Apache Spark - A unified analytics engine for large-scale data processing
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Python 25193 Version:Current License: Proprietary (Proprietary)
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Support
Quality
Security
License
Reuse
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Support
Quality
Security
License
Reuse
Parsing gigabytes of JSON per second
Support
Quality
Security
License
Reuse
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Support
Quality
Security
License
Reuse
📊 Cube — The Semantic Layer for Building Data Applications
Support
Quality
Security
License
Reuse
The official home of the Presto distributed SQL query engine for big data
Support
Quality
Security
License
Reuse
大数据入门指南 :star:
Support
Quality
Security
License
Reuse
d
data-engineering-zoomcampby DataTalksClub
Jupyter Notebook 13895 Version:Current License: No License (No License)
Free Data Engineering course!
Support
Quality
Security
License
Reuse
DataX是阿里云DataWorks数据集成的开源版本。
Support
Quality
Security
License
Reuse
Apache Hadoop
Support
Quality
Security
License
Reuse
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Support
Quality
Security
License
Reuse
Apache Druid: a high performance real-time analytics database.
Support
Quality
Security
License
Reuse
PredictionIO, a machine learning server for developers and ML engineers.
Support
Quality
Security
License
Reuse
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Support
Quality
Security
License
Reuse
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Support
Quality
Security
License
Reuse
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Support
Quality
Security
License
Reuse
open source training courses about distributed database and distributed systems
Support
Quality
Security
License
Reuse
Apache Doris is an easy-to-use, high performance and unified analytics database.
Support
Quality
Security
License
Reuse
Mirror of Apache Cassandra
Support
Quality
Security
License
Reuse
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Support
Quality
Security
License
Reuse
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Support
Quality
Security
License
Reuse
NoSQL data store using the seastar framework, compatible with Apache Cassandra
Support
Quality
Security
License
Reuse
Apache Beam is a unified programming model for Batch and Streaming data processing.
Support
Quality
Security
License
Reuse
Mirror of Apache Storm
Support
Quality
Security
License
Reuse
Alluxio, data orchestration for analytics and machine learning in the cloud
Support
Quality
Security
License
Reuse
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Support
Quality
Security
License
Reuse
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Support
Quality
Security
License
Reuse
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
Support
Quality
Security
License
Reuse
Data-Centric Pipelines and Data Versioning
Support
Quality
Security
License
Reuse
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Support
Quality
Security
License
Reuse
A Swift library that uses the Accelerate framework to provide high-performance functions for matrix math, digital signal processing, and image manipulation.
Support
Quality
Security
License
Reuse
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Support
Quality
Security
License
Reuse
Apache Mesos
Support
Quality
Security
License
Reuse
The Metadata Platform for the Modern Data Stack
Support
Quality
Security
License
Reuse
Apache HBase
Support
Quality
Security
License
Reuse
Apache Hive
Support
Quality
Security
License
Reuse
DataX集成可视化页面,选择数据源即可一键生成数据同步任务,支持RDBMS、Hive、HBase、ClickHouse、MongoDB等数据源,批量创建RDBMS数据同步任务,集成开源调度系统,支持分布式、增量同步数据、实时查看运行日志、监控执行器资源、KILL运行进程、数据源信息加密等。
Support
Quality
Security
License
Reuse
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Support
Quality
Security
License
Reuse
Apache Pinot - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
🚀SQL stream processing with Postgres-like experience. 🪄More than a modern alternative to Apache Flink.
Support
Quality
Security
License
Reuse
The open big data serving engine. https://vespa.ai
Support
Quality
Security
License
Reuse
A better notebook for Scala (and more)
Support
Quality
Security
License
Reuse
Feature Store for Machine Learning
Support
Quality
Security
License
Reuse
Apache Iceberg
Support
Quality
Security
License
Reuse
LIBSVM -- A Library for Support Vector Machines
Support
Quality
Security
License
Reuse
Simple and Distributed Machine Learning
Support
Quality
Security
License
Reuse
Upserts, Deletes And Incremental Processing on Big Data.
Support
Quality
Security
License
Reuse
Fast, distributed, secure AI for Big Data
Support
Quality
Security
License
Reuse
Apache Doris(Incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis.
Support
Quality
Security
License
Reuse
s
sparkby apache
Apache Spark - A unified analytics engine for large-scale data processing
Scala 35985Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Python 25193Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
x
xgboostby dmlc
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
C++ 24228Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
simdjsonby simdjson
Parsing gigabytes of JSON per second
C++ 16984Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
luigiby spotify
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Python 16581Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cubeby cube-js
📊 Cube — The Semantic Layer for Building Data Applications
Rust 15712Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
prestoby prestodb
The official home of the Presto distributed SQL query engine for big data
Java 14796Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
Support
Quality
Security
License
Reuse
d
data-engineering-zoomcampby DataTalksClub
Free Data Engineering course!
Jupyter Notebook 13895Updated: 1 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
D
DataXby alibaba
DataX是阿里云DataWorks数据集成的开源版本。
Java 13598Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
f
flink-learningby zhisheng17
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Java 13540Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
druidby apache
Apache Druid: a high performance real-time analytics database.
Java 12668Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
attic-predictionioby apache
PredictionIO, a machine learning server for developers and ML engineers.
Scala 12509Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
arrowby apache
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
C++ 11870Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
airbyteby airbytehq
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Python 10896Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sparkby perwendel
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Java 9473Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
talent-planby pingcap
open source training courses about distributed database and distributed systems
Rust 8866Updated: 1 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
d
dorisby apache
Apache Doris is an easy-to-use, high performance and unified analytics database.
Java 8520Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cassandraby apache
Mirror of Apache Cassandra
Java 8020Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
trinoby trinodb
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Java 7963Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vaexby vaexio
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Python 7914Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
s
scyllaby scylladb
NoSQL data store using the seastar framework, compatible with Apache Cassandra
C++ 7734Updated: 3 y ago License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
b
beamby apache
Apache Beam is a unified programming model for Batch and Streaming data processing.
Java 6930Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
Support
Quality
Security
License
Reuse
a
alluxioby Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Java 6268Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
z
zeppelinby apache
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Java 6070Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
deltaby delta-io
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Scala 6067Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
school-of-sreby linkedin
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
HTML 6053Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
pachydermby pachyderm
Data-Centric Pipelines and Data Versioning
Go 5930Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hazelcastby hazelcast
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Java 5409Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
S
Surgeby Jounce
A Swift library that uses the Accelerate framework to provide high-performance functions for matrix math, digital signal processing, and image manipulation.
Swift 5083Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
i
incubator-seatunnelby apache
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Java 5076Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
Support
Quality
Security
License
Reuse
d
datahubby linkedin
The Metadata Platform for the Modern Data Stack
Java 4881Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
d
datax-webby WeiYe-Jing
DataX集成可视化页面,选择数据源即可一键生成数据同步任务,支持RDBMS、Hive、HBase、ClickHouse、MongoDB等数据源,批量创建RDBMS数据同步任务,集成开源调度系统,支持分布式、增量同步数据、实时查看运行日志、监控执行器资源、KILL运行进程、数据源信息加密等。
Java 4762Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
S
Stream-Frameworkby tschellenbach
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Python 4693Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
pinotby apache
Apache Pinot - A realtime distributed OLAP datastore
Java 4617Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
r
risingwaveby risingwavelabs
🚀SQL stream processing with Postgres-like experience. 🪄More than a modern alternative to Apache Flink.
Rust 4463Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vespaby vespa-engine
The open big data serving engine. https://vespa.ai
Java 4455Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
polynoteby polynote
A better notebook for Scala (and more)
Jupyter Notebook 4421Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
feastby feast-dev
Feature Store for Machine Learning
Python 4397Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
l
libsvmby cjlin1
LIBSVM -- A Library for Support Vector Machines
Java 4315Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
S
SynapseMLby microsoft
Simple and Distributed Machine Learning
Scala 4302Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
h
hudiby apache
Upserts, Deletes And Incremental Processing on Big Data.
Java 4276Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
BigDLby intel-analytics
Fast, distributed, secure AI for Big Data
Jupyter Notebook 4229Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-dorisby apache
Apache Doris(Incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis.
C++ 4197Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse