Apache Spark - A unified analytics engine for large-scale data processing
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Python 
25193
Version:Current
License: Proprietary (Proprietary)
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Support
Quality
Security
License
Reuse
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Support
Quality
Security
License
Reuse
Parsing gigabytes of JSON per second
Support
Quality
Security
License
Reuse
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Support
Quality
Security
License
Reuse
📊 Cube — The Semantic Layer for Building Data Applications
Support
Quality
Security
License
Reuse
The official home of the Presto distributed SQL query engine for big data
Support
Quality
Security
License
Reuse
大数据入门指南 :star:
Support
Quality
Security
License
Reuse
d
data-engineering-zoomcampby DataTalksClub
Jupyter Notebook 
13895
Version:Current
License: No License (No License)
Free Data Engineering course!
Support
Quality
Security
License
Reuse
DataX是阿里云DataWorks数据集成的开源版本。
Support
Quality
Security
License
Reuse
Apache Hadoop
Support
Quality
Security
License
Reuse
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Support
Quality
Security
License
Reuse
Apache Druid: a high performance real-time analytics database.
Support
Quality
Security
License
Reuse
PredictionIO, a machine learning server for developers and ML engineers.
Support
Quality
Security
License
Reuse
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Support
Quality
Security
License
Reuse
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Support
Quality
Security
License
Reuse
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Support
Quality
Security
License
Reuse
open source training courses about distributed database and distributed systems
Support
Quality
Security
License
Reuse
Apache Doris is an easy-to-use, high performance and unified analytics database.
Support
Quality
Security
License
Reuse
Mirror of Apache Cassandra
Support
Quality
Security
License
Reuse
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Support
Quality
Security
License
Reuse
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Support
Quality
Security
License
Reuse
NoSQL data store using the seastar framework, compatible with Apache Cassandra
Support
Quality
Security
License
Reuse
Apache Beam is a unified programming model for Batch and Streaming data processing.
Support
Quality
Security
License
Reuse
Mirror of Apache Storm
Support
Quality
Security
License
Reuse
Alluxio, data orchestration for analytics and machine learning in the cloud
Support
Quality
Security
License
Reuse
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Support
Quality
Security
License
Reuse
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Support
Quality
Security
License
Reuse
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
Support
Quality
Security
License
Reuse
Data-Centric Pipelines and Data Versioning
Support
Quality
Security
License
Reuse
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Support
Quality
Security
License
Reuse
A Swift library that uses the Accelerate framework to provide high-performance functions for matrix math, digital signal processing, and image manipulation.
Support
Quality
Security
License
Reuse
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Support
Quality
Security
License
Reuse
Apache Mesos
Support
Quality
Security
License
Reuse
The Metadata Platform for the Modern Data Stack
Support
Quality
Security
License
Reuse
Apache HBase
Support
Quality
Security
License
Reuse
Apache Hive
Support
Quality
Security
License
Reuse
DataX集成可视化页面,选择数据源即可一键生成数据同步任务,支持RDBMS、Hive、HBase、ClickHouse、MongoDB等数据源,批量创建RDBMS数据同步任务,集成开源调度系统,支持分布式、增量同步数据、实时查看运行日志、监控执行器资源、KILL运行进程、数据源信息加密等。
Support
Quality
Security
License
Reuse
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Support
Quality
Security
License
Reuse
Apache Pinot - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
🚀SQL stream processing with Postgres-like experience. 🪄More than a modern alternative to Apache Flink.
Support
Quality
Security
License
Reuse
The open big data serving engine. https://vespa.ai
Support
Quality
Security
License
Reuse
A better notebook for Scala (and more)
Support
Quality
Security
License
Reuse
Feature Store for Machine Learning
Support
Quality
Security
License
Reuse
Apache Iceberg
Support
Quality
Security
License
Reuse
LIBSVM -- A Library for Support Vector Machines
Support
Quality
Security
License
Reuse
Simple and Distributed Machine Learning
Support
Quality
Security
License
Reuse
Upserts, Deletes And Incremental Processing on Big Data.
Support
Quality
Security
License
Reuse
Fast, distributed, secure AI for Big Data
Support
Quality
Security
License
Reuse
Apache Doris(Incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis.
Support
Quality
Security
License
Reuse
s
sparkby apache
Apache Spark - A unified analytics engine for large-scale data processing
Scala
35985
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Python
25193
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
x
xgboostby dmlc
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
C++
24228
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
simdjsonby simdjson
Parsing gigabytes of JSON per second
C++
16984
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
luigiby spotify
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Python
16581
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cubeby cube-js
📊 Cube — The Semantic Layer for Building Data Applications
Rust
15712
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
prestoby prestodb
The official home of the Presto distributed SQL query engine for big data
Java
14796
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
Support
Quality
Security
License
Reuse
d
data-engineering-zoomcampby DataTalksClub
Free Data Engineering course!
Jupyter Notebook
13895
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
D
DataXby alibaba
DataX是阿里云DataWorks数据集成的开源版本。
Java
13598
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
f
flink-learningby zhisheng17
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Java
13540
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
druidby apache
Apache Druid: a high performance real-time analytics database.
Java
12668
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
attic-predictionioby apache
PredictionIO, a machine learning server for developers and ML engineers.
Scala
12509
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
arrowby apache
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
C++
11870
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
airbyteby airbytehq
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Python
10896
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sparkby perwendel
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Java
9473
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
talent-planby pingcap
open source training courses about distributed database and distributed systems
Rust
8866
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
d
dorisby apache
Apache Doris is an easy-to-use, high performance and unified analytics database.
Java
8520
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cassandraby apache
Mirror of Apache Cassandra
Java
8020
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
trinoby trinodb
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Java
7963
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vaexby vaexio
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Python
7914
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
s
scyllaby scylladb
NoSQL data store using the seastar framework, compatible with Apache Cassandra
C++
7734
Updated: 3 y ago
License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
b
beamby apache
Apache Beam is a unified programming model for Batch and Streaming data processing.
Java
6930
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
Support
Quality
Security
License
Reuse
a
alluxioby Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Java
6268
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
z
zeppelinby apache
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Java
6070
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
deltaby delta-io
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Scala
6067
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
school-of-sreby linkedin
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
HTML
6053
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
pachydermby pachyderm
Data-Centric Pipelines and Data Versioning
Go
5930
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hazelcastby hazelcast
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Java
5409
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
S
Surgeby Jounce
A Swift library that uses the Accelerate framework to provide high-performance functions for matrix math, digital signal processing, and image manipulation.
Swift
5083
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
i
incubator-seatunnelby apache
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Java
5076
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
Support
Quality
Security
License
Reuse
d
datahubby linkedin
The Metadata Platform for the Modern Data Stack
Java
4881
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
d
datax-webby WeiYe-Jing
DataX集成可视化页面,选择数据源即可一键生成数据同步任务,支持RDBMS、Hive、HBase、ClickHouse、MongoDB等数据源,批量创建RDBMS数据同步任务,集成开源调度系统,支持分布式、增量同步数据、实时查看运行日志、监控执行器资源、KILL运行进程、数据源信息加密等。
Java
4762
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
S
Stream-Frameworkby tschellenbach
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Python
4693
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
pinotby apache
Apache Pinot - A realtime distributed OLAP datastore
Java
4617
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
r
risingwaveby risingwavelabs
🚀SQL stream processing with Postgres-like experience. 🪄More than a modern alternative to Apache Flink.
Rust
4463
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vespaby vespa-engine
The open big data serving engine. https://vespa.ai
Java
4455
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
polynoteby polynote
A better notebook for Scala (and more)
Jupyter Notebook
4421
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
feastby feast-dev
Feature Store for Machine Learning
Python
4397
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
l
libsvmby cjlin1
LIBSVM -- A Library for Support Vector Machines
Java
4315
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
S
SynapseMLby microsoft
Simple and Distributed Machine Learning
Scala
4302
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
h
hudiby apache
Upserts, Deletes And Incremental Processing on Big Data.
Java
4276
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
BigDLby intel-analytics
Fast, distributed, secure AI for Big Data
Jupyter Notebook
4229
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-dorisby apache
Apache Doris(Incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis.
C++
4197
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse