Free and Open, Distributed, RESTful Search Engine
Support
Quality
Security
License
Reuse
Apache Spark - A unified analytics engine for large-scale data processing
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Python 
25193
Version:Current
License: Proprietary (Proprietary)
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Support
Quality
Security
License
Reuse
Mirror of Apache Kafka
Support
Quality
Security
License
Reuse
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Support
Quality
Security
License
Reuse
Apache Flink
Support
Quality
Security
License
Reuse
Your window into the Elastic Stack
Support
Quality
Security
License
Reuse
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Support
Quality
Security
License
Reuse
The official home of the Presto distributed SQL query engine for big data
Support
Quality
Security
License
Reuse
Open source platform for the machine learning lifecycle
Support
Quality
Security
License
Reuse
大数据入门指南 :star:
Support
Quality
Security
License
Reuse
Apache Hadoop
Support
Quality
Security
License
Reuse
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Support
Quality
Security
License
Reuse
PredictionIO, a machine learning server for developers and ML engineers.
Support
Quality
Security
License
Reuse
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Support
Quality
Security
License
Reuse
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Support
Quality
Security
License
Reuse
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Support
Quality
Security
License
Reuse
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Support
Quality
Security
License
Reuse
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Support
Quality
Security
License
Reuse
NoSQL data store using the seastar framework, compatible with Apache Cassandra
Support
Quality
Security
License
Reuse
i
industry-machine-learningby firmai
Jupyter Notebook 
6825
Version:Current
License: No License (No License)
A curated list of applied machine learning and data science notebooks and libraries across different industries (by @firmai)
Support
Quality
Security
License
Reuse
A Flexible and Powerful Parameter Server for large-scale machine learning
Support
Quality
Security
License
Reuse
Pentaho Data Integration ( ETL ) a.k.a Kettle
Support
Quality
Security
License
Reuse
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Support
Quality
Security
License
Reuse
Alluxio, data orchestration for analytics and machine learning in the cloud
Support
Quality
Security
License
Reuse
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Support
Quality
Security
License
Reuse
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Support
Quality
Security
License
Reuse
Data-Centric Pipelines and Data Versioning
Support
Quality
Security
License
Reuse
cuDF - GPU DataFrame Library
Support
Quality
Security
License
Reuse
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Support
Quality
Security
License
Reuse
The Metadata Platform for the Modern Data Stack
Support
Quality
Security
License
Reuse
Apache HBase
Support
Quality
Security
License
Reuse
Apache Hive
Support
Quality
Security
License
Reuse
Apache Pinot - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
The open big data serving engine. https://vespa.ai
Support
Quality
Security
License
Reuse
A better notebook for Scala (and more)
Support
Quality
Security
License
Reuse
Feature Store for Machine Learning
Support
Quality
Security
License
Reuse
Apache Iceberg
Support
Quality
Security
License
Reuse
LIBSVM -- A Library for Support Vector Machines
Support
Quality
Security
License
Reuse
Simple and Distributed Machine Learning
Support
Quality
Security
License
Reuse
Upserts, Deletes And Incremental Processing on Big Data.
Support
Quality
Security
License
Reuse
Fast, distributed, secure AI for Big Data
Support
Quality
Security
License
Reuse
Build data pipelines, the easy way 🛠️
Support
Quality
Security
License
Reuse
Example code from Learning Spark book
Support
Quality
Security
License
Reuse
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Support
Quality
Security
License
Reuse
Source-agnostic distributed change data capture system
Support
Quality
Security
License
Reuse
Apache Kylin
Support
Quality
Security
License
Reuse
lakeFS - Data version control for your data lake | Git for data
Support
Quality
Security
License
Reuse
A Scala API for Cascading
Support
Quality
Security
License
Reuse
酷玩 Spark: Spark 源代码解析、Spark 类库等
Support
Quality
Security
License
Reuse
e
elasticsearchby elastic
Free and Open, Distributed, RESTful Search Engine
Java
64134
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sparkby apache
Apache Spark - A unified analytics engine for large-scale data processing
Scala
35985
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Python
25193
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
x
xgboostby dmlc
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
C++
24228
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
k
kibanaby elastic
Your window into the Elastic Stack
TypeScript
18535
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
l
luigiby spotify
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Python
16581
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
prestoby prestodb
The official home of the Presto distributed SQL query engine for big data
Java
14796
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
mlflowby mlflow
Open source platform for the machine learning lifecycle
Python
14591
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
f
flink-learningby zhisheng17
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Java
13540
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
attic-predictionioby apache
PredictionIO, a machine learning server for developers and ML engineers.
Scala
12509
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
deeplearning4jby eclipse
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Java
12434
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
airbyteby airbytehq
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Python
10896
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sparkby perwendel
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Java
9473
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
trinoby trinodb
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Java
7963
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vaexby vaexio
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Python
7914
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
s
scyllaby scylladb
NoSQL data store using the seastar framework, compatible with Apache Cassandra
C++
7734
Updated: 3 y ago
License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
i
industry-machine-learningby firmai
A curated list of applied machine learning and data science notebooks and libraries across different industries (by @firmai)
Jupyter Notebook
6825
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
a
angelby Angel-ML
A Flexible and Powerful Parameter Server for large-scale machine learning
Java
6665
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
pentaho-kettleby pentaho
Pentaho Data Integration ( ETL ) a.k.a Kettle
Java
6649
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
h2o-3by h2oai
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Jupyter Notebook
6315
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
alluxioby Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Java
6268
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
z
zeppelinby apache
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Java
6070
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
deltaby delta-io
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Scala
6067
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
pachydermby pachyderm
Data-Centric Pipelines and Data Versioning
Go
5930
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cudfby rapidsai
cuDF - GPU DataFrame Library
C++
5565
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hazelcastby hazelcast
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Java
5409
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
datahubby linkedin
The Metadata Platform for the Modern Data Stack
Java
4881
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
p
pinotby apache
Apache Pinot - A realtime distributed OLAP datastore
Java
4617
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vespaby vespa-engine
The open big data serving engine. https://vespa.ai
Java
4455
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
polynoteby polynote
A better notebook for Scala (and more)
Jupyter Notebook
4421
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
feastby feast-dev
Feature Store for Machine Learning
Python
4397
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
l
libsvmby cjlin1
LIBSVM -- A Library for Support Vector Machines
Java
4315
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
S
SynapseMLby microsoft
Simple and Distributed Machine Learning
Scala
4302
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
h
hudiby apache
Upserts, Deletes And Incremental Processing on Big Data.
Java
4276
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
BigDLby intel-analytics
Fast, distributed, secure AI for Big Data
Jupyter Notebook
4229
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
o
orchestby orchest
Build data pipelines, the easy way 🛠️
TypeScript
3867
Updated: 2 y ago
License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
l
learning-sparkby databricks
Example code from Learning Spark book
Java
3837
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
T
TensorFlowOnSparkby yahoo
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Python
3781
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
databusby linkedin
Source-agnostic distributed change data capture system
Java
3499
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
l
lakeFSby treeverse
lakeFS - Data version control for your data lake | Git for data
Go
3443
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
scaldingby twitter
A Scala API for Cascading
Scala
3426
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
C
CoolplaySparkby lw-lin
酷玩 Spark: Spark 源代码解析、Spark 类库等
Scala
3399
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse