Free and Open, Distributed, RESTful Search Engine
Support
Quality
Security
License
Reuse
Apache Spark - A unified analytics engine for large-scale data processing
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Python 25193 Version:Current License: Proprietary (Proprietary)
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Support
Quality
Security
License
Reuse
Mirror of Apache Kafka
Support
Quality
Security
License
Reuse
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Support
Quality
Security
License
Reuse
Apache Flink
Support
Quality
Security
License
Reuse
Your window into the Elastic Stack
Support
Quality
Security
License
Reuse
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Support
Quality
Security
License
Reuse
The official home of the Presto distributed SQL query engine for big data
Support
Quality
Security
License
Reuse
Open source platform for the machine learning lifecycle
Support
Quality
Security
License
Reuse
大数据入门指南 :star:
Support
Quality
Security
License
Reuse
Apache Hadoop
Support
Quality
Security
License
Reuse
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Support
Quality
Security
License
Reuse
PredictionIO, a machine learning server for developers and ML engineers.
Support
Quality
Security
License
Reuse
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Support
Quality
Security
License
Reuse
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Support
Quality
Security
License
Reuse
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Support
Quality
Security
License
Reuse
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Support
Quality
Security
License
Reuse
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Support
Quality
Security
License
Reuse
NoSQL data store using the seastar framework, compatible with Apache Cassandra
Support
Quality
Security
License
Reuse
i
industry-machine-learningby firmai
Jupyter Notebook 6825 Version:Current License: No License (No License)
A curated list of applied machine learning and data science notebooks and libraries across different industries (by @firmai)
Support
Quality
Security
License
Reuse
A Flexible and Powerful Parameter Server for large-scale machine learning
Support
Quality
Security
License
Reuse
Pentaho Data Integration ( ETL ) a.k.a Kettle
Support
Quality
Security
License
Reuse
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Support
Quality
Security
License
Reuse
Alluxio, data orchestration for analytics and machine learning in the cloud
Support
Quality
Security
License
Reuse
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Support
Quality
Security
License
Reuse
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Support
Quality
Security
License
Reuse
Data-Centric Pipelines and Data Versioning
Support
Quality
Security
License
Reuse
cuDF - GPU DataFrame Library
Support
Quality
Security
License
Reuse
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Support
Quality
Security
License
Reuse
The Metadata Platform for the Modern Data Stack
Support
Quality
Security
License
Reuse
Apache HBase
Support
Quality
Security
License
Reuse
Apache Hive
Support
Quality
Security
License
Reuse
Apache Pinot - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
The open big data serving engine. https://vespa.ai
Support
Quality
Security
License
Reuse
A better notebook for Scala (and more)
Support
Quality
Security
License
Reuse
Feature Store for Machine Learning
Support
Quality
Security
License
Reuse
Apache Iceberg
Support
Quality
Security
License
Reuse
LIBSVM -- A Library for Support Vector Machines
Support
Quality
Security
License
Reuse
Simple and Distributed Machine Learning
Support
Quality
Security
License
Reuse
Upserts, Deletes And Incremental Processing on Big Data.
Support
Quality
Security
License
Reuse
Fast, distributed, secure AI for Big Data
Support
Quality
Security
License
Reuse
Build data pipelines, the easy way 🛠️
Support
Quality
Security
License
Reuse
Example code from Learning Spark book
Support
Quality
Security
License
Reuse
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Support
Quality
Security
License
Reuse
Source-agnostic distributed change data capture system
Support
Quality
Security
License
Reuse
Apache Kylin
Support
Quality
Security
License
Reuse
lakeFS - Data version control for your data lake | Git for data
Support
Quality
Security
License
Reuse
A Scala API for Cascading
Support
Quality
Security
License
Reuse
酷玩 Spark: Spark 源代码解析、Spark 类库等
Support
Quality
Security
License
Reuse
e
elasticsearchby elastic
Free and Open, Distributed, RESTful Search Engine
Java 64134Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sparkby apache
Apache Spark - A unified analytics engine for large-scale data processing
Scala 35985Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Python 25193Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
x
xgboostby dmlc
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
C++ 24228Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
k
kibanaby elastic
Your window into the Elastic Stack
TypeScript 18535Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
l
luigiby spotify
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Python 16581Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
prestoby prestodb
The official home of the Presto distributed SQL query engine for big data
Java 14796Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
mlflowby mlflow
Open source platform for the machine learning lifecycle
Python 14591Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
f
flink-learningby zhisheng17
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Java 13540Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
attic-predictionioby apache
PredictionIO, a machine learning server for developers and ML engineers.
Scala 12509Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
deeplearning4jby eclipse
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Java 12434Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
airbyteby airbytehq
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Python 10896Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sparkby perwendel
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
Java 9473Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
trinoby trinodb
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Java 7963Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vaexby vaexio
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Python 7914Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
s
scyllaby scylladb
NoSQL data store using the seastar framework, compatible with Apache Cassandra
C++ 7734Updated: 3 y ago License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
i
industry-machine-learningby firmai
A curated list of applied machine learning and data science notebooks and libraries across different industries (by @firmai)
Jupyter Notebook 6825Updated: 1 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
a
angelby Angel-ML
A Flexible and Powerful Parameter Server for large-scale machine learning
Java 6665Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
pentaho-kettleby pentaho
Pentaho Data Integration ( ETL ) a.k.a Kettle
Java 6649Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
h2o-3by h2oai
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Jupyter Notebook 6315Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
alluxioby Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Java 6268Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
z
zeppelinby apache
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Java 6070Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
deltaby delta-io
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Scala 6067Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
pachydermby pachyderm
Data-Centric Pipelines and Data Versioning
Go 5930Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cudfby rapidsai
cuDF - GPU DataFrame Library
C++ 5565Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hazelcastby hazelcast
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Java 5409Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
datahubby linkedin
The Metadata Platform for the Modern Data Stack
Java 4881Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
p
pinotby apache
Apache Pinot - A realtime distributed OLAP datastore
Java 4617Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
vespaby vespa-engine
The open big data serving engine. https://vespa.ai
Java 4455Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
polynoteby polynote
A better notebook for Scala (and more)
Jupyter Notebook 4421Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
feastby feast-dev
Feature Store for Machine Learning
Python 4397Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
l
libsvmby cjlin1
LIBSVM -- A Library for Support Vector Machines
Java 4315Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
S
SynapseMLby microsoft
Simple and Distributed Machine Learning
Scala 4302Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
h
hudiby apache
Upserts, Deletes And Incremental Processing on Big Data.
Java 4276Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
BigDLby intel-analytics
Fast, distributed, secure AI for Big Data
Jupyter Notebook 4229Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
o
orchestby orchest
Build data pipelines, the easy way 🛠️
TypeScript 3867Updated: 1 y ago License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
l
learning-sparkby databricks
Example code from Learning Spark book
Java 3837Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
T
TensorFlowOnSparkby yahoo
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Python 3781Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
databusby linkedin
Source-agnostic distributed change data capture system
Java 3499Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
l
lakeFSby treeverse
lakeFS - Data version control for your data lake | Git for data
Go 3443Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
scaldingby twitter
A Scala API for Cascading
Scala 3426Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
C
CoolplaySparkby lw-lin
酷玩 Spark: Spark 源代码解析、Spark 类库等
Scala 3399Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse