State of the Art Natural Language Processing
Support
Quality
Security
License
Reuse
Koalas: pandas API on Apache Spark
Support
Quality
Security
License
Reuse
A Cloud Native Batch System (Project under CNCF)
Support
Quality
Security
License
Reuse
A better compressed bitset in Java
Support
Quality
Security
License
Reuse
Interactive and Reactive Data Science using Scala and Spark.
Support
Quality
Security
License
Reuse
A GPU-powered real-time analytics storage and query engine.
Support
Quality
Security
License
Reuse
REST job server for Apache Spark
Support
Quality
Security
License
Reuse
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Support
Quality
Security
License
Reuse
The flexibility of Python with the scale and performance of modern SQL.
Support
Quality
Security
License
Reuse
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Support
Quality
Security
License
Reuse
Python clone of Spark, a MapReduce alike framework in Python
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Scala 
2584
Version:Current
License: Proprietary (Proprietary)
Spark: The Definitive Guide's Code Repository
Support
Quality
Security
License
Reuse
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Support
Quality
Security
License
Reuse
Mirror of Apache Flume
Support
Quality
Security
License
Reuse
Microsoft Machine Learning for Apache Spark
Support
Quality
Security
License
Reuse
Notebooks using the Hugging Face libraries 🤗
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Go 
2344
Version:Current
License: Permissive (Apache-2.0)
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Support
Quality
Security
License
Reuse
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Support
Quality
Security
License
Reuse
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Support
Quality
Security
License
Reuse
Python library for creating data pipelines with chain functional programming
Support
Quality
Security
License
Reuse
A new arguably faster implementation of Apache Spark from scratch in Rust
Support
Quality
Security
License
Reuse
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Support
Quality
Security
License
Reuse
Elasticsearch Java Rest Client.
Support
Quality
Security
License
Reuse
Streaming MapReduce with Scalding and Storm
Support
Quality
Security
License
Reuse
Apache Parquet
Support
Quality
Security
License
Reuse
A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Support
Quality
Security
License
Reuse
Compile-time Language Integrated Queries for Scala
Support
Quality
Security
License
Reuse
Deep Learning Pipelines for Apache Spark
Support
Quality
Security
License
Reuse
Easy Machine Learning is a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks.
Support
Quality
Security
License
Reuse
Apache Hadoop docker image
Support
Quality
Security
License
Reuse
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Support
Quality
Security
License
Reuse
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Support
Quality
Security
License
Reuse
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
Support
Quality
Security
License
Reuse
DataStax Spark Cassandra Connector
Support
Quality
Security
License
Reuse
深圳地铁大数据客流分析系统🚇🚄🌟
Support
Quality
Security
License
Reuse
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Support
Quality
Security
License
Reuse
Production Ready Data Integration Product, documentation:
Support
Quality
Security
License
Reuse
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
Support
Quality
Security
License
Reuse
Apache Drill is a distributed MPP query layer for self describing data
Support
Quality
Security
License
Reuse
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Support
Quality
Security
License
Reuse
jdbi is designed to provide convenient tabular data access in Java; including templated SQL, parameterized and strongly typed queries, and Streams integration
Support
Quality
Security
License
Reuse
Mirror of Apache Kudu
Support
Quality
Security
License
Reuse
A large-scale entity and relation database supporting aggregation of properties
Support
Quality
Security
License
Reuse
ApacheCN 开源组织:公告、介绍、成员、活动、交流方式
Support
Quality
Security
License
Reuse
Elassandra = Elasticsearch + Apache Cassandra
Support
Quality
Security
License
Reuse
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
Support
Quality
Security
License
Reuse
生产环境的海量数据计算产品,文档地址:
Support
Quality
Security
License
Reuse
Home of the community managed version of Presto, the distributed SQL query engine for big data, under the auspices of the Presto Software Foundation.
Support
Quality
Security
License
Reuse
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Support
Quality
Security
License
Reuse
My solution to the book A Collection of Data Science Take-Home Challenges
Support
Quality
Security
License
Reuse
s
spark-nlpby JohnSnowLabs
State of the Art Natural Language Processing
Scala
3279
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
koalasby databricks
Koalas: pandas API on Apache Spark
Python
3268
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
volcanoby volcano-sh
A Cloud Native Batch System (Project under CNCF)
Go
3081
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
R
RoaringBitmapby RoaringBitmap
A better compressed bitset in Java
Java
3057
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-notebookby spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
JavaScript
3051
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
aresdbby uber
A GPU-powered real-time analytics storage and query engine.
Go
2910
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-jobserverby spark-jobserver
REST job server for Apache Spark
Scala
2820
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
deequby awslabs
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Scala
2812
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
ibisby ibis-project
The flexibility of Python with the scale and performance of modern SQL.
Python
2789
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
aws-data-wranglerby awslabs
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Python
2734
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
dparkby douban
Python clone of Spark, a MapReduce alike framework in Python
Python
2693
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Spark: The Definitive Guide's Code Repository
Scala
2584
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
M
Movie_Recommendby LuckyZXL2016
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Java
2446
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
m
mmlsparkby Azure
Microsoft Machine Learning for Apache Spark
Scala
2371
Updated: 4 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
n
notebooksby huggingface
Notebooks using the Hugging Face libraries 🤗
Jupyter Notebook
2352
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Go
2344
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
ballistaby ballista-compute
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Rust
2318
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
T
TransmogrifAIby salesforce
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Scala
2186
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
P
PyFunctionalby EntilZha
Python library for creating data pipelines with chain functional programming
Python
2168
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
v
vegaby rajasekarv
A new arguably faster implementation of Apache Spark from scratch in Rust
Rust
2166
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
gobblinby apache
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Java
2129
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
J
Jestby searchbox-io
Elasticsearch Java Rest Client.
Java
2120
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
summingbirdby twitter
Streaming MapReduce with Scalding and Storm
Scala
2102
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
Support
Quality
Security
License
Reuse
Q
Quicksqlby Qihoo360
A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Java
2005
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
q
quillby getquill
Compile-time Language Integrated Queries for Scala
Scala
1992
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-deep-learningby databricks
Deep Learning Pipelines for Apache Spark
Python
1968
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
E
EasyMLby ICT-BDA
Easy Machine Learning is a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks.
Java
1958
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
docker-hadoopby big-data-europe
Apache Hadoop docker image
Shell
1940
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
f
flinkStreamSQLby DTStack
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Java
1921
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sparkby dotnet
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
C#
1905
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
e
elasticsearch-hadoopby elastic
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
Java
1902
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-cassandra-connectorby datastax
DataStax Spark Cassandra Connector
Scala
1902
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
S
SZT-bigdataby geekyouth
深圳地铁大数据客流分析系统🚇🚄🌟
Scala
1871
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
i
incubator-gobblinby apache
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Java
1819
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
seatunnelby InterestingLab
Production Ready Data Integration Product, documentation:
Java
1819
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
blazingsqlby BlazingDB
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
C++
1808
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
drillby apache
Apache Drill is a distributed MPP query layer for self describing data
Java
1801
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
o
oryxby OryxProject
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Java
1798
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
j
jdbiby jdbi
jdbi is designed to provide convenient tabular data access in Java; including templated SQL, parameterized and strongly typed queries, and Streams integration
Java
1782
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
G
Gafferby gchq
A large-scale entity and relation database supporting aggregation of properties
Java
1700
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
homeby apachecn
ApacheCN 开源组织:公告、介绍、成员、活动、交流方式
CSS
1694
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
e
elassandraby strapdata
Elassandra = Elasticsearch + Apache Cassandra
Java
1667
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
fugueby fugue-project
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
Python
1622
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
w
waterdropby InterestingLab
生产环境的海量数据计算产品,文档地址:
Java
1601
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
prestoby prestosql
Home of the community managed version of Presto, the distributed SQL query engine for big data, under the auspices of the Presto Software Foundation.
Java
1595
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
petastormby uber
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Python
1584
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
D
DS-Take-Homeby JifuZhao
My solution to the book A Collection of Data Science Take-Home Challenges
Jupyter Notebook
1575
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse