State of the Art Natural Language Processing
Support
Quality
Security
License
Reuse
Koalas: pandas API on Apache Spark
Support
Quality
Security
License
Reuse
A Cloud Native Batch System (Project under CNCF)
Support
Quality
Security
License
Reuse
A better compressed bitset in Java
Support
Quality
Security
License
Reuse
Interactive and Reactive Data Science using Scala and Spark.
Support
Quality
Security
License
Reuse
A GPU-powered real-time analytics storage and query engine.
Support
Quality
Security
License
Reuse
REST job server for Apache Spark
Support
Quality
Security
License
Reuse
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Support
Quality
Security
License
Reuse
The flexibility of Python with the scale and performance of modern SQL.
Support
Quality
Security
License
Reuse
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Support
Quality
Security
License
Reuse
Python clone of Spark, a MapReduce alike framework in Python
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Scala 2584 Version:Current License: Proprietary (Proprietary)
Spark: The Definitive Guide's Code Repository
Support
Quality
Security
License
Reuse
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Support
Quality
Security
License
Reuse
Mirror of Apache Flume
Support
Quality
Security
License
Reuse
Microsoft Machine Learning for Apache Spark
Support
Quality
Security
License
Reuse
Notebooks using the Hugging Face libraries 🤗
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Go 2344 Version:Current License: Permissive (Apache-2.0)
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Support
Quality
Security
License
Reuse
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Support
Quality
Security
License
Reuse
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Support
Quality
Security
License
Reuse
Python library for creating data pipelines with chain functional programming
Support
Quality
Security
License
Reuse
A new arguably faster implementation of Apache Spark from scratch in Rust
Support
Quality
Security
License
Reuse
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Support
Quality
Security
License
Reuse
Elasticsearch Java Rest Client.
Support
Quality
Security
License
Reuse
Streaming MapReduce with Scalding and Storm
Support
Quality
Security
License
Reuse
Apache Parquet
Support
Quality
Security
License
Reuse
A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Support
Quality
Security
License
Reuse
Compile-time Language Integrated Queries for Scala
Support
Quality
Security
License
Reuse
Deep Learning Pipelines for Apache Spark
Support
Quality
Security
License
Reuse
Easy Machine Learning is a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks.
Support
Quality
Security
License
Reuse
Apache Hadoop docker image
Support
Quality
Security
License
Reuse
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Support
Quality
Security
License
Reuse
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Support
Quality
Security
License
Reuse
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
Support
Quality
Security
License
Reuse
DataStax Spark Cassandra Connector
Support
Quality
Security
License
Reuse
深圳地铁大数据客流分析系统🚇🚄🌟
Support
Quality
Security
License
Reuse
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Support
Quality
Security
License
Reuse
Production Ready Data Integration Product, documentation:
Support
Quality
Security
License
Reuse
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
Support
Quality
Security
License
Reuse
Apache Drill is a distributed MPP query layer for self describing data
Support
Quality
Security
License
Reuse
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Support
Quality
Security
License
Reuse
jdbi is designed to provide convenient tabular data access in Java; including templated SQL, parameterized and strongly typed queries, and Streams integration
Support
Quality
Security
License
Reuse
Mirror of Apache Kudu
Support
Quality
Security
License
Reuse
A large-scale entity and relation database supporting aggregation of properties
Support
Quality
Security
License
Reuse
ApacheCN 开源组织:公告、介绍、成员、活动、交流方式
Support
Quality
Security
License
Reuse
Elassandra = Elasticsearch + Apache Cassandra
Support
Quality
Security
License
Reuse
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
Support
Quality
Security
License
Reuse
生产环境的海量数据计算产品,文档地址:
Support
Quality
Security
License
Reuse
Home of the community managed version of Presto, the distributed SQL query engine for big data, under the auspices of the Presto Software Foundation.
Support
Quality
Security
License
Reuse
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Support
Quality
Security
License
Reuse
My solution to the book A Collection of Data Science Take-Home Challenges
Support
Quality
Security
License
Reuse
s
spark-nlpby JohnSnowLabs
State of the Art Natural Language Processing
Scala 3279Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
koalasby databricks
Koalas: pandas API on Apache Spark
Python 3268Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
v
volcanoby volcano-sh
A Cloud Native Batch System (Project under CNCF)
Go 3081Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
R
RoaringBitmapby RoaringBitmap
A better compressed bitset in Java
Java 3057Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-notebookby spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
JavaScript 3051Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
aresdbby uber
A GPU-powered real-time analytics storage and query engine.
Go 2910Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-jobserverby spark-jobserver
REST job server for Apache Spark
Scala 2820Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
deequby awslabs
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Scala 2812Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
ibisby ibis-project
The flexibility of Python with the scale and performance of modern SQL.
Python 2789Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
aws-data-wranglerby awslabs
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Python 2734Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
dparkby douban
Python clone of Spark, a MapReduce alike framework in Python
Python 2693Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Spark: The Definitive Guide's Code Repository
Scala 2584Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
M
Movie_Recommendby LuckyZXL2016
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Java 2446Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
m
mmlsparkby Azure
Microsoft Machine Learning for Apache Spark
Scala 2371Updated: 3 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
n
notebooksby huggingface
Notebooks using the Hugging Face libraries 🤗
Jupyter Notebook 2352Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Go 2344Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
ballistaby ballista-compute
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Rust 2318Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
T
TransmogrifAIby salesforce
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Scala 2186Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
P
PyFunctionalby EntilZha
Python library for creating data pipelines with chain functional programming
Python 2168Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
v
vegaby rajasekarv
A new arguably faster implementation of Apache Spark from scratch in Rust
Rust 2166Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
gobblinby apache
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Java 2129Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
J
Jestby searchbox-io
Elasticsearch Java Rest Client.
Java 2120Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
summingbirdby twitter
Streaming MapReduce with Scalding and Storm
Scala 2102Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
Support
Quality
Security
License
Reuse
Q
Quicksqlby Qihoo360
A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Java 2005Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
q
quillby getquill
Compile-time Language Integrated Queries for Scala
Scala 1992Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-deep-learningby databricks
Deep Learning Pipelines for Apache Spark
Python 1968Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
E
EasyMLby ICT-BDA
Easy Machine Learning is a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks.
Java 1958Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
docker-hadoopby big-data-europe
Apache Hadoop docker image
Shell 1940Updated: 1 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
f
flinkStreamSQLby DTStack
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Java 1921Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sparkby dotnet
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
C# 1905Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
e
elasticsearch-hadoopby elastic
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
Java 1902Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-cassandra-connectorby datastax
DataStax Spark Cassandra Connector
Scala 1902Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
S
SZT-bigdataby geekyouth
深圳地铁大数据客流分析系统🚇🚄🌟
Scala 1871Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
i
incubator-gobblinby apache
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Java 1819Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
seatunnelby InterestingLab
Production Ready Data Integration Product, documentation:
Java 1819Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
blazingsqlby BlazingDB
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
C++ 1808Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
drillby apache
Apache Drill is a distributed MPP query layer for self describing data
Java 1801Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
o
oryxby OryxProject
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Java 1798Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
j
jdbiby jdbi
jdbi is designed to provide convenient tabular data access in Java; including templated SQL, parameterized and strongly typed queries, and Streams integration
Java 1782Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
G
Gafferby gchq
A large-scale entity and relation database supporting aggregation of properties
Java 1700Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
homeby apachecn
ApacheCN 开源组织:公告、介绍、成员、活动、交流方式
CSS 1694Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
e
elassandraby strapdata
Elassandra = Elasticsearch + Apache Cassandra
Java 1667Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
fugueby fugue-project
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
Python 1622Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
w
waterdropby InterestingLab
生产环境的海量数据计算产品,文档地址:
Java 1601Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
prestoby prestosql
Home of the community managed version of Presto, the distributed SQL query engine for big data, under the auspices of the Presto Software Foundation.
Java 1595Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
petastormby uber
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Python 1584Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
D
DS-Take-Homeby JifuZhao
My solution to the book A Collection of Data Science Take-Home Challenges
Jupyter Notebook 1575Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse