Example code from Learning Spark book
Support
Quality
Security
License
Reuse
Apache NiFi
Support
Quality
Security
License
Reuse
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Support
Quality
Security
License
Reuse
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Support
Quality
Security
License
Reuse
Sub-second search & analytics engine on cloud storage
Support
Quality
Security
License
Reuse
CrateDB is a distributed SQL database for storing and analyzing massive amounts of data in real-time. Built on top of Lucene.
Support
Quality
Security
License
Reuse
The Hunting ELK
Support
Quality
Security
License
Reuse
Source-agnostic distributed change data capture system
Support
Quality
Security
License
Reuse
Apache Kylin
Support
Quality
Security
License
Reuse
A Scala API for Cascading
Support
Quality
Security
License
Reuse
Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days
Support
Quality
Security
License
Reuse
酷玩 Spark: Spark 源代码解析、Spark 类库等
Support
Quality
Security
License
Reuse
Koalas: pandas API on Apache Spark
Support
Quality
Security
License
Reuse
StreamPark, Make stream processing easier! easy-to-use streaming application development framework and operation platform
Support
Quality
Security
License
Reuse
Linear algebra library for Rust.
Support
Quality
Security
License
Reuse
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Support
Quality
Security
License
Reuse
Apache Pinot (Incubating) - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
NumPy and Pandas interface to Big Data
Support
Quality
Security
License
Reuse
RisingWave: the next-generation streaming database in the cloud.
Support
Quality
Security
License
Reuse
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Support
Quality
Security
License
Reuse
A better compressed bitset in Java
Support
Quality
Security
License
Reuse
Interactive and Reactive Data Science using Scala and Spark.
Support
Quality
Security
License
Reuse
f
flink-recommandSystem-demoby CheckChe0803
Java 
2920
Version:Current
License: No License (No License)
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Support
Quality
Security
License
Reuse
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Support
Quality
Security
License
Reuse
REST job server for Apache Spark
Support
Quality
Security
License
Reuse
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Support
Quality
Security
License
Reuse
The flexibility of Python with the scale and performance of modern SQL.
Support
Quality
Security
License
Reuse
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Support
Quality
Security
License
Reuse
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Support
Quality
Security
License
Reuse
Python clone of Spark, a MapReduce alike framework in Python
Support
Quality
Security
License
Reuse
turbo.js - perform massive parallel computations in your browser with GPGPU.
Support
Quality
Security
License
Reuse
Apache Nutch is an extensible and scalable web crawler
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Scala 
2584
Version:Current
License: Proprietary (Proprietary)
Spark: The Definitive Guide's Code Repository
Support
Quality
Security
License
Reuse
Performance-portable, length-agnostic SIMD with runtime dispatch
Support
Quality
Security
License
Reuse
Run MapReduce jobs on Hadoop or Amazon Web Services
Support
Quality
Security
License
Reuse
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Support
Quality
Security
License
Reuse
Mirror of Apache Flume
Support
Quality
Security
License
Reuse
Package gocql implements a fast and robust Cassandra client for the Go programming language.
Support
Quality
Security
License
Reuse
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Go 
2344
Version:Current
License: Permissive (Apache-2.0)
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Support
Quality
Security
License
Reuse
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Support
Quality
Security
License
Reuse
Apache Geode
Support
Quality
Security
License
Reuse
Python library for creating data pipelines with chain functional programming
Support
Quality
Security
License
Reuse
A new arguably faster implementation of Apache Spark from scratch in Rust
Support
Quality
Security
License
Reuse
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Support
Quality
Security
License
Reuse
Compile-time Language Integrated Queries for Scala
Support
Quality
Security
License
Reuse
Streaming MapReduce with Scalding and Storm
Support
Quality
Security
License
Reuse
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
Support
Quality
Security
License
Reuse
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Support
Quality
Security
License
Reuse
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Support
Quality
Security
License
Reuse
l
learning-sparkby databricks
Example code from Learning Spark book
Java
3837
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
n
Support
Quality
Security
License
Reuse
f
flink-recommandSystem-demoby will-che
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Java
3824
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
T
TensorFlowOnSparkby yahoo
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Python
3781
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
q
quickwitby quickwit-oss
Sub-second search & analytics engine on cloud storage
Rust
3763
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
c
crateby crate
CrateDB is a distributed SQL database for storing and analyzing massive amounts of data in real-time. Built on top of Lucene.
Java
3692
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
H
HELKby Cyb3rWard0g
The Hunting ELK
Jupyter Notebook
3530
Updated: 2 y ago
License: Strong Copyleft (GPL-3.0)
Support
Quality
Security
License
Reuse
d
databusby linkedin
Source-agnostic distributed change data capture system
Java
3499
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
s
scaldingby twitter
A Scala API for Cascading
Scala
3426
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
j
jitsuby jitsucom
Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days
TypeScript
3416
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
C
CoolplaySparkby lw-lin
酷玩 Spark: Spark 源代码解析、Spark 类库等
Scala
3399
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
k
koalasby databricks
Koalas: pandas API on Apache Spark
Python
3268
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-streamparkby apache
StreamPark, Make stream processing easier! easy-to-use streaming application development framework and operation platform
Java
3259
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
n
nalgebraby dimforge
Linear algebra library for Rust.
Rust
3242
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
gleamby chrislusf
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Go
3219
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-pinotby apache
Apache Pinot (Incubating) - A realtime distributed OLAP datastore
Java
3139
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
blazeby blaze
NumPy and Pandas interface to Big Data
Python
3133
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
r
risingwaveby RisingWaveLabs
RisingWave: the next-generation streaming database in the cloud.
Rust
3116
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
linkisby apache
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Java
3083
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
R
RoaringBitmapby RoaringBitmap
A better compressed bitset in Java
Java
3057
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-notebookby spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
JavaScript
3051
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flink-recommandSystem-demoby CheckChe0803
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Java
2920
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
s
smart_openby RaRe-Technologies
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Python
2880
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
s
spark-jobserverby spark-jobserver
REST job server for Apache Spark
Scala
2820
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
deequby awslabs
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Scala
2812
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
ibisby ibis-project
The flexibility of Python with the scale and performance of modern SQL.
Python
2789
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-linkisby apache
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Java
2762
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
aws-data-wranglerby awslabs
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Python
2734
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
dparkby douban
Python clone of Spark, a MapReduce alike framework in Python
Python
2693
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
j
jsby turbo
turbo.js - perform massive parallel computations in your browser with GPGPU.
JavaScript
2600
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
n
nutchby apache
Apache Nutch is an extensible and scalable web crawler
Java
2584
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Spark: The Definitive Guide's Code Repository
Scala
2584
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
highwayby google
Performance-portable, length-agnostic SIMD with runtime dispatch
C++
2580
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
mrjobby Yelp
Run MapReduce jobs on Hadoop or Amazon Web Services
Python
2546
Updated: 4 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
M
Movie_Recommendby LuckyZXL2016
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Java
2446
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
g
gocqlby gocql
Package gocql implements a fast and robust Cassandra client for the Go programming language.
Go
2414
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
O
OpenMetadataby open-metadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
TypeScript
2368
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Go
2344
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
ballistaby ballista-compute
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Rust
2318
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
Support
Quality
Security
License
Reuse
P
PyFunctionalby EntilZha
Python library for creating data pipelines with chain functional programming
Python
2168
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
v
vegaby rajasekarv
A new arguably faster implementation of Apache Spark from scratch in Rust
Rust
2166
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
gobblinby apache
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Java
2129
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
z
zio-quillby zio
Compile-time Language Integrated Queries for Scala
Scala
2125
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
summingbirdby twitter
Streaming MapReduce with Scalding and Storm
Scala
2102
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
griddbby griddb
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
C++
2093
Updated: 2 y ago
License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
d
dinkyby DataLinkDC
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Java
2093
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
L
Linkisby WeBankFinTech
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Scala
2091
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse