Example code from Learning Spark book
Support
Quality
Security
License
Reuse
Apache NiFi
Support
Quality
Security
License
Reuse
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Support
Quality
Security
License
Reuse
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Support
Quality
Security
License
Reuse
Sub-second search & analytics engine on cloud storage
Support
Quality
Security
License
Reuse
CrateDB is a distributed SQL database for storing and analyzing massive amounts of data in real-time. Built on top of Lucene.
Support
Quality
Security
License
Reuse
The Hunting ELK
Support
Quality
Security
License
Reuse
Source-agnostic distributed change data capture system
Support
Quality
Security
License
Reuse
Apache Kylin
Support
Quality
Security
License
Reuse
A Scala API for Cascading
Support
Quality
Security
License
Reuse
Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days
Support
Quality
Security
License
Reuse
酷玩 Spark: Spark 源代码解析、Spark 类库等
Support
Quality
Security
License
Reuse
Koalas: pandas API on Apache Spark
Support
Quality
Security
License
Reuse
StreamPark, Make stream processing easier! easy-to-use streaming application development framework and operation platform
Support
Quality
Security
License
Reuse
Linear algebra library for Rust.
Support
Quality
Security
License
Reuse
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Support
Quality
Security
License
Reuse
Apache Pinot (Incubating) - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
NumPy and Pandas interface to Big Data
Support
Quality
Security
License
Reuse
RisingWave: the next-generation streaming database in the cloud.
Support
Quality
Security
License
Reuse
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Support
Quality
Security
License
Reuse
A better compressed bitset in Java
Support
Quality
Security
License
Reuse
Interactive and Reactive Data Science using Scala and Spark.
Support
Quality
Security
License
Reuse
f
flink-recommandSystem-demoby CheckChe0803
Java 2920 Version:Current License: No License (No License)
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Support
Quality
Security
License
Reuse
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Support
Quality
Security
License
Reuse
REST job server for Apache Spark
Support
Quality
Security
License
Reuse
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Support
Quality
Security
License
Reuse
The flexibility of Python with the scale and performance of modern SQL.
Support
Quality
Security
License
Reuse
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Support
Quality
Security
License
Reuse
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Support
Quality
Security
License
Reuse
Python clone of Spark, a MapReduce alike framework in Python
Support
Quality
Security
License
Reuse
turbo.js - perform massive parallel computations in your browser with GPGPU.
Support
Quality
Security
License
Reuse
Apache Nutch is an extensible and scalable web crawler
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Scala 2584 Version:Current License: Proprietary (Proprietary)
Spark: The Definitive Guide's Code Repository
Support
Quality
Security
License
Reuse
Performance-portable, length-agnostic SIMD with runtime dispatch
Support
Quality
Security
License
Reuse
Run MapReduce jobs on Hadoop or Amazon Web Services
Support
Quality
Security
License
Reuse
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Support
Quality
Security
License
Reuse
Mirror of Apache Flume
Support
Quality
Security
License
Reuse
Package gocql implements a fast and robust Cassandra client for the Go programming language.
Support
Quality
Security
License
Reuse
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Go 2344 Version:Current License: Permissive (Apache-2.0)
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Support
Quality
Security
License
Reuse
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Support
Quality
Security
License
Reuse
Apache Geode
Support
Quality
Security
License
Reuse
Python library for creating data pipelines with chain functional programming
Support
Quality
Security
License
Reuse
A new arguably faster implementation of Apache Spark from scratch in Rust
Support
Quality
Security
License
Reuse
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Support
Quality
Security
License
Reuse
Compile-time Language Integrated Queries for Scala
Support
Quality
Security
License
Reuse
Streaming MapReduce with Scalding and Storm
Support
Quality
Security
License
Reuse
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
Support
Quality
Security
License
Reuse
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Support
Quality
Security
License
Reuse
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Support
Quality
Security
License
Reuse
l
learning-sparkby databricks
Example code from Learning Spark book
Java 3837Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
n
Support
Quality
Security
License
Reuse
f
flink-recommandSystem-demoby will-che
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Java 3824Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
T
TensorFlowOnSparkby yahoo
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Python 3781Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
q
quickwitby quickwit-oss
Sub-second search & analytics engine on cloud storage
Rust 3763Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
c
crateby crate
CrateDB is a distributed SQL database for storing and analyzing massive amounts of data in real-time. Built on top of Lucene.
Java 3692Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
H
HELKby Cyb3rWard0g
The Hunting ELK
Jupyter Notebook 3530Updated: 2 y ago License: Strong Copyleft (GPL-3.0)
Support
Quality
Security
License
Reuse
d
databusby linkedin
Source-agnostic distributed change data capture system
Java 3499Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
s
scaldingby twitter
A Scala API for Cascading
Scala 3426Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
j
jitsuby jitsucom
Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days
TypeScript 3416Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
C
CoolplaySparkby lw-lin
酷玩 Spark: Spark 源代码解析、Spark 类库等
Scala 3399Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
k
koalasby databricks
Koalas: pandas API on Apache Spark
Python 3268Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-streamparkby apache
StreamPark, Make stream processing easier! easy-to-use streaming application development framework and operation platform
Java 3259Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
n
nalgebraby dimforge
Linear algebra library for Rust.
Rust 3242Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
gleamby chrislusf
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Go 3219Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-pinotby apache
Apache Pinot (Incubating) - A realtime distributed OLAP datastore
Java 3139Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
blazeby blaze
NumPy and Pandas interface to Big Data
Python 3133Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
r
risingwaveby RisingWaveLabs
RisingWave: the next-generation streaming database in the cloud.
Rust 3116Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
linkisby apache
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Java 3083Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
R
RoaringBitmapby RoaringBitmap
A better compressed bitset in Java
Java 3057Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-notebookby spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
JavaScript 3051Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flink-recommandSystem-demoby CheckChe0803
:helicopter::rocket:基于Flink实现的商品实时推荐系统。flink统计商品热度,放入redis缓存,分析日志信息,将画像标签和实时记录放入Hbase。在用户发起推荐请求后,根据用户画像重排序热度榜,并结合协同过滤和标签两个推荐模块为新生成的榜单的每一个产品添加关联产品,最后返回新的用户列表。
Java 2920Updated: 3 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
s
smart_openby RaRe-Technologies
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Python 2880Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
s
spark-jobserverby spark-jobserver
REST job server for Apache Spark
Scala 2820Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
deequby awslabs
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Scala 2812Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
ibisby ibis-project
The flexibility of Python with the scale and performance of modern SQL.
Python 2789Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-linkisby apache
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Java 2762Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
aws-data-wranglerby awslabs
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Python 2734Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
dparkby douban
Python clone of Spark, a MapReduce alike framework in Python
Python 2693Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
j
jsby turbo
turbo.js - perform massive parallel computations in your browser with GPGPU.
JavaScript 2600Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
n
nutchby apache
Apache Nutch is an extensible and scalable web crawler
Java 2584Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
S
Spark-The-Definitive-Guideby databricks
Spark: The Definitive Guide's Code Repository
Scala 2584Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
highwayby google
Performance-portable, length-agnostic SIMD with runtime dispatch
C++ 2580Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
mrjobby Yelp
Run MapReduce jobs on Hadoop or Amazon Web Services
Python 2546Updated: 4 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
M
Movie_Recommendby LuckyZXL2016
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Java 2446Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
g
gocqlby gocql
Package gocql implements a fast and robust Cassandra client for the Go programming language.
Go 2414Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
O
OpenMetadataby open-metadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
TypeScript 2368Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-on-k8s-operatorby GoogleCloudPlatform
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Go 2344Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
ballistaby ballista-compute
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Rust 2318Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
Support
Quality
Security
License
Reuse
P
PyFunctionalby EntilZha
Python library for creating data pipelines with chain functional programming
Python 2168Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
v
vegaby rajasekarv
A new arguably faster implementation of Apache Spark from scratch in Rust
Rust 2166Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
gobblinby apache
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Java 2129Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
z
zio-quillby zio
Compile-time Language Integrated Queries for Scala
Scala 2125Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
summingbirdby twitter
Streaming MapReduce with Scalding and Storm
Scala 2102Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
griddbby griddb
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
C++ 2093Updated: 2 y ago License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
d
dinkyby DataLinkDC
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Java 2093Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
L
Linkisby WeBankFinTech
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Scala 2091Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse