A tutorial of building an LSM-Tree storage engine in a week! (WIP)
Support
Quality
Security
License
Reuse
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Support
Quality
Security
License
Reuse
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.
Support
Quality
Security
License
Reuse
Portable header-only C++ low level SIMD library
Support
Quality
Security
License
Reuse
scala、spark使用过程中,各种测试用例以及相关资料整理
Support
Quality
Security
License
Reuse
High-level C binding for ØMQ
Support
Quality
Security
License
Reuse
(Deprecated) Scikit-learn integration package for Apache Spark
Support
Quality
Security
License
Reuse
Distributed Stream and Batch Processing
Support
Quality
Security
License
Reuse
Schema safe, type-safe, reactive Scala driver for Cassandra/Datastax Enterprise
Support
Quality
Security
License
Reuse
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Support
Quality
Security
License
Reuse
Hive UDF's for the data warehouse
Support
Quality
Security
License
Reuse
Python module that allows one to easily write and run Hadoop programs.
Support
Quality
Security
License
Reuse
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Support
Quality
Security
License
Reuse
U
Udacity-Data-Engineering-Projectsby san089
Python 
1038
Version:Current
License: Proprietary (Proprietary)
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Support
Quality
Security
License
Reuse
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
Support
Quality
Security
License
Reuse
Mirror of Apache griffin
Support
Quality
Security
License
Reuse
Make stream processing easier! Flink & Spark development scaffold, The original intention of StreamX is to make the development of Flink easier. StreamX focuses on the management of development phases and tasks. Our ultimate goal is to build a one-stop big data solution integrating stream processing, batch processing, data warehouse and data laker.
Support
Quality
Security
License
Reuse
A Data Engineering & Machine Learning Knowledge Hub
Support
Quality
Security
License
Reuse
💥🔥 为了解决企业建设大数据平台的痛难点, 本项目旨在对Apache众多大数据平台组件进行二次开发维护,并输出一款通用的大数据平台底座,重点解决数据采集, 数据存储, 数据计算, 数据开发和数据运营场景遇到的问题与挑战, 初衷是建设开源业界领先的一站式大数据平台, 赋能成千上万个中小企业的业务快速发展, 以及给热爱大数据的开发者提供一系列解决方案。
Support
Quality
Security
License
Reuse
p
pyspark-tutorialby mahmoudparsian
Jupyter Notebook 
1009
Version:Current
License: Proprietary (Proprietary)
PySpark-Tutorial provides basic algorithms using PySpark
Support
Quality
Security
License
Reuse
4 labs + 2 challenges + 4 docs
Support
Quality
Security
License
Reuse
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Support
Quality
Security
License
Reuse
Development in Shark has been ended.
Support
Quality
Security
License
Reuse
Livy is an open source REST interface for interacting with Apache Spark from anywhere
Support
Quality
Security
License
Reuse
Support
Quality
Security
License
Reuse
💥🔥 大数据生态解决方案数据平台:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Support
Quality
Security
License
Reuse
Apache Accumulo
Support
Quality
Security
License
Reuse
Apache Impala
Support
Quality
Security
License
Reuse
s
spark-scala-tutorialby deanwampler
Jupyter Notebook 
966
Version:Current
License: Proprietary (Proprietary)
A free tutorial for Apache Spark.
Support
Quality
Security
License
Reuse
Official git repository for libdivide: optimized integer division
Support
Quality
Security
License
Reuse
A Spark DSL in idiomatic kotlin // dependency: com.sparkjava:spark-kotlin:1.0.0-alpha
Support
Quality
Security
License
Reuse
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等
Support
Quality
Security
License
Reuse
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Support
Quality
Security
License
Reuse
Sparkling Water provides H2O functionality inside Spark cluster
Support
Quality
Security
License
Reuse
WE HAVE MOVED to Apache Incubator. https://cwiki.apache.org/FLUME/ . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
Support
Quality
Security
License
Reuse
A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
Support
Quality
Security
License
Reuse
C# and F# language binding and extensions to Apache Spark
Support
Quality
Security
License
Reuse
Wormhole is a SPaaS (Stream Processing as a Service) Platform
Support
Quality
Security
License
Reuse
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Support
Quality
Security
License
Reuse
Support
Quality
Security
License
Reuse
Mirror of Apache Sqoop
Support
Quality
Security
License
Reuse
go-stash is a high performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.
Support
Quality
Security
License
Reuse
Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
Support
Quality
Security
License
Reuse
大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。
Support
Quality
Security
License
Reuse
Apache Traffic Control is an Open Source implementation of a Content Delivery Network
Support
Quality
Security
License
Reuse
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Support
Quality
Security
License
Reuse
A connector for Spark that allows reading and writing to/from Redis cluster
Support
Quality
Security
License
Reuse
R interface for Apache Spark
Support
Quality
Security
License
Reuse
A Time Series Library for Apache Spark
Support
Quality
Security
License
Reuse
Open source SQL engine in Python
Support
Quality
Security
License
Reuse
m
mini-lsmby skyzh
A tutorial of building an LSM-Tree storage engine in a week! (WIP)
Rust
1106
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
N
Nagios-Pluginsby HariSekhon
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Python
1101
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
hollowby Netflix
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.
Java
1094
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
libsimdppby p12tic
Portable header-only C++ low level SIMD library
C++
1094
Updated: 2 y ago
License: Permissive (BSL-1.0)
Support
Quality
Security
License
Reuse
u
utils4sby jacksu
scala、spark使用过程中,各种测试用例以及相关资料整理
Scala
1083
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
c
Support
Quality
Security
License
Reuse
s
spark-sklearnby databricks
(Deprecated) Scikit-learn integration package for Apache Spark
Python
1072
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hazelcast-jetby hazelcast
Distributed Stream and Batch Processing
Java
1054
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
phantomby outworkers
Schema safe, type-safe, reactive Scala driver for Cassandra/Datastax Enterprise
Scala
1050
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
goodreads_etl_pipelineby san089
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Python
1048
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
b
brickhouseby klout
Hive UDF's for the data warehouse
Java
1044
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
dumboby klbostee
Python module that allows one to easily write and run Hadoop programs.
Python
1044
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
k
kyloby Teradata
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Java
1041
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
U
Udacity-Data-Engineering-Projectsby san089
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Python
1038
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
snappydataby TIBCOSoftware
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
Scala
1033
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
g
griffinby apache
Mirror of Apache griffin
Scala
1032
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
streamxby streamxhub
Make stream processing easier! Flink & Spark development scaffold, The original intention of StreamX is to make the development of Flink easier. StreamX focuses on the management of development phases and tasks. Our ultimate goal is to build a one-stop big data solution integrating stream processing, batch processing, data warehouse and data laker.
Java
1031
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
around-dataengineeringby abhishek-ch
A Data Engineering & Machine Learning Knowledge Hub
Python
1026
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
a
alldataby AllDataTeam
💥🔥 为了解决企业建设大数据平台的痛难点, 本项目旨在对Apache众多大数据平台组件进行二次开发维护,并输出一款通用的大数据平台底座,重点解决数据采集, 数据存储, 数据计算, 数据开发和数据运营场景遇到的问题与挑战, 初衷是建设开源业界领先的一站式大数据平台, 赋能成千上万个中小企业的业务快速发展, 以及给热爱大数据的开发者提供一系列解决方案。
Java
1021
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
pyspark-tutorialby mahmoudparsian
PySpark-Tutorial provides basic algorithms using PySpark
Jupyter Notebook
1009
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
M
MIT6.824-2021by OneSizeFitsQuorum
4 labs + 2 challenges + 4 docs
Shell
1004
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
d
data-algorithms-bookby mahmoudparsian
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Java
996
Updated: 3 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sharkby amplab
Development in Shark has been ended.
Scala
993
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
livyby cloudera
Livy is an open source REST interface for interacting with Apache Spark from anywhere
Scala
990
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
R
RecommenderSystemsby DeepGraphLearning
Python
989
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
a
alldataby authorwlh
💥🔥 大数据生态解决方案数据平台:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Java
981
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
s
spark-scala-tutorialby deanwampler
A free tutorial for Apache Spark.
Jupyter Notebook
966
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
l
libdivideby ridiculousfish
Official git repository for libdivide: optimized integer division
C++
955
Updated: 2 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
spark-kotlinby perwendel
A Spark DSL in idiomatic kotlin // dependency: com.sparkjava:spark-kotlin:1.0.0-alpha
Kotlin
955
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
C
Coding-Nowby josonle
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等
Python
951
Updated: 2 y ago
License: Strong Copyleft (GPL-2.0)
Support
Quality
Security
License
Reuse
a
adamby bigdatagenomics
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Scala
943
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sparkling-waterby h2oai
Sparkling Water provides H2O functionality inside Spark cluster
Scala
943
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flumeby cloudera
WE HAVE MOVED to Apache Incubator. https://cwiki.apache.org/FLUME/ . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
Java
941
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sse2neonby DLTcollab
A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
C++
940
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
M
Mobiusby microsoft
C# and F# language binding and extensions to Apache Spark
C#
939
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
w
wormholeby edp963
Wormhole is a SPaaS (Stream Processing as a Service) Platform
JavaScript
937
Updated: 3 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
B
BigDataGuideby Dr11ft
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Java
935
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
g
graphframesby graphframes
Scala
934
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
Support
Quality
Security
License
Reuse
g
go-stashby kevwan
go-stash is a high performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.
Go
927
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
A
Addaxby wgzhao
Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
Java
912
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
bigdata-growthby collabH
大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。
Shell
912
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
t
trafficcontrolby apache
Apache Traffic Control is an Open Source implementation of a Content Delivery Network
Go
911
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
L
LearningSparkV2by databricks
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Scala
909
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-redisby RedisLabs
A connector for Spark that allows reading and writing to/from Redis cluster
Scala
908
Updated: 2 y ago
License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
s
sparklyrby sparklyr
R interface for Apache Spark
R
906
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flintby twosigma
A Time Series Library for Apache Spark
Scala
901
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
q
quokkaby marsupialtail
Open source SQL engine in Python
Python
891
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse