A tutorial of building an LSM-Tree storage engine in a week! (WIP)
Support
Quality
Security
License
Reuse
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Support
Quality
Security
License
Reuse
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.
Support
Quality
Security
License
Reuse
Portable header-only C++ low level SIMD library
Support
Quality
Security
License
Reuse
scala、spark使用过程中,各种测试用例以及相关资料整理
Support
Quality
Security
License
Reuse
High-level C binding for ØMQ
Support
Quality
Security
License
Reuse
(Deprecated) Scikit-learn integration package for Apache Spark
Support
Quality
Security
License
Reuse
Distributed Stream and Batch Processing
Support
Quality
Security
License
Reuse
Schema safe, type-safe, reactive Scala driver for Cassandra/Datastax Enterprise
Support
Quality
Security
License
Reuse
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Support
Quality
Security
License
Reuse
Hive UDF's for the data warehouse
Support
Quality
Security
License
Reuse
Python module that allows one to easily write and run Hadoop programs.
Support
Quality
Security
License
Reuse
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Support
Quality
Security
License
Reuse
U
Udacity-Data-Engineering-Projectsby san089
Python 1038 Version:Current License: Proprietary (Proprietary)
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Support
Quality
Security
License
Reuse
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
Support
Quality
Security
License
Reuse
Mirror of Apache griffin
Support
Quality
Security
License
Reuse
Make stream processing easier! Flink & Spark development scaffold, The original intention of StreamX is to make the development of Flink easier. StreamX focuses on the management of development phases and tasks. Our ultimate goal is to build a one-stop big data solution integrating stream processing, batch processing, data warehouse and data laker.
Support
Quality
Security
License
Reuse
A Data Engineering & Machine Learning Knowledge Hub
Support
Quality
Security
License
Reuse
💥🔥 为了解决企业建设大数据平台的痛难点, 本项目旨在对Apache众多大数据平台组件进行二次开发维护,并输出一款通用的大数据平台底座,重点解决数据采集, 数据存储, 数据计算, 数据开发和数据运营场景遇到的问题与挑战, 初衷是建设开源业界领先的一站式大数据平台, 赋能成千上万个中小企业的业务快速发展, 以及给热爱大数据的开发者提供一系列解决方案。
Support
Quality
Security
License
Reuse
p
pyspark-tutorialby mahmoudparsian
Jupyter Notebook 1009 Version:Current License: Proprietary (Proprietary)
PySpark-Tutorial provides basic algorithms using PySpark
Support
Quality
Security
License
Reuse
4 labs + 2 challenges + 4 docs
Support
Quality
Security
License
Reuse
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Support
Quality
Security
License
Reuse
Development in Shark has been ended.
Support
Quality
Security
License
Reuse
Livy is an open source REST interface for interacting with Apache Spark from anywhere
Support
Quality
Security
License
Reuse
Support
Quality
Security
License
Reuse
💥🔥 大数据生态解决方案数据平台:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Support
Quality
Security
License
Reuse
Apache Accumulo
Support
Quality
Security
License
Reuse
Apache Impala
Support
Quality
Security
License
Reuse
s
spark-scala-tutorialby deanwampler
Jupyter Notebook 966 Version:Current License: Proprietary (Proprietary)
A free tutorial for Apache Spark.
Support
Quality
Security
License
Reuse
Official git repository for libdivide: optimized integer division
Support
Quality
Security
License
Reuse
A Spark DSL in idiomatic kotlin // dependency: com.sparkjava:spark-kotlin:1.0.0-alpha
Support
Quality
Security
License
Reuse
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等
Support
Quality
Security
License
Reuse
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Support
Quality
Security
License
Reuse
Sparkling Water provides H2O functionality inside Spark cluster
Support
Quality
Security
License
Reuse
WE HAVE MOVED to Apache Incubator. https://cwiki.apache.org/FLUME/ . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
Support
Quality
Security
License
Reuse
A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
Support
Quality
Security
License
Reuse
C# and F# language binding and extensions to Apache Spark
Support
Quality
Security
License
Reuse
Wormhole is a SPaaS (Stream Processing as a Service) Platform
Support
Quality
Security
License
Reuse
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Support
Quality
Security
License
Reuse
Support
Quality
Security
License
Reuse
Mirror of Apache Sqoop
Support
Quality
Security
License
Reuse
go-stash is a high performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.
Support
Quality
Security
License
Reuse
Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
Support
Quality
Security
License
Reuse
大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。
Support
Quality
Security
License
Reuse
Apache Traffic Control is an Open Source implementation of a Content Delivery Network
Support
Quality
Security
License
Reuse
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Support
Quality
Security
License
Reuse
A connector for Spark that allows reading and writing to/from Redis cluster
Support
Quality
Security
License
Reuse
R interface for Apache Spark
Support
Quality
Security
License
Reuse
A Time Series Library for Apache Spark
Support
Quality
Security
License
Reuse
Open source SQL engine in Python
Support
Quality
Security
License
Reuse
m
mini-lsmby skyzh
A tutorial of building an LSM-Tree storage engine in a week! (WIP)
Rust 1106Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
N
Nagios-Pluginsby HariSekhon
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Python 1101Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
hollowby Netflix
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.
Java 1094Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
libsimdppby p12tic
Portable header-only C++ low level SIMD library
C++ 1094Updated: 2 y ago License: Permissive (BSL-1.0)
Support
Quality
Security
License
Reuse
u
utils4sby jacksu
scala、spark使用过程中,各种测试用例以及相关资料整理
Scala 1083Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
c
Support
Quality
Security
License
Reuse
s
spark-sklearnby databricks
(Deprecated) Scikit-learn integration package for Apache Spark
Python 1072Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hazelcast-jetby hazelcast
Distributed Stream and Batch Processing
Java 1054Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
phantomby outworkers
Schema safe, type-safe, reactive Scala driver for Cassandra/Datastax Enterprise
Scala 1050Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
goodreads_etl_pipelineby san089
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Python 1048Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
b
brickhouseby klout
Hive UDF's for the data warehouse
Java 1044Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
dumboby klbostee
Python module that allows one to easily write and run Hadoop programs.
Python 1044Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
k
kyloby Teradata
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Java 1041Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
U
Udacity-Data-Engineering-Projectsby san089
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Python 1038Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
snappydataby TIBCOSoftware
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
Scala 1033Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
g
griffinby apache
Mirror of Apache griffin
Scala 1032Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
streamxby streamxhub
Make stream processing easier! Flink & Spark development scaffold, The original intention of StreamX is to make the development of Flink easier. StreamX focuses on the management of development phases and tasks. Our ultimate goal is to build a one-stop big data solution integrating stream processing, batch processing, data warehouse and data laker.
Java 1031Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
around-dataengineeringby abhishek-ch
A Data Engineering & Machine Learning Knowledge Hub
Python 1026Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
a
alldataby AllDataTeam
💥🔥 为了解决企业建设大数据平台的痛难点, 本项目旨在对Apache众多大数据平台组件进行二次开发维护,并输出一款通用的大数据平台底座,重点解决数据采集, 数据存储, 数据计算, 数据开发和数据运营场景遇到的问题与挑战, 初衷是建设开源业界领先的一站式大数据平台, 赋能成千上万个中小企业的业务快速发展, 以及给热爱大数据的开发者提供一系列解决方案。
Java 1021Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
pyspark-tutorialby mahmoudparsian
PySpark-Tutorial provides basic algorithms using PySpark
Jupyter Notebook 1009Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
M
MIT6.824-2021by OneSizeFitsQuorum
4 labs + 2 challenges + 4 docs
Shell 1004Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
d
data-algorithms-bookby mahmoudparsian
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Java 996Updated: 3 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sharkby amplab
Development in Shark has been ended.
Scala 993Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
livyby cloudera
Livy is an open source REST interface for interacting with Apache Spark from anywhere
Scala 990Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
R
RecommenderSystemsby DeepGraphLearning
Python 989Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
a
alldataby authorwlh
💥🔥 大数据生态解决方案数据平台:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Java 981Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
s
spark-scala-tutorialby deanwampler
A free tutorial for Apache Spark.
Jupyter Notebook 966Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
l
libdivideby ridiculousfish
Official git repository for libdivide: optimized integer division
C++ 955Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
spark-kotlinby perwendel
A Spark DSL in idiomatic kotlin // dependency: com.sparkjava:spark-kotlin:1.0.0-alpha
Kotlin 955Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
C
Coding-Nowby josonle
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等
Python 951Updated: 2 y ago License: Strong Copyleft (GPL-2.0)
Support
Quality
Security
License
Reuse
a
adamby bigdatagenomics
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Scala 943Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sparkling-waterby h2oai
Sparkling Water provides H2O functionality inside Spark cluster
Scala 943Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flumeby cloudera
WE HAVE MOVED to Apache Incubator. https://cwiki.apache.org/FLUME/ . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
Java 941Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sse2neonby DLTcollab
A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
C++ 940Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
M
Mobiusby microsoft
C# and F# language binding and extensions to Apache Spark
C# 939Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
w
wormholeby edp963
Wormhole is a SPaaS (Stream Processing as a Service) Platform
JavaScript 937Updated: 3 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
B
BigDataGuideby Dr11ft
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Java 935Updated: 4 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
g
graphframesby graphframes
Scala 934Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
Support
Quality
Security
License
Reuse
g
go-stashby kevwan
go-stash is a high performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.
Go 927Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
A
Addaxby wgzhao
Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
Java 912Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
bigdata-growthby collabH
大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。
Shell 912Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
t
trafficcontrolby apache
Apache Traffic Control is an Open Source implementation of a Content Delivery Network
Go 911Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
L
LearningSparkV2by databricks
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Scala 909Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-redisby RedisLabs
A connector for Spark that allows reading and writing to/from Redis cluster
Scala 908Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
s
sparklyrby sparklyr
R interface for Apache Spark
R 906Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flintby twosigma
A Time Series Library for Apache Spark
Scala 901Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
q
quokkaby marsupialtail
Open source SQL engine in Python
Python 891Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse