Large-scale event processing with Akka Persistence and Apache Spark
Support
Quality
Security
License
Reuse
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Support
Quality
Security
License
Reuse
High Throughput Real-time Stream Processing Framework
Support
Quality
Security
License
Reuse
Hops Hadoop is a distribution of Apache Hadoop with distributed metadata.
Support
Quality
Security
License
Reuse
Fundamentals of Spark with Python (using PySpark), code examples
Support
Quality
Security
License
Reuse
Complex Event Processing on top of Kafka Streams
Support
Quality
Security
License
Reuse
Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
Support
Quality
Security
License
Reuse
w
web-development-with-node-and-express-2eby EthanRBrown
JavaScript 
263
Version:Current
License: No License (No License)
Companion repository for Web Development With Node and Express, 2nd Edition (O'Reilly).
Support
Quality
Security
License
Reuse
Ferry lets you define, run, and deploy big data applications on AWS, OpenStack, and your local machine using Docker
Support
Quality
Security
License
Reuse
Slimming down jars since 2016
Support
Quality
Security
License
Reuse
An efficient updatable key-value store for Apache Spark
Support
Quality
Security
License
Reuse
Apache Cassandra cluster orchestration tool for the command line
Support
Quality
Security
License
Reuse
Flume Source to import data from SQL Databases
Support
Quality
Security
License
Reuse
Go package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.
Support
Quality
Security
License
Reuse
如果你在从事大数据BI的工作,想对比一下MySQL、GreenPlum、Elasticsearch、Hive、Spark SQL、Presto、Impala、Drill、HAWQ、Druid、Pinot、Kylin、ClickHouse、Kudu等不同实现方案之间的表现,那你就需要一份标准的数据进行测试,这个开源项目就是为了生成这样的标准数据。
Support
Quality
Security
License
Reuse
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
Support
Quality
Security
License
Reuse
AntDB is a distributed database to provide both write-scalability and massively parallel processing. The mirror git repository is at https://gitee.com/adbsql/antdb.
Support
Quality
Security
License
Reuse
Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop
Support
Quality
Security
License
Reuse
基于Docker构建的Hadoop开发测试环境,包含Hadoop,Hive,HBase,Spark
Support
Quality
Security
License
Reuse
Apache Trafodion
Support
Quality
Security
License
Reuse
A C++ implementaton of MapReduce without distributed filesystem
Support
Quality
Security
License
Reuse
Hadoop (Utilities, Patches and Examples)
Support
Quality
Security
License
Reuse
BigDataPlatform:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Support
Quality
Security
License
Reuse
A toolkit providing a uniform interface for connecting to and extracting data from a wide variety of (potentially remote) data stores (including HDFS, Hive, Presto, MySQL, etc).
Support
Quality
Security
License
Reuse
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Support
Quality
Security
License
Reuse
Tool for Java codebases that will help you identify the God Classes you should refactor first.
Support
Quality
Security
License
Reuse
Facebook's Hive UDFs
Support
Quality
Security
License
Reuse
推荐项目(实时推荐和离线推荐)
Support
Quality
Security
License
Reuse
Cloud9 is a Hadoop toolkit for working with big data
Support
Quality
Security
License
Reuse
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Support
Quality
Security
License
Reuse
大数据相关框架实战项目(Hadoop, Spark, Storm, Flink)
Support
Quality
Security
License
Reuse
An end-to-end machine learning and data mining framework on Hadoop
Support
Quality
Security
License
Reuse
一个开源、成体系的大数据学习教程。spark学习 hadoop hive hbase flink教程 linux 从入门到精通
Support
Quality
Security
License
Reuse
LascoDan(Korea Scala Group) scala study
Support
Quality
Security
License
Reuse
A collection of pentest tools and resources targeting Hadoop environments
Support
Quality
Security
License
Reuse
J2EE学习以及Linux组件学习的日常总结,适合想了解和温习基础知识的童鞋。目前计划包含的内容有设计模式、Springboot、SpringCloud;以及Linux开源组件Redis、Kafka、Nginx、ElasticSearch、Hadoop、Zookeeper等
Support
Quality
Security
License
Reuse
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Support
Quality
Security
License
Reuse
Apache Spark™ and Scala Workshops
Support
Quality
Security
License
Reuse
Write Hadoop jobs in JRuby
Support
Quality
Security
License
Reuse
Visualize your HDFS cluster usage
Support
Quality
Security
License
Reuse
Automated turndown of Kubernetes clusters on specific schedules.
Support
Quality
Security
License
Reuse
Dockerfiles for building a storm cluster.
Support
Quality
Security
License
Reuse
GraphChi's Java version
Support
Quality
Security
License
Reuse
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Support
Quality
Security
License
Reuse
DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server
Support
Quality
Security
License
Reuse
presto hbase connector 组件基于Presto Connector接口规范实现,用来给Presto增加查询HBase的功能。相比其他开源版本的HBase Connector,我们的性能要快10到100倍以上。
Support
Quality
Security
License
Reuse
A Python MapReduce and HDFS API for Hadoop
Support
Quality
Security
License
Reuse
K
Keras_Deep_Clusteringby Tony607
Jupyter Notebook 
214
Version:Current
License: Proprietary (Proprietary)
How to do Unsupervised Clustering with Keras
Support
Quality
Security
License
Reuse
A set of scripts and config files to run a Cassandra cluster from Docker
Support
Quality
Security
License
Reuse
Spark、Flink等离线任务的调度以及实时任务的监控
Support
Quality
Security
License
Reuse
a
akka-analyticsby krasserm
Large-scale event processing with Akka Persistence and Apache Spark
Scala
277
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
D
DataVecby deeplearning4j
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Java
275
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
tigonby cdapio
High Throughput Real-time Stream Processing Framework
C++
275
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hopsby hopshadoop
Hops Hadoop is a distribution of Apache Hadoop with distributed metadata.
Java
273
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
S
Spark-with-Pythonby tirthajyoti
Fundamentals of Spark with Python (using PySpark), code examples
Jupyter Notebook
273
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
k
kafkastreams-cepby fhussonnois
Complex Event Processing on top of Kafka Streams
Java
267
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hadoop-connectorsby GoogleCloudDataproc
Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
Java
267
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
w
web-development-with-node-and-express-2eby EthanRBrown
Companion repository for Web Development With Node and Express, 2nd Edition (O'Reilly).
JavaScript
263
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
f
ferryby jhorey
Ferry lets you define, run, and deploy big data applications on AWS, OpenStack, and your local machine using Docker
Python
256
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
S
SlimFastby HubSpot
Slimming down jars since 2016
Java
256
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-indexedrddby amplab
An efficient updatable key-value store for Apache Spark
Scala
252
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cstarby spotify
Apache Cassandra cluster orchestration tool for the command line
Python
249
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flume-ng-sql-sourceby keedio
Flume Source to import data from SQL Databases
Java
248
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
parquet-goby fraugster
Go package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.
Go
248
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
data-generatorby ysc
如果你在从事大数据BI的工作,想对比一下MySQL、GreenPlum、Elasticsearch、Hive、Spark SQL、Presto、Impala、Drill、HAWQ、Druid、Pinot、Kylin、ClickHouse、Kudu等不同实现方案之间的表现,那你就需要一份标准的数据进行测试,这个开源项目就是为了生成这样的标准数据。
Java
247
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
storagetapperby uber
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
Go
246
Updated: 4 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
A
AntDBby ADBSQL
AntDB is a distributed database to provide both write-scalability and massively parallel processing. The mirror git repository is at https://gitee.com/adbsql/antdb.
C
245
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
C
Cubertby linkedin
Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop
Java
243
Updated: 5 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hadoop-dockerby ruoyu-chen
基于Docker构建的Hadoop开发测试环境,包含Hadoop,Hive,HBase,Spark
Shell
243
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
t
Support
Quality
Security
License
Reuse
m
mapreduce-liteby wangkuiyi
A C++ implementaton of MapReduce without distributed filesystem
C++
240
Updated: 4 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
H
Hadoopby matteobertozzi
Hadoop (Utilities, Patches and Examples)
Python
238
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
B
BigDataPlatformby KangU4
BigDataPlatform:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Java
238
Updated: 5 y ago
License: Strong Copyleft (GPL-3.0)
Support
Quality
Security
License
Reuse
o
omniductby airbnb
A toolkit providing a uniform interface for connecting to and extracting data from a wide variety of (potentially remote) data stores (including HDFS, Hive, Presto, MySQL, etc).
Python
238
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
p
parquet4sby mjakubowski84
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Scala
237
Updated: 2 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
R
RefactorFirstby jimbethancourt
Tool for Java codebases that will help you identify the God Classes you should refactor first.
Java
237
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
facebook-hive-udfsby brndnmtthws
Facebook's Hive UDFs
Java
235
Updated: 3 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
r
Support
Quality
Security
License
Reuse
C
Cloud9by lintool
Cloud9 is a Hadoop toolkit for working with big data
Java
234
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
g
gimelby paypal
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Scala
233
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
BigData-Getting-Startedby Thpffcj
大数据相关框架实战项目(Hadoop, Spark, Storm, Flink)
Java
232
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
s
shifuby ShifuML
An end-to-end machine learning and data mining framework on Hadoop
Java
232
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
big-databy vbay
一个开源、成体系的大数据学习教程。spark学习 hadoop hive hbase flink教程 linux 从入门到精通
Shell
232
Updated: 2 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
s
scalaby codeport
LascoDan(Korea Scala Group) scala study
Scala
232
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
h
hadoop-attack-libraryby wavestone-cdt
A collection of pentest tools and resources targeting Hadoop environments
Python
228
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
w
weathertopby king-angmar
J2EE学习以及Linux组件学习的日常总结,适合想了解和温习基础知识的童鞋。目前计划包含的内容有设计模式、Springboot、SpringCloud;以及Linux开源组件Redis、Kafka、Nginx、ElasticSearch、Hadoop、Zookeeper等
JavaScript
227
Updated: 4 y ago
License: Permissive (MIT)
Support
Quality
Security
License
Reuse
h
hive-jdbc-uber-jarby timveil
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Java
226
Updated: 3 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
s
spark-workshopby jaceklaskowski
Apache Spark™ and Scala Workshops
HTML
225
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
r
rubydoopby iconara
Write Hadoop jobs in JRuby
Ruby
224
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
h
hdfs-duby twitter-archive
Visualize your HDFS cluster usage
JavaScript
224
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
cluster-turndownby kubecost
Automated turndown of Kubernetes clusters on specific schedules.
Go
224
Updated: 2 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
storm-dockerby wurstmeister
Dockerfiles for building a storm cluster.
Shell
223
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
graphchi-javaby GraphChi
GraphChi's Java version
Java
219
Updated: 4 y ago
License: No License (No License)
Support
Quality
Security
License
Reuse
S
SparkRDMAby Mellanox
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Java
218
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
D
DataXby wgzhao
DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server
Java
217
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
presto-hbase-connectorby analysys
presto hbase connector 组件基于Presto Connector接口规范实现,用来给Presto增加查询HBase的功能。相比其他开源版本的HBase Connector,我们的性能要快10到100倍以上。
Java
216
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
pydoopby crs4
A Python MapReduce and HDFS API for Hadoop
Python
216
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
K
Keras_Deep_Clusteringby Tony607
How to do Unsupervised Clustering with Keras
Jupyter Notebook
214
Updated: 4 y ago
License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
docker-cassandraby nicolasff
A set of scripts and config files to run a Cassandra cluster from Docker
Shell
212
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
b
big-whaleby MeetYouDevs
Spark、Flink等离线任务的调度以及实时任务的监控
Java
210
Updated: 4 y ago
License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse