Apache Spark - A unified analytics engine for large-scale data processing
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Python 25193 Version:Current License: Proprietary (Proprietary)
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Support
Quality
Security
License
Reuse
Mirror of Apache Kafka
Support
Quality
Security
License
Reuse
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Support
Quality
Security
License
Reuse
Apache Flink
Support
Quality
Security
License
Reuse
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Support
Quality
Security
License
Reuse
The official home of the Presto distributed SQL query engine for big data
Support
Quality
Security
License
Reuse
大数据入门指南 :star:
Support
Quality
Security
License
Reuse
DataX是阿里云DataWorks数据集成的开源版本。
Support
Quality
Security
License
Reuse
Apache Hadoop
Support
Quality
Security
License
Reuse
PredictionIO, a machine learning server for developers and ML engineers.
Support
Quality
Security
License
Reuse
open source training courses about distributed database and distributed systems
Support
Quality
Security
License
Reuse
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Support
Quality
Security
License
Reuse
Alluxio, data orchestration for analytics and machine learning in the cloud
Support
Quality
Security
License
Reuse
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Support
Quality
Security
License
Reuse
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
Support
Quality
Security
License
Reuse
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Support
Quality
Security
License
Reuse
Apache Mesos
Support
Quality
Security
License
Reuse
Apache HBase
Support
Quality
Security
License
Reuse
Apache Hive
Support
Quality
Security
License
Reuse
DataX集成可视化页面,选择数据源即可一键生成数据同步任务,支持RDBMS、Hive、HBase、ClickHouse、MongoDB等数据源,批量创建RDBMS数据同步任务,集成开源调度系统,支持分布式、增量同步数据、实时查看运行日志、监控执行器资源、KILL运行进程、数据源信息加密等。
Support
Quality
Security
License
Reuse
Apache Pinot - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
Apache Ignite
Support
Quality
Security
License
Reuse
Feature Store for Machine Learning
Support
Quality
Security
License
Reuse
LIBSVM -- A Library for Support Vector Machines
Support
Quality
Security
License
Reuse
Fast, distributed, secure AI for Big Data
Support
Quality
Security
License
Reuse
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Support
Quality
Security
License
Reuse
CrateDB is a distributed SQL database for storing and analyzing massive amounts of data in real-time. Built on top of Lucene.
Support
Quality
Security
License
Reuse
A Scala API for Cascading
Support
Quality
Security
License
Reuse
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Support
Quality
Security
License
Reuse
Apache Pinot (Incubating) - A realtime distributed OLAP datastore
Support
Quality
Security
License
Reuse
A better compressed bitset in Java
Support
Quality
Security
License
Reuse
A distributed and coördination-free log management system
Support
Quality
Security
License
Reuse
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Support
Quality
Security
License
Reuse
The flexibility of Python with the scale and performance of modern SQL.
Support
Quality
Security
License
Reuse
Python clone of Spark, a MapReduce alike framework in Python
Support
Quality
Security
License
Reuse
Apache Nutch is an extensible and scalable web crawler
Support
Quality
Security
License
Reuse
Run MapReduce jobs on Hadoop or Amazon Web Services
Support
Quality
Security
License
Reuse
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Support
Quality
Security
License
Reuse
h
hdbscanby scikit-learn-contrib
Jupyter Notebook 2430 Version:Current License: Permissive (BSD-3-Clause)
A high performance implementation of HDBSCAN clustering.
Support
Quality
Security
License
Reuse
Mirror of Apache Flume
Support
Quality
Security
License
Reuse
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
Support
Quality
Security
License
Reuse
Streaming MapReduce with Scalding and Storm
Support
Quality
Security
License
Reuse
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
Support
Quality
Security
License
Reuse
Node.js bindings for librdkafka
Support
Quality
Security
License
Reuse
Easy Machine Learning is a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks.
Support
Quality
Security
License
Reuse
Apache Hadoop docker image
Support
Quality
Security
License
Reuse
Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.
Support
Quality
Security
License
Reuse
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Support
Quality
Security
License
Reuse
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
Support
Quality
Security
License
Reuse
s
sparkby apache
Apache Spark - A unified analytics engine for large-scale data processing
Scala 35985Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
data-science-ipython-notebooksby donnemartin
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Python 25193Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
k
Support
Quality
Security
License
Reuse
x
xgboostby dmlc
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
C++ 24228Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
l
luigiby spotify
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Python 16581Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
prestoby prestodb
The official home of the Presto distributed SQL query engine for big data
Java 14796Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
B
Support
Quality
Security
License
Reuse
D
DataXby alibaba
DataX是阿里云DataWorks数据集成的开源版本。
Java 13598Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
a
attic-predictionioby apache
PredictionIO, a machine learning server for developers and ML engineers.
Scala 12509Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
t
talent-planby pingcap
open source training courses about distributed database and distributed systems
Rust 8866Updated: 1 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
t
trinoby trinodb
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Java 7963Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
a
alluxioby Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Java 6268Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
deltaby delta-io
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Scala 6067Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
school-of-sreby linkedin
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
HTML 6053Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
h
hazelcastby hazelcast
Open-source distributed computation and storage platform. Real-time Stream Processing Unconference. Save Your Spot https://hazelcast.com/lp/unconference/
Java 5409Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
m
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
h
Support
Quality
Security
License
Reuse
d
datax-webby WeiYe-Jing
DataX集成可视化页面,选择数据源即可一键生成数据同步任务,支持RDBMS、Hive、HBase、ClickHouse、MongoDB等数据源,批量创建RDBMS数据同步任务,集成开源调度系统,支持分布式、增量同步数据、实时查看运行日志、监控执行器资源、KILL运行进程、数据源信息加密等。
Java 4762Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
p
pinotby apache
Apache Pinot - A realtime distributed OLAP datastore
Java 4617Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
f
feastby feast-dev
Feature Store for Machine Learning
Python 4397Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
libsvmby cjlin1
LIBSVM -- A Library for Support Vector Machines
Java 4315Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
B
BigDLby intel-analytics
Fast, distributed, secure AI for Big Data
Jupyter Notebook 4229Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
T
TensorFlowOnSparkby yahoo
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Python 3781Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
crateby crate
CrateDB is a distributed SQL database for storing and analyzing massive amounts of data in real-time. Built on top of Lucene.
Java 3692Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
scaldingby twitter
A Scala API for Cascading
Scala 3426Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
gleamby chrislusf
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Go 3219Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-pinotby apache
Apache Pinot (Incubating) - A realtime distributed OLAP datastore
Java 3139Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
R
RoaringBitmapby RoaringBitmap
A better compressed bitset in Java
Java 3057Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
o
oklogby oklog
A distributed and coördination-free log management system
Go 2968Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
smart_openby RaRe-Technologies
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Python 2880Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
i
ibisby ibis-project
The flexibility of Python with the scale and performance of modern SQL.
Python 2789Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
dparkby douban
Python clone of Spark, a MapReduce alike framework in Python
Python 2693Updated: 2 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
n
nutchby apache
Apache Nutch is an extensible and scalable web crawler
Java 2584Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
mrjobby Yelp
Run MapReduce jobs on Hadoop or Amazon Web Services
Python 2546Updated: 3 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
M
Movie_Recommendby LuckyZXL2016
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Java 2446Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
h
hdbscanby scikit-learn-contrib
A high performance implementation of HDBSCAN clustering.
Jupyter Notebook 2430Updated: 1 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
f
Support
Quality
Security
License
Reuse
O
OpenMetadataby open-metadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
TypeScript 2368Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
summingbirdby twitter
Streaming MapReduce with Scalding and Storm
Scala 2102Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
griddbby griddb
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
C++ 2093Updated: 1 y ago License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
n
node-rdkafkaby Blizzard
Node.js bindings for librdkafka
JavaScript 1969Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
E
EasyMLby ICT-BDA
Easy Machine Learning is a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks.
Java 1958Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
docker-hadoopby big-data-europe
Apache Hadoop docker image
Shell 1940Updated: 1 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
a
ambariby apache
Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.
Java 1925Updated: 1 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
f
flinkStreamSQLby DTStack
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Java 1921Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
e
elasticsearch-hadoopby elastic
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
Java 1902Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse