Distributed Deep learning with Keras & Spark
Support
Quality
Security
License
Reuse
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Support
Quality
Security
License
Reuse
MongoDB Connector for Hadoop
Support
Quality
Security
License
Reuse
s
spark-py-notebooksby jadianes
Jupyter Notebook 1521 Version:Current License: Proprietary (Proprietary)
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Support
Quality
Security
License
Reuse
A Scala kernel for Jupyter
Support
Quality
Security
License
Reuse
Open-source graph database, built for real-time streaming data, compatible with Neo4j.
Support
Quality
Security
License
Reuse
Code to accompany Advanced Analytics with Spark from O'Reilly Media
Support
Quality
Security
License
Reuse
MLeap: Deploy ML Pipelines to Production
Support
Quality
Security
License
Reuse
Lightning-fast cluster computing in Java, Scala and Python.
Support
Quality
Security
License
Reuse
Base classes to use when writing tests with Spark
Support
Quality
Security
License
Reuse
:truck: Agile Data Preparation Workflows madeย easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Support
Quality
Security
License
Reuse
High performance data store solution
Support
Quality
Security
License
Reuse
Apache Parquet
Support
Quality
Security
License
Reuse
HiBench is a big data benchmark suite.
Support
Quality
Security
License
Reuse
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Support
Quality
Security
License
Reuse
Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
Support
Quality
Security
License
Reuse
A cluster computing framework for processing large-scale geospatial data
Support
Quality
Security
License
Reuse
GeoMesa is a suite of tools for working with big geo-spatial data in a distributed fashion.
Support
Quality
Security
License
Reuse
Distributed deep learning on Hadoop and Spark clusters.
Support
Quality
Security
License
Reuse
SQL-based streaming analytics platform at scale
Support
Quality
Security
License
Reuse
Jupyter magics and kernels for working with remote Spark clusters
Support
Quality
Security
License
Reuse
p
pyspark-example-projectby AlexIoannides
Python 1195 Version:Current License: No License (No License)
Example project implementing best practices for PySpark ETL jobs and applications.
Support
Quality
Security
License
Reuse
Repository of notes, code and notebooks in Python for the book Pattern Recognition and Machine Learning by Christopher Bishop
Support
Quality
Security
License
Reuse
Dremio - the missing link in modern data
Support
Quality
Security
License
Reuse
KillrWeather is a reference application (work in progress) showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments.
Support
Quality
Security
License
Reuse
A registry of publicly available datasets on AWS
Support
Quality
Security
License
Reuse
Apache Spark ๅฎๆนๆๆกฃไธญๆ็
Support
Quality
Security
License
Reuse
A library for time series analysis on Apache Spark
Support
Quality
Security
License
Reuse
SQL-based streaming analytics platform at scale
Support
Quality
Security
License
Reuse
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
Support
Quality
Security
License
Reuse
StreamSets Data Collector - Continuous big data and cloud platform ingest infrastructure
Support
Quality
Security
License
Reuse
PySpark + Scikit-learn = Sparkit-learn
Support
Quality
Security
License
Reuse
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Support
Quality
Security
License
Reuse
scalaใsparkไฝฟ็จ่ฟ็จไธญ๏ผๅ็งๆต่ฏ็จไพไปฅๅ็ธๅ ณ่ตๆๆด็
Support
Quality
Security
License
Reuse
(Deprecated) Scikit-learn integration package for Apache Spark
Support
Quality
Security
License
Reuse
Distributed Stream and Batch Processing
Support
Quality
Security
License
Reuse
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Support
Quality
Security
License
Reuse
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Support
Quality
Security
License
Reuse
U
Udacity-Data-Engineering-Projectsby san089
Python 1038 Version:Current License: Proprietary (Proprietary)
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Support
Quality
Security
License
Reuse
Project SnappyData - memory optimized analytics database, based on Apache Sparkโข and Apache Geodeโข. Stream, Transact, Analyze, Predict in one cluster
Support
Quality
Security
License
Reuse
Make stream processing easier! Flink & Spark development scaffold, The original intention of StreamX is to make the development of Flink easier. StreamX focuses on the management of development phases and tasks. Our ultimate goal is to build a one-stop big data solution integrating stream processing, batch processing, data warehouse and data laker.
Support
Quality
Security
License
Reuse
p
pyspark-tutorialby mahmoudparsian
Jupyter Notebook 1009 Version:Current License: Proprietary (Proprietary)
PySpark-Tutorial provides basic algorithms using PySpark
Support
Quality
Security
License
Reuse
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Support
Quality
Security
License
Reuse
Development in Shark has been ended.
Support
Quality
Security
License
Reuse
Livy is an open source REST interface for interacting with Apache Spark from anywhere
Support
Quality
Security
License
Reuse
Support
Quality
Security
License
Reuse
Apache Accumulo
Support
Quality
Security
License
Reuse
Apache Impala
Support
Quality
Security
License
Reuse
s
spark-scala-tutorialby deanwampler
Jupyter Notebook 966 Version:Current License: Proprietary (Proprietary)
A free tutorial for Apache Spark.
Support
Quality
Security
License
Reuse
Parallel ML System - Bosen project
Support
Quality
Security
License
Reuse
e
elephasby maxpumperla
Distributed Deep learning with Keras & Spark
Python 1560Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
d
dlinkby DataLinkDC
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Java 1550Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
m
mongo-hadoopby mongodb
MongoDB Connector for Hadoop
Java 1521Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
s
spark-py-notebooksby jadianes
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Jupyter Notebook 1521Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
a
almondby almond-sh
A Scala kernel for Jupyter
Scala 1516Updated: 1 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse
m
memgraphby memgraph
Open-source graph database, built for real-time streaming data, compatible with Neo4j.
C++ 1494Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
a
aasby sryza
Code to accompany Advanced Analytics with Spark from O'Reilly Media
Scala 1485Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
m
mleapby combust
MLeap: Deploy ML Pipelines to Production
Scala 1449Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sparkby mesos
Lightning-fast cluster computing in Java, Scala and Python.
Scala 1423Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
s
spark-testing-baseby holdenk
Base classes to use when writing tests with Spark
Scala 1414Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
o
optimusby hi-primus
:truck: Agile Data Preparation Workflows madeย easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Python 1383Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
c
carbondataby apache
High performance data store solution
Scala 1359Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
Support
Quality
Security
License
Reuse
H
HiBenchby Intel-bigdata
HiBench is a big data benchmark suite.
Java 1351Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
i
incubator-kyuubiby apache
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Scala 1343Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
d
dr-elephantby linkedin
Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
Java 1302Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
i
incubator-sedonaby apache
A cluster computing framework for processing large-scale geospatial data
Java 1302Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
g
geomesaby locationtech
GeoMesa is a suite of tools for working with big geo-spatial data in a distributed fashion.
Scala 1302Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
C
CaffeOnSparkby yahoo
Distributed deep learning on Hadoop and Spark clusters.
Jupyter Notebook 1265Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
A
AthenaXby uber-archive
SQL-based streaming analytics platform at scale
Java 1219Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sparkmagicby jupyter-incubator
Jupyter magics and kernels for working with remote Spark clusters
Python 1213Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
p
pyspark-example-projectby AlexIoannides
Example project implementing best practices for PySpark ETL jobs and applications.
Python 1195Updated: 1 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
p
prmlby gerdm
Repository of notes, code and notebooks in Python for the book Pattern Recognition and Machine Learning by Christopher Bishop
Jupyter Notebook 1193Updated: 2 y ago License: Strong Copyleft (AGPL-3.0)
Support
Quality
Security
License
Reuse
d
dremio-ossby dremio
Dremio - the missing link in modern data
Java 1190Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
k
killrweatherby killrweather
KillrWeather is a reference application (work in progress) showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments.
Scala 1185Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
o
open-data-registryby awslabs
A registry of publicly available datasets on AWS
Python 1184Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
spark-doc-zhby apachecn
Apache Spark ๅฎๆนๆๆกฃไธญๆ็
JavaScript 1184Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
spark-timeseriesby sryza
A library for time series analysis on Apache Spark
Scala 1175Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
A
AthenaXby uber
SQL-based streaming analytics platform at scale
Java 1147Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
D
Dockerfilesby HariSekhon
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
Shell 1147Updated: 1 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
d
datacollectorby streamsets
StreamSets Data Collector - Continuous big data and cloud platform ingest infrastructure
Java 1145Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
s
sparkit-learnby lensacom
PySpark + Scikit-learn = Sparkit-learn
Python 1135Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
N
Nagios-Pluginsby HariSekhon
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Python 1101Updated: 1 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
u
utils4sby jacksu
scalaใsparkไฝฟ็จ่ฟ็จไธญ๏ผๅ็งๆต่ฏ็จไพไปฅๅ็ธๅ
ณ่ตๆๆด็
Scala 1083Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
s
spark-sklearnby databricks
(Deprecated) Scikit-learn integration package for Apache Spark
Python 1072Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
h
hazelcast-jetby hazelcast
Distributed Stream and Batch Processing
Java 1054Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
g
goodreads_etl_pipelineby san089
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Python 1048Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
k
kyloby Teradata
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Java 1041Updated: 2 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
U
Udacity-Data-Engineering-Projectsby san089
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Python 1038Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
snappydataby TIBCOSoftware
Project SnappyData - memory optimized analytics database, based on Apache Sparkโข and Apache Geodeโข. Stream, Transact, Analyze, Predict in one cluster
Scala 1033Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
streamxby streamxhub
Make stream processing easier! Flink & Spark development scaffold, The original intention of StreamX is to make the development of Flink easier. StreamX focuses on the management of development phases and tasks. Our ultimate goal is to build a one-stop big data solution integrating stream processing, batch processing, data warehouse and data laker.
Java 1031Updated: 3 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
p
pyspark-tutorialby mahmoudparsian
PySpark-Tutorial provides basic algorithms using PySpark
Jupyter Notebook 1009Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
d
data-algorithms-bookby mahmoudparsian
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Java 996Updated: 3 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
s
sharkby amplab
Development in Shark has been ended.
Scala 993Updated: 4 y ago License: Permissive (Apache-2.0)
Support
Quality
Security
License
Reuse
l
livyby cloudera
Livy is an open source REST interface for interacting with Apache Spark from anywhere
Scala 990Updated: 2 y ago License: No License (No License)
Support
Quality
Security
License
Reuse
R
RecommenderSystemsby DeepGraphLearning
Python 989Updated: 2 y ago License: Permissive (MIT)
Support
Quality
Security
License
Reuse
a
Support
Quality
Security
License
Reuse
i
Support
Quality
Security
License
Reuse
s
spark-scala-tutorialby deanwampler
A free tutorial for Apache Spark.
Jupyter Notebook 966Updated: 2 y ago License: Proprietary (Proprietary)
Support
Quality
Security
License
Reuse
b
bosenby sailing-pmls
Parallel ML System - Bosen project
C++ 958Updated: 4 y ago License: Permissive (BSD-3-Clause)
Support
Quality
Security
License
Reuse