spark-sklearn | learn integration package for Apache Spark

by databricks Python Version: 0.3.0 License: Apache-2.0

X-Ray Key Features Code Snippets(1)Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | spark-sklearn Summary

spark-sklearn is a Python library typically used in Big Data, Spark applications. spark-sklearn has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. However spark-sklearn build file is not available. You can install using 'pip install spark-sklearn' or download it from GitHub, PyPI.

(Deprecated) Scikit-learn integration package for Apache Spark

Support

Quality

Security

License

Reuse

Support

spark-sklearn has a medium active ecosystem.

It has 1072 star(s) with 234 fork(s). There are 92 watchers for this library.

It had no major release in the last 12 months.

There are 14 open issues and 35 have been closed. On average issues are closed in 451 days. There are 1 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of spark-sklearn is 0.3.0

Quality

spark-sklearn has 0 bugs and 0 code smells.

Security

spark-sklearn has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

spark-sklearn code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

spark-sklearn is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-sklearn releases are available to install and integrate.

Deployable package is available in PyPI.

spark-sklearn has no build file. You will be need to create the build yourself to build the component from source.

spark-sklearn saves you 627 person hours of effort in developing the same functionality from scratch.

It has 1458 lines of code, 144 functions and 19 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed spark-sklearn and discovered the below as its top functions. This is intended to give you an instant insight into spark-sklearn implemented functionality, and help decide if they suit your requirements.

Apply gapply to grouped data
Fit the ComputationFinder .
Converts a dataset to Pdfs .
Fit the ComputationGraph model .
Serialize a object into an object .
Gets the jvm .
Creates a new Java object .
Creates a spark session .
Call java .
Reads part of a file .

Get all kandi verified functions for this library.

spark-sklearn Key Features

No Key Features are available at this moment for spark-sklearn.

spark-sklearn Examples and Code Snippets

spark_sklearn GridSearchCV __init__ failed with parameter error

Python

Lines of Code : 5

License : Strong Copyleft (CC BY-SA 4.0)

Copy

GridSearchCV(
    sc=SparkContext.getOrCreate(),
    estimator=ensemble.GradientBoostingRegressor(**params), 
    param_grid=param_test2, n_jobs=1)

Community Discussions

Trending Discussions on spark-sklearn

How to convert a sklearn pipeline into a pyspark pipeline?

QUESTION

How to convert a sklearn pipeline into a pyspark pipeline?

Asked 2020-Sep-03 at 17:18

We have a machine learning classifier model that we have trained with a pandas dataframe and a standard sklearn pipeline (StandardScaler, RandomForestClassifier, GridSearchCV etc). We are working on Databricks and would like to scale up this pipeline to a large dataset using the parallel computation spark offers.

What is the quickest way to convert our sklearn pipeline into something that computes in parallel? (We can easily switch between pandas and spark DFs as required.)

For context, our options seem to be:

Rewrite the pipeline using MLLib (time-consuming)
Use a sklearn-spark bridging library

On option 2, Spark-Sklearn seems to be deprecated, but Databricks instead recommends that we use joblibspark. However, this raises an exception on Databricks:

...

ANSWER

Answered 2020-Sep-01 at 13:00

According to the Databricks instructions (here and here), the necessary requirements are:

Python 3.6+
pyspark>=2.4
scikit-learn>=0.21
joblib>=0.14

I cannot reproduce your issue in a community Databricks cluster running Python 3.7.5, Spark 3.0.0, scikit-learn 0.22.1, and joblib 0.14.1:

Source https://stackoverflow.com/questions/63687319

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-sklearn

You can install using 'pip install spark-sklearn' or download it from GitHub, PyPI.
You can use spark-sklearn like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: