spark-sklearn | learn integration package for Apache Spark

 by   databricks Python Version: 0.3.0 License: Apache-2.0

kandi X-RAY | spark-sklearn Summary

kandi X-RAY | spark-sklearn Summary

spark-sklearn is a Python library typically used in Big Data, Spark applications. spark-sklearn has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. However spark-sklearn build file is not available. You can install using 'pip install spark-sklearn' or download it from GitHub, PyPI.

(Deprecated) Scikit-learn integration package for Apache Spark
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              spark-sklearn has a medium active ecosystem.
              It has 1072 star(s) with 234 fork(s). There are 92 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 14 open issues and 35 have been closed. On average issues are closed in 451 days. There are 1 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of spark-sklearn is 0.3.0

            kandi-Quality Quality

              spark-sklearn has 0 bugs and 0 code smells.

            kandi-Security Security

              spark-sklearn has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              spark-sklearn code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              spark-sklearn is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              spark-sklearn releases are available to install and integrate.
              Deployable package is available in PyPI.
              spark-sklearn has no build file. You will be need to create the build yourself to build the component from source.
              spark-sklearn saves you 627 person hours of effort in developing the same functionality from scratch.
              It has 1458 lines of code, 144 functions and 19 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed spark-sklearn and discovered the below as its top functions. This is intended to give you an instant insight into spark-sklearn implemented functionality, and help decide if they suit your requirements.
            • Apply gapply to grouped data
            • Fit the ComputationFinder .
            • Converts a dataset to Pdfs .
            • Fit the ComputationGraph model .
            • Serialize a object into an object .
            • Gets the jvm .
            • Creates a new Java object .
            • Creates a spark session .
            • Call java .
            • Reads part of a file .
            Get all kandi verified functions for this library.

            spark-sklearn Key Features

            No Key Features are available at this moment for spark-sklearn.

            spark-sklearn Examples and Code Snippets

            spark_sklearn GridSearchCV __init__ failed with parameter error
            Pythondot img1Lines of Code : 5dot img1License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            GridSearchCV(
                sc=SparkContext.getOrCreate(),
                estimator=ensemble.GradientBoostingRegressor(**params), 
                param_grid=param_test2, n_jobs=1)
            

            Community Discussions

            QUESTION

            How to convert a sklearn pipeline into a pyspark pipeline?
            Asked 2020-Sep-03 at 17:18

            We have a machine learning classifier model that we have trained with a pandas dataframe and a standard sklearn pipeline (StandardScaler, RandomForestClassifier, GridSearchCV etc). We are working on Databricks and would like to scale up this pipeline to a large dataset using the parallel computation spark offers.

            What is the quickest way to convert our sklearn pipeline into something that computes in parallel? (We can easily switch between pandas and spark DFs as required.)

            For context, our options seem to be:

            1. Rewrite the pipeline using MLLib (time-consuming)
            2. Use a sklearn-spark bridging library

            On option 2, Spark-Sklearn seems to be deprecated, but Databricks instead recommends that we use joblibspark. However, this raises an exception on Databricks:

            ...

            ANSWER

            Answered 2020-Sep-01 at 13:00

            According to the Databricks instructions (here and here), the necessary requirements are:

            • Python 3.6+
            • pyspark>=2.4
            • scikit-learn>=0.21
            • joblib>=0.14

            I cannot reproduce your issue in a community Databricks cluster running Python 3.7.5, Spark 3.0.0, scikit-learn 0.22.1, and joblib 0.14.1:

            Source https://stackoverflow.com/questions/63687319

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install spark-sklearn

            You can install using 'pip install spark-sklearn' or download it from GitHub, PyPI.
            You can use spark-sklearn like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install spark-sklearn

          • CLONE
          • HTTPS

            https://github.com/databricks/spark-sklearn.git

          • CLI

            gh repo clone databricks/spark-sklearn

          • sshUrl

            git@github.com:databricks/spark-sklearn.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link