SynapseML | Simple and Distributed Machine Learning

by microsoft Scala Version: v0.11.1-spark3.3 License: MIT

X-Ray Key Features Code Snippets Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | SynapseML Summary

SynapseML is a Scala library typically used in Big Data, Spark applications. SynapseML has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

Simple and Distributed Machine Learning

Support

Quality

Security

License

Reuse

Support

SynapseML has a medium active ecosystem.

It has 4302 star(s) with 771 fork(s). There are 143 watchers for this library.

It had no major release in the last 12 months.

There are 272 open issues and 389 have been closed. On average issues are closed in 30 days. There are 32 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of SynapseML is v0.11.1-spark3.3

Quality

SynapseML has 0 bugs and 0 code smells.

Security

SynapseML has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

SynapseML code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

SynapseML is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

SynapseML releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

It has 49797 lines of code, 4419 functions and 548 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of SynapseML

Get all kandi verified functions for this library.

SynapseML Key Features

No Key Features are available at this moment for SynapseML.

SynapseML Examples and Code Snippets

No Code Snippets are available at this moment for SynapseML.

Community Discussions

Trending Discussions on SynapseML

How to integrate spark.ml pipeline fitting and hyperparameter optimisation in AWS Sagemaker?

Understanding the jars in pyspark

QUESTION

How to integrate spark.ml pipeline fitting and hyperparameter optimisation in AWS Sagemaker?

Asked 2022-Feb-25 at 12:57

Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api. To clarify:

I have to use LightGBM in general, there is no option here.
The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.

Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.

Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.

Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?

One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.

Has anyone done anything similar in the past?

Thanks

...

ANSWER

Answered 2022-Feb-25 at 12:57

Thanks for your questions. Here are answers:

SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
- SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
- SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container

Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:

each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)

In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.

Source https://stackoverflow.com/questions/70835006

QUESTION

Understanding the jars in pyspark

Asked 2022-Jan-18 at 07:45

I'm new to spark and my understanding is this:

jars are like a bundle of java code files
Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for them to execute the package API calls that the user interacts with. These jar files are like the backend code for those API calls

Questions:

Why are these jar files needed. Why could it not have sufficed to have all the code in python? (I guess the answer is that originally Spark is written in scala and there it distributes its dependencies as jars. So to not have to create that codebase mountain again, the python libraries just call that javacode in python interpreter through some converter that converts java code to equivalent python code. Please if I have understood right)
You specify these jar files locations while creating the spark context via spark.driver.extraClassPath and spark.executor.extraClassPath. These are outdated parameters though I guess. What is the recent way to specify these jar files location?
Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located? Why do not the libraries make it clear where their specific jar files are going to be?

I understand I might not be making sense here and what I have mentioned above is partly just my hunch that that is how it must be happening.

So, can you please help me understand this whole business with jars and how to find and specify them?

...

ANSWER

Answered 2021-Dec-09 at 12:15

Each library that I install that internally uses spark (or pyspark) has its own jar files

Can you tell which library are you trying to install ?

Yes, external libraries can have jars even if you are writing code in python.

Why ?

These libraries must be using some UDF (User Defined Functions). Spark runs the code in java runtime. If these UDF are written in python, then there will be lot of serialization and deserialization time due to converting data into something readable by python.

Java and Scala UDFs are usually faster that's why some libraries ship with jars.

Why could it not have sufficed to have all the code in python?

Same reason, scala/java UDFs are faster than python UDF.

What is the recent way to specify these jar files location?

You can use spark.jars.packages property. It will copy to both driver and executor.

Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located?

https://github.com/microsoft/SynapseML#python

They have mentioned here what jars are required i.e. com.microsoft.azure:synapseml_2.12:0.9.4

Source https://stackoverflow.com/questions/70288193

Community Discussions, Code Snippets contain sources that include Stack Exchange Network