SynapseML | Simple and Distributed Machine Learning
kandi X-RAY | SynapseML Summary
kandi X-RAY | SynapseML Summary
Simple and Distributed Machine Learning
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of SynapseML
SynapseML Key Features
SynapseML Examples and Code Snippets
Community Discussions
Trending Discussions on SynapseML
QUESTION
Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api. To clarify:
- I have to use LightGBM in general, there is no option here.
- The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
- I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.
Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator
API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.
Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator
API from that package does not support on-the-fly cluster configuration either.
Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?
One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.
Has anyone done anything similar in the past?
Thanks
...ANSWER
Answered 2022-Feb-25 at 12:57Thanks for your questions. Here are answers:
SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container
Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:
- each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
- use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)
In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.
QUESTION
I'm new to spark and my understanding is this:
- jars are like a bundle of java code files
- Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for them to execute the package API calls that the user interacts with. These jar files are like the backend code for those API calls
Questions:
- Why are these jar files needed. Why could it not have sufficed to have all the code in python? (I guess the answer is that originally Spark is written in scala and there it distributes its dependencies as jars. So to not have to create that codebase mountain again, the python libraries just call that javacode in python interpreter through some converter that converts java code to equivalent python code. Please if I have understood right)
- You specify these jar files locations while creating the spark context via
spark.driver.extraClassPath
andspark.executor.extraClassPath
. These are outdated parameters though I guess. What is the recent way to specify these jar files location? - Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located? Why do not the libraries make it clear where their specific jar files are going to be?
I understand I might not be making sense here and what I have mentioned above is partly just my hunch that that is how it must be happening.
So, can you please help me understand this whole business with jars and how to find and specify them?
...ANSWER
Answered 2021-Dec-09 at 12:15Each library that I install that internally uses spark (or pyspark) has its own jar files
Can you tell which library are you trying to install ?
Yes, external libraries can have jars even if you are writing code in python.
Why ?
These libraries must be using some UDF (User Defined Functions). Spark runs the code in java runtime. If these UDF are written in python, then there will be lot of serialization and deserialization time due to converting data into something readable by python.
Java and Scala UDFs are usually faster that's why some libraries ship with jars.
Why could it not have sufficed to have all the code in python?
Same reason, scala/java UDFs are faster than python UDF.
What is the recent way to specify these jar files location?
You can use spark.jars.packages
property. It will copy to both driver and executor.
Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located?
https://github.com/microsoft/SynapseML#python
They have mentioned here what jars are required i.e. com.microsoft.azure:synapseml_2.12:0.9.4
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install SynapseML
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page