spark-rapids | Spark RAPIDS plugin - accelerate Apache Spark with GPUs
kandi X-RAY | spark-rapids Summary
kandi X-RAY | spark-rapids Summary
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-rapids
spark-rapids Key Features
spark-rapids Examples and Code Snippets
Community Discussions
Trending Discussions on spark-rapids
QUESTION
I am new to Rapids and I have trouble understanding the supported operations.
I have data in following format:
...ANSWER
Answered 2021-May-20 at 17:53Kenny. May I please know what version of the rapids-4-spark
plugin you're using, and the version of Spark?
The initial GPU implementation of COLLECT_LIST()
was disabled by default because its behaviour did not match Spark's, w.r.t null values. (The GPU version kept nulls in the aggregated array rows, while Spark removed them.) Edit: The behaviour was corrected in the 0.5 release.
If you have no nulls in your aggregation column (and are using rapids-4-spark
0.4), you might try enabling the operator by setting spark.rapids.sql.expression.CollectList=true
.
In general, one can examine the reason why an operator didn't run on the GPU by setting spark.rapids.sql.explain=NOT_ON_GPU
. That should print the reason out to console.
If you still experience difficulty or incorrect behaviour with the rapids-4-spark
plugin, please feel free to raise a bug on the project's GitHub. We'd be happy to investigate further.
QUESTION
Currently I am studying the usage of Apache Spark 3.0 with Rapids GPU Acceleration. In the official spark-rapids
docs I came across this page which states:
There are cases where you may want to get access to the raw data on the GPU, preferably without copying it. One use case for this is exporting the data to an ML framework after doing feature extraction.
To me this sounds as if one could make data that is already available on the GPU from some upstream Spark ETL process directly available to a framework such as Tensorflow or PyTorch. If this is the case how can I access the data from within any of these frameworks? If I am misunderstanding something here, what is the quote exactly referring to?
...ANSWER
Answered 2021-Jan-27 at 17:32The link you references really only allows you to get access to the data still sitting on the GPU, but using that data in another framework, like Tensorflow or PyTorch is not that simple.
TL;DR; Unless you have a library explicitly setup to work with the RAPIDS accelerator you probably want to run your ETL with RAPIDS, then save it, and launch a new job to train your models using that data.
There are still a number of issues that you would need to solve. We have worked on these in the case of XGBoost, but it has not been something that we have tried to tackle for Tensorflow or PyTorch yet.
The big issues are
- Getting the data to the correct process. Even if the data is on the GPU, because of security, it is tied to a given user process. PyTorch and Tensorflow generally run as python processes and not in the same JVM that Spark is running in. This means that the data has to be sent to the other process. There are several ways to do this, but it is non-trivial to try and do it as a zero-copy operation.
- The format of the data is not what Tensorflow or PyTorch want. The data for RAPIDs is in an arrow compatible format. Tensorflow and PyTorch have APIs for importing data in standard formats from the CPU, but it might take a bit of work to get the data into a format that the frameworks want and to find an API to let you pull it in directly from the GPU.
- Sharing GPU resources. Spark only recently added in support for scheduling GPUs. Prior to that people would just launch a single spark task per executor and a single python process so that the python process would own the entire GPU when doing training or inference. With the RAPIDS accelerator the GPU is not free any more and you need a way to share the resources. RMM provides some of this if both libraries are updated to use it and they are in the same process, but in the case of Pytorch and and Tensoflow they are typically in python processes so figuring out how to share the GPU is hard.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark-rapids
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page