spark-rapids | Spark RAPIDS plugin - accelerate Apache Spark with GPUs

by NVIDIA Scala Version: v23.04.1 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | spark-rapids Summary

spark-rapids is a Scala library typically used in Big Data, Spark applications. spark-rapids has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

Support

Quality

Security

License

Reuse

Support

spark-rapids has a low active ecosystem.

It has 543 star(s) with 183 fork(s). There are 39 watchers for this library.

It had no major release in the last 12 months.

There are 1004 open issues and 2752 have been closed. On average issues are closed in 61 days. There are 15 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of spark-rapids is v23.04.1

Quality

spark-rapids has no bugs reported.

Security

spark-rapids has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

spark-rapids is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-rapids releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-rapids

Get all kandi verified functions for this library.

spark-rapids Key Features

No Key Features are available at this moment for spark-rapids.

spark-rapids Examples and Code Snippets

No Code Snippets are available at this moment for spark-rapids.

Community Discussions

Trending Discussions on spark-rapids

Spark RAPIDS - Operation not replaced with GPU version

How to access Spark DataFrame data in GPU from ML Libraries such as PyTorch or Tensorflow

QUESTION

Spark RAPIDS - Operation not replaced with GPU version

Asked 2021-May-27 at 05:11

I am new to Rapids and I have trouble understanding the supported operations.

I have data in following format:

...

ANSWER

Answered 2021-May-20 at 17:53

Kenny. May I please know what version of the rapids-4-spark plugin you're using, and the version of Spark?

The initial GPU implementation of COLLECT_LIST() was disabled by default because its behaviour did not match Spark's, w.r.t null values. (The GPU version kept nulls in the aggregated array rows, while Spark removed them.) Edit: The behaviour was corrected in the 0.5 release.

If you have no nulls in your aggregation column (and are using rapids-4-spark 0.4), you might try enabling the operator by setting spark.rapids.sql.expression.CollectList=true.

In general, one can examine the reason why an operator didn't run on the GPU by setting spark.rapids.sql.explain=NOT_ON_GPU. That should print the reason out to console.

If you still experience difficulty or incorrect behaviour with the rapids-4-spark plugin, please feel free to raise a bug on the project's GitHub. We'd be happy to investigate further.

Source https://stackoverflow.com/questions/67617057

QUESTION

How to access Spark DataFrame data in GPU from ML Libraries such as PyTorch or Tensorflow

Asked 2021-Jan-27 at 17:32

Currently I am studying the usage of Apache Spark 3.0 with Rapids GPU Acceleration. In the official spark-rapids docs I came across this page which states:

There are cases where you may want to get access to the raw data on the GPU, preferably without copying it. One use case for this is exporting the data to an ML framework after doing feature extraction.

To me this sounds as if one could make data that is already available on the GPU from some upstream Spark ETL process directly available to a framework such as Tensorflow or PyTorch. If this is the case how can I access the data from within any of these frameworks? If I am misunderstanding something here, what is the quote exactly referring to?

...

ANSWER

Answered 2021-Jan-27 at 17:32

The link you references really only allows you to get access to the data still sitting on the GPU, but using that data in another framework, like Tensorflow or PyTorch is not that simple.

TL;DR; Unless you have a library explicitly setup to work with the RAPIDS accelerator you probably want to run your ETL with RAPIDS, then save it, and launch a new job to train your models using that data.

There are still a number of issues that you would need to solve. We have worked on these in the case of XGBoost, but it has not been something that we have tried to tackle for Tensorflow or PyTorch yet.

The big issues are

Getting the data to the correct process. Even if the data is on the GPU, because of security, it is tied to a given user process. PyTorch and Tensorflow generally run as python processes and not in the same JVM that Spark is running in. This means that the data has to be sent to the other process. There are several ways to do this, but it is non-trivial to try and do it as a zero-copy operation.
The format of the data is not what Tensorflow or PyTorch want. The data for RAPIDs is in an arrow compatible format. Tensorflow and PyTorch have APIs for importing data in standard formats from the CPU, but it might take a bit of work to get the data into a format that the frameworks want and to find an API to let you pull it in directly from the GPU.
Sharing GPU resources. Spark only recently added in support for scheduling GPUs. Prior to that people would just launch a single spark task per executor and a single python process so that the python process would own the entire GPU when doing training or inference. With the RAPIDS accelerator the GPU is not free any more and you need a way to share the resources. RMM provides some of this if both libraries are updated to use it and they are in the same process, but in the case of Pytorch and and Tensoflow they are typically in python processes so figuring out how to share the GPU is hard.

Source https://stackoverflow.com/questions/65565760

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-rapids

The jar files for the most recent release can be retrieved from the download page.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: