```
"Load the CIFAR dataset."
X_train, y_train, _, _ = load_data('cifar') # load/download from openml.org
X_train = X_train/255 # normalize
"""Plot, to check it's the right data.
(This cell's code is from: https://www.tensorflow.org/tutorials/images
```

```
import cupy
import numpy
import smallpebble as sp
# Switch to CuPy
sp.use(cupy)
print(sp.array_library.library.__name__) # should be 'cupy'
# Switch back to NumPy:
sp.use(numpy)
print(sp.array_library.library.__name__) # should be 'numpy'
cupy
```

QUESTION

suppose I need to define functions that when the input is numpy array, it returns the numpy version of the function. And when the input is a cupy array, it returns cupy version of the function.

...ANSWER

Answered 2022-Apr-11 at 05:54To insert into the current module 3 functions with a loop:

QUESTION

I digged the documentation for `cupy`

sparse matrix.

as in `scipy`

I expect to have something like this:

ANSWER

Answered 2022-Apr-02 at 08:13as stated in the error, you need to convert the datatype to either bool, float32/64, or complex64/128:

QUESTION

Consider simplified example using multiprocessing inside a class that use cupy for simulation. this part

...ANSWER

Answered 2022-Mar-17 at 11:43Adding an answer here to wrap this one up. Didn't stumble upon a Stack Overflow thread when researching this issue so I'm assuming this thread will get more views in the future.

The issue has to do with the default start method not working with CUDA Multiprocessing. By explicitly setting the start method to spawn with `multiprocessing.set_start_method('spawn', force=True)`

this issue is resolved.

QUESTION

I am hoping to move my custom camera video pipeline to use video memory with a combination of numba and cupy and avoid passing data back to the host memory if at all possible. As part of doing this I need to port my sharpness detection routine to use cuda. The easiest way to do this seemed to be to use cupy as essential all I do is compute the variance of a laplace transform of each image. The trouble I am hitting is the cupy variance computation appears to be ~ 8x slower than numpy. This includes the time it takes to copy the device ndarray to the host and perform the variance computation on the cpu using numpy. I am hoping to gain a better understanding of why the variance computation ReductionKernel employed by cupy on the GPU is so much slower. I'll start by including the test I ran below.

...ANSWER

Answered 2022-Jan-14 at 21:58I have a partial hypothesis about the problem (not a full explanation) and a work-around. Perhaps someone can fill in the gaps. I've used a quicker-and-dirtier benchmark, for brevity's sake.

The work-around: reduce one axis at a timeCupy is **much** faster when reduction is performed on one axis at a time. In stead of:

`x.sum()`

prefer this:

`x.sum(-1).sum(-1).sum(-1)...`

Note that the results of these computations may differ due to rounding error.

Here are faster `mean`

and `var`

functions:

QUESTION

Below is a runnable code snippet using `dask`

and `cupy`

, which I have problems with. I run this on Google Colab with GPU activated.

Basically my problem is, that **A** and **At** are arrays which are too big for RAM, thats why I use `Dask`

. On these too big for RAM arrays, I run operations, but I would like to obtain **AtW1[:,k]** (as a cupy array) without blowing my RAM or GPU Memory, because i need this value for further operations. How can I achieve this?

ANSWER

Answered 2022-Jan-12 at 11:31Although the idea of rechunking makes a lot of sense on paper, in practice rechunking needs to be done with great care, since it will only be able to reshape the work that can be blocked in principle.

For example, compare the following two approaches:

QUESTION

I am trying to improve codes efficiency with cupy. But I find no ways to carry linear programming within cupy. This problem comes from the following parts:

...ANSWER

Answered 2021-Dec-08 at 12:49I’ve seen papers that propose to use GPU for linear programming. Some of them even reported outstanding improvement. But from what I saw, they compare their GPU implementation of the simplex method with their sequential implementation, not with Gurobi, Cplex, or even CLP. And I never heard about an efficient GPU-base LP solver that beats good LP solvers. Such flagman like Gurobi does not support GPU. And I know there are some doubts that GPU actually can help in large-scale LP.

- Large-scale LPs are sparse, and GPU is not good for sparse.
- Optimization in general is mostly a sequential process (paralleling in modern LP solvers is very specific and cannot utilize GPU).

If you want to try to implement your own GPU-base LP solver, I encourage you to try. Whatever you get it would be a great experience.

But if you only need to speed up your solution process then get a different solver. Linprog from SciPy may be a good choice to prototype. But GLPK or CLP/CBC will give you much better speed. You can invoke them through Pyomo or PULP.

QUESTION

This is a total newbie question but I've been searching for a couple days and cannot find the answer.

I am using cupy to allocate a large array of doubles (circa 655k rows x 4k columns ) which is about 16Gb in ram. I'm running on p2.8xlarge (the aws instance that claims to have 96GB of GPU ram and 8 GPUs), but when I allocate the array it gives me out of memory error.

Is this happening becaues the 96GB of ram is split into 8x12 GB lots that are only accessible to each GPU? Is there no concept of pooling the GPU ram across the GPUs (like regular ram in multiple CPU situation) ?

...ANSWER

Answered 2021-Nov-05 at 18:57From playing around with it a fair bit, I think the answer is no, you cannot pool memory across GPUs. You can move data back and forth between GPUs and CPU but there's no concept of unified GPU ram accessible to all GPUs

QUESTION

I am using Cupy with following code,

...ANSWER

Answered 2021-Oct-25 at 13:56For high-level, NumPy-like APIs, there is currently no public interface to change the grid/block configuration. In addition, many linalg APIs (such as `eigh`

in your example) delegate the job to the CUDA Math Libraries solvers, which do not allow users to set grid/block configuration either. I wonder what prompts to this need. It'd be nice if you could elaborate.

QUESTION

I'm trying to start using Cupy for some Cuda Programming. I need to write my own kernels. However, I'm struggling with 2D kernels. It seems that Cupy does not work the way I expected. Here is a very simple example of a 2D kernel in Numba Cuda:

...ANSWER

Answered 2021-Oct-19 at 18:18Memory in C is stored in a row-major-order. So, we need to index following this order. Also, since I'm passing int arrays, I changed the argument types of my kernel. Here is the code:

QUESTION

I've been making the rounds on forums trying out different ways to install cupy on MacOS running on a device without a Nvidia GPU. So far, nothing has worked. I've tried both a Homebrew install of Python 3.7 and a conda install of Python 3.7 and attempted each of the following:

`conda install -c conda-forge cupy`

`conda install cupy`

`pip install cupy`

- ...

ANSWER

Answered 2021-Oct-19 at 13:50There is no Mac support in CuPy since NVIDIA no longer supports MacOS. Whatever you read is outdated. I know because I sent a PR to remove the last broken bits from CuPy's codebase, and I also maintain the CuPy package on conda-forge.

