cuda-gdb | This directory contains various GNU compilers , assemblers | GPU library
kandi X-RAY | cuda-gdb Summary
kandi X-RAY | cuda-gdb Summary
CUDA GDB
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of cuda-gdb
cuda-gdb Key Features
cuda-gdb Examples and Code Snippets
Community Discussions
Trending Discussions on cuda-gdb
QUESTION
The following CUDA code takes a list of labels (0, 1, 2, 3, ...) and finds the sums of the weights of these labels.
To accelerate the calculation, I use shared memory so that each thread maintains its own running sum. At the end of the calculation, I perform a CUB block-wide reduction and then an atomic add to the global memory.
The CPU and GPU agree on the results if I use fewer than 30 blocks, but disagree if I use more than this. Why is this and how can I fix it?
Checking error codes in the code doesn't yield anything and cuda-gdb and cuda-memcheck do not show any uncaught errors or memory issues.
I'm using NVCC v10.1.243 and running on a Nvidia Quadro P2000.
MWE ...ANSWER
Answered 2020-Oct-03 at 01:20When I run your code on a Tesla V100, all the results are failures except the first test.
You have a problem here:
QUESTION
I'm new to cuda and was working on a small project to learn how to use it when I came across the following error when executing my program:
...ANSWER
Answered 2020-Sep-03 at 02:22Your in-kernel malloc
operation are exceeding the device heap size.
Any time you are having trouble with a CUDA code that uses in kernel malloc
or new
, it's good practice (at least as a diagnostic) to check the returned pointer value for NULL
, before attempting to use (i.e. dereference) it.
When I do that in your code right after the malloc
operations in aligner::kmdist
, I get the asserts being hit, indicating NULL return values. This is the indication you have exceeded the device heap. You can increase the device heap size.
When I increase the device heap size to 1GB, this particular issue disappears, and at that point cuda-memcheck
may start reporting other errors (I don't know, your application may have other defects, but the proximal issue here is exceeding the device heap).
As an aside, I also recommend that you compile your code to match the architecture you are running on:
QUESTION
I'm trying to install gcc version 4.9 on Ubuntu to replace the current version 7.5 (because Torch is not compatible with version 6 and above). However, even following precise instructions, I can't install it. I did:
...ANSWER
Answered 2020-Jun-04 at 08:51In the meantime, I figured out myself. You must add however that strangely, G++ and GCC version 4.9 is still not available, you must go with 4.8. By combining multiple sources, I constructed a way to install G++ and GCC 4.8.5 on your machine and configure them as the default ones:
QUESTION
I am developing a C++ application with cmake as the build system. Each component in the application builds into a static library, which the executable links to.
I am trying to link in some cuda code that is built as a separate static library, also with cmake. When I attempt to invoke the global function entry point in the cuda static library from the main application, everything seems to work fine - the cudaDeviceSynchronize that follows my global function invocation returns 0. However, the output of the kernel is not set and the call returns immediately.
I ran cuda-gdb. Despite the code being compiled with -g and -G, I was not able to break within the device function called by the kernel. So, I ran cuda-memcheck. When the kernel is launched, this message appears:
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunchKernel.
I looked this up, and the NVIDIA docs/forum posts I read suggested this is usually due to compiling for the wrong compute capability. However, I'm running Titan V's, and the CC is correctly set to 7.0 when compiling.
I have set CUDA_SEPARABLE_COMPILATION on both the cuda library and the component in the main application that the cuda code links to per https://devblogs.nvidia.com/building-cuda-applications-cmake/. I've also tried setting CUDA_RESOLVE_DEVICE_SYMBOLS.
Here is the relevant portion of the cmake for the main application:
(kronmult_cuda
is the component in the main application that links to the cuda library ${KRONLIB}
. another component, kronmult
, links to kronmult_cuda
. Eventually, something that links to kronmult
is linked to the main application).
ANSWER
Answered 2020-Apr-26 at 12:22After the helpful hint from @talonmies, I suspected this was a device linking problem. I simplified my build process, included all CUDA files in one translation unit, and turned off SEPARABLE COMPILATION
.
Still, I did not see a cmake_device_link.o
in either my main application binary or the component that called into my cuda library. And, still had the same error. Tried setting CUDA_RESOLVE_DEVICE_SYMBOLS
to no effect.
Finally, I tried building the component that calls into my cuda library as SHARED
. I saw the device linking step when building the .so in my cmake output, and the program runs fine. I do not know why building SHARED
fixes what I suspect was a device linking problem - will accept any answer that deciphers that?
QUESTION
CUDA 10.1 and the NVidia drivers v440 are installed on my Ubuntu 18.04 system. I don't understand why the nvidia-smi
tool reports CUDA version 10.2 when the installed version is 10.1 (see further down).
ANSWER
Answered 2020-Feb-13 at 20:46From Could not dlopen library 'libcudart.so.10.0';
we can get that you tensorflow package is built against CUDA 10.0. You should install CUDA 10.0 or build it from source (against CUDA 10.1 or 10.2) by yourself.
QUESTION
I'm developing a CUDA matrix multiplication, but I did some modifications to observe how they affect performances.
I'm trying to observe the behavior (and I'm measuring the changes in GPU events time) of a simple matrix multiplication kernel. But I'm testing it in two speicific different conditions:
I have an amount of matrices (say
matN
) either for A, B and C, then I transfer (H2D) one matrix for A, one for B at time and then multply them, to transfer back (D2H) one C;I have
matN
either for A, B and C, but I transfer >1(saychunk
) matrices at time for A and for B, perform exactlychunk
multiplications, and transfer backchunk
result matrices C.
In the first case (chunk = 1
) all works as expected, but in the second case (chunk > 1
) I get some of Cs are correct, while others are wrong.
But if I put a cudaDeviceSynchronize()
after the cudaMemcpyAsync
all results I get are correct.
Here's the part of code doing what I've just described above:
...ANSWER
Answered 2019-Jun-19 at 09:35If you are using multiples streams, you may override Ad
and Bd
before using them.
Example with iters = 2
and nStream = 2
:
QUESTION
I have seen a lot of specific posts to particular case-specific problems, but no fundamental motivating explanation. What does this error:
...ANSWER
Answered 2019-Apr-21 at 15:08When a device-side error is detected while CUDA device code is running, that error is reported via the usual CUDA runtime API error reporting mechanism. The usual detected error in device code would be something like an illegal address (e.g. attempt to dereference an invalid pointer) but another type is a device-side assert. This type of error is generated whenever a C/C++ assert()
occurs in device code, and the assert condition is false.
Such an error occurs as a result of a specific kernel. Runtime error checking in CUDA is necessarily asynchronous, but there are probably at least 3 possible methods to start to debug this.
Modify the source code to effectively convert asynchronous kernel launches to synchronous kernel launches, and do rigorous error-checking after each kernel launch. This will identify the specific kernel that has caused the error. At that point it may be sufficient simply to look at the various asserts in that kernel code, but you could also use step 2 or 3 below.
Run your code with
cuda-memcheck
. This is a tool something like "valgrind for device code". When you run your code withcuda-memcheck
, it will tend to run much more slowly, but the runtime error reporting will be enhanced. It is also usually preferable to compile your code with-lineinfo
. In that scenario, when a device-side assert is triggered,cuda-memcheck
will report the source code line number where the assert is, and also the assert itself and the condition that was false. You can see here for a walkthrough of using it (albeit with an illegal address error instead ofassert()
, but the process withassert()
will be similar.It should also be possible to use a debugger. If you use a debugger such as
cuda-gdb
(e.g. on linux) then the debugger will have back-trace reports that will indicate which line the assert was, when it was hit.
Both cuda-memcheck
and the debugger can be used if the CUDA code is launched from a python script.
At this point you have discovered what the assert is and where in the source code it is. Why it is there cannot be answered generically. This will depend on the developers intention, and if it is not commented or otherwise obvious, you will need some method to intuit that somehow. The question of "how to work backwards" is also a general debugging question, not specific to CUDA. You can use printf
in CUDA kernel code, and also a debugger like cuda-gdb
to assist with this (for example, set a breakpoint prior to the assert, and inspect machine state - e.g. variables - when the assert is about to be hit).
QUESTION
I've looked at many pages and either could not follow what they were saying because they were unclear and/or my knowledge is just not sufficient enough.
I am trying to run:
luarocks install https://raw.githubusercontent.com/qassemoquab/stnbhwd/master/stnbhwd-scm-1.rockspec
So that I may run DenseCap over some images using GPU Acceleration. When I run it, I get this error:
...ANSWER
Answered 2017-Dec-05 at 23:35Try to change the code architecture (such as sm_20) to some higher version in CMakeLists.txt of stnbhwd that you are trying to install.
From:
QUESTION
I have an application which generates CUDA C++ source code, compiles it into PTX at runtime using NVRTC
, and then creates CUDA modules from it using the CUDA driver API.
If I debug this application using cuda-gdb
, it displays the kernel (where an error occured) in the backtrace, but does not show the line number.
I export the generated source code into a file, and give the directory to cuda-gdb
using the --directory
option. I also tried passing its file name to nvrtcCreateProgram()
(name
argument). I use the compile options --device-debug
and --generate-line-info
with NVRTC.
Is there a way to let cuda-gdb
know the location of the generated source code file, and display the line number information in its backtrace?
ANSWER
Answered 2019-Feb-16 at 18:31I was able to do kernel source-level debugging on a nvrtc
-generated kernel with cuda-gdb
as follows:
- start with vectorAdd_nvrtc sample code
- modify the
compileFileToPTX
routine (provided bynvrtc_helper.h
) to add the--device-debug
switch during the compile-cu-to-ptx step. - modify the
loadPTX
routine (provided bynvrtc_helper.h
) to add theCU_JIT_GENERATE_DEBUG_INFO
option (set to 1) for thecuModuleLoadDataEx
load/JIT PTX-to-binary step. - compile the main function (vectorAdd.cpp) with
-g
option.
Here is a complete test case/session. I'm only showing the vectorAdd.cpp file from the project because that is the only file I modified. Other project file(s) are identical to what is in the sample project:
QUESTION
below are the things I have checked with cuda-gdb:
- the contents of src are correct
- cudaMalloc, malloc, and file I/O are successful
- cudaMemcpy returns cudaSuccess
- the problematic cudaMemcpy is called and throws no errors or exceptions
- destination is allocated (cudaMalloc) successfully
Below are relevent parts of the code: wavenet_server.cc
mallocs the source, copies data from a file to the source, and calls make_wavenet
. wavenet_infer.cu
calls constructor of MyWaveNet
and calls setEmbeddings
.
wavenet_server.cc:
...ANSWER
Answered 2018-Oct-12 at 06:55Turns out that cudaMemcpy was not the issue. when examining device global memroy using cuda-gdb, one cannot do: x/10fw float_array
. It will give incorrect values. To view, try this: p ((@global float*) float_array)[0]@10
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install cuda-gdb
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page