mykernel | mykernel 2.0 : Develop your own OS kernel
kandi X-RAY | mykernel Summary
kandi X-RAY | mykernel Summary
Develop your own OS kernel by reusing Linux infrastructure, based on x86-64/Linux Kernel 5.4.34. mykernel 1.0 based on IA32/Linux Kernel 3.9.4.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of mykernel
mykernel Key Features
mykernel Examples and Code Snippets
Community Discussions
Trending Discussions on mykernel
QUESTION
Post-solution edit: The issue is with the code alone. There is no hardware issue here. Now to the original post:
I'm trying to get a basic OpenCL program to work.
The program simply creates a buffer, writes 42 to the buffer, reads it, and outputs.
Here's the code, written in C:
...ANSWER
Answered 2021-Feb-21 at 04:07I am a buffoon. The error was on the line I went to set the kernel argument:
QUESTION
When trying out the new jitify
support planned for CuPy v9.x, I found that the name_expressions
named argument to cupy.RawModule
needs to be iterable for the NVRTC to not fail when later calling get_function
. Question stemming out of cupy.RawModule using name_expressions and nvcc and/or path.
ANSWER
Answered 2020-Dec-02 at 16:28Well, first of all, we did say it's a sequence (ex: list/tuple) of strings and gave an example in the doc page you quoted:
name_expressions
(sequence of str) – A sequence (e.g. list) of strings referring to the names of C++ global/template kernels. For example,name_expressions=['func1', 'func1', 'func2']
for the template kernelfunc1
and non-template kernelfunc2
. Strings in this tuple must then be passed, one at a time, toget_function()
to retrieve the corresponding kernel.
So I don't see any ambiguity. There is no doubt that it's such a common pitfall in Python to write ('abc')
and thinking it's a 1-element tuple containing the string 'abc'
, for which it should been written as ('abc',)
with comma. But checking for such pitfall everywhere in the codebase would be a pain in the ass IMHO.
Second, even if we add a check to ensure the input is iterable, it still doesn't solve your issue as strings are also iterable/sequence:
QUESTION
It seems the that there is a maximum number of resident blocks allowed per SM. But while other "hard" limits are easily found (via, for example, `cudaGetDeviceProperties'), a maximum number of resident blocks doesn't seem to be widely documented.
In the following sample code, I configure the kernel with one thread per block. To test the hypothesis that this GPU (a P100) has a maximum of 32 resident blocks per SM, I create a grid of 56*32 blocks (56 = number of SMs on the P100). Each kernel takes 1 second to process (via a "sleep" routine), so if I have configured the kernel correctly, the code should take 1 second. The timing results confirm this. Configuring with 32*56+1 blocks takes 2 seconds, suggesting the 32 blocks per SM is the maximum allowed per SM.
What I wonder is, why isn't this limit made more widely available? For example, it doesn't show up `cudaGetDeviceProperties'. Where can I find this limit for various GPUs? Or maybe this isn't a real limit, but is derived from other hard limits?
I am running CUDA 10.1
...ANSWER
Answered 2020-Apr-23 at 09:48Yes, there is a limit to the number of blocks per SM. The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.
Hardware limit stated by CUDA.
Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.
Limit derived from the amount of resources used by each block.
A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.
How do I find these boundaries?
CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks. The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. If you want to know how many registers per thread you use you have to add the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX. In the last tab of the file you will find the hardware limits grouped by Compute capability.
QUESTION
I am trying to generate "random" numbers from a uniform distribution inside a CUDA __global__
kernel using two different approaches. The first is using the cuRAND
device API, and the second is using thrust
. For each approach I have created a different class.
Here is my cuRAND
solution:
ANSWER
Answered 2020-Apr-20 at 16:24Perhaps the performance difference happens because cuRAND and Thrust use different PRNG algorithms with different performance profiles and demands on memory. Note that cuRAND supports five different PRNG algorithms, and your code doesn't give which one is in use.
Thrust's default_random_engine
is currently minstd_rand
, but its documentation notes that this "may change in a future version". (A comment written after I wrote mine also noted that it's minstd_rand
.) minstd_rand
is a simple linear congruential generator that may be faster than whatever PRNG cuRAND is using.
This was a comment converted to an answer and edited.
QUESTION
Is there a way to execute a statement inside the device code without raising CUDA error? Something looks like the following:
...ANSWER
Answered 2020-Feb-17 at 17:32currently, CUDA device code does not support exception handling.
try
/catch
cannot be used in CUDA device code.
If a statement would cause a device side error (e.g. out-of-bounds access, for example), there is no way to execute that statement in device code without triggering the device side error.
To avoid the error, you would need to modify the statement, or conditionally not execute it, to avoid triggering the device side error.
QUESTION
I need to dynamically size a int* pointer array that’s queried in every thread of my kernel's instance.
My goal is to make an array of ints in which I won’t know the size of the array until run-time (so it can’t be fixed size).
With that when I do:
...ANSWER
Answered 2020-Jan-20 at 08:24How do you use a pointer array with
__constant__
You don't. It is not possible.
Constant memory must be statically defined at compile time. That means if you want a constant memory array, the size must be defined when you compile the code. There is no way to dynamically allocate constant memory.
The only realistic solution is to define a __constant__
array to a maximum size, and then a second __constant__
variable indicating the size of the array which is being used for a given kernel invocation. So something like:
QUESTION
This runtime error has been bothering me for two days, I tried all the possible ways to debug it, still I couldn't find what the issue is.
...ANSWER
Answered 2019-Jul-30 at 00:30I just found out the solution by myself. Different machine has different size in byte for "long", some of machine are 4 bytes, some of are 8 bytes. Make sure they are compiled in the compiler and architecture, Otherwise, the cudaMemcpy will not be able to copy two different chunk-size of memory.
QUESTION
I have been writing some Metal compute kernels. So, I wrote a kernel with the following declaration:
...ANSWER
Answered 2019-Jun-20 at 15:44One approach that might work is to:
- Declare
inData
as avoid*
- In the body of the kernel shader, call a template function, passing along the arguments. The template function would be templated by the desired type and would receive
inData
as a pointer to that type.
You could use an input parameter to dynamically choose which variant of the template function to call. But a better approach is probably to use a function constant to pick. That way, the choice is compiled in.
So, something like:
QUESTION
Each instance of my CUDA kernel (i.e. each thread) needs three private arrays, with different types.
e.g.
...ANSWER
Answered 2019-Jun-12 at 22:11Preliminaries
Usually, people are interested in GPU computing for performance reasons - to make their codes run faster. So we'll keep performance as a guide when trying to make decisions about what to do.
I think one of the problems the sketch you provided in your question would have is one of natural alignment requirement in CUDA. Picking an arbitrary pointer and type-casting it to a different type may run afoul of this. If you have such a problem in your code, the cuda-memcheck
tool should be able to uncover it.
The typical place to put thread-private arrays in C++ is local memory, and CUDA is no different in my opinion. However CUDA C++, at least, does not support variable-length arrays. In your question, you sketched out using shared memory as a proxy for this. One of the implications of your thinking (I assume) is that although the size of these arrays are not known at compile time, there must be an upper bound to size, because shared memory may impose a limit of as low as 48KB per threadblock. Therefore if you have 1024 threads in a threadblock, then the maximum combined array size per thread would be limited to 48 bytes. With 512 threads per block you could conceivably have 96 bytes per thread. These would be due to shared memory limits if you used shared memory.
So an alternate approach (if you can adhere to these low limits) would be to simply upper-bound the local memory needed, and statically define a local memory array of that size (or 3), per thread. A single array would have to be partitioned among the various arrays, with attention paid to alignment as already mentioned. But given the small sizes suggested by your approach (e.g. ~96 bytes total) it probably would be expedient just to use upper-bounded fixed-size local arrays (not shared memory).
Local memory in CUDA is ultimately backed by the same physical resource -- GPU DRAM memory -- as global memory. However the arrangement is such that if each thread is accessing a particular element in their own local memory, the effect across threads would be equivalent to coalesced access, should that access need to be serviced by DRAM. This means the per-thread local storage is interleaved, in some fashion. And this interleaving characteristic is also something we will want to pay attention to, for performance reasons, if we come up with our own variable-length array implementation. It applies equally to a global memory proxy (to enable coalescing) or a shared memory proxy (to avoid bank conflicts).
In addition to the desire to interleave access for performance reasons, a possible performance reason to not prefer a shared memory implementation is that extensive use of shared memory can have negative implications for occupancy, and therefore for performance. This topic is covered in many other places, so I'll not drill into it further here.
Implementations
Local Memory
As mentioned above, I believe one of the implicit assumptions about your suggestion to use shared memory is that there must be some (reasonably small) upper bound to the actual sizes of the arrays needed. If that is the case, it may be expedient to use 3 arrays allocated with the upper bound size:
QUESTION
How do I work-around an nvprof
crash that occurs when running on a disk with a relatively small amount of space available?
Specifically, when profiling my cuda kernel, I use the following two commands:
...ANSWER
Answered 2019-May-31 at 19:24One can direct nvprof
to use a different temporary directory by setting the TMPDIR
environment variable. This is helpful, because since Linux kernel 2.6, there's a decent chance that you have a RAM disk available at /dev/shm
(see https://superuser.com/a/45509/363816 for more info). Thus, adding the following at the beginning of one's [bash
] script will likely work-around your issue.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install mykernel
mykernel-2.0 patch generated by this command: diff -Naur linux-5.4.34 linux-5.4.34-mykernel > mykernel-2.0_for_linux-5.3.34.patch
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page