mykernel | mykernel 2.0 : Develop your own OS kernel

by mengning C Version: v2.1 License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | mykernel Summary

mykernel is a C library. mykernel has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

Develop your own OS kernel by reusing Linux infrastructure, based on x86-64/Linux Kernel 5.4.34. mykernel 1.0 based on IA32/Linux Kernel 3.9.4.

Support

Quality

Security

License

Reuse

Support

mykernel has a low active ecosystem.

It has 608 star(s) with 321 fork(s). There are 57 watchers for this library.

It had no major release in the last 12 months.

There are 8 open issues and 2 have been closed. On average issues are closed in 43 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of mykernel is v2.1

Quality

mykernel has no bugs reported.

Security

mykernel has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

mykernel does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

mykernel releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of mykernel

Get all kandi verified functions for this library.

mykernel Key Features

No Key Features are available at this moment for mykernel.

mykernel Examples and Code Snippets

No Code Snippets are available at this moment for mykernel.

Community Discussions

Trending Discussions on mykernel

OpenCL: Invalid write of size # at global memory address 0x###

Check that `name_expressions` is iterable

Maximum number of resident blocks per SM?

cuRAND performs much worse than thrust when generating random numbers inside CUDA kernels

How to use something like Try-Catch inside CUDA device code

Cuda how do you use a pointer array with __constant__

CUDA templated function call

How to make templated Compute Kernels in Metal

CUDA per-thread arrays with different types

nvprof is crashing as it writes a very large file to /tmp/ and runs out of disk space

QUESTION

OpenCL: Invalid write of size # at global memory address 0x###

Asked 2021-Feb-21 at 04:07

Post-solution edit: The issue is with the code alone. There is no hardware issue here. Now to the original post:

I'm trying to get a basic OpenCL program to work.

The program simply creates a buffer, writes 42 to the buffer, reads it, and outputs.

Here's the code, written in C:

...

ANSWER

Answered 2021-Feb-21 at 04:07

I am a buffoon. The error was on the line I went to set the kernel argument:

Source https://stackoverflow.com/questions/66289287

QUESTION

Check that `name_expressions` is iterable

Asked 2020-Dec-02 at 16:28

When trying out the new jitify support planned for CuPy v9.x, I found that the name_expressions named argument to cupy.RawModule needs to be iterable for the NVRTC to not fail when later calling get_function. Question stemming out of cupy.RawModule using name_expressions and nvcc and/or path.

...

ANSWER

Answered 2020-Dec-02 at 16:28

Well, first of all, we did say it's a sequence (ex: list/tuple) of strings and gave an example in the doc page you quoted:

name_expressions (sequence of str) – A sequence (e.g. list) of strings referring to the names of C++ global/template kernels. For example, name_expressions=['func1', 'func1', 'func2'] for the template kernel func1 and non-template kernel func2. Strings in this tuple must then be passed, one at a time, to get_function() to retrieve the corresponding kernel.

So I don't see any ambiguity. There is no doubt that it's such a common pitfall in Python to write ('abc') and thinking it's a 1-element tuple containing the string 'abc', for which it should been written as ('abc',) with comma. But checking for such pitfall everywhere in the codebase would be a pain in the ass IMHO.

Second, even if we add a check to ensure the input is iterable, it still doesn't solve your issue as strings are also iterable/sequence:

Source https://stackoverflow.com/questions/65087601

QUESTION

Maximum number of resident blocks per SM?

Asked 2020-Apr-23 at 09:48

It seems the that there is a maximum number of resident blocks allowed per SM. But while other "hard" limits are easily found (via, for example, `cudaGetDeviceProperties'), a maximum number of resident blocks doesn't seem to be widely documented.

In the following sample code, I configure the kernel with one thread per block. To test the hypothesis that this GPU (a P100) has a maximum of 32 resident blocks per SM, I create a grid of 56*32 blocks (56 = number of SMs on the P100). Each kernel takes 1 second to process (via a "sleep" routine), so if I have configured the kernel correctly, the code should take 1 second. The timing results confirm this. Configuring with 32*56+1 blocks takes 2 seconds, suggesting the 32 blocks per SM is the maximum allowed per SM.

What I wonder is, why isn't this limit made more widely available? For example, it doesn't show up `cudaGetDeviceProperties'. Where can I find this limit for various GPUs? Or maybe this isn't a real limit, but is derived from other hard limits?

I am running CUDA 10.1

...

ANSWER

Answered 2020-Apr-23 at 09:48

Yes, there is a limit to the number of blocks per SM. The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.

Hardware limit stated by CUDA.

Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.

Limit derived from the amount of resources used by each block.

A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.

How do I find these boundaries?

CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks. The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. If you want to know how many registers per thread you use you have to add the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX. In the last tab of the file you will find the hardware limits grouped by Compute capability.

Source https://stackoverflow.com/questions/61378813

QUESTION

cuRAND performs much worse than thrust when generating random numbers inside CUDA kernels

Asked 2020-Apr-20 at 16:24

I am trying to generate "random" numbers from a uniform distribution inside a CUDA __global__ kernel using two different approaches. The first is using the cuRAND device API, and the second is using thrust. For each approach I have created a different class.

Here is my cuRAND solution:

...

ANSWER

Answered 2020-Apr-20 at 16:24

Perhaps the performance difference happens because cuRAND and Thrust use different PRNG algorithms with different performance profiles and demands on memory. Note that cuRAND supports five different PRNG algorithms, and your code doesn't give which one is in use.

Thrust's default_random_engine is currently minstd_rand, but its documentation notes that this "may change in a future version". (A comment written after I wrote mine also noted that it's minstd_rand.) minstd_rand is a simple linear congruential generator that may be faster than whatever PRNG cuRAND is using.

This was a comment converted to an answer and edited.

Source https://stackoverflow.com/questions/60731867

QUESTION

How to use something like Try-Catch inside CUDA device code

Asked 2020-Feb-17 at 17:47

Is there a way to execute a statement inside the device code without raising CUDA error? Something looks like the following:

...

ANSWER

Answered 2020-Feb-17 at 17:32

currently, CUDA device code does not support exception handling.

try/catch cannot be used in CUDA device code.

If a statement would cause a device side error (e.g. out-of-bounds access, for example), there is no way to execute that statement in device code without triggering the device side error.

To avoid the error, you would need to modify the statement, or conditionally not execute it, to avoid triggering the device side error.

Source https://stackoverflow.com/questions/60267535

QUESTION

Cuda how do you use a pointer array with __constant__

Asked 2020-Jan-20 at 08:24

I need to dynamically size a int* pointer array that’s queried in every thread of my kernel's instance.

My goal is to make an array of ints in which I won’t know the size of the array until run-time (so it can’t be fixed size).

With that when I do:

...

ANSWER

Answered 2020-Jan-20 at 08:24

How do you use a pointer array with __constant__

You don't. It is not possible.

Constant memory must be statically defined at compile time. That means if you want a constant memory array, the size must be defined when you compile the code. There is no way to dynamically allocate constant memory.

The only realistic solution is to define a __constant__ array to a maximum size, and then a second __constant__ variable indicating the size of the array which is being used for a given kernel invocation. So something like:

Source https://stackoverflow.com/questions/59813666

QUESTION

CUDA templated function call

Asked 2019-Jul-30 at 00:30

This runtime error has been bothering me for two days, I tried all the possible ways to debug it, still I couldn't find what the issue is.

...

ANSWER

Answered 2019-Jul-30 at 00:30

I just found out the solution by myself. Different machine has different size in byte for "long", some of machine are 4 bytes, some of are 8 bytes. Make sure they are compiled in the compiler and architecture, Otherwise, the cudaMemcpy will not be able to copy two different chunk-size of memory.

Source https://stackoverflow.com/questions/57000791

QUESTION

How to make templated Compute Kernels in Metal

Asked 2019-Jun-20 at 15:44

I have been writing some Metal compute kernels. So, I wrote a kernel with the following declaration:

...

ANSWER

Answered 2019-Jun-20 at 15:44

One approach that might work is to:

Declare inData as a void*
In the body of the kernel shader, call a template function, passing along the arguments. The template function would be templated by the desired type and would receive inData as a pointer to that type.

You could use an input parameter to dynamically choose which variant of the template function to call. But a better approach is probably to use a function constant to pick. That way, the choice is compiled in.

So, something like:

Source https://stackoverflow.com/questions/56687496

QUESTION

CUDA per-thread arrays with different types

Asked 2019-Jun-12 at 22:11

Each instance of my CUDA kernel (i.e. each thread) needs three private arrays, with different types.

e.g.

...

ANSWER

Answered 2019-Jun-12 at 22:11

Preliminaries

Usually, people are interested in GPU computing for performance reasons - to make their codes run faster. So we'll keep performance as a guide when trying to make decisions about what to do.

I think one of the problems the sketch you provided in your question would have is one of natural alignment requirement in CUDA. Picking an arbitrary pointer and type-casting it to a different type may run afoul of this. If you have such a problem in your code, the cuda-memcheck tool should be able to uncover it.

The typical place to put thread-private arrays in C++ is local memory, and CUDA is no different in my opinion. However CUDA C++, at least, does not support variable-length arrays. In your question, you sketched out using shared memory as a proxy for this. One of the implications of your thinking (I assume) is that although the size of these arrays are not known at compile time, there must be an upper bound to size, because shared memory may impose a limit of as low as 48KB per threadblock. Therefore if you have 1024 threads in a threadblock, then the maximum combined array size per thread would be limited to 48 bytes. With 512 threads per block you could conceivably have 96 bytes per thread. These would be due to shared memory limits if you used shared memory.

So an alternate approach (if you can adhere to these low limits) would be to simply upper-bound the local memory needed, and statically define a local memory array of that size (or 3), per thread. A single array would have to be partitioned among the various arrays, with attention paid to alignment as already mentioned. But given the small sizes suggested by your approach (e.g. ~96 bytes total) it probably would be expedient just to use upper-bounded fixed-size local arrays (not shared memory).

Local memory in CUDA is ultimately backed by the same physical resource -- GPU DRAM memory -- as global memory. However the arrangement is such that if each thread is accessing a particular element in their own local memory, the effect across threads would be equivalent to coalesced access, should that access need to be serviced by DRAM. This means the per-thread local storage is interleaved, in some fashion. And this interleaving characteristic is also something we will want to pay attention to, for performance reasons, if we come up with our own variable-length array implementation. It applies equally to a global memory proxy (to enable coalescing) or a shared memory proxy (to avoid bank conflicts).

In addition to the desire to interleave access for performance reasons, a possible performance reason to not prefer a shared memory implementation is that extensive use of shared memory can have negative implications for occupancy, and therefore for performance. This topic is covered in many other places, so I'll not drill into it further here.

Implementations

Local Memory

As mentioned above, I believe one of the implicit assumptions about your suggestion to use shared memory is that there must be some (reasonably small) upper bound to the actual sizes of the arrays needed. If that is the case, it may be expedient to use 3 arrays allocated with the upper bound size:

Source https://stackoverflow.com/questions/56564159

QUESTION

nvprof is crashing as it writes a very large file to /tmp/ and runs out of disk space

Asked 2019-May-31 at 19:24

How do I work-around an nvprof crash that occurs when running on a disk with a relatively small amount of space available?

Specifically, when profiling my cuda kernel, I use the following two commands:

...

ANSWER

Answered 2019-May-31 at 19:24

One can direct nvprof to use a different temporary directory by setting the TMPDIR environment variable. This is helpful, because since Linux kernel 2.6, there's a decent chance that you have a RAM disk available at /dev/shm (see https://superuser.com/a/45509/363816 for more info). Thus, adding the following at the beginning of one's [bash] script will likely work-around your issue.

Source https://stackoverflow.com/questions/56400355

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install mykernel

感谢meilihao 提交了pr #14 ，通过如下方法可以解决allnoconfig编译出来qemu无法加载启动的问题.
mykernel-2.0 patch generated by this command: diff -Naur linux-5.4.34 linux-5.4.34-mykernel > mykernel-2.0_for_linux-5.3.34.patch

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: