CL-basic | simple prototype for a basic OpenCL host | GPU library
kandi X-RAY | CL-basic Summary
kandi X-RAY | CL-basic Summary
A very simple prototype for a basic OpenCL host and kernel code
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of CL-basic
CL-basic Key Features
CL-basic Examples and Code Snippets
Community Discussions
Trending Discussions on CL-basic
QUESTION
I am trying to translate single-threaded serial code for the MJPEG decoder into OpenCL code which I want to execute on the GPU (NVIDIA Tesla k20c).
After translating several major functions into kernels, the execution time of the code has gone from about 18 ms per frame to an abysmal 400 ms per frame.
I am using a standard method of opening a file, reading it, using buffer and ndrange commands to execute code on the GPU and read the results from the CPU. I feel that transferring the mjpeg file (which is of the data type FILE
) to the GPU's memory will considerably cut down the communication overhead when the code is processed.
I referred to this link but the suggestions are only applicable to CUDA. This source and NVIDIA's OpenCL guide explain the utility of pinned memory but their usage of pinned memory is confined to kernel parameters and buffer commands.
I want to transfer the entire MJPEG file (size is about 2.8 MB) to the GPU's memory but I am struggling to find resources which do it.
Can I do this safely? If this can be done, how can I read the file to perform the various steps of MJPEG decoding?
EDIT:
The details of my GPU are as follows:
...ANSWER
Answered 2017-Sep-05 at 21:02There's nothing stopping you from copying the literal data of the image into a buffer in host memory and then copying it to the GPU:
QUESTION
As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf
4.1. SIMT Architecture
...
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread.
- WaveFront in OpenCL: https://sites.google.com/site/csc8820/opencl-basics/opencl-terms-explained#TOC-Wavefront
During runtime, the first wavefront is sent to the compute unit to run, then the second wavefront is sent to the compute unit, and so on. Work items within one wavefront are executed in parallel and in lock steps. But different wavefronts are executed sequentially.
I.e. we know, that:
threads in WARP (CUDA) - are SIMT-threads, which always executes the same instructions at each time and always are stay synchronized - i.e. threads of WARP are the same as lanes of SIMD (on CPU)
threads in WaveFront (OpenCL) - are threads, which always executes in parallel, but not necessarily all the threads perform the exact same instruction, and not necessarily all of the threads are synchronized
But is there any guarantee that all of the threads in the WaveFront always synchronized such as threads in WARP or as lanes in SIMD?
Conclusion:
- WaveFront-threads (items) are always synchronized - lock step: "wavefront executes a number of work-items in lock step relative to each other."
- WaveFront mapped on SIMD-block: "all work-items in the wavefront go to both paths of flow control"
- I.e. each WaveFront-thread (item) mapped to SIMD-lanes
Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing
1.1 Terminology
...
Wavefronts and work-groups are two concepts relating to compute kernels that provide data-parallel granularity. A wavefront executes a number of work-items in lock step relative to each other. Sixteen workitems are execute in parallel across the vector unit, and the whole wavefront is covered over four clock cycles. It is the lowest level that flow control can affect. This means that if two work-items inside of a wavefront go divergent paths of flow control, all work-items in the wavefront go to both paths of flow control.
This is true for: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
- (page-45) Chapter 2 OpenCL Performance and Optimization for GCN Devices
- (page-81) Chapter 3 OpenCL Performance and Optimization for Evergreen and Northern Islands Devices
ANSWER
Answered 2017-Feb-15 at 22:41First, you can query some values:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install CL-basic
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page