MIOpenGEMM | OpenCL general matrix multiplication API | GPU library
kandi X-RAY | MIOpenGEMM Summary
kandi X-RAY | MIOpenGEMM Summary
An OpenCL general matrix multiplication (GEMM) API and kernel generator. More information is available on the wiki.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of MIOpenGEMM
MIOpenGEMM Key Features
MIOpenGEMM Examples and Code Snippets
mkdir build; cd build;
cmake ..
cmake -DOPENCL_LIBRARIES= -DOPENCL_INCLUDE_DIRS ..
cmake -DCMAKE_INSTALL_PREFIX= ..
Community Discussions
Trending Discussions on MIOpenGEMM
QUESTION
I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)
I run it on Polaris 560, 16CU GPU.
Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl
I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.
The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).
I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.
I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.
There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.
I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.
...ANSWER
Answered 2021-Jul-09 at 20:28Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0
, ldb % 1024==0
probably due to cache
I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.
https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109
To handle cases M!=N
or M != 1<
I just increased number of work groups on M/N to neares 1<
and groups that don't have jobs exit in the begging not adding too much overhead.
z-order improved performance x4 times.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install MIOpenGEMM
All examples can be built with. or individually by name, for example.
HTML and PDF documentation can be built using:. This will build a local searchable web site inside the ./MIOpenGEMM/doc/html folder and a PDF document inside the ./MIOpenGEMM/doc/pdf folder. Documentation is generated using Doxygen and should be installed separately. HTML and PDFs are generated using Sphinx and Breathe, with the ReadTheDocs theme.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page