Simd | C++ image processing and machine | Computer Vision library
kandi X-RAY | Simd Summary
Support
Quality
Security
License
Reuse
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here
Simd Key Features
Simd Examples and Code Snippets
Trending Discussions on Simd
Trending Discussions on Simd
QUESTION
I'm creating an int (32 bit) vector with 1024 * 1024 * 1024 elements like so:
std::vector nums;
for (size_t i = 0; i < 1024 * 1024 * 1024; i++) {
nums.push_back(rand() % 1024);
}
which holds 4 GB of random data at that point. And then I'm simply summing up all the elements in the vector like so:
uint64_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn++) {
total += *cn;
}
This takes about ~0.18 seconds which means the data is processed at around 22.2 GB/s. I'm running this on an M1 with a much higher memory bandwidth of about 60GB/s. Is there a way to make the above code run faster on a single core?
EDIT: Manual SIMD version:
int32x4_t simd_total = vmovq_n_s32(0);
for (auto cn = nums.begin(); cn < nums.end()-3; cn +=4) {
const int32_t v[4] = {cn[0], cn[1], cn[2], cn[3]}
simd_total = vaddq_s32(simd_total, vld1q_s32(v));
}
return vaddvq_s32(simd_total);
The SIMD version has the same performance as the non-manual-SIMD version.
EDIT 2: Alright, so I changed the vector elements to uint32_t and also changed the result type to uint32_t(as suggested by @Peter Cordes):
uint32_t sum_ints_32(const std::vector& nums) {
uint32_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn++) {
total += *cn;
}
return total;
}
This runs much faster (~45 GB/s). This is the disassembly:
0000000100002218 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
100002218: a940200c ldp x12, x8, [x0]
10000221c: eb08019f cmp x12, x8
100002220: 54000102 b.cs 100002240 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x28> // b.hs, b.nlast
100002224: aa2c03e9 mvn x9, x12
100002228: 8b090109 add x9, x8, x9
10000222c: f1006d3f cmp x9, #0x1b
100002230: 540000c8 b.hi 100002248 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x30> // b.pmore
100002234: 52800000 mov w0, #0x0 // #0
100002238: aa0c03e9 mov x9, x12
10000223c: 14000016 b 100002294 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x7c>
100002240: 52800000 mov w0, #0x0 // #0
100002244: d65f03c0 ret
100002248: d342fd29 lsr x9, x9, #2
10000224c: 9100052a add x10, x9, #0x1
100002250: 927ded4b and x11, x10, #0x7ffffffffffffff8
100002254: 8b0b0989 add x9, x12, x11, lsl #2
100002258: 9100418c add x12, x12, #0x10
10000225c: 6f00e400 movi v0.2d, #0x0
100002260: aa0b03ed mov x13, x11
100002264: 6f00e401 movi v1.2d, #0x0
100002268: ad7f8d82 ldp q2, q3, [x12, #-16]
10000226c: 4ea08440 add v0.4s, v2.4s, v0.4s
100002270: 4ea18461 add v1.4s, v3.4s, v1.4s
100002274: 9100818c add x12, x12, #0x20
100002278: f10021ad subs x13, x13, #0x8
10000227c: 54ffff61 b.ne 100002268 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x50> // b.any
100002280: 4ea08420 add v0.4s, v1.4s, v0.4s
100002284: 4eb1b800 addv s0, v0.4s
100002288: 1e260000 fmov w0, s0
10000228c: eb0b015f cmp x10, x11
100002290: 540000a0 b.eq 1000022a4 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x8c> // b.none
100002294: b840452a ldr w10, [x9], #4
100002298: 0b000140 add w0, w10, w0
10000229c: eb08013f cmp x9, x8
1000022a0: 54ffffa3 b.cc 100002294 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x7c> // b.lo, b.ul, b.last
1000022a4: d65f03c0 ret
I also rewrote the Manual-SIMD version:
uint32_t sum_ints_simd_2(const std::vector& nums) {
uint32x4_t simd_total = vmovq_n_u32(0);
for (auto cn = nums.begin(); cn < nums.end()-3; cn +=4) {
const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
simd_total = vaddq_u32(simd_total, vld1q_u32(v));
}
return vaddvq_u32(simd_total);
}
which still runs 2x slower than the non-manual-SIMD version and results in the following disassembly:
0000000100002464 <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
100002464: a9402408 ldp x8, x9, [x0]
100002468: d1003129 sub x9, x9, #0xc
10000246c: 6f00e400 movi v0.2d, #0x0
100002470: eb09011f cmp x8, x9
100002474: 540000c2 b.cs 10000248c <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x28> // b.hs, b.nlast
100002478: 6f00e400 movi v0.2d, #0x0
10000247c: 3cc10501 ldr q1, [x8], #16
100002480: 4ea08420 add v0.4s, v1.4s, v0.4s
100002484: eb09011f cmp x8, x9
100002488: 54ffffa3 b.cc 10000247c <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x18> // b.lo, b.ul, b.last
10000248c: 4eb1b800 addv s0, v0.4s
100002490: 1e260000 fmov w0, s0
100002494: d65f03c0 ret
To reach the same speed as the auto-vectorized version, we can use a uint32x4x2 instead of uint32x4 for our manual-SIMD version:
uint32_t sum_ints_simd_3(const std::vector& nums) {
uint32x4x2_t simd_total;
simd_total.val[0] = vmovq_n_u32(0);
simd_total.val[1] = vmovq_n_u32(0);
for (auto cn = nums.begin(); cn < nums.end()-7; cn +=8) {
const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
const uint32_t v2[4] = { cn[4], cn[5], cn[6], cn[7] };
simd_total.val[0] = vaddq_u32(simd_total.val[0], vld1q_u32(v));
simd_total.val[1] = vaddq_u32(simd_total.val[1], vld1q_u32(v2));
}
return vaddvq_u32(simd_total.val[0]) + vaddvq_u32(simd_total.val[1]);
}
And to gain even more speed we can leverage uint32x4x4 (which gets us about ~53 GB/s):
uint32_t sum_ints_simd_4(const std::vector& nums) {
uint32x4x4_t simd_total;
simd_total.val[0] = vmovq_n_u32(0);
simd_total.val[1] = vmovq_n_u32(0);
simd_total.val[2] = vmovq_n_u32(0);
simd_total.val[3] = vmovq_n_u32(0);
for (auto cn = nums.begin(); cn < nums.end()-15; cn +=16) {
const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
const uint32_t v2[4] = { cn[4], cn[5], cn[6], cn[7] };
const uint32_t v3[4] = { cn[8], cn[9], cn[10], cn[11] };
const uint32_t v4[4] = { cn[12], cn[13], cn[14], cn[15] };
simd_total.val[0] = vaddq_u32(simd_total.val[0], vld1q_u32(v));
simd_total.val[1] = vaddq_u32(simd_total.val[1], vld1q_u32(v2));
simd_total.val[2] = vaddq_u32(simd_total.val[2], vld1q_u32(v3));
simd_total.val[3] = vaddq_u32(simd_total.val[3], vld1q_u32(v4));
}
return vaddvq_u32(simd_total.val[0])
+ vaddvq_u32(simd_total.val[1])
+ vaddvq_u32(simd_total.val[2])
+ vaddvq_u32(simd_total.val[3]);
}
which gets us the following disassembly:
0000000100005e34 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
100005e34: a9402408 ldp x8, x9, [x0]
100005e38: d100f129 sub x9, x9, #0x3c
100005e3c: 6f00e403 movi v3.2d, #0x0
100005e40: 6f00e402 movi v2.2d, #0x0
100005e44: 6f00e401 movi v1.2d, #0x0
100005e48: 6f00e400 movi v0.2d, #0x0
100005e4c: eb09011f cmp x8, x9
100005e50: 540001c2 b.cs 100005e88 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x54> // b.hs, b.nlast
100005e54: 6f00e400 movi v0.2d, #0x0
100005e58: 6f00e401 movi v1.2d, #0x0
100005e5c: 6f00e402 movi v2.2d, #0x0
100005e60: 6f00e403 movi v3.2d, #0x0
100005e64: ad401504 ldp q4, q5, [x8]
100005e68: ad411d06 ldp q6, q7, [x8, #32]
100005e6c: 4ea38483 add v3.4s, v4.4s, v3.4s
100005e70: 4ea284a2 add v2.4s, v5.4s, v2.4s
100005e74: 4ea184c1 add v1.4s, v6.4s, v1.4s
100005e78: 4ea084e0 add v0.4s, v7.4s, v0.4s
100005e7c: 91010108 add x8, x8, #0x40
100005e80: eb09011f cmp x8, x9
100005e84: 54ffff03 b.cc 100005e64 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x30> // b.lo, b.ul, b.last
100005e88: 4eb1b863 addv s3, v3.4s
100005e8c: 1e260068 fmov w8, s3
100005e90: 4eb1b842 addv s2, v2.4s
100005e94: 1e260049 fmov w9, s2
100005e98: 0b080128 add w8, w9, w8
100005e9c: 4eb1b821 addv s1, v1.4s
100005ea0: 1e260029 fmov w9, s1
100005ea4: 0b090108 add w8, w8, w9
100005ea8: 4eb1b800 addv s0, v0.4s
100005eac: 1e260009 fmov w9, s0
100005eb0: 0b090100 add w0, w8, w9
100005eb4: d65f03c0 ret
Crazy stuff
ANSWER
Answered 2021-Jun-14 at 17:01Here are some techniques.
Loop Unrollinguint64_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn += 4)
{
total += cn[0];
total += cn[1];
total += cn[2];
total += cn[3];
}
uint64_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn += 4)
{
const uint64 n0 = cn[0];
const uint64 n1 = cn[1];
const uint64 n2 = cn[2];
const uint64 n3 = cn[3];
total += n0;
total += n1;
total += n2;
total += n3;
}
You should print the assembly language for each of these at high optimization level and compare them.
Also, your processor may have some specialized instructions that you could. For example, the ARM processor can load multiple registers from memory with one instruction.
Also, look up SIMD instructions or search the internet for "C++ SIMD read memory".
I've argued with compilers (on embedded systems) and found out that the compiler's optimization strategies may be better or equal to instruction specialization or other techniques (timings were performed using Test Points and oscilloscope).
You'll have to remember that your task, on a one core machine, will most likely be swapped out more often that with a system with multiple cores or a specialized (embedded) system.
QUESTION
I got the idea of using pre-defined functions to do this: calculate "a + b", "c * 5", "d * 3" and then add the result.
But this way seems generate a lot of code. Is there any better methods to do this?
By the way, does Apache Arrow use SIMD by default(c++ version)? If not, how can I make it use SIMD?
ANSWER
Answered 2021-Jun-14 at 12:27PyArrow doesn't currently override operators in Python, but you can easily call the arithmetic compute functions. (functools.reduce
is used here since the addition kernel is binary, not n-ary.)
PyArrow automatically uses SIMD, based on what flags it was compiled with. It should use the 'highest' SIMD level supported by your CPU for which it was compiled with. Not all compute function implementations leverage SIMD internally. Right now it looks like it's mostly the aggregation kernels which do so.
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import functools
>>> pa.__version__
'4.0.1'
>>> a = pa.array([1,2,3])
>>> b = pa.array([3,4,5])
>>> c = pa.array([1,0,1])
>>> d = pa.array([2,4,2])
>>> functools.reduce(pc.add, [pc.add(a,b), pc.multiply(c, 5), pc.multiply(d, 3)])
[
15,
18,
19
]
QUESTION
I have trouble understanding something about simd_packed vectors in the simd module in Swift. I use the example of float4, I hope someone can help.
My understanding is that simd_float4
is a typealias
of SIMD4< Float>
, and MemoryLayout< Float>>.alignment = 16
(bytes), hence MemoryLayout.alignment = 16
. Makes sense.
But the following I do not understand: simd_packed_float4
is also a typealias
of SIMD4
. And so MemoryLayout.alignment = 16
.
What is the point of the "packed" in simd_packed_float4
, then? Where is the "relaxed alignment" that the documentation talks about?
In the Metal Shader Language Specification (Version 2.4) ( https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) in Table 2.4 (p.28), it says the alignment of packed_float4
is 4 (which is also the alignment of the scalar type, float), so this IS a "relaxed alignment" (as compared to the 16). That makes sense on its own, but how do I reconcile this to the above (simd_packed_float4
is typealias of SIMD4
and MemoryLayout = 16
)?
ANSWER
Answered 2021-Jun-12 at 03:45I actually think it's impossible to achieve relaxed alignment like this with a packed type in Swift. I think Swift compiler just can't bring the alignment attributes to actual Swift interface.
I think this makes simd_packed_float4
useless in Swift.
I have made a playground to check this, and using it as it's intended doesn't work.
import simd
MemoryLayout.stride
MemoryLayout.alignment
let capacity = 8
let buffer = UnsafeMutableBufferPointer.allocate(capacity: capacity)
for i in 0...stride * 4, as: simd_packed_float4.self)
print(readAligned)
let readUnaligned = rawBuffer.load(fromByteOffset: MemoryLayout.stride * 2, as: simd_packed_float4.self)
print(readUnaligned)
Which will output
SIMD4(4.0, 5.0, 6.0, 7.0)
Swift/UnsafeRawPointer.swift:900: Fatal error: load from misaligned raw pointer
If you do need to load or put unaligned simd_float4
vectors into buffers, I would suggest just making an extension that does this component-wise, so all the alignments work out, kinda like this
extension UnsafeMutableRawBufferPointer {
func loadFloat4(fromByteOffset offset: Int) -> simd_float4 {
let x = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 0, as: Float.self)
let y = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 1, as: Float.self)
let z = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 2, as: Float.self)
let w = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 3, as: Float.self)
return simd_float4(x, y, z, w)
}
}
let readUnaligned2 = rawBuffer.loadFloat4(fromByteOffset: MemoryLayout.stride * 2)
print(readUnaligned2)
Or you can even make it generic
QUESTION
I have a minimally reproducible sample which is as follows -
#include
#include
#include
#include
#include
template
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
for(size_t i=0; i < size * size; i++){
result[i] = matA[i] + matB[i];
}
}
int main(){
size_t size = 8192;
//std::cout<(matA, matB, result, size);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast(end - start).count();
std::cout<<"Average Time is = "<
I experiment as follows - I time the code with #pragma omp for simd
directive for the loop in the AddMatrixOpenMP
function and then time it without the directive. I compile the code as follows - g++ -O3 -fopenmp example.cpp
Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.
Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)
TIA
ANSWER
Answered 2021-Jun-11 at 14:46The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone))
to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.
And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3
. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)
After adding the missing #pragma omp simd
to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native
or -fopenmp
), I get 18266, but with -O3 -fopenmp
I get avg time 39772.
With the OpenMP vectorized version, if I look at top
while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result
, triggering page-faults for it, too.)
But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.
So it looks like gcc -O3
performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.
The asm shows the evidence: GCC 9.3 -O3
on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.
.L4: # outer loop
movss xmm0, DWORD PTR [rbx+rdx*4]
addss xmm0, DWORD PTR [r13+0+rdx*4] # one scalar operation
mov eax, 500
.L3: # do {
sub eax, 1 # empty inner loop after inversion
jne .L3 # }while(--i);
add rdx, 1
movss DWORD PTR [rcx], xmm0
add rcx, 4
cmp rdx, 67108864
jne .L4
This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.
Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.
There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.
The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)
Looking at memory bandwidth while the benchmark is running:Running intel_gpu_top -l
while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))
). IMC
is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.
$ intel_gpu_top -l
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
0 0 0 97 0.00 20421 7482 0.00 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 14 99 0.02 19627 6505 0.47 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 20 98 0.02 19625 6516 0.67 0 0 0.00 0 0 0.00 0 0 0.00 0 0
11 10 22 98 0.03 19632 6516 0.65 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 13 99 0.02 19609 6505 0.46 0 0 0.00 0 0 0.00 0 0 0.00 0 0
Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.
But with -O3
defeating the benchmark, with a 1000
repeat count, we see only near-idle levels of main-memory bandwidth.
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
...
8 8 17 99 0.03 365 85 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
9 9 17 99 0.02 349 90 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
4 4 5 100 0.01 303 63 0.25 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 15 100 0.02 345 69 0.43 0 0 0.00 0 0 0.00 0 0 0.00 0 0
10 10 21 99 0.03 350 74 0.64 0 0 0.00 0 0 0.00 0 0 0.00 0 0
vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)
QUESTION
I am working on a C++ intrinsic wrapper for x64 and neon. I want my functions to be constexpr. My motivation is similar to Constexpr and SSE intrinsics, but #pragma omp simd and intrinsics may not be supported by the compiler (GCC) in a constexpr function. The following code is just a demonstration (auto-vectorization is good enough for addition).
struct FA{
float c[4];
};
inline constexpr FA add(FA a, FA b){
FA result{};
#pragma omp simd // clang error: statement not allowed in constexpr function
for(int i = 0; i < 4; i++){ // GCC error: uninitialized variable 'i' in 'constexpr' function
result.c[i] = b.c[i] + a.c[i];
}
return result;
}
struct FA2{
__m128 c;
};
inline constexpr FA2 add2(FA2 a, FA2 b){
FA2 result{};
result.c = _mm_add_ps(a.c,b.c); // GCC error: call to non-'constexpr' function '__m128 _mm_add_ps(__m128, __m128)'
return result; // fine with clang
}
I have to provide reference C++ code for portability anyway. Is there a code efficient way to let the compiler use the reference code at compile time?
f(){
if(){
// constexpr version
}else{
// intrinsic version
}
}
It should work on all compilers that support omp, intrinsics and C++20.
ANSWER
Answered 2021-Jun-01 at 14:43Using std::is_constant_evaluated, you can get exactly what you want:
#include
struct FA{
float c[4];
};
// Just for the sake of the example. Makes for nice-looking assembly.
extern FA add_parallel(FA a, FA b);
constexpr FA add(FA a, FA b) {
if (std::is_constant_evaluated()) {
// do it in a constexpr-friendly manner
FA result{};
for(int i = 0; i < 4; i++) {
result.c[i] = b.c[i] + a.c[i];
}
return result;
} else {
// can be anything that's not constexpr-friendly.
return add_parallel(a, b);
}
}
constexpr FA at_compile_time = add(FA{1,2,3,4}, FA{5,6,7,8});
FA at_runtime(FA a) {
return add(a, at_compile_time);
}
See on godbolt: https://gcc.godbolt.org/z/szhWKs3ec
QUESTION
this is yet another SSE is slower than normal code! Why?
type of question.
I know that there are a bunch of similar questions but they don't seem to match my situation.
I am trying to implement Miller-Rabin primality test with Montgomery Modular Multiplication for fast modulo operations.
I tried to implement it in both scalar and SIMD way and it turns out that the SIMD version was around 10% slower.
that [esp+16] or [esp+12] is pointing to the modulo inverse of N
if there's anyone wondering.
I am really puzzled over the fact that a supposedly 1 Latency 1c Throughput 1uops instruction psrldq
takes more than 3 Latency 0.5c Throughput 1uops pmuludq
.
Below is the code and the run time analysis on visual studio ran on Ryzen 5 3600.
Any idea on how to improve SIMD code and/or why is it slower than a scalar code is appreciated.
P.S. Seems like the run time analysis is off by one instruction for some reason
EDIT 1: the comment on the image was wrong, I attached a fixed version below:
;------------------------------------------
; Calculate base^d mod x
;
; eax = 1
; esi = x
; edi = bases[eax]
; ebp = d
; while d do
; if d & 1 then eax = (eax * edi) mod x
; edi = (edi*edi) mod x
; d >>= 1
; end
;------------------------------------------
Scalar code:
LOOP_MODEXP:
push eax
test ebp, 1
jz @F
mul edi
mov ecx, edx
imul eax, DWORD PTR [esp+16]
mul esi
xor ebx, ebx
sub ecx, edx
cmovs ebx, esi
add ecx, ebx
mov DWORD PTR [esp], ecx
@@:
mov edx, edi
mulx ecx, edx, edi
imul edx, DWORD PTR [esp+16]
mulx eax, ebx, esi
xor ebx, ebx
sub ecx, eax
cmovs ebx, esi
add ecx, ebx
mov edi, ecx
pop eax
shr ebp, 1
jnz LOOP_MODEXP
movd xmm2, DWORD PTR [esp+12]
movd xmm3, esi
pshufd xmm2, xmm2, 0
pshufd xmm3, xmm3, 0
movd xmm1, edi
pshufd xmm1, xmm1, 0
movdqa xmm0, xmm1
pinsrd xmm0, eax, 2
LOOP_MODEXP:
movdqa xmm4, xmm0
pmuludq xmm0, xmm1
movdqa xmm1, xmm0
pmuludq xmm0, xmm2
pmuludq xmm0, xmm3
psubd xmm1, xmm0
psrldq xmm1, 4
pxor xmm0, xmm0
pcmpgtd xmm0, xmm1
blendvps xmm0, xmm3, xmm0
paddd xmm0, xmm1
movddup xmm1, xmm0
test ebp, 1
jnz @F
blendps xmm0, xmm4, 4
@@:
shr ebp, 1
jnz LOOP_MODEXP
pextrd eax, xmm0, 2
ANSWER
Answered 2021-May-30 at 16:13- Your SIMD code wastes time mispredicting that test ebp, 1 / jnz branch. There’s no conditional move instruction in SSE, but you can still optimize away that test + branch with a few more instructions like this:
mov ebx, ebp
and ebx, 1
sub ebx, 1
pxor xmm5, xmm5
pinsrd xmm5, ebx, 2
blendvps xmm0, xmm4, xmm5
instead of your
test ebp, 1
jnz @F
blendps xmm0, xmm4, 4
The above code computes ebx = ( ebp & 1 ) ? 0 : -1;
, inserts that integer into 3-rd lane of a zero vector, and uses that vector for the selector in blendvps
instruction.
- This instruction is not needed:
pcmpgtd xmm0, xmm1
Along with previous and next one, it computes this:
xmm0 = _mm_cmplt_epi32( xmm1, _mm_setzero_si128() );
xmm0 = _mm_blendv_ps( xmm0, xmm3, xmm0 );
Here’s an equivalent:
xmm0 = _mm_blendv_ps( _mm_setzero_si128(), xmm3, xmm1 );
That comparison instruction compares int32 lanes for xmm1 < 0. This results in the sign bit of these integers. _mm_blendv_ps
instruction only tests the high bits in 32-bit lanes, you don’t really need to compare for xmm1 < 0 before that.
- Unless you need to support CPUs without AVX, you should use VEX encoding of the instructions, even for code dealing with 16-byte vectors. Your SIMD code uses legacy encoding, most of them take 2 arguments and write the result in the first one. Most VEX instructions take 3 arguments and write result into another one. This should get rid of the couple redundant move instructions like
movdqa xmm4, xmm0
.
QUESTION
I am confused about the data sharing scope of the variable acc in the flowing two cases. In the case 1 I get following compilation error: error: reduction variable ‘acc’ is private in outer context
, whereas the case 2 compiles without any issues.
According to this article variables defined outside parallel region are shared.
Why is adding for-loop parallelism privatizing acc? How can I in this case accumulate the result calculated in the the for-loop and distribute a loop's iteration space across a thread team?
case 1
float acc = 0.0f;
#pragma omp for simd reduction(+: acc)
for (int k = 0; k < MATRIX_SIZE; k++) {
float mul = alpha;
mul *= a[i * MATRIX_SIZE + k];
mul *= b[j * MATRIX_SIZE + k];
acc += mul;
}
case 2
float acc = 0.0f;
#pragma omp simd reduction(+: acc)
for (int k = 0; k < MATRIX_SIZE; k++) {
float mul = alpha;
mul *= a[i * MATRIX_SIZE + k];
mul *= b[j * MATRIX_SIZE + k];
acc += mul;
}
ANSWER
Answered 2021-May-26 at 18:08Your case 1 is violating OpenMP semantics, as there's an implicit parallel region (see OpenMP Language Terminology, "sequential part") that contains the definition of acc
. Thus, acc
is indeed private to that implicit parallel region. This is what the compiler complains about.
Your case 2 is different in that the simd
construct is not a worksharing construct and thus has a different definition of the semantics of the reduction
clause.
Case 1 would be correct if you wrote it this way:
void example(void) {
float acc = 0.0f;
#pragma omp parallel for simd reduction(+: acc)
for (int k = 0; k < MATRIX_SIZE; k++) {
float mul = alpha;
mul *= a[i * MATRIX_SIZE + k];
mul *= b[j * MATRIX_SIZE + k];
acc += mul;
}
}
The acc
variable is now defined outside of the parallel
that the for simd
construct binds to.
QUESTION
I wrote a function to add up all the elements of a double[]
array using SIMD (System.Numerics.Vector
) and the performance is worse than the naïve method.
On my computer Vector.Count
is 4 which means I could create an accumulator of 4 values and run through the array adding up the elements by groups.
For example a 10 element array, with a 4 element accumulator and 2 remaining elements I would get
// | loop | remainder
acc[0] = vector[0] + vector[4] + vector[8]
acc[1] = vector[1] + vector[5] + vector[9]
acc[2] = vector[2] + vector[6]
acc[3] = vector[3] + vector[7]
and the result sum = acc[0]+acc[1]+acc[2]+acc[3]
The code below produces the correct results, but the speed isn't there compared to just a loop adding up the values
public static double SumSimd(this Span a)
{
var n = System.Numerics.Vector.Count;
var count = a.Length;
// divide array into n=4 element groups
// Example, 57 = 14*4 + 3
var groups = Math.DivRem(count, n, out int remain);
var buffer = new double[n];
// Create buffer with remaining elements (not in groups)
a.Slice(groups*n, remain).CopyTo(buffer);
// Scan through all groups and accumulate
var accumulator = new System.Numerics.Vector(buffer);
for (int i = 0; i < groups; i++)
{
//var next = new System.Numerics.Vector(a, n * i);
var next = new System.Numerics.Vector(a.Slice(n * i, n));
accumulator += next;
}
var sum = 0.0;
// Add up the elements of the accumulator vs
for (int j = 0; j < n; j++)
{
sum += accumulator[j];
}
return sum;
}
So my question is why aren't I realizing any benefits here with SIMD?
BaselineThe baseline code looks like this
public static double LinAlgSum(this ReadOnlySpan span)
{
double sum = 0;
for (int i = 0; i < span.Length; i++)
{
sum += span[i];
}
return sum;
}
In benchmarking the SIMD code comparing to the above, the SIMD code is 5× slower for size=7
, 2.5× slower for size=144
and about the same for size=770
.
I am running release mode using BenchmarkDotNet
. Here is the driver class
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
[CategoriesColumn]
public class LinearAlgebraBench
{
[Params(7, 35, 77, 144, 195, 311, 722)]
public int Size { get; set; }
[IterationSetup]
public void SetupData()
{
A = new LinearAlgebra.Vector(Size, (iter) => 2 * Size - iter).ToArray();
B = new LinearAlgebra.Vector(Size, (iter) => Size/2 + 2* iter).ToArray();
}
public double[] A { get; set; }
public double[] B { get; set; }
[BenchmarkCategory("Sum"), Benchmark(Baseline = true)]
public double BenchLinAlgSum()
{
return LinearAlgebra.LinearAlgebra.Sum(A.AsSpan().AsReadOnly());
}
[BenchmarkCategory("Sum"), Benchmark]
public double BenchSimdSum()
{
return LinearAlgebra.LinearAlgebra.SumSimd(A);
}
}
ANSWER
Answered 2021-May-19 at 18:28I would suggest you take a look at this article exploring SIMD performance in .Net.
The overall algorithm looks identical for summing using regular vectorization. One difference is that the multiplication can be avoided when slicing the array:
while (i < lastBlockIndex)
{
vresult += new Vector(source.Slice(i));
i += Vector.Count;
}
One multiplication should have fairly small effect on performance, but for this kind of code it might be relevant.
It is also worth noting that the compiler does not seem to produce very efficient SIMD code with the generic API. Performance for summing 32768 items:
- SumUnrolled - 8,979.690 ns
- SumVectorT - 6,689.829 ns
- SumIntrinstics - 2,200.996 ns
So, the generic version of SIMD only gains ~30% performance, while the intrinstics version gain ~400% performance, near the theoretical max.
QUESTION
First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.
In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.
This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).
However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.
My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.
My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)
It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.
ANSWER
Answered 2021-May-20 at 13:46can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores
I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc
using a lock free ring buffer, like TPCircularBuffer.
But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).
My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.
The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.
I still don't know what happens in practice when you use Audio Workgroups
, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.
QUESTION
I can't make my MTKView clear its background. I've set the view's and its layer's isOpaque to false, background color to clear and tried multiple solutions found on google/stackoverflow (most in the code below like loadAction and clearColor of color attachment) but nothing works.
All the background color settings seem to be ignored. Setting loadAction and clearColor of MTLRenderPassColorAttachmentDescriptor does nothing.
I'd like to have my regular UIView's drawn under the MTKView. What am I missing?
// initialization
let metal = MTKView(frame: self.view.bounds)
metal.device = MTLCreateSystemDefaultDevice()
self.renderer = try! MetalRenderer(mtkView: metal)
metal.delegate = self.renderer
self.view.addSubview(metal);
import Foundation
import MetalKit
import simd
public enum MetalError: Error {
case mtkViewError
case renderError
}
internal class MetalRenderer: NSObject, MTKViewDelegate {
private let commandQueue: MTLCommandQueue;
private let pipelineState: MTLRenderPipelineState
private var viewportSize: SIMD2 = SIMD2(x: 10, y: 10);
private weak var mtkView: MTKView?
init(mtkView: MTKView) throws {
guard let device = mtkView.device else {
print("device not found error")
throw MetalError.mtkViewError
}
self.mtkView = mtkView
// Load all the shader files with a .metal file extension in the project.
guard let defaultLibrary = device.makeDefaultLibrary() else {
print("Could not find library")
throw MetalError.mtkViewError
}
let vertexFunction = defaultLibrary.makeFunction(name: "vertexShader")
let fragmentFunction = defaultLibrary.makeFunction(name: "fragmentShader")
mtkView.layer.isOpaque = false;
mtkView.layer.backgroundColor = UIColor.clear.cgColor
mtkView.isOpaque = false;
mtkView.backgroundColor = .clear
let pipelineStateDescriptor = MTLRenderPipelineDescriptor();
pipelineStateDescriptor.label = "Pipeline";
pipelineStateDescriptor.vertexFunction = vertexFunction;
pipelineStateDescriptor.fragmentFunction = fragmentFunction;
pipelineStateDescriptor.isAlphaToCoverageEnabled = true
pipelineStateDescriptor.colorAttachments[0].pixelFormat = .bgra8Unorm;
pipelineStateDescriptor.colorAttachments[0].isBlendingEnabled = true;
pipelineStateDescriptor.colorAttachments[0].destinationRGBBlendFactor = .oneMinusSourceAlpha;
pipelineStateDescriptor.colorAttachments[0].destinationAlphaBlendFactor = .oneMinusSourceAlpha;
pipelineState = try! device.makeRenderPipelineState(descriptor: pipelineStateDescriptor);
guard let queue = device.makeCommandQueue() else {
print("make command queue error")
throw MetalError.mtkViewError
}
commandQueue = queue
}
func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {
viewportSize.x = UInt32(size.width)
viewportSize.y = UInt32(size.height)
}
func draw(in view: MTKView) {
let vertices: [Vertex] = [
Vertex(position: SIMD3(x: 250.0, y: -250.0, z: 0)),
Vertex(position: SIMD3(x: -250.0, y: -250.0, z: 0)),
Vertex(position: SIMD3(x: 0.0, y: 250.0, z: 0)),
]
guard let commandBuffer = commandQueue.makeCommandBuffer() else {
print("Couldn't create command buffer")
return
}
// Create a new command buffer for each render pass to the current drawable.
commandBuffer.label = "MyCommand";
// Obtain a renderPassDescriptor generated from the view's drawable textures.
guard let renderPassDescriptor = view.currentRenderPassDescriptor else {
print("Couldn't create render pass descriptor")
return
}
guard let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDescriptor) else {
print("Couldn't create render encoder")
return
}
renderPassDescriptor.colorAttachments[0].loadAction = .clear
renderPassDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(1.0, 0.0, 0.0, 0.5)
renderEncoder.label = "MyRenderEncoder";
// Set the region of the drawable to draw into.
renderEncoder.setViewport(MTLViewport(originX: 0.0, originY: 0.0, width: Double(viewportSize.x), height: Double(viewportSize.y), znear: 0.0, zfar: 1.0))
renderEncoder.setRenderPipelineState(pipelineState)
// Pass in the parameter data.
renderEncoder.setVertexBytes(vertices, length: MemoryLayout.size * vertices.count, index: Int(VertexInputIndexVertices.rawValue))
renderEncoder.setVertexBytes(&viewportSize, length: MemoryLayout>.size, index: Int(VertexInputIndexViewportSize.rawValue))
renderEncoder.drawPrimitives(type: MTLPrimitiveType.triangle, vertexStart: 0, vertexCount: 3)
renderEncoder.endEncoding()
// Schedule a present once the framebuffer is complete using the current drawable.
guard let drawable = view.currentDrawable else {
print("Couldn't get current drawable")
return
}
commandBuffer.present(drawable)
// Finalize rendering here & push the command buffer to the GPU.
commandBuffer.commit()
}
}
ANSWER
Answered 2021-May-17 at 09:09Thanks to Frank, the answer was to just set the clearColor property of the view itself, which I missed. I also removed most adjustments in the MTLRenderPipelineDescriptor, who's code is now:
let pipelineStateDescriptor = MTLRenderPipelineDescriptor();
pipelineStateDescriptor.label = "Pipeline";
pipelineStateDescriptor.vertexFunction = vertexFunction;
pipelineStateDescriptor.fragmentFunction = fragmentFunction;
pipelineStateDescriptor.colorAttachments[0].pixelFormat =
mtkView.colorPixelFormat;
Also no changes necessary to MTLRenderPassDescriptor from currentRenderPassDescriptor.
EDIT: Also be sure to set isOpaque property of MTKView to false too.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Simd
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits
Save this library and start creating your kit
Share this Page