kandi X-RAY | hyperfine Summary
kandi X-RAY | hyperfine Summary
A command-line benchmarking tool.
Top functions reviewed by kandi - BETA
hyperfine Key Features
hyperfine Examples and Code Snippets
Trending Discussions on hyperfine
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using
-march=native) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:...
ANSWERAnswered 2022-Apr-10 at 02:47
-mavx2 doesn't change that.
-mprefer-vector-width=128, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
vmovdqa ymm0, ymm1 mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use
vzeroupper afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1
-mprefer-vector-width to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a
vmovdqu [rdx], xmm0. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two
uint64_t elements, it (rightly or wrongly) decides it wouldn't be worth using
vpaddd to implement qword
(v<<2) + v, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO,
-fprofile-use. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD
uint64_t multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit
pmuludq instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like
5 with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256 doesn't help:
uint64_t *= 5 seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
-march=znver1 -O3 -mprefer-vector-width=256, we don't get the
*= 5 loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with
We do get vectorization with those options for
uint32_t (even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether
*= is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
uint64_t, changing to
arr[i] += arr[i]<<2; still doesn't vectorize, but
arr[i] <<= 1; does. (https://godbolt.org/z/6PMn93Y5G). Even
arr[i] <<= 2; and
arr[i] += 123 in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing
*= 5, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why
arr[i] += arr[i] would be considered more expensive than
arr[i] <<= 1; which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
No vulnerabilities reported
Reuse Trending Solutions
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page