Code-Red | A Graphics Interface for DirectX12 and Vulkan | Graphics library
kandi X-RAY | Code-Red Summary
kandi X-RAY | Code-Red Summary
A Graphics Interface for DirectX12 and Vulkan
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Code-Red
Code-Red Key Features
Code-Red Examples and Code Snippets
Community Discussions
Trending Discussions on Code-Red
QUESTION
Consider the following code:
...ANSWER
Answered 2020-Jul-28 at 17:45This is a GCC missed optimization; this is unfortunately not rare for GCC in tiny functions when its register allocator does a poor job with hard-register constraints imposed by the calling convention; apparently GCC is not usually dumb like this between parts of larger functions.
The pxor
-zeroing is there to break the (false) output dependency of cvtss2sd
, which exists because of Intel's short-sighted design for single-source scalar instructions to leave the upper part of the destination vector unmodified. They started this with SSE1 for PIII, where it gave a short-term gain because PIII handled XMM regs as two 64-bit halves, so only writing one half let instructions like sqrtss
be single-uop.
But they unfortunately kept this pattern even for SSE2 (new with Pentium 4). And later declined to fix it with the AVX version of SSE instructions. So compilers are stuck choosing between the risks of creating a long loop-carried dependency chain through a false dependency, or of using pxor-zeroing. GCC conservatively always uses pxor at -O3
, omitting it at -Os
. (2-source operations like mulsd
already depend on the destination as an input so this is unnecessary).
In this case, with its poor choice of register allocation, leaving out pxor
-zeroing would mean that converting (float)b
back to double
couldn't start until a
was ready. So if the critical path was a
being ready (b
ready early), omitting it would increase the latency from a
->result by 5 cycles on Skylake (for the 2-uop cvtss2sd
to run only after a
was ready, because the output has to merge into the register that originally held a
.) Otherwise it's just the mulsd
that has to wait for a
, with all the stuff involving b
done ahead of time.
foo same,same
is another way to work around an output dependency; that's what clang is doing. (And what GCC tries to do for popcnt
, which unexpectedly has one on Sandybridge-family that's not architecturally required, unlike these stupid SSE ones.)
BTW, AVX 3-operand instructions do sometimes provide a way to work around the false dependencies, using a "cold" register, or one that was xor-zeroed, as the register to merge into. Including for scalar int->FP, although clang sometimes just uses movd
plus packed-conversion for that.
Related: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? (I should have just linked that, I forgot I already wrote this up in that much detail on Stack Overflow recently.)
The movapd
and pxor
zeroing don't cost any latency on modern CPUs, but nothing is ever free. They still cost a front-end uop, and code size (L1i cache footprint). movapd
has zero latency in the back-end, and doesn't need an execution unit, but that's all - Can x86's MOV really be "free"? Why can't I reproduce this at all?
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Code-Red
Download and install the requisite SDK.
Open the solution of CodeRed with Visual Studio 2019.
Build the source to a library(or reference the source).
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page