travisdowns.github.io | Performance Matters blog content | GraphQL library
kandi X-RAY | travisdowns.github.io Summary
kandi X-RAY | travisdowns.github.io Summary
This is the backing repository for the Performance Matters blog hosted on github pages. If you find an error or issue in the blog, feel free to open an issue here or even submit a PR. Or free to email me if that's not your thing or you don't have a github account.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of travisdowns.github.io
travisdowns.github.io Key Features
travisdowns.github.io Examples and Code Snippets
Community Discussions
Trending Discussions on travisdowns.github.io
QUESTION
Originally I was trying to reproduce the effect described in Agner Fog's microarchitecture guide section "Warm-up period for YMM and ZMM vector instructions" where it says that:
The processor turns off the upper parts of the vector execution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 56,000 clock cycles or 14 μs.
I got the slowdown, although it seems like it was closer to ~2x instead of 4.5x. But what I've found is on my CPU (Intel i7-9750H Coffee Lake) the slowdown is not only affecting 256-bit operations, but also 128-bit vector ops and scalar floating point ops (and even N number of GPR-only instructions following XMM touching instruction).
Code of the benchmark program:
...ANSWER
Answered 2021-Apr-01 at 06:19The fact that you see throttling even for narrow SIMD instructions is a side-effect of a behavior I call implicit widening.
Basically, on modern Intel, if the upper 128-255 bits are dirty on any register in the range ymm0
to ymm15
, any SIMD instruction is internally widened to 256 bits, since the upper bits need to be zeroed and this requires the full 256-bit registers in the register file to be powered and probably the 256-bit ALU path as well. So the instruction acts for the purposes of AVX frequencies as if it was 256-bit wide.
Similarly, if bits 256 to 511 are dirty on any zmm register in the range zmm0
to zmm15
, operations are implicitly widened to 512 bits.
For the purposes of light vs heavy instructions, the widened instructions have the same type as they would if they were full width. That is, a 128-bit FMA which gets widened to 512 bits acts as "heavy AVX-512" even though only 128 bits of FMA is occurring.
This applies to all instructions which use the xmm/ymm registers, even scalar FP operations.
Note that this doesn't just apply to this throttling period: it means that if you have dirty uppers, a narrow SIMD instruction (or scalar FP) will cause a transition to the more conservative DVFS states just as a full-width instruction would do.
QUESTION
This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.
If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.
My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?
As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.
Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...
Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).
Update (2020): Travis Downs has found evidence of Hardware Store Elimination but only, it seems, for zeros and only where the data misses L1 and L2, and even then, not in all cases. His article is highly recommended as it goes into much more detail.... https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html
...ANSWER
Answered 2017-Nov-21 at 17:35Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.
There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)
For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)
Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.
(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)
There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.
Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.
Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.
The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).
I am not well-read on silent store elimination, so there are probably other issues (and workarounds).
With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.
QUESTION
I'm using Jekyll with the minima theme.
In a recent post I made, someone pointed out to me that I can use nasm specific formatting with ~~~nasm
- it's good to know that nasm is a supported language!
However, the syntax highlighting is pretty ugly, look at the red glow around the square brackets:
It it incorrectly indicating a syntax error, due to this bug.
Is it possible to override the style for those brackets and other aspects of the code samples?
...ANSWER
Answered 2019-Jun-14 at 11:00Save https://raw.githubusercontent.com/jekyll/minima/master/_sass/minima.scss to _sass/minima.scss
Save https://raw.githubusercontent.com/jekyll/minima/master/_sass/minima/_syntax-highlighting.scss as _sass/minima/_syntax-highlighting.scss
You can now modify your syntax highlighting in _sass/minima/_syntax-highlighting.scss.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install travisdowns.github.io
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page