travisdowns.github.io | Performance Matters blog content | GraphQL library

by travisdowns HTML Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(3)Vulnerabilities Install Support

kandi X-RAY | travisdowns.github.io Summary

travisdowns.github.io is a HTML library typically used in Web Services, GraphQL applications. travisdowns.github.io has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

This is the backing repository for the Performance Matters blog hosted on github pages. If you find an error or issue in the blog, feel free to open an issue here or even submit a PR. Or free to email me if that's not your thing or you don't have a github account.

Support

Quality

Security

License

Reuse

Support

travisdowns.github.io has a low active ecosystem.

It has 41 star(s) with 27 fork(s). There are 5 watchers for this library.

It had no major release in the last 6 months.

There are 13 open issues and 18 have been closed. On average issues are closed in 45 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of travisdowns.github.io is current.

Quality

travisdowns.github.io has no bugs reported.

Security

travisdowns.github.io has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

travisdowns.github.io does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

travisdowns.github.io releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of travisdowns.github.io

Get all kandi verified functions for this library.

travisdowns.github.io Key Features

No Key Features are available at this moment for travisdowns.github.io.

travisdowns.github.io Examples and Code Snippets

No Code Snippets are available at this moment for travisdowns.github.io.

Community Discussions

Trending Discussions on travisdowns.github.io

First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops

What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

Changing code sample format for nasm code samples in Jekyll

QUESTION

First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops

Asked 2021-Apr-01 at 06:19

Originally I was trying to reproduce the effect described in Agner Fog's microarchitecture guide section "Warm-up period for YMM and ZMM vector instructions" where it says that:

The processor turns off the upper parts of the vector execution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 56,000 clock cycles or 14 μs.

I got the slowdown, although it seems like it was closer to ~2x instead of 4.5x. But what I've found is on my CPU (Intel i7-9750H Coffee Lake) the slowdown is not only affecting 256-bit operations, but also 128-bit vector ops and scalar floating point ops (and even N number of GPR-only instructions following XMM touching instruction).

Code of the benchmark program:

...

ANSWER

Answered 2021-Apr-01 at 06:19

The fact that you see throttling even for narrow SIMD instructions is a side-effect of a behavior I call implicit widening.

Basically, on modern Intel, if the upper 128-255 bits are dirty on any register in the range ymm0 to ymm15, any SIMD instruction is internally widened to 256 bits, since the upper bits need to be zeroed and this requires the full 256-bit registers in the register file to be powered and probably the 256-bit ALU path as well. So the instruction acts for the purposes of AVX frequencies as if it was 256-bit wide.

Similarly, if bits 256 to 511 are dirty on any zmm register in the range zmm0 to zmm15, operations are implicitly widened to 512 bits.

For the purposes of light vs heavy instructions, the widened instructions have the same type as they would if they were full width. That is, a 128-bit FMA which gets widened to 512 bits acts as "heavy AVX-512" even though only 128 bits of FMA is occurring.

This applies to all instructions which use the xmm/ymm registers, even scalar FP operations.

Note that this doesn't just apply to this throttling period: it means that if you have dirty uppers, a narrow SIMD instruction (or scalar FP) will cause a transition to the more conservative DVFS states just as a full-width instruction would do.

Source https://stackoverflow.com/questions/66874161

QUESTION

What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

Asked 2020-May-13 at 21:07

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.

If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.

My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?

As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.

Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...

Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).

Update (2020): Travis Downs has found evidence of Hardware Store Elimination but only, it seems, for zeros and only where the data misses L1 and L2, and even then, not in all cases. His article is highly recommended as it goes into much more detail.... https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

...

ANSWER

Answered 2017-Nov-21 at 17:35

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.

There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)

For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)

Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.

(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)

There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.

Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.

Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.

The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).

I am not well-read on silent store elimination, so there are probably other issues (and workarounds).

With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.

Source https://stackoverflow.com/questions/47417481

QUESTION

Changing code sample format for nasm code samples in Jekyll

Asked 2019-Nov-23 at 22:45

I'm using Jekyll with the minima theme.

In a recent post I made, someone pointed out to me that I can use nasm specific formatting with ~~~nasm - it's good to know that nasm is a supported language!

However, the syntax highlighting is pretty ugly, look at the red glow around the square brackets:

It it incorrectly indicating a syntax error, due to this bug.

Is it possible to override the style for those brackets and other aspects of the code samples?

...

ANSWER

Answered 2019-Jun-14 at 11:00

Save https://raw.githubusercontent.com/jekyll/minima/master/_sass/minima.scss to _sass/minima.scss

Save https://raw.githubusercontent.com/jekyll/minima/master/_sass/minima/_syntax-highlighting.scss as _sass/minima/_syntax-highlighting.scss

You can now modify your syntax highlighting in _sass/minima/_syntax-highlighting.scss.

Source https://stackoverflow.com/questions/56552943

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install travisdowns.github.io

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: