isa-l | Intelligent Storage Acceleration Library | Compression library
kandi X-RAY | isa-l Summary
kandi X-RAY | isa-l Summary
[Package on conda-forge] ISA-L is a collection of optimized low-level functions targeting storage applications. ISA-L includes: * Erasure codes - Fast block Reed-Solomon type erasure codes for any encode/decode matrix in GF(2^8). * CRC - Fast implementations of cyclic redundancy check. Six different polynomials supported. - iscsi32, ieee32, t10dif, ecma64, iso64, jones64. * Raid - calculate and operate on XOR and P+Q parity found in common RAID implementations. * Compression - Fast deflate-compatible data compression. * De-compression - Fast inflate-compatible data compression. * igzip - A command line application like gzip, accelerated with ISA-L. Also see: * [ISA-L for updates] * For crypto functions see [isa-l_crypto on github] * The [github wiki] including a list of [distros/ports] offering binary packages as well as a list of [language bindings] * ISA-L [mailing list] * [Contributing] CONTRIBUTING.md).
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of isa-l
isa-l Key Features
isa-l Examples and Code Snippets
Community Discussions
Trending Discussions on isa-l
QUESTION
According to [1] sha256rnds2
instruction has an implicit 3rd operand that uses register xmm0
. This is the thing that prevents me from having an effective computation of sha256
over multiple buffers simultaneously and thus hopefully fully utilizing CPU's execution pipelines and conveyor.
Other multibuffer implementations (e.g. [2], [3]) use two different techniques to overcome this:
- Compute rounds sequentially
- Partially utilize parallelization when it's possible
The question I have - why this instruction was designed in this way - to have an implicit barrier that prevents us from utilizing multiple execution pipelines or to effectively use two sequential instructions due to reciprocal throughput.
I see three possible reasons:
- Initially SHA-NI was considered as an extension for low-performance CPUs. And no one thought that it will be popular in high-perf CPUs - hence no support of multiple pipelines.
- There is a limit from instruction encoding/decoding side - there are no enough bits to encode 3rd register that is why it's hardcoded.
shar256rnds2
has tremendous energy consumption and this is why it's not possible to have multiple execution pipelines for it.
Links:
...ANSWER
Answered 2021-Dec-11 at 18:48Register renaming makes this a non-problem for the back-end. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for info on how register renaming hides write-after-write and write-after-read hazards.)
At worst this costs you an extra movdqa xmm0, whatever
or vmovdqa
instruction before some or all of your sha256rnds2
instructions, costing a small amount of front-end throughput. Or I guess if you're out of registers, then maybe an extra load, or even a store/reload.
Looks like they wanted to avoid a VEX encoding, so they could provide SHA extensions on low-power Silvermont-family CPUs that don't have AVX/BMI instructions. (Where it's most useful because the CPU is slower relative to the amount of data it's throwing around.) So yes, only 2 explicit operands could be encoded via the normal ModRM mechanism in x86 machine code. x86 does three-register instructions with VEX prefixes, which provide a new field for another 4-bit register number. (vblendvb
has 4 explicit operands, with the 4th register number as an immediate, but that's crazy and requires special decoder support.)
So (1) led to (2), but not because of any lack of pipelining.
According to https://uops.info/ and https://agner.org/optimize/, the SHA256RNDS2
and instruction is at least partially pipelined on all CPUs that support it. Ice Lake has one execution unit for SHA256RNDS2 on port 5, with 6 cycle latency but pipelined at 3c throughput. So 2 can be in flight at once. Not close to a front-end bottleneck with an extra movdqa
.
QUESTION
I'm trying to read a zst-compressed file using Spark on Scala.
...ANSWER
Answered 2021-Apr-18 at 21:25Since I didn't want to build Hadoop by myself, inspired by the workaround used here, I've configured Spark to use Hadoop native libraries:
QUESTION
From all the information I could gather, there's no performance penalty with mixing SSE and 128-bit (E)VEX encoded instructions. This suggests that it should be fine to mix the two. This may be beneficial when SSE instructions are often 1 byte shorter than the VEX equivalent.
However, I've never seen anyone, or any compiler, do this. As an example, in Intel's AVX (128-bit) MD5 implementation, various vmovdqa
could be replaced with movaps
(or this vshufps
could be replaced with the shorter shufps
, since the dest and src1 register is the same).
Is there any particular reason for this avoidance of SSE, or is there something I'm missing?
ANSWER
Answered 2020-Jun-07 at 01:42You're right, if YMM uppers are known zero from a vzeroupper
, mixing AVX128 and SSE has no penalty and it's a missed optimization not to do so when it would save code size.
Also note that it only saves code size if you don't need a REX prefix. 2-byte VEX is equivalent to REX + 0F for SSE1. Compilers do try to favour low registers to hopefully avoid REX prefixes, but I think they don't look at which combinations of registers are used in each instruction to minimize total REX prefixes. (Or if they do try to do that, they not good at it). Humans can spend time planning like that.
It's pretty minor most of the time, just an occasional byte of code size. That's usually a good thing and can help the front-end. (Or saving a uop for blendvps xmm, xmm,
over pblendvps xmm, xmm, xmm, xmm
on Intel CPUs (same for pd, and pblendvb), if you can arrange to use it without needing another movaps
)
The downside if you get it wrong is an SSE/AVX transition penalty (on Haswell and Ice Lake), or a false dependency on Skylake. Why is this SSE code 6 times slower without VZEROUPPER on Skylake?. IDK if Zen2 does anything like that; Zen1 splits 256-bit operations into 2 uops and doesn't care about vzeroupper.
For compilers to do it safely, they would have to keep track of more stuff to make sure they don't run an SSE instruction inside a function while a YMM register has a dirty upper half. Compilers don't have an option to limit AVX code-gen to 128-bit instructions only, so they'd have to start tracking paths of execution that could have dirtied a YMM upper half.
However, I think they have to do that anyway on a whole-function basis to know when to use vzeroupper
before ret
(in functions that don't accept or return a __m256/i/d
by value, which would mean the caller is already using wide vectors).
But not needing vzeroupper
is a separate thing from whether movaps
is performance-safe, so it would be one more thing to track in a similar way. Finding every case where it's safe to avoid a VEX prefix.
Still, there are probably cases where it easy to prove it would be safe. It would be fine if compilers used a conservative algorithm that had some missed optimizations when branching might or might not have dirtied uppers, and in that case always using VEX, and always using vzeroupper
.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install isa-l
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page