isa-l | Intelligent Storage Acceleration Library | Compression library

by intel C Version: v2.30.0 License: BSD-3-Clause

X-Ray Key Features Code Snippets Community Discussions(3)Vulnerabilities Install Support

kandi X-RAY | isa-l Summary

isa-l is a C library typically used in Utilities, Compression applications. isa-l has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

[Package on conda-forge] ISA-L is a collection of optimized low-level functions targeting storage applications. ISA-L includes: * Erasure codes - Fast block Reed-Solomon type erasure codes for any encode/decode matrix in GF(2^8). * CRC - Fast implementations of cyclic redundancy check. Six different polynomials supported. - iscsi32, ieee32, t10dif, ecma64, iso64, jones64. * Raid - calculate and operate on XOR and P+Q parity found in common RAID implementations. * Compression - Fast deflate-compatible data compression. * De-compression - Fast inflate-compatible data compression. * igzip - A command line application like gzip, accelerated with ISA-L. Also see: * [ISA-L for updates] * For crypto functions see [isa-l_crypto on github] * The [github wiki] including a list of [distros/ports] offering binary packages as well as a list of [language bindings] * ISA-L [mailing list] * [Contributing] CONTRIBUTING.md).

Support

Quality

Security

License

Reuse

Support

isa-l has a medium active ecosystem.

It has 779 star(s) with 273 fork(s). There are 58 watchers for this library.

It had no major release in the last 6 months.

There are 48 open issues and 84 have been closed. On average issues are closed in 29 days. There are 7 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of isa-l is v2.30.0

Quality

isa-l has 0 bugs and 0 code smells.

Security

isa-l has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

isa-l code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

isa-l is licensed under the BSD-3-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

isa-l releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of isa-l

Get all kandi verified functions for this library.

isa-l Key Features

No Key Features are available at this moment for isa-l.

isa-l Examples and Code Snippets

No Code Snippets are available at this moment for isa-l.

Community Discussions

Trending Discussions on isa-l

sha256rnds2 implicit register xmm0

Reading a zst archive in Scala & Spark: native zStandard library not available

Mixing SSE with AVX128 for shorter instructions?

QUESTION

sha256rnds2 implicit register xmm0

Asked 2021-Dec-11 at 18:48

According to [1] sha256rnds2 instruction has an implicit 3rd operand that uses register xmm0. This is the thing that prevents me from having an effective computation of sha256 over multiple buffers simultaneously and thus hopefully fully utilizing CPU's execution pipelines and conveyor.

Other multibuffer implementations (e.g. [2], [3]) use two different techniques to overcome this:

Compute rounds sequentially
Partially utilize parallelization when it's possible

The question I have - why this instruction was designed in this way - to have an implicit barrier that prevents us from utilizing multiple execution pipelines or to effectively use two sequential instructions due to reciprocal throughput.

I see three possible reasons:

Initially SHA-NI was considered as an extension for low-performance CPUs. And no one thought that it will be popular in high-perf CPUs - hence no support of multiple pipelines.
There is a limit from instruction encoding/decoding side - there are no enough bits to encode 3rd register that is why it's hardcoded.
shar256rnds2 has tremendous energy consumption and this is why it's not possible to have multiple execution pipelines for it.

Links:

...

ANSWER

Answered 2021-Dec-11 at 18:48

Register renaming makes this a non-problem for the back-end. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for info on how register renaming hides write-after-write and write-after-read hazards.)

At worst this costs you an extra movdqa xmm0, whatever or vmovdqa instruction before some or all of your sha256rnds2 instructions, costing a small amount of front-end throughput. Or I guess if you're out of registers, then maybe an extra load, or even a store/reload.

Looks like they wanted to avoid a VEX encoding, so they could provide SHA extensions on low-power Silvermont-family CPUs that don't have AVX/BMI instructions. (Where it's most useful because the CPU is slower relative to the amount of data it's throwing around.) So yes, only 2 explicit operands could be encoded via the normal ModRM mechanism in x86 machine code. x86 does three-register instructions with VEX prefixes, which provide a new field for another 4-bit register number. (vblendvb has 4 explicit operands, with the 4th register number as an immediate, but that's crazy and requires special decoder support.)

So (1) led to (2), but not because of any lack of pipelining.

According to https://uops.info/ and https://agner.org/optimize/, the SHA256RNDS2 and instruction is at least partially pipelined on all CPUs that support it. Ice Lake has one execution unit for SHA256RNDS2 on port 5, with 6 cycle latency but pipelined at 3c throughput. So 2 can be in flight at once. Not close to a front-end bottleneck with an extra movdqa.

Source https://stackoverflow.com/questions/70181612

QUESTION

Reading a zst archive in Scala & Spark: native zStandard library not available

Asked 2021-Apr-18 at 21:25

I'm trying to read a zst-compressed file using Spark on Scala.

...

ANSWER

Answered 2021-Apr-18 at 21:25

Since I didn't want to build Hadoop by myself, inspired by the workaround used here, I've configured Spark to use Hadoop native libraries:

Source https://stackoverflow.com/questions/67099204

QUESTION

Mixing SSE with AVX128 for shorter instructions?

Asked 2020-Jun-07 at 01:42

From all the information I could gather, there's no performance penalty with mixing SSE and 128-bit (E)VEX encoded instructions. This suggests that it should be fine to mix the two. This may be beneficial when SSE instructions are often 1 byte shorter than the VEX equivalent.

However, I've never seen anyone, or any compiler, do this. As an example, in Intel's AVX (128-bit) MD5 implementation, various vmovdqa could be replaced with movaps (or this vshufps could be replaced with the shorter shufps, since the dest and src1 register is the same).
Is there any particular reason for this avoidance of SSE, or is there something I'm missing?

...

ANSWER

Answered 2020-Jun-07 at 01:42

You're right, if YMM uppers are known zero from a vzeroupper, mixing AVX128 and SSE has no penalty and it's a missed optimization not to do so when it would save code size.

Also note that it only saves code size if you don't need a REX prefix. 2-byte VEX is equivalent to REX + 0F for SSE1. Compilers do try to favour low registers to hopefully avoid REX prefixes, but I think they don't look at which combinations of registers are used in each instruction to minimize total REX prefixes. (Or if they do try to do that, they not good at it). Humans can spend time planning like that.

It's pretty minor most of the time, just an occasional byte of code size. That's usually a good thing and can help the front-end. (Or saving a uop for blendvps xmm, xmm, over pblendvps xmm, xmm, xmm, xmm on Intel CPUs (same for pd, and pblendvb), if you can arrange to use it without needing another movaps)

The downside if you get it wrong is an SSE/AVX transition penalty (on Haswell and Ice Lake), or a false dependency on Skylake. Why is this SSE code 6 times slower without VZEROUPPER on Skylake?. IDK if Zen2 does anything like that; Zen1 splits 256-bit operations into 2 uops and doesn't care about vzeroupper.

For compilers to do it safely, they would have to keep track of more stuff to make sure they don't run an SSE instruction inside a function while a YMM register has a dirty upper half. Compilers don't have an option to limit AVX code-gen to 128-bit instructions only, so they'd have to start tracking paths of execution that could have dirtied a YMM upper half.

However, I think they have to do that anyway on a whole-function basis to know when to use vzeroupper before ret (in functions that don't accept or return a __m256/i/d by value, which would mean the caller is already using wide vectors).

But not needing vzeroupper is a separate thing from whether movaps is performance-safe, so it would be one more thing to track in a similar way. Finding every case where it's safe to avoid a VEX prefix.

Still, there are probably cases where it easy to prove it would be safe. It would be fine if compilers used a conservative algorithm that had some missed optimizations when branching might or might not have dirtied uppers, and in that case always using VEX, and always using vzeroupper.

Source https://stackoverflow.com/questions/62239877

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install isa-l

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: