uarch-bench | A benchmark for low-level CPU micro-architectural features | Architecture library

by travisdowns C++ Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | uarch-bench Summary

uarch-bench is a C++ library typically used in Architecture applications. uarch-bench has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

A benchmark for low-level CPU micro-architectural features

Support

Quality

Security

License

Reuse

Support

uarch-bench has a low active ecosystem.

It has 582 star(s) with 51 fork(s). There are 31 watchers for this library.

It had no major release in the last 6 months.

There are 28 open issues and 55 have been closed. On average issues are closed in 85 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of uarch-bench is current.

Quality

uarch-bench has no bugs reported.

Security

uarch-bench has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

uarch-bench is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

uarch-bench releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of uarch-bench

Get all kandi verified functions for this library.

uarch-bench Key Features

No Key Features are available at this moment for uarch-bench.

uarch-bench Examples and Code Snippets

No Code Snippets are available at this moment for uarch-bench.

Community Discussions

Trending Discussions on uarch-bench

Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

QUESTION

Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

Asked 2019-Sep-19 at 18:46

ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs (https://agner.org/optimize/). Broadwell / Skylake and later have single-uop ADC/SBB/CMOV, after Haswell introduced 3-input uops for FMA and micro-fusion of indexed addressing modes in some cases.

(But BDW/SKL still uses 2 uops for the adc al, imm8 short-form encoding, or the other al/ax/eax/rax, imm8/16/32/32 short forms with no ModRM. More details in my answer.)

But adc with immediate 0 is special-cased on Haswell to decode as only a single uop. @BeeOnRope tested this, and included a check for this performance quirk in his uarch-bench: https://github.com/travisdowns/uarch-bench. Sample output from CI on a Haswell server showing a difference between adc reg,0 and adc reg,1 or adc reg,zeroed-reg.

(But only for 32 or 64-bit operand-size, not adc bl,0. So use 32-bit when using adc on a setcc result to combine 2 conditions into one branch.)

Same for SBB. As far as I've seen, there's never any difference between ADC and SBB performance on any CPU, for the equivalent encoding with the same immediate value.

When was this optimization for imm=0 introduced?

I tested on Core 2¹, and found that adc eax,0 latency is 2 cycles, same as adc eax,3. And also the cycle count is identical for a few variations of throughput tests with 0 vs. 3, so first-gen Core 2 (Conroe/Merom) doesn't do this optimization.

The easiest way to answer this is probably to use my test program below on a Sandybridge system, and see if adc eax,0 is faster than adc eax,1. But answers based on reliable documentation would be fine, too.

Footnote 1: I used this test program on my Core 2 E6600 (Conroe / Merom), running Linux.

...

ANSWER

Answered 2019-Sep-19 at 18:46

It's not present on Nehalem, but is on IvyBridge. So it was new either in Sandybridge or IvB.

My guess is Sandybridge for this, because that was a major redesign of the decoders (producing up to 4 total uops, rather than patterns like 4+1+1+1 that were possible in Core2 / Nehalem), and hanging on to instructions that can macro-fuse (like add or sub) if they're the last in a group in case the next instruction is a jcc.

Significantly for this, I think SnB decoders also look at the imm8 in immediate-count shifts to check if it's zero, instead of only doing that in the execution units².

Hard data so far:

Broadwell and later (and AMD, and Silvermont/KNL) don't need this optimization, adc r,imm and adc r,r are always 1 uop, except for the AL/AX/EAX/RAX imm short form¹ on Broadwell/Skylake.
Haswell does this optimization: adc reg,0 is 1 uop, adc reg,1 is 2. For 32 and 64-bit operand-size, not 8-bit.
IvyBridge i7-3630QM does this optimization (thanks @DavidWohlferd).
Sandybridge ???
Nehalem i7-820QM does not, adc is slower than add regardless of the imm.
Core 2 E6600 (Conroe/Merom) doesn't either.
Safe to assume Pentium M and earlier don't.

Footnote 1: On Skylake, the al/ax/eax/rax, imm8/16/32/32 short-form encodings with no ModR/M byte still decode to 2 uops, even when the immediate is zero. For example, adc eax, strict dword 0 (15 00 00 00 00) is twice as slow as 83 d0 00. Both uops are on the critical path for latency.

Looks like Intel forgot to update the decoding for the other immediate forms of adc and sbb! (This all applies equally to both ADC and SBB.)

Assemblers will use the short-form by default for immediates that don't fit in an imm8, so for example adc rax, 12345 assembles to 48 15 39 30 00 00 instead of the one-byte larger single-uop form that is the only option for registers other than the accumulator.

A loop that bottlenecks on adc rcx, 12345 instead of RAX latency runs twice as fast. But adc rax, 123 is unaffected, because it uses the adc r/m64, imm8 encoding which is single uop.

Footnote 2: See INC instruction vs ADD 1: Does it matter? for quotes from Intel's optimization manual about Core2 stalling the front-end if a later instruction reads flags from a shl r/m32, imm8, in case the imm8 was 0. (As opposed to the implicit-1 opcode, which the decoder knows always writes flags.)

But SnB-family doesn't do that; the decoder apparently checks the imm8 to see whether the instruction writes flags unconditionally or whether it leaves them untouched. So checking an imm8 is something that SnB decoders already do, and could usefully do for adc to omit the uop that adds that input, leaving only adding CF to the destination.

Source https://stackoverflow.com/questions/51664369

Community Discussions, Code Snippets contain sources that include Stack Exchange Network