agner | Reworking of Agner Fog 's performance test programs for Linux | GPU library

by mattgodbolt C++ Version: Current License: GPL-3.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | agner Summary

agner is a C++ library typically used in Hardware, GPU applications. agner has no bugs, it has no vulnerabilities, it has a Strong Copyleft License and it has low support. You can download it from GitHub.

A suite of tools, drivers and scripts for investigating the performance of x86 CPUs. Based very heavily on Agner Fog's test programs.

Support

Quality

Security

License

Reuse

Support

agner has a low active ecosystem.

It has 78 star(s) with 18 fork(s). There are 7 watchers for this library.

It had no major release in the last 6 months.

agner has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of agner is current.

Quality

agner has no bugs reported.

Security

agner has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

agner is licensed under the GPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

agner releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of agner

Get all kandi verified functions for this library.

agner Key Features

No Key Features are available at this moment for agner.

agner Examples and Code Snippets

No Code Snippets are available at this moment for agner.

Community Discussions

Trending Discussions on agner

Out-of-order execution in C#

Why do gcc and clang generate mov reg,-1

First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops

Is there a reason why Roslyn does not optimize multiple increments?

Access of struct member faster if located <128 bytes from start?

How to look up what form of an instruction is used, by opcode or disassembly?

Packing non-contiguous vector elements in AVX (and higher)

delays and measurement of specific instructions

uops for integer DIV instruction

What happens for a RIP-relative load next to the current instruction? Cache hit?

QUESTION

Out-of-order execution in C#

Asked 2021-Apr-30 at 02:57

I have the following snippet:

...

ANSWER

Answered 2021-Apr-30 at 02:57

Integer addition is associative. Compilers can take advantage of this (the "as-if rule"), regardless of the source-level order of operations.

(Unfortunately it seems most compilers are doing a bad job at this and making it worse even if you write your source cleverly.)

There's no side-effect for integer overflow in asm; even on targets like MIPS where add traps on signed-overflow, compilers use addu which doesn't, so they can optimize. (In C, compilers can assume the source-level order of operations never overflows, because that would be Undefined Behaviour. So they could use trapping add on ISAs that have it, for calculations that happen with the same inputs in the C abstract machine. But even though gcc -fwrapv to give signed-integer overflow well-defined 2's complement wraparound behaviour is not the default, compilers do use instructions that may allow silent wrapping, not trapping. Mostly so they don't have to care about whether any given operation is on values that appear in the C abstract machine or not. UB doesn't mean required-to-fault; -fsanitize=undefined takes extra code to make that happen.)

e.g. INT_MAX + INT_MIN + 1 could be evaluated as INT_MAX + 1 (overflowing to INT_MIN), then . + INT_MIN overflowing back to 0 on a 2's complement machine, or in source order with no overflows. Same final result, and that's all that's logically visible from the operation.

CPUs with out-of-order exec don't try to re-associate instructions, though, they follow the dependency graph from the asm / machine code.

(For one thing, that's too much for hardware to consider on the fly, and for another, the FLAGS output of each operation does depend on which temporaries you create, and an interrupt could arrive at any point. So the proper architectural state needs to be recoverable at instruction boundaries when all older instructions have finished. That means it's the compiler's job to expose instruction-level parallelism in the asm, not for the hardware to use math to create it. See also Modern Microprocessors A 90-Minute Guide! and this answer)

So how did compilers do?

Mostly badly, shooting themselves in the foot / pessimizing your attempt at doing this source-level optimization, at least in this case.

C#: removes ILP even if it exists in the source; serializes (a+b) + (c+d) into one linear chain of operations; 3 cycle latency.
clang12.0: same, serializes both versions.
MSVC: same, serializes both versions.
GCC11.1 for signed int64_t: preserves source order of operations. It's a longstanding GCC missed-optimization bug that its optimizer avoids introducing signed-overflow even in temporaries for some reason, like it backwards as far as the promises / guarantees / optimization opportunities that something being UB in the abstract machine creates when making a concrete implementation that runs as if on the abstract machine. Although GCC does know it can auto-vectorize int addition; it's only reordering within a scalar expression where some overly-conservative check lumps signed integer in with floating-point as non-associative.
GCC11.1 for uint64_t or with -fwrapv: treats as associative and compiles f and g the same way. Serializes with most tuning options (including for other ISAs like MIPS or PowerPC), but -march=znver1 happens to create ILP. (This does not mean that only AMD Zen is superscalar, it means GCC has missed-optimization bugs!)
ICC 2021.1.2: creates ILP even in the linear source version (f), but uses add/mov instead of LEA as the final step. :/

Godbolt for clang/MSVC/ICC.

Godbolt for GCC signed / unsigned or with -fwrapv.

Ideal is to start with two independent additions, then combine the pairs. One of those three additions should be done with an lea to get a result into RAX, but it can be any of the three. In a stand-alone function, you're allowed to destroy any of the incoming arg-passing registers and there's no real reason to avoid overwriting two of them instead of just one.

You do only want one LEA because a 2-register addressing mode makes it a longer instruction than an ADD.

Source https://stackoverflow.com/questions/67321596

QUESTION

Why do gcc and clang generate mov reg,-1

Asked 2021-Apr-26 at 07:21

I am using compiler explorer to look at some outputs from gcc and clang to get an idea of what assembly these compilers emit for some code. Recently I looked at the output of this code.

...

ANSWER

Answered 2021-Apr-26 at 07:21

When optimizing for speed mov reg, -1 is used instead of or reg, -1 because the former uses the register as a "write-only" operand, which CPU knows about and uses that to schedule it efficiently (out of order). Whereas or reg, -1, even though will always produce -1 is not recognized by the CPU as a dependency-breaking (write only) instruction.

To illustrate how it can affect performance:

Source https://stackoverflow.com/questions/67257911

QUESTION

First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops

Asked 2021-Apr-01 at 06:19

Originally I was trying to reproduce the effect described in Agner Fog's microarchitecture guide section "Warm-up period for YMM and ZMM vector instructions" where it says that:

The processor turns off the upper parts of the vector execution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 56,000 clock cycles or 14 μs.

I got the slowdown, although it seems like it was closer to ~2x instead of 4.5x. But what I've found is on my CPU (Intel i7-9750H Coffee Lake) the slowdown is not only affecting 256-bit operations, but also 128-bit vector ops and scalar floating point ops (and even N number of GPR-only instructions following XMM touching instruction).

Code of the benchmark program:

...

ANSWER

Answered 2021-Apr-01 at 06:19

The fact that you see throttling even for narrow SIMD instructions is a side-effect of a behavior I call implicit widening.

Basically, on modern Intel, if the upper 128-255 bits are dirty on any register in the range ymm0 to ymm15, any SIMD instruction is internally widened to 256 bits, since the upper bits need to be zeroed and this requires the full 256-bit registers in the register file to be powered and probably the 256-bit ALU path as well. So the instruction acts for the purposes of AVX frequencies as if it was 256-bit wide.

Similarly, if bits 256 to 511 are dirty on any zmm register in the range zmm0 to zmm15, operations are implicitly widened to 512 bits.

For the purposes of light vs heavy instructions, the widened instructions have the same type as they would if they were full width. That is, a 128-bit FMA which gets widened to 512 bits acts as "heavy AVX-512" even though only 128 bits of FMA is occurring.

This applies to all instructions which use the xmm/ymm registers, even scalar FP operations.

Note that this doesn't just apply to this throttling period: it means that if you have dirty uppers, a narrow SIMD instruction (or scalar FP) will cause a transition to the more conservative DVFS states just as a full-width instruction would do.

Source https://stackoverflow.com/questions/66874161

QUESTION

Is there a reason why Roslyn does not optimize multiple increments?

Asked 2021-Mar-23 at 18:06

I was trying to see how Roslyn optimizes the following snippet:

code

...

ANSWER

Answered 2021-Mar-23 at 16:11

The compiler does optimize. n is a parameter though, so it can't be modified. The JIT compiler must modify a copy of the parameter's value.

If the value is assigned to a variable before incrementing, the Roslyn compiler will eliminate the increments. From this Sharplab.io snippet, this C# code :

Source https://stackoverflow.com/questions/66766573

QUESTION

Access of struct member faster if located <128 bytes from start?

Asked 2021-Mar-12 at 01:30

From Anger Fog's C++ optimization manual, I read:

The code for accessing a data member is more compact if the offset of the member relative to the beginning of the structure or class is less than 128 because the offset can be expressed as an 8-bit signed number. If the offset relative to the beginning of the structure or class is 128 bytes or more then the offset has to be expressed as a 32-bit number (the instruction set has nothing between 8 bit and 32 bit offsets). Example:

...

ANSWER

Answered 2021-Mar-12 at 01:30

You're meant to be looking at the asm for ReadB, not main; but since they are defined inline, no asm is generated unless you call them (and then it would be mixed in with the code of the calling function). Let's move them out-of-line to make it easier.

Source https://stackoverflow.com/questions/66592024

QUESTION

How to look up what form of an instruction is used, by opcode or disassembly?

Asked 2020-Dec-14 at 20:52

Sites like https://uops.info/ and Agner Fog's instruction tables, and even Intel's own manuals, list various forms of the same instruction. For example add m, r (in Agner's tables) or add (m64, r64) on uops.info, or ADD r/m64, r64 in Intel's manual (https://www.felixcloutier.com/x86/add).

Here's a simple example I ran on godbolt

...

ANSWER

Answered 2020-Dec-14 at 00:05

http://ref.x86asm.net/coder64.html has an opcode map, but with a bit of experience you won't need one most of the time. Especially when you have disassembly, you can just check the manual entry for that mnemonic (https://www.felixcloutier.com/x86/add), and see which of the possible opcodes it is (83 /0 add r/m32, imm8).

Clearly this has a 32-bit operand-size (dword ptr) memory destination, and the source is an immediate (numeric constant). That rules out a , r64 register source for 2 separate reasons. So even without looking at the machine code, it's definitely add r/m32, imm with an imm8 or imm32. Any sane assembler will of course pick imm8 for a small constant that fits in a signed 8-bit integer.

Generally different ways of encoding the same instruction aren't special, so the source-level assembly / disassembly is fine, as long as you understand what's a register, what's memory, and what's an immediate.

But there are a few special cases, e.g. Agner Fog's guide notes that rotates by 1 using the short-form encoding are slower than rol reg, imm8 even when the imm8=1, because the flag-updating special case for rotate-by-1 actually depends on the opcode, not the immediate count. (Intel's documentation apparently assumes your assembler will always pick the short-form for rotate by constant 1. The part about "masked count" may only apply to rotate by cl. https://www.felixcloutier.com/x86/rcl:rcr:rol:ror#flags-affected. I haven't tested this recently and am not 100% sure I'm remembering correctly when OF is updated (but other flags in the SPAZO group are always left unmodified), but IIRC that's why rotates by 1 (2 uops) and by cl (3 uops) are slow, vs. rotates by other immediate counts (1 uop) on Intel).

Or https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks. Specifically I mean Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? - even on Haswell / Skylake, adc al,0 (using the short form with no modrm byte) is 2 uops, and so is the equivalent adc eax, 12345. But adc edx, 12345 is 1 uop using the non-special case.) Then you have to either check the machine code, or know how your assembler will have chosen to encode a given instruction. (Optimizing for size).

BTW, using a segment with a non-zero base adds 1 cycle of latency to address-generation, IIRC, but aren't a significant throughput penalty. (Unless of course throughput bottlenecks on a latency chain that it's part of...)

Source https://stackoverflow.com/questions/65281909

QUESTION

Packing non-contiguous vector elements in AVX (and higher)

Asked 2020-Nov-16 at 14:27

Having codes of this nature:

...

ANSWER

Answered 2020-Nov-12 at 20:46

vfmaddXXXsd and pd instructions are "cheap" (single uop, 2/clock throughput), even cheaper than shuffles (1/clock throughput on Intel CPUs) or gather-loads. https://uops.info/. Load operations are also 2/clock, so lots of scalar loads (especially from the same cache line) are quite cheap, and notice how 3 of them can fold into memory source operands for FMAs.

Worst case, packing 4 (x2) totally non-contiguous inputs and then manually scattering the outputs is definitely not worth it vs. just using scalar loads and scalar FMAs (especially when that allows memory source operands for the FMAs).

Your case is far from the worst case; you have 3 contiguous elements from 1 input. If you know you can safely load 4 elements without risk of touching an unmapped page, that takes care of that input. (And you can always use maskload). But the other vector is still non-contiguous and may be a showstopper for speedups.

It's usually not worth it if it would take more total instructions (actually uops) to do it via shuffling than plain scalar. And/or if shuffle throughput would be a worse bottleneck than anything in the scalar version.

(vgatherdpd counts as many instructions for this, being multi-uop and doing 1 cache access per load. Also you'd have to load constant vectors of indices instead of hard-coding offsets into addressing modes.

Also, gathers are quite slow on AMD CPUs, even Zen2. We don't have scatter at all until AVX512, and those are slow even on Ice Lake. Your case doesn't need scatters, though, just a horizontal sum. Which will involve more shuffles and vaddpd / sd. So even with a maskload + gather for inputs, having 3 products in separate vector elements is not particularly convenient for you.)

A little bit of SIMD (not a whole array, just a few operations) can be helpful, but this doesn't look like one of the cases where it's a significant win. Maybe there's something worth doing, like maybe replace 2 loads with a load + a shuffle. Or maybe shorten a latency chain for y[5] by summing the 3 products before adding to the output, instead of the chain of 3 FMAs. That might even be numerically better, in cases where an accumulator can hold a large number; adding multiple small numbers to a big total loses precision. Of course that would cost 1 mul, 2 FMA, and 1 add.

Source https://stackoverflow.com/questions/64810953

QUESTION

delays and measurement of specific instructions

Asked 2020-Oct-22 at 08:09

Because modern processor makes use of heavy pipeline even for ALU, multiple executions of independent arithmetic operations can be executed in one cycle, for example, four add operations can be executed in 4 cycles not 4 * latency of one add.

Even the existence of the pipelines, and the presence of contention on the execution ports, I would like to implement cycle-accurate delays by executing some instructions in a way that time to execute a sequence of instructions is predictable. For example, if instruction x takes 2 cycles, and cannot be pipelined, then by executing x four-time, I expect that I can put 8 cycle delays.

I know that this can be usually impossible for userspace because the kernel can intervene in between execution sequence and could result in more delay then expectation. However, I assume that this code executes in the kernel side without interrupts or isolated core which is free from noise.

After taking a look at https://agner.org/optimize/instruction_tables.pdf, I found that CDQ instruction doesn't require memory operation and takes 1 cycle in its latency and reciprocal throughput. If I understand this correctly, this means that if there is no contention for the port used by CDQ, it can execute this instruction at every cycle. To test it, I put the CDQ in between RDTSC timer and set core frequency as nominal core frequency (with the hope that it is the same as TSC cycle). Also I pinned two processes to hyperthreaded cores; one falls in the while(1) loop and the other executes CDQ instruction. It seems that adding one instruction increases 1-2 TSC cycles.

However, I am concern about the case when it requires lots of CDQ instructions to put large delays such as 10000 which might require at least 5000 instructions. If the code size is too large to fit in the Instruction cache and cause cache miss and TLB miss, it might introduce some jitters in my delay. I've tried to use simple for loop to execute CDQ instructions, but cannot assure whether it is okay to use for loop (implemented with jnz,cmp, and sub) because it might also introduce some unexpected noise in my delay. Could anyone confirm if I can use the CDQ instruction in this way?

Added Question

After testing with multiple CMC instructions, it seems that 10 CMC instruction adds 10 TSC cycles. I used below code to measure time for executing 0, 10, 20, 30, 40, 50

...

ANSWER

Answered 2020-Sep-23 at 23:38

You have 4 main options:

delay the 2nd operation by giving it a data dependency on (the result of) the first.
lfence, fixed delay sequence, lfence. Both of these can only give a minimum delay; could be much longer depending on CPU frequency scaling and/or interrupts.
spin on rdtsc until a deadline (which you calculate somehow, e.g. based on an earlier rdtsc), or do a longer sleep based on a TSC deadline e.g. using the local APIC.
Give up and use a different design, or use an in-order microcontroller where you can get reliable cycle-accurate timing at a fixed clock frequency.

This may be an X-Y problem, or at least isn't solvable without getting into the specific details of the two things you want to separate with a delay. (e.g. create a data dependency between a load and a store-address, and lengthen that dep chain with some instructions). There is no general-case answer that works between arbitrary code for very short delays.

If you need accurate delays of only a few clock cycles, you're mostly screwed; superscalar out-of-order execution, interrupts, and variable clock frequency makes that essentially impossible in the general case. As @Brendan explained:

For "extremely small and accurate" delays the only option is to give up then reassess the reason why you made the mistake of thinking you wanted it.

For kernel code; for longer delays with slightly less accuracy you could look into using local APIC timer in "TSC deadline mode" (possibly with some adjustment for IRQ exit timing) and/or similar with performance monitoring counters.

For delays of several dozen clock cycles, spin-wait for RDTSC to have a value you're looking for. How to calculate time for an asm delay loop on x86 linux? But that has some minimum overhead to execute RDTSC twice, or RDTSC plus TPAUSE if you have the "waitpkg" ISA extension. (You don't on i9-9900k). You also need lfence if you want to stop out-of-order exec across the whole thing.

If you need to do something "every 20 ns" or something, then increment a deadline instead of trying to do a fixed delay between other work. So variation in the other work won't accumulate error. But one interrupt will put you far behind and lead to running your other work back-to-back until you catch up. So as well as checking for the deadline, you'd also want to check for being far behind the deadline and take a new TSC sample.

(The TSC ticks at constant frequency on modern x86, but the core clock doesn't: see How to get the CPU cycle count in x86_64 from C++? for more details)

Maybe you can use a data dependency between your real work?

Small delays of a few clock cycles, smaller than the out-of-order scheduler size¹, are not really possible without taking the surrounding code into consideration and knowing the exact microarchitecture you're executing on.

footnote 1: 97 entry RS on Skylake-derived uarches, although there's some evidence that it's not truly a unified scheduler: some entries can only hold some kinds of uops.

If you can create a data dependency between the two things you're trying to separate, you might be able to create a minimum delay between their execution that way. There are ways to couple a dependency chain into another register without affecting its value, e.g. and eax, 0 / or ecx, eax makes ECX depend on the instruction that wrote EAX without affecting the value of ECX. (Make a register depend on another one without changing its value).

e.g. between two loads, you could create a data dependency from the load result of one into the load address of the later load, or into a store address. Coupling two store addresses together with a dependency chain is less good; the first store could take a bunch of extra time (e.g. for a dTLB miss) after the address is known, so two stores end up committing back-to-back after all. You might need mfence then lfence between two stores if you want to put a delay before the 2nd store. See also Are loads and stores the only instructions that gets reordered? for more about OoO exec across lfence (and mfence on Skylake).

This may require writing your "real work" in asm, too, unless you can come up with a way to "launder" the data dependency from the compiler with a small inline asm statement.

CMC is one of the few single-byte instructions available in 64-bit mode that you can just repeat to create a latency bottleneck (1 cycle per instruction on most CPUs) without also accessing memory (like lodsb which bottlenecks on merging into the low byte of RAX). xchg eax, reg would also work, but that's 3 uops on Intel.

Instead of lfence, you could couple that dep chain into a specific instruction using adc reg, 0, if you start with a known CF state and use an odd or even number of CMC instructions such that CF=0 at that point. Or cmovc same,same would make a register value depend on CF without modifying it, regardless of whether CF was set or cleared.

However, single-byte instructions can create weird front-end effects when you have too many in a row for the uop cache to handle. That's what slows down CDQ if you repeat it indefinitely; apparently Skylake can only decode it at 1/clock in the legacy decoders. Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?. That may be ok and/or what you want. 3 cycles per 3-byte instruction would let this code be cached by the uop cache, e.g imul eax, eax or imul eax, 0. But maybe it's better to avoid polluting the uop cache with code that's supposed to run slowly.

Between LFENCE instructions, cld is 3 uops and has a 4c throughput on Skylake, so if you're using lfence at the start/end of your delay that could be usable.

Also of course, any dead-reckoning delay in terms of a certain number of some instructions (not rdtsc) will depend on the core clock frequency, not the reference frequency. And at best it's a minimum delay; if an interrupt comes in during your delay loop, the total delay will be close to the total of interrupt handling time plus whatever your delay-loop took.

Or if the CPU happens to be running at idle speed (often 800MHz), the delay in nanoseconds will be much longer than if the CPU is at max turbo.

Re: your 2nd experiment with CMC between lfence OoO exec barriers

Yes, you can pretty accurately control the core clock cycles between two lfence instructions, or between lfence and rdtscp, with a simple dependency chain, pause instruction, or a throughput bottleneck on some execution unit(s), possibly the integer or FP divider. But I assume your real use case cares about the total delay between stuff before the first lfence and stuff after the 2nd lfence.

The first lfence has to wait for whatever instructions were previously in flight to retire from the out-of-order back-end (ROB = reorder buffer, 224 fused-domain uops on Skylake-family). If those included any loads that might miss in cache, your wait time can vary tremendously, and be much longer than you probably want.

Is it because CMC instructions back to back have no dependency on each other but CDQ instructions do have a dependency in between them?

You have that backwards: CMC has a true dependency on the previous CMC because it reads and writes the carry flag. Just like not eax has a true dependency on the previous EAX value.

CDQ does not: it reads EAX and writes EDX. Register renaming makes it possible for RDX to be written more than once in the same clock cycle. e.g. Zen can run 4 cdq instructions per clock. Your Coffee Lake can run 2 CDQ per clock (0.5c throughput), bottlenecked on the back-end ports it can run on (p0 and p6).

Agner Fog's numbers were based on testing a huge block of repeated instruction, apparently bottlenecking on legacy-decode throughput of 1/clock. (Again, see Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions? ). https://uops.info/ numbers are closer to accurate for small repeat counts for Coffee Lake, showing it as 0.6 c throughput. (But if you look at the detailed breakdown, with an unroll count of 500 https://www.uops.info/html-tp/CFL/CDQ-Measurements.html confirms that Coffee Lake still has that front-end bottleneck).

But increasing the repeat count up past about 20 (if aligned) will lead to the same legacy-decode bottleneck that Agner saw. However, if you don't use lfence, decode could be far ahead of execution so this is not good.

CDQ is a poor choice because of the weird front-end effects, and/or being a back-end throughput bottleneck instead of latency. But OoO exec can still see around it once the front-end gets past the repeated CDQs. 1-byte NOP could create a front-end bottleneck which might be more usable depending on what two things you were trying to separate.

BTW, if you don't fully understand dependency chains and their implications for out-of-order execution, and probably a bunch of other cpu-architecture details about the exact CPU you're using (e.g. store buffers if you want to separate any stores), you're going to have a bad time trying to do anything meaningful.

If you can do what you need with just a data dependency between two things, that might reduce the amount of stuff you need to understand to make anything like what you described as your goal.

Otherwise you probably need to understand basically all of this answer (and Agner Fog's microarchitecture guide) to figure out how your real problem translates into something you can actually make a CPU do. Or realize that it can't, and you'll need something else. (Like maybe a very fast in-order CPU, perhaps ARM, where you can somewhat control timing between independent instructions with delay sequences / loops.)

Source https://stackoverflow.com/questions/64011093

QUESTION

uops for integer DIV instruction

Asked 2020-Jul-29 at 18:11

I was looking the Agner Fog's instruction tables here, specifically I was looking at the sandy bridge case, and there is one thing that has caught my attention. If you look DIV instructions you can see that, for example, r64 DIV instruction can be decoded up to 56 uops! My question is: is it true or have I made a missinterpretation?

This is something that doesn't even get into my head. I've always thougt that an integer division of 2 registers was decoded in only 1 uop. And thought that that uop was dispatched to Port 0 (for example in Sandy Bridge).

What I thought that happenned here is: The uop is dispatched to Port0 and it finishes some cycles later. But, thanks to the pipelining, 1 div uop (or another uop that needs port0) can be sent to that port on each cycle. But this has completely broken my schemes: 56 different uops which need to be dispatched in 56 different cycles and occuping 56 ROB entries to ONLY do 1 integer division?

...

ANSWER

Answered 2020-Jul-29 at 18:11

Not all of those uops run on the actual divider unit on port 0. It seems only signed idiv is that many uops on Skylake, div r64 is "only" 33 uops. Perhaps signed idiv r64 is taking absolute values to do extended-precision division using a narrower HW divider unit, like you'd do for software extended-precision? (Why is __int128_t faster than long long on x86-64 GCC?)

And idiv/div r32 is "only" 10 uops, probably only 1 or 2 of them needing the actual divide unit on port 0, the others doing IDK what on other ports. Note the counts for arith.divider_active shown in Skylake profile results on Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux - div r64 with small inputs barely keeps the actual port 0 divider active for longer than div r32, but the other overhead makes it much slower.

FP division is actually single-uop because FP div performance is important in some real-world algorithms. (Especially effect of one divpd on front-end throughput of surrounding code). See Floating point division vs floating point multiplication

See also Do FP and integer division compete for the same throughput resources on x86 CPUs? - Ice Lake improves the divider HW.

See also discussion in comments clearing up other misconceptions.

Why is division more expensive than multiplication? division is fundamentally hard to pipeline.

I think I've read that modern divider units are typically built with an iterative not-fully-pipelined part, and then 2 Newton Raphson steps which are pipelined. So that's how division can be partially pipelined on modern CPUs: the next one can start as soon as the current one can move into the Newton-Raphson pipelined part of the execution unit.

Source https://stackoverflow.com/questions/63153788

QUESTION

What happens for a RIP-relative load next to the current instruction? Cache hit?

Asked 2020-Jun-29 at 21:39

I am reading Agner Fog's book on x86 assembly. I am wondering about how RIP-relative addressing works in this scenario. Specifically, assume my RIP offset is +1. This suggests the data I want to read is right next to this instruction in memory.

This piece of data is likely already fetched into the L1 instruction cache. Assuming that this data is not also in the L1d, what exactly will happen on the CPU?

Let's assume it's a relatively recent Intel architecture like Kaby Lake.

...

ANSWER

Answered 2020-Jun-29 at 13:30

Yes, it's probably hot in L1i cache, as well as the uop cache. The page is also hot in L1iTLB. But all that's irrelevant for a data load.

It might be hot in L2 because of instruction fetch, but it might have been evicted since then (L2 is NINE wrt. L1 caches). So best case is a hit in L2.

L1iTLB and L1dTLB are separate, so it will miss in L1dTLB if this is the first data load from that page. If the unified 2nd-level TLB is a victim cache, it could miss there and even trigger a page walk despite being hot in L1iTLB, but I don't know if L2TLB actually is a victim cache or not in recent Intel CPUs. It would make sense, though; code and data in the same page are usually rare. (Although less rare than code and data in the same line.)

See also Why do Compilers put data inside .text(code) section of the PE and ELF files and how does the CPU distinguish between data and code? for some details and discussion. But note that's a false claim, compilers don't do that on x86 because it's the opposite of helpful for performance (wasting TLB coverage footprint, and wasting cache capacity), unlike on ARM where constant pools between functions are normal because PC-relative addressing has very limited range. Only some obfuscators might do it.

Specifically, assume my RIP offset is +1. This suggests the data I want to read is right next to this instruction in memory

The rel32 is relative to the end of the current instruction. So no, not right next to; that would be a 1-byte gap.

e.g. like this:

Source https://stackoverflow.com/questions/62637943

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install agner

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: