pmu-tools | Intel PMU profiling tools | Performance Testing library

by andikleen Python Version: r220420 License: GPL-2.0

X-Ray Key Features Code Snippets(1)Community Discussions(3)Vulnerabilities Install Support

kandi X-RAY | pmu-tools Summary

pmu-tools is a Python library typically used in Telecommunications, Media, Media, Entertainment, Testing, Performance Testing applications. pmu-tools has no bugs, it has no vulnerabilities, it has build file available, it has a Strong Copyleft License and it has medium support. You can download it from GitHub.

pmu tools is a collection of tools and libraries for profile collection and performance analysis on Intel CPUs on top of Linux perf. This uses performance counters in the CPU.

Support

Quality

Security

License

Reuse

Support

pmu-tools has a medium active ecosystem.

It has 1740 star(s) with 308 fork(s). There are 88 watchers for this library.

It had no major release in the last 6 months.

There are 167 open issues and 218 have been closed. On average issues are closed in 157 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of pmu-tools is r220420

Quality

pmu-tools has 0 bugs and 0 code smells.

Security

pmu-tools has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

pmu-tools code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

pmu-tools is licensed under the GPL-2.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

pmu-tools releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed pmu-tools and discovered the below as its top functions. This is intended to give you an instant insight into pmu-tools implemented functionality, and help decide if they suit your requirements.

Generate polynomial
Cat rm all files in infn
Format a name
C catrm output files
Measure the performance of a given function
Handle an input subset
Generate script
Execute flex function
Find the emap native JSON object
Return the name of the eventlist file
Process command line arguments
Generate chart
Perform a perf command
Set up the number of CPU cores
Handle GET request
Flush all the cpunames
Compute the resolution
Update the contents of the file
Write the csv to the writer
Return the name of the eventlist
Check if the runner list is valid
Return the string representation of the event
Return a list of relevant perf features
Measure the workload and sample the results
Define the event
Create a worksheet with the given information
Convert a list of strings into equations

Get all kandi verified functions for this library.

pmu-tools Key Features

No Key Features are available at this moment for pmu-tools.

pmu-tools Examples and Code Snippets

2. Different use cases

Python

Lines of Code : 12

License : Permissive (Apache-2.0)

Copy

$ gcc -g3 -Iinclude -lm -o bubble_sort src/bubble_sort.c src/debug.c
-fprofile-generate $ gcc -g3 -Iinclude -lm -o matrix_multiplication
src/matrix_multiplication.c src/debug.c -fprofile-generate $ gcc -g3
-Iinclude -lm -o pi_calculation src/pi_calcu

Community Discussions

Trending Discussions on pmu-tools

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

Trouble understanding and comparing CPU performance metrics

how to interpret perf iTLB-loads,iTLB-load-misses

QUESTION

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

Asked 2019-Oct-10 at 02:19

This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al has a dependency on the last imul.

...

ANSWER

Answered 2019-Oct-10 at 02:04

Other answers welcome to address Sandybridge and IvyBridge in more detail. I don't have access to that hardware.

I haven't found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I've tested so far supports this model:

AL is never renamed separately from RAX (or r15b from r15). So if you never touch the high8 registers (AH/BH/CH/DH), everything behaves exactly like on a CPU with no partial-reg renaming (e.g. AMD).

Write-only access to AL merges into RAX, with a dependency on RAX. For loads into AL, this is a micro-fused ALU+load uop that executes on p0156, which is one of the strongest pieces of evidence that it's truly merging on every write, and not just doing some fancy double-bookkeeping as Agner speculated.

Agner (and Intel) say Sandybridge can require a merging uop for AL, so it probably is renamed separately from RAX. For SnB, Intel's optimization manual (section 3.5.2.4 Partial Register Stalls) says

SnB (not necessarily later uarches) inserts a merging uop in the following cases:

After a write to one of the registers AH, BH, CH or DH and before a following read of the 2-, 4- or 8-byte form of the same register. In these cases a merge micro-op is inserted. The insertion consumes a full allocation cycle in which other micro-ops cannot be allocated.

After a micro-op with a destination register of 1 or 2 bytes, which is not a source of the instruction (or the register's bigger form), and before a following read of a 2-,4- or 8-byte form of the same register. In these cases the merge micro-op is part of the flow.

I think they're saying that on SnB, add al,bl will RMW the full RAX instead of renaming it separately, because one of the source registers is (part of) RAX. My guess is that this doesn't apply for a load like mov al, [rbx + rax]; rax in an addressing mode probably doesn't count as a source.

I haven't tested whether high8 merging uops still have to issue/rename on their own on HSW/SKL. That would make the front-end impact equivalent to 4 uops (since that's the issue/rename pipeline width).

There is no way to break a dependency involving AL without writing EAX/RAX. xor al,al doesn't help, and neither does mov al, 0.
movzx ebx, al has zero latency (renamed), and needs no execution unit. (i.e. mov-elimination works on HSW and SKL). It triggers merging of AH if it's dirty, which I guess is necessary for it to work without an ALU. It's probably not a coincidence that Intel dropped low8 renaming in the same uarch that introduced mov-elimination. (Agner Fog's micro-arch guide has a mistake here, saying that zero-extended moves are not eliminated on HSW or SKL, only IvB.)
movzx eax, al is not eliminated at rename. mov-elimination on Intel never works for same,same. mov rax,rax isn't eliminated either, even though it doesn't have to zero-extend anything. (Although there'd be no point to giving it special hardware support, because it's just a no-op, unlike mov eax,eax). Anyway, prefer moving between two separate architectural registers when zero-extending, whether it's with a 32-bit mov or an 8-bit movzx.
movzx eax, bx is not eliminated at rename on HSW or SKL. It has 1c latency and uses an ALU uop. Intel's optimization manual only mentions zero-latency for 8-bit movzx (and points out that movzx r32, high8 is never renamed).

High-8 regs can be renamed separately from the rest of the register, and do need merging uops.

Write-only access to ah with mov ah, reg8 or mov ah, [mem8] do rename AH, with no dependency on the old value. These are both instructions that wouldn't normally need an ALU uop for the 32-bit version. (But mov ah, bl is not eliminated; it does need a p0156 ALU uop so that might be a coincidence).
a RMW of AH (like inc ah) dirties it.
setcc ah depends on the old ah, but still dirties it. I think mov ah, imm8 is the same, but haven't tested as many corner cases.

(Unexplained: a loop involving setcc ah can sometimes run from the LSD, see the rcr loop at the end of this post. Maybe as long as ah is clean at the end of the loop, it can use the LSD?).

If ah is dirty, setcc ah merges into the renamed ah, rather than forcing a merge into rax. e.g. %rep 4 (inc al / test ebx,ebx / setcc ah / inc al / inc ah) generates no merging uops, and only runs in about 8.7c (latency of 8 inc al slowed down by resource conflicts from the uops for ah. Also the inc ah / setcc ah dep chain).

I think what's going on here is that setcc r8 is always implemented as a read-modify-write. Intel probably decided that it wasn't worth having a write-only setcc uop to optimize the setcc ah case, since it's very rare for compiler-generated code to setcc ah. (But see the godbolt link in the question: clang4.0 with -m32 will do so.)
reading AX, EAX, or RAX triggers a merge uop (which takes up front-end issue/rename bandwidth). Probably the RAT (Register Allocation Table) tracks the high-8-dirty state for the architectural R[ABCD]X, and even after a write to AH retires, the AH data is stored in a separate physical register from RAX. Even with 256 NOPs between writing AH and reading EAX, there is an extra merging uop. (ROB size=224 on SKL, so this guarantees that the mov ah, 123 was retired). Detected with uops_issued/executed perf counters, which clearly show the difference.
Read-modify-write of AL (e.g. inc al) merges for free, as part of the ALU uop. (Only tested with a few simple uops, like add/inc, not div r8 or mul r8). Again, no merging uop is triggered even if AH is dirty.
Write-only to EAX/RAX (like lea eax, [rsi + rcx] or xor eax,eax) clears the AH-dirty state (no merging uop).
Write-only to AX (mov ax, 1) triggers a merge of AH first. I guess instead of special-casing this, it runs like any other RMW of AX/RAX. (TODO: test mov ax, bx, although that shouldn't be special because it's not renamed.)
xor ah,ah has 1c latency, is not dep-breaking, and still needs an execution port.
Read and/or write of AL does not force a merge, so AH can stay dirty (and be used independently in a separate dep chain). (e.g. add ah, cl / add al, dl can run at 1 per clock (bottlenecked on add latency).

Making AH dirty prevents a loop from running from the LSD (the loop-buffer), even when there are no merging uops. The LSD is when the CPU recycles uops in the queue that feeds the issue/rename stage. (Called the IDQ).

Inserting merging uops is a bit like inserting stack-sync uops for the stack-engine. Intel's optimization manual says that SnB's LSD can't run loops with mismatched push/pop, which makes sense, but it implies that it can run loops with balanced push/pop. That's not what I'm seeing on SKL: even balanced push/pop prevents running from the LSD (e.g. push rax / pop rdx / times 6 imul rax, rdx. (There may be a real difference between SnB's LSD and HSW/SKL: SnB may just "lock down" the uops in the IDQ instead of repeating them multiple times, so a 5-uop loop takes 2 cycles to issue instead of 1.25.) Anyway, it appears that HSW/SKL can't use the LSD when a high-8 register is dirty, or when it contains stack-engine uops.

This behaviour may be related to a an erratum in SKL:

SKL150: Short Loops Which Use AH/BH/CH/DH Registers May Cause Unpredictable System Behaviour

Problem: Under complex micro-architectural conditions, short loops of less than 64 instruction that use AH, BH, CH, or DH registers as well as their corresponding wider registers (e.g. RAX, EAX, or AX for AH) may cause unpredictable system behaviour. This can only happen when both logical processors on the same physical processor are active.

This may also be related to Intel's optimization manual statement that SnB at least has to issue/rename an AH-merge uop in a cycle by itself. That's a weird difference for the front-end.

My Linux kernel log says microcode: sig=0x506e3, pf=0x2, revision=0x84. Arch Linux's intel-ucode package just provides the update, you have to edit config files to actually have it loaded. So my Skylake testing was on an i7-6700k with microcode revision 0x84, which doesn't include the fix for SKL150. It matches the Haswell behaviour in every case I tested, IIRC. (e.g. both Haswell and my SKL can run the setne ah / add ah,ah / rcr ebx,1 / mov eax,ebx loop from the LSD). I have HT enabled (which is a pre-condition for SKL150 to manifest), but I was testing on a mostly-idle system so my thread had the core to itself.

With updated microcode, the LSD is completely disabled for everything all the time, not just when partial registers are active. lsd.uops is always exactly zero, including for real programs not synthetic loops. Hardware bugs (rather than microcode bugs) often require disabling a whole feature to fix. This is why SKL-avx512 (SKX) is reported to not have a loopback buffer. Fortunately this is not a performance problem: SKL's increased uop-cache throughput over Broadwell can almost always keep up with issue/rename.

Extra AH/BH/CH/DH latency:

Reading AH when it's not dirty (renamed separately) adds an extra cycle of latency for both operands. e.g. add bl, ah has a latency of 2c from input BL to output BL, so it can add latency to the critical path even if RAX and AH are not part of it. (I've seen this kind of extra latency for the other operand before, with vector latency on Skylake, where an int/float delay "pollutes" a register forever. TODO: write that up.)

This means unpacking bytes with movzx ecx, al / movzx edx, ah has extra latency vs. movzx/shr eax,8/movzx, but still better throughput.

Reading AH when it is dirty doesn't add any latency. (add ah,ah or add ah,dh/add dh,ah have 1c latency per add). I haven't done a lot of testing to confirm this in many corner-cases.

Hypothesis: a dirty high8 value is stored in the bottom of a physical register. Reading a clean high8 requires a shift to extract bits [15:8], but reading a dirty high8 can just take bits [7:0] of a physical register like a normal 8-bit register read.

Extra latency doesn't mean reduced throughput. This program can run at 1 iter per 2 clocks, even though all the add instructions have 2c latency (from reading DH, which is not modified.)

Source https://stackoverflow.com/questions/45660139

QUESTION

Trouble understanding and comparing CPU performance metrics

Asked 2019-Feb-24 at 01:15

When running toplev, from pmu-tools on a piece of software (compiled with gcc: gcc -g -O3) I get this output:

...

ANSWER

Answered 2019-Feb-24 at 01:15

Your assembly code reveals why the bandwidth DSB metric is very high (i.e., in 42.01% of all core cycles in which the DSB is active, the DSB delivers less than 4 uops). The issue seems to exist in the following loop:

Source https://stackoverflow.com/questions/54845826

QUESTION

how to interpret perf iTLB-loads,iTLB-load-misses

Asked 2018-Apr-21 at 19:06

I have a test case to observe perf iTLB-loads,iTLB-load-misses by

...

ANSWER

Answered 2018-Apr-21 at 19:06

On your Broadwell processor, perf maps iTLB-loads to ITLB_MISSES.STLB_HIT, which represents the event of a TLB lookup that misses the L1 ITLB but hits the unified TLB for all page sizes, and iTLB-load-misses to ITLB_MISSES.MISS_CAUSES_A_WALK, which represents the event of a TLB lookup that misses both the L1 ITLB and the unified TLB (causing a page walk) for all page sizes. Therefore, iTLB-load-misses can be larger or smaller than or equal to iTLB-loads. They are independent events.

Source https://stackoverflow.com/questions/49933319

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pmu-tools

pmu-tools doesn't really need to be installed. It's enough to clone the repository and run the respective tool (like toplev or ocperf) out of the source directory. To run it from other directories you can use export PATH=$PATH:/path/to/pmu-tools or symlink the tool you're interested in to /usr/local/bin or ~/bin. The tools automatically find their python dependencies. When first run, toplev / ocperf will automatically download the Intel event lists from https://download.01.org. This requires working internet access. Later runs can be done offline. It's also possible to download the event lists ahead, see pmu-tools offline. toplev works with both python 2.7 and python 3. However it requires an not too old perf tools and depending on the CPU an uptodate kernel. For more details see toplev kernel support. The majority of the tools also don't require any python dependencies and run in "included batteries only" mode. The main exception is generating plots or XLSX spreadsheets, which require external libraries. If you want to use those run. once, or follow the command suggested in error messages. jevents is a C library. It has no dependencies other than gcc/make and can be built with.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: