perf | Performance measurement , storage , and analysis | Storage library
kandi X-RAY | perf Summary
kandi X-RAY | perf Summary
This subrepository holds the source for various packages and tools related to performance measurement, storage, and analysis. cmd/benchstat contains a command-line tool that computes and compares statistics about benchmarks. cmd/benchsave contains a command-line tool for publishing benchmark results. storage contains the benchmark result storage system. analysis contains the benchmark result analysis system. Both storage and analysis can be run locally; the following commands will run the complete stack on your machine with an in-memory datastore. The storage system is designed to have a standardized API, and we encourage additional analysis tools to be written against the API. A client can be found in the storage package.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of perf
perf Key Features
perf Examples and Code Snippets
Community Discussions
Trending Discussions on perf
QUESTION
I'm currently learning how to create C extensions for Python so that I can call C/C++ code. I've been teaching myself with a few examples. I started with this guide and it was very helpful for getting up and running. All of the guides and examples I've found online only give C code where a single function is defined. I'm planning to access a C++ library with multiple functions from Python and so I decided the next logical step in learning would be to add more functions to the example.
However, when I do this, only the first function in the extension is accessible from Python. Here's the example that I've made for myself (for reference I'm working on Ubuntu 21):
The C code (with two functions: func1
and func2
, where func1
also depends on func2
) and header files:
ANSWER
Answered 2022-Mar-10 at 13:32Make export "C"
include both functions:
QUESTION
It was working fine before I have done nothing, no packages update, no gradle update no nothing just created new build and this error occurs. but for some team members the error occur after gradle sync.
The issue is that build is generating successfully without any error but when opens the app it suddenly gets crash (in both debug and release mode)
Error
...ANSWER
Answered 2022-Feb-25 at 23:22We have fixed the issue by replacing
QUESTION
I am benchmarking the following code for (T& x : v) x = x + x;
where T is int
.
When compiling with mavx2
Performance fluctuates 2 times depending on some conditions.
This does not reproduce on sse4.2
I would like to understand what's happening.
How does the benchmark workI am using Google Benchmark. It spins the loop until the point it is sure about the time.
The main benchmarking code:
...ANSWER
Answered 2022-Feb-12 at 20:11Yes, data misalignment could explain your 2x slowdown for small arrays that fit in L1d. You'd hope that with every other load/store being a cache-line split, it might only slow down by a factor of 1.5x, not 2, if a split load or store cost 2 accesses to L1d instead of 1.
But it has extra effects like replays of uops dependent on the load result that apparently account for the rest of the problem, either making out-of-order exec less able to overlap work and hide latency, or directly running into bottlenecks like "split registers".
ld_blocks.no_sr
counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use.
When a load execution unit detects that the load splits across a cache line, it has to save the first part somewhere (apparently in a "split register") and then access the 2nd cache line. On Intel SnB-family CPUs like yours, this 2nd access doesn't require the RS to dispatch the load uop to the port again; the load execution unit just does it a few cycles later. (But presumably can't accept another load in the same cycle as that 2nd access.)
- https://chat.stackoverflow.com/transcript/message/48426108#48426108 - uops waiting for the result of a cache-split load will get replayed.
- Are load ops deallocated from the RS when they dispatch, complete or some other time? But the load itself can leave the RS earlier.
- How can I accurately benchmark unaligned access speed on x86_64? general stuff on split load penalties.
The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor, but those are also fairly direct consequences of misaligned loads. Lots of counts for ld_blocks.no_sr
tells you that the CPU actually ran out of split registers and could otherwise be doing more work, but had to stall because of the unaligned load itself, not just other effects.
You could also look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details, but not being able to execute split loads will make that happen more. So probably all the back-end stalling is a consequence of the unaligned loads (and maybe stores if commit from store buffer to L1d is also a bottleneck.)
On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it.
Data alignment doesn't normally make that much difference for large arrays (except with 512-bit vectors). With a cache line (2x YMM vectors) arriving less frequently, the back-end has time to work through the extra overhead of unaligned loads / stores and still keep up. HW prefetch does a good enough job that it can still max out the per-core L3 bandwidth. Seeing a smaller effect for a size that fits in L2 but not L1d (like 100kiB) is expected.
Of course, most kinds of execution bottlenecks would show similar effects, even something as simple as un-optimized code that does some extra store/reloads for each vector of array data. So this alone doesn't prove that it was misalignment causing the slowdowns for small sizes that do fit in L1d, like your 10 KiB. But that's clearly the most sensible conclusion.
Code alignment or other front-end bottlenecks seem not to be the problem; most of your uops are coming from the DSB, according to idq.dsb_uops
. (A significant number aren't, but not a big percentage difference between slow vs. fast.)
How can I mitigate the impact of the Intel jcc erratum on gcc? can be important on Skylake-derived microarchitectures like yours; it's even possible that's why your idq.dsb_uops
isn't closer to your uops_issued.any
.
QUESTION
I'm receving the below error in API 31 devices during Firebase Auth UI library(Only Phone number credential),
...ANSWER
Answered 2022-Jan-20 at 05:58In my case, firebase UI (com.firebaseui:firebase-ui-auth:8.0.0) was using com.google.android.gms:play-services-auth:19.0.0 which I found with the command './gradlew -q app:dependencyInsight --dependency play-services-auth --configuration debugCompileClasspath'
This version of the play services auth was causing the issue for me.
I added a separate
implementation 'com.google.android.gms:play-services-auth:20.0.1'
to my gradle and this issue disappeared.
QUESTION
I have many test classes, and each has dozens of tests. I want to isolate tests, so instead of a mega context MyDbContext
, I use MyDbContextToTestFoo
, MyDbContextToTestBar
, MyDbContextToTestBaz
, etc. So I have MANY DbContext
subclasses.
In my unit tests with EF Core 5 I'm running into the ManyServiceProvidersCreatedWarning
. They work individually, but many fail when run as a group:
System.InvalidOperationException : An error was generated for warning 'Microsoft.EntityFrameworkCore.Infrastructure.ManyServiceProvidersCreatedWarning': More than twenty 'IServiceProvider' instances have been created for internal use by Entity Framework. This is commonly caused by injection of a new singleton service instance into every DbContext instance. For example, calling 'UseLoggerFactory' passing in a new instance each time--see https://go.microsoft.com/fwlink/?linkid=869049 for more details. This may lead to performance issues, consider reviewing calls on 'DbContextOptionsBuilder' that may require new service providers to be built. This exception can be suppressed or logged by passing event ID 'CoreEventId.ManyServiceProvidersCreatedWarning' to the 'ConfigureWarnings' method in 'DbContext.OnConfiguring' or 'AddDbContext'.
I don't do anything weird with DbContextOptionsBuilder
as that error suggests. I don't know how to diagnose "...that may require new service providers to be built". In most tests I create a context normally: new DbContextOptionsBuilder().UseSqlite("DataSource=:memory:")
where TContext
is one of the context types I mentioned above.
I've read many issues on the repo, and discovered that EF does heavy caching of all sorts of things, but docs on that topic don't exist. The recommendation is to "find what causes so many service providers to be cached", but I don't know what to look for.
There are two workarounds:
builder.EnableServiceProviderCaching(false)
which is apparently very bad for perfbuilder.ConfigureWarnings(x => x.Ignore(CoreEventId.ManyServiceProvidersCreatedWarning))
which ignores the problem
I assume that "service provider" means EF's internal IoC container.
What I want to know is: does the fact that I have many DbContext
types (and thus IModel
types), affect service provider caching? Are the two related? (I know EF caches an IModel
for every DbContext
, does it also cache a service provider for each one?)
ANSWER
Answered 2021-Dec-28 at 07:59Service provider caching is purely based on the context options configuration - the context type, model etc. doesn't matter.
In EF Core 5.0, the key according to the source code is
QUESTION
I'm trying to read a CSV file, block by block.
CSV looks like:
...ANSWER
Answered 2021-Dec-24 at 13:27Load your file with pd.read_csv
and create block at each time the row of your first column is No.
. Use groupby
to iterate over each block and create a new dataframe.
QUESTION
I'm using perf
for profiling on Ubuntu 20.04 (though I can use any other free tool). It allows to pass a delay in CLI, so that event collection starts after a certain time since program launch. However, this time varies a lot (by 20 seconds out of 1000) and there are tail computations which I am not interested in either.
So it would be great to call some API from my program to start perf
event collection for the fragment of code I'm interested in, and then stop collection after the code finishes.
It's not really an option to run the code in a loop because there is a ~30 seconds initialization phase and 10 seconds measurement phase and I'm only interested in the latter.
...ANSWER
Answered 2021-Dec-23 at 18:55There is an inter-process communication mechanism to achieve this between the program being profiled (or a controlling process) and the perf process: Use the --control
option in the format --control=fifo:ctl-fifo[,ack-fifo]
or --control=fd:ctl-fd[,ack-fd]
as discussed in the perf-stat(1) manpage. This option specifies either a pair of pathnames of FIFO files (named pipes) or a pair of file descriptors. The first file is used for issuing commands to enable or disable all events in any perf process that is listening to the same file. The second file, which is optional, is used to check with perf when it has actually executed the command.
There is an example in the manpage that shows how to use this option to control a perf process from a bash script, which you can easily translate to C/C++:
QUESTION
I'm trying to profile with perf
on Ubuntu 20.04, but the problem is that many functions do not appear in it (likely because they are inlined), or only their addresses appear (without names etc.). I'm using CMake's RelWithDebInfo
build. But there are some template functions that I don't know how to bring to the profiler results. I think marking them noinline
may help if this is legal in C++ for template functions, but this screws up the codebase and needs to be done per-function. Any suggestions how to make all functions noinline
at once?
ANSWER
Answered 2021-Dec-03 at 11:53QUESTION
I've installed Windows 10 21H2 on both my desktop (AMD 5950X system with RTX3080) and my laptop (Dell XPS 9560 with i7-7700HQ and GTX1050) following the instructions on https://docs.nvidia.com/cuda/wsl-user-guide/index.html:
- Install CUDA-capable driver in Windows
- Update WSL2 kernel in PowerShell:
wsl --update
- Install CUDA toolkit in Ubuntu 20.04 in WSL2 (Note that you don't install a CUDA driver in WSL2, the instructions explicitly tell that the CUDA driver should not be installed.):
ANSWER
Answered 2021-Nov-18 at 19:20Turns out that Windows 10 Update Assistant incorrectly reported it upgraded my OS to 21H2 on my laptop.
Checking Windows version by running winver
reports that my OS is still 21H1.
Of course CUDA in WSL2 will not work in Windows 10 without 21H2.
After successfully installing 21H2 I can confirm CUDA works with WSL2 even for laptops with Optimus NVIDIA cards.
QUESTION
I'm trying to verify the conclusion that two fuseable pairs can be decoded in the same clock cycle, using my Intel i7-10700 and ubuntu 20.04.
The test code is arranged like below, and it is copied like 8000 times to avoid the influence of LSD and DSB (to use MITE mostly).
...ANSWER
Answered 2021-Nov-12 at 13:08On Haswell and later, yes. On Ivy Bridge and earlier, no.
On Ice Lake and later, Agner Fog says macro-fusion is done right after decode, instead of in the decoders which required the pre-decoders to send the right chunks of x86 machine code to decoders accordingly. (And Ice Lake has slightly different restrictions: Instructions with a memory operand cannot fuse, unlike previous CPU models. Instructions with an immediate operand can fuse.) So on Ice Lake, macro-fusion doesn't let the decoders handle more than 5 instructions per clock.
Wikichip claims that only 1 macro-fusion per clock is possible on Ice Lake, but that's probably incorrect. Harold tested with my microbenchmark on Rocket Lake and found the same results as Skylake. (Rocket Lake uses a Cypress Cove core, a variant of Sunny Cove back-ported to a 14nm process, so it's likely that it's the same as Ice Lake in this respect.)
Your results indicate that uops_issued.any
is about half instructions
, therefore you are seeing macro-fusion of most pairs. (You could also look at the uops_retired.macro_fused
perf event. BTW, modern perf
has symbolic names for most uarch-specific events: use perf list
to see them.)
The decoders will still produce up-to-four or even five uops per clock on Skylake-derived microarchitectures, though, even if they only make two macro-fusions. You didn't look at how many cycles MITE is active, so you can't see that execution stalls most of the time, until there's room in the ROB / RS for an issue-group of 4 uops. And that opens up space in the IDQ for a decode group from MITE.
You have three other bottlenecks in your loop:Loop-carried dependency through
dec ecx
: only 1/clock because eachdec
has to wait for the result of the previous to be ready.Only one taken branch can execute per cycle (on port 6), and
dec
/jge
is taken almost every time, except for 1 in 2^32 when ECX was 0 before the dec.
The other branch execution unit on port 0 only handles predicted-not-taken branches. https://www.realworldtech.com/haswell-cpu/4/ shows the layout but doesn't mention that limitation; Agner Fog's microarch guide does.Branch prediction: even jumping to the next instruction, which is architecturally a NOP, is not special cased by the CPU. Slow jmp-instruction (Because there's no reason for real code to do this, except for
call +0
/pop
which is special cased at least for the return-address predictor stack.)This is why you're executing at significantly less than one instruction per clock, let alone one uop per clock.
Surprisingly to me, MITE didn't go on to decode a separate test
and jcc
in the same cycle as it made two fusions. I guess the decoders are optimized for filling the uop cache. (A similar effect on Sandybridge / IvyBridge is that if the final uop of a decode-group is potentially fusable, like dec
, decoders will only produce 3 uops that cycle, in anticipation of maybe fusing the dec
next cycle. That's true at least on SnB/IvB where the decoders can only make 1 fusion per cycle, and will decode separate ALU + jcc uops if there is another pair in the same decode group. Here, SKL is choosing not to decode a separate test
uop (and jcc
and another test
) after making two fusions.)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install perf
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page