perf | Performance measurement , storage , and analysis | Storage library

by golang Go Version: Current License: BSD-3-Clause

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | perf Summary

perf is a Go library typically used in Storage, Amazon S3 applications. perf has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

This subrepository holds the source for various packages and tools related to performance measurement, storage, and analysis. cmd/benchstat contains a command-line tool that computes and compares statistics about benchmarks. cmd/benchsave contains a command-line tool for publishing benchmark results. storage contains the benchmark result storage system. analysis contains the benchmark result analysis system. Both storage and analysis can be run locally; the following commands will run the complete stack on your machine with an in-memory datastore. The storage system is designed to have a standardized API, and we encourage additional analysis tools to be written against the API. A client can be found in the storage package.

Support

Quality

Security

License

Reuse

Support

perf has a low active ecosystem.

It has 341 star(s) with 55 fork(s). There are 26 watchers for this library.

It had no major release in the last 6 months.

perf has no issues reported. There are 4 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of perf is current.

Quality

perf has 0 bugs and 0 code smells.

Security

perf has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

perf code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

perf is licensed under the BSD-3-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

perf releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of perf

Get all kandi verified functions for this library.

perf Key Features

No Key Features are available at this moment for perf.

perf Examples and Code Snippets

No Code Snippets are available at this moment for perf.

Community Discussions

Trending Discussions on perf

Python C Extension with Multiple Functions

java.lang.NoSuchMethodError: No virtual method setSkipClientToken(Z)V in class Lcom/facebook/GraphRequest;

Which alignment causes this performance difference

Android API 31 FLAG_IMMUTABLE Error using Firebase Auth UI

EF Core internal caching and many DbContext types during testing

python pandas how to read csv file by block

Enable/disable perf event collection programmatically

Can I make a template function noinline or else force it to appear in the profiler?

Why does nvidia-smi return "GPU access blocked by the operating system" in WSL2 under Windows 10 21H2

Can two fuseable pairs be decoded in the same clock cycle?

QUESTION

Python C Extension with Multiple Functions

Asked 2022-Mar-10 at 13:32

I'm currently learning how to create C extensions for Python so that I can call C/C++ code. I've been teaching myself with a few examples. I started with this guide and it was very helpful for getting up and running. All of the guides and examples I've found online only give C code where a single function is defined. I'm planning to access a C++ library with multiple functions from Python and so I decided the next logical step in learning would be to add more functions to the example.

However, when I do this, only the first function in the extension is accessible from Python. Here's the example that I've made for myself (for reference I'm working on Ubuntu 21):

The C code (with two functions: func1 and func2, where func1 also depends on func2) and header files:

...

ANSWER

Answered 2022-Mar-10 at 13:32

Make export "C" include both functions:

Source https://stackoverflow.com/questions/71423163

QUESTION

java.lang.NoSuchMethodError: No virtual method setSkipClientToken(Z)V in class Lcom/facebook/GraphRequest;

Asked 2022-Feb-25 at 23:22

It was working fine before I have done nothing, no packages update, no gradle update no nothing just created new build and this error occurs. but for some team members the error occur after gradle sync.

The issue is that build is generating successfully without any error but when opens the app it suddenly gets crash (in both debug and release mode)

Error

...

ANSWER

Answered 2022-Feb-25 at 23:22

We have fixed the issue by replacing

Source https://stackoverflow.com/questions/71256006

QUESTION

Which alignment causes this performance difference

Asked 2022-Feb-12 at 20:11

What's the problem

I am benchmarking the following code for (T& x : v) x = x + x; where T is int. When compiling with mavx2 Performance fluctuates 2 times depending on some conditions. This does not reproduce on sse4.2

I would like to understand what's happening.

How does the benchmark work

I am using Google Benchmark. It spins the loop until the point it is sure about the time.

The main benchmarking code:

...

ANSWER

Answered 2022-Feb-12 at 20:11

Yes, data misalignment could explain your 2x slowdown for small arrays that fit in L1d. You'd hope that with every other load/store being a cache-line split, it might only slow down by a factor of 1.5x, not 2, if a split load or store cost 2 accesses to L1d instead of 1.

But it has extra effects like replays of uops dependent on the load result that apparently account for the rest of the problem, either making out-of-order exec less able to overlap work and hide latency, or directly running into bottlenecks like "split registers".

ld_blocks.no_sr counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use.

When a load execution unit detects that the load splits across a cache line, it has to save the first part somewhere (apparently in a "split register") and then access the 2nd cache line. On Intel SnB-family CPUs like yours, this 2nd access doesn't require the RS to dispatch the load uop to the port again; the load execution unit just does it a few cycles later. (But presumably can't accept another load in the same cycle as that 2nd access.)

https://chat.stackoverflow.com/transcript/message/48426108#48426108 - uops waiting for the result of a cache-split load will get replayed.
Are load ops deallocated from the RS when they dispatch, complete or some other time? But the load itself can leave the RS earlier.
How can I accurately benchmark unaligned access speed on x86_64? general stuff on split load penalties.

The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor, but those are also fairly direct consequences of misaligned loads. Lots of counts for ld_blocks.no_sr tells you that the CPU actually ran out of split registers and could otherwise be doing more work, but had to stall because of the unaligned load itself, not just other effects.

You could also look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details, but not being able to execute split loads will make that happen more. So probably all the back-end stalling is a consequence of the unaligned loads (and maybe stores if commit from store buffer to L1d is also a bottleneck.)

On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it.

Data alignment doesn't normally make that much difference for large arrays (except with 512-bit vectors). With a cache line (2x YMM vectors) arriving less frequently, the back-end has time to work through the extra overhead of unaligned loads / stores and still keep up. HW prefetch does a good enough job that it can still max out the per-core L3 bandwidth. Seeing a smaller effect for a size that fits in L2 but not L1d (like 100kiB) is expected.

Of course, most kinds of execution bottlenecks would show similar effects, even something as simple as un-optimized code that does some extra store/reloads for each vector of array data. So this alone doesn't prove that it was misalignment causing the slowdowns for small sizes that do fit in L1d, like your 10 KiB. But that's clearly the most sensible conclusion.

Code alignment or other front-end bottlenecks seem not to be the problem; most of your uops are coming from the DSB, according to idq.dsb_uops. (A significant number aren't, but not a big percentage difference between slow vs. fast.)

How can I mitigate the impact of the Intel jcc erratum on gcc? can be important on Skylake-derived microarchitectures like yours; it's even possible that's why your idq.dsb_uops isn't closer to your uops_issued.any.

Source https://stackoverflow.com/questions/71090526

QUESTION

Android API 31 FLAG_IMMUTABLE Error using Firebase Auth UI

Asked 2022-Jan-20 at 05:58

I'm receving the below error in API 31 devices during Firebase Auth UI library(Only Phone number credential),

...

ANSWER

Answered 2022-Jan-20 at 05:58

In my case, firebase UI (com.firebaseui:firebase-ui-auth:8.0.0) was using com.google.android.gms:play-services-auth:19.0.0 which I found with the command './gradlew -q app:dependencyInsight --dependency play-services-auth --configuration debugCompileClasspath'

This version of the play services auth was causing the issue for me.

I added a separate

implementation 'com.google.android.gms:play-services-auth:20.0.1'

to my gradle and this issue disappeared.

Source https://stackoverflow.com/questions/70387532

QUESTION

EF Core internal caching and many DbContext types during testing

Asked 2021-Dec-28 at 07:59

I have many test classes, and each has dozens of tests. I want to isolate tests, so instead of a mega context MyDbContext, I use MyDbContextToTestFoo, MyDbContextToTestBar, MyDbContextToTestBaz, etc. So I have MANY DbContext subclasses.

In my unit tests with EF Core 5 I'm running into the ManyServiceProvidersCreatedWarning. They work individually, but many fail when run as a group:

System.InvalidOperationException : An error was generated for warning 'Microsoft.EntityFrameworkCore.Infrastructure.ManyServiceProvidersCreatedWarning': More than twenty 'IServiceProvider' instances have been created for internal use by Entity Framework. This is commonly caused by injection of a new singleton service instance into every DbContext instance. For example, calling 'UseLoggerFactory' passing in a new instance each time--see https://go.microsoft.com/fwlink/?linkid=869049 for more details. This may lead to performance issues, consider reviewing calls on 'DbContextOptionsBuilder' that may require new service providers to be built. This exception can be suppressed or logged by passing event ID 'CoreEventId.ManyServiceProvidersCreatedWarning' to the 'ConfigureWarnings' method in 'DbContext.OnConfiguring' or 'AddDbContext'.

I don't do anything weird with DbContextOptionsBuilder as that error suggests. I don't know how to diagnose "...that may require new service providers to be built". In most tests I create a context normally: new DbContextOptionsBuilder().UseSqlite("DataSource=:memory:") where TContext is one of the context types I mentioned above.

I've read many issues on the repo, and discovered that EF does heavy caching of all sorts of things, but docs on that topic don't exist. The recommendation is to "find what causes so many service providers to be cached", but I don't know what to look for.

There are two workarounds:

builder.EnableServiceProviderCaching(false) which is apparently very bad for perf
builder.ConfigureWarnings(x => x.Ignore(CoreEventId.ManyServiceProvidersCreatedWarning)) which ignores the problem

I assume that "service provider" means EF's internal IoC container.

What I want to know is: does the fact that I have many DbContext types (and thus IModel types), affect service provider caching? Are the two related? (I know EF caches an IModel for every DbContext, does it also cache a service provider for each one?)

...

ANSWER

Answered 2021-Dec-28 at 07:59

Service provider caching is purely based on the context options configuration - the context type, model etc. doesn't matter.

In EF Core 5.0, the key according to the source code is

Source https://stackoverflow.com/questions/70503024

QUESTION

python pandas how to read csv file by block

Asked 2021-Dec-25 at 20:00

I'm trying to read a CSV file, block by block.

CSV looks like:

...

ANSWER

Answered 2021-Dec-24 at 13:27

Load your file with pd.read_csv and create block at each time the row of your first column is No.. Use groupby to iterate over each block and create a new dataframe.

Source https://stackoverflow.com/questions/70473295

QUESTION

Enable/disable perf event collection programmatically

Asked 2021-Dec-23 at 18:55

I'm using perf for profiling on Ubuntu 20.04 (though I can use any other free tool). It allows to pass a delay in CLI, so that event collection starts after a certain time since program launch. However, this time varies a lot (by 20 seconds out of 1000) and there are tail computations which I am not interested in either.

So it would be great to call some API from my program to start perf event collection for the fragment of code I'm interested in, and then stop collection after the code finishes.

It's not really an option to run the code in a loop because there is a ~30 seconds initialization phase and 10 seconds measurement phase and I'm only interested in the latter.

...

ANSWER

Answered 2021-Dec-23 at 18:55

There is an inter-process communication mechanism to achieve this between the program being profiled (or a controlling process) and the perf process: Use the --control option in the format --control=fifo:ctl-fifo[,ack-fifo] or --control=fd:ctl-fd[,ack-fd] as discussed in the perf-stat(1) manpage. This option specifies either a pair of pathnames of FIFO files (named pipes) or a pair of file descriptors. The first file is used for issuing commands to enable or disable all events in any perf process that is listening to the same file. The second file, which is optional, is used to check with perf when it has actually executed the command.

There is an example in the manpage that shows how to use this option to control a perf process from a bash script, which you can easily translate to C/C++:

Source https://stackoverflow.com/questions/70314376

QUESTION

Can I make a template function noinline or else force it to appear in the profiler?

Asked 2021-Dec-03 at 11:53

I'm trying to profile with perf on Ubuntu 20.04, but the problem is that many functions do not appear in it (likely because they are inlined), or only their addresses appear (without names etc.). I'm using CMake's RelWithDebInfo build. But there are some template functions that I don't know how to bring to the profiler results. I think marking them noinline may help if this is legal in C++ for template functions, but this screws up the codebase and needs to be done per-function. Any suggestions how to make all functions noinline at once?

...

ANSWER

Answered 2021-Dec-03 at 11:53

You could add -fno-inline to CMAKE_CXX_FLAGS.

From GCC man page:

Source https://stackoverflow.com/questions/70213785

QUESTION

Why does nvidia-smi return "GPU access blocked by the operating system" in WSL2 under Windows 10 21H2

Asked 2021-Nov-18 at 19:20

Installing CUDA on WSL2

I've installed Windows 10 21H2 on both my desktop (AMD 5950X system with RTX3080) and my laptop (Dell XPS 9560 with i7-7700HQ and GTX1050) following the instructions on https://docs.nvidia.com/cuda/wsl-user-guide/index.html:

Install CUDA-capable driver in Windows
Update WSL2 kernel in PowerShell: wsl --update
Install CUDA toolkit in Ubuntu 20.04 in WSL2 (Note that you don't install a CUDA driver in WSL2, the instructions explicitly tell that the CUDA driver should not be installed.):

...

ANSWER

Answered 2021-Nov-18 at 19:20

Turns out that Windows 10 Update Assistant incorrectly reported it upgraded my OS to 21H2 on my laptop. Checking Windows version by running winver reports that my OS is still 21H1. Of course CUDA in WSL2 will not work in Windows 10 without 21H2.

After successfully installing 21H2 I can confirm CUDA works with WSL2 even for laptops with Optimus NVIDIA cards.

Source https://stackoverflow.com/questions/70011494

QUESTION

Can two fuseable pairs be decoded in the same clock cycle?

Asked 2021-Nov-12 at 13:08

I'm trying to verify the conclusion that two fuseable pairs can be decoded in the same clock cycle, using my Intel i7-10700 and ubuntu 20.04.

The test code is arranged like below, and it is copied like 8000 times to avoid the influence of LSD and DSB (to use MITE mostly).

...

ANSWER

Answered 2021-Nov-12 at 13:08

On Haswell and later, yes. On Ivy Bridge and earlier, no.

On Ice Lake and later, Agner Fog says macro-fusion is done right after decode, instead of in the decoders which required the pre-decoders to send the right chunks of x86 machine code to decoders accordingly. (And Ice Lake has slightly different restrictions: Instructions with a memory operand cannot fuse, unlike previous CPU models. Instructions with an immediate operand can fuse.) So on Ice Lake, macro-fusion doesn't let the decoders handle more than 5 instructions per clock.

Wikichip claims that only 1 macro-fusion per clock is possible on Ice Lake, but that's probably incorrect. Harold tested with my microbenchmark on Rocket Lake and found the same results as Skylake. (Rocket Lake uses a Cypress Cove core, a variant of Sunny Cove back-ported to a 14nm process, so it's likely that it's the same as Ice Lake in this respect.)

Your results indicate that uops_issued.any is about half instructions, therefore you are seeing macro-fusion of most pairs. (You could also look at the uops_retired.macro_fused perf event. BTW, modern perf has symbolic names for most uarch-specific events: use perf list to see them.)

The decoders will still produce up-to-four or even five uops per clock on Skylake-derived microarchitectures, though, even if they only make two macro-fusions. You didn't look at how many cycles MITE is active, so you can't see that execution stalls most of the time, until there's room in the ROB / RS for an issue-group of 4 uops. And that opens up space in the IDQ for a decode group from MITE.

You have three other bottlenecks in your loop:

Loop-carried dependency through dec ecx: only 1/clock because each dec has to wait for the result of the previous to be ready.
Only one taken branch can execute per cycle (on port 6), and dec/jge is taken almost every time, except for 1 in 2^32 when ECX was 0 before the dec.
The other branch execution unit on port 0 only handles predicted-not-taken branches. https://www.realworldtech.com/haswell-cpu/4/ shows the layout but doesn't mention that limitation; Agner Fog's microarch guide does.
Branch prediction: even jumping to the next instruction, which is architecturally a NOP, is not special cased by the CPU. Slow jmp-instruction (Because there's no reason for real code to do this, except for call +0 / pop which is special cased at least for the return-address predictor stack.)

This is why you're executing at significantly less than one instruction per clock, let alone one uop per clock.

Working demo of 2 fusions per clock

Surprisingly to me, MITE didn't go on to decode a separate test and jcc in the same cycle as it made two fusions. I guess the decoders are optimized for filling the uop cache. (A similar effect on Sandybridge / IvyBridge is that if the final uop of a decode-group is potentially fusable, like dec, decoders will only produce 3 uops that cycle, in anticipation of maybe fusing the dec next cycle. That's true at least on SnB/IvB where the decoders can only make 1 fusion per cycle, and will decode separate ALU + jcc uops if there is another pair in the same decode group. Here, SKL is choosing not to decode a separate test uop (and jcc and another test) after making two fusions.)

Source https://stackoverflow.com/questions/69937504

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install perf

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: