benchmarks | Various caliper benchmarks

by nurkiewicz Java Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | benchmarks Summary

benchmarks is a Java library. benchmarks has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

Various caliper benchmarks

Support

Quality

Security

License

Reuse

Support

benchmarks has a low active ecosystem.

It has 27 star(s) with 10 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

benchmarks has no issues reported. There are 1 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of benchmarks is current.

Quality

benchmarks has 0 bugs and 0 code smells.

Security

benchmarks has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

benchmarks code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

benchmarks is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

benchmarks releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

It has 403 lines of code, 46 functions and 14 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed benchmarks and discovered the below as its top functions. This is intended to give you an instant insight into benchmarks implemented functionality, and help decide if they suit your requirements.

Utility function calls for small private methods
Add two integers
Add 8 bits
Add 128 bits
Add a 128 bit
Adds 8 bits
Adds two integers
Adds 4 4 to the quadr
Time calls for a single method call
Add a 128 bit matrix
Overriding super class
The underlying Calculator
Performs benchmark using CGLib
Performs a benchmark with an identity benchmark
Performs a benchmark with no caching
Returns the value for the given x
Performs a manual caching
Performs benchmark using a cacheable class
Performs a benchmark of the cacheable as cacheable
Performs a benchmark using aspect ratio
Calculate the number of smallvirtualVirtualVirtualMethodCallCallCalls
Method to get the Calculator instance from Spring context

Get all kandi verified functions for this library.

benchmarks Key Features

No Key Features are available at this moment for benchmarks.

benchmarks Examples and Code Snippets

No Code Snippets are available at this moment for benchmarks.

Community Discussions

Trending Discussions on benchmarks

JMH - How to measure time it takes to insert 50M items in an ArrayList

Is it possible to use #if NET6_0_OR_GREATER to exclude a benchmark method from a BenchmarkDotNet run?

BenchmarkTools outputs to DataFrame

Which alignment causes this performance difference

nexus-staging-maven-plugin: maven deploy failed: An API incompatibility was encountered while executing

Create std::string from std::span of unsigned char

looping over array, performance difference between indexed and enhanced for loop

Performance issue when using multiple threads with sqlite3

GEMM kernel implemented using AVX2 is faster than AVX2/FMA on a Zen 2 CPU

Most efficient way to remove element of certain value everywhere from List? C#

QUESTION

JMH - How to measure time it takes to insert 50M items in an ArrayList

Asked 2022-Mar-21 at 15:47

I've an ArrayList of 50M, I would like to measure time it takes to store that many objects in it. It seems as all JMH modes are time based, we can't really control number of executions of code under @Benchmark. For examlpe, how can I ensure the following code is run exactly 50M times per fork?

...

ANSWER

Answered 2022-Mar-21 at 15:47

You can create a benchmark class (ArrayListBenchmark) and a runner class (BenchmarkRunner).

In ArrayListBenchmark class, you can add the benchmark method that iterates the desired number of times adding items to the List.
In BenchmarkRunner class, you set the desired number of items to add to the List and config the runner options.

Note: Depending on your environment, adding 50M items may throw an OutOfMemoryError.

Benchmark class:

Source https://stackoverflow.com/questions/71549289

QUESTION

Is it possible to use #if NET6_0_OR_GREATER to exclude a benchmark method from a BenchmarkDotNet run?

Asked 2022-Feb-21 at 12:25

Suppose that you're writing some benchmarks for use with BenchmarkDotNet that are multi-targeted to net48 and net6.0, and that one of those benchmarks can only be compiled for the net6.0 target.

The obvious thing to do is to use something like this to exclude that particular benchmark from the net48 build:

...

ANSWER

Answered 2021-Dec-08 at 16:50

From memory, Benchmark.NET will run benchmarks for all frameworks with some internal wizardry. So instead of using the existing preprocessor symbols it's probably better to split your tests across two classes with different RuntimeMoniker attributes. For example:

Source https://stackoverflow.com/questions/70274948

QUESTION

BenchmarkTools outputs to DataFrame

Asked 2022-Feb-19 at 14:49

I am trying to benchmark the performance of functions using BenchmarkTools as in the example below. My goal is to obtain the outputs of @benchmark as a DataFrame.

In this example, I am benchmarking the performance of the following two functions:

...

ANSWER

Answered 2022-Feb-19 at 14:07

You can do it e.g. like this:

Source https://stackoverflow.com/questions/71183235

QUESTION

Which alignment causes this performance difference

Asked 2022-Feb-12 at 20:11

What's the problem

I am benchmarking the following code for (T& x : v) x = x + x; where T is int. When compiling with mavx2 Performance fluctuates 2 times depending on some conditions. This does not reproduce on sse4.2

I would like to understand what's happening.

How does the benchmark work

I am using Google Benchmark. It spins the loop until the point it is sure about the time.

The main benchmarking code:

...

ANSWER

Answered 2022-Feb-12 at 20:11

Yes, data misalignment could explain your 2x slowdown for small arrays that fit in L1d. You'd hope that with every other load/store being a cache-line split, it might only slow down by a factor of 1.5x, not 2, if a split load or store cost 2 accesses to L1d instead of 1.

But it has extra effects like replays of uops dependent on the load result that apparently account for the rest of the problem, either making out-of-order exec less able to overlap work and hide latency, or directly running into bottlenecks like "split registers".

ld_blocks.no_sr counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use.

When a load execution unit detects that the load splits across a cache line, it has to save the first part somewhere (apparently in a "split register") and then access the 2nd cache line. On Intel SnB-family CPUs like yours, this 2nd access doesn't require the RS to dispatch the load uop to the port again; the load execution unit just does it a few cycles later. (But presumably can't accept another load in the same cycle as that 2nd access.)

https://chat.stackoverflow.com/transcript/message/48426108#48426108 - uops waiting for the result of a cache-split load will get replayed.
Are load ops deallocated from the RS when they dispatch, complete or some other time? But the load itself can leave the RS earlier.
How can I accurately benchmark unaligned access speed on x86_64? general stuff on split load penalties.

The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor, but those are also fairly direct consequences of misaligned loads. Lots of counts for ld_blocks.no_sr tells you that the CPU actually ran out of split registers and could otherwise be doing more work, but had to stall because of the unaligned load itself, not just other effects.

You could also look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details, but not being able to execute split loads will make that happen more. So probably all the back-end stalling is a consequence of the unaligned loads (and maybe stores if commit from store buffer to L1d is also a bottleneck.)

On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it.

Data alignment doesn't normally make that much difference for large arrays (except with 512-bit vectors). With a cache line (2x YMM vectors) arriving less frequently, the back-end has time to work through the extra overhead of unaligned loads / stores and still keep up. HW prefetch does a good enough job that it can still max out the per-core L3 bandwidth. Seeing a smaller effect for a size that fits in L2 but not L1d (like 100kiB) is expected.

Of course, most kinds of execution bottlenecks would show similar effects, even something as simple as un-optimized code that does some extra store/reloads for each vector of array data. So this alone doesn't prove that it was misalignment causing the slowdowns for small sizes that do fit in L1d, like your 10 KiB. But that's clearly the most sensible conclusion.

Code alignment or other front-end bottlenecks seem not to be the problem; most of your uops are coming from the DSB, according to idq.dsb_uops. (A significant number aren't, but not a big percentage difference between slow vs. fast.)

How can I mitigate the impact of the Intel jcc erratum on gcc? can be important on Skylake-derived microarchitectures like yours; it's even possible that's why your idq.dsb_uops isn't closer to your uops_issued.any.

Source https://stackoverflow.com/questions/71090526

QUESTION

nexus-staging-maven-plugin: maven deploy failed: An API incompatibility was encountered while executing

Asked 2022-Feb-11 at 22:39

This worked fine for me be building under Java 8. Now under Java 17.01 I get this when I do mvn deploy.

mvn install works fine. I tried 3.6.3 and 3.8.4 and updated (I think) all my plugins to the newest versions.

Any ideas?

...

ANSWER

Answered 2022-Feb-11 at 22:39

Update: Version 1.6.9 has been released and should fix this issue! 🎉

This is actually a known bug, which is now open for quite a while: OSSRH-66257. There are two known workarounds:

1. Open Modules

As a workaround, use --add-opens to give the library causing the problem access to the required classes:

Source https://stackoverflow.com/questions/70153962

QUESTION

Create std::string from std::span of unsigned char

Asked 2022-Jan-23 at 16:19

I am using a C library which uses various fixed-sized unsigned char arrays with no null terminator as strings.

I've been converting them to std::string using the following function:

...

ANSWER

Answered 2022-Jan-22 at 22:33

You want:

Source https://stackoverflow.com/questions/70817628

QUESTION

looping over array, performance difference between indexed and enhanced for loop

Asked 2022-Jan-05 at 19:41

The JLS states, that for arrays, "The enhanced for statement is equivalent to a basic for statement of the form". However if I check the generated bytecode for JDK8, for both variants different bytecode is generated, and if I try to measure the performance, surprisingly, the enhanced one seems to be giving better results(on jdk8)... Can someone advise why it's that? I'd guess it's because of incorrect jmh testing, so if it's that, please suggest how to fix that. (I know that JMH states not to test using loops, but I don't think this applies here, as I'm actually trying to measure the loops here)

My JMH testing was rather simple (probably too simple), but I cannot explain the results. Testing JMH code is below, typical results are:

...

ANSWER

Answered 2022-Jan-05 at 19:41

TL;DR: You are observing what happens when JIT compiler cannot trust that values are not changing inside the loop. Additionally, in the tiny benchmark like this, Blackhole.consume costs dominate, obscuring the results.

Simplifying the test:

Source https://stackoverflow.com/questions/70583053

QUESTION

Performance issue when using multiple threads with sqlite3

Asked 2021-Dec-27 at 20:44

I am writing a program that generates hashes for files in all subdirectories and then puts them in a database or prints them to standard output: https://github.com/cherrry9/dedup

In the latest commit, I added option for my program to use multiple threads (THREADS macro).

Here are some benchmarks that I did:

...

ANSWER

Answered 2021-Dec-27 at 20:11

It seems that all your threads use the same database connection and statement objects. Therefore you have a race-condition (even in SERIALIZED threading model), as multiple threads are binding, stepping, and resetting the same statement. Asking 'why is it slow' becomes irrelevant until you fix this problem.

Instead you should wrap your sql_insert with a mutex to guarantee that at most one thread is accessing the database connection:

Source https://stackoverflow.com/questions/70499116

QUESTION

GEMM kernel implemented using AVX2 is faster than AVX2/FMA on a Zen 2 CPU

Asked 2021-Dec-14 at 20:40

I have tried speeding up a toy GEMM implementation. I deal with blocks of 32x32 doubles for which I need an optimized MM kernel. I have access to AVX2 and FMA.

I have two codes (in ASM, I apologies for the crudeness of the formatting) defined below, one is making use of AVX2 features, the other uses FMA.

Without going into micro benchmarks, I would like to try to develop an understanding (theoretical) of why the AVX2 implementation is 1.11x faster than the FMA version. And possibly how to improve both versions.

The codes below are for a 3000x3000 MM of doubles and the kernels are implemented using the classical, naive MM with an interchanged deepest loop. I'm using a Ryzen 3700x/Zen 2 as development CPU.

I have not tried unrolling aggressively, in fear that the CPU might run out of physical registers.

AVX2 32x32 MM kernel:

...

ANSWER

Answered 2021-Dec-13 at 21:36

Zen2 has 3 cycle latency for vaddpd, 5 cycle latency for vfma...pd. (https://uops.info/).

Your code with 8 accumulators has enough ILP that you'd expect close to two FMA per clock, about 8 per 5 clocks (if there aren't other bottlenecks) which is a bit less than the 10/5 theoretical max.

vaddpd and vmulpd actually run on different ports on Zen2 (unlike Intel), port FP2/3 and FP0/1 respectively, so it can in theory sustain 2/clock vaddpd and vmulpd. Since the latency of the loop-carried dependency is shorter, 8 accumulators are enough to hide the vaddpd latency if scheduling doesn't let one dep chain get behind. (But at least multiplies aren't stealing cycles from it.)

Zen2's front-end is 5 instructions wide (or 6 uops if there are any multi-uop instructions), and it can decode memory-source instructions as a single uop. So it might well be doing 2/clock each multiply and add with the non-FMA version.

If you can unroll by 10 or 12, that might hide enough FMA latency and make it equal to the non-FMA version, but with less power consumption and more SMT-friendly to code running on the other logical core. (10 = 5 x 2 would be just barely enough, which means any scheduling imperfections lose progress on a dep chain which is on the critical path. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for some testing on Intel.)

(By comparison, Intel Skylake runs vaddpd/vmulpd on the same ports with the same latency as vfma...pd, all with 4c latency, 0.5c throughput.)

I didn't look at your code super carefully, but 10 YMM vectors might be a tradeoff between touching two pairs of cache lines vs. touching 5 total lines, which might be worse if a spatial prefetcher tries to complete an aligned pair. Or might be fine. 12 YMM vectors would be three pairs, which should be fine.

Depending on matrix size, out-of-order exec may be able to overlap inner loop dep chains between separate iterations of the outer loop, especially if the loop exit condition can execute sooner and resolve the mispredict (if there is one) while FP work is still in flight. That's an advantage to having fewer total uops for the same work, favouring FMA.

Source https://stackoverflow.com/questions/70340734

QUESTION

Most efficient way to remove element of certain value everywhere from List? C#

Asked 2021-Nov-21 at 19:25

EDIT: Benchmarks for different techniques published at the bottom of this question.

I have a very large List full of integers. I want to remove every occurrence of "3" from the List. Which technique would be most efficient to do this? I would normally use the .Remove(3) extension until it returns false, but I fear that each call to .Remove(3) internally loops through the entire List unnecessarily.

EDIT: It was recommended in the comments to try

TheList = TheList.Where(x => x != 3).ToList();

but I need to remove the elements without instantiating a new List.

...

ANSWER

Answered 2021-Nov-21 at 17:55

You can just use List.RemoveAll and pass your predicate - https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.list-1.removeall?view=net-6.0#System_Collections_Generic_List_1_RemoveAll_System_Predicate__0__ . This guaranteed to be linear complexity O(list.Count)

Source https://stackoverflow.com/questions/70056930

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install benchmarks

You can download it from GitHub.
You can use benchmarks like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the benchmarks component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: