benchmark | Run microbenchmarks in Elm

by elm-explorations Elm Version: Current License: BSD-3-Clause

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | benchmark Summary

benchmark is a Elm library. benchmark has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Run microbenchmarks in Elm.

Support

Quality

Security

License

Reuse

Support

benchmark has a low active ecosystem.

It has 23 star(s) with 5 fork(s). There are 3 watchers for this library.

It had no major release in the last 6 months.

There are 14 open issues and 2 have been closed. On average issues are closed in 119 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of benchmark is current.

Quality

benchmark has 0 bugs and 0 code smells.

Security

benchmark has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

benchmark code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

benchmark is licensed under the BSD-3-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

benchmark releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of benchmark

Get all kandi verified functions for this library.

benchmark Key Features

No Key Features are available at this moment for benchmark.

benchmark Examples and Code Snippets

No Code Snippets are available at this moment for benchmark.

Community Discussions

Trending Discussions on benchmark

Fast method of getting all the descendants of a parent

Is if(A | B) always faster than if(A || B)?

JMH using java 17, no dead code elimination

Missing bounds checking elimination in String constructor?

Why is QuackSort 2x faster than Data.List's sort for random lists?

GEMM kernel implemented using AVX2 is faster than AVX2/FMA on a Zen 2 CPU

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

How is the s=s+c string concat optimization decided?

Bug? MATLAB MEX changes the kind of the default logical

QUESTION

Fast method of getting all the descendants of a parent

Asked 2022-Feb-25 at 08:17

With the parent-child relationships data frame as below:

...

ANSWER

Answered 2022-Feb-25 at 08:17

We can use ego like below

Source https://stackoverflow.com/questions/71022350

QUESTION

Is if(A | B) always faster than if(A || B)?

Asked 2022-Feb-11 at 05:03

I am reading this book by Fedor Pikus and he has some very very interesting examples which for me were a surprise.
Particularly this benchmark caught me, where the only difference is that in one of them we use || in if and in another we use |.

...

ANSWER

Answered 2022-Feb-08 at 19:57

Code readability, short-circuiting and it is not guaranteed that Ord will always outperform a || operand. Computer systems are more complicated than expected, even though they are man-made.

There was a case where a for loop with a much more complicated condition ran faster on an IBM. The CPU didn't cool and thus instructions were executed faster, that was a possible reason. What I am trying to say, focus on other areas to improve code than fighting small-cases which will differ depending on the CPU and the boolean evaluation (compiler optimizations).

Source https://stackoverflow.com/questions/71039947

QUESTION

JMH using java 17, no dead code elimination

Asked 2022-Feb-09 at 17:17

I run sample JHM benchmark which suppose to show dead code elimination. Code is rewritten for conciseness from jhm github sample.

...

ANSWER

Answered 2022-Feb-09 at 17:17

Those samples depend on JDK internals.

Looks like since JDK 9 and JDK-8152907, Math.log is no longer intrinsified into C2 intermediate representation. Instead, a direct call to a quick LIBM-backed stub is made. This is usually faster for the code that actually uses the result. Notice how measureCorrect is faster in JDK 17 output in your case.

But for JMH samples, it limits the the compiler optimizations around the Math.log, and dead code / folding samples do not work properly. The fix it to make samples that do not rely on JDK internals without a good reason, and instead use a custom written payload.

This is being done in JMH here:

Source https://stackoverflow.com/questions/71044636

QUESTION

Missing bounds checking elimination in String constructor?

Asked 2022-Jan-30 at 21:18

Looking into UTF8 decoding performance, I noticed the performance of protobuf's UnsafeProcessor::decodeUtf8 is better than String(byte[] bytes, int offset, int length, Charset charset) for the following non ascii string: "Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen".

I tried to figure out why, so I copied the relevant code in String and replaced the array accesses with unsafe array accesses, same as UnsafeProcessor::decodeUtf8. Here are the JMH benchmark results:

...

ANSWER

Answered 2022-Jan-12 at 09:52

To measure the branch you are interested in and particularly the scenario when while loop becomes hot, I've used the following benchmark:

Source https://stackoverflow.com/questions/70272651

QUESTION

Why is QuackSort 2x faster than Data.List's sort for random lists?

Asked 2022-Jan-27 at 19:24

I was looking for the canonical implementation of MergeSort on Haskell to port to HOVM, and I found this StackOverflow answer. When porting the algorithm, I realized something looked silly: the algorithm has a "halve" function that does nothing but split a list in two, using half of the length, before recursing and merging. So I thought: why not make a better use of this pass, and use a pivot, to make each half respectively smaller and bigger than that pivot? That would increase the odds that recursive merge calls are applied to already-sorted lists, which might speed up the algorithm!

I've done this change, resulting in the following code:

...

ANSWER

Answered 2022-Jan-27 at 19:15

Your split splits the list in two ordered halves, so merge consumes its first argument first and then just produces the second half in full. In other words it is equivalent to ++, doing redundant comparisons on the first half which always turn out to be True.

In the true mergesort the merge actually does twice the work on random data because the two parts are not ordered.

The split though spends some work on the partitioning whereas an online bottom-up mergesort would spend no work there at all. But the built-in sort tries to detect ordered runs in the input, and apparently that extra work is not negligible.

Source https://stackoverflow.com/questions/70856865

QUESTION

GEMM kernel implemented using AVX2 is faster than AVX2/FMA on a Zen 2 CPU

Asked 2021-Dec-14 at 20:40

I have tried speeding up a toy GEMM implementation. I deal with blocks of 32x32 doubles for which I need an optimized MM kernel. I have access to AVX2 and FMA.

I have two codes (in ASM, I apologies for the crudeness of the formatting) defined below, one is making use of AVX2 features, the other uses FMA.

Without going into micro benchmarks, I would like to try to develop an understanding (theoretical) of why the AVX2 implementation is 1.11x faster than the FMA version. And possibly how to improve both versions.

The codes below are for a 3000x3000 MM of doubles and the kernels are implemented using the classical, naive MM with an interchanged deepest loop. I'm using a Ryzen 3700x/Zen 2 as development CPU.

I have not tried unrolling aggressively, in fear that the CPU might run out of physical registers.

AVX2 32x32 MM kernel:

...

ANSWER

Answered 2021-Dec-13 at 21:36

Zen2 has 3 cycle latency for vaddpd, 5 cycle latency for vfma...pd. (https://uops.info/).

Your code with 8 accumulators has enough ILP that you'd expect close to two FMA per clock, about 8 per 5 clocks (if there aren't other bottlenecks) which is a bit less than the 10/5 theoretical max.

vaddpd and vmulpd actually run on different ports on Zen2 (unlike Intel), port FP2/3 and FP0/1 respectively, so it can in theory sustain 2/clock vaddpd and vmulpd. Since the latency of the loop-carried dependency is shorter, 8 accumulators are enough to hide the vaddpd latency if scheduling doesn't let one dep chain get behind. (But at least multiplies aren't stealing cycles from it.)

Zen2's front-end is 5 instructions wide (or 6 uops if there are any multi-uop instructions), and it can decode memory-source instructions as a single uop. So it might well be doing 2/clock each multiply and add with the non-FMA version.

If you can unroll by 10 or 12, that might hide enough FMA latency and make it equal to the non-FMA version, but with less power consumption and more SMT-friendly to code running on the other logical core. (10 = 5 x 2 would be just barely enough, which means any scheduling imperfections lose progress on a dep chain which is on the critical path. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for some testing on Intel.)

(By comparison, Intel Skylake runs vaddpd/vmulpd on the same ports with the same latency as vfma...pd, all with 4c latency, 0.5c throughput.)

I didn't look at your code super carefully, but 10 YMM vectors might be a tradeoff between touching two pairs of cache lines vs. touching 5 total lines, which might be worse if a spatial prefetcher tries to complete an aligned pair. Or might be fine. 12 YMM vectors would be three pairs, which should be fine.

Depending on matrix size, out-of-order exec may be able to overlap inner loop dep chains between separate iterations of the outer loop, especially if the loop exit condition can execute sooner and resolve the mispredict (if there is one) while FP work is still in flight. That's an advantage to having fewer total uops for the same work, favouring FMA.

Source https://stackoverflow.com/questions/70340734

QUESTION

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

Asked 2021-Dec-13 at 10:27

I'm using godbolt to get assembly of the following program:

...

ANSWER

Answered 2021-Dec-13 at 06:33

You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal instruction (2 per cycle and 1 cycle of latency).

This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.

Note that the lea instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea is faster than imul/mul.

UPDATE (v2):

I can reproduce the slower execution with -O2 using GCC 11.2 (on Linux with a i5-9600KF processor).

The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2 version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.

Here is the assembly of the loop with -Os:

Source https://stackoverflow.com/questions/70316686

QUESTION

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Asked 2021-Dec-03 at 06:02

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:

Instruction B I0 I1 M L S F0 F1 .LBB0_1: ldr q3, [x1], #16 0.5 0.5 1 ldr q4, [x2], #16 0.5 0.5 1 add x8, x8, #4 0.5 0.5 cmp x8, #508 0.5 0.5 mul v5.4s, v3.4s, v4.4s 2 mul v5.4s, v5.4s, v0.4s 2 smull v6.2d, v5.2s, v1.2s 1 smull2 v5.2d, v5.4s, v2.4s 1 smlal v6.2d, v3.2s, v4.2s 1 smlal2 v5.2d, v3.4s, v4.4s 1 uzp2 v3.4s, v6.4s, v5.4s 1 str q3, [x0], #16 0.5 0.5 1 b.lo .LBB0_1 1 Total port pressure 1 2.5 2.5 0 2 1 8 1

Although uzp2 could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.

There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.

Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.

Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.

Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1/ST1 instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1/ST1, improves performance to 11.25 cycles per iteration.

In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2 to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?

EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0) directly. Any overhead should be canceled out by the measurement strategy above.

...

ANSWER

Answered 2021-Nov-06 at 13:50

First off, you can further reduce the theoretical cycles to 6 by replacing the first mul with uzp1 and doing the following smull and smlal the other way around: mul, mul, smull, smlal => smull, uzp1, mul, smlal This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)

And you don't need v2 coefficents, but you can pack them to the higher part of v1

Let's rule out everything by unrolling this deep and writing it in assembly:

Source https://stackoverflow.com/questions/69855672

QUESTION

How is the s=s+c string concat optimization decided?

Asked 2021-Sep-08 at 00:15

Short version: If s is a string, then s = s + 'c' might modify the string in place, while t = s + 'c' can't. But how does the operation s + 'c' know which scenario it's in?

Long version:

t = s + 'c' needs to create a separate string because the program afterwards wants both the old string as s and the new string as t.

s = s + 'c' can modify the string in place if s is the only reference, as the program only wants s to be the extended string. CPython actually does this optimization, if there's space at the end for the extra character.

Consider these functions, which repeatedly add a character:

...

ANSWER

Answered 2021-Sep-08 at 00:15

Here's the code in question, from the Python 3.10 branch (in ceval.c, and called from the same file's implementation of the BINARY_ADD opcode). As @jasonharper noted in a comment, it peeks ahead to see whether the result of the BINARY_ADD will next be bound to the same name from which the left-hand addend came. In fast(), it is (operand came from s and result stored into s), but in slow() it isn't (operand came from s but stored into t).

There's no guarantee this optimization will persist, though. For example, I noticed that your fast() is no faster than your slow() on the current development CPython main branch (which is the current work-in-progress toward an eventual 3.11 release).

Should people rely on this?

As noted, there's no guarantee this optimization will persist. "Serious" Python programmers should know better than to rely on dodgy CPython-specific tricks, and, indeed, PEP 8 explicitly warns against relying on this specific one:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b ...

Source https://stackoverflow.com/questions/69079181

QUESTION

Bug? MATLAB MEX changes the kind of the default logical

Asked 2021-Sep-07 at 00:41

When interfacing a piece of Fortran 2003 (or above) code with MATLAB by MEX, I am surprised to find that MEX changes the kind of the default logical. This is fatal, because a piece of perfectly compilable Fortran code may fail to be mexified due to a mismatch of types, which did happen in my project.

Here is a minimal working example.

Name the following code as "test_kind.F", compile it by mex test_kind.F in MATLAB, and then run test_kind in MATLAB. This will produce a plain text file named fort.99, which contains two numbers "4" and then "8" as the result of the WRITE instructions.

...

ANSWER

Answered 2021-Sep-05 at 13:51

By default MEX compiles with the gfortran option -fdefault-integer-8. The way gfortran handles this results in what you see.

Consider the non-MEX program

Source https://stackoverflow.com/questions/69060408

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install benchmark

Here's a sample, benchmarking Array.
describe to organize benchmarks
benchmark to run benchmarks

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: