benchmark | Run microbenchmarks in Elm
kandi X-RAY | benchmark Summary
kandi X-RAY | benchmark Summary
Run microbenchmarks in Elm.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of benchmark
benchmark Key Features
benchmark Examples and Code Snippets
Community Discussions
Trending Discussions on benchmark
QUESTION
With the parent-child
relationships data frame as below:
ANSWER
Answered 2022-Feb-25 at 08:17We can use ego
like below
QUESTION
I am reading this book by Fedor Pikus and he has some very very interesting examples which for me were a surprise.
Particularly this benchmark caught me, where the only difference is that in one of them we use || in if and in another we use |.
ANSWER
Answered 2022-Feb-08 at 19:57Code readability, short-circuiting and it is not guaranteed that Ord will always outperform a ||
operand.
Computer systems are more complicated than expected, even though they are man-made.
There was a case where a for loop with a much more complicated condition ran faster on an IBM. The CPU didn't cool and thus instructions were executed faster, that was a possible reason. What I am trying to say, focus on other areas to improve code than fighting small-cases which will differ depending on the CPU and the boolean evaluation (compiler optimizations).
QUESTION
I run sample JHM benchmark which suppose to show dead code elimination. Code is rewritten for conciseness from jhm github sample.
...ANSWER
Answered 2022-Feb-09 at 17:17Those samples depend on JDK internals.
Looks like since JDK 9 and JDK-8152907, Math.log
is no longer intrinsified into C2 intermediate representation. Instead, a direct call to a quick LIBM-backed stub is made. This is usually faster for the code that actually uses the result. Notice how measureCorrect
is faster in JDK 17 output in your case.
But for JMH samples, it limits the the compiler optimizations around the Math.log
, and dead code / folding samples do not work properly. The fix it to make samples that do not rely on JDK internals without a good reason, and instead use a custom written payload.
This is being done in JMH here:
QUESTION
Looking into UTF8 decoding performance, I noticed the performance of protobuf's UnsafeProcessor::decodeUtf8
is better than String(byte[] bytes, int offset, int length, Charset charset)
for the following non ascii string: "Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen"
.
I tried to figure out why, so I copied the relevant code in String
and replaced the array accesses with unsafe array accesses, same as UnsafeProcessor::decodeUtf8
.
Here are the JMH benchmark results:
ANSWER
Answered 2022-Jan-12 at 09:52To measure the branch you are interested in and particularly the scenario when while
loop becomes hot, I've used the following benchmark:
QUESTION
I was looking for the canonical implementation of MergeSort on Haskell to port to HOVM, and I found this StackOverflow answer. When porting the algorithm, I realized something looked silly: the algorithm has a "halve" function that does nothing but split a list in two, using half of the length, before recursing and merging. So I thought: why not make a better use of this pass, and use a pivot, to make each half respectively smaller and bigger than that pivot? That would increase the odds that recursive merge calls are applied to already-sorted lists, which might speed up the algorithm!
I've done this change, resulting in the following code:
...ANSWER
Answered 2022-Jan-27 at 19:15Your split
splits the list in two ordered halves, so merge
consumes its first argument first and then just produces the second half in full. In other words it is equivalent to ++
, doing redundant comparisons on the first half which always turn out to be True
.
In the true mergesort the merge actually does twice the work on random data because the two parts are not ordered.
The split
though spends some work on the partitioning whereas an online bottom-up mergesort would spend no work there at all. But the built-in sort tries to detect ordered runs in the input, and apparently that extra work is not negligible.
QUESTION
I have tried speeding up a toy GEMM implementation. I deal with blocks of 32x32 doubles for which I need an optimized MM kernel. I have access to AVX2 and FMA.
I have two codes (in ASM, I apologies for the crudeness of the formatting) defined below, one is making use of AVX2 features, the other uses FMA.
Without going into micro benchmarks, I would like to try to develop an understanding (theoretical) of why the AVX2 implementation is 1.11x faster than the FMA version. And possibly how to improve both versions.
The codes below are for a 3000x3000 MM of doubles and the kernels are implemented using the classical, naive MM with an interchanged deepest loop. I'm using a Ryzen 3700x/Zen 2 as development CPU.
I have not tried unrolling aggressively, in fear that the CPU might run out of physical registers.
AVX2 32x32 MM kernel:
...ANSWER
Answered 2021-Dec-13 at 21:36Zen2 has 3 cycle latency for vaddpd
, 5 cycle latency for vfma...pd
. (https://uops.info/).
Your code with 8 accumulators has enough ILP that you'd expect close to two FMA per clock, about 8 per 5 clocks (if there aren't other bottlenecks) which is a bit less than the 10/5 theoretical max.
vaddpd
and vmulpd
actually run on different ports on Zen2 (unlike Intel), port FP2/3 and FP0/1 respectively, so it can in theory sustain 2/clock vaddpd
and vmulpd
. Since the latency of the loop-carried dependency is shorter, 8 accumulators are enough to hide the vaddpd
latency if scheduling doesn't let one dep chain get behind. (But at least multiplies aren't stealing cycles from it.)
Zen2's front-end is 5 instructions wide (or 6 uops if there are any multi-uop instructions), and it can decode memory-source instructions as a single uop. So it might well be doing 2/clock each multiply and add with the non-FMA version.
If you can unroll by 10 or 12, that might hide enough FMA latency and make it equal to the non-FMA version, but with less power consumption and more SMT-friendly to code running on the other logical core. (10 = 5 x 2 would be just barely enough, which means any scheduling imperfections lose progress on a dep chain which is on the critical path. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for some testing on Intel.)
(By comparison, Intel Skylake runs vaddpd/vmulpd on the same ports with the same latency as vfma...pd, all with 4c latency, 0.5c throughput.)
I didn't look at your code super carefully, but 10 YMM vectors might be a tradeoff between touching two pairs of cache lines vs. touching 5 total lines, which might be worse if a spatial prefetcher tries to complete an aligned pair. Or might be fine. 12 YMM vectors would be three pairs, which should be fine.
Depending on matrix size, out-of-order exec may be able to overlap inner loop dep chains between separate iterations of the outer loop, especially if the loop exit condition can execute sooner and resolve the mispredict (if there is one) while FP work is still in flight. That's an advantage to having fewer total uops for the same work, favouring FMA.
QUESTION
I'm using godbolt to get assembly of the following program:
...ANSWER
Answered 2021-Dec-13 at 06:33You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul
instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea
instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal
instruction (2 per cycle and 1 cycle of latency).
This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.
Note that the lea
instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea
is faster than imul
/mul
.
I can reproduce the slower execution with -O2
using GCC 11.2 (on Linux with a i5-9600KF processor).
The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2
version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.
Here is the assembly of the loop with -Os
:
QUESTION
Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:
Instruction B I0 I1 M L S F0 F1.LBB0_1:
ldr q3, [x1], #16
0.5
0.5
1
ldr q4, [x2], #16
0.5
0.5
1
add x8, x8, #4
0.5
0.5
cmp x8, #508
0.5
0.5
mul v5.4s, v3.4s, v4.4s
2
mul v5.4s, v5.4s, v0.4s
2
smull v6.2d, v5.2s, v1.2s
1
smull2 v5.2d, v5.4s, v2.4s
1
smlal v6.2d, v3.2s, v4.2s
1
smlal2 v5.2d, v3.4s, v4.4s
1
uzp2 v3.4s, v6.4s, v5.4s
1
str q3, [x0], #16
0.5
0.5
1
b.lo .LBB0_1
1
Total port pressure
1
2.5
2.5
0
2
1
8
1
Although uzp2
could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.
There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.
Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.
Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.
Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1
/ST1
instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32
) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1
/ST1
, improves performance to 11.25 cycles per iteration.
In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2
to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?
EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0
) directly. Any overhead should be canceled out by the measurement strategy above.
ANSWER
Answered 2021-Nov-06 at 13:50First off, you can further reduce the theoretical cycles to 6 by replacing the first mul
with uzp1
and doing the following smull
and smlal
the other way around: mul
, mul
, smull
, smlal
=> smull
, uzp1
, mul
, smlal
This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)
And you don't need v2
coefficents, but you can pack them to the higher part of v1
Let's rule out everything by unrolling this deep and writing it in assembly:
QUESTION
Short version: If s
is a string, then s = s + 'c'
might modify the string in place, while t = s + 'c'
can't. But how does the operation s + 'c'
know which scenario it's in?
Long version:
t = s + 'c'
needs to create a separate string because the program afterwards wants both the old string as s
and the new string as t
.
s = s + 'c'
can modify the string in place if s
is the only reference, as the program only wants s
to be the extended string. CPython actually does this optimization, if there's space at the end for the extra character.
Consider these functions, which repeatedly add a character:
...ANSWER
Answered 2021-Sep-08 at 00:15Here's the code in question, from the Python 3.10 branch (in ceval.c
, and called from the same file's implementation of the BINARY_ADD
opcode). As @jasonharper noted in a comment, it peeks ahead to see whether the result of the BINARY_ADD
will next be bound to the same name from which the left-hand addend came. In fast()
, it is (operand came from s
and result stored into s
), but in slow()
it isn't (operand came from s
but stored into t
).
There's no guarantee this optimization will persist, though. For example, I noticed that your fast()
is no faster than your slow()
on the current development CPython main
branch (which is the current work-in-progress toward an eventual 3.11 release).
As noted, there's no guarantee this optimization will persist. "Serious" Python programmers should know better than to rely on dodgy CPython-specific tricks, and, indeed, PEP 8 explicitly warns against relying on this specific one:
Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form
a += b
ora = a + b
...
QUESTION
When interfacing a piece of Fortran 2003 (or above) code with MATLAB by MEX, I am surprised to find that MEX changes the kind of the default logical. This is fatal, because a piece of perfectly compilable Fortran code may fail to be mexified due to a mismatch of types, which did happen in my project.
Here is a minimal working example.
Name the following code as "test_kind.F", compile it by mex test_kind.F
in MATLAB, and then run test_kind
in MATLAB. This will produce a plain text file named fort.99, which contains two numbers "4" and then "8" as the result of the WRITE instructions.
ANSWER
Answered 2021-Sep-05 at 13:51By default MEX compiles with the gfortran option -fdefault-integer-8
. The way gfortran handles this results in what you see.
Consider the non-MEX program
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install benchmark
describe to organize benchmarks
benchmark to run benchmarks
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page