benchmarking | A performance comparison of Duplicacy restic Attic | Performance Testing library

by gilbertchen Shell Version: Current License: MIT

X-Ray Key Features Code Snippets(4)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | benchmarking Summary

benchmarking is a Shell library typically used in Testing, Performance Testing, Pytorch applications. benchmarking has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

A performance comparison of Duplicacy, restic, Attic, and duplicity

Support

Quality

Security

License

Reuse

Support

benchmarking has a low active ecosystem.

It has 187 star(s) with 12 fork(s). There are 11 watchers for this library.

It had no major release in the last 6 months.

There are 10 open issues and 1 have been closed. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of benchmarking is current.

Quality

benchmarking has 0 bugs and 0 code smells.

Security

benchmarking has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

benchmarking code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

benchmarking is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

benchmarking releases are not available. You will need to build from source code and install.

Installation instructions are available. Examples and code snippets are not available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of benchmarking

Get all kandi verified functions for this library.

benchmarking Key Features

No Key Features are available at this moment for benchmarking.

benchmarking Examples and Code Snippets

Benchmarking

Python

Lines of Code : 39

License : No License

Copy

import unittest
from selenium import webdriver
import time


class TestThree(unittest.TestCase):

    def setUp(self):
        self.startTime = time.time()

    def test_url_fire(self):
        time.sleep(2)
        self.driver = webdriver.Firefox()

Run the benchmark .

python

Lines of Code : 58

License : Non-SPDX (Apache License 2.0)

Copy

def run_benchmark(self,
                    dataset,
                    num_elements,
                    iters=1,
                    warmup=True,
                    apply_default_optimizations=False,
                    session_config=None):

Runs a benchmark on the given dataset .

python

Lines of Code : 52

License : Non-SPDX (Apache License 2.0)

Copy

def run_and_report_benchmark(self,
                               dataset,
                               num_elements,
                               name,
                               iters=5,
                               extras=None,

Run a graph benchmark .

python

Lines of Code : 47

License : Non-SPDX (Apache License 2.0)

Copy

def _run_graph_benchmark(self,
                           iterable,
                           iters,
                           warmup,
                           session_config,
                           initializer=None):
    """Benchmarks the it

Community Discussions

Trending Discussions on benchmarking

Why does gcc -march=znver1 restrict uint64_t vectorization?

BenchmarkTools outputs to DataFrame

Starting and stopping services in vespa

Apply vector exponents to columns of matrix

Which alignment causes this performance difference

Create std::string from std::span of unsigned char

Performance assigning and copying with StaticArrays.jl in Julia

NEON assembly code requires more cycles on Cortex-A72 vs Cortex-A53

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

QUESTION

Why does gcc -march=znver1 restrict uint64_t vectorization?

Asked 2022-Apr-10 at 02:47

I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1 (or -march=native) gcc skips some loops even though they can be vectorized. Why does this happen?

In this code, the second loop, which multiplies each element by a scalar is not vectorised:

...

ANSWER

Answered 2022-Apr-10 at 02:47

The default -mtune=generic has -mprefer-vector-width=256, and -mavx2 doesn't change that.

znver1 implies -mprefer-vector-width=128, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.

And vmovdqa ymm0, ymm1 mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper afterwards, to avoid performance problems on other CPUs (but not Zen1).

I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.

Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.

You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0. https://godbolt.org/z/E5Tq7Gfzc

So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq / vpaddd to implement qword *5 as (v<<2) + v, vs. doing it with integer in one LEA instruction.

Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)

I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5 with only 2 set bits?

That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.

-mprefer-vector-width=256 doesn't help: Not vectorizing uint64_t *= 5 seems to be a GCC9 regression

(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)

Even with -march=znver1 -O3 -mprefer-vector-width=256, we don't get the *= 5 loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2. https://godbolt.org/z/dMTh7Wxcq

We do get vectorization with those options for uint32_t (even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *= is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.

With uint64_t, changing to arr[i] += arr[i]<<2; still doesn't vectorize, but arr[i] <<= 1; does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2; and arr[i] += 123 in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i] would be considered more expensive than arr[i] <<= 1; which is exactly the same thing.

GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6

Source https://stackoverflow.com/questions/71811588

QUESTION

BenchmarkTools outputs to DataFrame

Asked 2022-Feb-19 at 14:49

I am trying to benchmark the performance of functions using BenchmarkTools as in the example below. My goal is to obtain the outputs of @benchmark as a DataFrame.

In this example, I am benchmarking the performance of the following two functions:

...

ANSWER

Answered 2022-Feb-19 at 14:07

You can do it e.g. like this:

Source https://stackoverflow.com/questions/71183235

QUESTION

Starting and stopping services in vespa

Asked 2022-Feb-16 at 18:43

In the benchmarking page "https://docs.vespa.ai/en/performance/vespa-benchmarking.html" it is given that we need to restart the services after we increase the persearch thread using the commands vespa-stop-services and vespa-start-services. Could you tell us if we need to do this on all the content nodes or just the config nodes?

...

ANSWER

Answered 2022-Feb-16 at 18:43

When deploying a change that requires a restart, the deploy command will list the actions you need to take. For example when changing the global per search thread setting changing from 2 to 5 in the below example:

Source https://stackoverflow.com/questions/71147132

QUESTION

Apply vector exponents to columns of matrix

Asked 2022-Feb-13 at 19:00

In the cmprsk package one of the tests to call crr shows the following:

crr(ftime,fstatus,cov,cbind(cov[,1],cov[,1]),function(Uft) cbind(Uft,Uft^2))

It's the final inline function function(Uft) cbind(Uft,Uft^2) which I am interested in generalizing. This example call to crr above has cov1 as cov and cov2 as cbind(cov[,1],cov[,1]), essentially hardcoding the number of covariates. Hence the hardcoding of the function function(Uft) cbind(Uft,Uft^2) is sufficient.

I am working on some benchmarking code where I would like the number of covariates to be variable. So I generate a matrix cov2 which is has nobs rows and ncovs columns (rather than ncovs being pre-defined as 2 above).

My question is - how do I modify the inline function function(Uft) cbind(Uft,Uft^2) to take in a single column vector of length nobs and return a matrix of size nobs x ncovs where each column is simply the input vector raised to the column index?

Minimal reproducible example below. My call to z2 <- crr(...) is incorrect:

...

ANSWER

Answered 2022-Feb-13 at 18:12

vec <- 1:4
ncovs <- 5
matrix(unlist(Reduce("*", rep(list(vec), ncovs), accumulate= TRUE)), ncol = ncovs)

Source https://stackoverflow.com/questions/71103477

QUESTION

Which alignment causes this performance difference

Asked 2022-Feb-12 at 20:11

What's the problem

I am benchmarking the following code for (T& x : v) x = x + x; where T is int. When compiling with mavx2 Performance fluctuates 2 times depending on some conditions. This does not reproduce on sse4.2

I would like to understand what's happening.

How does the benchmark work

I am using Google Benchmark. It spins the loop until the point it is sure about the time.

The main benchmarking code:

...

ANSWER

Answered 2022-Feb-12 at 20:11

Yes, data misalignment could explain your 2x slowdown for small arrays that fit in L1d. You'd hope that with every other load/store being a cache-line split, it might only slow down by a factor of 1.5x, not 2, if a split load or store cost 2 accesses to L1d instead of 1.

But it has extra effects like replays of uops dependent on the load result that apparently account for the rest of the problem, either making out-of-order exec less able to overlap work and hide latency, or directly running into bottlenecks like "split registers".

ld_blocks.no_sr counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use.

When a load execution unit detects that the load splits across a cache line, it has to save the first part somewhere (apparently in a "split register") and then access the 2nd cache line. On Intel SnB-family CPUs like yours, this 2nd access doesn't require the RS to dispatch the load uop to the port again; the load execution unit just does it a few cycles later. (But presumably can't accept another load in the same cycle as that 2nd access.)

https://chat.stackoverflow.com/transcript/message/48426108#48426108 - uops waiting for the result of a cache-split load will get replayed.
Are load ops deallocated from the RS when they dispatch, complete or some other time? But the load itself can leave the RS earlier.
How can I accurately benchmark unaligned access speed on x86_64? general stuff on split load penalties.

The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor, but those are also fairly direct consequences of misaligned loads. Lots of counts for ld_blocks.no_sr tells you that the CPU actually ran out of split registers and could otherwise be doing more work, but had to stall because of the unaligned load itself, not just other effects.

You could also look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details, but not being able to execute split loads will make that happen more. So probably all the back-end stalling is a consequence of the unaligned loads (and maybe stores if commit from store buffer to L1d is also a bottleneck.)

On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it.

Data alignment doesn't normally make that much difference for large arrays (except with 512-bit vectors). With a cache line (2x YMM vectors) arriving less frequently, the back-end has time to work through the extra overhead of unaligned loads / stores and still keep up. HW prefetch does a good enough job that it can still max out the per-core L3 bandwidth. Seeing a smaller effect for a size that fits in L2 but not L1d (like 100kiB) is expected.

Of course, most kinds of execution bottlenecks would show similar effects, even something as simple as un-optimized code that does some extra store/reloads for each vector of array data. So this alone doesn't prove that it was misalignment causing the slowdowns for small sizes that do fit in L1d, like your 10 KiB. But that's clearly the most sensible conclusion.

Code alignment or other front-end bottlenecks seem not to be the problem; most of your uops are coming from the DSB, according to idq.dsb_uops. (A significant number aren't, but not a big percentage difference between slow vs. fast.)

How can I mitigate the impact of the Intel jcc erratum on gcc? can be important on Skylake-derived microarchitectures like yours; it's even possible that's why your idq.dsb_uops isn't closer to your uops_issued.any.

Source https://stackoverflow.com/questions/71090526

QUESTION

Create std::string from std::span of unsigned char

Asked 2022-Jan-23 at 16:19

I am using a C library which uses various fixed-sized unsigned char arrays with no null terminator as strings.

I've been converting them to std::string using the following function:

...

ANSWER

Answered 2022-Jan-22 at 22:33

You want:

Source https://stackoverflow.com/questions/70817628

QUESTION

Performance assigning and copying with StaticArrays.jl in Julia

Asked 2022-Jan-07 at 20:54

I was thinking of using the package StaticArrays.jl to enhance the performance of my code. However, I only use arrays to store computed variables and use them later after certain conditions are set. Hence, I was benchmarking the type SizedVector in comparison with normal vector, but I do not understand to code below. I also tried StaticVector and used the work around Setfield.jl.

...

ANSWER

Answered 2022-Jan-07 at 13:15

@phipsgabler is right! Statically sized arrays have their performance advantages when the size is known statically, at compile time. My arrays are, however, dynamically sized, with the size n being a runtime variable.

Changing this yields more sensible results:

Source https://stackoverflow.com/questions/70620874

QUESTION

NEON assembly code requires more cycles on Cortex-A72 vs Cortex-A53

Asked 2021-Dec-22 at 15:34

I am benchmarking an ARMv7 NEON code on two ARMv8 processors in AArch32 mode: the Cortex-A53 and Cortex-A72. I am using the Raspberry Pi 3B and Raspberry Pi 4B boards with 32-bit Raspbian Buster.

My benchmarking method is as follows:

...

ANSWER

Answered 2021-Oct-27 at 12:00

I compared the instruction cycle timing of A72 and A55 (nothing available on A53):

vshl and vshr:

A72: throughput(IPC) 1, latency 3, executes on F1 pipeline only
A55: throughput(IPC) 2, latency 2, executes on both pipelines (restricted though)

That pretty much nails it since there are many of them in your code.

There are some drawbacks in your assembly code, too:

vadd has less restrictions and better throughput/latency than vshl. You should replace all vshl by immediate 1 with vadd. Barrel shifters are more costly than arithmetic on SIMD.
You should not repeat the same instructions unnecesarily (<<5)
The second vmvn is unnecessary. You can replace all the following vand with vbic instead.
Compilers generate acceptable machine codes as long as no permutations are involved. Hence I'd write the code in neon intrinsics in this case.

Source https://stackoverflow.com/questions/69719403

QUESTION

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

Asked 2021-Dec-13 at 10:27

I'm using godbolt to get assembly of the following program:

...

ANSWER

Answered 2021-Dec-13 at 06:33

You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal instruction (2 per cycle and 1 cycle of latency).

This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.

Note that the lea instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea is faster than imul/mul.

UPDATE (v2):

I can reproduce the slower execution with -O2 using GCC 11.2 (on Linux with a i5-9600KF processor).

The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2 version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.

Here is the assembly of the loop with -Os:

Source https://stackoverflow.com/questions/70316686

QUESTION

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Asked 2021-Dec-03 at 06:02

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:

Instruction B I0 I1 M L S F0 F1 .LBB0_1: ldr q3, [x1], #16 0.5 0.5 1 ldr q4, [x2], #16 0.5 0.5 1 add x8, x8, #4 0.5 0.5 cmp x8, #508 0.5 0.5 mul v5.4s, v3.4s, v4.4s 2 mul v5.4s, v5.4s, v0.4s 2 smull v6.2d, v5.2s, v1.2s 1 smull2 v5.2d, v5.4s, v2.4s 1 smlal v6.2d, v3.2s, v4.2s 1 smlal2 v5.2d, v3.4s, v4.4s 1 uzp2 v3.4s, v6.4s, v5.4s 1 str q3, [x0], #16 0.5 0.5 1 b.lo .LBB0_1 1 Total port pressure 1 2.5 2.5 0 2 1 8 1

Although uzp2 could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.

There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.

Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.

Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.

Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1/ST1 instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1/ST1, improves performance to 11.25 cycles per iteration.

In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2 to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?

EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0) directly. Any overhead should be canceled out by the measurement strategy above.

...

ANSWER

Answered 2021-Nov-06 at 13:50

First off, you can further reduce the theoretical cycles to 6 by replacing the first mul with uzp1 and doing the following smull and smlal the other way around: mul, mul, smull, smlal => smull, uzp1, mul, smlal This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)

And you don't need v2 coefficents, but you can pack them to the higher part of v1

Let's rule out everything by unrolling this deep and writing it in assembly:

Source https://stackoverflow.com/questions/69855672

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install benchmarking

All tests were performed on a Mac mini 2012 model running macOS Sierra (10.12.3), with a 2.3 GHZ Intel i7 4-core processor and 16 GB memory. The following table lists several important configuration parameters or algorithms that may have significant impact on the overall performance. | | Duplicacy | restic | Attic | duplicity | |:------------------:|:-------------:|:---------------------:|:----------:|:-----------:| | Version | 2.0.3 | 0.6.1 | BorgBackup 1.1.0b6 | 0.7.12 | | Average chunk size | 1MB[1] | 1MB | 2MB | 25MB | | Hash | blake2 | SHA256 | blake2 [2]| SHA1 | | Compression | lz4 | not implemented | lz4 | zlib level 1| | Encryption | AES-GCM | AES-CTR | AES-CTR | GnuPG |. [1] The chunk size in Duplicacy is configurable with the default being 4MB. It was set it to 1MB to match that of restic. [2] Enabled by -e repokey-blake2 which is only available in 1.1.0+.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: