Average | Templated class for calculating averages | Analytics library

by MajenkoLibraries C++ Version: Current License: BSD-3-Clause

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | Average Summary

Average is a C++ library typically used in Analytics applications. Average has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Templated class for calculating averages and statistics of data sets.

Support

Quality

Security

License

Reuse

Support

Average has a low active ecosystem.

It has 48 star(s) with 43 fork(s). There are 16 watchers for this library.

It had no major release in the last 6 months.

There are 6 open issues and 6 have been closed. On average issues are closed in 31 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of Average is current.

Quality

Average has 0 bugs and 0 code smells.

Security

Average has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

Average code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

Average is licensed under the BSD-3-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

Average releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Average

Get all kandi verified functions for this library.

Average Key Features

No Key Features are available at this moment for Average.

Average Examples and Code Snippets

No Code Snippets are available at this moment for Average.

Community Discussions

Trending Discussions on Average

How can I use groupby with multiple values in a column in pandas?

Why is the XOR swap optimized into a normal swap using the MOV instruction?

Repeatedly removing the maximum average subarray

R function to extract top n scores from a dataframe and find their average using `apply` or dplyr `rowwise`

How can I transform and aggregate numeric values in a character column?

Is the timing of MATLAB reliable? If yes, can we reproduce the performance with julia, fortran, etc.?

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

logistic regression and GridSearchCV using python sklearn

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Why does nvidia-smi return "GPU access blocked by the operating system" in WSL2 under Windows 10 21H2

QUESTION

How can I use groupby with multiple values in a column in pandas?

Asked 2022-Mar-08 at 21:44

I've a dataframe like as follows,

...

ANSWER

Answered 2022-Feb-22 at 20:00

I think actually a more efficient way would be to sort by Brand and then Year, and then use interpolate:

Source https://stackoverflow.com/questions/71226654

QUESTION

Why is the XOR swap optimized into a normal swap using the MOV instruction?

Asked 2022-Mar-08 at 10:00

While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating average of 2 unsigned 32-bit integer:

...

ANSWER

Answered 2022-Mar-08 at 10:00

Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:

Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But xchg reg,reg already does that better.

I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.

Hopefully sometimes that lets it then optimize down to only a single mov, or two movs, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov instructions.

But worst case, three mov instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.

(Apparently GCC -Os does have a peephole optimization to use xchg reg,reg instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)

xor-swap has worse latency and defeats mov-elimination

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?

Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)

It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a to output a for xor-swap, and 2 from input b to output a, for example.

MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)

Also related:

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? - and those 3 uops can't benefit from mov-elimination. But on modern AMD xchg reg,reg is only 2 uops.

If you're going to branch, just duplicate the averaging code

GCC's real missed optimization here (even with -O3) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.

update: GCC will do this for you with -ftracer (https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use, not at -O3, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)

Doing it manually in the source (Godbolt):

Source https://stackoverflow.com/questions/71382441

QUESTION

Repeatedly removing the maximum average subarray

Asked 2022-Feb-28 at 18:19

I have an array of positive integers. For example:

...

ANSWER

Answered 2022-Feb-27 at 22:44

This problem has a fun O(n) solution.

If you draw a graph of cumulative sum vs index, then:

The average value in the subarray between any two indexes is the slope of the line between those points on the graph.

The first highest-average-prefix will end at the point that makes the highest angle from 0. The next highest-average-prefix must then have a smaller average, and it will end at the point that makes the highest angle from the first ending. Continuing to the end of the array, we find that...

These segments of highest average are exactly the segments in the upper convex hull of the cumulative sum graph.

Find these segments using the monotone chain algorithm. Since the points are already sorted, it takes O(n) time.

Source https://stackoverflow.com/questions/71287550

QUESTION

R function to extract top n scores from a dataframe and find their average using `apply` or dplyr `rowwise`

Asked 2022-Jan-18 at 18:16

The dataframe looks like this

...

ANSWER

Answered 2022-Jan-16 at 18:49

With apply, use MARGIN = 1, to loop over the rows on the numeric columns, sort, get the head/tail depending on decreasing = TRUE/FALSE and return with the mean in base R

Source https://stackoverflow.com/questions/70733133

QUESTION

How can I transform and aggregate numeric values in a character column?

Asked 2022-Jan-12 at 18:47

I have the followed column structure:

...

ANSWER

Answered 2022-Jan-12 at 17:36

We may remove the non-numeric part with gsub, read with read.table specifying the sep as - and use rowMeans in base R

Source https://stackoverflow.com/questions/70685950

QUESTION

Is the timing of MATLAB reliable? If yes, can we reproduce the performance with julia, fortran, etc.?

Asked 2022-Jan-05 at 08:26

Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.

In short, michalkvasnicka found that in the following MATLAB sample
...

ANSWER

Answered 2021-Dec-30 at 12:23

tic/toc should be fine, but it looks like the timing is being skewed by memory pre-allocation.

I can reproduce similar timings to your MATLAB example, however

On first run (clear workspace)

Loop approach takes 2.08 sec

Vectorised approach takes 1.04 sec

Vectorisation saves 50% execution time

On second run (workspace not cleared)

Loop approach takes 2.55 sec

Vectorised approach takes 0.065 sec

Vectorisation "saves" 97.5% execution time

My guess would be that since the loop approach explicitly creates a new matrix via zeros, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.

However, when HH remains in memory and the HH=___ line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.

We can prove this theory with the following test:

Source https://stackoverflow.com/questions/70531385

QUESTION

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

Asked 2021-Dec-13 at 10:27

I'm using godbolt to get assembly of the following program:

...

ANSWER

Answered 2021-Dec-13 at 06:33

You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal instruction (2 per cycle and 1 cycle of latency).

This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.

Note that the lea instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea is faster than imul/mul.
UPDATE (v2):
I can reproduce the slower execution with -O2 using GCC 11.2 (on Linux with a i5-9600KF processor).

The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2 version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.

Here is the assembly of the loop with -Os:

Source https://stackoverflow.com/questions/70316686

QUESTION

logistic regression and GridSearchCV using python sklearn

Asked 2021-Dec-10 at 14:14

I am trying code from this page. I ran up to the part LR (tf-idf) and got the similar results

After that I decided to try GridSearchCV. My questions below:

1)
...

ANSWER

Answered 2021-Dec-09 at 23:12

You end up with the error with precision because some of your penalization is too strong for this model, if you check the results, you get 0 for f1 score when C = 0.001 and C = 0.01

Source https://stackoverflow.com/questions/70264157

QUESTION

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Asked 2021-Dec-03 at 06:02

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:

Instruction B I0 I1 M L S F0 F1 .LBB0_1: ldr q3, [x1], #16 0.5 0.5 1 ldr q4, [x2], #16 0.5 0.5 1 add x8, x8, #4 0.5 0.5 cmp x8, #508 0.5 0.5 mul v5.4s, v3.4s, v4.4s 2 mul v5.4s, v5.4s, v0.4s 2 smull v6.2d, v5.2s, v1.2s 1 smull2 v5.2d, v5.4s, v2.4s 1 smlal v6.2d, v3.2s, v4.2s 1 smlal2 v5.2d, v3.4s, v4.4s 1 uzp2 v3.4s, v6.4s, v5.4s 1 str q3, [x0], #16 0.5 0.5 1 b.lo .LBB0_1 1 Total port pressure 1 2.5 2.5 0 2 1 8 1
Although uzp2 could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.

There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.

Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.

Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.

Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1/ST1 instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1/ST1, improves performance to 11.25 cycles per iteration.

In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2 to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?

EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0) directly. Any overhead should be canceled out by the measurement strategy above.
...

ANSWER

Answered 2021-Nov-06 at 13:50

First off, you can further reduce the theoretical cycles to 6 by replacing the first mul with uzp1 and doing the following smull and smlal the other way around: mul, mul, smull, smlal => smull, uzp1, mul, smlal This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)

And you don't need v2 coefficents, but you can pack them to the higher part of v1

Let's rule out everything by unrolling this deep and writing it in assembly:

Source https://stackoverflow.com/questions/69855672

QUESTION

Why does nvidia-smi return "GPU access blocked by the operating system" in WSL2 under Windows 10 21H2

Asked 2021-Nov-18 at 19:20

Installing CUDA on WSL2
I've installed Windows 10 21H2 on both my desktop (AMD 5950X system with RTX3080) and my laptop (Dell XPS 9560 with i7-7700HQ and GTX1050) following the instructions on https://docs.nvidia.com/cuda/wsl-user-guide/index.html:

Install CUDA-capable driver in Windows

Update WSL2 kernel in PowerShell: wsl --update

Install CUDA toolkit in Ubuntu 20.04 in WSL2 (Note that you don't install a CUDA driver in WSL2, the instructions explicitly tell that the CUDA driver should not be installed.):

...

ANSWER

Answered 2021-Nov-18 at 19:20

Turns out that Windows 10 Update Assistant incorrectly reported it upgraded my OS to 21H2 on my laptop. Checking Windows version by running winver reports that my OS is still 21H1. Of course CUDA in WSL2 will not work in Windows 10 without 21H2.

After successfully installing 21H2 I can confirm CUDA works with WSL2 even for laptops with Optimus NVIDIA cards.

Source https://stackoverflow.com/questions/70011494

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities
No vulnerabilities reported

Install Average
You can download it from GitHub.

Support
For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
Find more information at: