Average | Templated class for calculating averages | Analytics library
kandi X-RAY | Average Summary
kandi X-RAY | Average Summary
Templated class for calculating averages and statistics of data sets.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Average
Average Key Features
Average Examples and Code Snippets
Community Discussions
Trending Discussions on Average
QUESTION
I've a dataframe like as follows,
...ANSWER
Answered 2022-Feb-22 at 20:00I think actually a more efficient way would be to sort by Brand
and then Year
, and then use interpolate
:
QUESTION
While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating average of 2 unsigned 32-bit integer:
...ANSWER
Answered 2022-Mar-08 at 10:00Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:
Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But
xchg reg,reg
already does that better.
I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.
Hopefully sometimes that lets it then optimize down to only a single mov
, or two mov
s, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov
instructions.
But worst case, three mov
instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg
instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.
(Apparently GCC -Os
does have a peephole optimization to use xchg reg,reg
instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)
with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?
Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)
It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a
to output a
for xor-swap, and 2 from input b
to output a
, for example.
MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)
Also related:
- Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? - and those 3 uops can't benefit from mov-elimination. But on modern AMD
xchg reg,reg
is only 2 uops.
GCC's real missed optimization here (even with -O3
) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a
path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.
update: GCC will do this for you with -ftracer
(https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use
, not at -O3
, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)
Doing it manually in the source (Godbolt):
QUESTION
I have an array of positive integers. For example:
...ANSWER
Answered 2022-Feb-27 at 22:44This problem has a fun O(n) solution.
If you draw a graph of cumulative sum vs index, then:
The average value in the subarray between any two indexes is the slope of the line between those points on the graph.
The first highest-average-prefix will end at the point that makes the highest angle from 0. The next highest-average-prefix must then have a smaller average, and it will end at the point that makes the highest angle from the first ending. Continuing to the end of the array, we find that...
These segments of highest average are exactly the segments in the upper convex hull of the cumulative sum graph.
Find these segments using the monotone chain algorithm. Since the points are already sorted, it takes O(n) time.
QUESTION
The dataframe looks like this
...ANSWER
Answered 2022-Jan-16 at 18:49With apply
, use MARGIN = 1
, to loop over the rows on the numeric columns, sort
, get the head/tail
depending on decreasing = TRUE/FALSE
and return with the mean
in base R
QUESTION
I have the followed column structure:
...ANSWER
Answered 2022-Jan-12 at 17:36We may remove the non-numeric part with gsub
, read with read.table
specifying the sep
as -
and use rowMeans
in base R
QUESTION
Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
...ANSWER
Answered 2021-Dec-30 at 12:23tic
/toc
should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (
clear
workspace)- Loop approach takes 2.08 sec
- Vectorised approach takes 1.04 sec
- Vectorisation saves 50% execution time
On second run (workspace not cleared)
- Loop approach takes 2.55 sec
- Vectorised approach takes 0.065 sec
- Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros
, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH
remains in memory and the HH=___
line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
QUESTION
I'm using godbolt to get assembly of the following program:
...ANSWER
Answered 2021-Dec-13 at 06:33You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul
instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea
instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal
instruction (2 per cycle and 1 cycle of latency).
This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.
Note that the lea
instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea
is faster than imul
/mul
.
I can reproduce the slower execution with -O2
using GCC 11.2 (on Linux with a i5-9600KF processor).
The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2
version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.
Here is the assembly of the loop with -Os
:
QUESTION
I am trying code from this page. I ran up to the part LR (tf-idf)
and got the similar results
After that I decided to try GridSearchCV
. My questions below:
1)
...ANSWER
Answered 2021-Dec-09 at 23:12You end up with the error with precision because some of your penalization is too strong for this model, if you check the results, you get 0 for f1 score when C = 0.001 and C = 0.01
QUESTION
Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:
Instruction B I0 I1 M L S F0 F1.LBB0_1:
ldr q3, [x1], #16
0.5
0.5
1
ldr q4, [x2], #16
0.5
0.5
1
add x8, x8, #4
0.5
0.5
cmp x8, #508
0.5
0.5
mul v5.4s, v3.4s, v4.4s
2
mul v5.4s, v5.4s, v0.4s
2
smull v6.2d, v5.2s, v1.2s
1
smull2 v5.2d, v5.4s, v2.4s
1
smlal v6.2d, v3.2s, v4.2s
1
smlal2 v5.2d, v3.4s, v4.4s
1
uzp2 v3.4s, v6.4s, v5.4s
1
str q3, [x0], #16
0.5
0.5
1
b.lo .LBB0_1
1
Total port pressure
1
2.5
2.5
0
2
1
8
1
Although uzp2
could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.
There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.
Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.
Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.
Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1
/ST1
instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32
) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1
/ST1
, improves performance to 11.25 cycles per iteration.
In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2
to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?
EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0
) directly. Any overhead should be canceled out by the measurement strategy above.
ANSWER
Answered 2021-Nov-06 at 13:50First off, you can further reduce the theoretical cycles to 6 by replacing the first mul
with uzp1
and doing the following smull
and smlal
the other way around: mul
, mul
, smull
, smlal
=> smull
, uzp1
, mul
, smlal
This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)
And you don't need v2
coefficents, but you can pack them to the higher part of v1
Let's rule out everything by unrolling this deep and writing it in assembly:
QUESTION
I've installed Windows 10 21H2 on both my desktop (AMD 5950X system with RTX3080) and my laptop (Dell XPS 9560 with i7-7700HQ and GTX1050) following the instructions on https://docs.nvidia.com/cuda/wsl-user-guide/index.html:
- Install CUDA-capable driver in Windows
- Update WSL2 kernel in PowerShell:
wsl --update
- Install CUDA toolkit in Ubuntu 20.04 in WSL2 (Note that you don't install a CUDA driver in WSL2, the instructions explicitly tell that the CUDA driver should not be installed.):
ANSWER
Answered 2021-Nov-18 at 19:20Turns out that Windows 10 Update Assistant incorrectly reported it upgraded my OS to 21H2 on my laptop.
Checking Windows version by running winver
reports that my OS is still 21H1.
Of course CUDA in WSL2 will not work in Windows 10 without 21H2.
After successfully installing 21H2 I can confirm CUDA works with WSL2 even for laptops with Optimus NVIDIA cards.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Average
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page