kandi X-RAY | c8 Summary
kandi X-RAY | c8 Summary
output coverage reports using Node.js' built in coverage
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of c8
c8 Key Features
c8 Examples and Code Snippets
Community Discussions
Trending Discussions on c8
QUESTION
Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.
Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.
Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6
(n^3
for matrix-matrix multiplication, performed on n
slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.
Code snippet in main function:
...ANSWER
Answered 2022-Mar-25 at 19:331 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384
GFLOP/S. (Using vectors of 4 double
s, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(as well as the __kmpc_end_serialized_parallel
stuff).
There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp
should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.
As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)
- Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
Once you're already in an if
block, you unconditionally know the else
block should not run, so you jmp
over it instead of jcc
(even if FLAGS
were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc
to it, then it jmp
s back to after the other side. That allows the fast path to be contiguous with no taken branches.
- Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?
Duplicate ret
comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret
instead of jumping to a ret
. (And copies of any cleanup necessary, like restoring regs and stack pointer.)
QUESTION
I have a dataframe of gene expression scores (cells x genes). I also have the cluster that each cell belongs to in stored as a column.
I want to calculate the mean expression values per cluster for a group of genes (columns), however, I only want to include values > 0 in these calculations.
My attempt at this is as follows:
...ANSWER
Answered 2022-Mar-24 at 20:52What you want is:
QUESTION
I am trying to calculate a NPV of several cashflows with continuous compounding. However when trying to refer to the optional parameter (which signals that the NPV will be continuous compounding), i get #VALUE! instead of a valid NPV
My VBA code is as follows:
...ANSWER
Answered 2022-Mar-21 at 21:07I believe you want to include both inside the loop and then use an IF to decide which to do:
QUESTION
I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.
Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.
After turning on optimization (-O3 -march=native -ffast-math
, gcc cross compiler), the Gprof result was the following:
Clearly, the convolution2D
function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.
The convolution function in question looks like this:
...ANSWER
Answered 2022-Mar-10 at 13:57Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.
So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.
I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).
The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.
QUESTION
I am training a Unet segmentation model for binary class. The dataset is loaded in tensorflow data pipeline. The images are in (512, 512, 3) shape, masks are in (512, 512, 1) shape. The model expects the input in (512, 512, 3) shape. But I am getting the following error. Input 0 of layer "model" is incompatible with the layer: expected shape=(None, 512, 512, 3), found shape=(512, 512, 3)
Here are the images in metadata dataframe.
Randomly sampling the indices to select the training and validation set
...ANSWER
Answered 2022-Mar-08 at 13:38Use train_batches
in model.fit
and not train_images
. Also, you do not need to use repeat()
, which causes an infinite dataset if you do not specify how many times you want to repeat your dataset. Regarding your labels error, try rewriting your model like this:
QUESTION
I have a dataframe like this:
...ANSWER
Answered 2022-Feb-10 at 23:31You can use a named groupby
:
QUESTION
With Python, I wanted to format a string of hex characters:
- spaces between each byte (easy enough):
2f2f
->2f 2f
- line breaks at a specified max byte width (not hard):
2f 2f 2f 2f 2f 2f 2f 2f\n
- address ranges for each line (doable):
0x7f8-0x808: 2f 2f 2f 2f 2f 2f 2f 2f\n
- replace large ranges of sequential
00
bytes with:... trimmed 35 x 00 bytes [0x7 - 0x2a] ...
... it was at this point that I knew I was doing some bad coding. The function got bloated and hard to follow. Too many features piled up in a non-intuitive way.
Example output:
...ANSWER
Answered 2021-Dec-23 at 11:16I would suggest to not start a "trimmed 00 bytes" series in the middle of an output line, but only apply this compacting when it applies to complete output lines with only zeroes.
This means that you will still see non-compacted zeroes in a line that also contains non-zeroes, but in my opinion this results in a cleaner output format. For instance, if a line would end with just two 00 bytes, it really does not help to replace that last part of the line with the longer "trimmed 2 x 00 bytes" message. By only replacing complete 00-lines with this message, and compress multiple such lines with one message, the output format seems cleaner.
To produce that output format, I would use the power of regular expressions:
to identify a block of bytes to be output on one line: either a line with at least one non-zero, or a range of zero bytes which either runs to the end of the input, or else is a multiple of the "byte width" argument.
to insert spaces in a line of bytes
All this can be done through iterations in one expression:
QUESTION
Intel recommends using instruction prefixes to mitigate the performance consequences of JCC Erratum.
MSVC if compiled with /QIntel-jcc-erratum
follows the recommendation, and inserts prefixed instructions, like this:
ANSWER
Answered 2021-Dec-21 at 16:31A NOP is a separate instruction that had to decode and go through the pipeline separately. It's always better to pad instructions with prefixes to achieve desired alignment, not insert NOPs, as discussed in What methods can be used to efficiently extend instruction length on modern x86? (but only in ways that don't cause major stalls on some CPUs which can't handle large numbers of prefixes).
Perhaps Intel considered it worth the effort for toolchains to do it this way for this case since this would actually be inside inner loops, not just a NOP outside an inner loop. (And tacking on prefixes to one previous instruction is relatively simple.)
I now have some data point. The result of benchmarking for /QIntel-jcc-erratum
on AMD FX 8300 is bad.
The slowdown is by a decimal order of magnitude for a specific benchmark, where the benefit on Intel Skylake for the same benchmark is about 20 percent. This aligns with Peter's comments:
I checked Agner Fog's microarch guide, and AMD Zen has no problem with any number of prefixes on a single instruction, like mainstream Intel since Core2. AMD Bulldozer-family has a "very large" penalty for decoding instructions with more than 3 prefixes, like 14-15 cycles for 4-7 prefixes
It's somewhat valid to consider Bulldozer-family obsolete enough to not care much about it, although there are still some APU desktops and laptops around for sure, but they'd certainly show large regressions in loops where the compiler put 4 or more prefixes on one instruction inside a hot inner loop (including existing prefixes like REX or 66h). Much worse than the 3% for MITE legacy decode on SKL.
Though indeed Bulldozer-family is obsolete-ish, I don't think I can afford this much of an impact. I'm also afraid of other CPUs that may choke with extra prefixes the same way. So the conclusion for me is not to use /QIntel-jcc-erratum
for generally-targeted software. Unless it is enabled in specific translation units and dynamic dispatch to there is made, which is too much of the trouble most of the time.
One thing that probably safe to do on MSVC is to stop using /Os
flag . It was discovered that /Os
flag at least:
- Avoids jump tables in favor of conditional jumps
- Avoids loop start padding
Try the following example (https://godbolt.org/z/jvezPd9jM):
QUESTION
I have this:
...ANSWER
Answered 2021-Dec-01 at 08:58Split the enum class
into an enum
and a class
(or struct
for convenience).
QUESTION
So afaik IA32_LSTAR is supposed to hold the address of KiSystemCall64/KiSystemCall64Shadow, so right before a syscall was made on ntdll I dumped it, and set a breakpoint on it (KiSystemCall64Shadow) upon tracing with p on windbg I get a bugcheck(DOUBLE_FAULT), why is that?
I should mention that this whole process was inside a VM so I could kernel-debug the application
output of !analyze -v
...ANSWER
Answered 2021-Nov-26 at 23:41kernel mode interrupt entries check in which mode was interrupt - user or kernel, by checking lowest bit of CS
on stack (CPL
) and execute SWAPGS
instruction only in case iterrupt was from user mode. otherwise in GS assume already correct value - in user mode GS
point to TEB
and in kernel mode to KPCR
. example of
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install c8
No Installation instructions are available at this moment for c8.Refer to component home page for details.
Support
If you have any questions vist the community on GitHub, Stack Overflow.
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page