c8

by bcoe JavaScript Version: Current License: ISC

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | c8 Summary

null

output coverage reports using Node.js' built in coverage

Support

Quality

Security

License

Reuse

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of c8

Get all kandi verified functions for this library.

c8 Key Features

No Key Features are available at this moment for c8.

c8 Examples and Code Snippets

No Code Snippets are available at this moment for c8.

Community Discussions

Trending Discussions on c8

Faulty benchmark, puzzling assembly

Dplyr: using summarise across to take mean of columns only if row value > 0

VBA Function returning #VALUE! when trying to refer to Optional Parameter

Convolution Function Latency Bottleneck

Input 0 of layer "model" is incompatible with the layer: expected shape=(None, 512, 512, 3), found shape=(512, 512, 3)

Pandas groupby and count numbers of item by conditions

What's a less bad way to write this hex formatting function in Python?

Intel JCC Erratum - what is the effect of prefixes used for mitigation?

enum class with scoping but without having to cast to underlying type

Why do I get a DoubleFault when I set a break point on IA32_LSTAR MSR for SYSCALL on windows 64 bit with a kernel debugger? (KVASCODE section)

QUESTION

Faulty benchmark, puzzling assembly

Asked 2022-Mar-28 at 07:40

Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.

Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.

Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6 (n^3 for matrix-matrix multiplication, performed on n slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.

Code snippet in main function:

...

ANSWER

Answered 2022-Mar-25 at 19:33

1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384 GFLOP/S. (Using vectors of 4 doubles, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb> (as well as the __kmpc_end_serialized_parallel stuff).

There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.

As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?

Once you're already in an if block, you unconditionally know the else block should not run, so you jmp over it instead of jcc (even if FLAGS were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc to it, then it jmps back to after the other side. That allows the fast path to be contiguous with no taken branches.

Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Duplicate ret comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret instead of jumping to a ret. (And copies of any cleanup necessary, like restoring regs and stack pointer.)

Source https://stackoverflow.com/questions/71618068

QUESTION

Dplyr: using summarise across to take mean of columns only if row value > 0

Asked 2022-Mar-25 at 09:04

I have a dataframe of gene expression scores (cells x genes). I also have the cluster that each cell belongs to in stored as a column.

I want to calculate the mean expression values per cluster for a group of genes (columns), however, I only want to include values > 0 in these calculations.

My attempt at this is as follows:

...

ANSWER

Answered 2022-Mar-24 at 20:52

What you want is:

Source https://stackoverflow.com/questions/71609087

QUESTION

VBA Function returning #VALUE! when trying to refer to Optional Parameter

Asked 2022-Mar-21 at 21:07

I am trying to calculate a NPV of several cashflows with continuous compounding. However when trying to refer to the optional parameter (which signals that the NPV will be continuous compounding), i get #VALUE! instead of a valid NPV

My VBA code is as follows:

...

ANSWER

Answered 2022-Mar-21 at 21:07

I believe you want to include both inside the loop and then use an IF to decide which to do:

Source https://stackoverflow.com/questions/71563976

QUESTION

Convolution Function Latency Bottleneck

Asked 2022-Mar-10 at 13:57

I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.

Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.

After turning on optimization (-O3 -march=native -ffast-math, gcc cross compiler), the Gprof result was the following:

Clearly, the convolution2D function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.

The convolution function in question looks like this:

...

ANSWER

Answered 2022-Mar-10 at 13:57

Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.

So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.

I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).

The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.

Source https://stackoverflow.com/questions/71401876

QUESTION

Input 0 of layer "model" is incompatible with the layer: expected shape=(None, 512, 512, 3), found shape=(512, 512, 3)

Asked 2022-Mar-08 at 14:22

I am training a Unet segmentation model for binary class. The dataset is loaded in tensorflow data pipeline. The images are in (512, 512, 3) shape, masks are in (512, 512, 1) shape. The model expects the input in (512, 512, 3) shape. But I am getting the following error. Input 0 of layer "model" is incompatible with the layer: expected shape=(None, 512, 512, 3), found shape=(512, 512, 3)

Here are the images in metadata dataframe.

Randomly sampling the indices to select the training and validation set

...

ANSWER

Answered 2022-Mar-08 at 13:38

Use train_batches in model.fit and not train_images. Also, you do not need to use repeat(), which causes an infinite dataset if you do not specify how many times you want to repeat your dataset. Regarding your labels error, try rewriting your model like this:

Source https://stackoverflow.com/questions/71395504

QUESTION

Pandas groupby and count numbers of item by conditions

Asked 2022-Feb-10 at 23:31

I have a dataframe like this:

...

ANSWER

Answered 2022-Feb-10 at 23:31

You can use a named groupby:

Source https://stackoverflow.com/questions/70884256

QUESTION

What's a less bad way to write this hex formatting function in Python?

Asked 2021-Dec-23 at 18:06

With Python, I wanted to format a string of hex characters:

spaces between each byte (easy enough): 2f2f -> 2f 2f
line breaks at a specified max byte width (not hard): 2f 2f 2f 2f 2f 2f 2f 2f\n
address ranges for each line (doable): 0x7f8-0x808: 2f 2f 2f 2f 2f 2f 2f 2f\n
replace large ranges of sequential 00 bytes with: ... trimmed 35 x 00 bytes [0x7 - 0x2a] ... ... it was at this point that I knew I was doing some bad coding. The function got bloated and hard to follow. Too many features piled up in a non-intuitive way.

Example output:

...

ANSWER

Answered 2021-Dec-23 at 11:16

I would suggest to not start a "trimmed 00 bytes" series in the middle of an output line, but only apply this compacting when it applies to complete output lines with only zeroes.

This means that you will still see non-compacted zeroes in a line that also contains non-zeroes, but in my opinion this results in a cleaner output format. For instance, if a line would end with just two 00 bytes, it really does not help to replace that last part of the line with the longer "trimmed 2 x 00 bytes" message. By only replacing complete 00-lines with this message, and compress multiple such lines with one message, the output format seems cleaner.

To produce that output format, I would use the power of regular expressions:

to identify a block of bytes to be output on one line: either a line with at least one non-zero, or a range of zero bytes which either runs to the end of the input, or else is a multiple of the "byte width" argument.
to insert spaces in a line of bytes

All this can be done through iterations in one expression:

Source https://stackoverflow.com/questions/70459165

QUESTION

Intel JCC Erratum - what is the effect of prefixes used for mitigation?

Asked 2021-Dec-21 at 16:31

Intel recommends using instruction prefixes to mitigate the performance consequences of JCC Erratum.

MSVC if compiled with /QIntel-jcc-erratum follows the recommendation, and inserts prefixed instructions, like this:

...

ANSWER

Answered 2021-Dec-21 at 16:31

A NOP is a separate instruction that had to decode and go through the pipeline separately. It's always better to pad instructions with prefixes to achieve desired alignment, not insert NOPs, as discussed in What methods can be used to efficiently extend instruction length on modern x86? (but only in ways that don't cause major stalls on some CPUs which can't handle large numbers of prefixes).

Perhaps Intel considered it worth the effort for toolchains to do it this way for this case since this would actually be inside inner loops, not just a NOP outside an inner loop. (And tacking on prefixes to one previous instruction is relatively simple.)

I now have some data point. The result of benchmarking for /QIntel-jcc-erratum on AMD FX 8300 is bad.

The slowdown is by a decimal order of magnitude for a specific benchmark, where the benefit on Intel Skylake for the same benchmark is about 20 percent. This aligns with Peter's comments:

I checked Agner Fog's microarch guide, and AMD Zen has no problem with any number of prefixes on a single instruction, like mainstream Intel since Core2. AMD Bulldozer-family has a "very large" penalty for decoding instructions with more than 3 prefixes, like 14-15 cycles for 4-7 prefixes

It's somewhat valid to consider Bulldozer-family obsolete enough to not care much about it, although there are still some APU desktops and laptops around for sure, but they'd certainly show large regressions in loops where the compiler put 4 or more prefixes on one instruction inside a hot inner loop (including existing prefixes like REX or 66h). Much worse than the 3% for MITE legacy decode on SKL.

Though indeed Bulldozer-family is obsolete-ish, I don't think I can afford this much of an impact. I'm also afraid of other CPUs that may choke with extra prefixes the same way. So the conclusion for me is not to use /QIntel-jcc-erratum for generally-targeted software. Unless it is enabled in specific translation units and dynamic dispatch to there is made, which is too much of the trouble most of the time.

One thing that probably safe to do on MSVC is to stop using /Os flag . It was discovered that /Os flag at least:

Avoids jump tables in favor of conditional jumps
Avoids loop start padding

Try the following example (https://godbolt.org/z/jvezPd9jM):

Source https://stackoverflow.com/questions/70216761

QUESTION

enum class with scoping but without having to cast to underlying type

Asked 2021-Dec-01 at 08:59

I have this:

...

ANSWER

Answered 2021-Dec-01 at 08:58

Split the enum class into an enum and a class (or struct for convenience).

Source https://stackoverflow.com/questions/70179190

QUESTION

Why do I get a DoubleFault when I set a break point on IA32_LSTAR MSR for SYSCALL on windows 64 bit with a kernel debugger? (KVASCODE section)

Asked 2021-Nov-26 at 23:41

TLDR: Why can you set a break point at windows kernel .text section but not on the KVAS section?

So afaik IA32_LSTAR is supposed to hold the address of KiSystemCall64/KiSystemCall64Shadow, so right before a syscall was made on ntdll I dumped it, and set a breakpoint on it (KiSystemCall64Shadow) upon tracing with p on windbg I get a bugcheck(DOUBLE_FAULT), why is that?

I should mention that this whole process was inside a VM so I could kernel-debug the application

output of !analyze -v

...

ANSWER

Answered 2021-Nov-26 at 23:41

kernel mode interrupt entries check in which mode was interrupt - user or kernel, by checking lowest bit of CS on stack (CPL) and execute SWAPGS instruction only in case iterrupt was from user mode. otherwise in GS assume already correct value - in user mode GS point to TEB and in kernel mode to KPCR. example of

Source https://stackoverflow.com/questions/70059375

Community Discussions, Code Snippets contain sources that include Stack Exchange Network