rip | Yes , I know sed can do | Regex library
kandi X-RAY | rip Summary
kandi X-RAY | rip Summary
Extract data from input using regular expressions.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of rip
rip Key Features
rip Examples and Code Snippets
Community Discussions
Trending Discussions on rip
QUESTION
I am using a modified version of the GetMetaData script originally written by Ed Wilson at Microsoft (https://devblogs.microsoft.com/scripting/hey-scripting-guy-how-can-i-find-files-metadata/) and then modified by user wOxxOm here https://stackoverflow.com/a/42933461/5061596 . I'm trying to analyze all my DVD and BluRay rips and see what tool was used to create them. Mainly I want to check which ones I compressed with Handbrake and which ones came directly from MakeMKV. The problem is I can't find this field.
If I use the "stock" scrip and change the number of properties it looks for from 0 - 266 up to 0 - 330 I find the extra file info like movie length, resolution, etc. But I can't find the tool used. For example here is what the MediaInfo Lite tool reports:
But looking through the meta data I get something like this with no "Writing application" property:
...ANSWER
Answered 2022-Apr-05 at 13:21edit: actually, this seems more reliable. So far any file that mediainfo can read, this also works with.
QUESTION
Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.
Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.
Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6
(n^3
for matrix-matrix multiplication, performed on n
slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.
Code snippet in main function:
...ANSWER
Answered 2022-Mar-25 at 19:331 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384
GFLOP/S. (Using vectors of 4 double
s, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(as well as the __kmpc_end_serialized_parallel
stuff).
There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp
should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.
As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)
- Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
Once you're already in an if
block, you unconditionally know the else
block should not run, so you jmp
over it instead of jcc
(even if FLAGS
were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc
to it, then it jmp
s back to after the other side. That allows the fast path to be contiguous with no taken branches.
- Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?
Duplicate ret
comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret
instead of jumping to a ret
. (And copies of any cleanup necessary, like restoring regs and stack pointer.)
QUESTION
If I compile this code with GCC or Clang and enable -O2
optimizations, I still get some global object initialization. Is it even possible for any code to reach these variables?
ANSWER
Answered 2022-Mar-18 at 06:44Compiling that code with short string optimization (SSO) may be an equivalent of taking address of std::string
's member variable. Constructor have to analyze string length at compile time and choose if it can fit into internal storage of std::string
object or it have to allocate memory dynamically but then find that it never was read so allocation code can be optimized out.
Lack of optimization in this case might be an optimization flaw limited to such simple outlying examples like this one:
QUESTION
I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.
Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.
After turning on optimization (-O3 -march=native -ffast-math
, gcc cross compiler), the Gprof result was the following:
Clearly, the convolution2D
function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.
The convolution function in question looks like this:
...ANSWER
Answered 2022-Mar-10 at 13:57Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.
So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.
I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).
The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.
QUESTION
I wanted to know how methods are implemented in C++. I wanted to know how methods are implemented "under the hood". So, I have made a simple C++ program which has a class with 1 non static field and 1 non static, non virtual method.
Then I instantiated the class in the main function and called the method. I have used objdump -d option in order to see the CPU instructions of this program. I have a x86-64
processor.
Here's the code:
ANSWER
Answered 2022-Mar-02 at 06:25I think what you are looking for are these instructions:
QUESTION
I am in the process of creating a fiber threading system in C, following https://graphitemaster.github.io/fibers/ . I have a function to set and restore context, and what i am trying to accomplish is launching a function as a fiber with its own stack. Linux, x86_64 SysV ABI.
...ANSWER
Answered 2022-Feb-25 at 05:34Agree with comments: your stack alignment is incorrect.
It is true that the stack must be aligned to 16 bytes. However, the question is when? The normal rule is that the stack pointer must be a multiple of 16 at the site of a call instruction that calls an ABI-compliant function.
Well, you don't use a call instruction, but what that really means is that on entry to an ABI-compliant function, the stack pointer must be 8 less than a multiple of 16, or in other words an odd multiple of 8, since it assumes it was called with a call
instruction that pushed an 8-byte return address. That is just the opposite of what your code does, and so the stack is misaligned for the rest of your program, which makes printf
crash when it tries to use aligned move instructions.
You could subtract 8 from the sp
computed in your C code.
Or, I'm not really sure why you go to the trouble of loading the destination address into a register, then pushing and ret
, when an indirect jump or call would do. (Unless you are deliberately trying to fool the indirect branch predictor?) An indirect call will also kill the stack-alignment bird, by pushing the return address (even though it will never be used). So you could leave the rest of your code alone, and replace all the r8/ret stuff in restore_context
with just
QUESTION
In the following pseudo code description of the Intel loop
instruction, when the operand size is 16, this description appears to omit use of the DEST
branch-target operand in the taken case:
ANSWER
Answered 2022-Feb-18 at 03:20Yeah, looks like bug. The loop
instruction does jump, not just truncate EIP, in 16-bit mode just like in other modes.
(R/E)IP < CS.Base
also looks like a bug; the linear address is formed by adding EIP to CS.Base. i.e. valid EIP values are from 0
to CS.Limit
, unsigned, regardless of non-zero CS base.
I think Intel's forums work as a way to report bugs in manuals / guides, but it's not obvious which section to report in.
https://community.intel.com/t5/Intel-ISA-Extensions/bd-p/isa-extensions has some posts with bug reports for the intrinsics guide, which got the attention of Intel people who could do something about it.
Also possibly https://community.intel.com/t5/Software-Development-Topics/ct-p/software-dev-topics or some other sub-forum of the "software developer" forums. The "cpu" forums seems to be about people using CPUs, like motherboard / RAM compat and stuff.
QUESTION
I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3
flag made it run even slower than no flags at all! Meanwhile -O2
was making it run a lot faster as expected.
Without optimisations:
...ANSWER
Answered 2021-Oct-27 at 19:53It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here. See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86? Also Agner Fog's x86 optimization guides.
(gcc -O3
enables -ftree-vectorize
and a few other options not included by -O2
, e.g. if
-conversion to branchless cmov
, which is another way -O3
can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2
, although some of its optimizations are still only on at -O3
.)
It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.
(Bubble sort is bad in general, especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)
Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword. Scalar loads are cheap, and store-forwarding stalls are costly. (Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)
Even better would be to keep buf[x+1]
in a register and use it as buf[x]
in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)
If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. SSE 4.1 for a branchless pmind
/ pmaxd
comparator might be interesting, but that would mean always storing and the C source doesn't do that.
If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,
QUESTION
I know that it's way easier to ensure single instances from the class level, and that there's the excellent Staticish
module from Jonathan Stowe that does the same by using roles, but I just want to try and understand a bit better how the class higher order working can be handled, mainly for a FOSDEM talk. I could think of several ways of doing to at the metamodel level, but eventually this is what I came up with:
ANSWER
Answered 2022-Jan-16 at 16:02There's a few misunderstandings in this attempt.
- There is one instance of a meta-class per type. Thus if we want to allow a given type to only be instantiated once, the correct scoping is an attribute in the meta-class, not a
my
. Amy
would mean there's one global object no matter which type we create. - The
compose
method, when subclassingClassHOW
, should always call back up to the basecompose
method (which can be done usingcallsame
). Otherwise, the class will not be composed. - The
method_table
method returns the table of methods for this exact type. However, most classes won't have anew
method. Rather, they will inherit the default one. If we wrap that, however, we'd be having a very global effect.
While new
is relatively common to override to change the interface to construction, the bless
method - which new
calls after doing any mapping work - is not something we'd expect language users to be overriding. So one way we could proceed is to just try installing a bless
method that does the required logic. (We could also work with new
, but really we'd need to check if there was one in this class, wrap it if so, and add a copy of the default one that we then wrap if not, which is a bit more effort.)
Here's a solution that works:
QUESTION
I'm lost.. I wanted to play around with the compiler explorer to experiment with multithreaded C code, and started with a simple piece of code. The code is compiled with -O3
.
ANSWER
Answered 2021-Dec-28 at 12:48It's because of following rule:
[intro.progress]
The implementation may assume that any thread will eventually do one of the following:
- terminate,
- make a call to a library I/O function,
- perform an access through a volatile glvalue, or
- perform a synchronization operation or an atomic operation.
The compiler was able to prove that a program that enters the loop will never do any of the listed things and thus it is allowed to assume that the loop will never be entered.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install rip
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page