rip | Yes , I know sed can do | Regex library

by icholy Go Version: v0.0.2 License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | rip Summary

rip is a Go library typically used in Utilities, Regex applications. rip has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

Extract data from input using regular expressions.

Support

Quality

Security

License

Reuse

Support

rip has a low active ecosystem.

It has 139 star(s) with 6 fork(s). There are 3 watchers for this library.

It had no major release in the last 12 months.

There are 0 open issues and 3 have been closed. On average issues are closed in 8 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of rip is v0.0.2

Quality

rip has 0 bugs and 0 code smells.

Security

rip has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

rip code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

rip does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

rip releases are available to install and integrate.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of rip

Get all kandi verified functions for this library.

rip Key Features

No Key Features are available at this moment for rip.

rip Examples and Code Snippets

No Code Snippets are available at this moment for rip.

Community Discussions

Trending Discussions on rip

Powershell script to get the metadata field "writing application"

Faulty benchmark, puzzling assembly

Why can compiler not optimize out unused static std::string?

Convolution Function Latency Bottleneck

How are non-static, non-virtual methods implemented in C++?

C fibers crashing on printf

Is there a typo/bug in the documentation of the loop instruction?

Bubble sort slower with -O3 than -O2 with GCC

Enforcing single instances from the metamodel

Why is this optimized away by modern compilers for C++11 and higher

QUESTION

Powershell script to get the metadata field "writing application"

Asked 2022-Apr-05 at 13:21

I am using a modified version of the GetMetaData script originally written by Ed Wilson at Microsoft (https://devblogs.microsoft.com/scripting/hey-scripting-guy-how-can-i-find-files-metadata/) and then modified by user wOxxOm here https://stackoverflow.com/a/42933461/5061596 . I'm trying to analyze all my DVD and BluRay rips and see what tool was used to create them. Mainly I want to check which ones I compressed with Handbrake and which ones came directly from MakeMKV. The problem is I can't find this field.

If I use the "stock" scrip and change the number of properties it looks for from 0 - 266 up to 0 - 330 I find the extra file info like movie length, resolution, etc. But I can't find the tool used. For example here is what the MediaInfo Lite tool reports:

But looking through the meta data I get something like this with no "Writing application" property:

...

ANSWER

Answered 2022-Apr-05 at 13:21

edit: actually, this seems more reliable. So far any file that mediainfo can read, this also works with.

Source https://stackoverflow.com/questions/71721200

QUESTION

Faulty benchmark, puzzling assembly

Asked 2022-Mar-28 at 07:40

Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.

Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.

Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6 (n^3 for matrix-matrix multiplication, performed on n slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.

Code snippet in main function:

...

ANSWER

Answered 2022-Mar-25 at 19:33

1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384 GFLOP/S. (Using vectors of 4 doubles, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb> (as well as the __kmpc_end_serialized_parallel stuff).

There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.

As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?

Once you're already in an if block, you unconditionally know the else block should not run, so you jmp over it instead of jcc (even if FLAGS were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc to it, then it jmps back to after the other side. That allows the fast path to be contiguous with no taken branches.

Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Duplicate ret comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret instead of jumping to a ret. (And copies of any cleanup necessary, like restoring regs and stack pointer.)

Source https://stackoverflow.com/questions/71618068

QUESTION

Why can compiler not optimize out unused static std::string?

Asked 2022-Mar-18 at 06:44

If I compile this code with GCC or Clang and enable -O2 optimizations, I still get some global object initialization. Is it even possible for any code to reach these variables?

...

ANSWER

Answered 2022-Mar-18 at 06:44

Compiling that code with short string optimization (SSO) may be an equivalent of taking address of std::string's member variable. Constructor have to analyze string length at compile time and choose if it can fit into internal storage of std::string object or it have to allocate memory dynamically but then find that it never was read so allocation code can be optimized out.

Lack of optimization in this case might be an optimization flaw limited to such simple outlying examples like this one:

Source https://stackoverflow.com/questions/71445432

QUESTION

Convolution Function Latency Bottleneck

Asked 2022-Mar-10 at 13:57

I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.

Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.

After turning on optimization (-O3 -march=native -ffast-math, gcc cross compiler), the Gprof result was the following:

Clearly, the convolution2D function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.

The convolution function in question looks like this:

...

ANSWER

Answered 2022-Mar-10 at 13:57

Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.

So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.

I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).

The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.

Source https://stackoverflow.com/questions/71401876

QUESTION

How are non-static, non-virtual methods implemented in C++?

Asked 2022-Mar-05 at 02:56

I wanted to know how methods are implemented in C++. I wanted to know how methods are implemented "under the hood". So, I have made a simple C++ program which has a class with 1 non static field and 1 non static, non virtual method.

Then I instantiated the class in the main function and called the method. I have used objdump -d option in order to see the CPU instructions of this program. I have a x86-64 processor. Here's the code:

...

ANSWER

Answered 2022-Mar-02 at 06:25

I think what you are looking for are these instructions:

Source https://stackoverflow.com/questions/71318184

QUESTION

C fibers crashing on printf

Asked 2022-Feb-25 at 06:52

I am in the process of creating a fiber threading system in C, following https://graphitemaster.github.io/fibers/ . I have a function to set and restore context, and what i am trying to accomplish is launching a function as a fiber with its own stack. Linux, x86_64 SysV ABI.

...

ANSWER

Answered 2022-Feb-25 at 05:34

Agree with comments: your stack alignment is incorrect.

It is true that the stack must be aligned to 16 bytes. However, the question is when? The normal rule is that the stack pointer must be a multiple of 16 at the site of a call instruction that calls an ABI-compliant function.

Well, you don't use a call instruction, but what that really means is that on entry to an ABI-compliant function, the stack pointer must be 8 less than a multiple of 16, or in other words an odd multiple of 8, since it assumes it was called with a call instruction that pushed an 8-byte return address. That is just the opposite of what your code does, and so the stack is misaligned for the rest of your program, which makes printf crash when it tries to use aligned move instructions.

You could subtract 8 from the sp computed in your C code.

Or, I'm not really sure why you go to the trouble of loading the destination address into a register, then pushing and ret, when an indirect jump or call would do. (Unless you are deliberately trying to fool the indirect branch predictor?) An indirect call will also kill the stack-alignment bird, by pushing the return address (even though it will never be used). So you could leave the rest of your code alone, and replace all the r8/ret stuff in restore_context with just

Source https://stackoverflow.com/questions/71259613

QUESTION

Is there a typo/bug in the documentation of the loop instruction?

Asked 2022-Feb-18 at 03:20

In the following pseudo code description of the Intel loop instruction, when the operand size is 16, this description appears to omit use of the DEST branch-target operand in the taken case:

...

ANSWER

Answered 2022-Feb-18 at 03:20

Yeah, looks like bug. The loop instruction does jump, not just truncate EIP, in 16-bit mode just like in other modes.

(R/E)IP < CS.Base also looks like a bug; the linear address is formed by adding EIP to CS.Base. i.e. valid EIP values are from 0 to CS.Limit, unsigned, regardless of non-zero CS base.

I think Intel's forums work as a way to report bugs in manuals / guides, but it's not obvious which section to report in.

https://community.intel.com/t5/Intel-ISA-Extensions/bd-p/isa-extensions has some posts with bug reports for the intrinsics guide, which got the attention of Intel people who could do something about it.

Also possibly https://community.intel.com/t5/Software-Development-Topics/ct-p/software-dev-topics or some other sub-forum of the "software developer" forums. The "cpu" forums seems to be about people using CPUs, like motherboard / RAM compat and stuff.

Source https://stackoverflow.com/questions/71164945

QUESTION

Bubble sort slower with -O3 than -O2 with GCC

Asked 2022-Jan-21 at 02:41

I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3 flag made it run even slower than no flags at all! Meanwhile -O2 was making it run a lot faster as expected.

Without optimisations:

...

ANSWER

Answered 2021-Oct-27 at 19:53

It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here. See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86? Also Agner Fog's x86 optimization guides.

(gcc -O3 enables -ftree-vectorize and a few other options not included by -O2, e.g. if-conversion to branchless cmov, which is another way -O3 can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2, although some of its optimizations are still only on at -O3.)

It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.

(Bubble sort is bad in general, especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)

Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword. Scalar loads are cheap, and store-forwarding stalls are costly. (Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)

Even better would be to keep buf[x+1] in a register and use it as buf[x] in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)

If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. SSE 4.1 for a branchless pmind / pmaxd comparator might be interesting, but that would mean always storing and the C source doesn't do that.

If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,

Source https://stackoverflow.com/questions/69503317

QUESTION

Enforcing single instances from the metamodel

Asked 2022-Jan-16 at 16:02

I know that it's way easier to ensure single instances from the class level, and that there's the excellent Staticish module from Jonathan Stowe that does the same by using roles, but I just want to try and understand a bit better how the class higher order working can be handled, mainly for a FOSDEM talk. I could think of several ways of doing to at the metamodel level, but eventually this is what I came up with:

...

ANSWER

Answered 2022-Jan-16 at 16:02

There's a few misunderstandings in this attempt.

There is one instance of a meta-class per type. Thus if we want to allow a given type to only be instantiated once, the correct scoping is an attribute in the meta-class, not a my. A my would mean there's one global object no matter which type we create.
The compose method, when subclassing ClassHOW, should always call back up to the base compose method (which can be done using callsame). Otherwise, the class will not be composed.
The method_table method returns the table of methods for this exact type. However, most classes won't have a new method. Rather, they will inherit the default one. If we wrap that, however, we'd be having a very global effect.

While new is relatively common to override to change the interface to construction, the bless method - which new calls after doing any mapping work - is not something we'd expect language users to be overriding. So one way we could proceed is to just try installing a bless method that does the required logic. (We could also work with new, but really we'd need to check if there was one in this class, wrap it if so, and add a copy of the default one that we then wrap if not, which is a bit more effort.)

Here's a solution that works:

Source https://stackoverflow.com/questions/70729700

QUESTION

Why is this optimized away by modern compilers for C++11 and higher

Asked 2021-Dec-28 at 12:48

I'm lost.. I wanted to play around with the compiler explorer to experiment with multithreaded C code, and started with a simple piece of code. The code is compiled with -O3.

...

ANSWER

Answered 2021-Dec-28 at 12:48

It's because of following rule:

[intro.progress]

The implementation may assume that any thread will eventually do one of the following:

terminate,

make a call to a library I/O function,

perform an access through a volatile glvalue, or

perform a synchronization operation or an atomic operation.

The compiler was able to prove that a program that enters the loop will never do any of the listed things and thus it is allowed to assume that the loop will never be entered.

Source https://stackoverflow.com/questions/70506857

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install rip

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: