edi | An X12 EDI parsing crate | Parser library

by sezna Rust Version: Current License: GPL-3.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | edi Summary

edi is a Rust library typically used in Utilities, Parser applications. edi has no bugs, it has no vulnerabilities, it has a Strong Copyleft License and it has low support. You can download it from GitHub.

An X12 EDI parsing crate.

Support

Quality

Security

License

Reuse

Support

edi has a low active ecosystem.

It has 9 star(s) with 5 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

edi has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of edi is current.

Quality

edi has 0 bugs and 0 code smells.

Security

edi has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

edi code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

edi is licensed under the GPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

edi releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of edi

Get all kandi verified functions for this library.

edi Key Features

No Key Features are available at this moment for edi.

edi Examples and Code Snippets

No Code Snippets are available at this moment for edi.

Community Discussions

Trending Discussions on edi

Why can compiler not optimize out unused static std::string?

Convolution Function Latency Bottleneck

Why is the XOR swap optimized into a normal swap using the MOV instruction?

Why does GCC allocate more stack memory than needed?

Databricks Pyspark - Group related rows

Why is this optimized away by modern compilers for C++11 and higher

Missed optimization with string_view::find_first_of

How does gdb start an assembly compiled program and step one line at a time?

Delphi records assignment

Assembly function address table and data under the function or in data section

QUESTION

Why can compiler not optimize out unused static std::string?

Asked 2022-Mar-18 at 06:44

If I compile this code with GCC or Clang and enable -O2 optimizations, I still get some global object initialization. Is it even possible for any code to reach these variables?

...

ANSWER

Answered 2022-Mar-18 at 06:44

Compiling that code with short string optimization (SSO) may be an equivalent of taking address of std::string's member variable. Constructor have to analyze string length at compile time and choose if it can fit into internal storage of std::string object or it have to allocate memory dynamically but then find that it never was read so allocation code can be optimized out.

Lack of optimization in this case might be an optimization flaw limited to such simple outlying examples like this one:

Source https://stackoverflow.com/questions/71445432

QUESTION

Convolution Function Latency Bottleneck

Asked 2022-Mar-10 at 13:57

I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.

Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.

After turning on optimization (-O3 -march=native -ffast-math, gcc cross compiler), the Gprof result was the following:

Clearly, the convolution2D function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.

The convolution function in question looks like this:

...

ANSWER

Answered 2022-Mar-10 at 13:57

Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.

So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.

I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).

The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.

Source https://stackoverflow.com/questions/71401876

QUESTION

Why is the XOR swap optimized into a normal swap using the MOV instruction?

Asked 2022-Mar-08 at 10:00

While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating average of 2 unsigned 32-bit integer:

...

ANSWER

Answered 2022-Mar-08 at 10:00

Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:

Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But xchg reg,reg already does that better.

I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.

Hopefully sometimes that lets it then optimize down to only a single mov, or two movs, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov instructions.

But worst case, three mov instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.

(Apparently GCC -Os does have a peephole optimization to use xchg reg,reg instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)

xor-swap has worse latency and defeats mov-elimination

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?

Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)

It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a to output a for xor-swap, and 2 from input b to output a, for example.

MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)

Also related:

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? - and those 3 uops can't benefit from mov-elimination. But on modern AMD xchg reg,reg is only 2 uops.

If you're going to branch, just duplicate the averaging code

GCC's real missed optimization here (even with -O3) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.

update: GCC will do this for you with -ftracer (https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use, not at -O3, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)

Doing it manually in the source (Godbolt):

Source https://stackoverflow.com/questions/71382441

QUESTION

Why does GCC allocate more stack memory than needed?

Asked 2022-Feb-03 at 08:12

I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:

...

ANSWER

Answered 2022-Feb-03 at 04:10

(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)

GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?

What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.

So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.

Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.

One difference from the book's asm is a movl $0, %eax before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc returns: it ends up using movslq %edx, %rdx instead of cltq (sign-extend with RAX).

CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.

Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.

Source https://stackoverflow.com/questions/70953275

QUESTION

Databricks Pyspark - Group related rows

Asked 2022-Feb-01 at 13:55

I am parsing an EDI file in Azure Databricks. Rows in the input file are related to other rows based on the order in which they appear. What I need is a way to group related rows together.

...

ANSWER

Answered 2022-Feb-01 at 13:54

You can use conditional sum aggregation over a window ordered by sequence like this:

Source https://stackoverflow.com/questions/70941527

QUESTION

Why is this optimized away by modern compilers for C++11 and higher

Asked 2021-Dec-28 at 12:48

I'm lost.. I wanted to play around with the compiler explorer to experiment with multithreaded C code, and started with a simple piece of code. The code is compiled with -O3.

...

ANSWER

Answered 2021-Dec-28 at 12:48

It's because of following rule:

[intro.progress]

The implementation may assume that any thread will eventually do one of the following:

terminate,

make a call to a library I/O function,

perform an access through a volatile glvalue, or

perform a synchronization operation or an atomic operation.

The compiler was able to prove that a program that enters the loop will never do any of the listed things and thus it is allowed to assume that the loop will never be entered.

Source https://stackoverflow.com/questions/70506857

QUESTION

Missed optimization with string_view::find_first_of

Asked 2021-Dec-22 at 07:51

Update: relevant GCC bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103798

I tested the following code:
...

ANSWER

Answered 2021-Dec-21 at 11:08

libstdc++'s std::string_view::find_first_of looks something like:

Source https://stackoverflow.com/questions/70433152

QUESTION

How does gdb start an assembly compiled program and step one line at a time?

Asked 2021-Dec-15 at 11:03

Valgrind says the following on their documentation page

Your program is then run on a synthetic CPU provided by the Valgrind core

However GDB doesn't seem to do that. It seems to launch a separate process which executes independently. There's also no c library from what I can tell. Here's what I did

Compile using clang or gcc gcc -g tiny.s -nostdlib (-g seems to be required)

gdb ./a.out

Write starti

Press s a bunch of times

You'll see it'll print out "Test1\n" without printing test2. You can also kill the process without terminating gdb. GDB will say "Program received signal SIGTERM, Terminated." and won't ever write Test2

How does gdb start the process and have it execute only one line at a time?
...

ANSWER

Answered 2021-Oct-30 at 20:48

starti implementation
As usual for a process that wants to start another process, it does a fork/exec, like a shell does. But in the new process, GDB doesn't just make an execve system call right away.

Instead, it calls ptrace(PTRACE_TRACEME) to wait for the parent process to attach to it, so GDB (the parent) is already attached before the child process makes an execve() system call to make this process start executing the specified executable file.

Also note in the execve(2) man page:

If the current program is being ptraced, a SIGTRAP signal is sent to it after a successful execve().

So that's how the kernel debugging API supports stopping before the first user-space instruction is executed in a newly-execed process. i.e. exactly what starti wants. This doesn't depend on setting a breakpoint; that can't happen until after execve anyway, and with ASLR the correct address isn't even known until after execve picks a base address. (GDB by default disables ASLR, but it still works if you tell it not to disable ASLR.)

This is also what GDB use if you set breakpoints before run, manually, or by using start to set a one-time breakpoint on main. Before the starti command existed, a hack to emulate that functionality was to set an invalid breakpoint before run, so GDB would stop on that error, giving you control at that point.

If you strace -f -o gdb.trace gdb ./foo or something, you'll see some of what GDB does. (Nested tracing apparently doesn't work, so running GDB under strace means GDB's ptrace system call fails, but we can see what it does leading up to that.)

Source https://stackoverflow.com/questions/69762022

QUESTION

Delphi records assignment

Asked 2021-Nov-08 at 15:06

I've encountered some strange behaviour of delphi XE3 compiler (i compile for x86 architecture).

Imagine i have class with one field - custom record with several field of simple types:
...

ANSWER

Answered 2021-Nov-08 at 15:06

This is a bug that is still present in Delphi 11 (thanks to LU RD for confirming that). You should submit a bug report to Quality Portal.

In the meantime, I think that you can work around it by making the assignment in TPage rather than TParagraph. Like this:

Source https://stackoverflow.com/questions/69882932

QUESTION

Assembly function address table and data under the function or in data section

Asked 2021-Oct-30 at 09:56

I have a question about putting data (address table or other data) in the .text section under its function or put in .data section? For example, I have a function like this :

...

ANSWER

Answered 2021-Oct-30 at 07:51

Yes, you can put the table of pointers (.L4:) in .text section (if it won't be modified at run time) but I don't see a reason for double indirection to a set of jumps to external functions i0..i5. You can branch with an indirect near jump, which takes the destination address from a table of pointers to those external functions. The linker takes care of the completion of external addresses. Example in NASM/Intel syntax:

Source https://stackoverflow.com/questions/69772172

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities
No vulnerabilities reported

Install edi
You can download it from GitHub.
Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.

Support
For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
Find more information at: