rax | 🐰 Rax is a progressive framework | Server Side Rendering library
kandi X-RAY | rax Summary
kandi X-RAY | rax Summary
Rax is a progressive framework for building universal applications. Write Once, Run Anywhere: write one codebase, run with Web, Weex, Node.js, Alibaba MiniApp, WeChat MiniProgram and could work with more container that implements driver specification. Fast: use better performance and tinier size(~6KB) alternative to React with the same API. Easy: quick start with zero configuration, all features like Progressive Web App (PWA), Server-Side Rendering (SSR) and Function as a service (FaaS) can be used out of the box.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Creates a new ReactContext .
- Computes results from running benchmarks .
- Build a Rollup bundle
- Create element .
- Apply an effect .
- Benchmark benchmark
- Replace a reactive render .
- Run an app
- A helper function which runs synchronizing a synchronous selector and updates state accordingly .
- convertPureProps toPureProps .
rax Key Features
rax Examples and Code Snippets
Community Discussions
Trending Discussions on rax
QUESTION
... or rather, why does not static_cast-ing slow down my function?
Consider the function below, which performs integer division:
...ANSWER
Answered 2022-Mar-17 at 15:27I'm keeping this answer up for now as the comments are useful.
QUESTION
Inspired by a recent question.
One use case for gcc-style inline assembly is to encode instructions neither compiler nor assembler are aware of. For example, I gave this example for how to use the rdrand
instruction on a toolchain too old to support it:
ANSWER
Answered 2022-Mar-14 at 15:38I've actually had the same problem and came up with the following solution.
QUESTION
I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:
...ANSWER
Answered 2022-Feb-03 at 04:10(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)
GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc
: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc
actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.
So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.
Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args
, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args
used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.
One difference from the book's asm is a movl $0, %eax
before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc
returns: it ends up using movslq %edx, %rdx
instead of cltq
(sign-extend with RAX).
CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.
Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.
QUESTION
I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3
flag made it run even slower than no flags at all! Meanwhile -O2
was making it run a lot faster as expected.
Without optimisations:
...ANSWER
Answered 2021-Oct-27 at 19:53It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here. See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86? Also Agner Fog's x86 optimization guides.
(gcc -O3
enables -ftree-vectorize
and a few other options not included by -O2
, e.g. if
-conversion to branchless cmov
, which is another way -O3
can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2
, although some of its optimizations are still only on at -O3
.)
It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.
(Bubble sort is bad in general, especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)
Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword. Scalar loads are cheap, and store-forwarding stalls are costly. (Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)
Even better would be to keep buf[x+1]
in a register and use it as buf[x]
in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)
If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. SSE 4.1 for a branchless pmind
/ pmaxd
comparator might be interesting, but that would mean always storing and the C source doesn't do that.
If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,
QUESTION
I'm more like C++ than C, but this simple example of code is big surprise for me:
...ANSWER
Answered 2022-Jan-17 at 12:33This function is 100% correct.
In the C language int foo() {
means: define function foo
returning int
and taking unspecified number of parameters.
Your confusion comes from C++ plus where int foo(void)
and int foo()
mean exactly the same.
In the C language to define function which does not take any parameters you need to define it as:
QUESTION
These two loops are equivalent in C++ and Rust:
...ANSWER
Answered 2022-Jan-12 at 10:20Overflow in the iterator state.
The C++ version will loop forever when given a large enough input:
QUESTION
I have just browsed the Linux kernel source tree and read the file tools/include/nolibc/nolibc.h.
I saw the syscall
in this file uses %r8
, %r9
and %r10
in the clobber list.
Also there is a comment that says:
rcx and r8..r11 may be clobbered, others are preserved.
As far as I know, syscall
only clobbers %rax
, %rcx
and %r11
(and memory).
Is there a real example of syscall
that clobbers %r8
, %r9
and %r10
?
ANSWER
Answered 2022-Jan-05 at 10:28Only 32-bit system calls (e.g. via int 0x80
) in 64-bit mode step on those registers, along with R11. (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?).
syscall
properly saves/restores all regs including R8, R9, and R10, so user-space using it can assume they keep their values, except the RAX return value. (The kernel's syscall entry point even saves RCX and R11, but at that point they've already been overwritten by the syscall
instruction itself with the original RIP and before-masking RFLAGS value.)
Those, with R11, are the non-legacy registers that are call-clobbered in the function-calling convention, so compiler-generated code for C functions inside the kernel naturally preserves R12-R15, even if an asm entry point didn't save them.
Currently the 64-bit int 0x80
entry point just pushes 0
for the call-clobbered R8-R11 registers in the process-state struct that it will restore from before returning to user space, instead of the original register values.
Historically, the int 0x80
entry point from 32-bit user-space didn't save/restore those registers at all. So their values were whatever compiler-generated kernel code left sitting around. This was thought to be innocent because 32-bit mode can't read those registers, until it was realized that user-space can far-jump to 64-bit mode, using the same CS value that the kernel uses for normal 64-bit user-space processes, selecting that system-wide GDT entry. So there was an actual info leak of kernel data, which was fixed by zeroing those registers.
IDK whether there used to be or still is a separate entry point from 64-bit user-space vs. 32-bit, or how they differ in struct pt_regs
layout. The historical situation where int 0x80
leaked r8..r11 wouldn't have made sense for 64-bit user-space; that leak would have been obvious. So if they're unified now, they must not have been in the past.
QUESTION
This follows as a result of experimenting on Compiler Explorer as to ascertain the compiler's (rustc's) behaviour when it comes to the log2()
/leading_zeros()
and similar functions. I came across this result with seems exceedingly both bizarre and concerning:
Code:
...ANSWER
Answered 2021-Dec-26 at 01:56Old x86-64 CPUs don't support lzcnt
, so rustc/llvm won't emit it by default. (They would execute it as bsr
but the behavior is not identical.)
Use -C target-feature=+lzcnt
to enable it. Try.
More generally, you may wish to use -C target-cpu=XXX
to enable all the features of a specific CPU model. Use rustc --print target-cpus
for a list.
In particular, -C target-cpu=native
will generate code for the CPU that rustc itself is running on, e.g. if you will run the code on the same machine where you are compiling it.
QUESTION
Update: relevant GCC bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103798
I tested the following code:
...ANSWER
Answered 2021-Dec-21 at 11:08libstdc++'s std::string_view::find_first_of
looks something like:
QUESTION
Intel recommends using instruction prefixes to mitigate the performance consequences of JCC Erratum.
MSVC if compiled with /QIntel-jcc-erratum
follows the recommendation, and inserts prefixed instructions, like this:
ANSWER
Answered 2021-Dec-21 at 16:31A NOP is a separate instruction that had to decode and go through the pipeline separately. It's always better to pad instructions with prefixes to achieve desired alignment, not insert NOPs, as discussed in What methods can be used to efficiently extend instruction length on modern x86? (but only in ways that don't cause major stalls on some CPUs which can't handle large numbers of prefixes).
Perhaps Intel considered it worth the effort for toolchains to do it this way for this case since this would actually be inside inner loops, not just a NOP outside an inner loop. (And tacking on prefixes to one previous instruction is relatively simple.)
I now have some data point. The result of benchmarking for /QIntel-jcc-erratum
on AMD FX 8300 is bad.
The slowdown is by a decimal order of magnitude for a specific benchmark, where the benefit on Intel Skylake for the same benchmark is about 20 percent. This aligns with Peter's comments:
I checked Agner Fog's microarch guide, and AMD Zen has no problem with any number of prefixes on a single instruction, like mainstream Intel since Core2. AMD Bulldozer-family has a "very large" penalty for decoding instructions with more than 3 prefixes, like 14-15 cycles for 4-7 prefixes
It's somewhat valid to consider Bulldozer-family obsolete enough to not care much about it, although there are still some APU desktops and laptops around for sure, but they'd certainly show large regressions in loops where the compiler put 4 or more prefixes on one instruction inside a hot inner loop (including existing prefixes like REX or 66h). Much worse than the 3% for MITE legacy decode on SKL.
Though indeed Bulldozer-family is obsolete-ish, I don't think I can afford this much of an impact. I'm also afraid of other CPUs that may choke with extra prefixes the same way. So the conclusion for me is not to use /QIntel-jcc-erratum
for generally-targeted software. Unless it is enabled in specific translation units and dynamic dispatch to there is made, which is too much of the trouble most of the time.
One thing that probably safe to do on MSVC is to stop using /Os
flag . It was discovered that /Os
flag at least:
- Avoids jump tables in favor of conditional jumps
- Avoids loop start padding
Try the following example (https://godbolt.org/z/jvezPd9jM):
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install rax
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page