leaf | A fake file system for Deno binaries
kandi X-RAY | leaf Summary
kandi X-RAY | leaf Summary
Leaf is a fake file system for Deno binaries. This means, you can save your files along with the binary generated by deno compile, this way, you can put all your files together in a single executable which leads to easier deployments and a more compacted deliverable output.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of leaf
leaf Key Features
leaf Examples and Code Snippets
Community Discussions
Trending Discussions on leaf
QUESTION
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
...ANSWER
Answered 2022-Apr-10 at 02:47The default -mtune=generic
has -mprefer-vector-width=256
, and -mavx2
doesn't change that.
znver1 implies -mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq
/ vpaddd
to implement qword *5
as (v<<2) + v
, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5
with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:
Not vectorizing uint64_t *= 5
seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256
, we don't get the *= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2
. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t
, changing to arr[i] += arr[i]<<2;
still doesn't vectorize, but arr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2;
and arr[i] += 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i]
would be considered more expensive than arr[i] <<= 1;
which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
QUESTION
I was wondering if someone could help me understand why does System.IO.FileInfo
behaves differently on Windows than on Linux when handling relative paths.
- On Linux
ANSWER
Answered 2021-Dec-07 at 15:50Feels a bit duplicate, but since you asked..
I'm sorry I don't know about Linux, but in Windows:
You can add a test first to see if the path is relative and if so, convert it to absolute like:
QUESTION
Here it is my tree, a nested dictionary
...ANSWER
Answered 2021-Dec-13 at 16:38You are not using the result of the recursive call. That worked for print_leaves
which didn't have a return value, but that doesn't work for a function with return
or yield
.
Here is the long version:
QUESTION
I'm trying to replace the string types in the following function with more specific types that ensure type safe property access:
...ANSWER
Answered 2022-Mar-21 at 14:18I can't believe this abomination actually works:
QUESTION
I have two tables as follows :
...ANSWER
Answered 2022-Feb-26 at 05:47Create Split
Function like this
QUESTION
I've been working on python to make a program which need to handle complex problem from list of dict. The thing is I need to transform this data into dictionary and sort it. The input for this function is come from trees. The code I share here is working, but takes a long time to run. In here I wanna ask is there any idea to make this function run more faster in python? I use python 3.7.3 if you ask. The reason I wanna improve this code is because when I tried to make input data for this function need around 3-4 hours, but to run this function need time around 21-22 hours (this really shock me).
here is the structure of data that I input on below:
...ANSWER
Answered 2022-Feb-08 at 00:52Without having the full code to test outputs this is harder to do, but it seems that there are some redundant processes that you are adding elements to a list of lists only to flatten that list and add that to a dictionary as a set. You can increase some of the speed and memory by removing that and instead just adding it to the dictionary right away.
There are some other tweaks that can be done such as using f-strings instead of string concatenation, using list comprehension, and removing having to do the same math in the loop (time_range * gamma) and instead just reference it by memory.
But these are all minor tweaks compared to your step one process which looks to be the largest time sink (approx N^4 in time complexity). I am unsure if it is larger as I don't see the functions that you use inside that for loop, but tweaking that to reduce the number of calculations would provide the largest benefit to time savings.
QUESTION
n00b here!
I have managed to assemble JavaScript code and I don't understand why is not working correct.
The code is designed to calculate the difference between a given date and today's date. I guess my logic in the script is not correct as no error show in the console.
Here is the code:
...ANSWER
Answered 2022-Feb-03 at 23:26QUESTION
The function above AllPaths()
appends an array containing the path to each leaf of the binary tree to the global array res
.
The code works just fine, but I want to remove the global variable res
and make the function return an array instead. How can I do that?
ANSWER
Answered 2021-Dec-26 at 20:20A simple way that allow you to avoid the inner lists and global list altogether is to make a generator that yields the values as they come. Then you can just pass this to list
to make the final outcome:
QUESTION
I am attempting to construct an immutable Trie defined as such:
...ANSWER
Answered 2022-Jan-12 at 01:10The appears to get at what you're after.
QUESTION
Here is the code screenshot.
I want to render Homepage component but I want to wrap it into these MainLayout component.
The problem is that screen is blank and there is no error in Terminal but when I inspect the page it says "Matched leaf route at location "/" does not have an element", so guys I know this is version update syntax problem because I had same problem when I was writing component= {component }
but syntax has been changed and I should have written element={}
.
So I believe this is also syntax problem but I've done research but couldn't solve. I believe I should change this
...ANSWER
Answered 2021-Dec-22 at 19:03The Route
components in react-router-dom
v6 no longer take component
or render
props; the routed components are rendered on the element
prop as JSX.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install leaf
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page