nano | Mirror of git://gitsvgnuorg/nanogit
kandi X-RAY | nano Summary
kandi X-RAY | nano Summary
How to compile and install nano.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of nano
nano Key Features
nano Examples and Code Snippets
Community Discussions
Trending Discussions on nano
QUESTION
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
...ANSWER
Answered 2022-Apr-10 at 02:47The default -mtune=generic
has -mprefer-vector-width=256
, and -mavx2
doesn't change that.
znver1 implies -mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq
/ vpaddd
to implement qword *5
as (v<<2) + v
, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5
with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:
Not vectorizing uint64_t *= 5
seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256
, we don't get the *= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2
. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t
, changing to arr[i] += arr[i]<<2;
still doesn't vectorize, but arr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2;
and arr[i] += 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i]
would be considered more expensive than arr[i] <<= 1;
which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
QUESTION
I have microk8s v1.22.2 running on Ubuntu 20.04.3 LTS.
Output from /etc/hosts
:
ANSWER
Answered 2021-Oct-10 at 18:29error: unable to recognize "ingress.yaml": no matches for kind "Ingress" in version "extensions/v1beta1"
QUESTION
I'm correctly generating my image Yocto-hardknott-technexion with this:
...ANSWER
Answered 2022-Feb-16 at 16:34The solution was to change imx7d-pico-pi-m4.dtb to imx7d-pico-pi-qca-m4.dtb in the Yocto/Hardknott/technexion configuration file called pico-imx7.conf(described in the post)
QUESTION
Why is the use of fBuffer1
in the attached code example (SELECT_QUICK = true
) double as fast as the other variant when fBuffer2
is resized only once at the beginning (SELECT_QUICK = false
)?
The code path is absolutely identical but even after 10 minutes the throughput of fBuffer2
does not increase to this level of fBuffer1
.
We have a generic data processing framework that collects thousands of Java primitive values in different subclasses (one subclass for each primitive type). These values are stored internally in arrays, which we originally sized sufficiently large. To save heap memory, we have now switched these arrays to dynamic resizing (arrays grow only if needed). As expected, this change has massively reduce the heap memory. However, on the other hand the performance has unfortunately degenerated significantly. Our processing jobs now take 2-3 times longer as before (e.g. 6 min instead of 2 min as before).
I have reduced our problem to a minimum working example and attached it. You can choose with SELECT_QUICK
which buffer should be used. I see the same effect with jdk-1.8.0_202-x64
as well as with openjdk-17.0.1-x64
.
ANSWER
Answered 2022-Feb-04 at 18:19JIT compilation in HotSpot JVM is 1) based on runtime profile data; 2) uses speculative optimizations.
Once the method is compiled at the maximum optimization level, HotSpot stops profiling this code, so it is never recompiled afterwards, no matter how long the code runs. (The exception is when the method needs to be deoptimized or unloaded, but it's not your case).
In the first case (SELECT_QUICK == true), the condition remaining > fBuffer1.length - fIdx
is never met, and HotSpot JVM is aware of that from profiling data collected at lower tiers. So it speculatively hoists the check out of the loop, and compiles the loop body with the assumption that array index is always within bounds. After the optimization, the loop is compiled like this (in pseudocode):
QUESTION
I am aware of the next scenario: (Weird formatting, I know)
...ANSWER
Answered 2022-Jan-20 at 17:24The Java memory model is sequential consistency in the absence of data races (which your program doesn't have). This is very strong; it says that all reads and writes in your program form a total order, which is consistent with program order. So you can imagine that reads and writes from different threads simply get interleaved, or shuffled together, in some fashion - without changing the relative ordering of actions made from the same thread as each other.
But for this purpose, every action is a separate element in this order.
So just by the mere fact that aBoolean.get()
and aBoolean.compareAndSet()
are two actions and not one action, it is possible for any number of other actions by other threads to take place in between them.
It doesn't matter whether those actions are part of a single statement, or different statements; or what kind of expression they appear in; or what operators (if any) are between them; or what computations the thread itself may or may not be doing around them. There is no way in which two actions can be "so close together" that nothing else can happen in between, short of replacing them by a single action defined to be atomic by the language.
At the level of the machine, a very simple way this can happen is that, since aBoolean.get()
and aBoolean.compareAndSet()
are almost certainly two different machine instructions, an interrupt may arrive in between them. This interrupt could cause the thread to be delayed for any amount of time, during which other threads could do anything they wish. So it is entirely possible that threads #1 and #2 are both interrupted in between their get()
and compareAndSet()
, and that thread #3 executes its set
in the meantime.
Caution: Reasoning about how a particular machine might work is often useful for understanding why undesired behavior is possible, as in the previous paragraph. But it is not a substitute for reasoning about the formal memory model, and should not be used to try to argue that a program must have its desired behavior. Even if a particular machine you have in mind would do the right thing for your code, or you can't think of a way in which a plausible machine would fail, that does not prove that your program is correct.
So trying to say "oh, the machine will do a lock cmpxchg
and so blah blah blah and everything works" isn't wise; some other machine you haven't thought of might work in a totally different fashion, that still complies with the abstract Java memory model but otherwise violates your x86-based expectations. In fact, x86 is a particularly poor example for this: for historical reasons, it provides a rather strong set of instruction-level memory ordering guarantees that many other "weakly ordered" architectures do not, and so there may be many things that Java abstractly allows but that x86 in practice won't do.
QUESTION
I am trying to ingest data from kafka into aerospike. What am I missing in the kafka message being sent?
I am sending below data into kafka for pushing into aerospike:
...ANSWER
Answered 2022-Jan-19 at 03:20It looks like you are not specifying a key when you are sending your kafka message. By default Kafka sends a null key and your config says to use the kafka key as the aerospike key. In order to send a kafka key you need to set parse.key to true and specify what your separator will be (in the kafka producer).
see step 8 here
https://kafka-tutorials.confluent.io/kafka-console-consumer-producer-basics/kafka.html
QUESTION
I'm trying to install Phusion Passenger as a dynamic module with Nginx installed from the repo. The process seems to be working but my Meteor app doesn't load and it looks like the Passenger module isn't running.
OS: RedHat 8
Nginx: 1.20.1
Passenger: Standalone 6.0.12
Meteor: 2.5.1
How I've built the module:
Install Passenger standalone as per the tutorial
ANSWER
Answered 2022-Jan-06 at 13:35I worked it out; the issue was that I didn't realise that when you install Passenger as a dynamic module, you still need to do the same config as with a regular install. In particular, in your nginx.conf, you need to add this to the http block:
QUESTION
I have a script to list and check if multiple Anti-Virus are installed on a machine which is working fine. Is there a better way to make it more simpler that having a long code?
...ANSWER
Answered 2021-Dec-29 at 02:29I don't have Get-CimInstance
available for testing but it should be as easy as this:
QUESTION
With the second generation runtime of Google Cloud Run, it's now possible to mount Google Storage Buckets using gcsfuse.
https://cloud.google.com/run/docs/tutorials/network-filesystems-fuse
The python3 example is working fine. Unfortunately, I keep getting this error with my Dockerfile:
...ANSWER
Answered 2021-Dec-23 at 13:39I solved it mounting GCS bucket in Cloud Run and read/write of object with the following changes:
- Dockerfile:
QUESTION
hi I am unable to run this command aws sts get-caller-identity
.
when I do sudo nano ~/.aws/credentials
I can only locate this
ANSWER
Answered 2021-Dec-08 at 16:03Sometimes this kind of issues are caused by another credential configuration.
Environment variables credential configuration takes prority over credentials config file. So in case there are present the environment variables "AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY" or "AWS_SESSION_TOKEN" these could generate issues if it were missconfigured or have been expired.
Try checking the env vars associated to AWS Credentials and removing them using the 'unset' command in linux.
Additionally, to remove env vars permanently you need to remove the lines related on configuration files like:
- /etc/environment
- /etc/profile
- ~/.profile
- ~/.bashrc
Reference:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install nano
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page