simd | simd offers a basic interface

by huonw Rust Version: Current License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | simd Summary

simd is a Rust library typically used in Big Data applications. simd has no bugs, it has no vulnerabilities and it has low support. However simd has a Non-SPDX License. You can download it from GitHub.

simd offers a basic interface to the SIMD functionality of CPUs.

Support

Quality

Security

License

Reuse

Support

simd has a low active ecosystem.

It has 81 star(s) with 19 fork(s). There are 11 watchers for this library.

It had no major release in the last 6 months.

There are 19 open issues and 8 have been closed. On average issues are closed in 33 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of simd is current.

Quality

simd has no bugs reported.

Security

simd has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

simd has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

simd releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of simd

Get all kandi verified functions for this library.

simd Key Features

No Key Features are available at this moment for simd.

simd Examples and Code Snippets

No Code Snippets are available at this moment for simd.

Community Discussions

Trending Discussions on simd

C++ Optimize Memory Read Speed

how to use Apache Arrow to do "a + b + c*5 + d*3"?

Alignment of simd_packed vector in Swift (vs Metal Shader language)

OpenMP vectorised code runs way slower than O3 optimized code

How to combine constexpr and vectorized code?

Why is SIMD slower than scalar counterpart

error: reduction variable is private in outer context (omp reduction)

Writing a vector sum function with SIMD (System.Numerics) and making it faster than a for loop

Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?

MTKView Transparency

QUESTION

C++ Optimize Memory Read Speed

Asked 2021-Jun-14 at 20:17

I'm creating an int (32 bit) vector with 1024 * 1024 * 1024 elements like so:

...

ANSWER

Answered 2021-Jun-14 at 17:01

Here are some techniques.

Loop Unrolling

Source https://stackoverflow.com/questions/67974127

QUESTION

how to use Apache Arrow to do "a + b + c*5 + d*3"?

Asked 2021-Jun-14 at 12:27

I got the idea of using pre-defined functions to do this: calculate "a + b", "c * 5", "d * 3" and then add the result.

But this way seems generate a lot of code. Is there any better methods to do this?

By the way, does Apache Arrow use SIMD by default(c++ version)? If not, how can I make it use SIMD?

...

ANSWER

Answered 2021-Jun-14 at 12:27

PyArrow doesn't currently override operators in Python, but you can easily call the arithmetic compute functions. (functools.reduce is used here since the addition kernel is binary, not n-ary.)

PyArrow automatically uses SIMD, based on what flags it was compiled with. It should use the 'highest' SIMD level supported by your CPU for which it was compiled with. Not all compute function implementations leverage SIMD internally. Right now it looks like it's mostly the aggregation kernels which do so.

Source https://stackoverflow.com/questions/67958321

QUESTION

Alignment of simd_packed vector in Swift (vs Metal Shader language)

Asked 2021-Jun-13 at 05:17

I have trouble understanding something about simd_packed vectors in the simd module in Swift. I use the example of float4, I hope someone can help.

My understanding is that simd_float4 is a typealias of SIMD4< Float>, and MemoryLayout< Float>>.alignment = 16 (bytes), hence MemoryLayout.alignment = 16. Makes sense.

But the following I do not understand: simd_packed_float4 is also a typealias of SIMD4. And so MemoryLayout.alignment = 16.

What is the point of the "packed" in simd_packed_float4, then? Where is the "relaxed alignment" that the documentation talks about?

In the Metal Shader Language Specification (Version 2.4) ( https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) in Table 2.4 (p.28), it says the alignment of packed_float4 is 4 (which is also the alignment of the scalar type, float), so this IS a "relaxed alignment" (as compared to the 16). That makes sense on its own, but how do I reconcile this to the above (simd_packed_float4 is typealias of SIMD4 and MemoryLayout = 16)?

...

ANSWER

Answered 2021-Jun-12 at 03:45

I actually think it's impossible to achieve relaxed alignment like this with a packed type in Swift. I think Swift compiler just can't bring the alignment attributes to actual Swift interface.

I think this makes simd_packed_float4 useless in Swift.

I have made a playground to check this, and using it as it's intended doesn't work.

Source https://stackoverflow.com/questions/67943802

QUESTION

OpenMP vectorised code runs way slower than O3 optimized code

Asked 2021-Jun-11 at 14:46

I have a minimally reproducible sample which is as follows -

...

ANSWER

Answered 2021-Jun-11 at 14:46

The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.

And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)

After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.

With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)

But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.

So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.

The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.

Source https://stackoverflow.com/questions/67937516

QUESTION

How to combine constexpr and vectorized code?

Asked 2021-Jun-01 at 14:43

I am working on a C++ intrinsic wrapper for x64 and neon. I want my functions to be constexpr. My motivation is similar to Constexpr and SSE intrinsics, but #pragma omp simd and intrinsics may not be supported by the compiler (GCC) in a constexpr function. The following code is just a demonstration (auto-vectorization is good enough for addition).

...

ANSWER

Answered 2021-Jun-01 at 14:43

Using std::is_constant_evaluated, you can get exactly what you want:

Source https://stackoverflow.com/questions/67726812

QUESTION

Why is SIMD slower than scalar counterpart

Asked 2021-May-30 at 16:13

this is yet another SSE is slower than normal code! Why? type of question.
I know that there are a bunch of similar questions but they don't seem to match my situation.

I am trying to implement Miller-Rabin primality test with Montgomery Modular Multiplication for fast modulo operations.
I tried to implement it in both scalar and SIMD way and it turns out that the SIMD version was around 10% slower.
that [esp+16] or [esp+12] is pointing to the modulo inverse of N if there's anyone wondering.

I am really puzzled over the fact that a supposedly 1 Latency 1c Throughput 1uops instruction psrldq takes more than 3 Latency 0.5c Throughput 1uops pmuludq.

Below is the code and the run time analysis on visual studio ran on Ryzen 5 3600.

Any idea on how to improve SIMD code and/or why is it slower than a scalar code is appreciated.

P.S. Seems like the run time analysis is off by one instruction for some reason

EDIT 1: the comment on the image was wrong, I attached a fixed version below:

...

ANSWER

Answered 2021-May-30 at 16:13

Your SIMD code wastes time mispredicting that test ebp, 1 / jnz branch. There’s no conditional move instruction in SSE, but you can still optimize away that test + branch with a few more instructions like this:

Source https://stackoverflow.com/questions/67761813

QUESTION

error: reduction variable is private in outer context (omp reduction)

Asked 2021-May-26 at 18:08

I am confused about the data sharing scope of the variable acc in the flowing two cases. In the case 1 I get following compilation error: error: reduction variable ‘acc’ is private in outer context, whereas the case 2 compiles without any issues.

According to this article variables defined outside parallel region are shared.

Why is adding for-loop parallelism privatizing acc? How can I in this case accumulate the result calculated in the the for-loop and distribute a loop's iteration space across a thread team?

case 1

...

ANSWER

Answered 2021-May-26 at 18:08

Your case 1 is violating OpenMP semantics, as there's an implicit parallel region (see OpenMP Language Terminology, "sequential part") that contains the definition of acc. Thus, acc is indeed private to that implicit parallel region. This is what the compiler complains about.

Your case 2 is different in that the simd construct is not a worksharing construct and thus has a different definition of the semantics of the reduction clause.

Case 1 would be correct if you wrote it this way:

Source https://stackoverflow.com/questions/67709887

QUESTION

Writing a vector sum function with SIMD (System.Numerics) and making it faster than a for loop

Asked 2021-May-21 at 18:27

I wrote a function to add up all the elements of a double[] array using SIMD (System.Numerics.Vector) and the performance is worse than the naïve method.

On my computer Vector.Count is 4 which means I could create an accumulator of 4 values and run through the array adding up the elements by groups.

For example a 10 element array, with a 4 element accumulator and 2 remaining elements I would get

...

ANSWER

Answered 2021-May-19 at 18:28

I would suggest you take a look at this article exploring SIMD performance in .Net.

The overall algorithm looks identical for summing using regular vectorization. One difference is that the multiplication can be avoided when slicing the array:

Source https://stackoverflow.com/questions/67605744

QUESTION

Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?

Asked 2021-May-20 at 13:46

First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.

In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.

This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).

However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.

My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.

My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)

It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.

...

ANSWER

Answered 2021-May-20 at 13:46

can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores

I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc using a lock free ring buffer, like TPCircularBuffer.

But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).

My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.

The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.

I still don't know what happens in practice when you use Audio Workgroups, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.

Source https://stackoverflow.com/questions/67601620

QUESTION

MTKView Transparency

Asked 2021-May-17 at 09:09

I can't make my MTKView clear its background. I've set the view's and its layer's isOpaque to false, background color to clear and tried multiple solutions found on google/stackoverflow (most in the code below like loadAction and clearColor of color attachment) but nothing works.

All the background color settings seem to be ignored. Setting loadAction and clearColor of MTLRenderPassColorAttachmentDescriptor does nothing.

I'd like to have my regular UIView's drawn under the MTKView. What am I missing?

...

ANSWER

Answered 2021-May-17 at 09:09

Thanks to Frank, the answer was to just set the clearColor property of the view itself, which I missed. I also removed most adjustments in the MTLRenderPipelineDescriptor, who's code is now:

Source https://stackoverflow.com/questions/67487986

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install simd

You can download it from GitHub.
Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: