Code-used-on-Daniel-Lemire-s-blog | This is a repository for the code posted on my blog
kandi X-RAY | Code-used-on-Daniel-Lemire-s-blog Summary
kandi X-RAY | Code-used-on-Daniel-Lemire-s-blog Summary
This code is meant to illustrate ideas that I present on my blog. Don't expect or ask for industrial-strength software. It is experimental code: it can be wrong, slow, poorly coded and poorly documented. I do maintain some software meant for actual use, with bona fide unit testing and documentation. The code here does not fit in this category.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Code-used-on-Daniel-Lemire-s-blog
Code-used-on-Daniel-Lemire-s-blog Key Features
Code-used-on-Daniel-Lemire-s-blog Examples and Code Snippets
Community Discussions
Trending Discussions on Code-used-on-Daniel-Lemire-s-blog
QUESTION
I will preface this by saying that I am a complete beginner at SIMD intrinsics.
Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz
). I would like to know the fastest way to compute the dot product of two std::vector
of size 512
.
I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);
, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.
In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).
Thank you for your help
Edit 1:
I am also a little confused about the -mavx2
gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast
gcc flag) if I write a naive dot product implementation?
Edit 2 If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.
...ANSWER
Answered 2020-Jan-01 at 04:13_mm256_dp_ps
is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps
and _mm256_add_ps
in a loop would be much slower.
GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.
The code below is probably close to theoretical performance limit of your CPU. Untested.
Compile it with clang or gcc -O3 -march=native
. (Requires at least -mavx -mfma
, but -mtune
options implied by -march
are good, too, and so are the other -mpopcnt
and other things arch=native
enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load
: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)
Or compile it with MSVC -O2 -arch:AVX2
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Code-used-on-Daniel-Lemire-s-blog
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page