Code-used-on-Daniel-Lemire-s-blog | This is a repository for the code posted on my blog

by lemire C Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | Code-used-on-Daniel-Lemire-s-blog Summary

Code-used-on-Daniel-Lemire-s-blog is a C library. Code-used-on-Daniel-Lemire-s-blog has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

This code is meant to illustrate ideas that I present on my blog. Don't expect or ask for industrial-strength software. It is experimental code: it can be wrong, slow, poorly coded and poorly documented. I do maintain some software meant for actual use, with bona fide unit testing and documentation. The code here does not fit in this category.

Support

Quality

Security

License

Reuse

Support

Code-used-on-Daniel-Lemire-s-blog has a low active ecosystem.

It has 690 star(s) with 167 fork(s). There are 59 watchers for this library.

It had no major release in the last 6 months.

There are 2 open issues and 20 have been closed. On average issues are closed in 7 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of Code-used-on-Daniel-Lemire-s-blog is current.

Quality

Code-used-on-Daniel-Lemire-s-blog has 0 bugs and 0 code smells.

Security

Code-used-on-Daniel-Lemire-s-blog has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

Code-used-on-Daniel-Lemire-s-blog code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

Code-used-on-Daniel-Lemire-s-blog does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

Code-used-on-Daniel-Lemire-s-blog releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Code-used-on-Daniel-Lemire-s-blog

Get all kandi verified functions for this library.

Code-used-on-Daniel-Lemire-s-blog Key Features

No Key Features are available at this moment for Code-used-on-Daniel-Lemire-s-blog.

Code-used-on-Daniel-Lemire-s-blog Examples and Code Snippets

No Code Snippets are available at this moment for Code-used-on-Daniel-Lemire-s-blog.

Community Discussions

Trending Discussions on Code-used-on-Daniel-Lemire-s-blog

AVX2: Computing dot product of 512 float arrays

QUESTION

AVX2: Computing dot product of 512 float arrays

Asked 2020-Jan-01 at 04:13

I will preface this by saying that I am a complete beginner at SIMD intrinsics.

Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector of size 512.

I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.

In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).

Thank you for your help

Edit 1: I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?

Edit 2 If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.

...

ANSWER

Answered 2020-Jan-01 at 04:13

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.

GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.

The code below is probably close to theoretical performance limit of your CPU. Untested.

Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)

Or compile it with MSVC -O2 -arch:AVX2

Source https://stackoverflow.com/questions/59494745

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install Code-used-on-Daniel-Lemire-s-blog

You can download it from GitHub.

Support

Pull requests are always welcome. If you find a mistake, please submit a patch. Note that contributions are thus received as being in the public domain.

Find more information at: