Simd | C++ image processing and machine | Computer Vision library

 by   ermig1979 C++ Version: v5.2.123 License: MIT

kandi X-RAY | Simd Summary

Simd is a C++ library typically used in Artificial Intelligence, Computer Vision applications. Simd has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.
The [Simd Library] is a free open source image processing and machine learning library, designed for C and C++ programmers. It provides many useful high performance algorithms for image processing such as: pixel format conversion, image scaling and filtration, extraction of statistic information from images, motion detection, object detection (HAAR and LBP classifier cascades) and classification, neural network. The algorithms are optimized with using of different SIMD CPU extensions. In particular the library supports following CPU extensions: SSE, AVX and AVX-512 for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC (big-endian), NEON for ARM. The Simd Library has C API and also contains useful C classes and functions to facilitate access to C API. The library supports dynamic and static linking, 32-bit and 64-bit Windows and Linux, MSVS, G and Clang compilers, MSVS project and CMake build systems.
    Support
      Quality
        Security
          License
            Reuse
            Support
              Quality
                Security
                  License
                    Reuse

                      kandi-support Support

                        summary
                        Simd has a medium active ecosystem.
                        summary
                        It has 1740 star(s) with 380 fork(s). There are 115 watchers for this library.
                        summary
                        There were 6 major release(s) in the last 6 months.
                        summary
                        There are 20 open issues and 190 have been closed. On average issues are closed in 160 days. There are no pull requests.
                        summary
                        It has a neutral sentiment in the developer community.
                        summary
                        The latest version of Simd is v5.2.123
                        Simd Support
                          Best in #Computer Vision
                            Average in #Computer Vision
                            Simd Support
                              Best in #Computer Vision
                                Average in #Computer Vision

                                  kandi-Quality Quality

                                    summary
                                    Simd has no bugs reported.
                                    Simd Quality
                                      Best in #Computer Vision
                                        Average in #Computer Vision
                                        Simd Quality
                                          Best in #Computer Vision
                                            Average in #Computer Vision

                                              kandi-Security Security

                                                summary
                                                Simd has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
                                                Simd Security
                                                  Best in #Computer Vision
                                                    Average in #Computer Vision
                                                    Simd Security
                                                      Best in #Computer Vision
                                                        Average in #Computer Vision

                                                          kandi-License License

                                                            summary
                                                            Simd is licensed under the MIT License. This license is Permissive.
                                                            summary
                                                            Permissive licenses have the least restrictions, and you can use them in most projects.
                                                            Simd License
                                                              Best in #Computer Vision
                                                                Average in #Computer Vision
                                                                Simd License
                                                                  Best in #Computer Vision
                                                                    Average in #Computer Vision

                                                                      kandi-Reuse Reuse

                                                                        summary
                                                                        Simd releases are available to install and integrate.
                                                                        summary
                                                                        Installation instructions are not available. Examples and code snippets are available.
                                                                        Simd Reuse
                                                                          Best in #Computer Vision
                                                                            Average in #Computer Vision
                                                                            Simd Reuse
                                                                              Best in #Computer Vision
                                                                                Average in #Computer Vision
                                                                                  Top functions reviewed by kandi - BETA
                                                                                  kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
                                                                                  Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here
                                                                                  Get all kandi verified functions for this library.
                                                                                  Get all kandi verified functions for this library.

                                                                                  Simd Key Features

                                                                                  C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.

                                                                                  Simd Examples and Code Snippets

                                                                                  No Code Snippets are available at this moment for Simd.
                                                                                  Community Discussions

                                                                                  Trending Discussions on Simd

                                                                                  C++ Optimize Memory Read Speed
                                                                                  chevron right
                                                                                  how to use Apache Arrow to do "a + b + c*5 + d*3"?
                                                                                  chevron right
                                                                                  Alignment of simd_packed vector in Swift (vs Metal Shader language)
                                                                                  chevron right
                                                                                  OpenMP vectorised code runs way slower than O3 optimized code
                                                                                  chevron right
                                                                                  How to combine constexpr and vectorized code?
                                                                                  chevron right
                                                                                  Why is SIMD slower than scalar counterpart
                                                                                  chevron right
                                                                                  error: reduction variable is private in outer context (omp reduction)
                                                                                  chevron right
                                                                                  Writing a vector sum function with SIMD (System.Numerics) and making it faster than a for loop
                                                                                  chevron right
                                                                                  Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?
                                                                                  chevron right
                                                                                  MTKView Transparency
                                                                                  chevron right

                                                                                  QUESTION

                                                                                  C++ Optimize Memory Read Speed
                                                                                  Asked 2021-Jun-14 at 20:17

                                                                                  I'm creating an int (32 bit) vector with 1024 * 1024 * 1024 elements like so:

                                                                                  std::vector nums;
                                                                                  for (size_t i = 0; i < 1024 * 1024 * 1024; i++) {
                                                                                     nums.push_back(rand() % 1024);
                                                                                  }
                                                                                  

                                                                                  which holds 4 GB of random data at that point. And then I'm simply summing up all the elements in the vector like so:

                                                                                  uint64_t total = 0;
                                                                                  for (auto cn = nums.begin(); cn < nums.end(); cn++) {
                                                                                     total += *cn;
                                                                                  }
                                                                                  

                                                                                  This takes about ~0.18 seconds which means the data is processed at around 22.2 GB/s. I'm running this on an M1 with a much higher memory bandwidth of about 60GB/s. Is there a way to make the above code run faster on a single core?

                                                                                  EDIT: Manual SIMD version:

                                                                                  int32x4_t simd_total = vmovq_n_s32(0); 
                                                                                  for (auto cn = nums.begin(); cn < nums.end()-3; cn +=4) { 
                                                                                      const int32_t v[4] = {cn[0], cn[1], cn[2], cn[3]} 
                                                                                      simd_total = vaddq_s32(simd_total, vld1q_s32(v)); 
                                                                                  } 
                                                                                  return vaddvq_s32(simd_total); 
                                                                                  

                                                                                  The SIMD version has the same performance as the non-manual-SIMD version.

                                                                                  EDIT 2: Alright, so I changed the vector elements to uint32_t and also changed the result type to uint32_t(as suggested by @Peter Cordes):

                                                                                  uint32_t sum_ints_32(const std::vector& nums) {
                                                                                      uint32_t total = 0;
                                                                                      for (auto cn = nums.begin(); cn < nums.end(); cn++) {
                                                                                          total += *cn;
                                                                                      }
                                                                                      return total;
                                                                                  }
                                                                                  

                                                                                  This runs much faster (~45 GB/s). This is the disassembly:

                                                                                  0000000100002218 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
                                                                                     100002218:   a940200c    ldp x12, x8, [x0]
                                                                                     10000221c:   eb08019f    cmp x12, x8
                                                                                     100002220:   54000102    b.cs    100002240 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x28>  // b.hs, b.nlast
                                                                                     100002224:   aa2c03e9    mvn x9, x12
                                                                                     100002228:   8b090109    add x9, x8, x9
                                                                                     10000222c:   f1006d3f    cmp x9, #0x1b
                                                                                     100002230:   540000c8    b.hi    100002248 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x30>  // b.pmore
                                                                                     100002234:   52800000    mov w0, #0x0                    // #0
                                                                                     100002238:   aa0c03e9    mov x9, x12
                                                                                     10000223c:   14000016    b   100002294 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x7c>
                                                                                     100002240:   52800000    mov w0, #0x0                    // #0
                                                                                     100002244:   d65f03c0    ret
                                                                                     100002248:   d342fd29    lsr x9, x9, #2
                                                                                     10000224c:   9100052a    add x10, x9, #0x1
                                                                                     100002250:   927ded4b    and x11, x10, #0x7ffffffffffffff8
                                                                                     100002254:   8b0b0989    add x9, x12, x11, lsl #2
                                                                                     100002258:   9100418c    add x12, x12, #0x10
                                                                                     10000225c:   6f00e400    movi    v0.2d, #0x0
                                                                                     100002260:   aa0b03ed    mov x13, x11
                                                                                     100002264:   6f00e401    movi    v1.2d, #0x0
                                                                                     100002268:   ad7f8d82    ldp q2, q3, [x12, #-16]
                                                                                     10000226c:   4ea08440    add v0.4s, v2.4s, v0.4s
                                                                                     100002270:   4ea18461    add v1.4s, v3.4s, v1.4s
                                                                                     100002274:   9100818c    add x12, x12, #0x20
                                                                                     100002278:   f10021ad    subs    x13, x13, #0x8
                                                                                     10000227c:   54ffff61    b.ne    100002268 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x50>  // b.any
                                                                                     100002280:   4ea08420    add v0.4s, v1.4s, v0.4s
                                                                                     100002284:   4eb1b800    addv    s0, v0.4s
                                                                                     100002288:   1e260000    fmov    w0, s0
                                                                                     10000228c:   eb0b015f    cmp x10, x11
                                                                                     100002290:   540000a0    b.eq    1000022a4 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x8c>  // b.none
                                                                                     100002294:   b840452a    ldr w10, [x9], #4
                                                                                     100002298:   0b000140    add w0, w10, w0
                                                                                     10000229c:   eb08013f    cmp x9, x8
                                                                                     1000022a0:   54ffffa3    b.cc    100002294 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x7c>  // b.lo, b.ul, b.last
                                                                                     1000022a4:   d65f03c0    ret
                                                                                  

                                                                                  I also rewrote the Manual-SIMD version:

                                                                                  uint32_t sum_ints_simd_2(const std::vector& nums) {
                                                                                      uint32x4_t  simd_total = vmovq_n_u32(0);
                                                                                      for (auto cn = nums.begin(); cn < nums.end()-3; cn +=4) {
                                                                                          const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
                                                                                          simd_total = vaddq_u32(simd_total, vld1q_u32(v));
                                                                                      }
                                                                                      return vaddvq_u32(simd_total);
                                                                                  }
                                                                                  

                                                                                  which still runs 2x slower than the non-manual-SIMD version and results in the following disassembly:

                                                                                  0000000100002464 <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
                                                                                     100002464:   a9402408    ldp x8, x9, [x0]
                                                                                     100002468:   d1003129    sub x9, x9, #0xc
                                                                                     10000246c:   6f00e400    movi    v0.2d, #0x0
                                                                                     100002470:   eb09011f    cmp x8, x9
                                                                                     100002474:   540000c2    b.cs    10000248c <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x28>  // b.hs, b.nlast
                                                                                     100002478:   6f00e400    movi    v0.2d, #0x0
                                                                                     10000247c:   3cc10501    ldr q1, [x8], #16
                                                                                     100002480:   4ea08420    add v0.4s, v1.4s, v0.4s
                                                                                     100002484:   eb09011f    cmp x8, x9
                                                                                     100002488:   54ffffa3    b.cc    10000247c <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x18>  // b.lo, b.ul, b.last
                                                                                     10000248c:   4eb1b800    addv    s0, v0.4s
                                                                                     100002490:   1e260000    fmov    w0, s0
                                                                                     100002494:   d65f03c0    ret
                                                                                  

                                                                                  To reach the same speed as the auto-vectorized version, we can use a uint32x4x2 instead of uint32x4 for our manual-SIMD version:

                                                                                  uint32_t sum_ints_simd_3(const std::vector& nums) {
                                                                                      uint32x4x2_t simd_total;
                                                                                      simd_total.val[0] = vmovq_n_u32(0);
                                                                                      simd_total.val[1] = vmovq_n_u32(0);
                                                                                      for (auto cn = nums.begin(); cn < nums.end()-7; cn +=8) {
                                                                                          const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
                                                                                          const uint32_t v2[4] = { cn[4], cn[5], cn[6], cn[7] };
                                                                                          simd_total.val[0] = vaddq_u32(simd_total.val[0], vld1q_u32(v));
                                                                                          simd_total.val[1] = vaddq_u32(simd_total.val[1], vld1q_u32(v2));
                                                                                      }
                                                                                      return vaddvq_u32(simd_total.val[0]) + vaddvq_u32(simd_total.val[1]);
                                                                                  }
                                                                                  

                                                                                  And to gain even more speed we can leverage uint32x4x4 (which gets us about ~53 GB/s):

                                                                                  uint32_t sum_ints_simd_4(const std::vector& nums) {
                                                                                      uint32x4x4_t simd_total;
                                                                                      simd_total.val[0] = vmovq_n_u32(0);
                                                                                      simd_total.val[1] = vmovq_n_u32(0);
                                                                                      simd_total.val[2] = vmovq_n_u32(0);
                                                                                      simd_total.val[3] = vmovq_n_u32(0);
                                                                                      for (auto cn = nums.begin(); cn < nums.end()-15; cn +=16) {
                                                                                          const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
                                                                                          const uint32_t v2[4] = { cn[4], cn[5], cn[6], cn[7] };
                                                                                          const uint32_t v3[4] = { cn[8], cn[9], cn[10], cn[11] };
                                                                                          const uint32_t v4[4] = { cn[12], cn[13], cn[14], cn[15] };
                                                                                          simd_total.val[0] = vaddq_u32(simd_total.val[0], vld1q_u32(v));
                                                                                          simd_total.val[1] = vaddq_u32(simd_total.val[1], vld1q_u32(v2));
                                                                                          simd_total.val[2] = vaddq_u32(simd_total.val[2], vld1q_u32(v3));
                                                                                          simd_total.val[3] = vaddq_u32(simd_total.val[3], vld1q_u32(v4));
                                                                                      }
                                                                                      return vaddvq_u32(simd_total.val[0])
                                                                                          + vaddvq_u32(simd_total.val[1])
                                                                                          + vaddvq_u32(simd_total.val[2])
                                                                                          + vaddvq_u32(simd_total.val[3]);
                                                                                  }
                                                                                  

                                                                                  which gets us the following disassembly:

                                                                                  0000000100005e34 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
                                                                                     100005e34:   a9402408    ldp x8, x9, [x0]
                                                                                     100005e38:   d100f129    sub x9, x9, #0x3c
                                                                                     100005e3c:   6f00e403    movi    v3.2d, #0x0
                                                                                     100005e40:   6f00e402    movi    v2.2d, #0x0
                                                                                     100005e44:   6f00e401    movi    v1.2d, #0x0
                                                                                     100005e48:   6f00e400    movi    v0.2d, #0x0
                                                                                     100005e4c:   eb09011f    cmp x8, x9
                                                                                     100005e50:   540001c2    b.cs    100005e88 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x54>  // b.hs, b.nlast
                                                                                     100005e54:   6f00e400    movi    v0.2d, #0x0
                                                                                     100005e58:   6f00e401    movi    v1.2d, #0x0
                                                                                     100005e5c:   6f00e402    movi    v2.2d, #0x0
                                                                                     100005e60:   6f00e403    movi    v3.2d, #0x0
                                                                                     100005e64:   ad401504    ldp q4, q5, [x8]
                                                                                     100005e68:   ad411d06    ldp q6, q7, [x8, #32]
                                                                                     100005e6c:   4ea38483    add v3.4s, v4.4s, v3.4s
                                                                                     100005e70:   4ea284a2    add v2.4s, v5.4s, v2.4s
                                                                                     100005e74:   4ea184c1    add v1.4s, v6.4s, v1.4s
                                                                                     100005e78:   4ea084e0    add v0.4s, v7.4s, v0.4s
                                                                                     100005e7c:   91010108    add x8, x8, #0x40
                                                                                     100005e80:   eb09011f    cmp x8, x9
                                                                                     100005e84:   54ffff03    b.cc    100005e64 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x30>  // b.lo, b.ul, b.last
                                                                                     100005e88:   4eb1b863    addv    s3, v3.4s
                                                                                     100005e8c:   1e260068    fmov    w8, s3
                                                                                     100005e90:   4eb1b842    addv    s2, v2.4s
                                                                                     100005e94:   1e260049    fmov    w9, s2
                                                                                     100005e98:   0b080128    add w8, w9, w8
                                                                                     100005e9c:   4eb1b821    addv    s1, v1.4s
                                                                                     100005ea0:   1e260029    fmov    w9, s1
                                                                                     100005ea4:   0b090108    add w8, w8, w9
                                                                                     100005ea8:   4eb1b800    addv    s0, v0.4s
                                                                                     100005eac:   1e260009    fmov    w9, s0
                                                                                     100005eb0:   0b090100    add w0, w8, w9
                                                                                     100005eb4:   d65f03c0    ret
                                                                                  

                                                                                  Crazy stuff

                                                                                  ANSWER

                                                                                  Answered 2021-Jun-14 at 17:01

                                                                                  Here are some techniques.

                                                                                  Loop Unrolling
                                                                                  uint64_t total = 0;
                                                                                  for (auto cn = nums.begin(); cn < nums.end(); cn += 4)
                                                                                  {
                                                                                      total += cn[0];
                                                                                      total += cn[1];
                                                                                      total += cn[2];
                                                                                      total += cn[3];
                                                                                  }
                                                                                  
                                                                                  Register Prefetch
                                                                                  uint64_t total = 0;
                                                                                  for (auto cn = nums.begin(); cn < nums.end(); cn += 4)
                                                                                  {
                                                                                      const uint64 n0 = cn[0];
                                                                                      const uint64 n1 = cn[1];
                                                                                      const uint64 n2 = cn[2];
                                                                                      const uint64 n3 = cn[3];
                                                                                      total += n0;
                                                                                      total += n1;
                                                                                      total += n2;
                                                                                      total += n3;
                                                                                  }
                                                                                  

                                                                                  You should print the assembly language for each of these at high optimization level and compare them.

                                                                                  Also, your processor may have some specialized instructions that you could. For example, the ARM processor can load multiple registers from memory with one instruction.

                                                                                  Also, look up SIMD instructions or search the internet for "C++ SIMD read memory".

                                                                                  I've argued with compilers (on embedded systems) and found out that the compiler's optimization strategies may be better or equal to instruction specialization or other techniques (timings were performed using Test Points and oscilloscope).

                                                                                  You'll have to remember that your task, on a one core machine, will most likely be swapped out more often that with a system with multiple cores or a specialized (embedded) system.

                                                                                  Source https://stackoverflow.com/questions/67974127

                                                                                  QUESTION

                                                                                  how to use Apache Arrow to do "a + b + c*5 + d*3"?
                                                                                  Asked 2021-Jun-14 at 12:27

                                                                                  I got the idea of using pre-defined functions to do this: calculate "a + b", "c * 5", "d * 3" and then add the result.

                                                                                  But this way seems generate a lot of code. Is there any better methods to do this?

                                                                                  By the way, does Apache Arrow use SIMD by default(c++ version)? If not, how can I make it use SIMD?

                                                                                  ANSWER

                                                                                  Answered 2021-Jun-14 at 12:27

                                                                                  PyArrow doesn't currently override operators in Python, but you can easily call the arithmetic compute functions. (functools.reduce is used here since the addition kernel is binary, not n-ary.)

                                                                                  PyArrow automatically uses SIMD, based on what flags it was compiled with. It should use the 'highest' SIMD level supported by your CPU for which it was compiled with. Not all compute function implementations leverage SIMD internally. Right now it looks like it's mostly the aggregation kernels which do so.

                                                                                  >>> import pyarrow as pa
                                                                                  >>> import pyarrow.compute as pc
                                                                                  >>> import functools
                                                                                  >>> pa.__version__
                                                                                  '4.0.1'
                                                                                  >>> a = pa.array([1,2,3])
                                                                                  >>> b = pa.array([3,4,5])
                                                                                  >>> c = pa.array([1,0,1])
                                                                                  >>> d = pa.array([2,4,2])
                                                                                  >>> functools.reduce(pc.add, [pc.add(a,b), pc.multiply(c, 5), pc.multiply(d, 3)])
                                                                                  
                                                                                  [
                                                                                    15,
                                                                                    18,
                                                                                    19
                                                                                  ]
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/67958321

                                                                                  QUESTION

                                                                                  Alignment of simd_packed vector in Swift (vs Metal Shader language)
                                                                                  Asked 2021-Jun-13 at 05:17

                                                                                  I have trouble understanding something about simd_packed vectors in the simd module in Swift. I use the example of float4, I hope someone can help.

                                                                                  My understanding is that simd_float4 is a typealias of SIMD4< Float>, and MemoryLayout< Float>>.alignment = 16 (bytes), hence MemoryLayout.alignment = 16. Makes sense.

                                                                                  But the following I do not understand: simd_packed_float4 is also a typealias of SIMD4. And so MemoryLayout.alignment = 16.

                                                                                  What is the point of the "packed" in simd_packed_float4, then? Where is the "relaxed alignment" that the documentation talks about?

                                                                                  In the Metal Shader Language Specification (Version 2.4) ( https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) in Table 2.4 (p.28), it says the alignment of packed_float4 is 4 (which is also the alignment of the scalar type, float), so this IS a "relaxed alignment" (as compared to the 16). That makes sense on its own, but how do I reconcile this to the above (simd_packed_float4 is typealias of SIMD4 and MemoryLayout = 16)?

                                                                                  ANSWER

                                                                                  Answered 2021-Jun-12 at 03:45

                                                                                  I actually think it's impossible to achieve relaxed alignment like this with a packed type in Swift. I think Swift compiler just can't bring the alignment attributes to actual Swift interface.

                                                                                  I think this makes simd_packed_float4 useless in Swift.

                                                                                  I have made a playground to check this, and using it as it's intended doesn't work.

                                                                                  import simd
                                                                                  
                                                                                  MemoryLayout.stride
                                                                                  MemoryLayout.alignment
                                                                                  
                                                                                  let capacity = 8
                                                                                  let buffer = UnsafeMutableBufferPointer.allocate(capacity: capacity)
                                                                                  
                                                                                  for i in 0...stride * 4, as: simd_packed_float4.self)
                                                                                  
                                                                                  print(readAligned)
                                                                                  
                                                                                  let readUnaligned = rawBuffer.load(fromByteOffset: MemoryLayout.stride * 2, as: simd_packed_float4.self)
                                                                                  
                                                                                  print(readUnaligned)
                                                                                  

                                                                                  Which will output

                                                                                  SIMD4(4.0, 5.0, 6.0, 7.0)
                                                                                  Swift/UnsafeRawPointer.swift:900: Fatal error: load from misaligned raw pointer
                                                                                  

                                                                                  If you do need to load or put unaligned simd_float4 vectors into buffers, I would suggest just making an extension that does this component-wise, so all the alignments work out, kinda like this

                                                                                  extension UnsafeMutableRawBufferPointer {
                                                                                      func loadFloat4(fromByteOffset offset: Int) -> simd_float4 {
                                                                                          let x = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 0, as: Float.self)
                                                                                          let y = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 1, as: Float.self)
                                                                                          let z = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 2, as: Float.self)
                                                                                          let w = rawBuffer.load(fromByteOffset: offset + MemoryLayout.stride * 3, as: Float.self)
                                                                                  
                                                                                          return simd_float4(x, y, z, w)
                                                                                      }
                                                                                  }
                                                                                  
                                                                                  let readUnaligned2 = rawBuffer.loadFloat4(fromByteOffset: MemoryLayout.stride * 2)
                                                                                  print(readUnaligned2)
                                                                                  

                                                                                  Or you can even make it generic

                                                                                  Source https://stackoverflow.com/questions/67943802

                                                                                  QUESTION

                                                                                  OpenMP vectorised code runs way slower than O3 optimized code
                                                                                  Asked 2021-Jun-11 at 14:46

                                                                                  I have a minimally reproducible sample which is as follows -

                                                                                  #include 
                                                                                  #include 
                                                                                  #include 
                                                                                  #include 
                                                                                  #include 
                                                                                  
                                                                                  
                                                                                  
                                                                                  template
                                                                                  void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
                                                                                          for(size_t i=0; i < size * size; i++){
                                                                                              result[i] = matA[i] + matB[i];
                                                                                          }
                                                                                  }
                                                                                  
                                                                                  
                                                                                  int main(){
                                                                                      size_t size = 8192;
                                                                                  
                                                                                      //std::cout<(matA, matB, result, size);
                                                                                      
                                                                                  }
                                                                                  
                                                                                      auto end = std::chrono::high_resolution_clock::now();
                                                                                      auto duration = std::chrono::duration_cast(end - start).count();
                                                                                  
                                                                                      std::cout<<"Average Time is = "<

                                                                                  I experiment as follows - I time the code with #pragma omp for simd directive for the loop in the AddMatrixOpenMP function and then time it without the directive. I compile the code as follows - g++ -O3 -fopenmp example.cpp

                                                                                  Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
                                                                                  I am not able to understand why so.

                                                                                  Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)

                                                                                  TIA

                                                                                  ANSWER

                                                                                  Answered 2021-Jun-11 at 14:46

                                                                                  The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
                                                                                  Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.

                                                                                  And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)

                                                                                  After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.

                                                                                  With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)

                                                                                  But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.

                                                                                  So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.

                                                                                  The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.

                                                                                  .L4:                    # outer loop
                                                                                          movss   xmm0, DWORD PTR [rbx+rdx*4]
                                                                                          addss   xmm0, DWORD PTR [r13+0+rdx*4]        # one scalar operation
                                                                                          mov     eax, 500
                                                                                  .L3:                             # do {
                                                                                          sub     eax, 1                   # empty inner loop after inversion
                                                                                          jne     .L3              # }while(--i);
                                                                                  
                                                                                          add     rdx, 1
                                                                                          movss   DWORD PTR [rcx], xmm0
                                                                                          add     rcx, 4
                                                                                          cmp     rdx, 67108864
                                                                                          jne     .L4
                                                                                  

                                                                                  This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.

                                                                                  Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.

                                                                                  There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.

                                                                                  The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)

                                                                                  Looking at memory bandwidth while the benchmark is running:

                                                                                  Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.

                                                                                  $ intel_gpu_top -l
                                                                                   Freq MHz      IRQ RC6 Power     IMC MiB/s           RCS/0           BCS/0           VCS/0          VECS/0 
                                                                                   req  act       /s   %     W     rd     wr       %  se  wa       %  se  wa       %  se  wa       %  se  wa 
                                                                                     0    0        0  97  0.00  20421   7482    0.00   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                     3    4       14  99  0.02  19627   6505    0.47   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                     7    7       20  98  0.02  19625   6516    0.67   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                    11   10       22  98  0.03  19632   6516    0.65   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                     3    4       13  99  0.02  19609   6505    0.46   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                  

                                                                                  Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.

                                                                                  But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.

                                                                                   Freq MHz      IRQ RC6 Power     IMC MiB/s           RCS/0           BCS/0           VCS/0          VECS/0 
                                                                                   req  act       /s   %     W     rd     wr       %  se  wa       %  se  wa       %  se  wa       %  se  wa 
                                                                                  ...
                                                                                     8    8       17  99  0.03    365     85    0.62   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                     9    9       17  99  0.02    349     90    0.62   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                     4    4        5 100  0.01    303     63    0.25   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                     7    7       15 100  0.02    345     69    0.43   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                    10   10       21  99  0.03    350     74    0.64   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
                                                                                  

                                                                                  vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)

                                                                                  Source https://stackoverflow.com/questions/67937516

                                                                                  QUESTION

                                                                                  How to combine constexpr and vectorized code?
                                                                                  Asked 2021-Jun-01 at 14:43

                                                                                  I am working on a C++ intrinsic wrapper for x64 and neon. I want my functions to be constexpr. My motivation is similar to Constexpr and SSE intrinsics, but #pragma omp simd and intrinsics may not be supported by the compiler (GCC) in a constexpr function. The following code is just a demonstration (auto-vectorization is good enough for addition).

                                                                                  struct FA{
                                                                                      float c[4];
                                                                                  };
                                                                                  
                                                                                  inline constexpr FA add(FA a, FA b){
                                                                                      FA result{};
                                                                                      #pragma omp simd            // clang error: statement not allowed in constexpr function
                                                                                      for(int i = 0; i < 4; i++){ // GCC error: uninitialized variable 'i' in 'constexpr' function
                                                                                          result.c[i] = b.c[i] + a.c[i];
                                                                                      }
                                                                                      return result;
                                                                                  }
                                                                                  struct FA2{
                                                                                      __m128 c;
                                                                                  };
                                                                                  
                                                                                  
                                                                                  inline constexpr FA2 add2(FA2 a, FA2 b){
                                                                                          FA2 result{};
                                                                                          result.c = _mm_add_ps(a.c,b.c); // GCC error: call to non-'constexpr' function '__m128 _mm_add_ps(__m128, __m128)'
                                                                                          return result;                  // fine with clang
                                                                                  }
                                                                                  
                                                                                  
                                                                                  

                                                                                  I have to provide reference C++ code for portability anyway. Is there a code efficient way to let the compiler use the reference code at compile time?

                                                                                  f(){
                                                                                      if(){
                                                                                          // constexpr version
                                                                                      }else{
                                                                                          // intrinsic version
                                                                                      }
                                                                                  }
                                                                                  

                                                                                  It should work on all compilers that support omp, intrinsics and C++20.

                                                                                  ANSWER

                                                                                  Answered 2021-Jun-01 at 14:43

                                                                                  Using std::is_constant_evaluated, you can get exactly what you want:

                                                                                  #include 
                                                                                  
                                                                                  struct FA{
                                                                                      float c[4];
                                                                                  };
                                                                                  
                                                                                  // Just for the sake of the example. Makes for nice-looking assembly.
                                                                                  extern FA add_parallel(FA a, FA b);
                                                                                  
                                                                                  constexpr FA add(FA a, FA b) {
                                                                                      if (std::is_constant_evaluated()) {
                                                                                          // do it in a constexpr-friendly manner
                                                                                          FA result{};
                                                                                          for(int i = 0; i < 4; i++) {
                                                                                              result.c[i] = b.c[i] + a.c[i];
                                                                                          }
                                                                                          return result;
                                                                                      } else {
                                                                                          // can be anything that's not constexpr-friendly.
                                                                                          return add_parallel(a, b);
                                                                                      }
                                                                                  }
                                                                                  
                                                                                  constexpr FA at_compile_time = add(FA{1,2,3,4}, FA{5,6,7,8});
                                                                                  
                                                                                  FA at_runtime(FA a) {
                                                                                      return add(a, at_compile_time);
                                                                                  }
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/67726812

                                                                                  QUESTION

                                                                                  Why is SIMD slower than scalar counterpart
                                                                                  Asked 2021-May-30 at 16:13

                                                                                  this is yet another SSE is slower than normal code! Why? type of question.
                                                                                  I know that there are a bunch of similar questions but they don't seem to match my situation.

                                                                                  I am trying to implement Miller-Rabin primality test with Montgomery Modular Multiplication for fast modulo operations.
                                                                                  I tried to implement it in both scalar and SIMD way and it turns out that the SIMD version was around 10% slower.
                                                                                  that [esp+16] or [esp+12] is pointing to the modulo inverse of N if there's anyone wondering.

                                                                                  I am really puzzled over the fact that a supposedly 1 Latency 1c Throughput 1uops instruction psrldq takes more than 3 Latency 0.5c Throughput 1uops pmuludq.

                                                                                  Below is the code and the run time analysis on visual studio ran on Ryzen 5 3600.

                                                                                  Any idea on how to improve SIMD code and/or why is it slower than a scalar code is appreciated.

                                                                                  P.S. Seems like the run time analysis is off by one instruction for some reason

                                                                                  EDIT 1: the comment on the image was wrong, I attached a fixed version below:

                                                                                      ;------------------------------------------
                                                                                      ; Calculate base^d mod x
                                                                                      ;
                                                                                      ; eax = 1
                                                                                      ; esi = x
                                                                                      ; edi = bases[eax]
                                                                                      ; ebp = d
                                                                                      ; while d do
                                                                                      ;     if d & 1 then eax = (eax * edi) mod x
                                                                                      ;     edi = (edi*edi) mod x
                                                                                      ;     d >>= 1
                                                                                      ; end
                                                                                      ;------------------------------------------
                                                                                  

                                                                                  Scalar code:

                                                                                  LOOP_MODEXP:
                                                                                      push eax
                                                                                  
                                                                                      test ebp, 1
                                                                                      jz @F
                                                                                  
                                                                                      mul edi
                                                                                      mov ecx, edx
                                                                                      imul eax, DWORD PTR [esp+16]
                                                                                      mul esi
                                                                                      xor ebx, ebx
                                                                                      sub ecx, edx
                                                                                      cmovs ebx, esi
                                                                                      add ecx, ebx
                                                                                      mov DWORD PTR [esp], ecx
                                                                                  @@:
                                                                                      mov edx, edi
                                                                                      mulx ecx, edx, edi
                                                                                      imul edx, DWORD PTR [esp+16]
                                                                                      mulx eax, ebx, esi
                                                                                      xor ebx, ebx
                                                                                      sub ecx, eax
                                                                                      cmovs ebx, esi
                                                                                      add ecx, ebx
                                                                                      mov edi, ecx
                                                                                  
                                                                                      pop eax
                                                                                  
                                                                                      shr ebp, 1
                                                                                      jnz LOOP_MODEXP
                                                                                  

                                                                                  SIMD code

                                                                                      movd xmm2, DWORD PTR [esp+12]
                                                                                      movd xmm3, esi
                                                                                      pshufd xmm2, xmm2, 0
                                                                                      pshufd xmm3, xmm3, 0
                                                                                      
                                                                                      movd xmm1, edi
                                                                                  
                                                                                      pshufd xmm1, xmm1, 0
                                                                                      movdqa xmm0, xmm1
                                                                                  
                                                                                      pinsrd xmm0, eax, 2
                                                                                  
                                                                                  LOOP_MODEXP:
                                                                                      movdqa xmm4, xmm0
                                                                                      pmuludq xmm0, xmm1
                                                                                      movdqa xmm1, xmm0
                                                                                      pmuludq xmm0, xmm2
                                                                                      pmuludq xmm0, xmm3
                                                                                      psubd xmm1, xmm0
                                                                                      
                                                                                      psrldq xmm1, 4
                                                                                      pxor xmm0, xmm0
                                                                                      pcmpgtd xmm0, xmm1
                                                                                      blendvps xmm0, xmm3, xmm0
                                                                                      paddd xmm0, xmm1
                                                                                  
                                                                                      movddup xmm1, xmm0
                                                                                  
                                                                                      test ebp, 1
                                                                                      jnz @F
                                                                                      blendps xmm0, xmm4, 4
                                                                                  
                                                                                  @@:
                                                                                      shr ebp, 1
                                                                                      jnz LOOP_MODEXP
                                                                                  
                                                                                      pextrd eax, xmm0, 2
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-May-30 at 16:13
                                                                                  1. Your SIMD code wastes time mispredicting that test ebp, 1 / jnz branch. There’s no conditional move instruction in SSE, but you can still optimize away that test + branch with a few more instructions like this:
                                                                                  mov      ebx, ebp
                                                                                  and      ebx, 1
                                                                                  sub      ebx, 1
                                                                                  pxor     xmm5, xmm5
                                                                                  pinsrd   xmm5, ebx, 2
                                                                                  blendvps xmm0, xmm4, xmm5
                                                                                  

                                                                                  instead of your

                                                                                  test    ebp, 1
                                                                                  jnz @F
                                                                                  blendps xmm0, xmm4, 4
                                                                                  

                                                                                  The above code computes ebx = ( ebp & 1 ) ? 0 : -1;, inserts that integer into 3-rd lane of a zero vector, and uses that vector for the selector in blendvps instruction.

                                                                                  1. This instruction is not needed: pcmpgtd xmm0, xmm1 Along with previous and next one, it computes this:
                                                                                  xmm0 = _mm_cmplt_epi32( xmm1, _mm_setzero_si128() );
                                                                                  xmm0 = _mm_blendv_ps( xmm0, xmm3, xmm0 );
                                                                                  

                                                                                  Here’s an equivalent:

                                                                                  xmm0 = _mm_blendv_ps( _mm_setzero_si128(), xmm3, xmm1 );
                                                                                  

                                                                                  That comparison instruction compares int32 lanes for xmm1 < 0. This results in the sign bit of these integers. _mm_blendv_ps instruction only tests the high bits in 32-bit lanes, you don’t really need to compare for xmm1 < 0 before that.

                                                                                  1. Unless you need to support CPUs without AVX, you should use VEX encoding of the instructions, even for code dealing with 16-byte vectors. Your SIMD code uses legacy encoding, most of them take 2 arguments and write the result in the first one. Most VEX instructions take 3 arguments and write result into another one. This should get rid of the couple redundant move instructions like movdqa xmm4, xmm0.

                                                                                  Source https://stackoverflow.com/questions/67761813

                                                                                  QUESTION

                                                                                  error: reduction variable is private in outer context (omp reduction)
                                                                                  Asked 2021-May-26 at 18:08

                                                                                  I am confused about the data sharing scope of the variable acc in the flowing two cases. In the case 1 I get following compilation error: error: reduction variable ‘acc’ is private in outer context, whereas the case 2 compiles without any issues.

                                                                                  According to this article variables defined outside parallel region are shared.

                                                                                  Why is adding for-loop parallelism privatizing acc? How can I in this case accumulate the result calculated in the the for-loop and distribute a loop's iteration space across a thread team?

                                                                                  case 1

                                                                                              float acc = 0.0f;
                                                                                              
                                                                                              #pragma omp for simd reduction(+: acc)
                                                                                              for (int k = 0; k < MATRIX_SIZE; k++) {
                                                                                                  float mul = alpha;
                                                                                                  mul *=  a[i * MATRIX_SIZE + k];
                                                                                                  mul *=  b[j * MATRIX_SIZE + k];
                                                                                                  acc += mul;
                                                                                              }
                                                                                  
                                                                                  

                                                                                  case 2

                                                                                              float acc = 0.0f;
                                                                                              
                                                                                              #pragma omp simd reduction(+: acc)
                                                                                              for (int k = 0; k < MATRIX_SIZE; k++) {
                                                                                                  float mul = alpha;
                                                                                                  mul *=  a[i * MATRIX_SIZE + k];
                                                                                                  mul *=  b[j * MATRIX_SIZE + k];
                                                                                                  acc += mul;
                                                                                              }
                                                                                  
                                                                                  
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-May-26 at 18:08

                                                                                  Your case 1 is violating OpenMP semantics, as there's an implicit parallel region (see OpenMP Language Terminology, "sequential part") that contains the definition of acc. Thus, acc is indeed private to that implicit parallel region. This is what the compiler complains about.

                                                                                  Your case 2 is different in that the simd construct is not a worksharing construct and thus has a different definition of the semantics of the reduction clause.

                                                                                  Case 1 would be correct if you wrote it this way:

                                                                                  void example(void) {
                                                                                      float acc = 0.0f;
                                                                                  
                                                                                      #pragma omp parallel for simd reduction(+: acc)
                                                                                      for (int k = 0; k < MATRIX_SIZE; k++) {
                                                                                          float mul = alpha;
                                                                                          mul *=  a[i * MATRIX_SIZE + k];
                                                                                          mul *=  b[j * MATRIX_SIZE + k];
                                                                                          acc += mul;
                                                                                      }
                                                                                  }
                                                                                  

                                                                                  The acc variable is now defined outside of the parallel that the for simd construct binds to.

                                                                                  Source https://stackoverflow.com/questions/67709887

                                                                                  QUESTION

                                                                                  Writing a vector sum function with SIMD (System.Numerics) and making it faster than a for loop
                                                                                  Asked 2021-May-21 at 18:27

                                                                                  I wrote a function to add up all the elements of a double[] array using SIMD (System.Numerics.Vector) and the performance is worse than the naïve method.

                                                                                  On my computer Vector.Count is 4 which means I could create an accumulator of 4 values and run through the array adding up the elements by groups.

                                                                                  For example a 10 element array, with a 4 element accumulator and 2 remaining elements I would get

                                                                                  //     | loop                  | remainder
                                                                                  acc[0] = vector[0] + vector[4] + vector[8]
                                                                                  acc[1] = vector[1] + vector[5] + vector[9]
                                                                                  acc[2] = vector[2] + vector[6] 
                                                                                  acc[3] = vector[3] + vector[7] 
                                                                                  

                                                                                  and the result sum = acc[0]+acc[1]+acc[2]+acc[3]

                                                                                  The code below produces the correct results, but the speed isn't there compared to just a loop adding up the values

                                                                                  public static double SumSimd(this Span a)
                                                                                  {
                                                                                      var n = System.Numerics.Vector.Count;
                                                                                      var count = a.Length;
                                                                                      // divide array into n=4 element groups
                                                                                      // Example, 57 = 14*4 + 3
                                                                                      var groups = Math.DivRem(count, n, out int remain);
                                                                                      var buffer = new double[n];
                                                                                      // Create buffer with remaining elements (not in groups)
                                                                                      a.Slice(groups*n, remain).CopyTo(buffer);
                                                                                      // Scan through all groups and accumulate
                                                                                      var accumulator = new System.Numerics.Vector(buffer);
                                                                                      for (int i = 0; i < groups; i++)
                                                                                      {
                                                                                          //var next = new System.Numerics.Vector(a, n * i);
                                                                                          var next = new System.Numerics.Vector(a.Slice(n * i, n));
                                                                                          accumulator += next;
                                                                                      }
                                                                                      var sum = 0.0;
                                                                                      // Add up the elements of the accumulator vs
                                                                                      for (int j = 0; j < n; j++)
                                                                                      {
                                                                                          sum += accumulator[j];
                                                                                      }
                                                                                      return sum;
                                                                                  }
                                                                                  

                                                                                  So my question is why aren't I realizing any benefits here with SIMD?

                                                                                  Baseline

                                                                                  The baseline code looks like this

                                                                                  public static double LinAlgSum(this ReadOnlySpan span)
                                                                                  {
                                                                                      double sum = 0;
                                                                                      for (int i = 0; i < span.Length; i++)
                                                                                      {
                                                                                          sum += span[i];
                                                                                      }
                                                                                      return sum;
                                                                                  }
                                                                                  

                                                                                  In benchmarking the SIMD code comparing to the above, the SIMD code is 5× slower for size=7, 2.5× slower for size=144 and about the same for size=770.

                                                                                  I am running release mode using BenchmarkDotNet. Here is the driver class

                                                                                  [GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
                                                                                  [CategoriesColumn]
                                                                                  public class LinearAlgebraBench
                                                                                  {
                                                                                      [Params(7, 35, 77, 144, 195, 311, 722)]
                                                                                      public int Size { get; set; }
                                                                                  
                                                                                      [IterationSetup]
                                                                                      public void SetupData()
                                                                                      {
                                                                                          A = new LinearAlgebra.Vector(Size, (iter) => 2 * Size - iter).ToArray();
                                                                                          B = new LinearAlgebra.Vector(Size, (iter) => Size/2 + 2* iter).ToArray();
                                                                                      }
                                                                                  
                                                                                      public double[] A { get; set; }
                                                                                      public double[] B { get; set; }
                                                                                  
                                                                                      [BenchmarkCategory("Sum"), Benchmark(Baseline = true)]
                                                                                      public double BenchLinAlgSum()
                                                                                      {
                                                                                          return LinearAlgebra.LinearAlgebra.Sum(A.AsSpan().AsReadOnly());
                                                                                      }
                                                                                      [BenchmarkCategory("Sum"), Benchmark]
                                                                                      public double BenchSimdSum()
                                                                                      {
                                                                                          return LinearAlgebra.LinearAlgebra.SumSimd(A);
                                                                                      }
                                                                                  }
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-May-19 at 18:28

                                                                                  I would suggest you take a look at this article exploring SIMD performance in .Net.

                                                                                  The overall algorithm looks identical for summing using regular vectorization. One difference is that the multiplication can be avoided when slicing the array:

                                                                                  while (i < lastBlockIndex)
                                                                                  {
                                                                                      vresult += new Vector(source.Slice(i));
                                                                                      i += Vector.Count;
                                                                                  }
                                                                                  

                                                                                  One multiplication should have fairly small effect on performance, but for this kind of code it might be relevant.

                                                                                  It is also worth noting that the compiler does not seem to produce very efficient SIMD code with the generic API. Performance for summing 32768 items:

                                                                                  • SumUnrolled - 8,979.690 ns
                                                                                  • SumVectorT - 6,689.829 ns
                                                                                  • SumIntrinstics - 2,200.996 ns

                                                                                  So, the generic version of SIMD only gains ~30% performance, while the intrinstics version gain ~400% performance, near the theoretical max.

                                                                                  Source https://stackoverflow.com/questions/67605744

                                                                                  QUESTION

                                                                                  Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?
                                                                                  Asked 2021-May-20 at 13:46

                                                                                  First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.

                                                                                  In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.

                                                                                  This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).

                                                                                  However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.

                                                                                  My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.

                                                                                  My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)

                                                                                  It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.

                                                                                  ANSWER

                                                                                  Answered 2021-May-20 at 13:46

                                                                                  can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores

                                                                                  I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc using a lock free ring buffer, like TPCircularBuffer.

                                                                                  But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).

                                                                                  My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.

                                                                                  The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.

                                                                                  I still don't know what happens in practice when you use Audio Workgroups, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.

                                                                                  Source https://stackoverflow.com/questions/67601620

                                                                                  QUESTION

                                                                                  MTKView Transparency
                                                                                  Asked 2021-May-17 at 09:09

                                                                                  I can't make my MTKView clear its background. I've set the view's and its layer's isOpaque to false, background color to clear and tried multiple solutions found on google/stackoverflow (most in the code below like loadAction and clearColor of color attachment) but nothing works.

                                                                                  All the background color settings seem to be ignored. Setting loadAction and clearColor of MTLRenderPassColorAttachmentDescriptor does nothing.

                                                                                  I'd like to have my regular UIView's drawn under the MTKView. What am I missing?

                                                                                              // initialization
                                                                                              let metal = MTKView(frame: self.view.bounds)
                                                                                              metal.device = MTLCreateSystemDefaultDevice()
                                                                                              self.renderer = try! MetalRenderer(mtkView: metal)
                                                                                              metal.delegate = self.renderer
                                                                                              self.view.addSubview(metal);
                                                                                  
                                                                                  import Foundation
                                                                                  import MetalKit
                                                                                  import simd
                                                                                  
                                                                                  public enum MetalError: Error {
                                                                                      case mtkViewError
                                                                                      case renderError
                                                                                  }
                                                                                  
                                                                                  internal class MetalRenderer: NSObject, MTKViewDelegate {
                                                                                      private let commandQueue: MTLCommandQueue;
                                                                                      private let pipelineState: MTLRenderPipelineState
                                                                                      private var viewportSize: SIMD2 = SIMD2(x: 10, y: 10);
                                                                                      private weak var mtkView: MTKView?
                                                                                          
                                                                                  
                                                                                      
                                                                                      init(mtkView: MTKView) throws {
                                                                                  
                                                                                          guard let device = mtkView.device else { 
                                                                                              print("device not found error")
                                                                                              throw MetalError.mtkViewError 
                                                                                          }
                                                                                          self.mtkView = mtkView
                                                                                  
                                                                                          // Load all the shader files with a .metal file extension in the project.
                                                                                          guard let defaultLibrary = device.makeDefaultLibrary() else {
                                                                                              print("Could not find library")
                                                                                              throw MetalError.mtkViewError
                                                                                          }
                                                                                          let vertexFunction = defaultLibrary.makeFunction(name: "vertexShader")
                                                                                          let fragmentFunction = defaultLibrary.makeFunction(name: "fragmentShader")
                                                                                  
                                                                                          mtkView.layer.isOpaque = false;
                                                                                          mtkView.layer.backgroundColor = UIColor.clear.cgColor
                                                                                          mtkView.isOpaque = false;
                                                                                          mtkView.backgroundColor = .clear
                                                                                  
                                                                                          let pipelineStateDescriptor = MTLRenderPipelineDescriptor();
                                                                                          pipelineStateDescriptor.label = "Pipeline";
                                                                                          pipelineStateDescriptor.vertexFunction = vertexFunction;
                                                                                          pipelineStateDescriptor.fragmentFunction = fragmentFunction;
                                                                                          pipelineStateDescriptor.isAlphaToCoverageEnabled = true
                                                                                          pipelineStateDescriptor.colorAttachments[0].pixelFormat = .bgra8Unorm;
                                                                                          pipelineStateDescriptor.colorAttachments[0].isBlendingEnabled = true;
                                                                                          pipelineStateDescriptor.colorAttachments[0].destinationRGBBlendFactor = .oneMinusSourceAlpha;
                                                                                          pipelineStateDescriptor.colorAttachments[0].destinationAlphaBlendFactor = .oneMinusSourceAlpha;
                                                                                  
                                                                                          pipelineState = try! device.makeRenderPipelineState(descriptor: pipelineStateDescriptor);
                                                                                          
                                                                                          guard let queue = device.makeCommandQueue() else { 
                                                                                              print("make command queue error")
                                                                                              throw MetalError.mtkViewError 
                                                                                          }
                                                                                          commandQueue = queue
                                                                                      }
                                                                                  
                                                                                  
                                                                                      func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {
                                                                                          viewportSize.x = UInt32(size.width)
                                                                                          viewportSize.y = UInt32(size.height)
                                                                                      }
                                                                                  
                                                                                      func draw(in view: MTKView) {
                                                                                  
                                                                                          let vertices: [Vertex] = [
                                                                                              Vertex(position: SIMD3(x: 250.0, y: -250.0, z: 0)),
                                                                                              Vertex(position: SIMD3(x: -250.0, y: -250.0, z: 0)),
                                                                                              Vertex(position: SIMD3(x: 0.0, y: 250.0, z: 0)),
                                                                                          ]
                                                                                          
                                                                                          guard let commandBuffer = commandQueue.makeCommandBuffer() else { 
                                                                                              print("Couldn't create command buffer")
                                                                                              return 
                                                                                          }
                                                                                          // Create a new command buffer for each render pass to the current drawable.
                                                                                          commandBuffer.label = "MyCommand";
                                                                                  
                                                                                          // Obtain a renderPassDescriptor generated from the view's drawable textures.
                                                                                          guard let renderPassDescriptor = view.currentRenderPassDescriptor else {
                                                                                              print("Couldn't create render pass descriptor")
                                                                                              return
                                                                                          }
                                                                                          
                                                                                          guard let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDescriptor) else {
                                                                                              print("Couldn't create render encoder")
                                                                                              return
                                                                                          }
                                                                                  
                                                                                          renderPassDescriptor.colorAttachments[0].loadAction = .clear
                                                                                          renderPassDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(1.0, 0.0, 0.0, 0.5)
                                                                                  
                                                                                          renderEncoder.label = "MyRenderEncoder";
                                                                                          // Set the region of the drawable to draw into.
                                                                                          renderEncoder.setViewport(MTLViewport(originX: 0.0, originY: 0.0, width: Double(viewportSize.x), height: Double(viewportSize.y), znear: 0.0, zfar: 1.0))
                                                                                          renderEncoder.setRenderPipelineState(pipelineState)
                                                                                  
                                                                                          // Pass in the parameter data.
                                                                                          renderEncoder.setVertexBytes(vertices, length: MemoryLayout.size * vertices.count, index: Int(VertexInputIndexVertices.rawValue))
                                                                                          renderEncoder.setVertexBytes(&viewportSize, length: MemoryLayout>.size, index: Int(VertexInputIndexViewportSize.rawValue))
                                                                                          renderEncoder.drawPrimitives(type: MTLPrimitiveType.triangle, vertexStart: 0, vertexCount: 3)
                                                                                          renderEncoder.endEncoding()
                                                                                  
                                                                                          // Schedule a present once the framebuffer is complete using the current drawable.
                                                                                          guard let drawable = view.currentDrawable else {
                                                                                              print("Couldn't get current drawable")
                                                                                              return
                                                                                          }
                                                                                  
                                                                                          commandBuffer.present(drawable)
                                                                                  
                                                                                          // Finalize rendering here & push the command buffer to the GPU.
                                                                                          commandBuffer.commit()
                                                                                      }
                                                                                  }
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-May-17 at 09:09

                                                                                  Thanks to Frank, the answer was to just set the clearColor property of the view itself, which I missed. I also removed most adjustments in the MTLRenderPipelineDescriptor, who's code is now:

                                                                                      let pipelineStateDescriptor = MTLRenderPipelineDescriptor();
                                                                                      pipelineStateDescriptor.label = "Pipeline";
                                                                                      pipelineStateDescriptor.vertexFunction = vertexFunction;
                                                                                      pipelineStateDescriptor.fragmentFunction = fragmentFunction;
                                                                                      pipelineStateDescriptor.colorAttachments[0].pixelFormat = 
                                                                                      mtkView.colorPixelFormat;
                                                                                  

                                                                                  Also no changes necessary to MTLRenderPassDescriptor from currentRenderPassDescriptor.

                                                                                  EDIT: Also be sure to set isOpaque property of MTKView to false too.

                                                                                  Source https://stackoverflow.com/questions/67487986

                                                                                  Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                                                                  Vulnerabilities

                                                                                  No vulnerabilities reported

                                                                                  Install Simd

                                                                                  You can download it from GitHub.

                                                                                  Support

                                                                                  For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
                                                                                  Find more information at:
                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit

                                                                                  Share this Page

                                                                                  share link

                                                                                  Consider Popular Computer Vision Libraries

                                                                                  opencv

                                                                                  by opencv

                                                                                  tesseract

                                                                                  by tesseract-ocr

                                                                                  tesseract.js

                                                                                  by naptha

                                                                                  Detectron

                                                                                  by facebookresearch

                                                                                  Try Top Libraries by ermig1979

                                                                                  AntiDupl

                                                                                  by ermig1979C#

                                                                                  Synet

                                                                                  by ermig1979C++

                                                                                  Examples

                                                                                  by ermig1979C++

                                                                                  Cpl

                                                                                  by ermig1979C++

                                                                                  Compare Computer Vision Libraries with Highest Support

                                                                                  opencv

                                                                                  by opencv

                                                                                  picasso

                                                                                  by square

                                                                                  thumbor

                                                                                  by thumbor

                                                                                  albumentations

                                                                                  by albumentations-team

                                                                                  vision

                                                                                  by pytorch

                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit