Simulation

Explore all libraries in Simulation

Explore all Simulation open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Simulation

mongoose

6.2.10

bullet3

PyBullet 3.22

abstreet

Tiramisconstrued

Locale-Emulator

v2.5.0.1

a32nx

v0.7.5-rc1

Popular Libraries in Simulation

mongoose

by Automattic javascript

24117 MIT

MongoDB object modeling designed to work in an asynchronous environment.

OSX-KVM

by kholia python

11515

Run macOS on QEMU/KVM. With OpenCore + Big Sur support now! Only commercial (paid) support is available now to avoid spammy issues.

bullet3

by bulletphysics c++

9090 NOASSERTION

Bullet Physics SDK: real-time collision detection and multi-physics simulation for VR, games, visual effects, robotics, machine learning etc.

citra

by citra-emu c++

6741 GPL-2.0

A Nintendo 3DS Emulator

abstreet

by a-b-street rust

6519 Apache-2.0

Transportation planning and traffic simulation software for creating cities friendlier to walking, biking, and public transit

Locale-Emulator

by xupefei csharp

6367 LGPL-3.0

Yet Another System Region and Language Simulator

a32nx

by flybywiresim javascript

4107 GPL-3.0

The A32NX Project is a community driven open source project to create a free Airbus A320neo in Microsoft Flight Simulator that is as close to reality as possible.

paint

by dli javascript

2567 MIT

Fluid Paint - http://david.li/paint

crest

by wave-harmonic csharp

2531 NOASSERTION

An advanced ocean system implemented in Unity3D

Explore all libraries in Simulation

Trending New libraries in Simulation

a32nx

by flybywiresim javascript

4107 GPL-3.0

The A32NX Project is a community driven open source project to create a free Airbus A320neo in Microsoft Flight Simulator that is as close to reality as possible.

alien

by chrxh c++

2442 GPL-3.0

ALIEN is a CUDA-powered artificial life simulation program.

AntSimulator

by johnBuffer c++

1729 MIT

Simple Ants simulator

JoltPhysics

by jrouwe c++

1529 MIT

A multi core friendly rigid body physics and collision detection library suitable for games and VR applications.

covid19_scenarios

by neherlab javascript

1380 MIT

Models of COVID-19 outbreak trajectories and hospital demand

xemu

by mborgerson c

1181 NOASSERTION

Original Xbox Emulator for Windows, macOS, and Linux (Active Development)

brax

by google jupyter notebook

1078 Apache-2.0

Massively parallel rigidbody physics simulation on accelerator hardware.

Unity-Robotics-Hub

by Unity-Technologies csharp

1053 Apache-2.0

Central repository for tools, tutorials, resources, and documentation for robotics simulation in Unity.

gazebo

by osrf c++

876 NOASSERTION

Open source robotics simulator.

Top Authors in Simulation

phetsims

50 Libraries

235

mupen64plus

13 Libraries

1284

nasa

13 Libraries

1396

LLNL

13 Libraries

281

OpenEmu

11 Libraries

140

NREL

11 Libraries

810

brummer10

10 Libraries

cran

9 Libraries

Scrawk

9 Libraries

1399

bio-phys

8 Libraries

100

phetsims

50 Libraries

235

mupen64plus

13 Libraries

1284

nasa

13 Libraries

1396

LLNL

13 Libraries

281

OpenEmu

11 Libraries

140

NREL

11 Libraries

810

brummer10

10 Libraries

cran

9 Libraries

Scrawk

9 Libraries

1399

bio-phys

8 Libraries

100

Trending Kits in Simulation

6 Best C++ Simulation Libraries

6 Best C++ Simulation Libraries

Here are the best open-source C++ simulation libraries for your applications. You can use these to build simulations that model complex systems, such as physical processes, financial markets, or social interactions.

C++ is a powerful programming language widely used to create and run simulations for various applications in various fields. C++ simulation libraries are collections of pre-written code that can be used to model complex systems and perform simulations without writing the code from scratch. C++ is popular for simulation libraries because it offers high performance and low-level control, making it well-suited for computationally intensive simulations. Additionally, C++ offers a wide range of features and libraries that can be used to implement simulations, including advanced data structures, mathematical functions, and parallel processing. These libraries provide a range of powerful tools for simulation in C++. They offer high performance, low-level control, and a range of advanced features and libraries that can be used to implement complex simulations.

Whether you are working on discrete event simulation, network simulation, or fluid dynamics simulation, a library here can help you get the job done. We have handpicked the top and trending open-source C++ simulation libraries for your next application development project.

Open Dynamics Engine (ODE):

Used for simulating articulated rigid body dynamics.
It’s an open-source, high-performance physics engine.
Allows to simulate realistic interactions between objects.

Simbody:

Used in C++-based simulation applications.
Offers advanced features such as multibody dynamics, contact modeling, etc.
Also provides constraint-based modeling.

SOFA:

Used typically for medical simulations to help foster newer algorithms.
It includes a variety of modules and algorithms.
Features include simulating soft-tissue deformation, fluid dynamics, and more.

Chrono:

Used to model and simulate: - dynamics of large systems.
It’s a multi-physics simulation library package.
Allows to simulate interactions between solid and fluid objects and control systems.

Bullet:

Used in many VRs, games, visual effects, robotics, machine learning, etc.
Features include collision detection, rigid body dynamics, and soft body simulation.
Offers real-time collision detection and multi-physics simulation.

Gazebo:

Used to simulate multiple robots in a 3D environment, with the extensive dynamic interaction between objects.
Offers high-fidelity physics, rendering, and sensor models.
Includes physics simulation, sensor simulation, and visualization capabilities.

Here's a kit of 8 open-source projects addressing water challenges worldwide.

More kits in Simulation

Trending Discussions on Simulation

Reverting a linear filter for time series in R

Why should I use normalised units in numerical integration?

Dramatic drop in numpy fromfile performance when switching from python 2 to python 3

ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors

Convolution Function Latency Bottleneck

Write custom metadata to Parquet file in Julia

Floating point inconsistencies after upgrading libc/libm

Missing types, namespaces, directives, and assembly references

Iterating over an array of class objects VS a class object containing arrays

R: Trying to recreate mean-median difference gerrymander tests

Reverting a linear filter for time series in R

Why should I use normalised units in numerical integration?

Dramatic drop in numpy fromfile performance when switching from python 2 to python 3

ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors

Convolution Function Latency Bottleneck

Write custom metadata to Parquet file in Julia

Floating point inconsistencies after upgrading libc/libm

Missing types, namespaces, directives, and assembly references

Iterating over an array of class objects VS a class object containing arrays

R: Trying to recreate mean-median difference gerrymander tests

QUESTION

Reverting a linear filter for time series in R

Asked 2022-Mar-28 at 17:50

I'm using the stats::filter function in R in order to understand ARIMA simulations in R (as in the function stats::arima.sim) and estiamtion. I know that stats::filter applies a linear filter to a vector or time series, but I'm not sure how to "unfilter" my series.

Consider the following example: I want to use a recursive filter with value 0.7 to my series x = 1:5 (which is essentially generating an AR(1) with phi=0.7). I can do so by:

1x    &lt;- 1:5 
2ar   &lt;-0.7 
3filt &lt;- filter(x, ar, method=&quot;recursive&quot;)
4filt 
5
6Time Series:
7Start = 1 
8End = 5 
9Frequency = 1 
10[1]  1.0000  2.7000  4.8900  7.4230 10.1961
11

Which returns me essentially c(y1,y2,y3,y4,y5) where:

1x    &lt;- 1:5 
2ar   &lt;-0.7 
3filt &lt;- filter(x, ar, method=&quot;recursive&quot;)
4filt 
5
6Time Series:
7Start = 1 
8End = 5 
9Frequency = 1 
10[1]  1.0000  2.7000  4.8900  7.4230 10.1961
11y1 &lt;- x[1]                                            
12y2 &lt;- x[2] + ar*y1      
13y3 &lt;- x[3] + ar*y2 
14y4 &lt;- x[4] + ar*y3 
15y5 &lt;- x[5] + ar*y4 
16

Now imagine I have the y = c(y1,y2,y3,y4,y5) series. How can I use the filter function to return me the original series x = 1:5?

I can write a code to do it like:

1x    &lt;- 1:5 
2ar   &lt;-0.7 
3filt &lt;- filter(x, ar, method=&quot;recursive&quot;)
4filt 
5
6Time Series:
7Start = 1 
8End = 5 
9Frequency = 1 
10[1]  1.0000  2.7000  4.8900  7.4230 10.1961
11y1 &lt;- x[1]                                            
12y2 &lt;- x[2] + ar*y1      
13y3 &lt;- x[3] + ar*y2 
14y4 &lt;- x[4] + ar*y3 
15y5 &lt;- x[5] + ar*y4 
16unfilt &lt;- rep(NA, 5)
17unfilt[1] &lt;- filt[1]
18
19for(i in 2:5){
20   unfilt[i] &lt;- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24

But I do want to use the filter function to do so, instead of writing my own function. How can I do so? I tried stats::filter(filt, -ar, method="recursive"), which returns me [1] 1.0000 2.0000 3.4900 4.9800 6.7101 not what I desire.

ANSWER

Answered 2022-Mar-28 at 17:50

stats::filter used with the recursive option is a particular case of an ARMA filter.

1x    &lt;- 1:5 
2ar   &lt;-0.7 
3filt &lt;- filter(x, ar, method=&quot;recursive&quot;)
4filt 
5
6Time Series:
7Start = 1 
8End = 5 
9Frequency = 1 
10[1]  1.0000  2.7000  4.8900  7.4230 10.1961
11y1 &lt;- x[1]                                            
12y2 &lt;- x[2] + ar*y1      
13y3 &lt;- x[3] + ar*y2 
14y4 &lt;- x[4] + ar*y3 
15y5 &lt;- x[5] + ar*y4 
16unfilt &lt;- rep(NA, 5)
17unfilt[1] &lt;- filt[1]
18
19for(i in 2:5){
20   unfilt[i] &lt;- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]  
25

You could implement this filter with the signal package which allows more options than stat::filter :

1x    &lt;- 1:5 
2ar   &lt;-0.7 
3filt &lt;- filter(x, ar, method=&quot;recursive&quot;)
4filt 
5
6Time Series:
7Start = 1 
8End = 5 
9Frequency = 1 
10[1]  1.0000  2.7000  4.8900  7.4230 10.1961
11y1 &lt;- x[1]                                            
12y2 &lt;- x[2] + ar*y1      
13y3 &lt;- x[3] + ar*y2 
14y4 &lt;- x[4] + ar*y3 
15y5 &lt;- x[5] + ar*y4 
16unfilt &lt;- rep(NA, 5)
17unfilt[1] &lt;- filt[1]
18
19for(i in 2:5){
20   unfilt[i] &lt;- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]  
25a = c(1,-ar)
26b = 1
27
28filt_Arma &lt;- signal::filter(signal::Arma(b = b, a = a),x)
29filt_Arma
30
31# Time Series:
32# Start = 1 
33# End = 5 
34# Frequency = 1 
35# [1]  1.0000  2.7000  4.8900  7.4230 10.1961
36
37identical(filt,filt_Arma)
38# [1] TRUE
39

Reverting an ARMA filter can be done by switching b and a, provided that the inverse filter stays stable (which is the case here):

1x    &lt;- 1:5 
2ar   &lt;-0.7 
3filt &lt;- filter(x, ar, method=&quot;recursive&quot;)
4filt 
5
6Time Series:
7Start = 1 
8End = 5 
9Frequency = 1 
10[1]  1.0000  2.7000  4.8900  7.4230 10.1961
11y1 &lt;- x[1]                                            
12y2 &lt;- x[2] + ar*y1      
13y3 &lt;- x[3] + ar*y2 
14y4 &lt;- x[4] + ar*y3 
15y5 &lt;- x[5] + ar*y4 
16unfilt &lt;- rep(NA, 5)
17unfilt[1] &lt;- filt[1]
18
19for(i in 2:5){
20   unfilt[i] &lt;- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]  
25a = c(1,-ar)
26b = 1
27
28filt_Arma &lt;- signal::filter(signal::Arma(b = b, a = a),x)
29filt_Arma
30
31# Time Series:
32# Start = 1 
33# End = 5 
34# Frequency = 1 
35# [1]  1.0000  2.7000  4.8900  7.4230 10.1961
36
37identical(filt,filt_Arma)
38# [1] TRUE
39signal::filter(signal::Arma(b = a, a = b),filt)
40
41# Time Series:
42# Start = 2 
43# End = 6 
44# Frequency = 1 
45# [1] 1 2 3 4 5
46

This corresponds to switching numerator and denominator in the z-transform:

1x    &lt;- 1:5 
2ar   &lt;-0.7 
3filt &lt;- filter(x, ar, method=&quot;recursive&quot;)
4filt 
5
6Time Series:
7Start = 1 
8End = 5 
9Frequency = 1 
10[1]  1.0000  2.7000  4.8900  7.4230 10.1961
11y1 &lt;- x[1]                                            
12y2 &lt;- x[2] + ar*y1      
13y3 &lt;- x[3] + ar*y2 
14y4 &lt;- x[4] + ar*y3 
15y5 &lt;- x[5] + ar*y4 
16unfilt &lt;- rep(NA, 5)
17unfilt[1] &lt;- filt[1]
18
19for(i in 2:5){
20   unfilt[i] &lt;- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]  
25a = c(1,-ar)
26b = 1
27
28filt_Arma &lt;- signal::filter(signal::Arma(b = b, a = a),x)
29filt_Arma
30
31# Time Series:
32# Start = 1 
33# End = 5 
34# Frequency = 1 
35# [1]  1.0000  2.7000  4.8900  7.4230 10.1961
36
37identical(filt,filt_Arma)
38# [1] TRUE
39signal::filter(signal::Arma(b = a, a = b),filt)
40
41# Time Series:
42# Start = 2 
43# End = 6 
44# Frequency = 1 
45# [1] 1 2 3 4 5
46Y(z) = a(z)/b(z) X(z)
47
48X(z) = b(z)/a(z) Y(z)
49

Source https://stackoverflow.com/questions/71623626

QUESTION

Why should I use normalised units in numerical integration?

Asked 2022-Mar-19 at 10:40

I was simulating the solar system (Sun, Earth and Moon). When I first started working on the project, I used the base units: meters for distance, seconds for time, and metres per second for velocity. Because I was dealing with the solar system, the numbers were pretty big, for example the distance between the Earth and Sun is 150·10⁹ m.

When I numerically integrated the system with scipy.solve_ivp, the results were completely wrong. Here is an example of Earth and Moon trajectories.

But then I got a suggestion from a friend that I should use standardised units: astronomical unit (AU) for distance and years for time. And the simulation started working flawlessly!

My question is: Why is this a generally valid advice for problems such as mine? (Mind that this is not about my specific problem which was already solved, but rather why the solution worked.)

ANSWER

Answered 2021-Jul-25 at 07:42

Most, if not all integration modules work best out of the box if:

your dynamical variables have the same order of magnitude;
that order of magnitude is 1;
the smallest time scale of your dynamics also has the order of magnitude 1.

This typically fails for astronomical simulations where the orders of magnitude vary and values as well as time scales are often large in typical units.

The reason for the above behaviour of integrators is that they use step-size adaption, i.e., the integration step is adjusted to keep the estimated error at a defined level. The step-size adaption in turn is governed by a lot of parameters like absolute tolerance, relative tolerance, minimum time step, etc. You can usually tweak these parameters, but if you don’t, there need to be some default values and these default values are chosen with the above setup in mind.

Digression

You might ask yourself: Can these parameters not be chosen more dynamically? As a developer and maintainer of an integration module, I would roughly expect that introducing such automatisms has the following consequences:

About twenty in a thousand users will not run into problems like yours.
About fifty a thousand users (including the above) miss an opportunity to learn rudimentary knowledge about how integrators work and reading documentations.
About one in thousand users will run into a horrible problem with the automatisms that is much more difficult to solve than the above.
I need to introduce new parameters governing the automatisms that are even harder to grasp for the average user.
I spend a lot of time in devising and implementing the automatisms.

Source https://stackoverflow.com/questions/68500704

QUESTION

Dramatic drop in numpy fromfile performance when switching from python 2 to python 3

Asked 2022-Mar-16 at 23:53

Background

I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.

Simplified code to reproduce the problem

This code, iotest.py

Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both

1    import numpy as np
2    import os
3    
4    def generate_binary_file(filename, nrecords):
5        n_records = np.random.poisson(lam = nrecords)
6        record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7        x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8        with open(filename, 'wb') as f:
9            s = 0
10            for i in range(n_records):
11                f.write(record_lengths[i].tobytes())
12                f.write(x[s:s+record_lengths[i]].tobytes())
13                s += record_lengths[i]
14            # Trick for testing: make sum of records equal to 0
15            f.write(np.array([1], dtype = 'i4').tobytes())
16            f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17        return os.path.getsize(filename)
18    
19    def read_binary_npfromfile(filename):
20        checksum = 0.0
21        with open(filename, 'rb') as f:
22            while True:
23                try:
24                    record_length = np.fromfile(f, 'i4', 1)[0]
25                    x = np.fromfile(f, 'd', record_length)
26                    checksum += x.sum()
27                except:
28                    break
29        assert(np.abs(checksum) &lt; 1e-6)
30
31    def read_binary_npfrombuffer(filename):
32        checksum = 0.0
33        with open(filename, 'rb') as f:
34            while True:
35                try:
36                    record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37                    x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38                    checksum += x.sum()
39                except:
40                    break
41        assert(np.abs(checksum) &lt; 1e-6)
42    
43    
44    if __name__ == '__main__':
45        from timeit import Timer
46        from functools import partial
47    
48        fname = 'testfile.tmp'
49        print(&quot;# File size[MB], Timings and errors [s]: fromfile, frombuffer&quot;)
50        for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51            fsize = generate_binary_file(fname, i)
52            t1 = Timer(partial(read_binary_npfromfile, fname))
53            t2 = Timer(partial(read_binary_npfrombuffer, fname))
54            a1 = np.array(t1.repeat(5, 1))
55            a2 = np.array(t2.repeat(5, 1))
56            print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57

Results

Conclusions

In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.

In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.

In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.

Questions

In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?

ANSWER

Answered 2022-Mar-16 at 23:52

TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.

Analysis

The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).

Faster Numpy code

The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:

1    import numpy as np
2    import os
3    
4    def generate_binary_file(filename, nrecords):
5        n_records = np.random.poisson(lam = nrecords)
6        record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7        x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8        with open(filename, 'wb') as f:
9            s = 0
10            for i in range(n_records):
11                f.write(record_lengths[i].tobytes())
12                f.write(x[s:s+record_lengths[i]].tobytes())
13                s += record_lengths[i]
14            # Trick for testing: make sum of records equal to 0
15            f.write(np.array([1], dtype = 'i4').tobytes())
16            f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17        return os.path.getsize(filename)
18    
19    def read_binary_npfromfile(filename):
20        checksum = 0.0
21        with open(filename, 'rb') as f:
22            while True:
23                try:
24                    record_length = np.fromfile(f, 'i4', 1)[0]
25                    x = np.fromfile(f, 'd', record_length)
26                    checksum += x.sum()
27                except:
28                    break
29        assert(np.abs(checksum) &lt; 1e-6)
30
31    def read_binary_npfrombuffer(filename):
32        checksum = 0.0
33        with open(filename, 'rb') as f:
34            while True:
35                try:
36                    record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37                    x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38                    checksum += x.sum()
39                except:
40                    break
41        assert(np.abs(checksum) &lt; 1e-6)
42    
43    
44    if __name__ == '__main__':
45        from timeit import Timer
46        from functools import partial
47    
48        fname = 'testfile.tmp'
49        print(&quot;# File size[MB], Timings and errors [s]: fromfile, frombuffer&quot;)
50        for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51            fsize = generate_binary_file(fname, i)
52            t1 = Timer(partial(read_binary_npfromfile, fname))
53            t2 = Timer(partial(read_binary_npfrombuffer, fname))
54            a1 = np.array(t1.repeat(5, 1))
55            a2 = np.array(t2.repeat(5, 1))
56            print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58    buff = np.fromfile(filename, dtype=np.uint8)
59    buff_int32 = buff.view(np.int32)
60    buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61    buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62    nblocks = buff.size // 4      # Number of 4-byte blocks
63    pos = 0                       # Displacement by block of 4 bytes
64    lst = []
65    while pos &lt; nblocks:
66        record_length = buff_int32[pos]
67        pos += 1
68        if pos + record_length * 2 &gt; nblocks:
69            break
70        offset = pos // 2
71        if pos % 2 == 0:          # Aligned with buff_double_1
72            x = buff_double_1[offset:offset+record_length]
73        else:                     # Aligned with buff_double_2
74            x = buff_double_2[offset:offset+record_length]
75        lst.append(x)             # np.sum is too expensive here
76        pos += record_length * 2
77    checksum = np.sum(np.concatenate(lst))
78    assert(np.abs(checksum) &lt; 1e-6)
79

The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...

Better performance with... a basic pure-Python code

It turns out that the following pure-python implementation is faster, safer and simpler:

1    import numpy as np
2    import os
3    
4    def generate_binary_file(filename, nrecords):
5        n_records = np.random.poisson(lam = nrecords)
6        record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7        x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8        with open(filename, 'wb') as f:
9            s = 0
10            for i in range(n_records):
11                f.write(record_lengths[i].tobytes())
12                f.write(x[s:s+record_lengths[i]].tobytes())
13                s += record_lengths[i]
14            # Trick for testing: make sum of records equal to 0
15            f.write(np.array([1], dtype = 'i4').tobytes())
16            f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17        return os.path.getsize(filename)
18    
19    def read_binary_npfromfile(filename):
20        checksum = 0.0
21        with open(filename, 'rb') as f:
22            while True:
23                try:
24                    record_length = np.fromfile(f, 'i4', 1)[0]
25                    x = np.fromfile(f, 'd', record_length)
26                    checksum += x.sum()
27                except:
28                    break
29        assert(np.abs(checksum) &lt; 1e-6)
30
31    def read_binary_npfrombuffer(filename):
32        checksum = 0.0
33        with open(filename, 'rb') as f:
34            while True:
35                try:
36                    record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37                    x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38                    checksum += x.sum()
39                except:
40                    break
41        assert(np.abs(checksum) &lt; 1e-6)
42    
43    
44    if __name__ == '__main__':
45        from timeit import Timer
46        from functools import partial
47    
48        fname = 'testfile.tmp'
49        print(&quot;# File size[MB], Timings and errors [s]: fromfile, frombuffer&quot;)
50        for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51            fsize = generate_binary_file(fname, i)
52            t1 = Timer(partial(read_binary_npfromfile, fname))
53            t2 = Timer(partial(read_binary_npfrombuffer, fname))
54            a1 = np.array(t1.repeat(5, 1))
55            a2 = np.array(t2.repeat(5, 1))
56            print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58    buff = np.fromfile(filename, dtype=np.uint8)
59    buff_int32 = buff.view(np.int32)
60    buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61    buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62    nblocks = buff.size // 4      # Number of 4-byte blocks
63    pos = 0                       # Displacement by block of 4 bytes
64    lst = []
65    while pos &lt; nblocks:
66        record_length = buff_int32[pos]
67        pos += 1
68        if pos + record_length * 2 &gt; nblocks:
69            break
70        offset = pos // 2
71        if pos % 2 == 0:          # Aligned with buff_double_1
72            x = buff_double_1[offset:offset+record_length]
73        else:                     # Aligned with buff_double_2
74            x = buff_double_2[offset:offset+record_length]
75        lst.append(x)             # np.sum is too expensive here
76        pos += record_length * 2
77    checksum = np.sum(np.concatenate(lst))
78    assert(np.abs(checksum) &lt; 1e-6)
79from struct import unpack_from
80
81def read_binary_python_struct(filename):
82    checksum = 0.0
83    with open(filename, 'rb') as f:
84        data = f.read()
85        offset = 0
86        while offset &lt; len(data):
87            record_length = unpack_from('@i', data, offset)[0]
88            checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
89            offset += 4 + record_length * 8
90    assert(np.abs(checksum) &lt; 1e-6)
91

This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.

In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.

Writing a high-performance code with Numba

We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:

1    import numpy as np
2    import os
3    
4    def generate_binary_file(filename, nrecords):
5        n_records = np.random.poisson(lam = nrecords)
6        record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7        x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8        with open(filename, 'wb') as f:
9            s = 0
10            for i in range(n_records):
11                f.write(record_lengths[i].tobytes())
12                f.write(x[s:s+record_lengths[i]].tobytes())
13                s += record_lengths[i]
14            # Trick for testing: make sum of records equal to 0
15            f.write(np.array([1], dtype = 'i4').tobytes())
16            f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17        return os.path.getsize(filename)
18    
19    def read_binary_npfromfile(filename):
20        checksum = 0.0
21        with open(filename, 'rb') as f:
22            while True:
23                try:
24                    record_length = np.fromfile(f, 'i4', 1)[0]
25                    x = np.fromfile(f, 'd', record_length)
26                    checksum += x.sum()
27                except:
28                    break
29        assert(np.abs(checksum) &lt; 1e-6)
30
31    def read_binary_npfrombuffer(filename):
32        checksum = 0.0
33        with open(filename, 'rb') as f:
34            while True:
35                try:
36                    record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37                    x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38                    checksum += x.sum()
39                except:
40                    break
41        assert(np.abs(checksum) &lt; 1e-6)
42    
43    
44    if __name__ == '__main__':
45        from timeit import Timer
46        from functools import partial
47    
48        fname = 'testfile.tmp'
49        print(&quot;# File size[MB], Timings and errors [s]: fromfile, frombuffer&quot;)
50        for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51            fsize = generate_binary_file(fname, i)
52            t1 = Timer(partial(read_binary_npfromfile, fname))
53            t2 = Timer(partial(read_binary_npfrombuffer, fname))
54            a1 = np.array(t1.repeat(5, 1))
55            a2 = np.array(t2.repeat(5, 1))
56            print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58    buff = np.fromfile(filename, dtype=np.uint8)
59    buff_int32 = buff.view(np.int32)
60    buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61    buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62    nblocks = buff.size // 4      # Number of 4-byte blocks
63    pos = 0                       # Displacement by block of 4 bytes
64    lst = []
65    while pos &lt; nblocks:
66        record_length = buff_int32[pos]
67        pos += 1
68        if pos + record_length * 2 &gt; nblocks:
69            break
70        offset = pos // 2
71        if pos % 2 == 0:          # Aligned with buff_double_1
72            x = buff_double_1[offset:offset+record_length]
73        else:                     # Aligned with buff_double_2
74            x = buff_double_2[offset:offset+record_length]
75        lst.append(x)             # np.sum is too expensive here
76        pos += record_length * 2
77    checksum = np.sum(np.concatenate(lst))
78    assert(np.abs(checksum) &lt; 1e-6)
79from struct import unpack_from
80
81def read_binary_python_struct(filename):
82    checksum = 0.0
83    with open(filename, 'rb') as f:
84        data = f.read()
85        offset = 0
86        while offset &lt; len(data):
87            record_length = unpack_from('@i', data, offset)[0]
88            checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
89            offset += 4 + record_length * 8
90    assert(np.abs(checksum) &lt; 1e-6)
91@nb.njit('float64(uint8[::1])')
92def decode_buffer(buff):
93    checksum = 0.0
94    offset = 0
95    while offset + 4 &lt; buff.size:
96        record_length = buff[offset:offset+4].view(np.int32)[0]
97        start = offset + 4
98        end = start + record_length * 8
99        if end &gt; buff.size:
100            break
101        x = buff[start:end].view(np.float64)
102        checksum += x.sum()
103        offset = end
104    return checksum
105
106def read_binary_numba(filename):
107    buff = np.fromfile(filename, dtype=np.uint8)
108    checksum = decode_buffer(buff)
109    assert(np.abs(checksum) &lt; 1e-6)
110

Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.

Benchmark

Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.

1    import numpy as np
2    import os
3    
4    def generate_binary_file(filename, nrecords):
5        n_records = np.random.poisson(lam = nrecords)
6        record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7        x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8        with open(filename, 'wb') as f:
9            s = 0
10            for i in range(n_records):
11                f.write(record_lengths[i].tobytes())
12                f.write(x[s:s+record_lengths[i]].tobytes())
13                s += record_lengths[i]
14            # Trick for testing: make sum of records equal to 0
15            f.write(np.array([1], dtype = 'i4').tobytes())
16            f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17        return os.path.getsize(filename)
18    
19    def read_binary_npfromfile(filename):
20        checksum = 0.0
21        with open(filename, 'rb') as f:
22            while True:
23                try:
24                    record_length = np.fromfile(f, 'i4', 1)[0]
25                    x = np.fromfile(f, 'd', record_length)
26                    checksum += x.sum()
27                except:
28                    break
29        assert(np.abs(checksum) &lt; 1e-6)
30
31    def read_binary_npfrombuffer(filename):
32        checksum = 0.0
33        with open(filename, 'rb') as f:
34            while True:
35                try:
36                    record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37                    x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38                    checksum += x.sum()
39                except:
40                    break
41        assert(np.abs(checksum) &lt; 1e-6)
42    
43    
44    if __name__ == '__main__':
45        from timeit import Timer
46        from functools import partial
47    
48        fname = 'testfile.tmp'
49        print(&quot;# File size[MB], Timings and errors [s]: fromfile, frombuffer&quot;)
50        for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51            fsize = generate_binary_file(fname, i)
52            t1 = Timer(partial(read_binary_npfromfile, fname))
53            t2 = Timer(partial(read_binary_npfrombuffer, fname))
54            a1 = np.array(t1.repeat(5, 1))
55            a2 = np.array(t2.repeat(5, 1))
56            print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58    buff = np.fromfile(filename, dtype=np.uint8)
59    buff_int32 = buff.view(np.int32)
60    buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61    buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62    nblocks = buff.size // 4      # Number of 4-byte blocks
63    pos = 0                       # Displacement by block of 4 bytes
64    lst = []
65    while pos &lt; nblocks:
66        record_length = buff_int32[pos]
67        pos += 1
68        if pos + record_length * 2 &gt; nblocks:
69            break
70        offset = pos // 2
71        if pos % 2 == 0:          # Aligned with buff_double_1
72            x = buff_double_1[offset:offset+record_length]
73        else:                     # Aligned with buff_double_2
74            x = buff_double_2[offset:offset+record_length]
75        lst.append(x)             # np.sum is too expensive here
76        pos += record_length * 2
77    checksum = np.sum(np.concatenate(lst))
78    assert(np.abs(checksum) &lt; 1e-6)
79from struct import unpack_from
80
81def read_binary_python_struct(filename):
82    checksum = 0.0
83    with open(filename, 'rb') as f:
84        data = f.read()
85        offset = 0
86        while offset &lt; len(data):
87            record_length = unpack_from('@i', data, offset)[0]
88            checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
89            offset += 4 + record_length * 8
90    assert(np.abs(checksum) &lt; 1e-6)
91@nb.njit('float64(uint8[::1])')
92def decode_buffer(buff):
93    checksum = 0.0
94    offset = 0
95    while offset + 4 &lt; buff.size:
96        record_length = buff[offset:offset+4].view(np.int32)[0]
97        start = offset + 4
98        end = start + record_length * 8
99        if end &gt; buff.size:
100            break
101        x = buff[start:end].view(np.float64)
102        checksum += x.sum()
103        offset = end
104    return checksum
105
106def read_binary_numba(filename):
107    buff = np.fromfile(filename, dtype=np.uint8)
108    checksum = decode_buffer(buff)
109    assert(np.abs(checksum) &lt; 1e-6)
110read_binary_npfromfile:      10616 ms    (   x1)
111read_binary_npfrombuffer:     1132 ms    (   x9)
112read_binary_faster_numpy:      509 ms    (  x21)
113read_binary_python_struct:     222 ms    (  x48)
114read_binary_numba:              12 ms    ( x885)
115Optimal time:                    7 ms    (x1517)
116

One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!

Source https://stackoverflow.com/questions/71411907

QUESTION

ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors

Asked 2022-Mar-15 at 15:48

I am following TFF tutorials to build my FL model My data is contained in different CSV files which are considered as different clients. Following this tutorial, and build the Keras model function as following

1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3   return tf.data.experimental.CsvDataset(dataset_path, 
4                                          record_defaults=record_defaults,
5                                          header=True)
6   
7@tf.function
8def add_parsing(dataset):
9  def parse_dataset(*x):
10    return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13  dataset_paths, create_tf_dataset_for_client_fn) 
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids] 
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28  initializer = tf.keras.initializers.GlorotNormal(seed=0)
29  return tf.keras.models.Sequential([
30      tf.keras.layers.Input(shape=(32,)),
31      tf.keras.layers.Dense(10, kernel_initializer=initializer),
32      tf.keras.layers.Softmax(),
33  ])
34def model_fn():
35  keras_model = create_keras_model()
36  return tff.learning.from_keras_model(
37      keras_model,
38      input_spec=train_data.element_spec,
39      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41

Then I followed instructions and run other @tff.tf_computation functions as the tutorial, like def server_init(), def initialize_fn(), def client_update() and def server_update(). But when I run the def client_update_fn() I got this error

1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3   return tf.data.experimental.CsvDataset(dataset_path, 
4                                          record_defaults=record_defaults,
5                                          header=True)
6   
7@tf.function
8def add_parsing(dataset):
9  def parse_dataset(*x):
10    return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13  dataset_paths, create_tf_dataset_for_client_fn) 
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids] 
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28  initializer = tf.keras.initializers.GlorotNormal(seed=0)
29  return tf.keras.models.Sequential([
30      tf.keras.layers.Input(shape=(32,)),
31      tf.keras.layers.Dense(10, kernel_initializer=initializer),
32      tf.keras.layers.Softmax(),
33  ])
34def model_fn():
35  keras_model = create_keras_model()
36  return tff.learning.from_keras_model(
37      keras_model,
38      input_spec=train_data.element_spec,
39      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43   File &quot;&lt;ipython-input-14-cada45ffae0f&gt;&quot;, line 12, in client_update  *
44       for batch in dataset:
45   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 455, in forward_pass  *
46       return self._forward_pass(batch_input, training=training)
47   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 408, in _forward_pass  *
48       predictions = self.predict_on_batch(inputs, training)
49   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 398, in predict_on_batch  *
50       return self._keras_model(x, training=training)
51   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py&quot;, line 740, in __call__  **
52       self.name)
53   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py&quot;, line 200, in assert_input_compatibility
54       raise ValueError(f'Layer &quot;{layer_name}&quot; expects {len(input_spec)} input(s),'
55
56ValueError: Layer &quot;sequential&quot; expects 1 input(s), but it received 10 input tensors. Inputs received: [&lt;tf.Tensor 'x:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_1:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_2:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_3:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_4:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_5:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_6:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_7:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_8:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_9:0' shape=() dtype=int32&gt;]
57

Notes:

each CSV file has 10 column as features (input) and one column as label (output).
I added the shape=(32,) arbitrary, I don't really know what are the shape of the data is in each column?

So, the question is, how to feed the data to the keras model and overcome this error

Thanks in advance

ANSWER

Answered 2022-Mar-15 at 15:48

A couple problems: Your data has ten separate features, which means you actually need 10 separate inputs for your model. However, you can also stack the features into a tensor and then use a single input with the shape (10,). Here is a working example, but please note that it uses dummy data and therefore may not make much sense in reality.

Create dummy data:

1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3   return tf.data.experimental.CsvDataset(dataset_path, 
4                                          record_defaults=record_defaults,
5                                          header=True)
6   
7@tf.function
8def add_parsing(dataset):
9  def parse_dataset(*x):
10    return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13  dataset_paths, create_tf_dataset_for_client_fn) 
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids] 
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28  initializer = tf.keras.initializers.GlorotNormal(seed=0)
29  return tf.keras.models.Sequential([
30      tf.keras.layers.Input(shape=(32,)),
31      tf.keras.layers.Dense(10, kernel_initializer=initializer),
32      tf.keras.layers.Softmax(),
33  ])
34def model_fn():
35  keras_model = create_keras_model()
36  return tff.learning.from_keras_model(
37      keras_model,
38      input_spec=train_data.element_spec,
39      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43   File &quot;&lt;ipython-input-14-cada45ffae0f&gt;&quot;, line 12, in client_update  *
44       for batch in dataset:
45   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 455, in forward_pass  *
46       return self._forward_pass(batch_input, training=training)
47   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 408, in _forward_pass  *
48       predictions = self.predict_on_batch(inputs, training)
49   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 398, in predict_on_batch  *
50       return self._keras_model(x, training=training)
51   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py&quot;, line 740, in __call__  **
52       self.name)
53   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py&quot;, line 200, in assert_input_compatibility
54       raise ValueError(f'Layer &quot;{layer_name}&quot; expects {len(input_spec)} input(s),'
55
56ValueError: Layer &quot;sequential&quot; expects 1 input(s), but it received 10 input tensors. Inputs received: [&lt;tf.Tensor 'x:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_1:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_2:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_3:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_4:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_5:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_6:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_7:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_8:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_9:0' shape=() dtype=int32&gt;]
57import tensorflow as tf
58import tensorflow_federated as tff
59import pandas as pd
60from collections import OrderedDict
61import nest_asyncio
62nest_asyncio.apply()
63
64# Dummy data
65samples = 5
66data = [[tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
67        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
68        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
69        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
70        tf.random.normal((samples,)).numpy().tolist(),
71        tf.random.normal((samples,)).numpy().tolist(),
72        tf.random.normal((samples,)).numpy().tolist(),
73        tf.random.normal((samples,)).numpy().tolist(),
74        tf.random.normal((samples,)).numpy().tolist(),
75        tf.random.normal((samples,)).numpy().tolist(),
76        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
77        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist()]]
78df = pd.DataFrame(data)
79df = df.explode(list(df.columns))
80df.to_csv('client1.csv', index= False)
81df.to_csv('client2.csv', index= False)
82

Load and process dataset:

1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3   return tf.data.experimental.CsvDataset(dataset_path, 
4                                          record_defaults=record_defaults,
5                                          header=True)
6   
7@tf.function
8def add_parsing(dataset):
9  def parse_dataset(*x):
10    return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13  dataset_paths, create_tf_dataset_for_client_fn) 
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids] 
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28  initializer = tf.keras.initializers.GlorotNormal(seed=0)
29  return tf.keras.models.Sequential([
30      tf.keras.layers.Input(shape=(32,)),
31      tf.keras.layers.Dense(10, kernel_initializer=initializer),
32      tf.keras.layers.Softmax(),
33  ])
34def model_fn():
35  keras_model = create_keras_model()
36  return tff.learning.from_keras_model(
37      keras_model,
38      input_spec=train_data.element_spec,
39      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43   File &quot;&lt;ipython-input-14-cada45ffae0f&gt;&quot;, line 12, in client_update  *
44       for batch in dataset:
45   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 455, in forward_pass  *
46       return self._forward_pass(batch_input, training=training)
47   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 408, in _forward_pass  *
48       predictions = self.predict_on_batch(inputs, training)
49   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 398, in predict_on_batch  *
50       return self._keras_model(x, training=training)
51   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py&quot;, line 740, in __call__  **
52       self.name)
53   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py&quot;, line 200, in assert_input_compatibility
54       raise ValueError(f'Layer &quot;{layer_name}&quot; expects {len(input_spec)} input(s),'
55
56ValueError: Layer &quot;sequential&quot; expects 1 input(s), but it received 10 input tensors. Inputs received: [&lt;tf.Tensor 'x:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_1:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_2:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_3:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_4:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_5:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_6:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_7:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_8:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_9:0' shape=() dtype=int32&gt;]
57import tensorflow as tf
58import tensorflow_federated as tff
59import pandas as pd
60from collections import OrderedDict
61import nest_asyncio
62nest_asyncio.apply()
63
64# Dummy data
65samples = 5
66data = [[tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
67        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
68        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
69        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
70        tf.random.normal((samples,)).numpy().tolist(),
71        tf.random.normal((samples,)).numpy().tolist(),
72        tf.random.normal((samples,)).numpy().tolist(),
73        tf.random.normal((samples,)).numpy().tolist(),
74        tf.random.normal((samples,)).numpy().tolist(),
75        tf.random.normal((samples,)).numpy().tolist(),
76        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
77        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist()]]
78df = pd.DataFrame(data)
79df = df.explode(list(df.columns))
80df.to_csv('client1.csv', index= False)
81df.to_csv('client2.csv', index= False)
82import tensorflow as tf
83
84record_defaults = [int(), int(), int(), int(), float(),float(),float(),float(),float(),float(), int(), int()]
85
86@tf.function
87def create_tf_dataset_for_client_fn(dataset_path):
88   return tf.data.experimental.CsvDataset(dataset_path, 
89                                          record_defaults=record_defaults,
90                                          header=True)
91@tf.function
92def add_parsing(dataset):
93  def parse_dataset(*x):
94    return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
95  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
96
97dataset_paths = {'client1': '/content/client1.csv', 'client2': '/content/client2.csv'}
98
99source = tff.simulation.datasets.FilePerUserClientData(
100  dataset_paths, create_tf_dataset_for_client_fn) 
101
102client_ids = sorted(source.client_ids)
103
104# Make sure the client ids are tensor strings when splitting data.
105source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids] 
106source = source.preprocess(add_parsing)
107
108train, test = source.train_test_client_split(source, 1)
109
110train_client_ids = train.client_ids
111
112def reshape_data(d):
113  d['x'] = tf.stack([tf.cast(x, dtype=tf.float32) for x in d['x']])
114  return d
115
116train_data = [train.create_tf_dataset_for_client(c).map(reshape_data).batch(1) for c in train_client_ids]
117

Create and run model:

1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3   return tf.data.experimental.CsvDataset(dataset_path, 
4                                          record_defaults=record_defaults,
5                                          header=True)
6   
7@tf.function
8def add_parsing(dataset):
9  def parse_dataset(*x):
10    return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13  dataset_paths, create_tf_dataset_for_client_fn) 
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids] 
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28  initializer = tf.keras.initializers.GlorotNormal(seed=0)
29  return tf.keras.models.Sequential([
30      tf.keras.layers.Input(shape=(32,)),
31      tf.keras.layers.Dense(10, kernel_initializer=initializer),
32      tf.keras.layers.Softmax(),
33  ])
34def model_fn():
35  keras_model = create_keras_model()
36  return tff.learning.from_keras_model(
37      keras_model,
38      input_spec=train_data.element_spec,
39      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43   File &quot;&lt;ipython-input-14-cada45ffae0f&gt;&quot;, line 12, in client_update  *
44       for batch in dataset:
45   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 455, in forward_pass  *
46       return self._forward_pass(batch_input, training=training)
47   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 408, in _forward_pass  *
48       predictions = self.predict_on_batch(inputs, training)
49   File &quot;/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py&quot;, line 398, in predict_on_batch  *
50       return self._keras_model(x, training=training)
51   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py&quot;, line 740, in __call__  **
52       self.name)
53   File &quot;/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py&quot;, line 200, in assert_input_compatibility
54       raise ValueError(f'Layer &quot;{layer_name}&quot; expects {len(input_spec)} input(s),'
55
56ValueError: Layer &quot;sequential&quot; expects 1 input(s), but it received 10 input tensors. Inputs received: [&lt;tf.Tensor 'x:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_1:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_2:0' shape=() dtype=int32&gt;, &lt;tf.Tensor 'x_3:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_4:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_5:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_6:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_7:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_8:0' shape=() dtype=float32&gt;, &lt;tf.Tensor 'x_9:0' shape=() dtype=int32&gt;]
57import tensorflow as tf
58import tensorflow_federated as tff
59import pandas as pd
60from collections import OrderedDict
61import nest_asyncio
62nest_asyncio.apply()
63
64# Dummy data
65samples = 5
66data = [[tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
67        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
68        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
69        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
70        tf.random.normal((samples,)).numpy().tolist(),
71        tf.random.normal((samples,)).numpy().tolist(),
72        tf.random.normal((samples,)).numpy().tolist(),
73        tf.random.normal((samples,)).numpy().tolist(),
74        tf.random.normal((samples,)).numpy().tolist(),
75        tf.random.normal((samples,)).numpy().tolist(),
76        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
77        tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist()]]
78df = pd.DataFrame(data)
79df = df.explode(list(df.columns))
80df.to_csv('client1.csv', index= False)
81df.to_csv('client2.csv', index= False)
82import tensorflow as tf
83
84record_defaults = [int(), int(), int(), int(), float(),float(),float(),float(),float(),float(), int(), int()]
85
86@tf.function
87def create_tf_dataset_for_client_fn(dataset_path):
88   return tf.data.experimental.CsvDataset(dataset_path, 
89                                          record_defaults=record_defaults,
90                                          header=True)
91@tf.function
92def add_parsing(dataset):
93  def parse_dataset(*x):
94    return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
95  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
96
97dataset_paths = {'client1': '/content/client1.csv', 'client2': '/content/client2.csv'}
98
99source = tff.simulation.datasets.FilePerUserClientData(
100  dataset_paths, create_tf_dataset_for_client_fn) 
101
102client_ids = sorted(source.client_ids)
103
104# Make sure the client ids are tensor strings when splitting data.
105source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids] 
106source = source.preprocess(add_parsing)
107
108train, test = source.train_test_client_split(source, 1)
109
110train_client_ids = train.client_ids
111
112def reshape_data(d):
113  d['x'] = tf.stack([tf.cast(x, dtype=tf.float32) for x in d['x']])
114  return d
115
116train_data = [train.create_tf_dataset_for_client(c).map(reshape_data).batch(1) for c in train_client_ids]
117def create_keras_model():
118  initializer = tf.keras.initializers.GlorotNormal(seed=0)
119  return tf.keras.models.Sequential([
120      tf.keras.layers.Input(shape=(10,)),
121      tf.keras.layers.Dense(75, kernel_initializer=initializer),
122      tf.keras.layers.Dense(50, kernel_initializer=initializer),
123      tf.keras.layers.Softmax(),
124  ])
125def model_fn():
126  keras_model = create_keras_model()
127  return tff.learning.from_keras_model(
128      keras_model,
129      input_spec=train_data[0].element_spec,
130      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
131      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
132
133def initialize_fn():
134  model = model_fn()
135  return model.trainable_variables
136
137@tf.function
138def client_update(model, dataset, server_weights, client_optimizer):
139  &quot;&quot;&quot;Performs training (using the server model weights) on the client's dataset.&quot;&quot;&quot;
140  client_weights = model.trainable_variables
141  tf.nest.map_structure(lambda x, y: x.assign(y),
142                        client_weights, server_weights)
143
144  for batch in dataset:
145    with tf.GradientTape() as tape:
146      outputs = model.forward_pass(batch)
147
148    grads = tape.gradient(outputs.loss, client_weights)
149    grads_and_vars = zip(grads, client_weights)
150    client_optimizer.apply_gradients(grads_and_vars)
151
152  return client_weights
153
154@tf.function
155def server_update(model, mean_client_weights):
156  &quot;&quot;&quot;Updates the server model weights as the average of the client model weights.&quot;&quot;&quot;
157  model_weights = model.trainable_variables
158  tf.nest.map_structure(lambda x, y: x.assign(y),
159                        model_weights, mean_client_weights)
160  return model_weights
161
162federated_float_on_clients = tff.FederatedType(tf.float32, tff.CLIENTS)
163
164@tff.federated_computation(tff.FederatedType(tf.float32, tff.CLIENTS))
165def get_average_temperature(client_temperatures):
166  return tff.federated_mean(client_temperatures)
167str(get_average_temperature.type_signature)
168get_average_temperature([68.5, 70.3, 69.8])
169
170@tff.tf_computation
171def server_init():
172  model = model_fn()
173  return model.trainable_variables
174
175@tff.federated_computation
176def initialize_fn():
177  return tff.federated_value(server_init(), tff.SERVER)
178
179whimsy_model = model_fn()
180tf_dataset_type = tff.SequenceType(whimsy_model.input_spec)
181model_weights_type = server_init.type_signature.result
182
183@tff.tf_computation(tf_dataset_type, model_weights_type)
184def client_update_fn(tf_dataset, server_weights):
185  model = model_fn()
186  client_optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
187  return client_update(model, tf_dataset, server_weights, client_optimizer)
188
189@tff.tf_computation(model_weights_type)
190def server_update_fn(mean_client_weights):
191  model = model_fn()
192  return server_update(model, mean_client_weights)
193
194federated_server_type = tff.FederatedType(model_weights_type, tff.SERVER)
195federated_dataset_type = tff.FederatedType(tf_dataset_type, tff.CLIENTS)
196
197@tff.federated_computation(federated_server_type, federated_dataset_type)
198def next_fn(server_weights, federated_dataset):
199  server_weights_at_client = tff.federated_broadcast(server_weights)
200  client_weights = tff.federated_map(
201      client_update_fn, (federated_dataset, server_weights_at_client))
202  mean_client_weights = tff.federated_mean(client_weights)
203
204  server_weights = tff.federated_map(server_update_fn, mean_client_weights)
205  return server_weights
206
207federated_algorithm = tff.templates.IterativeProcess(
208    initialize_fn=initialize_fn,
209    next_fn=next_fn
210)
211
212server_state = federated_algorithm.initialize()
213for round in range(15):
214  server_state = federated_algorithm.next(server_state, train_data)
215

Regarding this line in the model: tf.keras.layers.Dense(50, kernel_initializer=initializer), I am using 50 output nodes, since I created dummy labels that can vary between 0 and 49. This is necessary when using the SparseCategoricalCrossentropy loss function.

Source https://stackoverflow.com/questions/71428904

QUESTION

Convolution Function Latency Bottleneck

Asked 2022-Mar-10 at 13:57

I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.

Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.

After turning on optimization (-O3 -march=native -ffast-math, gcc cross compiler), the Gprof result was the following:

Clearly, the convolution2D function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.

The convolution function in question looks like this:

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29

This is a design based on my previous question and most of the processing time should fall on the convolution itself: odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];.

Using objdump -drwC -Mintel to take a look at the assembly code returns me the following:

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221

For reference, I'm using an AMD Ryzen 7 CPU which uses Zen2 architecture. Here is its list of instructions (page 101).

I suspect that the data here points to a memory issue instead of simply the multiplication being the cause of the bottleneck.

Question:

How can I improve this code so that it does not cause a memory bottleneck?

I'm guessing this is actually a problem particular to my code, perhaps something related to the multidimensional arrays I'm using. If I instead used one big single-dimentional array for each variable, would the latency decrease?

Relevant information:

There are two ways I declare the variables that are passed to this function. The first is as a global variable (usually in a struct), the second is as dynamic allocation:

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222

Perhaps the order in which I declare these matrixes is not cache-friendly, but I am not sure how to re-order it.

Stride values for the previous function are always 1 or 2, usually 1.

Here is the output of valgrind --tool=cachegrind:

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245

ANSWER

Answered 2022-Mar-10 at 13:57

Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.

So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.

I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).

The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245float fma_scl(float a, float b, float c) {
246    return a * b + c;
247}
248
249fma_scl:
250        vfmadd132ss     xmm0, xmm2, xmm1
251        ret
252

You can see the calculation done with a single instruction.

We'll define a 256-bit vector type using GCC's vector extension.

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245float fma_scl(float a, float b, float c) {
246    return a * b + c;
247}
248
249fma_scl:
250        vfmadd132ss     xmm0, xmm2, xmm1
251        ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253

Here's a vectorized fma function.

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245float fma_scl(float a, float b, float c) {
246    return a * b + c;
247}
248
249fma_scl:
250        vfmadd132ss     xmm0, xmm2, xmm1
251        ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254    return a * b + c;
255}
256
257fma_vec:
258        vfmadd132ps     ymm0, ymm2, ymm1
259        ret
260

The code above is semantically the same as the one below, but everything is done in a single instruction.

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245float fma_scl(float a, float b, float c) {
246    return a * b + c;
247}
248
249fma_scl:
250        vfmadd132ss     xmm0, xmm2, xmm1
251        ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254    return a * b + c;
255}
256
257fma_vec:
258        vfmadd132ps     ymm0, ymm2, ymm1
259        ret
260typedef struct {
261    float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265    Vec_ r;
266    for (unsigned i = 0; i &lt; 8; ++i) {
267        r.f[i] = a.f[i] * b.f[i] + c.f[i];
268    }
269    return r;
270}
271

I think you'll now get the idea of making code run faster by vectorization.

Here is a simple function that's somewhat similar to your inner loop.

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245float fma_scl(float a, float b, float c) {
246    return a * b + c;
247}
248
249fma_scl:
250        vfmadd132ss     xmm0, xmm2, xmm1
251        ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254    return a * b + c;
255}
256
257fma_vec:
258        vfmadd132ps     ymm0, ymm2, ymm1
259        ret
260typedef struct {
261    float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265    Vec_ r;
266    for (unsigned i = 0; i &lt; 8; ++i) {
267        r.f[i] = a.f[i] * b.f[i] + c.f[i];
268    }
269    return r;
270}
271void loopadd_scl(float *restrict a, float *restrict b, float *restrict c, unsigned n) {
272    for (unsigned i = 0; i &lt; n; ++i) {
273        a[i] = fma_scl(b[i], c[i], a[i]);
274    }
275}
276

When you compile through GCC with -O3 -march=znver2, this is the output. It's huge. I'll explain below.

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245float fma_scl(float a, float b, float c) {
246    return a * b + c;
247}
248
249fma_scl:
250        vfmadd132ss     xmm0, xmm2, xmm1
251        ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254    return a * b + c;
255}
256
257fma_vec:
258        vfmadd132ps     ymm0, ymm2, ymm1
259        ret
260typedef struct {
261    float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265    Vec_ r;
266    for (unsigned i = 0; i &lt; 8; ++i) {
267        r.f[i] = a.f[i] * b.f[i] + c.f[i];
268    }
269    return r;
270}
271void loopadd_scl(float *restrict a, float *restrict b, float *restrict c, unsigned n) {
272    for (unsigned i = 0; i &lt; n; ++i) {
273        a[i] = fma_scl(b[i], c[i], a[i]);
274    }
275}
276loopadd_scl:
277        test    ecx, ecx
278        je      .L25
279        lea     eax, [rcx-1]
280        cmp     eax, 6
281        jbe     .L13
282        mov     r8d, ecx
283        xor     eax, eax
284        shr     r8d, 3
285        sal     r8, 5
286.L9:
287        vmovups ymm1, YMMWORD PTR [rdi+rax]
288        vmovups ymm0, YMMWORD PTR [rdx+rax]
289        vfmadd132ps     ymm0, ymm1, YMMWORD PTR [rsi+rax]
290        vmovups YMMWORD PTR [rdi+rax], ymm0
291        add     rax, 32
292        cmp     r8, rax
293        jne     .L9
294        mov     eax, ecx
295        and     eax, -8
296        test    cl, 7
297        je      .L26
298        vzeroupper
299.L8:
300        mov     r9d, ecx
301        sub     r9d, eax
302        lea     r8d, [r9-1]
303        cmp     r8d, 2
304        jbe     .L11
305        mov     r8d, eax
306        sal     r8, 2
307        lea     r10, [rdi+r8]
308        vmovups xmm0, XMMWORD PTR [rdx+r8]
309        vmovups xmm2, XMMWORD PTR [r10]
310        vfmadd132ps     xmm0, xmm2, XMMWORD PTR [rsi+r8]
311        mov     r8d, r9d
312        and     r8d, -4
313        add     eax, r8d
314        and     r9d, 3
315        vmovups XMMWORD PTR [r10], xmm0
316        je      .L25
317.L11:
318        mov     r8d, eax
319        sal     r8, 2
320        lea     r9, [rdi+r8]
321        vmovss  xmm0, DWORD PTR [rdx+r8]
322        vmovss  xmm3, DWORD PTR [r9]
323        vfmadd132ss     xmm0, xmm3, DWORD PTR [rsi+r8]
324        lea     r8d, [rax+1]
325        vmovss  DWORD PTR [r9], xmm0
326        cmp     r8d, ecx
327        jnb     .L25
328        sal     r8, 2
329        add     eax, 2
330        lea     r9, [rdi+r8]
331        vmovss  xmm0, DWORD PTR [rsi+r8]
332        vmovss  xmm4, DWORD PTR [r9]
333        vfmadd132ss     xmm0, xmm4, DWORD PTR [rdx+r8]
334        vmovss  DWORD PTR [r9], xmm0
335        cmp     eax, ecx
336        jnb     .L25
337        sal     rax, 2
338        add     rdi, rax
339        vmovss  xmm0, DWORD PTR [rdx+rax]
340        vmovss  xmm5, DWORD PTR [rdi]
341        vfmadd132ss     xmm0, xmm5, DWORD PTR [rsi+rax]
342        vmovss  DWORD PTR [rdi], xmm0
343.L25:
344        ret
345.L26:
346        vzeroupper
347        ret
348.L13:
349        xor     eax, eax
350        jmp     .L8
351

Basically GCC doesn't know anything about n, so it's splitting the loop to 3 cases: n / 8 > 1, n / 4 > 1, n < 4. It first deals with the n / 8 > 1 part using 256-bit ymm registers. Then, it deals with n / 4 > 1 with 128-bit xmm registers. Finally, it deals with n < 4 with scalar ss instructions.

You can avoid this mess if you know n is a multiple of 8. I got a bit lazy now, so have a look at the code and the compiler output below and compare it with the above. I think you're smart enough to get the idea.

1void convolution2D(int isize,   // width/height of input
2        int osize,              // width/height of output
3        int ksize,              // width/height of kernel
4        int stride,             // shift between input pixels, between consecutive outputs
5        int pad,                // offset between (0,0) pixels between input and output
6        int idepth, int odepth, // number of input and output channels
7        float idata[isize][isize][idepth],
8        float odata[osize][osize][odepth],
9        float kdata[odepth][ksize][ksize][idepth])
10{
11    // iterate over the output
12    for (int oy = 0; oy &lt; osize; ++oy) {
13    for (int ox = 0; ox &lt; osize; ++ox) {
14    for (int od = 0; od &lt; odepth; ++od) {
15        odata[oy][ox][od] = 0;  // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16        for (int ky = 0; ky &lt; ksize; ++ky) {
17        for (int kx = 0; kx &lt; ksize; ++kx) {
18            // map position in output and kernel to the input
19            int iy = stride * oy + ky - pad;
20            int ix = stride * ox + kx - pad;
21            // use only valid inputs
22            if (iy &gt;= 0 &amp;&amp; iy &lt; isize &amp;&amp; ix &gt;= 0 &amp;&amp; ix &lt; isize)
23                for (int id = 0; id &lt; idepth; ++id)
24                    odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25        }}
26    }}}
27
28}
29
300000000000007880 &lt;convolution2D&gt;:
31    7880:   f3 0f 1e fa             endbr64 
32    7884:   55                      push   rbp
33    7885:   48 89 e5                mov    rbp,rsp
34    7888:   41 57                   push   r15
35    788a:   41 56                   push   r14
36    788c:   41 55                   push   r13
37    788e:   41 54                   push   r12
38    7890:   53                      push   rbx
39    7891:   48 81 ec b0 00 00 00    sub    rsp,0xb0
40    7898:   ff 15 4a a7 00 00       call   QWORD PTR [rip+0xa74a]        # 11fe8 &lt;mcount@GLIBC_2.2.5&gt;
41    789e:   89 d3                   mov    ebx,edx
42    78a0:   89 55 a8                mov    DWORD PTR [rbp-0x58],edx
43    78a3:   89 8d 74 ff ff ff       mov    DWORD PTR [rbp-0x8c],ecx
44    78a9:   49 63 d1                movsxd rdx,r9d
45    78ac:   48 63 cf                movsxd rcx,edi
46    78af:   41 89 f2                mov    r10d,esi
47    78b2:   89 b5 38 ff ff ff       mov    DWORD PTR [rbp-0xc8],esi
48    78b8:   49 63 c0                movsxd rax,r8d
49    78bb:   48 0f af ca             imul   rcx,rdx
50    78bf:   48 63 75 10             movsxd rsi,DWORD PTR [rbp+0x10]
51    78c3:   49 89 d6                mov    r14,rdx
52    78c6:   4c 8d 24 95 00 00 00 00     lea    r12,[rdx*4+0x0]
53    78ce:   41 89 fd                mov    r13d,edi
54    78d1:   49 89 cb                mov    r11,rcx
55    78d4:   48 89 8d 60 ff ff ff    mov    QWORD PTR [rbp-0xa0],rcx
56    78db:   49 63 ca                movsxd rcx,r10d
57    78de:   4c 8d 0c b5 00 00 00 00     lea    r9,[rsi*4+0x0]
58    78e6:   49 89 f0                mov    r8,rsi
59    78e9:   48 0f af f1             imul   rsi,rcx
60    78ed:   48 63 cb                movsxd rcx,ebx
61    78f0:   4c 89 8d 48 ff ff ff    mov    QWORD PTR [rbp-0xb8],r9
62    78f7:   48 0f af d1             imul   rdx,rcx
63    78fb:   48 8d 3c 95 00 00 00 00     lea    rdi,[rdx*4+0x0]
64    7903:   45 85 d2                test   r10d,r10d
65    7906:   0f 8e 73 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
66    790c:   48 c1 ef 02             shr    rdi,0x2
67    7910:   49 c1 e9 02             shr    r9,0x2
68    7914:   48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
69    7918:   4c 89 e7                mov    rdi,r12
70    791b:   4c 89 8d 58 ff ff ff    mov    QWORD PTR [rbp-0xa8],r9
71    7922:   48 c1 ef 02             shr    rdi,0x2
72    7926:   48 89 bd 50 ff ff ff    mov    QWORD PTR [rbp-0xb0],rdi
73    792d:   45 85 c0                test   r8d,r8d
74    7930:   0f 8e 49 02 00 00       jle    7b7f &lt;convolution2D+0x2ff&gt;
75    7936:   48 c1 e6 02             shl    rsi,0x2
76    793a:   48 0f af d1             imul   rdx,rcx
77    793e:   29 c3                   sub    ebx,eax
78    7940:   89 c7                   mov    edi,eax
79    7942:   48 89 b5 30 ff ff ff    mov    QWORD PTR [rbp-0xd0],rsi
80    7949:   48 8b 75 20             mov    rsi,QWORD PTR [rbp+0x20]
81    794d:   48 89 85 68 ff ff ff    mov    QWORD PTR [rbp-0x98],rax
82    7954:   f7 df                   neg    edi
83    7956:   45 8d 7e ff             lea    r15d,[r14-0x1]
84    795a:   89 9d 70 ff ff ff       mov    DWORD PTR [rbp-0x90],ebx
85    7960:   89 bd 3c ff ff ff       mov    DWORD PTR [rbp-0xc4],edi
86    7966:   48 8d 0c 95 00 00 00 00     lea    rcx,[rdx*4+0x0]
87    796e:   89 7d ac                mov    DWORD PTR [rbp-0x54],edi
88    7971:   89 5d d4                mov    DWORD PTR [rbp-0x2c],ebx
89    7974:   48 89 4d 98             mov    QWORD PTR [rbp-0x68],rcx
90    7978:   4a 8d 0c 9d 00 00 00 00     lea    rcx,[r11*4+0x0]
91    7980:   c7 45 80 00 00 00 00    mov    DWORD PTR [rbp-0x80],0x0
92    7987:   48 89 75 88             mov    QWORD PTR [rbp-0x78],rsi
93    798b:   41 8d 70 ff             lea    esi,[r8-0x1]
94    798f:   48 89 4d c0             mov    QWORD PTR [rbp-0x40],rcx
95    7993:   48 8d 04 b5 04 00 00 00     lea    rax,[rsi*4+0x4]
96    799b:   c7 45 90 00 00 00 00    mov    DWORD PTR [rbp-0x70],0x0
97    79a2:   48 89 85 28 ff ff ff    mov    QWORD PTR [rbp-0xd8],rax
98    79a9:   44 89 f0                mov    eax,r14d
99    79ac:   45 89 ee                mov    r14d,r13d
100    79af:   41 89 c5                mov    r13d,eax
101    79b2:   48 8b 85 28 ff ff ff    mov    rax,QWORD PTR [rbp-0xd8]
102    79b9:   48 03 45 88             add    rax,QWORD PTR [rbp-0x78]
103    79bd:   48 c7 85 78 ff ff ff 00 00 00 00    mov    QWORD PTR [rbp-0x88],0x0
104    79c8:   c7 45 84 00 00 00 00    mov    DWORD PTR [rbp-0x7c],0x0
105    79cf:   c7 45 94 00 00 00 00    mov    DWORD PTR [rbp-0x6c],0x0
106    79d6:   44 8b 95 70 ff ff ff    mov    r10d,DWORD PTR [rbp-0x90]
107    79dd:   48 89 45 b0             mov    QWORD PTR [rbp-0x50],rax
108    79e1:   48 63 45 80             movsxd rax,DWORD PTR [rbp-0x80]
109    79e5:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
110    79ec:   48 0f af 85 60 ff ff ff     imul   rax,QWORD PTR [rbp-0xa0]
111    79f4:   48 89 85 40 ff ff ff    mov    QWORD PTR [rbp-0xc0],rax
112    79fb:   8b 85 3c ff ff ff       mov    eax,DWORD PTR [rbp-0xc4]
113    7a01:   89 45 d0                mov    DWORD PTR [rbp-0x30],eax
114    7a04:   48 8b 45 88             mov    rax,QWORD PTR [rbp-0x78]
115    7a08:   48 8b 9d 78 ff ff ff    mov    rbx,QWORD PTR [rbp-0x88]
116    7a0f:   4c 8d 04 98             lea    r8,[rax+rbx*4]
117    7a13:   48 8b 45 28             mov    rax,QWORD PTR [rbp+0x28]
118    7a17:   48 8b 5d 18             mov    rbx,QWORD PTR [rbp+0x18]
119    7a1b:   48 89 45 b8             mov    QWORD PTR [rbp-0x48],rax
120    7a1f:   48 63 45 84             movsxd rax,DWORD PTR [rbp-0x7c]
121    7a23:   48 2b 85 68 ff ff ff    sub    rax,QWORD PTR [rbp-0x98]
122    7a2a:   48 0f af 85 50 ff ff ff     imul   rax,QWORD PTR [rbp-0xb0]
123    7a32:   48 03 85 40 ff ff ff    add    rax,QWORD PTR [rbp-0xc0]
124    7a39:   48 8d 04 83             lea    rax,[rbx+rax*4]
125    7a3d:   48 89 45 a0             mov    QWORD PTR [rbp-0x60],rax
126    7a41:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
127    7a4c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
128    7a50:   8b 45 a8                mov    eax,DWORD PTR [rbp-0x58]
129    7a53:   41 c7 00 00 00 00 00    mov    DWORD PTR [r8],0x0
130    7a5a:   45 31 db                xor    r11d,r11d
131    7a5d:   48 8b 5d a0             mov    rbx,QWORD PTR [rbp-0x60]
132    7a61:   44 8b 4d ac             mov    r9d,DWORD PTR [rbp-0x54]
133    7a65:   85 c0                   test   eax,eax
134    7a67:   0f 8e 98 00 00 00       jle    7b05 &lt;convolution2D+0x285&gt;
135    7a6d:   0f 1f 00                nop    DWORD PTR [rax]
136    7a70:   45 85 c9                test   r9d,r9d
137    7a73:   78 7b                   js     7af0 &lt;convolution2D+0x270&gt;
138    7a75:   45 39 ce                cmp    r14d,r9d
139    7a78:   7e 76                   jle    7af0 &lt;convolution2D+0x270&gt;
140    7a7a:   48 8b 45 b8             mov    rax,QWORD PTR [rbp-0x48]
141    7a7e:   8b 55 d0                mov    edx,DWORD PTR [rbp-0x30]
142    7a81:   48 89 de                mov    rsi,rbx
143    7a84:   4a 8d 3c 98             lea    rdi,[rax+r11*4]
144    7a88:   eb 13                   jmp    7a9d &lt;convolution2D+0x21d&gt;
145    7a8a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
146    7a90:   ff c2                   inc    edx
147    7a92:   4c 01 e7                add    rdi,r12
148    7a95:   4c 01 e6                add    rsi,r12
149    7a98:   44 39 d2                cmp    edx,r10d
150    7a9b:   74 53                   je     7af0 &lt;convolution2D+0x270&gt;
151    7a9d:   85 d2                   test   edx,edx
152    7a9f:   78 ef                   js     7a90 &lt;convolution2D+0x210&gt;
153    7aa1:   41 39 d6                cmp    r14d,edx
154    7aa4:   7e ea                   jle    7a90 &lt;convolution2D+0x210&gt;
155    7aa6:   45 85 ed                test   r13d,r13d
156    7aa9:   7e e5                   jle    7a90 &lt;convolution2D+0x210&gt;
157    7aab:   c4 c1 7a 10 08          vmovss xmm1,DWORD PTR [r8]
158    7ab0:   31 c0                   xor    eax,eax
159    7ab2:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
160    7abd:   0f 1f 00                nop    DWORD PTR [rax]
161    7ac0:   c5 fa 10 04 87          vmovss xmm0,DWORD PTR [rdi+rax*4]
162    7ac5:   48 89 c1                mov    rcx,rax
163    7ac8:   c5 fa 59 04 86          vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164    7acd:   48 ff c0                inc    rax
165    7ad0:   c5 f2 58 c8             vaddss xmm1,xmm1,xmm0
166    7ad4:   c4 c1 7a 11 08          vmovss DWORD PTR [r8],xmm1
167    7ad9:   49 39 cf                cmp    r15,rcx
168    7adc:   75 e2                   jne    7ac0 &lt;convolution2D+0x240&gt;
169    7ade:   ff c2                   inc    edx
170    7ae0:   4c 01 e7                add    rdi,r12
171    7ae3:   4c 01 e6                add    rsi,r12
172    7ae6:   44 39 d2                cmp    edx,r10d
173    7ae9:   75 b2                   jne    7a9d &lt;convolution2D+0x21d&gt;
174    7aeb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
175    7af0:   4c 03 5d c8             add    r11,QWORD PTR [rbp-0x38]
176    7af4:   48 03 5d c0             add    rbx,QWORD PTR [rbp-0x40]
177    7af8:   41 ff c1                inc    r9d
178    7afb:   44 3b 4d d4             cmp    r9d,DWORD PTR [rbp-0x2c]
179    7aff:   0f 85 6b ff ff ff       jne    7a70 &lt;convolution2D+0x1f0&gt;
180    7b05:   48 8b 5d 98             mov    rbx,QWORD PTR [rbp-0x68]
181    7b09:   49 83 c0 04             add    r8,0x4
182    7b0d:   48 01 5d b8             add    QWORD PTR [rbp-0x48],rbx
183    7b11:   4c 3b 45 b0             cmp    r8,QWORD PTR [rbp-0x50]
184    7b15:   0f 85 35 ff ff ff       jne    7a50 &lt;convolution2D+0x1d0&gt;
185    7b1b:   8b 9d 74 ff ff ff       mov    ebx,DWORD PTR [rbp-0x8c]
186    7b21:   8b 45 94                mov    eax,DWORD PTR [rbp-0x6c]
187    7b24:   48 8b 8d 48 ff ff ff    mov    rcx,QWORD PTR [rbp-0xb8]
188    7b2b:   01 5d d0                add    DWORD PTR [rbp-0x30],ebx
189    7b2e:   48 01 4d b0             add    QWORD PTR [rbp-0x50],rcx
190    7b32:   01 5d 84                add    DWORD PTR [rbp-0x7c],ebx
191    7b35:   48 8b 8d 58 ff ff ff    mov    rcx,QWORD PTR [rbp-0xa8]
192    7b3c:   41 01 da                add    r10d,ebx
193    7b3f:   48 01 8d 78 ff ff ff    add    QWORD PTR [rbp-0x88],rcx
194    7b46:   ff c0                   inc    eax
195    7b48:   39 85 38 ff ff ff       cmp    DWORD PTR [rbp-0xc8],eax
196    7b4e:   74 08                   je     7b58 &lt;convolution2D+0x2d8&gt;
197    7b50:   89 45 94                mov    DWORD PTR [rbp-0x6c],eax
198    7b53:   e9 ac fe ff ff          jmp    7a04 &lt;convolution2D+0x184&gt;
199    7b58:   8b 4d 90                mov    ecx,DWORD PTR [rbp-0x70]
200    7b5b:   48 8b b5 30 ff ff ff    mov    rsi,QWORD PTR [rbp-0xd0]
201    7b62:   01 5d d4                add    DWORD PTR [rbp-0x2c],ebx
202    7b65:   01 5d ac                add    DWORD PTR [rbp-0x54],ebx
203    7b68:   01 5d 80                add    DWORD PTR [rbp-0x80],ebx
204    7b6b:   48 01 75 88             add    QWORD PTR [rbp-0x78],rsi
205    7b6f:   8d 41 01                lea    eax,[rcx+0x1]
206    7b72:   39 4d 94                cmp    DWORD PTR [rbp-0x6c],ecx
207    7b75:   74 08                   je     7b7f &lt;convolution2D+0x2ff&gt;
208    7b77:   89 45 90                mov    DWORD PTR [rbp-0x70],eax
209    7b7a:   e9 33 fe ff ff          jmp    79b2 &lt;convolution2D+0x132&gt;
210    7b7f:   48 81 c4 b0 00 00 00    add    rsp,0xb0
211    7b86:   5b                      pop    rbx
212    7b87:   41 5c                   pop    r12
213    7b89:   41 5d                   pop    r13
214    7b8b:   41 5e                   pop    r14
215    7b8d:   41 5f                   pop    r15
216    7b8f:   5d                      pop    rbp
217    7b90:   c3                      ret    
218    7b91:   66 66 2e 0f 1f 84 00 00 00 00 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
219    7b9c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300== 
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300== 
230==430300== I   refs:      6,369,594,192
231==430300== I1  misses:            4,271
232==430300== LLi misses:            2,442
233==430300== I1  miss rate:          0.00%
234==430300== LLi miss rate:          0.00%
235==430300== 
236==430300== D   refs:      2,064,233,110  (1,359,003,131 rd   + 705,229,979 wr)
237==430300== D1  misses:       34,476,969  (   19,010,839 rd   +  15,466,130 wr)
238==430300== LLd misses:        5,311,277  (    1,603,955 rd   +   3,707,322 wr)
239==430300== D1  miss rate:           1.7% (          1.4%     +         2.2%  )
240==430300== LLd miss rate:           0.3% (          0.1%     +         0.5%  )
241==430300== 
242==430300== LL refs:          34,481,240  (   19,015,110 rd   +  15,466,130 wr)
243==430300== LL misses:         5,313,719  (    1,606,397 rd   +   3,707,322 wr)
244==430300== LL miss rate:            0.1% (          0.0%     +         0.5%  )
245float fma_scl(float a, float b, float c) {
246    return a * b + c;
247}
248
249fma_scl:
250        vfmadd132ss     xmm0, xmm2, xmm1
251        ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254    return a * b + c;
255}
256
257fma_vec:
258        vfmadd132ps     ymm0, ymm2, ymm1
259        ret
260typedef struct {
261    float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265    Vec_ r;
266    for (unsigned i = 0; i &lt; 8; ++i) {
267        r.f[i] = a.f[i] * b.f[i] + c.f[i];
268    }
269    return r;
270}
271void loopadd_scl(float *restrict a, float *restrict b, float *restrict c, unsigned n) {
272    for (unsigned i = 0; i &lt; n; ++i) {
273        a[i] = fma_scl(b[i], c[i], a[i]);
274    }
275}
276loopadd_scl:
277        test    ecx, ecx
278        je      .L25
279        lea     eax, [rcx-1]
280        cmp     eax, 6
281        jbe     .L13
282        mov     r8d, ecx
283        xor     eax, eax
284        shr     r8d, 3
285        sal     r8, 5
286.L9:
287        vmovups ymm1, YMMWORD PTR [rdi+rax]
288        vmovups ymm0, YMMWORD PTR [rdx+rax]
289        vfmadd132ps     ymm0, ymm1, YMMWORD PTR [rsi+rax]
290        vmovups YMMWORD PTR [rdi+rax], ymm0
291        add     rax, 32
292        cmp     r8, rax
293        jne     .L9
294        mov     eax, ecx
295        and     eax, -8
296        test    cl, 7
297        je      .L26
298        vzeroupper
299.L8:
300        mov     r9d, ecx
301        sub     r9d, eax
302        lea     r8d, [r9-1]
303        cmp     r8d, 2
304        jbe     .L11
305        mov     r8d, eax
306        sal     r8, 2
307        lea     r10, [rdi+r8]
308        vmovups xmm0, XMMWORD PTR [rdx+r8]
309        vmovups xmm2, XMMWORD PTR [r10]
310        vfmadd132ps     xmm0, xmm2, XMMWORD PTR [rsi+r8]
311        mov     r8d, r9d
312        and     r8d, -4
313        add     eax, r8d
314        and     r9d, 3
315        vmovups XMMWORD PTR [r10], xmm0
316        je      .L25
317.L11:
318        mov     r8d, eax
319        sal     r8, 2
320        lea     r9, [rdi+r8]
321        vmovss  xmm0, DWORD PTR [rdx+r8]
322        vmovss  xmm3, DWORD PTR [r9]
323        vfmadd132ss     xmm0, xmm3, DWORD PTR [rsi+r8]
324        lea     r8d, [rax+1]
325        vmovss  DWORD PTR [r9], xmm0
326        cmp     r8d, ecx
327        jnb     .L25
328        sal     r8, 2
329        add     eax, 2
330        lea     r9, [rdi+r8]
331        vmovss  xmm0, DWORD PTR [rsi+r8]
332        vmovss  xmm4, DWORD PTR [r9]
333        vfmadd132ss     xmm0, xmm4, DWORD PTR [rdx+r8]
334        vmovss  DWORD PTR [r9], xmm0
335        cmp     eax, ecx
336        jnb     .L25
337        sal     rax, 2
338        add     rdi, rax
339        vmovss  xmm0, DWORD PTR [rdx+rax]
340        vmovss  xmm5, DWORD PTR [rdi]
341        vfmadd132ss     xmm0, xmm5, DWORD PTR [rsi+rax]
342        vmovss  DWORD PTR [rdi], xmm0
343.L25:
344        ret
345.L26:
346        vzeroupper
347        ret
348.L13:
349        xor     eax, eax
350        jmp     .L8
351void loopadd_vec(Vec *restrict a, Vec *restrict b, Vec *restrict c, unsigned n) {
352    n /= 8;
353    for (unsigned i = 0; i &lt; n; ++i) {
354        a[i] = fma_vec(b[i], c[i], a[i]);
355    }
356}
357
358loopadd_vec:
359        shr     ecx, 3
360        je      .L34
361        mov     ecx, ecx
362        xor     eax, eax
363        sal     rcx, 5
364.L29:
365        vmovaps ymm1, YMMWORD PTR [rdi+rax]
366        vmovaps ymm0, YMMWORD PTR [rdx+rax]
367        vfmadd132ps     ymm0, ymm1, YMMWORD PTR [rsi+rax]
368        vmovaps YMMWORD PTR [rdi+rax], ymm0
369        add     rax, 32
370        cmp     rcx, rax
371        jne     .L29
372        vzeroupper
373.L34:
374        ret
375}
376

Source https://stackoverflow.com/questions/71401876

QUESTION

Write custom metadata to Parquet file in Julia

Asked 2022-Mar-05 at 18:36

I am currently storing the output (a Julia Dataframe) of my Julia simulation in a Parquet file using Parquet.jl. I would also like to save some of the simulation parameters (eg. a list of (byte-)strings) to that same output file.

Preferably, these parameters are different for each column as each column is the result of different starting conditions of my code. However, I could also work with a global parameter list and then untangle it afterwards by indexing.

I have found a solution for Python using pyarrow

https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-table/.

Do you know a way how to do it in Julia?

ANSWER

Answered 2022-Mar-05 at 18:36

It's not quite done yet, and it's not registered, but my rewrite of the Julia parquet package, Parquet2.jl does support both custom file metadata and individual column metadata (the keyword arguments metadata and column_metadata in Parquet2.writefile.

I haven't gotten to documentation for writing yet, but if you are feeling adventurous you can give it a shot. I do expect to finish up this package and register it within the next couple of weeks. I don't have unit tests in for writing yet, so of course, if you try it and have problems, please open an issue.

It's probably also worth mentioning that the main use case I recommend for parquet is if you must have parquet for compatibility reasons. Most of the time, Julia users are probably better off with Arrow.jl as the format has a number of advantages over parquet for most use cases, please see my FAQ answer on this. Of course, the reason I undertook writing the package is because parquet is arguably the only ubiquitous binary format in "big data world" so a robust writer is desperately needed.

Source https://stackoverflow.com/questions/71310140

QUESTION

Floating point inconsistencies after upgrading libc/libm

Asked 2022-Mar-01 at 13:31

I recently upgraded my OS from Debian 9 to Debian 11. I have a bunch of servers running a simulation and one subset produces a certain result and another subset produces a different result. This did not used to happen with Debian 9. I have produced a minimal failing example:

1#include &lt;stdio.h&gt;
2#include &lt;math.h&gt;
3
4int main()
5{
6  double lp = 11.525775909423828;
7  double ap = exp(lp);
8
9  printf(&quot;%.14f %.14f\n&quot;, lp, ap);
10
11  return 0;
12}
13

The lp value prints the same answer on every machine but I have two different answers for ap: 101293.33662281210127 and 101293.33662281208672

The code was compiled with "gcc fptest.c -lm -O0". The '-O0' was just added to ensure optimizations weren't an issue. It behaves the same without this option.

The libraries linked in the Debian 11 version are libm-2.31.so and libc-2.31.so.

The libraries linked in the (working) Debian 9 version are libm-2.24.so and libc-2.24.so.

The servers are all running with different CPUs so its hard to say much about that. But I get different results between a xeon E5-2695 v2 and a xeon E5-2695 v3 for example.

Amongst all the processors I have, I only see one of these two results on Debian 11, and when running on Debian 9 I consistently only get one result.

This feels like a bug in libm-2.31 and/or libc-2.31 to me. I have zero experience with this sort of thing. Could someone please explain if what I am seeing is expected? Does it look like a bug? Anything I can do about it? etc.

Also tried compiling with clang, and get the exact same problem.

Also note that the binary compiled on Debian 9 runs on Debian 11 and produces the same results/problem as the Debian 11 binary adding further weight to my suspicion that this is library related (I cannot run the Debian 11 binary on Debian 9).

Update

Just read this post which was helpful. So I'm happy that different architectures may give different results for the exp() function. But all my processors are x86_64 and some kind of intel xeon-xxxx. I can't understand why the exact same binary with the exact same libraries is giving different results on different processors.

As suggested in that post I printed the values using %a. The two answers differ only by the LSB. If I use expl() I get the same answer on all machines.

An explanation of why I'm seeing differences, and if this is expected, would be nice. Any compiler flags that ensure consistency would also be nice.

ANSWER

Answered 2022-Feb-28 at 13:18

It’s not a bug. Floating point arithmetic has rounding errors. For single arithmetic operations + - * / sqrt the results should be the same, but for floating-point functions you can’t really expect it.

In this case it seems the compiler itself produced the results at compile time. The processor you use is unlikely to make a difference. And we don’t know whether the new version is more or less precise than the old one.

Source https://stackoverflow.com/questions/71294653

QUESTION

Missing types, namespaces, directives, and assembly references

Asked 2022-Feb-27 at 10:24

I use VS Code for C# and Unity3D and TypeScript and Angular and Python programming, so I have pretty much every required extension, including the .NET Framework and Core as well as the Quantum Development Kit (QDK) plus the Q# Interoperability Tools and also C# and Python extensions for VS Code.

I have devised the following steps to create my first quantum Hello World based on a few tutorials:

1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson                                     # Create a &quot;global.json&quot; file in your current directory folder.  
6$ dotnet new globaljson --sdk-version 3.1.416 --force       # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum                         # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host                             # Create a new C# console project.
17
18$ cd host                                                         # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj                  # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd ..                                                           # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet                                # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj      # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj            # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27

However, as soon as this is all done, I get the following load of errors:

What am I doing wrong? What am I missing?

Here is some additional info:

1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson                                     # Create a &quot;global.json&quot; file in your current directory folder.  
6$ dotnet new globaljson --sdk-version 3.1.416 --force       # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum                         # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host                             # Create a new C# console project.
17
18$ cd host                                                         # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj                  # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd ..                                                           # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet                                # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj      # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj            # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version  
323.1.416
33
34$ dotnet restore
35  Determining projects to restore...
36  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38

When I run dotnet build, I get the following:

Determining projects to restore... All projects are up-to-date for restore. It was not possible to find any compatible framework version The framework 'Microsoft.NETCore.App', version '2.0.0' was not found. - The following frameworks were found: 3.1.22 at [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] 5.0.13 at [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]

You can resolve the problem by installing the specified framework and/or SDK.

The specified framework can be found at: - https://aka.ms/dotnet-core-applaunch?framework=Microsoft.NETCore.App&framework_version=2.0.0&arch=x64&rid=win10-x64 C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build\Microsoft.Quantum.Development.Kit.targets(17,5): error MSB3073: The command "dotnet C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build../tools/qsc/qsc.dll --input "Operation.qs" --references "C:\Users\Muhy.nuget\packages\microsoft.quantum.canon\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Canon.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.MetaData.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Primitives.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Common.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Core.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.QCTraceSimulatorRuntime.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Simulators.dll" --outputpath obj\qsharp\src" exited with code -2147450730. [C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj]

Build FAILED.

C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build\Microsoft.Quantum.Development.Kit.targets(17,5): error MSB3073: The command "dotnet C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build../tools/qsc/qsc.dll --input "Operation.qs" --references "C:\Users\Muhy.nuget\packages\microsoft.quantum.canon\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Canon.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.MetaData.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Primitives.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Common.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Core.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.QCTraceSimulatorRuntime.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Simulators.dll" --outputpath obj\qsharp\src" exited with code -2147450730. [C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj] 0 Warning(s) 1 Error(s)

When I also build the project (dotnet build quantum-dotnet.sln and dotnet build ./host/host.csproj) I get the same errors.

ANSWER

Answered 2022-Feb-27 at 10:24

With help from a user on another forum, it turns out the problem was the command:

1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson                                     # Create a &quot;global.json&quot; file in your current directory folder.  
6$ dotnet new globaljson --sdk-version 3.1.416 --force       # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum                         # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host                             # Create a new C# console project.
17
18$ cd host                                                         # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj                  # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd ..                                                           # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet                                # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj      # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj            # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version  
323.1.416
33
34$ dotnet restore
35  Determining projects to restore...
36  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
39

which installed version 0.2.1802.1603-preview which is quite old. The latest version is 0.22.187631.

This did the trick:

1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson                                     # Create a &quot;global.json&quot; file in your current directory folder.  
6$ dotnet new globaljson --sdk-version 3.1.416 --force       # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum                         # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host                             # Create a new C# console project.
17
18$ cd host                                                         # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj                  # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd ..                                                           # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet                                # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj      # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj            # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version  
323.1.416
33
34$ dotnet restore
35  Determining projects to restore...
36  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
39dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates&quot;
40

Followed by:

1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson                                     # Create a &quot;global.json&quot; file in your current directory folder.  
6$ dotnet new globaljson --sdk-version 3.1.416 --force       # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum                         # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host                             # Create a new C# console project.
17
18$ cd host                                                         # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj                  # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd ..                                                           # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet                                # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj      # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj            # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version  
323.1.416
33
34$ dotnet restore
35  Determining projects to restore...
36  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
39dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates&quot;
40dotnet new --update-apply
41

to update all the project templates. After this, re-creating the projects fixed all the errors.

Now, I believe the following set of instructions should work for every beginner:

1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson                                     # Create a &quot;global.json&quot; file in your current directory folder.  
6$ dotnet new globaljson --sdk-version 3.1.416 --force       # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum                         # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host                             # Create a new C# console project.
17
18$ cd host                                                         # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj                  # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd ..                                                           # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet                                # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj      # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj            # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version  
323.1.416
33
34$ dotnet restore
35  Determining projects to restore...
36  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37  Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates::0.2-*&quot;
39dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates&quot;
40dotnet new --update-apply
41# Install the .NET Framework and Core as well as the Quantum Development Kit (QDK)
42# plus the Q# Interoperability Tools and also C# and Python extensions for VS Code.
43
44$ dotnet --version
45$ dotnet --list-sdks
46$ dotnet --list-runtimes
47
48$ dotnet new globaljson                                             # Create a &quot;global.json&quot; file in your current directory folder.  
49$ dotnet new globaljson --sdk-version 3.1.416 --force               # Change the current SDK version of your project to .NET 3.1.416
50
51$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
52$ dotnet new -i &quot;Microsoft.Quantum.ProjectTemplates&quot;
53$ dotnet new --update-apply
54
55# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
56# Create a project for your Q# library and for the .NET host that will call
57# into the operations and functions defined in your Q# library.
58
59$ dotnet new classlib -lang Q# -o quantum                           # Create a new Q# class-library project.
60$ dotnet new console -lang C# -o host                               # Create a new C# console project.
61                                                                    # The -o or --output command specifies the location to place the generated output.
62
63$ cd host                                                           # Navigate into the C# host directory.
64$ dotnet add reference ../quantum/quantum.csproj                    # Add your Q# class-library project as a reference to your C# console project.
65
66$ cd ..                                                             # Exit the C# host directory.
67$ dotnet new sln -n quantum-dotnet                                  # Create a new solution for both projects.
68$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj        # Add the Q# class-library project to the solution.
69$ dotnet sln quantum-dotnet.sln add ./host/host.csproj              # Add the C# host project to the solution.
70
71$ cd host
72$ dotnet build
73$ dotnet run
74

Source https://stackoverflow.com/questions/71100198

QUESTION

Iterating over an array of class objects VS a class object containing arrays

Asked 2022-Feb-13 at 16:58

I want to create a program for multi-agent simulation and I am thinking about whether I should use NumPy or numba to accelerate the calculation. Basically, I would need a class to store the state of agents and I would have over a 1000 instances of this classes. In each time step, I will perform different calculation for all instances. There are two approaches that I am thinking of:

Numpy vectorization:

Having 1 class with multiple NumPy arrays for storing states of all agents. Hence, I will only have 1 class instance at all times during the simulation. With this approach, I can simply use NumPy vectorization to perform calculations. However, this will make running functions for specific agents difficult and I would need an extra class to store the index of each agent.

1Class agent:
2   def __init__(self):
3      self.A = np.zeros(1000)
4      self.B = np.zeros(1000)
5   def timestep(self):
6      return self.A + self.B
7

Numba jitclass:

Using the numba jitclass decorator to compile the code. With this approach, I can apply more standard OOP formatted code as one class instance represent one agent. However, I am not sure about the performance of looping through multiple jitclass instance (say 1000 and more).

1Class agent:
2   def __init__(self):
3      self.A = np.zeros(1000)
4      self.B = np.zeros(1000)
5   def timestep(self):
6      return self.A + self.B
7@jitclass
8class agent:
9   def __init__(self):
10      self.A = 0
11      self.B = 0
12   def timestep(self):
13      return self.A + self.B
14

I would like to know which would be a faster approach.

ANSWER

Answered 2022-Feb-13 at 16:53

This problem is known as the "AoS VS SoA" where AoS means array of structures and SoA means structure of arrays. You can find some information about this here. SoA is less user-friendly than AoS but it is generally much more efficient. This is especially true when your code can benefit from using SIMD instructions. When you deal with many big array (eg. >=8 big arrays) or when you perform many scalar random memory accesses, then neither AoS nor SoA are efficient. In this case, the best solution is to use arrays of structure of small arrays (AoSoA) so to better use CPU caches while still being able benefit from SIMD. However, AoSoA is tedious as is complexity significantly the code for non trivial algorithms. Note that the number of fields that are accessed also matter in the choice of the best solution (eg. if only one field is frequently read, then SoA is perfect).

OOP is generally rather bad when it comes to performance partially because of this. Another reason is the frequent use of virtual calls and polymorphism while it is not always needed. OOP codes tends to cause a lot of cache misses and optimizing a large code that massively use OOP is often a mess (which sometimes results in rewriting a big part of the target software or the code being left very slow). To address this problem, data oriented design can be used. This approach has been successfully used to drastically speed up large code bases from video games (eg. Unity) to web browser renderers (eg. Chrome) and even relational databases. In high-performance computing (HPC), OOP is often barely used. Object-oriented design is quite related to the use of SoA rather than AoS so to better use cache and benefit from SIMD. For more information, please read this related post.

To conclude, I advise you to use the first code (SoA) in your case (since you only have two arrays and they are not so huge).

Source https://stackoverflow.com/questions/71101579

QUESTION

R: Trying to recreate mean-median difference gerrymander tests

Asked 2022-Feb-09 at 23:58

I'm trying to recreate the mean-median difference test described here: Archive of NYT article. I've downloaded House data from MIT's Election Lab, and pared it down to the 2012 Pennsylvania race. Using dplyr, I whittled it down to the relevant columns, and it now looks something like this:

1Rows: 42
2Columns: 5
3$ district       &lt;dbl&gt; 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party          &lt;chr&gt; &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;INDEPENDENT&quot;, &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, ~
5$ candidatevotes &lt;dbl&gt; 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes     &lt;dbl&gt; 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes      &lt;dbl&gt; 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8

Each row represents a district candidate. The final column was created using mutate, and represents the percentage of the vote in that district that went to the candidate. Now, I can find the median and mean democratic vote with

1Rows: 42
2Columns: 5
3$ district       &lt;dbl&gt; 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party          &lt;chr&gt; &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;INDEPENDENT&quot;, &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, ~
5$ candidatevotes &lt;dbl&gt; 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes     &lt;dbl&gt; 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes      &lt;dbl&gt; 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem &lt;- PA2012_house %&gt;% filter(party == &quot;DEMOCRAT&quot;) 
9obs_median &lt;- median(PA2012_house_dem$pct_votes)
10obs_mean &lt;- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12

What's giving me fits is calculating the "zone of chance". What I'd like to do is some kind of Monte Carlo simulation of taking each voter and randomly assigning them to a district, so that the number of voters in each district is unchanged, the number of total votes for each party is unchanged, but the proportion of Republican and Democratic (and other parties) in each district is random, as in a permutation test. The mean Democratic vote should be unchanged, but I can't figure out a good way to carry out this randomization so that I can calculate the median district's Democratic vote percentage.

Thanks in advance for your help!

Edit for clarification: I'd like to do the randomization, say, 10,000 times, and for each of those trials, calculate the median-mean difference. The result should then, ideally, be a vector or data frame with 10,000 rows, that I can then turn into a histogram or something.

EDIT 2 -- PARTIAL SOLUTION:

I have some code that runs, but it's not giving me a reasonable answer. Using dplyr, I've filtered out all but the DEMOCRAT votes, so that each row just gives me the Democrat vote share for a single district.

1Rows: 42
2Columns: 5
3$ district       &lt;dbl&gt; 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party          &lt;chr&gt; &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;INDEPENDENT&quot;, &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, ~
5$ candidatevotes &lt;dbl&gt; 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes     &lt;dbl&gt; 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes      &lt;dbl&gt; 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem &lt;- PA2012_house %&gt;% filter(party == &quot;DEMOCRAT&quot;) 
9obs_median &lt;- median(PA2012_house_dem$pct_votes)
10obs_mean &lt;- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12Rows: 18
13Columns: 5
14$ district       &lt;dbl&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
15$ party          &lt;chr&gt; &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCR~
16$ candidatevotes &lt;dbl&gt; 235394, 318176, 123933, 104643, 104725, 143803, 143509, 152859, 105128, 94227, 118231, 163589, 209901, ~
17$ totalvotes     &lt;dbl&gt; 277102, 356386, 302514, 303980, 282465, 335528, 353451, 352238, 274305, 273790, 285198, 338941, 303819,~
18$ pct_votes      &lt;dbl&gt; 84.94850, 89.27848, 40.96769, 34.42430, 37.07539, 42.85872, 40.60223, 43.39651, 38.32522, 34.41579, 41.~
19

This is saved as PA2012_reduced_dem.

Now, here is my code:

1Rows: 42
2Columns: 5
3$ district       &lt;dbl&gt; 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party          &lt;chr&gt; &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;INDEPENDENT&quot;, &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, ~
5$ candidatevotes &lt;dbl&gt; 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes     &lt;dbl&gt; 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes      &lt;dbl&gt; 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem &lt;- PA2012_house %&gt;% filter(party == &quot;DEMOCRAT&quot;) 
9obs_median &lt;- median(PA2012_house_dem$pct_votes)
10obs_mean &lt;- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12Rows: 18
13Columns: 5
14$ district       &lt;dbl&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
15$ party          &lt;chr&gt; &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCR~
16$ candidatevotes &lt;dbl&gt; 235394, 318176, 123933, 104643, 104725, 143803, 143509, 152859, 105128, 94227, 118231, 163589, 209901, ~
17$ totalvotes     &lt;dbl&gt; 277102, 356386, 302514, 303980, 282465, 335528, 353451, 352238, 274305, 273790, 285198, 338941, 303819,~
18$ pct_votes      &lt;dbl&gt; 84.94850, 89.27848, 40.96769, 34.42430, 37.07539, 42.85872, 40.60223, 43.39651, 38.32522, 34.41579, 41.~
19require(mosaic) # for the tally() function
20data &lt;- PA2012_reduced_dem
21B &lt;- 100
22samples_diff &lt;- vector(&quot;numeric&quot;, B)
23samples_mean &lt;- vector(&quot;numeric&quot;, B)
24samples_median &lt;- vector(&quot;numeric&quot;, B)
25
26for(samp in 1:B) {
27data_w_sample &lt;- mutate(data, sample_vote = tally(sample(district, sum(candidatevotes),replace=T, prob = totalvotes)))
28  data_w_sample &lt;- mutate(data_w_sample, sample_vote_pct = (sample_vote / totalvotes *100))
29  mean_sample &lt;- weighted.mean(data_w_sample$sample_vote_pct, w = data_w_sample$totalvotes)
30  median_sample &lt;- median(data_w_sample$sample_vote_pct)
31  diff_mean_median &lt;- mean_sample - median_sample
32  samples_diff[samp] &lt;- diff_mean_median
33  samples_mean[samp] &lt;- mean_sample
34  samples_median[samp] &lt;- median_sample
35}
36
37samples &lt;- data.frame(samples_mean,samples_median,samples_diff)
38

The idea is that I'm randomly placing each Democrat voter in a district, weighted by the total number of votes per district. Since I have the total vote as a variable, I can compute the share of vote in each district that goes to the Democrat (I'm ignoring independent and other party votes).

Obviously, this is slow, because each trial is sampling for every single Democrat vote (roughly 2.8 million), so I'm only running 100 trials right now.

However, my Monte Carlo simulations are finding a very small "zone of chance" around the mean, the median is only about 0.05 percent above or below the mean. Again, I'm only running 100 trials, but I was expecting a wider zone of chance.

ANSWER

Answered 2022-Feb-09 at 23:58

I figured it out! Randomly placing voters in each district is not correct, and honestly it was pretty silly of me to do so. Instead, I had to use dplyr to create a data frame with the number of Democrat and Republican votes in each of the 435 House districts, one district per row. Then, I followed the advice on page 12 of this paper. I created samples of 18 districts sampled from this 435-row data frame, rejecting them if the mean vote share was more than 1 percent away from that of PA. The results have a much nicer 95% confidence interval, that matches the results of the original article.

1Rows: 42
2Columns: 5
3$ district       &lt;dbl&gt; 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party          &lt;chr&gt; &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;INDEPENDENT&quot;, &quot;REPUBLICAN&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, ~
5$ candidatevotes &lt;dbl&gt; 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes     &lt;dbl&gt; 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes      &lt;dbl&gt; 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem &lt;- PA2012_house %&gt;% filter(party == &quot;DEMOCRAT&quot;) 
9obs_median &lt;- median(PA2012_house_dem$pct_votes)
10obs_mean &lt;- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12Rows: 18
13Columns: 5
14$ district       &lt;dbl&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
15$ party          &lt;chr&gt; &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCRAT&quot;, &quot;DEMOCR~
16$ candidatevotes &lt;dbl&gt; 235394, 318176, 123933, 104643, 104725, 143803, 143509, 152859, 105128, 94227, 118231, 163589, 209901, ~
17$ totalvotes     &lt;dbl&gt; 277102, 356386, 302514, 303980, 282465, 335528, 353451, 352238, 274305, 273790, 285198, 338941, 303819,~
18$ pct_votes      &lt;dbl&gt; 84.94850, 89.27848, 40.96769, 34.42430, 37.07539, 42.85872, 40.60223, 43.39651, 38.32522, 34.41579, 41.~
19require(mosaic) # for the tally() function
20data &lt;- PA2012_reduced_dem
21B &lt;- 100
22samples_diff &lt;- vector(&quot;numeric&quot;, B)
23samples_mean &lt;- vector(&quot;numeric&quot;, B)
24samples_median &lt;- vector(&quot;numeric&quot;, B)
25
26for(samp in 1:B) {
27data_w_sample &lt;- mutate(data, sample_vote = tally(sample(district, sum(candidatevotes),replace=T, prob = totalvotes)))
28  data_w_sample &lt;- mutate(data_w_sample, sample_vote_pct = (sample_vote / totalvotes *100))
29  mean_sample &lt;- weighted.mean(data_w_sample$sample_vote_pct, w = data_w_sample$totalvotes)
30  median_sample &lt;- median(data_w_sample$sample_vote_pct)
31  diff_mean_median &lt;- mean_sample - median_sample
32  samples_diff[samp] &lt;- diff_mean_median
33  samples_mean[samp] &lt;- mean_sample
34  samples_median[samp] &lt;- median_sample
35}
36
37samples &lt;- data.frame(samples_mean,samples_median,samples_diff)
38data &lt;- house_2012_reduced 
39# created with dplyr, contains total and percentage of votes
40# for Democrats and Republicans.
41B &lt;- 100000
42del_districts &lt;- 18 # 18 districts in PA
43samples_diff &lt;- vector(&quot;numeric&quot;, B)
44samples_mean &lt;- vector(&quot;numeric&quot;, B)
45samples_median &lt;- vector(&quot;numeric&quot;, B)
46
47for(samp in 1:B) {
48  sample_delegation &lt;- sample_n(data, del_districts)
49  sample_delegation_pct_dem_mean &lt;- weighted.mean(sample_delegation$pct_dem_votes, w = sample_delegation$total_votes)
50  sample_delegation_pct_dem_median &lt;- median(sample_delegation$pct_dem_votes)
51  if(near(mean_dem_pct_PA, sample_delegation_pct_dem_mean, 1)){
52    samples_mean[samp] &lt;- sample_delegation_pct_dem_mean
53    samples_median[samp] &lt;- sample_delegation_pct_dem_median
54    samples_diff[samp] &lt;- (sample_delegation_pct_dem_mean - sample_delegation_pct_dem_median)
55  }
56}
57
58samples &lt;- data.frame(samples_mean,samples_median,samples_diff)
59samples &lt;- filter_all(samples, any_vars(. != 0))
60quantile(samples$samples_median, c(0.025,0.975))
61