Popular New Releases in Simulation
mongoose
6.2.10
bullet3
PyBullet 3.22
abstreet
Tiramisconstrued
Locale-Emulator
v2.5.0.1
a32nx
v0.7.5-rc1
Popular Libraries in Simulation
by Automattic javascript
24117 MIT
MongoDB object modeling designed to work in an asynchronous environment.
by kholia python
11515
Run macOS on QEMU/KVM. With OpenCore + Big Sur support now! Only commercial (paid) support is available now to avoid spammy issues.
by bulletphysics c++
9090 NOASSERTION
Bullet Physics SDK: real-time collision detection and multi-physics simulation for VR, games, visual effects, robotics, machine learning etc.
by citra-emu c++
6741 GPL-2.0
A Nintendo 3DS Emulator
by a-b-street rust
6519 Apache-2.0
Transportation planning and traffic simulation software for creating cities friendlier to walking, biking, and public transit
by xupefei csharp
6367 LGPL-3.0
Yet Another System Region and Language Simulator
by flybywiresim javascript
4107 GPL-3.0
The A32NX Project is a community driven open source project to create a free Airbus A320neo in Microsoft Flight Simulator that is as close to reality as possible.
by dli javascript
2567 MIT
Fluid Paint - http://david.li/paint
by wave-harmonic csharp
2531 NOASSERTION
An advanced ocean system implemented in Unity3D
Trending New libraries in Simulation
by flybywiresim javascript
4107 GPL-3.0
The A32NX Project is a community driven open source project to create a free Airbus A320neo in Microsoft Flight Simulator that is as close to reality as possible.
by chrxh c++
2442 GPL-3.0
ALIEN is a CUDA-powered artificial life simulation program.
by johnBuffer c++
1729 MIT
Simple Ants simulator
by jrouwe c++
1529 MIT
A multi core friendly rigid body physics and collision detection library suitable for games and VR applications.
by neherlab javascript
1380 MIT
Models of COVID-19 outbreak trajectories and hospital demand
by mborgerson c
1181 NOASSERTION
Original Xbox Emulator for Windows, macOS, and Linux (Active Development)
by google jupyter notebook
1078 Apache-2.0
Massively parallel rigidbody physics simulation on accelerator hardware.
by Unity-Technologies csharp
1053 Apache-2.0
Central repository for tools, tutorials, resources, and documentation for robotics simulation in Unity.
by osrf c++
876 NOASSERTION
Open source robotics simulator.
Top Authors in Simulation
1
50 Libraries
235
2
13 Libraries
1284
3
13 Libraries
1396
4
13 Libraries
281
5
11 Libraries
140
6
11 Libraries
810
7
10 Libraries
45
8
9 Libraries
20
9
9 Libraries
1399
10
8 Libraries
100
1
50 Libraries
235
2
13 Libraries
1284
3
13 Libraries
1396
4
13 Libraries
281
5
11 Libraries
140
6
11 Libraries
810
7
10 Libraries
45
8
9 Libraries
20
9
9 Libraries
1399
10
8 Libraries
100
Trending Kits in Simulation
6 Best C++ Simulation Libraries
Here are the best open-source C++ simulation libraries for your applications. You can use these to build simulations that model complex systems, such as physical processes, financial markets, or social interactions.
C++ is a powerful programming language widely used to create and run simulations for various applications in various fields. C++ simulation libraries are collections of pre-written code that can be used to model complex systems and perform simulations without writing the code from scratch. C++ is popular for simulation libraries because it offers high performance and low-level control, making it well-suited for computationally intensive simulations. Additionally, C++ offers a wide range of features and libraries that can be used to implement simulations, including advanced data structures, mathematical functions, and parallel processing. These libraries provide a range of powerful tools for simulation in C++. They offer high performance, low-level control, and a range of advanced features and libraries that can be used to implement complex simulations.
Whether you are working on discrete event simulation, network simulation, or fluid dynamics simulation, a library here can help you get the job done. We have handpicked the top and trending open-source C++ simulation libraries for your next application development project.
Open Dynamics Engine (ODE):
- Used for simulating articulated rigid body dynamics.
- It’s an open-source, high-performance physics engine.
- Allows to simulate realistic interactions between objects.
Simbody:
- Used in C++-based simulation applications.
- Offers advanced features such as multibody dynamics, contact modeling, etc.
- Also provides constraint-based modeling.
SOFA:
- Used typically for medical simulations to help foster newer algorithms.
- It includes a variety of modules and algorithms.
- Features include simulating soft-tissue deformation, fluid dynamics, and more.
Chrono:
- Used to model and simulate: - dynamics of large systems.
- It’s a multi-physics simulation library package.
- Allows to simulate interactions between solid and fluid objects and control systems.
Bullet:
- Used in many VRs, games, visual effects, robotics, machine learning, etc.
- Features include collision detection, rigid body dynamics, and soft body simulation.
- Offers real-time collision detection and multi-physics simulation.
Gazebo:
- Used to simulate multiple robots in a 3D environment, with the extensive dynamic interaction between objects.
- Offers high-fidelity physics, rendering, and sensor models.
- Includes physics simulation, sensor simulation, and visualization capabilities.
Here's a kit of 8 open-source projects addressing water challenges worldwide.
Trending Discussions on Simulation
Reverting a linear filter for time series in R
Why should I use normalised units in numerical integration?
Dramatic drop in numpy fromfile performance when switching from python 2 to python 3
ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors
Convolution Function Latency Bottleneck
Write custom metadata to Parquet file in Julia
Floating point inconsistencies after upgrading libc/libm
Missing types, namespaces, directives, and assembly references
Iterating over an array of class objects VS a class object containing arrays
R: Trying to recreate mean-median difference gerrymander tests
QUESTION
Reverting a linear filter for time series in R
Asked 2022-Mar-28 at 17:50I'm using the stats::filter
function in R in order to understand ARIMA simulations in R (as in the function stats::arima.sim
) and estiamtion. I know that stats::filter
applies a linear filter to a vector or time series, but I'm not sure how to "unfilter" my series.
Consider the following example: I want to use a recursive filter with value 0.7
to my series x = 1:5
(which is essentially generating an AR(1) with phi=0.7). I can do so by:
1x <- 1:5
2ar <-0.7
3filt <- filter(x, ar, method="recursive")
4filt
5
6Time Series:
7Start = 1
8End = 5
9Frequency = 1
10[1] 1.0000 2.7000 4.8900 7.4230 10.1961
11
Which returns me essentially c(y1,y2,y3,y4,y5)
where:
1x <- 1:5
2ar <-0.7
3filt <- filter(x, ar, method="recursive")
4filt
5
6Time Series:
7Start = 1
8End = 5
9Frequency = 1
10[1] 1.0000 2.7000 4.8900 7.4230 10.1961
11y1 <- x[1]
12y2 <- x[2] + ar*y1
13y3 <- x[3] + ar*y2
14y4 <- x[4] + ar*y3
15y5 <- x[5] + ar*y4
16
Now imagine I have the y = c(y1,y2,y3,y4,y5)
series. How can I use the filter function to return me the original series x = 1:5
?
I can write a code to do it like:
1x <- 1:5
2ar <-0.7
3filt <- filter(x, ar, method="recursive")
4filt
5
6Time Series:
7Start = 1
8End = 5
9Frequency = 1
10[1] 1.0000 2.7000 4.8900 7.4230 10.1961
11y1 <- x[1]
12y2 <- x[2] + ar*y1
13y3 <- x[3] + ar*y2
14y4 <- x[4] + ar*y3
15y5 <- x[5] + ar*y4
16unfilt <- rep(NA, 5)
17unfilt[1] <- filt[1]
18
19for(i in 2:5){
20 unfilt[i] <- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24
But I do want to use the filter function to do so, instead of writing my own function. How can I do so? I tried stats::filter(filt, -ar, method="recursive")
, which returns me [1] 1.0000 2.0000 3.4900 4.9800 6.7101
not what I desire.
ANSWER
Answered 2022-Mar-28 at 17:50stats::filter
used with the recursive
option is a particular case of an ARMA
filter.
1x <- 1:5
2ar <-0.7
3filt <- filter(x, ar, method="recursive")
4filt
5
6Time Series:
7Start = 1
8End = 5
9Frequency = 1
10[1] 1.0000 2.7000 4.8900 7.4230 10.1961
11y1 <- x[1]
12y2 <- x[2] + ar*y1
13y3 <- x[3] + ar*y2
14y4 <- x[4] + ar*y3
15y5 <- x[5] + ar*y4
16unfilt <- rep(NA, 5)
17unfilt[1] <- filt[1]
18
19for(i in 2:5){
20 unfilt[i] <- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
25
You could implement this filter with the signal
package which allows more options than stat::filter
:
1x <- 1:5
2ar <-0.7
3filt <- filter(x, ar, method="recursive")
4filt
5
6Time Series:
7Start = 1
8End = 5
9Frequency = 1
10[1] 1.0000 2.7000 4.8900 7.4230 10.1961
11y1 <- x[1]
12y2 <- x[2] + ar*y1
13y3 <- x[3] + ar*y2
14y4 <- x[4] + ar*y3
15y5 <- x[5] + ar*y4
16unfilt <- rep(NA, 5)
17unfilt[1] <- filt[1]
18
19for(i in 2:5){
20 unfilt[i] <- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
25a = c(1,-ar)
26b = 1
27
28filt_Arma <- signal::filter(signal::Arma(b = b, a = a),x)
29filt_Arma
30
31# Time Series:
32# Start = 1
33# End = 5
34# Frequency = 1
35# [1] 1.0000 2.7000 4.8900 7.4230 10.1961
36
37identical(filt,filt_Arma)
38# [1] TRUE
39
Reverting an ARMA
filter can be done by switching b
and a
, provided that the inverse filter stays stable (which is the case here):
1x <- 1:5
2ar <-0.7
3filt <- filter(x, ar, method="recursive")
4filt
5
6Time Series:
7Start = 1
8End = 5
9Frequency = 1
10[1] 1.0000 2.7000 4.8900 7.4230 10.1961
11y1 <- x[1]
12y2 <- x[2] + ar*y1
13y3 <- x[3] + ar*y2
14y4 <- x[4] + ar*y3
15y5 <- x[5] + ar*y4
16unfilt <- rep(NA, 5)
17unfilt[1] <- filt[1]
18
19for(i in 2:5){
20 unfilt[i] <- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
25a = c(1,-ar)
26b = 1
27
28filt_Arma <- signal::filter(signal::Arma(b = b, a = a),x)
29filt_Arma
30
31# Time Series:
32# Start = 1
33# End = 5
34# Frequency = 1
35# [1] 1.0000 2.7000 4.8900 7.4230 10.1961
36
37identical(filt,filt_Arma)
38# [1] TRUE
39signal::filter(signal::Arma(b = a, a = b),filt)
40
41# Time Series:
42# Start = 2
43# End = 6
44# Frequency = 1
45# [1] 1 2 3 4 5
46
This corresponds to switching numerator and denominator in the z-transform:
1x <- 1:5
2ar <-0.7
3filt <- filter(x, ar, method="recursive")
4filt
5
6Time Series:
7Start = 1
8End = 5
9Frequency = 1
10[1] 1.0000 2.7000 4.8900 7.4230 10.1961
11y1 <- x[1]
12y2 <- x[2] + ar*y1
13y3 <- x[3] + ar*y2
14y4 <- x[4] + ar*y3
15y5 <- x[5] + ar*y4
16unfilt <- rep(NA, 5)
17unfilt[1] <- filt[1]
18
19for(i in 2:5){
20 unfilt[i] <- filt[i] - ar*filt[i-1]
21}
22unfilt
23[1] 1 2 3 4 5
24a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
25a = c(1,-ar)
26b = 1
27
28filt_Arma <- signal::filter(signal::Arma(b = b, a = a),x)
29filt_Arma
30
31# Time Series:
32# Start = 1
33# End = 5
34# Frequency = 1
35# [1] 1.0000 2.7000 4.8900 7.4230 10.1961
36
37identical(filt,filt_Arma)
38# [1] TRUE
39signal::filter(signal::Arma(b = a, a = b),filt)
40
41# Time Series:
42# Start = 2
43# End = 6
44# Frequency = 1
45# [1] 1 2 3 4 5
46Y(z) = a(z)/b(z) X(z)
47
48X(z) = b(z)/a(z) Y(z)
49
QUESTION
Why should I use normalised units in numerical integration?
Asked 2022-Mar-19 at 10:40I was simulating the solar system (Sun, Earth and Moon). When I first started working on the project, I used the base units: meters for distance, seconds for time, and metres per second for velocity. Because I was dealing with the solar system, the numbers were pretty big, for example the distance between the Earth and Sun is 150·10⁹ m.
When I numerically integrated the system with scipy.solve_ivp
, the results were completely wrong. Here is an example of Earth and Moon trajectories.
But then I got a suggestion from a friend that I should use standardised units: astronomical unit (AU) for distance and years for time. And the simulation started working flawlessly!
My question is: Why is this a generally valid advice for problems such as mine? (Mind that this is not about my specific problem which was already solved, but rather why the solution worked.)
ANSWER
Answered 2021-Jul-25 at 07:42Most, if not all integration modules work best out of the box if:
- your dynamical variables have the same order of magnitude;
- that order of magnitude is 1;
- the smallest time scale of your dynamics also has the order of magnitude 1.
This typically fails for astronomical simulations where the orders of magnitude vary and values as well as time scales are often large in typical units.
The reason for the above behaviour of integrators is that they use step-size adaption, i.e., the integration step is adjusted to keep the estimated error at a defined level. The step-size adaption in turn is governed by a lot of parameters like absolute tolerance, relative tolerance, minimum time step, etc. You can usually tweak these parameters, but if you don’t, there need to be some default values and these default values are chosen with the above setup in mind.
DigressionYou might ask yourself: Can these parameters not be chosen more dynamically? As a developer and maintainer of an integration module, I would roughly expect that introducing such automatisms has the following consequences:
- About twenty in a thousand users will not run into problems like yours.
- About fifty a thousand users (including the above) miss an opportunity to learn rudimentary knowledge about how integrators work and reading documentations.
- About one in thousand users will run into a horrible problem with the automatisms that is much more difficult to solve than the above.
- I need to introduce new parameters governing the automatisms that are even harder to grasp for the average user.
- I spend a lot of time in devising and implementing the automatisms.
QUESTION
Dramatic drop in numpy fromfile performance when switching from python 2 to python 3
Asked 2022-Mar-16 at 23:53I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problemThis code, iotest.py
- Generates a file of a similar structure to what I have in my studies
- Reads it using numpy.fromfile
- Reads it using numpy.frombuffer
- Compares timing of both
1 import numpy as np
2 import os
3
4 def generate_binary_file(filename, nrecords):
5 n_records = np.random.poisson(lam = nrecords)
6 record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7 x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8 with open(filename, 'wb') as f:
9 s = 0
10 for i in range(n_records):
11 f.write(record_lengths[i].tobytes())
12 f.write(x[s:s+record_lengths[i]].tobytes())
13 s += record_lengths[i]
14 # Trick for testing: make sum of records equal to 0
15 f.write(np.array([1], dtype = 'i4').tobytes())
16 f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17 return os.path.getsize(filename)
18
19 def read_binary_npfromfile(filename):
20 checksum = 0.0
21 with open(filename, 'rb') as f:
22 while True:
23 try:
24 record_length = np.fromfile(f, 'i4', 1)[0]
25 x = np.fromfile(f, 'd', record_length)
26 checksum += x.sum()
27 except:
28 break
29 assert(np.abs(checksum) < 1e-6)
30
31 def read_binary_npfrombuffer(filename):
32 checksum = 0.0
33 with open(filename, 'rb') as f:
34 while True:
35 try:
36 record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37 x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38 checksum += x.sum()
39 except:
40 break
41 assert(np.abs(checksum) < 1e-6)
42
43
44 if __name__ == '__main__':
45 from timeit import Timer
46 from functools import partial
47
48 fname = 'testfile.tmp'
49 print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
50 for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51 fsize = generate_binary_file(fname, i)
52 t1 = Timer(partial(read_binary_npfromfile, fname))
53 t2 = Timer(partial(read_binary_npfrombuffer, fname))
54 a1 = np.array(t1.repeat(5, 1))
55 a2 = np.array(t2.repeat(5, 1))
56 print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions- In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
- What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
ANSWER
Answered 2022-Mar-16 at 23:52TL;DR: np.fromfile
and np.frombuffer
are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while
loops calls 601_214 times np.fromfile
and np.frombuffer
. The timing on my machine are 10.5s for read_binary_npfromfile
and 1.2s for read_binary_npfrombuffer
. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view
. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
1 import numpy as np
2 import os
3
4 def generate_binary_file(filename, nrecords):
5 n_records = np.random.poisson(lam = nrecords)
6 record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7 x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8 with open(filename, 'wb') as f:
9 s = 0
10 for i in range(n_records):
11 f.write(record_lengths[i].tobytes())
12 f.write(x[s:s+record_lengths[i]].tobytes())
13 s += record_lengths[i]
14 # Trick for testing: make sum of records equal to 0
15 f.write(np.array([1], dtype = 'i4').tobytes())
16 f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17 return os.path.getsize(filename)
18
19 def read_binary_npfromfile(filename):
20 checksum = 0.0
21 with open(filename, 'rb') as f:
22 while True:
23 try:
24 record_length = np.fromfile(f, 'i4', 1)[0]
25 x = np.fromfile(f, 'd', record_length)
26 checksum += x.sum()
27 except:
28 break
29 assert(np.abs(checksum) < 1e-6)
30
31 def read_binary_npfrombuffer(filename):
32 checksum = 0.0
33 with open(filename, 'rb') as f:
34 while True:
35 try:
36 record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37 x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38 checksum += x.sum()
39 except:
40 break
41 assert(np.abs(checksum) < 1e-6)
42
43
44 if __name__ == '__main__':
45 from timeit import Timer
46 from functools import partial
47
48 fname = 'testfile.tmp'
49 print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
50 for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51 fsize = generate_binary_file(fname, i)
52 t1 = Timer(partial(read_binary_npfromfile, fname))
53 t2 = Timer(partial(read_binary_npfrombuffer, fname))
54 a1 = np.array(t1.repeat(5, 1))
55 a2 = np.array(t2.repeat(5, 1))
56 print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58 buff = np.fromfile(filename, dtype=np.uint8)
59 buff_int32 = buff.view(np.int32)
60 buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61 buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62 nblocks = buff.size // 4 # Number of 4-byte blocks
63 pos = 0 # Displacement by block of 4 bytes
64 lst = []
65 while pos < nblocks:
66 record_length = buff_int32[pos]
67 pos += 1
68 if pos + record_length * 2 > nblocks:
69 break
70 offset = pos // 2
71 if pos % 2 == 0: # Aligned with buff_double_1
72 x = buff_double_1[offset:offset+record_length]
73 else: # Aligned with buff_double_2
74 x = buff_double_2[offset:offset+record_length]
75 lst.append(x) # np.sum is too expensive here
76 pos += record_length * 2
77 checksum = np.sum(np.concatenate(lst))
78 assert(np.abs(checksum) < 1e-6)
79
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos]
or buff_double_1[offset:offset+record_length]
. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
1 import numpy as np
2 import os
3
4 def generate_binary_file(filename, nrecords):
5 n_records = np.random.poisson(lam = nrecords)
6 record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7 x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8 with open(filename, 'wb') as f:
9 s = 0
10 for i in range(n_records):
11 f.write(record_lengths[i].tobytes())
12 f.write(x[s:s+record_lengths[i]].tobytes())
13 s += record_lengths[i]
14 # Trick for testing: make sum of records equal to 0
15 f.write(np.array([1], dtype = 'i4').tobytes())
16 f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17 return os.path.getsize(filename)
18
19 def read_binary_npfromfile(filename):
20 checksum = 0.0
21 with open(filename, 'rb') as f:
22 while True:
23 try:
24 record_length = np.fromfile(f, 'i4', 1)[0]
25 x = np.fromfile(f, 'd', record_length)
26 checksum += x.sum()
27 except:
28 break
29 assert(np.abs(checksum) < 1e-6)
30
31 def read_binary_npfrombuffer(filename):
32 checksum = 0.0
33 with open(filename, 'rb') as f:
34 while True:
35 try:
36 record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37 x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38 checksum += x.sum()
39 except:
40 break
41 assert(np.abs(checksum) < 1e-6)
42
43
44 if __name__ == '__main__':
45 from timeit import Timer
46 from functools import partial
47
48 fname = 'testfile.tmp'
49 print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
50 for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51 fsize = generate_binary_file(fname, i)
52 t1 = Timer(partial(read_binary_npfromfile, fname))
53 t2 = Timer(partial(read_binary_npfrombuffer, fname))
54 a1 = np.array(t1.repeat(5, 1))
55 a2 = np.array(t2.repeat(5, 1))
56 print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58 buff = np.fromfile(filename, dtype=np.uint8)
59 buff_int32 = buff.view(np.int32)
60 buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61 buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62 nblocks = buff.size // 4 # Number of 4-byte blocks
63 pos = 0 # Displacement by block of 4 bytes
64 lst = []
65 while pos < nblocks:
66 record_length = buff_int32[pos]
67 pos += 1
68 if pos + record_length * 2 > nblocks:
69 break
70 offset = pos // 2
71 if pos % 2 == 0: # Aligned with buff_double_1
72 x = buff_double_1[offset:offset+record_length]
73 else: # Aligned with buff_double_2
74 x = buff_double_2[offset:offset+record_length]
75 lst.append(x) # np.sum is too expensive here
76 pos += record_length * 2
77 checksum = np.sum(np.concatenate(lst))
78 assert(np.abs(checksum) < 1e-6)
79from struct import unpack_from
80
81def read_binary_python_struct(filename):
82 checksum = 0.0
83 with open(filename, 'rb') as f:
84 data = f.read()
85 offset = 0
86 while offset < len(data):
87 record_length = unpack_from('@i', data, offset)[0]
88 checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
89 offset += 4 + record_length * 8
90 assert(np.abs(checksum) < 1e-6)
91
This is because the overhead of unpack_from
is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
1 import numpy as np
2 import os
3
4 def generate_binary_file(filename, nrecords):
5 n_records = np.random.poisson(lam = nrecords)
6 record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7 x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8 with open(filename, 'wb') as f:
9 s = 0
10 for i in range(n_records):
11 f.write(record_lengths[i].tobytes())
12 f.write(x[s:s+record_lengths[i]].tobytes())
13 s += record_lengths[i]
14 # Trick for testing: make sum of records equal to 0
15 f.write(np.array([1], dtype = 'i4').tobytes())
16 f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17 return os.path.getsize(filename)
18
19 def read_binary_npfromfile(filename):
20 checksum = 0.0
21 with open(filename, 'rb') as f:
22 while True:
23 try:
24 record_length = np.fromfile(f, 'i4', 1)[0]
25 x = np.fromfile(f, 'd', record_length)
26 checksum += x.sum()
27 except:
28 break
29 assert(np.abs(checksum) < 1e-6)
30
31 def read_binary_npfrombuffer(filename):
32 checksum = 0.0
33 with open(filename, 'rb') as f:
34 while True:
35 try:
36 record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37 x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38 checksum += x.sum()
39 except:
40 break
41 assert(np.abs(checksum) < 1e-6)
42
43
44 if __name__ == '__main__':
45 from timeit import Timer
46 from functools import partial
47
48 fname = 'testfile.tmp'
49 print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
50 for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51 fsize = generate_binary_file(fname, i)
52 t1 = Timer(partial(read_binary_npfromfile, fname))
53 t2 = Timer(partial(read_binary_npfrombuffer, fname))
54 a1 = np.array(t1.repeat(5, 1))
55 a2 = np.array(t2.repeat(5, 1))
56 print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58 buff = np.fromfile(filename, dtype=np.uint8)
59 buff_int32 = buff.view(np.int32)
60 buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61 buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62 nblocks = buff.size // 4 # Number of 4-byte blocks
63 pos = 0 # Displacement by block of 4 bytes
64 lst = []
65 while pos < nblocks:
66 record_length = buff_int32[pos]
67 pos += 1
68 if pos + record_length * 2 > nblocks:
69 break
70 offset = pos // 2
71 if pos % 2 == 0: # Aligned with buff_double_1
72 x = buff_double_1[offset:offset+record_length]
73 else: # Aligned with buff_double_2
74 x = buff_double_2[offset:offset+record_length]
75 lst.append(x) # np.sum is too expensive here
76 pos += record_length * 2
77 checksum = np.sum(np.concatenate(lst))
78 assert(np.abs(checksum) < 1e-6)
79from struct import unpack_from
80
81def read_binary_python_struct(filename):
82 checksum = 0.0
83 with open(filename, 'rb') as f:
84 data = f.read()
85 offset = 0
86 while offset < len(data):
87 record_length = unpack_from('@i', data, offset)[0]
88 checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
89 offset += 4 + record_length * 8
90 assert(np.abs(checksum) < 1e-6)
91@nb.njit('float64(uint8[::1])')
92def decode_buffer(buff):
93 checksum = 0.0
94 offset = 0
95 while offset + 4 < buff.size:
96 record_length = buff[offset:offset+4].view(np.int32)[0]
97 start = offset + 4
98 end = start + record_length * 8
99 if end > buff.size:
100 break
101 x = buff[start:end].view(np.float64)
102 checksum += x.sum()
103 offset = end
104 return checksum
105
106def read_binary_numba(filename):
107 buff = np.fromfile(filename, dtype=np.uint8)
108 checksum = decode_buffer(buff)
109 assert(np.abs(checksum) < 1e-6)
110
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile
which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
1 import numpy as np
2 import os
3
4 def generate_binary_file(filename, nrecords):
5 n_records = np.random.poisson(lam = nrecords)
6 record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
7 x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
8 with open(filename, 'wb') as f:
9 s = 0
10 for i in range(n_records):
11 f.write(record_lengths[i].tobytes())
12 f.write(x[s:s+record_lengths[i]].tobytes())
13 s += record_lengths[i]
14 # Trick for testing: make sum of records equal to 0
15 f.write(np.array([1], dtype = 'i4').tobytes())
16 f.write(np.array([-x.sum()], dtype = 'd').tobytes())
17 return os.path.getsize(filename)
18
19 def read_binary_npfromfile(filename):
20 checksum = 0.0
21 with open(filename, 'rb') as f:
22 while True:
23 try:
24 record_length = np.fromfile(f, 'i4', 1)[0]
25 x = np.fromfile(f, 'd', record_length)
26 checksum += x.sum()
27 except:
28 break
29 assert(np.abs(checksum) < 1e-6)
30
31 def read_binary_npfrombuffer(filename):
32 checksum = 0.0
33 with open(filename, 'rb') as f:
34 while True:
35 try:
36 record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
37 x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
38 checksum += x.sum()
39 except:
40 break
41 assert(np.abs(checksum) < 1e-6)
42
43
44 if __name__ == '__main__':
45 from timeit import Timer
46 from functools import partial
47
48 fname = 'testfile.tmp'
49 print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
50 for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
51 fsize = generate_binary_file(fname, i)
52 t1 = Timer(partial(read_binary_npfromfile, fname))
53 t2 = Timer(partial(read_binary_npfrombuffer, fname))
54 a1 = np.array(t1.repeat(5, 1))
55 a2 = np.array(t2.repeat(5, 1))
56 print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
57def read_binary_faster_numpy(filename):
58 buff = np.fromfile(filename, dtype=np.uint8)
59 buff_int32 = buff.view(np.int32)
60 buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
61 buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
62 nblocks = buff.size // 4 # Number of 4-byte blocks
63 pos = 0 # Displacement by block of 4 bytes
64 lst = []
65 while pos < nblocks:
66 record_length = buff_int32[pos]
67 pos += 1
68 if pos + record_length * 2 > nblocks:
69 break
70 offset = pos // 2
71 if pos % 2 == 0: # Aligned with buff_double_1
72 x = buff_double_1[offset:offset+record_length]
73 else: # Aligned with buff_double_2
74 x = buff_double_2[offset:offset+record_length]
75 lst.append(x) # np.sum is too expensive here
76 pos += record_length * 2
77 checksum = np.sum(np.concatenate(lst))
78 assert(np.abs(checksum) < 1e-6)
79from struct import unpack_from
80
81def read_binary_python_struct(filename):
82 checksum = 0.0
83 with open(filename, 'rb') as f:
84 data = f.read()
85 offset = 0
86 while offset < len(data):
87 record_length = unpack_from('@i', data, offset)[0]
88 checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
89 offset += 4 + record_length * 8
90 assert(np.abs(checksum) < 1e-6)
91@nb.njit('float64(uint8[::1])')
92def decode_buffer(buff):
93 checksum = 0.0
94 offset = 0
95 while offset + 4 < buff.size:
96 record_length = buff[offset:offset+4].view(np.int32)[0]
97 start = offset + 4
98 end = start + record_length * 8
99 if end > buff.size:
100 break
101 x = buff[start:end].view(np.float64)
102 checksum += x.sum()
103 offset = end
104 return checksum
105
106def read_binary_numba(filename):
107 buff = np.fromfile(filename, dtype=np.uint8)
108 checksum = decode_buffer(buff)
109 assert(np.abs(checksum) < 1e-6)
110read_binary_npfromfile: 10616 ms ( x1)
111read_binary_npfrombuffer: 1132 ms ( x9)
112read_binary_faster_numpy: 509 ms ( x21)
113read_binary_python_struct: 222 ms ( x48)
114read_binary_numba: 12 ms ( x885)
115Optimal time: 7 ms (x1517)
116
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile
and only 4 ms in decode_buffer
!
QUESTION
ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors
Asked 2022-Mar-15 at 15:48I am following TFF tutorials to build my FL model My data is contained in different CSV files which are considered as different clients. Following this tutorial, and build the Keras model function as following
1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3 return tf.data.experimental.CsvDataset(dataset_path,
4 record_defaults=record_defaults,
5 header=True)
6
7@tf.function
8def add_parsing(dataset):
9 def parse_dataset(*x):
10 return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11 return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13 dataset_paths, create_tf_dataset_for_client_fn)
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids]
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28 initializer = tf.keras.initializers.GlorotNormal(seed=0)
29 return tf.keras.models.Sequential([
30 tf.keras.layers.Input(shape=(32,)),
31 tf.keras.layers.Dense(10, kernel_initializer=initializer),
32 tf.keras.layers.Softmax(),
33 ])
34def model_fn():
35 keras_model = create_keras_model()
36 return tff.learning.from_keras_model(
37 keras_model,
38 input_spec=train_data.element_spec,
39 loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40 metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41
Then I followed instructions and run other @tff.tf_computation
functions as the tutorial, like def server_init()
, def initialize_fn()
, def client_update()
and def server_update()
. But when I run the def client_update_fn()
I got this error
1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3 return tf.data.experimental.CsvDataset(dataset_path,
4 record_defaults=record_defaults,
5 header=True)
6
7@tf.function
8def add_parsing(dataset):
9 def parse_dataset(*x):
10 return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11 return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13 dataset_paths, create_tf_dataset_for_client_fn)
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids]
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28 initializer = tf.keras.initializers.GlorotNormal(seed=0)
29 return tf.keras.models.Sequential([
30 tf.keras.layers.Input(shape=(32,)),
31 tf.keras.layers.Dense(10, kernel_initializer=initializer),
32 tf.keras.layers.Softmax(),
33 ])
34def model_fn():
35 keras_model = create_keras_model()
36 return tff.learning.from_keras_model(
37 keras_model,
38 input_spec=train_data.element_spec,
39 loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40 metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43 File "<ipython-input-14-cada45ffae0f>", line 12, in client_update *
44 for batch in dataset:
45 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 455, in forward_pass *
46 return self._forward_pass(batch_input, training=training)
47 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 408, in _forward_pass *
48 predictions = self.predict_on_batch(inputs, training)
49 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 398, in predict_on_batch *
50 return self._keras_model(x, training=training)
51 File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 740, in __call__ **
52 self.name)
53 File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py", line 200, in assert_input_compatibility
54 raise ValueError(f'Layer "{layer_name}" expects {len(input_spec)} input(s),'
55
56ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors. Inputs received: [<tf.Tensor 'x:0' shape=() dtype=int32>, <tf.Tensor 'x_1:0' shape=() dtype=int32>, <tf.Tensor 'x_2:0' shape=() dtype=int32>, <tf.Tensor 'x_3:0' shape=() dtype=float32>, <tf.Tensor 'x_4:0' shape=() dtype=float32>, <tf.Tensor 'x_5:0' shape=() dtype=float32>, <tf.Tensor 'x_6:0' shape=() dtype=float32>, <tf.Tensor 'x_7:0' shape=() dtype=float32>, <tf.Tensor 'x_8:0' shape=() dtype=float32>, <tf.Tensor 'x_9:0' shape=() dtype=int32>]
57
Notes:
- each CSV file has 10 column as features (input) and one column as label (output).
- I added the
shape=(32,)
arbitrary, I don't really know what are the shape of the data is in each column?
So, the question is, how to feed the data to the keras model and overcome this error
Thanks in advance
ANSWER
Answered 2022-Mar-15 at 15:48A couple problems: Your data has ten separate features, which means you actually need 10 separate inputs for your model. However, you can also stack the features into a tensor and then use a single input with the shape (10,)
. Here is a working example, but please note that it uses dummy data and therefore may not make much sense in reality.
Create dummy data:
1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3 return tf.data.experimental.CsvDataset(dataset_path,
4 record_defaults=record_defaults,
5 header=True)
6
7@tf.function
8def add_parsing(dataset):
9 def parse_dataset(*x):
10 return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11 return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13 dataset_paths, create_tf_dataset_for_client_fn)
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids]
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28 initializer = tf.keras.initializers.GlorotNormal(seed=0)
29 return tf.keras.models.Sequential([
30 tf.keras.layers.Input(shape=(32,)),
31 tf.keras.layers.Dense(10, kernel_initializer=initializer),
32 tf.keras.layers.Softmax(),
33 ])
34def model_fn():
35 keras_model = create_keras_model()
36 return tff.learning.from_keras_model(
37 keras_model,
38 input_spec=train_data.element_spec,
39 loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40 metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43 File "<ipython-input-14-cada45ffae0f>", line 12, in client_update *
44 for batch in dataset:
45 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 455, in forward_pass *
46 return self._forward_pass(batch_input, training=training)
47 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 408, in _forward_pass *
48 predictions = self.predict_on_batch(inputs, training)
49 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 398, in predict_on_batch *
50 return self._keras_model(x, training=training)
51 File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 740, in __call__ **
52 self.name)
53 File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py", line 200, in assert_input_compatibility
54 raise ValueError(f'Layer "{layer_name}" expects {len(input_spec)} input(s),'
55
56ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors. Inputs received: [<tf.Tensor 'x:0' shape=() dtype=int32>, <tf.Tensor 'x_1:0' shape=() dtype=int32>, <tf.Tensor 'x_2:0' shape=() dtype=int32>, <tf.Tensor 'x_3:0' shape=() dtype=float32>, <tf.Tensor 'x_4:0' shape=() dtype=float32>, <tf.Tensor 'x_5:0' shape=() dtype=float32>, <tf.Tensor 'x_6:0' shape=() dtype=float32>, <tf.Tensor 'x_7:0' shape=() dtype=float32>, <tf.Tensor 'x_8:0' shape=() dtype=float32>, <tf.Tensor 'x_9:0' shape=() dtype=int32>]
57import tensorflow as tf
58import tensorflow_federated as tff
59import pandas as pd
60from collections import OrderedDict
61import nest_asyncio
62nest_asyncio.apply()
63
64# Dummy data
65samples = 5
66data = [[tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
67 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
68 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
69 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
70 tf.random.normal((samples,)).numpy().tolist(),
71 tf.random.normal((samples,)).numpy().tolist(),
72 tf.random.normal((samples,)).numpy().tolist(),
73 tf.random.normal((samples,)).numpy().tolist(),
74 tf.random.normal((samples,)).numpy().tolist(),
75 tf.random.normal((samples,)).numpy().tolist(),
76 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
77 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist()]]
78df = pd.DataFrame(data)
79df = df.explode(list(df.columns))
80df.to_csv('client1.csv', index= False)
81df.to_csv('client2.csv', index= False)
82
Load and process dataset:
1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3 return tf.data.experimental.CsvDataset(dataset_path,
4 record_defaults=record_defaults,
5 header=True)
6
7@tf.function
8def add_parsing(dataset):
9 def parse_dataset(*x):
10 return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11 return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13 dataset_paths, create_tf_dataset_for_client_fn)
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids]
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28 initializer = tf.keras.initializers.GlorotNormal(seed=0)
29 return tf.keras.models.Sequential([
30 tf.keras.layers.Input(shape=(32,)),
31 tf.keras.layers.Dense(10, kernel_initializer=initializer),
32 tf.keras.layers.Softmax(),
33 ])
34def model_fn():
35 keras_model = create_keras_model()
36 return tff.learning.from_keras_model(
37 keras_model,
38 input_spec=train_data.element_spec,
39 loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40 metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43 File "<ipython-input-14-cada45ffae0f>", line 12, in client_update *
44 for batch in dataset:
45 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 455, in forward_pass *
46 return self._forward_pass(batch_input, training=training)
47 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 408, in _forward_pass *
48 predictions = self.predict_on_batch(inputs, training)
49 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 398, in predict_on_batch *
50 return self._keras_model(x, training=training)
51 File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 740, in __call__ **
52 self.name)
53 File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py", line 200, in assert_input_compatibility
54 raise ValueError(f'Layer "{layer_name}" expects {len(input_spec)} input(s),'
55
56ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors. Inputs received: [<tf.Tensor 'x:0' shape=() dtype=int32>, <tf.Tensor 'x_1:0' shape=() dtype=int32>, <tf.Tensor 'x_2:0' shape=() dtype=int32>, <tf.Tensor 'x_3:0' shape=() dtype=float32>, <tf.Tensor 'x_4:0' shape=() dtype=float32>, <tf.Tensor 'x_5:0' shape=() dtype=float32>, <tf.Tensor 'x_6:0' shape=() dtype=float32>, <tf.Tensor 'x_7:0' shape=() dtype=float32>, <tf.Tensor 'x_8:0' shape=() dtype=float32>, <tf.Tensor 'x_9:0' shape=() dtype=int32>]
57import tensorflow as tf
58import tensorflow_federated as tff
59import pandas as pd
60from collections import OrderedDict
61import nest_asyncio
62nest_asyncio.apply()
63
64# Dummy data
65samples = 5
66data = [[tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
67 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
68 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
69 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
70 tf.random.normal((samples,)).numpy().tolist(),
71 tf.random.normal((samples,)).numpy().tolist(),
72 tf.random.normal((samples,)).numpy().tolist(),
73 tf.random.normal((samples,)).numpy().tolist(),
74 tf.random.normal((samples,)).numpy().tolist(),
75 tf.random.normal((samples,)).numpy().tolist(),
76 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
77 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist()]]
78df = pd.DataFrame(data)
79df = df.explode(list(df.columns))
80df.to_csv('client1.csv', index= False)
81df.to_csv('client2.csv', index= False)
82import tensorflow as tf
83
84record_defaults = [int(), int(), int(), int(), float(),float(),float(),float(),float(),float(), int(), int()]
85
86@tf.function
87def create_tf_dataset_for_client_fn(dataset_path):
88 return tf.data.experimental.CsvDataset(dataset_path,
89 record_defaults=record_defaults,
90 header=True)
91@tf.function
92def add_parsing(dataset):
93 def parse_dataset(*x):
94 return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
95 return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
96
97dataset_paths = {'client1': '/content/client1.csv', 'client2': '/content/client2.csv'}
98
99source = tff.simulation.datasets.FilePerUserClientData(
100 dataset_paths, create_tf_dataset_for_client_fn)
101
102client_ids = sorted(source.client_ids)
103
104# Make sure the client ids are tensor strings when splitting data.
105source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids]
106source = source.preprocess(add_parsing)
107
108train, test = source.train_test_client_split(source, 1)
109
110train_client_ids = train.client_ids
111
112def reshape_data(d):
113 d['x'] = tf.stack([tf.cast(x, dtype=tf.float32) for x in d['x']])
114 return d
115
116train_data = [train.create_tf_dataset_for_client(c).map(reshape_data).batch(1) for c in train_client_ids]
117
Create and run model:
1@tf.function
2def create_tf_dataset_for_client_fn(dataset_path):
3 return tf.data.experimental.CsvDataset(dataset_path,
4 record_defaults=record_defaults,
5 header=True)
6
7@tf.function
8def add_parsing(dataset):
9 def parse_dataset(*x):
10 return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
11 return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
12source = tff.simulation.datasets.FilePerUserClientData(
13 dataset_paths, create_tf_dataset_for_client_fn)
14
15client_ids = sorted(source.client_ids)
16
17# Make sure the client ids are tensor strings when splitting data.
18source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids]
19source = source.preprocess(add_parsing)
20
21train, test = source.train_test_client_split(source, 1)
22
23train_client_ids = train.client_ids
24
25train_data = train.create_tf_dataset_for_client(train_client_ids[0])
26
27def create_keras_model():
28 initializer = tf.keras.initializers.GlorotNormal(seed=0)
29 return tf.keras.models.Sequential([
30 tf.keras.layers.Input(shape=(32,)),
31 tf.keras.layers.Dense(10, kernel_initializer=initializer),
32 tf.keras.layers.Softmax(),
33 ])
34def model_fn():
35 keras_model = create_keras_model()
36 return tff.learning.from_keras_model(
37 keras_model,
38 input_spec=train_data.element_spec,
39 loss=tf.keras.losses.SparseCategoricalCrossentropy(),
40 metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
41ValueError: in user code:
42
43 File "<ipython-input-14-cada45ffae0f>", line 12, in client_update *
44 for batch in dataset:
45 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 455, in forward_pass *
46 return self._forward_pass(batch_input, training=training)
47 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 408, in _forward_pass *
48 predictions = self.predict_on_batch(inputs, training)
49 File "/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/learning/keras_utils.py", line 398, in predict_on_batch *
50 return self._keras_model(x, training=training)
51 File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 740, in __call__ **
52 self.name)
53 File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_spec.py", line 200, in assert_input_compatibility
54 raise ValueError(f'Layer "{layer_name}" expects {len(input_spec)} input(s),'
55
56ValueError: Layer "sequential" expects 1 input(s), but it received 10 input tensors. Inputs received: [<tf.Tensor 'x:0' shape=() dtype=int32>, <tf.Tensor 'x_1:0' shape=() dtype=int32>, <tf.Tensor 'x_2:0' shape=() dtype=int32>, <tf.Tensor 'x_3:0' shape=() dtype=float32>, <tf.Tensor 'x_4:0' shape=() dtype=float32>, <tf.Tensor 'x_5:0' shape=() dtype=float32>, <tf.Tensor 'x_6:0' shape=() dtype=float32>, <tf.Tensor 'x_7:0' shape=() dtype=float32>, <tf.Tensor 'x_8:0' shape=() dtype=float32>, <tf.Tensor 'x_9:0' shape=() dtype=int32>]
57import tensorflow as tf
58import tensorflow_federated as tff
59import pandas as pd
60from collections import OrderedDict
61import nest_asyncio
62nest_asyncio.apply()
63
64# Dummy data
65samples = 5
66data = [[tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
67 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
68 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
69 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
70 tf.random.normal((samples,)).numpy().tolist(),
71 tf.random.normal((samples,)).numpy().tolist(),
72 tf.random.normal((samples,)).numpy().tolist(),
73 tf.random.normal((samples,)).numpy().tolist(),
74 tf.random.normal((samples,)).numpy().tolist(),
75 tf.random.normal((samples,)).numpy().tolist(),
76 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist(),
77 tf.random.uniform((samples,), maxval=50, dtype=tf.int32).numpy().tolist()]]
78df = pd.DataFrame(data)
79df = df.explode(list(df.columns))
80df.to_csv('client1.csv', index= False)
81df.to_csv('client2.csv', index= False)
82import tensorflow as tf
83
84record_defaults = [int(), int(), int(), int(), float(),float(),float(),float(),float(),float(), int(), int()]
85
86@tf.function
87def create_tf_dataset_for_client_fn(dataset_path):
88 return tf.data.experimental.CsvDataset(dataset_path,
89 record_defaults=record_defaults,
90 header=True)
91@tf.function
92def add_parsing(dataset):
93 def parse_dataset(*x):
94 return OrderedDict([('y', x[-1]), ('x', x[1:-1])])
95 return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
96
97dataset_paths = {'client1': '/content/client1.csv', 'client2': '/content/client2.csv'}
98
99source = tff.simulation.datasets.FilePerUserClientData(
100 dataset_paths, create_tf_dataset_for_client_fn)
101
102client_ids = sorted(source.client_ids)
103
104# Make sure the client ids are tensor strings when splitting data.
105source._client_ids = [tf.cast(c, tf.string) for c in source.client_ids]
106source = source.preprocess(add_parsing)
107
108train, test = source.train_test_client_split(source, 1)
109
110train_client_ids = train.client_ids
111
112def reshape_data(d):
113 d['x'] = tf.stack([tf.cast(x, dtype=tf.float32) for x in d['x']])
114 return d
115
116train_data = [train.create_tf_dataset_for_client(c).map(reshape_data).batch(1) for c in train_client_ids]
117def create_keras_model():
118 initializer = tf.keras.initializers.GlorotNormal(seed=0)
119 return tf.keras.models.Sequential([
120 tf.keras.layers.Input(shape=(10,)),
121 tf.keras.layers.Dense(75, kernel_initializer=initializer),
122 tf.keras.layers.Dense(50, kernel_initializer=initializer),
123 tf.keras.layers.Softmax(),
124 ])
125def model_fn():
126 keras_model = create_keras_model()
127 return tff.learning.from_keras_model(
128 keras_model,
129 input_spec=train_data[0].element_spec,
130 loss=tf.keras.losses.SparseCategoricalCrossentropy(),
131 metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
132
133def initialize_fn():
134 model = model_fn()
135 return model.trainable_variables
136
137@tf.function
138def client_update(model, dataset, server_weights, client_optimizer):
139 """Performs training (using the server model weights) on the client's dataset."""
140 client_weights = model.trainable_variables
141 tf.nest.map_structure(lambda x, y: x.assign(y),
142 client_weights, server_weights)
143
144 for batch in dataset:
145 with tf.GradientTape() as tape:
146 outputs = model.forward_pass(batch)
147
148 grads = tape.gradient(outputs.loss, client_weights)
149 grads_and_vars = zip(grads, client_weights)
150 client_optimizer.apply_gradients(grads_and_vars)
151
152 return client_weights
153
154@tf.function
155def server_update(model, mean_client_weights):
156 """Updates the server model weights as the average of the client model weights."""
157 model_weights = model.trainable_variables
158 tf.nest.map_structure(lambda x, y: x.assign(y),
159 model_weights, mean_client_weights)
160 return model_weights
161
162federated_float_on_clients = tff.FederatedType(tf.float32, tff.CLIENTS)
163
164@tff.federated_computation(tff.FederatedType(tf.float32, tff.CLIENTS))
165def get_average_temperature(client_temperatures):
166 return tff.federated_mean(client_temperatures)
167str(get_average_temperature.type_signature)
168get_average_temperature([68.5, 70.3, 69.8])
169
170@tff.tf_computation
171def server_init():
172 model = model_fn()
173 return model.trainable_variables
174
175@tff.federated_computation
176def initialize_fn():
177 return tff.federated_value(server_init(), tff.SERVER)
178
179whimsy_model = model_fn()
180tf_dataset_type = tff.SequenceType(whimsy_model.input_spec)
181model_weights_type = server_init.type_signature.result
182
183@tff.tf_computation(tf_dataset_type, model_weights_type)
184def client_update_fn(tf_dataset, server_weights):
185 model = model_fn()
186 client_optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
187 return client_update(model, tf_dataset, server_weights, client_optimizer)
188
189@tff.tf_computation(model_weights_type)
190def server_update_fn(mean_client_weights):
191 model = model_fn()
192 return server_update(model, mean_client_weights)
193
194federated_server_type = tff.FederatedType(model_weights_type, tff.SERVER)
195federated_dataset_type = tff.FederatedType(tf_dataset_type, tff.CLIENTS)
196
197@tff.federated_computation(federated_server_type, federated_dataset_type)
198def next_fn(server_weights, federated_dataset):
199 server_weights_at_client = tff.federated_broadcast(server_weights)
200 client_weights = tff.federated_map(
201 client_update_fn, (federated_dataset, server_weights_at_client))
202 mean_client_weights = tff.federated_mean(client_weights)
203
204 server_weights = tff.federated_map(server_update_fn, mean_client_weights)
205 return server_weights
206
207federated_algorithm = tff.templates.IterativeProcess(
208 initialize_fn=initialize_fn,
209 next_fn=next_fn
210)
211
212server_state = federated_algorithm.initialize()
213for round in range(15):
214 server_state = federated_algorithm.next(server_state, train_data)
215
Regarding this line in the model: tf.keras.layers.Dense(50, kernel_initializer=initializer)
, I am using 50 output nodes, since I created dummy labels that can vary between 0 and 49. This is necessary when using the SparseCategoricalCrossentropy
loss function.
QUESTION
Convolution Function Latency Bottleneck
Asked 2022-Mar-10 at 13:57I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.
Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.
After turning on optimization (-O3 -march=native -ffast-math
, gcc cross compiler), the Gprof result was the following:
Clearly, the convolution2D
function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.
The convolution function in question looks like this:
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
This is a design based on my previous question and most of the processing time should fall on the convolution itself: odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
.
Using objdump -drwC -Mintel
to take a look at the assembly code returns me the following:
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221
For reference, I'm using an AMD Ryzen 7 CPU which uses Zen2 architecture. Here is its list of instructions (page 101).
I suspect that the data here points to a memory issue instead of simply the multiplication being the cause of the bottleneck.
Question:How can I improve this code so that it does not cause a memory bottleneck?
I'm guessing this is actually a problem particular to my code, perhaps something related to the multidimensional arrays I'm using. If I instead used one big single-dimentional array for each variable, would the latency decrease?
Relevant information:
There are two ways I declare the variables that are passed to this function. The first is as a global variable (usually in a struct), the second is as dynamic allocation:
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222
Perhaps the order in which I declare these matrixes is not cache-friendly, but I am not sure how to re-order it.
Stride values for the previous function are always 1 or 2, usually 1.
Here is the output of valgrind --tool=cachegrind
:
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245
ANSWER
Answered 2022-Mar-10 at 13:57Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.
So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.
I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).
The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245float fma_scl(float a, float b, float c) {
246 return a * b + c;
247}
248
249fma_scl:
250 vfmadd132ss xmm0, xmm2, xmm1
251 ret
252
You can see the calculation done with a single instruction.
We'll define a 256-bit vector type using GCC's vector extension.
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245float fma_scl(float a, float b, float c) {
246 return a * b + c;
247}
248
249fma_scl:
250 vfmadd132ss xmm0, xmm2, xmm1
251 ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253
Here's a vectorized fma function.
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245float fma_scl(float a, float b, float c) {
246 return a * b + c;
247}
248
249fma_scl:
250 vfmadd132ss xmm0, xmm2, xmm1
251 ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254 return a * b + c;
255}
256
257fma_vec:
258 vfmadd132ps ymm0, ymm2, ymm1
259 ret
260
The code above is semantically the same as the one below, but everything is done in a single instruction.
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245float fma_scl(float a, float b, float c) {
246 return a * b + c;
247}
248
249fma_scl:
250 vfmadd132ss xmm0, xmm2, xmm1
251 ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254 return a * b + c;
255}
256
257fma_vec:
258 vfmadd132ps ymm0, ymm2, ymm1
259 ret
260typedef struct {
261 float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265 Vec_ r;
266 for (unsigned i = 0; i < 8; ++i) {
267 r.f[i] = a.f[i] * b.f[i] + c.f[i];
268 }
269 return r;
270}
271
I think you'll now get the idea of making code run faster by vectorization.
Here is a simple function that's somewhat similar to your inner loop.
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245float fma_scl(float a, float b, float c) {
246 return a * b + c;
247}
248
249fma_scl:
250 vfmadd132ss xmm0, xmm2, xmm1
251 ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254 return a * b + c;
255}
256
257fma_vec:
258 vfmadd132ps ymm0, ymm2, ymm1
259 ret
260typedef struct {
261 float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265 Vec_ r;
266 for (unsigned i = 0; i < 8; ++i) {
267 r.f[i] = a.f[i] * b.f[i] + c.f[i];
268 }
269 return r;
270}
271void loopadd_scl(float *restrict a, float *restrict b, float *restrict c, unsigned n) {
272 for (unsigned i = 0; i < n; ++i) {
273 a[i] = fma_scl(b[i], c[i], a[i]);
274 }
275}
276
When you compile through GCC with -O3 -march=znver2
, this is the output. It's huge. I'll explain below.
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245float fma_scl(float a, float b, float c) {
246 return a * b + c;
247}
248
249fma_scl:
250 vfmadd132ss xmm0, xmm2, xmm1
251 ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254 return a * b + c;
255}
256
257fma_vec:
258 vfmadd132ps ymm0, ymm2, ymm1
259 ret
260typedef struct {
261 float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265 Vec_ r;
266 for (unsigned i = 0; i < 8; ++i) {
267 r.f[i] = a.f[i] * b.f[i] + c.f[i];
268 }
269 return r;
270}
271void loopadd_scl(float *restrict a, float *restrict b, float *restrict c, unsigned n) {
272 for (unsigned i = 0; i < n; ++i) {
273 a[i] = fma_scl(b[i], c[i], a[i]);
274 }
275}
276loopadd_scl:
277 test ecx, ecx
278 je .L25
279 lea eax, [rcx-1]
280 cmp eax, 6
281 jbe .L13
282 mov r8d, ecx
283 xor eax, eax
284 shr r8d, 3
285 sal r8, 5
286.L9:
287 vmovups ymm1, YMMWORD PTR [rdi+rax]
288 vmovups ymm0, YMMWORD PTR [rdx+rax]
289 vfmadd132ps ymm0, ymm1, YMMWORD PTR [rsi+rax]
290 vmovups YMMWORD PTR [rdi+rax], ymm0
291 add rax, 32
292 cmp r8, rax
293 jne .L9
294 mov eax, ecx
295 and eax, -8
296 test cl, 7
297 je .L26
298 vzeroupper
299.L8:
300 mov r9d, ecx
301 sub r9d, eax
302 lea r8d, [r9-1]
303 cmp r8d, 2
304 jbe .L11
305 mov r8d, eax
306 sal r8, 2
307 lea r10, [rdi+r8]
308 vmovups xmm0, XMMWORD PTR [rdx+r8]
309 vmovups xmm2, XMMWORD PTR [r10]
310 vfmadd132ps xmm0, xmm2, XMMWORD PTR [rsi+r8]
311 mov r8d, r9d
312 and r8d, -4
313 add eax, r8d
314 and r9d, 3
315 vmovups XMMWORD PTR [r10], xmm0
316 je .L25
317.L11:
318 mov r8d, eax
319 sal r8, 2
320 lea r9, [rdi+r8]
321 vmovss xmm0, DWORD PTR [rdx+r8]
322 vmovss xmm3, DWORD PTR [r9]
323 vfmadd132ss xmm0, xmm3, DWORD PTR [rsi+r8]
324 lea r8d, [rax+1]
325 vmovss DWORD PTR [r9], xmm0
326 cmp r8d, ecx
327 jnb .L25
328 sal r8, 2
329 add eax, 2
330 lea r9, [rdi+r8]
331 vmovss xmm0, DWORD PTR [rsi+r8]
332 vmovss xmm4, DWORD PTR [r9]
333 vfmadd132ss xmm0, xmm4, DWORD PTR [rdx+r8]
334 vmovss DWORD PTR [r9], xmm0
335 cmp eax, ecx
336 jnb .L25
337 sal rax, 2
338 add rdi, rax
339 vmovss xmm0, DWORD PTR [rdx+rax]
340 vmovss xmm5, DWORD PTR [rdi]
341 vfmadd132ss xmm0, xmm5, DWORD PTR [rsi+rax]
342 vmovss DWORD PTR [rdi], xmm0
343.L25:
344 ret
345.L26:
346 vzeroupper
347 ret
348.L13:
349 xor eax, eax
350 jmp .L8
351
Basically GCC doesn't know anything about n
, so it's splitting the loop to 3 cases: n / 8 > 1
, n / 4 > 1
, n < 4
. It first deals with the n / 8 > 1
part using 256-bit ymm
registers. Then, it deals with n / 4 > 1
with 128-bit xmm
registers. Finally, it deals with n < 4
with scalar ss
instructions.
You can avoid this mess if you know n
is a multiple of 8. I got a bit lazy now, so have a look at the code and the compiler output below and compare it with the above. I think you're smart enough to get the idea.
1void convolution2D(int isize, // width/height of input
2 int osize, // width/height of output
3 int ksize, // width/height of kernel
4 int stride, // shift between input pixels, between consecutive outputs
5 int pad, // offset between (0,0) pixels between input and output
6 int idepth, int odepth, // number of input and output channels
7 float idata[isize][isize][idepth],
8 float odata[osize][osize][odepth],
9 float kdata[odepth][ksize][ksize][idepth])
10{
11 // iterate over the output
12 for (int oy = 0; oy < osize; ++oy) {
13 for (int ox = 0; ox < osize; ++ox) {
14 for (int od = 0; od < odepth; ++od) {
15 odata[oy][ox][od] = 0; // When you iterate multiple times without closing the program, this number would stack up to infinity, so we have to zero it out every time.
16 for (int ky = 0; ky < ksize; ++ky) {
17 for (int kx = 0; kx < ksize; ++kx) {
18 // map position in output and kernel to the input
19 int iy = stride * oy + ky - pad;
20 int ix = stride * ox + kx - pad;
21 // use only valid inputs
22 if (iy >= 0 && iy < isize && ix >= 0 && ix < isize)
23 for (int id = 0; id < idepth; ++id)
24 odata[oy][ox][od] += kdata[od][ky][kx][id] * idata[iy][ix][id];
25 }}
26 }}}
27
28}
29
300000000000007880 <convolution2D>:
31 7880: f3 0f 1e fa endbr64
32 7884: 55 push rbp
33 7885: 48 89 e5 mov rbp,rsp
34 7888: 41 57 push r15
35 788a: 41 56 push r14
36 788c: 41 55 push r13
37 788e: 41 54 push r12
38 7890: 53 push rbx
39 7891: 48 81 ec b0 00 00 00 sub rsp,0xb0
40 7898: ff 15 4a a7 00 00 call QWORD PTR [rip+0xa74a] # 11fe8 <mcount@GLIBC_2.2.5>
41 789e: 89 d3 mov ebx,edx
42 78a0: 89 55 a8 mov DWORD PTR [rbp-0x58],edx
43 78a3: 89 8d 74 ff ff ff mov DWORD PTR [rbp-0x8c],ecx
44 78a9: 49 63 d1 movsxd rdx,r9d
45 78ac: 48 63 cf movsxd rcx,edi
46 78af: 41 89 f2 mov r10d,esi
47 78b2: 89 b5 38 ff ff ff mov DWORD PTR [rbp-0xc8],esi
48 78b8: 49 63 c0 movsxd rax,r8d
49 78bb: 48 0f af ca imul rcx,rdx
50 78bf: 48 63 75 10 movsxd rsi,DWORD PTR [rbp+0x10]
51 78c3: 49 89 d6 mov r14,rdx
52 78c6: 4c 8d 24 95 00 00 00 00 lea r12,[rdx*4+0x0]
53 78ce: 41 89 fd mov r13d,edi
54 78d1: 49 89 cb mov r11,rcx
55 78d4: 48 89 8d 60 ff ff ff mov QWORD PTR [rbp-0xa0],rcx
56 78db: 49 63 ca movsxd rcx,r10d
57 78de: 4c 8d 0c b5 00 00 00 00 lea r9,[rsi*4+0x0]
58 78e6: 49 89 f0 mov r8,rsi
59 78e9: 48 0f af f1 imul rsi,rcx
60 78ed: 48 63 cb movsxd rcx,ebx
61 78f0: 4c 89 8d 48 ff ff ff mov QWORD PTR [rbp-0xb8],r9
62 78f7: 48 0f af d1 imul rdx,rcx
63 78fb: 48 8d 3c 95 00 00 00 00 lea rdi,[rdx*4+0x0]
64 7903: 45 85 d2 test r10d,r10d
65 7906: 0f 8e 73 02 00 00 jle 7b7f <convolution2D+0x2ff>
66 790c: 48 c1 ef 02 shr rdi,0x2
67 7910: 49 c1 e9 02 shr r9,0x2
68 7914: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
69 7918: 4c 89 e7 mov rdi,r12
70 791b: 4c 89 8d 58 ff ff ff mov QWORD PTR [rbp-0xa8],r9
71 7922: 48 c1 ef 02 shr rdi,0x2
72 7926: 48 89 bd 50 ff ff ff mov QWORD PTR [rbp-0xb0],rdi
73 792d: 45 85 c0 test r8d,r8d
74 7930: 0f 8e 49 02 00 00 jle 7b7f <convolution2D+0x2ff>
75 7936: 48 c1 e6 02 shl rsi,0x2
76 793a: 48 0f af d1 imul rdx,rcx
77 793e: 29 c3 sub ebx,eax
78 7940: 89 c7 mov edi,eax
79 7942: 48 89 b5 30 ff ff ff mov QWORD PTR [rbp-0xd0],rsi
80 7949: 48 8b 75 20 mov rsi,QWORD PTR [rbp+0x20]
81 794d: 48 89 85 68 ff ff ff mov QWORD PTR [rbp-0x98],rax
82 7954: f7 df neg edi
83 7956: 45 8d 7e ff lea r15d,[r14-0x1]
84 795a: 89 9d 70 ff ff ff mov DWORD PTR [rbp-0x90],ebx
85 7960: 89 bd 3c ff ff ff mov DWORD PTR [rbp-0xc4],edi
86 7966: 48 8d 0c 95 00 00 00 00 lea rcx,[rdx*4+0x0]
87 796e: 89 7d ac mov DWORD PTR [rbp-0x54],edi
88 7971: 89 5d d4 mov DWORD PTR [rbp-0x2c],ebx
89 7974: 48 89 4d 98 mov QWORD PTR [rbp-0x68],rcx
90 7978: 4a 8d 0c 9d 00 00 00 00 lea rcx,[r11*4+0x0]
91 7980: c7 45 80 00 00 00 00 mov DWORD PTR [rbp-0x80],0x0
92 7987: 48 89 75 88 mov QWORD PTR [rbp-0x78],rsi
93 798b: 41 8d 70 ff lea esi,[r8-0x1]
94 798f: 48 89 4d c0 mov QWORD PTR [rbp-0x40],rcx
95 7993: 48 8d 04 b5 04 00 00 00 lea rax,[rsi*4+0x4]
96 799b: c7 45 90 00 00 00 00 mov DWORD PTR [rbp-0x70],0x0
97 79a2: 48 89 85 28 ff ff ff mov QWORD PTR [rbp-0xd8],rax
98 79a9: 44 89 f0 mov eax,r14d
99 79ac: 45 89 ee mov r14d,r13d
100 79af: 41 89 c5 mov r13d,eax
101 79b2: 48 8b 85 28 ff ff ff mov rax,QWORD PTR [rbp-0xd8]
102 79b9: 48 03 45 88 add rax,QWORD PTR [rbp-0x78]
103 79bd: 48 c7 85 78 ff ff ff 00 00 00 00 mov QWORD PTR [rbp-0x88],0x0
104 79c8: c7 45 84 00 00 00 00 mov DWORD PTR [rbp-0x7c],0x0
105 79cf: c7 45 94 00 00 00 00 mov DWORD PTR [rbp-0x6c],0x0
106 79d6: 44 8b 95 70 ff ff ff mov r10d,DWORD PTR [rbp-0x90]
107 79dd: 48 89 45 b0 mov QWORD PTR [rbp-0x50],rax
108 79e1: 48 63 45 80 movsxd rax,DWORD PTR [rbp-0x80]
109 79e5: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
110 79ec: 48 0f af 85 60 ff ff ff imul rax,QWORD PTR [rbp-0xa0]
111 79f4: 48 89 85 40 ff ff ff mov QWORD PTR [rbp-0xc0],rax
112 79fb: 8b 85 3c ff ff ff mov eax,DWORD PTR [rbp-0xc4]
113 7a01: 89 45 d0 mov DWORD PTR [rbp-0x30],eax
114 7a04: 48 8b 45 88 mov rax,QWORD PTR [rbp-0x78]
115 7a08: 48 8b 9d 78 ff ff ff mov rbx,QWORD PTR [rbp-0x88]
116 7a0f: 4c 8d 04 98 lea r8,[rax+rbx*4]
117 7a13: 48 8b 45 28 mov rax,QWORD PTR [rbp+0x28]
118 7a17: 48 8b 5d 18 mov rbx,QWORD PTR [rbp+0x18]
119 7a1b: 48 89 45 b8 mov QWORD PTR [rbp-0x48],rax
120 7a1f: 48 63 45 84 movsxd rax,DWORD PTR [rbp-0x7c]
121 7a23: 48 2b 85 68 ff ff ff sub rax,QWORD PTR [rbp-0x98]
122 7a2a: 48 0f af 85 50 ff ff ff imul rax,QWORD PTR [rbp-0xb0]
123 7a32: 48 03 85 40 ff ff ff add rax,QWORD PTR [rbp-0xc0]
124 7a39: 48 8d 04 83 lea rax,[rbx+rax*4]
125 7a3d: 48 89 45 a0 mov QWORD PTR [rbp-0x60],rax
126 7a41: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
127 7a4c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
128 7a50: 8b 45 a8 mov eax,DWORD PTR [rbp-0x58]
129 7a53: 41 c7 00 00 00 00 00 mov DWORD PTR [r8],0x0
130 7a5a: 45 31 db xor r11d,r11d
131 7a5d: 48 8b 5d a0 mov rbx,QWORD PTR [rbp-0x60]
132 7a61: 44 8b 4d ac mov r9d,DWORD PTR [rbp-0x54]
133 7a65: 85 c0 test eax,eax
134 7a67: 0f 8e 98 00 00 00 jle 7b05 <convolution2D+0x285>
135 7a6d: 0f 1f 00 nop DWORD PTR [rax]
136 7a70: 45 85 c9 test r9d,r9d
137 7a73: 78 7b js 7af0 <convolution2D+0x270>
138 7a75: 45 39 ce cmp r14d,r9d
139 7a78: 7e 76 jle 7af0 <convolution2D+0x270>
140 7a7a: 48 8b 45 b8 mov rax,QWORD PTR [rbp-0x48]
141 7a7e: 8b 55 d0 mov edx,DWORD PTR [rbp-0x30]
142 7a81: 48 89 de mov rsi,rbx
143 7a84: 4a 8d 3c 98 lea rdi,[rax+r11*4]
144 7a88: eb 13 jmp 7a9d <convolution2D+0x21d>
145 7a8a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
146 7a90: ff c2 inc edx
147 7a92: 4c 01 e7 add rdi,r12
148 7a95: 4c 01 e6 add rsi,r12
149 7a98: 44 39 d2 cmp edx,r10d
150 7a9b: 74 53 je 7af0 <convolution2D+0x270>
151 7a9d: 85 d2 test edx,edx
152 7a9f: 78 ef js 7a90 <convolution2D+0x210>
153 7aa1: 41 39 d6 cmp r14d,edx
154 7aa4: 7e ea jle 7a90 <convolution2D+0x210>
155 7aa6: 45 85 ed test r13d,r13d
156 7aa9: 7e e5 jle 7a90 <convolution2D+0x210>
157 7aab: c4 c1 7a 10 08 vmovss xmm1,DWORD PTR [r8]
158 7ab0: 31 c0 xor eax,eax
159 7ab2: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
160 7abd: 0f 1f 00 nop DWORD PTR [rax]
161 7ac0: c5 fa 10 04 87 vmovss xmm0,DWORD PTR [rdi+rax*4]
162 7ac5: 48 89 c1 mov rcx,rax
163 7ac8: c5 fa 59 04 86 vmulss xmm0,xmm0,DWORD PTR [rsi+rax*4]
164 7acd: 48 ff c0 inc rax
165 7ad0: c5 f2 58 c8 vaddss xmm1,xmm1,xmm0
166 7ad4: c4 c1 7a 11 08 vmovss DWORD PTR [r8],xmm1
167 7ad9: 49 39 cf cmp r15,rcx
168 7adc: 75 e2 jne 7ac0 <convolution2D+0x240>
169 7ade: ff c2 inc edx
170 7ae0: 4c 01 e7 add rdi,r12
171 7ae3: 4c 01 e6 add rsi,r12
172 7ae6: 44 39 d2 cmp edx,r10d
173 7ae9: 75 b2 jne 7a9d <convolution2D+0x21d>
174 7aeb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
175 7af0: 4c 03 5d c8 add r11,QWORD PTR [rbp-0x38]
176 7af4: 48 03 5d c0 add rbx,QWORD PTR [rbp-0x40]
177 7af8: 41 ff c1 inc r9d
178 7afb: 44 3b 4d d4 cmp r9d,DWORD PTR [rbp-0x2c]
179 7aff: 0f 85 6b ff ff ff jne 7a70 <convolution2D+0x1f0>
180 7b05: 48 8b 5d 98 mov rbx,QWORD PTR [rbp-0x68]
181 7b09: 49 83 c0 04 add r8,0x4
182 7b0d: 48 01 5d b8 add QWORD PTR [rbp-0x48],rbx
183 7b11: 4c 3b 45 b0 cmp r8,QWORD PTR [rbp-0x50]
184 7b15: 0f 85 35 ff ff ff jne 7a50 <convolution2D+0x1d0>
185 7b1b: 8b 9d 74 ff ff ff mov ebx,DWORD PTR [rbp-0x8c]
186 7b21: 8b 45 94 mov eax,DWORD PTR [rbp-0x6c]
187 7b24: 48 8b 8d 48 ff ff ff mov rcx,QWORD PTR [rbp-0xb8]
188 7b2b: 01 5d d0 add DWORD PTR [rbp-0x30],ebx
189 7b2e: 48 01 4d b0 add QWORD PTR [rbp-0x50],rcx
190 7b32: 01 5d 84 add DWORD PTR [rbp-0x7c],ebx
191 7b35: 48 8b 8d 58 ff ff ff mov rcx,QWORD PTR [rbp-0xa8]
192 7b3c: 41 01 da add r10d,ebx
193 7b3f: 48 01 8d 78 ff ff ff add QWORD PTR [rbp-0x88],rcx
194 7b46: ff c0 inc eax
195 7b48: 39 85 38 ff ff ff cmp DWORD PTR [rbp-0xc8],eax
196 7b4e: 74 08 je 7b58 <convolution2D+0x2d8>
197 7b50: 89 45 94 mov DWORD PTR [rbp-0x6c],eax
198 7b53: e9 ac fe ff ff jmp 7a04 <convolution2D+0x184>
199 7b58: 8b 4d 90 mov ecx,DWORD PTR [rbp-0x70]
200 7b5b: 48 8b b5 30 ff ff ff mov rsi,QWORD PTR [rbp-0xd0]
201 7b62: 01 5d d4 add DWORD PTR [rbp-0x2c],ebx
202 7b65: 01 5d ac add DWORD PTR [rbp-0x54],ebx
203 7b68: 01 5d 80 add DWORD PTR [rbp-0x80],ebx
204 7b6b: 48 01 75 88 add QWORD PTR [rbp-0x78],rsi
205 7b6f: 8d 41 01 lea eax,[rcx+0x1]
206 7b72: 39 4d 94 cmp DWORD PTR [rbp-0x6c],ecx
207 7b75: 74 08 je 7b7f <convolution2D+0x2ff>
208 7b77: 89 45 90 mov DWORD PTR [rbp-0x70],eax
209 7b7a: e9 33 fe ff ff jmp 79b2 <convolution2D+0x132>
210 7b7f: 48 81 c4 b0 00 00 00 add rsp,0xb0
211 7b86: 5b pop rbx
212 7b87: 41 5c pop r12
213 7b89: 41 5d pop r13
214 7b8b: 41 5e pop r14
215 7b8d: 41 5f pop r15
216 7b8f: 5d pop rbp
217 7b90: c3 ret
218 7b91: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
219 7b9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
220
221float (*arr)[x][y] = calloc(z, sizeof *arr);
222==430300== Cachegrind, a cache and branch-prediction profiler
223==430300== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
224==430300== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
225==430300== Command: ./EmbeddedNet test 1
226==430300== Parent PID: 170008
227==430300==
228--430300-- warning: L3 cache found, using its data for the LL simulation.
229==430300==
230==430300== I refs: 6,369,594,192
231==430300== I1 misses: 4,271
232==430300== LLi misses: 2,442
233==430300== I1 miss rate: 0.00%
234==430300== LLi miss rate: 0.00%
235==430300==
236==430300== D refs: 2,064,233,110 (1,359,003,131 rd + 705,229,979 wr)
237==430300== D1 misses: 34,476,969 ( 19,010,839 rd + 15,466,130 wr)
238==430300== LLd misses: 5,311,277 ( 1,603,955 rd + 3,707,322 wr)
239==430300== D1 miss rate: 1.7% ( 1.4% + 2.2% )
240==430300== LLd miss rate: 0.3% ( 0.1% + 0.5% )
241==430300==
242==430300== LL refs: 34,481,240 ( 19,015,110 rd + 15,466,130 wr)
243==430300== LL misses: 5,313,719 ( 1,606,397 rd + 3,707,322 wr)
244==430300== LL miss rate: 0.1% ( 0.0% + 0.5% )
245float fma_scl(float a, float b, float c) {
246 return a * b + c;
247}
248
249fma_scl:
250 vfmadd132ss xmm0, xmm2, xmm1
251 ret
252typedef float Vec __attribute__((vector_size(32), aligned(32)));
253Vec fma_vec(Vec a, Vec b, Vec c) {
254 return a * b + c;
255}
256
257fma_vec:
258 vfmadd132ps ymm0, ymm2, ymm1
259 ret
260typedef struct {
261 float f[8];
262} Vec_;
263
264Vec_ fma_vec_(Vec_ a, Vec_ b, Vec_ c) {
265 Vec_ r;
266 for (unsigned i = 0; i < 8; ++i) {
267 r.f[i] = a.f[i] * b.f[i] + c.f[i];
268 }
269 return r;
270}
271void loopadd_scl(float *restrict a, float *restrict b, float *restrict c, unsigned n) {
272 for (unsigned i = 0; i < n; ++i) {
273 a[i] = fma_scl(b[i], c[i], a[i]);
274 }
275}
276loopadd_scl:
277 test ecx, ecx
278 je .L25
279 lea eax, [rcx-1]
280 cmp eax, 6
281 jbe .L13
282 mov r8d, ecx
283 xor eax, eax
284 shr r8d, 3
285 sal r8, 5
286.L9:
287 vmovups ymm1, YMMWORD PTR [rdi+rax]
288 vmovups ymm0, YMMWORD PTR [rdx+rax]
289 vfmadd132ps ymm0, ymm1, YMMWORD PTR [rsi+rax]
290 vmovups YMMWORD PTR [rdi+rax], ymm0
291 add rax, 32
292 cmp r8, rax
293 jne .L9
294 mov eax, ecx
295 and eax, -8
296 test cl, 7
297 je .L26
298 vzeroupper
299.L8:
300 mov r9d, ecx
301 sub r9d, eax
302 lea r8d, [r9-1]
303 cmp r8d, 2
304 jbe .L11
305 mov r8d, eax
306 sal r8, 2
307 lea r10, [rdi+r8]
308 vmovups xmm0, XMMWORD PTR [rdx+r8]
309 vmovups xmm2, XMMWORD PTR [r10]
310 vfmadd132ps xmm0, xmm2, XMMWORD PTR [rsi+r8]
311 mov r8d, r9d
312 and r8d, -4
313 add eax, r8d
314 and r9d, 3
315 vmovups XMMWORD PTR [r10], xmm0
316 je .L25
317.L11:
318 mov r8d, eax
319 sal r8, 2
320 lea r9, [rdi+r8]
321 vmovss xmm0, DWORD PTR [rdx+r8]
322 vmovss xmm3, DWORD PTR [r9]
323 vfmadd132ss xmm0, xmm3, DWORD PTR [rsi+r8]
324 lea r8d, [rax+1]
325 vmovss DWORD PTR [r9], xmm0
326 cmp r8d, ecx
327 jnb .L25
328 sal r8, 2
329 add eax, 2
330 lea r9, [rdi+r8]
331 vmovss xmm0, DWORD PTR [rsi+r8]
332 vmovss xmm4, DWORD PTR [r9]
333 vfmadd132ss xmm0, xmm4, DWORD PTR [rdx+r8]
334 vmovss DWORD PTR [r9], xmm0
335 cmp eax, ecx
336 jnb .L25
337 sal rax, 2
338 add rdi, rax
339 vmovss xmm0, DWORD PTR [rdx+rax]
340 vmovss xmm5, DWORD PTR [rdi]
341 vfmadd132ss xmm0, xmm5, DWORD PTR [rsi+rax]
342 vmovss DWORD PTR [rdi], xmm0
343.L25:
344 ret
345.L26:
346 vzeroupper
347 ret
348.L13:
349 xor eax, eax
350 jmp .L8
351void loopadd_vec(Vec *restrict a, Vec *restrict b, Vec *restrict c, unsigned n) {
352 n /= 8;
353 for (unsigned i = 0; i < n; ++i) {
354 a[i] = fma_vec(b[i], c[i], a[i]);
355 }
356}
357
358loopadd_vec:
359 shr ecx, 3
360 je .L34
361 mov ecx, ecx
362 xor eax, eax
363 sal rcx, 5
364.L29:
365 vmovaps ymm1, YMMWORD PTR [rdi+rax]
366 vmovaps ymm0, YMMWORD PTR [rdx+rax]
367 vfmadd132ps ymm0, ymm1, YMMWORD PTR [rsi+rax]
368 vmovaps YMMWORD PTR [rdi+rax], ymm0
369 add rax, 32
370 cmp rcx, rax
371 jne .L29
372 vzeroupper
373.L34:
374 ret
375}
376
QUESTION
Write custom metadata to Parquet file in Julia
Asked 2022-Mar-05 at 18:36I am currently storing the output (a Julia Dataframe) of my Julia simulation in a Parquet file using Parquet.jl
. I would also like to save some of the simulation parameters (eg. a list of (byte-)strings) to that same output file.
Preferably, these parameters are different for each column as each column is the result of different starting conditions of my code. However, I could also work with a global parameter list and then untangle it afterwards by indexing.
I have found a solution for Python using pyarrow
https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-table/.
Do you know a way how to do it in Julia?
ANSWER
Answered 2022-Mar-05 at 18:36It's not quite done yet, and it's not registered, but my rewrite of the Julia parquet package, Parquet2.jl does support both custom file metadata and individual column metadata (the keyword arguments metadata
and column_metadata
in Parquet2.writefile
.
I haven't gotten to documentation for writing yet, but if you are feeling adventurous you can give it a shot. I do expect to finish up this package and register it within the next couple of weeks. I don't have unit tests in for writing yet, so of course, if you try it and have problems, please open an issue.
It's probably also worth mentioning that the main use case I recommend for parquet is if you must have parquet for compatibility reasons. Most of the time, Julia users are probably better off with Arrow.jl as the format has a number of advantages over parquet for most use cases, please see my FAQ answer on this. Of course, the reason I undertook writing the package is because parquet is arguably the only ubiquitous binary format in "big data world" so a robust writer is desperately needed.
QUESTION
Floating point inconsistencies after upgrading libc/libm
Asked 2022-Mar-01 at 13:31I recently upgraded my OS from Debian 9 to Debian 11. I have a bunch of servers running a simulation and one subset produces a certain result and another subset produces a different result. This did not used to happen with Debian 9. I have produced a minimal failing example:
1#include <stdio.h>
2#include <math.h>
3
4int main()
5{
6 double lp = 11.525775909423828;
7 double ap = exp(lp);
8
9 printf("%.14f %.14f\n", lp, ap);
10
11 return 0;
12}
13
The lp value prints the same answer on every machine but I have two different answers for ap: 101293.33662281210127 and 101293.33662281208672
The code was compiled with "gcc fptest.c -lm -O0". The '-O0' was just added to ensure optimizations weren't an issue. It behaves the same without this option.
The libraries linked in the Debian 11 version are libm-2.31.so and libc-2.31.so.
The libraries linked in the (working) Debian 9 version are libm-2.24.so and libc-2.24.so.
The servers are all running with different CPUs so its hard to say much about that. But I get different results between a xeon E5-2695 v2 and a xeon E5-2695 v3 for example.
Amongst all the processors I have, I only see one of these two results on Debian 11, and when running on Debian 9 I consistently only get one result.
This feels like a bug in libm-2.31 and/or libc-2.31 to me. I have zero experience with this sort of thing. Could someone please explain if what I am seeing is expected? Does it look like a bug? Anything I can do about it? etc.
Also tried compiling with clang, and get the exact same problem.
Also note that the binary compiled on Debian 9 runs on Debian 11 and produces the same results/problem as the Debian 11 binary adding further weight to my suspicion that this is library related (I cannot run the Debian 11 binary on Debian 9).
Update
Just read this post which was helpful. So I'm happy that different architectures may give different results for the exp() function. But all my processors are x86_64 and some kind of intel xeon-xxxx. I can't understand why the exact same binary with the exact same libraries is giving different results on different processors.
As suggested in that post I printed the values using %a. The two answers differ only by the LSB. If I use expl() I get the same answer on all machines.
An explanation of why I'm seeing differences, and if this is expected, would be nice. Any compiler flags that ensure consistency would also be nice.
ANSWER
Answered 2022-Feb-28 at 13:18It’s not a bug. Floating point arithmetic has rounding errors. For single arithmetic operations + - * / sqrt the results should be the same, but for floating-point functions you can’t really expect it.
In this case it seems the compiler itself produced the results at compile time. The processor you use is unlikely to make a difference. And we don’t know whether the new version is more or less precise than the old one.
QUESTION
Missing types, namespaces, directives, and assembly references
Asked 2022-Feb-27 at 10:24I use VS Code
for C#
and Unity3D
and TypeScript
and Angular
and Python
programming, so I have pretty much every required extension, including the .NET Framework
and Core
as well as the Quantum Development Kit (QDK)
plus the Q# Interoperability Tools
and also C#
and Python
extensions for VS Code
.
I have devised the following steps to create my first quantum Hello World based on a few tutorials:
1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson # Create a "global.json" file in your current directory folder.
6$ dotnet new globaljson --sdk-version 3.1.416 --force # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host # Create a new C# console project.
17
18$ cd host # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd .. # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27
However, as soon as this is all done, I get the following load of errors:
What am I doing wrong? What am I missing?
Here is some additional info:
1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson # Create a "global.json" file in your current directory folder.
6$ dotnet new globaljson --sdk-version 3.1.416 --force # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host # Create a new C# console project.
17
18$ cd host # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd .. # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version
323.1.416
33
34$ dotnet restore
35 Determining projects to restore...
36 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38
When I run dotnet build
, I get the following:
Microsoft (R) Build Engine version 16.7.2+b60ddb6f4 for .NET Copyright (C) Microsoft Corporation. All rights reserved.
Determining projects to restore... All projects are up-to-date for restore. It was not possible to find any compatible framework version The framework 'Microsoft.NETCore.App', version '2.0.0' was not found. - The following frameworks were found: 3.1.22 at [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] 5.0.13 at [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
You can resolve the problem by installing the specified framework and/or SDK.
The specified framework can be found at: - https://aka.ms/dotnet-core-applaunch?framework=Microsoft.NETCore.App&framework_version=2.0.0&arch=x64&rid=win10-x64 C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build\Microsoft.Quantum.Development.Kit.targets(17,5): error MSB3073: The command "dotnet C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build../tools/qsc/qsc.dll --input "Operation.qs" --references "C:\Users\Muhy.nuget\packages\microsoft.quantum.canon\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Canon.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.MetaData.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Primitives.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Common.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Core.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.QCTraceSimulatorRuntime.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Simulators.dll" --outputpath obj\qsharp\src" exited with code -2147450730. [C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj]
Build FAILED.
C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build\Microsoft.Quantum.Development.Kit.targets(17,5): error MSB3073: The command "dotnet C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\build../tools/qsc/qsc.dll --input "Operation.qs" --references "C:\Users\Muhy.nuget\packages\microsoft.quantum.canon\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Canon.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.MetaData.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Primitives.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Common.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Core.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.QCTraceSimulatorRuntime.dll" "C:\Users\Muhy.nuget\packages\microsoft.quantum.development.kit\0.2.1802.1603-preview\lib\netstandard2.0\Microsoft.Quantum.Simulation.Simulators.dll" --outputpath obj\qsharp\src" exited with code -2147450730. [C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj] 0 Warning(s) 1 Error(s)
When I also build the project (dotnet build quantum-dotnet.sln
and dotnet build ./host/host.csproj
) I get the same errors.
ANSWER
Answered 2022-Feb-27 at 10:24With help from a user on another forum, it turns out the problem was the command:
1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson # Create a "global.json" file in your current directory folder.
6$ dotnet new globaljson --sdk-version 3.1.416 --force # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host # Create a new C# console project.
17
18$ cd host # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd .. # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version
323.1.416
33
34$ dotnet restore
35 Determining projects to restore...
36 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
39
which installed version 0.2.1802.1603-preview
which is quite old. The latest version is 0.22.187631
.
This did the trick:
1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson # Create a "global.json" file in your current directory folder.
6$ dotnet new globaljson --sdk-version 3.1.416 --force # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host # Create a new C# console project.
17
18$ cd host # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd .. # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version
323.1.416
33
34$ dotnet restore
35 Determining projects to restore...
36 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
39dotnet new -i "Microsoft.Quantum.ProjectTemplates"
40
Followed by:
1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson # Create a "global.json" file in your current directory folder.
6$ dotnet new globaljson --sdk-version 3.1.416 --force # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host # Create a new C# console project.
17
18$ cd host # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd .. # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version
323.1.416
33
34$ dotnet restore
35 Determining projects to restore...
36 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
39dotnet new -i "Microsoft.Quantum.ProjectTemplates"
40dotnet new --update-apply
41
to update all the project templates. After this, re-creating the projects fixed all the errors.
Now, I believe the following set of instructions should work for every beginner:
1$ dotnet --version
2$ dotnet --list-sdks
3$ dotnet --list-runtimes
4
5$ dotnet new globaljson # Create a "global.json" file in your current directory folder.
6$ dotnet new globaljson --sdk-version 3.1.416 --force # Change the current SDK version of your project.
7
8$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
9$ dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
10
11# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
12# Create a project for your Q# library and for the .NET host that will call
13# into the operations and functions defined in your Q# library.
14
15$ dotnet new classlib -lang Q# -o quantum # Create a new Q# class-library project.
16$ dotnet new console -lang C# -o host # Create a new C# console project.
17
18$ cd host # Navigate into the C# host directory.
19$ dotnet add reference ../quantum/quantum.csproj # Add your Q# class-library project as a reference to your C# console project.
20
21$ cd .. # Exit the C# host directory.
22$ dotnet new sln -n quantum-dotnet # Create a new solution for both projects.
23$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj # Add the Q# class-library project to the solution.
24$ dotnet sln quantum-dotnet.sln add ./host/host.csproj # Add the C# host project to the solution.
25
26# The -o or --output command specifies the location to place the generated output.
27$ dotnet --list-sdks
283.1.416 [C:\Program Files\dotnet\sdk]
295.0.404 [C:\Program Files\dotnet\sdk]
30
31$ dotnet --version
323.1.416
33
34$ dotnet restore
35 Determining projects to restore...
36 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\quantum\quantum.csproj (in 484 ms).
37 Restored C:\Users\Muhy\Dropbox\Q\Q#-programs\Qtest\host\host.csproj (in 782 ms).
38dotnet new -i "Microsoft.Quantum.ProjectTemplates::0.2-*"
39dotnet new -i "Microsoft.Quantum.ProjectTemplates"
40dotnet new --update-apply
41# Install the .NET Framework and Core as well as the Quantum Development Kit (QDK)
42# plus the Q# Interoperability Tools and also C# and Python extensions for VS Code.
43
44$ dotnet --version
45$ dotnet --list-sdks
46$ dotnet --list-runtimes
47
48$ dotnet new globaljson # Create a "global.json" file in your current directory folder.
49$ dotnet new globaljson --sdk-version 3.1.416 --force # Change the current SDK version of your project to .NET 3.1.416
50
51$ dotnet new -i Microsoft.DotNet.Web.Spa.ProjectTemplates
52$ dotnet new -i "Microsoft.Quantum.ProjectTemplates"
53$ dotnet new --update-apply
54
55# First, create a Q# application and a .NET host, and then make a call to Q# from the host.
56# Create a project for your Q# library and for the .NET host that will call
57# into the operations and functions defined in your Q# library.
58
59$ dotnet new classlib -lang Q# -o quantum # Create a new Q# class-library project.
60$ dotnet new console -lang C# -o host # Create a new C# console project.
61 # The -o or --output command specifies the location to place the generated output.
62
63$ cd host # Navigate into the C# host directory.
64$ dotnet add reference ../quantum/quantum.csproj # Add your Q# class-library project as a reference to your C# console project.
65
66$ cd .. # Exit the C# host directory.
67$ dotnet new sln -n quantum-dotnet # Create a new solution for both projects.
68$ dotnet sln quantum-dotnet.sln add ./quantum/quantum.csproj # Add the Q# class-library project to the solution.
69$ dotnet sln quantum-dotnet.sln add ./host/host.csproj # Add the C# host project to the solution.
70
71$ cd host
72$ dotnet build
73$ dotnet run
74
QUESTION
Iterating over an array of class objects VS a class object containing arrays
Asked 2022-Feb-13 at 16:58I want to create a program for multi-agent simulation and I am thinking about whether I should use NumPy or numba to accelerate the calculation. Basically, I would need a class to store the state of agents and I would have over a 1000 instances of this classes. In each time step, I will perform different calculation for all instances. There are two approaches that I am thinking of:
Numpy vectorization:
Having 1 class with multiple NumPy arrays for storing states of all agents. Hence, I will only have 1 class instance at all times during the simulation. With this approach, I can simply use NumPy vectorization to perform calculations. However, this will make running functions for specific agents difficult and I would need an extra class to store the index of each agent.
1Class agent:
2 def __init__(self):
3 self.A = np.zeros(1000)
4 self.B = np.zeros(1000)
5 def timestep(self):
6 return self.A + self.B
7
Numba jitclass:
Using the numba jitclass decorator to compile the code. With this approach, I can apply more standard OOP formatted code as one class instance represent one agent. However, I am not sure about the performance of looping through multiple jitclass instance (say 1000 and more).
1Class agent:
2 def __init__(self):
3 self.A = np.zeros(1000)
4 self.B = np.zeros(1000)
5 def timestep(self):
6 return self.A + self.B
7@jitclass
8class agent:
9 def __init__(self):
10 self.A = 0
11 self.B = 0
12 def timestep(self):
13 return self.A + self.B
14
I would like to know which would be a faster approach.
ANSWER
Answered 2022-Feb-13 at 16:53This problem is known as the "AoS VS SoA" where AoS means array of structures and SoA means structure of arrays. You can find some information about this here. SoA is less user-friendly than AoS but it is generally much more efficient. This is especially true when your code can benefit from using SIMD instructions. When you deal with many big array (eg. >=8 big arrays) or when you perform many scalar random memory accesses, then neither AoS nor SoA are efficient. In this case, the best solution is to use arrays of structure of small arrays (AoSoA) so to better use CPU caches while still being able benefit from SIMD. However, AoSoA is tedious as is complexity significantly the code for non trivial algorithms. Note that the number of fields that are accessed also matter in the choice of the best solution (eg. if only one field is frequently read, then SoA is perfect).
OOP is generally rather bad when it comes to performance partially because of this. Another reason is the frequent use of virtual calls and polymorphism while it is not always needed. OOP codes tends to cause a lot of cache misses and optimizing a large code that massively use OOP is often a mess (which sometimes results in rewriting a big part of the target software or the code being left very slow). To address this problem, data oriented design can be used. This approach has been successfully used to drastically speed up large code bases from video games (eg. Unity) to web browser renderers (eg. Chrome) and even relational databases. In high-performance computing (HPC), OOP is often barely used. Object-oriented design is quite related to the use of SoA rather than AoS so to better use cache and benefit from SIMD. For more information, please read this related post.
To conclude, I advise you to use the first code (SoA) in your case (since you only have two arrays and they are not so huge).
QUESTION
R: Trying to recreate mean-median difference gerrymander tests
Asked 2022-Feb-09 at 23:58I'm trying to recreate the mean-median difference test described here: Archive of NYT article. I've downloaded House data from MIT's Election Lab, and pared it down to the 2012 Pennsylvania race. Using dplyr
, I whittled it down to the relevant columns, and it now looks something like this:
1Rows: 42
2Columns: 5
3$ district <dbl> 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party <chr> "REPUBLICAN", "DEMOCRAT", "INDEPENDENT", "REPUBLICAN", "DEMOCRAT", "DEMOCRAT", ~
5$ candidatevotes <dbl> 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes <dbl> 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes <dbl> 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8
Each row represents a district candidate. The final column was created using mutate
, and represents the percentage of the vote in that district that went to the candidate. Now, I can find the median and mean democratic vote with
1Rows: 42
2Columns: 5
3$ district <dbl> 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party <chr> "REPUBLICAN", "DEMOCRAT", "INDEPENDENT", "REPUBLICAN", "DEMOCRAT", "DEMOCRAT", ~
5$ candidatevotes <dbl> 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes <dbl> 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes <dbl> 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem <- PA2012_house %>% filter(party == "DEMOCRAT")
9obs_median <- median(PA2012_house_dem$pct_votes)
10obs_mean <- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12
What's giving me fits is calculating the "zone of chance". What I'd like to do is some kind of Monte Carlo simulation of taking each voter and randomly assigning them to a district, so that the number of voters in each district is unchanged, the number of total votes for each party is unchanged, but the proportion of Republican and Democratic (and other parties) in each district is random, as in a permutation test. The mean Democratic vote should be unchanged, but I can't figure out a good way to carry out this randomization so that I can calculate the median district's Democratic vote percentage.
Thanks in advance for your help!
Edit for clarification: I'd like to do the randomization, say, 10,000 times, and for each of those trials, calculate the median-mean difference. The result should then, ideally, be a vector or data frame with 10,000 rows, that I can then turn into a histogram or something.
EDIT 2 -- PARTIAL SOLUTION:
I have some code that runs, but it's not giving me a reasonable answer. Using dplyr
, I've filtered out all but the DEMOCRAT votes, so that each row just gives me the Democrat vote share for a single district.
1Rows: 42
2Columns: 5
3$ district <dbl> 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party <chr> "REPUBLICAN", "DEMOCRAT", "INDEPENDENT", "REPUBLICAN", "DEMOCRAT", "DEMOCRAT", ~
5$ candidatevotes <dbl> 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes <dbl> 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes <dbl> 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem <- PA2012_house %>% filter(party == "DEMOCRAT")
9obs_median <- median(PA2012_house_dem$pct_votes)
10obs_mean <- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12Rows: 18
13Columns: 5
14$ district <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
15$ party <chr> "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCR~
16$ candidatevotes <dbl> 235394, 318176, 123933, 104643, 104725, 143803, 143509, 152859, 105128, 94227, 118231, 163589, 209901, ~
17$ totalvotes <dbl> 277102, 356386, 302514, 303980, 282465, 335528, 353451, 352238, 274305, 273790, 285198, 338941, 303819,~
18$ pct_votes <dbl> 84.94850, 89.27848, 40.96769, 34.42430, 37.07539, 42.85872, 40.60223, 43.39651, 38.32522, 34.41579, 41.~
19
This is saved as PA2012_reduced_dem
.
Now, here is my code:
1Rows: 42
2Columns: 5
3$ district <dbl> 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party <chr> "REPUBLICAN", "DEMOCRAT", "INDEPENDENT", "REPUBLICAN", "DEMOCRAT", "DEMOCRAT", ~
5$ candidatevotes <dbl> 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes <dbl> 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes <dbl> 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem <- PA2012_house %>% filter(party == "DEMOCRAT")
9obs_median <- median(PA2012_house_dem$pct_votes)
10obs_mean <- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12Rows: 18
13Columns: 5
14$ district <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
15$ party <chr> "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCR~
16$ candidatevotes <dbl> 235394, 318176, 123933, 104643, 104725, 143803, 143509, 152859, 105128, 94227, 118231, 163589, 209901, ~
17$ totalvotes <dbl> 277102, 356386, 302514, 303980, 282465, 335528, 353451, 352238, 274305, 273790, 285198, 338941, 303819,~
18$ pct_votes <dbl> 84.94850, 89.27848, 40.96769, 34.42430, 37.07539, 42.85872, 40.60223, 43.39651, 38.32522, 34.41579, 41.~
19require(mosaic) # for the tally() function
20data <- PA2012_reduced_dem
21B <- 100
22samples_diff <- vector("numeric", B)
23samples_mean <- vector("numeric", B)
24samples_median <- vector("numeric", B)
25
26for(samp in 1:B) {
27data_w_sample <- mutate(data, sample_vote = tally(sample(district, sum(candidatevotes),replace=T, prob = totalvotes)))
28 data_w_sample <- mutate(data_w_sample, sample_vote_pct = (sample_vote / totalvotes *100))
29 mean_sample <- weighted.mean(data_w_sample$sample_vote_pct, w = data_w_sample$totalvotes)
30 median_sample <- median(data_w_sample$sample_vote_pct)
31 diff_mean_median <- mean_sample - median_sample
32 samples_diff[samp] <- diff_mean_median
33 samples_mean[samp] <- mean_sample
34 samples_median[samp] <- median_sample
35}
36
37samples <- data.frame(samples_mean,samples_median,samples_diff)
38
The idea is that I'm randomly placing each Democrat voter in a district, weighted by the total number of votes per district. Since I have the total vote as a variable, I can compute the share of vote in each district that goes to the Democrat (I'm ignoring independent and other party votes).
Obviously, this is slow, because each trial is sampling for every single Democrat vote (roughly 2.8 million), so I'm only running 100 trials right now.
However, my Monte Carlo simulations are finding a very small "zone of chance" around the mean, the median is only about 0.05 percent above or below the mean. Again, I'm only running 100 trials, but I was expecting a wider zone of chance.
ANSWER
Answered 2022-Feb-09 at 23:58I figured it out! Randomly placing voters in each district is not correct, and honestly it was pretty silly of me to do so. Instead, I had to use dplyr
to create a data frame with the number of Democrat and Republican votes in each of the 435 House districts, one district per row. Then, I followed the advice on page 12 of this paper. I created samples of 18 districts sampled from this 435-row data frame, rejecting them if the mean vote share was more than 1 percent away from that of PA. The results have a much nicer 95% confidence interval, that matches the results of the original article.
1Rows: 42
2Columns: 5
3$ district <dbl> 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
4$ party <chr> "REPUBLICAN", "DEMOCRAT", "INDEPENDENT", "REPUBLICAN", "DEMOCRAT", "DEMOCRAT", ~
5$ candidatevotes <dbl> 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
6$ totalvotes <dbl> 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
7$ pct_votes <dbl> 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~
8PA2012_house_dem <- PA2012_house %>% filter(party == "DEMOCRAT")
9obs_median <- median(PA2012_house_dem$pct_votes)
10obs_mean <- mean(PA2012_house_dem$pct_votes)
11obs_median - obs_mean
12Rows: 18
13Columns: 5
14$ district <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
15$ party <chr> "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCR~
16$ candidatevotes <dbl> 235394, 318176, 123933, 104643, 104725, 143803, 143509, 152859, 105128, 94227, 118231, 163589, 209901, ~
17$ totalvotes <dbl> 277102, 356386, 302514, 303980, 282465, 335528, 353451, 352238, 274305, 273790, 285198, 338941, 303819,~
18$ pct_votes <dbl> 84.94850, 89.27848, 40.96769, 34.42430, 37.07539, 42.85872, 40.60223, 43.39651, 38.32522, 34.41579, 41.~
19require(mosaic) # for the tally() function
20data <- PA2012_reduced_dem
21B <- 100
22samples_diff <- vector("numeric", B)
23samples_mean <- vector("numeric", B)
24samples_median <- vector("numeric", B)
25
26for(samp in 1:B) {
27data_w_sample <- mutate(data, sample_vote = tally(sample(district, sum(candidatevotes),replace=T, prob = totalvotes)))
28 data_w_sample <- mutate(data_w_sample, sample_vote_pct = (sample_vote / totalvotes *100))
29 mean_sample <- weighted.mean(data_w_sample$sample_vote_pct, w = data_w_sample$totalvotes)
30 median_sample <- median(data_w_sample$sample_vote_pct)
31 diff_mean_median <- mean_sample - median_sample
32 samples_diff[samp] <- diff_mean_median
33 samples_mean[samp] <- mean_sample
34 samples_median[samp] <- median_sample
35}
36
37samples <- data.frame(samples_mean,samples_median,samples_diff)
38data <- house_2012_reduced
39# created with dplyr, contains total and percentage of votes
40# for Democrats and Republicans.
41B <- 100000
42del_districts <- 18 # 18 districts in PA
43samples_diff <- vector("numeric", B)
44samples_mean <- vector("numeric", B)
45samples_median <- vector("numeric", B)
46
47for(samp in 1:B) {
48 sample_delegation <- sample_n(data, del_districts)
49 sample_delegation_pct_dem_mean <- weighted.mean(sample_delegation$pct_dem_votes, w = sample_delegation$total_votes)
50 sample_delegation_pct_dem_median <- median(sample_delegation$pct_dem_votes)
51 if(near(mean_dem_pct_PA, sample_delegation_pct_dem_mean, 1)){
52 samples_mean[samp] <- sample_delegation_pct_dem_mean
53 samples_median[samp] <- sample_delegation_pct_dem_median
54 samples_diff[samp] <- (sample_delegation_pct_dem_mean - sample_delegation_pct_dem_median)
55 }
56}
57
58samples <- data.frame(samples_mean,samples_median,samples_diff)
59samples <- filter_all(samples, any_vars(. != 0))
60quantile(samples$samples_median, c(0.025,0.975))
61
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Simulation
Tutorials and Learning Resources are not available at this moment for Simulation