Explore all Compiler open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Compiler

rust

Rust 1.60.0

emscripten

3.0.0

zig

0.7.1

numba

Version 0.55.1

kotlin-native

1.5.10

Popular Libraries in Compiler

rust

by rust-lang doticonrustdoticon

star image 65482 doticonNOASSERTION

Empowering everyone to build reliable and efficient software.

emscripten

by emscripten-core doticoncdoticon

star image 22125 doticonNOASSERTION

Emscripten: An LLVM-to-WebAssembly Compiler

zig

by ziglang doticoncdoticon

star image 8966 doticonMIT

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

numba

by numba doticonpythondoticon

star image 7237 doticonBSD-2-Clause

NumPy aware dynamic Python compiler using LLVM

kotlin-native

by JetBrains doticonkotlindoticon

star image 7071 doticonApache-2.0

Kotlin/Native infrastructure

Nuitka

by Nuitka doticonpythondoticon

star image 6578 doticonApache-2.0

Nuitka is a Python compiler written in Python. It's fully compatible with Python 2.6, 2.7, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, and 3.10. You feed it your Python app, it does a lot of clever things, and spits out an executable or extension module.

retdec

by avast doticonc++doticon

star image 6409 doticonNOASSERTION

RetDec is a retargetable machine-code decompiler based on LLVM.

gcc

by gcc-mirror doticoncdoticon

star image 5340 doticonNOASSERTION

dotty

by lampepfl doticonscaladoticon

star image 4933 doticonApache-2.0

The Scala 3 compiler, also known as Dotty.

Trending New libraries in Compiler

rescript-lang.org

by rescript-association doticonjavascriptdoticon

star image 1052 doticonMIT

Official documentation website for the ReScript programming language

circt

by llvm doticonc++doticon

star image 888 doticonNOASSERTION

Circuit IR Compilers and Tools

mini-typescript

by sandersn doticontypescriptdoticon

star image 811 doticonMIT

A miniature model of the Typescript compiler, intended to teach the structure of the real Typescript compiler

oakc

by adam-mcdaniel doticonrustdoticon

star image 621 doticonApache-2.0

A portable programming language with a compact intermediate representation

langcraft

by SuperTails doticonrustdoticon

star image 497 doticonNOASSERTION

Compiler from LLVM IR to Minecraft datapacks.

symcc

by eurecom-s3 doticonc++doticon

star image 415 doticonGPL-3.0

SymCC: efficient compiler-based symbolic execution

tenderjit

by tenderlove doticonrubydoticon

star image 363 doticonApache-2.0

JIT for Ruby that is written in Ruby

clang-tutor

by banach-space doticonc++doticon

star image 362 doticonUnlicense

A collection of out-of-tree Clang plugins for teaching and learning

StaticScript

by StaticScript doticonc++doticon

star image 340 doticonMIT

๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ A new statically typed programming language, syntactically like TypeScript.

Top Authors in Compiler

1

microsoft

12 Libraries

star icon1353

2

pfalcon

7 Libraries

star icon134

3

llvm

7 Libraries

star icon1069

4

trailofbits

7 Libraries

star icon809

5

maekawatoshiki

7 Libraries

star icon732

6

regehr

6 Libraries

star icon51

7

clang-omp

6 Libraries

star icon149

8

weliveindetail

6 Libraries

star icon211

9

chapuni

6 Libraries

star icon84

10

google

6 Libraries

star icon1626

1

12 Libraries

star icon1353

2

7 Libraries

star icon134

3

7 Libraries

star icon1069

4

7 Libraries

star icon809

5

7 Libraries

star icon732

6

6 Libraries

star icon51

7

6 Libraries

star icon149

8

6 Libraries

star icon211

9

6 Libraries

star icon84

10

6 Libraries

star icon1626

Trending Kits in Compiler

No Trending Kits are available at this moment for Compiler

Trending Discussions on Compiler

Java, Intellij IDEA problem Unrecognized option: --add-opens=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED

Is it safe to bind an unsigned int to a signed int reference?

Can one delete a function returning an incomplete type in C++?

Why does the TypeScript compiler compile its optional chaining and null-coalescing operators with two checks?

Why does my Intel Skylake / Kaby Lake CPU incur a mysterious factor 3 slowdown in a simple hash table implementation?

Function default argument value depending on argument name in C++

Command CompileSwiftSources failed with a nonzero exit code XCode 13

Is it allowed to name a global variable `read` or `malloc` in C++?

Are char arrays guaranteed to be null terminated?

Why is C++'s NULL typically an integer literal rather than a pointer like in C?

QUESTION

Java, Intellij IDEA problem Unrecognized option: --add-opens=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED

Asked 2022-Mar-26 at 15:23

I have newly installed

1IntelliJ IDEA 2021.2 (Ultimate Edition)
2Build #IU-212.4746.92, built on July 27, 2021
3Licensed to XXXXXX
4Subscription is active until August 15, 2021.
5Runtime version: 11.0.11+9-b1504.13 amd64
6VM: OpenJDK 64-Bit Server VM by JetBrains s.r.o.
7Linux 5.4.0-80-generic
8GC: G1 Young Generation, G1 Old Generation
9Memory: 2048M
10Cores: 3
11
12Kotlin: 212-1.5.10-release-IJ4746.92
13Current Desktop: X-Cinnamon
14

I cloned project I work with on other workstation without issues, but cannot start any class with main method and IDEA says:

1IntelliJ IDEA 2021.2 (Ultimate Edition)
2Build #IU-212.4746.92, built on July 27, 2021
3Licensed to XXXXXX
4Subscription is active until August 15, 2021.
5Runtime version: 11.0.11+9-b1504.13 amd64
6VM: OpenJDK 64-Bit Server VM by JetBrains s.r.o.
7Linux 5.4.0-80-generic
8GC: G1 Young Generation, G1 Old Generation
9Memory: 2048M
10Cores: 3
11
12Kotlin: 212-1.5.10-release-IJ4746.92
13Current Desktop: X-Cinnamon
14Abnormal build process termination: 
15/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -Xmx700m -Djava.awt.headless=true -Djava.endorsed.dirs=\"\" -Dcompile.parallel=false -Drebuild.on.dependency.change=true -Djdt.compiler.useSingleThread=true -Daether.connector.resumeDownloads=false -Dio.netty.initialSeedUniquifier=-5972351880001011455 -Dfile.encoding=UTF-8 -Duser.language=en -Duser.country=US -Didea.paths.selector=IntelliJIdea2021.2 -Didea.home.path=/home/pm/idea-IU-212.4746.92 -Didea.config.path=/home/pm/.config/JetBrains/IntelliJIdea2021.2 -Didea.plugins.path=/home/pm/.local/share/JetBrains/IntelliJIdea2021.2 -Djps.log.dir=/home/pm/.cache/JetBrains/IntelliJIdea2021.2/log/build-log -Djps.fallback.jdk.home=/home/pm/idea-IU-212.4746.92/jbr -Djps.fallback.jdk.version=11.0.11 -Dio.netty.noUnsafe=true -Djava.io.tmpdir=/home/pm/.cache/JetBrains/IntelliJIdea2021.2/compile-server/rfg-survey-api_cc70fc05/_temp_ -Djps.backward.ref.index.builder=true -Djps.track.ap.dependencies=false --add-opens=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.model=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.processing=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.jvm=ALL-UNNAMED -Dtmh.instrument.annotations=true -Dtmh.generate.line.numbers=true -Dkotlin.incremental.compilation=true -Dkotlin.incremental.compilation.js=true -Dkotlin.daemon.enabled -Dkotlin.daemon.client.alive.path=\"/tmp/kotlin-idea-12426594439704512301-is-running\" -classpath /home/pm/idea-IU-212.4746.92/plugins/java/lib/jps-launcher.jar:/usr/lib/jvm/java-1.8.0-openjdk-amd64/lib/tools.jar org.jetbrains.jps.cmdline.Launcher /home/pm/idea-IU-212.4746.92/lib/slf4j.jar:/home/pm/idea-IU-212.4746.92/lib/idea_rt.jar:/home/pm/idea-IU-212.4746.92/lib/platform-api.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/maven-resolver-transport-file-1.3.3.jar:/home/pm/idea-IU-212.4746.92/lib/forms_rt.jar:/home/pm/idea-IU-212.4746.92/lib/util.jar:/home/pm/idea-IU-212.4746.92/lib/annotations.jar:/home/pm/idea-IU-212.4746.92/lib/3rd-party.jar:/home/pm/idea-IU-212.4746.92/lib/kotlin-stdlib-jdk8.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/maven-resolver-connector-basic-1.3.3.jar:/home/pm/idea-IU-212.4746.92/lib/jna-platform.jar:/home/pm/idea-IU-212.4746.92/lib/protobuf-java-3.15.8.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/jps-builders-6.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/javac2.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/aether-dependency-resolver.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/jps-builders.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/jps-javac-extension-1.jar:/home/pm/idea-IU-212.4746.92/lib/jna.jar:/home/pm/idea-IU-212.4746.92/lib/jps-model.jar:/home/pm/idea-IU-212.4746.92/plugins/java/lib/maven-resolver-transport-http-1.3.3.jar:/home/pm/idea-IU-212.4746.92/plugins/JavaEE/lib/jasper-v2-rt.jar:/home/pm/idea-IU-212.4746.92/plugins/Kotlin/lib/kotlin-reflect.jar:/home/pm/idea-IU-212.4746.92/plugins/Kotlin/lib/kotlin-plugin.jar:/home/pm/idea-IU-212.4746.92/plugins/ant/lib/ant-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/uiDesigner/lib/jps/java-guiForms-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/eclipse/lib/eclipse-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/eclipse/lib/eclipse-common.jar:/home/pm/idea-IU-212.4746.92/plugins/IntelliLang/lib/java-langInjection-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/Groovy/lib/groovy-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/Groovy/lib/groovy-constants-rt.jar:/home/pm/idea-IU-212.4746.92/plugins/maven/lib/maven-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/gradle-java/lib/gradle-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/devkit/lib/devkit-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/javaFX/lib/javaFX-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/javaFX/lib/javaFX-common.jar:/home/pm/idea-IU-212.4746.92/plugins/JavaEE/lib/javaee-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/webSphereIntegration/lib/jps/javaee-appServers-websphere-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/weblogicIntegration/lib/jps/javaee-appServers-weblogic-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/JPA/lib/jps/javaee-jpa-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/Grails/lib/groovy-grails-jps.jar:/home/pm/idea-IU-212.4746.92/plugins/Grails/lib/groovy-grails-compilerPatch.jar:/home/pm/idea-IU-212.4746.92/plugins/Kotlin/lib/jps/kotlin-jps-plugin.jar:/home/pm/idea-IU-212.4746.92/plugins/Kotlin/lib/kotlin-jps-common.jar:/home/pm/idea-IU-212.4746.92/plugins/Kotlin/lib/kotlin-common.jar org.jetbrains.jps.cmdline.BuildMain 127.0.0.1 34781 9f0681bb-da2a-48db-8344-900ddeb29804 /home/pm/.cache/JetBrains/IntelliJIdea2021.2/compile-server
16Unrecognized option: --add-opens=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED
17Error: Could not create the Java Virtual Machine.
18Error: A fatal exception has occurred. Program will exit.
19

I found other comment to check Lombok works and I see it is fine.

How to fix the problem?

ANSWER

Answered 2021-Jul-28 at 07:22

You are running the project via Java 1.8 and add the --add-opens option to the runner. However Java 1.8 does not support it.

So, the first option is to use Java 11 to run the project, as Java 11 can recognize this VM option.

Another solution is to find a place where --add-opens is added and remove it. Check Run configuration in IntelliJ IDEA (VM options field) and Maven/Gradle configuration files for argLine (Maven) and jvmArgs (Gradle)

Source https://stackoverflow.com/questions/68554693

QUESTION

Is it safe to bind an unsigned int to a signed int reference?

Asked 2022-Feb-09 at 07:17

After coming across something similar in a co-worker's code, I'm having trouble understanding why/how this code executes without compiler warnings or errors.

1#include <iostream>
2
3int main (void)
4{
5    unsigned int u = 42;
6
7    const int& s = u;
8
9    std::cout << "u=" << u << " s=" << s << "\n";
10
11    u = 6 * 9;
12
13    std::cout << "u=" << u << " s=" << s << "\n";
14}
15

Output:

1#include <iostream>
2
3int main (void)
4{
5    unsigned int u = 42;
6
7    const int& s = u;
8
9    std::cout << "u=" << u << " s=" << s << "\n";
10
11    u = 6 * 9;
12
13    std::cout << "u=" << u << " s=" << s << "\n";
14}
15u=42 s=42
16u=54 s=42
17

First, I expect the compiler to issue some kind of diagnostic when I mix signed/unsigned integers like this. Certainly it does if I attempt to compare with <. That's one thing that confuses me.

Second, I'm not sure how the second line of output is generated. I expected the value of s to be 54. How does this work? Is the compiler creating an anonymous, automatic signed integer variable, assigning the value of u, and pointing the reference s at that value? Or is it doing something else, like changing s from a reference to a plain integer variable?

ANSWER

Answered 2022-Feb-09 at 07:17

References can't bind to objects with different type directly. Given const int& s = u;, u is implicitly converted to int firstly, which is a temporary, a brand-new object and then s binds to the temporary int. (Lvalue-references to const (and rvalue-references) could bind to temporaries.) The lifetime of the temporary is prolonged to the lifetime of s, i.e. it'll be destroyed when get out of main.

Source https://stackoverflow.com/questions/70712797

QUESTION

Can one delete a function returning an incomplete type in C++?

Asked 2021-Dec-19 at 10:56

In the following example function f() returning incomplete type A is marked as deleted:

1struct A;
2A f() = delete;
3

It is accepted by GCC, but not in Clang, which complains:

1struct A;
2A f() = delete;
3error: incomplete result type 'A' in function definition
4

Demo: https://gcc.godbolt.org/z/937PEz1h3

Which compiler is right here according to the standard?

ANSWER

Answered 2021-Dec-19 at 10:26

Clang is wrong.

[dcl.fct.def.general]

2 The type of a parameter or the return type for a function definition shall not be a (possibly cv-qualified) class type that is incomplete or abstract within the function body unless the function is deleted ([dcl.fct.def.delete]).

That's pretty clear I think. A deleted definition allows for an incomplete class type. It's not like the function can actually be called in a well-formed program, or the body is actually using the incomplete type in some way. The function is a placeholder to signify an invalid result to overload resolution.

Granted, the parameter types are more interesting in the case of actual overload resolution (and the return type can be anything), but there is no reason to restrict the return type into being complete here either.

Source https://stackoverflow.com/questions/70410542

QUESTION

Why does the TypeScript compiler compile its optional chaining and null-coalescing operators with two checks?

Asked 2021-Nov-17 at 06:56

Why does the TypeScript compiler compile its optional chaining and null-coalescing operators, ?. and ??, to

1// x?.y
2x === null || x === void 0 ? void 0 : x.y;
3
4// x ?? y
5x !== null &amp;&amp; x !== void 0 ? x : y
6

instead of

1// x?.y
2x === null || x === void 0 ? void 0 : x.y;
3
4// x ?? y
5x !== null &amp;&amp; x !== void 0 ? x : y
6// x?.y
7x == null ? void 0 : x.y
8
9// x ?? y
10x != null ? x : y
11

?

Odds are that behind the scenes == null does the same two checks, but even for the sake of code length, it seems like single check would be cleaner. It adds many fewer parentheses when using a string of optional chaining, too.

Incidentally, I'm also surprised that optional chaining doesn't compile to

1// x?.y
2x === null || x === void 0 ? void 0 : x.y;
3
4// x ?? y
5x !== null &amp;&amp; x !== void 0 ? x : y
6// x?.y
7x == null ? void 0 : x.y
8
9// x ?? y
10x != null ? x : y
11x == null ? x : x.y
12

to preserve null vs undefined. This has since been answered: Why does JavaScript's optional chaining to use undefined instead of preserving null?

ANSWER

Answered 2021-Nov-04 at 17:40

You can find an authoritative answer in microsoft/TypeScript#16 (wow, an old one); it is specifically explained in this comment:

That's because of document.all [...], a quirk that gets special treatment in the language for backwards compatibility.

1// x?.y
2x === null || x === void 0 ? void 0 : x.y;
3
4// x ?? y
5x !== null &amp;&amp; x !== void 0 ? x : y
6// x?.y
7x == null ? void 0 : x.y
8
9// x ?? y
10x != null ? x : y
11x == null ? x : x.y
12document.all == null // true
13document.all === null || document.all === undefined // false
14

In the optional chaining proposal

1// x?.y
2x === null || x === void 0 ? void 0 : x.y;
3
4// x ?? y
5x !== null &amp;&amp; x !== void 0 ? x : y
6// x?.y
7x == null ? void 0 : x.y
8
9// x ?? y
10x != null ? x : y
11x == null ? x : x.y
12document.all == null // true
13document.all === null || document.all === undefined // false
14document.all?.foo === document.all.foo
15

but document.all == null ? void 0 : document.all.foo would incorrectly return void 0.

So there is a particular idiosyncratic deprecated obsolete wacky legacy pseudo-property edge case of type HTMLAllCollection that nobody uses, which is loosely equal to null but not strictly equal to either undefined or null. Amazing!

It doesn't seem like anyone seriously considered just breaking things for document.all. And since the xxx === null || xxx === undefined version works for all situations, it's probably the tersest way of emitting backward-compatible JS code that behaves according to the spec.

Source https://stackoverflow.com/questions/69843082

QUESTION

Why does my Intel Skylake / Kaby Lake CPU incur a mysterious factor 3 slowdown in a simple hash table implementation?

Asked 2021-Oct-26 at 09:13

In short:

I have implemented a simple (multi-key) hash table with buckets (containing several elements) that exactly fit a cacheline. Inserting into a cacheline bucket is very simple, and the critical part of the main loop.

I have implemented three versions that produce the same outcome and should behave the same.

The mystery

However, I'm seeing wild performance differences by a surprisingly large factor 3, despite all versions having the exact same cacheline access pattern and resulting in identical hash table data.

The best implementation insert_ok suffers around a factor 3 slow down compared to insert_bad & insert_alt on my CPU (i7-7700HQ). One variant insert_bad is a simple modification of insert_ok that adds an extra unnecessary linear search within the cacheline to find the position to write to (which it already knows) and does not suffer this x3 slow down.

The exact same executable shows insert_ok a factor 1.6 faster compared to insert_bad & insert_alt on other CPUs (AMD 5950X (Zen 3), Intel i7-11800H (Tiger Lake)).

1# see https://github.com/cr-marcstevens/hashtable_mystery
2$ ./test.sh
3model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
4==============================
5CXX=g++    CXXFLAGS=-std=c++11 -O2 -march=native -falign-functions=64
6tablesize: 117440512 elements: 67108864 loadfactor=0.571429
7- test insert_ok : 11200ms
8- test insert_bad: 3164ms
9  (outcome identical to insert_ok: true)
10- test insert_alt: 3366ms
11  (outcome identical to insert_ok: true)
12
13tablesize: 117440813 elements: 67108864 loadfactor=0.571427
14- test insert_ok : 10840ms
15- test insert_bad: 3301ms
16  (outcome identical to insert_ok: true)
17- test insert_alt: 3579ms
18  (outcome identical to insert_ok: true)
19

The Code

1# see https://github.com/cr-marcstevens/hashtable_mystery
2$ ./test.sh
3model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
4==============================
5CXX=g++    CXXFLAGS=-std=c++11 -O2 -march=native -falign-functions=64
6tablesize: 117440512 elements: 67108864 loadfactor=0.571429
7- test insert_ok : 11200ms
8- test insert_bad: 3164ms
9  (outcome identical to insert_ok: true)
10- test insert_alt: 3366ms
11  (outcome identical to insert_ok: true)
12
13tablesize: 117440813 elements: 67108864 loadfactor=0.571427
14- test insert_ok : 10840ms
15- test insert_bad: 3301ms
16  (outcome identical to insert_ok: true)
17- test insert_alt: 3579ms
18  (outcome identical to insert_ok: true)
19// insert element in hash_table
20inline void insert_ok(uint64_t k)
21{
22    // compute target bucket
23    uint64_t b = mod(k);
24    // bounded linear search for first non-full bucket
25    for (size_t c = 0; c &lt; 1024; ++c)
26    {
27        bucket_t&amp; B = table_ok[b];
28        // if bucket non-full then store element and return
29        if (B.size != bucket_size)
30        {
31            B.keys[B.size] = k;
32            B.values[B.size] = 1;
33            ++B.size;
34            ++table_count;
35            return;
36        }
37        // increase b w/ wrap around
38        if (++b == table_size)
39            b = 0;
40    }
41}
42// equivalent to insert_ok
43// but uses a stupid linear search to store the element at the target position
44inline void insert_bad(uint64_t k)
45{
46    // compute target bucket
47    uint64_t b = mod(k);
48    // bounded linear search for first non-full bucket
49    for (size_t c = 0; c &lt; 1024; ++c)
50    {
51        bucket_t&amp; B = table_bad[b];
52        // if bucket non-full then store element and return
53        if (B.size != bucket_size)
54        {
55            for (size_t i = 0; i &lt; bucket_size; ++i)
56            {
57                if (i == B.size)
58                {
59                    B.keys[i] = k;
60                    B.values[i] = 1;
61                    ++B.size;
62                    ++table_count;
63                    return;
64                }
65            }
66        }
67        // increase b w/ wrap around
68        if (++b == table_size)
69            b = 0;
70    }
71}
72// instead of using bucket_t.size, empty elements are marked by special empty_key value
73// a bucket is filled first to last, so bucket is full if last element key != empty_key
74uint64_t empty_key = ~uint64_t(0);
75inline void insert_alt(uint64_t k)
76{
77    // compute target bucket
78    uint64_t b = mod(k);
79    // bounded linear search for first non-full bucket
80    for (size_t c = 0; c &lt; 1024; ++c)
81    {
82        bucket_t&amp; B = table_alt[b];
83        // if bucket non-full then store element and return
84        if (B.keys[bucket_size-1] == empty_key)
85        {
86            for (size_t i = 0; i &lt; bucket_size; ++i)
87            {
88                if (B.keys[i] == empty_key)
89                {
90                    B.keys[i] = k;
91                    B.values[i] = 1;
92                    ++table_count;
93                    return;
94                }
95            }
96        }
97        // increase b w/ wrap around
98        if (++b == table_size)
99            b = 0;
100    }
101}
102

My analysis

I've tried various modifications to the loop C++, but inherently it's so simple, the compiler will produce the same assembly. It's really not obvious from the resulting assembly what the factor 3 loss might cause. I've tried measuring with perf, but I can't seem to pinpoint any meaningful difference.

Comparing the assembly of the 3 versions which are all just relatively small loops, there is nothing that suggests anything close that may cause a factor 3 loss between these versions.

Hence, I presume the 3x slow down is a weird effect of automatic prefetching, or branch prediction, or instruction/jump alignment or maybe a combination of those.

Does anybody have better insights or ways to measure what effects might actually be at play here?

Details

I've created a small working C++11 example that demonstrates the problem. The code is available at https://github.com/cr-marcstevens/hashtable_mystery

This also includes my own static binaries that demonstrate this problem on my CPU, as different compilers may produce different code. As well as dumped assembly code for all three hash table versions.

perf event measurements

Here are a lot of perf event measurements. I've focused on ones that include the word miss and stall. Each event has two lines:

  • the first line corresponds to insert_ok which has the slowdown
  • the second line corresponds to insert_alt which has an additional loop and additional work, but ends up faster
1# see https://github.com/cr-marcstevens/hashtable_mystery
2$ ./test.sh
3model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
4==============================
5CXX=g++    CXXFLAGS=-std=c++11 -O2 -march=native -falign-functions=64
6tablesize: 117440512 elements: 67108864 loadfactor=0.571429
7- test insert_ok : 11200ms
8- test insert_bad: 3164ms
9  (outcome identical to insert_ok: true)
10- test insert_alt: 3366ms
11  (outcome identical to insert_ok: true)
12
13tablesize: 117440813 elements: 67108864 loadfactor=0.571427
14- test insert_ok : 10840ms
15- test insert_bad: 3301ms
16  (outcome identical to insert_ok: true)
17- test insert_alt: 3579ms
18  (outcome identical to insert_ok: true)
19// insert element in hash_table
20inline void insert_ok(uint64_t k)
21{
22    // compute target bucket
23    uint64_t b = mod(k);
24    // bounded linear search for first non-full bucket
25    for (size_t c = 0; c &lt; 1024; ++c)
26    {
27        bucket_t&amp; B = table_ok[b];
28        // if bucket non-full then store element and return
29        if (B.size != bucket_size)
30        {
31            B.keys[B.size] = k;
32            B.values[B.size] = 1;
33            ++B.size;
34            ++table_count;
35            return;
36        }
37        // increase b w/ wrap around
38        if (++b == table_size)
39            b = 0;
40    }
41}
42// equivalent to insert_ok
43// but uses a stupid linear search to store the element at the target position
44inline void insert_bad(uint64_t k)
45{
46    // compute target bucket
47    uint64_t b = mod(k);
48    // bounded linear search for first non-full bucket
49    for (size_t c = 0; c &lt; 1024; ++c)
50    {
51        bucket_t&amp; B = table_bad[b];
52        // if bucket non-full then store element and return
53        if (B.size != bucket_size)
54        {
55            for (size_t i = 0; i &lt; bucket_size; ++i)
56            {
57                if (i == B.size)
58                {
59                    B.keys[i] = k;
60                    B.values[i] = 1;
61                    ++B.size;
62                    ++table_count;
63                    return;
64                }
65            }
66        }
67        // increase b w/ wrap around
68        if (++b == table_size)
69            b = 0;
70    }
71}
72// instead of using bucket_t.size, empty elements are marked by special empty_key value
73// a bucket is filled first to last, so bucket is full if last element key != empty_key
74uint64_t empty_key = ~uint64_t(0);
75inline void insert_alt(uint64_t k)
76{
77    // compute target bucket
78    uint64_t b = mod(k);
79    // bounded linear search for first non-full bucket
80    for (size_t c = 0; c &lt; 1024; ++c)
81    {
82        bucket_t&amp; B = table_alt[b];
83        // if bucket non-full then store element and return
84        if (B.keys[bucket_size-1] == empty_key)
85        {
86            for (size_t i = 0; i &lt; bucket_size; ++i)
87            {
88                if (B.keys[i] == empty_key)
89                {
90                    B.keys[i] = k;
91                    B.values[i] = 1;
92                    ++table_count;
93                    return;
94                }
95            }
96        }
97        // increase b w/ wrap around
98        if (++b == table_size)
99            b = 0;
100    }
101}
102=== L1-dcache-load-misses ===
103insert_ok : 171411476
104insert_alt: 244244027
105=== L1-dcache-loads ===
106insert_ok : 775468123
107insert_alt: 1038574743
108=== L1-dcache-stores ===
109insert_ok : 621353009
110insert_alt: 554244145
111=== L1-icache-load-misses ===
112insert_ok : 69666
113insert_alt: 259102
114=== LLC-load-misses ===
115insert_ok : 70519701
116insert_alt: 71399242
117=== LLC-loads ===
118insert_ok : 130909270
119insert_alt: 134776189
120=== LLC-store-misses ===
121insert_ok : 16782747
122insert_alt: 16851787
123=== LLC-stores ===
124insert_ok : 17072141
125insert_alt: 17534866
126=== arith.divider_active ===
127insert_ok : 26810
128insert_alt: 26611
129=== baclears.any ===
130insert_ok : 2038060
131insert_alt: 7648128
132=== br_inst_retired.all_branches ===
133insert_ok : 546479449
134insert_alt: 938434022
135=== br_inst_retired.all_branches_pebs ===
136insert_ok : 546480454
137insert_alt: 938412921
138=== br_inst_retired.cond_ntaken ===
139insert_ok : 237470651
140insert_alt: 433439086
141=== br_inst_retired.conditional ===
142insert_ok : 477604946
143insert_alt: 802468807
144=== br_inst_retired.far_branch ===
145insert_ok : 1058138
146insert_alt: 1052510
147=== br_inst_retired.near_call ===
148insert_ok : 227076
149insert_alt: 227074
150=== br_inst_retired.near_return ===
151insert_ok : 227072
152insert_alt: 227070
153=== br_inst_retired.near_taken ===
154insert_ok : 307946256
155insert_alt: 503926433
156=== br_inst_retired.not_taken ===
157insert_ok : 237458763
158insert_alt: 433429466
159=== br_misp_retired.all_branches ===
160insert_ok : 36443541
161insert_alt: 90626754
162=== br_misp_retired.all_branches_pebs ===
163insert_ok : 36441027
164insert_alt: 90622375
165=== br_misp_retired.conditional ===
166insert_ok : 36454196
167insert_alt: 90591031
168=== br_misp_retired.near_call ===
169insert_ok : 173
170insert_alt: 169
171=== br_misp_retired.near_taken ===
172insert_ok : 19032467
173insert_alt: 40361420
174=== branch-instructions ===
175insert_ok : 546476228
176insert_alt: 938447476
177=== branch-load-misses ===
178insert_ok : 36441314
179insert_alt: 90611299
180=== branch-loads ===
181insert_ok : 546472151
182insert_alt: 938435143
183=== branch-misses ===
184insert_ok : 36436325
185insert_alt: 90597372
186=== bus-cycles ===
187insert_ok : 222283508
188insert_alt: 88243938
189=== cache-misses ===
190insert_ok : 257067753
191insert_alt: 475091979
192=== cache-references ===
193insert_ok : 445465943
194insert_alt: 590770464
195=== cpu-clock ===
196insert_ok : 10333.94 msec cpu-clock:u # 1.000 CPUs utilized
197insert_alt: 4766.53 msec cpu-clock:u # 1.000 CPUs utilized
198=== cpu-cycles ===
199insert_ok : 25273361574
200insert_alt: 11675804743
201=== cpu_clk_thread_unhalted.one_thread_active ===
202insert_ok : 223196489
203insert_alt: 88616919
204=== cpu_clk_thread_unhalted.ref_xclk ===
205insert_ok : 222719013
206insert_alt: 88467292
207=== cpu_clk_unhalted.one_thread_active ===
208insert_ok : 223380608
209insert_alt: 88212476
210=== cpu_clk_unhalted.ref_tsc ===
211insert_ok : 32663820508
212insert_alt: 12901195392
213=== cpu_clk_unhalted.ref_xclk ===
214insert_ok : 221957996
215insert_alt: 88390991
216insert_alt: === cpu_clk_unhalted.ring0_trans ===
217insert_ok : 374
218insert_alt: 373
219=== cpu_clk_unhalted.thread ===
220insert_ok : 25286801620
221insert_alt: 11714137483
222=== cycle_activity.cycles_l1d_miss ===
223insert_ok : 16278956219
224insert_alt: 7417877493
225=== cycle_activity.cycles_l2_miss ===
226insert_ok : 15607833569
227insert_alt: 7054717199
228=== cycle_activity.cycles_l3_miss ===
229insert_ok : 12987627072
230insert_alt: 6745771672
231=== cycle_activity.cycles_mem_any ===
232insert_ok : 23440206343
233insert_alt: 9027220495
234=== cycle_activity.stalls_l1d_miss ===
235insert_ok : 16194872307
236insert_alt: 4718344050
237=== cycle_activity.stalls_l2_miss ===
238insert_ok : 15350067722
239insert_alt: 4578933898
240=== cycle_activity.stalls_l3_miss ===
241insert_ok : 12697354271
242insert_alt: 4457980047
243=== cycle_activity.stalls_mem_any ===
244insert_ok : 20930005455
245insert_alt: 4555461595
246=== cycle_activity.stalls_total ===
247insert_ok : 22243173394
248insert_alt: 6561416461
249=== dTLB-load-misses ===
250insert_ok : 67817362
251insert_alt: 63603879
252=== dTLB-loads ===
253insert_ok : 775467642
254insert_alt: 1038562488
255=== dTLB-store-misses ===
256insert_ok : 8823481
257insert_alt: 13050341
258=== dTLB-stores ===
259insert_ok : 621353007
260insert_alt: 554244145
261=== dsb2mite_switches.count ===
262insert_ok : 93894397
263insert_alt: 315793354
264=== dsb2mite_switches.penalty_cycles ===
265insert_ok : 9216240937
266insert_alt: 206393788
267=== dtlb_load_misses.miss_causes_a_walk ===
268insert_ok : 177266866
269insert_alt: 101439773
270=== dtlb_load_misses.stlb_hit ===
271insert_ok : 2994329
272insert_alt: 35601646
273=== dtlb_load_misses.walk_active ===
274insert_ok : 4747616986
275insert_alt: 3893609232
276=== dtlb_load_misses.walk_completed ===
277insert_ok : 67817832
278insert_alt: 63591832
279=== dtlb_load_misses.walk_completed_4k ===
280insert_ok : 67817841
281insert_alt: 63596148
282=== dtlb_load_misses.walk_pending ===
283insert_ok : 6495600072
284insert_alt: 5987182579
285=== dtlb_store_misses.miss_causes_a_walk ===
286insert_ok : 89895924
287insert_alt: 21841494
288=== dtlb_store_misses.stlb_hit ===
289insert_ok : 4940907
290insert_alt: 21970231
291=== dtlb_store_misses.walk_active ===
292insert_ok : 1784142210
293insert_alt: 903334856
294=== dtlb_store_misses.walk_completed ===
295insert_ok : 8845884
296insert_alt: 13071262
297=== dtlb_store_misses.walk_completed_4k ===
298insert_ok : 8822993
299insert_alt: 12936414
300=== dtlb_store_misses.walk_pending ===
301insert_ok : 1842905733
302insert_alt: 933039119
303=== exe_activity.1_ports_util ===
304insert_ok : 991400575
305insert_alt: 1433908710
306=== exe_activity.2_ports_util ===
307insert_ok : 782270731
308insert_alt: 1314443071
309=== exe_activity.3_ports_util ===
310insert_ok : 556847358
311insert_alt: 1158115803
312=== exe_activity.4_ports_util ===
313insert_ok : 427323800
314insert_alt: 783571280
315=== exe_activity.bound_on_stores ===
316insert_ok : 299732094
317insert_alt: 303475333
318=== exe_activity.exe_bound_0_ports ===
319insert_ok : 227569792
320insert_alt: 348959512
321=== frontend_retired.dsb_miss ===
322insert_ok : 6771584
323insert_alt: 93700643
324=== frontend_retired.itlb_miss ===
325insert_ok : 1115
326insert_alt: 1689
327=== frontend_retired.l1i_miss ===
328insert_ok : 3639
329insert_alt: 3857
330=== frontend_retired.l2_miss ===
331insert_ok : 2826
332insert_alt: 2830
333=== frontend_retired.latency_ge_1 ===
334insert_ok : 9206268
335insert_alt: 178345368
336=== frontend_retired.latency_ge_128 ===
337insert_ok : 2708
338insert_alt: 2703
339=== frontend_retired.latency_ge_16 ===
340insert_ok : 403492
341insert_alt: 820950
342=== frontend_retired.latency_ge_2 ===
343insert_ok : 4981263
344insert_alt: 85781924
345=== frontend_retired.latency_ge_256 ===
346insert_ok : 802
347insert_alt: 970
348=== frontend_retired.latency_ge_2_bubbles_ge_1 ===
349insert_ok : 56936702
350insert_alt: 225712704
351=== frontend_retired.latency_ge_2_bubbles_ge_2 ===
352insert_ok : 10312026
353insert_alt: 163227996
354=== frontend_retired.latency_ge_2_bubbles_ge_3 ===
355insert_ok : 7599252
356insert_alt: 122841752
357=== frontend_retired.latency_ge_32 ===
358insert_ok : 3599
359insert_alt: 3317
360=== frontend_retired.latency_ge_4 ===
361insert_ok : 2627373
362insert_alt: 42287077
363=== frontend_retired.latency_ge_512 ===
364insert_ok : 418
365insert_alt: 241
366=== frontend_retired.latency_ge_64 ===
367insert_ok : 2474
368insert_alt: 2802
369=== frontend_retired.latency_ge_8 ===
370insert_ok : 528748
371insert_alt: 951836
372=== frontend_retired.stlb_miss ===
373insert_ok : 769
374insert_alt: 562
375=== hw_interrupts.received ===
376insert_ok : 9330
377insert_alt: 3738
378=== iTLB-load-misses ===
379insert_ok : 456094
380insert_alt: 90739
381=== iTLB-loads ===
382insert_ok : 949
383insert_alt: 1031
384=== icache_16b.ifdata_stall ===
385insert_ok : 1145821
386insert_alt: 862403
387=== icache_64b.iftag_hit ===
388insert_ok : 1378406022
389insert_alt: 4459469241
390=== icache_64b.iftag_miss ===
391insert_ok : 61812
392insert_alt: 57204
393=== icache_64b.iftag_stall ===
394insert_ok : 56551468
395insert_alt: 82354039
396=== idq.all_dsb_cycles_4_uops ===
397insert_ok : 896374829
398insert_alt: 1610100578
399=== idq.all_dsb_cycles_any_uops ===
400insert_ok : 1217878089
401insert_alt: 2739912727
402=== idq.all_mite_cycles_4_uops ===
403insert_ok : 315979501
404insert_alt: 480165021
405=== idq.all_mite_cycles_any_uops ===
406insert_ok : 1053703958
407insert_alt: 2251382760
408=== idq.dsb_cycles ===
409insert_ok : 1218891711
410insert_alt: 2744099964
411=== idq.dsb_uops ===
412insert_ok : 5828442701
413insert_alt: 10445095004
414=== idq.mite_cycles ===
415insert_ok : 470409312
416insert_alt: 1664892371
417=== idq.mite_uops ===
418insert_ok : 1407396065
419insert_alt: 4515396737
420=== idq.ms_cycles ===
421insert_ok : 583601361
422insert_alt: 587996351
423=== idq.ms_dsb_cycles ===
424insert_ok : 218346
425insert_alt: 74155
426=== idq.ms_mite_uops ===
427insert_ok : 1266443204
428insert_alt: 1277980465
429=== idq.ms_switches ===
430insert_ok : 149106449
431insert_alt: 150392336
432=== idq.ms_uops ===
433insert_ok : 1266950097
434insert_alt: 1277330690
435=== idq_uops_not_delivered.core ===
436insert_ok : 1871959581
437insert_alt: 6531069387
438=== idq_uops_not_delivered.cycles_0_uops_deliv.core ===
439insert_ok : 289301660
440insert_alt: 946930713
441=== idq_uops_not_delivered.cycles_fe_was_ok ===
442insert_ok : 24668869613
443insert_alt: 9335642949
444=== idq_uops_not_delivered.cycles_le_1_uop_deliv.core ===
445insert_ok : 393750384
446insert_alt: 1344106460
447=== idq_uops_not_delivered.cycles_le_2_uop_deliv.core ===
448insert_ok : 506090534
449insert_alt: 1824690188
450=== idq_uops_not_delivered.cycles_le_3_uop_deliv.core ===
451insert_ok : 688462029
452insert_alt: 2416339045
453=== ild_stall.lcp ===
454insert_ok : 380
455insert_alt: 480
456=== inst_retired.any ===
457insert_ok : 4760842560
458insert_alt: 5470438932
459=== inst_retired.any_p ===
460insert_ok : 4760919037
461insert_alt: 5470404264
462=== inst_retired.prec_dist ===
463insert_ok : 4760801654
464insert_alt: 5470649220
465=== inst_retired.total_cycles_ps ===
466insert_ok : 25175372339
467insert_alt: 11718929626
468=== instructions ===
469insert_ok : 4760805219
470insert_alt: 5470497783
471=== int_misc.clear_resteer_cycles ===
472insert_ok : 199623562
473insert_alt: 671083279
474=== int_misc.recovery_cycles ===
475insert_ok : 314434729
476insert_alt: 704406698
477=== itlb.itlb_flush ===
478insert_ok : 303
479insert_alt: 248
480=== itlb_misses.miss_causes_a_walk ===
481insert_ok : 19537
482insert_alt: 116729
483=== itlb_misses.stlb_hit ===
484insert_ok : 11323
485insert_alt: 5557
486=== itlb_misses.walk_active ===
487insert_ok : 2809766
488insert_alt: 4070194
489=== itlb_misses.walk_completed ===
490insert_ok : 24298
491insert_alt: 45251
492=== itlb_misses.walk_completed_4k ===
493insert_ok : 34084
494insert_alt: 29759
495=== itlb_misses.walk_pending ===
496insert_ok : 853764
497insert_alt: 2817933
498=== l1d.replacement ===
499insert_ok : 171135334
500insert_alt: 244967326
501=== l1d_pend_miss.fb_full ===
502insert_ok : 354631656
503insert_alt: 382309583
504=== l1d_pend_miss.pending ===
505insert_ok : 16792436441
506insert_alt: 22979721104
507=== l1d_pend_miss.pending_cycles ===
508insert_ok : 16377420892
509insert_alt: 7349245429
510=== l1d_pend_miss.pending_cycles_any ===
511insert_ok : insert_alt: === l2_lines_in.all ===
512insert_ok : 303009088
513insert_alt: 411750486
514=== l2_lines_out.non_silent ===
515insert_ok : 157208112
516insert_alt: 309484666
517=== l2_lines_out.silent ===
518insert_ok : 127379047
519insert_alt: 84169481
520=== l2_lines_out.useless_hwpf ===
521insert_ok : 70374658
522insert_alt: 144359127
523=== l2_lines_out.useless_pref ===
524insert_ok : 70747103
525insert_alt: 142931540
526=== l2_rqsts.all_code_rd ===
527insert_ok : 71254
528insert_alt: 242327
529=== l2_rqsts.all_demand_data_rd ===
530insert_ok : 137366274
531insert_alt: 143507049
532=== l2_rqsts.all_demand_miss ===
533insert_ok : 150071420
534insert_alt: 150820168
535=== l2_rqsts.all_demand_references ===
536insert_ok : 154854022
537insert_alt: 160487082
538=== l2_rqsts.all_pf ===
539insert_ok : 170261458
540insert_alt: 282476184
541=== l2_rqsts.all_rfo ===
542insert_ok : 17575896
543insert_alt: 16938897
544=== l2_rqsts.code_rd_hit ===
545insert_ok : 79800
546insert_alt: 381566
547=== l2_rqsts.code_rd_miss ===
548insert_ok : 25800
549insert_alt: 33755
550=== l2_rqsts.demand_data_rd_hit ===
551insert_ok : 5191029
552insert_alt: 9831101
553=== l2_rqsts.demand_data_rd_miss ===
554insert_ok : 132253891
555insert_alt: 133965310
556=== l2_rqsts.miss ===
557insert_ok : 305347974
558insert_alt: 414758839
559=== l2_rqsts.pf_hit ===
560insert_ok : 14639778
561insert_alt: 19484420
562=== l2_rqsts.pf_miss ===
563insert_ok : 156092998
564insert_alt: 263293430
565=== l2_rqsts.references ===
566insert_ok : 326549998
567insert_alt: 443460029
568=== l2_rqsts.rfo_hit ===
569insert_ok : 11650
570insert_alt: 21474
571=== l2_rqsts.rfo_miss ===
572insert_ok : 17544467
573insert_alt: 16835137
574=== l2_trans.l2_wb ===
575insert_ok : 157044674
576insert_alt: 308107712
577=== ld_blocks.no_sr ===
578insert_ok : 14
579insert_alt: 13
580=== ld_blocks.store_forward ===
581insert_ok : 158
582insert_alt: 128
583=== ld_blocks_partial.address_alias ===
584insert_ok : 5155853
585insert_alt: 17867414
586=== load_hit_pre.sw_pf ===
587insert_ok : 10840795
588insert_alt: 11072297
589=== longest_lat_cache.miss ===
590insert_ok : 257061118
591insert_alt: 471152073
592=== longest_lat_cache.reference ===
593insert_ok : 445701577
594insert_alt: 583870610
595=== machine_clears.count ===
596insert_ok : 3926377
597insert_alt: 4280080
598=== machine_clears.memory_ordering ===
599insert_ok : 97177
600insert_alt: 25407
601=== machine_clears.smc ===
602insert_ok : 138579
603insert_alt: 305423
604=== mem-stores ===
605insert_ok : 621353009
606insert_alt: 554244143
607=== mem_inst_retired.all_loads ===
608insert_ok : 775473590
609insert_alt: 1038559807
610=== mem_inst_retired.all_stores ===
611insert_ok : 621353013
612insert_alt: 554244145
613=== mem_inst_retired.lock_loads ===
614insert_ok : 85
615insert_alt: 85
616=== mem_inst_retired.split_loads ===
617insert_ok : 171
618insert_alt: 174
619=== mem_inst_retired.split_stores ===
620insert_ok : 53
621insert_alt: 49
622=== mem_inst_retired.stlb_miss_loads ===
623insert_ok : 68308539
624insert_alt: 18088047
625=== mem_inst_retired.stlb_miss_stores ===
626insert_ok : 264054
627insert_alt: 819551
628=== mem_load_l3_hit_retired.xsnp_none ===
629insert_ok : 231116
630insert_alt: 175217
631=== mem_load_retired.fb_hit ===
632insert_ok : 6510722
633insert_alt: 95952490
634=== mem_load_retired.l1_hit ===
635insert_ok : 698271530
636insert_alt: 920982402
637=== mem_load_retired.l1_miss ===
638insert_ok : 69525335
639insert_alt: 20089897
640=== mem_load_retired.l2_hit ===
641insert_ok : 1451905
642insert_alt: 773356
643=== mem_load_retired.l2_miss ===
644insert_ok : 68085186
645insert_alt: 19474303
646=== mem_load_retired.l3_hit ===
647insert_ok : 222829
648insert_alt: 155958
649=== mem_load_retired.l3_miss ===
650insert_ok : 67879593
651insert_alt: 19244746
652=== memory_disambiguation.history_reset ===
653insert_ok : 97621
654insert_alt: 25831
655=== minor-faults ===
656insert_ok : 1048716
657insert_alt: 1048718
658=== node-loads ===
659insert_ok : 71473780
660insert_alt: 71377840
661=== node-stores ===
662insert_ok : 16781161
663insert_alt: 16842666
664=== offcore_requests.all_data_rd ===
665insert_ok : 284186682
666insert_alt: 392110677
667=== offcore_requests.all_requests ===
668insert_ok : 530876505
669insert_alt: 777784382
670=== offcore_requests.demand_code_rd ===
671insert_ok : 34252
672insert_alt: 45896
673=== offcore_requests.demand_data_rd ===
674insert_ok : 133468710
675insert_alt: 134288893
676=== offcore_requests.demand_rfo ===
677insert_ok : 17612516
678insert_alt: 17062276
679=== offcore_requests.l3_miss_demand_data_rd ===
680insert_ok : 71616594
681insert_alt: 82917520
682=== offcore_requests_buffer.sq_full ===
683insert_ok : 2001445
684insert_alt: 3113287
685=== offcore_requests_outstanding.all_data_rd ===
686insert_ok : 35577129549
687insert_alt: 78698308135
688=== offcore_requests_outstanding.cycles_with_data_rd ===
689insert_ok : 17518017620
690insert_alt: 7940272202
691=== offcore_requests_outstanding.demand_code_rd ===
692insert_ok : 11085819
693insert_alt: 9390881
694=== offcore_requests_outstanding.demand_data_rd ===
695insert_ok : 15902243707
696insert_alt: 21097348926
697=== offcore_requests_outstanding.demand_data_rd_ge_6 ===
698insert_ok : 1225437
699insert_alt: 317436422
700=== offcore_requests_outstanding.demand_rfo ===
701insert_ok : 1074492442
702insert_alt: 1157902315
703=== offcore_response.demand_code_rd.any_response ===
704insert_ok : 53675
705insert_alt: 69683
706=== offcore_response.demand_code_rd.l3_hit.any_snoop ===
707insert_ok : 19407
708insert_alt: 29704
709=== offcore_response.demand_code_rd.l3_hit.snoop_none ===
710insert_ok : 12675
711insert_alt: 11951
712=== offcore_response.demand_code_rd.l3_miss.any_snoop ===
713insert_ok : 34617
714insert_alt: 40868
715=== offcore_response.demand_code_rd.l3_miss.spl_hit ===
716insert_ok : 0
717insert_alt: 753
718=== offcore_response.demand_data_rd.any_response ===
719insert_ok : 131014821
720insert_alt: 134813171
721=== offcore_response.demand_data_rd.l3_hit.any_snoop ===
722insert_ok : 59713328
723insert_alt: 50254543
724=== offcore_response.demand_data_rd.l3_miss.any_snoop ===
725insert_ok : 71431585
726insert_alt: 83916030
727=== offcore_response.demand_data_rd.l3_miss.spl_hit ===
728insert_ok : 244837
729insert_alt: 6441992
730=== offcore_response.demand_rfo.any_response ===
731insert_ok : 16876557
732insert_alt: 17619450
733=== offcore_response.demand_rfo.l3_hit.any_snoop ===
734insert_ok : 907432
735insert_alt: 45127
736=== offcore_response.demand_rfo.l3_hit.snoop_none ===
737insert_ok : 787567
738insert_alt: 794579
739=== offcore_response.demand_rfo.l3_hit_e.any_snoop ===
740insert_ok : 496938
741insert_alt: 173658
742=== offcore_response.demand_rfo.l3_hit_e.snoop_none ===
743insert_ok : 779919
744insert_alt: 50575
745=== offcore_response.demand_rfo.l3_hit_m.any_snoop ===
746insert_ok : 128627
747insert_alt: 25483
748=== offcore_response.demand_rfo.l3_miss.any_snoop ===
749insert_ok : 16782186
750insert_alt: 16847970
751=== offcore_response.demand_rfo.l3_miss.snoop_none ===
752insert_ok : 16782647
753insert_alt: 16850104
754=== offcore_response.demand_rfo.l3_miss.spl_hit ===
755insert_ok : 0
756insert_alt: 1364
757=== offcore_response.other.any_response ===
758insert_ok : 137231000
759insert_alt: 189526494
760=== offcore_response.other.l3_hit.any_snoop ===
761insert_ok : 62695084
762insert_alt: 51005882
763=== offcore_response.other.l3_hit.snoop_none ===
764insert_ok : 62975018
765insert_alt: 50217349
766=== offcore_response.other.l3_hit_e.any_snoop ===
767insert_ok : 62770215
768insert_alt: 50691817
769=== offcore_response.other.l3_hit_e.snoop_none ===
770insert_ok : 62602591
771insert_alt: 50642954
772=== offcore_response.other.l3_miss.any_snoop ===
773insert_ok : 74247236
774insert_alt: 139212975
775=== offcore_response.other.l3_miss.snoop_none ===
776insert_ok : 75911794
777insert_alt: 141076520
778=== other_assists.any ===
779insert_ok : 1
780insert_alt: 3
781=== page-faults ===
782insert_ok : 1048719
783insert_alt: 1048718
784=== partial_rat_stalls.scoreboard ===
785insert_ok : 530950991
786insert_alt: 539869553
787=== ref-cycles ===
788insert_ok : 32546980212
789insert_alt: 12930921138
790=== resource_stalls.any ===
791insert_ok : 21923576648
792insert_alt: 5205690082
793=== resource_stalls.sb ===
794insert_ok : 397908667
795insert_alt: 402738367
796=== rs_events.empty_cycles ===
797insert_ok : 1173721723
798insert_alt: 1880165720
799=== rs_events.empty_end ===
800insert_ok : 87752182
801insert_alt: 160792701
802=== sw_prefetch_access.t0 ===
803insert_ok : 20835202
804insert_alt: 20599176
805=== task-clock ===
806insert_ok : 10416.86 msec task-clock:u # 1.000 CPUs utilized
807insert_alt: 4767.78 msec task-clock:u # 1.000 CPUs utilized
808=== tlb_flush.stlb_any ===
809insert_ok : 1835393
810insert_alt: 1835396
811=== topdown-fetch-bubbles ===
812insert_ok : 1904143421
813insert_alt: 6543146396
814=== topdown-slots-issued ===
815insert_ok : 7538371393
816insert_alt: 14449966516
817=== topdown-slots-retired ===
818insert_ok : 5267325162
819insert_alt: 5849706597
820=== uops_dispatched_port.port_0 ===
821insert_ok : 1252121297
822insert_alt: 1489605354
823=== uops_dispatched_port.port_1 ===
824insert_ok : 1379316967
825insert_alt: 1585037107
826=== uops_dispatched_port.port_2 ===
827insert_ok : 1140861153
828insert_alt: 1785053149
829=== uops_dispatched_port.port_3 ===
830insert_ok : 1187151423
831insert_alt: 1828975838
832=== uops_dispatched_port.port_4 ===
833insert_ok : 1577171758
834insert_alt: 1557761857
835=== uops_dispatched_port.port_5 ===
836insert_ok : 1341370655
837insert_alt: 1653599117
838=== uops_dispatched_port.port_6 ===
839insert_ok : 1856735970
840insert_alt: 4387464794
841=== uops_dispatched_port.port_7 ===
842insert_ok : 508351498
843insert_alt: 603583315
844=== uops_executed.core ===
845insert_ok : 7225522677
846insert_alt: 12716368190
847=== uops_executed.core_cycles_ge_1 ===
848insert_ok : 3041586797
849insert_alt: 5168421550
850=== uops_executed.core_cycles_ge_2 ===
851insert_ok : 2017794537
852insert_alt: 3653591208
853=== uops_executed.core_cycles_ge_3 ===
854insert_ok : 1225785335
855insert_alt: 2316014066
856=== uops_executed.core_cycles_ge_4 ===
857insert_ok : 657121809
858insert_alt: 1143390519
859=== uops_executed.core_cycles_none ===
860insert_ok : 22191507320
861insert_alt: 6563722081
862=== uops_executed.cycles_ge_1_uop_exec ===
863insert_ok : 3040999757
864insert_alt: 5175668459
865=== uops_executed.cycles_ge_2_uops_exec ===
866insert_ok : 2015520940
867insert_alt: 3659989196
868=== uops_executed.cycles_ge_3_uops_exec ===
869insert_ok : 1224025952
870insert_alt: 2319025110
871=== uops_executed.cycles_ge_4_uops_exec ===
872insert_ok : 657094113
873insert_alt: 1141381027
874=== uops_executed.stall_cycles ===
875insert_ok : 22350754164
876insert_alt: 6590978048
877=== uops_executed.thread ===
878insert_ok : 7214521925
879insert_alt: 12697219901
880=== uops_executed.x87 ===
881insert_ok : 2992
882insert_alt: 3337
883=== uops_issued.any ===
884insert_ok : 7531354736
885insert_alt: 14462113169
886=== uops_issued.slow_lea ===
887insert_ok : 2136241
888insert_alt: 2115308
889=== uops_issued.stall_cycles ===
890insert_ok : 23244177475
891insert_alt: 7416801878
892=== uops_retired.macro_fused ===
893insert_ok : 410461916
894insert_alt: 735050350
895=== uops_retired.retire_slots ===
896insert_ok : 5265023980
897insert_alt: 5855259326
898=== uops_retired.stall_cycles ===
899insert_ok : 23513958928
900insert_alt: 9630258867
901=== uops_retired.total_cycles ===
902insert_ok : 25266688635
903insert_alt: 11703285605
904

Background

I'm implementing a cryptanalytic attack in C++11 and need to find many collisions between two large lists (both generated on the fly). A crucial part of the attack thus just consists of two critical loops:

  1. first populating a hash table with one list
  2. then matching the other list against the hash table.

The hash table operations are thus performance critical, and a factor 3 slow down means the attack is 3x slower.

Regarding design: Besides trying to minimize memory usage, I'm also trying to have a typical hash table operation operate on just a single cacheline. As I expect that will increase overall attack performance, especially when running the attack on all CPU cores.

ANSWER

Answered 2021-Oct-25 at 22:53
Summary

The TLDR is that loads which miss all levels of the TLB (and so require a page walk) and which are separated by address unknown stores can't execute in parallel, i.e., the loads are serialized and the memory level parallelism (MLP) factor is capped at 1. Effectively, the stores fence the loads, much as lfence would.

The slow version of your insert function results in this scenario, while the other two don't (the store address is known). For large region sizes the memory access pattern dominates, and the performance is almost directly related to the MLP: the fast versions can overlap load misses and get an MLP of about 3, resulting in a 3x speedup (and the narrower reproduction case we discuss below can show more than a 10x difference on Skylake).

The underlying reason seems to be that the Skylake processor tries to maintain page-table coherence, which is not required by the specification but can work around bugs in software.

The Details

For those who are interested, we'll dig into the details of what's going on.

I could reproduce the problem immediately on my Skylake i7-6700HQ machine, and by stripping out extraneous parts we can reduce the original hash insert benchmark to this simple loop, which exhibits the same issue:

1# see https://github.com/cr-marcstevens/hashtable_mystery
2$ ./test.sh
3model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
4==============================
5CXX=g++    CXXFLAGS=-std=c++11 -O2 -march=native -falign-functions=64
6tablesize: 117440512 elements: 67108864 loadfactor=0.571429
7- test insert_ok : 11200ms
8- test insert_bad: 3164ms
9  (outcome identical to insert_ok: true)
10- test insert_alt: 3366ms
11  (outcome identical to insert_ok: true)
12
13tablesize: 117440813 elements: 67108864 loadfactor=0.571427
14- test insert_ok : 10840ms
15- test insert_bad: 3301ms
16  (outcome identical to insert_ok: true)
17- test insert_alt: 3579ms
18  (outcome identical to insert_ok: true)
19// insert element in hash_table
20inline void insert_ok(uint64_t k)
21{
22    // compute target bucket
23    uint64_t b = mod(k);
24    // bounded linear search for first non-full bucket
25    for (size_t c = 0; c &lt; 1024; ++c)
26    {
27        bucket_t&amp; B = table_ok[b];
28        // if bucket non-full then store element and return
29        if (B.size != bucket_size)
30        {
31            B.keys[B.size] = k;
32            B.values[B.size] = 1;
33            ++B.size;
34            ++table_count;
35            return;
36        }
37        // increase b w/ wrap around
38        if (++b == table_size)
39            b = 0;
40    }
41}
42// equivalent to insert_ok
43// but uses a stupid linear search to store the element at the target position
44inline void insert_bad(uint64_t k)
45{
46    // compute target bucket
47    uint64_t b = mod(k);
48    // bounded linear search for first non-full bucket
49    for (size_t c = 0; c &lt; 1024; ++c)
50    {
51        bucket_t&amp; B = table_bad[b];
52        // if bucket non-full then store element and return
53        if (B.size != bucket_size)
54        {
55            for (size_t i = 0; i &lt; bucket_size; ++i)
56            {
57                if (i == B.size)
58                {
59                    B.keys[i] = k;
60                    B.values[i] = 1;
61                    ++B.size;
62                    ++table_count;
63                    return;
64                }
65            }
66        }
67        // increase b w/ wrap around
68        if (++b == table_size)
69            b = 0;
70    }
71}
72// instead of using bucket_t.size, empty elements are marked by special empty_key value
73// a bucket is filled first to last, so bucket is full if last element key != empty_key
74uint64_t empty_key = ~uint64_t(0);
75inline void insert_alt(uint64_t k)
76{
77    // compute target bucket
78    uint64_t b = mod(k);
79    // bounded linear search for first non-full bucket
80    for (size_t c = 0; c &lt; 1024; ++c)
81    {
82        bucket_t&amp; B = table_alt[b];
83        // if bucket non-full then store element and return
84        if (B.keys[bucket_size-1] == empty_key)
85        {
86            for (size_t i = 0; i &lt; bucket_size; ++i)
87            {
88                if (B.keys[i] == empty_key)
89                {
90                    B.keys[i] = k;
91                    B.values[i] = 1;
92                    ++table_count;
93                    return;
94                }
95            }
96        }
97        // increase b w/ wrap around
98        if (++b == table_size)
99            b = 0;
100    }
101}
102=== L1-dcache-load-misses ===
103insert_ok : 171411476
104insert_alt: 244244027
105=== L1-dcache-loads ===
106insert_ok : 775468123
107insert_alt: 1038574743
108=== L1-dcache-stores ===
109insert_ok : 621353009
110insert_alt: 554244145
111=== L1-icache-load-misses ===
112insert_ok : 69666
113insert_alt: 259102
114=== LLC-load-misses ===
115insert_ok : 70519701
116insert_alt: 71399242
117=== LLC-loads ===
118insert_ok : 130909270
119insert_alt: 134776189
120=== LLC-store-misses ===
121insert_ok : 16782747
122insert_alt: 16851787
123=== LLC-stores ===
124insert_ok : 17072141
125insert_alt: 17534866
126=== arith.divider_active ===
127insert_ok : 26810
128insert_alt: 26611
129=== baclears.any ===
130insert_ok : 2038060
131insert_alt: 7648128
132=== br_inst_retired.all_branches ===
133insert_ok : 546479449
134insert_alt: 938434022
135=== br_inst_retired.all_branches_pebs ===
136insert_ok : 546480454
137insert_alt: 938412921
138=== br_inst_retired.cond_ntaken ===
139insert_ok : 237470651
140insert_alt: 433439086
141=== br_inst_retired.conditional ===
142insert_ok : 477604946
143insert_alt: 802468807
144=== br_inst_retired.far_branch ===
145insert_ok : 1058138
146insert_alt: 1052510
147=== br_inst_retired.near_call ===
148insert_ok : 227076
149insert_alt: 227074
150=== br_inst_retired.near_return ===
151insert_ok : 227072
152insert_alt: 227070
153=== br_inst_retired.near_taken ===
154insert_ok : 307946256
155insert_alt: 503926433
156=== br_inst_retired.not_taken ===
157insert_ok : 237458763
158insert_alt: 433429466
159=== br_misp_retired.all_branches ===
160insert_ok : 36443541
161insert_alt: 90626754
162=== br_misp_retired.all_branches_pebs ===
163insert_ok : 36441027
164insert_alt: 90622375
165=== br_misp_retired.conditional ===
166insert_ok : 36454196
167insert_alt: 90591031
168=== br_misp_retired.near_call ===
169insert_ok : 173
170insert_alt: 169
171=== br_misp_retired.near_taken ===
172insert_ok : 19032467
173insert_alt: 40361420
174=== branch-instructions ===
175insert_ok : 546476228
176insert_alt: 938447476
177=== branch-load-misses ===
178insert_ok : 36441314
179insert_alt: 90611299
180=== branch-loads ===
181insert_ok : 546472151
182insert_alt: 938435143
183=== branch-misses ===
184insert_ok : 36436325
185insert_alt: 90597372
186=== bus-cycles ===
187insert_ok : 222283508
188insert_alt: 88243938
189=== cache-misses ===
190insert_ok : 257067753
191insert_alt: 475091979
192=== cache-references ===
193insert_ok : 445465943
194insert_alt: 590770464
195=== cpu-clock ===
196insert_ok : 10333.94 msec cpu-clock:u # 1.000 CPUs utilized
197insert_alt: 4766.53 msec cpu-clock:u # 1.000 CPUs utilized
198=== cpu-cycles ===
199insert_ok : 25273361574
200insert_alt: 11675804743
201=== cpu_clk_thread_unhalted.one_thread_active ===
202insert_ok : 223196489
203insert_alt: 88616919
204=== cpu_clk_thread_unhalted.ref_xclk ===
205insert_ok : 222719013
206insert_alt: 88467292
207=== cpu_clk_unhalted.one_thread_active ===
208insert_ok : 223380608
209insert_alt: 88212476
210=== cpu_clk_unhalted.ref_tsc ===
211insert_ok : 32663820508
212insert_alt: 12901195392
213=== cpu_clk_unhalted.ref_xclk ===
214insert_ok : 221957996
215insert_alt: 88390991
216insert_alt: === cpu_clk_unhalted.ring0_trans ===
217insert_ok : 374
218insert_alt: 373
219=== cpu_clk_unhalted.thread ===
220insert_ok : 25286801620
221insert_alt: 11714137483
222=== cycle_activity.cycles_l1d_miss ===
223insert_ok : 16278956219
224insert_alt: 7417877493
225=== cycle_activity.cycles_l2_miss ===
226insert_ok : 15607833569
227insert_alt: 7054717199
228=== cycle_activity.cycles_l3_miss ===
229insert_ok : 12987627072
230insert_alt: 6745771672
231=== cycle_activity.cycles_mem_any ===
232insert_ok : 23440206343
233insert_alt: 9027220495
234=== cycle_activity.stalls_l1d_miss ===
235insert_ok : 16194872307
236insert_alt: 4718344050
237=== cycle_activity.stalls_l2_miss ===
238insert_ok : 15350067722
239insert_alt: 4578933898
240=== cycle_activity.stalls_l3_miss ===
241insert_ok : 12697354271
242insert_alt: 4457980047
243=== cycle_activity.stalls_mem_any ===
244insert_ok : 20930005455
245insert_alt: 4555461595
246=== cycle_activity.stalls_total ===
247insert_ok : 22243173394
248insert_alt: 6561416461
249=== dTLB-load-misses ===
250insert_ok : 67817362
251insert_alt: 63603879
252=== dTLB-loads ===
253insert_ok : 775467642
254insert_alt: 1038562488
255=== dTLB-store-misses ===
256insert_ok : 8823481
257insert_alt: 13050341
258=== dTLB-stores ===
259insert_ok : 621353007
260insert_alt: 554244145
261=== dsb2mite_switches.count ===
262insert_ok : 93894397
263insert_alt: 315793354
264=== dsb2mite_switches.penalty_cycles ===
265insert_ok : 9216240937
266insert_alt: 206393788
267=== dtlb_load_misses.miss_causes_a_walk ===
268insert_ok : 177266866
269insert_alt: 101439773
270=== dtlb_load_misses.stlb_hit ===
271insert_ok : 2994329
272insert_alt: 35601646
273=== dtlb_load_misses.walk_active ===
274insert_ok : 4747616986
275insert_alt: 3893609232
276=== dtlb_load_misses.walk_completed ===
277insert_ok : 67817832
278insert_alt: 63591832
279=== dtlb_load_misses.walk_completed_4k ===
280insert_ok : 67817841
281insert_alt: 63596148
282=== dtlb_load_misses.walk_pending ===
283insert_ok : 6495600072
284insert_alt: 5987182579
285=== dtlb_store_misses.miss_causes_a_walk ===
286insert_ok : 89895924
287insert_alt: 21841494
288=== dtlb_store_misses.stlb_hit ===
289insert_ok : 4940907
290insert_alt: 21970231
291=== dtlb_store_misses.walk_active ===
292insert_ok : 1784142210
293insert_alt: 903334856
294=== dtlb_store_misses.walk_completed ===
295insert_ok : 8845884
296insert_alt: 13071262
297=== dtlb_store_misses.walk_completed_4k ===
298insert_ok : 8822993
299insert_alt: 12936414
300=== dtlb_store_misses.walk_pending ===
301insert_ok : 1842905733
302insert_alt: 933039119
303=== exe_activity.1_ports_util ===
304insert_ok : 991400575
305insert_alt: 1433908710
306=== exe_activity.2_ports_util ===
307insert_ok : 782270731
308insert_alt: 1314443071
309=== exe_activity.3_ports_util ===
310insert_ok : 556847358
311insert_alt: 1158115803
312=== exe_activity.4_ports_util ===
313insert_ok : 427323800
314insert_alt: 783571280
315=== exe_activity.bound_on_stores ===
316insert_ok : 299732094
317insert_alt: 303475333
318=== exe_activity.exe_bound_0_ports ===
319insert_ok : 227569792
320insert_alt: 348959512
321=== frontend_retired.dsb_miss ===
322insert_ok : 6771584
323insert_alt: 93700643
324=== frontend_retired.itlb_miss ===
325insert_ok : 1115
326insert_alt: 1689
327=== frontend_retired.l1i_miss ===
328insert_ok : 3639
329insert_alt: 3857
330=== frontend_retired.l2_miss ===
331insert_ok : 2826
332insert_alt: 2830
333=== frontend_retired.latency_ge_1 ===
334insert_ok : 9206268
335insert_alt: 178345368
336=== frontend_retired.latency_ge_128 ===
337insert_ok : 2708
338insert_alt: 2703
339=== frontend_retired.latency_ge_16 ===
340insert_ok : 403492
341insert_alt: 820950
342=== frontend_retired.latency_ge_2 ===
343insert_ok : 4981263
344insert_alt: 85781924
345=== frontend_retired.latency_ge_256 ===
346insert_ok : 802
347insert_alt: 970
348=== frontend_retired.latency_ge_2_bubbles_ge_1 ===
349insert_ok : 56936702
350insert_alt: 225712704
351=== frontend_retired.latency_ge_2_bubbles_ge_2 ===
352insert_ok : 10312026
353insert_alt: 163227996
354=== frontend_retired.latency_ge_2_bubbles_ge_3 ===
355insert_ok : 7599252
356insert_alt: 122841752
357=== frontend_retired.latency_ge_32 ===
358insert_ok : 3599
359insert_alt: 3317
360=== frontend_retired.latency_ge_4 ===
361insert_ok : 2627373
362insert_alt: 42287077
363=== frontend_retired.latency_ge_512 ===
364insert_ok : 418
365insert_alt: 241
366=== frontend_retired.latency_ge_64 ===
367insert_ok : 2474
368insert_alt: 2802
369=== frontend_retired.latency_ge_8 ===
370insert_ok : 528748
371insert_alt: 951836
372=== frontend_retired.stlb_miss ===
373insert_ok : 769
374insert_alt: 562
375=== hw_interrupts.received ===
376insert_ok : 9330
377insert_alt: 3738
378=== iTLB-load-misses ===
379insert_ok : 456094
380insert_alt: 90739
381=== iTLB-loads ===
382insert_ok : 949
383insert_alt: 1031
384=== icache_16b.ifdata_stall ===
385insert_ok : 1145821
386insert_alt: 862403
387=== icache_64b.iftag_hit ===
388insert_ok : 1378406022
389insert_alt: 4459469241
390=== icache_64b.iftag_miss ===
391insert_ok : 61812
392insert_alt: 57204
393=== icache_64b.iftag_stall ===
394insert_ok : 56551468
395insert_alt: 82354039
396=== idq.all_dsb_cycles_4_uops ===
397insert_ok : 896374829
398insert_alt: 1610100578
399=== idq.all_dsb_cycles_any_uops ===
400insert_ok : 1217878089
401insert_alt: 2739912727
402=== idq.all_mite_cycles_4_uops ===
403insert_ok : 315979501
404insert_alt: 480165021
405=== idq.all_mite_cycles_any_uops ===
406insert_ok : 1053703958
407insert_alt: 2251382760
408=== idq.dsb_cycles ===
409insert_ok : 1218891711
410insert_alt: 2744099964
411=== idq.dsb_uops ===
412insert_ok : 5828442701
413insert_alt: 10445095004
414=== idq.mite_cycles ===
415insert_ok : 470409312
416insert_alt: 1664892371
417=== idq.mite_uops ===
418insert_ok : 1407396065
419insert_alt: 4515396737
420=== idq.ms_cycles ===
421insert_ok : 583601361
422insert_alt: 587996351
423=== idq.ms_dsb_cycles ===
424insert_ok : 218346
425insert_alt: 74155
426=== idq.ms_mite_uops ===
427insert_ok : 1266443204
428insert_alt: 1277980465
429=== idq.ms_switches ===
430insert_ok : 149106449
431insert_alt: 150392336
432=== idq.ms_uops ===
433insert_ok : 1266950097
434insert_alt: 1277330690
435=== idq_uops_not_delivered.core ===
436insert_ok : 1871959581
437insert_alt: 6531069387
438=== idq_uops_not_delivered.cycles_0_uops_deliv.core ===
439insert_ok : 289301660
440insert_alt: 946930713
441=== idq_uops_not_delivered.cycles_fe_was_ok ===
442insert_ok : 24668869613
443insert_alt: 9335642949
444=== idq_uops_not_delivered.cycles_le_1_uop_deliv.core ===
445insert_ok : 393750384
446insert_alt: 1344106460
447=== idq_uops_not_delivered.cycles_le_2_uop_deliv.core ===
448insert_ok : 506090534
449insert_alt: 1824690188
450=== idq_uops_not_delivered.cycles_le_3_uop_deliv.core ===
451insert_ok : 688462029
452insert_alt: 2416339045
453=== ild_stall.lcp ===
454insert_ok : 380
455insert_alt: 480
456=== inst_retired.any ===
457insert_ok : 4760842560
458insert_alt: 5470438932
459=== inst_retired.any_p ===
460insert_ok : 4760919037
461insert_alt: 5470404264
462=== inst_retired.prec_dist ===
463insert_ok : 4760801654
464insert_alt: 5470649220
465=== inst_retired.total_cycles_ps ===
466insert_ok : 25175372339
467insert_alt: 11718929626
468=== instructions ===
469insert_ok : 4760805219
470insert_alt: 5470497783
471=== int_misc.clear_resteer_cycles ===
472insert_ok : 199623562
473insert_alt: 671083279
474=== int_misc.recovery_cycles ===
475insert_ok : 314434729
476insert_alt: 704406698
477=== itlb.itlb_flush ===
478insert_ok : 303
479insert_alt: 248
480=== itlb_misses.miss_causes_a_walk ===
481insert_ok : 19537
482insert_alt: 116729
483=== itlb_misses.stlb_hit ===
484insert_ok : 11323
485insert_alt: 5557
486=== itlb_misses.walk_active ===
487insert_ok : 2809766
488insert_alt: 4070194
489=== itlb_misses.walk_completed ===
490insert_ok : 24298
491insert_alt: 45251
492=== itlb_misses.walk_completed_4k ===
493insert_ok : 34084
494insert_alt: 29759
495=== itlb_misses.walk_pending ===
496insert_ok : 853764
497insert_alt: 2817933
498=== l1d.replacement ===
499insert_ok : 171135334
500insert_alt: 244967326
501=== l1d_pend_miss.fb_full ===
502insert_ok : 354631656
503insert_alt: 382309583
504=== l1d_pend_miss.pending ===
505insert_ok : 16792436441
506insert_alt: 22979721104
507=== l1d_pend_miss.pending_cycles ===
508insert_ok : 16377420892
509insert_alt: 7349245429
510=== l1d_pend_miss.pending_cycles_any ===
511insert_ok : insert_alt: === l2_lines_in.all ===
512insert_ok : 303009088
513insert_alt: 411750486
514=== l2_lines_out.non_silent ===
515insert_ok : 157208112
516insert_alt: 309484666
517=== l2_lines_out.silent ===
518insert_ok : 127379047
519insert_alt: 84169481
520=== l2_lines_out.useless_hwpf ===
521insert_ok : 70374658
522insert_alt: 144359127
523=== l2_lines_out.useless_pref ===
524insert_ok : 70747103
525insert_alt: 142931540
526=== l2_rqsts.all_code_rd ===
527insert_ok : 71254
528insert_alt: 242327
529=== l2_rqsts.all_demand_data_rd ===
530insert_ok : 137366274
531insert_alt: 143507049
532=== l2_rqsts.all_demand_miss ===
533insert_ok : 150071420
534insert_alt: 150820168
535=== l2_rqsts.all_demand_references ===
536insert_ok : 154854022
537insert_alt: 160487082
538=== l2_rqsts.all_pf ===
539insert_ok : 170261458
540insert_alt: 282476184
541=== l2_rqsts.all_rfo ===
542insert_ok : 17575896
543insert_alt: 16938897
544=== l2_rqsts.code_rd_hit ===
545insert_ok : 79800
546insert_alt: 381566
547=== l2_rqsts.code_rd_miss ===
548insert_ok : 25800
549insert_alt: 33755
550=== l2_rqsts.demand_data_rd_hit ===
551insert_ok : 5191029
552insert_alt: 9831101
553=== l2_rqsts.demand_data_rd_miss ===
554insert_ok : 132253891
555insert_alt: 133965310
556=== l2_rqsts.miss ===
557insert_ok : 305347974
558insert_alt: 414758839
559=== l2_rqsts.pf_hit ===
560insert_ok : 14639778
561insert_alt: 19484420
562=== l2_rqsts.pf_miss ===
563insert_ok : 156092998
564insert_alt: 263293430
565=== l2_rqsts.references ===
566insert_ok : 326549998
567insert_alt: 443460029
568=== l2_rqsts.rfo_hit ===
569insert_ok : 11650
570insert_alt: 21474
571=== l2_rqsts.rfo_miss ===
572insert_ok : 17544467
573insert_alt: 16835137
574=== l2_trans.l2_wb ===
575insert_ok : 157044674
576insert_alt: 308107712
577=== ld_blocks.no_sr ===
578insert_ok : 14
579insert_alt: 13
580=== ld_blocks.store_forward ===
581insert_ok : 158
582insert_alt: 128
583=== ld_blocks_partial.address_alias ===
584insert_ok : 5155853
585insert_alt: 17867414
586=== load_hit_pre.sw_pf ===
587insert_ok : 10840795
588insert_alt: 11072297
589=== longest_lat_cache.miss ===
590insert_ok : 257061118
591insert_alt: 471152073
592=== longest_lat_cache.reference ===
593insert_ok : 445701577
594insert_alt: 583870610
595=== machine_clears.count ===
596insert_ok : 3926377
597insert_alt: 4280080
598=== machine_clears.memory_ordering ===
599insert_ok : 97177
600insert_alt: 25407
601=== machine_clears.smc ===
602insert_ok : 138579
603insert_alt: 305423
604=== mem-stores ===
605insert_ok : 621353009
606insert_alt: 554244143
607=== mem_inst_retired.all_loads ===
608insert_ok : 775473590
609insert_alt: 1038559807
610=== mem_inst_retired.all_stores ===
611insert_ok : 621353013
612insert_alt: 554244145
613=== mem_inst_retired.lock_loads ===
614insert_ok : 85
615insert_alt: 85
616=== mem_inst_retired.split_loads ===
617insert_ok : 171
618insert_alt: 174
619=== mem_inst_retired.split_stores ===
620insert_ok : 53
621insert_alt: 49
622=== mem_inst_retired.stlb_miss_loads ===
623insert_ok : 68308539
624insert_alt: 18088047
625=== mem_inst_retired.stlb_miss_stores ===
626insert_ok : 264054
627insert_alt: 819551
628=== mem_load_l3_hit_retired.xsnp_none ===
629insert_ok : 231116
630insert_alt: 175217
631=== mem_load_retired.fb_hit ===
632insert_ok : 6510722
633insert_alt: 95952490
634=== mem_load_retired.l1_hit ===
635insert_ok : 698271530
636insert_alt: 920982402
637=== mem_load_retired.l1_miss ===
638insert_ok : 69525335
639insert_alt: 20089897
640=== mem_load_retired.l2_hit ===
641insert_ok : 1451905
642insert_alt: 773356
643=== mem_load_retired.l2_miss ===
644insert_ok : 68085186
645insert_alt: 19474303
646=== mem_load_retired.l3_hit ===
647insert_ok : 222829
648insert_alt: 155958
649=== mem_load_retired.l3_miss ===
650insert_ok : 67879593
651insert_alt: 19244746
652=== memory_disambiguation.history_reset ===
653insert_ok : 97621
654insert_alt: 25831
655=== minor-faults ===
656insert_ok : 1048716
657insert_alt: 1048718
658=== node-loads ===
659insert_ok : 71473780
660insert_alt: 71377840
661=== node-stores ===
662insert_ok : 16781161
663insert_alt: 16842666
664=== offcore_requests.all_data_rd ===
665insert_ok : 284186682
666insert_alt: 392110677
667=== offcore_requests.all_requests ===
668insert_ok : 530876505
669insert_alt: 777784382
670=== offcore_requests.demand_code_rd ===
671insert_ok : 34252
672insert_alt: 45896
673=== offcore_requests.demand_data_rd ===
674insert_ok : 133468710
675insert_alt: 134288893
676=== offcore_requests.demand_rfo ===
677insert_ok : 17612516
678insert_alt: 17062276
679=== offcore_requests.l3_miss_demand_data_rd ===
680insert_ok : 71616594
681insert_alt: 82917520
682=== offcore_requests_buffer.sq_full ===
683insert_ok : 2001445
684insert_alt: 3113287
685=== offcore_requests_outstanding.all_data_rd ===
686insert_ok : 35577129549
687insert_alt: 78698308135
688=== offcore_requests_outstanding.cycles_with_data_rd ===
689insert_ok : 17518017620
690insert_alt: 7940272202
691=== offcore_requests_outstanding.demand_code_rd ===
692insert_ok : 11085819
693insert_alt: 9390881
694=== offcore_requests_outstanding.demand_data_rd ===
695insert_ok : 15902243707
696insert_alt: 21097348926
697=== offcore_requests_outstanding.demand_data_rd_ge_6 ===
698insert_ok : 1225437
699insert_alt: 317436422
700=== offcore_requests_outstanding.demand_rfo ===
701insert_ok : 1074492442
702insert_alt: 1157902315
703=== offcore_response.demand_code_rd.any_response ===
704insert_ok : 53675
705insert_alt: 69683
706=== offcore_response.demand_code_rd.l3_hit.any_snoop ===
707insert_ok : 19407
708insert_alt: 29704
709=== offcore_response.demand_code_rd.l3_hit.snoop_none ===
710insert_ok : 12675
711insert_alt: 11951
712=== offcore_response.demand_code_rd.l3_miss.any_snoop ===
713insert_ok : 34617
714insert_alt: 40868
715=== offcore_response.demand_code_rd.l3_miss.spl_hit ===
716insert_ok : 0
717insert_alt: 753
718=== offcore_response.demand_data_rd.any_response ===
719insert_ok : 131014821
720insert_alt: 134813171
721=== offcore_response.demand_data_rd.l3_hit.any_snoop ===
722insert_ok : 59713328
723insert_alt: 50254543
724=== offcore_response.demand_data_rd.l3_miss.any_snoop ===
725insert_ok : 71431585
726insert_alt: 83916030
727=== offcore_response.demand_data_rd.l3_miss.spl_hit ===
728insert_ok : 244837
729insert_alt: 6441992
730=== offcore_response.demand_rfo.any_response ===
731insert_ok : 16876557
732insert_alt: 17619450
733=== offcore_response.demand_rfo.l3_hit.any_snoop ===
734insert_ok : 907432
735insert_alt: 45127
736=== offcore_response.demand_rfo.l3_hit.snoop_none ===
737insert_ok : 787567
738insert_alt: 794579
739=== offcore_response.demand_rfo.l3_hit_e.any_snoop ===
740insert_ok : 496938
741insert_alt: 173658
742=== offcore_response.demand_rfo.l3_hit_e.snoop_none ===
743insert_ok : 779919
744insert_alt: 50575
745=== offcore_response.demand_rfo.l3_hit_m.any_snoop ===
746insert_ok : 128627
747insert_alt: 25483
748=== offcore_response.demand_rfo.l3_miss.any_snoop ===
749insert_ok : 16782186
750insert_alt: 16847970
751=== offcore_response.demand_rfo.l3_miss.snoop_none ===
752insert_ok : 16782647
753insert_alt: 16850104
754=== offcore_response.demand_rfo.l3_miss.spl_hit ===
755insert_ok : 0
756insert_alt: 1364
757=== offcore_response.other.any_response ===
758insert_ok : 137231000
759insert_alt: 189526494
760=== offcore_response.other.l3_hit.any_snoop ===
761insert_ok : 62695084
762insert_alt: 51005882
763=== offcore_response.other.l3_hit.snoop_none ===
764insert_ok : 62975018
765insert_alt: 50217349
766=== offcore_response.other.l3_hit_e.any_snoop ===
767insert_ok : 62770215
768insert_alt: 50691817
769=== offcore_response.other.l3_hit_e.snoop_none ===
770insert_ok : 62602591
771insert_alt: 50642954
772=== offcore_response.other.l3_miss.any_snoop ===
773insert_ok : 74247236
774insert_alt: 139212975
775=== offcore_response.other.l3_miss.snoop_none ===
776insert_ok : 75911794
777insert_alt: 141076520
778=== other_assists.any ===
779insert_ok : 1
780insert_alt: 3
781=== page-faults ===
782insert_ok : 1048719
783insert_alt: 1048718
784=== partial_rat_stalls.scoreboard ===
785insert_ok : 530950991
786insert_alt: 539869553
787=== ref-cycles ===
788insert_ok : 32546980212
789insert_alt: 12930921138
790=== resource_stalls.any ===
791insert_ok : 21923576648
792insert_alt: 5205690082
793=== resource_stalls.sb ===
794insert_ok : 397908667
795insert_alt: 402738367
796=== rs_events.empty_cycles ===
797insert_ok : 1173721723
798insert_alt: 1880165720
799=== rs_events.empty_end ===
800insert_ok : 87752182
801insert_alt: 160792701
802=== sw_prefetch_access.t0 ===
803insert_ok : 20835202
804insert_alt: 20599176
805=== task-clock ===
806insert_ok : 10416.86 msec task-clock:u # 1.000 CPUs utilized
807insert_alt: 4767.78 msec task-clock:u # 1.000 CPUs utilized
808=== tlb_flush.stlb_any ===
809insert_ok : 1835393
810insert_alt: 1835396
811=== topdown-fetch-bubbles ===
812insert_ok : 1904143421
813insert_alt: 6543146396
814=== topdown-slots-issued ===
815insert_ok : 7538371393
816insert_alt: 14449966516
817=== topdown-slots-retired ===
818insert_ok : 5267325162
819insert_alt: 5849706597
820=== uops_dispatched_port.port_0 ===
821insert_ok : 1252121297
822insert_alt: 1489605354
823=== uops_dispatched_port.port_1 ===
824insert_ok : 1379316967
825insert_alt: 1585037107
826=== uops_dispatched_port.port_2 ===
827insert_ok : 1140861153
828insert_alt: 1785053149
829=== uops_dispatched_port.port_3 ===
830insert_ok : 1187151423
831insert_alt: 1828975838
832=== uops_dispatched_port.port_4 ===
833insert_ok : 1577171758
834insert_alt: 1557761857
835=== uops_dispatched_port.port_5 ===
836insert_ok : 1341370655
837insert_alt: 1653599117
838=== uops_dispatched_port.port_6 ===
839insert_ok : 1856735970
840insert_alt: 4387464794
841=== uops_dispatched_port.port_7 ===
842insert_ok : 508351498
843insert_alt: 603583315
844=== uops_executed.core ===
845insert_ok : 7225522677
846insert_alt: 12716368190
847=== uops_executed.core_cycles_ge_1 ===
848insert_ok : 3041586797
849insert_alt: 5168421550
850=== uops_executed.core_cycles_ge_2 ===
851insert_ok : 2017794537
852insert_alt: 3653591208
853=== uops_executed.core_cycles_ge_3 ===
854insert_ok : 1225785335
855insert_alt: 2316014066
856=== uops_executed.core_cycles_ge_4 ===
857insert_ok : 657121809
858insert_alt: 1143390519
859=== uops_executed.core_cycles_none ===
860insert_ok : 22191507320
861insert_alt: 6563722081
862=== uops_executed.cycles_ge_1_uop_exec ===
863insert_ok : 3040999757
864insert_alt: 5175668459
865=== uops_executed.cycles_ge_2_uops_exec ===
866insert_ok : 2015520940
867insert_alt: 3659989196
868=== uops_executed.cycles_ge_3_uops_exec ===
869insert_ok : 1224025952
870insert_alt: 2319025110
871=== uops_executed.cycles_ge_4_uops_exec ===
872insert_ok : 657094113
873insert_alt: 1141381027
874=== uops_executed.stall_cycles ===
875insert_ok : 22350754164
876insert_alt: 6590978048
877=== uops_executed.thread ===
878insert_ok : 7214521925
879insert_alt: 12697219901
880=== uops_executed.x87 ===
881insert_ok : 2992
882insert_alt: 3337
883=== uops_issued.any ===
884insert_ok : 7531354736
885insert_alt: 14462113169
886=== uops_issued.slow_lea ===
887insert_ok : 2136241
888insert_alt: 2115308
889=== uops_issued.stall_cycles ===
890insert_ok : 23244177475
891insert_alt: 7416801878
892=== uops_retired.macro_fused ===
893insert_ok : 410461916
894insert_alt: 735050350
895=== uops_retired.retire_slots ===
896insert_ok : 5265023980
897insert_alt: 5855259326
898=== uops_retired.stall_cycles ===
899insert_ok : 23513958928
900insert_alt: 9630258867
901=== uops_retired.total_cycles ===
902insert_ok : 25266688635
903insert_alt: 11703285605
904tlb_fencing:
905
906    xor     eax, eax  ; the index pointer
907    mov     r9 , [rsi + region.start]
908
909    mov     r8 , [rsi + region.size]  
910    sub     r8 , 200                   ; pointer to end of region (plus a bit of buffer)
911
912    mov     r10, [rsi + region.size]
913    sub     r10, 1 ; mask
914
915    mov     rsi, r9   ; region start
916
917.top:
918    mov     rcx, rax
919    and     rcx, r10        ; remap the index into the region via masking
920    add     rcx, r9         ; make pointer p into the region
921    mov     rdx, [rcx]      ; load 8 bytes at p, always zero
922    xor     rcx, rcx        ; no-op
923    mov     DWORD [rsi + rdx + 160], 0 ; store zero at p + 160 
924    add     rax, (64 * 67)  ; advance a prime number of cache lines slightly larger than a page
925
926    dec     rdi
927    jnz     .top
928
929    ret
930

This is roughly equivalent to the B.size access (the load) and the B.values[B.size] = 1 access (the store) of the innermost loop of insert_ok4.

Concentrating on the loop, we do a strided load and a fixed store. Then move the load location forward by a bit more than the size of a page (4 KiB). Critically, the store address depends on the result of the load: as the addressing expression [rsi + rdx + 160] includes rdx which is the register holding the loaded value1. The store always occurs to the same address, as none of the address components changes in the loop (so we expect an L1 cache hit always).

The original hash example did a lot more work, and accessed memory randomly, and had the store to the same line as the load, but this simple loop captures the same effect.

We use also one other version of the benchmark, which is identical except that the no-op xor rcx, rcx between the load and the store is replaced by xor rdx, rdx. This breaks the dependency between the load and the store address.

Naively, we don't expect this dependency to do much. The stores here are fire-and-forget: we don't read from the stored location again (at least not for many iterations) so they aren't part of any carried dependency chain. For small regions we expect the bottleneck to be just chewing through the ~8 uops and for large regions we expect the time to handle all the cache misses to dominate: critically, we expect many misses to be handled in parallel since the load addresses can be independently calculated from simple non-memory uops.

Find below the performance in cycles for region sizes from 4 KiB up to 256 MiB, with the following three variations:

2M dep: The loop shown above (with the store address dependent on load) with 2 MiB huge pages.

4K dep: The loop shown above (with the store address dependent on load) with standard 4 KiB pages.

4K indep: The variant of the above loop with but with xor rdx, rdx replacing xor rcx, rcx to break the dependency between the load result and store address, using 4 KiB pages.

The result:

Shows the dep case sucking when region is 8 MiB or more

The performance of all the variants is basically identical for small region sizes. Everything up to 256 KiB takes 2 cycles/iteration, limited simply by the 8 uops in the loop and the CPU width of 4 uops/cycle. A bit of math shows that we have decent MLP (memory level parallelism): an L2 cache hit has a latency of 12 cycles, but we are completing one every 2 cycles, so on average we must be overlapping the latency of 6 L1 misses to achieve that.

Between 256 KiB and 4096 KiB the performance degrades somewhat as L3 hits start happening, but performance is good and MLP high.

At 8196 KiB performance degrades catastrophically for only the 4K dep case, crossing over 150 cycles and eventually stabilizing at about 220 cycles. It is more than 10 times slower than the other two cases2.

We can already make some key observations:

  • Both the 2M dep and the 4K indep cases are fast: so this is not just about the dependency between the stores, but also about paging behavior.
  • The 2M dep case is fastest of all, so we know the dependency doesn't cause some fundamental problem even when you miss to memory.
  • The performance of the slow 4K dep case is suspiciously similar to the memory latency of my machine.

I've mentioned MLP above and calculated a lower bound on the MLP based on the observed performance, but on Intel CPUs we can measure MLP directly using two performance counters:

l1d_pend_miss.pending

Counts duration of L1D miss outstanding, that is each cycle number of Fill Buffers (FB) outstanding required by Demand Reads.

l1d_pend_miss.pending_cycles

Cycles with L1D load Misses outstanding

The first counts, every cycle, how many requests are outstanding from the L1D. So if 3 misses are in progress, this counter increments by 3 every cycle. The second counter increments by 1 every cycle at least one miss is in progress. You can see it as a version of the first counter which saturates at 1 every cycle. The ratio l1d_pend_miss.pending / l1d_pend_miss.pending_cycles of these counters over a period of time is the average MLP factor while any miss is outstanding3.

Let's plot that MLP ratio for the dep and indep versions of the 4K benchmark:

Shows that MLP tanks in the 4K dep case to 1 at 8 MiB

The problem becomes very clear. Up to regions of 4096 KiB, performance is identical, and MLP is high (for very small region sizes there is "no" MLP since there are no L1D misses at all). Suddenly at 8192 KiB, the MLP for the dependent case drops to 1 and stays there, while in the independent case the MLP goes to almost 10. That alone basically explains the 10x performance difference: the dependent case is not able to overlap loads, at all.

Why? The problem seems to be TLB misses. What happens at 8192 KiB is that the benchmark starts missing the TLB. Specifically, each Skylake core has 1536 STLB (second-level TLB) entries which can cover 1536 ร— 4096 = 6 MiB of 4K pages. So right between the 4 and 8 MiB region sizes, TLB misses go to 1 per iteration based on dtlb_load_misses.walk_completed, leading to this almost-too-perfect-is-it-fake plot:

Shows that 1.0 page walks are done for both 4k cases at 8 MiB

So that's what happens: when address-unknown stores are in the store buffer, loads that take STLB misses can't overlap: they go one-at-a-time. So you suffer the full memory latency for every access. This also explains why the 2MB page case was fast: 2 MB pages can cover 3 GiB of memory, so there are no STLB misses/page walks for these region sizes.

Why

This behavior seems to stem from the fact that Skylake and other earlier Intel processors implement page table coherence, even though the x86 platform does not require it. Page table coherence means that if a store which modifies an address mapping (for example) a subsequent load that uses a virtual address affected by the remapping will consistently see the new mapping without any explicit flushes.

This insight comes from Henry Wong who reports in his excellent article on page walk coherence that to do this, page walks are terminated if a conflicting or address-unknown store is encountered during the walk:

Unexpectedly, Intel Core 2 and newer systems behaved as though a pagewalk coherence misspeculation had occurred even though there were no page table modifications. These systems have memory dependence prediction, so the load should have speculatively executed much earlier than the store and broken the data dependence chain.

It turns out it is precisely the early-executing load that is responsible for the incorrectly-detected misspeculation. This gives a hint on how coherence violations may be detected: by comparing pagewalks to known older store addresses (in the store queue?), and assuming a coherence violation if there is an older store with a conflict or an unknown address.

So even though these stores are totally innocent in that they don't modify any page tables, they get caught up in the page table coherence mechanism. We can find further evidence of this theory by looking at the event dtlb_load_misses.miss_causes_a_walk. Unlike the walk_completed event, this counts all walks that started even if they don't complete successfully. That one looks like this (again, 2M isn't shown because it starts no page walks at all):

Shows that the dep case has slightly more than 2 walks per iteration

Huh! The 4K dependent shows two started walks, only one of which completes successfully. That's two walks for every load. This aligns with the theory that the page walk starts for the load in iteration N+1, but it finds the store from iteration N still sitting in the store buffer (since the load for iteration N provides its address, and it is still in progress). Since the address is unknown, the page walk is canceled as Henry describes. Further page walks are delayed until the store address is resolved. The upshot is all the loads complete in a serialized fashion because the page walk for load N+1 must wait for the result of load N.

Why the "bad" and "alt" methods are fast

Finally, there is one remaining mystery. The above explains why the original hash access was slow, but not why the other two were fast. The key is that both of the fast methods don't have address-unknown stores, because the data dependency with the load is replaced by a speculative control dependency.

Take a look at the inner loop for the insert_bad approach:

1# see https://github.com/cr-marcstevens/hashtable_mystery
2$ ./test.sh
3model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
4==============================
5CXX=g++    CXXFLAGS=-std=c++11 -O2 -march=native -falign-functions=64
6tablesize: 117440512 elements: 67108864 loadfactor=0.571429
7- test insert_ok : 11200ms
8- test insert_bad: 3164ms
9  (outcome identical to insert_ok: true)
10- test insert_alt: 3366ms
11  (outcome identical to insert_ok: true)
12
13tablesize: 117440813 elements: 67108864 loadfactor=0.571427
14- test insert_ok : 10840ms
15- test insert_bad: 3301ms
16  (outcome identical to insert_ok: true)
17- test insert_alt: 3579ms
18  (outcome identical to insert_ok: true)
19// insert element in hash_table
20inline void insert_ok(uint64_t k)
21{
22    // compute target bucket
23    uint64_t b = mod(k);
24    // bounded linear search for first non-full bucket
25    for (size_t c = 0; c &lt; 1024; ++c)
26    {
27        bucket_t&amp; B = table_ok[b];
28        // if bucket non-full then store element and return
29        if (B.size != bucket_size)
30        {
31            B.keys[B.size] = k;
32            B.values[B.size] = 1;
33            ++B.size;
34            ++table_count;
35            return;
36        }
37        // increase b w/ wrap around
38        if (++b == table_size)
39            b = 0;
40    }
41}
42// equivalent to insert_ok
43// but uses a stupid linear search to store the element at the target position
44inline void insert_bad(uint64_t k)
45{
46    // compute target bucket
47    uint64_t b = mod(k);
48    // bounded linear search for first non-full bucket
49    for (size_t c = 0; c &lt; 1024; ++c)
50    {
51        bucket_t&amp; B = table_bad[b];
52        // if bucket non-full then store element and return
53        if (B.size != bucket_size)
54        {
55            for (size_t i = 0; i &lt; bucket_size; ++i)
56            {
57                if (i == B.size)
58                {
59                    B.keys[i] = k;
60                    B.values[i] = 1;
61                    ++B.size;
62                    ++table_count;
63                    return;
64                }
65            }
66        }
67        // increase b w/ wrap around
68        if (++b == table_size)
69            b = 0;
70    }
71}
72// instead of using bucket_t.size, empty elements are marked by special empty_key value
73// a bucket is filled first to last, so bucket is full if last element key != empty_key
74uint64_t empty_key = ~uint64_t(0);
75inline void insert_alt(uint64_t k)
76{
77    // compute target bucket
78    uint64_t b = mod(k);
79    // bounded linear search for first non-full bucket
80    for (size_t c = 0; c &lt; 1024; ++c)
81    {
82        bucket_t&amp; B = table_alt[b];
83        // if bucket non-full then store element and return
84        if (B.keys[bucket_size-1] == empty_key)
85        {
86            for (size_t i = 0; i &lt; bucket_size; ++i)
87            {
88                if (B.keys[i] == empty_key)
89                {
90                    B.keys[i] = k;
91                    B.values[i] = 1;
92                    ++table_count;
93                    return;
94                }
95            }
96        }
97        // increase b w/ wrap around
98        if (++b == table_size)
99            b = 0;
100    }
101}
102=== L1-dcache-load-misses ===
103insert_ok : 171411476
104insert_alt: 244244027
105=== L1-dcache-loads ===
106insert_ok : 775468123
107insert_alt: 1038574743
108=== L1-dcache-stores ===
109insert_ok : 621353009
110insert_alt: 554244145
111=== L1-icache-load-misses ===
112insert_ok : 69666
113insert_alt: 259102
114=== LLC-load-misses ===
115insert_ok : 70519701
116insert_alt: 71399242
117=== LLC-loads ===
118insert_ok : 130909270
119insert_alt: 134776189
120=== LLC-store-misses ===
121insert_ok : 16782747
122insert_alt: 16851787
123=== LLC-stores ===
124insert_ok : 17072141
125insert_alt: 17534866
126=== arith.divider_active ===
127insert_ok : 26810
128insert_alt: 26611
129=== baclears.any ===
130insert_ok : 2038060
131insert_alt: 7648128
132=== br_inst_retired.all_branches ===
133insert_ok : 546479449
134insert_alt: 938434022
135=== br_inst_retired.all_branches_pebs ===
136insert_ok : 546480454
137insert_alt: 938412921
138=== br_inst_retired.cond_ntaken ===
139insert_ok : 237470651
140insert_alt: 433439086
141=== br_inst_retired.conditional ===
142insert_ok : 477604946
143insert_alt: 802468807
144=== br_inst_retired.far_branch ===
145insert_ok : 1058138
146insert_alt: 1052510
147=== br_inst_retired.near_call ===
148insert_ok : 227076
149insert_alt: 227074
150=== br_inst_retired.near_return ===
151insert_ok : 227072
152insert_alt: 227070
153=== br_inst_retired.near_taken ===
154insert_ok : 307946256
155insert_alt: 503926433
156=== br_inst_retired.not_taken ===
157insert_ok : 237458763
158insert_alt: 433429466
159=== br_misp_retired.all_branches ===
160insert_ok : 36443541
161insert_alt: 90626754
162=== br_misp_retired.all_branches_pebs ===
163insert_ok : 36441027
164insert_alt: 90622375
165=== br_misp_retired.conditional ===
166insert_ok : 36454196
167insert_alt: 90591031
168=== br_misp_retired.near_call ===
169insert_ok : 173
170insert_alt: 169
171=== br_misp_retired.near_taken ===
172insert_ok : 19032467
173insert_alt: 40361420
174=== branch-instructions ===
175insert_ok : 546476228
176insert_alt: 938447476
177=== branch-load-misses ===
178insert_ok : 36441314
179insert_alt: 90611299
180=== branch-loads ===
181insert_ok : 546472151
182insert_alt: 938435143
183=== branch-misses ===
184insert_ok : 36436325
185insert_alt: 90597372
186=== bus-cycles ===
187insert_ok : 222283508
188insert_alt: 88243938
189=== cache-misses ===
190insert_ok : 257067753
191insert_alt: 475091979
192=== cache-references ===
193insert_ok : 445465943
194insert_alt: 590770464
195=== cpu-clock ===
196insert_ok : 10333.94 msec cpu-clock:u # 1.000 CPUs utilized
197insert_alt: 4766.53 msec cpu-clock:u # 1.000 CPUs utilized
198=== cpu-cycles ===
199insert_ok : 25273361574
200insert_alt: 11675804743
201=== cpu_clk_thread_unhalted.one_thread_active ===
202insert_ok : 223196489
203insert_alt: 88616919
204=== cpu_clk_thread_unhalted.ref_xclk ===
205insert_ok : 222719013
206insert_alt: 88467292
207=== cpu_clk_unhalted.one_thread_active ===
208insert_ok : 223380608
209insert_alt: 88212476
210=== cpu_clk_unhalted.ref_tsc ===
211insert_ok : 32663820508
212insert_alt: 12901195392
213=== cpu_clk_unhalted.ref_xclk ===
214insert_ok : 221957996
215insert_alt: 88390991
216insert_alt: === cpu_clk_unhalted.ring0_trans ===
217insert_ok : 374
218insert_alt: 373
219=== cpu_clk_unhalted.thread ===
220insert_ok : 25286801620
221insert_alt: 11714137483
222=== cycle_activity.cycles_l1d_miss ===
223insert_ok : 16278956219
224insert_alt: 7417877493
225=== cycle_activity.cycles_l2_miss ===
226insert_ok : 15607833569
227insert_alt: 7054717199
228=== cycle_activity.cycles_l3_miss ===
229insert_ok : 12987627072
230insert_alt: 6745771672
231=== cycle_activity.cycles_mem_any ===
232insert_ok : 23440206343
233insert_alt: 9027220495
234=== cycle_activity.stalls_l1d_miss ===
235insert_ok : 16194872307
236insert_alt: 4718344050
237=== cycle_activity.stalls_l2_miss ===
238insert_ok : 15350067722
239insert_alt: 4578933898
240=== cycle_activity.stalls_l3_miss ===
241insert_ok : 12697354271
242insert_alt: 4457980047
243=== cycle_activity.stalls_mem_any ===
244insert_ok : 20930005455
245insert_alt: 4555461595
246=== cycle_activity.stalls_total ===
247insert_ok : 22243173394
248insert_alt: 6561416461
249=== dTLB-load-misses ===
250insert_ok : 67817362
251insert_alt: 63603879
252=== dTLB-loads ===
253insert_ok : 775467642
254insert_alt: 1038562488
255=== dTLB-store-misses ===
256insert_ok : 8823481
257insert_alt: 13050341
258=== dTLB-stores ===
259insert_ok : 621353007
260insert_alt: 554244145
261=== dsb2mite_switches.count ===
262insert_ok : 93894397
263insert_alt: 315793354
264=== dsb2mite_switches.penalty_cycles ===
265insert_ok : 9216240937
266insert_alt: 206393788
267=== dtlb_load_misses.miss_causes_a_walk ===
268insert_ok : 177266866
269insert_alt: 101439773
270=== dtlb_load_misses.stlb_hit ===
271insert_ok : 2994329
272insert_alt: 35601646
273=== dtlb_load_misses.walk_active ===
274insert_ok : 4747616986
275insert_alt: 3893609232
276=== dtlb_load_misses.walk_completed ===
277insert_ok : 67817832
278insert_alt: 63591832
279=== dtlb_load_misses.walk_completed_4k ===
280insert_ok : 67817841
281insert_alt: 63596148
282=== dtlb_load_misses.walk_pending ===
283insert_ok : 6495600072
284insert_alt: 5987182579
285=== dtlb_store_misses.miss_causes_a_walk ===
286insert_ok : 89895924
287insert_alt: 21841494
288=== dtlb_store_misses.stlb_hit ===
289insert_ok : 4940907
290insert_alt: 21970231
291=== dtlb_store_misses.walk_active ===
292insert_ok : 1784142210
293insert_alt: 903334856
294=== dtlb_store_misses.walk_completed ===
295insert_ok : 8845884
296insert_alt: 13071262
297=== dtlb_store_misses.walk_completed_4k ===
298insert_ok : 8822993
299insert_alt: 12936414
300=== dtlb_store_misses.walk_pending ===
301insert_ok : 1842905733
302insert_alt: 933039119
303=== exe_activity.1_ports_util ===
304insert_ok : 991400575
305insert_alt: 1433908710
306=== exe_activity.2_ports_util ===
307insert_ok : 782270731
308insert_alt: 1314443071
309=== exe_activity.3_ports_util ===
310insert_ok : 556847358
311insert_alt: 1158115803
312=== exe_activity.4_ports_util ===
313insert_ok : 427323800
314insert_alt: 783571280
315=== exe_activity.bound_on_stores ===
316insert_ok : 299732094
317insert_alt: 303475333
318=== exe_activity.exe_bound_0_ports ===
319insert_ok : 227569792
320insert_alt: 348959512
321=== frontend_retired.dsb_miss ===
322insert_ok : 6771584
323insert_alt: 93700643
324=== frontend_retired.itlb_miss ===
325insert_ok : 1115
326insert_alt: 1689
327=== frontend_retired.l1i_miss ===
328insert_ok : 3639
329insert_alt: 3857
330=== frontend_retired.l2_miss ===
331insert_ok : 2826
332insert_alt: 2830
333=== frontend_retired.latency_ge_1 ===
334insert_ok : 9206268
335insert_alt: 178345368
336=== frontend_retired.latency_ge_128 ===
337insert_ok : 2708
338insert_alt: 2703
339=== frontend_retired.latency_ge_16 ===
340insert_ok : 403492
341insert_alt: 820950
342=== frontend_retired.latency_ge_2 ===
343insert_ok : 4981263
344insert_alt: 85781924
345=== frontend_retired.latency_ge_256 ===
346insert_ok : 802
347insert_alt: 970
348=== frontend_retired.latency_ge_2_bubbles_ge_1 ===
349insert_ok : 56936702
350insert_alt: 225712704
351=== frontend_retired.latency_ge_2_bubbles_ge_2 ===
352insert_ok : 10312026
353insert_alt: 163227996
354=== frontend_retired.latency_ge_2_bubbles_ge_3 ===
355insert_ok : 7599252
356insert_alt: 122841752
357=== frontend_retired.latency_ge_32 ===
358insert_ok : 3599
359insert_alt: 3317
360=== frontend_retired.latency_ge_4 ===
361insert_ok : 2627373
362insert_alt: 42287077
363=== frontend_retired.latency_ge_512 ===
364insert_ok : 418
365insert_alt: 241
366=== frontend_retired.latency_ge_64 ===
367insert_ok : 2474
368insert_alt: 2802
369=== frontend_retired.latency_ge_8 ===
370insert_ok : 528748
371insert_alt: 951836
372=== frontend_retired.stlb_miss ===
373insert_ok : 769
374insert_alt: 562
375=== hw_interrupts.received ===
376insert_ok : 9330
377insert_alt: 3738
378=== iTLB-load-misses ===
379insert_ok : 456094
380insert_alt: 90739
381=== iTLB-loads ===
382insert_ok : 949
383insert_alt: 1031
384=== icache_16b.ifdata_stall ===
385insert_ok : 1145821
386insert_alt: 862403
387=== icache_64b.iftag_hit ===
388insert_ok : 1378406022
389insert_alt: 4459469241
390=== icache_64b.iftag_miss ===
391insert_ok : 61812
392insert_alt: 57204
393=== icache_64b.iftag_stall ===
394insert_ok : 56551468
395insert_alt: 82354039
396=== idq.all_dsb_cycles_4_uops ===
397insert_ok : 896374829
398insert_alt: 1610100578
399=== idq.all_dsb_cycles_any_uops ===
400insert_ok : 1217878089
401insert_alt: 2739912727
402=== idq.all_mite_cycles_4_uops ===
403insert_ok : 315979501
404insert_alt: 480165021
405=== idq.all_mite_cycles_any_uops ===
406insert_ok : 1053703958
407insert_alt: 2251382760
408=== idq.dsb_cycles ===
409insert_ok : 1218891711
410insert_alt: 2744099964
411=== idq.dsb_uops ===
412insert_ok : 5828442701
413insert_alt: 10445095004
414=== idq.mite_cycles ===
415insert_ok : 470409312
416insert_alt: 1664892371
417=== idq.mite_uops ===
418insert_ok : 1407396065
419insert_alt: 4515396737
420=== idq.ms_cycles ===
421insert_ok : 583601361
422insert_alt: 587996351
423=== idq.ms_dsb_cycles ===
424insert_ok : 218346
425insert_alt: 74155
426=== idq.ms_mite_uops ===
427insert_ok : 1266443204
428insert_alt: 1277980465
429=== idq.ms_switches ===
430insert_ok : 149106449
431insert_alt: 150392336
432=== idq.ms_uops ===
433insert_ok : 1266950097
434insert_alt: 1277330690
435=== idq_uops_not_delivered.core ===
436insert_ok : 1871959581
437insert_alt: 6531069387
438=== idq_uops_not_delivered.cycles_0_uops_deliv.core ===
439insert_ok : 289301660
440insert_alt: 946930713
441=== idq_uops_not_delivered.cycles_fe_was_ok ===
442insert_ok : 24668869613
443insert_alt: 9335642949
444=== idq_uops_not_delivered.cycles_le_1_uop_deliv.core ===
445insert_ok : 393750384
446insert_alt: 1344106460
447=== idq_uops_not_delivered.cycles_le_2_uop_deliv.core ===
448insert_ok : 506090534
449insert_alt: 1824690188
450=== idq_uops_not_delivered.cycles_le_3_uop_deliv.core ===
451insert_ok : 688462029
452insert_alt: 2416339045
453=== ild_stall.lcp ===
454insert_ok : 380
455insert_alt: 480
456=== inst_retired.any ===
457insert_ok : 4760842560
458insert_alt: 5470438932
459=== inst_retired.any_p ===
460insert_ok : 4760919037
461insert_alt: 5470404264
462=== inst_retired.prec_dist ===
463insert_ok : 4760801654
464insert_alt: 5470649220
465=== inst_retired.total_cycles_ps ===
466insert_ok : 25175372339
467insert_alt: 11718929626
468=== instructions ===
469insert_ok : 4760805219
470insert_alt: 5470497783
471=== int_misc.clear_resteer_cycles ===
472insert_ok : 199623562
473insert_alt: 671083279
474=== int_misc.recovery_cycles ===
475insert_ok : 314434729
476insert_alt: 704406698
477=== itlb.itlb_flush ===
478insert_ok : 303
479insert_alt: 248
480=== itlb_misses.miss_causes_a_walk ===
481insert_ok : 19537
482insert_alt: 116729
483=== itlb_misses.stlb_hit ===
484insert_ok : 11323
485insert_alt: 5557
486=== itlb_misses.walk_active ===
487insert_ok : 2809766
488insert_alt: 4070194
489=== itlb_misses.walk_completed ===
490insert_ok : 24298
491insert_alt: 45251
492=== itlb_misses.walk_completed_4k ===
493insert_ok : 34084
494insert_alt: 29759
495=== itlb_misses.walk_pending ===
496insert_ok : 853764
497insert_alt: 2817933
498=== l1d.replacement ===
499insert_ok : 171135334
500insert_alt: 244967326
501=== l1d_pend_miss.fb_full ===
502insert_ok : 354631656
503insert_alt: 382309583
504=== l1d_pend_miss.pending ===
505insert_ok : 16792436441
506insert_alt: 22979721104
507=== l1d_pend_miss.pending_cycles ===
508insert_ok : 16377420892
509insert_alt: 7349245429
510=== l1d_pend_miss.pending_cycles_any ===
511insert_ok : insert_alt: === l2_lines_in.all ===
512insert_ok : 303009088
513insert_alt: 411750486
514=== l2_lines_out.non_silent ===
515insert_ok : 157208112
516insert_alt: 309484666
517=== l2_lines_out.silent ===
518insert_ok : 127379047
519insert_alt: 84169481
520=== l2_lines_out.useless_hwpf ===
521insert_ok : 70374658
522insert_alt: 144359127
523=== l2_lines_out.useless_pref ===
524insert_ok : 70747103
525insert_alt: 142931540
526=== l2_rqsts.all_code_rd ===
527insert_ok : 71254
528insert_alt: 242327
529=== l2_rqsts.all_demand_data_rd ===
530insert_ok : 137366274
531insert_alt: 143507049
532=== l2_rqsts.all_demand_miss ===
533insert_ok : 150071420
534insert_alt: 150820168
535=== l2_rqsts.all_demand_references ===
536insert_ok : 154854022
537insert_alt: 160487082
538=== l2_rqsts.all_pf ===
539insert_ok : 170261458
540insert_alt: 282476184
541=== l2_rqsts.all_rfo ===
542insert_ok : 17575896
543insert_alt: 16938897
544=== l2_rqsts.code_rd_hit ===
545insert_ok : 79800
546insert_alt: 381566
547=== l2_rqsts.code_rd_miss ===
548insert_ok : 25800
549insert_alt: 33755
550=== l2_rqsts.demand_data_rd_hit ===
551insert_ok : 5191029
552insert_alt: 9831101
553=== l2_rqsts.demand_data_rd_miss ===
554insert_ok : 132253891
555insert_alt: 133965310
556=== l2_rqsts.miss ===
557insert_ok : 305347974
558insert_alt: 414758839
559=== l2_rqsts.pf_hit ===
560insert_ok : 14639778
561insert_alt: 19484420
562=== l2_rqsts.pf_miss ===
563insert_ok : 156092998
564insert_alt: 263293430
565=== l2_rqsts.references ===
566insert_ok : 326549998
567insert_alt: 443460029
568=== l2_rqsts.rfo_hit ===
569insert_ok : 11650
570insert_alt: 21474
571=== l2_rqsts.rfo_miss ===
572insert_ok : 17544467
573insert_alt: 16835137
574=== l2_trans.l2_wb ===
575insert_ok : 157044674
576insert_alt: 308107712
577=== ld_blocks.no_sr ===
578insert_ok : 14
579insert_alt: 13
580=== ld_blocks.store_forward ===
581insert_ok : 158
582insert_alt: 128
583=== ld_blocks_partial.address_alias ===
584insert_ok : 5155853
585insert_alt: 17867414
586=== load_hit_pre.sw_pf ===
587insert_ok : 10840795
588insert_alt: 11072297
589=== longest_lat_cache.miss ===
590insert_ok : 257061118
591insert_alt: 471152073
592=== longest_lat_cache.reference ===
593insert_ok : 445701577
594insert_alt: 583870610
595=== machine_clears.count ===
596insert_ok : 3926377
597insert_alt: 4280080
598=== machine_clears.memory_ordering ===
599insert_ok : 97177
600insert_alt: 25407
601=== machine_clears.smc ===
602insert_ok : 138579
603insert_alt: 305423
604=== mem-stores ===
605insert_ok : 621353009
606insert_alt: 554244143
607=== mem_inst_retired.all_loads ===
608insert_ok : 775473590
609insert_alt: 1038559807
610=== mem_inst_retired.all_stores ===
611insert_ok : 621353013
612insert_alt: 554244145
613=== mem_inst_retired.lock_loads ===
614insert_ok : 85
615insert_alt: 85
616=== mem_inst_retired.split_loads ===
617insert_ok : 171
618insert_alt: 174
619=== mem_inst_retired.split_stores ===
620insert_ok : 53
621insert_alt: 49
622=== mem_inst_retired.stlb_miss_loads ===
623insert_ok : 68308539
624insert_alt: 18088047
625=== mem_inst_retired.stlb_miss_stores ===
626insert_ok : 264054
627insert_alt: 819551
628=== mem_load_l3_hit_retired.xsnp_none ===
629insert_ok : 231116
630insert_alt: 175217
631=== mem_load_retired.fb_hit ===
632insert_ok : 6510722
633insert_alt: 95952490
634=== mem_load_retired.l1_hit ===
635insert_ok : 698271530
636insert_alt: 920982402
637=== mem_load_retired.l1_miss ===
638insert_ok : 69525335
639insert_alt: 20089897
640=== mem_load_retired.l2_hit ===
641insert_ok : 1451905
642insert_alt: 773356
643=== mem_load_retired.l2_miss ===
644insert_ok : 68085186
645insert_alt: 19474303
646=== mem_load_retired.l3_hit ===
647insert_ok : 222829
648insert_alt: 155958
649=== mem_load_retired.l3_miss ===
650insert_ok : 67879593
651insert_alt: 19244746
652=== memory_disambiguation.history_reset ===
653insert_ok : 97621
654insert_alt: 25831
655=== minor-faults ===
656insert_ok : 1048716
657insert_alt: 1048718
658=== node-loads ===
659insert_ok : 71473780
660insert_alt: 71377840
661=== node-stores ===
662insert_ok : 16781161
663insert_alt: 16842666
664=== offcore_requests.all_data_rd ===
665insert_ok : 284186682
666insert_alt: 392110677
667=== offcore_requests.all_requests ===
668insert_ok : 530876505
669insert_alt: 777784382
670=== offcore_requests.demand_code_rd ===
671insert_ok : 34252
672insert_alt: 45896
673=== offcore_requests.demand_data_rd ===
674insert_ok : 133468710
675insert_alt: 134288893
676=== offcore_requests.demand_rfo ===
677insert_ok : 17612516
678insert_alt: 17062276
679=== offcore_requests.l3_miss_demand_data_rd ===
680insert_ok : 71616594
681insert_alt: 82917520
682=== offcore_requests_buffer.sq_full ===
683insert_ok : 2001445
684insert_alt: 3113287
685=== offcore_requests_outstanding.all_data_rd ===
686insert_ok : 35577129549
687insert_alt: 78698308135
688=== offcore_requests_outstanding.cycles_with_data_rd ===
689insert_ok : 17518017620
690insert_alt: 7940272202
691=== offcore_requests_outstanding.demand_code_rd ===
692insert_ok : 11085819
693insert_alt: 9390881
694=== offcore_requests_outstanding.demand_data_rd ===
695insert_ok : 15902243707
696insert_alt: 21097348926
697=== offcore_requests_outstanding.demand_data_rd_ge_6 ===
698insert_ok : 1225437
699insert_alt: 317436422
700=== offcore_requests_outstanding.demand_rfo ===
701insert_ok : 1074492442
702insert_alt: 1157902315
703=== offcore_response.demand_code_rd.any_response ===
704insert_ok : 53675
705insert_alt: 69683
706=== offcore_response.demand_code_rd.l3_hit.any_snoop ===
707insert_ok : 19407
708insert_alt: 29704
709=== offcore_response.demand_code_rd.l3_hit.snoop_none ===
710insert_ok : 12675
711insert_alt: 11951
712=== offcore_response.demand_code_rd.l3_miss.any_snoop ===
713insert_ok : 34617
714insert_alt: 40868
715=== offcore_response.demand_code_rd.l3_miss.spl_hit ===
716insert_ok : 0
717insert_alt: 753
718=== offcore_response.demand_data_rd.any_response ===
719insert_ok : 131014821
720insert_alt: 134813171
721=== offcore_response.demand_data_rd.l3_hit.any_snoop ===
722insert_ok : 59713328
723insert_alt: 50254543
724=== offcore_response.demand_data_rd.l3_miss.any_snoop ===
725insert_ok : 71431585
726insert_alt: 83916030
727=== offcore_response.demand_data_rd.l3_miss.spl_hit ===
728insert_ok : 244837
729insert_alt: 6441992
730=== offcore_response.demand_rfo.any_response ===
731insert_ok : 16876557
732insert_alt: 17619450
733=== offcore_response.demand_rfo.l3_hit.any_snoop ===
734insert_ok : 907432
735insert_alt: 45127
736=== offcore_response.demand_rfo.l3_hit.snoop_none ===
737insert_ok : 787567
738insert_alt: 794579
739=== offcore_response.demand_rfo.l3_hit_e.any_snoop ===
740insert_ok : 496938
741insert_alt: 173658
742=== offcore_response.demand_rfo.l3_hit_e.snoop_none ===
743insert_ok : 779919
744insert_alt: 50575
745=== offcore_response.demand_rfo.l3_hit_m.any_snoop ===
746insert_ok : 128627
747insert_alt: 25483
748=== offcore_response.demand_rfo.l3_miss.any_snoop ===
749insert_ok : 16782186
750insert_alt: 16847970
751=== offcore_response.demand_rfo.l3_miss.snoop_none ===
752insert_ok : 16782647
753insert_alt: 16850104
754=== offcore_response.demand_rfo.l3_miss.spl_hit ===
755insert_ok : 0
756insert_alt: 1364
757=== offcore_response.other.any_response ===
758insert_ok : 137231000
759insert_alt: 189526494
760=== offcore_response.other.l3_hit.any_snoop ===
761insert_ok : 62695084
762insert_alt: 51005882
763=== offcore_response.other.l3_hit.snoop_none ===
764insert_ok : 62975018
765insert_alt: 50217349
766=== offcore_response.other.l3_hit_e.any_snoop ===
767insert_ok : 62770215
768insert_alt: 50691817
769=== offcore_response.other.l3_hit_e.snoop_none ===
770insert_ok : 62602591
771insert_alt: 50642954
772=== offcore_response.other.l3_miss.any_snoop ===
773insert_ok : 74247236
774insert_alt: 139212975
775=== offcore_response.other.l3_miss.snoop_none ===
776insert_ok : 75911794
777insert_alt: 141076520
778=== other_assists.any ===
779insert_ok : 1
780insert_alt: 3
781=== page-faults ===
782insert_ok : 1048719
783insert_alt: 1048718
784=== partial_rat_stalls.scoreboard ===
785insert_ok : 530950991
786insert_alt: 539869553
787=== ref-cycles ===
788insert_ok : 32546980212
789insert_alt: 12930921138
790=== resource_stalls.any ===
791insert_ok : 21923576648
792insert_alt: 5205690082
793=== resource_stalls.sb ===
794insert_ok : 397908667
795insert_alt: 402738367
796=== rs_events.empty_cycles ===
797insert_ok : 1173721723
798insert_alt: 1880165720
799=== rs_events.empty_end ===
800insert_ok : 87752182
801insert_alt: 160792701
802=== sw_prefetch_access.t0 ===
803insert_ok : 20835202
804insert_alt: 20599176
805=== task-clock ===
806insert_ok : 10416.86 msec task-clock:u # 1.000 CPUs utilized
807insert_alt: 4767.78 msec task-clock:u # 1.000 CPUs utilized
808=== tlb_flush.stlb_any ===
809insert_ok : 1835393
810insert_alt: 1835396
811=== topdown-fetch-bubbles ===
812insert_ok : 1904143421
813insert_alt: 6543146396
814=== topdown-slots-issued ===
815insert_ok : 7538371393
816insert_alt: 14449966516
817=== topdown-slots-retired ===
818insert_ok : 5267325162
819insert_alt: 5849706597
820=== uops_dispatched_port.port_0 ===
821insert_ok : 1252121297
822insert_alt: 1489605354
823=== uops_dispatched_port.port_1 ===
824insert_ok : 1379316967
825insert_alt: 1585037107
826=== uops_dispatched_port.port_2 ===
827insert_ok : 1140861153
828insert_alt: 1785053149
829=== uops_dispatched_port.port_3 ===
830insert_ok : 1187151423
831insert_alt: 1828975838
832=== uops_dispatched_port.port_4 ===
833insert_ok : 1577171758
834insert_alt: 1557761857
835=== uops_dispatched_port.port_5 ===
836insert_ok : 1341370655
837insert_alt: 1653599117
838=== uops_dispatched_port.port_6 ===
839insert_ok : 1856735970
840insert_alt: 4387464794
841=== uops_dispatched_port.port_7 ===
842insert_ok : 508351498
843insert_alt: 603583315
844=== uops_executed.core ===
845insert_ok : 7225522677
846insert_alt: 12716368190
847=== uops_executed.core_cycles_ge_1 ===
848insert_ok : 3041586797
849insert_alt: 5168421550
850=== uops_executed.core_cycles_ge_2 ===
851insert_ok : 2017794537
852insert_alt: 3653591208
853=== uops_executed.core_cycles_ge_3 ===
854insert_ok : 1225785335
855insert_alt: 2316014066
856=== uops_executed.core_cycles_ge_4 ===
857insert_ok : 657121809
858insert_alt: 1143390519
859=== uops_executed.core_cycles_none ===
860insert_ok : 22191507320
861insert_alt: 6563722081
862=== uops_executed.cycles_ge_1_uop_exec ===
863insert_ok : 3040999757
864insert_alt: 5175668459
865=== uops_executed.cycles_ge_2_uops_exec ===
866insert_ok : 2015520940
867insert_alt: 3659989196
868=== uops_executed.cycles_ge_3_uops_exec ===
869insert_ok : 1224025952
870insert_alt: 2319025110
871=== uops_executed.cycles_ge_4_uops_exec ===
872insert_ok : 657094113
873insert_alt: 1141381027
874=== uops_executed.stall_cycles ===
875insert_ok : 22350754164
876insert_alt: 6590978048
877=== uops_executed.thread ===
878insert_ok : 7214521925
879insert_alt: 12697219901
880=== uops_executed.x87 ===
881insert_ok : 2992
882insert_alt: 3337
883=== uops_issued.any ===
884insert_ok : 7531354736
885insert_alt: 14462113169
886=== uops_issued.slow_lea ===
887insert_ok : 2136241
888insert_alt: 2115308
889=== uops_issued.stall_cycles ===
890insert_ok : 23244177475
891insert_alt: 7416801878
892=== uops_retired.macro_fused ===
893insert_ok : 410461916
894insert_alt: 735050350
895=== uops_retired.retire_slots ===
896insert_ok : 5265023980
897insert_alt: 5855259326
898=== uops_retired.stall_cycles ===
899insert_ok : 23513958928
900insert_alt: 9630258867
901=== uops_retired.total_cycles ===
902insert_ok : 25266688635
903insert_alt: 11703285605
904tlb_fencing:
905
906    xor     eax, eax  ; the index pointer
907    mov     r9 , [rsi + region.start]
908
909    mov     r8 , [rsi + region.size]  
910    sub     r8 , 200                   ; pointer to end of region (plus a bit of buffer)
911
912    mov     r10, [rsi + region.size]
913    sub     r10, 1 ; mask
914
915    mov     rsi, r9   ; region start
916
917.top:
918    mov     rcx, rax
919    and     rcx, r10        ; remap the index into the region via masking
920    add     rcx, r9         ; make pointer p into the region
921    mov     rdx, [rcx]      ; load 8 bytes at p, always zero
922    xor     rcx, rcx        ; no-op
923    mov     DWORD [rsi + rdx + 160], 0 ; store zero at p + 160 
924    add     rax, (64 * 67)  ; advance a prime number of cache lines slightly larger than a page
925
926    dec     rdi
927    jnz     .top
928
929    ret
930for (size_t i = 0; i &lt; bucket_size; ++i)
931{
932    if (i == B.size)
933    {
934        B.keys[i] = k;
935        B.values[i] = 1;
936        ++B.size;
937        ++table_count;
938        return;
939    }
940}
941

Note that the stores use the loop index i. Unlike the insert_ok case, where the index [B.size] comes from a store, i is a simply a calculated value in a register. Now i is related to the loaded value B.size since its final value will be equal to it, but that is established via a comparison which is a speculated control dependency. It doesn't cause any problem with page walk cancellation. This scenario does have a lot of mis-predictons (since the loop exit is unpredictable) but for the large region case these aren't actually too harmful because the bad path usually makes the same memory accesses as the good path (specifically, the next value inserted is always the same) and memory access behavior dominates.

The same is true for the alt case: the index to write at is established by using a calculated value i to load a value, check if it is the special marker value and then writing at that location using index i. Again, no delayed store address, just a quickly calculated register value and a speculated control dependency.

What About Other Hardware

Like the question author, I found the effect on Skylake, but I also observed the same behavior on Haswell. On Ice Lake, I can't reproduce it: both the dep and indep have almost identical performance.

User Noah, however, reported he could reproduce on Tigerlake using the original benchmark for certain alignments. I believe the most likely cause is that TGL isn't subject to this page walk behavior, but rather at some alignments the memory disambiguation predictors collide causing a very similar effect: the loads can't execute ahead of earlier address-unknown stores because the processor thinks the stores might forward to the load.

Run It Yourself

You can run the benchmark I describe above yourself. It's part of uarch-bench. On Linux (or WSL, but performance counters aren't available) you can run the following command to collect the results:

1# see https://github.com/cr-marcstevens/hashtable_mystery
2$ ./test.sh
3model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
4==============================
5CXX=g++    CXXFLAGS=-std=c++11 -O2 -march=native -falign-functions=64
6tablesize: 117440512 elements: 67108864 loadfactor=0.571429
7- test insert_ok : 11200ms
8- test insert_bad: 3164ms
9  (outcome identical to insert_ok: true)
10- test insert_alt: 3366ms
11  (outcome identical to insert_ok: true)
12
13tablesize: 117440813 elements: 67108864 loadfactor=0.571427
14- test insert_ok : 10840ms
15- test insert_bad: 3301ms
16  (outcome identical to insert_ok: true)
17- test insert_alt: 3579ms
18  (outcome identical to insert_ok: true)
19// insert element in hash_table
20inline void insert_ok(uint64_t k)
21{
22    // compute target bucket
23    uint64_t b = mod(k);
24    // bounded linear search for first non-full bucket
25    for (size_t c = 0; c &lt; 1024; ++c)
26    {
27        bucket_t&amp; B = table_ok[b];
28        // if bucket non-full then store element and return
29        if (B.size != bucket_size)
30        {
31            B.keys[B.size] = k;
32            B.values[B.size] = 1;
33            ++B.size;
34            ++table_count;
35            return;
36        }
37        // increase b w/ wrap around
38        if (++b == table_size)
39            b = 0;
40    }
41}
42// equivalent to insert_ok
43// but uses a stupid linear search to store the element at the target position
44inline void insert_bad(uint64_t k)
45{
46    // compute target bucket
47    uint64_t b = mod(k);
48    // bounded linear search for first non-full bucket
49    for (size_t c = 0; c &lt; 1024; ++c)
50    {
51        bucket_t&amp; B = table_bad[b];
52        // if bucket non-full then store element and return
53        if (B.size != bucket_size)
54        {
55            for (size_t i = 0; i &lt; bucket_size; ++i)
56            {
57                if (i == B.size)
58                {
59                    B.keys[i] = k;
60                    B.values[i] = 1;
61                    ++B.size;
62                    ++table_count;
63                    return;
64                }
65            }
66        }
67        // increase b w/ wrap around
68        if (++b == table_size)
69            b = 0;
70    }
71}
72// instead of using bucket_t.size, empty elements are marked by special empty_key value
73// a bucket is filled first to last, so bucket is full if last element key != empty_key
74uint64_t empty_key = ~uint64_t(0);
75inline void insert_alt(uint64_t k)
76{
77    // compute target bucket
78    uint64_t b = mod(k);
79    // bounded linear search for first non-full bucket
80    for (size_t c = 0; c &lt; 1024; ++c)
81    {
82        bucket_t&amp; B = table_alt[b];
83        // if bucket non-full then store element and return
84        if (B.keys[bucket_size-1] == empty_key)
85        {
86            for (size_t i = 0; i &lt; bucket_size; ++i)
87            {
88                if (B.keys[i] == empty_key)
89                {
90                    B.keys[i] = k;
91                    B.values[i] = 1;
92                    ++table_count;
93                    return;
94                }
95            }
96        }
97        // increase b w/ wrap around
98        if (++b == table_size)
99            b = 0;
100    }
101}
102=== L1-dcache-load-misses ===
103insert_ok : 171411476
104insert_alt: 244244027
105=== L1-dcache-loads ===
106insert_ok : 775468123
107insert_alt: 1038574743
108=== L1-dcache-stores ===
109insert_ok : 621353009
110insert_alt: 554244145
111=== L1-icache-load-misses ===
112insert_ok : 69666
113insert_alt: 259102
114=== LLC-load-misses ===
115insert_ok : 70519701
116insert_alt: 71399242
117=== LLC-loads ===
118insert_ok : 130909270
119insert_alt: 134776189
120=== LLC-store-misses ===
121insert_ok : 16782747
122insert_alt: 16851787
123=== LLC-stores ===
124insert_ok : 17072141
125insert_alt: 17534866
126=== arith.divider_active ===
127insert_ok : 26810
128insert_alt: 26611
129=== baclears.any ===
130insert_ok : 2038060
131insert_alt: 7648128
132=== br_inst_retired.all_branches ===
133insert_ok : 546479449
134insert_alt: 938434022
135=== br_inst_retired.all_branches_pebs ===
136insert_ok : 546480454
137insert_alt: 938412921
138=== br_inst_retired.cond_ntaken ===
139insert_ok : 237470651
140insert_alt: 433439086
141=== br_inst_retired.conditional ===
142insert_ok : 477604946
143insert_alt: 802468807
144=== br_inst_retired.far_branch ===
145insert_ok : 1058138
146insert_alt: 1052510
147=== br_inst_retired.near_call ===
148insert_ok : 227076
149insert_alt: 227074
150=== br_inst_retired.near_return ===
151insert_ok : 227072
152insert_alt: 227070
153=== br_inst_retired.near_taken ===
154insert_ok : 307946256
155insert_alt: 503926433
156=== br_inst_retired.not_taken ===
157insert_ok : 237458763
158insert_alt: 433429466
159=== br_misp_retired.all_branches ===
160insert_ok : 36443541
161insert_alt: 90626754
162=== br_misp_retired.all_branches_pebs ===
163insert_ok : 36441027
164insert_alt: 90622375
165=== br_misp_retired.conditional ===
166insert_ok : 36454196
167insert_alt: 90591031
168=== br_misp_retired.near_call ===
169insert_ok : 173
170insert_alt: 169
171=== br_misp_retired.near_taken ===
172insert_ok : 19032467
173insert_alt: 40361420
174=== branch-instructions ===
175insert_ok : 546476228
176insert_alt: 938447476
177=== branch-load-misses ===
178insert_ok : 36441314
179insert_alt: 90611299
180=== branch-loads ===
181insert_ok : 546472151
182insert_alt: 938435143
183=== branch-misses ===
184insert_ok : 36436325
185insert_alt: 90597372
186=== bus-cycles ===
187insert_ok : 222283508
188insert_alt: 88243938
189=== cache-misses ===
190insert_ok : 257067753
191insert_alt: 475091979
192=== cache-references ===
193insert_ok : 445465943
194insert_alt: 590770464
195=== cpu-clock ===
196insert_ok : 10333.94 msec cpu-clock:u # 1.000 CPUs utilized
197insert_alt: 4766.53 msec cpu-clock:u # 1.000 CPUs utilized
198=== cpu-cycles ===
199insert_ok : 25273361574
200insert_alt: 11675804743
201=== cpu_clk_thread_unhalted.one_thread_active ===
202insert_ok : 223196489
203insert_alt: 88616919
204=== cpu_clk_thread_unhalted.ref_xclk ===
205insert_ok : 222719013
206insert_alt: 88467292
207=== cpu_clk_unhalted.one_thread_active ===
208insert_ok : 223380608
209insert_alt: 88212476
210=== cpu_clk_unhalted.ref_tsc ===
211insert_ok : 32663820508
212insert_alt: 12901195392
213=== cpu_clk_unhalted.ref_xclk ===
214insert_ok : 221957996
215insert_alt: 88390991
216insert_alt: === cpu_clk_unhalted.ring0_trans ===
217insert_ok : 374
218insert_alt: 373
219=== cpu_clk_unhalted.thread ===
220insert_ok : 25286801620
221insert_alt: 11714137483
222=== cycle_activity.cycles_l1d_miss ===
223insert_ok : 16278956219
224insert_alt: 7417877493
225=== cycle_activity.cycles_l2_miss ===
226insert_ok : 15607833569
227insert_alt: 7054717199
228=== cycle_activity.cycles_l3_miss ===
229insert_ok : 12987627072
230insert_alt: 6745771672
231=== cycle_activity.cycles_mem_any ===
232insert_ok : 23440206343
233insert_alt: 9027220495
234=== cycle_activity.stalls_l1d_miss ===
235insert_ok : 16194872307
236insert_alt: 4718344050
237=== cycle_activity.stalls_l2_miss ===
238insert_ok : 15350067722
239insert_alt: 4578933898
240=== cycle_activity.stalls_l3_miss ===
241insert_ok : 12697354271
242insert_alt: 4457980047
243=== cycle_activity.stalls_mem_any ===
244insert_ok : 20930005455
245insert_alt: 4555461595
246=== cycle_activity.stalls_total ===
247insert_ok : 22243173394
248insert_alt: 6561416461
249=== dTLB-load-misses ===
250insert_ok : 67817362
251insert_alt: 63603879
252=== dTLB-loads ===
253insert_ok : 775467642
254insert_alt: 1038562488
255=== dTLB-store-misses ===
256insert_ok : 8823481
257insert_alt: 13050341
258=== dTLB-stores ===
259insert_ok : 621353007
260insert_alt: 554244145
261=== dsb2mite_switches.count ===
262insert_ok : 93894397
263insert_alt: 315793354
264=== dsb2mite_switches.penalty_cycles ===
265insert_ok : 9216240937
266insert_alt: 206393788
267=== dtlb_load_misses.miss_causes_a_walk ===
268insert_ok : 177266866
269insert_alt: 101439773
270=== dtlb_load_misses.stlb_hit ===
271insert_ok : 2994329
272insert_alt: 35601646
273=== dtlb_load_misses.walk_active ===
274insert_ok : 4747616986
275insert_alt: 3893609232
276=== dtlb_load_misses.walk_completed ===
277insert_ok : 67817832
278insert_alt: 63591832
279=== dtlb_load_misses.walk_completed_4k ===
280insert_ok : 67817841
281insert_alt: 63596148
282=== dtlb_load_misses.walk_pending ===
283insert_ok : 6495600072
284insert_alt: 5987182579
285=== dtlb_store_misses.miss_causes_a_walk ===
286insert_ok : 89895924
287insert_alt: 21841494
288=== dtlb_store_misses.stlb_hit ===
289insert_ok : 4940907
290insert_alt: 21970231
291=== dtlb_store_misses.walk_active ===
292insert_ok : 1784142210
293insert_alt: 903334856
294=== dtlb_store_misses.walk_completed ===
295insert_ok : 8845884
296insert_alt: 13071262
297=== dtlb_store_misses.walk_completed_4k ===
298insert_ok : 8822993
299insert_alt: 12936414
300=== dtlb_store_misses.walk_pending ===
301insert_ok : 1842905733
302insert_alt: 933039119
303=== exe_activity.1_ports_util ===
304insert_ok : 991400575
305insert_alt: 1433908710
306=== exe_activity.2_ports_util ===
307insert_ok : 782270731
308insert_alt: 1314443071
309=== exe_activity.3_ports_util ===
310insert_ok : 556847358
311insert_alt: 1158115803
312=== exe_activity.4_ports_util ===
313insert_ok : 427323800
314insert_alt: 783571280
315=== exe_activity.bound_on_stores ===
316insert_ok : 299732094
317insert_alt: 303475333
318=== exe_activity.exe_bound_0_ports ===
319insert_ok : 227569792
320insert_alt: 348959512
321=== frontend_retired.dsb_miss ===
322insert_ok : 6771584
323insert_alt: 93700643
324=== frontend_retired.itlb_miss ===
325insert_ok : 1115
326insert_alt: 1689
327=== frontend_retired.l1i_miss ===
328insert_ok : 3639
329insert_alt: 3857
330=== frontend_retired.l2_miss ===
331insert_ok : 2826
332insert_alt: 2830
333=== frontend_retired.latency_ge_1 ===
334insert_ok : 9206268
335insert_alt: 178345368
336=== frontend_retired.latency_ge_128 ===
337insert_ok : 2708
338insert_alt: 2703
339=== frontend_retired.latency_ge_16 ===
340insert_ok : 403492
341insert_alt: 820950
342=== frontend_retired.latency_ge_2 ===
343insert_ok : 4981263
344insert_alt: 85781924
345=== frontend_retired.latency_ge_256 ===
346insert_ok : 802
347insert_alt: 970
348=== frontend_retired.latency_ge_2_bubbles_ge_1 ===
349insert_ok : 56936702
350insert_alt: 225712704
351=== frontend_retired.latency_ge_2_bubbles_ge_2 ===
352insert_ok : 10312026
353insert_alt: 163227996
354=== frontend_retired.latency_ge_2_bubbles_ge_3 ===
355insert_ok : 7599252
356insert_alt: 122841752
357=== frontend_retired.latency_ge_32 ===
358insert_ok : 3599
359insert_alt: 3317
360=== frontend_retired.latency_ge_4 ===
361insert_ok : 2627373
362insert_alt: 42287077
363=== frontend_retired.latency_ge_512 ===
364insert_ok : 418
365insert_alt: 241
366=== frontend_retired.latency_ge_64 ===
367insert_ok : 2474
368insert_alt: 2802
369=== frontend_retired.latency_ge_8 ===
370insert_ok : 528748
371insert_alt: 951836
372=== frontend_retired.stlb_miss ===
373insert_ok : 769
374insert_alt: 562
375=== hw_interrupts.received ===
376insert_ok : 9330
377insert_alt: 3738
378=== iTLB-load-misses ===
379insert_ok : 456094
380insert_alt: 90739
381=== iTLB-loads ===
382insert_ok : 949
383insert_alt: 1031
384=== icache_16b.ifdata_stall ===
385insert_ok : 1145821
386insert_alt: 862403
387=== icache_64b.iftag_hit ===
388insert_ok : 1378406022
389insert_alt: 4459469241
390=== icache_64b.iftag_miss ===
391insert_ok : 61812
392insert_alt: 57204
393=== icache_64b.iftag_stall ===
394insert_ok : 56551468
395insert_alt: 82354039
396=== idq.all_dsb_cycles_4_uops ===
397insert_ok : 896374829
398insert_alt: 1610100578
399=== idq.all_dsb_cycles_any_uops ===
400insert_ok : 1217878089
401insert_alt: 2739912727
402=== idq.all_mite_cycles_4_uops ===
403insert_ok : 315979501
404insert_alt: 480165021
405=== idq.all_mite_cycles_any_uops ===
406insert_ok : 1053703958
407insert_alt: 2251382760
408=== idq.dsb_cycles ===
409insert_ok : 1218891711
410insert_alt: 2744099964
411=== idq.dsb_uops ===
412insert_ok : 5828442701
413insert_alt: 10445095004
414=== idq.mite_cycles ===
415insert_ok : 470409312
416insert_alt: 1664892371
417=== idq.mite_uops ===
418insert_ok : 1407396065
419insert_alt: 4515396737
420=== idq.ms_cycles ===
421insert_ok : 583601361
422insert_alt: 587996351
423=== idq.ms_dsb_cycles ===
424insert_ok : 218346
425insert_alt: 74155
426=== idq.ms_mite_uops ===
427insert_ok : 1266443204
428insert_alt: 1277980465
429=== idq.ms_switches ===
430insert_ok : 149106449
431insert_alt: 150392336
432=== idq.ms_uops ===
433insert_ok : 1266950097
434insert_alt: 1277330690
435=== idq_uops_not_delivered.core ===
436insert_ok : 1871959581
437insert_alt: 6531069387
438=== idq_uops_not_delivered.cycles_0_uops_deliv.core ===
439insert_ok : 289301660
440insert_alt: 946930713
441=== idq_uops_not_delivered.cycles_fe_was_ok ===
442insert_ok : 24668869613
443insert_alt: 9335642949
444=== idq_uops_not_delivered.cycles_le_1_uop_deliv.core ===
445insert_ok : 393750384
446insert_alt: 1344106460
447=== idq_uops_not_delivered.cycles_le_2_uop_deliv.core ===
448insert_ok : 506090534
449insert_alt: 1824690188
450=== idq_uops_not_delivered.cycles_le_3_uop_deliv.core ===
451insert_ok : 688462029
452insert_alt: 2416339045
453=== ild_stall.lcp ===
454insert_ok : 380
455insert_alt: 480
456=== inst_retired.any ===
457insert_ok : 4760842560
458insert_alt: 5470438932
459=== inst_retired.any_p ===
460insert_ok : 4760919037
461insert_alt: 5470404264
462=== inst_retired.prec_dist ===
463insert_ok : 4760801654
464insert_alt: 5470649220
465=== inst_retired.total_cycles_ps ===
466insert_ok : 25175372339
467insert_alt: 11718929626
468=== instructions ===
469insert_ok : 4760805219
470insert_alt: 5470497783
471=== int_misc.clear_resteer_cycles ===
472insert_ok : 199623562
473insert_alt: 671083279
474=== int_misc.recovery_cycles ===
475insert_ok : 314434729
476insert_alt: 704406698
477=== itlb.itlb_flush ===
478insert_ok : 303
479insert_alt: 248
480=== itlb_misses.miss_causes_a_walk ===
481insert_ok : 19537
482insert_alt: 116729
483=== itlb_misses.stlb_hit ===
484insert_ok : 11323
485insert_alt: 5557
486=== itlb_misses.walk_active ===
487insert_ok : 2809766
488insert_alt: 4070194
489=== itlb_misses.walk_completed ===
490insert_ok : 24298
491insert_alt: 45251
492=== itlb_misses.walk_completed_4k ===
493insert_ok : 34084
494insert_alt: 29759
495=== itlb_misses.walk_pending ===
496insert_ok : 853764
497insert_alt: 2817933
498=== l1d.replacement ===
499insert_ok : 171135334
500insert_alt: 244967326
501=== l1d_pend_miss.fb_full ===
502insert_ok : 354631656
503insert_alt: 382309583
504=== l1d_pend_miss.pending ===
505insert_ok : 16792436441
506insert_alt: 22979721104
507=== l1d_pend_miss.pending_cycles ===
508insert_ok : 16377420892
509insert_alt: 7349245429
510=== l1d_pend_miss.pending_cycles_any ===
511insert_ok : insert_alt: === l2_lines_in.all ===
512insert_ok : 303009088
513insert_alt: 411750486
514=== l2_lines_out.non_silent ===
515insert_ok : 157208112
516insert_alt: 309484666
517=== l2_lines_out.silent ===
518insert_ok : 127379047
519insert_alt: 84169481
520=== l2_lines_out.useless_hwpf ===
521insert_ok : 70374658
522insert_alt: 144359127
523=== l2_lines_out.useless_pref ===
524insert_ok : 70747103
525insert_alt: 142931540
526=== l2_rqsts.all_code_rd ===
527insert_ok : 71254
528insert_alt: 242327
529=== l2_rqsts.all_demand_data_rd ===
530insert_ok : 137366274
531insert_alt: 143507049
532=== l2_rqsts.all_demand_miss ===
533insert_ok : 150071420
534insert_alt: 150820168
535=== l2_rqsts.all_demand_references ===
536insert_ok : 154854022
537insert_alt: 160487082
538=== l2_rqsts.all_pf ===
539insert_ok : 170261458
540insert_alt: 282476184
541=== l2_rqsts.all_rfo ===
542insert_ok : 17575896
543insert_alt: 16938897
544=== l2_rqsts.code_rd_hit ===
545insert_ok : 79800
546insert_alt: 381566
547=== l2_rqsts.code_rd_miss ===
548insert_ok : 25800
549insert_alt: 33755
550=== l2_rqsts.demand_data_rd_hit ===
551insert_ok : 5191029
552insert_alt: 9831101
553=== l2_rqsts.demand_data_rd_miss ===
554insert_ok : 132253891
555insert_alt: 133965310
556=== l2_rqsts.miss ===
557insert_ok : 305347974
558insert_alt: 414758839
559=== l2_rqsts.pf_hit ===
560insert_ok : 14639778
561insert_alt: 19484420
562=== l2_rqsts.pf_miss ===
563insert_ok : 156092998
564insert_alt: 263293430
565=== l2_rqsts.references ===
566insert_ok : 326549998
567insert_alt: 443460029
568=== l2_rqsts.rfo_hit ===
569insert_ok : 11650
570insert_alt: 21474
571=== l2_rqsts.rfo_miss ===
572insert_ok : 17544467
573insert_alt: 16835137
574=== l2_trans.l2_wb ===
575insert_ok : 157044674
576insert_alt: 308107712
577=== ld_blocks.no_sr ===
578insert_ok : 14
579insert_alt: 13
580=== ld_blocks.store_forward ===
581insert_ok : 158
582insert_alt: 128
583=== ld_blocks_partial.address_alias ===
584insert_ok : 5155853
585insert_alt: 17867414
586=== load_hit_pre.sw_pf ===
587insert_ok : 10840795
588insert_alt: 11072297
589=== longest_lat_cache.miss ===
590insert_ok : 257061118
591insert_alt: 471152073
592=== longest_lat_cache.reference ===
593insert_ok : 445701577
594insert_alt: 583870610
595=== machine_clears.count ===
596insert_ok : 3926377
597insert_alt: 4280080
598=== machine_clears.memory_ordering ===
599insert_ok : 97177
600insert_alt: 25407
601=== machine_clears.smc ===
602insert_ok : 138579
603insert_alt: 305423
604=== mem-stores ===
605insert_ok : 621353009
606insert_alt: 554244143
607=== mem_inst_retired.all_loads ===
608insert_ok : 775473590
609insert_alt: 1038559807
610=== mem_inst_retired.all_stores ===
611insert_ok : 621353013
612insert_alt: 554244145
613=== mem_inst_retired.lock_loads ===
614insert_ok : 85
615insert_alt: 85
616=== mem_inst_retired.split_loads ===
617insert_ok : 171
618insert_alt: 174
619=== mem_inst_retired.split_stores ===
620insert_ok : 53
621insert_alt: 49
622=== mem_inst_retired.stlb_miss_loads ===
623insert_ok : 68308539
624insert_alt: 18088047
625=== mem_inst_retired.stlb_miss_stores ===
626insert_ok : 264054
627insert_alt: 819551
628=== mem_load_l3_hit_retired.xsnp_none ===
629insert_ok : 231116
630insert_alt: 175217
631=== mem_load_retired.fb_hit ===
632insert_ok : 6510722
633insert_alt: 95952490
634=== mem_load_retired.l1_hit ===
635insert_ok : 698271530
636insert_alt: 920982402
637=== mem_load_retired.l1_miss ===
638insert_ok : 69525335
639insert_alt: 20089897
640=== mem_load_retired.l2_hit ===
641insert_ok : 1451905
642insert_alt: 773356
643=== mem_load_retired.l2_miss ===
644insert_ok : 68085186
645insert_alt: 19474303
646=== mem_load_retired.l3_hit ===
647insert_ok : 222829
648insert_alt: 155958
649=== mem_load_retired.l3_miss ===
650insert_ok : 67879593
651insert_alt: 19244746
652=== memory_disambiguation.history_reset ===
653insert_ok : 97621
654insert_alt: 25831
655=== minor-faults ===
656insert_ok : 1048716
657insert_alt: 1048718
658=== node-loads ===
659insert_ok : 71473780
660insert_alt: 71377840
661=== node-stores ===
662insert_ok : 16781161
663insert_alt: 16842666
664=== offcore_requests.all_data_rd ===
665insert_ok : 284186682
666insert_alt: 392110677
667=== offcore_requests.all_requests ===
668insert_ok : 530876505
669insert_alt: 777784382
670=== offcore_requests.demand_code_rd ===
671insert_ok : 34252
672insert_alt: 45896
673=== offcore_requests.demand_data_rd ===
674insert_ok : 133468710
675insert_alt: 134288893
676=== offcore_requests.demand_rfo ===
677insert_ok : 17612516
678insert_alt: 17062276
679=== offcore_requests.l3_miss_demand_data_rd ===
680insert_ok : 71616594
681insert_alt: 82917520
682=== offcore_requests_buffer.sq_full ===
683insert_ok : 2001445
684insert_alt: 3113287
685=== offcore_requests_outstanding.all_data_rd ===
686insert_ok : 35577129549
687insert_alt: 78698308135
688=== offcore_requests_outstanding.cycles_with_data_rd ===
689insert_ok : 17518017620
690insert_alt: 7940272202
691=== offcore_requests_outstanding.demand_code_rd ===
692insert_ok : 11085819
693insert_alt: 9390881
694=== offcore_requests_outstanding.demand_data_rd ===
695insert_ok : 15902243707
696insert_alt: 21097348926
697=== offcore_requests_outstanding.demand_data_rd_ge_6 ===
698insert_ok : 1225437
699insert_alt: 317436422
700=== offcore_requests_outstanding.demand_rfo ===
701insert_ok : 1074492442
702insert_alt: 1157902315
703=== offcore_response.demand_code_rd.any_response ===
704insert_ok : 53675
705insert_alt: 69683
706=== offcore_response.demand_code_rd.l3_hit.any_snoop ===
707insert_ok : 19407
708insert_alt: 29704
709=== offcore_response.demand_code_rd.l3_hit.snoop_none ===
710insert_ok : 12675
711insert_alt: 11951
712=== offcore_response.demand_code_rd.l3_miss.any_snoop ===
713insert_ok : 34617
714insert_alt: 40868
715=== offcore_response.demand_code_rd.l3_miss.spl_hit ===
716insert_ok : 0
717insert_alt: 753
718=== offcore_response.demand_data_rd.any_response ===
719insert_ok : 131014821
720insert_alt: 134813171
721=== offcore_response.demand_data_rd.l3_hit.any_snoop ===
722insert_ok : 59713328
723insert_alt: 50254543
724=== offcore_response.demand_data_rd.l3_miss.any_snoop ===
725insert_ok : 71431585
726insert_alt: 83916030
727=== offcore_response.demand_data_rd.l3_miss.spl_hit ===
728insert_ok : 244837
729insert_alt: 6441992
730=== offcore_response.demand_rfo.any_response ===
731insert_ok : 16876557
732insert_alt: 17619450
733=== offcore_response.demand_rfo.l3_hit.any_snoop ===
734insert_ok : 907432
735insert_alt: 45127
736=== offcore_response.demand_rfo.l3_hit.snoop_none ===
737insert_ok : 787567
738insert_alt: 794579
739=== offcore_response.demand_rfo.l3_hit_e.any_snoop ===
740insert_ok : 496938
741insert_alt: 173658
742=== offcore_response.demand_rfo.l3_hit_e.snoop_none ===
743insert_ok : 779919
744insert_alt: 50575
745=== offcore_response.demand_rfo.l3_hit_m.any_snoop ===
746insert_ok : 128627
747insert_alt: 25483
748=== offcore_response.demand_rfo.l3_miss.any_snoop ===
749insert_ok : 16782186
750insert_alt: 16847970
751=== offcore_response.demand_rfo.l3_miss.snoop_none ===
752insert_ok : 16782647
753insert_alt: 16850104
754=== offcore_response.demand_rfo.l3_miss.spl_hit ===
755insert_ok : 0
756insert_alt: 1364
757=== offcore_response.other.any_response ===
758insert_ok : 137231000
759insert_alt: 189526494
760=== offcore_response.other.l3_hit.any_snoop ===
761insert_ok : 62695084
762insert_alt: 51005882
763=== offcore_response.other.l3_hit.snoop_none ===
764insert_ok : 62975018
765insert_alt: 50217349
766=== offcore_response.other.l3_hit_e.any_snoop ===
767insert_ok : 62770215
768insert_alt: 50691817
769=== offcore_response.other.l3_hit_e.snoop_none ===
770insert_ok : 62602591
771insert_alt: 50642954
772=== offcore_response.other.l3_miss.any_snoop ===
773insert_ok : 74247236
774insert_alt: 139212975
775=== offcore_response.other.l3_miss.snoop_none ===
776insert_ok : 75911794
777insert_alt: 141076520
778=== other_assists.any ===
779insert_ok : 1
780insert_alt: 3
781=== page-faults ===
782insert_ok : 1048719
783insert_alt: 1048718
784=== partial_rat_stalls.scoreboard ===
785insert_ok : 530950991
786insert_alt: 539869553
787=== ref-cycles ===
788insert_ok : 32546980212
789insert_alt: 12930921138
790=== resource_stalls.any ===
791insert_ok : 21923576648
792insert_alt: 5205690082
793=== resource_stalls.sb ===
794insert_ok : 397908667
795insert_alt: 402738367
796=== rs_events.empty_cycles ===
797insert_ok : 1173721723
798insert_alt: 1880165720
799=== rs_events.empty_end ===
800insert_ok : 87752182
801insert_alt: 160792701
802=== sw_prefetch_access.t0 ===
803insert_ok : 20835202
804insert_alt: 20599176
805=== task-clock ===
806insert_ok : 10416.86 msec task-clock:u # 1.000 CPUs utilized
807insert_alt: 4767.78 msec task-clock:u # 1.000 CPUs utilized
808=== tlb_flush.stlb_any ===
809insert_ok : 1835393
810insert_alt: 1835396
811=== topdown-fetch-bubbles ===
812insert_ok : 1904143421
813insert_alt: 6543146396
814=== topdown-slots-issued ===
815insert_ok : 7538371393
816insert_alt: 14449966516
817=== topdown-slots-retired ===
818insert_ok : 5267325162
819insert_alt: 5849706597
820=== uops_dispatched_port.port_0 ===
821insert_ok : 1252121297
822insert_alt: 1489605354
823=== uops_dispatched_port.port_1 ===
824insert_ok : 1379316967
825insert_alt: 1585037107
826=== uops_dispatched_port.port_2 ===
827insert_ok : 1140861153
828insert_alt: 1785053149
829=== uops_dispatched_port.port_3 ===
830insert_ok : 1187151423
831insert_alt: 1828975838
832=== uops_dispatched_port.port_4 ===
833insert_ok : 1577171758
834insert_alt: 1557761857
835=== uops_dispatched_port.port_5 ===
836insert_ok : 1341370655
837insert_alt: 1653599117
838=== uops_dispatched_port.port_6 ===
839insert_ok : 1856735970
840insert_alt: 4387464794
841=== uops_dispatched_port.port_7 ===
842insert_ok : 508351498
843insert_alt: 603583315
844=== uops_executed.core ===
845insert_ok : 7225522677
846insert_alt: 12716368190
847=== uops_executed.core_cycles_ge_1 ===
848insert_ok : 3041586797
849insert_alt: 5168421550
850=== uops_executed.core_cycles_ge_2 ===
851insert_ok : 2017794537
852insert_alt: 3653591208
853=== uops_executed.core_cycles_ge_3 ===
854insert_ok : 1225785335
855insert_alt: 2316014066
856=== uops_executed.core_cycles_ge_4 ===
857insert_ok : 657121809
858insert_alt: 1143390519
859=== uops_executed.core_cycles_none ===
860insert_ok : 22191507320
861insert_alt: 6563722081
862=== uops_executed.cycles_ge_1_uop_exec ===
863insert_ok : 3040999757
864insert_alt: 5175668459
865=== uops_executed.cycles_ge_2_uops_exec ===
866insert_ok : 2015520940
867insert_alt: 3659989196
868=== uops_executed.cycles_ge_3_uops_exec ===
869insert_ok : 1224025952
870insert_alt: 2319025110
871=== uops_executed.cycles_ge_4_uops_exec ===
872insert_ok : 657094113
873insert_alt: 1141381027
874=== uops_executed.stall_cycles ===
875insert_ok : 22350754164
876insert_alt: 6590978048
877=== uops_executed.thread ===
878insert_ok : 7214521925
879insert_alt: 12697219901
880=== uops_executed.x87 ===
881insert_ok : 2992
882insert_alt: 3337
883=== uops_issued.any ===
884insert_ok : 7531354736
885insert_alt: 14462113169
886=== uops_issued.slow_lea ===
887insert_ok : 2136241
888insert_alt: 2115308
889=== uops_issued.stall_cycles ===
890insert_ok : 23244177475
891insert_alt: 7416801878
892=== uops_retired.macro_fused ===
893insert_ok : 410461916
894insert_alt: 735050350
895=== uops_retired.retire_slots ===
896insert_ok : 5265023980
897insert_alt: 5855259326
898=== uops_retired.stall_cycles ===
899insert_ok : 23513958928
900insert_alt: 9630258867
901=== uops_retired.total_cycles ===
902insert_ok : 25266688635
903insert_alt: 11703285605
904tlb_fencing:
905
906    xor     eax, eax  ; the index pointer
907    mov     r9 , [rsi + region.start]
908
909    mov     r8 , [rsi + region.size]  
910    sub     r8 , 200                   ; pointer to end of region (plus a bit of buffer)
911
912    mov     r10, [rsi + region.size]
913    sub     r10, 1 ; mask
914
915    mov     rsi, r9   ; region start
916
917.top:
918    mov     rcx, rax
919    and     rcx, r10        ; remap the index into the region via masking
920    add     rcx, r9         ; make pointer p into the region
921    mov     rdx, [rcx]      ; load 8 bytes at p, always zero
922    xor     rcx, rcx        ; no-op
923    mov     DWORD [rsi + rdx + 160], 0 ; store zero at p + 160 
924    add     rax, (64 * 67)  ; advance a prime number of cache lines slightly larger than a page
925
926    dec     rdi
927    jnz     .top
928
929    ret
930for (size_t i = 0; i &lt; bucket_size; ++i)
931{
932    if (i == B.size)
933    {
934        B.keys[i] = k;
935        B.values[i] = 1;
936        ++B.size;
937        ++table_count;
938        return;
939    }
940}
941for s in 2M-dep 4K-dep 4K-indep; do ./uarch-bench --timer=perf --test-name=&quot;studies/memory/tlb-fencing/*$s&quot; --extra-events=dtlb_load_misses.miss_causes_a_walk#walk_s,dtlb_load_misses.walk_completed#walk_c,l1d_pend_miss.pending#l1d_p,l1d_pend_miss.pending_cycles#l1d_pc; done
942

Some systems may not have enough free performance counters available (if you have hyperthreading enabled), so you can do two runs using different sets of counters each time.


1 In this case, rdx is always zero (the region is entirely full of zeros) so the store address happens to be the same as if this register wasn't included in the addressing expression, but the CPU doesn't know that!

2 Here, the 2M dep case also starts to show better performance than the 4K indep case, although the gap is modest.

3 Note the "while any miss is outstanding" part: you could also calculate MLP as l1d_pend_miss.pending / cycles, which would be the average MLP over a period of time, regardless of whether any misses were outstanding. Each is useful in their own way, but in a case like this with misses constantly outstanding they give almost identical values.

4 Yes, there are many differences between this and the original example. We store to a single fixed location, whereas the original loop stored near the load location, which varies every iteration. We store 0 not 1. We don't check B.size to see if it is too large. In our test the loaded value is always 0. There is no search loop for when the bucket is full. We don't load a random value to address, but just do a linear stride. However, these are not material: the same effect occurs in both cases and you can incrementally modify the original example by removing complexity until you reach this simple case.

Source https://stackoverflow.com/questions/69664733

QUESTION

Function default argument value depending on argument name in C++

Asked 2021-Oct-06 at 22:12

If one defines a new variable in C++, then the name of the variable can be used in the initialization expression, for example:

1int x = sizeof(x);
2

And what about default value of a function argument? Is it allowed there to reference the argument by its name? For example:

1int x = sizeof(x);
2void f(int y = sizeof(y)) {}
3

This function is accepted in Clang, but rejected in GCC with the error:

1int x = sizeof(x);
2void f(int y = sizeof(y)) {}
3'y' was not declared in this scope
4

Demo: https://gcc.godbolt.org/z/YsvYnhjTb

Which compiler is right here?

ANSWER

Answered 2021-Oct-06 at 22:12

According to the C++17 standard (11.3.6 Default arguments)

9 A default argument is evaluated each time the function is called with no argument for the corresponding parameter. A parameter shall not appear as a potentially-evaluated expression in a default argument. Parameters of a function declared before a default argument are in scope and can hide namespace and class member name

It provides the following example:

1int x = sizeof(x);
2void f(int y = sizeof(y)) {}
3'y' was not declared in this scope
4int h(int a, int b = sizeof(a)); // OK, unevaluated operand
5

So, this function declaration

1int x = sizeof(x);
2void f(int y = sizeof(y)) {}
3'y' was not declared in this scope
4int h(int a, int b = sizeof(a)); // OK, unevaluated operand
5void f(int y = sizeof(y)) {}
6

is correct because, in this expression sizeof(y), y is not an evaluated operand, based on C++17 8.3.3 Sizeof:

1 The sizeof operator yields the number of bytes in the object representation of its operand. The operand is either an expression, which is an unevaluated operand (Clause 8), or a parenthesized type-id.

and C++17 6.3.2 Point of declaration:

1 The point of declaration for a name is immediately after its complete declarator (Clause 11) and before its initializer (if any), except as noted below.

Source https://stackoverflow.com/questions/69461415

QUESTION

Command CompileSwiftSources failed with a nonzero exit code XCode 13

Asked 2021-Oct-05 at 16:33

I am trying to run a project on the Xcode13, after running a pod cache clean --all, deleting the derived data, and running a pod update. When I clean the project and build it the following error appears:

1CompileSwiftSources normal x86_64 com.apple.xcode.tools.swift.compiler (in target 'Alamofire' from project 'Pods')
2    cd /Users/aimoresa/MyProject-iOS/Pods
3    export DEVELOPER_DIR\=/Applications/Xcode.app/Contents/Developer
4    export SDKROOT\=/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator15.0.sdk
5    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swiftc -incremental -module-name Alamofire -Onone -enable-batch-mode -enforce-exclusivity\=checked @/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire.SwiftFileList -DDEBUG -D COCOAPODS -suppress-warnings -sdk /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator15.0.sdk -target x86_64-apple-ios10.0-simulator -g -module-cache-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/ModuleCache.noindex -Xfrontend -serialize-debugging-options -enable-testing -index-store-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Index/DataStore -swift-version 5 -I /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Products/Debug-iphonesimulator/Alamofire -F /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Products/Debug-iphonesimulator/Alamofire -c -j4 -output-file-map /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire-OutputFileMap.json -parseable-output -serialize-diagnostics -emit-dependencies -emit-module -emit-module-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire.swiftmodule -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/swift-overrides.hmap -Xcc -iquote -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-generated-files.hmap -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-own-target-headers.hmap -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-all-non-framework-target-headers.hmap -Xcc -ivfsoverlay -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/all-product-headers.yaml -Xcc -iquote -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-project-headers.hmap -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Products/Debug-iphonesimulator/Alamofire/include -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/DerivedSources-normal/x86_64 -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/DerivedSources/x86_64 -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/DerivedSources -Xcc -DPOD_CONFIGURATION_DEBUG\=1 -Xcc -DDEBUG\=1 -Xcc -DCOCOAPODS\=1 -emit-objc-header -emit-objc-header-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire-Swift.h -import-underlying-module -Xcc -ivfsoverlay -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/unextended-module-overlay.yaml -working-directory /Users/aimoresa/InvestorCentre-iOS/Pods
6
7Command CompileSwiftSources failed with a nonzero exit code
8

ANSWER

Answered 2021-Oct-05 at 16:33

Edited: For people who use Cocoapods, this answer might be useful: https://stackoverflow.com/a/69384358/587609


I also faced this issue, and it seems that there is a known issue on Xcode 13 as mentioned in this document: https://developer.apple.com/documentation/Xcode-Release-Notes/xcode-13-release-notes

Swift libraries depending on Combine may fail to build for targets including armv7 and i386 architectures. (82183186, 82189214)

Workaround: Use an updated version of the library that isnโ€™t impacted (if available) or remove armv7 and i386 support (for example, increase the deployment target of the library to iOS 11 or higher).

If your app is for iOS 11 or higher, one of the libraries should be modified to target iOS 11 or higher (e.g., my app is for iOS 12 or higher).

For example, I am using GRDB.swift, and its minimum iOS version is 10.0. There was a discussion as an issue of this repo, and I followed that comment to solve this issue as follows:

  1. Fork the repository
  2. Change Package.swift to modify the minimum iOS version like:
1CompileSwiftSources normal x86_64 com.apple.xcode.tools.swift.compiler (in target 'Alamofire' from project 'Pods')
2    cd /Users/aimoresa/MyProject-iOS/Pods
3    export DEVELOPER_DIR\=/Applications/Xcode.app/Contents/Developer
4    export SDKROOT\=/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator15.0.sdk
5    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swiftc -incremental -module-name Alamofire -Onone -enable-batch-mode -enforce-exclusivity\=checked @/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire.SwiftFileList -DDEBUG -D COCOAPODS -suppress-warnings -sdk /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator15.0.sdk -target x86_64-apple-ios10.0-simulator -g -module-cache-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/ModuleCache.noindex -Xfrontend -serialize-debugging-options -enable-testing -index-store-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Index/DataStore -swift-version 5 -I /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Products/Debug-iphonesimulator/Alamofire -F /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Products/Debug-iphonesimulator/Alamofire -c -j4 -output-file-map /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire-OutputFileMap.json -parseable-output -serialize-diagnostics -emit-dependencies -emit-module -emit-module-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire.swiftmodule -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/swift-overrides.hmap -Xcc -iquote -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-generated-files.hmap -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-own-target-headers.hmap -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-all-non-framework-target-headers.hmap -Xcc -ivfsoverlay -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/all-product-headers.yaml -Xcc -iquote -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Alamofire-project-headers.hmap -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Products/Debug-iphonesimulator/Alamofire/include -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/DerivedSources-normal/x86_64 -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/DerivedSources/x86_64 -Xcc -I/Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/DerivedSources -Xcc -DPOD_CONFIGURATION_DEBUG\=1 -Xcc -DDEBUG\=1 -Xcc -DCOCOAPODS\=1 -emit-objc-header -emit-objc-header-path /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/Objects-normal/x86_64/Alamofire-Swift.h -import-underlying-module -Xcc -ivfsoverlay -Xcc /Users/aimoresa/Library/Developer/Xcode/DerivedData/LinkProject-bwzldrnlucfenpavteypbjybxdky/Build/Intermediates.noindex/Pods.build/Debug-iphonesimulator/Alamofire.build/unextended-module-overlay.yaml -working-directory /Users/aimoresa/InvestorCentre-iOS/Pods
6
7Command CompileSwiftSources failed with a nonzero exit code
8let package = Package(
9name: &quot;GRDB&quot;,
10platforms: [
11    .iOS(&quot;12.0&quot;),   // changed here
12    .macOS(&quot;10.10&quot;),
13    .tvOS(&quot;9.0&quot;),
14    .watchOS(&quot;2.0&quot;),
15],
16...
17
  1. Modify Podfile or Swift Package Manager (SPM) config to use my forked repository

I am using five libraries via SPM in my Xcode project, but applying the above method to only one of those libraries solved this issue.

There is also a related thread in the Apple forum: https://developer.apple.com/forums/thread/682285

Source https://stackoverflow.com/questions/69276367

QUESTION

Is it allowed to name a global variable `read` or `malloc` in C++?

Asked 2021-Oct-04 at 09:43

Consider the following C++17 code:

1#include &lt;iostream&gt;
2int read;
3int main(){
4    std::ios_base::sync_with_stdio(false);
5    std::cin &gt;&gt; read;
6}
7

It compiles and runs fine on Godbolt with GCC 11.2 and Clang 12.0.1, but results in runtime error if compiled with a -static key.

As far as I understand, there is a POSIX(?) function called read (see man read(2)), so the example above actually invokes ODR violation and the program is essentially ill-formed even when compiled without -static. GCC even emits warning if I try to name a variable malloc: built-in function 'malloc' declared as non-function

Is the program above valid C++17? If no, why? If yes, is it a compiler bug which prevents it from running?

ANSWER

Answered 2021-Oct-03 at 12:09

The code shown is valid (all C++ Standard versions, I believe). The similar restrictions are all listed in [reserved.names]. Since read is not declared in the C++ standard library, nor in the C standard library, nor in older versions of the standard libraries, and is not otherwise listed there, it's fair game as a name in the global namespace.

So is it an implementation defect that it won't link with -static? (Not a "compiler bug" - the compiler piece of the toolchain is fine, and there's nothing forbidding a warning on valid code.) It does at least work with default settings (though because of how the GNU linker doesn't mind duplicated symbols in an unused object of a dynamic library), and one could argue that's all that's needed for Standard compliance.

We also have at [intro.compliance]/8

A conforming implementation may have extensions (including additional library functions), provided they do not alter the behavior of any well-formed program. Implementations are required to diagnose programs that use such extensions that are ill-formed according to this International Standard. Having done so, however, they can compile and execute such programs.

We can consider POSIX functions such an extension. This is intentionally vague on when or how such extensions are enabled. The g++ driver of the GCC toolset links a number of libraries by default, and we can consider that as adding not only the availability of non-standard #include headers but also adding additional translation units to the program. In theory, different arguments to the g++ driver might make it work without the underlying link step using libc.so. But good luck - one could argue it's a problem that there's no simple way to link only names from the C++ and C standard libraries without including other unreserved names.

(Does not altering a well-formed program even mean that an implementation extension can't use non-reserved names for the additional libraries? I hope not, but I could see a strict reading implying that.)

So I haven't claimed a definitive answer to the question, but the practical situation is unlikely to change, and a Standard Defect Report would in my opinion be more nit-picking than a useful clarification.

Source https://stackoverflow.com/questions/69424363

QUESTION

Are char arrays guaranteed to be null terminated?

Asked 2021-Sep-16 at 07:51
1#include &lt;stdio.h&gt;
2
3int main() {
4    char a = 5;
5    char b[2] = &quot;hi&quot;; // No explicit room for `\0`.
6    char c = 6;
7
8    return 0;
9}
10

Whenever we write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character http://www.eskimo.com/~scs/cclass/notes/sx8.html

In the above example b only has room for 2 characters so the null terminating char doesn't have a spot to be placed at and yet the compiler is reorganizing the memory store instructions so that a and c are stored before b in memory to make room for a \0 at the end of the array.

Is this expected or am I hitting undefined behavior?

ANSWER

Answered 2021-Sep-13 at 12:35

It is allowed to initialize a char array with a string if the array is at least large enough to hold all of the characters in the string besides the null terminator.

This is detailed in section 6.7.9p14 of the C standard:

An array of character type may be initialized by a character string literal or UTFโˆ’8 string literal, optionally enclosed in braces. Successive bytes of the string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.

However, this also means that you can't treat the array as a string since it's not null terminated. So as written, since you're not performing any string operations on b, your code is fine.

What you can't do is initialize with a string that's too long, i.e.:

1#include &lt;stdio.h&gt;
2
3int main() {
4    char a = 5;
5    char b[2] = &quot;hi&quot;; // No explicit room for `\0`.
6    char c = 6;
7
8    return 0;
9}
10char b[2] = &quot;hello&quot;;
11

As this gives more initializers than can fit in the array and is a constraint violation. Section 6.7.9p2 states this as follows:

No initializer shall attempt to provide a value for an object not contained within the entity being initialized.

If you were to declare and initialize the array like this:

1#include &lt;stdio.h&gt;
2
3int main() {
4    char a = 5;
5    char b[2] = &quot;hi&quot;; // No explicit room for `\0`.
6    char c = 6;
7
8    return 0;
9}
10char b[2] = &quot;hello&quot;;
11char b[] = &quot;hi&quot;; 
12

Then b would be an array of size 3, which is large enough to hold the two characters in the string constant plus the terminating null byte, making b a string.

To summarize:

If the array has a fixed size:

  • If the string constant used to initialize it is shorter than the array, the array will contain the characters in the string with successive elements set to 0, so the array will contain a string.
  • If the array is exactly large enough to contain the elements of the string but not the null terminator, the array will contain the characters in the string without the null terminator, meaning the array is not a string.
  • If the string constant (not counting the null terminator) is longer than the array, this is a constraint violation which triggers undefined behavior

If the array does not have an explicit size, the array will be sized to hold the string constant plus the terminating null byte.

Source https://stackoverflow.com/questions/69162573

QUESTION

Why is C++'s NULL typically an integer literal rather than a pointer like in C?

Asked 2021-Sep-11 at 17:58

I've been writing C++ for many years, using nullptr for null pointers. I also know C, whence NULL originates, and remember that it's the constant for a null pointer, with type void *.

For reasons, I've had to use NULL in my C++ code for something. Well, imagine my surprise when during some template argument deduction the compiler tells me that my NULL is really a ... long. So, I double-checked:

1#include &lt;type_traits&gt;
2#include &lt;cstddef&gt;
3
4static_assert(not std::is_same&lt;decltype(NULL), long&gt;::value, &quot;NULL is long ???&quot;);
5

And indeed, the static assertion fails (with GCC and with Clang).

I checked on cppreference.com, and sure enough (C++11 wording):

The macro NULL is an implementation-defined null pointer constant, which may be an integer literal with value zero, or a prvalue of type std::nullptr_t.

Why does this make sense? In itself, and in light of the incompatibility of C?

ANSWER

Answered 2021-Sep-04 at 16:50

In C, a void* can be implicitly converted to any T*. As such, making NULL a void* is entirely appropriate.

But that's profoundly dangerous. So C++ did away with such conversions, requiring you to do most pointer casts manually. But that would create source-incompatibility with C; a valid C program that used NULL the way C wanted would fail to compile in C++. It would also require a bunch of redundancy: T *pt = (T*)(NULL);, which would be irritating and pointless.

So C++ redefined the NULL macro to be the integer literal 0. In C, the literal 0 is also implicitly convertible to any pointer type and generates a null pointer value, behavior which C++ kept.

Now of course, using the literal 0 (or more accurately, an integer constant expression whose value is 0) for a null pointer constant was... not the best idea. Particularly in a language that allows overloading. So C++11 punted on using NULL entirely over a keyword that specifically means "null pointer constant" and nothing else.

Source https://stackoverflow.com/questions/69057184

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Compiler

Tutorials and Learning Resources are not available at this moment for Compiler

Share this Page

share link

Get latest updates on Compiler