Seqlock | An implementation of Seqlock in C11 | File Utils library
kandi X-RAY | Seqlock Summary
kandi X-RAY | Seqlock Summary
This project was created by Erik Rigtorp .
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Seqlock
Seqlock Key Features
Seqlock Examples and Code Snippets
Community Discussions
Trending Discussions on Seqlock
QUESTION
In both MSVC STL and LLVM libc++ implementations std::atomic
for non-atomic size is implemented using a spin lock.
libc++ (Github):
...ANSWER
Answered 2022-Feb-20 at 22:49Yes, you can use a SeqLock as a readers/writers lock if you provide mutual exclusion between writers. You'd still get read-side scalability, while writes and RMWs would stay about the same.
It's not a bad idea, although it has potential fairness problems for readers if you have very frequent writes. Maybe not a good idea for a mainstream standard library, at least not without some testing with some different workloads / use-cases on a range of hardware, since working great on some machines but faceplanting on others is not what you want for standard library stuff. (Code that wants great performance for its special case often unfortunately has to use an implementation that's tuned for it, not the standard one.)
Mutual exclusion is possible with a separate spinlock, or just using the low bit of the sequence number. In fact I've seen other descriptions of a SeqLock that assumed you'd be using it with multiple writers, and didn't even mention the single-writer case that allows pure-load and pure-store for the sequence number to avoid the cost of an atomic RMW.
How to use the sequence number as a spinlockA writer or RMWer attempts to atomically CAS the sequence number to increment (if it wasn't already odd). If the sequence number is already odd, writers just spin until they see an even value.
This would mean writers have to start by reading the sequence number before trying to write, which can cause extra coherency traffic (MESI Share request, then RFO). On a machine that actually had a fetch_or
in hardware, you could use that to atomically make the count odd and see if you won the race to take it from even to odd.
On x86-64, you can use lock bts
to set the low bit and find out what the old low bit was, then load the whole sequence number if it was previously even (because you won the race, no other writer is going to be modifying it). So you can do a release-store of that plus 1 to "unlock" instead of needing a lock add
.
Making other writers faster at reclaiming the lock may actually be a bad thing, though: you want to give a window for readers to complete. Maybe just use multiple pause
instructions (or equivalent on non-x86) in write-side spin loops, more than in read-side spins. If contention is low, readers probably had time to see it before writers got to it, otherwise writers will frequently see it locked and go into the slower spin loop. Maybe with faster-increasing backoff for writers, too.
An LL/SC machine could (in asm at least) test-and-increment just as easily as CAS or TAS. I don't know how to write pure C++ that would compile to just that. fetch_or could compile efficiently for LL/SC, but still to a store even if it was already odd. (If you have to LL separately from SC, you might as well make the most of it and not store at all if it will be useless, and hope that the hardware is designed to make the best of things.)
(It's critical to not unconditionally increment; you must not unlock another writer's ownership of the lock. But an atomic-RMW that leaves the value unchanged is always ok for correctness, if not performance.)
It may not be a good idea by default because of bad results with heavy write activity making it potentially hard for a reader to get a successful read done. As Wikipedia points out:
The reader never blocks, but it may have to retry if a write is in progress; this speeds up the readers in the case where the data was not modified, since they do not have to acquire the lock as they would with a traditional read–write lock. Also, writers do not wait for readers, whereas with traditional read–write locks they do, leading to potential resource starvation in a situation where there are a number of readers (because the writer must wait for there to be no readers). Because of these two factors, seqlocks are more efficient than traditional read–write locks for the situation where there are many readers and few writers. The drawback is that if there is too much write activity or the reader is too slow, they might livelock (and the readers may starve).
The "too slow reader" problem is unlikely, just a small memcpy. Code shouldn't expect good results from std::atomic
for very large T
; the general assumption is that you'd only bother with std::atomic for a T that can be lock-free on some implementations. (Usually not including transactional memory since mainstream implementations don't do that.)
But the "too much write" problem could still be real: SeqLock is best for read-mostly data. Readers may have a bad time with a heavy write mix, retrying even more than with a simple spinlock or a readers-writers lock.
It would be nice if there was a way to make this an option for an implementation, like an optional template parameter such as std::atomic
, or a #pragma
, or #define
before including . Or a command-line options.
An optional template param affects every use of the type, but might be slightly less clunky than a separate class name like gnu::atomic_seqlock
. An optional template param would still make std::atomic
types be that class name, so e.g. matching specializations of other things for std::atomic
. But might break other things, IDK.
Might be fun to hack something up to experiment with.
QUESTION
The paper N4455 No Sane Compiler Would Optimize Atomics talks about various optimizations compilers can apply to atomics. Under the section Optimization Around Atomics, for the seqlock example, it mentions a transformation implemented in LLVM, where a fetch_add(0, std::memory_order_release)
is turned into a mfence
followed by a plain load, rather than the usual lock add
or xadd
. The idea is that this avoids taking exclusive access of the cacheline, and is relatively cheaper. The mfence
is still required regardless of the ordering constraint supplied to prevent StoreLoad
reordering for the mov
instruction generated.
This transformation is performed for such read-don't-modify-write
operations regardless of the ordering, and equivalent assembly is produced for fetch_add(0, memory_order_relaxed).
However, I am wondering if this is legal. The C++ standard explicitly notes under [atomic.order] that:
Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.
This fact about RMW operations seeing the 'latest' value has also been noted previously by Anthony Williams.
My question is: Is there a difference of behavior in the value the thread could see based on the modification order of the atomic variable, based on whether the compiler emits a lock add
vs mfence
followed by a plain load? Is it possible for this transformation to cause the thread performing the RMW operation to instead load values older than the latest one? Does this violate the guarantees of the C++ memory model?
ANSWER
Answered 2020-Dec-16 at 04:44(I started writing this a while ago but got stalled; I'm not sure it adds up to a full answer, but thought some of this might be worth posting. I think @LWimsey's comments do a better job of getting to the heart of an answer than what I wrote.)
Yes, it's safe.
Keep in mind that the way the as-if rule applies is that execution on the real machine has to always produce a result that matches one possible execution on the C++ abstract machine. It's legal for optimizations to make some executions that the C++ abstract machine allows impossible on the target. Even compiling for x86 at all makes all IRIW reordering impossible, for example, whether the compiler likes it or not. (See below; some PowerPC hardware is the only mainstream hardware that can do it in practice.)
I think the reason that wording is there for RMWs specifically is that it ties the load to the "modification order" which ISO C++ requires to exist for each atomic object separately. (Maybe.)
Remember that the way C++ formally defines its ordering model is in terms of synchronizes-with, and existence of a modification order for each object (that all threads can agree on). Not like hardware where there is a notion of coherent caches1 creating a single coherent view of memory that each core accesses. The existence of coherent shared memory (typically using MESI to maintain coherence at all times) makes a bunch of things implicit, like the impossibility of reading "stale" values. (Although HW memory models do typically document it explicitly like C++ does).
Thus the transformation is safe.
ISO C++ does mention the concept of coherency in a note in another section: http://eel.is/c++draft/intro.races#14
The value of an atomic object M, as determined by evaluation B, shall be the value stored by some side effect A that modifies M, where B does not happen before A.
[Note 14: The set of such side effects is also restricted by the rest of the rules described here, and in particular, by the coherence requirements below. — end note]...
[Note 19: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note]
[Note 20: The value observed by a load of an atomic depends on the “happens before” relation, which depends on the values observed by loads of atomics. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the “happens before” relation derived as described above, satisfy the resulting constraints as imposed here. — end note]
So ISO C++ itself notes that cache coherence gives some ordering, and x86 has coherent caches. (I'm not making a complete argument that this is enough ordering, sorry. LWimsey's comments about what it even means to be the latest in a modification order are relevant.)
(On many ISAs (but not all), the memory model also rules out IRIW reordering when you have stores to 2 separate objects. (e.g. on PowerPC, 2 reader threads can disagree about the order of 2 stores to 2 separate objects). Very few implementations can create such reordering: if shared cache is the only way data can get between cores, like on most CPUs, that creates an order for stores.)
Is it possible for this transformation to cause the thread performing the RMW operation to instead load values older than the latest one?
On x86 specifically, it's very easy to reason about. x86 has a strongly-ordered memory model (TSO = Total Store Order = program order + a store buffer with store-forwarding).
Footnote 1: All cores that std::thread
can run across have coherent caches. True on all real-world C++ implementations across all ISAs, not just x86-64. There are some heterogeneous boards with separate CPUs sharing memory without cache coherency, but ordinary C++ threads of the same process won't be running across those different cores. See this answer for more details about that.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Seqlock
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page