I was reading through the guide to using the various memory barriers provided by Linux and came upon the example below.
I was curious as to the reason why CPU2 will load P into Q before issuing the load of *Q or whether this is actually guaranteed to always be the case?
I do not see it explicitly stated but I assume there is a memory ordering guarantee that all writes to a pointer variable will occur before any subsequent deference of that pointer variable? Can anyone confirm this to be an accurate interpretation or provide evidence from the Linux-kernel memory model that justifies this behavior?
As a further example, consider this sequence of events:
CPU 1 CPU 2
=============== ===============
{ A == 1, B == 2, C == 3, P == &A, Q == &C }
B = 4; Q = P;
P = &B; D = *Q;
There is an obvious address dependency here, as the value loaded into D depends
on the address retrieved from P by CPU 2. At the end of the sequence, any of
the following results are possible:
(Q == &A) and (D == 1)
(Q == &B) and (D == 2)
(Q == &B) and (D == 4)
Note that CPU 2 will never try and load C into D because the CPU will load P
into Q before issuing the load of *Q.
The Linux kernel with
volatileis essentially the same as ISO C withmemory_order_relaxed. (Or stronger because compile-time reordering ofvolatileoperations wrt. each other isn't allowed even to different addresses.) (In Linux kernel code this would beWRITE_ONCE(B, 4);and so on, unless you actually declared the shared variables asvolatile.)Within the same thread, sequencing and coherence rules apply, so
shared = 1;tmp = sharedis guaranteed to read1or some later value in the modification order ofshared.Except it's even simpler in this case:
Qisn't touched by CPU 1 in this example, so it's not a shared variable, it's effectively local to CPU 2. The as-if rules of compiler optimization and out-of-order execution require preserving single-threaded correctness, like that variables have their new values after we write them.Dereferencing an atomic pointer involve reading it into a temporary (a register on all(?) current ISAs that Linux supports) and then dereferencing that temporary. In this case,
Qis that temporary, despite the unclear naming convention.The read itself (of
Q) is sequenced after the write (Q = ...) from the same thread.The pointed-to data from
D = *Q;is dependency-ordered after the read ofQ, like C/C++memory_order_consume, except on DEC Alpha AXP. All other ISAs that Linux runs on have hardware memory-dependency ordering, and a compiler can't plausibly break this code by knowing there's only one possible value forQand using a constant for that which doesn't have a data dependency on thePload result. Or a branch (control dependencies can be speculated past, unlike data dependencies).But we can still read
(Q == &B) && (D == 2)because CPU 1 didn't do a release store. If it had, then even though CPU 2 didn't do an acquire load, the data dependency through the address will stop any real-world CPU from loading from*Quntil after it knows the address fromQ.If you want to rely on this in Linux kernel code, you should use
smp_read_barrier_depends()betweenQ = Pand*Q. It's a no-op on everything except Alpha.Memory order consume usage in C11
C++11: the difference between memory_order_relaxed and memory_order_consume
Paul E. McKenney's CppCon 2016 talk: C++ Atomics: The Sad Story of
memory_order_consume: A Happy Ending At Last? - he describes how Linux effectively usesrelaxedand avoids doing stuff liketmp = shared; array[tmp - tmp]as a way to get one load ordered after another because compilers will optimize that to a constant0with no dependency.(That's why C++'s
memory_order_consumehad to exist, because it does let you do stuff like that with formal guarantees; on real ISAs like AArch64, compilers would have to emit asm that generates a0with a data dependency ontmp, e.g. with asuborxorinstruction. Fun fact, non-x86 CPUs aren't even allowed to optimize xor-zeroing of registers because their ISA rules sayxorandsubcarry a dependency. Anyway, that's also whymemory_order_consumeproved so hard to support correctly, so compilers gave up and promotedconsumetoacquireinstead of possibly having their optimizer break code like this. Linux kernel code can rely on stuff like "we know what we're doing well enough with a limited set of compilers that we can write code that compilers shouldn't optimize in ways that breaksconsume")