Why exactly is a full read memory barrier required in the kernel docs at Documentation/memory-barriers.txt:709:
q = READ_ONCE(a);
if (q) {
<read barrier> // why?
p = READ_ONCE(b);
}
The explanation says 'the CPU may short-circuit by attempting to predict the outcome in advance, so that other CPUs see the load from b as having happened before the load from a'
Does this explanation imply that the CPU executing this snippet will not reorder the reads from
aandb?Why is the order in which other CPUs see the reads important? What would be an example of a scenario where this results in a bug?
The only issue I can see is if the CPU emits the reads to a and b out of order.
Are the CPUs supported by the kernel allowed to do that reordering?
If yes, where is this rule stated? Then I would see the need for the barrier, but not for the reason stated in the explanation.
I tried to ask on IRC, but no one knew
Speculative execution is the key here. A CPU can speculate a load, unlike a store, because it doesn't have any observable side-effects visible to other cores.
CPUs handle control dependencies (branches) with branch prediction + speculative exec instead of like data dependencies. (Difference between data dependence and control dependence).
The second load can start as soon as the address
bis available and the load instruction enters the out-of-order back-end, beforeqis ready. Whenqis ready and thecbnzor whatever can execute to confirm correct branch prediction, nothing happens, the later load was already started. (If it instead detects thatqwas zero so the load shouldn't have happened, execution rolls back to the correct path, discarding the load result.)Yes, most non-x86 CPUs do that. (And modern x86 internally speculates but takes a memory ordering machine clear if a cache line wasn't still valid when it's architecturally allowed to be read. Loading early is so important for performance that it's worth speculating on, especially since speculation will be valid as long as no other core is invalidating the cache line, e.g. true or false sharing.)
Related: Why do weak memory models exist and how is their instruction order selected?
Note that these loads are independent: the address for the second doesn't depend on the load result from the first. If they were dependent, all architectures except some models of DEC Alpha would do the dependent load after the first, like C++
memory_order_consume. (Dependent loads reordering in CPU). In the Linux kernel memory model, you'd needsmp_read_barrier_depends()which is a no-op on ISAs still supported by Linux. And apparently is now considered obsolete in the kernel, implicit inREAD_ONCEmacros.Writer:
reader:
Without the
smp_rmbread memory barrier (or better an acquire-load like AArch64ldaprlike C++11std::atomicdata_ready.load(std::memory_order_acquire)would use), the reader can load old values frombuffer[i]from before the writer's stores.This is the canonical use-case for acquire/release semantics (https://preshing.com/20120913/acquire-and-release-semantics/), or with a spin-loop in the reader.