Consider the following code:
#include <atomic>
std::atomic<bool> flag;
void thread1()
{
flag.store(true, std::memory_order_relaxed);
}
void thread2()
{
while (!flag.load(std::memory_order_relaxed))
;
}
Under the Standard, could the compiler optimize out the store in thread1 (since thread1 has no release fence)? Making thread2 an infinite loop? Or, could the compiler register buffer flag in thread2 after reading it from memory once (making thread2 potentially an infinite loop regardless of how flag is written)?
If that's a potential problem, would volatile fix it?
volatile std::atomic<bool> flag;
An answer that quotes from the Standard would be most appreciated.
No, skipping the store or hoisting the
std::atomicloadout of the loop (and inventing an infinite loop that the thread could never leave!) would violate the standard's guidance that stores "should" be visible to loads "in a reasonable amount of time" [atomics.order] and in "finite period of time" [intro.progress].I suspect that they're only "should" not "must", and not more strongly worded, because context switches and extreme loads can suspend another thread for a long time in the worst case (swap thrashing, or even using a debugger to pause and single-step one of the program's threads). Also to allow for example cooperative multi-tasking on a single core where time between context switches might sometimes be high-ish.
Those aren't just notes; they are normative. One might argue that a Deathstation 9000 could totally ignore those "should" requirements, but without some justification it seems unreasonable. There are lots of ways to make an ISO C++ implementation that's nearly unusable, and any implementation that aims to be usable will definitely compile that
.store(true, relaxed)to an asm store, and theloadto a load inside the loop.Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`? asks a slightly different thing about an equivalent
exit_nowspin-loop (worried more about keeping inter-thread latency low, rather than worried about it being infinite), but the same quotes from the standard apply.CPUs commit stores to cache-coherent memory as soon as they can; stronger orders (and
fencees) just make this thread wait for things, e.g. for an acquire load to complete before taking a value for other loads. Or for earlier stores and loads to complete before a release-store commits to L1d cache and itself becomes globally visible. Fences don't push data to other threads, they just control the order for when that does happen. Data becoming globally visible happens on its own very fast. (If you were implementing C++ on hypothetical hardware that didn't work this way, you'd have to compile even a relaxed store to include extra instructions to flush its own address.)IDK if this misconception about barriers being needed to create visibility (rather than to order it) is what's happening here, or if you're just asking what actual wording in the standard prevents a Deathstation 9000 from being terrible.
The
storedefinitely can't be optimized away for that and other reasons: it's a visible side effect that changes the program state. It's guaranteed visible to later loads in this thread (e.g. in the caller of thethread2function). For the same reason, even a non-atomicplain_bool = trueassignment couldn't be optimized away unless inlining into a caller that didplain_bool = falseafterwards; then dead-store elimination could happen.Compilers currently don't optimize atomics, treating them basically like
volatile atomic<>already, but ISO C++ would allow optimization of atomicflag=true;flag=false;into justflag=false;(even withseq_cst, but also with.store(val, relaxed)). This could remove the time window for other threads to ever detect that the variable wastrue; ISO C++ makes no guarantee that any state which exists in the abstract machine can actually be observed by another thread.However, as a quality-of-implementation issue, it can be undesirable to optimize away an unlock/relock or
++/--, which is part of why compilers don't optimize atomics. Also related: If a RMW operation changes nothing, can it be optimized away, for all memory orders? - merging two RWMs into a no-op can't optimize away their memory-ordering semantics, unless they're bothrelaxedand there are no fences anywhere, including in possible callers.Even if compilers did optimize as much as the standard allows per the as-if rule, you still wouldn't need
volatile atomicfor this case (assuming the caller ofthread2()doesn't doflag.store(false, order)right after the call).But you might perhaps want
volatile atomicin other situations. But http://wg21.link/p0062 / http://wg21.link/n4455 point out that evenvolatile atomicdoesn't close all the possible loopholes for overly aggressive optimizations, so the until further design progress is made on letting programmers control when optimization of atomics would be ok, the plan is that compilers will continue to behave as they do now, not optimizing atomics.Also related re: compiler optimizations inventing infinite loops
What are the exact inter-thread reordering constraints on mutex.lock() and .unlock() in c++11 and up? - mutex operations can reorder at run-time, so can a compiler statically reorder in a way that creates a deadlock? No, that would not be sane, and my answer argues not valid per the as-if rule.
How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release? - the same thing but with a manually-implemented spinlock using
std::atomic. The ISO C++ standard doesn't discuss compile-time vs. run-time reordering. Or in fact reordering at all, only inter-thread visibility and creating happens-before relationships.It comes down to whether the compiler is allowed to invent an infinite loop, delaying visibility of an atomic store indefinitely. The answer is no, per [intro.progress], same as here.