Can the compiler optimize out accesses with memory order relaxed that are not ordered by any memory fence?

Question

Can the compiler optimize out accesses with memory order relaxed that are not ordered by any memory fence?

232 Views Asked by WaltK At 27 March 2023 at 22:04

Consider the following code:

#include <atomic>

std::atomic<bool> flag;

void thread1()
{
    flag.store(true, std::memory_order_relaxed);
}

void thread2()
{
    while (!flag.load(std::memory_order_relaxed))
        ;
}

Under the Standard, could the compiler optimize out the store in thread1 (since thread1 has no release fence)? Making thread2 an infinite loop? Or, could the compiler register buffer flag in thread2 after reading it from memory once (making thread2 potentially an infinite loop regardless of how flag is written)?

If that's a potential problem, would volatile fix it?

volatile std::atomic<bool> flag;

An answer that quotes from the Standard would be most appreciated.

Original Q&A

There are 1 best solutions below

**Peter Cordes** · Answer 1 · 2023-03-27T22:36:42.373000

No, skipping the store or hoisting the std::atomic load out of the loop (and inventing an infinite loop that the thread could never leave!) would violate the standard's guidance that stores "should" be visible to loads "in a reasonable amount of time" [atomics.order] and in "finite period of time" [intro.progress].

I suspect that they're only "should" not "must", and not more strongly worded, because context switches and extreme loads can suspend another thread for a long time in the worst case (swap thrashing, or even using a debugger to pause and single-step one of the program's threads). Also to allow for example cooperative multi-tasking on a single core where time between context switches might sometimes be high-ish.

Those aren't just notes; they are normative. One might argue that a Deathstation 9000 could totally ignore those "should" requirements, but without some justification it seems unreasonable. There are lots of ways to make an ISO C++ implementation that's nearly unusable, and any implementation that aims to be usable will definitely compile that .store(true, relaxed) to an asm store, and the load to a load inside the loop.

Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`? asks a slightly different thing about an equivalent exit_now spin-loop (worried more about keeping inter-thread latency low, rather than worried about it being infinite), but the same quotes from the standard apply.

CPUs commit stores to cache-coherent memory as soon as they can; stronger orders (and fencees) just make this thread wait for things, e.g. for an acquire load to complete before taking a value for other loads. Or for earlier stores and loads to complete before a release-store commits to L1d cache and itself becomes globally visible. Fences don't push data to other threads, they just control the order for when that does happen. Data becoming globally visible happens on its own very fast. (If you were implementing C++ on hypothetical hardware that didn't work this way, you'd have to compile even a relaxed store to include extra instructions to flush its own address.)

IDK if this misconception about barriers being needed to create visibility (rather than to order it) is what's happening here, or if you're just asking what actual wording in the standard prevents a Deathstation 9000 from being terrible.

The store definitely can't be optimized away for that and other reasons: it's a visible side effect that changes the program state. It's guaranteed visible to later loads in this thread (e.g. in the caller of the thread2 function). For the same reason, even a non-atomic plain_bool = true assignment couldn't be optimized away unless inlining into a caller that did plain_bool = false afterwards; then dead-store elimination could happen.

Compilers currently don't optimize atomics, treating them basically like volatile atomic<> already, but ISO C++ would allow optimization of atomic flag=true; flag=false; into just flag=false; (even with seq_cst, but also with .store(val, relaxed)). This could remove the time window for other threads to ever detect that the variable was true; ISO C++ makes no guarantee that any state which exists in the abstract machine can actually be observed by another thread.

However, as a quality-of-implementation issue, it can be undesirable to optimize away an unlock/relock or ++ / --, which is part of why compilers don't optimize atomics. Also related: If a RMW operation changes nothing, can it be optimized away, for all memory orders? - merging two RWMs into a no-op can't optimize away their memory-ordering semantics, unless they're both relaxed and there are no fences anywhere, including in possible callers.

Even if compilers did optimize as much as the standard allows per the as-if rule, you still wouldn't need volatile atomic for this case (assuming the caller of thread2() doesn't do flag.store(false, order) right after the call).

But you might perhaps want volatile atomic in other situations. But http://wg21.link/p0062 / http://wg21.link/n4455 point out that even volatile atomic doesn't close all the possible loopholes for overly aggressive optimizations, so the until further design progress is made on letting programmers control when optimization of atomics would be ok, the plan is that compilers will continue to behave as they do now, not optimizing atomics.

Also related re: compiler optimizations inventing infinite loops

What are the exact inter-thread reordering constraints on mutex.lock() and .unlock() in c++11 and up? - mutex operations can reorder at run-time, so can a compiler statically reorder in a way that creates a deadlock? No, that would not be sane, and my answer argues not valid per the as-if rule.
How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release? - the same thing but with a manually-implemented spinlock using std::atomic. The ISO C++ standard doesn't discuss compile-time vs. run-time reordering. Or in fact reordering at all, only inter-thread visibility and creating happens-before relationships.

It comes down to whether the compiler is allowed to invent an infinite loop, delaying visibility of an atomic store indefinitely. The answer is no, per [intro.progress], same as here.

Can the compiler optimize out accesses with memory order relaxed that are not ordered by any memory fence?

There are 1 best solutions below

Related Questions in C++

Related Questions in ATOMIC

Related Questions in STDATOMIC

Related Questions in MEMORY-MODEL

Trending Questions

Popular # Hahtags

Popular Questions