C++11 atomics. visibility and thread.join() / correct way to stop a thread

632 Views Asked by At

For which (if any?) STORE_ORDER & LOAD_ORDER does C++11 guarantee that this code runs in finite time?

std::atomic<bool> a{false};
std::thread t{[&]{
  while(!a.load(LOAD_ORDER));
}};
a.store(true, STORE_ORDER);
t.join();

I see two issues with this:

Memory order

It seems to me that with release & aquire, the compiler and cpu are allowed to reorder my join (assuming it behaves like a load) before the store, which would of course break this.

Even with memory_order_seq_cst, I'm not sure if such a reordering is prohibited because I don't know if join() actually does any loads or stores.

Visibility

If I understood this question about memory_order_relaxed correctly, it is not guaranteed that a store with memory_order_relaxed becomes visible to other threads in a finite amount of time. Is there such a guarantee for any of the other orderings?

I understand that std::atomic is about atomicity and memory ordering, not about visibility. But I am not aware of any other tools in c++11 that could help me here. Would I need to use a platform-specific tool to get a correctness guarantee here and if yes, which one?


To take this one step further – if I have finiteness, it would be nice to also have some promise about speed. I don't think the C++ standard makes any such promises. But is there any compiler or x86-specific way to get a promise that the store will become visible to the other thread quickly?


In summary: I'm looking for a way to swiftly stop a worker thread that is actually guaranteed to have this property. Ideally this would be platform-independent. But if we can't have that, does it at least exist for x86?

2

There are 2 best solutions below

5
Chronial On

After some more searching, I found a question that is identical to the visibility part of mine, which got a clear answer: There is indeed no such guarantee – there is only the request that "implementations should make atomic stores visible to atomic loads within a reasonable amount of time". The standard does not define what it means by should, but I will assume the normal meaning, so this would be non-binding. It also not quite clear what "reasonable" means, but I would assume it clearly excludes "infinite".

This doesn't quite answer the question about memory ordering. But if the store is ordered after the join(), which may block forever, the store would never become visible to the other threads – which would not be a "reasonable amount of time".

So while the standard does not require the code in the question to be valid, it at least suggests that it should be valid. As a bonus, it actually says that it shouldn't just be finite time, but also somewhat fast (or well, reasonable).

That leaves the part of my question about a platform-specific solution: Is there a x86-specific way to write the requested algorithm so it is actually guaranteed to be correct?

3
Peter Cordes On

Is there a x86-specific way to write the requested algorithm so it is actually guaranteed to be correct?

Use a compiler that isn't broken to make sure that thread.join() is correctly treated as a "black box" function which might read or write any memory.

Thus the compiler will have to make sure that memory is "in sync with C++ abstract machine before blocking on the thread finishing. i.e. store reordering at compile time could violate the as-if rule, so compilers must not do it unless they can prove that it won't (e.g. to a local variable whose address doesn't escape the function). That isn't the case here, so the store has to happen in x86 asm before call join even for mo_relaxed.

(Or in a hypothetical implementation, a join that could fully inline would at least have a compile-time memory barrier like GNU C asm("":::"memory") or maybe atomic_thread_fence() of some strength. Anything up to acq_rel needs no asm instructions on x86, just blocking compile-time reordering.)

x86 CPU cores share a coherent view of memory through a coherent cache. (MESI protocol or equivalent). As soon as the store commits from the store buffer to L1d cache, it becomes impossible for any other core to read a "stale" value. Inter-core latency is typically 40 to 100 nanoseconds on modern x86, IIRC (when both threads are already running on different physical cores).

See Is mov + mfence safe on NUMA? and When to use volatile with multi threading? explain more about how asm can't see stale values indefinitely on real CPUs (including non-x86). So program-order at compile time is sufficient here.

This same reasoning applies for every other real CPU that a sane C++ implementation can run across using std::thread. (There are some heterogeneous system-on-chip CPUs with separate coherency domains between a microcontroller and DSP, but a quality C++ implementation's std::thread wouldn't start threads across non-coherent cores.

Or if an implementation did run on cores without coherent cache, its std::atomic would have to flush the cache line after a relaxed-atomic store. And maybe before a relaxed-atomic load. Or sync/write-back any dirty data across the whole cache before a release-store. So it would be hilariously inefficient to implement C++'s memory model on top of a non-coherent / explicit-coherency system. It would have to be coherent enough for atomic RMW to work as described (release-sequences and so on, and there only being one "live" copy of an atomic counter at any time).

This is why you wouldn't build a C++ implementation like that. You might allow starting stuff on other cores that are part of a different coherence domain (ARM non "inner-shareable"), but not via std::thread, because you need users to know the implications if you don't want to make shared variable work sanely between them.