atomic_flag_test_and_setYes!atomic_flag_clearYes!atomic_flag_test_and_clearnopeatomic_flag_setnope
If you wanted to do something like set a flag on a event in some context, and in some other context check and clear the event, C/C++ doesn't allow you to do in single atomic call per context.
You'd have to invert the flag, so clear the flag on the event, check and set the flag when checking the event.
Not a huge deal, but seems backwards in this scenario, especially considering the default state of the flag is false, which in the inverted sense means the event is asserted by default.
I supposed alternatively, an atomic bool with atomic_exchange could be used instead.
Practical advice: use
atomic<bool>oratomic<unsigned char>instead ofatomic_flagin normal code if its limited set of operations makes your code less efficient. Unless you care about being lock-free on very primitive machines whereatomic<bool>andatomic<int>might not be lock-free. (Or perhaps use C++20std::atomic_unsigned_lock_free.)TAS (Test And Set) is one of the well-known primitive atomic RMW operations in computer science that can be a building block for mutual-exclusion (locking). On some old hardware (like Motorola 68000), it's the only atomic RMW.
There are no machines where the only atomic RMW is a test-and-clear. (Zero-initializing memory is the norm, so a spinlock or mutex in static storage will start out in the unlocked state if 0 means unlocked, and TAS takes the lock. You need an atomic RMW when acquiring a spinlock but not when releasing.)
TAS can also be efficiently implemented in terms of an atomic exchange, aka swap, which some other old hardware provides as its only atomic RMW. (
local = foo.exchange(true)and test the result.)But neither TAS nor exchange work as building blocks for arbitrary lock-free atomic RMWs like
fetch_xoror CAS (Compare-and-Swap e.g.compare_exchange_weak/strong). A machine with just TAS or a way to emulate it can't provide lock-freestd::atomic<bool>, but can provide lock-freestd::atomic_flag.(LL/SC or CAS itself are building blocks for arbitrary lock-free RMWs on a single variable. All(?) modern machines that support multiple cores have at least one of those, and sometimes also direct support for some of the common integer operations like
fetch_add,fetch_or,exchange, etc. like on x86-64 and ARMv8.1. And of course atomicity guarantees for pure-load and pure-store operations so.load()can be done as an actual load, not an RMW which would fault on read-only memory and have contention between readers.)Hypothetically, on a CPU with only a test-can-clear, you could still implement
atomic_flagwith the C++falsestate having a non-zero object-representation; before C++20 it wasn't required that static initialization made it false. But if a CPU had TAS and "TAC" (or exchange which can do both) but not enough functionality to implement lock-freeatomic<bool>, you can't take full advantage of it viaatomic_flag.atomic_flagis required to be lock-free, unlike anyatomic<T>. It exists to expose a minimal set of lock-free functionality that can be implemented across a wide range of hardware that's able to implement ISO C++11 at all (includingstd::mutexand the locking for non-lock-freestd::atomic<T>.)Some things it avoids requiring:
Read-only access (before C++20
.test())Some hardware might have race detection where it could fault on concurrent read + write. (The same condition that would be undefined behaviour in C++ on non-atomic variables.) Presumably such hardware would have special instructions that are allowed to be concurrent, which may only include writes and RMWs.
Write-only access other than
.clear().Writing
truecan be done with.test_and_set()and ignoring the return value. But that's not efficient. It's still an atomic RMW and thus continues a release-sequence and has to see the "latest value" in the modification order so it's linearizable, so it's non-trivial for compilers to optimize it to just a store if the return value isn't used.IDK what hardware reason might exist for not providing a
.set()API. If the underlying hardware doesn't allow plain stores of values other than0(or whatever the actual bit-pattern is forclear), the implementation can always use a TAS instruction and ignore the result. Being stronger isn't a problem.So this might just be a case of keeping the API minimal because it mostly doesn't matter; most code should just use
atomic<bool>because that's also lock-free on the real platforms most people are programming for.Store of a variable value: you could always
if(x) flag.TAS(); else flag.clear(), but the API doesn't provide that.Toggle of an existing value: the only available RMW (TAS) stores a new value that doesn't depend on what was already there. This allows implementation in terms of instructions like ARM
swp(Swap) which pre-dated LL/SC support on ARM. As well as in terms of TAS itself.Presumably this is one of the factors that makes TAS and swap easier to implement in hardware: a read+write could maybe happen in the same cycle, unlike with operations where the value to be written has to be computed from the load result so won't be ready until at least a cycle later.
If any operation supported by
atomic<bool>needs a mutex (due to lack of CAS or LL/SC), all operations have to respect the mutex. (Except pure loads, if tearing and memory ordering aren't a problem, e.g. for relaxed loads of a byte or bool, or atomic int on some systems, but that's a corner case for obsolete systems that modern compilers probably don't bother with.) Thenatomic<bool>::is_always_lock_freehas to befalse.C++20 added
std::atomic_signed_lock_freeandstd::atomic_unsigned_lock_freeinteger types that are guaranteed lock-free (and are "most efficient" for wait/notify). Those are optional in freestanding implementations (not hosted under an OS), but I think this rules out C++20 hosted implementations on 386 or 68000; you'd need 486 or 68020 or later. I think C++20 decided to add useful stuff that people want even if that means some retro hardware can't implement C++20. Modern C++ has been making choices like that in other areas, like requiring signed integers to be two's complement, dropping the implementation choice of one's complement or sign/magnitude.In ISO C, the thread/mutex/atomics stuff is optional, unlike ISO C++, so modern C can still be implemented on ancient hardware by leaving out the thread support library entirely.
ISAs with just TAS or Swap
68000 TAS instruction 68000 / http://www.easy68k.com/paulrsm/doc/dpbm68k3.htm
68020 and later have CAS (and even CAS2 on 68020/030/040, an operation on two separate memory locations, although that proved very difficult to support efficiently so was dropped from later CPUs).
ARMv5 (just
swpSwap)SparcV8 (mentioned by https://llvm.org/docs/Atomics.html#atomics-and-codegen)
80386:
cmpxchgwas officially new in Pentium, although 486 had a different undocumented opcode for it.(See also a retrocomputing 486 SMP question for some mention of some early x86 SMP systems.)
80386 has
xchg(which is an atomic RMW even without alockprefix when used with a memory operand, whether you want it or not), andlock bts/btr/btcto test-and-set, test-and-reset (clear), or test-and-complement (flip) a bit without disturbing surrounding bits.It also has
lock add,lock or/lock andetc. (fetch_or/fetch_andwithout a return value, other than setting FLAGS from the result so you can branch on the result being all-zero or having its MSB set. Or the parity of the low byte.)lock xaddwas new in 486 (fetch_addincluding the return value) If you use the return value offetch_or, compilers have to implement it using alock cmpxchgretry loop.But none of these are
cmpxchgor sufficient to emulate it.Mutual exclusion without atomic RMWs?
Technically, mutual exclusion is possible without an atomic RMW via Peterson's algorithm with just
seq_cstloads and stores, but that requires an array with an entry for each possible thread, and C++ allows starting new threads after mutex objects already exist and are in use. So that's not really viable. Lamport's bakery algorithm has the same requirement for an array with size =NUM_THREADS.C11 and C++11 weren't interested in trying to support threads on such ancient CPUs where multiprocessing is such a struggle, so they went ahead and required
atomic_flagto support lock-free RMWs.On a uniprocessor machine, one way to implement any atomic RMW is just to disable/re-enable interrupts around it. This is still lock-free in a sense: it can't create a deadlock between an interrupt or signal handler and the main thread like an actual spinlock or mutex could. This would require a system call in a hosted implementation.
Or, on uniprocessor machines where interrupts can only happen at instruction boundaries, do it with a single instruction. e.g. x86
add [mem], eaxis atomic wrt. interrupts and thus context switches on the same core, but not in a multi-core system unless you also use thelockprefix. (Is incrementing an int effectively atomic in specific cases?).So some CISCs with memory-destination RMW instructions can use those for operations that only have to be atomicity wrt. other threads (or interrupt handlers) that could only run on the same core (because this is a uniprocessor system or because we only care about interrupt / signal handlers). Even though the instructions don't have any special RMW guarantees wrt. concurrent writers like DMA or other bus-master I/O devices. Or other cores if any existed.
Some ISAs like m68k can save partial progress of executing an instruction and resume after an interrupt, defeating this. But x86 isn't like that; interrupts are only processed at instruction boundaries. (See Is x86 CMPXCHG atomic, if so why does it need LOCK?)
Related:
How is atomic_flag implemented? - for x86, current compilers use
xchgbecause it's a bit more efficient thanlock bts.clear(memory_order_release)or weaker is much more efficient thanlock btr, using plainmovwithoutmfenceorxchg.Do atomic operations require support from hardware?
Why only std::atomic_flag is guaranteed to be lock-free?
TAS instruction 68000 - example asm for a spinlock
http://www.easy68k.com/paulrsm/doc/dpbm68k3.htm discusses the design behind TAS
Is incrementing an int effectively atomic in specific cases? - how modern microarchitectures with cache handle atomic RMWs without locking the whole memory bus, so separate cores can be doing atomic RMWs on separate memory locations at the same time.