I am trying to use a shared index to indicate that data has been written to a shared circular buffer. Is there an efficient way to do this on ARM (arm gcc 9.3.1 for cortex M4 with -O3) without using the discouraged volatile keyword?
The following C functions work fine on x86:
void Test1(int volatile* x) { *x = 5; }
void Test2(int* x) { __atomic_store_n(x, 5, __ATOMIC_RELEASE); }
Both compile efficiently and identically on x86:
0000000000000000 <Test1>:
0: c7 07 05 00 00 00 movl $0x5,(%rdi)
6: c3 retq
7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
e: 00 00
0000000000000010 <Test2>:
10: c7 07 05 00 00 00 movl $0x5,(%rdi)
16: c3 retq
However on ARM the __atomic builtin generates a Data Memory Barrier, while volatile does not:
00000000 <Test1>:
0: 2305 movs r3, #5
2: 6003 str r3, [r0, #0]
4: 4770 bx lr
6: bf00 nop
00000000 <Test2>:
0: 2305 movs r3, #5
2: f3bf 8f5b dmb ish
6: 6003 str r3, [r0, #0]
8: 4770 bx lr
a: bf00 nop
How do I avoid the memory barrier (or similar inefficiencies) while also avoiding volatile?
The
volatileassignment isn't a release-store, and doesn't even give you StoreStore ordering which might be all you need here.volatileis basically equivalent to__ATOMIC_RELAXEDordering, except that it prevents compile-time reordering with othervolatileaccesses. It does not do anything to prevent run-time reordering, which CPU memory models other than x86 do allow. (As for actual atomicity, with narrow enough types you do get atomicity with certain compilers, like GCC and Clang, since the Linux kernel usesvolatilethis way to roll its own atomics, along with inline asm for fences.)See also When to use volatile with multi threading? - never,
volatiledoesn't give you anything you can't get with atomics for the purposes of multi-threading. Use GNU C builtins or C++20std::atomic_refwithmemory_order_relaxedinstead ofvolatileif you need non-atomic access to a variable in other parts of your program. Or more simply use C11stdatomic.h_Atomic intor C++11std::atomic<>if you never need to point a plainint*at it.dmb ISHSTis at least a StoreStore barrier, so in asm you could get release semantics wrt. earlier stores but not earlier loads. That isn't sufficient forstd::memory_order_releaseaka__ATOMIC_RELEASE(which also requires LoadStore ordering), so there's no way to get a compiler to use that for you. (None of the ops or fences in https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html map to that).So unfortunately on ARMv7 and earlier, you need a full barrier (
dmb ish) for any standard C / C++ memory_order other thanrelaxed. ARMv8 fixed that.With
-mcpu=cortex-a53or other ARMv8 CPUs,stlis available as a release-store even in AArch32 state. So use that to avoid an expensivedmb ishfull-barrier for release stores or acquire loads: https://godbolt.org/z/1hzvGMbonSingle-core systems
On your single-core Cortex M4, all "threads" will run on the same core, so run-time memory reordering isn't possible. An interrupt leading to a context-switch is equivalent to a signal handler in the C11 / C++11 memory models.
You can use
atomic_signal_fenceto roll your own same-core-acquire / same-core-release forrelaxedloads/stores.Porting such code to multi-core by changing
atomic_signal_fencetoatomic_thread_fenceis safe but worse for performance on some ISAs, notably ARMv8 where a separate barrier instruction is expensive, but a release-store operation can just usestl