Looking into the Intel Intrinsics documentation, the synopsis for _mm_mfence is as follows
Perform a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior to this instruction. Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order.
The asm manual for the corresponding mfence instruction is similar.
There seems to be a lot of jargon I cannot understand could anybody please clear up what _mm_mfence actually does
I tried to google this up but their was no decent documentation or other questions asked.
The intrinsic is full barrier against compile-time and run-time memory reordering, including blocking StoreLoad reordering, the only kind x86 allows at run-time - https://preshing.com/20120515/memory-reordering-caught-in-the-act/ has an asm demo of the
mfenceinstruction in action. (x86's asm memory model is program order plus a store-buffer with store-forwarding, so acquire / release is free, only sequential consistency needs stronger ordering.)The C++ intrinsic is rarely useful; use C++11
std::atomicmemory-ordering stuff instead, unless you're implementing a C++ standard library yourself. And even then, don't actually usemfence, use a dummylocked operation like GCC and Clang do.It's like a slower version of
atomic_thread_fence(seq_cst), but also working on weakly-ordered NT loads from WC memory, whichlocked instructions might not do. (Does lock xchg have the same behavior as mfence?). If you're using NT stores, usually you only need_mm_sfence()afterward to inter-operate withstd::atomicandstd::mutexacquire/release ordering guarantees(Also,
atomic_thread_fenceisn't guaranteed on paper to work wrt. NT stores, but in practiceatomic_thread_fence(seq_cst)does, althoughreleasedoesn't.)See also:
When should I use _mm_sfence _mm_lfence and _mm_mfence
When are x86 LFENCE, SFENCE and MFENCE instructions required?
Does lock xchg have the same behavior as mfence?
https://preshing.com/20120625/memory-ordering-at-compile-time/
On paper,
std::atomic_thread_fence(seq_cst)is not necessarily guaranteed to block non-atomic vars on one side from reordering with non-atomic vars on the other side, unless there are alsostd::atomicloads/stores/RMWs that other threads could potentially sync-with. e.g.foo=1 ; thread_fence(sc); foo = 2; atomic_var.store(3, relaxed);is I think on paper allowed to do dead store elimination and only dofoo=2before the barrier, removing thefoo=1assignment. Real compilers don't do that, AFAIK. But with_mm_mfence(), I think they wouldn't be allowed to because the intrinsic is a full memory barrier for the compiler so all globally-reachable memory has to be in sync and assumed to be changed. Like GNU Casm("mfence" ::: "memory"). i.e. as strong as a non-inline function call, to a function the compiler doesn't have a definition for. (Why can `asm volatile("" ::: "memory")` serve as a compiler barrier?)The intrinsics guide is just describing the behaviour of the asm instruction in terms of the x86 asm, not the fact that the intrinsic for it also needs to block compile-time reordering to be useful in C++. https://preshing.com/20120625/memory-ordering-at-compile-time/
Also, don't try to use just
_mm_mfencefor multi-threading. First of all, you don't need sequential consistency most of the time, so on x86 you just need to block compile-time reordering to set acquire / release semantics. But to get sane behaviour in a multi-threaded program you also need to make sure one access to a shared variable in your C++ program compiles to one access to it in asm. Not for exampleint local_var = shared_var;actually compiling into multiple reads ofshared_varwhen you uselocal_varmultiple times in your function. And you need those accesses to be atomic.See Who's afraid of a big bad optimizing compiler? on LWN for more about why barriers alone aren't sufficient for multi-threaded C programs, including invented loads.
std::atomic<>is the normal way to get all of these things (even withmemory_order_relaxed), but if you insist on rolling your own like the Linux kernel does, usevolatile.You might use
_mm_mfenceto prevent StoreLoad reordering between avolatilestore and avolatileload from MMIO device registers, but normally you'd have a memory region mapped UC (uncacheable) which implies strong ordering so StoreLoad wouldn't be possible in the first place.It does emit an
mfenceinstruction which waits for the store buffer to drain before later loads can happen. (Or stores, but StoreStore reordering isn't allowed on x86 anyway.)On Skylake at least,
mfenceis extra slow after a microcode update to includelfence-like behaviour of not letting even later ALU instructions execute until it's done waiting for the store-buffer to drain. See Are loads and stores the only instructions that gets reordered? for an example. This is one reason compilers have stopped usingmfenceeven for stuff likeatomic_thread_fence(seq_cst), instead usinglock add byte [rsp], 0or similar. As well as that it was slower on AMD even before that.