what does the _mm_mfence() function do

173 Views Asked by At

Looking into the Intel Intrinsics documentation, the synopsis for _mm_mfence is as follows

Perform a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior to this instruction. Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order.

The asm manual for the corresponding mfence instruction is similar.


There seems to be a lot of jargon I cannot understand could anybody please clear up what _mm_mfence actually does

I tried to google this up but their was no decent documentation or other questions asked.

2

There are 2 best solutions below

0
Peter Cordes On

The intrinsic is full barrier against compile-time and run-time memory reordering, including blocking StoreLoad reordering, the only kind x86 allows at run-time - https://preshing.com/20120515/memory-reordering-caught-in-the-act/ has an asm demo of the mfence instruction in action. (x86's asm memory model is program order plus a store-buffer with store-forwarding, so acquire / release is free, only sequential consistency needs stronger ordering.)

The C++ intrinsic is rarely useful; use C++11 std::atomic memory-ordering stuff instead, unless you're implementing a C++ standard library yourself. And even then, don't actually use mfence, use a dummy locked operation like GCC and Clang do.

It's like a slower version of atomic_thread_fence(seq_cst), but also working on weakly-ordered NT loads from WC memory, which locked instructions might not do. (Does lock xchg have the same behavior as mfence?). If you're using NT stores, usually you only need _mm_sfence() afterward to inter-operate with std::atomic and std::mutex acquire/release ordering guarantees

(Also, atomic_thread_fence isn't guaranteed on paper to work wrt. NT stores, but in practice atomic_thread_fence(seq_cst) does, although release doesn't.)

See also:


On paper, std::atomic_thread_fence(seq_cst) is not necessarily guaranteed to block non-atomic vars on one side from reordering with non-atomic vars on the other side, unless there are also std::atomic loads/stores/RMWs that other threads could potentially sync-with. e.g. foo=1 ; thread_fence(sc); foo = 2; atomic_var.store(3, relaxed); is I think on paper allowed to do dead store elimination and only do foo=2 before the barrier, removing the foo=1 assignment. Real compilers don't do that, AFAIK. But with _mm_mfence(), I think they wouldn't be allowed to because the intrinsic is a full memory barrier for the compiler so all globally-reachable memory has to be in sync and assumed to be changed. Like GNU C asm("mfence" ::: "memory"). i.e. as strong as a non-inline function call, to a function the compiler doesn't have a definition for. (Why can `asm volatile("" ::: "memory")` serve as a compiler barrier?)

The intrinsics guide is just describing the behaviour of the asm instruction in terms of the x86 asm, not the fact that the intrinsic for it also needs to block compile-time reordering to be useful in C++. https://preshing.com/20120625/memory-ordering-at-compile-time/

Also, don't try to use just _mm_mfence for multi-threading. First of all, you don't need sequential consistency most of the time, so on x86 you just need to block compile-time reordering to set acquire / release semantics. But to get sane behaviour in a multi-threaded program you also need to make sure one access to a shared variable in your C++ program compiles to one access to it in asm. Not for example int local_var = shared_var; actually compiling into multiple reads of shared_var when you use local_var multiple times in your function. And you need those accesses to be atomic.

See Who's afraid of a big bad optimizing compiler? on LWN for more about why barriers alone aren't sufficient for multi-threaded C programs, including invented loads. std::atomic<> is the normal way to get all of these things (even with memory_order_relaxed), but if you insist on rolling your own like the Linux kernel does, use volatile.

You might use _mm_mfence to prevent StoreLoad reordering between a volatile store and a volatile load from MMIO device registers, but normally you'd have a memory region mapped UC (uncacheable) which implies strong ordering so StoreLoad wouldn't be possible in the first place.


It does emit an mfence instruction which waits for the store buffer to drain before later loads can happen. (Or stores, but StoreStore reordering isn't allowed on x86 anyway.)

On Skylake at least, mfence is extra slow after a microcode update to include lfence-like behaviour of not letting even later ALU instructions execute until it's done waiting for the store-buffer to drain. See Are loads and stores the only instructions that gets reordered? for an example. This is one reason compilers have stopped using mfence even for stuff like atomic_thread_fence(seq_cst), instead using lock add byte [rsp], 0 or similar. As well as that it was slower on AMD even before that.

1
Mike Nakis On

In simple terms:

One thing which is important to always keep in mind when writing code in modern, highly complex computers, is that code that looks sequential is not necessarily always executed in a sequential fashion.

There are various tricks happening under the hood, as a result of compiler optimizations, multi-threading, CPU instruction pipelining, etc. which may change the order in which some instructions are executed, and thus change the order in which some memory locations are accessed.

To tie this to the jargon that you have already seen, a memory access A that precedes, in program order, a memory access B, may in fact happen after B!

These tricks are always done very carefully, so that they do not change the semantics of your code under normal circumstances. However, the moment you start doing things in your code that are outside of what is considered the "normal circumstances", like multi-threading without proper synchronization, or trying to roll your own synchronization mechanism, you might start running into trouble, because of these tricks. Things will look mighty bizarre; impossible things will seem to be happening; cause-and-effect will seem to have lost their meaning; nothing will be making sense.

A "memory fence" is an abstraction of some low-level mechanism that you can use to get out of trouble. The low-level mechanism might be a special CPU instruction, some library call, or who knows what, but that's the beauty of abstractions: we only need to know what it does, we do not need to know what it is and how it does it. Furthermore, an explanation of what it is and how it does it would necessarily involve jargon, but an explanation of what the abstraction does, need not involve any jargon! So:

When you use a memory fence you are requesting that any ongoing tricks should be stopped for a moment, so that:

  • All memory accesses that appear in your program code before the memory fence do in fact happen before the memory fence.
  • All memory accesses that appear in your program code after the memory fence do in fact happen after the memory fence.

(That's why it is called a fence: it keeps what should have happened in the past separate from what should happen in the future.)

Of course, when you invoke a memory fence, various ongoing optimizations are cancelled, so you take a small performance penalty, but code that works is always better than slightly faster code that does not work.