I'm trying to understand how a mfence guarantees sequential consistency on x86.
Take this code for example
std::atomic<int> a,b,r;
void write_a()
{
a.store(1, std::memory_order_seq_cst);
}
void write_b()
{
b.store(1, std::memory_order_seq_cst);
}
void read_a_b()
{
while(!a.load(std::memory_order_seq_cst));
if(b.load(std::memory_order_seq_cst)) {
r++;
}
}
void read_b_a()
{
while(!b.load(std::memory_order_seq_cst));
if(a.load(std::memory_order_seq_cst)) {
r++;
}
}
gcc 9.5 with -O3 generates following assembly for the write_a and write_b functions
write_a():
mov DWORD PTR a[rip], 1
mfence
ret
write_b():
mov DWORD PTR b[rip], 1
mfence
ret
When a std::memory_order_release stores are used, then the code becomes
write_a():
mov DWORD PTR a[rip], 1
ret
write_b():
mov DWORD PTR b[rip], 1
ret
so essentially the mfence is just dropped.
Now to my understanding with sequential-consistency a result r==0 is impossible. Whereas with acquire-release ordering it is theoretically possible.
As far as I know, mfence makes sure that the store buffer gets "flushed" and every store and load that follows it is stalled until the memory operations before the mfence have completed globally. However in my example, after the mfence no other memory operation follow, so I don't understand how it makes any difference in regards to the sequential consistent visibility of the changes to a and b.
In particular what happens if thread1 executed mov DWORD PTR a[rip], 1 but has not yet started executing mfence and thread2 analogously executed mov DWORD PTR b[rip], 1 but not yet started executing mfence.
write_a():
mov DWORD PTR a[rip], 1 <- thread1 finished this operation
mfence <- thread1 has not yet executed this operation
ret
write_b():
mov DWORD PTR b[rip], 1 <- thread2 finished this operation
mfence <- thread2 has not yet executed this operation
ret
At this time point, the code executed so far is the same as the code generated for the std::memory_order_release stores. So up to this point only a "release" stores took place.
Now if we have thread3 executing read_a_b() and thread4 executing read_b_a() I believe they could still disagree upon the order of the writes to a and b, so a result of r==0 is still theoretically possible. Only after thread1 and thread2 execute their respective mfences, this would no longer be possible.
What am I getting wrong?
I know that gcc10 uses xchng instead of mov, mfence but my underlying problem remains the same