Let suppose I have memory mapped file and write into it from different threads (writes never overlap and are independent from each other). I want to sync already written data with disk and execute msync or FlushViewOfFile, then unmap file.
Do I need to synchronize writer threads with flushing thread, e.g. using release memory fences on writers and acquire fences on a flusher? I fear that some of the writes would still be in the CPU caches and not in main memory at this point.
Do CPU and OS guarantee that writes that are still on CPU caches would eventually get into disk or should I first ensure that writes reach main RAM and only then flush and unmap a file?
My threads don't ever read from mapped pages and I want to use only relaxed atomic operations to track my data, if that is possible.
Pseudo-code of what I try to do:
static NUM_WRITERS: AtomicUint = 0;
static CURRENT_OFFSET: AtomicUint = 0;
fn some_thread_fn(void* buffer, payload: &[byte]){
NUM_WRITERS.fetch_add(1, memory_order_relaxed);
offset = CURRENT_OFFSET.fetch_add(payload_size, memory_order_relaxed);
void* dst = buffer + offset;
memcpy(dst, payload, payload_size);
// Do I need memory fence or fetch_sub(Release) here?
compiler_fence(memory_order_release); // Prevent compiler from reordering instructions
NUM_WRITERS.fetch_sub(1, memory_order_relaxed);
if offset + payload_size < buffer_size {
// Some other thread is responsible for unmapping.
return;
}
while (NUM_WRITERS.load(relaxed) > 0) {
mm_pause();
}
// Do I need acquire memory fence here?
compiler_fence(memory_order_acquire); // Prevent compiler from reordering instructions
flush_async(buffer);
unmap(buffer);
}
All stores need to happen-before the
munmap, else they could fault. They will make it to disk unless the system crashes before that happens. For data to be affected bymsync, make sure the write (assignments / memcpy) happens-beforemsyncon that address range.fdatasyncon the fd aftermunmapwould be a simpler way to make sure all dirty data makes it to disk ASAP, unless you have other regions of the file that you don't want to sync. Without any manual syncing, dirty pages in the page-cache get queued for write back to disk after some timeout, like 15 seconds.Sequenced-before is a sufficiently-strong form of happens-before, since a call to
munmapormysncdoes the "observing" from this thread. For example, within a single thread, the "dirty" flag bits in the page-table entries will be seen by any later kernel code (such as during themsyncsyscall) for pages modified by store instructions by this thread. (Or by any other threads that you've synced-with.)I think in your case, yes you do need
NUM_WRITERS.fetch_sub(1, memory_order_release);for every thread, andwhile (NUM_WRITERS.load(acquire) > 0) { pause }for the one thread that reaches that spin-wait loop to do the cleanup.mm_pause()is x86-specific; on x86,acquireloads are free, same asm asrelaxed. And all RMWs needlock, e.g.lock add, the same asm that's strong enough forseq_cst. If you plan to port to other ISAs, then rest assured AArch64 has relatively efficientacquireandrelease.relaxedmay work most of the time, but would in theory allow this thread tomunmapand invalidate the page-table entries before other threads have even reached the store instructions, leading to a fault. Or more plausibly to have some of the stores happen aftermysnc.With cache-coherent DMA (like on x86), only the actual DMA read of the memory by the device is the deadline for stores to have committed to cache. (But for
msyncto notice the page was dirty in the first place and queue it for writing to disk, at least one byte of it would have to be written recently before the OS checked the hardware page tables.)When a store instruction runs on a page where the TLB entry shows the
D(Dirty) bit = 0, on x86 that core takes a microcode assist to atomically RMW the page-table entry to haveD=1. (https://wiki.osdev.org/Paging#Page_Directory) There's also anAbit which gets set even by reads. (The OS can clear this and see which pages have it set again soon; those pages are bad choices for eviction.)You don't need manual
atomic_signal_fence(akacompiler_barrier) because compilers already can't move stores past a function call to a function that might read the stored data. (Likearr[i] = 1; foo(arr)wherefooismsyncormunmap, for exactly the same reason it's safe with a user-defined function the CPU doesn't know about.) CPUs that do out-of-order exec will preserve the illusion of a single thread running in program order.If each write has page granularity, you could have each thread do its own
msyncon the pages it wrote. This would be not great if writes are smaller than pages, since you'd trigger multiple disk I/Os for the same page, though.