I am working on making a fiber-based job system for my latest project which will depend on the use of spinlocks for proper functionality. I had intended to use the PAUSE instruction as that seems to be the gold-standard for the waiting portion of your average modern spinlock. However, on doing some research into implementing my own fibers, I came across the fact that the cycle duration of PAUSE on recent machines has increased to an adverse extent.
I found this out from here, where it says, quoting the Intel Optimization Manual, "The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles," and "As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss."
As a result, I'd like to find an alternative to the PAUSE instruction for use in my own spinlocks. I've read that in the past, PAUSE has been preferred due to it somehow saving on energy usage which I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock. I'm also guessing that this is on the other end of the spectrum power-wise to doing some dummy calculation for the desired number of cycles.
Given this, is there a best-case solution that comes close to PAUSE's apparent energy efficiency while having the flexibility and low-cycle count as a repeat 'throwout' calculation?
Yes,
pauselets the CPU avoid memory-order mis-speculation when leaving a read-only spin-wait loop, which is how you should spin to avoid creating contention for the thread trying to unlock that location. (Don't spamxchg). See also:How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - spin read-only (with
pause) if the first atomic RMW fails to take the lock. You do want to start with an RMW attempt so you don't get a Shared copy of the cache line and then have to wait for another off-core request; if the first access is an RMW likexchgorlock cmpxchg, the first request the core makes will be an RFO.Locks around memory manipulation via inline assembly a minimal x86 asm spinlock using
pause(but no fallback to OS-assisted sleep/wake, so inappropriate if the locking thread might ever sleep while holding the lock.)If you have to wait for another core to change something in memory, it doesn't help much to check much more frequently than the inter-core latency, especially if you'll stall right when you finally do see the change you've been waiting for.
Dealing with
pausebeing slow (newer Intel) or fast (AMD and old Intel)If you have code that uses multiple
pauseinstructions between checking the shared value, you should change the code to do fewer.See also Why have AMD-CPUs such a silly PAUSE-timing with Brendan's suggestion of checking
rdtscin spin loops to better adapt to unknown delays frompauseon different CPUs.Basically try to make your workload not as sensitive to
pauselatency. That may also mean trying to avoid waiting for data from other threads as much, or having other useful work to do until a lock becomes available.Alternatives, IDK if this is lower wakeup latency than spinning with
pauseOn CPUs new enough to have the WAITPKG extension (Tremont or Alder Lake / Sapphire Rapids),
umonitor/umwaitcould be viable to wait in user-space for a memory location to change, like spinning but the CPU wakes itself up when it sees cache coherency traffic about a change, or something like that. Although that may be slower thanpauseif it has to enter a sleep state.(You can ask
umwaitto only go into C0.1, not C0.2, with bit 0 of the register you specify, and EDX:EAX as a TSC deadline. Intel says a C0.2 sleep state improves performance of the other hyperthread, so presumably that means switching back to single-core-active mode, de-partitioning the store-buffer, ROB, etc., and having to wait for re-partitioning before this core can wake up. But C0.1 state doesn't do that.)Even in the worst case,
pauseis only about 140 core clock cycles. That's still much faster than a Linux system call on a modern x86-64, especially with Spectre / Meltdown mitigation. (Thousands to tens of thousands of clock cycles, up from a couple hundred just forsyscall+sysret, let alone callingschedule()and maybe running something else.)So if you're aiming to minimize wakeup latency at the expense of wasting CPU time spinning longer,
nanosleepis not an option. If might be good for other use-cases though, as a fallback after spinning onpausea couple times.Or use
futexto sleep on a value changing or on a notify from another process. (It doesn't guarantee that the kernel will usemonitor/mwaitto sleep until a change, it will let other tasks run. So you do need to have the unlocking thread make afutexsystem call if any waiters had gone to sleep withfutex. But you can still make a lightweight mutex that avoids any system calls in the locker and unlocker if there isn't contention and no threads go to sleep.)But at that point, with
futexyou're probably reproducing what glibc'spthread_mutexdoes.