I'm hacking on a fifo/queue for a multithreaded program using the following example as a guide. One thread writes to the fifo and another reads from it. The fifo will only be shared between these two threads. The synchronization overhead is massive; adding and removing 10 million items from the fifo takes about three seconds on my machine if I use mutexes and 1.5 seconds if I use spinlocks instead.
My blocking add function is as follows. The blocking remove function works analogously. The functions were updated, as John Bollinger pointed out that the spinlock code was racy if there were more than one writer or reader.
static void
spin_while_full(volatile synced_queue *me) {
while (me->queue->n_elements == me->queue->capacity) {
}
}
void
synced_queue_add(synced_queue *me, void *value) {
synced_queue_lock_type tp = me->lock_type;
if (tp == SYNCED_QUEUE_SPIN_LOCK) {
while (true) {
pthread_spin_lock(&me->lock);
if (me->queue->n_elements < me->queue->capacity) {
break;
}
pthread_spin_unlock(&me->lock);
}
} else if (tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
spin_while_full(me);
pthread_spin_lock(&me->lock);
} else if (tp == SYNCED_QUEUE_MUTEX) {
pthread_mutex_lock(&me->mutex);
while (me->queue->n_elements == me->queue->capacity) {
pthread_cond_wait(&me->var_prod, &me->mutex);
}
} else {
assert(false);
}
queue_add(me->queue, value);
if (tp == SYNCED_QUEUE_SPIN_LOCK ||
tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
pthread_spin_unlock(&me->lock);
} else {
pthread_mutex_unlock(&me->mutex);
pthread_cond_signal(&me->var_cons);
}
}
The library's user sets the lock_type flag to whatever
synchronization method they want. The benchmark results for the three
lock types are:
=== benchmark_synced_queue
spinlock 1024 elements 0.17 us/job
uncontested spinlock 1024 elements 0.14 us/job
mutex 1024 elements 0.31 us/job
spinlock 512 elements 0.17 us/job
uncontested spinlock 512 elements 0.14 us/job
mutex 512 elements 0.31 us/job
spinlock 256 elements 0.15 us/job
uncontested spinlock 256 elements 0.14 us/job
mutex 256 elements 0.28 us/job
spinlock 128 elements 0.15 us/job
uncontested spinlock 128 elements 0.14 us/job
mutex 128 elements 0.29 us/job
spinlock 64 elements 0.16 us/job
uncontested spinlock 64 elements 0.14 us/job
mutex 64 elements 0.28 us/job
spinlock 32 elements 0.18 us/job
uncontested spinlock 32 elements 0.15 us/job
mutex 32 elements 0.15 us/job
spinlock 16 elements 0.21 us/job
uncontested spinlock 16 elements 0.16 us/job
mutex 16 elements 0.30 us/job
spinlock 8 elements 0.29 us/job
uncontested spinlock 8 elements 0.16 us/job
mutex 8 elements 0.60 us/job
spinlock 4 elements 0.43 us/job
uncontested spinlock 4 elements 0.17 us/job
mutex 4 elements 1.21 us/job
So while the "correct" spinlock is a little slower than the incorrect one, the difference between the spinlock types and the mutex is much bigger.
The pthread manual pretty much discourages the use of spinlocks:
Spin locks should be employed in conjunction with real-time scheduling policies (SCHED_FIFO, or possibly SCHED_RR). Use of spin locks with nondeterministic scheduling policies such as SCHED_OTHER probably indicates a design mistake. ... User-space spin locks are not applicable as a general locking solution. They are, by definition, prone to priority inversion and unbounded spin times. A programmer using spin locks must be exceptionally careful not only in the code, but also in terms of system configuration, thread placement, and priority assignment.
My question is why is locking is so much slower using mutexes than spinlocks? The problem with spinlocks is of course that it taxes the cpu. So is there a way to avoid using spinlocks, but to still get good performance?
I'll give you my educated guess. It needs some testing to be proven right/wrong, which you should be able to perform yourself.
First of all let me highlight the fundamental implementation difference between spinlocks and mutexes:
futex(2)syscall (at least on Linux) or anyway through a set of syscalls. Contrary to spinlocks, they prevent CPU usage while waiting for a lock by making the current thread sleep at least until the lock is ready to be acquired.Your implementation uses a busy loop in the case of spin locks VS a pthread condition in the case of mutexes. Again, the first case is likely a userspace-only job and the second case implies sleeping/waking threads through syscalls. This further polarizes the two cases you are testing: in the first one (spinlocks) you will have two threads that will almost entirely hog the CPU for their exclusive use, while in the second one (mutexes) you will have two threads that will "intelligently" sleep/wake trying to minimize CPU usage when unneeded.
By itself, this already introduces a big difference in overhead between the two implementations: the performance of the first one will more or less only be limited by the speed of your CPUs, while the second one also has to rely on the underlying kernel for task sleep/wakeup. Performing a syscall (mutex) is a much more expensive operation than a few hundred CMPXCHG instructions (spinlock).
On top of this, modern CPUs and operating systems use dynamic CPU frequency scaling, which essentially means that the CPU clock speed of each core is not fixed, but continuously adjusted (independently from other cores) based on its utilization either by the kernel, the hardware itself, or a combination of the two (see e.g. Intel P-state).
Therefore, what is likely happening is that the intermittent sleeping caused by mutexes in your second implementation is further degrading performance by making the CPU constantly change clock speed like a roller coaster. I have talked about this phenomenon in more detail (specifically focusing on Linux) in this other answer of mine.