Fair Reader-Writer Ticket Spinlock in C++ running slow

623 Views Asked by At

I have recently implemented a fair reader-writer ticket-spinlock in C++. The code is fairly simple and I thought it was working great. I have integrated the spinlock into a larger application and I noticed that on some rare occasions, the code is just running extremely slowly while most of the time, it works really fast. I know it is due to the spinlock because if I replace it immediately with a simple reader-writer spinlock (not fair and no ticket), the code suddenly just runs much faster. It happened a few times on different machines. I know that those kind of locks can run slowly if you run them with more threads than cores but I ran it with 16 threads on a machine with 48 cores. I couldn't reproduce the issue on my laptop with 4 threads and 4 cores. Here is the code:

    inline size_t rndup(size_t v) {

        v--;
        v |= v >> 1;
        v |= v >> 2;
        v |= v >> 4;
        v |= v >> 8;
        v |= v >> 16;
        v |= v >> 32;
        v++;

        return v;
    }    

    class SpinLockRW_MCS {

        public:

            SpinLockRW_MCS(const size_t nb_readers) :   writer(nullptr), lock_pool(nullptr), it_lock_pool(0),
                                                        load_lock_pool(0), mask_it(rndup(2 * nb_readers + 1) - 1),
                                                        padding1{0}, padding2{0}, padding3{0}, padding4{0} {

                if (nb_readers <= std::thread::hardware_concurrency()){

                    lock_pool = new Lock[mask_it + 1];
                    lock_pool[0].is_locked = false;
                }
            }

            ~SpinLockRW_MCS() {

                clear();
            }

            inline void clear() {

                if (lock_pool != nullptr){

                    delete[] lock_pool;
                    lock_pool = nullptr;
                }

                writer = nullptr;

                it_lock_pool = 0;
                load_lock_pool = 0;
            }

            inline void acquire_reader() {

                uint_fast32_t retry = 0;

                const size_t prev_reader_id = it_lock_pool.fetch_add(1) & mask_it;
                const size_t new_reader_id = (prev_reader_id + 1) & mask_it;

                while (lock_pool[prev_reader_id].is_locked){

                    if (++retry > 100) this_thread::yield();
                }

                ++load_lock_pool;

                lock_pool[prev_reader_id].is_locked = true;
                lock_pool[new_reader_id].is_locked = false;
            }

            inline void release_reader() {

                --load_lock_pool;
            }

            inline void acquire_writer() {

                uint_fast32_t retry = 0;

                const size_t prev_reader_id = it_lock_pool.fetch_add(1) & mask_it;
                const size_t new_reader_id = (prev_reader_id + 1) & mask_it;

                while (lock_pool[prev_reader_id].is_locked){

                    if (++retry > 100) this_thread::yield();
                }

                while (load_lock_pool){

                    if (++retry > 100) this_thread::yield();
                }

                lock_pool[prev_reader_id].is_locked = true;

                writer = &lock_pool[new_reader_id];
            }

            inline void release_writer() {

                writer->is_locked = false;
            }

            inline void release_writer_acquire_reader() {

                ++load_lock_pool;

                writer->is_locked = false;
            }

        private:

            struct Lock {

                std::atomic<bool> is_locked;
                const int padding[15];

                Lock() : is_locked(true), padding{0} {}
            };

            Lock* writer;
            const int padding1[14];
            Lock* lock_pool;
            const int padding2[14];
            const size_t mask_it;
            const int padding3[14];
            std::atomic<size_t> it_lock_pool;
            const int padding4[14];
            std::atomic<size_t> load_lock_pool;
    };

Any suggestion would be greatly appreciated! Thanks!

2

There are 2 best solutions below

2
darune On

My bet is your issue is somewhere around these lines:

if (++retry > 100) this_thread::yield();

I know this is how you plan to be 'optimistic', however hardcoded (arbitrary) values like these ('100' in this case) is normally an indication of a design flaw when dealing with this class of problem - like you say you only see the issue on another system which may be a symptom of that (since with this value it seems to works on your system). This in turn points to the this_thread::yield() as indicating part of the issue.

2
ComicSansMS On

It is somewhat difficult assessing the problem without having more details, but here's my shot in the dark: I suspect that in your scenario readers need to acquire locks very frequently (otherwise, you would probably be better off with a traditional lock). Here's your problem:

Any single thread is able to starve out all others.

This is true for both readers and writers, while in a non-fair algorithm it is usually only true for writers. The problem in your situation happens when you have multiple readers queuing up for read access. Each thread will wait for the preceding lock to become available (while (lock_pool[prev_reader_id].is_locked) ...). If they can get that lock, everything is fine, but you get into trouble as soon as one thread cannot get it. All the reader threads queue up to see their predecessors flip to false. Each of them depends on their direct predecessor.

Now imagine the first reader not being able to get the lock. It will keep spinning for a while and eventually yield(). This effectively means your thread is now no longer running. The operating system dropped it from the scheduling queue and it won't run for a long time (the rest of their timeslice, which is long in comparison for how long it takes for the 100 spins to complete). As a consequence, the complete chain of waiting threads will go into yield very likely.

Eventually, the flag that the first thread was waiting for will flip to false. But your scheduler is now in for a predicament. It has a bunch of threads lying around, but they are only spinning for a very short time before going into yield again. The expectation here is that for all but the first thread in the chain, if they get picked they will almost certainly be doomed to lay dormant for one more complete timeslice. As a consequence, if this happens to a thread early in the chain of waiting threads, you also condemn all other threads in the chain to wait for at least as long.

What you are playing here is a game of probabilities where your odds of winning decrease significantly with growing number of readers in the queue. That's why the problem got worse when moving from 4 to 16 threads. In particular, once you reach the point where the average time it takes for a new reader to arrive in the queue is roughly in the order of the time it takes for a thread to move through the queue, you will have a hard time getting back to an empty queue again. This is not unlikely, as we are talking multiple of timeslices here, which brings you into the order of tens to hundreds of milliseconds.

This is a typical trade-off in fair scheduling algorithms. The fairness comes at a price, in this case that a single reader is able to block everyone. Since my reader can never overtake your reader if you managed to get to the acquire call first, I will have to wait forever if you don't move on. One solution to this problem is to give the scheduler additional information what each thread is waiting on, so that it has a better chance of waking them up in the right order. Another is to choose a different algorithm that is better suited to your particular scenario.