Horrible scalability of std::mutex with zero contention

115 Views Asked by At

Even without contention, the scalability of std::mutex seems to be horrible. This is a case where every thread is guaranteed to use its own mutex. What is going on?

#include <mutex>
#include <vector>
#include <numeric>

void TestThread(bool *pbFinished, int* pResult)
{
    std::mutex mtx;
    for (; !*pbFinished; (*pResult)++)
    {
        mtx.lock();
        mtx.unlock();
    }
}

void Test(int coreCnt)
{
    const int ms = 3000;
    bool bFinished = false;
    std::vector<int> results(coreCnt);    
    std::vector<std::thread*> threads(coreCnt);

    for (int i = 0; i < coreCnt; i++)
        threads[i] = new std::thread(TestThread, &bFinished, &results[i]);

    std::this_thread::sleep_for(std::chrono::milliseconds(ms));

    bFinished = true;
    for (std::thread* pThread: threads)
        pThread->join();

    int sum = std::accumulate(results.begin(), results.end(), 0);
    printf("%d cores: %.03fm ops/sec\n", coreCnt, double(sum)/double(ms)/1000.);
}

int main(int argc, char** argv)
{
    for (int cores = 1; cores <= (int)std::thread::hardware_concurrency(); cores++)
        Test(cores);

    return 0;
}

Results in Windows are very bad:

1 cores: 15.696m ops/sec
2 cores: 12.294m ops/sec
3 cores: 17.134m ops/sec
4 cores: 9.680m ops/sec
5 cores: 13.012m ops/sec
6 cores: 21.142m ops/sec
7 cores: 18.936m ops/sec
8 cores: 18.507m ops/sec

Linux manages to be an even bigger loser:

1 cores: 46.525m ops/sec
2 cores: 15.089m ops/sec
3 cores: 15.105m ops/sec
4 cores: 14.822m ops/sec
5 cores: 14.519m ops/sec
6 cores: 14.544m ops/sec
7 cores: 13.996m ops/sec
8 cores: 13.869m ops/sec

I have also tried using tbb's readers/writer lock, and even rolled my own.

1

There are 1 best solutions below

0
selbie On

I made my own variant of your test with the following changes:

  • Each test thread executes for a specific number of iterations instead for a specific amount of time. Each thread returns how long it took to run the number of iterations. (For testing, I used 20 million iterations).

  • The main thread orchestrating the threads waits for each thread to "signal" that its ready to begin. Then the main thread, upon seeing all threads are "ready", it signals "go" to all the tests. These signals are basically condition_variables. This basically eliminates performance noise of having one thread starting to execute while another thread is getting warmed up the kernel.

  • No attempt by the thread to access a global variable until the thread exits to return results.

  • When all the threads have finished, the total number of iterations is computed with the total amount of time each thread took.

  • used the high resolution clock to measure time used in each thread

struct TestSignal
{
    std::mutex mut;
    std::condition_variable cv;
    bool isReady;

    TestSignal() : isReady(false)
    {

    }

    void Signal()
    {
        mut.lock();
        isReady = true;
        mut.unlock();
        cv.notify_all();
    }

    void Wait()
    {
        std::unique_lock<std::mutex> lck(mut);
        cv.wait(lck, [this] {return isReady; });
    }
};

long long TestThread2(long long iterations, TestSignal& signalReady, TestSignal& signalGo)
{
    std::mutex mtx;

    signalReady.Signal(); // signal to the main thread we're ready to proceed
    signalGo.Wait();      // wait for the main thread to tell us to start

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < iterations; i++)
    {
        mtx.lock();
        mtx.unlock();
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto diff = end - start;

    auto milli = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    return milli.count(); // return how long it took to execute the iterations
}


void Test2(unsigned int threadcount)
{

    long long iterations = 20000000; // 20 million

    std::vector<std::thread> threads(threadcount);
    std::vector<TestSignal> readySignals(threadcount);
    std::vector<long long> results(threadcount);

    TestSignal signalGo;

    for (unsigned int i = 0; i < threadcount; i++)
    {
        auto t = std::thread([&results, &readySignals, &signalGo, i, iterations] {results[i] = TestThread2(iterations, readySignals[i], signalGo); });
        readySignals[i].Wait();
        threads[i] = std::move(t);
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(500));

    signalGo.Signal(); // unleash the threads

    for (unsigned int i = 0; i < threadcount; i++)
    {
        threads[i].join();
    }

    long long totaltime = 0;
    double totalrate = 0;
    for (unsigned int i = 0; i < threadcount; i++)
    {
        double rate = iterations / (double)(results[i]); // operations per millisecond
        totalrate += rate;
    }

    std::cout << threadcount << " threads: " << totalrate/1000 << "m ops/sec (new test)\n";
    
}

Then a simple main to compare both results 3x times:

int main()
{
#ifdef WIN32
    ::SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
#endif

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());


    return 0;
}

The results are noticably different:

12 cores: 66.343m ops/sec
12 threads: 482.187m ops/sec (new test)
12 cores: 111.061m ops/sec
12 threads: 474.199m ops/sec (new test)
12 cores: 66.758m ops/sec
12 threads: 481.353m ops/sec (new test)