Multithreaded program using 8 threads takes more time to run than 2 threads

Question

Multithreaded program using 8 threads takes more time to run than 2 threads

88 Views Asked by Lion Lai At 06 January 2024 at 08:01

I am trying to convert a single threaded raytracing program to a multithreaded program. My idea is the following:

Create a new thread for processing the color of each pixel. Since each pixel has 1024 samples, the cost of creating a thread should be small comparing to the calculation of a pixel.
Create a producer thread which is responsible for pushing the future of a new thread into a BlockingQueue and wait for the thread's finishing. If a thread returns, use future.get to get the returned value and pop the future out of the BlockingQueue.
The main program is responsible for get the returned pixel value out of future and pop future out of the BlockingQueue.

The whole code block is the following:

std::cout << "P3\n" << image_width << ' ' << image_height << "\n255\n";

const int CONCURRENCY = 8; // `std::thread::hardware_concurrency();` returns 16 on my laptop.
std::clog << "\rCONCURRENCY: " << CONCURRENCY << std::endl;
BlockingQueue<std::shared_future<color>> futures(CONCURRENCY); // CONCURRENCY is the size of the thread pool.

for (int j = 0; j < image_height; ++j) {
    std::clog << "\rScanlines remaining: " << (image_height - j) << ' ' << std::flush;

    // The lambda function on which each spawned thread is going to run.
    auto lambda_func = [this, j, &world](int i) {
        color pixel_color(0,0,0);
        for (int sample = 0; sample < this->samples_per_pixel; ++sample) {
            const ray r = get_ray(i, j);
            pixel_color += ray_color(r, this->max_depth, world);
        }
        return pixel_color;
    };
    std::thread producer_thread(producer_func, std::ref(futures), image_width, lambda_func);

    for (int i = 0; i < image_width; ++i) {
        color pixel_color(0,0,0);
        
        // Multithread
        std::shared_future<color> f = futures.front();
        pixel_color = f.get();
        futures.pop();

        write_color(std::cout, pixel_color, samples_per_pixel);
    }
    producer_thread.join();
}

std::clog << "\rDone.                 \n";

Here are the explanation for each component:

BlockingQueue: It is a thread-safe queue which suppports the producer_thread and the main thread to push and pop futures onto it.

template <typename T>
class BlockingQueue {
  private:
    std::mutex _mtx;
    std::condition_variable _cond;
    int _max_size;
    std::queue<T> _queue;

  public:
    BlockingQueue(int max_size) : _max_size(max_size) {
    }

    void push(T t) {
        std::unique_lock<std::mutex> lock(_mtx);
        _cond.wait(lock, [this]() { return _queue.size() < _max_size; });
        _queue.push(t);
        lock.unlock();
        _cond.notify_one();
    }

    T front() {
        std::unique_lock<std::mutex> lock(_mtx);
        _cond.wait(lock, [this]() { return !_queue.empty(); });
        return _queue.front();
    }

    void pop() {
        std::unique_lock<std::mutex> lock(_mtx);
        _cond.wait(lock, [this]() { return !_queue.empty(); });
        _queue.pop();
        lock.unlock();
        _cond.notify_one();
    }

    int size() {
        std::lock_guard<std::mutex> lock(_mtx);
        return _queue.size();
    }
};

producer_func: It is the function on which the producer_thread runs to push new spawned threads into the BlockingQueue. This program process the image row by row. Thus, i is the parameter for assigning which pixel is going to be processed. Here, there is no image_height information in this function because the height information is already embedded inside the lambda_func.

void producer_func(BlockingQueue<std::shared_future<color>> &futures, int image_width, std::function<color(int)> lambda_func) {
    for (int i = 0; i < image_width; i++) {
        // A new thread is spawned and pushed into BlockingQueue.
        std::shared_future<color> f = std::async(std::launch::async, lambda_func, i);
        // If BlockingQueue is full, this thread will wait until an element is pop out.
        futures.push(f);
    }
}

lambda_func and producer_thread: the lambda_func calculates and accumulates the color of all samples of a pixel and then returns it. The producer_thread is responsible for generating the information of i to keep the lambda_func informative.

auto lambda_func = [this, j, &world](int i) {
    color pixel_color(0,0,0);
    for (int sample = 0; sample < this->samples_per_pixel; ++sample) {
        const ray r = get_ray(i, j);
        pixel_color += ray_color(r, this->max_depth, world);
    }
    return pixel_color;
};
std::thread producer_thread(producer_func, std::ref(futures), image_width, lambda_func);

Performance

I've run this program with CONCURRENCY=2 and CONCURRENCY=8. They both generate legitimate result, but the costing time doesn't make too much sense:

Using 2 and 8 threads are both faster than the baseline (no multithreading).
Using 2 threads to run the program takes roughly the same time as 8 threads'. Sometimes using 2 threads is few seconds faster. I would expect using 8 threads should be significant faster than only using 2 threads. The time improvement doesn't necessary need to be linear, but still using 8 threads should be faster than 2 threads in my opinion.
This program uses standard output std::cout to write integer pixel value to an image. I suspect this could be the bottleneck of this program. But the running time is the same without writing to standard output.
Due to the nature of raytracing, it should be a great example of showcasing the power of multithreading because every pixel can be calculated independently and not related to the neighboring pixels. The only thing is shared among multiple threads is world. world is an object which describes the scene. world is a constant reference const hittable& world that is read-only. My understanding of multithreading is that if a shared object (pass-by-reference) in different threads can only be read but not write, then mutex and lock are not needed when accessing them. (However, I wonder how it works in low level. How can two threads read the same memory address (pass-by-reference) without stepping on others' toe).

My questions

Based on the information I provided, what could be the bottleneck of this program?
Why is using 8 threads taking roughly the same time as only 2 threads?

The spec of my CPUs:

$ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              141
Model name:                         11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
Stepping:                           1
CPU MHz:                            3866.270
CPU max MHz:                        4800,0000
CPU min MHz:                        800,0000
BogoMIPS:                           4992.00
Virtualization:                     VT-x
L1d cache:                          384 KiB
L1i cache:                          256 KiB
L2 cache:                           10 MiB
L3 cache:                           24 MiB
NUMA node0 CPU(s):                  0-15

Original Q&A

Multithreaded program using 8 threads takes more time to run than 2 threads

The whole code block is the following:

Here are the explanation for each component:

Performance

My questions

There are 0 best solutions below

Related Questions in C++

Related Questions in MULTITHREADING

Related Questions in QUEUE

Related Questions in PRODUCER-CONSUMER

Related Questions in RAYTRACING

Trending Questions

Popular # Hahtags

Popular Questions