I am trying to convert a single threaded raytracing program to a multithreaded program. My idea is the following:
- Create a new thread for processing the color of each pixel. Since each pixel has 1024 samples, the cost of creating a thread should be small comparing to the calculation of a pixel.
- Create a producer thread which is responsible for pushing the future of a new thread into a
BlockingQueueand wait for the thread's finishing. If a thread returns, usefuture.getto get the returned value and pop the future out of theBlockingQueue. - The main program is responsible for get the returned pixel value out of future and pop future out of the
BlockingQueue.
The whole code block is the following:
std::cout << "P3\n" << image_width << ' ' << image_height << "\n255\n";
const int CONCURRENCY = 8; // `std::thread::hardware_concurrency();` returns 16 on my laptop.
std::clog << "\rCONCURRENCY: " << CONCURRENCY << std::endl;
BlockingQueue<std::shared_future<color>> futures(CONCURRENCY); // CONCURRENCY is the size of the thread pool.
for (int j = 0; j < image_height; ++j) {
std::clog << "\rScanlines remaining: " << (image_height - j) << ' ' << std::flush;
// The lambda function on which each spawned thread is going to run.
auto lambda_func = [this, j, &world](int i) {
color pixel_color(0,0,0);
for (int sample = 0; sample < this->samples_per_pixel; ++sample) {
const ray r = get_ray(i, j);
pixel_color += ray_color(r, this->max_depth, world);
}
return pixel_color;
};
std::thread producer_thread(producer_func, std::ref(futures), image_width, lambda_func);
for (int i = 0; i < image_width; ++i) {
color pixel_color(0,0,0);
// Multithread
std::shared_future<color> f = futures.front();
pixel_color = f.get();
futures.pop();
write_color(std::cout, pixel_color, samples_per_pixel);
}
producer_thread.join();
}
std::clog << "\rDone. \n";
Here are the explanation for each component:
BlockingQueue: It is a thread-safe queue which suppports theproducer_threadand the main thread to push and pop futures onto it.
template <typename T>
class BlockingQueue {
private:
std::mutex _mtx;
std::condition_variable _cond;
int _max_size;
std::queue<T> _queue;
public:
BlockingQueue(int max_size) : _max_size(max_size) {
}
void push(T t) {
std::unique_lock<std::mutex> lock(_mtx);
_cond.wait(lock, [this]() { return _queue.size() < _max_size; });
_queue.push(t);
lock.unlock();
_cond.notify_one();
}
T front() {
std::unique_lock<std::mutex> lock(_mtx);
_cond.wait(lock, [this]() { return !_queue.empty(); });
return _queue.front();
}
void pop() {
std::unique_lock<std::mutex> lock(_mtx);
_cond.wait(lock, [this]() { return !_queue.empty(); });
_queue.pop();
lock.unlock();
_cond.notify_one();
}
int size() {
std::lock_guard<std::mutex> lock(_mtx);
return _queue.size();
}
};
producer_func: It is the function on which theproducer_threadruns to push new spawned threads into theBlockingQueue. This program process the image row by row. Thus,iis the parameter for assigning which pixel is going to be processed. Here, there is noimage_heightinformation in this function because the height information is already embedded inside thelambda_func.
void producer_func(BlockingQueue<std::shared_future<color>> &futures, int image_width, std::function<color(int)> lambda_func) {
for (int i = 0; i < image_width; i++) {
// A new thread is spawned and pushed into BlockingQueue.
std::shared_future<color> f = std::async(std::launch::async, lambda_func, i);
// If BlockingQueue is full, this thread will wait until an element is pop out.
futures.push(f);
}
}
lambda_funcandproducer_thread: thelambda_funccalculates and accumulates the color of all samples of a pixel and then returns it. Theproducer_threadis responsible for generating the information ofito keep thelambda_funcinformative.
auto lambda_func = [this, j, &world](int i) {
color pixel_color(0,0,0);
for (int sample = 0; sample < this->samples_per_pixel; ++sample) {
const ray r = get_ray(i, j);
pixel_color += ray_color(r, this->max_depth, world);
}
return pixel_color;
};
std::thread producer_thread(producer_func, std::ref(futures), image_width, lambda_func);
Performance
I've run this program with CONCURRENCY=2 and CONCURRENCY=8. They both generate legitimate result, but the costing time doesn't make too much sense:
Using 2 and 8 threads are both faster than the baseline (no multithreading).
Using 2 threads to run the program takes roughly the same time as 8 threads'. Sometimes using 2 threads is few seconds faster. I would expect using 8 threads should be significant faster than only using 2 threads. The time improvement doesn't necessary need to be linear, but still using 8 threads should be faster than 2 threads in my opinion.
This program uses standard output
std::coutto write integer pixel value to an image. I suspect this could be the bottleneck of this program. But the running time is the same without writing to standard output.Due to the nature of raytracing, it should be a great example of showcasing the power of multithreading because every pixel can be calculated independently and not related to the neighboring pixels. The only thing is shared among multiple threads is
world.worldis an object which describes the scene.worldis a constant referenceconst hittable& worldthat is read-only. My understanding of multithreading is that if a shared object (pass-by-reference) in different threads can only be read but not write, then mutex and lock are not needed when accessing them. (However, I wonder how it works in low level. How can two threads read the same memory address (pass-by-reference) without stepping on others' toe).
My questions
Based on the information I provided, what could be the bottleneck of this program?
Why is using 8 threads taking roughly the same time as only 2 threads?
The spec of my CPUs:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 141
Model name: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
Stepping: 1
CPU MHz: 3866.270
CPU max MHz: 4800,0000
CPU min MHz: 800,0000
BogoMIPS: 4992.00
Virtualization: VT-x
L1d cache: 384 KiB
L1i cache: 256 KiB
L2 cache: 10 MiB
L3 cache: 24 MiB
NUMA node0 CPU(s): 0-15