In my application i am using Intel OneAPI TBB for parallelism. I am isolating work on NUMA nodes using task_arenas. I have a couple of parallel tasks I need to execute on each NUMA node for which I am using task_groups. For some of these tasks it is necessary to synchronize with the other NUMA nodes.
I am trying to do this with the std::barrier. When I test my implementation the execution sometimes deadlocks and sometimes it doesn't and I don't really understand why this happens.
Here is an example that reproduces the problem:
#include <oneapi/tbb/info.h>
#include <oneapi/tbb/task_arena.h>
#include <oneapi/tbb/task_group.h>
#include <barrier>
using namespace oneapi;
int main(int argc, char* argv[]) {
std::vector<tbb::numa_node_id> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
std::vector<tbb::task_group> task_groups(numa_indexes.size());
std::barrier barrier(numa_indexes.size());
for (unsigned j = 0; j < numa_indexes.size(); j++) {
arenas[j].initialize(tbb::task_arena::constraints(numa_indexes[j]));
}
for (int i = 0; i < 10000; ++i) {
for (unsigned j = 0; j < numa_indexes.size(); j++) {
arenas[j].execute([&task_groups, &barrier, j]() {
task_groups[j].run([&barrier]() { barrier.arrive_and_wait(); });
});
}
for (unsigned j = 0; j < numa_indexes.size(); j++) {
arenas[j].execute([&task_groups, j]() { task_groups[j].wait(); });
}
}
}
When I run the same code without the task_groups (calling barrier.arrive_and_wait() directly inside arenas[j].execute()) the code works just as expected. But without the use of task_groups it is not possible to wait on the completion of specific tasks.
I also tried other synchronization approaches which lead to the same result.
Can someone explain to me why this happens?
I am using oneTBB 2021.9.0 and gcc 13.1.0.