Once the program has executed a parallel section, all the threads except for the master get destroyed. As the program enters another parallel section, threads are spawned again. All this routine causes some overhead which may significantly limit the efficiency of multiprocessing.
Consider a nested loop with the inner one being parallelized:
for (int i = 0; i < outer_loops; i++){
// some code which does not affect the execution time noticeably
#pragma omp parallel for
for (int j = 0; j < inner_loops; j++) {
// some code to run through with multiple threads
}
}
Is there a way to avoid slave threads to be spawned anew each time the program enters the inner loop? For example, once the inner loop is done, the master thread takes care of the outer loop, but the slave threads do not get destroyed to be spawned again, they just wait for the next iteration of the inner loop.
In my case, there is no way I can include the outer loop into the parallel section as it must be sequential. I'm trying to minimize the overhead caused by thread spawn.
This is wrong in practice. An OpenMP implementation can keep threads alive even outside a parallel section and recycle the threads. In fact, GOMP (OpenMP runtime of GCC) and IOMP (OpenMP runtime of Clang and ICC) does this optimization (as long as some constraints are fulfilled like having the same number of threads in the different sections).
Most of the overheads does not comes from the creation of the threads but the initialisation of some internal data structures, the weak up of threads (if needed regarding your settings), their synchronisation and the actual worksharing operation. With an active waiting policy (see
OMP_WAIT_POLICY), the initialisation of the internal data structure should be the most expensive part though it is generally pretty fast.You can using one parallel section and tell OpenMP that a part is only executed by the master thread:
The two barrier slow down the execution but they are mandatory in your case. If the sequential code of the outer loop is not dependent of the result of the inner parallel loop, then you can use OpenMP tasks to speed up the computation (it reduce the need for expensive synchronizations and provide more parallelism).
Note that this code is fundamentally limited by the speed of the sequential code. The Amdahl's law state that this strongly limit the scalability of the approach. Synchronizations and worksharing overheads tends to slow down the serial part reducing scalability. In the most critical case, the parallel code can be slower. The only solutions to solve this is to reduce the time taken by the serial part OR to work on more data (see the Gustafson's_law).