Is OpenMP (parallel for) in g++ 4.7 not very efficient? 2.5x at 5x CPU

364 Views Asked by At

I've tried using OpenMP with a single #pragma omp parallel for, and it resulted in my programme going from a runtime of 35s (99.6% CPU) to 14s (500% CPU), running on Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz. That's the difference between compiling with g++ -O3 and g++ -O3 -fopenmp, both with gcc (Debian 4.7.2-5) 4.7.2 on Debian 7 (wheezy).

  • Why is it only using 500% CPU at most, when the theoretical maximum would be 800%, since the CPU is 4 core / 8 threads? Shouldn't it be reaching at least low 700s?

  • Why am I only getting a 2.5x improvement in overall time, yet at a cost of 5x in CPU use? Cache thrashing?

The whole programme is based on C++ string manipulation, with recursive processing (using a lot of .substr(1) and some concatenation), where said strings are continuously inserted into a vector of set.

In other words, basically, there are about 2k loop iterations done in a single parallel for loop, operating on vector, and each one of them may do two recursive calls to itself w/ some string .substr(1) and + char concatenation, and then the recursion terminates with set .insert of either a single string or a concatenation of two strings, and the said set .insert also takes care of a significant number of duplicates that are possible.

Everything runs correctly and well within the spec, but I'm trying to see if it can run faster. :-)

2

There are 2 best solutions below

2
On

Achieving linear speedup is usually not trivial, and it gets harder the more cores you have. There might be several aspects you may need to look for:

  1. False sharing: if the array is not sliced properly, a cache line may be in contention between two cores, usually the two halves of the cache line are written by two threads causing the cache line to move from the L2 cache of one core to the other. Cache sharing can also happen if you are using lots of shared variables. In that case consider using private or firstprivate to avoid synchronization.
  2. OpenMP scheduling: if not specified, OpenMP will split the iterations of the loop in equal manner among the threads in the system and assign a subrange to each of them. However, if the amount of work for each index vary then you may end-up in a situation where most of the threads are finish their work and they are blocked at the barrier at the end of the parallel region waiting for the slower thread (the one that got more work to do). This depends on the type of algorithm you have implemented inside your loop, but OpenMP gives you the opportunity to change scheduling policy using the schedule() directive. Consider trying dynamic at least.
2
On

Based on your description you can draw the following conclusions:

I am assuming OpenMP really uses 8 threads (verify by export OMP_NUM_THREADS=8)

  1. 500% CPU means, that there is significant time spent in barriers. This is due to bad load balance: different iterations taking varying amount of time. Hence the default (static) loop scheduling is inefficient, try different kinds of loop scheduling, e.g. dynamic.

  2. If the runtime time does not decrease proportionally to the number of threads (e.g. the overall CPU time spent increases), there is either a shared resource that acts as a bottleneck, or an influence between the threads. Note this can also be the result of short barriers (from load imbalances), that perform busy waiting instead of blocking.

    • The most common shared resource is the memory bandwidth. Whether that affects you, depends on whether your working set fits into local caches. Considering the many memory hierarchies and NUMA properties in modern systems, this can become very complex to understand. The is nothing fundamental you can do against that short of restructuring your data accesses to use caches more efficiently (blocking).

    • False sharing: If multiple threads write & read to the same memory locations (cache lines), the respective memory accesses become much slower. Try to restrict writes within the parallel loop to private variables.

    • HyperThreading - a core being a shared resource between two hardware threads.

      using 5x resources

      This is a fundamental misunderstanding. The additional hardware thread provides little additional resources to each core. If two threads run on the same core, they will share computational resources and memory bandwidth, basically the only advantage is hidden I/O latency. You will never see 2x speedup, despite 2x CPU time. Sometimes it will give you 10% speedup, sometimes it will be slower.

      Now consider 5 active threads running on 4 cores. 2 threads, sharing a core, will run only at ~50% speed, slowing everything down. Try reducing the number of threads to the number of cores (export OMP_NUM_THREADS=4).