Threading Building Blocks (TBB) library provides two functions for performing reduction over a range:
Which one of two shall be selected if I want to perform the reduction as fast as possible, but still get exactly the same answer independently on hardware concurrency and the load from other processes or threads? I am basically interested in two scenarios:
- Computing the sum of elements in integer-value vector.
- Computing the sum of elements in floating-point-value vector.
And a side question. On the page about parallel_deterministic_reduce there is one warning:
Since
simple_partitionerdoes not automatically coarsen ranges, make sure to specify an appropriate grain size
Does it mean that the call to parallel_deterministic_reduce with a range having no explicitly specified grain size will lead to poor performance? How grain size shall be set then?
parallel_reducedoes not make any guarantees regarding the summation order. If used with floating point numbers, the result is not deterministic since summation of floating point numbers is not associative. In contrast,parallel_deterministic_reduceguarantees that the summation order is always the same, regardless of the number of threads used. But note that there is still no guarantee of a specific summation order, just that the order is deterministic (for example, the result can differ compared tostd::accumulate).Thus:
parallel_reducefor best performance.parallel_deterministic_reduceif you need deterministic behavior.Regarding the note that
simple_partitionerdoes not automatically coarsen ranges: I am not entirely sure why they mention this specifically in the documentation ofparallel_deterministic_reduce. In all cases where you usesimple_partitioner, you should think about appropriate grain sizes. Not just in case ofparallel_deterministic_reduce. A too small grain size can lead to a large overhead. See for example here. In practice this especially means to measure the performance for typical work loads, and play with the grain size or partitioner such that performance is maximized. I guess they just wanted to highlight the general issue again forparallel_deterministic_reducebecauseparallel_deterministic_reducesupports onlysimple_partitionerandstatic_partitioner, but neitheraffinity_partitionernorauto_partitioner(where the latter is usually the default).