activemask() vs ballot_sync()

Question

activemask() vs ballot_sync()

4.5k Views Asked by Fabio T. At 05 January 2019 at 18:54

After read this post on CUDA Developer Blog I am struggling to understand when is safe\correct use __activemask() in place of __ballot_sync().

In section Active Mask Query, the authors wrote:

This is incorrect, as it would result in partial sums instead of a total sum.

and after, in section Opportunistic Warp-level Programming they are using the function __activemask() because:

This may be difficult if you want to use warp-level programming inside a library function but you cannot change the function interface.

Original Q&A

There are 1 best solutions below

**Robert Crovella** · Accepted Answer · 2019-01-05T19:39:37.260000

There is no __active_mask() in CUDA. That is a typo (in the blog article). It should be __activemask().

__activemask() is only a query. It asks the question "which threads in the warp are currently executing this instruction, in this cycle?" which is equivalent to asking "which threads in the warp are currently converged at this point?"

It has no effect on convergence. It will not cause threads to converge. It has no warp synchronizing behavior.

__ballot_sync() on the other hand has converging behavior (according to the supplied mask).

The primary differentiation here should be considered in light of the Volta warp execution model. Volta and beyond, because of hardware changes in the warp execution engine, can support threads in a warp being diverged in a larger number of scenarios, and for a longer time, than can previous architectures.

The divergence we are referring to here is incidental divergence due to previous conditional execution. Enforced divergence due to explicit coding is identical before or after Volta.

Let's consider an example:

if (threadIdx.x < 1){
   statement_A();}
statement_B();

Assuming the threadblock X dimension is greater than 1, statement_A() is in an area of enforced divergence. The warp will be in a diverged state when statement_A() is executed.

What about statement_B() ? The CUDA execution model makes no particular statements about whether the warp will be in a diverged state or not when statement_B() is executed (see note 1 below). In a pre-Volta execution environment, programmers would typically expect that there is some sort of warp reconvergence at the closing curly-brace of the previous if statement (although CUDA makes no guarantees of that). Therefore the general expectation is that statement_B() would be executed in a non-diverged state.

However in the Volta execution model, not only are there no guarantees provided by CUDA, but in practice we may observe the warp to be in a diverged state at statement_B(). Divergence at statement_B() isn't required for code correctness (whereas it is required at statement_A()), nor is convergence at statement_B() required by the CUDA execution model. If there is divergence at statement_B() as may occur in the Volta execution model, I'm referring to this as incidental divergence. It is divergence that arises not out of some requirement of the code, but typically as a result of some kind of previous conditional execution behavior.

If we have no divergence at statement_B(), then these two expressions (if they were at statement_B()) should return the same result:

int mask = __activemask();

and

int mask = __ballot_sync(0xFFFFFFFF, 1);

So in the pre-volta case, when we would typically not expect divergence at statement_B() in practice these two expressions return the same value.

In the Volta execution model, we can have incidental divergence at statement_B(). Therefore these two expressions might not return the same result. Why?

The __ballot_sync() instruction, like all other CUDA 9+ warp level intrinsics which have a mask parameter, have a synchronizing effect. If we have code-enforced divergence, if the synchronizing "request" indicated by the mask argument cannot be fulfilled (as would be the case above where we are requesting full convergence), that would represent illegal code.

However if we have incidental divergence (only, for this example), the __ballot_sync() semantics are to first reconverge the warp at least to the extent that the mask argument is requesting, then perform the requested ballot operation.

The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.

If you then created code that performed some warp level operation (such as a warp-level sum-reduction as suggested in the blog article) and selected the threads to participate based on __activemask() vs. __ballot_sync(0xFFFFFFFF, 1), you could conceivably get a different result, in the presence of incidental divergence. The __activemask() realization, in the presence of incidental divergence, would compute a result that did not include all threads (i.e. it would compute a "partial" sum). On the other hand, the __ballot_sync(0xFFFFFFFF, 1) realization, because it would first eliminate the incidental divergence, would force all threads to participate (computing a "total" sum).

A similar example and description as what I have given here is given around listing 10 in the blog article.

An example of where it could be correct to use __activemask is given in the blog article on "opportunistic warp-level programming", here:

int mask = __match_all_sync(__activemask(), ptr, &pred);

this statement is saying "tell me which threads are converged" (i.e. the __activemask() request), and then "use (at least) those threads to perform the __match_all operation. This is perfectly legal and will use whatever threads happen to be converged at that point. As that listing 9 example continues, the mask computed in the above step is used in the only other warp-cooperative primitive:

res = __shfl_sync(mask, res, leader);

(which happens to be right after a piece of conditional code). This determines which threads are available, and then forces the use of those threads, regardless of what incidental divergence may have existed, to produce a predictable result.

As an additional clarification on the usage of the mask parameter, note the usage statements in the PTX guide. In particular, the mask parameter is not intended to be an exclusion method. If you wish threads to be excluded from the shuffle operation, you must use conditional code to do that. This is important in light of the following statement from the PTX guide:

The behavior of shfl.sync is undefined if the executing thread is not in the membermask.

Furthermore, although not directly related to the above discussion, there is an "exception" to the forced divergence idea for __shfl_sync(). The programming guide states that this is acceptable on volta and beyond:

if (tid % warpSize < 16) {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
} else {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
}

The reason for this is hinted at there, and we can get a further explanation of behavior in this case from the PTX guide:

shfl.sync will cause executing thread to wait until all non-exited threads corresponding to membermask have executed shfl.sync with the same qualifiers and same membermask value before resuming execution.

This means that the __shfl_sync() in the if-path and the __shfl_sync() in the else-path are effectively working together in this case to produce a defined result for all threads in the warp. A few caveats:

This statement applies to cc7.0 and higer

Other constructs are not necessarily going to work. For example this:

  if (tid % warpSize < 16) {
      ...
      float swapped = __shfl_xor_sync(0xffffffff, val, 16);
      ...
  } else {
  }

will not provide interesting results for any thread in the warp.

This question/answer may also be of interest.

(note 1) This is perhaps a non-intuitive concept. The CUDA programming model allows for incidental divergence to occur or appear at any point in device code, for no reason at all. The blog article makes reference to this concept in several places, I will excerpt one here:

It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.

activemask() vs ballot_sync()

There are 1 best solutions below

Related Questions in CUDA

Related Questions in GPU-WARP

Trending Questions

Popular # Hahtags

Popular Questions

__activemask() vs __ballot_sync()

There are 1 best solutions below

Related Questions in CUDA

Related Questions in GPU-WARP

Trending Questions

Popular # Hahtags

Popular Questions

activemask() vs ballot_sync()