I have several (e.g. 3, 5, 7 or 9) equally sized huge blocks of data (e.g. 100KB-100MB), and want to do bitwise majority voting, to get just one block with the most frequently used value for each bit. To speed this up, i would like to use SSE/SSE2/AVX/NEON/... CPU extensions.
I just tried it manually bitwise, and the result was very slow.
As discussed in other answers, specific functions for Maj3, Maj5, Maj7, or Maj9 do auto-vectorize nicely. (Especially with AVX-512
vpternlogd, although compilers don't use it optimally for Maj3 as a single operation.) For clang AVX2 on Skylake, Maj3 that way is about 19x faster than Maj3 this way, if cache bandwidth isn't a bottleneck. Or 11x for Maj7 vs. Aki's version.If you need to support arbitrary numbers of voters, positional popcount is one strategy, into one byte per bit-position across
ninputs, so the output is1ifcount >= n/2+1. With AVX2 for example, we end with a SIMD byte compare and movemask to get 32 output bits.One option to use vector instructions is through a portable API like
std::experimental::simdor gcc vector extensions built-in functions to process, for example, 32 votes at once with 256-bit vectors.One thing this lacks is a movemask operation; compilers aren't smart enough to optimize this portable extract and OR strategy into x86 movemask or AArch64 shift-right-and-insert chains (which can be more efficient when reducing multiple vectors to a 64-bit mask).
On x86 with AVX2 it does roughly 4 vector instructions to load each row of 32 votes (4 bytes) and sum them up (broadcast load,
v & mask == mask, and subtract). And another 10 to compute and store the majority votes from the counts.Outputs:
Generated assembly is decent: the loops are unrolled (thanks to compile-time-constant
R(rows)), vectorized stores and broadcast loads, no extra loads or stores beyond the required minimum, no stack spillage, withgcc-13.2 -O3andclang-17.0.1 -O2with-march=x86-64-v3.But this is still much slower than pure vertical bitwise operations for small fixed vote counts. The worst being Maj3, with this having a bottleneck on 18 vector ALU instructions in the loop per 4 result bytes with AVX2 clang, vs. Maj3 as an auto-vectorized
(a&b) | (a&c) | (b&c)usingunsigned long(optimizes to 2 AND and 2 OR per result vector) with 10 total instructions per 32 bytes of output (Peter's answer).With no cache misses, https://uica.uops.info/ predicts this runs Maj3 at best 6 cycles per 4 bytes on Skylake or Ice Lake, vs. 2.5 or 2.0 cycles with some unrolling per 32 bytes the way clang compiles. That's a 19x speedup / slowdown. Maj7 is less extreme, but still about 11x vs. Aki's.
Another option is to use Intel AVX2 intrinsics to trade portability for speed, especially of the cleanup. This version can be more efficient than the above, roughly same 4 AVX instructions per row of votes but only 5 to compute and store the result taking advantage of AVX
_mm256_movemask_epi8instruction unavailable in the gcc vector extensions built-in functions.