All subsets of of the 4-bit number 1101 are 0000, 0001, 0100, 0101, 1000, 1001, 1100, 1011. All subsets of this mask which are divisible by 2 are 0000, 0100, 1000, 1100.
Given a 64-bit mask M and a 64-bit bit integer P, how do I iterate over all subsets of M which are divisible by P?
To iterate over subsets of a bit mask, I can do
uint64_t superset = ...;
uint64_t subset = 0;
do {
print(subset);
subset = (subset - superset) & superset;
} while (subset != 0);
If M is ~0 I can just start with 0 and keep adding P to iterate over all multiples of P. If P is a power of two I can just do M &= ~(P - 1) to chop off bits which are never going to be set.
But if I have none of the constraints above, do I have a better shot than naively checking each and every subset for divisibility by P? This naive algorithm on average to get the next subset which is divisible by P takes O(P) operations. Can I do better than O(P)?
A Parallel Algorithm
There are inputs for which it is vastly more efficient to check the multiples of the factor than the subsets of the mask, and inputs where it’s the other way around. For example, when M is
0xFFFFFFFFFFFFFFFFand P is0x4000000000000000, checking the three multiples of P is nigh-instantaneous, but even if you could crunch and check a billion subsets of M each second, enumerating them all would take thirty years. The optimization of finding only subsets greater than or equal to P would only cut that to four years.However, there is a strong reason to enumerate and check the multiples of P instead of the subsets of M: parallelism. I want to emphasize, because of incorrect comments on this code elsewhere: the algorithm in the OP is inherently sequential, because each value of
subsetuses the previous value ofsubset. It cannot run until all the lower subsets have already been calculated. It cannot be vectorized to use AVX registers or similar. You cannot load four values into an AVX2 register and run SIMD instructions on them, because you would need to calculate the first value to initialize the second element, the second to initialize the third, and all three to initialize the final one, and then you are back to computing only one value at a time. It cannot be split between worker threads on different CPU cores either, which is not the same thing. (The accepted answer can be modified to do the latter, but not the former without a total refactoring.) You cannot divide the workload into subsets 0 to 63, subsets 64 to 127, and so on, and have different threads work on each in parallel, because you cannot start on the sixty-fourth subset until you know what the sixty-third subset is, for which you need the sixty-second, and so on.If you take nothing else away from this, I highly recommend that you try this code out on Godbolt with full optimizations enabled, and see for yourself that it compiles to sequential code. If you’re familiar with OpenMP, try adding
#pragma omp simdand#pramga omp paralleldirectives and see what happens. The problem isn’t with the compiler, it’s that the algorithm is inherently sequential. But seeing what real compilers do should at least convince you that compilers in the year 2023 are not able to vectorize code like this.For reference, here is what Clang 16 does with
find:Enumerate and Check the Multiples Instead of the Subsets
In addition to having more parallelism, this has several advantages in speed:
(i+4)*pgiveni*pto use this on a vector of four elements, can be strength-reduced to a single addition.%operation, which most CPUs do not have as a native instruction and is always the slowest ALU operation even when it is there.So, a version of this code that uses both multi-threading and SIMD for speed-up:
The inner loop of
check_multiplescompiles, on ICX 2022, to:I encourage you to try your variations on the algorithm in this compiler, under the same settings, and see what happens. If you think it should be possible to generate vectorized code on the subsets as good as that, you should get some practice.
A Possible Improvement
The number of candidates to check could get extremely large, but one way to limit it is to also compute the multiplicative inverse of P, and use that if it is better.
Every value of P decomposes into 2ⁱ · Q, where Q is odd. Since Q and 2⁶⁴ are coprime, Q will have a modular multiplicative inverse, Q', whose product QQ' = 1 (mod 2⁶⁴). You can find this with the extended Euclidean algorithm (but not the method I proposed here initially).
This is useful for optimizing the algorithm because, for many values of P, Q' < P. If m is a solution, m = nP for some integer n. Multiply both sides by Q', and Q'Pm = 2ⁱ · m = Q'n. This means we can enumerate (with a bit of extra logic to make sure they have enough trailing zero bits) the multiples of Q' or of P. Note that, since Q' is odd, it is not necessary to check all multiples of Q'; if the constant in front of m is 4, for example, you need only check the products of 4·_Q'_.