Setting bits:
Given an array int inds[N], where each inds[i] is a 1-bit position in [0, 255] range (and all inds[i] are sorted and unique), I need to set corresponding bits of __m256i to 1.
Is there a better way than what I do below:
alignas(32) uint64_t buf[4] = {0};
for (int i = 0; i < N; ++i) {
int ind = inds[i];
buf[ind / 64] |= 1ul << (ind % 64);
}
auto r = _mm256_load_si256((__m256i*)buf);
Getting bits:
In the opposite operation, I need to compute product of double values at bit-1 positions. I.e., given double const sizes[256] compute product of some of them (at positions given by __m256i mask).
inline
double size (__m256i r, double const sizes[256])
{
alignas(16) uint64_t buf[4];
_mm256_store_si256((__m256i*)buf, r);
double s[4] = {1.0, 1.0, 1.0, 1.0};
// __builtin_ctzl(i) gives next position
// and i &= i - 1 clears that bit
for (; buf[0] != 0; buf[0] &= buf[0] - 1)
s[0] *= sizes[__builtin_ctzl(buf[0]) + 0 * 64];
for (; buf[1] != 0; buf[1] &= buf[1] - 1)
s[1] *= sizes[__builtin_ctzl(buf[1]) + 1 * 64];
for (; buf[2] != 0; buf[2] &= buf[2] - 1)
s[2] *= sizes[__builtin_ctzl(buf[2]) + 2 * 64];
for (; buf[3] != 0; buf[3] &= buf[3] - 1)
s[3] *= sizes[__builtin_ctzl(buf[3]) + 3 * 64];
return s[0] * s[1] * s[2] * s[3];
}
Same question: better way to do it?
It's possible to do this with AVX512, and more efficient than the scalar approach in some cases, depending on N.
There is something else though, the scalar approach has a problem that can be fixed: a loop-carried dependency through memory. For example GCC compiles the code like this, (relevant part extracted)
That
oris loading/storing the same memory location in (most) successive loop iterations. This can be avoided by writing separate loops for each chunk of the result,At the source level that looks like there is still a dependency through memory, but when it's written this way (where the index in the array is constant) compilers are likely to apply an optimization where they temporarily use a register for
buf[0]and so on, for example here's an excerpt from what GCC made of that (which is fairly representative of what other compilers do too):Much better (though GCC missed the opportunity to use
btswith a register destination here, which is efficient unlike the version with a memory destination). In fact more than twice as good in my tests, but that will depend onNand other factors.And here are the AVX512 hacks for good measure. On my PC (rocket lake), this is faster (in the sense of throughput being higher, I did not test latency) than the improved scalar code for some
N, with Peter's suggestions now around 16 or more, not bad. Conversion to AVX2 seems possible, but that would make the threshold where it begins to be worth it higher.