The function takes about 20-35ms when ran with static inline on GCC or Clang with at least O1 (on O0 its the same 400-600ms like without the static keyword), when static is removed the function takes +400ms to execute on an array with 1bil bytes/chars, when single threaded the function's time doesn't change whether its used with static or without. On MSVC it will always take 400 or more ms even with O2i and avx2 arch.

If I just replace the AVX2 code with a simple call to std::count(begin, end, target) it will run just as fast as the AVX2 no matter if static is specified or not (even on MSVC this time)

Code:

static inline uint64_t opt_count(const char* begin, const char* end, const char target) noexcept {

    const __m256i avx2_Target = _mm256_set1_epi8(target);
    uint64_t result = 0;

    static __m256i cnk1, cnk2;
    static __m256i cmp1, cmp2;
    static uint32_t msk1, msk2;
    uint64_t cst;

    for (; begin < end; begin += 64) {
        cnk1 = _mm256_load_si256((const __m256i*)(begin));
        cnk2 = _mm256_load_si256((const __m256i*)(begin+32));

        cmp1 = _mm256_cmpeq_epi8(cnk1, avx2_Target);
        cmp2 = _mm256_cmpeq_epi8(cnk2, avx2_Target);

        msk1 = _mm256_movemask_epi8(cmp1);
        msk2 = _mm256_movemask_epi8(cmp2);
        // Casting and shifting is faster than 2 popcnt calls
        cst = static_cast<uint64_t>(msk2) << 32;
        result += _mm_popcnt_u64(msk1 | cst);
    }

    return result;
}

Caller:


uint64_t opt_count_parallel(const char* begin, const char* end, const char target) noexcept {
    const size_t num_threads = std::thread::hardware_concurrency()*2;
    const size_t total_length = end - begin;
    if (total_length < num_threads * 2) {
        return opt_count(begin, end, target);
    }

    const size_t chunk_size = (total_length + num_threads - 1) / num_threads;

    std::vector<std::future<uint64_t>> futures;
    futures.reserve(num_threads);

    for (size_t i = 0; i < num_threads; ++i) {
        const char* chunk_begin = begin + (i * chunk_size);
        const char* chunk_end = std::min(end, chunk_begin + chunk_size);

        futures.emplace_back(std::async(std::launch::async, opt_count, chunk_begin, chunk_end, target));
    }

    uint64_t total_count = 0;
    for (auto& future : futures) {
        total_count += future.get();
    }

    return total_count;
}

In a different file I allocate a buffer with new, align it ,memset '/n' and every other char set to 'x' and time each iteration of the opt_count_parallel call and print its output.

I have tried using thread and future and both have more or less the same result.

Here is the godbolt diff view: https://godbolt.org/z/9P87bndsb , I don't see much difference in the assembly but I'm not knowledgeable enough to understand the small differences

I've also tried assigning avx2_Target outside the opt_count, in opt_count_parallel which made no difference

I looked at GCC's fopt-info and the output was same on both occasions, I've also tried force inlining and noalign but again no noticeable difference

I've also tried debugging/profiling but its a bit of an annoyance since it's speedup is lost on -O0 and profiling just shows that everything takes uniformly longer to execute

1

There are 1 best solutions below

1
Maj mac On

Changing these:

    static __m256i cnk1, cnk2;
    static __m256i cmp1, cmp2;

to:

    __m256i cnk1, cnk2;
    __m256i cmp1, cmp2

Fixed the issue, thanks Harold