Fastest way to mask out bytes higher than separator position with SIMD

450 Views Asked by At

uint8_t data[] = "mykeyxyz:1234\nky:123\n...";. My lines of string has format key:value, where each line has len(key) <= 16 guaranteed. I want to load mykeyxyz into a __m128i, but fill out the higher position with 0.

The easiest way is to have an array of 255 or 0 masks, but that requires another memory load. Is there anyway to do this faster?

The accepted answer gives ~2% faster total program time. To compare, test 1brc_valid13.cpp against 1brc_valid14.cpp (which uses the accepted answer). Hardware: AMD 2950X, Ubuntu 18.04, g++ 11.4, compile command: g++ -o main 1brc_final_valid.cpp -O3 -std=c++17 -march=native -m64 -lpthread

Edit: preferably without AVX512

Edit 2: I need the variable len so I can start parsing the value part.

Edit 3: the function will be used in a loop (for example to parse 1 million lines of text). But strcmp_mask will basically always be inside L1 cache

Edit 4: I benchmark the functions by parsing 1 billion lines of (key,value) and process them. You can download the code/data and replicate the results in my repo: https://github.com/lehuyduc/1brc-simd . Also the discussion post will contain more info

Edit 5: I tested maskafterc256 and found that it caused my code to be 50x slower!!! If I replace _mm256_set_epi8 with _mm256_setr_epi8, then it becomes 500+x slower (took so long that I just Ctrl-C). I'm not sure what _mm256_set_epi8 does, because it's translated into a sequence of instructions instead of a single one.

perf stat -d ./main result for maskafterc

         14,470.46 msec task-clock                #   20.785 CPUs utilized
             3,032      context-switches          #    0.210 K/sec
                 5      cpu-migrations            #    0.000 K/sec
           341,894      page-faults               #    0.024 M/sec
    55,073,525,723      cycles                    #    3.806 GHz                      (37.19%)
     1,714,679,575      stalled-cycles-frontend   #    3.11% frontend cycles idle     (36.71%)
    11,442,758,125      stalled-cycles-backend    #   20.78% backend cycles idle      (36.92%)
    80,739,874,133      instructions              #    1.47  insn per cycle
                                                  #    0.14  stalled cycles per insn  (37.39%)
     8,661,529,181      branches                  #  598.566 M/sec                    (38.22%)
        39,299,214      branch-misses             #    0.45% of all branches          (38.13%)
    17,784,400,757      L1-dcache-loads           # 1229.015 M/sec                    (37.93%)
       827,509,870      L1-dcache-load-misses     #    4.65% of all L1-dcache hits    (37.52%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

       0.696207306 seconds time elapsed

      12.918590000 seconds user
       1.546535000 seconds sys

perf stat -d ./main result for maskafterc256

Performance counter stats for './main':

      1,047,414.73 msec task-clock                #   29.982 CPUs utilized
           125,296      context-switches          #    0.120 K/sec
               211      cpu-migrations            #    0.000 K/sec
           341,889      page-faults               #    0.326 K/sec
 4,229,832,527,830      cycles                    #    4.038 GHz                      (37.50%)
    10,965,796,240      stalled-cycles-frontend   #    0.26% frontend cycles idle     (37.50%)
   167,711,051,408      stalled-cycles-backend    #    3.96% backend cycles idle      (37.49%)
   296,573,918,148      instructions              #    0.07  insn per cycle
                                                  #    0.57  stalled cycles per insn  (37.50%)
    44,843,867,352      branches                  #   42.814 M/sec                    (37.50%)
        56,509,334      branch-misses             #    0.13% of all branches          (37.51%)
    91,621,829,978      L1-dcache-loads           #   87.474 M/sec                    (37.50%)
    18,776,996,709      L1-dcache-load-misses     #   20.49% of all L1-dcache hits    (37.51%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

      34.935225940 seconds time elapsed

    1039.609651000 seconds user
       6.774492000 seconds sys
#include <iostream>
#include <immintrin.h>
#include <string>
#include <cstring>
using namespace std;

alignas(4096) const uint8_t strcmp_mask[32] = {
  255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};

int main()
{
  uint8_t data[] = "mykeyxyz:1234\naaaaaaaaaaa";
  __m128i chars = _mm_loadu_si128((__m128i*)data);
  __m128i separators = _mm_set1_epi8(':');
  __m128i compared = _mm_cmpeq_epi8(chars, separators);
  uint32_t separator_mask = _mm_movemask_epi8(compared);
  uint32_t len = __builtin_ctz(separator_mask);
  
  cout << "len = " << len << "\n";
  __m128i mask = _mm_loadu_si128((__m128i*)(strcmp_mask + 16 - len));
  __m128i key_chars = _mm_and_si128(chars, mask);
  
  uint8_t res[16];
  memcpy(res, (char*)&key_chars, 16);
  for (int i = 0; i < 16; i++) cout << int(res[i]) << " ";
  cout << "\n";
}
// len = 8
// 109 121 107 101 121 120 121 122 0 0 0 0 0 0 0 0
4

There are 4 best solutions below

27
Simon Goater On BEST ANSWER

I often find it interesting to see how others approach a problem, so here's my version. It only requires SSE2, but benefits from SSSE3, and BMI1 for the trailing zeros calculation.

#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <immintrin.h>

// gcc maskafterc.c -o maskafterc.bin -O3 -march=native -Wall

__m128i maskafterc(__m128i input, uint8_t c, uint8_t* restrict pos) {
  // Finds first occurance of c in input and takes its position pos. 
  // Returns mask of 255s before pos, 0s on and after.
  // e.g. maskafterc([5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], 9, uint8_t *pos)
  // sets pos = 4 and returns [255, 255, 255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
  __m128i index = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
  __m128i cmp = _mm_cmpeq_epi8(input, _mm_set1_epi8(c));
  uint32_t mmask = _mm_movemask_epi8(cmp);
  *pos = (mmask ? __builtin_ctz(mmask) : 16); // Return all -1s if c not found
  return _mm_cmplt_epi8(index, _mm_set1_epi8(*pos));
}

int main(int argc, char **argv)
{
  unsigned char data[] = "mykeyxyz:98765211234\naaaaaaaaaaa";
  uint8_t pos;
  __m128i chars = _mm_loadu_si128((__m128i*)data);
  __m128i res =_mm_and_si128(maskafterc(chars, ':', &pos), chars);
  if (pos < 16) puts((char*) &res);
  printf("keylen = %i\n", pos);
}
//mykeyxyz
//keylen = 8

EDIT: AVX2 version.

__m256i maskafterc256(__m256i input, uint8_t c, uint8_t* restrict pos) {
  // Finds first occurance of c in input and takes its position pos. 
  // Returns mask of 255s before pos, 0s on and after.
  __m256i index = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                                  10, 11, 12, 13, 14, 15, 16,
                                  17, 18, 19, 20, 21, 22, 23, 24,
                                  25, 26, 27, 28, 29, 30, 31);
  __m256i cmp = _mm256_cmpeq_epi8(input, _mm256_set1_epi8(c));
  uint32_t mmask = _mm256_movemask_epi8(cmp);
  *pos = (mmask ? __builtin_ctz(mmask) : 32); // Return all -1s if c not found
  return _mm256_cmpgt_epi8(_mm256_set1_epi8(*pos), index);
}
7
Jan Schultke On

Your existing way is probably best for your use-case of processing a big batch of key:value lines. You'll get at most one cache-miss for the sliding-window array on the first call, but it'll then be hot in L1d cache assuming you aren't touching huge amounts of memory before parsing the next key:value. Unaligned vector loads are very cheap on modern x86 when they hit in cache, as long as they don't cross a cache-line boundary (or 32-byte boundary on Zen 1), and aligning your 32-byte constant array to its size prevents that.

You say you need the __builtin_ctz(mask) or std::countr_zero(mask) result anyway, and your current strategy only takes a couple extra instructions beyond that (like a sub or neg and then a memory-source vpand), so it's very good for throughput.


@chtz's answer shows a way to process the vector compare result directly into a vector mask, without going to scalar and back. It takes surprisingly few instructions even though there isn't a subtract or decrement that carries across the whole 128-bit vector, only within 64-bit elements. Going to scalar with _mm_movemask_epi8 makes the bithack easier, but applying it without AVX-512 would take even more work and require loading a non-trivial vector constant for any of the efficient approaches in is there an inverse instruction to the movemask instruction in intel avx2?. If you're going to load anything from .rodata, it might as well be the mask you want (Although that does make the load address data-dependent on the compare, so load-use latency is part of what out-of-order exec has to hide, and out-of-order exec can't start early on a cache miss.)

In a different use-case from yours, where data-cache misses are a problem for the sliding-window mask load, @chtz's answer is worth trying, and perhaps also this answer if AVX-512 is available. (@Simon's answer is good but needs a vector constant: if you're doing enough work between calls that cache isn't still hot, it's probably not in a loop that can keep that constant in a register either.)

With AVX-512, it's cheap to apply a scalar mask to a vector. This answer has some AVX-512 versions which avoid any vector constants, which might or might not be better than your sliding window or than chtz's answer. (AVX-512 makes chtz's code more efficient, combining the final andnot and and into one vpternlogq instruction). chtz's AVX1 vector shuffles have the lowest critical-path latency for out-of-order exec to hide, but AVX-512 allows fewer total uops for the front-end and ROB (reorder buffer). Especially when you also need __builtin_ctz(mask).

AVX-512 also makes it cheaper for a compiler to do _mm_set1_epi8(':') with a mov-immediate + vpbroadcastb instead of loading a constant from .rodata for that. (GCC 12 likes to construct constants that way even with out AVX-512, needing a separate [v]movd as well as vpbroadcastd or pshufd. This costs more code size but less .rodata size (especially for large 32 and 64-byte constants), and can't miss in data cache. Maybe GCC's tuning testing found that code was more predictable and easier for the CPU to prefetch than data, or that I-cache pressure wasn't as much of a problem on average.


Using AVX-512, you can generate a mask using _mm_maskz_broadcastb_epi8 for use with _mm_and_si128.

Even more simply, you can mask the input characters with _mm_maskz_mov_epi8 (1). If you do another vector operation on the __m128i, the compiler might even be able to fold this zero-masking into the next operation.

This versions uses the key length (ctz) so it's good for cases which need that as an extra output.

#if defined(__AVX512BW__) && defined(__BMI2__)
__m128i mask_first_n(__m128i string, int n) {
    //__mmask16 mask = (1u << n) - 1;      // compilers use shlx/sub
    __mmask16 mask = _bzhi_u32(-1u, n);    // same result for n=0..31.  For n=32, produces -1 instead of 0 (for 1u<<(n&31) like x86 shifts)
    return _mm_maskz_mov_epi8(mask, string);
}
#endif
mask_first_n(long long __vector(2), int):
        mov     eax, -1         # can potentially get hoisted out of a loop
        bzhi    eax, eax, edi
        kmovw   k1, eax
        vmovdqu8        xmm0{k1}{z}, xmm0
        ret

So this is 3 instructions after getting the __builtin_ctz result, not counting the mov eax, -1 which could be reused across loop iterations if the loop has enough registers to spare. They're all single-uop, except kmov* k, r32 on Zen 4 which is 2 uops. (https://uops.info/)

With (1u << n) - 1, we got mov eax,1 / shlx / sub eax,1, which on Intel can run on any ALU port including 6 which doesn't have any vector ALUs, but that's 2 total uops. bzhi is 1 uop on Intel and AMD, but on Intel only runs on port 1 or 5. (Or on Alder Lake just port 1 since it has 3 cycle latency there.)

Using this, you can mask your string like this:

int find(__m128i string, char c) {
// You do still want SSE/AVX compare+movemask here, not _mm_cmpeq_epi8_mask 
    __m128i compareMask = _mm_cmpeq_epi8(string, _mm_set1_epi8(c));
    unsigned mask = _mm_movemask_epi8(compareMask);
    return std::countr_zero(mask);    // or _tzcnt_u32 if you don't want C++20 <bit>
    // TODO: error handling if mask == 0, i.e. character not found
    //  _tzcnt_u32 or countr_zero returns 32
    //  and bzhi(-1, 32) and zero-masking will leave all 16 bytes unchanged
    //  either use 32-byte vectors, or std::countr_zero( mask | (1<<16) ) to get a "correct" len
}

__m128i zero_above_key_ctz_bzhi(__m128i vec, __m128i separator){
    int pos = find(vec, separator);
    __m128i result = mask_first_n(vec, pos);
    return result;
}


int main() {
    unsigned char data[] = "mykeyxyz:1234\naaaaaaaaaaa";
    __m128i chars = _mm_loadu_si128((__m128i*)data);

    int pos = find(chars, _mm_set1_epi8(':'));
    __m128i result = mask_first_n(chars, pos);

    puts((char*) &result);
}

This code prints mykeyxyz. See live code at Compiler Explorer, also including code from other answers. Old version with (1u<<n) - 1: https://godbolt.org/z/Gjjqna4Ke)


Without using tzcnt

We can use instructions like kadd and kandn to do @chtz's suggestion of (mask-1) & (~mask), which is like BMI1 blsmsk but not including the lowest set bit. Or we can use non-AVX512 compare/movemask to do the bithack in general-purpose integer registers.

AVX1 compare + movemask is probably better if we need the value in a general-purpose register for tzcnt to get the length. That version of vpcmpeqb runs on more ports on Intel than comparing into a k mask register, and compare+movemask might be lower latency than cmp_epi8_mask / kmov, and lower latency for pos can be very important if lots of later code depends on it, more than out-of-order exec can see past.

__m128i zero_above_key_k_bithack(__m128i vec, __m128i separator)
{
    __mmask16 mask = _mm_cmpeq_epi8_mask(vec, separator);
    //mask = (mask-1) & (~mask);  // kmov / lea / andn / kmov.  Use AVX1 cmp/movemask to avoid the first kmov
    mask = _kandn_mask16(mask, _kadd_mask16(mask, -1));  // kadd / kandn  (plus kxnor to create a constant)
     // clang does a masked compare into mask (port 5, 3c latency) instead of KAND (port 0, 1c latency).
    __m128i result = _mm_maskz_mov_epi8(mask, vec);
    return result;
}

__m128i zero_above_key_gpr_bithack(__m128i vec, __m128i separator)
{
    __m128i cmp = _mm_cmpeq_epi8(vec, separator);
    unsigned mask = _mm_movemask_epi8(cmp);
    mask = (mask-1) & (~mask);  // lea / andn / kmov.
    __m128i result = _mm_maskz_mov_epi8(mask, vec);
    return result;
}

GCC compiles them as expected. Clang uses AVX-512 vpcmpb k0, xmm,xmm and needs a kmovd instead of vpmovmskb even for the GPR version, which is probably worse. (Round-trip latency for kmov is 4 cycles on Ice Lake, according to uops.info's testing.

zero_above_key_k_bithack(long long __vector(2), long long __vector(2)):
        vpcmpb  k0, xmm0, xmm1, 0
        kxnorw  k1, k1, k1
        kaddw   k1, k0, k1
        kandnw  k1, k0, k1
        vmovdqu8        xmm1{k1}{z}, xmm0
        vmovdqa xmm0, xmm1
        ret

zero_above_key_gpr_bithack(long long __vector(2), long long __vector(2)):
        vpcmpeqb        xmm1, xmm0, xmm1
        vpmovmskb       eax, xmm1
        lea     edx, -1[rax]
        andn    eax, eax, edx
        kmovw   k1, eax
        vmovdqu8        xmm1{k1}{z}, xmm0
        vmovdqa xmm0, xmm1
        ret

vs. the tzcnt version putting together the pieces with find() and mask_first_n().

zero_above_key_ctz_bzhi(long long __vector(2), long long __vector(2)):
        vpcmpeqb        xmm1, xmm0, xmm1
        mov     edx, -1
        vpmovmskb       eax, xmm1
        tzcnt   eax, eax
        bzhi    edx, edx, eax
        kmovw   k1, edx
        vmovdqu8        xmm1{k1}{z}, xmm0
        vmovdqa xmm0, xmm1         # could have operated in-place
        ret

(1) thanks to @PeterCordes for the suggestion and expanding + optimizing this answer

4
chtz On

The following code (requiring only SSE 4.1), masks out every byte following the first occurrence of char c in string:

__m128i maskString(__m128i string, char c) {
    // find every occurrence of `c`:
    __m128i compareMask = _mm_cmpeq_epi8(string, _mm_set1_epi8(c));
    // always generate a `-1` for the lower half, 
    // and a `-1` for the upper half if the lower half is zero (i.e. if it would carry):
    __m128i minus_one = _mm_cmpeq_epi64(_mm_bslli_si128(compareMask, 8), _mm_setzero_si128());
    // `~compareMask & (compareMask-1)` as if computed as 128bit subtraction.
    // Subtraction makes `0xff` of every byte below the first occurrence of `c` in `string`, anding with `~compareMask` clears every remaining `0xff` and `0xfe`
    __m128i andMask = _mm_andnot_si128(compareMask, _mm_add_epi64(compareMask, minus_one));

    // apply mask to string and return:
    return _mm_and_si128(andMask, string);
}

Godbolt demo: https://godbolt.org/z/Tnsj1sf46

N.B.: If you need the length of the key anyways, OP's original code is probably fine (loading data from cache is usually cheap).

1
Aki Suihkonen On

One could get the position by _mm_minpos_epu16 by shifting so the high bit of the min value is concatenated with the word-index, forming the desired byte-index. This requires that there's only one separator, otherwise a later 00FF is less than an earlier FF00.

(We also need to preprocess, because 00 00 words would be smaller than either 00FF or FF00.)

   // key:12312 --> 00 00 00 FF 00 00 00 ... -> min_idx = idx=01|FF00
   // kk:131232 --> 00 00 FF 00 00 00 00 ... -> min_idx = idx=01|00FF
   __m128i mask = _mm_cmpeq_epi8(key, separator);
   // convert the 0000 -> FFFF
   //             00FF -> 00FE
   //             FF00 -> FEFF
   mask = _mm_sub_epi16(mask, _mm_set1_epi16(1));  // or add -1, cheaper constant
   __m128i min_idx = _mm_minpos_epu16(mask);

   // shift right to get min_idx = 3 for the first case
   // min_idx = 2 for the second case
   min_idx = _mm_srli_epi32(min_idx, 15);  // keep high bit of min val + 2*idx

   // broadcast the byte zero everywhere
   min_idx = _mm_shuffle_epi8(min_idx, _mm_setzero_si128());
   
   // make the mask
   __m128i mask = _mm_cmplt_epi8(k0123456789, min_idx);
   
   // use the mask
   key = _mm_and_si128(key, mask);

One can as well broadcast the byte #2 (i.e. the index from minpos after)

   __m128i min_idx = _mm_minpos_epu16(mask);
   min_idx = _mm_add_epi32(min_idx, min_idx);
   min_idx = _mm_shuffle_epi8(min_idx, _mm_set1_epi8( 2 ));

But as pointed in the comments, this would not work readily for cases where there are two or more separators.

One can try to equalise 00FF and FF00, e.g. by

   __m128i mask_s = _mm_cmpeq_epi8(key, separator);
   __m128i tmp = _mm_cmpeq_epi16(mask_s, _mm_setzero_si128());
   // now 0000 -> FFFF
   //     00FF -> 0000
   //     FF00 -> 0000
   pos = _mm_minpos_epi16(tmp);

But now the LSB needs to be detected separately. One can still use minpos that has the wrong granularity as in

   __m128i word_mask = _mm_cmplt_epi8(k0011223344556677, pos_broadcasted_as_int16);
   word_mask = _mm_or_si128(word_mask, mask_s);