SIMD algorithm to check of if an integer block is "consecutive."

117 Views Asked by At

How to you check if an aligned chunk of 16 u32's is consecutive (and increasing)?

For example: [100, 101, 102, ..., 115] is. And, [100, 99, 3 ...] is not.

I'm on AVX512f. This is what I have so far:

Algo A:

* predefine DECREASE_U32, a u32x16 of [15,14,13,...0]
* let a = input + DECREASE_32 // wrapping is OK
* compare a to u32x16::splat(first_item(a))
* Return whether all true

Alterative (Algo B)

* let b = copy of A
* permute the elements of b by one position
* let b = a-b
* Is b all 1's (except for 1st position)

I'm doing this in Rust with the packed_simd crate, but any language/pseudocode` is fine. (I wish there was a SIMD operation to subtract adjacent items.)

2

There are 2 best solutions below

1
Peter Cordes On BEST ANSWER

I think your first idea is probably best if done inside a loop that can amortize the cost of loading a vector constant. AVX-512 can do that efficiently.

Either with a vector load and then separately broadcast the low element with vpbroadcastd, or with a vector load and a broadcast-load. e.g. vpaddd zmm16, zmm31, [rdi]{1to16} / vpcmpeqd k1, zmm16, [rdi].

Hmm, but then checking for all elements being true, I guess perhaps kaddw with a constant 1 and check that the low 16 bits are zero with kortest? Or just kmov to an integer register for a compare against 0xffff like we'd do with SSE/AVX pmovmskb. I tried that, and clang had a better idea: compare for not-equal, and check that the mask is all zero. (i.e. check that every element is equal by checking that they aren't not-equal.) That allows kortest on the mask itself. I applied clang's idea to my intrinsics so GCC could make better asm as well.

In C++:

#include <immintrin.h>

// compare for not-equal, checking the mask for 0
bool check_contig(int *p)
{
    __m512i bcast_first = _mm512_set1_epi32(*p);
    __m512i desired = _mm512_add_epi32(bcast_first, _mm512_setr_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0));

    __m512i v = _mm512_loadu_si512(p);
    __mmask16 cmp = _mm512_cmpneq_epi32_mask(desired, v);
    return cmp == 0;
}

Godbolt - asm from GCC and clang:

# GCC
check_contig(int*):
        vmovdqa32       zmm0, ZMMWORD PTR .LC0[rip]
        vpaddd  zmm0, zmm0, DWORD PTR [rdi]{1to16}
        vpcmpd  k0, zmm0, ZMMWORD PTR [rdi], 4
        kortestw        k0, k0
        sete    al
        vzeroupper
        ret
# clang
check_contig(int*):
        vpbroadcastd    zmm0, dword ptr [rdi]
        vpaddd  zmm0, zmm0, zmmword ptr [rip + .LCPI0_0]
        vpcmpneqd       k0, zmm0, zmmword ptr [rdi]
        kortestw        k0, k0
        sete    al
        vzeroupper
        ret

So they both choose to load twice instead of vpbroadcastd zmm1, xmm0, at least when not in a loop so the vector constant also has to get loaded from .rodata.

Perhaps if I wrote it differently, as _mm512_broadcastd_epi32( _mm512_castsi512_si128(v)), they'd prefer one load, at the cost of an extra shuffle uop. (Which is probably worse when you have 512-bit uops in flight, so Intel CPUs shut down the vector ALU on port 1, leaving only ports 0 and 5. https://agner.org/optimize/ and https://uops.info/)


Algo B - avoiding a non-trivial vector constant

Maybe your second way could also be done efficiently with valignd to rotate the vector; the only vector constant it needs is all-ones which can be generated somewhat more cheaply (vpternlogd) instead of loaded.

Checking the compare-mask would probably require a kmov to integer for an and + cmp to check all but one bit, unless we can use the same trick clang did and arrange things so we actually want the mask to be all-zero in the places we want. In that case, test eax, imm32 can check the bits we want while ignoring the one we don't.

6
Carl On

The core of my current Rust code is now this macro code:

    const LAST_INDEX: usize = <$simd>::lanes() - 1;
    let (expected, overflowed) = $chunk[0].overflowing_add(LAST_INDEX as $scalar);
    if overflowed || expected != $chunk[LAST_INDEX] {
        return false;
    }


    let a = unsafe { <$simd>::from_slice_aligned_unchecked($chunk) } + $decrease;
    let compare_mask = a.eq(<$simd>::splat(a.extract(0)));
    compare_mask.all()

Where $scalar is u32, $simd is u32x16 and $decrease is the [15, 14 ... 0] block. The first part of the code spot checks that the last element is 15 more than the first (and takes care of overflows).

I asked a smart tool to help me understand the SIMD assembly produced. It says:

  • vmovdqa64: This instruction moves a 512-bit vector of data into a ZMM register. It's used here twice: vmovdqa64 zmm0,zmmword ptr [...]: Loads a 512-bit vector from memory into zmm0. vmovdqa64 zmm0,zmmword ptr [...] (later in the code): Loads a different 512-bit vector into zmm0. vpaddd:

  • vpaddd zmm0,zmm0,zmmword ptr [rax+40h]: Performs packed integer addition of 32-bit integers. This instruction adds the 512-bit vector in zmm0 to another 512-bit vector (loaded from the memory address in rax + 40h) and stores the result back in zmm0. vpbroadcastd:

  • vpbroadcastd zmm1,xmm0: Broadcasts a 32-bit integer from xmm0 (lower 128 bits of zmm0) across all lanes of zmm1. This creates a 512-bit vector in zmm1 where all elements are the same and equal to the value in xmm0. vpcmpeqd:

  • vpcmpeqd k0,zmm0,zmm1: Compares 32-bit integers in zmm0 and zmm1 for equality. The results are stored in a mask register k0, where each bit represents the result of the comparison for each pair of elements. vpternlogd:

  • vpternlogd zmm1,zmm1,zmm1,0FFh: Performs a bitwise ternary logic operation on each bit of the operands. The specific operation is determined by the immediate value 0xFF, which in this case corresponds to a bitwise OR. vpmovm2d:

  • vpmovm2d zmm0,k0: Moves the bitmask from the mask register k0 into a general-purpose register zmm0. Each bit of k0 becomes a 32-bit element in zmm0. vpcmpd:

  • vpcmpd k0,zmm0,zmm1,4: Compares 32-bit integers in zmm0 and zmm1 according to the predicate provided as the last operand (here 4, which typically represents "less than"). The result is stored in the mask register k0. vmovdqu64:

  • vmovdqu64 zmmword ptr [rsp+50h],zmm0: Moves the 512-bit vector in zmm0 into memory at the address rsp + 50h. kortestw:

  • kortestw k0,k0: Tests the contents of the mask register k0 and sets the zero flag based on the result. This is often used for conditional branches based on SIMD comparison results.

  • vzeroupper: This instruction is used to clear the upper 256 bits of all YMM registers to avoid penalties when mixing AVX-512 and legacy SSE code. It's a good practice to use this instruction before calls to functions that might not be AVX-512 aware.