How to you check if an aligned chunk of 16 u32's is consecutive (and increasing)?
For example: [100, 101, 102, ..., 115] is.
And, [100, 99, 3 ...] is not.
I'm on AVX512f. This is what I have so far:
Algo A:
* predefine DECREASE_U32, a u32x16 of [15,14,13,...0]
* let a = input + DECREASE_32 // wrapping is OK
* compare a to u32x16::splat(first_item(a))
* Return whether all true
Alterative (Algo B)
* let b = copy of A
* permute the elements of b by one position
* let b = a-b
* Is b all 1's (except for 1st position)
I'm doing this in Rust with the packed_simd crate, but any language/pseudocode` is fine. (I wish there was a SIMD operation to subtract adjacent items.)
I think your first idea is probably best if done inside a loop that can amortize the cost of loading a vector constant. AVX-512 can do that efficiently.
Either with a vector load and then separately broadcast the low element with
vpbroadcastd, or with a vector load and a broadcast-load. e.g.vpaddd zmm16, zmm31, [rdi]{1to16}/vpcmpeqd k1, zmm16, [rdi].Hmm, but then checking for all elements being true, I guess perhaps
kaddwwith a constant1and check that the low 16 bits are zero withkortest? Or justkmovto an integer register for a compare against0xfffflike we'd do with SSE/AVXpmovmskb. I tried that, and clang had a better idea: compare for not-equal, and check that the mask is all zero. (i.e. check that every element is equal by checking that they aren't not-equal.) That allowskorteston the mask itself. I applied clang's idea to my intrinsics so GCC could make better asm as well.In C++:
Godbolt - asm from GCC and clang:
So they both choose to load twice instead of
vpbroadcastd zmm1, xmm0, at least when not in a loop so the vector constant also has to get loaded from.rodata.Perhaps if I wrote it differently, as
_mm512_broadcastd_epi32( _mm512_castsi512_si128(v)), they'd prefer one load, at the cost of an extra shuffle uop. (Which is probably worse when you have 512-bit uops in flight, so Intel CPUs shut down the vector ALU on port 1, leaving only ports 0 and 5. https://agner.org/optimize/ and https://uops.info/)Algo B - avoiding a non-trivial vector constant
Maybe your second way could also be done efficiently with
valigndto rotate the vector; the only vector constant it needs is all-ones which can be generated somewhat more cheaply (vpternlogd) instead of loaded.Checking the compare-mask would probably require a
kmovto integer for anand+cmpto check all but one bit, unless we can use the same trick clang did and arrange things so we actually want the mask to be all-zero in the places we want. In that case,test eax, imm32can check the bits we want while ignoring the one we don't.