Can FP compares like SSE2 _mm_cmpeq_pd / AVX _mm_cmp_pd be used to compare 64 bit integers?
The idea is to emulate missing _mm_cmpeq_epi64 that would be similar to _mm_cmpeq_epi8, _mm_cmpeq_epi16, _mm_cmpeq_epi32.
The concern is I'm not sure if the comparison is bitwise, or handles floating point specifically, like NAN values are always unequal.
AVX implies availability of SSE4.1
pcmpeqqis available, in that case you should just use_mm_cmpeq_epi64.FP compares treat NaN != NaN, and
-0.0 == +0.0, and if DAZ is set in MXCSR, treat any small integer as zero. (Because exponent = 0 means it represents a denormal, and Denormals-Are-Zero mode treats them as exactly zero on input to avoid possible speed penalties for any operations on any microarchitecture, including for compares. IIRC, modern microarchitectures don't have a penalty for subnormal inputs to compares, but do still for some other operations. In any case, programs built with-ffast-mathset FTZ and DAZ for the main thread on startup.)So FP compares are not really usable for integers unless you know that some but not all of bits [62:52] (inclusive) will be set.
It's much to use
pcmpeqd(_mm_cmpeq_epi32) than to hack up some FP bit-manipulation. (Although @chtz suggested in comments you could do42.0 == (42.0 ^ (a^b))withxorpd, as long as the compiler doesn't optimize away the constant and compare against 0.0. That's a GCC bug without -ffast-math).If you want a condition like at-least-one-match then you need to make sure both halves of a 64-bit element matched, like
mask & (mask<<1)on amovmskpsresult, which can compile tolea/test. (You couldmask & (mask<<4)on apmovmskbresult, but that's slightly less efficient because LEA copy-and-shift can only shift by 0..3.)Of course "all-matched" doesn't care about element sizes so you can just use
_mm_movemask_epi8on any compare result, and check it against0xFFFF.If you want to use it for a blend with and/andnot/or, you can
pshufd/pandto swap halves within 64-bit elements. (If you were feedingpblendvborblendvpd, that would mean SSE4.1 was available so you should have usedpcmpeqq.)The more expensive one to emulate is SSE4.2
pcmpgtq, although I think GCC and/or clang do know how to emulate it when auto-vectorizing.