I adapted a function I found on SO for SSE2 and included it in my program. The function uses SSE2 intrinsics to calculate the leading zero count of each of the 8 x 16bit integers in the vector. When I compiled the program, which produced no warnings, and ran it on my old laptop which I often use for development, it crashed with the message 'Illegal instruction (core dumped)'. On inspecting the assembly, I noticed my function appeared to have AVX1 'VEX' encoded SSE2 instructions as shown below.
.globl _mm_lzcnt_epi32
.type _mm_lzcnt_epi32, @function
_mm_lzcnt_epi32:
.LFB5318:
.cfi_startproc
endbr64
vmovdqa64 %xmm0, %xmm1
vpsrld $8, %xmm0, %xmm0
vpandn %xmm1, %xmm0, %xmm0
vmovdqa64 .LC0(%rip), %xmm1
vcvtdq2ps %xmm0, %xmm0
vpsrld $23, %xmm0, %xmm0
vpsubusw %xmm0, %xmm1, %xmm0
vpminsw .LC1(%rip), %xmm0, %xmm0
ret
.cfi_endproc
If I change the header immintrin.h to emmintrin.h, it compiles my code properly to SSE2 instructions
.globl _mm_lzcnt_epi32
.type _mm_lzcnt_epi32, @function
_mm_lzcnt_epi32:
.LFB567:
.cfi_startproc
endbr64
movdqa %xmm0, %xmm1
psrld $8, %xmm0
pandn %xmm1, %xmm0
cvtdq2ps %xmm0, %xmm1
movdqa .LC0(%rip), %xmm0
psrld $23, %xmm1
psubusw %xmm1, %xmm0
pminsw .LC1(%rip), %xmm0
ret
.cfi_endproc
and runs successfully. My program is as follows.
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <immintrin.h>
// gcc ssebug.c -o ssebug.bin -O3 -msse2 -Wall
__m128i _mm_lzcnt_epi32(__m128i v) {
// Based on https://stackoverflow.com/questions/58823140/count-leading-zero-bits-for-each-element-in-avx2-vector-emulate-mm256-lzcnt-ep
// prevent value from being rounded up to the next power of two
v = _mm_andnot_si128(_mm_srli_epi32(v, 8), v); // keep 8 MSB
v = _mm_castps_si128(_mm_cvtepi32_ps(v)); // convert signed integer to float ??
v = _mm_srli_epi32(v, 23); // shift down the exponent
v = _mm_subs_epu16(_mm_set1_epi32(158), v); // undo bias
v = _mm_min_epi16(v, _mm_set1_epi32(32)); // clamp at 32
return v;
}
int main(int argc, char **argv) {
uint32_t i, a[4];
__m128i arg;
uint32_t argval = 123;
if (argc >= 2) argval = atoi(argv[1]);
arg = _mm_set1_epi32(argval);
arg = _mm_lzcnt_epi32(arg);
_mm_storeu_si128((void*)a, arg);
for(i=0; i<4; i++) {
printf("%u ", a[i]);
}
printf("\n");
}
This explanation, Header files for x86 SIMD intrinsics, appears to suggest that for gcc at least, it is safe to just use immintrin.h for everything, which appears to be false. My questions are as follows.
Is it supposed to be safe to use immintrin.h for everything, or does using it tell the compiler to assume at least AVX1?
Isn't it the compiler's responsibility to produce ONLY instructions which are valid for the target architecture? If not, why not?
Why does it work (produce only SSE2) if I use immintrin.h but make my function static inline?
Is there a way to scan an assembly file to reveal what CPU extensions it requires?
Who should I contact about such issues in future?
I think this is potentially quite a serious issue as it isn't always feasible to check the assembler contains only valid instructions for the target architecture. I only found this because my program crashed, and I was using an old machine which doesn't support AVX1. If the function was in some hardly ever executed branch, I might have missed it. You could argue that it isn't worth worrying about this issue specifically because nobody will be using such old hardware for anything serious, but the issues it raises could well apply to newer architectures too. Thanks for your time. I am using gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0.
Rename your function to not clash with intrinsics
Like
lzcnt_epi32_sse2or justlzcnt_epi32. Theepi32is already enough to remind you it's related to Intel intrinsics like taking an__m128iarg, but the lack of_mmin the name lets you know it's just a function, and not one of Intel's SVML functions or something.If you want to mix vector widths and need to distinguish that in your helper functions (since C doesn't allow overloading), perhaps
__m1128i lzcntd_m128i( __m128i v );. I've also seen names likemm_lzcnt_epi32without the leading_, but it would be very easy to miss that when reading.Don't define your own functions with names that start with
_, those are reserved for use by the implementation. That reserved part of the namespace is a reasonable place for non-portable extensions that won't clash with any existing code, which is probably why Intel chose it for their intrinsics. (What are the rules about using an underscore in a C++ identifier? - C has pretty much the same rules as C++ for this, IIRC. Since your definition isn'tstatic, it's in the global namespace where_anythingis reserved. Not that I'd recommendstatic inlinewith clashing names.)Don't follow their naming scheme for your own functions that take
__m128iargs, and definitely never define your own version of an intrinsic. Those do get defined even without-mavx512vlenabled globally, so they're usable inside functions that use__attribute__((target("avx512vl"))), and unfortunately you end up with silent use of ISA extensions you didn't want, with no good way for GCC to detect a potential problem to even warn about it, I think.The intrinsic's definition
_mm_lzcnt_epi32is a real intrinsic for an AVX-512 instruction. It's declared and defined in a GCC header as anextern inlinewrapper function (around a GNU C__builtin) inside a#pragma GCC target("avx512vl,avx512cd")block, with__attribute__((always_inline)). (Ifavx512vlwasn't enabled globally, it will#pragma GCC pop_optionsafterwards so it's only enabled for that block of definitions.)Apparently the target-attribute part of the declaration sticks, but not the always-inline attribute which normally makes inlining fail with a compile-time error. This part may be a GCC bug. And somehow it's not an error to redefine the function, because of the
gnu_inlineattribute in the header's definition1. It is an error with clang which uses different headers.So you get a
call _mm_lzcnt_epi32inmainto a non-inline function that uses AVX-512 instructions. (Yes, GCC9.4 uses EVEXvmovdqa64 xmm1, xmm0as well as VEXvpsrld xmm0, xmm0, 8, as you show in your code block. This is a missed-optimization bug that was fixed in GCC10:vmovdqa xmm1, xmm0is fewer bytes of machine code. Although I think the whole copy is avoidable by shifting into a separate destination so there is still a missed optimization, but GCC10 makes asm that will run on Godbolt's Zen 3 AWS instances, not just its SKX / Ice Lake instances.)This is what's supposed to happen with
arg = _mm_lzcnt_epi32(arg);if you haven't defined your own version of it - a "target-specific options mismatch" error:Or if you use the raw builtin manually:
Note that
-msse2is baseline for x86-64. You only need to enable it if targeting-m32with a GCC config that doesn't do that by default. It doesn't do any harm for x86-64, but it also doesn't override AVX enabled by any earlier options like-march=x86-64-v3or-mavx. For that you want-mno-avx. But that just sets the baseline for all code: pragma and per-function__attribute__can still enable use of later ISA extensions for specific functions.gcc -msse2 -mno-avxis equivalent to the default and won't help work around this bug of naming a function that clashes with an intrinsic.Some Linux distros are planning to ship versions that are built with
-march=x86-64-v3(Haswell baseline: AVX2+FMA+BMI2, wikipedia) although IDK if they're planning to configure GCC with that higher baseline as a no-options default the way many do for SSE2 withgcc -m32. But your GCC 9.4.0-1ubuntu1~20.04.1 is definitely not configured that way, and what I can see on Godbolt matches what you report your GCC doing.Which CPUs is this relevant for?
First of all, your code uses AVX-512 instructions (
vmovdqa64) and will crash on Intel's latest desktop / laptop CPUs because they removed AVX-512 before defining a way (AVX10.1) to expose 128 and 256-bit EVEX instructions with all the great new features like masking, better shuffles,vpternlogd, and niche instructions likevplzcntd. They'll run fine on Zen 4, though.Secondly, low-power Intel CPUs based on Tremont and earlier lack AVX/BMI, so there are recent low-power servers and low-end netbooks without AVX.
Also, Intel Pentium and Celeron before Ice Lake had AVX+BMI disabled. (BMI perhaps a victim of disabling decode of VEX prefixes as a way to disable AVX+FMA?) This was pretty bad, not helping the x86 ecosystem get closer to making BMI (or AVX) baseline. BMI1/BMI2 are most useful if used everywhere for stuff like more efficient variable-count shifts, not just in a couple hot loops like SIMD.
(Ice Lake Pentium/Celeron are still half-width, but that means 256-bit so x86-64-v3 without AVX-512. Low-end / low-power Alder Lake N has all Gracemont E-cores but that's the same x86-64-v3 feature level as their P-cores, thanks to Intel crippling the AVX-512 on the P-cores even in systems with no E-cores, while enhancing their E-cores to add x86-64-v3 features.)
Footnote 1: No redefinition error?
It seems that
__attribute__((__gnu_inline__))is responsible for allowing a second definition. GCC compiles this without complaint:(
__gnu_inline__is a version ofgnu_inlinethat doesn't pollute the namespace, for use in-std=gnu11mode, like__asm__vs.asm. Most GNU keywords have an__x__version so headers don't break even if user code did a#defineon any non-reserved part of the namespace.)From the GCC manual: function attributes:
So I guess the version in the header wasn't a candidate for inlining because of mismatching target options, but providing a non-
inlinedefinition let GCC call it anyway. So this might not be a GCC bug. And it's probably not something GCC should even warn about since most.cfiles that provide the non-inline definition (if there is one; not the case for intrinsics I assume) will include the header that defines theextern inlineversion.Even if it were or is a bug that GCC didn't error or warn about this, don't define your own functions in a reserved part of the namespace in the first place. The most we could hope for is GCC being more helpful like erroring at compile-time instead of silently making a binary you didn't intend.
The behaviour is undefined in this case (defining functions with reserved names). Perhaps GCC could warn if it differentiated based on path, knowing which headers were "part of the implementation" vs. 3rd-party libraries. But I think glibc also uses plenty of
__names in headers in/usr/include, so I don't think that's viable.At first I thought GCC was allowing it because different target attributes on definitions for the same name is how GCC does function multiversioning. But this is different. If it was doing multiversioning, it would be using a non-AVX512 version because
mainwas compiled with just SSE2 in effect. The test-case above compiles with justgnu_inline, no target-attribute stuff required.