Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
First of all,
f1stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases wheref1doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.You can fix that by making the function arg
const char ptr[static 8](but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to astruct {char buf[8]};would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like
f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)Your
f2could have been optimized the same asf3, to a qwordcmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimizef2, thanks @Tharwen for spotting that.Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use
memcmpif that's what you want; apparently most compilers don't spend time looking for patterns likef2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.Or use
memcpyto do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.Or in GNU C++, use a typedef to express unaligned may-alias loads:
Compiles on Godbolt with GCC10 -O3:
Casting to
uint64_t*would potentially violatealignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by thechar*was compatible withuint64_t.And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting
movapsor other problems can happen with real compilers in corner cases.https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using
may_alias(withoutaligned(1)in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?