What prevents the compiler from optimizing a hand written memcmp()?

342 Views Asked by At

Given:

#include <string.h>

bool test_data(void *data)
{
    return memcmp(data, "abcd", 4) == 0;
}

The compiler can optimize it into:

test_data:
    cmpl    $1684234849, (%rdi)
    sete    %al
    ret

which is nice.

But if I use my own memcmp() (not from <string.h>), the compiler can't optimize it into a single cmpl instruction. Instead, it does this:

static int memcmp(const void *s1, const void *s2, size_t n)
{
    const unsigned char *p1 = s1, *p2 = s2;
    size_t i;

    for (i = 0; i < n; i++) {
        int ret = p1[i] - p2[i];
        if (ret)
            return ret;
    }

    return 0;
}

bool test_data(void *data)
{
    return memcmp(data, "abcd", 4) == 0;
}
test_data:
    cmpb    $97, (%rdi)
    jne     .L5
    cmpb    $98, 1(%rdi)
    jne     .L5
    cmpb    $99, 2(%rdi)
    jne     .L5
    cmpb    $100, 3(%rdi)
    sete    %al
    ret
.L5:
    xorl    %eax, %eax
    ret

Link: https://godbolt.org/z/Kfhchr45a

  • What prevents the compiler from further optimizing it?
  • Did I do something that inhibits the optimization?
3

There are 3 best solutions below

3
Peter Cordes On BEST ANSWER

Data-dependent branching defeats auto-vectorization in GCC/Clang (but not classic ICC). The trip-count can't be computed before the first iteration (in the abstract machine), so GCC and clang wouldn't even try to use pcmpeqb/pmovmskb for large sizes. (Which is the efficient way to memcmp on x86-64 for large inputs.)

Apparently this is a hard problem, as it's been unsolved in GCC/Clang for 15 to 20 years (from now back to whenever GCC first started doing auto-vectorization.)

See also how to auto vectorization array comparison function - writing it as a loop that counts mismatches and always touches all elements can give auto-vectorization. (Or an OR reduction instead of a sum reduction). But that won't help for a small fixed size like 4 bytes. And removing the early-out entirely is a disaster for a 1GiB memcmp with a difference in the first byte. (A good SIMD memcmp like glibc's SSE2/AVX2 versions would branch maybe every 64 to 128 bytes on a vector compare results.)


Apparently there's no idiom-recognition either, at least not for this way of writing it. (There is for memcpy; GCC and clang can turn simple copy loops into call memcpy or call memset, or inline expansions of those functions. So if you're writing your own, you need -fno-builtin-memcpy or else your function turns into a call to itself... Pick a different name for experiments.)

Writing it as a loop that unconditionally touches all bytes could give auto-vectorization, but probably won't get recognized as a memcmp idiom in GCC's internals. I think that would be necessary for good code with small problems, like this case where we want a single dword cmp.


Compilers must avoid introducing segfaults by inventing reads past where the abstract machine stops. If void *data points to a 1-byte buffer holding 'z', your manual loop has well-defined behaviour in the abstract machine. Reading all 4 bytes would be past the end of the buffer.

But memcmp is UB if any part of the array is inaccessible, so the compiler can touch all 4 bytes without checking for early-out or for pointer alignment. (Can std::memcmp read any bytes past the first difference? yes, unlike your loop.)

(In x86-64 asm, going past the end could go into an unmapped page, resulting in a segfault, violating the as-if rule. Is it safe to read past the end of a buffer within the same page on x86 and x64? - yes, but only within the same page. This is something you can work around with aligned loads for strlen and strchr, but a bigger obstacle for vectorizing strcmp with differently-aligned pointers.)

Instead of comparing two unknown pointers from function args, I changed your test_data caller to pass pointers to two global arrays, char foo[4], bar[4];, which the compiler knows for certain are both readable. (Godbolt). But that didn't help, so still no.

6
chqrlie On

As coded, the compiler cannot optimize the implementation for your special case because it cannot determine that reading the second byte pointed by data is allowed if the first byte is different from 'a' and so on for the following bytes. memcmp is specified in the C Standard as having undefined behavior if any of the n bytes pointed to by both pointers are inaccessible. The compiler and/or standard library assume that 4 bytes are accessible and unaligned access is supported on the target architecture.

If you modify your implementation to read all 4 bytes, assuming unaligned access is OK for the target architecture, both gcc and clang will optimize test_data, generating 3 instructions as expected:

#include <string.h>
#include <arpa/inet.h>

static int my_memcmp(const void *s1, const void *s2, size_t n)
{
    const unsigned char *p1 = s1, *p2 = s2;
    while (n >= sizeof(unsigned)) {
        unsigned u1, u2;
        memcpy(&u1, p1, sizeof u1);
        memcpy(&u2, p2, sizeof u2);
        if (u1 != u2)
            return htonl(u1) < htonl(u2) ? -1 : 1;
        n -= sizeof(unsigned);
        p1 += sizeof(unsigned);
        p1 += sizeof(unsigned);
    }
    for (size_t i = 0; i < n; i++) {
        int ret = p1[i] - p2[i];
        if (ret)
           return ret;
    }
    return 0;
}
7
Lundin On

In addition to what's already been said, a library implementation of memcmp for a 64 bit computer may try to carry out the comparison in a single instruction as long as it can deduce that the data is small enough.

So a naive implementation of an optimized memcmp for a 64-bit CPU might look like this:

static int memcmp1 (const void* s1, const void* s2, size_t n)
{
  if(n>8)
  {
    // call some other function suitable for larger chunks
    // e.g. the loop in chqrlie's answer, preferably using unsigned long
    // I used the real memcmp for illustration purposes here    
    return memcmp(s1,s2,n); 
  }

  uint64_t i1=0;
  memcpy(&i1, s1, n);  // clang manages to optimize a 4-byte write to an 8-byte
  uint64_t i2=0;
  memcpy(&i2, s2, n);
  
  if(i1 < i2)   // assume i1 and i2 are big-endian, unlike on x86-64.
    return -1;  // Use be64toh if you care about < vs >, not just != vs. ==
  if(i2 > i1)
    return 1;
  return 0;
}

https://godbolt.org/z/edjn4hx8h

Despite being rather naive code, this generates quite efficient assembler still. The compiler ought to discard the choice to call the more intricate function during inlining of memcmp1, because it knows the size of the data at compile-time. gcc failed to do so in this case, clang didn't, but either result looks pretty ok regardless.

As for the "function suitable for larger chunks", it will likely start by checking for alignment, then treat misaligned bytes in the beginning and the end of the data as special cases. And then perform the rest of the checks word by word, 8 byte at a time on a x86_64.

And as we noted, implementing such library functions without abusing strict aliasing might be tricky - there's been many functions in glibc and similar that contain blatant strict aliasing violations, which is fine as long as the lib isn't compiled as strictly conforming standard C. In which case those libs will often break spectacularly, so follow the build instructions for the lib in case you build it yourself. See a section of this answer about glibc's strlen: "Why this is safe as part of glibc but not otherwise."

But the code in this answer is actually safe; memcpy is safe for aliasing and alignment.