I have been playing around with the Godbolt Compiler and typed this code:
constexpr int func(int x)
{
return x > 3 ? x * 2 : (x < -4 ? x - 4 : x / 2);
}
int main(int argc)
{
return func(argc);
}
The code is somewhat straight forward. The important part here is the final division by 2 inside func(int x). Since x is an integer, basically any compiler would simplify this to a shift to avoid division instructions.
The assembly from x86-64 gcc 12.2 -O3 (for Linux, thus System V ABI) looks like this:
main:
cmp edi, 3
jle .L2
lea eax, [rdi+rdi]
ret
.L2:
cmp edi, -4
jge .L4
lea eax, [rdi-4]
ret
.L4:
mov eax, edi
mov ecx, 2
cdq
idiv ecx
ret
You can see the final idiv ecx command which is not a shift but an actual division by 2. I also tested clang and clang does actually reduce this to a shift.
main: # @main
mov eax, edi
cmp edi, 4
jl .LBB0_2
add eax, eax
ret
.LBB0_2:
cmp eax, -5
jg .LBB0_4
add eax, -4
ret
.LBB0_4:
mov ecx, eax
shr cl, 7
add cl, al
sar cl
movsx eax, cl
ret
Could this be due to inlining maybe? I am very curious about whats going on here.
GCC treats
mainspecially: implicit__attribute__((cold))So
maingets optimized less (or favouring size over speed), as it's usually only called once in most programs.__attribute__((cold))isn't quite the same as-Os(optimize for size), but it's a step in that direction which sometimes gets the cost heuristics to pick a naive division instruction.As GCC dev Marc Glisse commented, don't put your code in a function called
mainif you're benchmarking it or looking at how it optimizes. (There can be other special things besidescold, e.g. MinGW GCC puts in an extracallto an init function, andgcc -m32adds code to align the stack by 16. All of which are noise you don't want for the code you're looking at. See also How to remove "noise" from GCC/clang assembly output?)Another Q&A shows GCC putting
mainin the.text.startupsection, along with other presumed "cold" functions. (This is good for TLB and paging locality; hopefully a whole page of init functions can get evicted after process startup. The idea is that code inmainprobably only runs once, with real work happening inside some function it calls. This may not be true if real work inlines into main, or for simple programs.)This is a bad heuristic for toy programs with all their code in
main, but it's what GCC does. Most real programs people run regularly aren't toys and have enough code in some other function that it doesn't inline intomain. Although it would be nice if the heuristic was a little smarter and removed thecoldif it turned out that the whole program or all the functions in a loop did optimize intomain, since some real programs are pretty simple.You can override the heuristic with a GNU C function attribute.
__attribute__((hot)) int main(){ ...optimizes the way you expect(Godbolt from Sopel's comment, with attribute added).
__attribute__((cold))on a function not calledmainproducesidiv.__attribute__((optimize("O3")))doesn't help.int main(int x, char **y){ return x/2; }does still use shifts withgcc -O2, so main beingcolddoesn't always have that effect (unlike-Os).But perhaps with your division already being conditional, GCC guesses that basic block doesn't even run every time so that's more reason to make it small instead of fast.
Insanely, GCC
-Osfor x86-64 (Godbolt) does useidivfor signed division a constant 2, not just for arbitrary constants (where GCC normally uses a multiplicative inverse even at-O0). It doesn't save much if any code size vs. an arithmetic right shift with a fixup to round towards zero (instead of -inf), and can be much slower, especially for 64-bit integers on Intel before Ice Lake. Same for AArch64, where it's 2 fixed-size instructions either way, withsdivalmost certainly being much slower.sdivdoes save some code size on AArch64 for higher powers of 2 (Godbolt), but still so much slower that it's probably not a good tradeoff for-Os.idivdoesn't save instructions on x86-64 (ascdqorcqointo RDX is required), although maybe a couple bytes of code size. So probably only good for-Ozwhere it would also usepush 2/pop rcxto get a small constant into a register in 3 bytes of x86-64 machine code instead of 5.