enter image description here Code asm diff
I dont really understand what optimization this is, all I know is that it's really fast and I've tried many commands in the manual to no avail. Can anyone explain in detail what optimization this is and what command in GCC generates the same ASM as ICC or better?
I'm not sure there is an optimization option to make GCC do this optimization. Don't write loops that redo the same work 100k times if you don't want you program to spend time doing that.
Defeating benchmark loops can make compilers look good on benchmark, but AFAIK is often not useful in real-world code where something else happens between runs of the loop you want optimized.
ICC is defeating your benchmark repeat-loop by turning it into this
The first step, swapping the inner and outer loops, is called loop interchange. Making one pass over the array is good for cache locality and enables further optimizations.
Turning
for() if()intoif() for(){} else for(){}is called loop unswitching. In this case, there is no "else" work to do; the only thing in the loop wasif()sum+=..., so it becomes just an if controlling a repeated-addition loop.ICC unrolls + vectorizes that
sum +=, strangely not just doing it with a multiply. Instead it does10000064-bit add operations. ymm0 holds_mm256_set1_epi64(data[c])fromvpbroadcastq.This inner loop only runs conditionally; it's worth branching if it's going to save 6250 iterations of this loop. (Only one pass over the array, one branch per element total, not 100k.)
Every iteration does 16 additions, 4 per instruction, unrolled by 4 into separate accumulators that are reduced to 1 and then hsummed after the loop. Unrolling lets Skylake and later run 3
vpaddqper clock cycle.By contrast, GCC does multiple passes over the array, inside the loop vectorizing to branchlessly do 8 compares:
This is inside a repeat loop that makes multiple passes over the array, and might bottleneck on 1 shuffle uop per clock, like 8 elements per 3 clock cycles on Skylake.
So it just vectorized the inner
if() sum+=data[c]like you'd expect, without defeating the repeat loop at all. Clang does similar, as in Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?