Vector by Scalar Division with -ffast-math

30 Views Asked by At
typedef float float4 __attribute__((vector_size(16)));
float4 divvs(float4 vector, float scalar) {
    return vector / scalar;
}

compiles to

// x86 gcc/clang -O3
    shufps  xmm1, xmm1, 0
    divps   xmm0, xmm1

// arm gcc/clang -O3
    dup     v1.4s, v1.s[0]
    fdiv    v0.4s, v0.4s, v1.4s

// x86 gcc -O3 -ffast-math
    shufps  xmm1, xmm1, 0
    rcpps   xmm2, xmm1
    mulps   xmm1, xmm2
    mulps   xmm1, xmm2
    addps   xmm2, xmm2
    subps   xmm2, xmm1
    mulps   xmm0, xmm2

// x86 clang -O3 -ffast-math
    movss   xmm2, dword ptr [rip + .LCPI0_0] # 1.0f
    divss   xmm2, xmm1
    shufps  xmm2, xmm2, 0
    mulps   xmm0, xmm2

// arm gcc -O3 -ffast-math (same code as without it though)
    dup     v1.4s, v1.s[0]
    fdiv    v0.4s, v0.4s, v1.4s

// arm clang -O3 -ffast-math
    fmov    v2.4s, #1.00000000
    dup     v1.4s, v1.s[0]
    fdiv    v1.4s, v2.4s, v1.4s
    fmul    v0.4s, v0.4s, v1.4s

My understanding is that -ffast-math enables reciprocal approximation instead of division. My other understanding is that scalar division and reciprocal instructions have at most a 1 cycle latency difference from their respective vector counterparts (Intel's intrinsics guide says so, on arm it's a bit harder but this and my own benchmarks agree on Apple Silicon at least).

My questions are:

  1. What's going on with the x86 gcc version? What I'm thinking is that rcpps on its own has too much error for -ffast-math, and whatever gcc does here gets it down below that threshold. I can't quite wrap my head around why it multiplies by it's own reciprocal though, nor the code that follows. I'm pretty curious on the math.
  2. Aren't both clang versions just unconditionally worse than the one without -ffast-math? Unfortunately I don't have an Intel machine handy to benchmark. The arm clang version takes 25% longer on my M1 Mac though (putting fmov outside the loop naturally makes little difference; fdiv s0 and fdiv v0.4s are exactly the same on their own). Intel's guide says divss and divps have the same latency too, so even if reciprocal estimate instructions didn't exist, how could it not be better in all respects to just shufps and divps? Just in principle, why does -ffast-math put clang through all this pain to multiply by reciprocal rather than divide, when the reciprocal comes from a division?
  3. Why doesn't either arm version use frecpe (arm's rcpps)? It's 10x faster. I checked -march=armv9.3-a, it isn't that. The only thing I can think of is that there's some standard for accuracy -ffast-math still has to meet, and frecpe doesn't meet it. But even then, assuming my theory in question #1 is correct, the only way gcc could justify the inconsistency between x86 and arm is if it somehow took fewer resources to fdiv than it does to get the error from frecpe down to the error in xmm2 by the end of the x86 version, presumably because rcpps is more accurate. That definitely doesn't sound right. Unfortunately I can't find error numbers for frecpe anywhere.
  4. If arm clang must insist on fdiv then fmul, then can't it skip the dup and just fmul v0.4s, v0.4s, v1.s[0]. What's wrong with that?
  5. To be intentionally vague, what's the "best" code to do this?
0

There are 0 best solutions below