I've read This article and do-denormal-flags-like-denormals-are-zero-daz-affect-comparisons-for-equality and I understand the usage and difference between FTZ and DAZ flags.
DAZ applies on input, FTZ on output from an FP operation.
What confused me is where does the denormal value come from in assembly view if FTZ is set. I think it can only be constant values either as immediate operands or from section .rodata (accessed with RIP-relative addressing).
But I found in my binary, there are no denormal values in these places but it still suffers from FP-ASSIST issues, causing bad performance.
If I set both DAZ and FTZ, the issue disappears and performance gets better. Actually I don't even find any denormal inputs in my source code. I am really confused, where does the denormal values come from?
Another question by the way, for instruction vmovsd 0x9498(%rip),%xmm0, supposing 0x9498(%rip) is a denormal value, what happens to xmm0 after this instruction executes, if we set FTZ or DAZ respectively?
In my understanding, DAZ would make it take 0x9498(%rip) as zero and mov 0 to xmm0; FTZ would move 0x9498(%rip) to xmm0 and found it is a denormal, so flush xmm0 to zero. I'm not sure, is it correct?
A denormal aka subnormal is a value with exponent field = 0 in the IEEE binary format. https://en.wikipedia.org/wiki/Double-precision_floating-point_format
When an FP math instruction (not move or pure bitwise boolean) reads such a number as an input operand, it has to handle that special case when lining up the mantissa with the other operand, and when applying the implicit top bit of the mantissa that's implied by the exponent being 0 or non-zero.
Yes most of the time FTZ on ouput is sufficient because most floating-point values are the results of other FP computations. And yes, FTZ is necessary because mul/div/add/sub on normal numbers can create a subnormal result. (For add the inputs need opposite signs). The other IEEE "basic" exactly-rounded operation, sqrt, can't create subnormals because it makes numbers closer to 1.0.
The obvious thing would be to use
perf recordto find out where you're getting FP-assists, and add some extra checks there to print or something when you find a denormal there. (Then set a breakpoint in that branch so you can examine the situation.)Possible sources of denormals (not exhaustive) with FTZ set, i.e. other than FP math ops:
strtodnextafter. Also Possibly as part of the internals of anexpimplementation that stuffs an integer into an exponent field of adouble.static double foo = DBL_MIN / 4.0;would be a compile-time denormal. But you would find them in.rodataor.data. Non-const non-zero static / global variables go in.data.Obviously any manual manipulation of FP bit-patterns using integer stuff can do it, too. How to use bits in a byte to set dwords in ymm register without AVX2? (Inverse of vmovmskps) could have produced denormal inputs to a compare if I didn't spend an extra instruction to avoid it, but that's an unusual manual vectorization trick that compilers wouldn't be doing for you.
x86 doesn't have FP immediates; you'd have to
mov rax, imm64/movq xmm0, raxor similar. But compilers don't do that because it's generally more efficient to load from.rodata.vmovsdis just a load and always copies the 64 bits exactly; architecturally equivalent to avmovqSIMD-integer load.It doesn't run the value through an ALU so no MXCSR bits have any effect on
vmovsd, FP shuffles, etc. Only instructions that do actual FP math and can raise FP exceptions are affected. You can tell by looking at the exceptions section of the asm manual entry. e.g.roundsddoes obey DAZ to possibly round the input to zero before rounding it according to the specified mode.