When writing computational code for GPUs using APIs where compute shaders are translated via SPIR-V (in particular, Vulkan), I am guaranteed that ULP error of floating-point division will be at most 3. Other basic arithmetic (addition, multiplication) is correctly rounded.
How to efficiently implement correctly rounded division in these circumstances? Let's assume that FMA is available and correctly rounded.
What happens with denormals will depend on the underlying hardware. Vulkan API allows to query whether the device can preserve denormals and whether it can flush them to zero (so a GPU that does not fully support denormals will have "canPreserve: false, canFlush: true"). So let's additionally assume that the GPU can produce and handle denormals, without flushing them to zero (otherwise trying to produce a correctly rounded result which is subnormal seems futile).
Modern FMA-based division algorithms typically rely heavily on the work of Peter Markstein who worked on this issue extensively. The canonical reference is:
Peter Markstein, "IA-64 and Elementary Functions: Speed and Precision", Prentice-Hall 2000.
The C code below is also based on Markstein's work. It basically consists of a fast path that handles common cases as quickly as possible, and a slow path that handles all the pesky special cases.
The concepts involved are pretty straightforward, but they do require careful mathematical analysis to ensure a correctly rounded result. The first step is to compute an accurate reciprocal of the divisor. This requires the accurate computation of the approximation error, for which FMA is the best tool. The refinement of the reciprocal from a (hardware-based) approximation typically takes the form of a single Newton-Raphson iteration with quadratic convergence or a single Halley iteration with cubic convergence, both of which maps perfectly to FMA.
Multiplying the reciprocal of the divisor with the dividend then gives us an approximation to the quotient. This is likewise refined based on the FMA-based computation of the residual. At this stage one typically works with a version of the dividend scaled towards unity to guard against overflow and underflow in intermediate computation and allow for the widest possible range of operands in the fast path of the division algorithm. This also means that there is a need for a final scaling step at the end to adjust the quotient magnitude.
An advantageous arrangement of the two paths through the code is to first unconditionally perform the fast-path computation, then check at the end whether the conditions for fast-path usage are met, and if not, invoked the slow path code. This allows for the computation of necessary conditions concurrent with the fast-path computation, and allows the outlining of the slow-path computation, which facilitates placing that code in a rarely used memory page where various slow-path or exception handling code is collected.
Please treat the code below as an algorithmic sketch. The included "smoke"-level test based on random test vectors is far from a rigorous test framework. The conditions for invoking the slow path have neither been shown to be as tight as possible, nor as tight as necessary, nor the most efficient arrangement. The refinement of the reciprocal uses a single Newton-Raphson step but that may not be sufficient for a given GPU's reciprocal approximation, and a Halley iteration may need to be substituted at the cost of an additional FMA. Finally, I have omitted the construction of the entire slow path as that would exceed the limitations of a Stackoverflow answer, both in terms of the design effort and the amount of textual description.
Some additional notes after re-acquainting myself with the issue. It is advantageous if the hardware's reciprocal approximation instruction treats subnormals as zero. This allows the elimination of the divisor magnitude check in the invocation of the slow-path code, as any divisor greater than 2126 will result in a reciprocal of zero and a quotient of infinity, thus triggering the quotient check.
As for the quotient check, any fast-path result that represents a normal can be accepted, except for the smallest normal which may represent an incorrect rounding from a subnormal result. In other words, quotients whose absolute value are in [0x1.000002p-126, 0x1.fffffep+127] should not trigger re-computation by slow-path code.
There is no need for a quotient-refinement step as long as a sufficiently accurate reciprocal is used. An exhaustive test that covers all dividends in [1,2] and all divisors in [1,2] and thus all possible combinations of significand (mantissa) patterns is feasible with modern hardware, requiring 246 test cases. Even with scalar code running on a single CPU core it completed in less than four days, with no errors reported.
In practical usage, where possible, one would want to force the compiler to inline
fp32_div()while forcingfp_div_slowpath()into a called subroutine.Below I prototyped an AVX-based version of single-precision division using the simplifications discussed above. It passed all my tests. Since the accuracy of the AVX-based hardware reciprocal is low, a Halley iteration is required to refine the reciprocal to full single-precision, delivering results with a maximum error close to 0.5 ulp.
Corresponding CUDA code (tested) for an NVIDIA GPU looks as follows: