The 8086/8087/8088 MACRO ASSEMBLY LANGUAGE REFERENCE MANUAL (c) 1980 Intel Corporation mentions (Emphasis mine):
... the 8087 provides a very good approximation of the real number system. It is important to remember, however, that it is not an exact representation, and that arithmetic on real numbers is inherently approximate.
Conversely, and equally important, the 8087 does perform exact arithmetic on its integer subset of the reals. That is, an operation on two integers returns an exact integral result, provided that the true result is an integer and is in range.
A more recent manual is even more concise (Emphasis theirs):
the IA processors ... They can process decimal numbers of up to 18 digits without round-off errors, performing exact arithmetic on integers as large as 2^64 (or 10^18).
The integer data types that the FPU supports include signed word (16 bit), signed dword (32 bit), and signed qword (64 bit). There is never any mention of UNsigned. In fact, everything about the FPU breathes signedness, to the extent that they even support signed zero (+0 and -0).
So, is it possible to use the FPU to divide a couple of unsigned 64-bit numbers and get an exact quotient and remainder for it?
For the division of a couple of signed 64-bit numbers I wrote next code. The quotient seems fine but the remainder is always returned zero. Why is this?
; IN (edx:eax,ecx:ebx) OUT (edx:eax,ecx:ebx,CF)
FiDiv: push edi ecx ebx edx eax
mov edi, esp
fninit
fild qword [edi] ; Dividend
fild qword [edi+8] ; Divisor
fld
fnstcw [edi]
or word [edi], 0C00h ; Truncate Towards Zero
fldcw [edi]
fdivr st2 ; st0 = st2 / st0
fld ; Duplicate because `fistp` does pop
fistp qword [edi] ; Quotient
fmulp ; st1 *= st0, pop st0
fsubp ; st1 -= st0, pop st0
fistp qword [edi+8] ; Remainder
fnstsw ax
test ax, 101b ; #Z (Divide-By-Zero), #I (Invalid Arithmetic)
pop eax edx ebx ecx edi
jnz .NOK
ret ; CF=0
.NOK: stc ; Overflow on 8000000000000000h / -1
ret ; or Division by zero
; ------------------------------
The FPU rounding mode is set to 'Truncate Towards Zero' aka 'Chop', so as to mimic the behavior of the ALU idiv instruction.
Zero remainder
This code calculates the remainder from:
Because the Rounding Mode got set to 'Truncate Towards Zero', the
fistp qword [edi]instruction will convert the (copy of the) quotient held in ST0 to an integer right before storing to memory. However the value for the quotient that remains on the (fpu) stack is still a real number with a fraction. Once multiplied with the divisor it will produce the dividend again, resulting in a zero remainder.What is missing is rounding the quotient to an integer and doing it on the (fpu) stack already:
But a faster way is to simply reload the integer quotient from memory:
Unsigned division
Internally, the FPU dedicates 64 bits to the significand of a number plus a separate bit for the sign of the number. The FPU can represent every whole number in the range from -18'446744073'709551616 to 18'446744073'709551616. The 64-bit significand allows us to work with the unsigned 64-bit integers that range from 0 to 18'446744073'709551615. The only trouble is how to load and store these values that
fildandfistpcan't deal with (because they are restricted to operate in the range from -9'223372036'854775808 to 9'223372036'854775807).It would be possible to convert back and forth between unsigned quadword and the extended real format so we could use
fldandfstpinstead. Another approach would load/store the unsigned quadwords from/to their upper and lower halves. But conversions take time and so I found that by eliminating enough special cases, the only troublesome operation that remains was the loading of the dividend. Everything else can just usefildandfistpas usual.The special cases include:
Where an actual
fdivis needed, the code first loads half of the dividend, doubles it back on the (fpu) stack, and conditionally adds 1 if the true dividend was odd.Faster results
Quoting from an answer that uses a technique named Chunking to divide a couple of 64-bit integers:
Today's code can benefit from the same prefix, but as it turns out, it is more beneficial to keep detecting the special cases first: