Why do we set the least significant bit as part of converting a 64-bit unsigned integer to a 32-bit float on x86?

323 Views Asked by Sea Erchin At 23 February 2024 at 12:41

#include <cstddef>
float cast(size_t sz){return sz;}

Compiling the above code on Clang 13.0.1 with -mavx2 -O3 produces the following code:

cast(unsigned long):                               # @cast(unsigned long)
        test    rdi, rdi
        js      .LBB0_1
        vcvtsi2ss       xmm0, xmm0, rdi
        ret
.LBB0_1:
        mov     rax, rdi
        shr     rax
        and     edi, 1
        or      rdi, rax
        vcvtsi2ss       xmm0, xmm0, rdi
        vaddss  xmm0, xmm0, xmm0
        ret

Similar code is also produced on GCC, MSVC and the Intel compiler, and for older versions too.

I understand the general goal of the algorithm is to work around the fact that there is no conversion instruction from a 64-bit unsigned int to float or double, until AVX-512.

So if the number is large enough to be interpreted as negative, it halves it, converts it, and then doubles it. However, what is the purpose of setting the least significant bit if it was set in the original integer?

It seems like a waste of time since the floating point number only has 23 significant bits, and this bit is guaranteed to not be significant. Perhaps if it were an increment instruction instead, it could affect the significant bits in some cases. But just an or instruction doesn't seem to do anything.

Original Q&A

There are 1 best solutions below

Mike Vine On 23 February 2024 at 14:00

You are making an assumption that the bottom bit can never matter - this isn't true. There are a few corner case values in conversions where the bottom bit does matter. Consider:

#include <iostream>

int main()
{
    uint64_t start = 0x8000000000000000;

    for(uint64_t i=start; i<(start+0x10000000000000ull); i+=0x1000000ull)
    {
        float f = (float)(i);
        float g = (float)(i+1);
        if (f != g)
        {
            std::cout << "From = " << std::hex << i << " we get " << std::hexfloat << f << " and from " << (i+1) << " we get " << std::hexfloat << g << std::endl;
        }
    }
}

This iterates through a number of large values and prints out (using the handy hex float modifier) when the bottom bit makes a difference. It shows that there are a few which need this adjustment:

From = 8000008000000000 we get 0x1p+63 and from 8000008000000001 we get 0x1.000002p+63
From = 8000028000000000 we get 0x1.000004p+63 and from 8000028000000001 we get 0x1.000006p+63
From = 8000048000000000 we get 0x1.000008p+63 and from 8000048000000001 we get 0x1.00000ap+63
From = 8000068000000000 we get 0x1.00000cp+63 and from 8000068000000001 we get 0x1.00000ep+63
From = 8000088000000000 we get 0x1.00001p+63 and from 8000088000000001 we get 0x1.000012p+63
From = 80000a8000000000 we get 0x1.000014p+63 and from 80000a8000000001 we get 0x1.000016p+63
...

Link: https://godbolt.org/z/jTeo1Ko4s

Why do we set the least significant bit as part of converting a 64-bit unsigned integer to a 32-bit float on x86?

There are 1 best solutions below

Related Questions in C++

Related Questions in ASSEMBLY

Related Questions in CLANG

Related Questions in X86-64

Related Questions in FLOATING-POINT-CONVERSION

Trending Questions

Popular # Hahtags

Popular Questions