How does the int to float cast work for large numbers?

738 Views Asked by At

If we cast an integer to a float it needs to be rounded or truncated when it gets too large to be represented exactly by a floating-point number. Here is a small test program to take a look at this rounding.

#include <stdio.h>

#define INT2FLOAT(num) printf(" %d: %.0f\n", (num), (float)(num));

int main(void)
{
    INT2FLOAT((1<<24) + 1);
    INT2FLOAT((1<<24) + 2);
    INT2FLOAT((1<<24) + 3);
    INT2FLOAT((1<<24) + 4);
    INT2FLOAT((1<<24) + 5);
    INT2FLOAT((1<<24) + 6);
    INT2FLOAT((1<<24) + 7);
    INT2FLOAT((1<<24) + 8);
    INT2FLOAT((1<<24) + 9);
    INT2FLOAT((1<<24) + 10);

    return 0;
}

The output is:

 16777217: 16777216
 16777218: 16777218
 16777219: 16777220
 16777220: 16777220
 16777221: 16777220
 16777222: 16777222
 16777223: 16777224
 16777224: 16777224
 16777225: 16777224
 16777226: 16777226

Values in the middle between two representable integers get sometimes rounded up, sometimes rounded down. It seems like some sort of round-to-even is applied. How does this work exactly? Where can I find the code that is doing this conversion?

3

There are 3 best solutions below

7
M.M On BEST ANSWER

The behaviour of this implicit conversion is implementation-defined: (C11 6.3.1.4/2):

If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

This means your compiler should document how it works, but you may not be able to control it.

There are various functions and macros for controlling the rounding direction when rounding a floating-point source to an integer , but I'm not aware of any for the case of converting integer to floating.

2
Luis Colorado On

In addition to what has been said in other answers, for example, intel floating point units use internally full 80 bit floating point representation with an excess in the number of bits.... so when it rounds the number to the nearest 23 bit float number (as I assume from your output) think that it is able to be very precise and consider all the bits in an int.

IEEE-752 specifies a 32bit float as a number with 23 bits dedicated to store the significand, which means that, for a normalized number, in which the most significant bit is implicit (not stored, as it is always a 1 bit) you have actually 24 bits of significand of the form 1xxxxxxx_xxxxxxxx_xxxxxxxx, which means the number 2^24-1 is the last you'll be able to represent exactly (11111111_11111111_11111111 actually). After it, you can represent all the even numbers, but not the odds, as you lack the least significant bit to represent them. This should mean you are able to represent:

                                                     v decimal dot.
16777210  == 2^24-6        11111111_11111111_11111010.
16777211  == 2^24-5        11111111_11111111_11111011.
16777212  == 2^24-4        11111111_11111111_11111100.
16777213  == 2^24-3        11111111_11111111_11111101.
16777214  == 2^24-2        11111111_11111111_11111110.
16777215  == 2^24-1        11111111_11111111_11111111.
16777216  == 2^24         10000000_00000000_00000000_. <-- here the leap becomes 2 as there are no more than 23 bits to play with.
16777217  == 2^24+1       10000000_00000000_00000000_. (there should be a 1 bit after the last 0)
16777218  == 2^24+2       10000000_00000000_00000001_.
...
33554430  == 2^25-2       11111111_11111111_11111111_.
33554432  == 2^26        10000000_00000000_00000000__. <-- here the leap becomes 4 as there's another shift
33554436  == 2^26+4      10000000_00000000_00000001__.
...

If you imagine the problem in base 10, assume we have floating point numbers of just 3 decimal digits in significand, and an exponent of ten to raise the power. When we begin counting from 0, we get this:

  1  => 1.00E0
...
  8  => 8.00E0
  9  => 9.00E0
 10  => 1.00E1  <<< see what happened here... this is the same number as the first but with the ten's exponent incremented, meaning a one digit shift of every digit to the left.
 11  => 1.10E1
...
 98  => 9.80E1
 99  => 9.90E1
100  => 1.00E2  <<< and here.
101  => 1.01E2
...
996  => 9.96E2
997  => 9.97E2
998  => 9.98E2
999  => 9.99E2
1000 => 1.00E3  <<< exact, but here you don't have anymore a fourth digit to represent units.
1001 => 1.00E3  (this number cannot be represented exactly)
...
1004 => 1.00E3  (this number cannot be represented exactly)
1005 => 1.01E3  (this number cannot be represented exactly) <<< here rounding is applied, but the implementation is free to do whatever it wants.
...
1009 => 1.01E3  (this number cannot be represented exactly)
1010 => 1.01E3 <<< this is the next number that can be represent exactly with three floating point digits.  So we switched from an increment of one by one to an increment of ten by ten.
...

Note

The case you show, is one of the rounding modes specified for the intel processors, it rounds to the even number closer, but in case it is half the distance, it counts the number of one bits in the significand and rounds up when it is odd, and rounds down when it is even (this is to avoid the rounding up always so importan in banking sometimes ---banks never use floating point because they don't have precise control on the rounding)

0
Stand with Gaza On

As the other have stated, the algorithm running on your machine is most certainly implemented in hardware so there is no C or assembly code that you can inspect. That said, the algorithm can also be implemented in software. Here is the algorithm works for positive integers and 32-bit ieee754 floats:

  • Mask out the most significant bit (msb) of the integer.

  • Check if msb > 23. If it isn't, the integer can be represented exactly and rounding isn't necessary.

  • Otherwise, divide the integer by 2^(msb - 23) into quotient (q) and remainder (r).

  • Round up (increment q) if; ** 2^(msb - 23) - r < r, or ** 2^(msb - 23) - r = r and q % 2 == 1 (round ties to even). ** Otherwise round down (do nothing).

  • If q = 2^23 increment msb and set q = 0.

  • The significand is q and the exponent msb + 127.

The following C code implements the algorithm using bit twiddling over division to make it more efficient. The algorithm's input is the unsigned integer u32 and its msb and its output the significand sig and exponent exp:

// Mask msb.
u32 -= (1 << msb);

uint32_t sig;
if (msb > 23) {
    // Index of the truncated part's MSB.
    int8_t trunc_msb = msb - 23;
    sig = u32 >> trunc_msb;

    // Upper bound of truncation range.
    uint32_t upper = 1 << trunc_msb;

    // Truncted value
    uint32_t trunc = u32 & (upper - 1);

    // Distance to the upper and lower bound (which is zero).
    uint32_t lo = trunc - 0;
    uint32_t hi = upper - trunc;

    // Round up if closer to upper bound than lower, or if
    // equally close round up if odd (so to even).
    if ((lo > hi) ||
        (lo == hi && (sig & 1))) {
        sig++;

        // Incrementing the sig may cause wrap-around in
        // which case we increase the msb.
        sig &= (1 << 23) - 1;
        msb += !sig;
    }
} else {
    sig = u32 << (23 - msb);
}
uint8_t exp = msb + 127;