Issue related to double precision floating point division in C++

Question

Issue related to double precision floating point division in C++

202 Views Asked by Gaurav Dhir At 10 February 2023 at 02:57

In C++, we know that we can find the minimum representable double precision value using std::numeric_limits<double>::min(). The value turns out to be 2.22507e-308 when printed.

Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.

To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?

Consider the following code:

int main()
{
  double minval = std::numeric_limits<double>::min(), wg = 8038,    
         ans = floor((wg - minval) / wg); // expecting the answer to round to 0   
  cout << ans; // but the answer actually resulted as 1!
}

Original Q&A

There are 2 best solutions below

**Jerry Coffin** · Answer 1 · 2023-02-10T03:34:06.587000

A double typically has around 16 digits of precision.

You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.

8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.

Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.

**Gaurav Dhir** · Answer 2 · 2023-02-14T22:22:29.843000

The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.

As an example, consider the code below.

int main()
{
    double b, c, d;
    
    vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
    
    cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
    
    for( int i = 0; i < 9; i++ ) {
        
        b = std::nextafter( a[i], 0 );
    
        c = a[i] - b;
    
        d = 1e-17;
        
        if( (bool)(d > c) )
            cout << "True" << "\t";
        else
            cout << "False" << "\t";
        
        cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
        
    }
    
    return 0;
}

The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.

This can be observed from the output of the C++ program as shown below:

Subtraction Possible?   Floor Result
False                       0
False                       0
False                       0
False                       0
False                       1
False                       1
False                       1
False                       1
False                       1

The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:

Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause

The statement in further repeated in Section 5.1 of Page 29 as shown below:

Unless otherwise specified, each of the computational operations specified by this standard that returns a numeric result shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).

C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].

It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.

I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.

"IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.

Issue related to double precision floating point division in C++

There are 2 best solutions below

Related Questions in C++

Related Questions in C++14

Related Questions in DOUBLE

Related Questions in ROUNDING

Related Questions in FLOOR-DIVISION

Trending Questions

Popular # Hahtags

Popular Questions