Are there non-negative floating points a,b,c such that a + (b - a) * c > b with a < b and c < 1?

74 Views Asked by At

This question is related to When does (b - a) + a != b exactly in floating point but with an additional multiplicator c that must be strictly smaller than 1.

Formally I am asking myself if there exist any floating point numbers a, b, and c such that

  1. 0 <= a < b
  2. 0 <= c < 1
  3. a + (b - a) * c > b.

If there exists such a triple I am further wondering if we can find normal floating point numbers or if some of the three need to be subnormal.

First of all to find such a triple it is pretty clear that if it works it will work with c = 1. - epsilon, i.e. the largest floating point smaller than 1. So to find suiting values for a and b I tried multiple things without success:

  1. Generate a lot of random values
  2. Generate pairs a = i / n and b = j / n with 0 <= i < j <= n as suggested in this answer: https://stackoverflow.com/a/51383057/23137796. I tried for all (i,j,n) pairs with n <= 1000. I could find many pairs with a + (b - a) > b but none with a + (b - a) * c > b.
  3. I tried to generate a and b values that are close to each other (i.e. only epsilon away from each other or some constant multiple of epsilon away, up to 100). I couldn't find any pair that satisfies a + (b - a) > b.

Edit: Thank you for everyone presenting examples with rounding mode towards infinity, I can see that with this rounding mode we can find examples. I am mostly interested in rounding mode "Round to nearest, ties to even" that is used in many programming languages by default, should have mentioned this to begin with.

1

There are 1 best solutions below

4
Eric Postpischil On

Summary

There are such numbers when rounding toward +∞ is used. There are not such numbers when rounding to nearest with ties to even is used in a binary floating-point format. There are such numbers in non-binary formats.

In this answer, text in code font represents floating-point arithmetic. Text not in code font represents real-number arithmetic.

Rounding Toward +∞

Consider IEEE-754 binary32 with rounding toward +∞. Let a be ½−2−25, the representable number just below ½. Let b be 1. Let c be 1−2−24, the representable number just below 1. ba (mathematical subtraction) is ½+2−25. That is not representable, so b-a (floating-point subtraction) with rounding toward +∞ produces ½+2−24. b-ac is ½+2−24−2−25-2−48, so (b-a)*c produces ½+2−24. Then a+(b-a)*c is ½−2−25+½+2−24 = 1−225+2−24 = 1+2−25. 1+2−25 is not representable, so rounding toward +∞ produces 1+2−23, so a + (b-a)*c produces 1+2−23, which is greater than b.

Demonstrating in C, the output of:

#include <fenv.h>
#include <stdio.h>
#include <math.h>

#pragma STDC FENV_ACCESS ON


int main(void)
{
    fesetround(FE_UPWARD);
    float a = nexttowardf(.5f, 0);
    float b = 1;
    float c = nexttowardf(1, 0);
    printf("%a\n", a + (b-a)*c);
}

is:

0x1.000002p+0

Round to Nearest With Binary Floating-Point

This is not possible in a binary format with round-to-nearest-ties-to-even. Proof:

Consider, that, if a + (b-a)*c is greater than b, it is at least one ULP of b greater than b. Let u be the ULP of b. Consider first a + (b-a). Since a < b, b-ab, so the ULP of b-a is at most the UP of b, u. With round-to-nearest, the maximum error in a single operation is ½ ULP, for the ULP of the result. There are values for a and b where b-a rounds up by ½ ULP, such as when we subtract 1½•2−23 from 2−2−23. (2−2−23 − 1½•2−23 = 2 − 2½•2−23, which rounds up to 2 − 2•2−23.)

So, for b-a, it is possible to produce a result of ba + ½ u.

Then when we add a, there are again values for which the addition in a + (b-a) rounds up by ½ ULP. We know from above the mathematical result is at most a + ba + ½u = b + ½u. This is halfway to the next representable number after b, so it will, with round-to-nearest-ties-to-even, round up if the next representable number has an even low digit in its significand. The numbers above are an example of that; with a = 1½•2−23 and b = 2−2−23, a + (b-a) yields 2, which is greater than b.

Observe that we just barely achieved this. The final addition rounded up because there was exactly ½u rounding error in b-a, the maximum possible. It is not possible to arrange for the result of a + (b-a) to be any greater than b + ½u. So, when we consider a + (b-a)*c, we must not lose any of this achieved rounding error. The result of (b-a)*c must be b-a and not any less, or a + (b-a)*c will be lower than the value needed to allow rounding up.

The greatest c can be is the next representable number less than 1. The above examples use binary32, but we will consider binary formats generally, so the greatest c is 1−2p, where p is the precision of the format (the number of digits in the significand).

b-ac is necessarily less than b-a, so we want (b-a)*c to round up to compensate for that. Suppose b-a has the largest significand in a binary format, 2−21−p (using [1, 2) for the normal significand interval). c also has this significand. (It is 1−2p, which normalizes to 2−21−p.) Then product of the significands in b-ac is (2−21−p)2 = 4 − 2•21−p + (21−p)2 = 4 − 22−p + 22−2p. Scaling that to the normal interval yields 2 − 21−p + 21−2p.

2 − 21−p is the largest significand, and its ULP is 21−p, so the product does not exceed the largest significand by at least ½ ULP unless 21−2p ≥ ½•21−p, implying p ≤ 1. (I will leave it to the reader to consider whether there is a solution with p = 1 and what to replace round-to-nearest-ties-to-even with when all significands of nonzero numbers have an odd low digit.)

Therefore the product of this largest possible significand of a-b with c cannot round up, and no lower significand can round up either. Therefore, we cannot achieve the necessary rounding error to make a + (b-a)*c exceed b.

Bonus

For the p = 1 case left to the reader above, the representable numbers are … ¼, ½, 1, 2, 4,… The largest possible value for c is ½. We can consider any representable number for b, as there is only one significand (1) for non-zero numbers, and all behaviors will scale as the exponent is changed. Let’s choose b = 4, which makes a at most 2. For b-a, the result for any a is at most b, so let’s replace a + (b-a)*c with a + b*c. This is a + 4*½, so a + 2. Since a ≤ 2, a + 2 ≤ 4. So there is no solution in this degenerate floating-point format.

Decimal Format

Consider a two-digit decimal format with round-to-nearest. Let a be 15, b be 130, and c be .99. Then b-a = 115, so b-a rounds to 120, and b-ac = 118.8, so (b-a)*c rounds to 120. Then a + (b-a)*c = 15 + 120 = 135, so a + (b-a)*c rounds to 140, and 140 is greater than b.