This question is related to When does (b - a) + a != b exactly in floating point but with an additional multiplicator c that must be strictly smaller than 1.
Formally I am asking myself if there exist any floating point numbers a, b, and c such that
0 <= a < b0 <= c < 1a + (b - a) * c > b.
If there exists such a triple I am further wondering if we can find normal floating point numbers or if some of the three need to be subnormal.
First of all to find such a triple it is pretty clear that if it works it will work with c = 1. - epsilon, i.e. the largest floating point smaller than 1. So to find suiting values for a and b I tried multiple things without success:
- Generate a lot of random values
- Generate pairs
a = i / nandb = j / nwith0 <= i < j <= nas suggested in this answer: https://stackoverflow.com/a/51383057/23137796. I tried for all(i,j,n)pairs withn <= 1000. I could find many pairs witha + (b - a) > bbut none witha + (b - a) * c > b. - I tried to generate
aandbvalues that are close to each other (i.e. only epsilon away from each other or some constant multiple of epsilon away, up to 100). I couldn't find any pair that satisfiesa + (b - a) > b.
Edit: Thank you for everyone presenting examples with rounding mode towards infinity, I can see that with this rounding mode we can find examples. I am mostly interested in rounding mode "Round to nearest, ties to even" that is used in many programming languages by default, should have mentioned this to begin with.
Summary
There are such numbers when rounding toward +∞ is used. There are not such numbers when rounding to nearest with ties to even is used in a binary floating-point format. There are such numbers in non-binary formats.
In this answer, text in
code fontrepresents floating-point arithmetic. Text not in code font represents real-number arithmetic.Rounding Toward +∞
Consider IEEE-754 binary32 with rounding toward +∞. Let
abe ½−2−25, the representable number just below ½. Letbbe 1. Letcbe 1−2−24, the representable number just below 1.b−a(mathematical subtraction) is ½+2−25. That is not representable, sob-a(floating-point subtraction) with rounding toward +∞ produces ½+2−24.b-a•cis ½+2−24−2−25-2−48, so(b-a)*cproduces ½+2−24. Thena+(b-a)*cis ½−2−25+½+2−24 = 1−225+2−24 = 1+2−25. 1+2−25 is not representable, so rounding toward +∞ produces 1+2−23, soa + (b-a)*cproduces 1+2−23, which is greater thanb.Demonstrating in C, the output of:
is:
Round to Nearest With Binary Floating-Point
This is not possible in a binary format with round-to-nearest-ties-to-even. Proof:
Consider, that, if
a + (b-a)*cis greater thanb, it is at least one ULP ofbgreater thanb. Let u be the ULP ofb. Consider firsta + (b-a). Sincea<b,b-a≤b, so the ULP ofb-ais at most the UP ofb, u. With round-to-nearest, the maximum error in a single operation is ½ ULP, for the ULP of the result. There are values foraandbwhereb-arounds up by ½ ULP, such as when we subtract 1½•2−23 from 2−2−23. (2−2−23 − 1½•2−23 = 2 − 2½•2−23, which rounds up to 2 − 2•2−23.)So, for
b-a, it is possible to produce a result ofb−a+ ½ u.Then when we add
a, there are again values for which the addition ina + (b-a)rounds up by ½ ULP. We know from above the mathematical result is at mosta+b−a+ ½u =b+ ½u. This is halfway to the next representable number afterb, so it will, with round-to-nearest-ties-to-even, round up if the next representable number has an even low digit in its significand. The numbers above are an example of that; witha= 1½•2−23 andb= 2−2−23,a + (b-a)yields 2, which is greater thanb.Observe that we just barely achieved this. The final addition rounded up because there was exactly ½u rounding error in
b-a, the maximum possible. It is not possible to arrange for the result ofa + (b-a)to be any greater thanb+ ½u. So, when we considera + (b-a)*c, we must not lose any of this achieved rounding error. The result of(b-a)*cmust beb-aand not any less, ora + (b-a)*cwill be lower than the value needed to allow rounding up.The greatest
ccan be is the next representable number less than 1. The above examples use binary32, but we will consider binary formats generally, so the greatestcis 1−2−p, where p is the precision of the format (the number of digits in the significand).b-a•cis necessarily less thanb-a, so we want(b-a)*cto round up to compensate for that. Supposeb-ahas the largest significand in a binary format, 2−21−p (using [1, 2) for the normal significand interval).calso has this significand. (It is 1−2−p, which normalizes to 2−21−p.) Then product of the significands inb-a•cis (2−21−p)2 = 4 − 2•21−p + (21−p)2 = 4 − 22−p + 22−2p. Scaling that to the normal interval yields 2 − 21−p + 21−2p.2 − 21−p is the largest significand, and its ULP is 21−p, so the product does not exceed the largest significand by at least ½ ULP unless 21−2p ≥ ½•21−p, implying p ≤ 1. (I will leave it to the reader to consider whether there is a solution with p = 1 and what to replace round-to-nearest-ties-to-even with when all significands of nonzero numbers have an odd low digit.)
Therefore the product of this largest possible significand of
a-bwithccannot round up, and no lower significand can round up either. Therefore, we cannot achieve the necessary rounding error to makea + (b-a)*cexceedb.Bonus
For the p = 1 case left to the reader above, the representable numbers are … ¼, ½, 1, 2, 4,… The largest possible value for
cis ½. We can consider any representable number forb, as there is only one significand (1) for non-zero numbers, and all behaviors will scale as the exponent is changed. Let’s chooseb= 4, which makesaat most 2. Forb-a, the result for anyais at mostb, so let’s replacea + (b-a)*cwitha + b*c. This isa + 4*½, soa + 2. Sincea≤ 2,a + 2≤ 4. So there is no solution in this degenerate floating-point format.Decimal Format
Consider a two-digit decimal format with round-to-nearest. Let
abe 15,bbe 130, andcbe .99. Thenb-a= 115, sob-arounds to 120, andb-a•c= 118.8, so(b-a)*crounds to 120. Thena+(b-a)*c= 15 + 120 = 135, soa + (b-a)*crounds to 140, and 140 is greater thanb.