In experimenting with rounding modes, FE_DOWNWARD ( and FE_TOWARDZERO) apparently fails to form the expected sum of infinity and instead forms DBL_MAX when adding DBL_MAX and 1 ULP of DBL_MAX.
Follows is test code that demos the unexpected sum. Under different rounds modes, it adds to DBL_MAX values near 0.5 ULP and 1.0 ULP. No problems noted in FE_TONEAREST and FE_UPWARD.
Questions:
- Do you agree it is an error?
- Does code form the correct answer on another machine?
- This sadly follows another near
DBL_MAXproblem recently reported, so perhaps my math library is sub-par. Advice on how to report is requested.
Compiler notes:
Invoking: Cygwin C Compiler
gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF"rand_i.d" -MT"rand_i.o" -o "rand_i.o" "../rand_i.c"
COLLECT_GCC=gcc
Target: x86_64-pc-cygwin
gcc version 11.3.0 (GCC)
GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG + 2)/4;
int P = (LDBL_MANT_DIG + 2)/4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf("mode: Addendum SUM (double) SUM (long double)\n");
for (int i = 0; i < n; i++) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
"ulp-", "ulp", "ulp+"};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d++) {
double sum = max + delta[d];
printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 0.0L + max + delta[d]);
}
puts("");
}
}
Output: note 4 unexpected lines.
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX :0x1.fffffffffffffp+1023 1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971 1.9958403095347198e+292
mode: Addendum SUM (double) SUM (long double)
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_DOWNWARD ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_DOWNWARD ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_TOWARDZERO ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TOWARDZERO ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff802p+1023
mode:FE_UPWARD ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000002p+1024
Checking for and dealing with overflow happens in two stages. Note that the actual software/circuitry may use a different strategy, but the result will be as if this were the procedure.
FE_NEAREST(the "normal" mode) the number is corrected to infinity. ForFE_TOWARDZERO, however, it's corrected to+/-DBL_MAX. (For the other rounding modes it depends on sign: rounding away from zero leads to infinity and rounding toward zero leads to+/-DBL_MAX.)The overflow rules for a given mode are thus reminiscent of the rounding rules for that mode, but not really the same. Arguably
FE_NEARESTis the weird one, since it acts like the (IEEE-754-only) ties-away-from-zero rounding mode under all out-of-range situations, rather than noticing that infinity is way farther away thanDBL_MAX. But the basic behavior for the other modes is to output+/-DBL_MAXwhen its preferred direction (given the sign) is toward zero, and infinity when its preferred direction is away from zero.Also note that when the result of stage 1 is a value outside the representable range, an overflow exception will still be emitted even when the result of stage 2 is
DBL_MAX. The overflow doesn't indicate "I made an infinity", but "I had to do stage 2".