How to safely do floating point arithmetic in C

177 Views Asked by At

The recent (C23) addition of stdckdint.h provides a portable way to check for overflow problems in integer arithmetic. I have used similar compiler extensions in the past in cases where the integers in question are provided by the user and they have been very helpful. Right now I faced with a situation where the values provided are floating point numbers instead, and I need to do arithmetic operations with them.

My question is whether there is a similar solution (not necessarily in the standard library), to ensure that operations between floating point variables (floats, doubles, long doubles) won't run into any issues. If not, what are all the possible problems that might occur during arithmetic operations (overflows, underflows, etc.) and how can I check for them?

2

There are 2 best solutions below

0
Eric Postpischil On BEST ANSWER

If not, what are all the possible problems that might occur during arithmetic operations (overflows, underflows, etc.)…

Omitting discussion of C conformance, both of the standard and of implementations, to the IEEE 754 floating-point standard, IEEE 754 specifies five kinds of exceptions in clause 7:

  • Invalid operation is signaled iff “there is no useful definable result”: general operations on a signaling NaN, multiplication of a zero by an infinity, addition of opposite-signed infinities, subtraction of same-signed infinities, division of zero by zero or infinity by infinity, square root of a number less than zero, certain uses of the quantize operation (used to adjust the exponent of decimal formats), conversions to integer that cannot represent the result, comparisons of NaNs with operators that require non-NaNs, and integer-logarithm (“get the floating-point exponent”) of a NaN, +∞, or zero.
  • Division by zero is signaled for ordinary division by zero and integer-logarithm of −∞.
  • Overflow is signaled when the computed result is infinite because the exponent is bounded by the floating-point format (not because it would “naturally” be infinite).
  • Underflow is signaled when the computed result is zero because the exponent is bounded by the floating-point format (not because it would “naturally” be zero) except an implementation may also signal underflow when the result before rounding is lower in magnitude than the smallest representable non-zero number.
  • Inexact is signaled when the computed floating-point result differs from the real-number arithmetic result. (Inexact is a frequent outcome of floating-point arithmetic so that it is considered normal, not an exception in a common language sense, except in special-purpose code.)

In this context, “signaled” means the exception is indicted in some means, such as setting a bit in a status register. It does not mean program execution is interrupted or altered, as when a Unix signal causes a signal handler to be executed. The above is a summary and omits some details.

… If not, what are all the possible problems that might occur during arithmetic operations (overflows, underflows, etc.) and how can I check for them?

The current C standard, 2018, and the forthcoming standard, if faithful to the available drafts, provide facilities for testing floating-point exceptions in clause 7.6, discussing the header <fenv.h> and its associated facilities. Testing for exceptions is specified in 7.6.4.

feclearexcept(FE_ALL_EXCEPT) attempts to clear all supported exception flags and returns zero iff succesful.

fetestexcept(mask) returns the bitwise AND of mask and the exception flags that are set. The mask may be a bitwise OR of FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, and possibly other flag names supported by the particular C implementation.

Testing exceptions in this way may be slow, depending on the C implementation and its underlying hardware, because it may require accessing a status register in a way that interrupts processor execution of instructions or because it requires the compiler to use scalar floating-point instructions when it might have optimized to using SIMD instructions if the flags were not being accessed. Some C implementations may provide alternate methods of testing flags. In particular, for performance, it may be preferable not to test flags explicitly but rather to set exception handlers to be triggered when an exception occurs. This can be useful for detecting exceptions you expect to be rare, such as overflows with code designed to avoid them. However, with exceptions that are common, such as inexact (normally common) or underflow (not uncommon in some code, such as diminishing echo effects), frequent execution of exception handlers will impair performance.

0
Martin Brown On

Overflow of double precision 64bit FP (or higher) isn't something that you need to lose sleep over unless you are really pushing things hard. Division by zero is by far the most common user error in FP arithmetic and you should always check for it.

FP failures of infinity, overflow, denorm, underflow and indeterminate propagate through calculations so that the answer will be "Indeterminate" Ind, "Infinity" Inf or "Not a Number" NaN if the worst should happen. Denorms can still be useful as smaller than normally allowed but are much slower and will lose precision.

Default behaviour is to continue on execution after an FP error but you can enable interrupts so that any FP error allows your own code to take control.

You need to be much more careful with ordinary 32bit floats where the upper limit is ~10^38 which isn't quite enough to be safe in some common numerical algorithms. Solving a cubic equation for example involves A^6 which can cause trouble in single precision if you are careless about scaling your coefficients.

This is an excellent introduction by Goldberg to the issues of floating point computation and although quite old now some key things don't change.

Catastrophic cancellation is the other common gotcha which can catch out the unwary. Likewise with tests for exact equality especially involving decimal fractions which do not have a perfect representation in FP binary.