calculate sum of floats behavior weird

110 Views Asked by At

I try three methods to get sum of floats. one method is pandas sum, another one is use for loop, the last one is using math.fsum. It looks like the last method get the right results. but the first and second method get different values.

import pandas as pd
import math

def main():
    data = [61.1, 19.3, 15.7, 3.07, .255, .158, .102, .072, .0608,
            .0048, .0416, .0368, .0288, .0128, .0112, .0096, .0096,
            .008, .0048, .004, .004, .0032, .0024, .0006]

    df = pd.DataFrame(data,columns=['value'])
    print(df['value'].sum())

    sum = 0
    for x in data:
        sum += x
    print(sum)

    print(math.fsum(data))    

main()    

Current results:

99.99999999999999
100.00000000000004
100.0

Expected results:

100.0
100.0
100.0
2

There are 2 best solutions below

0
tax evader On

All three functions give different result due to them having different algorithm for floating points sumation. The reason why for the difference is that floating point representation in computer program have limited precision and different algorithm is needed depending on the trade offs between performance and precision. You can learn more about the implementation of floating points operations in this answer

  • sum += x approach use regular approach of adding up the approximation value represented by the program. This approach is the most computationally efficient but also the most inaccurate.

  • df[value].sum() use np.sum() under the hood which uses pairwise summation algorithm . This approach is more accurate than regular sum operation but slower runtime

  • math.fsum is the most accurate algorithm for summation of floating points but also the most inefficient. You can learn more about its implementation here

1
Tim Peters On

Living with the limitations of binary floating-point is a very large and complex topic. You're not going to get easy answers, alas. Start by reading the elementary Python docs on the topic.

Floating-point addition is not, in general, associative. That means the order in which additions are performed can change the final result: it is not always the case that (x+y)+z returns the same result as x+(y+z). You can easily find examples by generating addends at random (e.g., using Python's random.random()). So the order in which an implementation does addition can matter.

For reference:

  • Adding "left to right", as your for loop does, is one of the poorest methods for limiting accumulated rounding errors. Python's builtin sum() function also works left-to-right(*). You can do worse! For example, sort the addends from largest to smallest before adding left-to-right (although the numerics are generally better if sorted from smallest to largest instead).

  • The Pandas docs say it uses the method numpy uses. That's a little slower, but numerically better behaved. It effectively organizes the additions as a binary tree, adding adjacent pairs at the bottom, then pairs-of-pair-sums, and so on.

  • math.fsum() is an entirely different kind of method, and much slower. It emulates "infinite precision" arithmetic in software, and returns the result "as if" no rounding ever occurred, until a single rounding at the end (to deliver the final result). Because there is no internal rounding, math.fsum(xs) delivers the same result for every permutation of xs.

(*) Note: starting in Python 3.12, the builtin sum() applied to floats uses yet another algorithm to improve accuracy. This is close to math.fsum() in accuracy, but generally much faster than math.fsum().