I understand that float32 per IEEE has precision issues when summing values on the order of 10^-7. I fully expect to see differences in computations between architectures, including GPUs. However, right now I'm testing two different implementations of a pytorch module and getting different results on one side vs the other. I can see that the errors appear when I do torch.sum(x, dim=-1) on a vector of small values. However, since both repositories are running on the same version of pytorch (2.0.1) on the same CPU on the same computer, I can't understand why I would see different results. This should be the exact same implementation, yes? which means the errors should be the same?
What are the possible flags/configurations that affect precision of float32 sum operations? For example, I see this https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html. Is there anything else that affects how float operations are performed?