How to express float constants precisely in source code

Question

How to express float constants precisely in source code

271 Views Asked by Cerno At 16 August 2022 at 11:24

I have some C++11 code generated via a code generator that contains a large array of floats, and I want to make sure that the compiled values are precisely the same as the compiled values in the generator (assuming that both depend on the same float ISO norm)

So I figured the best way to do it is to store the values as hex representations and interpret them as float in the code.

Edit for Clarification: The code generator takes the float values and converts them to their corresponding hex representations. The target code is supposed to convert back to float.

It looks something like this:

const unsigned int data[3] = { 0x3d13f407U, 0x3ea27884U, 0xbe072dddU};
float const* ptr = reinterpret_cast<float const*>(&data[0]);

This works and gives me access to all the data element as floats, but I recently stumbled upon the fact that this is actually undefined behavior and only works because my compiler resolves it the way I intended:

https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8

https://en.cppreference.com/w/cpp/language/reinterpret_cast.

The standard basically says that reinterpret_cast is not defined between POD pointers of different type.

So basically I have three options:

Use memcopy and hope that the compiler will be able to optimize this
Store the data not as hex-values but in a different way.
Use std::bit_cast from C++20.

I cannot use 3) because I'm stuck with C++11.

I don't have the resources to store the data array twice, so I would have to rely on the compiler to optimize this. Due to this, I don't particularly like 1) because it could stop working if I changed compilers or compiler settings.

So that leaves me with 2):

Is there a standardized way to express float values in source code so that they map to the exact float value when compiled? Does the ISO float standard define this in a way that guarantees that any compiler will follow the interpretation? I imagine if I deviate from the way the compiler expects, I could run the risk that the float "neighbor" of the number I actually want is used.

I would also take alternative ideas if there is an option 4 I forgot.

Original Q&A

There are 3 best solutions below

**KamilCuk** · Answer 1 · 2022-08-16T11:37:09.760000

How to express float constants precisely in source code

Use hexadecimal floating point literals. Assuming some endianess for the hexes you presented:

float floats[] = { 0x1.27e80ep-5, 0x1.44f108p-2, -0x1.0e5bbap-3 };

**Eric Postpischil** · Answer 2 · 2022-08-16T12:12:33.467000

If you have the generated code produce the full representation of the floating-point value—all of the decimal digits needed to show its exact value—then a C++ 11 compiler is required to parse the number exactly.

C++ 11 draft N3092 2.14.4 1 says, of a floating literal:

… The exponent, if present, indicates the power of 10 by which the significant [likely typo, should be “significand”] part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner…

Thus, if the floating literal does not have all the digits needed to show the exact value, the implementation may round it either upward or downward, as the implementation defines. But if it does have all the digits, then the value represented by the floating literal is representable in the floating-point format, and so its value must be the result of the parsing.

**Cerno** · Answer 3 · 2022-08-18T15:50:01.827000

I have read some very valuable information here and would like to throw in an option that does not strictly answer the question, but could be a solution.

It might be problematic, but if so, I would like to discuss it.

The simple solution would be: Leave it as it is.

A short rundown of why I am hesitant about the suggested options:

memcpy relies on the compiler to optimize away the actual copy and understand that I only want to read the values. Since I am having large arrays of data I would want to avoid a surprise event in which a compiler setting would be changed that suddenly introduces increased runtime and would require a fix on short notice.
bit_cast is only available from C++20. There are reference implementations but they basically use memcpy under the hood (see above).
hex float literals are only available from C++17
Directly writing the floats precisely... I don't know, it seems to be somewhat dangerous, because if I make a slight mistake I may end up with a data block that is slightly off and could have an impact on my classification results. A mistake like that would be a nightmare to spot.

So why do I think I can get away with an implementation that is strictly speaking undefined? The rationale is that the standard may not define it, but compiler manufacturers likely do, at least the ones I have worked with so far gave me exact results. The code has been running without major problems for a fairly long time, across dozens of code generator run and I would expect that a failed reinterpret_cast would break the conversion so severely that I would spot the result in my classification results right away.

Still not robust enough though. So my idea was to write a unit test that contains a significant number of hex-floats, do the reinterpret_cast and compare to reference float values for exact correspondence to tell me if a setting or compiler failed in this regard.

I have one doubt though: Is the assumption somewhat reasonable that a failed reinterpret_cast would break things spectacularly, or are the bets totally off when it comes to undefined behavior?

I am a bit worried that if the compiler implementation defines the undefined behavior in a way that it would pick a float that is close the hex value instead of the precise one (although I would wonder why), and that it happens only sporadically so that my unit test misses the problems.

So the endgame would be to unit test every single data entry against the corresponding reference float. Since the code is generated, I can generate the test as well. I think that should put all my worries to rest and make sure that I can get this to work across all possible compilers and compiler settings or be notified if anything breaks.

How to express float constants precisely in source code

There are 3 best solutions below

Related Questions in C++

Related Questions in C++11

Related Questions in FLOATING-POINT-CONVERSION

Trending Questions

Popular # Hahtags

Popular Questions