Is it possible to represent -3/32 as a binary floating-point value using only 7 bits

55 Views Asked by At

Suppose you are limited to 7 bits for a floating-point representation: 1 sign bit, 3 exponent bits, and 3 fraction bits.

First I convert 3/32 to the binary 0.00011,
then to the standard scientific notation of 1.1 * 2^(-4).

At this point I realize my exponent field will be -1, which is not valid.
I try to represent 3/32 as 0.11 * 2^(-3) instead, which leads to the more intuitive representation of 1 000 110.
However, obviously this is a denormalized value, and if I try to convert the representation back to decimal I get -3/16.

My question is: is it even possible to represent this value precisely within the constraints of the problem?
It looks like the smallest representable value for this scheme is -15, so -3/32 falls within this interval.
I'm aware that bits are dropped and precision is lost during conversions; is this the case here?

2

There are 2 best solutions below

2
alias On BEST ANSWER

With 1 sign, 3 exponent, and 3 significand bits, following IEEE-754 rules, here're the first four non-negative smallest finite values you can represent:

Bits       | Decimal Value
-----------+----------------
0b0000000  | 0
0b0000001  | 0.03125
0b0000010  | 0.0625
0b0000011  | 0.09375

The value you're looking for, 3/32, equals 0.09375 (decimal); matching the 4th value. So, it is precisely representable in this format.

Detailed representation of this value is:

                  6 543 210
                  S E3- S3-
   Binary layout: 0 000 011
      Hex layout: 03
       Precision: 3 exponent bits, 3 significand bits
            Sign: Positive
        Exponent: -2 (Subnormal, with fixed exponent value. Stored: 0, Bias: 3)
  Classification: FP_SUBNORMAL
          Binary: 0b1.1p-4
           Octal: 0o6p-6
             Hex: 0x1.8p-4

Since you wanted -3/32, you can simply set the sign bit.

0
Chris Dodd On

The first step is to represent your number in (binary) scientific notation, which is 0b1.1×2-4. So in general for a (normalized) floating point value, you'll have a mantissa of just 1 (padded out with trailing 0s to fill the field) and an exponent of -4

The problem is that with only 3 exponent bits, you'll (probably) have a bias of 3 (2k-1-1 for k bits), which means the minimum representable exponent is -2. As a result, you need to use a denormalized representation with no "hidden" 1 and an exponent of -2. So you shift the mantissa to increase the exponent to -2, giving 0b0.011×2-2.

This makes your final 7-bit fp value 0 000 011 -- sign is 0, exponent field is 0 (for a denorm) and mantissa is 011