I am currently writing a Compiler to just test my programing ability and my target architecture doesn't have a Floating Point Unit. To account for this I am adding functions to my standard library to handle floating point calculations with Bitwise operations. All the floating point calculations are 32 bits. With 1 sign bit, 8 bits for the exponent, and 23 bits for the Mantissa. Detailed below are all the steps utilized for the subtraction operation.
As a note currently the code I have in the Standard Library works for numbers of the same sign and even works for some circumstances of numbers with different signs.
The test case I have that is failing is associated with the following subtraction operation:
50.0 - 92.0 = -42.0
As stated above the operation I am trying to solve is 50.0 - 92.0 which should equal -42.0.
Step one should be to convert both the numbers into binary:
The Converted numbers are as followed...
Sign Exp Mantissa Binary Scientific Notation
50.0 = 0|10000100|10010000000000000000000 = 1.10010000000000000000000x2^5
92.0 = 0|10000101|01110000000000000000000 = 1.01110000000000000000000x2^6
Step two is to raise the exponent of the 50.0 so that the exponent 5 becomes a 6. Therefor we will need to shift the bits 1 place to the right to account for the increase in the exponent.
1.10010000000000000000000x2^5 becomes 0.11001000000000000000000x2^6
Step three is to get the twos compliment of the 2nd value because we are subtracting 92.0 not adding.
1.01110000000000000000000x2^6 inverted is 0.10001111111111111111111x2^6
0.10001111111111111111111x2^6 + 1 is 0.10010000000000000000000x2^6
The final step is to add the Mantissas together
0.11001000000000000000000x2^6
+ 0.10010000000000000000000x2^6
_______________________________
1.01011000000000000000000x2^6
Now this final bit is where I get a bit confused because the final result of -42 in ieee754 format is
Sign Exp Mantissa
-42.0 = 1|10000100|01010000000000000000000
And obviously the Mantissa
01010000000000000000000 is not
01011000000000000000000
Does anyone have some insight as to what I am doing wrong. Thanks
You have not used enough bits to handle two’s complement correctly, and you have not handled a negative result.
In complementing the positive 1.011100000000000000000002×26, you got 0.100100000000000000000002×26. The result should be a negative number, but a leading 0 in two’s complement indicates a positive number. In other words, your complement operation overflowed the format.
If you prefix a leading 0 and then complement, you will have 10.100100000000000000000002×26, and you will add this to 00.110010000000000000000002×26, which has also had a 0 prefixed. Then the sum is 11.010110000000000000000002×26.
The leading bit is 1, indicating the result is negative. So you can complement it again to see the absolute value, 00.101010000000000000000002×26, meaning the result is −00.101010000000000000000002×26.
Finally, you normalize this to −1.010100000000000000000002×25, which is −42.
Notes
This explanation is not an endorsement of using two’s complement. Implementing a direct subtractor may be preferred.
“Significand” is the preferred term for the fraction part of a floating-point number. “Mantissa” is an old term for the fraction part of a logarithm. Significands are linear (if the number increases by a factor of 1.2, the significand increases by a factor of 1.2, unless an exponent threshold is crossed), whereas mantissas are logarithmic (adding to the mantissa multiplies the value represented).