Floating Point Operations |
![]() |
A brief overview about floating point numbers - representations and main operations. At the end, there are links to additional web-pages.
Floating point number - M*BE, where M - mantissa (significand), E - exponent, B - base. For everyday use, the base is 10, and both mantissa and exponent are decimal numbers - the mantissa is real and the exponent is integer. In the case when sign S is used, the mantissa is then positive and the floating point number is presented as (-1)S*M*BE.
Binary numbers are used in computers and digital systems usually - B=2, and M and E are binary numbers, S is a single bit. The actual number of bits for mantissa and exponent depends on the used format (see, e.g, IEEE-754 standard). In addition, the mantissa is usually a fraction and in a certain range (normal, normalized).
For simplicity, the format used in the examples below has 4 bits for the exponent and 8 bits for the mantissa (or 7 bits plus sign). The normalized mantissa is 0,5≤|M|<1.
Signed mantissa (2's complement, no extra bit for the sign):
| decimal | M10*10E10 | binary | M2*2E2 | M2 | E2 | [M10*2E10] | [decimal] |
| 2.5 | 0.25*101 | 10.1 | 0.101*210 | 0.1010000 | 0010 | 0.625*22 | 2.5 |
| 1.1 | 0.11*101 | 1.0(0011) | 0.10(0011)*21 | 0.1000110 | 0001 | 0.546875*21 | 1.09375 |
| -2.2 | -0.22*101 | -10.(0011) | -0.10(0011)*210 | 1.0111010 | 0010 | -0.546875*22 | 2.1875 |
| -0.056 | -0.56*10-1 | -0.0000111001010110... | -0.111001010110...*2-100 | 1.0001110 | 1100 | -0.890625*2-4 | -0.0556640625 |
Unsigned (positive) mantissa, extra bit for sign (S):
| decimal | M10*10E10 | binary | M2*2E2 | S | M2 | E2 | [M10*2E10] | [decimal] |
| 2.5 | 0.25*101 | 10.1 | 0.101*210 | 0 | .1010000 | 0010 | 0.625*22 | 2.5 |
| 1.1 | 0.11*101 | 1.0(0011) | 0.10(0011)*21 | 0 | .1000110 | 0001 | 0.546875*21 | 1.09375 |
| -2.2 | -0.22*101 | -10.(0011) | -0.10(0011)*210 | 1 | .1000110 | 0010 | -0.546875*22 | 2.1875 |
| -0.056 | -0.56*10-1 | -0.0000111001010110... | -0.111001010110...*2-100 | 1 | .1110010 | 1100 | -0.890625*2-4 | -0.0556640625 |
In the examples below, notations used are "mmmm...m|ee..e" (signed mantissa and exponent) and "s|mmm...m|ee..e" (sign, unsigned mantissa and exponent). In actual implementations (e.g., IEEE-754) the order of fields in the word may be different.
It should be noted that with unsigned normalized mantissa, the most significant bit (MSB) is always '1'. This feature is used by IEEE format where this bit is not stored and the mantissa has one extra bit, essentially. However, extra encoding is needed to differ between normalized and un-normalized mantissas. This feature is not used in the examples below and the MSB of the un-normalized mantissa is not '1'.
The representation M*2E is used to illustrate the operations while taking into account the following issues:
One of the mantissas may need un-normalization when adding/subtracting (to make the exponents equal). For all operations, the resulting mantissa may need normalization together with the correction of the resulting exponent.
In the examples below, both operands and the result are represented by mantissa and exponent - A = MA * 2EA, B = MB * 2EB and R = MR * 2ER.
To add or subtract mantissas when calculating A+B (or A-B), the exponents must be equal to bring them before parentheses. In general, the operands have different exponents and to equal them, the mantissa of one operand must be un-normalized. For instance, 6,0 == 0,75*23 == 1,5*22 == 0,375*24. Often the mantissa is kept <1 to avoid the need for additional bits in the integer part (left of the point). Therefore it is useful to keep the operand with the greater exponent unchanged and to change the mantissa and exponent of the operand with the smaller exponent. The exponent of the result will be equal to the greater exponent. However, there may be a need to correct both the mantissa and exponent when the mantissa is not normalized. Based on that, for addition and subtraction the corrections of mantissa and exponent can be presented as follows:
When subtracting, the mantissas and exponents of operands are corrected exactly in the same way. The mantissa and exponent of the result may need correction after both addition and subtraction. The largest value of the mantissa will be less than 2.0 and therefore an extra bit is needed in the integer part (or the extension of the sign bit). The smallest value of the mantissa can be 0 - when subtracting equal mantissas, for instance. Extra bits are also used in the fraction part to support rounding (for simplicity, only one extra bit in the examples). The subtraction is done like with integers when mantissa is a signed fixed-point number and 2's complement encoding is used.
Examples
1) 2.5 + 1.1 = 3.6
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1010000 | 0010 | 0.1000110 | 0001 | EA>EB --> MB>>1 & EB+1, plus extra bits | ||||
| 00.10100000 | 0010 | 00.01000110 | 0010 | 00.11100110 | 0010 | MA+MB, 0.5≤|MR|<1 --> OK, no need to round either | |||
| 0.1110011 | 0010 | 3.5937510 |
2) 2.5 - 2.2 = 0.3
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1010000 | 0010 | 0.1000110 | 0010 | EA=EB --> OK, plus extra bits | ||||
| 00.10100000 | 0010 | 00.10001100 | 0010 | 00.00010100 | 0010 | MA-MB, |MR|<0.5 --> MR<<3 & ER-3 | |||
| 0.1010000 | 1111 | 0.312510 |
3) 1.1 - 1.1 = 0.0 -- how to represent zero?
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1000110 | 0001 | 0.1000110 | 0001 | EA=EB --> OK, plus extra bits | ||||
| 00.10001100 | 0001 | 00.10001100 | 0001 | 00.00000000 | 0001 | MA-MB, (|MR|<0.5) |MR|=0 --> a special case because shifting MR will result in underflow! | |||
| 0.00000000 | 0000 | 0.010 - different formats may handle differently |
4) 1.1 + 0.56 = 1.66 -- rounding?
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1000110 | 0001 | 0.1000111 | 0000 | EA>EB --> MB>>1 & EB+1, plus extra bits | ||||
| 00.10001100 | 0001 | 00.01000111 | 0001 | 00.11010011 | 0001 | MA+MB, 0.5≤|MR|<1 --> OK but rounding may be useful | |||
| 0.1101001 | 0001 | 1.64062510 - cutting off LSB | |||||
| 0.1101010 | 0001 | 1.6562510 - rounded (up) |
5) -2.2 - 2.5 = -4.7 -- rounding negative result?
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 1.0111010 | 0010 | 0.1010000 | 0010 | EA=EB --> OK, plus extra bits | ||||
| 11.01110100 | 0010 | 00.10100000 | 0010 | 10.11010110 | 0010 | MA-MB, |MR|≥1 --> MR>>1 & ER+1 | |||
| 1.0110110 | 0011 | -4.62510 - 2's complement rounded (up) == |MR| cutting off LSB | |||||
| 1.0110101 | 0011 | -4.687510 - 2's complement cutting off LSB == |MR| rounded (up) |
To multiply, mantissas are multiplied and exponents added - A*B == MA*2EA * MB*2EB == (MA*MB)* (2EA*2EB) == (MA*MB)*2EA+EB. Normalization of the resulting mantissa may be needed by shifting one bit left, plus decrementing the resulting exponent because the absolute value of the resulting mantissa will be between 0.25 and 1.0 - 0.25≤|MR|<1.
It is assumed in the examples below that 8x8-bit fixed-point multiplier (sign and 7-bit fraction) is used. Because of that, 7 lower bits of the result must be "cut off" and/or rounding may be needed.
Examples
1) 2.5 * 1.1 = 2.75
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1010000 | 0010 | 0.1000110 | 0001 | 00.01010111100000 | 0011 | |MR|<0.5 --> MR<<1 & ER-1 | |||
| 0.1010111 | 0010 | 2.7187510 - cut off | |||||
| 0.1011000 | 0010 | 2.7510 - rounded |
2) 1.1 * (-2.2) = -2.42
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1000110 | 0001 | 1.0111010 | 0010 | 11.10110011011100 | 0011 | |MR|<0.5 --> MR<<1 & ER-1 | |||
| 1.0110011 | 0010 | -2.4062510 - cut off | |||||
| 1.0110010 | 0010 | -2.437510 - rounded |
3) (-2.2) * (-0.056) = 0.1232
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 1.0111010 | 0010 | 1.0001110 | 1100 | 00.01111100101100 | 1110 | |MR|<0.5 --> MR<<1 & ER-1 | |||
| 0.1111100 | 1101 | 0.1210937510 - cut off | |||||
| 0.1111101 | 1101 | 0.122070312510 - rounded |
4) 0.8 * 0.8 = 0.64
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1100110 | 0000 | 0.1100110 | 0000 | 00.10100010100011 | 0000 | 0.5≤|MR|<1 --> OK | |||
| 0.1010001 | 0000 | 0.632812510 - cut off==rounded |
To divide, mantissas are divided and exponents subtracted - A*B == MA*2EA / MB*2EB == (MA/MB)* (2EA/2EB) == (MA/MB)*2EA-EB. Normalization of the resulting mantissa may be needed by shifting one bit right, plus incrementing the resulting exponent because the absolute value of the resulting mantissa will be between 0.5 and 2.0 - 0.5≤|MR|<2. Rounding of the mantissa may be also needed.
Examples
1) 2.5 / 1.1 = 2.(27)
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1010000 | 0010 | 0.1000110 | 0001 | 01.0010010 | 0001 | |MR|≥1.0 --> MR>>1 & ER+1 | |||
| 0.1001001 | 0010 | 2.2812510 |
2) -2.2 / 1.1 = -2.0
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 1.0111010 | 0010 | 0.1000110 | 0001 | 11.0000000 | 0001 | |MR|≥1.0 --> MR>>1 & ER+1 | |||
| 1.1000000 | 0010 | -2.010 |
3) -0.056 / -2.2 = 0.025(45)
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 1.0001110 | 1100 | 1.0111010 | 0010 | 01.1010000 | 1010 | |MR|≥1.0 --> MR>>1 & ER+1 | |||
| 0.1101000 | 1011 | 0.02539062510 |
4) 0.6 / 0.8 = 0.75
| A (MA|EA) | B (MB|EB) | R (MR|ER) | Action / comments | |||
| 0.1001101 | 0000 | 0.1100110 | 0000 | 00.1100000 | 0000 | 0.5≤|MR|<1 --> OK | |||
| 0.1100000 | 0000 | 0.7510 |
The representation (-1)S*M*2E is used to illustrate the operations while taking into account the following issues:
One of the mantissas may need un-normalization when adding/subtracting (to make the exponents equal). For all operations, the resulting mantissa may need normalization together with the correction of the resulting exponent.
Making exponents equal and correcting mantissas is done like in the case when mantissas are signed. There are two ways (versions) to add/subtract mantissas:
| A+B | B<0 | B≥0 | A-B | B<0 | B≥0 | |
| A<0 | -(|A|+|B|) | |B|-|A| | A<0 | |B|-|A| | -(|A|+|B|) | |
| A≥0 | |A|-|B| | |A|+|B| | A≥0 | |A|+|B| | |A|-|B| |
In both cases, the resulting mantissa may be negative and the absolute value of the mantissa must be found and the sign corrected.
In the following examples, the both versions are shown when the calculation process is different.
1) 2.5 + 1.1 = 3.6
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [versions 1 & 2] | |||
| 0|.1010000 | 0010 | 0|.1000110 | 0001 | A>0, B>0 --> A+B=|A|+|B| | ||||
| 0.1010000 | 0010 | 0.1000110 | 0001 | EA>EB --> MB>>1 & EB+1, plus extra bits | ||||
| 00.10100000 | 0010 | 00.01000110 | 0010 | 00.11100110 | 0010 | MA+MB, 0.5≤|MR|<1 --> OK, no need to round either | |||
| 0|.1110011 | 0010 | 3.5937510 |
2) 2.5 - 2.2 = 0.3
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [versions 1 & 2] | |||
| 0|.1010000 | 0010 | 0|.1000110 | 0010 | A>0, B≥0 --> A+B=|A|-|B| | ||||
| 0.1010000 | 0010 | 0.1000110 | 0010 | EA=EB --> OK, plus extra bits | ||||
| 00.10100000 | 0010 | 00.10001100 | 0010 | 00.00010100 | 0010 | MA-MB, |MR|<0.5 --> MR<<3 & ER-3 | |||
| 0|.1010000 | 1111 | 0.312510 |
3) -2.2 - 2.5 = -4.7
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [version 1] | |||
| 1|.1000110 | 0010 | 0|.1010000 | 0010 | MA=-MA; EA=EB --> OK, plus extra bits | ||||
| 11.01110100 | 0010 | 00.10100000 | 0010 | 10.11010100 | 0010 | |MR|≥1.0 --> MR>>1 & ER+1 | |||
| 1|.1001011 | 0011 | -4.687510 | |||||
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [version 2] | |||
| 1|.1000110 | 0010 | 0|.1010000 | 0010 | A<0, B≥0 --> A-B=-(|A|+|B|) [NB! SR=1] | ||||
| 0|.1000110 | 0010 | 0|.1010000 | 0010 | |A| + |B| | EA=EB --> OK, plus extra bits | |||
| 00.10001100 | 0010 | 00.10100000 | 0010 | 01.00101100 | 0010 | |MR|≥1.0 --> MR>>1 & ER+1; MR>0 --> SR=1 | |||
| 1|.1001011 | 0011 | -4.687510 |
4) (-2.2) - (-2.5) = 0.3
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [version 1] | |||
| 1|.1000110 | 0010 | 1|.1010000 | 0010 | MA=-MA, MB=-MB; EA=EB --> OK, plus extra bits | ||||
| 11.01110100 | 0010 | 11.01100000 | 0010 | 00.00010100 | 0010 | |MR|<0.5 --> MR<<3 & ER-3; MR>0 --> SR=0 | |||
| 0|.1010000 | 1111 | 0.312510 | |||||
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [version 2] | |||
| 1|.1000110 | 0010 | 1|.1010000 | 0010 | A<0, B<0 --> A-B=|B|-|A|) | ||||
| 1|.1010000 | 0010 | 1|.1000110 | 0010 | |B| - |A| | EA=EB --> OK, plus extra bits | |||
| 00.10100000 | 0010 | 00.10001100 | 0010 | 00.00010100 | 0010 | |MR|<0.5 --> MR<<3 & ER-3 MR>0 --> SR=0 | |||
| 0|.1010000 | 1111 | 0.312510 |
5) -2.5 + 1.1 = -1.4
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [version 1] | |||
| 1|.1010000 | 0010 | 0|.1000110 | 0001 | MA=-MA; EA>EB --> MB>>1 & EB+1, plus extra bits | ||||
| 11.01100000 | 0010 | 00.01000110 | 0010 | 11.10100110 | 0010 | |MR|<0.5 --> MR<<1 & ER-1; MR<0 --> -MR, SR=1 | |||
| 1|.1011010 | 0001 | -1.4062510 | |||||
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments [version 2] | |||
| 1|.1010000 | 0010 | 0|.1000110 | 0001 | A<0, B≥0 --> A+B=|B|-|A|) | ||||
| 0|.1000110 | 0001 | 0|.1010000 | 0010 | |B| - |A| | EA>EB --> MB>>1 & EB+1, plus extra bits | |||
| 00.01000110 | 0010 | 00.10100000 | 0010 | 11.10100110 | 0010 | |MR|<0.5 --> MR<<1 & ER-1 MR<0 --> -MR, SR=1 | |||
| 1|.1011010 | 0001 | -1.4062510 |
In practice, the multiplication is done like when mantissas are signed - mantissas are multiplied and exponents are added. The main difference is in the fact that mantissas are always positive and normalization may be needed. The sign of the result is defined by the signs of operands - when signs are equal the result is positive, and negative when opposite.
Examples
1) 2.5 * 1.1 = 2.75
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 0|.1010000 | 0010 | 0|.1000110 | 0001 | A>0 & B>0 --> R>0; plus extra bits | ||||
| 0.1010000 | 0010 | 0.1000110 | 0001 | 00.01010111100000 | 0011 | |MR|<0.5 --> MR<<1 & ER-1 | |||
| 0|.1010111 | 0010 | 2.7187510 - cut off | |||||
| 0|.1011000 | 0010 | 2.7510 - rounded |
2) 1.1 * (-2.2) = -2.42
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 0|.1000110 | 0001 | 1|.1000110 | 0010 | A>0 & B<0 --> R<0; plus extra bits | ||||
| 0.1000110 | 0001 | 0.1000110 | 0010 | 00.01001100100100 | 0011 | |MR|<0.5 --> MR<<1 & ER-1 | |||
| 1|.1001100 | 0010 | -2.37510 - cut off | |||||
| 1|.1001101 | 0010 | -2.4062510 - rounded |
3) (-2.2) * (-0.056) = 0.1232
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 1|.1000110 | 0010 | 1|.1110010 | 1100 | A<0 & B<0 --> R>0; plus extra bits | ||||
| 0.1000110 | 0010 | 0.1110010 | 1100 | 00.01111100101100 | 1110 | |MR|<0.5 --> MR<<1 & ER-1 | |||
| 0|.1111100 | 1101 | 0.1210937510 - cut off | |||||
| 0|.1111101 | 1101 | 0.122070312510 - rounded |
4) 0.8 * 0.8 = 0.64
| A (SA|MA|EA) | B (SB|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 0|.1100110 | 0000 | 0|.1100110 | 0000 | A>0 & B>0 --> R>0; plus extra bits | ||||
| 0.1100110 | 0000 | 0.1100110 | 0000 | 00.10100010100011 | 0000 | 0.5≤|MR|<1 --> OK | |||
| 0|.1010001 | 0000 | 0.632812510 - cut off==rounded |
In practice, also the division is done like when mantissas are signed - mantissas are divided and exponents are subtracted. The main difference is in the fact that mantissas are always positive and normalization may be needed. The sign of the result is defined by the signs of operands - when signs are equal the result is positive, and negative when opposite.
Examples
1) 2.5 / 1.1 = 2.(27)
| A (SA|MA|EA) | B (SA|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 0|.1010000 | 0010 | 0|.1000110 | 0001 | A>0 & B>0 --> R>0, plus extra bits | ||||
| 0.1010000 | 0010 | 0.1000110 | 0001 | 01.0010010 | 0001 | |MR|≥1.0 --> MR>>1 & ER+1 | |||
| 0|.1001001 | 0010 | 2.2812510 |
2) -2.2 / 1.1 = -2.0
| A (SA|MA|EA) | B (SA|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 1|.1000110 | 0010 | 0|.1000110 | 0001 | A<0 & B>0 --> R<0, plus extra bits | ||||
| 0.1000110 | 0010 | 0.1000110 | 0001 | 01.0000000 | 0001 | |MR|≥1.0 --> MR>>1 & ER+1 | |||
| 1|.1000000 | 0010 | -2.010 |
3) -0.056 / -2.2 = 0.025(45)
| A (SA|MA|EA) | B (SA|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 1|.1110010 | 1100 | 1|.1000110 | 0010 | A<0 & B<0 --> R>0, plus extra bits | ||||
| 0.1000110 | 1100 | 0.1000110 | 0010 | 01.1010000 | 1010 | |MR|≥1.0 --> MR>>1 & ER+1 | |||
| 0.1101000 | 1011 | 0.02539062510 |
4) 0.6 / 0.8 = 0.75
| A (SA|MA|EA) | B (SA|MB|EB) | R (SR|MR|ER) | Action / comments | |||
| 0|.1001101 | 0000 | 0|.1100110 | 0000 | A>0 & B>0 --> R>0, plus extra bits | ||||
| 0.1001101 | 0000 | 0.1100110 | 0000 | 00.1100000 | 0000 | 0.5≤|MR|<1 --> OK | |||
| 0|.1100000 | 0000 | 0.7510 |
By its nature, the IEEE-754 format is similar to the representation with positive (unsigned) mantissas but there are some additional features. First, the normalized mantissa contains one bit less because the MSB (integer part, sic!) is always 1 in this case and for the unnormalized mantissa the MSB is always 0. The value of the (normalized) mantissa is therefore at least 1.0 but less than 2.0. Second, the base is 2 or 10 and, in addition, bias is used (the base is not in 2's complement form). This gives the possibility to represent special cases ("00...00" - unnormalized mantissa, "11...11" - overflows, etc.). Plus are there different ways for roundings.
The word lengths are also standardized but exceptions are allowed. The most used are 32- and 64-bit formats (also known as single-precision and double-precision, correspondingly). For the details, please check the link below. The examples show use 16-bit format (half-precision) where the first bit is sing, followed by 5-bit exponent (base 2, bias 15) and 11-bit mantissa (the integer part 1.0 is missing for normalized mantissas!).
| decimal | binary | M2*2E2 | S | E2+011112 | M2 | [M10*2E10] | [decimal] |
| 2.5 | 10.1 | 1.0100000000*21 | 0 | 10000 | 0100000000 | 1.25*21 | 2.5 |
| 1.1 | 1.0(0011) | 1.0001100110*20 | 0 | 01111 | 0001100110 | 1.099609375*20 | 1.099609375 |
| -2.2 | -10.(0011) | -1.0001100110*21 | 1 | 10000 | 0001100110 | 1.099609375*21 | -2.1992187500 |
| -0.056 | -0.0000111001010110... | -1.1100101011*2-101 | 0 | 01010 | 1100101011 | -1.7919921875*2-5 | -0.055999755859375 |
| 1.0 | 1.0 | 1.0*20 | 0 | 01111 | 0000000000 | 1.0*20 | 1.0 |
| +0.0 | 0.0 | 0 | 00000 | 0000000000 | +0.0 | ||
| -0.0 | -0.0 | 1 | 00000 | 0000000000 | -0.0 | ||
| +∞ | 0 | 11111 | 0000000000 | +∞ | |||
| -∞ | 1 | 11111 | 0000000000 | -∞ |
NB! There are more sources for sure.
Last modified - 2021.12.02.