Floating Point Operations

eesti

A brief overview about floating point numbers - representations and main operations. At the end, there are links to additional web-pages.

1. General representation

Floating point number - M*BE, where M - mantissa (significand), E - exponent, B - base. For everyday use, the base is 10, and both mantissa and exponent are decimal numbers - the mantissa is real and the exponent is integer. In the case when sign S is used, the mantissa is then positive and the floating point number is presented as (-1)S*M*BE.

Binary numbers are used in computers and digital systems usually - B=2, and M and E are binary numbers, S is a single bit. The actual number of bits for mantissa and exponent depends on the used format (see, e.g, IEEE-754 standard). In addition, the mantissa is usually a fraction and in a certain range (normal, normalized).

For simplicity, the format used in the examples below has 4 bits for the exponent and 8 bits for the mantissa (or 7 bits plus sign). The normalized mantissa is 0,5≤|M|<1.

Signed mantissa (2's complement, no extra bit for the sign):
decimal M10*10E10 binary M2*2E2 M2 E2 [M10*2E10] [decimal]
2.5 0.25*101 10.1 0.101*210 0.1010000 0010 0.625*22 2.5
1.1 0.11*101 1.0(0011) 0.10(0011)*21 0.1000110 0001 0.546875*21 1.09375
-2.2 -0.22*101 -10.(0011) -0.10(0011)*210 1.0111010 0010 -0.546875*22 2.1875
-0.056 -0.56*10-1 -0.0000111001010110... -0.111001010110...*2-100 1.0001110 1100 -0.890625*2-4 -0.0556640625

Unsigned (positive) mantissa, extra bit for sign (S):
decimal M10*10E10 binary M2*2E2 S M2 E2 [M10*2E10] [decimal]
2.5 0.25*101 10.1 0.101*210 0 .1010000 0010 0.625*22 2.5
1.1 0.11*101 1.0(0011) 0.10(0011)*21 0 .1000110 0001 0.546875*21 1.09375
-2.2 -0.22*101 -10.(0011) -0.10(0011)*210 1 .1000110 0010 -0.546875*22 2.1875
-0.056 -0.56*10-1 -0.0000111001010110... -0.111001010110...*2-100 1 .1110010 1100 -0.890625*2-4 -0.0556640625

In the examples below, notations used are "mmmm...m|ee..e" (signed mantissa and exponent) and "s|mmm...m|ee..e" (sign, unsigned mantissa and exponent). In actual implementations (e.g., IEEE-754) the order of fields in the word may be different.

It should be noted that with unsigned normalized mantissa, the most significant bit (MSB) is always '1'. This feature is used by IEEE format where this bit is not stored and the mantissa has one extra bit, essentially. However, extra encoding is needed to differ between normalized and un-normalized mantissas. This feature is not used in the examples below and the MSB of the un-normalized mantissa is not '1'.


2. Operations (signed mantissa)

The representation M*2E is used to illustrate the operations while taking into account the following issues:

One of the mantissas may need un-normalization when adding/subtracting (to make the exponents equal). For all operations, the resulting mantissa may need normalization together with the correction of the resulting exponent.

In the examples below, both operands and the result are represented by mantissa and exponent - A = MA * 2EA, B = MB * 2EB and R = MR * 2ER.

2.1. Addition and subtraction

To add or subtract mantissas when calculating A+B (or A-B), the exponents must be equal to bring them before parentheses. In general, the operands have different exponents and to equal them, the mantissa of one operand must be un-normalized. For instance, 6,0 == 0,75*23 == 1,5*22 == 0,375*24. Often the mantissa is kept <1 to avoid the need for additional bits in the integer part (left of the point). Therefore it is useful to keep the operand with the greater exponent unchanged and to change the mantissa and exponent of the operand with the smaller exponent. The exponent of the result will be equal to the greater exponent. However, there may be a need to correct both the mantissa and exponent when the mantissa is not normalized. Based on that, for addition and subtraction the corrections of mantissa and exponent can be presented as follows:

When subtracting, the mantissas and exponents of operands are corrected exactly in the same way. The mantissa and exponent of the result may need correction after both addition and subtraction. The largest value of the mantissa will be less than 2.0 and therefore an extra bit is needed in the integer part (or the extension of the sign bit). The smallest value of the mantissa can be 0 - when subtracting equal mantissas, for instance. Extra bits are also used in the fraction part to support rounding (for simplicity, only one extra bit in the examples). The subtraction is done like with integers when mantissa is a signed fixed-point number and 2's complement encoding is used.

Examples

1) 2.5 + 1.1 = 3.6
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1010000  | 0010 0.1000110  | 0001 EA>EB --> MB>>1 & EB+1, plus extra bits
00.10100000 | 0010 00.01000110 | 0010 00.11100110 | 0010 MA+MB, 0.5≤|MR|<1 --> OK, no need to round either
0.1110011  | 0010 3.5937510

2) 2.5 - 2.2 = 0.3
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1010000  | 0010 0.1000110  | 0010 EA=EB --> OK, plus extra bits
00.10100000 | 0010 00.10001100 | 0010 00.00010100 | 0010 MA-MB, |MR|<0.5 --> MR<<3 & ER-3
0.1010000  | 1111 0.312510

3) 1.1 - 1.1 = 0.0 -- how to represent zero?
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1000110  | 0001 0.1000110  | 0001 EA=EB --> OK, plus extra bits
00.10001100 | 0001 00.10001100 | 0001 00.00000000 | 0001 MA-MB, (|MR|<0.5) |MR|=0 --> a special case because shifting MR will result in underflow!
0.00000000  | 0000 0.010 - different formats may handle differently

4) 1.1 + 0.56 = 1.66 -- rounding?
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1000110  | 0001 0.1000111  | 0000 EA>EB --> MB>>1 & EB+1, plus extra bits
00.10001100 | 0001 00.01000111 | 0001 00.11010011 | 0001 MA+MB, 0.5≤|MR|<1 --> OK but rounding may be useful
0.1101001  | 0001 1.64062510 - cutting off LSB
0.1101010  | 0001 1.6562510 - rounded (up)

5) -2.2 - 2.5 = -4.7 -- rounding negative result?
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
1.0111010  | 0010 0.1010000  | 0010 EA=EB --> OK, plus extra bits
11.01110100 | 0010 00.10100000 | 0010 10.11010110 | 0010 MA-MB, |MR|≥1 --> MR>>1 & ER+1
1.0110110  | 0011 -4.62510 - 2's complement rounded (up) == |MR| cutting off LSB
1.0110101  | 0011 -4.687510 - 2's complement cutting off LSB == |MR| rounded (up)

2.2. Multiplication

To multiply, mantissas are multiplied and exponents added - A*B == MA*2EA * MB*2EB == (MA*MB)* (2EA*2EB) == (MA*MB)*2EA+EB. Normalization of the resulting mantissa may be needed by shifting one bit left, plus decrementing the resulting exponent because the absolute value of the resulting mantissa will be between 0.25 and 1.0 - 0.25≤|MR|<1.

It is assumed in the examples below that 8x8-bit fixed-point multiplier (sign and 7-bit fraction) is used. Because of that, 7 lower bits of the result must be "cut off" and/or rounding may be needed.

Examples

1) 2.5 * 1.1 = 2.75
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1010000 | 0010 0.1000110 | 0001 00.01010111100000 | 0011 |MR|<0.5 --> MR<<1 & ER-1
  0.1010111             | 0010 2.7187510 - cut off
  0.1011000             | 0010 2.7510 - rounded

2) 1.1 * (-2.2) = -2.42
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1000110 | 0001 1.0111010 | 0010 11.10110011011100 | 0011 |MR|<0.5 --> MR<<1 & ER-1
  1.0110011             | 0010 -2.4062510 - cut off
  1.0110010             | 0010 -2.437510 - rounded

3) (-2.2) * (-0.056) = 0.1232
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
1.0111010 | 0010 1.0001110 | 1100 00.01111100101100 | 1110 |MR|<0.5 --> MR<<1 & ER-1
  0.1111100             | 1101 0.1210937510 - cut off
  0.1111101             | 1101 0.122070312510 - rounded

4) 0.8 * 0.8 = 0.64
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1100110 | 0000 0.1100110 | 0000 00.10100010100011 | 0000 0.5≤|MR|<1 --> OK
  0.1010001             | 0000 0.632812510 - cut off==rounded

2.3. Division

To divide, mantissas are divided and exponents subtracted - A*B == MA*2EA / MB*2EB == (MA/MB)* (2EA/2EB) == (MA/MB)*2EA-EB. Normalization of the resulting mantissa may be needed by shifting one bit right, plus incrementing the resulting exponent because the absolute value of the resulting mantissa will be between 0.5 and 2.0 - 0.5≤|MR|<2. Rounding of the mantissa may be also needed.

Examples

1) 2.5 / 1.1 = 2.(27)
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1010000 | 0010 0.1000110 | 0001 01.0010010 | 0001 |MR|≥1.0 --> MR>>1 & ER+1
0.1001001 | 0010 2.2812510

2) -2.2 / 1.1 = -2.0
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
1.0111010 | 0010 0.1000110 | 0001 11.0000000 | 0001 |MR|≥1.0 --> MR>>1 & ER+1
1.1000000 | 0010 -2.010

3) -0.056 / -2.2 = 0.025(45)
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
1.0001110 | 1100 1.0111010 | 0010 01.1010000 | 1010 |MR|≥1.0 --> MR>>1 & ER+1
0.1101000 | 1011 0.02539062510

4) 0.6 / 0.8 = 0.75
A (MA|EA)     B (MB|EB)     R (MR|ER)      Action / comments
0.1001101 | 0000 0.1100110 | 0000 00.1100000 | 0000 0.5≤|MR|<1 --> OK
0.1100000 | 0000 0.7510


3. Operations (sign and positive mantissa)

The representation (-1)S*M*2E is used to illustrate the operations while taking into account the following issues:

One of the mantissas may need un-normalization when adding/subtracting (to make the exponents equal). For all operations, the resulting mantissa may need normalization together with the correction of the resulting exponent.

3.1. Addition and subtraction

Making exponents equal and correcting mantissas is done like in the case when mantissas are signed. There are two ways (versions) to add/subtract mantissas:

  1. For negative operand(s) (sign is 1), the mantissa(s) is/are made negative (e.g., in 2's complement form) and the operation is made directly.
  2. Mantissas are always positive but the operations are made according to the table below where the operand's absolute value is the unsigned mantissa, essentially. This approach is faster but the control logic is more complex.
  3. A+B B<0 B≥0       A-B B<0 B≥0
    A<0 -(|A|+|B|) |B|-|A|       A<0 |B|-|A| -(|A|+|B|)
    A≥0 |A|-|B| |A|+|B|       A≥0 |A|+|B| |A|-|B|

In both cases, the resulting mantissa may be negative and the absolute value of the mantissa must be found and the sign corrected.

In the following examples, the both versions are shown when the calculation process is different.

1) 2.5 + 1.1 = 3.6
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [versions 1 & 2]
0|.1010000  | 0010 0|.1000110  | 0001 A>0, B>0 --> A+B=|A|+|B|
0.1010000  | 0010 0.1000110  | 0001 EA>EB --> MB>>1 & EB+1, plus extra bits
00.10100000 | 0010 00.01000110 | 0010 00.11100110 | 0010 MA+MB, 0.5≤|MR|<1 --> OK, no need to round either
0|.1110011  | 0010 3.5937510

2) 2.5 - 2.2 = 0.3
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [versions 1 & 2]
0|.1010000  | 0010 0|.1000110  | 0010 A>0, B≥0 --> A+B=|A|-|B|
0.1010000  | 0010 0.1000110  | 0010 EA=EB --> OK, plus extra bits
00.10100000 | 0010 00.10001100 | 0010 00.00010100 | 0010 MA-MB, |MR|<0.5 --> MR<<3 & ER-3
0|.1010000  | 1111 0.312510

3) -2.2 - 2.5 = -4.7
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [version 1]
1|.1000110  | 0010 0|.1010000  | 0010 MA=-MA; EA=EB --> OK, plus extra bits
11.01110100 | 0010 00.10100000 | 0010 10.11010100 | 0010 |MR|≥1.0 --> MR>>1 & ER+1
1|.1001011  | 0011 -4.687510
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [version 2]
1|.1000110  | 0010 0|.1010000  | 0010 A<0, B≥0 --> A-B=-(|A|+|B|) [NB! SR=1]
0|.1000110  | 0010 0|.1010000  | 0010 |A| + |B| EA=EB --> OK, plus extra bits
00.10001100 | 0010 00.10100000 | 0010 01.00101100 | 0010 |MR|≥1.0 --> MR>>1 & ER+1; MR>0 --> SR=1
1|.1001011  | 0011 -4.687510

4) (-2.2) - (-2.5) = 0.3
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [version 1]
1|.1000110  | 0010 1|.1010000  | 0010 MA=-MA, MB=-MB; EA=EB --> OK, plus extra bits
11.01110100 | 0010 11.01100000 | 0010 00.00010100 | 0010 |MR|<0.5 --> MR<<3 & ER-3; MR>0 --> SR=0
0|.1010000  | 1111 0.312510
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [version 2]
1|.1000110  | 0010 1|.1010000  | 0010 A<0, B<0 --> A-B=|B|-|A|)
1|.1010000  | 0010 1|.1000110  | 0010 |B| - |A| EA=EB --> OK, plus extra bits
00.10100000 | 0010 00.10001100 | 0010 00.00010100 | 0010 |MR|<0.5 --> MR<<3 & ER-3 MR>0 --> SR=0
0|.1010000  | 1111 0.312510

5) -2.5 + 1.1 = -1.4
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [version 1]
1|.1010000  | 0010 0|.1000110  | 0001 MA=-MA; EA>EB --> MB>>1 & EB+1, plus extra bits
11.01100000 | 0010 00.01000110 | 0010 11.10100110 | 0010 |MR|<0.5 --> MR<<1 & ER-1; MR<0 --> -MR, SR=1
1|.1011010  | 0001 -1.4062510
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments [version 2]
1|.1010000  | 0010 0|.1000110  | 0001 A<0, B≥0 --> A+B=|B|-|A|)
0|.1000110  | 0001 0|.1010000  | 0010 |B| - |A| EA>EB --> MB>>1 & EB+1, plus extra bits
00.01000110 | 0010 00.10100000 | 0010 11.10100110 | 0010 |MR|<0.5 --> MR<<1 & ER-1 MR<0 --> -MR, SR=1
1|.1011010  | 0001 -1.4062510

3.2. Multiplication

In practice, the multiplication is done like when mantissas are signed - mantissas are multiplied and exponents are added. The main difference is in the fact that mantissas are always positive and normalization may be needed. The sign of the result is defined by the signs of operands - when signs are equal the result is positive, and negative when opposite.

Examples

1) 2.5 * 1.1 = 2.75
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments
0|.1010000 | 0010 0|.1000110 | 0001 A>0 & B>0 --> R>0; plus extra bits
0.1010000 | 0010 0.1000110 | 0001 00.01010111100000 | 0011 |MR|<0.5 --> MR<<1 & ER-1
  0|.1010111             | 0010 2.7187510 - cut off
  0|.1011000             | 0010 2.7510 - rounded

2) 1.1 * (-2.2) = -2.42
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments
0|.1000110 | 0001 1|.1000110 | 0010 A>0 & B<0 --> R<0; plus extra bits
0.1000110 | 0001 0.1000110 | 0010 00.01001100100100 | 0011 |MR|<0.5 --> MR<<1 & ER-1
  1|.1001100             | 0010 -2.37510 - cut off
  1|.1001101             | 0010 -2.4062510 - rounded

3) (-2.2) * (-0.056) = 0.1232
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments
1|.1000110 | 0010 1|.1110010 | 1100 A<0 & B<0 --> R>0; plus extra bits
0.1000110 | 0010 0.1110010 | 1100 00.01111100101100 | 1110 |MR|<0.5 --> MR<<1 & ER-1
  0|.1111100             | 1101 0.1210937510 - cut off
  0|.1111101             | 1101 0.122070312510 - rounded

4) 0.8 * 0.8 = 0.64
A (SA|MA|EA)     B (SB|MB|EB)     R (SR|MR|ER)      Action / comments
0|.1100110 | 0000 0|.1100110 | 0000 A>0 & B>0 --> R>0; plus extra bits
0.1100110 | 0000 0.1100110 | 0000 00.10100010100011 | 0000 0.5≤|MR|<1 --> OK
  0|.1010001             | 0000 0.632812510 - cut off==rounded

3.3. Division

In practice, also the division is done like when mantissas are signed - mantissas are divided and exponents are subtracted. The main difference is in the fact that mantissas are always positive and normalization may be needed. The sign of the result is defined by the signs of operands - when signs are equal the result is positive, and negative when opposite.

Examples

1) 2.5 / 1.1 = 2.(27)
A (SA|MA|EA)     B (SA|MB|EB)     R (SR|MR|ER)      Action / comments
0|.1010000 | 0010 0|.1000110 | 0001 A>0 & B>0 --> R>0, plus extra bits
0.1010000 | 0010 0.1000110 | 0001 01.0010010 | 0001 |MR|≥1.0 --> MR>>1 & ER+1
0|.1001001 | 0010 2.2812510

2) -2.2 / 1.1 = -2.0
A (SA|MA|EA)     B (SA|MB|EB)     R (SR|MR|ER)      Action / comments
1|.1000110 | 0010 0|.1000110 | 0001 A<0 & B>0 --> R<0, plus extra bits
0.1000110 | 0010 0.1000110 | 0001 01.0000000 | 0001 |MR|≥1.0 --> MR>>1 & ER+1
1|.1000000 | 0010 -2.010

3) -0.056 / -2.2 = 0.025(45)
A (SA|MA|EA)     B (SA|MB|EB)     R (SR|MR|ER)      Action / comments
1|.1110010 | 1100 1|.1000110 | 0010 A<0 & B<0 --> R>0, plus extra bits
0.1000110 | 1100 0.1000110 | 0010 01.1010000 | 1010 |MR|≥1.0 --> MR>>1 & ER+1
0.1101000 | 1011 0.02539062510

4) 0.6 / 0.8 = 0.75
A (SA|MA|EA)     B (SA|MB|EB)     R (SR|MR|ER)      Action / comments
0|.1001101 | 0000 0|.1100110 | 0000 A>0 & B>0 --> R>0, plus extra bits
0.1001101 | 0000 0.1100110 | 0000 00.1100000 | 0000 0.5≤|MR|<1 --> OK
0|.1100000 | 0000 0.7510


4. Features of IEEE-754 format

By its nature, the IEEE-754 format is similar to the representation with positive (unsigned) mantissas but there are some additional features. First, the normalized mantissa contains one bit less because the MSB (integer part, sic!) is always 1 in this case and for the unnormalized mantissa the MSB is always 0. The value of the (normalized) mantissa is therefore at least 1.0 but less than 2.0. Second, the base is 2 or 10 and, in addition, bias is used (the base is not in 2's complement form). This gives the possibility to represent special cases ("00...00" - unnormalized mantissa, "11...11" - overflows, etc.). Plus are there different ways for roundings.

The word lengths are also standardized but exceptions are allowed. The most used are 32- and 64-bit formats (also known as single-precision and double-precision, correspondingly). For the details, please check the link below. The examples show use 16-bit format (half-precision) where the first bit is sing, followed by 5-bit exponent (base 2, bias 15) and 11-bit mantissa (the integer part 1.0 is missing for normalized mantissas!).

decimal binary M2*2E2 S E2+011112 M2 [M10*2E10] [decimal]
2.5 10.1 1.0100000000*21 0 10000 0100000000 1.25*21 2.5
1.1 1.0(0011) 1.0001100110*20 0 01111 0001100110 1.099609375*20 1.099609375
-2.2 -10.(0011) -1.0001100110*21 1 10000 0001100110 1.099609375*21 -2.1992187500
-0.056 -0.0000111001010110... -1.1100101011*2-101 0 01010 1100101011 -1.7919921875*2-5 -0.055999755859375
1.0 1.0 1.0*20 0 01111 0000000000 1.0*20 1.0
+0.0 0.0 0 00000 0000000000 +0.0
-0.0 -0.0 1 00000 0000000000 -0.0
+∞ 0 11111 0000000000 +∞
-∞ 1 11111 0000000000 -∞


Additional information

NB! There are more sources for sure.


Last modified - 2021.12.02.