Jump to content

Floating-point arithmetic

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by William Ackerman (talk | contribs) at 15:16, 19 October 2006 (Accuracy problems: Express it in words and code snippets.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

You must add a |reason= parameter to this Cleanup template – replace it with {{Cleanup|August 2006|reason=<Fill reason here>}}, or remove the Cleanup template.

Floating-point is a numeral system for representing a subset of the real numbers in terms of digits or bits in a computer or calculator, similar to how scientific notation is used to represent exact values. A floating-point number is often stored as three parts:

  • A significand or mantissa (indicating the digits that define the number's magnitude)
  • An exponent or scale (indicating the position of the radix point)
  • A sign (indicating whether the number is positive or negative)

Computation with floating-point numbers plays a very important role in an enormous variety of applications in science, engineering, and industry. The ability to perform floating point operations is an important measure of performance for computers intended for such applications. The extent of this ability is measured in "FLOPS" (FLoating-point Operations Per Second).

This article will focus on the common formats used by nearly all modern floating-point computer hardware. These formats have been standardized as the IEEE 754 standard, commonly called "IEEE floating-point". The standard actually provides for many closely-related formats, differing in only a few details. Two of these formats are ubiquitous in computer hardware and languages:

  • Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. It occupies 32 bits (4 bytes) and gives about 7 digits of precision.
  • Double precision, called "double" in the C language family, and "doubleprecision" or "real*8" in Fortran. It occupies 64 bits (8 bytes) and gives about 16 digits of precision.

Floating-point numbers are intended to approximate the mathematical real numbers. But while the real numbers form a continuum that can be subdivided without limit, floating-point numbers have only finite resolution—they can only represent discrete points on the real number line. (In double precision representation, consecutive points differ by about 1 part in 1016.) That is, they can only represent a subset of the reals. Because of this, floating-point numbers are sometimes thought of as just an approximation to a real number, or as representing a real number to within some tolerance, or as representing a (narrow) range of real numbers, but this is not correct. A floating-point number (that is, a string of bits in a computer) represents a real number exactly. It just might not be the real number that is the intended value of a computation, if that value is not in the representable subset.

History

The need for computers to process data over a wide range of values, as in scientific notation, was recognized very early in the history of computers. The first computer to be able to do this in hardware appears to be the IBM 704 in 1954. For some time after that, floating-point hardware was an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. All modern general-purpose computers have this ability.

Prior to the IEEE-754 standard, computers had many different formats (and, indeed, different word sizes), and there were also multiple formats in software emulations.

The IEEE-754 standard was created in the early 1980's, after word sizes of 32 bits (or 16 or 64) had been generally settled upon. Among its innovations are these:

  • A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.
  • A precisely specified behavior of the arithmetic operations. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.
  • The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.

Other mechanisms

The floating-point behavior discussed in this article is widely used because, like twos complement binary integer arithmetic, nearly all computers can handle it extremely efficiently in hardware. However, there are other ways of handling real numbers. Fixed-point arithmetic is sometimes used in embedded processors for specific applications that don't require the exponent range that floating-point provides. There are also rational arithmetic software packages that handle quotients of integers, where the integers might be extremely large and are handled by arbitrary-precision arithmetic ("bignum") software. There are also "bigfloat" libraries that handle numbers similarly to standard floating-point hardware, but with enormous arrays to hold the significand. Computer algebra systems often denote numbers such as π or internally in a "formal" way, thereby doing exact arithmetic on numbers whose irrationality would otherwise make exact representation impossible.

Basics

A floating-point representation requires, first of all, a choice of base or radix for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by b, and the number of digits, or precision, by p. The significand is a number consisting of p digits in radix b, so each digit lies in the range 0 through b-1. A base of 2 (that is, binary representation) is nearly always used in computer hardware, though some computers use b=10 or b=16. A base of 10 (that is, decimal representation) is used in the familiar scientific notation.

As an example, the revolution period of Jupiter's moon Io could be represented in scientific notation as 1.528535047 × 105 seconds. The string of digits "1528535047" is the significand, and the exponent is 5.

Now this could be represented in many different ways. For example the value can be any of

  • 1528.535047 × 102
  • 1528535047. × 10−4
  • 1528535047000000. × 10−10
  • 0.000001528535047 × 1011

All of these are valid representations.

One benefit of scientific notation is that the decimal point can be placed so that long strings of leading or trailing zeros can be avoided. Floating-point notation goes a step further, mandating a specific place for the point—typically just after the leftmost nonzero digit, and lets the exponent handle the rest. So, in this case, the correct representation is 1.528535047 × 105.

This, plus the requirement that the leftmost digit of the significand be nonzero, is called normalization. By doing this, one no longer needs to express the point explicitly; the exponent provides that information. In decimal floating-point notation with precision of 10, the revolution period of Io is simply an exponent e=5 and a significand s=1528535047. The implied decimal point is after the first digit of s (after the '1' and before the first '5').

When a (nonzero) floating-point number is normalized, its leftmost digit is nonzero. The value of the significand obeys 1 ≤ s < b. (Zero needs special treatment; this will be described below.)

The mathematical value of a floating-point number, using the convention given above, is s.ssssssss...sss × be.

In binary radix, the significand is a string of bits (1's and 0's) of length p, of which the leftmost bit is 1. The number π, represented in binary, is

11.0010010000111111011010101000100010000101101000110000100011010011... but is
11.0010010000111111011011 when rounded to a precision of 24 bits.

In binary floating-point, this is e=1 ; s=110010010000111111011011.

The actual real number that a floating-point number represents is the number that one could obtain by placing an infinite number of zeros to the right of the significand:

e=1; s=1100100100001111110110110000000000000000...

This number with a 24-bit significand has a decimal value of

3.1415927410125732421875 (exactly!), whereas the true value of π is
3.1415926535897932384626433832795...

The result of rounding π to 24-bit binary floating-point differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error.

The problem is that floating-point numbers with a limited number of digits can represent only a subset of the reals, and that π is not in that subset. The act of rounding a real number to a floating-point representation consists of finding the representable real number that is closest to the given real number.

One doesn't need numbers as sophisticated as π to exhibit this phenomenon. The decimal number 0.1 is not representable in binary floating-point of any finite precision. The exact binary representation would have a "1100" sequence continuing endlessly:

e=-4; s=1100110011001100110011001100110011..., but when rounded to 24 bits it becomes
e=-4; s=110011001100110011001101 which is actually 0.100000001490116119384765625 in decimal.

Single precision computer format has a precision of 24 bits. Double precision computer format has a precision of 53 bits.

Banker's rounding

When the bits after the last available result bit are 1000000000..... exactly (that is, it is known that there are an infinite number of zeros after the initial 1), rounding upward or downward is equally accurate. In this case, the technique of Banker's rounding is usually used. The rounding is done in the direction that makes the resulting significand even (this could leave a result of zero).

Mantissa

The word mantissa is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand.

Floating point arithmetic operations

The usual rule for performing floating point arithmetic is that the exact mathematical value is calculated[1], and the result is then rounded to the nearest representable value in the specified precision. This is in fact the behavior mandated for IEEE-compliant computer hardware, aside from a few details that will be described below.

For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples. The fundamental principles are the same in any radix or precision.

Addition and subtraction

To add or subtract two numbers, they must have their decimal (or binary) points lined up. This is done by comparing the exponent fields and shifting the smaller number to the right. In the example below, the second number is shifted right by three digits because it has the smaller exponent. It is unnormalized at this point. The actual addition is then performed:

  e=5;  s=1.234567     (123456.7)
+ e=2;  s=1.017654     (101.7654)

  e=5;  s=1.234567
+ e=5;  s=0.001017654  (after shifting)
--------------------
  e=5;  s=1.235584654  (true sum: 123558.4654)

This is the "true" result, relative to the exact meaning of the incoming operands, but it has too many digits. It must be rounded to seven digits (to match the precision) and then normalized if necessary. The final result is e=5; s=1.235585, or 123558.5.

Note that the low 3 digits of the second operand (654) are essentially lost. Except for their possible influence on carrying and rounding, they have no effect. This is round-off error. Whenever numbers of different magnitudes are added or subtracted, the lowest digits (or bits) of the smaller one are lost. In the most serious case of this, the smaller operand can be totally "absorbed" (that is, it has no effect at all):

  e=5;  s=1.234567
+ e=-3; s=9.876543

  e=5;  s=1.234567
+ e=5;  s=0.00000009876543 (after shifting)
----------------------
  e=5;  s=1.23456709876543 (true sum)
  e=5;  s=1.234567         (after rounding/normalization)

Another problem of loss of significance occurs when two nearly equal numbers are subtracted.

  e=5;  s=1.235585
- e=5;  s=1.234567
----------------
  e=5;  s=0.001018 (true difference)
  e=2;  s=1.018000 (after rounding/normalization)

The result of this operation is exact with respect to its inputs. However, if the addend is considered as an approximation to the result of the addition in the earlier example, we can see that the bottom three digits, that we created during normalization do not add any accuracy to the result. That is, they are zero, but this might simply be because the original operands didn't have enough precision. This is cancellation and illustrates the danger in assuming that all of the digits of a computed result are meaningful. It occurs when nearly equal numbers are subtracted, or numbers of opposite sign but nearly equal magnitude are added. Dealing with the consequences of cancellation and other concerns relating to error analysis are topics in numerical analysis.

Multiplication and division

To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

  e=3;  s=4.734612
× e=5;  s=5.417242
-----------------------
  e=8;  s=25.648538980104 (true product)
  e=8;  s=25.64854        (after rounding)
  e=9;  s=2.564854        (after normalization)

Division is done similarly, but that is more complicated.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex. (see Booth's multiplication algorithm and digital division)[2]

Computer representation

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand, from left to right. For the common formats they are apportioned as follows:

         sign   exponent  (exponent bias)  significand   total
single    1        8         (127)             23         32
double    1       11         (1023)            52         64

Since the exponent can be positive or negative, it has a fixed "bias" added to it, and the sum is stored in the actual exponent field as an unsigned number. A value of zero, or all 1's, in that field is reserved for special treatment. Therefore the legal exponent range for normalized numbers is [-126, 127] for single precision or [-1022, 1023] for double.

When a number is normalized, its leftmost significand bit is known to be 1. In the IEEE single and double precision formats that bit is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has 24 bits of significand precision, while double precision format has 53.

For example, it was shown above that π, rounded to 24 bits of precision, has:

  • sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)

The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as

  • 0 10000000 10010010000111111011011 = 40490FDB in hexadecimal

"Bad data"

When large collections of data are to be processed, or a long sequence of computations is to be executed, it can happen that something goes wrong, but not necessarily hopelessly awry. A datum might be missing from its place in a regular array, or the computation act wrongly. One approach is to associate with every datum a status indicator and employ suitable tests. This soon becomes tedious, and slows execution for the vast majority of untroubled computations. An alternative approach is to push this task into the hardware and introduce a special state for a datum, often called Not a Number, or NaN. In the design so far this can be done quite neatly. Because the significand is represented as a signed number, zero could have two manifestations, one for each sign. Clearly, a normal zero would have a sign bit of zero, and the NaN value a sign bit of one. Yet further extensions to the arithmetic hardware would be made, such that operations involving a NaN value would result in a NaN value, and so forth.

The use of such a facility will depend on careful distinctions. The prime idea is that a "bad" value should render the result of any further computation involving it also "bad", rather than offer a number whose value seems no less valid than any other number. For instance, subtracting two nearly-equal small numbers will through cancellation produce a result that could be too small to represent as a normalised number. The denormal representation is fully satisfactory for this because its reduced precision captures the reduced precision of the result: gradual underflow. Such cancellation is always a potential problem with subtraction and is expected. But, if two very small numbers are multiplied (or a small number is divided by a large), the resulting underflow is abrupt, and the conversion to the nearest representable denormal number will involve a significant loss of precision, an unexpected event in multiplication and division. In such a case, the Underflow code would mark all subsequent usages of the result as bad rather than allow the misleading appearance of a good result. Hopefully, a sequence such as (small × small × large) might be found and changed to (small × large × small). Without an Underflow trap, this might never be noticed.

The secondary usage would be to avoid referencing "bad" values that are present because of organisational convenience, somewhat as follows:

t:=0; n:=0;              %An array x is to be averaged, elements first to last.
for i:=first:last do     %Bad values are to be avoided, lest the total be ruined. 
 if Valid(x(i)) then t:=t + x(i), n:=n + 1;
next i;
if n > 0 then print "Average of ",n," is ",t/n
 else print "No valid values!";

Additional special values

At this stage the design could be declared complete. Every possible bit pattern is decodable to a distinct number, plus the two special values of zero and NaN.

But once some special values are encompassed, the dam has been broken and other special values could be supported to signify other special cases. Because of the limitations of finite-sized floating point numbers, it is possible that the result of a calculation is a number too big, or too small to be represented; this is ±Overflow and ±Underflow. The mathematics behind a calculation might be such as to make ±Infinity a useful indication, and "?" for "no result", then there arise indicators for erroneous operations such as divide by zero, square root of a negative number, log of a non-positive number, arctangent of a vector of zero length and so on, plus the more sordid mistakes of programming, such as referencing the value of a variable that has yet to be initialised. And if these are to be allowed, why not extend to allowing user-defined codes such as "missing datum", "unreadable value", "misprinted value", "malfunctioning meter", etc. as might prove useful?

With such a plethora, the obvious rule would be that an operation on two NaN values of the same type retain that type, whereas different types would result in a basic type, say NaN0. There would be a schedule of meanings for NaN1, NaN2, etc. and when a computation ends, the results would be scrutinised as to what might have caused the various NaNx codes that appear instead of normal numbers.

There are choices for representation. Reserving a single bit to signify normal/special is wasteful as with it set on, all other 31 bits become available to specify just a few states, whereas with it off, the storage for normal numbers is reduced by one bit. The other extreme would be to select a special value of the significand field, but this is untidy since all its bit patterns have a use. So then, the exponent: the highest possible exponent might be reserved (realtively easily identified as its bit pattern will be all 1), and the miscellaneous types distinguished via the values of the significand. Since the significand field has 23 bits, there is be much scope for encodement. If this is done, the highest exponent is no longer e = +127 but e = +126 and so the lowest exponent (needed for reciprocals) would be e = −127 instead of e = −128, thus the exponent offset could changed to 127. Or instead, a single NaN value be accepted for all purposes.

Implementation in actual computers

The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.

The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit extended precision format, with 15 exponent bits and 64 significand bits, with no hidden bit. There is controversy about the failure of most programming languages to make these extended precision formats available to programmers (although C and related programming languages usually provide these formats via the long double type on such hardware). System vendors may also provide additional extended formats (e.g. 128 bits) emulated in software.

As of 2000, the IEEE 754 standard is currently under revision. See IEEE 754r.

Behavior of computer arithmetic

The standard behavior of computer hardware is to round the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In practice, there are other options. IEEE-754-compliant hardware allows one to set the rounding mode to any of the following:

  • round to nearest (the default; by far the most common mode)
  • round up (toward +∞; negative results round toward zero)
  • round down (toward −∞; negative results round away from zero)
  • round toward zero (sometimes called "chop" mode; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3)

In the default rounding mode the IEEE 754 standard mandates the round-to-nearest behavior described above for all fundamental algebraic operations, including square root. ("Library" functions such as cosine and log are not mandated.) This means that IEEE-compliant hardware's behavior is completely determined in all 32 or 64 bits.

The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.

The arithmetical distance between two consecutive representable floating point numbers is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. An ULP is about 10−7 in single precision, and 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of an ULP.

Exceptional values and exceptions under the IEEE standard

In addition to the "infinity" value that is produced when an overflow occurs, there is a special value "NaN" ("not a number") that is produced by operations as taking the square root of a negative number. NaN is encoded with the reserved exponent of 128 (or 1024), and a significand field that distinguishes it from infinity.

The intention of the INF and NaN values is that, under the most common circumstances, they can just propagate from one operation to the next (any operation with NaN as an operand produces NaN as a result), and they only need to be attended to at a point that the programmer chooses.

In addition to the creation of exceptional values, there are "events" that may occur, though some of them are quite benign:

  • An overflow occurs as described previously, producing an infinity.
  • An underflow occurs as described previously, producing a denorm.
  • A zerodivide occurs whenever a divisor is zero, producing an infinity of the appropriate sign. (The sign of zero is meaningful here.) Note that a very small but nonzero divisor can still cause an overflow and produce an infinity.
  • An "operand error" occurs whenever a NaN has to be created. This occurs whenever any operand to an operation is a NaN, or some other obvious thing happens, such a sqrt(-2.0) or log(-1.0).
  • An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.

Computer hardware is typically able to raise exceptions ("traps") when these events occur. How these are presented to the software is very language- and system-dependent. Usually all exceptions are masked (disabled). Sometimes overflow, zerodivide, and operand error are enabled.

Accuracy problems

The facts that floating-point numbers cannot faithfully mimic the real numbers, and that floating-point operations cannot faithfully mimic true arithmetic operations, lead to many surprising situations.

For example, the non-representability of 0.1 and 0.01 means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as e=-4; s=110011001100110011001101, which is

.100000001490116119384765625 exactly.

Squaring this number gives

.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with single-precision floating-point hardware (with rounding) gives

.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) would not yield a result of infinity, nor would it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:

  // Enough digits to be sure we get the correct approximation.
  double pi = 3.1415926535897932384626433832795;
  double z = tan(pi/2.0);

Will give a result of 16331239353195370.0. In single precision (using the tanf function), the result would be -22877332.0.

By the same token, a calculation of sin(π) would not yield zero. The result would be (approximately) .1225 × 10-15 in single precision, or -.8742 × 10-7 in single precision. [3]

In fact, while addition and multiplication are both commutative (a+b = b+a and a×b = b×a), they are not associative (a + b) + c = a + (b + c). Using 7-digit decimal arithmetic:

 1234.567 + 45.67844 = 1280.245
                       1280.245 + 0.0004 = 1280.245
 but 
 45.67844 + 0.0004 = 45.67884
                     45.67884 + 1234.567 = 1280.246

They are also not distributive (a + b)×c = a×c + b×c :

 1234.567 × 3.333333 = 4115.223
 1.234567 × 3.333333 = 4.115223
                       4115.223 + 4.115223 = 4119.338
 but 
 1234.567 + 1.234567 = 1235.802
                       1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:

  • Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. This is perhaps the most common and serious accuracy problem.
  • Conversions to integer are unforgiving: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than rounding.
  • Limited exponent range: results might overflow yielding infinity, or underflow yielding a denormal value or zero. If a denormal number results, precision will be lost.
  • Testing for safe division is problematical: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
  • Equality is problematical! Two computational sequences that are mathematically equal may well produce different floating-point values. Programmers often perform comparisons within some tolerance (often a decimal constant, itself not accurately represented), but that doesn't necessarily make the problem go away.

Minimizing the effect of accuracy problems

Because of the problems noted above, naive use of floating point arithmetic can lead to many problems. A good understanding of numerical analysis is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.

In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.

Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/. The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.

Double precision floating point arithmetic is more accurate than just about any physical measurement one could make. For example, it could indicate the distance from the Earth to the Moon with an accuracy of about 50 nanometers. So, if one were designing an integrated circuit chip with 100 nanometer features, that stretched from the Earth to the Moon, double precision arithmetic would be somewhat problematical, but only somewhat.

What makes floating point arithmetic troublesome is that people write mathematical algorithms that perform operations an enormous number of times, and so small errors grow. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.

People often carry expectations from their mathematics training into the field of floating point computation. For example, it is known that , and that , and that eigenvectors are degenerate if the eigenvalues are equal. These facts can't be counted on when the quantities involved are the result of floating point computation.

While a treatment of the techniques for writing high-quality floating-point software is far beyond the scope of this article, here are a few simple tricks:

The use of the equality test (if (x==y) ...) is usually not a good idea when it is based on expectations from pure mathematics. Such things are sometimes replaced with "fuzzy" tests (if (abs(x-y) < 1.0E-13) ...). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.

An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000. A typical addition would then be something like

3253.671
+  3.141276
--------
3256.812

The low 3 digits of the addends are effectively lost. The Kahan summation algorithm may be used to reduce the errors.

Another thing that can be done is to rearrange the computation in a way that is mathematically equivalent but less prone to error. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is:

Here is a computation using IEEE "double" (53 bits of significand precision) arithmetic:

 i   6 × 2i × ti, first form    6 × 2i × ti, second form

 0   3.4641016151377543863      3.4641016151377543863
 1   3.2153903091734710173      3.2153903091734723496
 2   3.1596599420974940120      3.1596599420975006733
 3   3.1460862151314012979      3.1460862151314352708
 4   3.1427145996453136334      3.1427145996453689225
 5   3.1418730499801259536      3.1418730499798241950
 6   3.1416627470548084133      3.1416627470568494473
 7   3.1416101765997805905      3.1416101766046906629
 8   3.1415970343230776862      3.1415970343215275928
 9   3.1415937488171150615      3.1415937487713536668
10   3.1415929278733740748      3.1415929273850979885
11   3.1415927256228504127      3.1415927220386148377
12   3.1415926717412858693      3.1415926707019992125
13   3.1415926189011456060      3.1415926578678454728
14   3.1415926717412858693      3.1415926546593073709
15   3.1415919358822321783      3.1415926538571730119
16   3.1415926717412858693      3.1415926536566394222
17   3.1415810075796233302      3.1415926536065061913
18   3.1415926717412858693      3.1415926535939728836
19   3.1414061547378810956      3.1415926535908393901
20   3.1405434924008406305      3.1415926535900560168
21   3.1400068646912273617      3.1415926535898608396
22   3.1349453756585929919      3.1415926535898122118
23   3.1400068646912273617      3.1415926535897995552
24   3.2245152435345525443      3.1415926535897968907
25                              3.1415926535897962246
26                              3.1415926535897962246
27                              3.1415926535897962246
28                              3.1415926535897962246
              The true value is 3.1415926535897932385...

While the two forms of the recurrence formula are clearly equivalent, the first subtracts 1 from a number extremely close to 1, leading to huge cancellation errors. Note that, as the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

A few nice properties

One can sometimes take advantage of a few nice properties:

  • Any integer less than or equal to 224 can be exactly represented in the single precision format, and any integer less than or equal to 253 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented. This property is sometimes used in purely integer applications, to get 53-bit integers on machines that have double precision floats but only 32-bit integers.
  • The bit representations are monotonic, as long as exceptional values are avoided and the signs are handled properly. Floating point numbers are equal if and only if their integer bit representations are equal. Comparisons for larger or smaller can be done with integer comparisons on the bit patterns, as long as the signs match. However, the actual floating point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.
  • To a rough approximation, the bit representation of a floating point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.

See also

Notes and references

  1. ^ Computer hardware doesn't necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.
  2. ^ The enormous complexity of modern division algorithms once led to a famous error. An early version of the Intel Pentium chip was shipped with a division instruction that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases. See "Pentium FDIV Bug"
  3. ^ But a calculation of cos(π) yields -1 exactly. Since the derivative is nearly zero, the inaccuracy in the argument has almost no effect, and the rounded result is correct.