# Real Close to the Machine: Floating Point in D

## Introduction

*by Don Clugston*

Computers were originally conceived as devices for performing mathematics. The earliest computers spent most of their time solving equations. Although the engineering and scientific community now forms only a miniscule part of the computing world, there is a fantastic legacy from those former times: almost all computers now feature superb hardware for performing mathematical calculations accurately and extremely quickly. Sadly, most programming languages make it difficult for programmers to take full advantage of this hardware. An even bigger problem is the lack of documentation; even for many mathematical programmers, aspects of floating-point arithmetic remain shrouded in mystery.

As a systems programming language, the D programming language attempts to remove all barriers between the programmer and the compiler, and between the programmer and the machine. This philosophy is particularly evident in support for floating-point arithmetic. A personal anecdote may illustrate the importance of having an accurate understanding of the hardware.

My first floating-point nightmare occurred in a C++ program which hung once in every hundred runs or so. I eventually traced the problem to a while loop which occasionally failed to terminate. The essence of the code is shown in Listing 1.

```
double q[8];
...
int x = 0;
while (x < 8) {
if ( q[x] >= 0 ) return true;
if ( q[x] < 0 ) ++x;
}
return false;
```

Initially, I was completely baffled as to how this harmless-looking loop could fail. But eventually, I discovered that q had not been initialized properly; q[7] contained random garbage. Occasionally, that garbage had every bit set, which mean that q[7] was a Not-a-Number (NaN), a special code which indicates that the value of the variable is nonsense. NaNs were not mentioned in the compiler's documentation - the only information I could find about them was in Intel's assembly instruction set documentation! Any comparison involving a NaN is false, so q[7] was neither >= 0, nor < 0, killing my program. Until that unwelcome discovery, I'd been unaware that NaNs even existed. I had lived in a fool's paradise, mistakenly believing that every floating point number was either positive, negative, or zero.

My experience would have been quite different in D. The "strange" features of floating point have a higher visibility in the language, improving the education of numerical programmers. Uninitialized floating point numbers are initialized to NaN by the compiler, so the problematic loop would fail every time, not intermittently. Numerical programmers in D will generally execute their programs with the 'invalid' floating point exception enabled. Under those circumstances, as soon as the program accessed the uninitialized variable, a hardware exception would occur, summoning the debugger. Easy access to the "strange" features of floating point results in better educated programmers, reduced confusion, faster debugging, better testing, and hopefully, more reliable and correct numerical programs. This article will provide a brief overview of the support for floating point in the D programming language.

## Demystifying Floating-Point

D guarantees that all built-in floating-point types conform to IEEE 754 arithmetic, making behaviour entirely predictable (note that this is *not* the same as producing identical results on all platforms). IEEE 754-2008 is the latest revision of the IEEE 754 Standard for Floating-Point Arithmetic. D is progressing towards full compliance with 754-2008.

The IEEE standard floating point types currently supported by D are float and double. Additionally, D supports the real type, which is either 'IEEE 80-bit extended' if supported by the CPU; otherwise, it is the same as double. In the future, the new types from 754-2008 will be added: quadruple, decimal64, and decimal128.

The characteristics of these types are easily accessible in the language via *properties*. For example, float.max is the maximum value which can be stored in a float; float.mant_dig is the number of digits (bits) stored in the mantissa.

To make sense of mathematics in D, it's necessary to have a basic understanding of IEEE floating-point arithmetic. Fundamentally, it is a mapping of the infinitely many real numbers onto a small number of bytes. Only 4000 million distinct numbers are representable as an IEEE 32-bit float. Even with such a pathetically small representation space, IEEE floating point arithmetic does a remarkably good job of maintaining the illusion that mathematical real numbers are being used; but it's important to understand when the illusion breaks down.

Most problems arise from the distribution of these representable numbers. The IEEE number line is quite different to the mathematical number line.

```
+ +-----------+------------+ .. + .. +----------+----------+ + #
-infinity -float.max -1 -float.min_normal 0 float.min_normal 1 float.max infinity NaN
```

Notice that half of the IEEE number line lies between -1 and +1. There are 1000 million representable floats between 0 and 0.5, but only 8 million between 0.5 and 1. This has important implications for accuracy: the effective precision is incredibly high near zero. Several examples will be presented where numbers in the range -1 to +1 are treated seperately to take advantage of this.

Notice also the special numbers: ±∞; the so-called "subnormals" between ±float.min_normal and 0, which are represented at reduced precision; the fact that there are TWO zeros, +0 and -0, and finally "NaN"("Not-a-Number"), the nonsense value, which caused so much grief in Listing 1.

Why does NaN exist? It serves a valuable role: it *eradicates undefined behaviour* from floating-point arithmetic. This makes floating-point completely predictable. Unlike the int type, where 3/0 invokes a hardware division by zero trap handler, possibly ending your program, the floating-point division 3.0/0.0 results in ∞. Numeric overflow (eg, real.max*2) also creates ∞. Depending on the application, ∞ may be a perfectly valid result; more typically, it indicates an error. Nonsensical operations, such as 0.0 / 0.0, result in NaN; but *your program does not lose control*. At first glance, infinity and NaN may appear unnecessary -- why not just make it an error, just as in the integer case? After all, it is easy to avoid division by zero, simply by checking for a zero denominator before every division. The real difficulty comes from overflow: it is extremely difficult to determine in advance whether an overflow will occur in a multiplication.

Subnormals are necessary to prevent certain anomalies, and preserve important relationships such as: "x - y == 0 if and only if x == y".

Since ∞ can be produced by overflow, both +∞ and -∞ are required. Both +0 and -0 are required in order to preserve identities such as: if x>0, then 1/(1/x) > 0. In almost all other cases, however, there is no difference between +0 and -0.

It's worth noting that these ‘special values’ are usually not very efficient. On x86 machines, for example, a multiplication involving a NaN, an infinity, or a subnormal can be twenty to fifty times slower than an operation on normal numbers. If your numerical code is unexpectedly slow, it's possible that you are inadvertently creating many of these special values. Enabling floating-point exception trapping, described later, is a quick way to confirm this.

One of the biggest factor obscuring what the machine is doing is in the conversion between binary and decimal. You can eliminate this by using the "%a" format when displaying results. This is an invaluable debugging tool, and an enormously helpful aid when developing floating-point algorithms. The 0x1.23Ap+6 hexadecimal floating-point format can also be used in source code for ensuring that your input data is *exactly* what you intended.

## The Quantized Nature of Floating-Point

The fact that the possible values are limited gives access to some operations which are not possible on mathematical real numbers. Given a number x, nextUp(x) gives the next representable number which is greater than x. nextDown(x) gives the next representable number which is less than x.

Numerical analysts often describe errors in terms of "units in the last place"(ulp), a surprisingly subtle term which is often used rather carelessly. [footnote: The most formal definition is found in [J.-M. Muller, "On the definition of ulp(x)",INRIA Technical Report 5504 (2005).]: If x is a real number that lies between two finite consecutive floating-point numbers a and b of type F, without being equal to one of them, then ulp(x)=abs(b-a); otherwise ulp(x) = x*F.epsilon. Moreover, ulp(NaN) is NaN, and ulp(±F.infinity) = ±F.max*F.epsilon.] I prefer a far simpler definition: The difference in ulps between two numbers x and y is is the number of times which you need to call nextUp() or nextDown() to move from x to y. [Footnote: This will not be an integer if either x or y is a real number, rather than a floating point number.] The D library function feqrel(x, y) gives the number of bits which are equal between x and y; it is an easy way to check for loss-of-precision problems.

The quantized nature of floating point has some interesting consequences.

- ANY mathematical range [a,b), (a,b], or (a,b) can be converted into a range or the form [a,b]. (The converse does not apply: there is no (a,b) equivalent to [-∞, ∞]).
- A naive binary chop doesn't work correctly. The fact that there are hundreds or thousands of times as many representable numbers between 0 and 1, as there are between 1 and 2, is problematic for divide-and-conquer algorithms. A naive binary chop would divide the interval [0 .. 2] into [0 .. 1] and [1 .. 2]. Unfortunately, this is not a true binary chop, because the interval [0 .. 1] contains more than 99% of the representable numbers from the original interval!

## Condition number

Using nextUp, it's easy to approximately calculate the condition number.

```
real x = 0x1.1p13L;
real u = nextUp(x);
int bitslost = feqrel(x, u) - feqrel(exp(x), exp(u));
```

This shows that at these huge values of x, a one-bit error in x destroys 12 bits of accuracy in exp(x)! The error has increased by roughly 6000 units in the last place. The condition number is thus 6000 at this value of x.

## The semantics of float, double, and real

For the x86 machines which dominate the market, floating point has traditionally been performed on a descendent of the 8087 math coprocessor. These "x87" floating point units were the first to implement IEEE754 arithmetic. The SSE2 instruction set is an alternative for x86-64 processors, but x87 remains the only portable option for floating point 32-bit x86 machines (no 32-bit AMD processors support SSE2).

The x87 is unusual compared to most other floating-point units. It _only_ supports 80-bit operands, henceforth termed "real80". All double and float operands are first converted to 80-bit, all arithmetic operations are performed at 80-bit precision, and the results are reduced to 64-bit or 32-bit precision if required. This means that the results can be significantly more accurate than on a machine which supports at most 64 bit operations. However, it also poses challenges for writing portable code. (Footnote: The x87 allows you to reduce the mantissa length to be the same as 'double or float, but it retains the real80 exponent, which means different results are obtained with subnormal numbers. To precisely emulate double arithmetic slows down floating point code by an order of magnitude).

Apart from the x87 family, the Motorola 68K (but not ColdFire) and Itanium processors are the only ones which support 80-bit floating point.

A similar issue relates to the FMA (fused multiply and accumulate) instruction, which is available on an increasing number of processors, including PowerPC, Itanium, Sparc, and Cell. On such processors, when evaluating expressions such as x*y + z, the x*y is performed at twice the normal precison. Some calculations which would otherwise cause a total loss of precision, are instead calculated exactly. The challenge for a high-level systems programming language is to create an abstraction which provides predictable behaviour on all platforms, but which nonetheless makes good use of the available hardware.

D's approach to this situation arises from the following observations:

- It is extremely costly performance-wise to ensure identical behaviour on all processors. In particular, it is crippling for the x87.
- Very many programs will only run on a particular processor. It would be unfortunate to deny the use of more accurate hardware, for the sake of portability which would never be required.
- The requirements for portability and for high precision are never required simultaneously. If double precision is inadequate, increasing the precision on only some processors doesn't help.
- The language should not be tied to particular features of specific processors.

A key design goal is: it should be possible to write code such that, regardless of the processor which is used, the accuracy is never worse than would be obtained on a system which only supports the double type.

(Footnote: real is close to ‘indigenous’ in the Borneo proposal for the Java programming language[Ref Borneo]).

Consider evaluating x*y + z*w, where x, y, z and w are double.

- double r1 = x * y + z * w;
- double a = x * y; double r2 = a + z * w;
- real b = x * y; double r3 = b + z * w;

Note that during optimisation, (2) and (3) may be transformed into (1), but this is implementation-dependent. Case (2) is particularly problematic, because it introduces an additional rounding.

On a "simple" CPU, r1==r2==r3. We will call this value r0. On PowerPC, r2==r3, but r1 may be more accurate than the others, since it enables use of FMA. On x86, r1==r3, which may be more accurate than r0, though not as much as for the PowerPC case. r2, however, may be LESS accurate than r0.

By using real for intermediate values, we are guaranteed that our results are never worse than for a simple CPU which only supports double.

## Properties of the Built-in Types

The fundamental floating-point properties are epsilon, min_normal and max. The six integral properties are simply the log2 or log10 of these three.

float | double | real80 | quadruple | decimal64 | decimal128 | |
---|---|---|---|---|---|---|

epsilon | 0x1p-23 | 0x1p-52 | 0x1p-63 | 0x1p-112 | 1e-16 (1p-54) | 1e-34 (1p-113) |

[min_normal | 0x1p-126 | 0x1p-1022 | 0x1p-16382 | 0x1p-16382 | 1e-383 | 1e-6143 |

..max) | 0x1p+128 | 0x1p+1024 | 0x1p+16384 | 0x1p+16384 | 1e+385 | 1e+6145 |

binary properties | ||||||

mant_dig | 24 | 53 | 64 | 113 | 53 | 112 |

min_exp | -125 | -1021 | -16381 | -16381 | ||

max_exp | +128 | +1024 | +16384 | +16384 | ||

decimal properties | ||||||

dig | 6 | 15 | 18 | 33 | 16 | 34 |

min_10_exp | -37 | -307 | -4932 | -4932 | -382 | -6142 |

max_10_exp | +38 | +308 | +4932 | +4932 | 385 | +6145 |

When writing code which should adapt to different CPUs at compile time, use static if with the mant_dig property. For example, static if (real.mant_dig==64) is true if 80-bit reals are available.
For binary types, the dig property gives only the *minimum* number of valid decimal digits. To ensure that that every representable number has a unique decimal representation, two additional digits are required. Similarly, for decimal numbers, mant_dig is a lower bound on the number of valid binary digits.

## Useful relations for a floating point type F, where x and y are of type F

- The smallest representable number is F.min_normal * F.epsilon
- Any integer between 0 and (1/F.epsilon) can be stored in F without loss of precision. 1/F.epsilon is always a exact power of the base.
- If a number x is subnormal, x*(1/F.epsilon) is normal, and exponent(x) = exponent(x*(1/F.epsilon)) - (mant_dig-1).
- x>0 if and only if 1/(1/x) > 0; x<0 if and only if 1/(1/x) < 0.
- If x-y==0, then x==y && isFinite(x) && isFinite(y). Note that if x==y==infinity, then isNaN(x-y).
- F.max * F.min_normal = 4.0 for binary types, 10.0 for decimal types.

### Addition and subtraction

- Some loss of precision occurs with x±y if exponent(x)!=exponent(y). The number of digits of precision which are lost is abs(exponent(x)-exponent(y)).
- x±y has total loss of precision, if and only if (1) abs(x * F.epsilon) > abs(y), in which case x+y == x, x-y == x or (2) abs(y * F.epsilon) > abs(x), in which case x+y == y, x-y == -y
- Addition is commutative: a + b == b + a.
- Subtraction is not quite commutative: a - b == -(b - a), but produce +0 and -0 if a==b.
- Addition is not associative at all.

### Multiplication and division

- Multiplication and division are
*always*at risk of overflow or underflow. For any abs(x) > F.epsilon, there is at least one finite y such that x/y will overflow to ∞. For any abs(x) < F.epsilon, there is at least one finite y such that x/y will underflow to zero. For any abs(x) > 1, there is at least one finite y such that x*y will overflow to ∞. For any abs(x) < 1, there is at least one finite y such that x*y will underflow to zero. - x*x will overflow if abs(x)>sqrt(F.max), and underflow to zero if abs(x) < sqrt(F.min_normal*F.epsilon)
- Multiplication is commutative. a * b == b * a .
- Multiplication is not associative in general: a*(b*c) != (a*b)*c, because (1) there is a risk of overflow or underflow and (2) b*c may be an exact calculation, so that a*(b*c) contains only one round-off error, whereas (a*b)*c contains two. The roundoff errors may therefore accumulate at the rate of just under 1 ulp per multiplication.
- However, a limited form of associativity is possible if the type used for intermediate results is larger than any of the operands (which happens on x87 and Itanium machines). If R is the intermediate type, and F is the type being multiplied, up to min(R.max_exp/F.max_exp, R.epsilon/F.epsilon) values of type F can be multiplied together in any order without influencing the result. For example, if R is double, multiplication of 8 floats f1*f2*f3*f4*f5*f6*f7*f8 is completely associative. On x87, 130 floats can be safely multiplied together in any order, and 16 doubles can similarly be multiplied together safely. Strict distributivity does not hold even under these circumstances, as it may destroy the sign of -0.
- The distributive law almost never holds. For example, 4*x + 6*x != 10*x if x==nextDown(1.5). a*x + b*x == (a+b)*x for all x only if the operations a*x, b*x, and (a+b) are all exact operations, which is true only if a and b are exact powers of 2. Even then, if a==-b and x==-0, then a*x+b*x==0.0, (a+b)*x==-0.0.
- Performing a division by multiplication by the reciprocal returns a result which (in round-to-nearest mode) is at most 1.5 ulps from the correctly rounded result. For almost any denominator, the rounding is incorrect (>0.5ulps) for 27% of numerators. [Ref: N. Brisebarre, J-M Muller, and S.K. Raina, "Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance", IEEE Trans. on Computers, Vol 53, pp 1069-1072 (2004)].

### Powers and logarithms

- F.mant_dig = -log2(F.epsilon) for binary types;
- F.dig = -log10(F.epsilon) for decimal types.
- F.max = exp2(F.max_exp*(1-F.epsilon)) for binary types;
- F.max = exp10(F.max_10_exp*(1-F.epsilon)) for decimal types.
- For any positive finite x, F.min_exp - F.mant_dig <= log2(x) < F.max_exp for binary types, F.min_10_exp - F.dig <= log10(x) < F.max_10_exp for decimal types
- exp2(x) == 0 if x < F.min_exp - F.mant_dig, exp2(x) == infinity if x >= F.max_exp

## NaN payloads

According to the IEEE 754 standard, a ‘payload’ can be stored in the mantissa of a NaN. This payload can contain information about how or why it was generated. Historically, almost no programming languages have ever made use of this potentially powerful feature. In D, this payload consists of a positive integer.

- real NaN(ulong payload) -- create a NaN with a "payload", where the payload is a ulong.
- ulong getNaNPayload(real x) -- returns the integer payload. Note that if narrowing conversions have occured, the high-order bits may have changed.

*Never* store a pointer as an integer payload inside a NaN. The garbage collector will not be able to find it!

## NCEG comparison operations

As well as the usual <, >, <=, and >= comparison operators, D also supports the "NCEG" operators. Most of them are the direct negation of the ordinary operators. Additionally, <>, <>=, !<>, and !<>= are provided. These 8 new operators are different from the normal operators only when a NaN is involved, so for the most part they are quite obscure. They are useful mainly in eliminating the possibility of NaN before beginning a calculation. The most useful relationships are probably:

- x <>= y is the same as !isNaN(x) && !isNaN(y), (except that signalling NaNs may be triggered by <>=).
- x !<>= y is the same as isNaN(x) || isNaN(y).

## The IEEE Rounding Modes

The rounding mode is controlled within a scope. Rounding mode will be restored to its previous state at the end of that scope.
Four rounding modes can be set. The default mode, *Round to nearest*, is the most statistically accurate, but the least intuitive. In the event of tie, the result is rounded to an even number.

Rounding mode | rndint(4.5) | rndint(5.5) | rndint(-4.5) | Notes |
---|---|---|---|---|

Round to nearest | 4 | 6 | -4 | Ties round to an even number |

Round down | 4 | 5 | -5 | |

Round up | 5 | 6 | -4 | |

Round to zero | 4 | 5 | -4 |

There are very few reasons for changing the rounding mode. The round-up and round-down modes were created specifically to allow fast implementations of interval arithmetic; they are crucial to certain libraries, but rarely used elsewhere. The round-to-zero mode is used for casting floating-point numbers to integers. Since mode switching is slow, especially on Intel machines, it may be useful to switch to round-to-zero mode, in order to exactly duplicate the behaviour of cast(int) in an inner loop.

The only other commonly cited reason for changing the rounding mode is as a simple check for numerical stability: if the calculation produces wildly different results when the rounding mode is changed, it's a clear sign that it is suffering from round-off errors.

## The IEEE Exception Status Flags

All IEEE-compiliant processors include special status bits that indicate when "weird" things have happened that programs might want to know about. For example, ieeeFlags.divideByZero tells if any infinities have been created by dividing by zero. They are 'sticky' bits: once they have been set, they remain set until explicitly cleared. By only checking this once at the end of a calculation, it may be possible to avoid comparing thousands of comparisions that are almost never going to fail.

Here's a list of the weird things that can be detected:

- invalid
- This is set if any NaN's have been generated. This can happen with ∞ - ∞, ∞ * 0, 0 * ∞, 0/0, ∞/∞, ∞%∞, or x%0, for any number x. Several other operations, such as sqrt(-1), can also generate a NaN. The
*invalid*condition is also set when a 'signalling NaN' is accessed, indicating use of an uninitialized variable. This almost always indicates a programming error. - overflow
- Set if ∞ was generated by adding or multiplying two numbers that were so large that the sum was greater than real.max. This almost always indicates that the result is incorrect; and corrective action needs to be taken.
- divisionByZero
- Set if ±∞ was generated by dividing by zero. This usually indicates a programming error, but not always; some types of calculations return correct results even when division by zero occurs. (For example, 1/(1+ 1/x) == 0 if x == 0). Note that division by a tiny, almost-zero number also produces an infinite result, but sets the overflow flag rather than the divisionByZero flag.
- underflow
- This happens if two numbers are subtracted or divided and are so tiny that the result lost precision because it was subnormal. Extreme underflow produces a zero result. Underflow almost never creates problems, and can usually be ignored.
- inexact
- This indicates that rounding has occurred. Almost all floating point operations set this flag! It was apparently included in the hardware to support some arcane tricks used in the pioneering days of numerical analysis. It can always be ignored.

Floating-point traps can be enabled for any of the categories listed above. When enabled, a hardware exception will be generated. This can be an invaluable debugging aid. A more advanced usage, not yet supported on any platform(!) is to provide a nested function to be used as a hardware exception handler. This is most useful for the overflow and underflow exceptions.

## Floating point and ‘pure nothrow’

Every floating point operation, even the most trivial, is affected by the floating-point rounding state, and writes to the sticky flags. The status flags and control state are thus 'hidden variables', potentially affecting every pure function; and if the floating point traps are enabled, any floating point operation can generate a hardware exception. D provides a facility for the floating-point control mode and exception flags to be usable in limited circumstances even when pure and nothrow functions are called.

[TODO: I've made two proposals, but I haven't persauded Walter yet!].

## Conclusion

Although D is a general-purpose programming language and supports many high-level concepts, it gives direct and convenient access to almost all features of modern floating-point hardware. This makes it an excellent language for development of robust, high-performance numerical code. It is also a language which encourages a deep understanding of the machine, making it fertile ground for innovation and for developing new algorithms.

## References and Further Reading

- "What Every Computer Scientist Should Know About Floating-Point Arithmetic"
- "An Interview with the Old Man of Floating-Point: Reminiscences elicited from William Kahan by Charles Severance"
- N. Brisebarre, J-M Muller, and S.K. Raina, "Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance", IEEE Trans. on Computers, Vol 53, pp 1069-1072 (2004).
- "The Borneo language"