Floating Point

Floating Point Intermediate Values

On many computers, greater precision operations do not take any longer than lesser precision operations, so it makes numerical sense to use the greatest precision available for internal temporaries. The philosophy is not to dumb down the language to the lowest common hardware denominator, but to enable the exploitation of the best capabilities of target hardware.

For floating point operations and expression intermediate values, a greater precision can be used than the type of the expression. Only the minimum precision is set by the types of the operands, not the maximum. Implementation Note: On Intel x86 machines, for example, it is expected (but not required) that the intermediate calculations be done to the full 80 bits of precision implemented by the hardware.

It's possible that, due to greater use of temporaries and common subexpressions, optimized code may produce a more accurate answer than unoptimized code.

Algorithms should be written to work based on the minimum precision of the calculation. They should not degrade or fail if the actual precision is greater. Float or double types, as opposed to the real (extended) type, should only be used for:

reducing memory consumption for large arrays
when speed is more important than accuracy
data and function argument compatibility with C

Floating Point Constant Folding

Regardless of the type of the operands, floating point constant folding is done in real or greater precision. It is always done following IEEE 754 rules and round-to-nearest is used.

Floating point constants are internally represented in the implementation in at least real precision, regardless of the constant's type. The extra precision is available for constant folding. Committing to the precision of the result is done as late as possible in the compilation process. For example:

const float f = 0.2f;
writefln(f - 0.2);

will print 0. A non-const static variable's value cannot be propagated at compile time, so:

static float f = 0.2f;
writefln(f - 0.2);

will print 2.98023e-09. Hex floating point constants can also be used when specific floating point bit patterns are needed that are unaffected by rounding. To find the hex value of 0.2f:

import std.stdio;

void main()
{
    writefln("%a", 0.2f);
}

which is 0x1.99999ap-3. Using the hex constant:

const float f = 0x1.99999ap-3f;
writefln(f - 0.2);

prints 2.98023e-09.

Different compiler settings, optimization settings, and inlining settings can affect opportunities for constant folding, therefore the results of floating point calculations may differ depending on those settings.

Complex and Imaginary types

In existing languages, there is an astonishing amount of effort expended in trying to jam a complex type onto existing type definition facilities: templates, structs, operator overloading, etc., and it all usually ultimately fails. It fails because the semantics of complex operations can be subtle, and it fails because the compiler doesn't know what the programmer is trying to do, and so cannot optimize the semantic implementation.

This is all done to avoid adding a new type. Adding a new type means that the compiler can make all the semantics of complex work "right". The programmer then can rely on a correct (or at least fixable ) implementation of complex.

Coming with the baggage of a complex type is the need for an imaginary type. An imaginary type eliminates some subtle semantic issues, and improves performance by not having to perform extra operations on the implied 0 real part.

Imaginary literals have an i suffix:

ireal j = 1.3i;

There is no particular complex literal syntax, just add a real and imaginary type:

cdouble cd = 3.6 + 4i;
creal c = 4.5 + 2i;

Complex, real and imaginary numbers have two properties:

.re	get real part (0 for imaginary numbers)
.im	get imaginary part as a real (0 for real numbers)

For example:

cd.re		is 4.5 double
cd.im		is 2 double
c.re		is 4.5 real
c.im		is 2 real
j.im		is 1.3 real
j.re		is 0 real

Rounding Control

IEEE 754 floating point arithmetic includes the ability to set 4 different rounding modes. These are accessible via the functions in std.c.fenv.

Exception Flags

IEEE 754 floating point arithmetic can set several flags based on what happened with a computation:

FE_INVALID

FE_DENORMAL

FE_DIVBYZERO

FE_OVERFLOW

FE_UNDERFLOW

FE_INEXACT

These flags can be set/reset via the functions in std.c.fenv.

Floating Point Comparisons

In addition to the usual < <= > >= == != comparison operators, D adds more that are specific to floating point. These are !<>= <> <>= !<= !< !>= !> !<> and match the semantics for the NCEG extensions to C. See Floating point comparisons.

Floating Point Transformations

An implementation may perform transformations on floating point computations in order to reduce their strength, i.e. their runtime computation time. Because floating point math does not precisely follow mathematical rules, some transformations are not valid, even though some other programming languages still allow them.

The following transformations of floating point expressions are not allowed because under IEEE rules they could produce different results.

Disallowed Floating Point Transformations
transformation	comments
x + 0 → x	not valid if x is -0
x - 0 → x	not valid if x is ±0 and rounding is towards -∞
-x ↔ 0 - x	not valid if x is +0
x - x → 0	not valid if x is NaN or ±∞
x - y ↔ -(y - x)	not valid because (1-1=+0) whereas -(1-1)=-0
x * 0 → 0	not valid if x is NaN or ±∞
x / c ↔ x * (1/c)	valid if (1/c) yields an exact result
x != x → false	not valid if x is a NaN
x == x → true	not valid if x is a NaN
x !op y ↔ !(x op y)	not valid if x or y is a NaN

Of course, transformations that would alter side effects are also invalid.

Books