Numerics Programming

In certain circles, FORTRAN reigns supreme as the premier programming language for mathematical applications. Since the first computers were designed primarily for performing calculations, it should come as no surprise that a pioneering programming language like FORTRAN would be biased toward mathematical work. Later languages such as Pascal tend to leave math to the FORTRAN programmers.

C's primary disadvantage has been its complexity; optimizing C compilers are often forced to avoid code improvements that FORTRAN compilers can generate with confidence. However, Digital Mars' optimization technology combined with C++'s improved type-checking enable generating numerical applications that rival the performance of equivalent FORTRAN programs.

Digital Mars has implemented nearly all of the numerics additions to C defined in C99. These additions are also present as extensions to C++98.

This chapter is a guide to understanding and using the C and C++ numerics support by Digital Mars.

What's in This Chapter

Numeric Data Types
Using Mathematics
Floating Point Environment
Floating Point Library
Working with the Intel Numeric Coprocessor
Compiling for floating-point emulation and math coprocessors
Complex Numbers

Numeric Data Types

Two broad classes of numeric data types exist in C and C++. Integral data types represent whole numbers; floating-point data types hold approximations of numbers with fractions. Digital Mars C++ supports nine integral and nine floating-point numeric data types; each type has a range of values it can hold.

Integers

An integral value is a binary integer that has a fixed number of bits. An 8-bit integral type, for example, can hold bit values in the range 00000000 (decimal 0) through 11111111 (decimal 255). If an integral value is unsigned, the bits are read literally. For signed integral values, negative numbers are stored in twos complement form. For example, the 8-bit integral value 11111111 represents decimal 255 if it is unsigned. However, the signed value of 11111111 is -1.

Integral data types

The primary integral data types are:

Integral Data Types
char	8 bits
short	16 bits
int	16 or 32 bits
long	32 bits
long long	64 bits

which come in both signed and unsigned versions. int's are 16 bits for the 16 bit memory models, and are 32 bits for the 32 bit memory models. long long's are only supported in 32 bit memory models.

Note: The char type is signed by default. The -J compiler switch, however, changes char so that it is unsigned. When using 8-bit integral quantities, it is best to explicitly define values as signed char or unsigned char, depending upon your needs. The char type should be used primarily for working with characters and text strings. char, regardless of whether it is implicitly signed or unsigned, is a type distinct from signed char and unsigned char. Type compatibilities are covered in more detail in the section Using Mathematics.

Literal integral constants

Literal integral constants can be written in octal (base 8), decimal (base 10), and hexadecimal (base 16) formats. Any constant beginning with a number other than zero is assumed to be in decimal format. If the first two characters of the constant are 0x, the constant is in hexadecimal format. When the first two characters of the constant are 0b, the constant is in binary format. Octal format is assumed when the first character is zero and the second character is a valid octal digit. A few examples of literal integral constants are:

Integral Constants
15	decimal 15
015	octal 15 = decimal 13
0x15	hexadecimal 15 = decimal 21
0b1011	binary 1011, decimal 11

Octal constants may contain only the digits 0 through 7, decimal constants contain the digits 0 through 9, and hexadecimal constants contain the digits 0 through 9 and A through F. Letter digits in hexadecimal values can be in either uppercase or lowercase.

Literal constants have a type. Unless your program specifies otherwise, a literal constant has an assumed type based on its size. A decimal constant is given the first of these types into which its value can be stored: int, unsigned int, long, unsigned long. An octal or hexadecimal constant is typed int, unsigned int, long, or unsigned long, depending on its size. Some examples of constants and their types are:

Constant Types
1701	int
42000	unsigned int
3117426591	unsigned long
0x7F	int
0157152	unsigned int
0x49310	long

The type of a literal integral constant can be affected by a suffix. If a u or U is used as a suffix, the constant is unsigned. If a l or L suffix is applied, the constant is long. If both suffixes appear, the constant is an unsigned long. Some examples of how suffixes can affect the default typing of the constants given above are:

Suffixes
1701u	unsigned int
42000l	long
0x7Ful	unsigned long
0157152l	long
0x49310u	unsigned long

Floating point types

Floating point numbers contain scaled values that may have a fractional part. Digital Mars C++ supports three real floating-point types, float, double and long double, which differ in the number of decimal digits and in the magnitude of the numbers that they can represent. (Imaginary and complex numbers are detailed in Complex Numbers.)

The float and double types implement the single-and double-precision floating-point formats defined by the Institute of Electrical and Electronic Engineers (IEEE) standard 754-1985. A float is a 32-bit value, a double is 64-bits, and a long double is 80 bits. These bits in a floating- point value are divided into three components: a sign bit, an exponent, and a significand. Figure 28-1 shows the internal format of the types. The s indicates the sign bit.

Figure 28-1 Floating point data type formats

float:
s  exponent significand
31 30----23 22--------0

double:
s  exponent significand
63 62----52 51--------0

long double: (Win32 only)
s  exponent 1  significand
79 78----64 63 62--------0

The highest-order bit in a floating-point value is the sign bit. If the sign bit is 1, the value is negative; if the sign bit is zero, the value is positive.

In a float, the exponent occupies 8 bits and the significand uses the remaining 23 bits. A double has a 52-bit significand and an 11-bit exponent. In addition, the significand of float and double values has an implicit high-order bit of 1.

The significand holds a binary fraction greater than or equal to 1 (because the implied high bit is 1) and less than 2. The number of bits in the significand affects the accuracy of the floating-point value. A float has 6 decimal digits of accuracy, and a double (with its longer significand) is accurate to 15 decimal digits. Since the significand is a binary fraction, it cannot always exactly reflect a decimal value stored in it. For example, there is no binary fraction that can exactly represent the values 0.6 or 1/ 3. Floating point numbers represent an approximation of a decimal value, from which rounding errors come. The section Using Mathematics covers precision, significant digits, and rounding in detail.

The exponent is a binary number that represents the number of binary digits the significand is shifted left for a positive actual exponent or right for a negative actual exponent. The exponent is a biased value; you calculate the actual exponent value by subtracting a bias value from the exponent stored in the value. The bias for a float is 127; the bias for a double is 1023. Thus, a float value with an exponent of 150 would represent a number with an exponent of 23.

The following table shows the approximate minimum and maximum values that can be represented by the double and float data types.

Min Max Values
Data type	Lowest value	Highest value
float	1.175494*10-38	3.402823*10+38
double	2.225074*10-308	1.797693*10+308

Literal floating-point constants

A floating-point literal is a decimal value containing two parts: the significand and the exponent. The significand contains a series of decimal digits representing a signed whole number, a decimal point, and an optional fractional part of the value. The optional exponent consists of the letter e or E followed by an integer constant that defines the decimal magnitude of the number. Examples of floating-point constants are:

Floating Point Constants
0.0	zero
3.141592656	pi
1000.0	one thousand
1000.	one thousand
1.E+3	one thousand
5.e-3	five one-thousandths
0.005	five one-thousandths

The value of the exponent indicates the number of digits that the decimal point should be shifted. A positive exponent shifts the decimal point to the right; a negative exponent shifts it to the left. This is the same as the scientific notation we learned in high school; the exponent is the power of 10 applied to the significand.

Unless otherwise specified, C assumes that all floating-point literals are doubles. A suffix of f or F makes a constant a float. For example, 0.0F is a float-type zero.

You can define literal floating-point constants using hexadecimal notation. A hexadecimal floating-point constant consists of an optional sign, 0x prefix, a hexadecimal significand, a p to indicate the beginning of the exponent, a required exponent value, and an optional type suffix. Examples of hexadecimal constants are:

0x1.FFFFFEp127f
0x1p-23
-0x1.2ACp+10

Hexadecimal floating-point constants are literal bitmaps of a floating-point value. In a hexadecimal constant, the exponent value is the power of 2 to which the significand is raised.

Unusual floating-point values

Floating point numbers can have some unusual values. It is possible for a floating-point number to represent positive and negative infinity, for example. Or, a floating-point value can be in a special format that does not represent a valid number. Understanding the floating-point values can help you create better programs.

Infinities

The macro INFINITY, defined in fltpnt.h, expands to a constant interpreted as positive infinity. A floating-point value is infinity when the bits in the exponent are all 1 and the bits in the significand are all zero. A minus sign can be used to make a negative infinity: -INFINITY.

The result of a division operation, such as 1.0/ 0.0, is INFINITY. An FE_OVERFLOW exception is raised whenever the result of a calculation is INFINITY.

Zero

Digital Mars C++ supports +0.0 and -0.0. A floating-point number is zero when both its significand and its exponent are zero. The sign bit, however, can indicate either a positive or negative zero. +0.0 and -0. 0 are equal. When performing multiplication or division, the sign of a zero value affects the sign of the result.

NaNs

When is a number not a number? When its exponent is all ones and its significand contains any set of bits that is not all zeros, which would indicate an infinity. A value in this format is known as a NaN (Not a Number). The sign bit for a NaN is irrelevant.

Quiet NaN

When the high-order bit of a NaN's significand is 1, it is called a quiet NaN. Quiet NaNs can be used in calculations and as arguments in calls to library functions. Any mathematical operation involving a quiet NaN results in a quiet NaN. Any floating-point library function that is called with a quiet NaN for a parameter returns a quiet NaN. Also, some floating-point library functions return quiet NaNs when an error has occurred. The macro NAN, which expands to the value of a quiet NaN, is defined in fltpnt.h.

Signaling NaN

If the high-order bit of a NaN's significand is zero, it is known as a signaling NaN. Whenever a signaling NaN is an operand in a calculation, the result of the calculation is a quiet NaN, and the FE_INVALID exception is raised. This is known as triggering a signaling NaN. Unless otherwise specified, the functions in the floating-point library trigger signaling NaNs.

Signaling NaNs are never generated by function calls or calculations. The only way they appear in your programs is if you explicitly use the NANS macro (defined in fltpnt.h) in your programs.

Subnormals

A subnormal floating-point value is 1 in which the exponent and implied significand bit are zero and the significand is not zero. These are numbers that are too small to be accurately represented by the floating-point format. In general, subnormal numbers do not show up very often; they don't have any particular use and are usually the result of calculations involving very tiny numbers. If a subnormal value is generated, the FE_INEXACT exception (see the section "The Floating Point Environment") is raised.

Subnormal values involved in a calculation reduce the accuracy of the result. You may want to use the fpclassify macro to check for subnormal values when working with very tiny numbers.

Using Mathematics

Computers have advantages and disadvantages when compared to pencil-and-paper calculations performed by a person. On the plus side, computers are much faster than humans when it comes to performing lengthy and complex calculations. On the negative side, computers are very literal minded. Numbers on a computer are not quite like numbers on paper or in textbooks; mathematical programmers must understand how a computer performs calculations and how to get the computer's numbers to act like human numbers.

Conversions

The next set of examples assumes that these declarations have been made:

signed char c;
unsigned char uc;
short i1, i2;
unsigned short ui;
long l;
float f;
double d;

Conversions take place implicitly in calculations or explicitly when a cast is performed. The term source type refers to the type of value being converted; the term destination type refers to the type of the conversion's result.

Integral conversions

Let's begin by examining conversions between integral types. When the source type is smaller then the destination type, the conversion is called a promotion. Promotions require that C++ fill in the bits of the destination value that do not exist in the source value. For example:

c = -23;
i1 = (int) c;

In this code fragment, an 8-bit signed value, c, has been assigned to a 16-bit signed value, i1. A cast tells the compiler that the conversion is legitimate. Whenever converting between different data types, it is good programming practice to use a cast to declare explicitly that a conversion is expected.

First, the value of c is copied directly into the lower 8 bits of i1. Then, the high (sign) bit of c is copied into each of i1's upper 8 bits. Because negative numbers are stored in twos complement form, this preserves both the value and the sign of c when it is copied into i1. The table below provides a schematic view of the process of promotion.

Table 28-2 Conversion of char to int
Step	Binary value of i1	Comment
1	00000000 11101001	copy c into lower 8 bits
2	11111111 11101001	extend sign in upper 8 bits

As long as the destination type is signed and larger than the source type, the destination value equals the source value. This also holds true if both the source and destination types are unsigned. However, when the destination type is unsigned and the source type is signed, the destination cannot correctly hold the source value. This assignment demonstrates the problem:

ui = (unsigned int) c;

The compiler assigns the same binary pattern to ui that it does to i1. However, since ui is unsigned, those bits are construed differently from how they were when they were assigned to i1. Note: the only difference between signed and unsigned types of the same size is the way in which they decipher their bits. A signed value takes a high-order bit of 1 to mean that it is holding a negative value in twos complement form. An unsigned type is a positive binary integer. For i1, the bits 11111111 11101001 are interpreted as -23, the original value of c. For ui, the same bits result in the value 65513.

When converting from a larger source type to a smaller destination type, the compiler simply assigns the low-order bits of the source value to the destination. Here are several examples.

ui = 64; /* lower 8 bits are 01000000 */
c = (char) ui;/* c equals 64 */
uc = (unsigned char) ui;/* uc equals 64 */

In this case, the lower 8 bits of ui are assigned to c and uc. Since the high-order bit is not set (that is, c is positive), ui and c represent the same number.

ui = 30113; /* lower 8 bits are 10100001 */
c = (char) ui; /* c equals -95 */
uc = (unsigned char) ui;/* uc equals 161 */

In this case, c is negative since the high-order bit of the value copied from ui is 1.

i1 = -12345; /* lower 8 bits are 11000111 */
c = (char) i1; /* c equals -57 */
uc = (unsigned char) i1;/* uc equals 199 */

Again, the high-order bit is 1, so c has a negative value. As mentioned in the section "Integral data types", char, signed char, and unsigned char are distinct types. The Digital Mars compiler generates warnings if you mix the different character types. Furthermore, even if short and int are the same size, they are not the same type. If your programs assume that short and int are always identical, problems will occur. In 32-bit compilations, int is defined as 32 bits and short as 16 bits. Different types are different types, and you should never assume otherwise.

Floating point conversions

Assigning a double to a float type rounds the value of the double to the number of significant digits available in a float. If the value is larger than the maximum float value, an error occurs. For more information, see "Error trapping" later in this chapter.

Converting from a float to a double, the value is almost unchanged. Slight differences in value occur because of the differing number of significand bits in the types. The following program converts between float and double values, illustrating how the conversions alter the values being exchanged. For now, you can ignore the specifics of floating-point I/O, as well as the FLT_DIG and DBL_DIG macros. These subjects are covered in later sections. What is important is to see the results of converting between float and double values.

#include <stdio.h>
#include <float.h>
#define pi 3.14159265358979
int main()
{
  float f;
  double d;

  d = 0.0;
  printf("d <= 0.0,\td => %.* g\n", DBL_DIG, d);

  f = 3.606F;
  printf("f <= 3.606,\tf => %.* g\n", FLT_DIG, f);

  d = (double) f;
  printf("d <= (double) f,\td => %.* g\n", DBL_DIG, d);

  f = 2.0F / 3.0F;
  printf("f <= 2.0/ 3.0,\tf =>%.* g\n", FLT_DIG, f);

  d = (double) f;
  printf("d <= (double) f,\td => %.* g\n", DBL_DIG, d);

  d = d * 2.0;
  printf("d <= d * 2.0,\td =>%.* g\n", DBL_DIG, d);

  d = 2.0 / 3.0;
  printf("d <= 2.0/ 3.0,\td =>%.* g\n", DBL_DIG, d);

  d = d * 2.0;
  printf("d <= d * 2.0,\td =>%.* g\n", DBL_DIG, d);

  d = pi;
  printf("d <= pi,\td => %.* g\n", DBL_DIG, d);

  f = (float) d;
  printf("f<=(float) d,\tf => %.* g\n", FLT_DIG, f);

  d = (double) f;
  printf("d<=(double) f,\td => %.* g\n", DBL_DIG, d);

  return 0;
}

In a perfect world, where numbers would be stored as exact decimal digits, the output of the above program would look like this:

d <= 0.0, d => 0
f <= 3.606, f => 3.606
d <= (double) f, d => 3.606
f <= 2.0/ 3.0,f => 0.666667
d <= (double) f, d => 0.66666666666667
d <= d * 2.0,d => 1.33333333333333
d <= 2.0/ 3.0,d => 0.66666666666667
d <= d * 2.0,d => 1.33333333333333
d <= pi, d => 3.14159265358979
f <= (float) d, f => 3.14159
d <= (double) f, d => 3.14159

However, the actual output looks like this:

d <= 0.0, d => 0
f <= 3.606, f => 3.606
d <= (double) f, d => 3.60599994659424
f <= 2.0/ 3.0,f => 0.666667
d <= (double) f, d => 0.666666686534882
d <= d * 2.0,d => 1.33333337306976
d <= 2.0/ 3.0,d => 0.666666666666667
d <= d * 2.0,d => 1.33333333333333
d <= pi, d => 3.14159265358979
f <= (float) d, f => 3.14159
d <= (double) f, d => 3.14159274101257

The difference between the ideal output and the real output is due to the binary nature of floating-point numbers. The significand is only an approximation of a decimal fraction; the number of decimal digits that can be accurately approximated by a floating-point data types is determined by the number of binary digits in the significand.

A float has 23 bits in its significand, and a double has 52 bits. If you hand-calculate the largest possible value that can be stored in a float, you'll find that it can actually represent as many as nine decimal digits. However, since this is a binary fraction and not a decimal fraction, those nine digits may not be correct. Usually, the last two decimal digits represented by a floating-point type ensure that the other decimal digits are accurate. So, a float accurately represents only six decimal digits of a value, while a double represents up to 15 digits.

Note: The macros FLT_DIG and DBL_DIG, defined in float.h as 6 and 15 (and fully discussed in the section "Floating point characteristic constants" later in this chapter), represent the number of significant decimal digits for the float and double types. Use these macros with the %.* g format specifier when displaying floating-point values to their full precision with the printf family of functions.

You may be surprised when displaying all 15 digits of a double that have been assigned a float value. When a floats significand is copied into the significand of a double, the extra digits are filled with zeros. The "extra" precision bits of the float, which are not part of its six decimal digit accuracy, are part of the doubles 15- digit decimal accuracy. If you are interested in accurate math, do not mix floats and doubles in calculations. Once you have assigned a float value to a double, you can only assume that the double has six digits of decimal accuracy as opposed to 15.

Conversions between integer and floating-point

Mixing integer and floating-point types can also lead to the loss of significant digits. The fractional portion of a floating-point value is dropped when the value is assigned to an integral value. If the non-fractional part of the floating-point value is larger than the maximum value allowed for the integral type, a floating-point math error occurs. See the section "Error trapping" later in this chapter for more information.

When assigning an integral value to a floating-point value, the least significant digits may be lost if there are more digits in the integer than can be represented by the floating-point type. For example:

f = (float)2123456789UL;
printf("f <= (float)2123456789UL,\tf => %.*g\n",
    FLT_DIG, f);

The output of the printf statement will be:

f <= (float)2123456789UL, f => 2.12346E+009

A float has only six valid digits, whereas the original unsigned long value had 10 digits. So, the unsigned long value is rounded to the six digits when it is converted to a float.

Implicit conversions

An implicit conversion takes place when a calculation contains values of mixed types. C compilers break calculations into segments based on binary operators. Each binary operator has (obviously) two arguments. The argument with the smaller type is converted to an argument of the larger type. A type is smaller than another type if it contains fewer bits. Also, if both arguments are the same size, but one is signed and the other is unsigned, the signed value is converted to unsigned. In addition, all char and short arguments are automatically converted to int. Table 28-3 shows the order, from smallest to largest, of conversion.

Table 28-3 Conversion order
Order	Types
0	all char and short types
1	int
2	unsigned int
3	long
4	unsigned long
5	float
6	double

The type of the result obtained from a binary operation is the same as the type used for the calculation. For example:

i1 + c /* c converted to int; result is int */
l / f /* l converted to float; result is float*/

Based on Table 28-3, i1 (an int) has a higher order than c (a char) so the compiler converts c to an int before performing the addition. The result of the addition is an int. Using the same rule, l is converted to a float before it is divided by f. The result is a float.

Watch for implicit conversions! They can turn a simple math statement into a mysterious program bug. For example, assuming the following assignments:

i1 = 20000;
i2 = 30000;

these two statements do not assign the same value to l:

l = i1 * i2 * 2.0L; /* formula 1 */
l = 2.0L * i1 * i2; /* formula 2 */

Operations of the same precedence take place from left to right. In the first formula, i1 is multiplied by i2; these are both ints, making the result an int. The result of the multiplication is 600000000; an int cannot hold a value this large, so an overflow occurs, giving the result an actual value of 17920. That result is then converted to a long so that it can be multiplied by 2.0L. The end result stored in l is 35840, an answer that is far from what is expected.

The second formula rearranges the operations. The first calculation multiplies 2.0L by i1; i1 is converted to a long, and a resulting value (with type long) of 40000 is generated. This is multiplied by i2, which is converted to a long. The end result is 1200000000, the expected value.

The first formula can return the same value as the second formula, if you use casts or parentheses:

l = i1 * (i2 * 2.0L);
l = (long) i1 * (long) i2 * 2.0L;

Many a program has died because the programmer did not watch for implicit conversions. A careful programmer always casts the operands of a formula if they are different from the type of the result. A small amount of extra typing can save hours of debugging.

One final caution: comparisons are a form of calculation. When two values with different types are compared, the smaller type is converted into the larger type. Watch for this, particularly when comparing signed and unsigned values or when the comparands are different sized types.

Significant digits

Scientists and engineers often refer to the concept of "significant digits" when performing calculations. The accuracy of a result is only as good as the accuracy of the operands used in the calculation.

For example, if you ask Digital Mars C++ to perform these calculations:

float a = 1.5;
float b = 2.01;
float c, d;
c = a * b;

the compiler dutifully assigns to c the result of multiplying 1.5 by 2.01, which is 3.015. However, the accuracy of that result must be compared against the accuracy of the numbers used to calculate it. In this case, while b has four decimal digits, we know the accuracy of a only to two decimal places. Unless we know that a is exactly equal to 1.5, we cannot assume that c is exactly equal to 3. 015. It is a case of being sure of our results. If we are not certain of the value of a beyond the second digit, then we cannot be certain beyond the second decimal place about the value of any result that is calculated using a.

It is common in science to know a value to a specific number of decimal places without certainty that the value is exact. As long as there is doubt about the absolute accuracy of a value, all calculations involving that number are limited to the number of digits that we know are right. In the example above, the correct value for c is 3.0, since the least accurate of operands used in its calculation, a, has only two digits of accuracy.

Of course, the compiler knows nothing about this. It goes about its business of assigning 3.015 to c. What happens later in the program when another calculation involving c is made?

d = c * 250.; // assume 250. is exact

The compiler assigns d the value 3.015 * 250, or 753.75. The error continues to mount! The correct result should be 750, since c is known reliably only to two digits of accuracy, making it 3.0. The problem only grows worse as calculations continue.

The solution is to use the rounding features of Digital Mars C++ to fix the number of significant digits in a value. Here's an example of a function named setsigdig, which rounds a number to a specific number of significant digits.

#include <float.h>
#include <fenv.h>
#include <fltpnt.h>
#include <math.h>

double setsigdig(double x, unsigned int n)
{
  double shift, result;
  int oldmode;

  if ((n == 0U) || (n > DBL_DIG))
      result = x;
  else
  {
    // save the current rounding mode
    oldmode = fegetround();

    // set rounding to truncate
    // the fractional part
    fesetround(FE_TOWARDZERO);

    // adjust the number of significant
    // digits
    --n;

    // calculate the number of digits
    // to be shifted
    shift = pow(10.0, double n) * rint(log10(fabs(x)));

    // restore the original rounding mode
    fesetround(oldmode);

    // scale x up, round it,
    // and scale it back
    result = rint(x * shift) / shift;
  }
  return result;
}

The setsigdig function works by scaling a double value to an integer with the requested number of significant digits, rounding that value, and then scaling the number back to its original magnitude. The fegetround, fesetround, and rint functions are documented in the section "Rounding functions" later in this chapter.

Floating point comparisons

As discussed in previous sections, floating-point values can contain unusual values such as infinity and NaN. This complicates the comparison of floating-point data types; for example, how does a NaN compare to zero?

Table 28-4 describes the floating-point comparison operators, showing the conditions that generate a true (nonzero) or false (zero) result for a comparison. The four relations columns show the results of a comparison for each of four possible relationships between two operands: greater than (>), less than (<), equals (=), and unordered (?).

A comparison is unordered when one (or both) of its operands is a NaN.

Cross-reference the relationship of the operands with the operator; if the table is marked T, the relationship is true (nonzero). If the table is marked F, the relationship is false (a value of zero). Also, if the comparison operator has "yes" listed in the invalid column, the FE_INVALID exception is raised (see "Exceptions") for unordered operands.

Table 28-4 Floating point comparison operators
Operator	Relations	Invalid?	Description
	> < = ?
<	F T F F	yes	less than
>	T F F F	yes	greater than
<=	F T T F	yes	less than or equal to
>=	T F T F	yes	greater than or equal to
==	F F T F	no	equal to
!=	T T F T	no	unordered, less than, or greater than
!<>=	F F F T	no	unordered
<>	T T F F	yes	less than or greater than
<>=	T T T F	yes	less than, equal to, or greater than
!<=	T F F T	no	unordered or greater than
!<	T F T T	no	unordered, greater than, or equal to
!>=	F T F T	no	unordered or less than
!>	F T T T	no	unordered, less than, or equal to
!<>	F F T T	no	unordered or equal to

Here are some example comparisons:

Example Comparisons
NAN != 0.0	True
NAN == 0.0	False
NAN == NAN	False
NAN != NAN	True
NAN !<>= NAN	True
1.0 !< 2.0	True
2.0 <> 2.0	False
2.0 != 2.0	False

The operators that contain an exclamation point, also called a "bang," are true when the relationship is unordered. All of the floating-point operators can be overloaded in C++ programs.

Be sure to understand the result of negating comparisons. The statement !(a op b) reverses the true and false results given for op in Table 28-4. For example, !(a < b) is not the same thing as (a >= b).

Avoid using an equality condition to end a calculation loop, such as:

float x, float y = 3.5;
do {
    // calculate x
} while (x != y);

The calculation may vary the last few bits of x slightly, so that it never quite exactly matches y. That, of course, generates an infinite loop. Comparing the difference between two floating-point values is a better solution:

float x, float y = 3.5;
do {
    // calculate x
} while (fabs(x -y) > FLT_EPSILON);

FLT_EPSILON is a constant defined in float.h; it expands to the smallest possible difference between two floating-point values. The loop may be slightly slower, but minor bit fluctuations in x are less likely to make the loop run forever.

Arithmetic considerations

Calculations that involve infinities, NaNs, and signed zeros can be disconcerting for the uninitiated. Understanding how these numbers work mathematically is important to knowing how to write numerical software with Digital Mars C++.

Infinity math

The relationship -INFINITY < n < +INFINITY holds true for every n that is not a NaN. INFINITY often represents an arbitrarily large value.

INFINITY is generated by the programmer, when division by zero occurs, or when there is an overflow (see the "Exceptions" section later in this chapter). An FE_INVALID exception is raised when INFINITY is used in the following calculations:

+-INFINITY / +-INFINITY
+-0.0 * +-INFINITY
adding or subtracting +-INFINITY

When given INFINITY as an argument, some floating-point library functions generate an exception. See the section "The Floating Point Library" later in this chapter for the reaction of specific functions to INFINITY.

NaN math

What are NaNs good for, anyway? It's a good question. In the real world, there's no such thing as a number that isn't a number. Why would anyone use such a value in a computer program?

For one thing, NaNs breed new NaNs. Any binary operation involving a NaN generates a quiet NaN. With a few minor exceptions, all the floating-point library functions listed later return a quiet NaN if one or more of their arguments is a NaN. Once a quiet NaN is introduced into a series of calculations, it propagates itself. This can be useful when working with variables that may not have defined values. For example, this program has a subtle bug:

float a, b, c;

// lots of program code

b = a * 2.0f;   // b equals almost anything
c = 4.5 / b;    // and so c could equal anything

a was never assigned a value, and it may contain any random value. The problem is, the calculations of b and c proceed without any warning that a was never initialized. In fact, a could hold a semivalid random value, and b and c would subsequently be calculated as having numbers that may not look wrong but are! There's no way to know from looking at b or c that an invalid value was introduced into their calculation.

Never leave anything to chance. Always initialize variables immediately after they are declared. If you do not have a valid value yet, initialize a variable to NAN. Subsequent calculations result in the propagation of the NaN; if your final result turns out to be a NaN, you know that an important piece of data was not given a value.

float a = NAN, b = NAN, c = NAN;

// imagine lots of program code

b = a * 2.0f; // b is a NaN because a is a NaN
c = 4.5 / b;  // c is a NaN because b is a NaN

The common practice is to initialize undefined variables with a quiet NaN. Then, as the program proceeds, the quiet NaNs propagate themselves, showing where your program is lacking information. This is particularly useful when performing calculations involving matrices that contain unknown values. The NaNs propagate across calculations, and you can tell from the NaNs in the results how much vital information is missing. Or, you can use signaling NaNs instead of quiet ones, so that the first calculation involving an unknown quantity results in an exception.

The sign of a NaN is unimportant, although the state of sign bit can be determined using the signbit macro.

The sign bit

The value of a results sign follows the standard rules of arithmetic. Multiplying or dividing two values with the same sign generates a positive result; if the signs differ, the result is negative.

The sum of two equal operands with opposite signs (or the difference of two equal operands with like signs) results in a +0.0, unless the rounding mode is set to FE_DOWNWARD, in which case the result is -0.0. Positive and negative zero are considered equal for comparisons.

Random numbers

Some applications require the generation of pseudo-random numbers. While Digital Mars C++ provides functions that generate pseudo-random integers, it doesn't provide a similar function for floating-point values. You can easily remedy this problem through the use of a simple function, such as this:

double randomd()
{
  return (double) rand() / INT_MAX;
}

Arrays

Arrays in C and C++ are zero-based. This array

float fa[10];

has 10 elements, numbered from 0 to 9. A simple technique allows the indexing of the array's elements from 1 to 10 by exploiting the close relationship between pointers and arrays. Create a pointer to the array, set it to point to the elements just before the beginning of the array, and then index through the array via the pointer:

float fa[10], *fap;
fap = fa;
--fap;

Use caution to avoid indexing outside an array's defined range. C and C++ do not perform range checks; if you change the 23rd element of a 10 element array, you will overwrite portions of memory outside the array's bounds.

Floating Point Environment

The floating-point environment defines how mathematics operates in a program. A programmer has a choice between using a numeric coprocessor or a software floating-point emulator. A programmer can control the method by which numbers are rounded, and the precision to which those numbers are calculated. Finally, errors (called exceptions) may occur during floating-point calculations. A well-written program will be aware of the environment under which it is running.

Coprocessors and emulators

Nearly every small computer has a socket on its motherboard into which a math coprocessor can be inserted. The coprocessor understands floating-point numbers and, depending on the version, can perform logarithmic, trigonometric, and arithmetic operations on them.

Digital Mars C++ can generate programs that work only with the coprocessor. These are the fastest floating-point programs because they use the hardware most efficiently. However, this added speed comes at a cost: programs compiled specifically for the coprocessor do not run on machines that lack coprocessors.

If you do not have a coprocessor or do not want to rely upon one, Digital Mars C++ can generate a program that uses a software emulator to perform floating-point calculations. The emulator is many times slower than a coprocessor, but your programs can run on computers that do not have coprocessors. The emulator will use the coprocessor if one is available, but it will not use the coprocessor as efficiently as would a program compiled specifically for a coprocessor.

The difference in speed between a coprocessor-based program and an emulator-based program can be significant. For example, a program compiled with Digital Mars C++ ran for 43 seconds when using a coprocessor; run with the emulator, the same program took nearly 15 minutes to execute! Coprocessor-only programs are also slightly smaller than emulator-based programs. For any intensive mathematical work, a coprocessor is a requirement. If you are serious about floating-point performance, a coprocessor is well worth its purchase price.

Rounding

Rounding occurs in three circumstances:

When converting from a double to a float
When converting a floating-point value to an integer value
When the result of +, -, *, and / is inexact

Four different methods of rounding a number are available. The term fraction refers to the portion of the source value that is rounded off when converting to a destination value.

To nearest (FE_TONEAREST): The value is rounded to the nearest value. If the value is exactly between two values, the even value of the two is chosen.
Upward (FE_UPWARD): The result is rounded to the value closest to positive infinity.
Downward (FE_DOWNWARD): The result is rounded to the value closest to negative infinity.
Toward zero (FE_TOWARDZERO): The fraction is simply dropped, rounding the value closer to zero. This is also known as chop rounding.

The identifiers listed in parentheses above are the macro constants (defined in fenv.h) identifying the individual rounding modes.

When a program begins, it assumes that you want to round to the nearest value. A program can change the rounding mode at any time by calling the fesetround function, using one of the macro constants to select a rounding mode. For example, the statement

fesetround(FE_TOWARDZERO);

changes the rounding toward zero.

When changing the rounding mode, the previous rounding mode is lost. The fegetround function retrieves the current rounding mode as an int value. A well-written function saves the current rounding mode before changing it; when the function is done, it restores the original rounding mode. For example:

double math_function(double x)
{
  double result;
  int save_rounding;

  save_rounding = fegetround(); // save rounding mode
  fesetround(FE_UPWARD); // set rounding mode

  // calculate result

  fesetround(save_rounding); // restore rounding mode
  return result;
}

You can estimate the amount of rounding errors in a calculation by performing the calculation twice with two rounding modes. First, set the rounding mode to FE_UPWARD, perform the calculation, and save the result. Then, set the rounding mode to FE_DOWNWARD and perform the calculation again. The difference between the first and second calculations is the rounding error.

The complete specifications of the rounding macros and functions are provided in the section "Rounding functions" later in this chapter.

Changing the precision of calculations

When using the software floating-point emulator, calculations that result in a float value are performed using floats, and calculations resulting in a double value are performed using doubles. Since a double has more decimal digits of precision, calculations on doubles are more accurate than calculations performed on floats. Calculations on floats, however, are generally faster than those using doubles.

If a program is compiled for use with the coprocessor, all float and double calculations are carried out to a higher level of precision. The coprocessor uses 80-bit temporary floating-point values for its computations. This provides a very high level of accuracy without affecting the speed of calculations. Because the coprocessor uses more bits for its computations (increasing its accuracy), an emulator-based program may produce slightly different results when running on a machine equipped with a coprocessor.

A program compiled for the coprocessor can change the number of bits used by the coprocessor for calculations. At program startup, the coprocessor is set to use a full 80 bits of precision. The fesetprec function can tell the coprocessor to use fewer bits. While reducing the number of bits does not make the coprocessor perform more quickly, using fewer bits generates results that match those produced by the emulator. These are examples of calls to fesetprec:

fesetprec(FE_FLTPREC); // use float (32-bit) precision
fesetprec(FE_DBLPREC); // use double (64-bit) precision
fesetprec(FE_LDBLPREC); // use long double (80-bit) precision

The fegetprec function returns the current precision mode; use it to save the rounding mode when a function changes it, just as the fegetround function you use to save the rounding mode. The precision functions and macro constants are fully documented in the section "The Floating Point Library" later in this chapter.

Exceptions

It is easy for something to go wrong in a calculation. A number may be divided by zero, or a NaN may be used as an argument to a function. When an error condition is found, the appropriate exception is raised.

Digital Mars C++ supports five exceptions that indicate the problems that occur when floating-point calculations take place. These exceptions are:

Loss of precision (FE_INEXACT): When a calculation produces a result that is rounded or when an underflow or overflow occurs, the inexact exception is raised.
Divide by zero (FE_DIVBYZERO): If the divisor is zero, and the dividend is a finite nonzero value, the division by zero exception is raised. The result of a division by zero is an appropriately signed infinity.
Underflow (FE_UNDERFLOW): When the result of a floating-point calculation is too small to be represented by a floating- point type, the underflow exception is raised. This problem can occur when performing calculation with subnormal values, performing exponential calculations, multiplying very small numbers, or dividing small numbers by large numbers.
Overflow (FE_OVERFLOW): If the result of a floating-point calculation is too large to be stored in the floating-point type, the overflow exception is raised. This can occur when performing exponential operations, multiplying large numbers, dividing by very small numbers, and converting floating-point values, to a smaller format, such as converting from double to float or when converting from a floating-point value to an integral type.
Invalid operation (FE_INVALID): When an operand is invalid for the operation being performed, an invalid exception is raised. When a signaling NaN is an operand or argument, the arithmetic operations (+, -, *, /) and many of the floating-point library functions raise the invalid exception. Also when infinities are added, subtracted, or multiplied the invalid exception is raised. Division operations such as 0/ 0 and INFINITY / INFINITY raise the invalid exception, as do a conversion from floating-point to the integral of a value that cannot be accurately represented by the destination type.

The macros that represent these exceptions are listed in parentheses above.

Exceptions are raised via the feraiseexcept function. A well-written program uses these exceptions to indicate error conditions. For example, most floating-point functions should follow this general framework when handling NaN parameters:

double float_func(double x)
{
  double result;

  if (isnan(x))
  {
    if (fpclassify(x) == FP_NANS)
      feraiseexcept(FE_INVALID);
    return x;
  }

  // calculate result

  return result;
}

If a function parameter exceeds some limit, an FE_INVALID exception should be raised, and NAN should be the return value.

The fetestexcept function checks to see if one or more exceptions have been raised:

// is the overflow exception raised?
if (fetestexcept(FE_OVERFLOW))
{
  // print error message or
  // otherwise handle exception
}

Exceptions are "sticky"; once an exception is raised, it remains raised until the program explicitly clears the exception with the feclearexcept function. A calculation can proceed until its conclusion, at which point the program can check to see which exceptions, if any, have been raised. However, the program must also clear the exceptions before any calculations, so that leftover exceptions from previous calculations do not obfuscate newly raised exceptions. To clear all the exceptions, this function call can be used:

feclearexcept(FE_ALL_EXCEPT);

The arguments for feclearexcept, fetestexcept, and feraiseexcept can be one of the macro constants defining exceptions. You can combine several exception macros using the bitwise OR operator, so that more than one exception can be tested/ cleared/raised in one function call:

feraiseexcept(FE_OVERFLOW | FE_UNDERFLOW);

FE_ALL_EXCEPT is a macro that evaluates to a bitwise OR of all the exception macro's constants.

Function prototypes and macro definitions for handling exceptions can be found in the file fenv.h.

errno

The header file errno.h defines a variable and a set of macro constants that can be used for tracking some floating-point errors. The global variable errno is defined as an int; its value is set to various values when errors occur. The errno variable tracks more than just floating-point errors. For the purpose of floating-point, only two possible values have a meaning. The values are defined as macro constants:

EDOM domain error
ERANGE range error

A domain error occurs when a parameter passed to a floating-point function is outside the range of that function's calculation range. For example, the sqrt function sets errno to EDOM if it is asked to calculate the square root of a negative value. Range errors occur upon underflow and overflow.

In general, do not use errno for tracking floating-point errors. The facilities provided by exceptions provide more granularity in determining the type of error and are more flexible than errno.

Error trapping

A program may need to respond to floating-point errors when they occur without waiting until a calculation is finished to see which exceptions have been raised. The signal function is provided for that purpose. signal declares that a certain function is to be called whenever a certain class of errors occurs. Floating point math errors are a class of errors, and signal can be used to define a function to be called whenever an exception is raised. For example:

#include <string.h>
#include <stdio.h>
#include <fenv.h>
#include <setjmp.h>

// prototype
void _far _cdecl FloatError(int sig);

// global variable to hold
// setjmp/ longjmp information
volatile jmp_buf ErrJump;

// main function
int main()
{
  // assign floating-point exception handler
  signal(SIGFPE, FloatError);

  // set-up return address for errors
  if (setjmp(ErrJump))
    return 1;

  //*********************************
  // do something with floating-point
  //*********************************

  // outta here
  return 0;
}


// floating-point error handler
void _far _cdecl FloatError(int sig)
{
  char temp[256];

  strcpy(temp," Floating point error: ");

  if (fetestexcept(FE_INEXACT))
    strcat(temp,"[Inexact] ");

  if (fetestexcept(FE_DIVBYZERO))
    strcat(temp,"[Divide by 0] ");

  if (fetestexcept(FE_UNDERFLOW))
    strcat(temp,"[Underflow] ");

  if (fetestexcept(FE_OVERFLOW))
    strcat(temp,"[Overflow] ");

  if (fetestexcept(FE_INVALID))
    strcat(temp,"[Invalid] ");

  printf("\a%s\n", temp);
  longjmp(ErrJump, 1);
}

The detailed workings of the setjmp, longjmp, and signal functions are outside the scope of this manual. Suffice it to say that the call to signal tells the program to call FloatError whenever a floating-point error occurs. The setjmp function marks a program location to which FloatError returns (via longjmp) after processing the error.

Saving/Restoring the floating-point environment

Some functions make so many changes to the floating-point environment that the state of the environment needs to be preserved while the function performs its task. The header file fenv.h defines a data type, fenv_t, that represents the state of the floating-point environment. Two pairs of functions save and restore the environment in fenv_t objects.

To simply preserve the floating-point environment, the fesetenv and fegetenv functions can handle the task. fegetenv stores the current state of the floating-point in a fenv_t object; fesetenv sets the state of the floating-point environment to the value of a fenv_t object. So, a function that wishes to preserve the floating-point environment can perform these function calls:

double num_func(double x)
{
  double result;
  fenv_t saved_env;

  // save the floating-point environment
  fegetenv(&saved_env);

  [...] // change the environment and calculate a result

  // restore the original environment
  fesetenv(&saved_env);

  return result;
}

The feprocentry and feprocexit functions do more than preserve the floating-point environment; they also reset the environment to the startup conditions: round to nearest, FE_LDBLPREC precision, and no exceptions set. To use the default settings, thus avoiding any unpredictable changes made since the program started, use feprocentry and feprocexit instead of fegetenv and fesetenv:

double num_func(double x)
{
  double result;
  fenv_t saved_env;

  // save floating-point environment
  // & restore default
  feprocentry(&saved_env);

  [...]  // calculate a result

  // restore the program's previous
  // environment
  feprocexit(&saved_env);

    return result;
}

The Floating Point Library

Digital Mars C++ provides a powerful library of functions, macros, and definitions for use with floating-point numbers. This section covers the library in detail and is provided as a reference for mathematical programmers.

The function descriptions often have an associated Special Results table that describes how the function handles certain kinds of arguments. The column headings mean:

value of x: The value of the x parameter for this condition to exist.
value of y: The value of the y parameter for this condition to exist.
value of *ptr: The value that is stored through the pointer parameter ptr for this x (and possibly y) argument value.
return value: The value the function returns for these x and y values.
invalid?: Will the FE_INVALID exception be raised?
div by 0?: Will the FE_DIVBYZERO exception be raised?

All of these functions return a NaN if one of their arguments is a NaN. If the argument is a signaling NaN, the function also raises the FE_INVALID exception.

Header files

ANSI C defines two header files for working with floating-point math, math.h and float.h. NCEG has added new features to the two ANSI header files and also has defined two new header files named fltpnt.h and fenv.h. These four files contain function prototypes, macro definitions, type declarations, and constant definitions for use with floating-point math.

Before using any of the functions, macros, data types, or constants defined in these headers, #include the appropriate header file into your program.

Predefined macros

The NCEG document describes two predefined macros that can tell if a compiler supports the NCEG extensions to C. These macros are:

__FPCE__: If this macro is defined, it indicates a compiler's compliance with the NCEG document. This macro is defined by the compiler as a constant value of 1.
__FPCE_IEEE__: If this macro is defined, it indicates a compiler's compliance with the IEEE portion of the NCEG document. This macro is defined by the compiler as a constant value of 1.

General floating-point types

The efficiency of the standard float and double types depends upon the underlying computer architecture. For some systems, a double may actually be more efficient than a float. Knowing which data types are the most efficient requires detailed knowledge of the floating-point architecture of every machine upon which an application executes.

To ease the development of efficient and portable applications, NCEG created two compiler-defined floating-point types, float_t and double_t. These types are defined in fltpnt.h as representing the most efficient data types for single-and double-precision computations in the host environment. In the case of Digital Mars C++, float_t is defined as float, and double_t is defined as double.

Floating point characteristic constants

Programmers often need to know the characteristics of the floating-point data types supported by a compiler. The ANSI-defined file float.h declares a set of constants for float and double data types. These constants are:

Macro Name	Definition
FLT_DIG	The number of decimal digits that can be represented accurately by a float.
FLT_MANT_DIG	The number of base_{FLT_RADIX} digits in the significand of a float. In the case of binary floating-point implementations (like on Intel x86 processors), this number represents the number of binary digits in the significand.
FLT_MAX_10_EXP	Maximum positive integer n such that 10ⁿ within the range of normalized floating-point numbers
FLT_MAX_EXP	Maximum positive integer n such that FLT_RADIX^n-1 is representable
FLT_MIN_10_EXP	Minimum negative integer n such that 10ⁿ is within the range of normalized floating-point numbers
FLT_MIN_EXP	Minimum negative integer n such that FLT_RADIX^n-1 is a normalized floating-point number
FLT_RADIX	Radix of exponent representation
FLT_ROUNDS	Direction of rounding
FLT_MAX	Maximum representable positive floating-point number
FLT_MIN	Minimum representable normalized positive floating-point number
FLT_EPSILON	Minimum positive number x such that 1.0+x does not equal 1.0
DBL_xxx	double versions
LDBL_xxx	long double versions

Other compilers, even those that claim limited IEEE conformance, may not define the same values for these constants. For example, some compilers define FLT_DIG as 7, even though the IEEE single format used for float types actually has only six decimal digits of accuracy.

Environment types and constants

The floating-point environment is a combination of settings that control how calculations and errors are handled for floating-point calculations. It is controlled by the rounding mode, the precision mode, and the exception flags. The file fenv.h defines data types, constants, macros, and functions for controlling the floating-point environment.

The two data types defined in fenv.h are:

fenv_t: Can represent the state of the complete floating-point environment. The macro FE_DFL_ENV, defined in fenv.h, expands to a value with type fenv_t. FE_DFL_ENV represents the default (startup) floating-point environment.
fexcept_t: Represents an exception flag.

Three sets of macro constants are defined, providing symbolic constants for exception flags, rounding modes, and precision modes.

Exception macros

These macros represent exception flags; they expand to unique int values that are a power of 2.

FE_INEXACT
FE_DIVBYZERO
FE_UNDERFLOW
FE_OVERFLOW
FE_INVALID

This macro represents the bitwise-OR (|) of all five of the above-mentioned exception flags:

FE_ALL_EXCEPT

These macros are used with the feraiseexcept, fetestexcept, feclearexcept, fegetexcept, and fesetexcept functions.

Rounding macros

Four macros represent the available rounding modes:

FE_TONEAREST Round to the nearest value
FE_UPWARD Round toward positive infinity
FE_DOWNWARD Round toward negative infinity
FE_TOWARDZERO Round toward zero (drop fractional part)

Each macro expands to a unique, positive constant of type int. The fesetround and fegetround functions use these macros.

Precision macros

The following macros represent precision modes:

FE_FLTPREC Use float (32-bit) precision
FE_DBLPREC Use double (64-bit) precision
FE_LDBLPREC Use long double (80-bit) precision

Each of these macros expands to a unique constant of type int. The fesetprec and fegetprec functions use these macros. These macros have no effect unless a coprocessor is being used.

Classification macros

FP_NANS Signaling NaN
FP_NANQ Quiet NaN
FP_INFINITE Infinity
FP_NORMAL Any number not covered by other classifications
FP_SUBNORMAL Subnormal or denormal
FP_ZERO Zero

These macros expand to unique constant expressions of type int; they are used to classify floating-point values.

These macros can be compared to the result from the macro fpclassify, which is discussed later in this section.

NaNs and infinities

The following macros expand to constant expressions that represent special floating-point values:

NAN Expands to represent the value of a quiet NaN
NANS Expands to represent the value of signaling NaN
INFINITY Expands to represent the value of positive infinity

Unary + and -operators can set the sign of these values; for example, -INFINITY represents negative infinity.

Type determination macros

A set of macros defined in fltpnt.h can determine if a floating-point value belongs to a special class. These macros are:

fpclassify(x): Returns a macro constant that reflects the class of the floating-point expression x.
isfinite(x): Evaluates to a nonzero int expression if x is zero, subnormal, or normal. Otherwise, it evaluates to zero.
isnormal(x): Evaluates to a nonzero int expression if x is normal. Note that if x is zero, it is not normal. Otherwise, it evaluates to zero.
isnan(x): Evaluates to a nonzero int expression if x is a signaling or quiet NaN. Otherwise, it evaluates to zero.

Exception control functions

These functions test, raise, clear, and otherwise manipulate exceptions. In all cases, the excepts parameter is assumed to be a bitwise OR of the exception macro constants defined in the section "Exceptions" earlier in this chapter.

Rounding functions

The functions described in this section set or get the rounding mode. At the start of a program, the default rounding mode is FE_TONEAREST.

Precision functions

The functions described in this section get or set the precision mode. When the program begins, the precision mode is FE_LDBLPREC.

Note: The precision mode has meaning only when inline coprocessor instructions are used. The software floating-point emulator ignores the precision mode.

Trigonometric functions

These transcendental functions calculate trigonometric values.

Note: Parameter values representing angles larger than p/4 may result in long calculation times or FE_INVALID exceptions, depending upon the type of coprocessor installed in a given computer.

Conversion functions

Conversion Functions
Header	Prototype	Description	Returns
stdlib.h	double atof(const char *nptr);	Converts the characters in the string (pointed to by nptr) into a double. atof is implemented as if it were a call to strtod.	A double based on the text pointed to by nptr.
stdlib.h	float strtof(const char nptr, char * endptr);	Converts the character representation of a floating-point value to a float or double. The function assumes that the input string (pointed to by nptr) is in the same format as a literal floating-point constant. The first character encountered that cannot be a part of a literal floating-point constant terminates the translation process. The address of the terminating character is assigned to the pointer variable pointed to by endptr. +-NAN and +-INF are valid input values representing NaN and infinity.	A float value based on the string pointed to by nptr.
stdlib.h	double strtod(const char nptr, char * endptr);	Like strtof() but converts to a double.	A double value based on the string pointed to by nptr.

Sign functions

Examine sign (macro)

Function signbit

Header fltpnt.h

Purpose Examines the sign bit of a floating-point value, including
NaNs, infinities, and zero.

Returns Nonzero if the sign bit is 1; zero if the sign bit is 0.

Working with the Intel Numeric Coprocessor

Intel coprocessors have a pair of 16-bit values that are useful to serious mathematical programmers. The status word contains bits that examine the current condition of the coprocessor. Bit 7 is the error status bit; if it is set, an error has occurred. Bits 0 through 5 are the exception indicators:

Bit 0, invalid operation: If this bit is set, an error has occurred that is not covered by the other exception indicators. Trying to obtain the square root of a negative number, for example, sets this bit. This is equivalent to the FE_INVALID exception.
Bit 1, denormal exception: A denormal number has an exponent of zero and a signif-icand that is not zero. In general, denormalized numbers are caused by a calculation or formatting error.
Bit 2, zero divide: Dividing by zero generates this exception. This corresponds to the FE_DIVBYZERO exception.
Bit 3, overflow: This exception occurs when an operation that would result in a number too large to be represented by the coproces-sor. This corresponds to the FE_OVERFLOW exception.
Bit 4, underflow: This exception is generated by an operation that would re-sult in a number too small to be represented by the copro-cessor. This corresponds to the FE_UNDERFLOW exception.
Bit 5, precision lost: When a calculated result cannot be exactly represented by the coprocessor, this exception is generated. For example, dividing 1.0 by 3.0 generates a precision lost exception, since 1/3 cannot be exactly stored in a floating-point number. This corresponds to the FE_INEXACT exception.

The exception indicators are sticky bits; once they are set, they remain set until they are manually cleared. Most compilers provide special functions for clearing the status word.

The coprocessor's control word changes the behavior of the coprocessor. Its important bits are:

Bits 0 through 6, exception masks

These bits correspond to the exception indicators in the status register. When an exception occurs, the exception indicator in the status register is set. Then, the corresponding bit in the control word is checked. If the control bit is zero (unmasked), an interrupt is generated indicating that a problem has occurred. This is the interrupt captured for exception handlers installed with signal. If the control bit is 1 (masked), no interrupt is generated.

Bits 10 and 11, rounding control

The setting of these two bits affects how the coprocessor handles rounding. This field can be set to one of these values:

Rounding
00	round toward nearest or even
01	round toward negative infinity
10	round toward positive infinity
11	round toward zero (truncate)

The runtime library has functions and bit masks for examining and setting these words:

unsigned int _clear87(void)

Gets and clears the coprocessor's status word.

unsigned int _control87(unsigned new, unsigned mask)

Gets and sets the coprocessor's control word. If mask is zero, _control87 simply returns the current value of the control word. If mask is nonzero, the bits in the control word are changed using this formula:

if (mask != 0)
  cw = ((cw & ~ mask) | (new & mask));

where cw is the current value of the control word. Defini-tions of the bits in the control and status words are in float.h.

unsigned int _status87(void)

Gets the coprocessors status word. Bit masks defined in float.h simplify checking the status word's bits.

void _fpreset(void)

Resets the floating-point math package after an error or subprocess.

Calling the coprocessor-specific functions is not advised if the functions listed in the section "The Floating Point Library" (earlier in this chapter) are also being used to change the floating-point environment. The two sets of functions are unaware of each other, and conflicts may arise if one set of functions makes a change that the other set of functions is not aware of. The coprocessor control functions are provided for compatibility with other MS-DOS C and C++ compilers.

Assembly language functions that call the coprocessor can use all of the coprocessor's stack registers, provided that those registers are left empty when the function returns. When calling floating-point library functions from assembly language routines, assume that the function will use the entire coprocessor stack. It is a calling function's responsibility to save and restore coprocessor registers.

Compiling for floating-point emulation and math coprocessors

Digital Mars C++ can produce code to take advantage of a math coprocessor if one is available on the target system. Compiled programs decide at run-time whether to use native coprocessor instructions or software emulation. There is no separate floating-point library to be linked in. Digital Mars C++ provides a range of options for targeting floating-point operations to a particular hardware configuration.

Floating-point options

The following methods are offered to handle floating-point operations:

Does not perform any floating-point operations. Prevents the floating-point routines from being loaded from the library.
Does all floating-point using the emulator libraries. If a math coprocessor is present, does not use it.
Does floating-point using the software emulation. If a math coprocessor is present, uses it for computations. For math functions in the run-time library, change to hand-coded math coprocessor routines if a coprocessor is present. This option is the default.
Generates inline math coprocessor code. A coprocessor chip must be present on the target system for the program to run.

The advantages of each option are:

The size of the resulting program is significantly smaller than with the other three options, and run-time speed is maximized.
The program runs with or without a math coprocessor chip. The speed of calculation and the results obtained are repeatable across all 8086 machines. The compiler itself is compiled with this option.
If a math coprocessor is not present, this option works like option 1. If one is present, the program uses it and runs faster.
The fastest and smallest floating-point compilation is generated, obtaining potentially more accurate results (coprocessor math is done using 80 bits of precision, as opposed to 64 bits for the software floating point). The relative speed increase obtained by using the coprocessor is highly dependent on the relative clock speeds of the coprocessor and the regular CPU and on which version of the coprocessor you are using (80287, 80387, or clones). With a basic IBM PC, the improvement is dramatic for programs that make intensive use of floating-point arithmetic.

The disadvantages are:

The program cannot contain any floating-point operations.
This option produces the slowest code, less accurate calculations and a larger program than that produced by option 1.
The speed of the program and the accuracy of the results obtained depend on whether a coprocessor chip is present. The program is the same size as that produced by option 2.
The program requires a math coprocessor to operate. Executing the instructions without an coprocessor may crash the system.

Creating option 1

Option 1 is created automatically when you do not use any floating-point math. During compilation, Digital Mars C and C++ libraries do not detect whether floating-point math is needed. Floating-point emulation is linked in only if floating-point math is used.

Creating option 2

The following program stub creates option 2:

main()
{
  extern int _8087;
  _8087 = 0; /* never use 80x87 */
  ...
  ...
}

Creating option 3

Option 3 is the default when floating point math is used.

Creating option 4

Create option 4 by compiling code with the -f or -ff option to dmc. For example:

dmc file -f

Code compiled with the -f option crashes if a math coprocessor is not present. We recommend you insert the following code when using the -f option:

void main()
{
  extern int _8087;
  if (_8087 == 0) /* if no 80x87 present */
  {
    printf("This program requires an 80x87\n");
    exit(1); /* halt program */
  }
  ...
}

Experienced programmers using option 4 can modify and recompile the libraries to remove all the non-coprocessor floating-point code. This makes the program smaller and slightly faster.

All four options use the same Digital Mars libraries.

Note: The above options do not change call/return conventions between functions; therefore, you can link object modules compiled using any option.

For example, an application that does a matrix inversion can have two versions:

dmc -c -f matrix -DINLINE8087 -omatrix87. obj
dmc -c matrix
dmc main matrix. obj matrix87. obj

The file matrix.c contains:

#ifdef INLINE8087
void matrix87() /* different name for 87 version*/
#else
void matrix()
#endif
{
    ...
}

The file main.c contains:

extern int _8087;
...
if (_8087)    /* if there is an 80x87 */
  matrix87(); /* use 80x87 version */
else
  matrix();   /* use software floating */
              /* point version */

When you write an executable program this way, it can exploit all the advantages of the coprocessor but still work correctly if the target machine does not have one. You can find similar examples in the math.c library source code.

Floating-point options

Use the options described below to control how the compiler uses floating-point instructions.

Floating Point Compiler Switches
Switch	Description
-f	Generate IEEE 754 inline 8087 code
-fd	Generate IEEE 754 inline 8087 code with workaround for FDIV bug
-ff	Generate fast inline 8087 code

Complex Numbers

Starting with version 8.10, Digital Mars C and C++ implement the ANSI C99 specification for complex floating point types. DMC is the very first compiler to do so. The types are:

_Imaginary float
_Imaginary double
_Imaginary long double
_Complex float
_Complex double
_Complex long double

Note: Complex and imaginary types are only implemented for the Win32 memory model. The code generation relies on a numeric coprocessor, but the DOSX and 16 bit memory models cannot guarantee one.

The relevant header file is complex.h. The old C++ complex class conflicts with this, and was renamed to oldcomplex.h. There is no reason to use the complex class in new code, as the native types supercede all the functionality in the class.

A whole range of new library functions are added to deal with complex numbers. They're documented in Complex Numbers.

Tools

Compiling

Linking

Win32 Programming

DOS and Win16 Programming

C/C++ Extensions

Porting to DMC++

What's in This Chapter

Integers

Integral data types

Literal integral constants

Floating point types

Literal floating-point constants

Unusual floating-point values

Infinities

Zero

NaNs

Quiet NaN

Signaling NaN

Subnormals

Conversions

Integral conversions

Floating point conversions

Conversions between integer and floating-point

Implicit conversions

Significant digits

Arithmetic considerations

Infinity math

NaN math

The sign bit

Random numbers

Arrays

Coprocessors and emulators

Rounding

Changing the precision of calculations

Exceptions

errno

Error trapping

Saving/Restoring the floating-point environment

Header files

Predefined macros

General floating-point types

Floating point characteristic constants

Environment types and constants

Exception macros

Rounding macros

Precision macros

Classification macros

NaNs and infinities

Type determination macros

Exception control functions

Rounding functions

Precision functions

Trigonometric functions

Conversion functions

Sign functions

unsigned int _clear87(void)

unsigned int _control87(unsigned new, unsigned mask)

unsigned int _status87(void)

void _fpreset(void)

Floating-point options

Creating option 1

Creating option 2

Creating option 3

Creating option 4

Floating-point options