numerical analysis - mathematics of scientific computing

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

Chapter 2

COMPUTER ARITHMETIC

2.1 Floating-Point Numbers and Roundoff Errors 2.2 Absolute and Relative Errors; Loss of Significance 2.3 Stable and Unstable Computations; Conditioning

2.1 Floating-Point Numbers and Roundoff Errors

Most high-speed computers deal with real numbers in the binary system, in contrast to the decimal system that humans prefer to use. The binary system uses 2 as the base in the same way that the decimal system uses 10. To make this comparison, recall first how our familiar number representation works. When a real number such as 427.325 is written out in more detail, we have

427.325 = ⁴

^X

¹⁰

²

^{+ 2}

^X

¹⁰

¹

^{+ 7}

^{X lQ}^O

^{+ 3}

^X

^10-l

+ 2

^X

10-

²

+ 5

X

10-

³

The expression on the right employs powers of 10 together with the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. If we admit the possibility of having an infinite number of digits st�ding to the right of the decimal point, then any real number can be expressed in the manner just illustrated, with a sign ( + or -) affixed to it. Thus

^-Jr

is

-1r

= -3.14159265358979323846264338 ...

The last 8 written here stands for 8 x 10-

25•

In the binary system, only the two digits O and 1 are used. A typical number in the binary system can also be written in detail, as for example

1001.11101 = ¹

^X

²

³

^{+ 0}

^X

²

^{+ 0}

^X

²

¹

^{+ 1}

^X

²

⁰

+ 1

X

r

¹

+ 1

X

r

²

+ 1

X

r

³

+ 0

X

r

⁴

+ 1

X

2-

⁵

This is the same real number as 9.90625 in decimal notation. (Verify.)

In general, any integer f3 > l can be used as the base for a number system.

Numbers represented in base f3 will contain digits 0, 1, 2, 3, 4, . .. , f3 - 1. If the context does not make it clear what number base is being used for the number N, the notation (N)

_/3

can be employed. Thus we have, from above,

(1001.11101)2 = ^(9.90625)1

⁰

28

(35)

2.1 Floating-Point Numbers and Roundoff Errors 29 Since the typical computer communicates with its human users in the decimal system but works internally in the binary system, there must be conversion procedures that are executed by the computer. These come into use at input and output time.

Ordinarily the user need not be concerned with these conversions, but they do involve small roundoff errors, as we shall see later.

Computers are not able to operate using real numbers expressed with more than a fixed number of digits. The word length of the computer places a restriction on the precision with which real numbers can be represented.

Normalized Scientific Notation

In the decimal system, any real number can be expressed in normalized scientific notation. This means that the decimal point is shifted and appropriate powers of 10 are supplied so that all the digits are to the right of the decimal point, and the first digit displayed is not zero. Examples are

732.5051 = . 7325051 X 10³ -0.005612 = -.5612 X 10-2 In general, a nonzero real number x can be represented in the form

X = ±r X 10ⁿ

(1)

where r is a number in the range fcj � r < 1 and n is an integer (positive, negative, or zero). Of course, if x = 0, then r = O; in all other cases, we can adjust n so that r lies in the given range.

In exactly the same way, we can use scientific notation in the binary system.

Now we have

(2) where ½ � q < l (if x # 0) and m is an integer. The number q is called the mantissa and the integer m the exponent. In a binary computer, both q and m will be represented as base 2 numbers.

Hypothetical Computer MARC-32

Within a typical computer, numbers are represented in the way just described, but with certain restrictions placed on q and m that are imposed by the available word length. To illustrate this, we consider a hypothetical computer that we call the MARC- 32. It has a word length of 32 bits (binary digits) and is therefore similar to many modem personal computers. Suppose that the bits composing a word are allocated in the following way when representing real numbers;

sign of the real number x sign of the exponent m exponent (integer 1ml) mantissa (real number lql)

I bit 1 bit 7 bits 23 bits

Since a nonzero real number x = ±q x 2^mcan be normalized so that its mantissa is in the range 1/2 � q < 1, the first bit in q can be assumed to be I. It therefore does

(36)

30 Chapter 2 Computer Arithmetic

not require storage. The 23 bits reserved in the word for the mantissa can be used to store the 2nd, 3rd, ... , 24th bits in q. In effect, the machine has a 24-bit mantissa for its floating-point numbers.

CA real number expressed as in Equation (2) is said to be in normalized floating

point form. If it can then be represented with 1ml occupying 7 bits and q occupying 24 bits, then it is a machine number in the MARC-32. J T�t is_,_ it ca�beJ>l:ecisely r�resented wi.tli_in this E__articular -".omp_!!l_er. Most real numbers are not precisely representable within the MARC-32. When such a number occurs as an input datum or as the result of a computation, an inevitable error will arise in representing it as accurately as possible by a machine number.

�estriction that 1ml require no more than 7 bits means that 1ml;;; (1111111)2 = 2⁷- 1 = 127

Since 2¹²⁷"' 10³⁸, the MARC-32 can handle numbers roughly as small as 10-38 and as large as 10³⁸. This is not a sufficiently generous range of magnitudes for some scientific calculations and, for this reason and others, we occasionally must write a program in double-precision or extended-precision arithmetic. A floating-point number in double-precision is represented in two computer words, and the mantissa usually has at least twice as many bits. Hence, there are roughly twice the number of decimal places of accuracy in double-precision as in single-precision. In double

precision, calculations are much slower than in single-precision, often by a factor of 2 or more.

The restriction that q require no more than '.M.J)its means that our machine numi;;;;:;; have a limited precision of roughly seven decimal places, since the least significant bit in the mantissa represents units of 2-24 (or approximately 10-7). Thus, numbers expressed with more than seven decimal digits will be approximated when given as input to the computer. Also, some simple decimal numbers such as 1/10 are not machine numbers on a binary computer! (See Equation (9) later in this section.) Floating-point numbers in a binary computer are distributed rather unevenly, more of them being concentrated near 0. There are only a finite number of floating-point numbers in the computer, and between adjacent powers of 2 there are always the same number of machine numbers. Since gaps between powers of 2 are smaller near zero and larger away from zero, this produces a nonuniform distribution of floating-point numbers, with higher density near the origin.

An integer can use all of the computer word in its representation except that a single bit must be reserved for the sign. Hence in the MARC-32, integers range from -(2³¹-1) to 2³¹- 1 = 2147483647. In scientific computations, purely integer calculations are not common.

The floating-point representation for a single-precision real number in the hypo

thetical 32-bit computer MARC-32 is divided into three fields as shown in Figure 2. 1.

Here S is the bit representing the sign of x, s is the bit representing the sign of m, E the 7-bit exponent, and F is the 23-bit fraction of the real number x that, together with an implicit leading 1, yields the significant digit field (.L __ ... ---)². Hence, nonzero normalized machine numbers are bit strings whose values are decoded as follows:

(37)

(38)

(39)

(40)

(41)

(42)

(43)

(44)

(45)

(46)

(47)

(48)

(49)

(50)

(51)

(52)

(53)

(54)

(55)

(56)

(57)

(58)

(59)

(60)

(61)

(62)

(63)

(64)

(65)

(66)

(67)

(68)

(69)

(70)

(71)

(72)

(73)

(74)

(75)

(76)

(77)

(78)

(79)

(80)

(81)

(82)

(83)

(84)

(85)

(86)

(87)

(88)

(89)

(90)

(91)

(92)

(93)

(94)

(95)

(96)

(97)

(98)

(99)

(100)

(101)

(102)

(103)

(104)

(105)

(106)

(107)

(108)

(109)

(110)

(111)

(112)

(113)

(114)

(115)

(116)

(117)

(118)

(119)

(120)

(121)

(122)

(123)

(124)

(125)

(126)

(127)

(128)

(129)

(130)

(131)

(132)

(133)

(134)

(135)

(136)

(137)

(138)

(139)

(140)

(141)

(142)

(143)

(144)

(145)

(146)

(147)

(148)

(149)

(150)

(151)

(152)

(153)

(154)

(155)

(156)

(157)

(158)

(159)

(160)

(161)

(162)

(163)

(164)

(165)

(166)

(167)

(168)

(169)

(170)

(171)

(172)

(173)

(174)

(175)

(176)

(177)

(178)

(179)

(180)

(181)

(182)

(183)

(184)

(185)

(186)

(187)

(188)

(189)

(190)

(191)

(192)

(193)

(194)

(195)

(196)

(197)

(198)

(199)

(200)

numerical analysis - mathematics of scientific computing

Chapter 2

COMPUTER ARITHMETIC

2.1 Floating-Point Numbers and Roundoff Errors 2.2 Absolute and Relative Errors; Loss of Significance 2.3 Stable and Unstable Computations; Conditioning

2.1 Floating-Point Numbers and Roundoff Errors

427.325 = 4

10

+ 2

10

+ 7

+ 3

10-l

+ 2

10-

+ 5

10-

is

= -3.14159265358979323846264338 ...

The last 8 written here stands for 8 x 10-

In the binary system, only the two digits O and 1 are used. A typical number in the binary system can also be written in detail, as for example

1001.11101 = 1

2

+ 0

2

+ 0

2

+ 1

2

+ 1

r

+ 1

r

+ 1

r

+ 0

r

+ 1

2-

This is the same real number as 9.90625 in decimal notation. (Verify.)

In general, any integer f3 > l can be used as the base for a number system.

Numbers represented in base f3 will contain digits 0, 1, 2, 3, 4, . .. , f3 - 1. If the context does not make it clear what number base is being used for the number N, the notation (N)

can be employed. Thus we have, from above,

(1001.11101)2 = (9.90625)1

28

(1)

427.325 = ⁴

¹⁰

^{+ 2}

¹⁰

^{+ 7}

^{+ 3}

^10-l

1001.11101 = ¹

²

^{+ 0}

²

^{+ 0}

²

^{+ 1}

²

(1001.11101)2 = ^(9.90625)1