Example l: This example follows up Example 4 of the previous section: given p >0, q >0, pq, determine the root
y=−p+ p2+q with smallest absolute value of the quadratic equation
y2+ 2py−q= 0.
Input data:p, q.Result:y=ϕ(p, q) =−p+ p2+q.
The problem was seen to be well conditioned for p >0, q >0. It was also shown that the relative input errorsεp, εq make the following contribution to the relative error of the resulty=ϕ(p, q):
−p p2+q
εp+ q
2y p2+q
εq= −p p2+q
εp+p+ p2+q 2
p2+q εq.
Since
p
p2+q ≤1,
p+ p2+q 2
p2+q ≤1, the inherent error∆(0)ysatisfies
eps≤ε(0)y :=∆(0)y
y ≤3 eps.
We will now consider two algorithms for computingy=ϕ(p, q).
s:=p2, Algorithm 1:
t:=s+q, u:=√
t, y:=−p+u.
Obviously,pqcauses cancellation wheny:=−p+uis evaluated, and it must therefore be expected that the roundoff error
∆u:=ε·√
t=ε· p2+q.
generated during the floating-point calculation of the square root fl(√
t) =√
t·(1 +ε), |ε| ≤eps,
will be greatly amplified. Indeed, the above error contributes the following term to the error ofy:
1 y∆u=
p2+q
−p+
p2+q ·ε= 1 q(p
p2+q+p2+q)ε=k·ε.
Sincep,q >0, the amplification factorkadmits the following lower bound:
k >2p2 q >0.
which is large, sincepq by hypothesis. Therefore, the proposed algorithm is not numerically stable, because the influence of rounding
p2+qalone exceeds that of the inherent errorε(0)y by an order of magnitude.
s:=p2, Algorithm 2:
t:=s+q, u:=√
t, v:=p+u, y:=q/v.
This algorithm does not cause cancellation when calculating v := p+u. The roundoff error ∆u = ε
p2+q, which stems from rounding
p2+q, will be amplified according to the remainder mapψ(u):
u→p+u→ q
p+u =:ψ(u).
Thus it contributes the following term to the relative error ofy:
1 y
∂ϕ
∂u∆u= −q
y(p+u)2 ·∆u
= −q
p2+q −p+
p2+q p+
p2+q2 ·ε
=−
p2+q p+
p2+q·ε=k·ε.
The amplification factor k remains small; indeed, |k| < 1, and algorithm 2 is therefore numerically stable.
The following numerical results illustrate the difference between Algorithms 1 and 2. They were obtained using floating-point arithmetic of 40 binary mantissa places – about 13 decimal places – as will be the case in subsequent numerical examples.
p= 1000, q= 0.018 000 000 081
Resultyaccording to Algorithm 1: 0.900 030 136 10810−5 Resultyaccording to nach Algorithm 2: 0.899 999 999 99910−5 Exact value ofy: 0.900 000 000 00010−5 Example 2.For given fixedx, the value of coskxmay be computed recursively using form= 1,2, . . . , k−1 the formula
cos(m+ 1)x= 2 cosxcosmx−cos(m−1)x.
In this case, a trigonometric-function evaluation has to be carried out only once, to findc= cosx. Now let|x| = 0 be a small number. The calculation ofccauses a small roundoff error:
˜
c= (1 +ε) cosx, |ε| ≤eps. How does this roundoff error affect the calculation of coskx?
coskxcan be expressed in terms ofc: coskx= cos(karccosc) =:f(c).Since df
dc =ksinkx sinx
the errorεcosxofccauses, to first approximation, an absolute error (1.4.1) ∆coskx .
=εcosx
sinxksinkx=ε·kcotxsinkx in coskx.
On the other hand, the inherent error∆(0)ck(1.3.19) of the resultck:= coskx is
∆(0)ck= [k|xsinkx|+|coskx|] eps.
Comparing this with (1.4.1) shows that∆coskxmay be considerably larger than
∆(0)ck for small|x|; hence the algorithm is not numerically stable.
Example 3.For givenxand a “large” positive integerk, the numbers coskxand sinkxare to be computed recursively using
cosmx= cosxcos(m−1)x−sinxsin(m−1)x,
sinmx= sinxcos(m−1)x+ cosxsin(m−1)x, m= 1,2, . . . , k.
How do small errorsεccosx,εssinxin the calculation of cosx, sinxaffect the final results coskx, sinkx? Abbreviatingcm:= cosmx,sm:= sinmx,c:= cosx, s:= sinx, and putting
U :=
c−s s c
,
we have
cm
sm
=U cm−1
sm−1
, m= 1, . . . , k.
Here U is a unitary matrix, which corresponds to a rotation by the angle x.
Repeated application of the formula above gives ck
sk
=Uk c0
s0
=Uk· 1
0
. Now
∂U
∂c = 1 0
0 1
, ∂U
∂s = 0−1
1 0
=:A, and therefore
∂
∂cUk=k Uk−1,
∂
∂sUk=AUk−1+U AUk−2+· · ·+Uk−1A
=kAUk−1,
becauseAcommutes withU. SinceU describes a rotation in IR2 by the anglex,
∂
∂cUk=k
cos(k−1)x −sin(k−1)x sin(k−1)x cos(k−1)x
,
∂
∂sUk=k
−sin(k−1)x −cos(k−1)x cos(k−1)x −sin(k−1)x
.
The relative errorsεc,εsofc= cosx,s= sinxeffect the following absolute errors of coskx, sinkx:
(1.4.2)
∆ck
∆sk
.
= ∂
∂cUk 1 0
·εccosx+ ∂
∂sUk 1 0
·εssinx
=εckcosx
cos(k−1)x sin(k−1)x
+εsksinx
−sin(k−1)x cos(k−1)x
. The inherent errors ∆(0)ck and∆(0)sk ofck = coskx and sk = sinkx, respec- tively, are given by
(1.4.3) ∆(0)ck= [k|xsinkx|+|coskx|] eps,
∆(0)sk= [k|xcoskx|+|sinkx|] eps.
Comparison of (1.4.2) and (1.4.3) reveals that for bigkand|kx| ≈1 the influence of the roundoff error εc is considerably bigger than the inherent errors, while the roundoff errorεsis harmless.The algorithm is not numerically stable, albeit numerically more trustworthy than the algorithm of Example 2 as far as the computation ofck alone is concerned.
Example 4.For small|x|, the recursive calculation of cm= cosmx, sm= sinmx, m= 1,2, . . . , based on
cos(m+ 1)x= cosxcosmx−sinxsinmx, sin(m+ 1)x= sinxcosmx+ cosxsinmx,
as in Example 3, may be further improved numerically. To this end, we express the differencesdsm+1anddcm+1of subsequent sine and cosine values as follows:
dcm+1: = cos(m+ 1)x−cosmx
= 2(cosx−1) cosmx−sinxsinmx−cosxcosmx+ cosmx
=−4
sin2 x 2
cosmx+ [cosmx−cos(m−1)x], dsm+1: = sin(m+ 1)x−sinmx
= 2(cosx−1) sinmx+ sinxcosmx−cosxsinmx+ sinmx
=−4
sin2 x 2
sinmx+ [sinmx−sin(m−1)x].
This leads to a more elaborate recursive algorithm for computingck,sk in the casex >0:
dc1:=−2 sin2x
2, t:= 2dc1, ds1:=
−dc1(2 +dc1), s0:= 0, c0:= 1, and form:= 1, 2,. . .,k:
cm:=cm−1+dcm, dcm+1:=t·cm+dcm, sm:=sm−1+dsm, dsm+1:=t·sm+dsm.
For the error analysis, note thatckandskare functions ofs= sin(x/2):
ck= cos(2karcsins) =:ϕ1(s), sk= sin(2karcsins) =:ϕ2(s).
An error∆s=εssin(x/2) in the calculation ofstherefore causes – to a first-order approximation – the following errors inck:
∂ϕ1
∂s εssinx
2 =εs−2ksinkx cos(x/2) sinx
2
=−2ktanx
2sinkx·εs, and insk:
∂ϕ2
∂s εssinx
2 = 2ktanx
2coskx·εs.
Comparison with the inherent errors (1.4.3) shows these errors to be harmless for small|x|. The algorithm is then numerically stable, at least as far as the influence of the roundoff errorεsis concerned.
Again we illustrate our analytical considerations with some numerical results.
Letx= 0.001,k= 1000.
Algorithm Result for coskx Relative error Example 2 0.540 302 121 124 −0.3410−6 Example 3 0.540 302 305 776 −0.1710−9 Example 4 0.540 302 305 865 −0.5810−11 Exact value 0.540 302 305 868 140. . .
Example 5.We will derive some results which will be useful for the analysis of algorithms for solving linear equations in Section 4.5. Given the quantitiesc,a1, . . .,an,b1,. . .,bn−1 withan= 0, we want to find the solutionβnof the linear equation
(1.4.4) c−a1b1− · · · −an−1bn−1−anβn= 0.
Floating-point arithmetic yields the approximate solution
(1.4.5) bn= fl
c−a
1b1− · · · −an−1bn−1
an
as follows:
s0:=c;
for j:= 1, 2,. . .,n−1
(1.4.6) sj:= fl(sj−1−ajbj) = (sj−1−ajbj(1 +µj))(1 +αj), bn:= fl(sn−1/an) = (1 +δ)sn−1/an,
with|µj|,|aj|,|δ| ≤eps. Ifan= 1, as is frequently the case in applications, then δ= 0, sincebn:=sn−1.
We will now describe two useful estimates for the residual r:=c−a1b1−. . .−anbn
From (1.4.6) follow the equations s0−c= 0, sj−(sj−1−ajbj) =sj−
sj
1 +αj +ajbjµj
=sj αj
1 +αj −ajbjµj, j= 1,2, . . . , n−1, anbn−sn−1=δsn−1.
Summing these equations yields r=c−
n i=1
aibi=
n−1
j=1
−sj αj
1 +αj+ajbjµj
−δsn−1
and thereby the first one of the promised estimates
(1.4.7) |r| ≤ eps
1−eps[δ· |sn−1|+
n−1
j=1
(|sj|+|ajbj|)],
δ:= 0 ifan= 1, 1 otherwise.
The second estimate is cruder than (1.4.7). (1.4.6) gives (1.4.8) bn=
c
n!−1 k=1
(1 +αk)−
n−1
j=1
ajbj(1 +µj)
n!−1 k=j
(1 +αk)
1 +δ an , which can be solved forc:
(1.4.9) c=
n−1
j=1
ajbj(1 +µj)
j−1
!
k=1
(1 +αk)−1+anbn(1 +δ)−1
n!−1 k=1
(1 +αk)−1.
A simple induction argument overmshows that (1 +σ) =
!m k=1
(1 +σk)±1, |σk| ≤eps, m·eps<1 implies
|σ| ≤ m·eps 1−m·eps.
In view of (1.4.9) this ensures the existence of quantitiesεjwith
(1.4.10) c=
n−1
j=1
ajbj(1 +j·εj) +anbn(1 + (n−1 +δ)εn),
|εj| ≤ eps
1−n·eps, δ:= 0 ifan= 1, 1 otherwise.
Forr=c−a1b1−a2b2− · · · −anbnwe have consequently (1.4.11) |r| ≤ eps
1−n·eps n−1
j=1
j|ajbj|+ (n−1 +δ)|anbn|
.
In particular, (1.4.8) reveals the numerical stability of our algorithm for comput- ingβn. The roundoff errorαmcontributes the amount
c−a1b1−a2b2− · · · −ambm
an αm
to the absolute error inβn. This, however, is at most equal to c·εc−a1b1εa1a− · · · −ambmεαm
n
≤
|c|+"m
i=1|aibi| eps
|an| ,
which represents no more than the influence of the input errors εc and εai of c andai, i= 1,. . .,m, respectively, provided |εc|,|εai| ≤eps. The remaining roundoff errorsµkandδare similarly shown to be harmless.
The numerical stability of the above algorithm is often shown by interpreting (1.4.10) in the sense of backward analysis: The computed approximate solution bnis the exact solution of the equation
c−¯a1b1−. . .−a¯nbn= 0, whose coeffizients
¯
aj:=aj(1 +j·εj), 1≤j≤n−1,
¯
aj:=aj(1 + (n−1 +δ)εn)
have changed only slightly from their original values aj. This kind of analysis, however, involves the difficulty of having to define how large n can be so that errors of the formnε,|ε| ≤eps can still be considered as being of the same order of magnitude as the machine precision eps.