• Tidak ada hasil yang ditemukan

The Steepest Descent Method

Geometric Significance of the Gradient Vector

Inx1x2space (whenn=2) the gradient vector∇f(xk) is normal or perpendicular to the constant objective curves passing throughxk. More generally, if we construct a tangent hyperplane to the constant objective hypersurface, then the gradient vec- tor will be normal to this plane. To prove this, consider any curve on the constant objective surface emanating fromxk. Letsbe the arc length along the curve. Along the curve we can writex=x(s) withx(0)≡xk. Sincef(x(s))=constant,

d

ds f(x(s))]s=0=f(xk)T˙x(0)=0 wherex(0). =lim

s→0

x(s)x(0)

s . Sincex(s)−x(0) is the chord joiningxktox(s), the limit of this chord as s → 0 is a vector that is tangent to the constant objective sur- face. Thus,x.(0) is a vector in the tangent plane. The preceding equation states that

f(xk) is normal to this vector. We arrived at this result considering a single curve throughxk. Since this is true for all curves throughxk, we conclude that∇f(xk) is normal to the tangent plane. SeeFig. 3.3.

Further,∇f(xk) points in the direction of increasing function value.

Example 3.7

Givenf=x1x22,x0=(1, 2)T.

(i) Find the steepest descent direction atx0. (ii) Isd=(−1, 2)Ta direction of descent?

We have the gradient vector∇f=(x22, 2x1x2)T. The steepest descent direc- tion atx0 is the negative of the gradient vector that gives (−4, −4)T. Now, in (ii), to check ifdis a descent direction, we need to check the dot product ofd with the gradient vector as given in (3.10). We have∇f(x0)Td=(4, 4) (−1, 2)T

= +4>0. Thus,dis not a descent direction.

Line Search

We now return to the steepest descent method. After having determined a direc- tion vectordkat the pointxk, the question is “how far” to go along this direction.

Thus, we need to develop a numerical procedure to determine the step sizeαkalong dk. This problem can be resolved once we recognize that it is one-dimensional – it involves only the variableα. Examples 3.5 and 3.6 illustrate this point. Specifically, as we move alongdk, the design variables and the objective function depend only onαas

xk+αdkx(α) (3.12)

3.5 The Steepest Descent Method 101

(a) x°

x2

f = 3

x1 f = 5f = 7

f(x°)

f = 30 f = 20 f = 10 x3

x1

x2

(b)

f(x°)

Figure 3.3. Geometric significance of the gradient vector in two and three dimensions.

and

f(xk+αdk)≡ f(α) (3.13)

The notationf(α)≡f(xk+αdk) will be used in this text. The slope or derivative f(α)=df/dα is called the directional derivativeof falong the direction dand is given by the expression

d f( ˆα)

=f(xk+αˆdk)T dk (3.14)

(a) (b) f(α)

f(xk) ≡ f(0)

Slope f′(0)

α = 0

f1) ≥ f2) < f3)

α1 α2 α3 α

f(α)

f(0)

0 α3 α2 α1 α

f(α2) > f(α3) f(0)

Figure 3.4. Obtaining a three-point pattern.

In the steepest descent method, the direction vector is−∇f(xk) resulting in the slope at the current pointα=0 being

d f(0)

=[f(xk)]T [−f(xk)] = − f(xk)2<0 (3.15) implying a move in a downhill direction. See also Eq. (1.5b) in Chapter 1.

Exact Line Search:Here, we may chooseαso as to minimizef(α):

minimize

α0 f(α)≡ f(xk+αdk) (3.16) We will denote αk to be the solution to the preceding problem. The problem in (3.16) is called “exact line search” since we are minimizing falong the linedk. It is also possible to obtain a step sizeαk from approximate line search procedures as discussed in Section 3.9. We give details in performing exact line search with

“Quadratic Fit.”

The main step is to obtain a “three-point pattern.” Once the minimum point is bracketed, the quadratic fit or other techniques discussed in Chapter 2 may be used to obtain the minimum. Consider a plot off(α) versusαas shown in Fig. 3.4a. Bear in mind thatf(0)≡f(xk). Since the slope of the curve atα=0 is negative, we know that the function immediately decreases withα >0. Now, choose an initial stepα1s. Iff(α1)<f(0), then we march forward by taking steps ofαi+1=αi+τsi,i=1, 2, . . . , whereτ =0.618034 is the golden section ratio. We continue this march until

3.5 The Steepest Descent Method 103 the function starts to increase and obtainf(αi+1)≥f(αi)<f(αi1) whereupon [αi1, αi,αi+1] is the three-point pattern.

However, if the initial stepα1 results inf(α1)>f(0) as depicted in Fig. 3.4b, then we define points within [0,α1] byαi+1=τ αi,i=1, 2, . . . untilf(αi)>f(αi+1)≤ f(0). We will then have [0,αi+1,αi] as our three-point pattern.

Again, once we have identified a three-point pattern, the quadratic fit or other techniques (see Chapter 2) may be used to determine the minimumαk. The reader can see how the aforementioned logic is implemented in program STEEPEST.

Exact line search as described previously, based on solving Problem in (3.16), may entail a large number of function calls. Asufficient decreaseor Armijo condition is sometimes more efficient and considerably simplifies the computer code. Further details on this are given later in Section 3.9, so as not to disturb the presentation of the methods. However, all computer codes accompanying this text use exact line search strategy, which is the most robust approach for general problems.

Stopping Criteria

Starting from an initial point, we determine a direction vector and a step size, and obtain a new point asxk+1=xk+αkdk. The question now is to know when to stop the iterative process. Two stopping criteria are discussed in the following and are implemented in the computer programs.

(I) Before performing line search, the necessary condition for optimality is checked:

f(xk) ≤ εG (3.17)

whereεGis a tolerance on the gradient and is supplied by the user. If (3.17) is sat- isfied, then the process is terminated. We note that the gradient can also vanish for a local maximum point. However, we are using a descent strategy where the value offreduces from iteration to iteration – as opposed to a root-finding strategy. Thus, the chances of converging to a local maximum point are remote.

(II) We should also check successive reductions infas a criterion for stopping.

Here, we check if

f(xk+1)− f(xk)| ≤εA+εRf(xk) (3.18) whereεA=absolute tolerance on the change in function value andεR =relative tolerance. Only if (3.18) is satisfied for two consecutive iterations, is the descent process stopped.

The reader should note that both a gradient check and a function check are necessary for robust stopping criteria. We also impose a limit on the number of iterations. The algorithm is presented as follows.

Steepest Descent Algorithm

1. Select a starting design pointx0 and parametersεG, εA,εR. Set iteration index k=0.

2. Compute∇f(xk). Stop if||∇f(xk)||≤εG. Otherwise, define a normalized direc- tion vectordk= −∇f(xk)/||∇f(xk)||.

3. Obtainαk from exact or approximate line search techniques. Update xk+1 = xk+αkdk.

4. Evaluatef(xk+1). Stop if (3.18) is satisfied for two successive iterations. Other- wise setk=k+1,xk=xk+1, and go to step 2.

Convergence Characteristics of the Steepest Descent Method

The steepest descent method zigzags its way towards the optimum point. This is because each step is orthogonal to the previous step. This follows from the fact that αk is obtained by minimizingf(xk +αdk). Thus, d/dα{f(xk+αdk)} =0 atα= αk. Upon differentiation, we get∇f(xk+αkdk)Tdk=0 or∇f(xk+1)Tdk=0. Since dk+1= − ∇f(xk+1), we arrive at the resultdTk+1dk=0. Note that this result is true only if the line search is performed exactly.

Application of the steepest descent method to quadratic functions leads to fur- ther understanding of the method. Consider the quadratic function

f =x12+a x22

The steps taken by the steepest descent method to the optimum are shown in Fig. 3.5. Larger is the value ofa, more elongated are the contours off, and slower is the convergence rate. It has been proven that the speed of convergence of the method is related to the spectral condition number of the Hessian matrix. The spec- tral condition numberκ of a symmetric positive definite matrixAis defined as the ratio of the largest to the smallest eigenvalue, or as

κ =λmax

λmin

(3.19) For well-conditioned Hessian matrices, the condition number is close to unity, con- tours are more circular, and the method is at its best. Higher the condition num- ber, more ill-conditioned is the Hessian, contours are more elliptical, more is the amount of zigzagging as the optimum is approached, smaller is the step sizes and, thus, poorer is the rate of convergence. For the function f =x21+a x22considered

3.5 The Steepest Descent Method 105

x*

x2 x1 x0

x1 x2

Figure 3.5. The steepest descent method.

previously, we have

2f = 2 0 0 2a

Thus, κ =a. Convergence is fastest whena=1. For a nonquadratic function, the conditioning of the Hessian matrix at the optimumxaffects the convergence rate.

While details may be found in references given at the end of the chapter, and specific proofs must incorporate the line search strategy as well, the main aspects regarding convergence of the steepest descent method are:

(i) The method is convergent fromanystarting point under mild conditions. The term “globally convergent” is used, but this is not to be confused with finding the global minimum – the method only finds a local minimum. Luenberger [1965]

contains an excellent discussion of global convergence requirements for uncon- strained minimization algorithms. In contrast, pure Newton’s method discussed next does not have this theoretical property.

(ii) The method has a linear rate of convergence. Specifically, if the Hessian at the optimum is positive definite with spectral condition number κ, then the sequence of function values at each iteration satisfy

fk+1f≈ (κ−1)2

(κ+1)2[fkf] (3.20) where fis the optimum value. The preceding inequality can be written asek+1Cekr where e represents error and r = 1 indicates linear rate of convergence.

Typically, the steepest descent method takes only a few iterations to bring a far- off starting point into the “optimum region” but then takes hundreds of iterations to make very little progress toward the solution.

Scaling

The aforementioned discussion leads to the concept of scaling the design variables so that the Hessian matrix is well conditioned. For the function f =x12+a x22, we can define new variablesy1andy2as

y1 =x1,y2=√ ax2

which leads to the function in y variables given by g=y12+y22. The Hessian of the new function has the best conditioning possible with contours corresponding to circles.

In general, consider a functionf=f(x). If we scale the variables as

x=T y (3.21)

then the new function is g(y)≡f(T y), and its gradient and Hessian are given by

g=TTf (3.22)

2g=TT2fT (3.23)

Usually T is chosen as a diagonal matrix. The idea is to choose T such that the Hessian ofghas a condition number close to unity.

Example 3.8

Considerf=(x1−2)4+(x1−2x2)2,x0=(0, 3)T. Perform one iteration of the steepest descent method.

We havef(x0)=52,∇f(x)=[4(x1−2)3+2(x1−2x2),−4(x1−2x2)]T. Thus, d0= − ∇f(x0)=[44,−24]T. Normalizing the direction vector to make it a unit vector, we haved0=[0.8779,−0.4789]T. The solution to the line search problem minimizef(α)=f(x0+αd0) withα >0 yieldsα0=3.0841. Thus, the new point isx1=x0+α0d0=[2.707, 1.523]T, withf(x1)=0.365. The second iteration will now proceed by evaluating∇(xk+1), etc.