Notes on Optimization
A.3 Methods for Solving Optimization Problems
A.3.1 First Order Methods
A better idea than using simple coordinate descent is based on building a first order model of the objective, namely
f(xi+di)≈f(xi) +gradf(xi)·di (A.3) and to follow a descent directiondthat is obtained from the gradient in each iteration step. If the descent directiondi =−gradf(xi)is the negative gradient and a step lengthλi is chosen to find the minimum along this direction, then the very greedy steepest descent method (in the Euclidean norm) is obtained [NW99, BV04, Pol87]. It is surprising, but has been demonstrated again and again [NW99, BV04, Pol87] that the performance of unpreconditioned descent methods can be quite poor.
One reason for this poor behavior lies in the abrupt direction change in each minimization step, e.g., for the steepest descent methoddi−1·di= 0.
Theorem 2 (Convergence rate steepest descent) Let f be twice continuously differentiable. If the steepest descent method is started sufficiently close to a minimum x∗ with positive definite Hessf(x∗), then the sequencef(xi)converges with rate
f(xi+1)−f(x∗) f(xi)−f(x∗) ≤
λn−λ1 λn+λ1
2
(A.4) whereλ1 is the smallest andλn is the largest eigenvalue ofHessf(x∗)(Theorem 3.4 in [NW99]
2This is slightly misleading as in generalkdk 6= 1.
3The “q” refers to “quotient” and is chosen to disambiguate the introduced terms.
p.49). We also obtain convergence for the point sequencexi
kxi−x∗k kx0−x∗k ≤
λn−λ1 λn+λ1
+ i
(A.5) as Theorem 4 in [Pol87] p.27 shows.
The convergence rate is q-linear and easily observed in practice. The main problem is a factor that is close to1for many problems, even of very modest size. Because the factor depends on the condition number of the Hessian, it can be improved by carefully changing to a different set of variables defining the objective. In general this is not easy and hence methods have been developed that are more robust to this problem.
It is quite surprising that following randomly chosen descent directions as for instance described in the SPSA method is on average as good as following the exact gradient [Spa03, Pol87]. From this one should realize that following the gradient downhill is not a particularly original or efficient. This argument should also show on an intuitive level, that better downhill directions than the gradient exist (because the average includes many directions that perform worse than the gradient).
One idea that performs better than the steepest descent method is physically motivated and simulates a “heavy ball” rolling down the objective landscape. This dampens the direction changes and achieves under ideal parameter selection for the damping the same convergence rate as the popular nonlinear conjugate gradient method (refer to [Pol87] p.74).
Observation 1 (Convergence rate linear conjugate gradient) The linear conjugate gradient method with constant HessianAconverges with geometric (q-linear) rate. The error is bound by
kxi−x∗kA kx0−x∗kA ≤2
qλ1
λn −1 qλ1
λn + 1
!i
. (A.6)
(See equation 2.15 in [Kel95] and for a sharper bound Theorem 5.5 in [NW99].))
Compared to the steepest descent method the main difference is the square root on the spectral condition number λλ1
n. This means that, at least in the linear setting for quadratic energies, the conjugate gradient method converges much faster than the steepest descent method. It also does not require any tuning of a damping parameter (as for the heavy ball method) to achieve this rate.
Different nonlinear extensions to the conjugate gradient methods exist and behave similarly well in practice. But because they have a “memory” of previously encountered nonlinear data, their
theoretical analysis is complex. The Fletcher-Reeves variant is often less efficient than the Polak- Ribi´ere+ algorithm and recommended by different authors [NW99, PTVF92].4
One reason causing the gradient to be a poor downhill direction is its dependence even on simple affine scaling of the variables.
grad(f(Ax)) =A·gradf(Ax) (A.7)
This means a steepest descent algorithm’s performance will depend on the chosen “parameteriza- tion” - or on the scaling of the variables [NW99, BV04]. While this appears to be a minor problem, this sensitivity has great practical implications: for instance in the parameterization problems dis- cussed in Chapter 4 huge disparities on the edge lengths appear during the minimization. Such situations need to be handled in a robust way across all scale! Newton-steps are independent of affine transformations [NW99].5
Poor scaling of the gradient has also the potential to ruin a main termination criterion, which is based on the vanishing norm of the gradient. For this reason users are required to carefully pick an application dependent thresholdfor the gradient normkgradf(xi)k ≤for termination [BMMS04].
A range of methods has been developed to automatically define good preconditioning matrices to obtain better search directions. Some (often positive definite) matricesBi are updated in each iteration step from the available gradient information. The gradient direction is scaled with the inverseHi = Bi−1 of this matrix to obtain the new step directiondi = −Hi·gradf(xi). These methods often try to converge to the Newton method and are commonly referred to as quasi-Newton iterations. Different construction rules for theBiare, for instance, Broyden’s method [Kel95, Pol87, NW99], the Davidson-Fletcher-Powell method (DFP) [Pol87, NW99] or the SR1 method [NW99].
What makes these methods particularly attractive, is that direct update-rules for theHiexist. This means there is no need to invert a matrix in each iteration step!6All the mentioned methods converge in a neighborhood of the minimum with q-linear or even q-superlinear rate [Kel95, Pol87, NW99].
But q-superlinear rate is only achieved if theBiconverge to the Hessian - which for large dimension nis rarely practical to await.
Observation 2 (First order methods) The cost of first order methods is very low if counted on a
4The simpler Polak-Ribi´ere (no “plus”) method performs well in practice but can fail to converge without periodic restarts.
5The underlying problem does not magically disappear but is handed over to the linear equation systems solver.
6ButHimight not be sparse.
x0 x1x2. . . x0 x2x4 x5x3x1
Figure A.3: Without step length control the decrease of the objective function can be too small either because of too short (left graph) or too long (right graph) step lengths.
per iteration basis. Their strength lies in the following situations a) the sequencexiis still far away from the solution.
b) the dimensionnof the problem is reasonably small or well-posed.
c) or if Hessian information can not be obtained. In practice the most efficient first order meth- ods are the nonlinear conjugate gradient for well-conditioned and quasi-Newton methods for ill- conditioned objectives.