Reconstruction and Compressed Sensing
5.3 SR ALGORITHMS
5.3.2 Thresholding Algorithms
The algorithms presented at the end of Section 5.3.1.2 both require inversion of a potentially massive matrix at each iteration. While fast methods for computing this inverse exist, we are now going to consider a class of algorithms that avoids this necessity. These algorithms will be variations on the iteration
ˆxk+1=ηˆxk−μAHA ˆxk− y (5.27) where η{·} is a thresholding function. The motivation for considering an algorithm of this form becomes apparent if we examine the cost function for QPλ. First, for ease of discussion let us define two functions that represent the two terms in this cost function.
Adopting the notation used in [79]
f(x)= Ax− y22 g(x)=λx1
24Imagine, for example, the phase response of a flat plate to see intuitively why penalizing variation in the phase would be problematic.
25MM relies on replacing the cost function of interest with a surrogate function that is strictly greater (hence majorization) but easier to minimize (hence minimization). The idea is to majorize the function near the current estimate, minimize the resulting approximation, and repeat. The perhaps more famil- iar expectation maximization algorithm is actually a special case of MM, as detailed in an excellent tutorial [78].
26A particular step size must be selected to obtain this form for the algorithm in [75]. It is also worth emphasizing that this version can handle the more general problem withλ2=0. The resulting algorithm simply uses a more complicated expression for the function h.
Melvin-5220033 book ISBN : 9781891121531 September 14, 2012 17:41 173
5.3 SR Algorithms 173
The function f(·)is differentiable, and we can easily obtain its gradient as
∇f(x)=2 AH(Ax− y)
Notice that this term appears in our generic thresholding iteration (5.27). Indeed, we can rewrite this iteration as
ˆxk+1 =ηˆxk−μ∇f(ˆxk) (5.28) Now, the nature of the thresholding algorithms becomes obvious. At each iteration, we take a step of size μ27 in the negative gradient direction of the 2 portion of our cost function. We then apply a thresholding operation that tweaks these steps to account for the1portion of the cost function encoded in g(·).
5.3.2.1 Soft Thresholding
We will consider two choices for the thresholding function that yield different performance characteristics and guarantees. The first is the soft thresholding operation defined for a scalar as
ηs(x, α)=
⎧⎨
⎩
|x| −α
|x|
x, if|x| ≥α 0, otherwise
(5.29) In other words, the amplitude of the scalar is reduced byαor set to zero if the amplitude is alreadyα or less. Notice that we can apply this operation to real or complex data.
When acting on a vector, the functionηs(·, α) operates component-wise. The first soft threshold algorithm we will consider is the iterative shrinkage-thresholding algorithm (ISTA), which has been developed and studied by several authors. We will follow the treatment provided in [79] which demonstrates that ISTA is a MM algorithm. To see this, consider the majorization28of f(x)+g(x)at the point b given by
QP(x,b)= f(b)+Re{x−b,∇f(b)} +P
2x−b2+g(x) (5.30) where P is twice the maximum eigenvalue of AHA. This value is the smallest Lipschitz constant of f(x), which can be easily determined using the power iteration [20] without explicit access to this potentially enormous matrix.
We can use this function to majorize our cost function at b = ˆxk. The next step in defining an MM algorithm is to minimize this majorization. We can compute the unique minimizer of this function for a fixed b as
argmin
x
QP(x,b)=argmin
x
Re{x−b,∇f(b)} + P
2x−b2+g(x)
=argmin
x
g(x)+ P
2 x−b,x−b + P 2
x−b,∇f(b) P
+
∇f(b) P ,x−b
+P 2
∇f(b)
P ,∇f(b) P
27The step sizeμcan be chosen using a variety of adaptive methods.
28It is verified in [79, Lemma 2.1] that QP(x,b)majorizes f(x)+g(x)at b.
Melvin-5220033 book ISBN : 9781891121531 September 14, 2012 17:41 174
174 C H A P T E R 5 Radar Applications of Sparse Reconstruction
=argmin
x
g(x)+ P 2
x−
b− 1
P∇f(b)2
2
=argmin
x λx1+ P 2
x−
b− 2
PAH(Ab− y)2
2
(5.31)
=ηs
b− 2
PAH(Ab− y), λ P
(5.32) Notice that the term in parenthesis in (5.31) is a constant for fixed b. Thus, the mini- mization in (5.31) can be carried out component-wise, yielding a simple analytical solu- tion that corresponds to the application of the soft threshold. Combining these ideas, we obtain ISTA
ˆxk+1 =ηs
ˆxk− 2
PAH(A ˆxk− y), λ P
Unfortunately, ISTA has been shown to enjoy only a sublinear, that is proportional to 1/k, rate of convergence [79, Theorem 3.1]. Beck and Teboulle [79] propose a modified fast ISTA (FISTA) that uses a very simple modification to obtain a quadratic convergence rate. FISTA requires nearly identical computational cost, particularly for a large-scale problem, and is given by
z1 = ˆx0 =0 t1 =1 ˆxk =ηs
zk− 2
PAH(Azk− y), λ P
tk+1 = 1+1+4(tk)2 2 zk+1 = ˆxk+tk−1
tk+1 (ˆxk− ˆxk−1)
Intuitively, this algorithm uses knowledge of the previous two iterates to take faster steps toward the global minimum. No additional applications of A and AH are required com- pared with ISTA, and thus the computational cost of this modification is negligible for the large-scale problems of interest.
ISTA and FISTA29converge to true solutions of QPλat sublinear and quadratic con- vergence rates, respectively. Thus, these algorithms inherit the RIP-based performance guarantees already proven for minimization of this cost function, for example (5.15). In- deed, these algorithms are close cousins of the Kragh algorithm described in the previous section. The key difference is that the derivation is restricted to the p = 1 norm case to take advantage of a simple analytical result for the minimization step of the algorithm.
As a final note on these algorithms, in [61] the authors leverage the same previous work that inspired FISTA to derive the NESTA algorithm. This approach provides extremely fast computation, particularly when the forward operator A enjoys certain properties. In addition, the provided algorithm can solve more general problems, including minimization
29We should mention that the FISTA algorithm given in [79] is more general than the result provided here, which has been specialized for our problem of interest.
Melvin-5220033 book ISBN : 9781891121531 September 14, 2012 17:41 175
5.3 SR Algorithms 175
of the TV norm and nondiagonal weighting on the1portion of the cost function considered here. Another strong contribution of [61] is an extensive numerical comparison of several leading algorithms, including examples not considered in this chapter, on a series of test problems.
5.3.2.2 Hard Thresholding
We now turn our attention toward hard thresholding algorithms. Iterative Hard Thresh- olding (IHT) applies the operatorηh{x,s}after each gradient step. This hard thresholding function leaves the s coefficients of x with the largest magnitudes30 unchanged and sets all others to zero. In particular, the IHT algorithm [80] is given by
ˆxk+1 =ηh
ˆxk−μ∇f(ˆxk),s (5.33) By construction every iteration produces a solution such that ˆxk0 ≤ s. Thus, if the algorithm parameter is set too low, we are guaranteed a priori to never find the correct solution. Naturally, this choice is analogous to the choice ofλ,τ, orσ when using thep
norm-based algorithms.
A RIP-based performance guarantee for IHT is provided in [80,Theorem 4]. After we select the sparsity parameter s for IHT, then provided thatR3s(A) <1/√
32≈0.1768, we obtain the error bound
xtrue−ˆx2 ≤7xtrue−xtrues 2+s−1/2xtrue−xtrues 1+σ (5.34) This guarantee is very similar to the result for BPDN. However, the RIP requirement is more stringent, requiring a condition on signals of length 3s instead of 2s, and the resulting error bound is not as tight. While we do obtain a RIP condition of a similar form, it is important to note that these RIP bounds are sufficient but not necessary conditions. In addition, they are worst-case results, which may belie the performance observed for typical signals. The authors make this point at some length in [80] and provide simulations demonstrating that IHT exhibits inferior performance to BPDN and other1 approaches when the RIP condition is violated. Thus, this simplified algorithm does come at some cost. Nonetheless, for sufficiently sparse signals, IHT performs beautifully and with excellent computational efficiency. Indeed, the need to store only a small number of nonzero coefficients at each iteration is particularly convenient for very large-scale problems. Recent work [81] has developed variations of IHT leveraging ideas from soft-thresholding schemes like FISTA with very promising numerical performance and some analytical performance guarantees.