(a) A vectorζbelongs to NSP(s)iff there existsσ=σ(ζ, s)≥0 such that ζ, s−s ≤σs−s2 ∀s ∈S.
(b) Furthermore, for any givenδ >0, we haveζ∈NSP(s)iff there exists σ=σ(ζ, s)≥0such that
ζ, s−s ≤σs−s2 ∀s ∈S∩B(s;δ).
The only item requiring proof is the following:
1.6. Exercise. Prove that if the inequality of (b) holds for someσ andδ, then that of (a) holds for some possibly largerσ.
The previous proposition makes it evident thatNSP(s) is convex; however, it need be neither open nor closed. ThatNSP(s) can be trivial (i.e., reduce to{0}) even whenSis a closed subset ofRnandslies in bdryS, can easily be seen by considering the set
S:=
(x, y)∈R2:y≥ −|x|
.
There are no points outside S whose closest point in S is (0,0) (to put this another way: no ball whose interior fails to intersect S can have (0,0) on its boundary). Thus NSP(0,0) ={0}. A slightly more complicated but smoother example is the following:
1.7. Exercise. ConsiderSdefined as S:=
(x, y)∈R2:y≥ −|x|3/2 .
Show that for (x, y)∈bdryS,NSP(x, y) = (0,0) iff (x, y) = (0,0).
1.8. Exercise. LetX=X1⊕X2 be an orthogonal decomposition, and supposeS ⊆ X is closed, s ∈ S, and ζ ∈ NSP(s). Write s = (s1, s2) andζ = (ζ1, ζ2) according to the given decomposition, and define S1 =
s1: (s1, s2) ∈ S
, and similarly defineS2. Show that ζi∈NSPi(si),i= 1,2.
The next two propositions illustrate that the concept of a proximal normal generalizes two classical definitions, that of a normal direction to a C2 manifold as defined in differential geometry, and that of a normal vector in the context of convex analysis.
Consider a closed subset S ofRn that admits a representation of the form S=
x∈Rn:hi(x) = 0, i= 1,2, . . . , k
, (1)
where eachhi:Rn →RisC1. If the vectors
∇hi(s)
(i= 1,2, . . . , k) are linearly independent at each s∈S, thenS is a C1 manifold of dimension n−k.
1.9. Proposition. Let s ∈ S, whereS is given by (1), and assume that the vectors
∇hi(s)
(i= 1,2, . . . , k)are linearly independent. Then:
(a) NSP(s)⊆span
∇hi(s)
(i= 1,2, . . . , k).
(b) If in addition eachhi isC2, then equality holds in (a).
Proof. Letζ belong toNSP(s). By Proposition 1.5, there exists a constant σ >0 so that
ζ, s−s ≤σs−s2,
whenever s belongs to S. Put another way, this is equivalent to saying that the pointsminimizes the functions → −ζ, s+σs−s2 over all points s satisfying hi(s) = 0 (i = 1,2, . . . , k). The Lagrange multiplier rule of classical calculus provides a set of scalars {µi}ki=1 such that ζ =
iµi∇hi(s), which establishes (a).
Now let ζ have the form
iµi∇hi(s), where each hi is C2. Consider the C2 function
g(x) :=−ζ, x+
i
µihi(x) +σx−s2,
whereσ >0. Theng(s) = 0, and forσsufficiently large we haveg(s)>0 (positive definite), from which it follows thatgadmits a local minimum at s. Consequently, if s is near enough tosand satisfies hi(s) = 0 for each i, we have
g(s) =−ζ, s+σs−s2≥g(s) =−ζ, s.
This confirms the proximal normal inequality and completes the proof.
The special case in which S is convex is an important one.
1.10. Proposition. Let S be closed and convex. Then (a) ζ∈NSP(s)iff
ζ, s−s ≤0 ∀s ∈S.
2 Proximal Subgradients 27 (b) IfX is finite-dimensional ands∈bdry(S), thenNSP(s)={0}.
Proof. The inequality in (a) holds iff the proximal normal inequality holds withσ= 0. Hence the “if” statement is immediate from Proposition 1.5(a).
To see the converse, letζ∈NSP(s) andσ >0 be chosen as in the proximal normal inequality. Let s be any point inS. Since S is convex, the point
˜
s:=s+t(s−s) =ts+ (1−t)salso belongs toS for eacht∈(0,1). The proximal normal inequality applied to ˜sgives
ζ, t(s−s)
≤σt2|s−s|2.
Dividing across byt and lettingt↓0 yields the desired inequality.
To prove (b), let{si}be a sequence inSconverging tosso thatNSP(si)= {0} for alli. Such a sequence exists by Exercise 1.2(d). Letζi ∈NSP(si) satisfy ζi = 1, and passing to a subsequence if necessary, assume that ζi→ζ asi→ ∞, and note thatζ= 1. By part (a), we have
ζi, s−si ≤0 ∀s∈S.
Lettingi→ ∞yields
ζ, s−s ≤0∀s∈S, which, again by part (a), says that ζ∈NSP(s).
Let 0 = ζ ∈ X and r ∈ R. A hyperplane (with associated normal vector ζ) is any set of the form
x∈X: ζ, x=r
, and ahalf-space is a set of the form
x∈X:ζ, x ≤r
. Proposition 1.10(b) is aseparation theorem, for it says that each point in the boundary of a convex set lies in some hyperplane, with the set itself lying in one of the associated half-spaces.
An example given in the end-of-chapter problems shows that this fact fails in general when X is infinite dimensional, although separation does hold under additional hypotheses.
We now turn our attention from sets to functions.
2 Proximal Subgradients
We begin by establishing some notation and recalling some facts about functions.
A quite useful convention prevalent in the theories of integration and op- timization, which we will also adopt, is to allow for functions f:X → (−∞,+∞]; that is, functions which are extended real-valued. As we will see, there are many advantages in allowing f to actually attain the value +∞at a given point. To single out those points at whichf is not +∞, we define the (effective) domain as the set
domf :=
x∈X:f(x)<∞ .
Thegraph andepigraph off are given, respectively, by grf :=
x, f(x)
:x∈domf , epif :=
(x, r)∈domf×R:r≥f(x) .
Just as sets are customarily assumed to be closed, the usual background assumption on f is that of lower semicontinuity. A function f: X → (−∞,+∞] is lower semicontinuous atxprovided that
lim inf
x→x f(x)≥f(x).
This condition is clearly equivalent to saying that for allε >0, there exists δ >0 so thaty ∈B(x;δ) impliesf(y)≥f(x)−ε, where as usual, ∞ −r is interpreted as ∞whenr∈R.
Complementary to lower semicontinuity isupper semicontinuity:f is upper semicontinuous atxif−f is lower semicontinuous atx. Lower semicontin- uous functions are featured prominently in our development, but of course our results have upper semicontinuous analogues, although we will rarely state them. This preference for lower semicontinuity explains why +∞is allowed as a function value and not−∞.
As is customary, we say that a functionf iscontinuous atx∈X provided it is finite-valued near x and for all ε > 0, there exists δ > 0 so that y∈B(x;δ) impliesf(x)−f(y)≤ε. For finite-valuedf, this is equivalent to saying that f is both lower and upper semicontinuous at x. If f is lower semicontinuous (respectively, upper semicontinuous, continuous) at each pointxin an open setU ⊂X, then f is called lower semicontinuous (respectively, upper semicontinuous, continuous) onU.
To restrict certain pathological functions from entering the discussion, we designate byF(U), whereU ⊆Xis open, the class of all functionsf:X → (−∞,∞] which are lower semicontinuous onU and such that domf∩U =∅.
IfU =X, then we simply writeF forF(X).
Let S be a subset of X. The indicator function of S, denoted either by IS(·) orI(·;S), is the extended-valued function defined by
IS(x) :=
0 ifx∈S, +∞ otherwise.
LetU ⊂X be an open convex set. A function f: X→(−∞,∞] is said to be convex onU provided
f
tx+ (1−t)y
≤tf(x) + (1−t)f(y)∀x, y∈U, 0< t <1.
A functionf which is convex onX is simply said to be convex. Note that domf is necessarily a convex set iff is convex.
2 Proximal Subgradients 29 The following exercise contains some elementary properties of lower semi- continuous and convex functions. Parts (a) and (b) in particular help to demonstrate why the epigraph, rather than the graph, of a function plays the fundamental role in the analysis of lower semicontinuous functions.
Note that X ×R, the space in which epif lives, is always viewed as a Hilbert space with inner product
(x, r),(x, r)
:=x, x+rr. 2.1. Exercise. Supposef:X →(−∞,+∞].
(a) Show that f is lower semicontinuous on X iff epif is closed inX×R, and this is true iff each r-level set
x:f(x) ≤ r is closed,r∈ R. Note that grf need not be closed whenf is lower semicontinuous.
(b) Show thatfis convex onXiff epifis a convex subset ofX×R.
(c) Whenf is an indicator function, f=IS, thenf ∈ F(X) iffS is nonempty and closed, andf is convex iffS is convex.
(d) Suppose that (ζ,−λ)∈X×Rbelongs toNepiP f(x, r) for some (x, r)∈epif, wheref∈ F. Prove thatλ≥0, thatr=f(x) if λ >0, and thatλ= 0 ifr > f(x). In this last case, show that (ζ,0)∈NepiP f
x, f(x) .
(e) Give an example of a continuousf ∈ F(R) such that at some pointxwe have (1,0)∈NepiP f
x, f(x) .
(f) If S = epif, where f ∈ F, prove that for all x, dS(x, r) is nonincreasing as a function ofr.
A vector ζ ∈ X is called a proximal subgradient (or P-subgradient) of a lower semicontinuous functionf atx∈domf provided that
(ζ,−1)∈NepiP f
x, f(x) .
The set of all suchζ is denoted∂Pf(x), and is referred to as theproximal subdifferential, orP-subdifferential. Note that because a cone is involved, ifα >0 and (ζ,−α)∈NepiP f
x, f(x)
, thenζ/α∈∂Pf(x). It also follows immediately from our study of the proximal normal cone that ∂Pf(x) is convex, however it is not necessarily open, closed, or nonempty. The func- tionf: R→Rdefined byf(x) =−|x|is a simple example of a continuous function having∂Pf(0) =∅.
Figure 1.3 illustrates the epigraph of a functionf together with some vec- tors of the form (ζ,−1), ζ ∈∂Pf(x). There exists a single P-subgradient at x1, as well as at all the unlabeled points. At x2, there are no P- subgradients, and there are multipleP-subgradients at the three remaining labeled points. Atx4, the proximal subdifferential is an unbounded set; the (horizontal) dashed arrow here is not associated with aP-subgradient, al- though it does represent a P-normal to epif.
The indicator function is one of several ways in which we pass between sets and functions. It is also useful in optimization: note that minimizingf over a setS is equivalent to minimizing the functionf+IS globally.
FIGURE 1.3. The epigraph of a function.
2.2. Exercise. Letf=IS. Prove that forx∈Swe have
∂Pf(x) =∂PIS(x) =NSP(x).
The main theme of this chapter is to develop the calculus rules governing the proximal subgradient. We will see that to a surprising degree, many of the usual properties enjoyed by the classical derivative carry over to the proximal subgradient∂Pf(x). As a first illustration of this, we give an exercise which echos the vanishing of the derivative at a local minimum.
A point x ∈ X is said to attain the minimum of f on S provided x ∈ S ∩ domf and
f(x)≤f(y)∀y∈S.
If there exists an open neighborhood U ofx∈ X on which xattains the minimum off, thenxis said to be a local minimum off. Ifxis a minimum off onU =X, then xis called a global minimum.
2.3. Exercise. Supposef∈ F.
(a) Show that iffattains a local minimum atx, then 0∈∂Pf(x).
(b) SupposeS ⊂X is compact and satisfiesS∩domf =∅. Show thatf is bounded below on S, and attains its minimum over S.
Classical Derivatives
Before developing further properties of P-subgradients, we need to recall some facts about classical derivatives. We will do so rather quickly. The
2 Proximal Subgradients 31 directional derivative off at x∈domf in the direction v∈X is defined as
f(x;v) := lim
t↓0
f(x+tv)−f(x)
t , (1)
when the limit exists. We say thatf isGˆateaux differentiable atxprovided the limit in (2.1) exists for allv∈X, and there exists a (necessarily unique) element fG(x)∈X (called theGˆateaux derivative) that satisfies
f(x;v) =
fG (x), v
∀v∈X. (2) A function may possess a directional derivative atxin every direction and yet fail to possess a Gˆateaux derivative, as is evidenced byf(x) =x at x= 0. In this case, we have f(0;v) =v. Also, a lower semicontinuous function may have a Gˆateaux derivative at a pointxbut not be continuous there.
Suppose that (2) holds at a pointx, and in addition that the convergence in (1) is uniform with respect tovin bounded subsets ofX. We then say that f is Fr´echet differentiable at x, and in this case write f(x) (the Fr´echet derivative) in place of fG(x). Equivalently this means that for all r > 0 andε >0, there existsδ >0 so that
f(x+tv)−f(x)
t −
f(x), v< ε holds for all|t|< δ andv ≤r.
The two notions of differentiability are not equivalent, even in finite di- mensions. We can easily show that Fr´echet differentiability at x implies continuity atx, which is not the case for Gˆateaux differentiability.
Many of the elementary properties of the derivative encountered in the multivariate calculus (i.e., whenX =Rn) have exact analogues using either Fr´echet or Gˆateaux derivatives, wheref or fG take the place of the usual gradient∇f. To illustrate in some detail, supposef, g: X→Rhave Fr´echet derivatives at x∈ X. Then f ±g, f g, and f /g (with g(x) = 0) all have Fr´echet derivatives atxobeying the classical rules:
(f±g)(x) =f(x)±g(x),
(f g)(x) =f(x)g(x) +f(x)g(x), f
g
(x) =
f(x)g(x)−f(x)g(x) g2(x)
. The proofs of these facts are the same as in the classical case.
The Mean Value Theorem can be stated as follows: suppose f ∈ F(X) is Gˆateaux differentiable on an open neighborhood that contains the line
segment [x, y] :=
tx+ (1−t)y: 0 ≤ t ≤ 1
, where x, y ∈ X. That is, there exists an open setU containing the line segment [x, y] such thatf is differentiable at every point ofU. Then there exists a pointz:=tx+(1−t)y, 0< t <1, so that
f(y)−f(x) =
fG (z), y−x .
A proof of the Mean Value Theorem can be given by applying the classical one-dimensional mean value theorem to the functiong: [0,1]→Rdefined byg(t) =f
x+t(y−x) .
Another useful result is theChain Rule. In order to state it, we first need to extend the above notions of differentiability to maps between two Hilbert spaces. SupposeX1andX2 are Hilbert spaces with norms · 1and · 2, respectively, and supposeF:X1→X2is a mapping between these spaces.
We write L(X1, X2) for the space of bounded linear transformations from X1to X2 endowed with the usual operator norm. The scalar caseX2=R was discussed above, in which case L(X1,R) was identified with X1in the usual way.
Letx∈X1. The Gˆateaux derivative, should it exist, ofF atxis an element FG(x)∈ L(X1, X2) that satisfies
limt↓0
F(x+tv)−F(x)
t −FG(x)(v) 2
= 0,
for allv∈X1. Should in addition the above limit hold uniformly overvin bounded sets ofX1, thenF is Fr´echet differentiable and we writeF(x) in place of FG(x).
As in the scalar case, the derivative of the sum of two functions mapping X1 to X2 is the sum of the derivatives. Let us now consider the Chain Rule. Suppose X1, X2, and X3 are all Hilbert spaces, andF:X1 →X2, G:X2→X3. Assume thatF is Fr´echet differentiable atx∈X1, andGis Fr´echet differentiable atF(x)∈X2. Then the compositionG◦F:X1→X3 is Fr´echet differentiable atxand
(G◦F)(x) =G F(x)
F(x), where G
F(x)
F(x)∈ L(X1, X3) signifies the composition ofF(x) with G
F(x) .
Suppose U ⊆X is open andf:U →R is Fr´echet differentiable on U. If f(·) : U → X is continuous on U, then we say that f is C1 on U, and write f ∈ C1(U). It turns out that if f is Gˆateaux differentiable on U with a continuous derivative there, then f ∈C1(U). Now suppose further that the map f(·) :U → X is itself Fr´echet differentiable onU with its derivative at x ∈ U denoted by f(x) ∈ L(X, X) (in the multivariate calculus, f(x) is the Hessian). For each x ∈ U, f then admits a local
2 Proximal Subgradients 33 second-order Taylor expansion with remainder, which means there exists a neighborhoodB(x;η) ofxso that for everyy∈B(x;η) we have
f(y) =f(x) +
f(x), y−x +12
f(z)(y−x), y−x ,
where zis some element on the line segment connectingxandy. We note that if the norms of f(y) are bounded over y ∈ B(x;η) by the constant 2σ >0, then this implies
f(y)≥f(x) +
f(x), y−x
−σy−x2 (3) for ally∈B(x;η).
If it should also happen thatf:X→ L(X, X) is continuous onU, thenf is said to be twice continuously differentiable onU, and we writef ∈C2(U), or simplyf ∈C2ifU =X. We note that iff ∈C2(U), then for eachx∈U there exists a neighborhoodB(x;η) and a constantσso that (3) holds, since the continuity of f at x implies that the norms off are bounded in a neighborhood ofx.