2 Proximal Subgradients 33 second-order Taylor expansion with remainder, which means there exists a neighborhoodB(x;η) ofxso that for everyy∈B(x;η) we have
f(y) =f(x) +
f(x), y−x +12
f(z)(y−x), y−x ,
where zis some element on the line segment connectingxandy. We note that if the norms of f(y) are bounded over y ∈ B(x;η) by the constant 2σ >0, then this implies
f(y)≥f(x) +
f(x), y−x
−σy−x2 (3) for ally∈B(x;η).
If it should also happen thatf:X→ L(X, X) is continuous onU, thenf is said to be twice continuously differentiable onU, and we writef ∈C2(U), or simplyf ∈C2ifU =X. We note that iff ∈C2(U), then for eachx∈U there exists a neighborhoodB(x;η) and a constantσso that (3) holds, since the continuity of f at x implies that the norms off are bounded in a neighborhood ofx.
for ally∈B(x;η) and for allα≥f(y). This in turn implies (ζ,−1),
(y, α)−
x, f(x)
≤σ(y, α)−x, f(x)2 for all points (y, α)∈epi(f) near
x, f(x)
. In view of Proposition 1.5, this implies that (ζ,−1)∈NepiP f
x, f(x) .
Let us now turn to the “only if” part. To this end, suppose that (ζ,−1)∈ NepiP f
x, f(x)
. Then by Proposition 1.3 there existsδ >0 such that x, f(x)
∈projepif
x, f(x)
+δ(ζ,−1) . This evidently implies
δ(ζ,−1)2≤x, f(x)
+δ(ζ,−1)
−(y, α)2
for all (y, α) ∈ epif; see Figure 1.4. Upon taking α = f(y), the last in- equality yields
δ2ζ2+δ2≤ x−y+δζ2+
f(x)−f(y)−δ2
, which can be rewritten as
f(y)−f(x) +δ2
≥δ2+ 2δζ, y−x − x−y2. (5) It is clear that the right-hand side of (5) is positive for all y sufficiently near x, say fory ∈B(x;η). By shrinking η > 0 if necessary, we can also ensure (by the lower semicontinuity off) thaty∈B(x;η) implies
f(y)−f(x) +δ >0.
Hence taking square roots of (5) gives us that f(y)≥g(y) :=f(x)−δ+
δ2+ 2δζ, y−x − x−y21/2 (6) for ally∈B(x;η). Direct calculations show thatg(x) =ζand thatgex- ists and is bounded, say by 2σ >0, on a neighborhood ofx(Exercise 2.4).
Again if η is shrunk further if necessary, we have (as noted above in con- nection with the inequality (3))
g(y)≥g(x) +ζ, y−x −σy−x2 ∀y∈B(x;η).
But then by (6), and since f(x) =g(x), we see that
f(y)≥f(x) +ζ, y−x −σy−x2∀y∈B(x;η), which is (4) as required.
2 Proximal Subgradients 35
FIGURE 1.4.ζbelongs to∂Pf(x).
The definition of proximal subgradients via proximal normals to an epi- graph is a geometric approach, and the characterization in Theorem 2.5 can also be interpreted geometrically. The proximal subgradient inequality (4) asserts that near x,f(·) majorizes the quadratic function
h(y) :=f(x) +ζ, y−x −σy−x2,
with equality aty=x(since obviouslyh(x) =f(x)). It is worth noting that this is equivalent to saying that y→f(y)−h(y) has a local minimum at y=xwith min value equal to 0. Put into purely heuristic terms, the content of Theorem 2.5 is that the existence of such a parabola hwhich “locally fits under” the epigraph off at
x, f(x)
is equivalent to the existence of a ball in X×Rtouching the epigraph nonhorizontally at that point; this is, in essence, what the proof of the theorem shows. See Figure 1.4.
The description of proximal subgradients contained in Theorem 2.5 is gen- erally more useful in analyzing lower semicontinuous functions than is a direct appeal to the definition. The first corollary below illustrates this, and relates ∂Pf to classical differentiability. It also states that for convex functions, the inequality (4) holds globally in an even simpler form; this is the functional analogue of the simplified proximal normal inequality for convex sets (Proposition 1.10).
2.6. Corollary. Let f ∈ F andU ⊂X be open.
(a) Assume thatf is Gˆateaux differentiable at x∈U. Then
∂Pf(x)⊆ fG(x)
.
(b) Iff ∈C2(U), then
∂Pf(x) = f(x) for allx∈U.
(c) Iff is convex, thenζ∈∂Pf(x)iff
f(y)≥f(x) +ζ, y−x ∀y∈X. (7) Proof.
(a) Suppose f has a Gˆateaux derivative atxand that ζ ∈∂Pf(x). For anyv∈X, if we writey=x+tv, the proximal subgradient inequality (4) implies that there existsσ >0 such that
f(x+tv)−f(x)
t − ζ, v ≥ −tσv2
for all sufficiently small positivet. Upon lettingt↓0 we obtain fG (x)−ζ, v
≥0.
Sincev was arbitrary, the conclusionζ=fG (x) follows.
(b) If f ∈ C2(U) and x ∈ U, then we have f(x) ∈ ∂Pf(x) by Theo- rem 2.5, since (3) implies (4) ifζ is set equal tof(x). That∂Pf(x) contains onlyf(x) follows from part (a).
(c) Obviously ifζ satisfies (7), then (4) holds withσ= 0 and anyη >0, so that ζ ∈ ∂Pf(x). Conversely, suppose ζ ∈ ∂Pf(x), and σ and η are chosen as in (4). Lety ∈X. Then for anyt in (0,1) sufficiently small so that (1−t)x+ty∈B(x;η), we have by the convexity of f and (4) (where we substitute (1−t)x+ty fory) that
(1−t)f(x) +tf(y)≥f
(1−t)x+ty
≥f(x) +tζ, y−x −t2σy−x2. Simplifying and dividing byt, we conclude
f(y)≥f(x) +ζ, y−x −tσy−x2. Lettingt↓0 yields (7).
The containment in Corollary 2.6(a) is the best possible conclusion under the stated assumptions, since even when X = R and f is continuously differentiable, the nonemptiness of the proximal subdifferential is not as- sured. The already familiarC1functionf(x) =−|x|3/2admits no proximal subgradient at x= 0 (see Exercise 1.7).
2 Proximal Subgradients 37 The first part of the following corollary has already been observed (Exer- cise 2.3). Despite its simplicity, it is the fundamental fact that generates proximal subgradients on many occasions. The second part says that the
“first-order” necessary condition for a minimum is also sufficient in the case of convex functions, which is a principal reason for their importance.
2.7. Corollary. Supposef ∈ F.
(a) Iff has a local minimum atx, then0∈∂Pf(x).
(b) Conversely, iff is convex and0∈∂Pf(x), thenxis a global minimum off.
Proof.
(a) The definition of a local minimum says there existsη >0 so that f(y)≥f(x)∀y∈B(x;η),
which is the proximal subgradient inequality withζ = 0 andσ= 0.
Thus Theorem 2.5 implies that 0∈∂Pf(x).
(b) Under these hypotheses, (7) holds withζ= 0. Thusf(y)≥f(x) for ally∈X, which says thatxis a global minimum off.
The proximal subdifferential is a “one-sided” object suitable to the anal- ysis of lower semicontinuous functions. For a theory applicable to upper semicontinuous functions f, the proximal superdifferential ∂Pf(x) is the appropriate object, and can be defined simply as−∂P(−f)(x). In the sub- sequent development, analogues for upper semicontinuous functions will usually not be stated because they require only evident modifications, such as replacing “sub” by “super,” “≤” by “≥,” “minimum” by “maximum,”
and “convex” by “concave.” Nonetheless, we will have occasional use for supergradients.