• Tidak ada hasil yang ditemukan

Statistics authors titles recent submissions

N/A
N/A
Protected

Academic year: 2017

Membagikan "Statistics authors titles recent submissions"

Copied!
24
0
0

Teks penuh

(1)

Neon2

: Finding Local Minima via First-Order Oracles

(version 1)∗

Zeyuan Allen-Zhu zeyuan@csail.mit.edu

Microsoft Research

Yuanzhi Li

yuanzhil@cs.princeton.edu Princeton University November 17, 2017

Abstract

We propose a reduction for non-convex optimization that can (1) turn a stationary-point finding algorithm into a local-minimum finding one, and (2) replace the Hessian-vector prod-uct computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm’s performance.

As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into local-minimum finding algorithms outperforming some best known results.

1

Introduction

Nonconvex optimization has become increasing popular due its ability to capture modern machine learning tasks in large scale. Most notably, trainingdeep neural networkscorresponds to minimizing a function f(x) = 1nPni=1fi(x) over x ∈ Rd that is non-convex, where each training sample

i corresponds to one loss function fi(·) in the summation. This average structure allows one to perform stochastic gradient descent (SGD) which uses a random fi(x) —corresponding to computing backpropagation once— to approximatef(x) and performs descent updates.

Motivated by such large-scale machine learning applications, we wish to design faster first-order non-convex optimization methods that outperform the performance of gradient descent, both in the online and offline settings. In this paper, we say an algorithm is online if its complexity is independent ofn (soncan be infinite), and offline otherwise. In recently years, researchers across different communities have gathered together to tackle this challenging question. By far, known theoretical approaches mostly fall into one of the following two categories.

First-order methods for stationary points. In analyzing first-order methods, we denote by gradient complexity T the number of computations of fi(x). To achieve an ε-approximate stationary point —namely, a point x with k∇f(x)k ≤ ε— it is a folklore that gradient descent (GD) is offline and needs T O εn2

, while stochastic gradient decent (SGD) is online and needs T O ε14

. In recent years, the offline complexity has been improved to T O nε22/3

by the

The result of this paper was briefly discussed at a Berkeley Simons workshop on Oct 6 and internally presented

at Microsoft on Oct 30. We started to prepare this manuscript on Nov 11, after being informed of the independent and similar work of Xu and Yang [28]. Their result appeared on arXiv on Nov 3. To respect the fact that their work appeared online before us, we have adopted their algorithm name Neon and called our new algorithmNeon2. We encourage readers citing this work to also cite [28].

(2)

SVRG method [3, 23], and the online complexity has been improved to T O ε101/3

by the SCSG

method [18]. Both of them rely on the so-called variance-reduction technique, originally discovered for convex problems [11, 16, 24, 26].

These algorithms SVRG and SCSG are only capable of finding stationary points, which may not necessarily be approximate local minima and are arguably bad solutions for neural-network training [9, 10, 14]. Therefore,

can we turn stationary-point finding algorithms into local-minimum finding ones?

Hessian-vector methods for local minima. It is common knowledge that using information about the Hessian, one can findε-approximate local minima —namely, a pointxwithk∇f(x)k ≤ε and also 2f(x) ε1/CI.1 In 2006, Nesterov and Polyak [20] showed that one can find an ε -approximate in O(ε11.5) iterations, but each iteration requires an (offline) computation as heavy as

inverting the matrix2f(x).

To fix this issue, researchers propose to study the so-called “Hessian-free” methods that, in addition to gradient computations, also compute Hessian-vector products. That is, instead of using the full matrix2f

i(x) or∇2f(x), these methods also compute∇2fi(x)·vfor indicesiand vectors

v.2 For Hessian-free methods, we denote by gradient complexity T the number of computations of fi(x) plus that of ∇2fi(x)·v. The hope of using Hessian-vector products is to improve the complexityT as a function of ε.

Such improvement was first shown possible independently by [1, 7] for the offline setting, with complexityT ε1n.5 + n

3/4

ε1.75

so is better than that of gradient descent. In the online setting, the first improvement was byNatasha2which gives complexityT ε31.25

[2].

Unfortunately, it is argued by some researchers that Hessian-vector products are not general enough and may not be as simple to implement as evaluating gradients [8]. Therefore,

can we turn Hessian-free methods into first-order ones, without hurting their performance?

1.1 From Hessian-Vector Products to First-Order Methods

Recall by definition of derivative we have 2fi(x) ·v = limq→0{∇fi(x+qvq)−∇fi(x)}. Given any

Hessian-free method, at least at a high level, can we replace every occurrence of 2f

i(x)·v with

w= ∇fi(x+qv)−∇fi(x)

q for some small q >0?

Note the error introduced in this approximation is k∇2fi(x)·v−wk ∝ qkvk2. Therefore, as long as the original algorithm is sufficiently stable to adversarial noise, and as long as q is small enough, this can convert Hessian-free algorithms into first-order ones.

In this paper, we demonstrate this idea by converting negative-curvature-search (NC-search) subroutines into first-order processes. NC-search is a key subroutine used in state-of-the-art Hessian-free methods that have rigorous proofs (see [1, 2, 7]). It solves the following simple task:

negative-curvature search (NC-search)

given pointx0, decide if∇2f(x0) −δI or find a unit vectorv such thatv⊤∇2f(x0)v≤ −δ2.

Online Setting. In the online setting, NC-search can be solved by Oja’s algorithm [21] which

1We sayAδIif all the eigenvalues ofAare no smaller thanδ. In this high-level introduction, we focus only

on the case whenδ=ε1/C for some constantC.

(3)

costsOe(1/δ2) computations of Hessian-vector products (first proved in [6] and applied to NC-search in [2]).

In this paper, we propose a method Neon2online which solves the NC-search problem via only stochastic first-order updates. That is, starting fromx1 =x0+ξ where ξ is some random

pertur-bation, we keep updating xt+1 =xt−η(∇fi(xt)− ∇fi(x0)). In the end, the vector xT −x0 gives

us enough information about the negative curvature.

Theorem 1 (informal). Our Neon2online algorithm solves NC-search usingOe(1/δ2) stochastic

gra-dients, without Hessian-vector product computations.

This complexity Oe(1/δ2) matches that of Oja’s algorithm, and is information-theoretically op-timal (up to log factors), see the lower bound in [6].

We emphasize that the independent workNeonby Xu and Yang [28] is actually thefirst recorded theoretical result that proposed this approach. However,Neon needs Oe(1/δ3) stochastic gradients, because it uses full gradient descent to find NC (on a sub-sampled objective) inspired by [15] and the power method; instead,Neon2online usesstochastic gradients and is based on our prior work on Oja’s algorithm [6].

By pluggingNeon2online intoNatasha2[2], we achieve the following corollary (see Figure 1(c)):

Theorem 2 (informal). Neon2online turns Natasha2 into a stochastic first-order method, without hurting its performance. That is, it finds an (ε, δ)-approximate local minimum in T = Oe ε31.25 +

1

ε3δ+δ15

stochastic gradient computations, without Hessian-vector product computations.

(We say x is an approximate local minimum if k∇f(x)k ≤εand 2f(x)δI.)

Offline Setting. There are a number of ways to solve the NC-search problem in the offline setting using Hessian-vector products. Most notably, power method usesOe(n/δ) computations of Hessian-vector products, Lanscoz method [17] usesOe(n/√δ) computations, and shift-and-invert [12] on top of SVRG [26] (that we call SI+SVRG) uses Oe(n+n3/4/√δ) computations.

In this paper, we convert Lanscoz’s method and SI+SVRG into first-order ones:

Theorem 3 (informal). Our Neon2det algorithm solves NC-search using Oe(1/√δ) full gradients (or equivalently Oe(n/√δ) stochastic gradients), and our Neon2svrg solves NC-search using Oe(n+ n3/4/√δ) stochastic gradients.

We emphasize that, although analyzed in the online setting only, the work Neon by Xu and Yang [28] also applies to the offline setting, and seems to be the first result to solve NC-search using first-order gradients with a theoretical proof. However, Neon uses Oe(1/δ) full gradients instead of Oe(1/√δ). Their approach is inspired by [15], but ourNeon2det is based on Chebyshev

approximation theory (see textbook [27]) and its recent stability analysis [5].

By putting Neon2det and Neon2svrg into theCDHS method of Carmon et al. [7], we have3

Theorem 4 (informal). Neon2det turns CDHS into a first-order method without hurting its perfor-mance: it finds an (ε, δ)-approximate local minimum in Oe ε11.75 +δ31.5

full gradient computations.

Neon2svrg turnsCDHSinto a first-order method without hurting its performance: it finds an(ε, δ )-approximate local minimum in T =Oe ε1n.5 +δn3 + n

3/4

ε1.75 +n 3/4

δ3.5

stochastic gradient computations.

3We note that the original paper ofCDHSonly proved such complexity results (although requiring Hessian-vector

products) for the special case ofδ ≥ε1/2. In such a case, it requires eitherOe 1

ε1.75

full gradient computations or e

O ε1n.5+

n3/4

ε1.75

(4)

T=δ-5

T=δ-3ε-2

T=ε-4

T=δ-7

Neon2+SGD Neon+SGD

ε ε2/3ε4/7ε1/2 ε1/4 δ ε-4

ε-5 T

(a)

T=δ-5

T=δ-3ε-2

T=ε-3.33

T=δ-6 Neon2+SCSG

Neon+SCSG

ε ε2/3 ε1/2ε4/9 ε1/4 δ

ε-4

ε-5

ε-3.33

T

(b)

T=δ-5

T=δ-1ε-3 T=ε-3.25

T=δ-6

Neon2+Natasha2 Neon+Natasha

ε ε3/4 ε3/5ε1/2 ε1/4 δ ε-3.25

ε-3.75 ε-3.6 ε-5

T

(c)

Figure 1: Neon vsNeon2for finding (ε, δ)-approximate local minima. We emphasize thatNeon2 andNeonare based on the same high-level idea, butNeonis arguably the first-recorded result to turn stationary-point finding algorithms (such asSGD,SCSG) into local-minimum finding ones, with theoretical proofs.

One should perhaps compareNeon2detto the interesting work “convex until guilty” by Carmon et al. [8]. Their method finds ε-approximate stationary points using Oe(1/ε1.75) full gradients, and is arguably the first first-order method achieving a convergence rate better than 1/ε2 of GD.

Unfortunately, it is unclear if their method guarantees local minima. In comparison, Neon2det on

CDHSachieves the same complexity but guarantees its output to be an approximate local minimum.

Remark 1.1. All the cited works in this sub-section requires the objective to have (1) Lipschitz-continuous Hessian (a.k.a. second-order smoothness) and (2) Lipschitz-Lipschitz-continuous gradient (a.k.a. Lipschitz smoothness). One can argue that (1) and (2) are both necessary for finding approximate local minima, but if only finding approximate stationary points, then only (2) is necessary. We shall formally discuss our assumptions in Section 2.

1.2 From Stationary Points to Local Minima

Given any (first-order) algorithm that finds only stationary points (such as GD, SGD, or SCSG [18]), we can hope for using the NC-search routine to identify whether or not its output x satisfies

∇2f(x) δI. If so, then automatically x becomes an (ε, δ)-approximate local minima so we can terminate. If not, then we can go in its negative curvature direction to further decrease the objective.

In the independent work of Xu and Yang [28], they proposed to apply their Neon method for NC-search, and thus turned SGD and SCSG into first-order methods finding approximate local minima. In this paper, we use Neon2instead. We show the following theorem:

Theorem 5 (informal). To find an(ε, δ)-approximate local minima,

(a) Neon2+SGDneeds T =Oe ε14 +ε21δ3 +δ15

stochastic gradients;

(b) Neon2+SCSG needsT =Oe 1

ε10/3 + 1

ε2δ3 + δ15

stochastic gradients; and

(c) Neon2+GD needsT =Oe εn2 +δn3.5

(so Oe ε12 + δ31.5

full gradients).

(d) Neon2+SVRG needsT =Oe nε2/23 +δn3 + n 5/12

ε2δ1/2 +

n3/4

δ3.5

stochastic gradients.

We make several comments as follows.

(a) We compare Neon2+SGD to Ge et al. [13], where the authors showed SGD plus perturbation needsT =Oe(poly(d)/ε4) stochastic gradients to find (ε, ε1/4)-approximate local minima. This

(5)

To some extent, Theorem 5a is superior because we have (1) removed thepoly(d) factor,4 (2) achieved T =Oe(1/ε4) as long asδ ε2/3, and (3) a much simpler analysis.

We also remark that, if usingNeon instead ofNeon2, one achieves a slightly worse complexity T =Oe ε14 +δ17

, see Figure 1(a) for a comparison.5

(b) Neon2+SCSGturns SCSG into a local-minimum finding algorithm. Again, if usingNeoninstead of Neon2, one gets a slightly worse complexity T =Oe 1

ε10/3 + 1

ε2δ3 +δ16

, see Figure 1(b).

(c) We compare Neon2+GD to Jin et al. [15], where the authors showed GD plus perturbation needs Oe(1/ε2) full gradients to find (ε, ε1/2)-approximate local minima. This is perhaps the

first time that one can convert a stationary-point finding method (namely GD) into a local minimum-finding one, without hurting its performance.

To some extent, Theorem 5c is better because we useOe(1/ε2) full gradients as long asδ ε4/7.

(d) Our result for Neon2+SVRG does not seem to be recorded anywhere, even if Hessian-vector product computations are allowed.

Limitation. We note that there is limitation of usingNeon2(orNeon) to turn an algorithm finding stationary points to that finding local minima. Namely, given any algorithm A, if the gradient complexity forAto find anε-approximate stationary point isT, then after this conversion, it finds (ε, δ)-approximate local minima in a gradient complexity that is at least T. This is because the new algorithm, after combining Neon2 and A, tries to alternatively find stationary points (using

A) and escape from saddle points (using Neon2). Therefore, it must pay at least complexity T. In contrast, methods such asNatasha2swing by saddle points instead of go to saddle points and then escape. This has enabled it to achieve a smaller complexityT =O(ε−3.25) for δε1/4.

2

Preliminaries

Throughout this paper, we denote by k · k the Euclidean norm. We use i R [n] to denote that

i is generated from [n] ={1,2, . . . , n} uniformly at random. We denote by I[event] the indicator

function of probabilistic events.

We denote bykAk2the spectral norm of matrix A. For symmetric matricesAandB, we write

A B to indicate that AB is positive semidefinite (PSD). Therefore, AσI if and only if all eigenvalues ofA are no less thanσ. We denote by λmin(A) andλmax(A) the minimum and

maximum eigenvalue of a symmetric matrix A.

Recall some definitions on smoothness (for other equivalent definitions, see textbook [19])

Definition 2.1. For a function f:RdR,

• f isL-Lipschitz smooth (or L-smooth for short) if

∀x, yRd, k∇f(x)− ∇f(y)k ≤Lkxyk.

• f is second-orderL2-Lipschitz smooth (orL2-second-order smooth for short) if

∀x, yRd, k∇2f(x)− ∇2f(y)k2 L2kxyk.

The following fact says the variance of a random variable decreases by a factor mif we choose m independent copies and average them. It is trivial to prove, see for instance [18].

4We are aware that the original authors of [13] have a different proof to remove its

poly(d) factor, but have not found it online at this moment.

5Their complexity might be improvable toOe 1

ε4 +

1

δ6

(6)

algorithm gradient complexityT Hessian-vector

stationary SGD(folklore) O 1 ε4

no needed needed no

local minima perturbedSGD[13] Oe polyε4(d)

stationary GD(folklore) O n ε2

no no needed no

local minima perturbedGD[15] Oe n ε2

stationary “guilty” [8] Oe n ε1.75

no no needed needed

local minima FastCubic [1] Oe

n

Table 1: Complexity for findingk∇f(x)k ≤ε and∇2f(x)δI. Following tradition, in these complexity bounds,

we assume variance and smoothness parameters as constants, and only show the dependency onn, d, ε.

Remark 1. Variance bounds is needed for online methods.

Remark 2. Lipschitz smoothness is needed for finding approximate stationary points.

Remark 3. Second-order Lipschitz smoothness is needed for finding approximate local minima.

Fact 2.2. If v1, . . . , vn∈Rd satisfy Pni=1vi =~0, and S is a non-empty, uniform random subset of

Throughout the paper we study the following minimization problem

minxRd n

f(x)def= 1nPni=1fi(x)

o

(7)

Algorithm 1 Neon2online(f, x0, δ, p)

Input: Function f(x) = n1Pni=1fi(x), vector x0, negative curvature δ >0, confidencep∈(0,1].

1: forj= 1,2,· · ·Θ(log 1/p) do ⋄ boost the confidence

2: vj ←Neon2onlineweak (f, x0, δ, p);

3: if vj 6=⊥then 4: m←Θ(L

2log 1/p

δ2 ),v′ ←Θ(Lδ 2)v.

5: Draw i1, . . . , im ∈R[n]. 6: zj = mk1vk2

2

Pm

j=1(v′)⊤ ∇fij(x0+v′)− ∇fij(x0)

7: if zj ≤ −3δ/4 returnv =vj 8: end if

9: end for 10: returnv =⊥.

Algorithm 2 Neon2onlineweak(f, x0, δ, p)

1: η C2 δ 0L2log(d/p)

,T C02log(d/p)

ηδ , ⋄ for sufficiently large constant C0

2: ξ Gaussian random vector with norm σ. σdef=η(d/p)−2C0 δ 2

L2L3

3: x1 ←x0+ξ.

4: fort←1 to T do

5: xt+1 ←xt−η(∇fi(xt)− ∇fi(x0)) wherei∈R[n].

6: if kxt+1−x0k2 ≥r then returnv= kxxtt+1+1xx00k2 ⋄ r

def

= (d/p)C0σ

7: end for 8: returnv =;

where both f(·) and each fi(·) can be nonconvex. We wish to find (ε, δ)-local minima which are pointsx satisfying

k∇f(x)k ≤ε and 2f(x)δI . We need the following three assumptions

• Each fi(x) is L-Lipschitz smooth.

• Each fi(x) is second-orderL2-Lipschitz smooth.

(In fact, the gradient complexity of Neon2 in this paper only depends polynomially on the second-order smoothness of f(x) (rather than fi(x)), and the time complexity depends loga-rithmically on the second-order smoothness of fi(x). To make notations simple, we decide to simply assume each fi(x) is L2-second-order smooth.)

• Stochastic gradients have bounded variance: xRd: Ei

∈R[n]k∇f(x)− ∇fi(x)k

2 ≤ V .

(This assumption is needed only foronline algorithms.)

3

Neon2 in the Online Setting

We propose Neon2online formally in Algorithm 1.

(8)

Theorem 1 (Neon2online). Let f(x) = n1Pin=1fi(x) where each fi is L-smooth and L2

-second-order smooth. For every point x0 ∈Rd, every δ >0, every p∈(0,1], the output

v=Neon2online(f, x0, δ, p)

satisfies that, with probability at least 1p:

1. If v=, then 2f(x0) −δI.

2. If v6=, then kvk2 = 1 and v⊤∇2f(x0)v≤ −δ2.

Moreover, the total number of stochastic gradient evaluations O log2(d/pδ2 )L2

.

The proof of Theorem 1 immediately follows from Lemma 3.1 and Lemma 3.2 below.

Lemma 3.1 (Neon2onlineweak ). In the same setting as Theorem 1, the outputv=Neon2onlineweak (f, x0, δ, p)

satisfies Ifλmin(∇2f(x0))≤ −δ, then with probability at least2/3,v6=⊥andv⊤∇2f(x0)v≤ −10051δ.

Proof sketch of Lemma 3.1. We explain whyNeon2onlineweak works as follows. Starting from a randomly perturbed pointx1 =x0+ξ, it keeps updatingxt+1 ←xt−η(∇fi(xt)− ∇fi(x0)) for some random

indexi[n], and stops either whenT iterations are reached, or whenkxt+1−x0k2 > r. Therefore,

we have kxt−x0k2 ≤ r throughout the iterations, and thus can approximate ∇2fi(x0)(xt−x0)

using fi(xt)− ∇fi(x0), up to error O(r2). This is a small term based on our choice ofr.

Ignoring the error term, our updates look likext+1−x0 = I−η∇2fi(x0)(xt−x0). This is exactly

the same as Oja’s algorithm [21] which is known to approximately compute the minimum eigenvector of2f(x

0) = n1Pni=1fi(x0). Using the recent optimal convergence analysis of Oja’s algorithm [6],

one can conclude that afterT1= Θ log

r σ

ηλ

iterations, whereλ= max{0,λmin(∇2f(x0))}, then we

not only have that kxt+1−x0k2 is blown up, but also it aligns well with the minimum eigenvector

of2f(x0). In other words, ifλ≥δ, then the algorithm must stop beforeT.

Finally, one has to carefully argue that the error does not blow up in this iterative process. We

defer the proof details to Appendix A.2.

Our Lemma 3.2 below tells us we can verify if the output v ofNeon2onlineweak is indeed correct (up to additive 4δ), so we can boost the success probability to 1p.

Lemma 3.2(verification). In the same setting as Theorem 1, let vectorsx, vRd. If i1, . . . , im R

[n] and define

z= m1 Pmj=1v⊤(fij(x+v)− ∇fij(x))

Then, ifkvk ≤ 8Lδ

2 and m= Θ(

L2log 1/p

δ2 ), with probability at least 1−p,

kvzk2

2 −

v⊤2f(x)v

kvk2 2

≤ δ4 .

4

Neon2 in the Deterministic Setting

(9)

Algorithm 3 Neon2det(f, x0, δ, p)

Input: A function f, vectorx0, negative curvature targetδ >0, failure probability p∈(0,1].

1: T C12log(d/p)

L

δ . ⋄ for sufficiently large constantC1.

2: ξ Gaussian random vector with norm σ; σdef= (d/p)−2C1 δ

T3L 2

3: x1 ←x0+ξ. y1←ξ, y0←0

4: fort1 to T do

5: yt+1= 2M(yt)−yt−1; ⋄ M(y)

def

=−1

L(∇f(x0+y)− ∇f(x0)) + 1−

4L

y 6: xt+1 =x0+yt+1− M(yt).

7: if kxt+1−x0k2 ≥r then return kxxtt+1+1xx00k2. ⋄ r

def

= (d/p)C1σ

8: end for 9: return⊥.

Theorem 3 (Neon2det). Let f(x) be a function that is L-smooth and L2-second-order smooth.

For every pointx0 ∈Rd, everyδ >0, everyp∈(0,1], the output v=Neon2det(f, x0, δ, p)satisfies

that, with probability at least 1p:

1. If v=, then 2f(x

0) −δI.

2. If v6=, then kvk2 = 1 and v⊤∇2f(x0)v≤ −δ2.

Moreover, the total number full gradient evaluations isO log2(d/p) √

L

δ

.

Proof sketch of Theorem 3. We explain the high-level intuition ofNeon2detand the proof of Theorem 3

as follows. Define M=L12f(x0) + 1− 43LδI. We immediately notice that

• all eigenvalues of2f(x

0) in −43δ, L are mapped to the eigenvalues ofMin [−1,1], and

• any eigenvalue of2f(x0) smaller than −δ is mapped to eigenvalue ofMgreater than 1 +4δL.

Therefore, as long asT Ωe Lδ, if we computexT+1 =x0+MTξ for some random vectorξ, by the

theory of power method, xT+1−x0 must be a negative-curvature direction of ∇2f(x0) with value

≤ 12δ. There aretwo issues with this approach.

The first issue is that, the degreeT of this matrix polynomialMT can be reduced toT =eL δ

if the so-called Chebyshev polynomial is used.

Claim 4.1. Let Tt(x) be the t-th Chebyshev polynomial of the first kind, defined as [27]:

T0(x) def

= 1, T1(x) def

= x, Tn+1(x) def

= 2x· Tn(x)− Tn−1(x)

thenTt(x) satisfies: Tt(x)∈

  

[1,1] if x[1,1];

1 2

x+√x21t,x+x21t

if x >1.

Since Tt(x) stays between [−1,1] when x ∈[−1,1], and grows to ≈(1 +

x21)t for x 1, we can useTT(M) in replacement of MT. Then, any eigenvalue ofM that is above 1 +4δL shall grow in a speed like (1 +pδ/L)T, so it suffices to chooseT eL

σ

. This is quadratically faster than applying the power method, so inNeon2det we wish to compute xt+1≈x0+Tt(M)ξ.

(10)

gradient difference to approximate it; that is, we can only use M(y) to approximate My where

M(y)def= 1

L(∇f(x0+y)− ∇f(x0)) +

1 3δ 4L

y .

How does error propagate if we computeTt(M)ξ by replacingMwithM? Note that this is a very non-trivial question, because the coefficients of the polynomialTt(x) is as large as 2O(t).

It turns out, the way that error propagates depends on how the Chebyshev polynomial is calculated. If the so-called backward recurrence formula is used, namely,

y0 = 0, y1=ξ, yt= 2M(yt−1)−yt−2

and setting xT+1 = x0 +yT+1 − M(yT), then this xT+1 is sufficiently close to the exact value

x0+Tt(M)ξ. This is known as the stability theory of computing Chebyshev polynomials, and is proved in our prior work [5].

We defer all the proof details to Appendix B.2.

5

Neon2 in the SVRG Setting

Recall that the shift-and-invert (SI) approach [12] on top of the SVRG method [26] solves the min-imum eigenvector problem as follows. Given any matrixA=2f(x

0) and suppose its eigenvalues

areλ1 ≤ · · · ≤λd. Then, ifλ >−λ1, we can define positive semidefinite matrixB= (λI+A)−1, and

then apply power method to find an (approximate) maximum eigenvector ofB, which necessarily is an (approximate) minimum eigenvector ofA.

The SI approach specifies a binary search routine to determine the shifting constant λ, and ensures thatB= (λI+A)−1 is always “well conditioned,” meaning that it suffices to apply power method on B for logarithmic number of iterations. In other words, the task of computing the minimum eigenvector ofA reduces to computing matrix-vector products By for poly-logarithmic number of times. Moreover, the stability of SI was shown in a number of papers, including [12] and [4]. This means, it suffices for us to computeBy approximately.

However, how to compute By for an arbitrary vector y. It turns out, this is equivalent to minimizing aconvex quadratic function that is of a finite sum form

g(z)def= 1

2z⊤(λI+A)z+y⊤z= 1 2n

n

X

i=1

z⊤(λI+2fi(x0))z+y⊤z .

Therefore, one can apply the a variant of the SVRG method (arguably first discovered by Shalev-Shwartz [26]) to solve this task. In each iteration, SVRG needs to evaluate a stochastic gradient (λI+2fi(x0))z+y at some point z for some random i ∈ [n]. Instead of evaluating it exactly

(which require a Hessian-vector product), we usefi(x0+z)−∇fi(x0) to approximate∇2fi(x0)·z.

Of course, one needs to show also that the SVRG method is stable to noise. Using similar techniques as the previous two sections, one can show that the error term is proportional toO(kzk22), and thus as long as we bound the norm ofzis bounded (just like we did in the previous two sections), this should not affect the performance of the algorithm. We decide to ignore the detailed theoretical proof of this result, because it will complicate this paper.

Theorem 3 (Neon2svrg). Let f(x) = n1Pin=1fi(x) where each fi isL-smooth and L2-second-order

smooth. For every pointx0 ∈Rd, every δ >0, every p∈(0,1], the output v=Neon2svrg(f, x0, δ, p)

satisfies that, with probability at least 1p:

(11)

Moreover, the total number stochastic gradient evaluations is O ne +n3/√4√L

δ

.

6

Applications of Neon2

We show how Neon2online can be applied to existing algorithms such as SGD, GD, SCSG, SVRG,

Natasha2, CDHS. Unfortunately, we are unaware of a generic statement for applying Neon2online

toany algorithm. Therefore, we have to prove them individually.6

Throughout this section, we assume that some starting vector x0 ∈ Rd and upper bound ∆f is given to the algorithm, and it satisfies f(x0)−minx{f(x)} ≤ ∆f. This is only for the purpose of proving theoretical bounds. In practice, because ∆f only appears in specifying the number of iterations, can just run enough number of iterations and then halt the algorithm, without the necessity of knowing ∆f.

6.1 Auxiliary Claims

Claim 6.1. For any x, using O((εV2 + 1) log1p) stochastic gradients, we can decide

with probability1p: either k∇f(x)k ≥ ε

2 or k∇f(x)k ≤ε .

Proof. Suppose we generate m = O(log1p) random uniform subsets S1, . . . , Sm of [n], each of cardinality B = max{32Vε2,1}. Then, denoting by vj = B1 Pi∈Sj∇fi(x), we have according to

Fact 2.2 that ESjkvj − ∇f(x)k2 V

B = ε

2

32. In other words, with probability at least 1/2 over

the randomness of Sj, we have

kvjk − k∇f(x)k

≤ kvj − ∇f(x)k ≤ ε4. Since m = O(log1p), we have with probability at least 1p, it satisfies that at least m/2 + 1 of the vectors vj satisfy

kvjk − k∇f(x)k ≤ ε4. Now, if we select v∗ =vj where j ∈[m] is the index that gives the median value of kvjk, then it satisfies

kvjk − k∇f(x)k ≤ ε4. Finally, we can check if kvjk ≤ 34ε. If so, then we conclude that k∇f(x)k ≤ε, and if not, we conclude thatk∇f(x)k ≥ ε2.

Claim 6.2. If v is a unit vector and v⊤2f(y)v≤ −δ2, suppose we choosey′=y±Lδ

2v where the

sign is random, then f(y)E[f(y)] δ3

12L2 2.

Proof. Letting η= Lδ

2, then by the second-order smoothness,

f(y)E[f(y)]Eh∇f(y), yyi −1

2(y−y′)⊤∇

2f(y)(yy)L2

6 ky−y′k

3

=η

2

2 v⊤∇

2f(y)vL2η3

6 kvk

3 η2δ

4 − L2η3

6 = δ3

12L22 .

6.2 Neon2 on SGD and GD

To apply Neon2 to turn SGD into an algorithm finding approximate local minima, we propose the following processNeon2+SGD (see Algorithm 4). In each iterationt, we first apply SGDwith mini-batch sizeO(ε12) (see Line 4). Then, ifSGDfinds a point with small gradient, we applyNeon2online to

decide if it has a negative curvature, if so, then we move in the direction of the negative curvature (see Line 10). We have the following theorem:

6This is because stationary-point finding algorithms have somewhat different guarantees. For instance, in

mini-batchSGDwe havef(xt)E[f(xt+1)]Ω(k∇f(xt)k2) but inSCSGwe havef(x

(12)

Algorithm 4 Neon2+SGD(f, x0, p, ε, δ)

Input: functionf(·), starting vector x0, confidencep∈(0,1), ε >0 and δ >0.

1: KO L22∆f

δ3 +

L∆f

ε2

; ∆f is any upper bound onf(x0)−minx{f(x)}

2: fort0 to K1 do

3: S ← a uniform random subset of [n] with cardinality|S|=B def= max{8εV2,1};

4: xt+1/2←xt−L|1S|PiS∇fi(xt);

5: if k∇f(xt)k ≥ 2ε then ⋄ estimatek∇f(xt)kusingO(ε−2Vlog(K/p))stochastic gradients

6: xt+1←xt+1/2;

7: else necessarilyk∇f(xt)k ≤ε

8: v←Neon2online(xt, δ,2pK);

9: if v= then returnxt; ⋄ necessarily∇2f(xt) −δI

10: else xt+1←xt± Lδ2v; ⋄ necessarilyv⊤∇2f(xt)v≤ −δ/2

11: end if 12: end for

13: will not reach this line (with probability 1p).

Algorithm 5 Neon2+GD(f, x0, p, ε, δ)

Input: functionf(·), starting vector x0, confidencep∈(0,1), ε >0 and δ >0.

1: KO L22∆f

δ3 +

L∆f

ε2

; ∆f is any upper bound onf(x0)−minx{f(x)}

2: fort←0 to K−1 do 3: xt+1/2←xt−L1∇f(xt); 4: if k∇f(xt)k ≥ 2ε then 5: xt+1←xt+1/2;

6: else

7: vNeon2det(xt, δ,2pK);

8: if v=⊥ then returnxt; ⋄ necessarily∇2f(xt) −δI

9: else xt+1←xt± Lδ2v; ⋄ necessarilyv⊤∇2f(xt)v≤ −δ/2

10: end if 11: end for

12: will not reach this line (with probability 1p).

Theorem 5a. With probability at least 1p, Neon2+SGD outputs an (ε, δ)-approximate local minimum in gradient complexity T =Oe(εV2 + 1)

L2 2∆f

δ3 +

L∆f

ε2

+Lδ22

L2 2∆f

δ3

.

Corollary 6.3. Treating∆f,V, L, L2 as constants, we haveT =Oe ε14 +ε21δ3 +δ15

.

One can similarly (and more easily) give an algorithmNeon2+GD, which is the same asNeon2+SGD

except that the mini-batch SGD is replaced with a full gradient descent, and the use ofNeon2online

is replaced withNeon2det. We have the following theorem:

Theorem 5c. With probability at least 1p, Neon2+GDoutputs an(ε, δ)-approximate local min-imum in gradient complexity OeL∆f

ε2 +L 1/2

δ1/2

L2 2∆f

δ3

full gradient computations.

(13)

6.3 Neon2 on SCSG and SVRG

Background. We first recall the main idea of theSVRGmethod for non-convex optimization [3, 23]. It is an offline method but is what SCSG is built on. SVRG divides iterations into epochs, each of length n. It maintains a snapshot point xe for each epoch, and computes the full gradient

∇f(ex) only for snapshots. Then, in each iteration t at point xt, SVRG defines gradient estimator

e ∇f(xt)

def

= fi(xt)− ∇fi(xe) +∇f(ex) which satisfies Ei[∇ef(xt)] = ∇f(xt), and performs update

xt+1←xt−α∇ef(xt) for learning rateα.

The SCSG method of Lei et al. [18] proposed a simple fix to turn SVRG into an online method. They changed the epoch length of SVRG from n to B 1/ε2, and then replaced the computation of f(ex) with |S1|PiSfi(xe) where S is a random subset of [n] with cardinality |S| = B. To make this approach even more general, they also analyzed SCSG in the mini-batch setting, with mini-batch sizeb∈ {1,2, . . . , B}.7 Their Theorem 3.1 [18] says that,

Lemma 6.4 ([18]). There exist constant C >1 such that, if we run SCSG for an epoch of size B (so using O(B) stochastic gradients)8 with mini-batch b ∈ {1,2, . . . , B} starting from a point xt and moving tox+t, then

Ek∇f(x+

t )k2

≤C·L(b/B)1/3 f(xt)−E[f(x+t)]

+ 6V B .

Our Approach. In principle, one can apply the same idea ofNeon2+SGDonSCSG to turn it into an algorithm finding approximate local minima. Unfortunately, this is not quite possible because the left hand side of Lemma 6.4 is onEk∇f(x+

t )k2

, as opposed tok∇f(xt)k2 in SGD (see (C.1)). This means, instead of testing whetherxtis a good local minimum (as we did inNeon2+SGD), this time we need to test whetherx+t is a good local minimum. This creates some extra difficulty so we need a different proof.

Remark 6.5. As for the parameters ofSCSG, we simply use B = max{1,48ε2V}. However, choosing

mini-batch sizeb= 1 does not necessarily give the best complexity, so a tradeoffb= Θ((ε2+V)ε4L62

δ9L3 )

is needed. (A similar tradeoff was also discovered by the authors of Neon [28].) Note that this quantity bmay be larger than B, and if this happens, SCSG becomes essentially equivalent to one iteration ofSGD with mini-batch size b. Instead of analyzing this boundary caseb > B separately, we decide to simply runNeon2+SGDwhenever b > B happens, to simplify our proof.

We show the following theorem (proved in Appendix C)

Theorem 5b. With probability at least 2/3, Neon2+SCSG outputs an (ε, δ)-approximate local minimum in gradient complexity T =Oe L∆f

ε4/3V1/3 +

L2 2∆f

δ3

V

ε2 + L 2

δ2

+ L∆f

ε2 L 2

δ2

.

(To provide the simplest proof, we have shown Theorem 5b only with probability 2/3. One can for instance boost the confidence to 1pby running log1p copies ofNeon2+SCSG.)

Corollary 6.6. Treating∆f,V, L, L2 as constants, we haveT =Oe ε101/3 + 1

ε2δ3 + δ15

.

7That is, they reduced the epoch length to B

b, and replaced∇fi(xt)− ∇fi(x) withe

1

|S′|

P

i∈S′ ∇fi(xt)− ∇fi(ex)

for someS′ that is a random subset of [n] with cardinality|S|=b.

8We remark that Lei et al. [18] only showed that an epoch runs in anexpectation ofO(B) stochastic gradients.

We assume it is exact here to simplify proofs. One can for instance stopSCSG afterO(Blog1

p) stochastic gradient

(14)

Algorithm 6 Neon2+SCSG(f, x0, ε, δ)

Input: functionf(·), starting vector x0,ε >0 and δ >0.

1: Bmax{1,48ε2V};b←max

1,Θ((ε2+V)ε4L62

δ9L3 ) ;

2: if b > B then return Neon2+SGD(f, x0,2/3, ε, δ); ⋄ for cleaner analysis purpose, see Remark 6.5

3: K←Θ Lb

1/3

f

ε4/3V1/3

; ∆f is any upper bound onf(x0)−minx{f(x)}

4: fort0 to K1 do

5: xt+1/2 apply SCSG onxtfor one epoch of size B = max{Θ(V/ε2),1};

6: if k∇f(xt+1/2)k ≥ ε2 then ⋄ estimatek∇f(xt)kusingO(ε−2VlogK)stochastic gradients

7: xt+1←xt+1/2;

8: else necessarilyk∇f(xt+1/2)k ≤ε

9: v←Neon2online(f, xt+1/2, δ,1/20K);

10: if v= then returnxt+1/2; ⋄ necessarily∇2f(xt+1/2) −δI

11: else xt+1←xt+1/2±Lδ2v; ⋄ necessarilyv

2f(x

t+1/2)v≤ −δ/2

12: end if 13: end for

14: will not reach this line (with probability 2/3).

As for SVRG, it is an offline method and its one-epoch lemma looks like9

Ek∇f(x+

t)k2

≤C·Ln1/3 f(xt)−E[f(x+t)]

.

If one replaces the use of Lemma 6.4 with this new inequality, and replace the use of Neon2online

withNeon2svrg, then we get the following theorem:

Theorem 5d. With probability at least 2/3, Neon2+SVRG outputs an (ε, δ)-approximate local minimum in gradient complexity T =Oe L∆f

ε2n1/3 +

L2 2∆f

δ3

n+n3/√4√L

δ

.

For a clean presentation of this paper, we ignore the pseudocode and proof because they are only simpler thanNeon2+SCSG.

6.4 Neon2 on Natasha2 and CDHS

The recent results Carmon et al. [7] (that we refer to CDHS) and Natasha2 [2] are both Hessian-free methods where the only Hessian-vector product computations come from the exact NC-search process we study in this paper. Therefore, by replacing their NC-search withNeon2, we can directly turn them into first-order methods without the necessity of computing Hessian-vector products.

We state the following two theorems where the proofs are exactly the same as the papers [7] and [2]. We directly state them by assuming ∆f,V, L, L2 are constants, to simplify our notions.

Theorem 2. One can replace Oja’s algorithm with Neon2online in Natasha2 without hurting its

performance, turning it into a first-order stochastic method.

Treating∆f,V, L, L2 as constants,Natasha2finds an (ε, δ)-approximate local minimum inT =

e

O ε31.25 +ε13δ +δ15

stochastic gradient computations.

Theorem 4. One can replace Lanczos method with Neon2det or Neon2svrg in CDHS without hurting it performance, turning it into a first-order method.

(15)

Treating ∆f, L, L2 as constants, CDHS finds an (ε, δ)-approximate local minimum in either

e

O ε11.75+δ31.5

full gradient computations (if Neon2det is used) or inT =Oe ε1n.5 +δn3 +n 3/4

ε1.75+n 3/4

δ3.5

stochastic gradient computations (ifNeon2svrg is used).

Acknowledgements

We would like to thank Tianbao Yang and Yi Xu for helpful feedbacks on this manuscript.

Appendix

A

Missing Proofs for Section 3

A.1 Auxiliary Lemmas

We use the following lemma to approximate hessian-vector products:

Lemma A.1. If f(x) is L2-second-order smooth, then for every point x ∈ Rd and every vector

vRd, we have: f(x+v)− ∇f(x)− ∇2f(x)v

2 ≤L2kvk 2 2.

Proof of Lemma A.1. We can writef(x+v)−∇f(x) =Rt1=02f(x+tv)vdt. Subtracting2f(x)v we have:

f(x+v)− ∇f(x)− ∇2f(x)v2 =

Z 1

t=0 ∇

2f(x+tv)

− ∇2f(x)vdt

2

≤ Z 1

t=0

2f(x+tv)− ∇2f(x)

2kvk2dt≤L2kvk 2

2 .

We need the following auxiliary lemma about martingale concentration bound:

Lemma A.2. Consider random events{Ft}t≥1and random variablesx1, . . . , xT ≥0anda1, . . . , AT ∈ [ρ, ρ]for ρ[0,1/2]where eachxtand at only depend onF1, . . . ,Ft. Lettingx0 = 0 and suppose

there exist constant b0 and λ >0 such that for every t1:

xt≤xt−1(1−at) +b and E[at| F1, . . . ,Ft−1]≥ −λ .

Then, we have for every p(0,1): Pr

xT ≥T beλT+2ρ

q

TlogT p

≤p .

Proof. We know that

xT ≤(1−aT)xT−1+b

≤(1aT) ((1−aT−1)xT−2+b) +b

= (1aT)(1−aT−1)xT−2+ (1−aT)b+b

≤ · · · ≤

T

X

s=2

T

Y

t=s

(1as)b

For eachs[T], we consider the random process define as

(16)

Therefore

logyt+1 = log(1−at) + logyt

For log(1at)∈[−2ρ, ρ] andE[log(1−at)| F1,· · · Ft−1]≤λ. Thus, we can apply Azuma-Hoeffding

inequality on logyt to conclude that

Pr

yT ≥beλT+2ρ

q

TlogT p

≤p/T .

Taking union bound overs we complete the proof.

A.2 Proof of Lemma 3.1

Proof of Lemma 3.1. Let it ∈[n] be the random indexi chosen when computing xt+1 from xt in Line 5 of Neon2onlineweak . We will write the update rule of xt in terms of the Hessian before we stop. By Lemma A.1, we know that for every t1,

fit(xt)− ∇fit(x0)− ∇

2f

it(x0)(xt−x0)

2 ≤L2kxt−x0k 2 2 .

Therefore, there exists error vectorξt∈Rd withkξtk2 ≤L2kxt−x0k22 such that

(xt+1−x0) = (xt−x0)−η∇2fit(x0)(xt−x0) +ηξt .

For notational simplicity, let us denote by

zt

def

=xt−x0, At

def

= Bt+Rt where Bt

def

= 2fit(x0), Rt

def

=ξtzt⊤

kztk22

then it satisfies

zt+1=zt−ηBtzt+ηξt= (I−ηAt)zt .

We have kRtk2 ≤ L2kztk2 ≤ L2·r. By the L-smoothness of fit, we know kBtk2 ≤ L and thus kAtk2≤ kBtk2+kRtk2≤ kBtk2+L2r ≤2L.

Now, define Φt

def

=zt+1zt⊤+1= (I−ηAt)· · ·(I−ηA1)ξξ⊤(I−ηA1)· · ·(I−ηAt) andwt

def

= zt kztk2 =

zt

(Tr(Φt−1))1/2. Then, before we stop, we have: Tr(Φt) =Tr(Φt−1)

12ηw⊤t Atwt+η2w⊤t A2twt

≤Tr(Φt−1)

12ηw⊤t Atwt+ 4η2L2

≤Tr(Φt−1)

12ηw⊤t Btwt+ 2ηkRtk2+ 4η2L2

≤Tr(Φt−1)

12ηw⊤t Btwt+ 8η2L2

.

Above,①is because our choice of parameter satisfiesr ηL2

L2. Therefore,

log (Tr(Φt))≤log (Tr(Φt−1)) + log

12ηw⊤t Btwt+ 8η2L2

.

Letting λ=λmin(∇2f(x0)) = −λmin(EBt[Bt]), since the randomness ofBt is independent of wt,

we know thatw⊤t Btwt∈[−L, L] and for everywt, it satisfiesEBt

w⊤t Btwt|wt≥ −λ. Which (by concavity of log) also implies thatE[log 12ηwtBtwt+ 8η2L2]2ηλ and log(12ηwtBtwt+

8η2L2)[2(2ηL+ 8η2L2),2ηL+ 8η2L2][6ηL,3ηL].

Hence, applying Azuma-Hoeffding inequality on log(Φt) we have

(17)

In other words, with probability at least 1p,Neon2onlineweak will not terminate until t T0, where

Next, we turn to accuracy. Let “true” vector vt+1 def

Above,①assumes without loss of generality thatL1 (as otherwise we can re-scale the problem).

Therefore, applying martingale concentration Lemma A.2, we know

Pr

Now we can apply the recent of Oja’s algorithm — [6, Theorem 4]. By our choice of parameter η, we have: with probability at least 99/100 the following holds:

1. Norm growth: kvtk2 ≥e(ηλ−2η

by our choice of parameters, we know that

eηλT1+2ηL q

T1log1p

σr2 .

which implies that with probability at least 98/100,

(18)

is due to rσ2 δ2p

weak must terminate withinT1 ≤T iterations.

Moreover, recall with probability at last 1p, Neon2onlineweak will not terminate before iteration T0. Thus, at the point oft∈[T0, T1] of termination, we have with probability at least 98/100,

Putting everything together we complete the proof.

A.3 Proof of Lemma 3.2

Proof of Lemma 3.2. By Lemma A.1, we know for everyi[n],

2 from Lemma A.1, we conclude that

Pr z

Plugging in our assumption onkv||2 and our choice of mfinishes the proof.

B

Missing Proofs for Section 4

B.1 Stable Computation of Chebyshev Polynomials

We recall the following result from [5, Section 6.2] regarding one way tostably compute Chebyshev polynomials. Suppose we want to compute

(19)

Definition B.1 (inexact backward recurrence). Let Mbe an approximate algorithm that satisfies

Theorem B.2 (stable Chebyshev sum). For every N N∗, suppose the eigenvalues of M are in

[a, b] and suppose there are parameters CU ≥1, CT ≥1, ρ≥1, Cc≥0 satisfying

Then, if the inexact backward recurrence in Def. B.1 is applied with ε 4N C1

U, we have kbsN −~sNk ≤ε·2(1 + 2N CT)N CUCc .

B.2 Proof of Theorem 3

Proof of Theorem 3. We can without loss of generality assume δ L. For notation simplicity, let us denote of the first kind. However, we cannot multiply M to vectors (because we are not allowed to use Hessian-vector products). We define

and shall use it to approximateMy and then apply backward recurrence

y0= 0, y1 =ξ, yt= 2M(yt−1)−yt−2 .

Throughout the iterations of Neon2det, we have

yt= 2M(yt−1)−yt−2 = 2(x0−xt+yt)−yt−2 =⇒ yt−yt−2 = 2(xt−x0) .

Since we havekxt−x0k2 ≤r for eachtbefore termination, we knowkytk2 ≤2tr. Using this upper

bound we can approximate Hessian-vector product by gradient difference. Lemma A.1 gives us

kM(yt)−Mytk2 ≤ L2 Now, recall from Claim 4.1 that

(20)

Theorem B.2 tells us that, for every tbefore termination,

kx∗t+1xt+1k2 ≤

32L2rt3ρtσ

L .

In order to prove Theorem 3, in the rest of the proof, it suffices for us to show that, if λmin(∇2f(x0)) ≤ −δ, then with probability at least 1 −p, it satisfies v 6= ⊥, kvk2 = 1, and

v⊤2f(x0)v≤ −12δ. In other words, we can assumeλ≥δ.

The value λδ impliesρ 1 +√δ >1, so we can let

T1 def

= log

4dr pσ

logρ ≤T .

By Claim 4.1, we know that kTT1(M)k2 ≥ 1

2ρT1 = 2pσdr. Thus, with probability at least 1−p,

kx∗

T1+1−x0k2=kTT1(M)ξk2≥2r. Moreover, at iterationT1, we have:

kx∗T1+1xT1+1k2≤

32L2rT13ρT1σ

L ≤

32L2rT13σ

L ×

4dr pσ ≤

128dL2T13

L r2

p

≤ δ

100Lr≤ 1 16r .

Here,① uses the fact that r δp

12800dL2T13.

This means kxT1+1−x0k2≥r so the algorithm must terminate before iterationT1 ≤T.

On the other hand, since kTt(M)k2 ≤ρt, we know that the algorithm will not terminate until

tT0 for

T0 def

= log r

2σ logρ .

At the timetT0 of termination, by the property of Chebyshev polynomial Claim 4.1, we know

1. Tt(ρ)≥ 12ρt≥ 21ρT0 ≥ 4rσ = (d/p)Θ(1).

2. x[1,1],Tt(x)∈[−1,1].

Since all the eigenvalues of A that are ≥ −3/4δ are mapped to the eigenvalues of M that are in [1,1], and the smallest eigenvalue of A is mapped to the eigenvalue ρ of M. So we have, with probability at least 1p, letting vt

def

= x∗t+1x0 it satisfies

vt⊤Avt

kvtk2 ≤ −

5 8δ .

Therefore, denoting by zt

def

= xt+1−x0, we have

z⊤

t Azt

kztk22

= kvtk

2 2

kztk22

·zt⊤Azt kvtk22

≤ 1516vt⊤Avt+ 4Lkzt−vtk2kvtk2 kvtk22

≤ 1516 v

t Avt

kvtk22

+ 4Lkvt−ztk2

kvtk2

≤ 1516

vt⊤Avt

kvtk22 + 1

25δ

≤ −10051 δ .

(21)

C

Missing Proofs for Section 6

C.1 Proof of Theorem 5a

Proof of Theorem 5a. Since both estimatingk∇f(xt)kin Line 5 (see Claim 6.1) and invokingNeon2online (see Theorem 1) succeed with high probability, we can assume that they always succeed. This means whenever we output xt in an iteration, it already satisfies k∇f(xt)k ≤ ε and ∇2f(xt) −δI. Therefore, it remains to show that the algorithm must output some xt in an iteration, as well as to compute the final complexity.

Recall (from classical SGD theory) if we updatext+1/2 ←xt−|αS|PiS∇fi(xt) whereα >0 is

, then the algorithm must terminate and return xt in one of its iterations. This ensures that Line 13 will not be reached. As for the total complexity, we note that each iteration ofNeon2+SGD is dominated byOe(B) =Oe(εV2 + 1) stochastic gradient

computations in Line 4 and Line 5, totaling Oe((εV2 + 1)K), as well as Oe(L 2

δ2) stochastic gradient

computations byNeon2online, but the latter will not happen for more thanO(L22∆f

δ3 ) times. Therefore,

the total gradient complexity is

e

C.2 Proof of Theorem 5b

Proof of Theorem 5b. We first note in the special case b = Θ((ε2+V)ε4L62

δ9L3 ) ≥ B, or equivalently

δ3O(L22ε2

L ), Theorem 5a gives us gradient complexityT =Oe

Since both estimatingk∇f(xt)kin Line 6 (see Claim 6.1) and invokingNeon2online(see Theorem 1) succeed with high probability, we can assume that they always succeed. This means whenever we output xt+1/2 in an iteration, it already satisfies k∇f(xt+1/2)k ≤ ε and 2f(x

t+1/2) −δI.

(22)

For analysis purpose, let us assume that, whenever the algorithm reachesv=(in Line 10), it does not immediately halt and instead sets xt+1=xt+1/2. This modification ensures that random

variables xt and xt+1/2 are well defined fort= 0,1, . . . , K−1.

Let N1 and N2 respectively be the number of times we reach Line 9 (so we invokeNeon2online)

and the number of times we reach Line 11 (so we update xt+1 =xt+1/2± Lδ2v). Both N1 and N2

are random variables and it satisfies N1≥N2. To prove thatNeon2+SCSGoutputs in an iteration,

we need to proveN1 > N2 holds at least with constant probability.

Let us apply the key SCSG lemma (Lemma 6.4) for an epoch with size B = max{1,48ε2V} and

mini-batch sizeb1, we have

Ek∇f(xt+1/2)k2C·L(b/B)1/3 f(xt)E[f(xt+1/2)]+ε

Note that the third casev6=happens forN2 times. Therefore, combining the three cases together

with (C.2), we have

SinceN1> N2, this means the algorithm must terminate and output somext+1/2in an iteration,

with probability at least 2/3.

Finally, the per-iteration complexity of Neon2+SCSG is dominated by Oe(B) stochastic gradient computations per iteration for both SCSG and estimating k∇f(xt+1/2)k, as well as Oe(L2/δ2) for invokingNeon2online. This totals to gradient complexity

(23)

References

[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding Approximate Local Minima for Nonconvex Optimization in Linear Time. InSTOC, 2017. Full version available at

http://arxiv.org/abs/1611.01146.

[2] Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD. ArXiv e-prints, abs/1708.08694, August 2017.

[3] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. InICML, 2016. Full version available at http://arxiv.org/abs/1603.05643.

[4] Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. InNIPS, 2016. Full version available athttp://arxiv.org/abs/1607.03463.

[5] Zeyuan Allen-Zhu and Yuanzhi Li. Faster Principal Component Regression and Stable Matrix Cheby-shev Approximation. In Proceedings of the 34th International Conference on Machine Learning, ICML ’17, 2017.

[6] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the Compressed Leader: Faster Online Learning of Eigenvec-tors and Faster MMWU. InICML, 2017. Full version available athttp://arxiv.org/abs/1701.01722. [7] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated Methods for Non-Convex

Optimization. ArXiv e-prints, abs/1611.00756, November 2016.

[8] Yair Carmon, Oliver Hinder, John C. Duchi, and Aaron Sidford. ”Convex Until Proven Guilty”: Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions. InICML, 2017.

[9] Anna Choromanska, Mikael Henaff, Michael Mathieu, G´erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.

[10] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben-gio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. InNIPS, pages 2933–2941, 2014.

[11] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. InNIPS, 2014.

[12] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigen-vector computation. InICML, 2016.

[13] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. InProceedings of the 28th Annual Conference on Learning Theory, COLT 2015, 2015.

[14] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. ArXiv e-prints, December 2014.

[15] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape Saddle Points Efficiently. In ICML, 2017.

[16] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduc-tion. InAdvances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013. [17] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential

and integral operators. Journal of Research of the National Bureau of Standards, 45(4), 1950.

[18] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex Finite-Sum Optimization Via SCSG Methods. ArXiv e-prints, abs/1706.09156, June 2017.

[19] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.

[20] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance.

(24)

[21] Erkki Oja. Simplified neuron model as a principal component analyzer.Journal of mathematical biology, 15(3):267–273, 1982.

[22] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.

[23] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. InICML, 2016.

[24] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013. Preliminary version appeared in NIPS 2012.

[25] Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.

[26] Shai Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In ICML, 2016.

[27] Lloyd N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2013.

Gambar

Figure 1: Neon vs Neon2 for finding (ε, δ)-approximate local minima. We emphasize that Neon2 and Neon are basedon the same high-level idea, but Neon is arguably the first-recorded result to turn stationary-point findingalgorithms (such as SGD, SCSG) into local-minimum finding ones, with theoretical proofs.

Referensi

Dokumen terkait

Yang bertanda tangan dibawah ini Kelompok Kerja Barang Unit Layanan Pengadaan Kabupaten Kepulauan Aru, berdasarkan :. Berita Acara Pemberian Penjelasan (BAPP) Nomor :

Demikian Pengumuman Pemenang ini dibuat untuk diketahui, dan bagi Perusahan yang ditunjuk sebagai pemenang agar mempersiapkan untuk proses selanjutnya, atas perhatiannya

[r]

[r]

Meningkatnya Kapasitas Kemandirian Dan Produksi Dan Layanan Data/ Informasi Penginderaan Jauh Untuk Pengguna Berbagai Sektor Pembangunan Meningkatnya Kapasitas Produksi dan

Informasi hasil penelitian ini diharapkan dapat menjadi informasi serta masuka berharga bagi para guru dalam melakukan berbagai upaya untuk meningkatkan kualitas

Pembuktian Kualifikasi dilakukan oleh Direktur Utama / Pimpinan Perusahaan atau Penerima kuasa dari Direktur Utama / Pimpinan Perusahaan yang namanya tercantum

dengan membawa asli dan fotocopy seluruh dokumen yang sesuai dengan daftar isian dokumen. kualifikasi perusahaan saudara pada aplikasi SPSE, yang akan dilaksanakan