Statistics authors titles recent submissions

(1)

Neon2

_{: Finding Local Minima via First-Order Oracles}

(version 1)∗

Zeyuan Allen-Zhu [email protected]

Microsoft Research

Yuanzhi Li

[email protected] Princeton University November 17, 2017

Abstract

We propose a reduction for non-convex optimization that can (1) turn a stationary-point finding algorithm into a local-minimum finding one, and (2) replace the Hessian-vector prod-uct computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm’s performance.

As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into local-minimum finding algorithms outperforming some best known results.

1 Introduction

Nonconvex optimization has become increasing popular due its ability to capture modern machine learning tasks in large scale. Most notably, trainingdeep neural networkscorresponds to minimizing a function f(x) = 1_nPn_i₌₁fi(x) over x ∈ Rd that is non-convex, where each training sample

i corresponds to one loss function fi(·) in the summation. This average structure allows one to perform stochastic gradient descent (SGD) which uses a random _∇fi(x) —corresponding to computing backpropagation once— to approximate_∇f(x) and performs descent updates.

Motivated by such large-scale machine learning applications, we wish to design faster first-order non-convex optimization methods that outperform the performance of gradient descent, both in the online and offline settings. In this paper, we say an algorithm is online if its complexity is independent ofn (soncan be infinite), and offline otherwise. In recently years, researchers across different communities have gathered together to tackle this challenging question. By far, known theoretical approaches mostly fall into one of the following two categories.

First-order methods for stationary points. In analyzing first-order methods, we denote by gradient complexity T the number of computations of _∇fi(x). To achieve an ε-approximate stationary point —namely, a point x with _k∇f(x)_{k ≤} ε— it is a folklore that gradient descent (GD) is offline and needs T _∝O _εn2

, while stochastic gradient decent (SGD) is online and needs T _∝ O _ε14

. In recent years, the offline complexity has been improved to T _∝ O n_ε22/3

by the

∗_{The result of this paper was briefly discussed at a Berkeley Simons workshop on Oct 6 and internally presented}

at Microsoft on Oct 30. We started to prepare this manuscript on Nov 11, after being informed of the independent and similar work of Xu and Yang [28]. Their result appeared on arXiv on Nov 3. To respect the fact that their work appeared online before us, we have adopted their algorithm name Neon and called our new algorithmNeon2. We encourage readers citing this work to also cite [28].

(2)

SVRG method [3, 23], and the online complexity has been improved to T _∝ O _ε101/3

by the SCSG

method [18]. Both of them rely on the so-called variance-reduction technique, originally discovered for convex problems [11, 16, 24, 26].

These algorithms SVRG and SCSG are only capable of finding stationary points, which may not necessarily be approximate local minima and are arguably bad solutions for neural-network training [9, 10, 14]. Therefore,

can we turn stationary-point finding algorithms into local-minimum finding ones?

Hessian-vector methods for local minima. It is common knowledge that using information about the Hessian, one can findε-approximate local minima —namely, a pointxwith_k∇f(x)_{k ≤}ε and also _∇2f(x) ₋ε1/CI.1 In 2006, Nesterov and Polyak [20] showed that one can find an ε -approximate in O(_ε11.5) iterations, but each iteration requires an (offline) computation as heavy as

inverting the matrix_∇2_f₍_x_).

To fix this issue, researchers propose to study the so-called “Hessian-free” methods that, in addition to gradient computations, also compute Hessian-vector products. That is, instead of using the full matrix_∇2_f

i(x) or∇2f(x), these methods also compute∇2fi(x)·vfor indicesiand vectors

v.2 For Hessian-free methods, we denote by gradient complexity T the number of computations of _∇fi(x) plus that of ∇2fi(x)·v. The hope of using Hessian-vector products is to improve the complexityT as a function of ε.

Such improvement was first shown possible independently by [1, 7] for the offline setting, with complexityT _∝ _ε1n.5 + n

3/4

ε1.75

so is better than that of gradient descent. In the online setting, the first improvement was byNatasha2which gives complexityT _∝ _ε31.25

[2].

Unfortunately, it is argued by some researchers that Hessian-vector products are not general enough and may not be as simple to implement as evaluating gradients [8]. Therefore,

can we turn Hessian-free methods into first-order ones, without hurting their performance?

1.1 From Hessian-Vector Products to First-Order Methods

Recall by definition of derivative we have _∇2fi(x) ·v = limq→0{∇fi(x+qv_q)−∇fi(x)}. Given any

Hessian-free method, at least at a high level, can we replace every occurrence of _∇2_f

i(x)·v with

w= ∇fi(x+qv)−∇fi(x)

q for some small q >0?

Note the error introduced in this approximation is _k∇2fi(x)·v−wk ∝ qkvk2. Therefore, as long as the original algorithm is sufficiently stable to adversarial noise, and as long as q is small enough, this can convert Hessian-free algorithms into first-order ones.

In this paper, we demonstrate this idea by converting negative-curvature-search (NC-search) subroutines into first-order processes. NC-search is a key subroutine used in state-of-the-art Hessian-free methods that have rigorous proofs (see [1, 2, 7]). It solves the following simple task:

negative-curvature search (NC-search)

given pointx0, decide if∇2f(x0) −δI or find a unit vectorv such thatv⊤∇2f(x0)v≤ −δ₂.

Online Setting. In the online setting, NC-search can be solved by Oja’s algorithm [21] which

1_{We say}_A₋_δ_I_{if all the eigenvalues of}_A_{are no smaller than}₋_{δ. In this high-level introduction, we focus only}

on the case whenδ=ε1/C _{for some constant}_C.

(3)

costsOe(1/δ2) computations of Hessian-vector products (first proved in [6] and applied to NC-search in [2]).

In this paper, we propose a method Neon2online which solves the NC-search problem via only stochastic first-order updates. That is, starting fromx1 =x0+ξ where ξ is some random

pertur-bation, we keep updating xt+1 =xt−η(∇fi(xt)− ∇fi(x0)). In the end, the vector xT −x0 gives

us enough information about the negative curvature.

Theorem 1 (informal). Our Neon2online algorithm solves NC-search usingOe(1/δ2₎ _stochastic

gra-dients, without Hessian-vector product computations.

This complexity Oe(1/δ2) matches that of Oja’s algorithm, and is information-theoretically op-timal (up to log factors), see the lower bound in [6].

We emphasize that the independent workNeonby Xu and Yang [28] is actually thefirst recorded theoretical result that proposed this approach. However,Neon needs Oe(1/δ3) stochastic gradients, because it uses full gradient descent to find NC (on a sub-sampled objective) inspired by [15] and the power method; instead,Neon2online usesstochastic gradients and is based on our prior work on Oja’s algorithm [6].

By pluggingNeon2online intoNatasha2[2], we achieve the following corollary (see Figure 1(c)):

Theorem 2 (informal). Neon2online turns Natasha2 into a stochastic first-order method, without hurting its performance. That is, it finds an (ε, δ)-approximate local minimum in T = Oe _ε31.25 +

1

ε3_δ+_δ15

stochastic gradient computations, without Hessian-vector product computations.

(We say x is an approximate local minimum if _k∇f(x)_{k ≤}εand _∇2f(x)₋δI.)

Offline Setting. There are a number of ways to solve the NC-search problem in the offline setting using Hessian-vector products. Most notably, power method usesOe(n/δ) computations of Hessian-vector products, Lanscoz method [17] usesOe(n/√δ) computations, and shift-and-invert [12] on top of SVRG [26] (that we call SI+SVRG) uses Oe(n+n3/4/√δ) computations.

In this paper, we convert Lanscoz’s method and SI+SVRG into first-order ones:

Theorem 3 (informal). Our Neon2det algorithm solves NC-search using Oe(1/√δ) full gradients (or equivalently Oe(n/√δ) stochastic gradients), and our Neon2svrg solves NC-search using Oe(n+ n3/4/√δ) stochastic gradients.

We emphasize that, although analyzed in the online setting only, the work Neon by Xu and Yang [28] also applies to the offline setting, and seems to be the first result to solve NC-search using first-order gradients with a theoretical proof. However, Neon uses Oe(1/δ) full gradients instead of Oe(1/√δ). Their approach is inspired by [15], but ourNeon2det _{is based on Chebyshev}

approximation theory (see textbook [27]) and its recent stability analysis [5].

By putting Neon2det and Neon2svrg into theCDHS method of Carmon et al. [7], we have3

Theorem 4 (informal). Neon2det turns CDHS into a first-order method without hurting its perfor-mance: it finds an (ε, δ)-approximate local minimum in Oe _ε11.75 +_δ31.5

full gradient computations.

Neon2svrg turnsCDHSinto a first-order method without hurting its performance: it finds an(ε, δ )-approximate local minimum in T =Oe _ε1n.5 +_δn3 + n

3/4

ε1.75 +n 3/4

δ3.5

stochastic gradient computations.

3_{We note that the original paper of}_CDHS_{only proved such complexity results (although requiring Hessian-vector}

products) for the special case ofδ ≥ε1/2_{. In such a case, it requires either}_O_e 1

ε1.75

full gradient computations or e

O _ε1n.5+

n3/4

ε1.75

(4)

T=_δ-5

T=_δ-3_ε-2

T=_ε-4

T=_δ-7

Neon2+SGD Neon+SGD

ε ε2/3_ε4/7_ε1/2 _ε1/4 δ ε-4

ε-5 T

(a)

T=_δ-5

T=_δ-3_ε-2

T=_ε-3.33

T=_δ-6 _Neon2+SCSG

Neon+SCSG

ε ε2/3 _ε1/2_ε4/9 _ε1/4 δ

ε-4

ε-5

ε-3.33

T

(b)

T=_δ-5

T=_δ-1_ε-3 T=_ε-3.25

T=_δ-6

Neon2+Natasha2 Neon+Natasha

ε ε3/4 _ε3/5_ε1/2 _ε1/4 δ ε-3.25

ε-3.75 ε-3.6 ε-5

T

(c)

Figure 1: Neon vsNeon2for finding (ε, δ)-approximate local minima. We emphasize thatNeon2 andNeonare based on the same high-level idea, butNeonis arguably the first-recorded result to turn stationary-point finding algorithms (such asSGD_,SCSG_{) into local-minimum finding ones, with theoretical proofs.}

One should perhaps compareNeon2detto the interesting work “convex until guilty” by Carmon et al. [8]. Their method finds ε-approximate stationary points using Oe(1/ε1.75) full gradients, and is arguably the first first-order method achieving a convergence rate better than 1/ε2 _of _GD_.

Unfortunately, it is unclear if their method guarantees local minima. In comparison, Neon2det on

CDHSachieves the same complexity but guarantees its output to be an approximate local minimum.

Remark 1.1. All the cited works in this sub-section requires the objective to have (1) Lipschitz-continuous Hessian (a.k.a. second-order smoothness) and (2) Lipschitz-Lipschitz-continuous gradient (a.k.a. Lipschitz smoothness). One can argue that (1) and (2) are both necessary for finding approximate local minima, but if only finding approximate stationary points, then only (2) is necessary. We shall formally discuss our assumptions in Section 2.

1.2 From Stationary Points to Local Minima

Given any (first-order) algorithm that finds only stationary points (such as GD, SGD, or SCSG [18]), we can hope for using the NC-search routine to identify whether or not its output x satisfies

∇2f(x) ₋δI. If so, then automatically x becomes an (ε, δ)-approximate local minima so we can terminate. If not, then we can go in its negative curvature direction to further decrease the objective.

In the independent work of Xu and Yang [28], they proposed to apply their Neon method for NC-search, and thus turned SGD and SCSG into first-order methods finding approximate local minima. In this paper, we use Neon2instead. We show the following theorem:

Theorem 5 (informal). To find an(ε, δ)-approximate local minima,

(a) Neon2+SGDneeds T =Oe _ε14 +_ε21_δ3 +_δ15

stochastic gradients;

(b) Neon2+SCSG needsT =Oe 1

ε10/3 + 1

ε2_δ3 + _δ15

stochastic gradients; and

(c) Neon2+GD needsT =Oe _εn2 +_δn3.5

(so Oe _ε12 + _δ31.5

full gradients).

(d) Neon2+SVRG needsT =Oe n_ε2/23 +_δn3 + n 5/12

ε2_δ1/2 +

n3/4

δ3.5

stochastic gradients.

We make several comments as follows.

(a) We compare Neon2+SGD to Ge et al. [13], where the authors showed SGD plus perturbation needsT =Oe(poly(d)/ε4_{) stochastic gradients to find (}_{ε, ε}1/4_{)-approximate local minima. This}

(5)

To some extent, Theorem 5a is superior because we have (1) removed thepoly(d) factor,4 (2) achieved T =Oe(1/ε4_{) as long as}_δ _≥_ε2/3_{, and (3) a much simpler analysis.}

We also remark that, if usingNeon instead ofNeon2, one achieves a slightly worse complexity T =Oe _ε14 +_δ17

, see Figure 1(a) for a comparison.5

(b) Neon2+SCSGturns SCSG into a local-minimum finding algorithm. Again, if usingNeoninstead of Neon2, one gets a slightly worse complexity T =Oe 1

ε10/3 + 1

ε2_δ3 +_δ16

, see Figure 1(b).

(c) We compare Neon2+GD to Jin et al. [15], where the authors showed GD plus perturbation needs Oe(1/ε2_{) full gradients to find (}_{ε, ε}1/2_{)-approximate local minima. This is perhaps the}

first time that one can convert a stationary-point finding method (namely GD) into a local minimum-finding one, without hurting its performance.

To some extent, Theorem 5c is better because we useOe(1/ε2_{) full gradients as long as}_δ _≥_ε4/7_.

(d) Our result for Neon2+SVRG does not seem to be recorded anywhere, even if Hessian-vector product computations are allowed.

Limitation. We note that there is limitation of usingNeon2(orNeon) to turn an algorithm finding stationary points to that finding local minima. Namely, given any algorithm _A, if the gradient complexity for_Ato find anε-approximate stationary point isT, then after this conversion, it finds (ε, δ)-approximate local minima in a gradient complexity that is at least T. This is because the new algorithm, after combining Neon2 and _A, tries to alternatively find stationary points (using

A) and escape from saddle points (using Neon2). Therefore, it must pay at least complexity T. In contrast, methods such asNatasha2swing by saddle points instead of go to saddle points and then escape. This has enabled it to achieve a smaller complexityT =O(ε−3.25) for δ_≥ε1/4.

2 Preliminaries

Throughout this paper, we denote by _{k · k} the Euclidean norm. We use i _∈R [n] to denote that

i is generated from [n] =_{1,2, . . . , n_} uniformly at random. We denote by I_[_event_{] the indicator}

function of probabilistic events.

We denote by_kA_k2the spectral norm of matrix A. For symmetric matricesAandB, we write

A B to indicate that A₋B is positive semidefinite (PSD). Therefore, A₋σI if and only if all eigenvalues ofA are no less than₋σ. We denote by λmin(A) andλmax(A) the minimum and

maximum eigenvalue of a symmetric matrix A.

Recall some definitions on smoothness (for other equivalent definitions, see textbook [19])

Definition 2.1. For a function f:Rd_→R_,

• f isL-Lipschitz smooth (or L-smooth for short) if

∀x, y_∈Rd_, _k∇_f₍_x₎_{− ∇}_f₍_y₎_{k ≤}_L_k_x₋_y_k_.

• f is second-orderL2-Lipschitz smooth (orL2-second-order smooth for short) if

∀x, y_∈Rd_, _k∇2_f₍_x₎_{− ∇}2_f₍_y₎_k₂ _≤_L₂_k_x₋_y_k_.

The following fact says the variance of a random variable decreases by a factor mif we choose m independent copies and average them. It is trivial to prove, see for instance [18].

4_{We are aware that the original authors of [13] have a different proof to remove its}

poly(d) factor, but have not found it online at this moment.

5_{Their complexity might be improvable to}_O_e 1

ε4 +

1

δ6

(6)

algorithm gradient complexityT Hessian-vector

stationary SGD(folklore) O 1 ε4

no needed needed no

local minima perturbedSGD[13] Oe poly_ε4(d)

stationary GD(folklore) O n ε2

no no needed no

local minima perturbedGD[15] Oe n ε2

stationary “guilty” [8] Oe n ε1.75

no no needed needed

local minima FastCubic [1] Oe

n

Table 1: Complexity for findingk∇f(x)k ≤ε and∇2_f(x)₋_δ_I_{. Following tradition, in these complexity bounds,}

we assume variance and smoothness parameters as constants, and only show the dependency onn, d, ε.

Remark 1. Variance bounds is needed for online methods.

Remark 2. Lipschitz smoothness is needed for finding approximate stationary points.

Remark 3. Second-order Lipschitz smoothness is needed for finding approximate local minima.

Fact 2.2. If v1, . . . , vn∈Rd satisfy Pni=1vi =~0, and S is a non-empty, uniform random subset of

Throughout the paper we study the following minimization problem

min_x_∈Rd n

f(x)def= 1_nPn_i₌₁fi(x)

o

(7)

Algorithm 1 Neon2online(f, x0, δ, p)

Input: Function f(x) = _n1Pn_i₌₁fi(x), vector x0, negative curvature δ >0, confidencep∈(0,1].

1: forj= 1,2,· · ·Θ(log 1/p) do ⋄ boost the confidence

2: vj ←Neon2online_weak (f, x0, δ, p);

3: if vj 6=⊥then 4: m←Θ(L

2_{log 1}_/p

δ2 ),v′ ←Θ(_Lδ 2)v.

5: Draw i1, . . . , im ∈R[n]. 6: zj = _m_k1_v′_k2

2

Pm

j=1(v′)⊤ ∇fij(x0+v′)− ∇fij(x0)

7: if zj ≤ −3δ/4 returnv =vj 8: end if

9: end for 10: returnv =⊥.

Algorithm 2 Neon2online_weak(f, x0, δ, p)

1: η_← _C2 δ 0L2log(d/p)

,T _← C02log(d/p)

ηδ , ⋄ for sufficiently large constant C0

2: ξ_← Gaussian random vector with norm σ. _⋄ σdef=η(d/p)−2C0 δ 2

L2L3

3: x1 ←x0+ξ.

4: fort←1 to T do

5: xt+1 ←xt−η(∇fi(xt)− ∇fi(x0)) wherei∈R[n].

6: if _kxt+1−x0k2 ≥r then returnv= _k_xx_tt₊₁+1₋−_xx₀0_k₂ ⋄ r

def

= (d/p)C0_σ

7: end for 8: returnv =_⊥;

where both f(_·) and each fi(·) can be nonconvex. We wish to find (ε, δ)-local minima which are pointsx satisfying

k∇f(x)_{k ≤}ε and _∇2f(x)₋δI . We need the following three assumptions

• Each fi(x) is L-Lipschitz smooth.

• Each fi(x) is second-orderL2-Lipschitz smooth.

(In fact, the gradient complexity of Neon2 in this paper only depends polynomially on the second-order smoothness of f(x) (rather than fi(x)), and the time complexity depends loga-rithmically on the second-order smoothness of fi(x). To make notations simple, we decide to simply assume each fi(x) is L2-second-order smooth.)

• Stochastic gradients have bounded variance: _∀x_∈Rd_: E_i

∈R[n]k∇f(x)− ∇fi(x)k

2 _{≤ V} _.

(This assumption is needed only foronline algorithms.)

3 Neon2 in the Online Setting

We propose Neon2online formally in Algorithm 1.

(8)

Theorem 1 (Neon2online). Let f(x) = _n1P_in₌₁fi(x) where each fi is L-smooth and L2

-second-order smooth. For every point x0 ∈Rd, every δ >0, every p∈(0,1], the output

v=Neon2online(f, x0, δ, p)

satisfies that, with probability at least 1₋p:

1. If v=_⊥, then _∇2f(x0) −δI.

2. If v₆=_⊥, then _kv_k2 = 1 and v⊤∇2f(x0)v≤ −δ₂.

Moreover, the total number of stochastic gradient evaluations O log2(d/p_δ2 )L2

.

The proof of Theorem 1 immediately follows from Lemma 3.1 and Lemma 3.2 below.

Lemma 3.1 (Neon2online_weak ). In the same setting as Theorem 1, the outputv=Neon2online_weak (f, x0, δ, p)

satisfies Ifλmin(∇2f(x0))≤ −δ, then with probability at least2/3,v6=⊥andv⊤∇2f(x0)v≤ −₁₀₀51δ.

Proof sketch of Lemma 3.1. We explain whyNeon2online_weak works as follows. Starting from a randomly perturbed pointx1 =x0+ξ, it keeps updatingxt+1 ←xt−η(∇fi(xt)− ∇fi(x0)) for some random

indexi_∈[n], and stops either whenT iterations are reached, or when_kxt+1−x0k2 > r. Therefore,

we have _kxt−x0k2 ≤ r throughout the iterations, and thus can approximate ∇2fi(x0)(xt−x0)

using _∇fi(xt)− ∇fi(x0), up to error O(r2). This is a small term based on our choice ofr.

Ignoring the error term, our updates look likext+1−x0 = I−η∇2fi(x0)(xt−x0). This is exactly

the same as Oja’s algorithm [21] which is known to approximately compute the minimum eigenvector of_∇2_f₍_x

0) = _n1Pn_i₌₁fi(x0). Using the recent optimal convergence analysis of Oja’s algorithm [6],

one can conclude that afterT1= Θ log

r σ

ηλ

iterations, whereλ= max_{0,₋λmin(∇2f(x0))}, then we

not only have that _kxt+1−x0k2 is blown up, but also it aligns well with the minimum eigenvector

of_∇2f(x0). In other words, ifλ≥δ, then the algorithm must stop beforeT.

Finally, one has to carefully argue that the error does not blow up in this iterative process. We

defer the proof details to Appendix A.2.

Our Lemma 3.2 below tells us we can verify if the output v ofNeon2online_weak is indeed correct (up to additive ₄δ), so we can boost the success probability to 1₋p.

Lemma 3.2(verification). In the same setting as Theorem 1, let vectorsx, v_∈Rd_{. If} _i₁_{, . . . , i}_m _∈_R

[n] and define

z= _m1 Pm_j₌₁v⊤(_∇fij(x+v)− ∇fij(x))

Then, if_kv_{k ≤} ₈_Lδ

2 and m= Θ(

L2_{log 1}_/p

δ2 ), with probability at least 1−p,

_kvzk2

2 −

v⊤_∇2_f₍_x₎_v

kvk2 2

≤ δ4 .

4 Neon2 in the Deterministic Setting

(9)

Algorithm 3 Neon2det(f, x0, δ, p)

Input: A function f, vectorx0, negative curvature targetδ >0, failure probability p∈(0,1].

1: T _← C12log(d/p)

√

L

√

δ . ⋄ for sufficiently large constantC1.

2: ξ_← Gaussian random vector with norm σ; _⋄ σdef= (d/p)−2C1 δ

T3_L 2

3: x1 ←x0+ξ. y1←ξ, y0←0

4: fort_←1 to T do

5: yt+1= 2M(yt)−yt−1; ⋄ M(y)

def

=−1

L(∇f(x0+y)− ∇f(x0)) + 1−

3δ

4L

y 6: xt+1 =x0+yt+1− M(yt).

7: if _kxt+1−x0k2 ≥r then return _k_xx_tt₊₁+1₋−_xx₀0_k₂. ⋄ r

def

= (d/p)C1_σ

8: end for 9: return⊥.

Theorem 3 (Neon2det). Let f(x) be a function that is L-smooth and L2-second-order smooth.

For every pointx0 ∈Rd, everyδ >0, everyp∈(0,1], the output v=Neon2det(f, x0, δ, p)satisfies

that, with probability at least 1₋p:

1. If v=_⊥, then _∇2_f₍_x

0) −δI.

2. If v₆=_⊥, then _kv_k2 = 1 and v⊤∇2f(x0)v≤ −δ₂.

Moreover, the total number full gradient evaluations isO log2(d/p) √

L

√

δ

.

Proof sketch of Theorem 3. We explain the high-level intuition ofNeon2det_{and the proof of Theorem 3}

as follows. Define M=₋_L1_∇2f(x0) + 1− ₄3_LδI. We immediately notice that

• all eigenvalues of_∇2_f₍_x

0) in −₄3δ, L are mapped to the eigenvalues ofMin [−1,1], and

• any eigenvalue of_∇2f(x0) smaller than −δ is mapped to eigenvalue ofMgreater than 1 +₄δ_L.

Therefore, as long asT _≥Ωe L_δ, if we computexT+1 =x0+MTξ for some random vectorξ, by the

theory of power method, xT+1−x0 must be a negative-curvature direction of ∇2f(x0) with value

≤ 12δ. There aretwo issues with this approach.

The first issue is that, the degreeT of this matrix polynomialMT _{can be reduced to}_T ₌_Ω_e √_√L δ

if the so-called Chebyshev polynomial is used.

Claim 4.1. Let _Tt(x) be the t-th Chebyshev polynomial of the first kind, defined as [27]:

T0(x) def

= 1, _T1(x) def

= x, _Tn+1(x) def

= 2x_{· T}n(x)− Tn−1(x)

then_Tt(x) satisfies: Tt(x)∈

  

[₋1,1] if x_∈[₋1,1];

1 2

x+√x2₋₁t_,_x₊√_x2₋₁t

if x >1.

Since _Tt(x) stays between [−1,1] when x ∈[−1,1], and grows to ≈(1 +

√

x2₋₁₎t _for _x _≥_{1, we} can use_TT(M) in replacement of MT. Then, any eigenvalue ofM that is above 1 +₄δ_L shall grow in a speed like (1 +pδ/L)T_{, so it suffices to choose}_T _≥_Ω_e √_√L

σ

. This is quadratically faster than applying the power method, so inNeon2det we wish to compute xt+1≈x0+Tt(M)ξ.

(10)

gradient difference to approximate it; that is, we can only use _M(y) to approximate My where

M(y)def= ₋1

L(∇f(x0+y)− ∇f(x0)) +

1₋ 3δ 4L

y .

How does error propagate if we compute_Tt(M)ξ by replacingMwithM? Note that this is a very non-trivial question, because the coefficients of the polynomial_Tt(x) is as large as 2O(t).

It turns out, the way that error propagates depends on how the Chebyshev polynomial is calculated. If the so-called backward recurrence formula is used, namely,

y0 = 0, y1=ξ, yt= 2M(yt−1)−yt−2

and setting xT+1 = x0 +yT+1 − M(yT), then this xT+1 is sufficiently close to the exact value

x0+Tt(M)ξ. This is known as the stability theory of computing Chebyshev polynomials, and is proved in our prior work [5].

We defer all the proof details to Appendix B.2.

5 Neon2 in the SVRG Setting

Recall that the shift-and-invert (SI) approach [12] on top of the SVRG method [26] solves the min-imum eigenvector problem as follows. Given any matrixA=_∇2_f₍_x

0) and suppose its eigenvalues

areλ1 ≤ · · · ≤λd. Then, ifλ >−λ1, we can define positive semidefinite matrixB= (λI+A)−1, and

then apply power method to find an (approximate) maximum eigenvector ofB, which necessarily is an (approximate) minimum eigenvector ofA.

The SI approach specifies a binary search routine to determine the shifting constant λ, and ensures thatB= (λI+A)−1 is always “well conditioned,” meaning that it suffices to apply power method on B for logarithmic number of iterations. In other words, the task of computing the minimum eigenvector ofA reduces to computing matrix-vector products By for poly-logarithmic number of times. Moreover, the stability of SI was shown in a number of papers, including [12] and [4]. This means, it suffices for us to computeBy approximately.

However, how to compute By for an arbitrary vector y. It turns out, this is equivalent to minimizing aconvex quadratic function that is of a finite sum form

g(z)def= 1

2z⊤(λI+A)z+y⊤z= 1 2n

n

X

i=1

z⊤(λI+_∇2fi(x0))z+y⊤z .

Therefore, one can apply the a variant of the SVRG method (arguably first discovered by Shalev-Shwartz [26]) to solve this task. In each iteration, SVRG needs to evaluate a stochastic gradient (λI+_∇2fi(x0))z+y at some point z for some random i ∈ [n]. Instead of evaluating it exactly

(which require a Hessian-vector product), we use_∇fi(x0+z)−∇fi(x0) to approximate∇2fi(x0)·z.

Of course, one needs to show also that the SVRG method is stable to noise. Using similar techniques as the previous two sections, one can show that the error term is proportional toO(_kz_k2₂), and thus as long as we bound the norm ofzis bounded (just like we did in the previous two sections), this should not affect the performance of the algorithm. We decide to ignore the detailed theoretical proof of this result, because it will complicate this paper.

Theorem 3 (Neon2svrg). Let f(x) = _n1P_in₌₁fi(x) where each fi isL-smooth and L2-second-order

smooth. For every pointx0 ∈Rd, every δ >0, every p∈(0,1], the output v=Neon2svrg(f, x0, δ, p)

satisfies that, with probability at least 1₋p:

(11)

Moreover, the total number stochastic gradient evaluations is O ne +n3/√4√L

δ

.

6 Applications of Neon2

We show how Neon2online can be applied to existing algorithms such as SGD, GD, SCSG, SVRG,

Natasha2, CDHS. Unfortunately, we are unaware of a generic statement for applying Neon2online

toany algorithm. Therefore, we have to prove them individually.6

Throughout this section, we assume that some starting vector x0 ∈ Rd and upper bound ∆f is given to the algorithm, and it satisfies f(x0)−minx{f(x)} ≤ ∆f. This is only for the purpose of proving theoretical bounds. In practice, because ∆f only appears in specifying the number of iterations, can just run enough number of iterations and then halt the algorithm, without the necessity of knowing ∆f.

6.1 Auxiliary Claims

Claim 6.1. For any x, using O((_εV2 + 1) log1_p) stochastic gradients, we can decide

with probability1₋p: either _k∇f(x)_{k ≥} ε

2 or k∇f(x)k ≤ε .

Proof. Suppose we generate m = O(log1_p) random uniform subsets S1, . . . , Sm of [n], each of cardinality B = max_{32_Vε2,1_}. Then, denoting by vj = _B1 Pi∈Sj∇fi(x), we have according to

Fact 2.2 that E_S_j_k_v_j _{− ∇}_f₍_x₎_k2 _≤ V

B = ε

2

32. In other words, with probability at least 1/2 over

the randomness of Sj, we have

_kvjk − k∇f(x)k

_{≤ k}vj − ∇f(x)k ≤ ε₄. Since m = O(log1_p), we have with probability at least 1₋p, it satisfies that at least m/2 + 1 of the vectors vj satisfy

_kvjk − k∇f(x)k ≤ ε₄. Now, if we select v∗ =vj where j ∈[m] is the index that gives the median value of _kvjk, then it satisfies

_kvjk − k∇f(x)k ≤ ε₄. Finally, we can check if kvjk ≤ 3₄ε. If so, then we conclude that _k∇f(x)_{k ≤}ε, and if not, we conclude that_k∇f(x)_{k ≥} ε₂.

Claim 6.2. If v is a unit vector and v⊤_∇2f(y)v_{≤ −}δ₂, suppose we choosey′=y_±_Lδ

2v where the

sign is random, then f(y)₋E_[_f₍_y′_)]_≥ δ3

12L2 2.

Proof. Letting η= _Lδ

2, then by the second-order smoothness,

f(y)₋E_[_f₍_y′_)]_≥E_h∇_f₍_y₎_{, y}₋_y′_{i −}1

2(y−y′)⊤∇

2_f₍_y₎₍_y₋_y′₎₋L2

6 ky−y′k

3

=₋η

2

2 v⊤∇

2_f₍_y₎_v₋L2η3

6 kvk

3 _≥ η2δ

4 − L2η3

6 = δ3

12L2₂ .

6.2 Neon2 on SGD and GD

To apply Neon2 to turn SGD into an algorithm finding approximate local minima, we propose the following processNeon2+SGD (see Algorithm 4). In each iterationt, we first apply SGDwith mini-batch sizeO(_ε12) (see Line 4). Then, ifSGDfinds a point with small gradient, we applyNeon2online to

decide if it has a negative curvature, if so, then we move in the direction of the negative curvature (see Line 10). We have the following theorem:

6_{This is because stationary-point finding algorithms have somewhat different guarantees. For instance, in}

mini-batchSGD_{we have}_f_(x_t₎₋E_[f(x_t₊₁_)]_≥_Ω(_k∇_f(x_t₎_k2_{) but in}_SCSG_{we have}_f(x

(12)

Algorithm 4 Neon2+SGD(f, x0, p, ε, δ)

Input: functionf(_·), starting vector x0, confidencep∈(0,1), ε >0 and δ >0.

1: K_←O L22∆f

δ3 +

L∆f

ε2

; _⋄ ∆f is any upper bound onf(x0)−minx{f(x)}

2: fort_←0 to K₋1 do

3: S ← a uniform random subset of [n] with cardinality|S|=B def= max{8_εV2,1};

4: xt+1/2←xt−_L_|1_S_|P_i_∈_S∇fi(xt);

5: if k∇f(xt)k ≥ ₂ε then ⋄ estimatek∇f(xt)kusingO(ε−2Vlog(K/p))stochastic gradients

6: xt+1←xt+1/2;

7: else _⋄ necessarilyk∇f(xt)k ≤ε

8: v←Neon2online(xt, δ,₂p_K);

9: if v=_⊥ then returnxt; ⋄ necessarily∇2f(xt) −δI

10: else xt+1←xt± _Lδ₂v; ⋄ necessarilyv⊤∇2f(xt)v≤ −δ/2

11: end if 12: end for

13: will not reach this line (with probability _≥1₋p).

Algorithm 5 Neon2+GD(f, x0, p, ε, δ)

Input: functionf(_·), starting vector x0, confidencep∈(0,1), ε >0 and δ >0.

1: K_←O L22∆f

δ3 +

L∆f

ε2

2: fort←0 to K−1 do 3: xt+1/2←xt−_L1∇f(xt); 4: if _k∇f(xt)k ≥ ₂ε then 5: xt+1←xt+1/2;

6: else

7: v_←Neon2det(xt, δ,₂p_K);

8: if v=⊥ then returnxt; ⋄ necessarily∇2f(xt) −δI

9: else xt+1←xt± _Lδ₂v; ⋄ necessarilyv⊤∇2f(xt)v≤ −δ/2

12: will not reach this line (with probability _≥1₋p).

Theorem 5a. With probability at least 1₋p, Neon2+SGD outputs an (ε, δ)-approximate local minimum in gradient complexity T =Oe(_εV2 + 1)

L2 2∆f

δ3 +

L∆f

ε2

+L_δ22

L2 2∆f

δ3

.

Corollary 6.3. Treating∆f,V, L, L2 as constants, we haveT =Oe _ε14 +_ε21_δ3 +_δ15

.

One can similarly (and more easily) give an algorithmNeon2+GD, which is the same asNeon2+SGD

except that the mini-batch SGD is replaced with a full gradient descent, and the use ofNeon2online

is replaced withNeon2det. We have the following theorem:

Theorem 5c. With probability at least 1₋p, Neon2+GDoutputs an(ε, δ)-approximate local min-imum in gradient complexity OeL∆f

ε2 +L 1/2

δ1/2

L2 2∆f

δ3

full gradient computations.

(13)

6.3 Neon2 on SCSG and SVRG

Background. We first recall the main idea of theSVRGmethod for non-convex optimization [3, 23]. It is an offline method but is what SCSG is built on. SVRG divides iterations into epochs, each of length n. It maintains a snapshot point x_e for each epoch, and computes the full gradient

∇f(_ex) only for snapshots. Then, in each iteration t at point xt, SVRG defines gradient estimator

e ∇f(xt)

def

= _∇fi(xt)− ∇fi(xe) +∇f(ex) which satisfies Ei[∇ef(xt)] = ∇f(xt), and performs update

xt+1←xt−α∇ef(xt) for learning rateα.

The SCSG method of Lei et al. [18] proposed a simple fix to turn SVRG into an online method. They changed the epoch length of SVRG from n to B _≈1/ε2, and then replaced the computation of _∇f(_ex) with _|_S1_|P_i_∈_S_∇fi(xe) where S is a random subset of [n] with cardinality |S| = B. To make this approach even more general, they also analyzed SCSG in the mini-batch setting, with mini-batch sizeb_{∈ {}1,2, . . . , B_}.7 Their Theorem 3.1 [18] says that,

Lemma 6.4 ([18]). There exist constant C >1 such that, if we run SCSG for an epoch of size B (so using O(B) stochastic gradients)8 with mini-batch b _{∈ {}1,2, . . . , B_} starting from a point xt and moving tox+_t, then

E_k∇_f₍_x+

t )k2

≤C_·L(b/B)1/3 f(xt)−E[f(x+t)]

+ 6V B .

Our Approach. In principle, one can apply the same idea ofNeon2+SGDonSCSG to turn it into an algorithm finding approximate local minima. Unfortunately, this is not quite possible because the left hand side of Lemma 6.4 is onE_k∇_f₍_x+

t )k2

, as opposed to_k∇f(xt)k2 in SGD (see (C.1)). This means, instead of testing whetherxtis a good local minimum (as we did inNeon2+SGD), this time we need to test whetherx+_t is a good local minimum. This creates some extra difficulty so we need a different proof.

Remark 6.5. As for the parameters ofSCSG, we simply use B = max_{1,48_ε2V}. However, choosing

mini-batch sizeb= 1 does not necessarily give the best complexity, so a tradeoffb= Θ((ε2+V)ε4L62

δ9_L3 )

is needed. (A similar tradeoff was also discovered by the authors of Neon [28].) Note that this quantity bmay be larger than B, and if this happens, SCSG becomes essentially equivalent to one iteration ofSGD with mini-batch size b. Instead of analyzing this boundary caseb > B separately, we decide to simply runNeon2+SGDwhenever b > B happens, to simplify our proof.

We show the following theorem (proved in Appendix C)

Theorem 5b. With probability at least 2/3, Neon2+SCSG outputs an (ε, δ)-approximate local minimum in gradient complexity T =Oe L∆f

ε4/3_V1/3 +

L2 2∆f

δ3

_V

ε2 + L 2

δ2

+ L∆f

ε2 L 2

δ2

.

(To provide the simplest proof, we have shown Theorem 5b only with probability 2/3. One can for instance boost the confidence to 1₋pby running log1_p copies ofNeon2+SCSG.)

Corollary 6.6. Treating∆f,V, L, L2 as constants, we haveT =Oe _ε101/3 + 1

ε2_δ3 + _δ15

.

7_{That is, they reduced the epoch length to} B

b, and replaced∇fi(xt)− ∇fi(x) withe

1

|S′|

P

i∈S′ ∇fi(xt)− ∇fi(ex)

for someS′ _{that is a random subset of [n] with cardinality}_|_S′_|₌_b.

8_{We remark that Lei et al. [18] only showed that an epoch runs in an}_expectation _of_{O(B) stochastic gradients.}

We assume it is exact here to simplify proofs. One can for instance stopSCSG afterO(Blog1

p) stochastic gradient

(14)

Algorithm 6 Neon2+SCSG(f, x0, ε, δ)

Input: functionf(_·), starting vector x0,ε >0 and δ >0.

1: B_←max_{1,48_ε2V};b←max

1,Θ((ε2+V)ε4L62

δ9_L3 ) ;

2: if b > B then return Neon2+SGD(f, x0,2/3, ε, δ); ⋄ for cleaner analysis purpose, see Remark 6.5

3: K←Θ Lb

1/3_∆

f

ε4/3_V1/3

4: fort_←0 to K₋1 do

5: x_t₊₁_/₂_← apply SCSG onxtfor one epoch of size B = max{Θ(V/ε2),1};

6: if k∇f(x_t₊₁_/₂)k ≥ ε₂ then ⋄ estimatek∇f(xt)kusingO(ε−2VlogK)stochastic gradients

7: xt+1←xt+1/2;

8: else _⋄ necessarilyk∇f(xt+1/2)k ≤ε

9: v←Neon2online(f, x_t₊₁_/₂, δ,1/20K);

10: if v=_⊥ then returnxt+1/2; ⋄ necessarily∇2f(xt+1/2) −δI

11: else xt+1←xt+1/2±Lδ2v; ⋄ necessarilyv

⊤_∇2_f_(x

t+1/2)v≤ −δ/2

14: will not reach this line (with probability _≥2/3).

As for SVRG, it is an offline method and its one-epoch lemma looks like9

E_k∇_f₍_x+

t)k2

≤C_·Ln1/3 f(xt)−E[f(x+t)]

.

If one replaces the use of Lemma 6.4 with this new inequality, and replace the use of Neon2online

withNeon2svrg, then we get the following theorem:

Theorem 5d. With probability at least 2/3, Neon2+SVRG outputs an (ε, δ)-approximate local minimum in gradient complexity T =Oe L∆f

ε2_n1/3 +

L2 2∆f

δ3

n+n3/√4√L

δ

.

For a clean presentation of this paper, we ignore the pseudocode and proof because they are only simpler thanNeon2+SCSG.

6.4 Neon2 on Natasha2 and CDHS

The recent results Carmon et al. [7] (that we refer to CDHS) and Natasha2 [2] are both Hessian-free methods where the only Hessian-vector product computations come from the exact NC-search process we study in this paper. Therefore, by replacing their NC-search withNeon2, we can directly turn them into first-order methods without the necessity of computing Hessian-vector products.

We state the following two theorems where the proofs are exactly the same as the papers [7] and [2]. We directly state them by assuming ∆f,V, L, L2 are constants, to simplify our notions.

Theorem 2. One can replace Oja’s algorithm with Neon2online _in _Natasha2 _{without hurting its}

performance, turning it into a first-order stochastic method.

Treating∆f,V, L, L2 as constants,Natasha2finds an (ε, δ)-approximate local minimum inT =

e

O _ε31.25 +_ε13_δ +_δ15

stochastic gradient computations.

Theorem 4. One can replace Lanczos method with Neon2det or Neon2svrg in CDHS without hurting it performance, turning it into a first-order method.

(15)

Treating ∆f, L, L2 as constants, CDHS finds an (ε, δ)-approximate local minimum in either

e

O _ε11.75+_δ31.5

full gradient computations (if Neon2det is used) or inT =Oe _ε1n.5 +_δn3 +n 3/4

ε1.75+n 3/4

δ3.5

stochastic gradient computations (ifNeon2svrg is used).

Acknowledgements

We would like to thank Tianbao Yang and Yi Xu for helpful feedbacks on this manuscript.

Appendix

A

Missing Proofs for Section 3

A.1 Auxiliary Lemmas

We use the following lemma to approximate hessian-vector products:

Lemma A.1. If f(x) is L2-second-order smooth, then for every point x ∈ Rd and every vector

v_∈Rd_{, we have:} _∇_f₍_x₊_v₎_{− ∇}_f₍_x₎_{− ∇}2_f₍_x₎_v

2 ≤L2kvk 2 2.

Proof of Lemma A.1. We can write_∇f(x+v)_−∇f(x) =R_t1₌₀_∇2f(x+tv)vdt. Subtracting_∇2f(x)v we have:

_∇f(x+v)_{− ∇}f(x)_{− ∇}2f(x)v₂ =

Z 1

t=0 ∇

2_f₍_x₊_tv₎

− ∇2f(x)vdt

2

≤ Z 1

t=0

_∇2_f₍_x₊_tv₎_{− ∇}2_f₍_x₎

2kvk2dt≤L2kvk 2

2 .

We need the following auxiliary lemma about martingale concentration bound:

Lemma A.2. Consider random events_{Ft}t≥1and random variablesx1, . . . , xT ≥0anda1, . . . , AT ∈ [₋ρ, ρ]for ρ_∈[0,1/2]where eachxtand at only depend onF1, . . . ,Ft. Lettingx0 = 0 and suppose

there exist constant b_≥0 and λ >0 such that for every t_≥1:

xt≤xt−1(1−at) +b and E[at| F1, . . . ,Ft−1]≥ −λ .

Then, we have for every p_∈(0,1): Pr

xT ≥T beλT+2ρ

q

TlogT p

≤p .

Proof. We know that

xT ≤(1−aT)xT−1+b

≤(1₋aT) ((1−aT−1)xT−2+b) +b

= (1₋aT)(1−aT−1)xT−2+ (1−aT)b+b

≤ · · · ≤

T

X

s=2

T

Y

t=s

(1₋as)b

For eachs_∈[T], we consider the random process define as

(16)

Therefore

logyt+1 = log(1−at) + logyt

For log(1₋at)∈[−2ρ, ρ] andE[log(1−at)| F1,· · · Ft−1]≤λ. Thus, we can apply Azuma-Hoeffding

inequality on logyt to conclude that

Pr

yT ≥beλT+2ρ

q

TlogT p

≤p/T .

Taking union bound overs we complete the proof.

A.2 Proof of Lemma 3.1

Proof of Lemma 3.1. Let it ∈[n] be the random indexi chosen when computing xt+1 from xt in Line 5 of Neon2online_weak . We will write the update rule of xt in terms of the Hessian before we stop. By Lemma A.1, we know that for every t_≥1,

_∇fit(xt)− ∇fit(x0)− ∇

2_f

it(x0)(xt−x0)

2 ≤L2kxt−x0k 2 2 .

Therefore, there exists error vectorξt∈Rd withkξtk2 ≤L2kxt−x0k22 such that

(xt+1−x0) = (xt−x0)−η∇2fit(x0)(xt−x0) +ηξt .

For notational simplicity, let us denote by

zt

def

=xt−x0, At

def

= Bt+Rt where Bt

def

= _∇2fit(x0), Rt

def

=₋ξtzt⊤

kztk22

then it satisfies

zt+1=zt−ηBtzt+ηξt= (I−ηAt)zt .

We have _kRtk2 ≤ L2kztk2 ≤ L2·r. By the L-smoothness of fit, we know kBtk2 ≤ L and thus kAtk2≤ kBtk2+kRtk2≤ kBtk2+L2r ≤2L.

Now, define Φt

def

=zt+1zt⊤+1= (I−ηAt)· · ·(I−ηA1)ξξ⊤(I−ηA1)· · ·(I−ηAt) andwt

def

= zt kztk2 =

zt

(Tr(Φt−1))1/2. Then, before we stop, we have: Tr(Φt) =Tr(Φt−1)

1₋2ηw⊤_t Atwt+η2w⊤t A2twt

≤Tr(Φt−1)

1₋2ηw⊤_t Atwt+ 4η2L2

≤Tr(Φt−1)

1₋2ηw⊤_t Btwt+ 2ηkRtk2+ 4η2L2

①

≤Tr(Φt−1)

1₋2ηw⊤_t Btwt+ 8η2L2

.

Above,①_{is because our choice of parameter satisfies}_r _≤_ηL2

L2. Therefore,

log (Tr(Φt))≤log (Tr(Φt−1)) + log

1₋2ηw⊤_t Btwt+ 8η2L2

.

Letting λ=₋λmin(∇2f(x0)) = −λmin(EBt[Bt]), since the randomness ofBt is independent of wt,

we know thatw⊤_t Btwt∈[−L, L] and for everywt, it satisfiesEBt

w⊤_t Btwt|wt≥ −λ. Which (by concavity of log) also implies thatE_{[log 1}₋₂_ηw_t⊤_B_t_w_t_{+ 8}_η2_L2_]_≤₂_ηλ _{and log(1}₋₂_ηw_t⊤_B_t_w_t₊

8η2_L2₎_∈_[₋₂₍₂_ηL_{+ 8}_η2_L2₎_,₂_ηL_{+ 8}_η2_L2_]_∈_[₋₆_ηL,₃_ηL_].

Hence, applying Azuma-Hoeffding inequality on log(Φt) we have

(17)

In other words, with probability at least 1₋p,Neon2online_weak will not terminate until t _≥T0, where

Next, we turn to accuracy. Let “true” vector vt+1 def

Above,①_{assumes without loss of generality that}_L_≥_{1 (as otherwise we can re-scale the problem).}

Therefore, applying martingale concentration Lemma A.2, we know

Pr

Now we can apply the recent of Oja’s algorithm — [6, Theorem 4]. By our choice of parameter η, we have: with probability at least 99/100 the following holds:

1. Norm growth: _kvtk2 ≥e(ηλ−2η

by our choice of parameters, we know that

eηλT1+2ηL q

T1log1_p

≤_σr2 .

which implies that with probability at least 98/100,

(18)

is due to r_σ2 _≤ δ2p

weak must terminate withinT1 ≤T iterations.

Moreover, recall with probability at last 1₋p, Neon2online_weak will not terminate before iteration T0. Thus, at the point oft∈[T0, T1] of termination, we have with probability at least 98/100,

Putting everything together we complete the proof.

A.3 Proof of Lemma 3.2

Proof of Lemma 3.2. By Lemma A.1, we know for everyi_∈[n],

2 from Lemma A.1, we conclude that

Pr z

Plugging in our assumption on_kv_||2 and our choice of mfinishes the proof.

B

Missing Proofs for Section 4

B.1 Stable Computation of Chebyshev Polynomials

We recall the following result from [5, Section 6.2] regarding one way tostably compute Chebyshev polynomials. Suppose we want to compute

(19)

Definition B.1 (inexact backward recurrence). Let _Mbe an approximate algorithm that satisfies

Theorem B.2 (stable Chebyshev sum). For every N _∈ N∗_{, suppose the eigenvalues of} _M _{are in}

[a, b] and suppose there are parameters CU ≥1, CT ≥1, ρ≥1, Cc≥0 satisfying

Then, if the inexact backward recurrence in Def. B.1 is applied with ε_≤ ₄_{N C}1

U, we have kbsN −~sNk ≤ε·2(1 + 2N CT)N CUCc .

B.2 Proof of Theorem 3

Proof of Theorem 3. We can without loss of generality assume δ _≤L. For notation simplicity, let us denote of the first kind. However, we cannot multiply M to vectors (because we are not allowed to use Hessian-vector products). We define

and shall use it to approximateMy and then apply backward recurrence

y0= 0, y1 =ξ, yt= 2M(yt−1)−yt−2 .

Throughout the iterations of Neon2det, we have

yt= 2M(yt−1)−yt−2 = 2(x0−xt+yt)−yt−2 =⇒ yt−yt−2 = 2(xt−x0) .

Since we have_kxt−x0k2 ≤r for eachtbefore termination, we knowkytk2 ≤2tr. Using this upper

bound we can approximate Hessian-vector product by gradient difference. Lemma A.1 gives us

kM(yt)−Mytk₂ ≤ L2 Now, recall from Claim 4.1 that

(20)

Theorem B.2 tells us that, for every tbefore termination,

kx∗_t₊₁₋xt+1k2 ≤

32L2rt3ρtσ

L .

In order to prove Theorem 3, in the rest of the proof, it suffices for us to show that, if λmin(∇2f(x0)) ≤ −δ, then with probability at least 1 −p, it satisfies v 6= ⊥, kvk2 = 1, and

v⊤_∇2f(x0)v≤ −1₂δ. In other words, we can assumeλ≥δ.

The value λ_≥δ impliesρ _≥1 +√δ >1, so we can let

T1 def

= log

4dr pσ

logρ ≤T .

By Claim 4.1, we know that _kTT1(M)k2 ≥ 1

2ρT1 = 2pσdr. Thus, with probability at least 1−p,

kx∗

T1+1−x0k2=kTT1(M)ξk2≥2r. Moreover, at iterationT1, we have:

kx∗_T₁₊₁₋xT1+1k2≤

32L2rT13ρT1σ

L ≤

32L2rT13σ

L ×

4dr pσ ≤

128dL2T13

L r2

p

①

≤ δ

100Lr≤ 1 16r .

Here,① _{uses the fact that} _r_≤ δp

12800dL2T13.

This means _kxT1+1−x0k2≥r so the algorithm must terminate before iterationT1 ≤T.

On the other hand, since _kTt(M)k2 ≤ρt, we know that the algorithm will not terminate until

t_≥T0 for

T0 def

= log r

2σ logρ .

At the timet_≥T0 of termination, by the property of Chebyshev polynomial Claim 4.1, we know

1. _Tt(ρ)≥ 1₂ρt≥ ₂1ρT0 ≥ ₄r_σ = (d/p)Θ(1).

2. _∀x_∈[₋1,1],_Tt(x)∈[−1,1].

Since all the eigenvalues of A that are _{≥ −}3/4δ are mapped to the eigenvalues of M that are in [₋1,1], and the smallest eigenvalue of A is mapped to the eigenvalue ρ of M. So we have, with probability at least 1₋p, letting vt

def

= x∗_t₊₁₋x0 it satisfies

vt⊤Avt

kvtk2 ≤ −

5 8δ .

Therefore, denoting by zt

def

= xt+1−x0, we have

z⊤

t Azt

kztk22

= kvtk

2 2

kztk22

·zt⊤Azt kvtk22

≤ 15₁₆vt⊤Avt+ 4Lkzt−vtk2kvtk2 kvtk2₂

≤ 15₁₆ _v_⊤

t Avt

kvtk22

+ 4Lkvt−ztk2

kvtk2

≤ 15₁₆

v_t⊤Avt

kvtk2₂ + 1

25δ

≤ −₁₀₀51 δ .

(21)

C

Missing Proofs for Section 6

C.1 Proof of Theorem 5a

Proof of Theorem 5a. Since both estimating_k∇f(xt)kin Line 5 (see Claim 6.1) and invokingNeon2online (see Theorem 1) succeed with high probability, we can assume that they always succeed. This means whenever we output xt in an iteration, it already satisfies k∇f(xt)k ≤ ε and ∇2f(xt) −δI. Therefore, it remains to show that the algorithm must output some xt in an iteration, as well as to compute the final complexity.

Recall (from classical SGD theory) if we updatext+1/2 ←xt−_|α_S_|P_i_∈_S∇fi(xt) whereα >0 is

, then the algorithm must terminate and return xt in one of its iterations. This ensures that Line 13 will not be reached. As for the total complexity, we note that each iteration ofNeon2+SGD is dominated byOe(B) =Oe(_εV2 + 1) stochastic gradient

computations in Line 4 and Line 5, totaling Oe((_εV2 + 1)K), as well as Oe(L 2

δ2) stochastic gradient

computations byNeon2online, but the latter will not happen for more thanO(L22∆f

δ3 ) times. Therefore,

the total gradient complexity is

e

C.2 Proof of Theorem 5b

Proof of Theorem 5b. We first note in the special case b = Θ((ε2+V)ε4L62

δ9_L3 ) ≥ B, or equivalently

δ3_≤_O₍L22ε2

L ), Theorem 5a gives us gradient complexityT =Oe

Since both estimating_k∇f(xt)kin Line 6 (see Claim 6.1) and invokingNeon2online(see Theorem 1) succeed with high probability, we can assume that they always succeed. This means whenever we output x_t₊₁_/₂ in an iteration, it already satisfies _k∇f(x_t₊₁_/₂)_{k ≤} ε and _∇2_f₍_x

t+1/2) −δI.

(22)

For analysis purpose, let us assume that, whenever the algorithm reachesv=_⊥(in Line 10), it does not immediately halt and instead sets xt+1=xt+1/2. This modification ensures that random

variables xt and xt+1/2 are well defined fort= 0,1, . . . , K−1.

Let N1 and N2 respectively be the number of times we reach Line 9 (so we invokeNeon2online)

and the number of times we reach Line 11 (so we update xt+1 =xt+1/2± Lδ2v). Both N1 and N2

are random variables and it satisfies N1≥N2. To prove thatNeon2+SCSGoutputs in an iteration,

we need to proveN1 > N2 holds at least with constant probability.

Let us apply the key SCSG lemma (Lemma 6.4) for an epoch with size B = max_{1,48_ε2V} and

mini-batch sizeb_≥1, we have

E_k∇_f₍_x_t₊₁_/₂₎_k2_≤_C_·_L₍_b/B₎1/3 _f₍_x_t₎₋E_[_f₍_x_t₊₁_/₂_)]₊ε

Note that the third casev₆=_⊥happens forN2 times. Therefore, combining the three cases together

with (C.2), we have

SinceN1> N2, this means the algorithm must terminate and output somext+1/2in an iteration,

with probability at least 2/3.

Finally, the per-iteration complexity of Neon2+SCSG is dominated by Oe(B) stochastic gradient computations per iteration for both SCSG and estimating _k∇f(x_t₊₁_/₂)_k, as well as Oe(L2/δ2) for invokingNeon2online. This totals to gradient complexity

(23)

References

[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding Approximate Local Minima for Nonconvex Optimization in Linear Time. InSTOC, 2017. Full version available at

http://arxiv.org/abs/1611.01146.

[2] Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD. ArXiv e-prints, abs/1708.08694, August 2017.

[3] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. InICML, 2016. Full version available at http://arxiv.org/abs/1603.05643.

[4] Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. InNIPS, 2016. Full version available athttp://arxiv.org/abs/1607.03463.

[5] Zeyuan Allen-Zhu and Yuanzhi Li. Faster Principal Component Regression and Stable Matrix Cheby-shev Approximation. In Proceedings of the 34th International Conference on Machine Learning, ICML ’17, 2017.

[6] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the Compressed Leader: Faster Online Learning of Eigenvec-tors and Faster MMWU. InICML, 2017. Full version available athttp://arxiv.org/abs/1701.01722. [7] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated Methods for Non-Convex

Optimization. ArXiv e-prints, abs/1611.00756, November 2016.

[8] Yair Carmon, Oliver Hinder, John C. Duchi, and Aaron Sidford. ”Convex Until Proven Guilty”: Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions. InICML, 2017.

[9] Anna Choromanska, Mikael Henaff, Michael Mathieu, G´erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.

[10] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben-gio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. InNIPS, pages 2933–2941, 2014.

[11] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. InNIPS, 2014.

[12] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigen-vector computation. InICML, 2016.

[13] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. InProceedings of the 28th Annual Conference on Learning Theory, COLT 2015, 2015.

[14] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. ArXiv e-prints, December 2014.

[15] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape Saddle Points Efficiently. In ICML, 2017.

[16] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduc-tion. InAdvances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013. [17] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential

and integral operators. Journal of Research of the National Bureau of Standards, 45(4), 1950.

[18] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex Finite-Sum Optimization Via SCSG Methods. ArXiv e-prints, abs/1706.09156, June 2017.

[19] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.

[20] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance.

(24)

[21] Erkki Oja. Simplified neuron model as a principal component analyzer.Journal of mathematical biology, 15(3):267–273, 1982.

[22] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.

[23] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. InICML, 2016.

[24] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013. Preliminary version appeared in NIPS 2012.

[25] Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.

[26] Shai Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In ICML, 2016.

[27] Lloyd N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2013.