Mathematics authors titles recent submissions

(1)

arXiv:1711.06644v1 [math.OC] 17 Nov 2017

Nearly Optimal Stochastic Approximation for Online

Principal Subspace Estimation

Xin Liang

∗

Zhen-Chen Guo

†

Ren-Cang Li

‡

Wen-Wei Lin

§

November 20, 2017

Abstract

Processing streaming data as they arrive is often necessary for high dimensional data analysis. In this paper, we analyse the convergence of a subspace online PCA iteration, as a followup of the recent work of Li, Wang, Liu, and Zhang [Math. Program., Ser. B, DOI 10.1007/s10107-017-1182-z] who considered the case for the most significant principal component only, i.e., a single vector. Under the sub-Gaussian assumption, we obtain a finite-sample error bound that closely matches the minimax information lower bound.

Key words. Principal component analysis, Principal component subspace, Stochastic approx-imation, High-dimensional data, Online algorithm, Finite-sample analysis

AMS subject classifications. 62H25, 68W27, 60H10, 60H15

1 Introduction

Principal component analysis (PCA) introduced in [7, 10] is one of the most well-known and popular methods for dimensional reduction in high-dimensional data analysis.

LetX∈Rd _{be a random vector with mean E}_{_X_}_{and covariance matrix}

Σ= E(X−E_{X})(X−E_{X})T .

To reduce the dimension ofX fromdtop, PCA looks for ap-dimensional linear subspace that is closest to the centered random vectorX−E{X}in a mean squared sense, through the independent and identically distributed samplesX(1), . . . , X(n).

Denote by Gp(Rd) the Grassmann manifold of p-planes in Rd, or equivalently, the set of all

p-dimensional subspaces of Rd. Without loss of generality, we assume E_{X} = 0. Then PCA corresponds to a stochastic optimization problem

min U∈Gp(Rd)

Ek(Id−ΠU)Xk22 , (1.1)

whereIdis thed×didentity matrix, andΠU is the orthogonal projector onto the subspaceU. Let

Σ=U ΛUT _{be the spectral decomposition of}_Σ_{, where}

Λ= diag(λ1, . . . , λd) withλ1≥ · · · ≥λd≥0, and orthogonalU= [u1, . . . , ud]. (1.2)

Ifλp > λp+1, then the unique solution to the optimization problem (1.1), namely thep-dimensional

principal subspace ofΣ, is_U_∗=_R([u1, . . . , up]), the subspace spanned byu1, . . . , up.

In practice, Σ is unknown, and we must use sample data to estimate _U_∗. The classical PCA does it by the spectral decomposition of the empirical covariance matrixΣb= 1

n Pn

i=1X(i)(X(i))T.

Specifically, the classical PCA uses _Ub_∗ = _R([bu1, . . . ,ubp]) to estimate U∗, where bui is the corre-sponding eigenvectors ofΣb. The important quantity is the distance between U∗ and Ub∗. Vu and Lei [21, Theorem 3.1] proved that ifp(d₋p)σ2∗

n is bounded, then

inf e

U∗∈Gp(Rd) sup X∈P0(σ_∗2,d)

EnksinΘ(Ub∗,U∗)k2F

o

≥cp(d−p)σ

2

∗

n, (1.3)

where c > 0 is an absolute constant, and _P0(σ2_∗, d) is the set of all d-dimensional sub-Gaussian

distributions for which the eigenvalues of the covariance matrix satisfy λ1λp+1

(λp−λp+1)2 ≤σ

2

∗. Note that λ1λp+1

(λp−λp+1)2 is the effective noise variance. In the classical PCA, obtaining the empirical covariance

matrix has time complexity O(nd2_{) and space complexity O(}_d2_{). So storing and calculating a}

large empirical covariance matrix are very expensive when the data are of high dimension, not to mention the cost of computing its eigenvalues and eigenvectors.

To reduce both the time and space complexities, Oja [14] proposed an online PCA iteration

e

u(n)=u(n−1)+β(n−1)X(n)(X(n))Tu(n−1), u(n)=ue(n)_keu(n)_k−21, (1.4)

to approximate the most significant principal component, whereβ(n)_>_{0 is a stepsize. Later Oja}

and Karhunen [15] proposed a subspace online PCA iteration

e

U(n)=U(n−1)+X(n)(X(n))TU(n−1)diag(β1(n−1), . . . , βp(n−1)), U(n)=Ue(n)R(n), (1.5)

to approximate the principal subspace_U_∗, whereβ_i(n)>0 for 1_≤i_≤pare stepsizes, and R(n) _is

a normalization matrix to makeU(n)_{have orthonormal columns. One such an}_R(n)_is

(3)

covariance matrix explicitly is completely avoided. In the online PCA, obtaining the principal subspace has time complexity O(p2_nd_{) and space complexity O(}_pd_{), which is much less than those}

required by the classical PCA.

Although the online PCA iteration (1.4) was proposed over 30 years ago, its convergence analysis is rather scarce. Some recent works [2, 8, 16] studied the convergence of the online PCA for the most significant principal component, i.e., u1, from different points of view and obtained some

results for the case where the samples are almost surely uniformly bounded. For such a case, De Sa, Olukotun, and R´e [4] studied a different but closely related problem, in which the angular part is equivalent to the online PCA, and obtained some convergence results. In contrast, for the distributions with sub-Gaussian tails (note that the samples of this kind of distributions may be unbounded), Li, Wang, Liu, and Zhang [11] proved a nearly optimal convergence rate for the iteration (1.4): if the initial guessu(0) _{is randomly chosen according to a uniform distribution and}

the stepsizeβ is chosen in accordance with the sample sizen, then there exists a high-probability eventA_∗ with P_{A_∗_{} ≥}1₋δ such that

En|tanΘ(u(n), u∗)|2

A_∗o_≤C(d, n, δ)lnn

n

1

λ1−λ2

d X

i=2

λ1λi

λ1−λi

(1.7a)

≤C(d, n, δ) λ1λ2 (λ1−λ2)2

(d₋1) lnn

n , (1.7b)

where δ_∈[0,1), u∗ =u1 in (1.2), and C(d, n, δ) can be approximately treated as a constant. It

can be seen that this bound matches the minimax low bound (1.3) up to a logarithmic factor ofn, hence,nearly optimal. Also, the convergence rate holds true as long as the initial approximation satisfies

|tanΘ(u(0), u∗)| ≤cd, (1.8) for some constantc >0, which meansnearly global. It is significant because a uniformly distributed initial value is nearly orthogonal to the principal component with high probability whendis large. This result is more general than previous ones in [2,8,16], because it is for distributions that can possibly be unbounded, and the convergence rate is nearly optimal and nearly global. For more details of comparison, the reader is referred to [11].

However, there is still no convergence result for the subspace online PCA, namely the subspace iteration (1.5). Our main purpose in this paper is to analyze the convergence of the subspace online PCA iteration (1.5), similarly to the effort in [11] which is for the special casep= 1. One of our results for the convergence rate states that: if the initial guessU(0) _{is randomly chosen to satisfy}

that_R(U(0)_{) is uniformly sampled from}_G_p(_Rd_{), and the stepsize}_β(n)

i for 1≤i≤pare chosen the same and in accordance with the same sizen, then there exists a high-probability eventH_∗ with P{H_∗_{} ≥}1−2δp2_{, such that}

En_ktanΘ(U(n)_{, U}

∗)k2F

H_∗o_≤C(d, n, δ)lnn

n

1

λp−λp+1

p X

j=1

d X

i=p+1

λjλi

λj−λi

(1.9a)

≤C(d, n, δ) λpλp+1 (λp−λp+1)2

p(d−p) lnn

n , (1.9b)

where the constantC(d, n, δ)→24ψ4/(1−δp2) asd→ ∞andn→ ∞, andψisX’s Orlicz norm. This is also nearly optimal, nearly global, and valid for any sub-Gaussian distribution. Whenp= 1, it degenerates to (1.7), as it should be. Although this result of ours look like a straightforward generalization, its proof, however, turns out to be nontrivially much more complicated.

(4)

the technical differences in proof also from [11] in section5. Due to their complexities, the proofs of these results are deferred to sections6 and 7. Finally, section 8 summarizes the results of the paper.

Notations. Rn×m_{is the set of all}_n_×_m_{real matrices,}_Rn₌_Rn×1_{, and}_R₌_R1_. _I

n (or simply

I if its dimension is clear from the context) is then×nidentity matrix andej is its jth column (usually with dimension determined by the context). For a matrix X, σ(X), kXk∞, kXk2 and

kXkF are the multiset of the singular values, the ℓ∞-operator norm, the spectral norm, and the Frobenius norm ofX, respectively. R(X) is the column space spanned by the columns ofX,X(i,j)

is the (i, j)th entry ofX, andX(k:ℓ,:) andX(:,i:j)are two submatrices of X consisting of its rowk

to rowℓand columnito columnj, respectively. X_◦Y is the Hadamard, i.e., entrywise, product of matrices (vector)X andY of the same size.

For any vector or matrixX, Y,X _≤Y (X < Y) meansX(i,j)≤Y(i,j)(X(i,j)< Y(i,j)) for any

i, j. X _≥Y (X > Y) if₋X _{≤ −}Y (₋X <₋Y);X _≤α(X < α) for a scalarαmeansX(i,j)≤α

(X(i,j)< α) for anyi, j; similarlyX ≥αandX > α.

For a subset or an eventA, Ac is the complement set ofA. By σ_{A1, . . . ,Ap}we denote the

σ-algebra generated by the eventsA1, . . . ,Ap. N= {1,2,3, . . .}. E{X;A} := E{X1A} denotes

the expectation of a random variableX over eventA. Note that

E_{X; A_}= E_{X|A_}P_{A_}. (1.10)

For a random vector or matrix X, E{X}:=EX(i,j) . Note that kE{X}kui ≤E{kXkui} for

ui = 2,F. Write cov_◦(_X,_Y) := E{[_X−E{_X}]◦[_Y −E{_Y}]} and var_◦(_X) := cov_◦(_X,_X).

2 Canonical Angles

For two subspacesX,Y ∈Gp(Rd_{), let}_{X, Y} _∈_Cd×p_{be the basis matrices of}_X _and_Y_{, respectively,} i.e., X = R(X) and Y = R(Y), and denote by σj for 1 ≤ j ≤ k in nondecreasing order, i.e.,

σ1≤ · · · ≤σp, the singular values of

(XT_X₎−1/2_XT_Y₍_YT_Y₎−1/2_.

Thekcanonical angles θj(_X,_Y) between_X to_Y are defined by

0≤θj(X,Y) := arccosσj ≤ π

2 for 1≤j≤p. (2.1)

They are in non-increasing order, i.e.,θ1(X,Y)≥ · · · ≥θp(X,Y). Set

Θ(X,Y) = diag(θ1(X,Y), . . . , θp(X,Y)). (2.2)

It can be seen that angles so defined are independent of the basis matrices X and Y, which are not unique. With the definition of canonical angles,

ksinΘ(_X,_Y)_kui for ui = 2,F

are metrics onGp(Rd) [17, Section II.4].

In what follows, we sometimes place a vector or matrix in one or both arguments of θj(_·,_·) and Θ(_·,_·) with the understanding that it is about the subspace spanned by the vector or the columns of the matrix argument.

For anyX _∈Rd×p, ifX(1:p,:) is nonsingular, then we can define

T₍_X_{) :=}_X₍_p_+1:_d,_:)_X−1

(1:p,:). (2.3)

Lemma 2.1. For X _∈Rd×p with nonsingularX(1:p,:), we have for ui = 2,F

tanΘ(X,

Ip 0

)

ui

(5)

Proof. Let Y =

Ip 0

∈ Rd×p_{. By definition,} _σ

j = cosθj(X, Y) for 1 ≤ j ≤ p are the singular values of

I+T₍_X₎TT₍_X₎−1/2

I T₍_X₎

T

I

0

=I+T₍_X₎TT₍_X₎−1/2_.

So ifτ1≥τ2≥ · · · ≥τp are the singular values ofT(X), then

σj = (1 +τj2)−1/2 ⇒ τj= q

1₋σ2

j

σj

= tanθj(X, Y),

and hence the identity (2.4).

3 Online PCA for Principal Subspace

LetX = [X1,X2, . . . ,Xd]Tbe a random vector inRd. Assume E{X}= 0. Its covariance matrix

Σ:= EnXXTohas the spectral decomposition

Σ=U ΛUT with U = [u1, u2, . . . , ud], Λ= diag(λ1, . . . , λd), (3.1)

where U _∈ Rd×d is orthogonal, and λi for 1 ≤ i ≤ d are the eigenvalues of Σ, arranged for convenience in non-increasing order. Assume

λ1≥ · · · ≥λp > λp+1≥ · · · ≥λd>0. (3.2) In section 1, we mention the subspace online PCA iteration (1.5) of Oja and Karhunen [15] for computing the principal subspace of dimensionp

U∗=R(U(:,1:p)) =R([u1, u2, . . . , up]). (3.3)

In this paper, we will use a fixed stepsizeβ for allβ_i(n)there. Then U(n)_{can be stated in a more}

explicit manner with the help of the following lemma.

Lemma 3.1. Let V ∈Rd×p withVTV =Ip,y∈Rd with yTy= 1, and0< β∈R, and let

W :=V +βyyTV = (Id+βyyT)V, V+:=W(WTW)−1/2.

If VT_y₆_{= 0}_{, then}

V+=V +βyyTV −[1−(1 +α)−1/2](V +βyyTV)zzT,

whereγ=_kVT_y

k2,z=VTy/γ, and α=β(2 +β)γ2. In particular, V+TV+=Ip.

Proof. Letz=VT_y

∈Rp. We have

WTW =VT[Id+βyyT]2V =VT[Id+β(2 +β)yyT]2V =Ip+αzzT. LetZ⊥∈Rp×(p−1)such that [z, Z⊥]T[z, Z⊥] =Ip. The eigen-decomposition of WTW is

WT_W _{= [}_{z, Z}

⊥]

1 +α Ip−1

[z, Z_⊥]T

which yields

(WT_W₎−1/2_{= [}_{z, Z}

⊥]

(1 +α)−1/2

Ip−1

[z, Z_⊥]T

(6)

Algorithm 3.1 Subspace Online PCA

1: ChooseU(0)∈Rd×p with (U(0)₎T_U(0) ₌_I_{, and choose stepsize}_{β >}_0.

2: forn= 1,2, . . . until convergencedo 3: Take anX’s sampleX(n)_;

4: Z(n)_{= (}_U(n−1)₎T_X(n)_,_α(n)₌_β_{2 +}_β₍_X(n)₎T_X(n)₍_Z(n)₎T_Z(n)_,_α_e(n)_{= (1 +}_α(n)₎−1/2_;

5: U(n)₌_U(n−1)₊_β_α_e(n)_X(n)₍_Z(n)₎T

− 1−αe(n)

(Z(n)₎T_Z(n)U(n−1)Z(n)(Z(n))T.

6: end for

= (1 +α)−1/2zzT+Ip−zzT =Ip−[1−(1 +α)−1/2]zzT. Therefore,

V+= (V +βyyTV){Ip−[1−(1 +α)−1/2]zzT} =V +βyyT_V

−[1₋(1 +α)−1/2_](_V ₊_βyyT_V₎_zzT_,

as expected.

With the help of this lemma, for the fixed stepsize β_i(n) =β for all 1 _≤i _≤p, we outline in Algorithm 3.1 a subspace online PCA algorithm based on (1.5) and (1.6). The iteration at its line 5 combines (1.5) and (1.6) as one. This seems like a minor reformulation, but it turns out to be one of the keys that make our analysis go through. The rest of this paper is devoted to analyze its convergence.

Remark 3.1. A couple of comments are in order for Algorithm 3.1.

1. Vectors X(n)

∈ Rd for n = 1,2, . . . are independent and identically distributed samples of

X.

2. If the algorithm converges, it is expected that

U(n)→U∗:=U

Ip 0

= [u1, u2, . . . , up]

in the sense that_ksinΘ(U(n)_{, U}

∗)kui→0 asn→ ∞.

Notations introduced in this section, except those in Lemma 3.1 will be adopted throughout the rest of this paper.

4 Main Results

We point out that any statement we will make is meant to holdalmost surely.

We are concerned with random variables/vectors that have a sub-Gaussian distribution which we will define next. To that end, we need to introduce the Orlicz ψα-norm of a random vari-able/vector. More details can be found in [19].

Definition 4.1. TheOrlicz ψα-norm of a random variableX ∈Ris defined as

kXkψα := inf

ξ >0 : E

expX

ξ

α

≤2

,

and theOrliczψα-norm of a random vectorX∈Rd is defined as

kXkψα := sup kvk2=1

kvT_X

kψα.

(7)

By the definition, we conclude that any bounded random variable/vector follows a sub-Gaussian distribution. To prepare our convergence analysis, we make a few assumptions.

Assumption 4.1. X = [X1,X2, . . . ,Xd]T∈Rd is a random vector.

(A-1) E_{X}= 0, andΣ:= EnXXTohas the spectral decomposition (3.1) satisfying (3.2);

(A-2) ψ:=kΣ−1/2_X_k

ψ2≤ ∞.

The principal subspace_U_∗ in (3.3) is uniquely determined under (A-1) of Assumption 4.1. On the other hand, (A-2) of Assumption4.1ensures that all 1-dimensional marginals of X have sub-Gaussian tails, or equivalently,X follows a sub-Gaussian distribution. This is also an assumption that is used in [11].

Assumption 4.2.

(A-3) There is a constantν _≥1, independent of d, such that

k(Id−ΠU∗)Xk

2

2≤(ν−1)pmax

kU_∗TXk2∞, λ1 . (4.1)

To intuitively understand Assumption4.2, we noteUT

∗X∈Rpand thuskΠU∗Xk2=kU

T

∗Xk2≤

√_p_k_UT

∗Xk∞. Hence (4.1) is implied by

k(Id−ΠU∗)Xk2≤

√

ν₋1_kΠU∗Xk2. (4.2) This inequality can be easily interpreted, namely, the optimal principle component subspaceU∗ should be inclusive enough so that the component of X in the orthogonal complement of U∗ is almost surely dominated by the component of X in U∗ by a constant factor √ν−1. The case

ν= 1 is rather special and it means (Id−ΠU∗)X = 0 almost surely, i.e.,X ∈ U∗ almost surely. In what follows, we will state our main results and leave their proofs to later sections because of their high complexity. To that end, first we introduce some quantities. The eigenvalue gap is

γ=λp−λp+1. (4.3)

Fors >0 and the stepsizeβ <1 such thatβγ <1, define

Ns(β) := minn_∈N: (1₋βγ)n_≤βs =

slnβ

ln(1₋βγ)

, (4.4)

where_⌈·⌉ is the ceiling function taking the smallest integer that is no smaller than its argument, and for 0< ε <1/7,

M(ε) := minm_∈N:β7ε/2−1/2

≤β(1−21−m

)(3ε−1/2) _{= 2 +}

&

ln1/2_ε−3ε ln 2

'

≥2. (4.5)

Theorem 4.1. Given

ε∈(0,1/7), ω >0, φ >0, κ >6[M(ε)−1]/2max{2φω1/2,√2}, (4.6a)

0< β <min   1,

√

2₋1

νpλ1

! 1

1−2ε

,

1 8κpλ1

2

1−4ε

,

γ2

130ν1/2_κ2_p2_λ2 1

1

ε  

. (4.6b)

LetU(n)_for_n_{= 1}_,₂_{, . . .} _{be the approximations of}_U

∗generated by Algorithm3.1. Under Assump-tions4.1 and4.2, if_ktanΘ(U(0)_{, U}

∗)k22≤φ2d−1, and

(8)

then there exist absolute constants1 _C

ψ,Cν,C◦ and a high-probability event Hwith

P_{H_{} ≥}1₋K[(2 + e)d+ ep] exp(₋Cνψβ−ε)

such that for anyn∈[N3/2−37ε/4(β), K]

En_ktanΘ(U(n), U_∗)_k2F; H

o

≤(1₋βγ)2(n−1)2pφ2d

+ 32ψ

4_β

2₋λ1β1−2ε

ϕ(p, d;Λ) +C_◦κ4ν2λ21γ−1p3

p

d₋pβ3/2−7ε, (4.8)

wheree = exp(1)is Euler’s number,Cνψ= max{Cνν−1/2, Cψψ−1}, and2

ϕ(p, d;Λ) := p X

j=1

d X

i=p+1

λjλi

λj−λi ∈

p(d₋p)λ1λd

λ1−λd

,p(d−p)λpλp+1 λp−λp+1

. (4.9)

The conclusion of Theorem4.1holds for any givenU(0)_satisfying

ktanΘ(U(0)_{, U}

∗)k22≤φ2d−1.

However, it is not easy, if at all possible, to verify this condition. Next we consider a randomly selectedU(0)_.

Suppose onGp(Rd_{) we use a uniform distribution, the one with the Haar invariant probability} measure (see [3, Section 1.4] and [9, Section 4.6]). We refer the reader to [3, Section 2.2] on how to generate such a uniform distribution onGp(Rd_{). Our assumption for a randomly selected} _U(0)_is

randomly selectedU(0) _{satisfies that}_R₍_U(0)_{) is}

uni-formly sampled fromGp(Rd_). (4.10)

Theorem 4.2. Under Assumptions4.1and4.2, for sufficiently largedand anyβ satisfying (4.6b)

withκ= 6[M(ε)−1]/2_max_{₂_C

p,√2}, and

p <(d+ 1)/2, ε∈(0,1/7), δ∈[0,1], K > N3/2−37ε/4(β),

whereCp is a constant only dependent onp, if (4.10)holds, and

dβ1−3ε≤δ2, K[(2 + e)d+ ep] exp(−Cνψβ−ε)≤δp

2

,

then there exists a high-probability event H_∗ with P_{H_∗_{} ≥}1₋2δp2

such that

En_ktanΘ(U(n), U∗)k2F; H∗ o

≤2(1₋βγ)2(n−1)p Cp2δ−2d

+ 32ψ

4_β

2₋λ1β1−2ε

ϕ(p, d;Λ) +C◦κ4ν2λ21γ−1p3

p

d₋pβ3/2−7ε (4.11)

for anyn∈[N3/2−37ε/4(β), K], whereϕ(p, d;Λ)is as in (4.9).

Finally, suppose that the number of principal componentspand the eigenvalue gapγ =λp−

λp+1is known in advance, and the sample size is fixed atN∗. We must choose a properβ to obtain the principal components as accurately as possible. A good choice turns out to be

β=β_∗:= 3 lnN∗ 2γN∗

. (4.12)

1_{We attach each with a subscript for the convenience of indicating their associations. They don’t change as the}

values of the subscript variables vary, by which we meanabsolute constants_{. Later in (6.5), we explicitly bound} these absolute constants.

2_{To see the inclusion in (4.9), we note the following: if 0}_≤_a_≤_{c < d}_≤_b_{, then}

0≤1

b ≤ 1 d<

1 c ≤

1 a ⇒

dc d−c =

1

c−

1

d ≥ ₁ 1

a−

1

b

(9)

Theorem 4.3. Under Assumptions4.1and4.2, for sufficiently large d_≥2pand sufficiently large N∗,δ∈[0,1]satisfying

dβ_∗1−3ε_≤δ2, N∗[(2 + e)d+ ep] exp(−Cνψβ−ε)≤δp

2

,

where β_∗ is given by (4.12), if (4.10) holds, then there exists a high-probability event H_∗ with

P_{H_∗_{} ≥}1₋2δp2

, such that

En_ktanΘ(U(N∗)_{, U}

∗)k2F; H∗ o

≤C∗(d, N∗, δ)_λϕ(p, d;Λ) p−λp+1

lnN∗

N_∗ , (4.13)

where the constantC_∗(d, N_∗, δ)_→24ψ4 _as_d

→ ∞, N_∗_{→ ∞}, andϕ(p, d;Λ)is as in (4.9).

In Theorems4.1,4.2and4.3, the conclusions are stated in term of the expectation ofktanΘ(U(n)_{, U}

∗)k2F

over some highly probable event. These expectations can be turned into conditional expectations, thanks to the relation (1.10). In fact, (1.9) is a consequence of (4.13) and (1.10).

The proofs of the three theorems are given in sections 6 and 7. Although overall our proofs follow the same structure of that in Li, Wang, Liu, and Zhang [11], there are inherently critical subtleties in going from one-dimension (p = 1) to multi-dimension (p > 1). In fact, one of key steps in proof works for p= 1 does not seem to work for p >1. More detail will be discussed in the next section.

5 Comparisons with Previous Results

Our three theorems in the previous section, namely Theorems4.1,4.2and4.3, are the analogs for

p >1 of Li, Wang, Liu, and Zhang’s three theorems [11, Theorems 1, 2, and 3] which are forp= 1 only. Naturally, we would like to know how our results when applied to the casep= 1 and proofs would stand against those in [11]. In what follows, we will do a fairly detailed comparison. But before we do that, let us state their theorems (in our notation).

Theorem 5.1 ( [11, Theorem 1]). Under Assumption 4.1 and p = 1, suppose there exists a constant φ > 1 such that tanΘ(U(0)_{, U}

∗) ≤ φ2d. Then for any ε ∈ (0,1/8), stepsize β > 0

satisfyingd[λ2

1γ−1β]1−2ε≤b1φ−2, and any t >1, there exists an eventH with

P_{H_{} ≥}1₋2(d+ 2)Nbo₍_{β, φ}_{) exp}

−C0[λ21γ−1β]−2ε

−4dNbt(β) exp ₋C1[λ21γ−1β]−2ε

,

such that for anyn_∈[Nb1(β) +Nbo(β, φ),Nbt(β)]

Entan2Θ(U(n), U∗); H o

≤(1₋βγ)2[n−Nbo(β,φ)]+C2βϕ(1, d;Λ) +C2

d X

i=2

λ1−λ2

λ1−λi

[λ21γ−1β]3/2−4ε,

(5.1)

whereb1∈(0,ln22/16),C0, C1, C2 are absolute constants, and

b

No(β, φ) := minn∈N: (1−βγ)n ≤[4φ2d]−1 =

−ln[4φ2_d_]

ln(1−βγ)

,

b

Ns(β) := minn_∈N: (1₋βγ)n _≤[λ21γ−1β]s =

sln[λ2 1γ−1β]

ln(1₋βγ)

.

What we can see that Theorem 4.1for p= 1 is essentially the same as Theorem5.1. In fact, since (1₋βγ)1−Nbo₍_β,φ₎

≤4φ2_d

≤(1₋βγ)−Nbo₍_β,φ₎

, the upper bounds by (4.8) for p= 1 and by (5.1) are comparable in the sense that they are in the same order ind, β, δ. The major difference is perhaps the use of Assumption4.2 by Theorem4.1, but not by Theorem5.1. In this sense, for the casep= 1, Theorem5.1 is stronger.

(10)

to generalize the proving techniques in [11] which is for the one-dimensional case (p= 1) to the multi-dimensional case (p >1) so that Assumption4.2becomes unnecessary. Indeed, we tried but didn’t succeed, due to we believe unsurmountable obstacles. We now explain. The basic struc-ture of the proof in [11] is to split the Grassmann manifoldGp(Rd_{) where the initial guess comes} from into two regions: the cold region and warm region. Roughly speaking, an approximation

U(n)_{in the warm region means that}_k_tan_Θ₍_U(n)_{, U}

∗)kF is small while it in the cold region means

that ktanΘ(U(n)_{, U}

∗)kF is not that small. U∗ sits at the “center” of the warm region which is wrapped around by the cold region. The proof is divided into two cases: the first case is when the initial guess is in the warm region and the other one is when it is in the code region. For the first case, they proved that the algorithm will produce a sequence convergent to the principal subspace (which is actually the most significant principal component because it is forp= 1) with high probability. For the second case, they first proved that the algorithm will produce a sequence of approximations that, after a finite number of iterations, will fall into the warm region with high probability, and then use the conclusion proved for the first case to conclude the proof because of the Markov property.

For our situation p >1, we still structure our proof in the same way, i.e., dividing the whole proof into two cases of the cold region and warm region. The proof in [11] for the warm region case can be carried over with a little extra effort, as we will see later, but we didn’t find that it was possible to use a similar argument in [11] to get the job done for the cold region case. Three major difficulties are as follows. In [11], essentiallykcotΘ(U(n)_{, U}

∗)kFwas used to track the the behavior

of a martingale along with the power iteration. Note cotΘ(U(n)_{, U}

∗) isp×pand thus it is a scalar whenp= 1, perfectly well-conditioned if treated as a matrix. But forp >1, it is a genuine matrix and, in fact, an inverse of a random matrix in the proof. The first difficulty is how to estimate the inverse because it may not even exist! We tried to separate the flow ofU(n)into two subflows: the ill-conditioned flow and the well-conditioned flow, and estimate the related quantities separately. Here the ill-conditioned flow at each step represents the subspace generated by the singular vectors of cotΘ(U(n)_{, U}

∗) whose corresponding singular values are tiny, while the well-conditioned flow at each step represents the subspace generated by the other singular vectors, of which the inverse (restricted to this subspace) is well conditioned. Unfortunately, tracking the two flows can be an impossible task because, due to the randomness, some elements in the ill-conditioned flow could jump to the well-conditioned flow during the iteration, and vice versa. This is the second difficulty. The third one is to build a martingale to go along with a proper power iteration, or equivalently, to find the Doob decomposition of the process, because the recursion formula of the main part of the inverse — the drift in the Doob decomposition, even if limited to the well-conditioned flow, is not a linear operator, which makes it impossible to build a proper power iteration.

In the end, to deal with the cold region, we gave up the idea of estimatingkcotΘ(U(n)_{, U}

∗)kF.

Instead, we invent another method: cutting the cold region into many layers, each wrapped around by another with the innermost one around the warm region. We prove the initial guess in any layer will produce a sequence of approximations that will fall into its inner neighbor layer (or the warm region if the layer is innermost) in a finite number of iterations with high probability. Therefore eventually, any initial guess in the cold region will lead to an approximation in the warm region within a finite number of iterations with high probability, returning to the case of initial guesses coming from the warm region because of the Markov property. This enables us to completely avoid the difficulties mentioned above. This technique can also be used for the one-dimensional case to simplify the proof in [11] if with Assumption4.2.

Theorem 5.2( [11, Theorem 2]). Under Assumption4.1andp= 1, suppose thatU(0)_{is uniformly}

sampled from the unit sphere. Then for anyε∈(0,1/8), stepsizeβ >0, δ >0 satisfying

d[λ21γ−1β]1−2ε≤b2δ2, 4dNb2(β) exp −C3[λ21γ−1β]−2ε

≤δ,

there exists an event H_∗ withP_{H_∗_{} ≥}1₋2δsuch that for anyn_∈[Nb2(β),Nb3(β)]

Entan2Θ(U(n), U_∗); H_∗o_≤C4(1−βγ)2nδ−4d2+C4βϕ(1, d;Λ) +C4

d X

i=2

λ1−λ2

λ1−λi

[λ21γ−1β]3/2−4ε,

(11)

whereb2, C3, C4 are absolute constants.

Theorem 5.3( [11, Theorem 3]). Under Assumption4.1andp= 1, suppose thatU(0)_{is uniformly}

sampled from the unit sphere and let β_∗ = 2 lnN∗

γN . Then for any ε ∈ (0,1/8), N∗ ≥ 1, δ > 0

satisfying

d[λ21γ−1β∗]1−2ε≤b3δ2, 4dNb2(β∗) exp −C6[λ21γ−1β∗]−2ε

≤δ, there exists an event H_∗ withP_{H_∗_{} ≥}1₋2δsuch that

Entan2Θ(U(N∗)_{, U}

∗); H∗ o

≤C∗(d, N∗, δ)ϕ_λ(1, d;Λ)

1−λ2

lnN∗

N∗ , (5.3)

where the constantC∗(d, N∗, δ)→C5 asd→ ∞, N∗→ ∞, andb3, C5, C6 are absolute constants.

Our Theorems4.2 and4.3 when applied to the casep= 1 do not exactly yield Theorems 5.2

and 5.3, respectively. But the resulting conditions and upper bounds have the same orders in variablesd, β, δ, and the coefficients ofβ and lnN∗

N in the upper bounds are comparable. But we note that the first term in right-hand side of (4.11) is proportional tod, notd2_{as in (}_5.2_{); so ours}

is tighter.

The proofs here for Theorems4.2and4.3are nearly the same as those in [11] for Theorems5.2

and 5.3 owing to the fact that the difficult estimates have already been taken care of by either Theorem 4.1or Theorem 5.1. But still there are some extras forp >1, namely, to estimate the marginal probability for the uniform distribution on the Grassmann manifold of dimension higher than 1. We cannot find it in the literature, and thus have to build it ourselves with the help of the theory of special functions of a matrix argument, rarely used in the statistical community.

It may also be worth pointing out that all absolute constants, exceptCp which has an explicit expression in (7.4) and Cψ, in our theorems are concretely bounded as in (6.5), whereas those in Theorems5.1to5.3are not.

6 Proof of Theorem

4.1

In this section, we will prove Theorem4.1. For that purpose, we build a quite amount of preparation material in sections 6.1 to 6.3 before we prove the theorem in section 6.4. Figure 6.1 shows a pictorial description of our proof process.

Lemma6.1

Lemma6.2 Lemma 6.3 Lemma6.4 Lemma6.6

Lemma6.5 Theorem4.1

section6.2

section6.3

Figure 6.1: Proof process for Theorem4.1

6.1 Simplification

Without loss of generality, we may assume that the covariance matrixΣdiagonal. Otherwise, we can perform a (constant) orthogonal transformation as follows. Recall the spectral decomposition

Σ=U ΛUT _{in (}_3.1_{). Instead of the random vector}_X_{, we equivalently consider}

Y ≡[Y1,Y2, . . . ,Yn]T:=UTX.

Accordingly, perform the same orthogonal transformation on all involved quantities:

Y(n)₌_UT_X(n)_, _V(n)₌_UT_U(n)_, _V

∗=UTU∗=

Ip 0

(12)

As consequences, we will have the equivalent versions of Algorithm3.1, Theorems4.1,4.2and4.3. Firstly, because

(V(n−1))TY(n)= (U(n−1))TX(n), (Y(n))TY(n)= (X(n))TX(n),

the equivalent version of Algorithm 3.1 is obtained by symbolically replacing all letters X, U by

Y, V while keeping their respective superscripts. If the algorithm converges, it is expected that

R(V(n)₎

→ R(V_∗). Secondly, noting

kΣ−1/2Xkψ2 =kU Λ−

1/2_UT

Xkψ2 =kΛ−

1/2

Ykψ2,

we can restate Assumptions4.1and4.2equivalently as

(A-1′_{) E}_{_Y_}_{= 0}_,_En_{Y Y}To₌_Λ_{= diag(}_λ

1, . . . , λd) with (3.2);

(A-2′₎ _ψ_:=_k_Λ−1/2_Y

kψ2 ≤ ∞.

(A-3′₎ _k₍_I_d₋_Π V∗)Yk

2

2 ≤(ν−1)pmax

kYk2

∞, λ1 , where 1 ≤ν is a constant, independent of d,

and_V∗=R(V∗).

Thirdly, all canonical angles between two subspaces are invariant under the orthogonal transforma-tion. Therefore the equivalent versions of Theorems4.1,4.2and4.3forY is also simply obtained by replacing all lettersX, U byY, V while keeping their respective superscripts.

In the rest of this section, we will prove the mentioned equivalent version of Theorem 4.1. Likewise in the next section, we will prove the equivalent versions of Theorems4.2and4.3. In the other words, essentially we assume thatΣis diagonal.

To facilitate our proof, we introduce new notations for two particular submatrices of anyV ∈

Rd×p_:

¯

V =V(1:p,:),

¯

V =V(p+1:d,:). (6.2)

In particular,T₍_V_{) =}

¯

VV¯−1 for the operatorT _{defined in (}_2.3_{), provided ¯}_V _{is nonsingular. Set}

¯

Λ= diag(λ1, . . . , λp),

¯

Λ= diag(λp+1, . . . , λd). (6.3)

Although the assignments to ¯Λand ¯

Λare not consistent with the extractions defined by (6.2), they don’t seem to cause confusions in our later presentations.

Forκ >1, defineS(κ) :=_{V _∈Rd×p:σ( ¯V)_⊂[1

κ,1]}. It can be verified that

V _∈S(κ)_{⇔ k}T₍_V₎_k₂_≤p_κ2₋₁_. _(6.4)

For the sequenceV(n)_{, define}

Nout{S(κ)}:= min{n:V(n)∈/ S(κ)}, Nin{S(κ)}:= min{n:V(n)∈S(κ)}.

Nout{S(κ)}is the first step of the iterative process at whichV(n)jumps fromS(κ) to outside, and

Nin{S(κ)}is the first step of the iterative process at whichV(n)jumps from outside toS(κ). Also,

define e

λ:=λ1β−2ε, Nqb{eλ}:= min

n

n_≥1 : max_{kY(n)

k∞,kZ(n)k∞} ≥eλ1/2 o

.

Nqb{eλ}is the first step of the iterative process at which an entry of Y(n) orZ(n)exceedsλe. For

n < Nqb{eλ}, we have

kY(n)_k2<(deλ)1/2, kZ(n)k2<(peλ)1/2.

Moreover, under (A-3′_),_k_Y(n)

k2<(νpλe)1/2.

For convenience, we will set T(n)₌_T₍_V(n)_{), and let}_F

n =σ{Y(1), . . . , Y(n)}be theσ-algebra filtration, i.e., the information known by step n. Also, since in this section ε, β are fixed, we suppress the dependency information ofM(ε) on εand Ns = Ns(β) onβ to simply write M =

(13)

Lastly, we discuss some of the important implications of the condition

0< β <min   1,

√

2₋1

νpλ1

! 1

1−2ε

,

1 8κpλ1

2

1−4ε

,

γ2

130ν1/2_κ2_p2_λ2 1

1

ε  

. (4.6b)

It guarantees that

(β-1) β <1 andβγ=β1/2+2ε_β1/2−2ε₍_λ

p−λp+1)≤₈_κpλ1 ₁λ1≤₁₆1√₃;

(β-2) dβeλ_≤β2ε_ωλ

1β−2ε≤ωλ1;

(β-3) pβeλ_≤νpβeλ=νpλ1β1−2ε≤√2−1.

Set

CV =5 2 +

7

2(νpλβe ) + 15

8 (νpeλβ)

2₊3

8(νpeλβ)

3

≤16 + 13 √

2

8 ≈4.298; (6.5a)

C∆= 2 + 1

2(νpeλβ) +CVpeλβ≤

22 + 7√2

8 ≈3.987; (6.5b)

CT =CV + 2C∆+ 2C∆CVpeλβ≤

251 + 122√2

16 ≈26.471; (6.5c)

Cκ=

(3₋√2)C2

∆ 64(CT+ 2C∆)2 ≤

565 + 171√2

21504 ≈0.038; (6.5d)

Cν = 4√2CTCκ≤223702 + 183539

√

2

86016 ≈5.618; (6.5e)

C◦= 29 + 8

√

2 16(3−√2) +

4CT (3−√2)C∆

β3ε+ [CT +29 + 8

√

2 32 ]β

1/2−3ε

+ 3C

2

T 2(3−√2)C2

∆

β1/2+3ε+2CT

C∆

β1−3ε+ C

2

T 2C2

∆

β3/2−3ε

≤2582968 + 1645155 √

2

14336 ≈342.464. (6.5f)

The condition (4.6b) also guarantees that

(β-4) 2C∆peλβ1/2κ= 2C∆pλ1β1/2−2εκ≤ 2C₈∆ <1, and thus 2C∆peλβκ <1;

(β-5) 4√2CTν1/2κ2p2λe2γ−1β5ε ≤1, and thus 4√2CTν1/2κ2p2eλ2γ−1β1/2+χ ≤1 for χ ∈[−1/2 + 5ε,0].

6.2 Increments of One Iteration

Lemma 6.1. For any fixedK≥1,

PnNqb{eλ}> K

o

≥1₋e(d+p)Kexp(₋Cψψ−1β−2ε),

whereCψ is an absolute constant.

Proof. Since n

Nqb{eλ} ≤K

o

⊂ [

n≤K n

kY(n)

k∞≥eλ1/2 o

∪nkZ(n)

k∞≥eλ1/2 o

⊂ [

n≤K 

 [

1≤i≤d n

|eT

i Y(n)| ≥eλ1/2 o

∪ [

1≤j≤p n

|eT

jZ(n)| ≥eλ1/2 o

(14)

= [

Proof. To avoid the cluttered superscripts, in this proof, we drop the superscript “_·(n)_{” and use}

the superscript “_·+_{” to replace “}

By Taylor’s expansion, there exists 0< ξ < αsuch that

(15)

whereR=₋β

2

2 (Z

T_Z_{)(2 +}_βYT_Y₎_Y ₊_ζβ2₍_ZT_Z₎_{V Z}₊_ζβ3₍_ZT_Z₎2_Y _{for which}

kR_k2≤

β2

2 (peλ)(2 +βνpeλ)(νpλe)

1/2₊_ζβ2₍_p_e_λ₎3/2₊_ζβ3₍_p_e_λ₎2₍_νp_λ_e₎1/2

=

1

2(2 +βνpeλ) + 3

8(2 +βνpλe)

2₊3

8(2 +βνpeλ)

2₍_βp_e_λ₎

ν1/2(peλ)3/2β2

=CVν1/2(peλ)3/2β2, as expected.

Lemma 6.3. Suppose that the conditions of Theorem4.1hold. Let τ=_kT(n)

k2, andCT be as in (6.5c). Ifn <minnNqb{λe}, Nout{S(κ)}

o

, then the following statements hold.

1. T(n) _and_T(n+1) _{are well-defined.}

2. DefineET(n)(V(n))by E

T(n+1)₋_T(n)_F

n =β( ¯

ΛT(n)₋_T(n)_Λ_¯_{) +}_E(n)

T (V(n)). Then

(a) supV∈S(κ)kE (n)

T (V)k2≤CTν1/2(peλβ)2(1 +τ2)3/2;

(b) _kT(n+1)

−T(n)

k2≤ν1/2(peλβ)(1 +τ2) +CTν1/2(peλβ)2(1 +τ2)3/2.

3. DefineR_◦ by var_◦ T(n+1)

−T(n)

|Fn

=β2_H

◦+R◦. Then

(a) H◦= var◦ Y_¯Y¯T≤16ψ4H, whereH = [ηij](d−p)×pwithηij=λp+iλj fori= 1, . . . , d−

p,j= 1, . . . , p;

(b) kR◦k2≤(νpλβe )2τ(1 +11₂τ+τ2+1₄τ3) + 4CTν(pλβe )3(1 +τ2)5/2+ 2CT2ν(pλβe )4(1 +τ2)3.

Proof. For readability, we will drop the superscript “_·(n)_{”, and use the superscript “}

·+_{” to replace}

“_·(n+1)_{” for}_{V, R}_{, drop the superscript “}

·(n+1)_{” on}_{Y, Z}_{, and drop the conditional sign “}

|Fn” in the computation of E_{·},var(_·),cov(_·) with the understanding that they are conditional with respect toFn. Finally, for any expression or variableF, we define∆F :=F+

−F.

Consider item 1. Sincen < Nout{S(κ)}, we haveV ∈S(κ) andτ =kTk2≤(κ2−1)1/2. Thus,

kV¯−1_k2≤κandT =

¯

VV¯−1 is well-defined. Recall (6.6) and the partitioning

Y =

1

p Y¯

d−p ¯

Y

, R=

1

p R¯

d−p ¯

R

.

We have∆V¯ =β( ¯Y ZT₋_{(1 +} β

2YTY) ¯V ZZT) + ¯RZT, and

¯

R=−β

2

2 (Z

T_Z_{)(2 +}_βYT_Y_{) ¯}_Y ₊_ζβ2₍_ZT_Z_{) ¯}_{V Z}₊_ζβ3₍_ZT_Z₎2_{Y .}_¯

Noticing_kY¯_k2≤(peλ)1/2, we find

k∆V¯k2≤β(peλ) +β(1 +β

2νpeλ)peλ+CV(peλ)

2_β2

≤

2 + β

2νpλe+CVpeλβ

peλβ=C∆peλβ,

where C∆ is as in (6.5b). Thus k∆V¯V¯−1k2 ≤ k∆V¯k2kV¯−1k2 ≤C∆peλβκ ≤1/2 by (β-4). As a result, ¯V+ is nonsingular, and

k( ¯V+)−1k2≤ k

¯

V−1k2

1_{− k}V¯−1∆V¯_k2 ≤

(16)

In particular,T+₌

¯

V+_{( ¯}_V+₎₋1 _{is well-defined. This proves item}₁_.

For item2, using the Shermann-Morrison-Woodbury formula [5, p. 95], we get

∆T = ¯

V+( ¯V+)−1₋ ¯

VV¯−1

= ( ¯

V +∆

¯

V)( ¯V +∆V¯)−1−V_¯V¯−1

= ( ¯

V +∆

¯

V)( ¯V−1₋V¯−1∆V¯( ¯V +∆V¯)−1)₋ ¯

VV¯−1

=∆

¯

VV¯−1₋

¯

VV¯−1∆V¯( ¯V +∆V¯)−1₋∆

¯

VV¯−1∆V¯( ¯V +∆V¯)−1 =∆

¯

VV¯−1−_¯VV¯−1∆V¯( ¯V−1−V¯−1∆V¯( ¯V +∆V¯)−1)−∆

¯

VV¯−1∆V¯( ¯V +∆V¯)−1 =∆

¯

VV¯−1₋T ∆V¯V¯−1+T ∆V¯V¯−1∆V¯( ¯V+)−1₋∆

¯

VV¯−1∆V¯( ¯V+)−1 = [∆

¯

V ₋T ∆V¯][I₋( ¯V+)−1∆V¯] ¯V−1.

WriteTL=−T I andTR=

I T

, and thenTLV = 0 andV =TRV¯. Thus,

∆T =TL∆V[I−( ¯V+)−1∆V¯]VTTR. Since∆V is rank-1,∆T is also rank-1. By Lemma6.2,

∆T =TL

βY ZT

−β(1 +β 2Y

T_Y₎_{V ZZ}T₊_RZT

I₋( ¯V+)−1_∆_V_¯_VT_T

R

=TLβY YTV +RZT I−( ¯V+)−1∆V¯VTTR =TL(βY YT_{V V}T₊_R

T)TR

=TL(βY YT+RT)TR,

(6.7)

whereRT =RZTVT−(βY +R)ZT( ¯V+)−1∆V V¯ T.Note that

TLY YTTR= ¯

YY¯T−TY¯

¯

YTT−TY¯Y¯T+ ¯

Y

¯

YTT, (6.8)

and

E ¯

YY¯T = 0, ETY¯Y¯T =TEY¯Y¯T =TΛ,¯ (6.9a) ETY¯

¯

YTT =TEY¯

¯

YT T = 0, E ¯

Y

¯

YTT = E ¯

Y

¯

YT T = ¯

ΛT, (6.9b)

Thus, E{∆T}=β( ¯

ΛT −TΛ¯) +ET(V), whereET(V) = E{TLRTTR}. SinceV ∈S(κ),kTk2≤(κ2−1)1/2by (6.4). Thus

kRTk2≤ kRk2(peλ)1/2+ [(νpeλ)1/2β+kRk2](peλ)1/22(1 +kTk22)1/2C∆peλβ

≤CVν1/2(peλ)2β2+ (1 +kTk22)1/2[1 +CVpλβe ]2C∆ν1/2(peλ)2β2

≤CTν1/2(peλβ)2(1 +kTk22)1/2,

(6.10)

whereCT =CV + 2C∆(1 +CVpλβe ). Therefore,

kET(V)k2≤E{kTLRTTRk2} ≤(1 +kTk22) E{kRTk2}.

Item2(a)holds. For item2(b), we have

k∆T_k2≤(1 +kTk22)(βkY YTV VTk2+kRTk2)

≤β(νpeλ)1/2(pλe)1/2(1 +kTk2

2) +CTν1/2(pλβe )2(1 +kTk22)3/2

≤ν1/2peλβ(1 +kTk2

2) +CTν1/2(peλβ)2(1 +kTk22)3/2.

Now we turn to item3. We have

(17)

whereR◦,1= cov◦ TLY YTTR, TLRTTR, andR◦,2= var◦(TLRTTR). By (6.8),

var◦ TLY YTTR= var◦ Y_¯Y¯T+R◦,0, (6.12)

where

R_◦,0= var◦ TY¯_¯YTT

+ var_◦ TY¯Y¯T+ var_◦ ¯

Y

¯

YTT

−2 cov_◦ ¯

YY¯T, TY¯

¯

YTT₋2 cov_◦ ¯

YY¯T, TY¯Y¯T+ 2 cov_◦ ¯

YY¯T,

¯

Y

¯

YTT

+ 2 cov_◦ TY¯

¯

YTT, TY¯Y¯T₋2 cov_◦ TY¯

¯

YTT,

¯

Y

¯

YTT₋2 cov_◦ TY¯Y¯T,

¯

Y

¯

YTT.

Examine (6.11) and (6.12) together to get H_◦ = var_◦ ¯

YY¯T and R_◦ =β2_R

◦,0+ 2βR◦,1+R◦,2.

We note

Yj =eTjY =eTjΛ1/2Λ−1/2Y =λ

1/2

j eTjΛ−1/2Y,

eTi var◦ Y_¯Y¯T

ej= var(eTi _¯YY¯Tej) = var(Yp+iYj) = E

Yp2+iYj2 . By [20, (5.11)],

EYj4 =λ2jE n

(eTjΛ−1/2Y)4 o

≤16λ2jkeTjΛ−1/2Yk4ψ2 ≤16λ

2

jkΛ−1/2Yk4ψ2= 16λ

2

jψ4. Therefore

eTi var◦ _¯YY¯T

ej≤[E

Yp4+i E

Yj4 ]1/2≤16λp+iλjψ4, i.e., H◦ = var◦ _¯YY¯T

≤16ψ4H. This proves item 3(a). To show item3(b), first we bound the entrywise variance and covariance. For any matrices A1, A2, by Schur’s inequality (which was

generalized to all unitarily invariant norm in [6, Theorem 3.1]),

kA1◦A2k2≤ kA1k2kA2k2, (6.13)

we have

kcov_◦(A1, A2)k2=kE{A1◦A2} −E{A1} ◦E{A2}k2

≤E{kA1◦A2k2}+kE{A1} ◦E{A2}k2

≤E_{kA1k2kA2k2}+kE{A1}k2kE{A2}k2, (6.14a)

kvar_◦(A1)k2≤E

kA1k22 +kE{A1}k22. (6.14b)

Apply (6.14) toR◦,1 andR◦,2 to get

kR_◦,1k2≤2CTν(pλe)3β2(1 +kTk22)5/2, kR◦,2k2≤2CT2ν(peλβ)4(1 +kTk22)3, (6.15)

upon using

kTLY YTTRk2=kTLY YTV VTTRk2≤ν1/2(pλe)(1 +kTk22),

kTLRTTRk2≤CTν1/2(peλβ)2(1 +kTk22)3/2.

ForR_◦,0, by (6.9), we have

kcov◦ Y_¯Y¯T, TY¯_¯YTTk2≤Ek

¯

YY¯T_k22 kTk22,

kcov◦ _¯YY¯T, TY¯Y¯Tk2≤Ek

¯

YY¯T_k2kY¯Y¯Tk2 kTk2,

kcov◦ _¯YY¯T,_¯Y_¯YTTk2≤Ek

¯

YY¯T_k2k

¯

Y

¯

YT_k2 kTk2,

kcov◦ TY¯_¯YTT, TY¯Y¯Tk2≤Ek

¯

YY¯T_k2kY¯Y¯Tk2 kTk32,

kcov◦ TY¯_¯YTT,Y_¯_¯YTTk2≤Ek

¯

YY¯T_k2k

¯

Y

¯

YT_k2 kTk32,

kvar◦ TY¯_¯YTTk2≤Ek

¯

(18)

kvar◦ TY¯Y¯Tk2≤EkY¯Y¯Tk22 kTk22+kTΛ¯k22,

6.3 Quasi-power iteration process

DefineD(n+1)₌_T(n+1)

norm induced by the matrix norm_k·kui. Recursively,

(19)

Lemma 6.4. Suppose that the conditions of Theorem4.1hold. Ifn <minnNqb{eλ}, Nout{S(κβχ)}

o

andV(0)

∈S(κβχ₎_{, then for any} _χ

∈(5ε₋1/2,0]andκ >√2, we have

P_{Mn(χ+ 1/2)_{} ≥}1₋2dexp(₋Cκγκ−2ν−1p−2λ1−2β−2ε), (6.20)

whereCκ is as in (6.5d).

Proof. Sinceκ >√2, we haveκ2_β2χ _>_{2 and}_κβχ_<_[2(_κ2_β2χ₋_1)]1/2_{. Thus, by (}_β_-5_),

4CTν1/2κ3p2eλ2γ−1β1+3χ(κ2β2χ−1)−1/2β−1/2−χ≤4

√

2CTν1/2κ2p2λe2γ−1β1/2+χ≤1.

For any n < minnNqb{eλ}, Nout{S(κβχ)}

o , V(n)

∈ S(κβχ_{) and thus}

kT(n)

k2 ≤

p

κ2_β2χ₋_{1 by} (6.4). Therefore, by item2(b)of Lemma 6.3, we have

kD(n+1)_k2=

T(n+1)₋T(n)₋EnT(n+1)₋T(n)Fno

2

≤ kT(n+1)₋T(n)_k2+ E

n

kT(n+1)₋T(n)_k2

Fn

o

≤2ν1/2_p_e_λβ_{(1 +}

kT(n)

k2

2)[1 +CTpeλβ(1 +kT(n)k22)1/2]

≤2κ2_ν1/2_p_λβ_e 1+2χ_{[1 +}_C

Tκpeλβ1+χ].

(6.21)

For anyn <minnNqb{eλ}, Nout{S(κβχ)}

o ,

kJ3k2≤

n X

s=1

kLkn2−skE (s−1)

T (V

(s−1)₎

k2

≤CTν1/2κ3(peλ)2β2+3χ n X

s=1

(1₋βγ)n−s

≤CTν

1/2_κ3₍_p_e_λ₎2_β2+3χ

βγ =CTν

1/2_κ3₍_p_λ_e₎2_γ−1_β1+3χ

≤1₄(κ2β2χ−1)1/2β1/2+χ.

Similarly,

kJ2k2≤

n X

s=1

kLkn−s

2 kD(s)k2

≤ 2κ

2_ν1/2_p_λβ_e 2χ_{(1 +}_C

Tκpeλβ1+χ)

γ

≤ 2κ

2_ν1/2_p_e_λβ2χ

γ +

1 2(κ

2_β2χ

−1)1/2_β1/2+χ_.

Also,_kJ1k2≤ kLkn2kT(0)k2≤ kT(0)k2≤(κ2β2χ−1)1/2. For fixedn >0 andβ >0,

 

M

(n) 0 :=L

n_T(0)_{, M}(n)

t :=LnT(0)+

min{t,Nout_X{S(κ)}−1}

s=1

Ln−sD(s):t= 1, . . . , n

  

forms a martingale with respect toFt, because

EnkMt(n)k2

o

≤ kJ1k2+kJ2k2<+∞,

and

EnMt(+1n)−M (n)

t Ft

o

= EnLn−t−1_D(t+1)

Ft o

=Ln−t−1_En_D(t+1)

Ft o

(20)

Use the matrix version of Azuma’s inequality [18, Section 7.2] to get, for anyα >0,

Pn_kM(n)

n −M

(n) 0 k2≥α

o

≤2dexp(₋α

2

2σ2),

where

σ2=

min{n,Nout_X{S(κ)}−1}

s=1

kLn−sD(s)_k22

≤[2κ2_ν1/2_p_e_λβ1+2χ_{(1 +}_C

Tκpeλβ1+χ)]2

min{n,Nout_X{S(κ)}−1}

s=1

(1₋βγ)2(n−s)

≤4κ

4_νp2_e_λ2_β2+4χ_{(1 +}_C

Tκpλβe 1+χ)2

βγ[2−βγ]

≤4κ

4_νp2_e_λ2_γ−1_β1+4χ_{(1 +} CT

2C∆)

2

3₋√2 by (β-4),eλβ

1/2

≤₂_κpC1

∆

=Cσκ4νp2γ−1eλ2β1+4χ,

andCσ=(CT+2C∆)

2

(3−√2)C2

∆. Thus, noticingJ2=M

(n)

n −M0(n) forn≤Nout{S(κ)} −1, we have

P_{kJ2k2≥α} ≤2dexp

− α

2

2Cσκ4νp2γ−1eλ2β1+4χ

.

Choosing α = 1 4(κ

2_β2χ

−1)1/2_βχ+1/2−3ε _{and noticing}

kJ3k2 ≤ 1₄(κ2β2χ −1)1/2βχ+1/2−3ε and

T(n)

− Ln_T(0)₌_J

2+J3, we have

P_{Mn(χ+ 1/2)c_}= P

kT(n)_{− L}nT(0)_k2≥

1 2(κ

2_β2χ

−1)1/2βχ+1/2−3ε

≤P

kJ2k2≥

1 4(κ

2_β2χ

−1)1/2βχ+1/2−3ε

≤2dexp

− κ

2_β2χ

−1 32Cσκ4νp2γ−1eλ2β2χ

β−6ε

≤2dexp

− κ

2_β2χ 64Cσκ4νp2γ−1eλ2β2χ

β−6ε

= 2dexp(₋Cκγκ−2ν−1p−2λ−12β−2ε),

whereCκ=₆₄1_C_σ which is the same as in (6.5d).

Lemma 6.5. Suppose that the conditions of Theorem 4.1hold. If

N2−m₍₁₋₆_ε₎<min n

Nqb{λe}, Nout{S(κβχ)}

o

andV(0)_∈_S₍_β(1−21−m

)(3ε−1/2)_κ

m/2)with m≥2, then for κm>√2

P{Hm} ≥1−2dN2−m₍₁₋₆_ε₎exp(−C_κγκ−_m2ν−1p−2λ−₁2β−2ε),

whereHm= n

Nin

n

S(p3/2β(1−22−m

)(3ε−1/2)_κ_m)o_≤_N

2−m₍₁₋₆_ε₎ o

.

Proof. By the definition of the eventTn,

Tn(2−m[1−6ε] + 3ε) =nkT(n)k2≤(κ2m−β(1−2

1−m₎₍₁

(21)

Forn_≥N2−m₍₁₋₆_ε₎and V(0)∈S(β(1−2

1−m₎₍₃_ε

−1/2)_κ

m/2),

Mn(2−m(1−6ε) + 3ε)⊂Tn(2−m(1−6ε) + 3ε)

because

kT(n)_k2≤ kT(n)− LnT(0)k2+kLkn2kT(0)k2

≤1₂κ2m−β(1−2

1−m₎₍₁

−6ε)1/2_β(1−22−m₎₍₃_ε −1/2)

+β2−m(1−6ε)κ

2

m 4 −β

(1−21−m₎₍₁

−6ε)1/2_β(1−21−m₎₍₃_ε −1/2)

≤κ2m−β(1−2

1−m₎₍₁

−6ε)1/2_β(1−21−m₎₍₃_ε −1/2)_.

Therefore, noticing

κ2m−β(1−2

1−m₎₍₁

−6ε)1/2_β(1−22−m₎₍₃_ε

−1/2)₌_β(1−22−m₎₍₆_ε −1)_κ2

m−β2

1−m₍₁

−6ε)1/2

≤3₂β(1−22−m)(6ε−1)κ2m−1 1/2

,

we get

MN₂₋m(1−6ε)(2

−m₍₁

−6ε) + 3ε)_⊂nNin

n

S(p3/2β(1−22−m)(3ε−1/2)κm)o_≤N2−m₍₁₋₆_ε₎ o

=:Hm.

Since

\

n≤minnN2−m(1−6ε),Nin

n

S(√3/2β(1−22−m)(3ε−1/2)_κ_m₎o₋₁o

Mn(2−m₍₁

−6ε) + 3ε)_∩Hc_m

⊂ \

n≤N₂−m(1−6ε)

Mn(2−m(1₋6ε) + 3ε)_⊂MN2−m(1−6ε)(2

−m₍₁

−6ε) + 3ε),

we have

\

n≤minnN2−m(1−6ε),Nin

n

S(√3/2β(1−22−m)(3ε−1/2)_κ_m₎o₋₁o

Mn(2−m₍₁

−6ε) + 3ε)_⊂Hm.

By Lemma6.4withχ= 2−m₍₁

−6ε) + 3ε₋1 2 = 2−

m₍₁

−2m−1₎₍₁

−6ε), we get

P_{Hc_m_{} ≤}P     

[

n≤minnN₂−m(1−6ε),Nin

n

S(√3/2β(1−22−m)(3ε−1/2)_κ

m) o

−1o

Mn(2−m(1−6ε) + 3ε)c     

≤minnN2−m₍₁₋₆_ε₎, N_in n

S(p3/2β(1−22−m)(3ε−1/2)κm)o₋1o

×2dexp(₋Cκγκ−m2ν−1p−2λ−12β−2ε)

≤2dN2−m₍₁₋₆_ε₎exp(−C_κγκ_m−2ν−1p−2λ−₁2β−2ε), as expected.

Lemma 6.6. Suppose that the conditions of Theorem 4.1 hold. IfV(0)

∈S(κ/2) with κ >2√2, K > N1−6ε, then there exists a high-probability event H1∩QK = Tn∈[N1/2−3ε,K]Tn(1/2)∩QK

satisfying

(22)

such that for anyn_∈[N1−6ε, K],

EnT(n)_◦T(n); H1∩QK o

≤ L2nT(0)_◦T(0)+ 2β2[I_{− L}2]−1[I_{− L}2n]H_◦+RE,

where _kREk2 ≤ C◦κ4γ−1ν2(pλe)2β3/2−3ε, H◦ = var◦ _¯YY¯T

≤ 16ψ4_H _{is as in} _item _3(a) _of

Lemma6.3, and C◦ is as in (6.5f).

Proof. First we estimate the probability of the eventH1. We knowTn(1/2)⊂kT(n)k2≤(κ2−1)1/2 .

IfK≥Nout{S(κ)}, then there exists somen≤K, such thatV(n)∈/S(κ), i.e.,kT(n)k2>(κ2−1)1/2

by (6.4). Thus,

{K_≥Nout{S(κ)}} ⊂

[

n≤K n

kT(n)_k2>(κ2−1)1/2

o

⊂ [

n≤K

Tn(1/2)c.

On the other hand, forn≥N1/2−3ε andV(0)∈S(κ/2), Mn(1/2)⊂Tn(1/2) because

kT(n)

k2≤ kT(n)− LnT(0)k2+kLkn2kT(0)k2

≤ 1₂(κ2₋1)1/2β1/2−3ε+β1/2−3εκ

2

4 −1 1/2

≤(κ2

−1)1/2_β1/2−3ε_. _(6.22) Therefore,

\

n∈[N1/2−3ε,K]

Mn(1/2)_⊂ \ n∈[N1/2−3ε,K]

Tn(1/2)_{⊂ {}K_≤Nout{S(κ)} −1},

and so

\

n≤min{K,Nout{S(κ)}−1}

Mn(1/2)_⊂ \

n∈[N1/2−3ε,min{K,Nout{S(κ)}−1}]

Mn(1/2)

= \

n∈[N1/2−3ε,K]

Mn(1/2)

⊂ \

n∈[N1/2−3ε,K]

Tn(1/2)

=:H1.

By Lemma6.4withχ= 0, we have

P   

[

Mn(1/2)c∩QK   

≤min_{K, Nout{S(κ)} −1} ·2dexp(−Cκγκ−2ν−1p−2λ−12β−2ε)

= 2dKexp(₋Cκγκ−2ν−1p−2λ−12β−2ε).

Thus, by Lemma6.1,

P_{(H1∩QK)c}= P{Hc1∪QcK}

= P_{Hc₁_∩QK}+ P{QcK}

≤P   

[

Mn(1/2)c∩QK   + P{Q

c

K}

(23)

Next we estimate the expectation. Since

H1=

\

n∈[N1/2−3ε,K]

Tn(1/2)⊂ \

n∈[N1/2−3ε,K] n

1Tn−1D

(n)₌_D(n)o_,

we have forn_∈[N1/2−3ε, K]

T(n)1H1∩QK =1QK

Ln_T(0)₊

N1/2−3ε−1

X

s=1

Ln−s_D(s)₊

n X

s=N1/2−3ε

Ln−s_D(s)₁

Ts−1+

n X

s=1

Ln−s_E(s−1)

T (V

(s−1)₎

=:J1+J21+J22+J3.

In what follows, we simply writeE_T(n)=E_T(n)(V(n)_{) for convenience. Then,}

EnT(n)_◦T(n); H1∩QK o

= EnT(n)_◦T(n)1H1∩QK o

= E_{J1◦J1}+ 2 E{J1◦J21}+ 2 E{J1◦J22}+ 2 E{J1◦J3}

+ E{[J21+J22]◦[J21+J22]}+ 2 E{[J21+J22]◦J3}+ E{J3◦J3}

≤E_{J1◦J1}+ 2 E{J1◦J21}+ 2 E{J1◦J22}+ 2 E{J1◦J3}

+ 2 E_{J21◦J21}+ 4 E{J21◦J22}+ 2 E{J22◦J22}+ 2 E{J3◦J3}.

In the following, we estimate each summand above forn_∈[N1−6ε, K]. We have the following. 1. E_{J1◦J1}=L2nT(0)◦T(0).

2. E_{J1◦J21}=

N1/_X2−3ε−1

s=1

L2n−sT(0)_◦EnD(s)1QK o

= 0, because

EnD(s)1QK o

= EnEnD(s)1QK Fs−1

oo = 0.

3. E{J1◦J22}=

n X

s=N1/2−3ε

L2n−s_T(0)

◦EnD(s)1Ts−11QK o

= 0, becauseTs−1⊂Fs−1and so

EnD(s)1Ts−11QK o

= P{Ts−1}E

n

D(s)1QK Ts−1

o

= P_{Ts−1}E

n

EnD(s)₁

QK

Fs−1o Ts−1

o = 0.

4. E_{J1◦J3}=

n X

s=1

L2n−sT(0)_◦EnE_T(s−1)1QK o

. Recall (6.13). By item2(a)of Lemma6.3, we

have

kE_{J1◦J3}k2≤

n X

s=1

kLk22n−skT(0)k2kE

n

E_T(s−1)1QK o

k2

≤

n X

s=1

(1−βγ)2n−s(κ

2

4 −1)

1/2_C

Tν1/2(peλβ)2κ3

≤(1₋βγ)n(κ2−1)1/2CTν1/2p2eλ2β2κ3 2βγ

(24)

(25)

≤(1−βγ)

2(n+1−N1/2−3ε)

βγ[2−βγ] E{kR◦k2}

≤1₂−βγ −βγ

(1−βγ)n

βγ E{kR◦k2}

≤1₂β1−6εγ−1βκ4ν(peλ)2 29 + 8

√

2

32 ν+ 4CTκ(peλβ) + 2C

2

Tκ2(peλβ)2 !

≤ 29 + 8 √

2

64 + 2CTκ(pλβe ) +C

2

Tκ2(peλβ)2 !

γ−1κ4ν2(peλ)2β2−6ε.

7. For E_{J22◦J22}, we have

E_{J22◦J22}=

n X

s=N1/2−3ε

L2(n−s)EnD(s)1QK1Ts−1◦D

(s)₁

QK1Ts−1

o

=β2

n X

s=N1/2−3ε

L2(n−s)H_◦+ n X

s=N1/2−3ε

L2(n−s)ER_◦1Ts−1 ,

because fors₆=s′_,

EnD(s)1QK1Ts−1◦D

(s′₎

1QK1Ts′−1

o

= EnD(s)_◦D(s′)1QK1Ts−11Ts′−1

o

= P_{Ts−1∩Ts′₋₁}E

n

D(s)_◦D(s′)1QK

Ts−1∩Ts′₋₁

o

= P_{Ts−1∩Ts′₋1}E

n

EnD(max{s,s′})1QK

Fmax{s,s′_}−1

o

◦D(min{s,s′})Ts−1∩Ts′₋1

o

= 0,

and

EnD(s)₁

QK1Ts−1◦D

(s)₁

QK1Ts−1

o

= EnD(s)

◦D(s′₎

1QK1Ts−1

o

= P_{Ts−1}E

n EnD(s)

◦D(s)₁

QK

Fs−1o Ts−1

o

≤β2H_◦+ ER_◦1Ts−1 .

We have

kR◦1Ts−1k2≤

29 + 8√2 32 κ

3_ν2₍_p_e_λβ₎2_τ

s−1+ 4CTκ5ν(peλβ)3+ 2CT2κ6ν(peλβ)4

≤29 + 8 √

2 32 κ

3_ν2₍_p_e_λβ₎2₍_κ2

−1)1/2_β1/2−3ε_{+ 4}_C

Tκ5ν(peλβ)3+ 2CT2κ6ν(peλβ)4

≤29 + 8 √

2 32 κ

4_ν2₍_p_e_λβ₎2_β1/2−3ε_{+ 4}_C

Tκ5ν(peλβ)3+ 2CT2κ6ν(peλβ)4.

WriteE22:=

n X

s=N1/2−3ε

L2(n−s)ER_◦1Ts−1 for which we have

kE22k2≤

n X

s=N1/2−3ε

kLk2(2n−s)E

kR◦1Ts−1k2

≤_βγ_[21 −βγ]E

kR◦1Ts−1k2

≤ 1

3−√2γ

−1_κ4_ν₍_p_e_λ₎2_β 29 + 8

√

2 32 νβ

1/2−3ε_{+ 4}_C

(26)

≤ 1

Collecting all estimates together, we obtain

EnT(n)◦T(n); H1∩QK

6.4 Proof of Theorem

4.1

(27)

Writeκm= 6(1−m)/2κform= 1, . . . , M ≡M(ǫ). Sincedβ1−7ε≤ω, we know

φd1/2_≤φω1/2β7ε/2−1/2_≤β(1−21−M)(3ε−1/2)κM/2.

The key to our proof is to divide the whole process intoM segments of iterations. Thanks to the strong Markov property of the process, we can use the final value of current segment as the initial guess of the very next one. By Lemma6.5, after the first segment of

n1:= min

n

Nin

n

S(p3/2β(1−22−M

)(3ε−1/2)_κ 1)

o

, N2−M₍₁₋₆_ε₎ o

iterations,V(n1) _{lies in} _S₍p₃_/₂_β(1−22−M)(3ε−1/2)_κ

1) =S(β(1−2

2−M₎₍₃_ε

−1/2)_κ

2/2) with high

prob-ability, which will be a good initial guess for the second segment. In general, theith segment of iterations starts withV(ni−1) _{and ends with}_V(ni)_{, where}

ni= min (

Nin

n

S(β(1−2i+1−M)(3ε−1/2)κi+1/2)

o

,

& _M X

m=M+1−i e

N2−m₍₁₋₆_ε₎ ')

.

At the end of the (M ₋1)st segment of iterations, V(nM−1) _{is produced and it is going to be}

used as an initial guess for the last step, at which we can apply Lemma 6.6. Now nM−1 =

minnNin{S(κM/2)},Kb o

, where Kb = lPM_m₌₂Ne2−m₍₁₋₆_ε₎ m

= lNe(1−21−M₎₍₁_/₂₋₃_ε₎ m

. By 22−M

≥

ε/2

1/2−3ε ≥21−M, we have

N1/2−7ε/2=

l e

N1/2−7ε/2

m

≤Kb _≤lNe1/2−13ε/4

m

≤N1/2−13ε/4.

Let

e

Hm= n

Nin

n

S(p3/2β(1−22−m

)(3ε−1/2)_κ

M+1−m) o

≤Ne2−m₍₁₋₆_ε₎+n_M₋_m o

form= 2, . . . , M ,

e

H1=

\

n∈[N1/2−3ε,K−Nin{S(κM/2)}]

T_n₊_N_in_{S(κM/2)}(1/2),

H= M \

m=1

e

Hm∩QK,

wheren0= 0. We have

P{Hc_}= P

( _M

[

m=1

e

Hc_m_∪Qc_K

)

≤

M X

m=1

PnHec_m_∪Qc_Ko

≤

M X

m=2

2dN2−m₍₁₋₆_ε₎exp(−C_κγκ−2

M+1−mν−

1_p−2_λ−2 1 β−2ε)

+ 2dK₋

M X

m=2

N2−m₍₁₋₆_ε₎

exp(₋Cκγκ−M2ν−

1_p−2_λ−2 1 β−

2ε₎

+ eK(d+p) exp ₋Cψψ−1β−2ε

≤2dKexp(₋Cκγκ−2ν−1p−2λ−12β−2ε) + eK(d+p) exp −Cψψ−1β−2ε

≤2dKexp(₋Cκ4√2CTν−1/2βεβ−2ε) + eK(d+p) exp −Cψψ−1β−εβ−ε

by (β-5)

≤2dKexp(−4√2CTCκν−1/2β−ε) + eK(d+p) exp −Cψψ−1β−ε

≤K[(2 + e)d+ ep] exp(₋max_{Cνν−1/2, Cψψ−1}β−ε),

Mathematics authors titles recent submissions

arXiv:1711.06644v1 [math.OC] 17 Nov 2017

Nearly Optimal Stochastic Approximation for Online

Principal Subspace Estimation

Xin Liang

Zhen-Chen Guo

Ren-Cang Li

Wen-Wei Lin

November 20, 2017

Contents

1

Introduction

2

Canonical Angles

3

Online PCA for Principal Subspace

4

Main Results

5

Comparisons with Previous Results

6

Proof of Theorem

4.1

6.1

Simplification

6.2

Increments of One Iteration

6.3

Quasi-power iteration process

6.4

Proof of Theorem

4.1