Proofs of the supporting lemmas - Covariance estimation

2.8 Covariance estimation

2.8.3 Proofs of the supporting lemmas

We now establish the lemmas used in the proof of Theorem2.15.

Proof of Lemma2.19. The probability that λ_k(X) overestimates λ_k(A) is controlled with the sequence of inequalities

λ_k(X)≥λ_k(A) +t =P (

inf

W∈V^p_p−k+1λmax W^∗XW

≥λ_k(A) +t )

≤P

¦λmax

W^∗₊XW₊

≥λk(A) +t© .

We use a related approach to study the probability thatλk(X)underestimatesλk(A). Our

choice ofW₋implies that

λ_p−k+1(−A) =−λk(A) =−λmin

W^∗₋AW₋

=λmax

W^∗₋(−A)W₋ .

It follows that

λk(X)≤λk(A)−t =P

¦λ_p−k+1(−X)≥λ_p−k+1(−A) +t©

=P (

inf

W∈V^p_kλmax W^∗(−X)W

≥λmax

W^∗₋(−A)W₋ +t

)

≤P

¦λmax

W^∗₋(−X)W₋

−λmax

W^∗₋(−A)W₋

≥t©

≤P

¦λmax

W^∗₋(A−X)W₋

≥t© .

The final inequality follows from the subadditivity of the maximum eigenvalue mapping.

Proof of Lemma2.20. We begin by takingS to be the positive-semidefinite square root ofG.

LetS=UΛU^∗be the eigenvalue decomposition ofS, and letγbe aN(0,I_p)random variable.

Recalling thatGis the covariance matrix ofξ, we see thatξandUΛγare identically distributed.

Thus,

E(ξξ^∗−G)²=E(UΛγγ^∗ΛU^∗−UΛ²U^∗)²

=UΛE(γγ^∗Λ²γγ^∗)ΛU^∗−G². (2.8.5)

Consider the(i,j)entry of the matrix being averaged:

E(γγ^∗Λ²γγ^∗)i j=X

E(γiγjγ²_k)λ²_k.

The(i,j)entry of this matrix is zero because the entries ofγare independent and symmetric.

Furthermore, the(i,i)entry satisfies

E(γγ^∗Λ²γγ^∗)_ii=E(γ⁴_i)λ²_i +X

k6=i

E(γ²_k)λ²_k=2λ²_i +tr(Λ²).

We have shown

E(γγ^∗Λ²γγ^∗) =2Λ²+tr(G)·I.

This equality and (2.8.5) imply the desired result.

Proof of Lemma2.21. Factor the covariance matrix ofξasG=UΛU^∗ whereUis orthogonal andΛ=diag(λ1, . . . ,λ_p)is the matrix of eigenvalues ofG. Letγbe aN(0,I_p)random variable.

ThenξandUΛ¹^/²γare identically distributed, so

E(ξξ^∗)^m=E

(ξ^∗ξ)^m−¹ξξ^∗

(γ^∗Λγ)^m−¹UΛ¹^/²γγ^∗Λ¹^/²U^∗

=UΛ¹^/²E

(γ^∗Λγ)^m−¹γγ^∗

Λ¹^/²U^∗. (2.8.6)

Consider the(i,j)entry of the bracketed matrix in (2.8.6):

(γ^∗Λγ)^m−¹γiγj

`=1λ_`γ²_`m−1

γiγj

. (2.8.7)

From this expression, and the independence and symmetry of the Gaussian variables{γ_i}, we see that this matrix is diagonal.

To bound the diagonal entries, use a multinomial expansion to further develop the sum in

(2.8.7) for the(i,i)entry:

(γ^∗Λγ)^m−¹γ²_i

= X

`1+···+`p=m−1

m−1

`1, . . . ,`_p

λ^`₁¹· · ·λ^`p^pE

hγ²₁^`¹· · ·γ^2`p ^pγ²_ii .

Now we use the generalized AM–GM inequality to replace the expectation of the product of Gaussians with the 2mth moment of a single standard Gaussian g. Denote the L_r norm of a random variableX by

kXkL_r = (E|X|^r)¹^/^r.

Since`1, . . . ,`_pare nonnegative integers summing tom−1, the generalized AM-GM inequality justifies the first of the following inequalities:

Eγ²₁^`¹· · ·γ²p^`^pγ²_i ≤E

`1|γ1|+· · ·+`p|γp|+|γi| m

= 1 m





|γi|+

j=1

`j|γj|







L_2m

≤





 1 m





 γi

L2m+

j=1

γj

L2m













1+`1+. . .+`p

L_2m=E(g^2m).

The second inequality is the triangle inequality forL_r norms. Now we reverse the multinomial expansion to see that the diagonal terms satisfy the inequality

(γ^∗Λγ)^m−1γ²_i

≤ X

`1+···+`p=m−1

m−1

`1, . . . ,`_p

λ^`₁¹· · ·λ^`p^pE(g^2m)

= (λ1+. . .+λ_p)^m⁻¹E(g^2m) =tr(G)^m⁻¹E(g^2m). (2.8.8)

EstimateE(g^2m)using the fact thatΓ(x)is increasing for x≥1 :

g^2m

= 2^m

pπΓ(m+1/2)< 2^m

pπΓ(m+1) = 2^m

pπm! form≥1.

Combine this result with (2.8.8) to see that

(γ^∗Λγ)^m−¹γγ^∗ 2^m

pπm! tr(G)^m−¹·I.

Complete the proof by using this estimate in (2.8.6).

Chapter 3

Randomized sparsification in NP-hard norms

Massive matrices are ubiquitous in modern data processing. Classical dense matrix algorithms are poorly suited to such problems because their running times scale superlinearly with the size of the matrix. When the dataset is sparse, one prefers to use sparse matrix algorithms, whose running times depend more on the sparsity of the matrix than on the size of the matrix. Of course, in many applications the matrix isnotsparse. Accordingly, one may wonder whether it is possible to approximate a computation on a large dense matrix with a related computation on a sparse approximant to the matrix.

Letk · kbe a norm on matrices. Here is one way to frame this challenge mathematically:

Given a matrixA, how can one efficiently generate a sparse matrixXfor which the approximation errorkA−Xkis small?

The literature has concentrated on the behavior of the approximation error in the spectral and Frobenius norms; however, these norms are not always the most natural choice. Sometimes it is more appropriate to consider the matrix as an operator from a finite-dimensional`pspace to a finite-dimensional`qspace, and investigate the behavior of the approximation error in the associatedp→qoperator norm. As an example, the problem of graph sparsification is naturally posed as a question of preserving the so-calledcut normof a matrix associated with the graph.

The strong equivalency of the cut norm and the∞→1 norm suggests that, for graph-theoretic applications, it may be fruitful to consider the behavior of the∞→1 norm under sparsification.

In other applications, e.g., the column subset selection algorithm in[Tro09], the∞→2 norm is the norm of interest.

This chapter investigates the errors incurred by approximating a fixed real matrix with a random matrix¹. Our results apply to any scheme in which the entries of the approximating matrix are independent and average to the corresponding entries of the fixed matrix. Our main contribution is a bound on the expected∞→pnorm error, which we specialize to the case of the

∞→1 and∞→2 norms. We also use a result of Latała[Lat05]to bound the expected spectral approximation error, and we establish the subgaussianity of the spectral approximation error.

Our methods are similar to those of Rudelson and Vershynin in[RV07]in that we treatA as a linear operator between finite-dimensional Banach spaces and use some of the same tools of probability in Banach spaces. Whereas Rudelson and Vershynin consider the behavior of the norms of random submatrices ofA, we consider the behavior of the norms of matrices formed by randomly sparsifying (or quantizing) the entries ofA. This yields error bounds applicable to schemes that sparsify or quantize matrices entrywise. Since some graph algorithms depend more on the number of edges in the graph than the number of vertices, such schemes may be useful in developing algorithms for handling large graphs. In particular, the algorithm of[BSS09]is not suitable for sparsifying graphs with a large number of vertices. Part of our motivation for investigating the∞→1 approximation error is the belief that the equivalence of the cut norm with the∞ →1 norm means that matrix sparsification in the∞ →1 norm might be useful for efficiently constructing optimal sparsifiers for such graphs.

1The content of this chapter is adapted from the technical report[GT11]co-authored with Joel Tropp.

3.1 Notation

We establish the notation particular to this chapter.

All quantities are real. For 1≤p≤ ∞, the`ⁿ_pnorm ofx∈Rⁿis written askxkp. Each space

`ⁿ_phas an associated dual space`ⁿ_p0, wherep⁰, theconjugate exponenttop, is determined by the relationp⁻¹+ (p⁰)⁻¹=1. The dual space of`₁ⁿ(respectively,`_∞ⁿ ) is`_∞ⁿ (respectively,`ⁿ₁).

The kth column of the matrixAis denoted byA₍_k₎, and the(j,k)th element is denoted by a_jk. We treatAas an operator from`ⁿ_p to`_q^m, and thep→q operator norm ofAis written as kAk_p→q. The spectral norm, i.e. the 2→2 operator norm, is writtenkAk2. Recall that given an operatorA:`ⁿ_p→`^m_q, the associated adjoint operator (A^T, in the case of a matrix) maps from`_q^m0

to`ⁿ_p0. Further, thep→qandq⁰→p⁰norms are dual in the sense that

kAkp→q= A^T

q⁰→p⁰.

This chapter is concerned primarily with the spectral norm and the∞→1 and∞→2 norms.

The∞→1 and∞→2 norms are not unitarily invariant, so do not have simple interpretations in terms of singular values; in fact, they are NP-hard to compute for general matrices[Roh00]. We remark that kAk∞→1 = kAxk1 and kAk∞→2 =

2 for certain vectors x and y whose components take values±1. An additional operator norm, the 2→∞norm, is of interest: it is the largest`2 norm achieved by a row ofA. In the sequel we also encounter the column norm

kAkcol=X

kkA_(k)k2.

The variance of X is written VarX =E(X −EX)². The expectation taken with respect to one variableX, with all others fixed, is writtenEX. The expressionX ∼Y indicates the random

variablesX andY are identically distributed. Given a random variableX, the symbolX⁰denotes a random variable independent of X such that X⁰ ∼ X. The indicator variable of the event X >Y is written1_X>Y. The Bernoulli distribution with expectationpis written Bern(p)and the binomial distribution ofnindependent trials each with success probabilitypis written Bin(n,p). We writeX ∼Bern(p)to indicateX is Bernoulli with meanp.

3.1.0.1 Graph sparsification

Graphs are often represented and fruitfully manipulated in terms of matrices, so the problems of graph sparsification and matrix sparsification are strongly related. We now introduce the relevant notation before surveying the literature.

Let G = (V,E,ω) be a weighted simple undirected graph with n vertices, m edges, and adjacency matrixAgiven by

a_jk=











ωjk (j,k)∈E 0 otherwise

Orient the edges of Gin an arbitrary manner. Then define the corresponding 2m×n oriented incidence matrixBin the following manner: b_2i₋_1,j =b_2i,k=ωi and b_2i₋_1,k =b_2i,_j =−ωi if edgeiis oriented from vertex jto vertexk, and all other entries ofB are identically zero.

Acutis a partition of the vertices ofGinto two blocks: V =S∪S. Thecostof a cut is the sum of the weights of all edges inEwhich have one vertex inS and one vertex inS. Several problems relating to cuts are of considerable practical interest. In particular, the MAXCUT problem, to determine the cut of maximum cost in a graph, is common in computer science applications.

The cuts of maximum cost are exactly those that correspond to thecut-normof the oriented

incidence matrixB, which is defined as

kBkC= max

I⊂{1,...,2m},J⊂{1,...,n}

i∈I,j∈J b_{i j} .

Finding the cut-norm of a general matrix is NP-hard, but in [AN04], the authors offer a randomized polynomial-time algorithm which finds a submatrix ˜BofBfor which|P

jk˜b_jk| ≥ 0.56kBkC. This algorithm thereby gives a feasible means of approximating theMAXCUTvalue for arbitrary graphs. A crucial point in the derivation of the algorithm is the fact that for general matrices the∞→1 operator norm is strongly equivalent with the cut-norm:

kAkC≤ kAk_∞→1≤4kAkC;

in fact, in the particular case of oriented incidence matrices,kBkC=kBk_∞→1.

In his thesis[Kar95]and the sequence of papers[Kar94a,Kar94b,Kar96], Karger introduces the idea of random sampling to increase the efficiency of calculations with graphs, with a focus on cuts. In[Kar96], he shows that by picking each edge of the graph with a probability inversely proportional to the density of edges in a neighborhood of that edge, one can construct asparsifier, i.e., a graph with the same vertex set and significantly fewer edges that preserves the value of each cut to within a factor of(1±ε).

In[SS08], Spielman and Srivastava improve upon this sampling scheme, instead keeping an edge with probability proportional to itseffective resistance—a measure of how likely it is to appear in a random spanning tree of the graph. They provide an algorithm which produces a sparsifier with O

(nlogn)/ε²

edges, wherenis the number of vertices in the graph. They obtain this result by reducing the problem to the behavior of projection matricesΠGandΠG⁰

associated with the original graph and the sparsifier, and then appealing to a spectral-norm

concentration result.

The lognfactor in[SS08]seems to be an unavoidable consequence of using spectral-norm concentration. In[BSS09], Batson et al. prove that the lognfactor is not intrinsic: they establish that every graph has a sparsifier that hasΩ(n)edges. The existence proof is constructive and provides a deterministic algorithm for constructing such sparsifiers in O(n³m)time, wheremis the number of edges in the original graph.

Dalam dokumen Analyses of Randomized Linear Algebra (Halaman 66-76)