• Tidak ada hasil yang ditemukan

Proofs of the supporting lemmas

Dalam dokumen Analyses of Randomized Linear Algebra (Halaman 66-76)

2.8 Covariance estimation

2.8.3 Proofs of the supporting lemmas

We now establish the lemmas used in the proof of Theorem2.15.

Proof of Lemma2.19. The probability that λk(X) overestimates λk(A) is controlled with the sequence of inequalities

P

λk(X)≥λk(A) +t =P (

inf

W∈Vppk+1λmax WXW

λk(A) +t )

≤P

¦λmax

€W+XW+Š

λk(A) +t© .

We use a related approach to study the probability thatλk(X)underestimatesλk(A). Our

choice ofWimplies that

λpk+1(−A) =−λk(A) =−λmin

€WAWŠ

=λmax

€W(−A)WŠ .

It follows that

P

λk(X)≤λk(A)−t =P

¦λpk+1(−X)≥λpk+1(−A) +t©

=P (

inf

W∈Vpkλmax W(−X)W

λmax

€W(−A)WŠ +t

)

≤P

¦λmax

€W(−X)WŠ

λmax

€W(−A)WŠ

t©

≤P

¦λmax

€W(AX)WŠ

t© .

The final inequality follows from the subadditivity of the maximum eigenvalue mapping.

Proof of Lemma2.20. We begin by takingS to be the positive-semidefinite square root ofG.

LetS=UΛUbe the eigenvalue decomposition ofS, and letγbe aN(0,Ip)random variable.

Recalling thatGis the covariance matrix ofξ, we see thatξandγare identically distributed.

Thus,

E(ξξG)2=E(γγΛU2U)2

=E(γγΛ2γγ)ΛUG2. (2.8.5)

Consider the(i,j)entry of the matrix being averaged:

E(γγΛ2γγ)i j=X

k

E(γiγjγ2k)λ2k.

The(i,j)entry of this matrix is zero because the entries ofγare independent and symmetric.

Furthermore, the(i,i)entry satisfies

E(γγΛ2γγ)ii=E(γ4i)λ2i +X

k6=i

E(γ2k)λ2k=2λ2i +tr(Λ2).

We have shown

E(γγΛ2γγ) =2Λ2+tr(GI.

This equality and (2.8.5) imply the desired result.

Proof of Lemma2.21. Factor the covariance matrix ofξasG=UΛU whereUis orthogonal andΛ=diag(λ1, . . . ,λp)is the matrix of eigenvalues ofG. Letγbe aN(0,Ip)random variable.

Thenξand1/2γare identically distributed, so

E(ξξ)m=E

”(ξξ)m1ξξ—

=E

”(γΛγ)m11/2γγΛ1/2U—

=1/2E

”(γΛγ)m1γγ—

Λ1/2U. (2.8.6)

Consider the(i,j)entry of the bracketed matrix in (2.8.6):

E

”(γΛγ)m1γiγj

—=E

Xp

`=1λ`γ2`m1

γiγj

. (2.8.7)

From this expression, and the independence and symmetry of the Gaussian variables{γi}, we see that this matrix is diagonal.

To bound the diagonal entries, use a multinomial expansion to further develop the sum in

(2.8.7) for the(i,i)entry:

E

”(γΛγ)m1γ2i—

= X

`1+···+`p=m1

m−1

`1, . . . ,`p

λ`11· · ·λ`ppE

hγ21`1· · ·γ2`p pγ2ii .

Now we use the generalized AM–GM inequality to replace the expectation of the product of Gaussians with the 2mth moment of a single standard Gaussian g. Denote the Lr norm of a random variableX by

kXkLr = (E|X|r)1/r.

Since`1, . . . ,`pare nonnegative integers summing tom−1, the generalized AM-GM inequality justifies the first of the following inequalities:

Eγ21`1· · ·γ2p`pγ2i ≤E

‚`1|γ1|+· · ·+`p|γp|+|γi| m

Œ2m

= 1 m

|γi|+

p

X

j=1

`j|γj|

2m

L2m

 1 m

γi

L2m+

p

X

j=1

`j

γj

L2m

2m

=

‚1+`1+. . .+`p

m

Œ2m

g

2m

L2m=E(g2m).

The second inequality is the triangle inequality forLr norms. Now we reverse the multinomial expansion to see that the diagonal terms satisfy the inequality

E

”(γΛγ)m−1γ2i—

≤ X

`1+···+`p=m1

m−1

`1, . . . ,`p

λ`11· · ·λ`ppE(g2m)

= (λ1+. . .+λp)m1E(g2m) =tr(G)m1E(g2m). (2.8.8)

EstimateE(g2m)using the fact thatΓ(x)is increasing for x≥1 :

E

€g2mŠ

= 2m

pπΓ(m+1/2)< 2m

pπΓ(m+1) = 2m

pπm! form≥1.

Combine this result with (2.8.8) to see that

E

”(γΛγ)m1γγ— 2m

pπm! tr(G)m1·I.

Complete the proof by using this estimate in (2.8.6).

Chapter 3

Randomized sparsification in NP-hard norms

Massive matrices are ubiquitous in modern data processing. Classical dense matrix algorithms are poorly suited to such problems because their running times scale superlinearly with the size of the matrix. When the dataset is sparse, one prefers to use sparse matrix algorithms, whose running times depend more on the sparsity of the matrix than on the size of the matrix. Of course, in many applications the matrix isnotsparse. Accordingly, one may wonder whether it is possible to approximate a computation on a large dense matrix with a related computation on a sparse approximant to the matrix.

Letk · kbe a norm on matrices. Here is one way to frame this challenge mathematically:

Given a matrixA, how can one efficiently generate a sparse matrixXfor which the approximation errorkAXkis small?

The literature has concentrated on the behavior of the approximation error in the spectral and Frobenius norms; however, these norms are not always the most natural choice. Sometimes it is more appropriate to consider the matrix as an operator from a finite-dimensional`pspace to a finite-dimensional`qspace, and investigate the behavior of the approximation error in the associatedpqoperator norm. As an example, the problem of graph sparsification is naturally posed as a question of preserving the so-calledcut normof a matrix associated with the graph.

The strong equivalency of the cut norm and the∞→1 norm suggests that, for graph-theoretic applications, it may be fruitful to consider the behavior of the∞→1 norm under sparsification.

In other applications, e.g., the column subset selection algorithm in[Tro09], the∞→2 norm is the norm of interest.

This chapter investigates the errors incurred by approximating a fixed real matrix with a random matrix1. Our results apply to any scheme in which the entries of the approximating matrix are independent and average to the corresponding entries of the fixed matrix. Our main contribution is a bound on the expected∞→pnorm error, which we specialize to the case of the

∞→1 and∞→2 norms. We also use a result of Latała[Lat05]to bound the expected spectral approximation error, and we establish the subgaussianity of the spectral approximation error.

Our methods are similar to those of Rudelson and Vershynin in[RV07]in that we treatA as a linear operator between finite-dimensional Banach spaces and use some of the same tools of probability in Banach spaces. Whereas Rudelson and Vershynin consider the behavior of the norms of random submatrices ofA, we consider the behavior of the norms of matrices formed by randomly sparsifying (or quantizing) the entries ofA. This yields error bounds applicable to schemes that sparsify or quantize matrices entrywise. Since some graph algorithms depend more on the number of edges in the graph than the number of vertices, such schemes may be useful in developing algorithms for handling large graphs. In particular, the algorithm of[BSS09]is not suitable for sparsifying graphs with a large number of vertices. Part of our motivation for investigating the∞→1 approximation error is the belief that the equivalence of the cut norm with the∞ →1 norm means that matrix sparsification in the∞ →1 norm might be useful for efficiently constructing optimal sparsifiers for such graphs.

1The content of this chapter is adapted from the technical report[GT11]co-authored with Joel Tropp.

3.1 Notation

We establish the notation particular to this chapter.

All quantities are real. For 1≤p≤ ∞, the`npnorm ofx∈Rnis written askxkp. Each space

`nphas an associated dual space`np0, wherep0, theconjugate exponenttop, is determined by the relationp1+ (p0)1=1. The dual space of`1n(respectively,`n ) is`n (respectively,`n1).

The kth column of the matrixAis denoted byA(k), and the(j,k)th element is denoted by ajk. We treatAas an operator from`np to`qm, and thepq operator norm ofAis written as kAkpq. The spectral norm, i.e. the 2→2 operator norm, is writtenkAk2. Recall that given an operatorA:`np`mq, the associated adjoint operator (AT, in the case of a matrix) maps from`qm0

to`np0. Further, thepqandq0p0norms are dual in the sense that

kAkpq= AT

q0p0.

This chapter is concerned primarily with the spectral norm and the∞→1 and∞→2 norms.

The∞→1 and∞→2 norms are not unitarily invariant, so do not have simple interpretations in terms of singular values; in fact, they are NP-hard to compute for general matrices[Roh00]. We remark that kAk∞→1 = kAxk1 and kAk∞→2 =

Ay

2 for certain vectors x and y whose components take values±1. An additional operator norm, the 2→∞norm, is of interest: it is the largest`2 norm achieved by a row ofA. In the sequel we also encounter the column norm

kAkcol=X

kkA(k)k2.

The variance of X is written VarX =E(X −EX)2. The expectation taken with respect to one variableX, with all others fixed, is writtenEX. The expressionXY indicates the random

variablesX andY are identically distributed. Given a random variableX, the symbolX0denotes a random variable independent of X such that X0X. The indicator variable of the event X >Y is written1X>Y. The Bernoulli distribution with expectationpis written Bern(p)and the binomial distribution ofnindependent trials each with success probabilitypis written Bin(n,p). We writeX ∼Bern(p)to indicateX is Bernoulli with meanp.

3.1.0.1 Graph sparsification

Graphs are often represented and fruitfully manipulated in terms of matrices, so the problems of graph sparsification and matrix sparsification are strongly related. We now introduce the relevant notation before surveying the literature.

Let G = (V,E,ω) be a weighted simple undirected graph with n vertices, m edges, and adjacency matrixAgiven by

ajk=





ωjk (j,k)∈E 0 otherwise

.

Orient the edges of Gin an arbitrary manner. Then define the corresponding 2m×n oriented incidence matrixBin the following manner: b2i1,j =b2i,k=ωi and b2i1,k =b2i,j =−ωi if edgeiis oriented from vertex jto vertexk, and all other entries ofB are identically zero.

Acutis a partition of the vertices ofGinto two blocks: V =SS. Thecostof a cut is the sum of the weights of all edges inEwhich have one vertex inS and one vertex inS. Several problems relating to cuts are of considerable practical interest. In particular, the MAXCUT problem, to determine the cut of maximum cost in a graph, is common in computer science applications.

The cuts of maximum cost are exactly those that correspond to thecut-normof the oriented

incidence matrixB, which is defined as

kBkC= max

I⊂{1,...,2m},J⊂{1,...,n}

X

iI,jJ bi j .

Finding the cut-norm of a general matrix is NP-hard, but in [AN04], the authors offer a randomized polynomial-time algorithm which finds a submatrix ˜BofBfor which|P

jk˜bjk| ≥ 0.56kBkC. This algorithm thereby gives a feasible means of approximating theMAXCUTvalue for arbitrary graphs. A crucial point in the derivation of the algorithm is the fact that for general matrices the∞→1 operator norm is strongly equivalent with the cut-norm:

kAkC≤ kAk∞→1≤4kAkC;

in fact, in the particular case of oriented incidence matrices,kBkC=kBk∞→1.

In his thesis[Kar95]and the sequence of papers[Kar94a,Kar94b,Kar96], Karger introduces the idea of random sampling to increase the efficiency of calculations with graphs, with a focus on cuts. In[Kar96], he shows that by picking each edge of the graph with a probability inversely proportional to the density of edges in a neighborhood of that edge, one can construct asparsifier, i.e., a graph with the same vertex set and significantly fewer edges that preserves the value of each cut to within a factor of(1±ε).

In[SS08], Spielman and Srivastava improve upon this sampling scheme, instead keeping an edge with probability proportional to itseffective resistance—a measure of how likely it is to appear in a random spanning tree of the graph. They provide an algorithm which produces a sparsifier with O€

(nlogn)2Š

edges, wherenis the number of vertices in the graph. They obtain this result by reducing the problem to the behavior of projection matricesΠGandΠG0

associated with the original graph and the sparsifier, and then appealing to a spectral-norm

concentration result.

The lognfactor in[SS08]seems to be an unavoidable consequence of using spectral-norm concentration. In[BSS09], Batson et al. prove that the lognfactor is not intrinsic: they establish that every graph has a sparsifier that hasΩ(n)edges. The existence proof is constructive and provides a deterministic algorithm for constructing such sparsifiers in O(n3m)time, wheremis the number of edges in the original graph.

Dalam dokumen Analyses of Randomized Linear Algebra (Halaman 66-76)