getdocbbe2. 135KB Jun 04 2011 12:04:54 AM

(1)

ELECTRONIC COMMUNICATIONS in PROBABILITY

QUANTITATIVE ASYMPTOTICS OF GRAPHICAL

PROJECTION PURSUIT

ELIZABETH S. MECKES1

220 Yost Hall, Department of Mathematics, Case Western Reserve University, 10900 Euclid Ave., Cleveland, OH 44106.

email: [email protected]

SubmittedJanuary 12, 2009, accepted in final formApril 14, 2009 AMS 2000 Subject classification: 60E15,62E20

Keywords: Projection pursuit, concentration inequalities, Stein’s method, Lipschitz distance

Abstract

There is a result of Diaconis and Freedman which says that, in a limiting sense, for large collections of high-dimensional data most one-dimensional projections of the data are approximately Gaus-sian. This paper gives quantitative versions of that result. For a set of deterministic vectors_{x_i_}n

i=1 inRd _with_n_and_d_{fixed, let}θ _∈Sd−1_{be a random point of the sphere and let}µθ

n denote the ran-dom measure which puts mass 1_n at each of the points

x1,θ

, . . . ,

xn,θ

. For a fixed bounded Lipschitz test function f, Z a standard Gaussian random variable andσ2_{a suitable constant, an} explicit bound is derived for the quantity P

Z

f dµθ

n−Ef(σZ)

> ε

. A bound is also given

forP_d_{B L}₍_µθ

n,N(0,σ

2₎₎_{> ε}

, wheredB Ldenotes the bounded-Lipschitz distance, which yields a lower bound on the waiting time to finding a non-Gaussian projection of the_{xi}if directions are tried independently and uniformly onSd−1_.

1 Introduction

A foundational tool of data analysis is the projection of high-dimensional data to a one- or two-dimensional subspace in order to visually represent the data, and, ideally, identify underlying structure. The question immediately arises: which projections are interesting? One would like to answer by saying that those projections which exhibit structure are interesting, however, identify-ing which projections those are is not quite as straightforward as one might think. In particular, there are several reasons that have led to the idea that one should mainly look for projections which are far from Gaussian in behavior; that Gaussian projections in fact do not generally exhibit interesting structure. One justification for this idea is the following result due to Persi Diaconis and David Freedman.

1_{RESEARCH SUPPORTED BY AN AMERICAN INSTITUTE OF MATHEMATICS FIVE-YEAR FELLOWSHIP}

(2)

Theorem 1(Diaconis-Freedman[1]). Let x₁, . . . ,x_nbe deterministic vectors inRd_{. Suppose that n,}

d and the x_i depend on a hidden indexν, so that asνtends to infinity, so do n and d. Suppose that there is aσ2_>₀_{such that, for all}_{ε >}_0,

1 n

¦

j_≤n:|x_j|2−σ2d > εd

©

ν→∞

−−→0, (1)

and suppose that

1 n2

¦

j,k≤n: ¬

xj,xk

¶ > εd

©

ν→∞

−−→0. (2)

Letθ_∈Sd−1_{be distributed uniformly on the sphere, and consider the random measure}µθ

ν which puts

mass 1_n at each of the points

θ,x1

, . . . ,

θ,xn

. Then asνtends to infinity, the measuresµθ ν tend

to_N(0,σ2₎_{weakly in probability.}

Heuristically, Theorem 1 can be interpreted as saying that, for a large number of high-dimensional data vectors, as long as they have nearly the same lengths and are nearly orthogonal, most one-dimensional projections are close to Gaussian regardless of the structure of the data. It is important to note that the conditions (1) and (2) are not too strong; in particular, even though only d vectors can be exactly orthogonal inRd_{, the 2}d_{vertices of a unit cube centered at the origin satisfy}

condition (2) for “rough orthogonality”.

A failing of the usual interpretation of Theorem 1 is that sometimes, projections of data look nearly Gaussian for a reason; that is, it is not always due to the central-limit type effect described by the theorem. Thus the question arises: is there a way to tell whether a Gaussian projection is interesting? A possible answer lies in quantifying the theorem, and then saying that a nearly-Gaussian projection is interesting if it is “too close” to nearly-Gaussian to simply be the result of the phenomenon described by Theorem 1. By way of analogy, one has the Berry-Esséen theorem stating that the rate of convergence to normal of the sum ofnindependent, identically distributed random variables is of the orderp1_n; if one has a sum ofnrandom variables converging to Gaussian significantly faster, it must be happening for some reason other than just the usual central-limit theorem. In order to implement this idea, it is necessary (as with the Berry-Esséen theorem) to have a sharp quantitative version of the limit theorem in question.

A second motivation for proving a quantitative version of Theorem 1 is the application to waiting times for discovering an interesting direction on which to project data. If a sequence of indepen-dent random projection directions is tried until the empirical distribution of the projected data is more than some threshhold away from Gaussian (in some metric on measures), and N is the number of trials needed to find such a direction, a one can easily give a lower bound forE_N _from

the type of quantitative theorem proved below.

Thus the goal of this paper is to provide a quantitative version of Theorem 1 in a fixed dimension d and for a fixed number of data vectorsn. To do this, it is first necessary to replace conditions (1) and (2) with non-asymptotic conditions. The conditions we will use are the following. Letσ2 be defined by 1

n

Pn

i=1|xi|2=σ2d. Suppose there existAandB, such that 1

n n

X

i=1

σ−2|x_i|2−d

≤A, (3)

and, for allθ_∈Sd−1_,

1 n

n

X

i=1

θ,xi

2

(3)

For a little perspective on the restrictiveness of these conditions, note that, as for the conditions of Diaconis and Freedman, they hold for the vertices of a unit cube inRd _(with_A₌_{0 and}_B₌ 1

4). Under these assumptions, the following theorems hold.

Theorem 2. Let{xi}ni=1be deterministic vectors inR

d_{, subject to conditions}₍₃₎_and₍₄₎_{above. For} a pointθ_∈Sd−1_{, let the measure}_µθ

n put equal mass at each of the points

random variable,θchosen uniformly on the sphere,σdefined as above, andε >maxp2πpB

d−1,

d_{, subject to conditions}₍₃₎_and₍₄₎_{above, and} again consider the measuresµθ

n. Ifθis chosen uniformly fromS

(i) It should be emphasized that the key difference between the results proved here and the result of Diaconis and Freedman is that Theorems 2 and 3 hold forfixeddimensiond and number of data vectorsn; there are no limits in the statements of the theorems.

(ii) It is not necessary for Aand B to be absolute constants; for the the results above to be of interest asd _{→ ∞}, it is easy to see from the statements that it is only necessary that A=o(d)andB = o(d)for Theorem 2 whileB = o(pd)for Theorem 3. The reader may also be wondering where the dependence onnis in the statements above; it is built into the definition ofB. Note that, by definition,B≥|xi|2

n for eachi; in particular,B≥

σ2_d

n . It is thus necessary thatn_{→ ∞}asd_{→ ∞}for Theorem 2 andn_≫pdfor Theorem 3.

(iii) For Theorem 2, consider the special case thatε2₌ C2

·25_B

d or smaller.

(4)

whereC′₌₉_·₂16_{C B}2_and_C′′₌₄₈p_πC−3/10_{. Thus, roughly speaking, the bounded} Lips-chitz distance from the random measureµθ

n to the Gaussian measure with mean zero and varianceσ2_{is unlikely to be more than a large multiple of}log(d−1)

d−1

1/5

. We make no claims of the sharpness of this result.

Theorem 3 can easily be used to give an estimate on the waiting time until a non-Gaussian di-rection is found, if didi-rections are tried randomly and independently. Specifically, we have the following corollary.

Corollary 4. Letθ₁,θ₂,θ₃, . . .be a sequence of independent, uniformly distributed random points on

Sd−1_{. Let T}_ε_:₌_min_{_j_:_d_{B L}₍_µθ_nj_,_N₍_0,_σ2₎_{> ε}_}_._{Then there are constants c,}_c′_{such that}

E_T_ε_≥ cε

3/2

p

B exp

c′₍_d₋₁₎_ε5 B2

.

2 Proofs

This section is mainly devoted to the proofs of Theorems 2 and 3, with some additional remarks following the proofs. For the proof of Theorem 2, several auxiliary results are needed. The first is an abstract normal approximation for bounding the distance of a random variable to a Gaussian random variable in the presence of a continuous family of exchangeable pairs. The theorem is an abstraction of an idea used by Stein in [6] to bound the distance to Gaussian of the trace of a power of a random orthogonal matrix.

Theorem 5 (Meckes [4]). Suppose that (W,W_ε) is a family of exchangeable pairs defined on a common probability space, such thatE_W ₌₀_andE_W2₌_σ2_{. Let}_F _{be a}_{σ-algebra on this space}

with σ(W) _{⊆ F}. Suppose there is a function λ(ε)and random variables E,E′ measurable with respect to_F, such that

(i) 1

λ(ε)E

Wε−W

F

L1

−−→_ε

→0 −W+E ′_.

(ii) 1 2λ(ε)σ2E

(Wε−W)2

F

L1

−−→_ε

→0 1+E. (iii) _λ1₍_ε₎E_|_W

ε−W|3 ε→0

−−→0.

Then if Z is a standard normal random variable,

d_{T V}(W,σZ)_≤EE +

Ç

π 2E

E′

.

The next result gives expressions for some mixed moments of entries of a Haar-distributed orthog-onal matrix. See[3], Lemma 3.3 and Theorem 1.6 for a detailed proof.

Lemma 6. If U=ui j

d

i,j=1is an orthogonal matrix distributed according to Haar measure, then

E h_Q

uri j

i j

i

is non-zero if and only if ri•:=

Pd

j=1ri j and r•j:=

Pd

(5)

(i) For all i,j,

E h

u2_{i j}i=1

d. (ii) For all i,j,r,s,α,β,λ,µ,

E

ui jursuαβuλµ

=₋ 1

(d−1)d(d+2)

h

δ_{i r}δ_αλδ_j_βδ_s_µ+δ_{i r}δ_αλδ_j_µδ_s_β+δ_i_αδ_r_λδ_jsδ_βµ

+δ_i_αδ_r_λδ_j_µδ_β_s+δ_i_λδ_r_αδ_jsδ_βµ+δ_i_λδ_r_αδ_j_βδ_s_µi

+ d+1

(d₋1)d(d+2)

h

δ_{i r}δ_αλδ_jsδ_βµ+δ_i_αδ_r_λδ_j_βδ_s_µ+δ_i_λδ_r_αδ_j_µδ_s_βi.

(iii) For the matrix Q=

qi j

d

i,j=1defined by qi j:=ui1uj2−ui2uj1,and for all i,j,ℓ,p,

E_q_{i j}_q_ℓ_p₌ 2

d(d₋1)

δiℓδj p−δi pδjℓ

.

Finally, we will need to make use of the concentration of measure on the sphere, in the form of the following lemma.

Lemma 7 (Lévy, see [5]). For a function F :Sd−1 _→ R_{, let M}

F denote its median with respect to the uniform measure (that is, for θ distributed uniformly on Sd−1_, P_F₍_θ₎ _≤ _M

F

≥ 1₂ and

P

F(θ)_≥MF

≥ 1₂) and let L denote its Lipschitz constant. Then

PF(θ)−M_F > ε

≤

Ç

π 2exp

−(d−1)ε

2

2L2

.

With these results, it is now possible to give the proof of Theorem 2.

Proof of Theorem 2. The proof divides into two parts. First, an “annealed” version of the theorem is proved using the infinitesimal version of Stein’s method given by Theorem 5. Then, for a fixed test

function f andZa standard Gaussian random variable, the quantityP

R

f dµθ

ν −Ef(σZ)

> ε

is bounded using the annealed theorem together with the concentration of measure phenomenon.

Letθ be a uniformly distributed random point ofSd−1_⊆Rd_{, and let} _I _{be a uniformly distributed}

element of_{1, . . . ,n_}, independent ofθ. Consider the random variableW:=

θ,x_I

. ThenE_W₌

0 by symmetry andE_W2₌_σ2_{by the condition}1

n

Pn

i=1|xi|2=σ2d. Theorem 5 will be used to bound the total variation distance fromWtoσZ, whereZis a standard Gaussian random variable. The family of exchangeable pairs needed to apply the theorem is constructed as follows. Forε >0 fixed, let

Aε:=



 p

1₋ε2 _ε

−ε p1₋ε2



⊕I_d₋₂ = I_d+ 

−

ε2

2 +δ ε

−ε ₋ε2 2 +δ



⊕0_d₋₂,

whereδ=O(ε4₎_{. Let}_U_{be a Haar-distributed}_d_×_d_{random orthogonal matrix, independent of}_θ andI, and letWε=

UAεUTθ,xI

(6)

LetK be the d×2 matrix made of the first two columns of UandC2=

KC₂KT _{(note that this is the same}_Q_{as in part (iii) of Theorem 6). Then by the construction of} Wε,

The conditions of Theorem 5 can be verified using the expressions in Lemma 6 as follows. By the lemma,E_{K K}T₌2

Condition (ii) of Theorem 5 is thus satisfied withE= 1

d−1

. Condition (iii) of the theorem is trivial by (5); it follows that

dT V(W,σZ)≤ This is the annealed statement referred to at the beginning of the proof.

We next use the concentration of measure on the sphere to show that, for a large measure of θ_∈Sd−1_{, the random measure}_µθ

nwhich puts mass 1

nat each of the

θ,x_i

is close to the average behavior. To do this, we make use of Lévy’s Lemma (Lemma 7). Let f :R _→ R _{be such that}

(7)

Sd−1_{. Then, using}_k_f_k

B L≤1 together with equation (4),

thus the Lipschitz constant ofF is bounded bypB. It follows from Lemma 7 that

PF(θ)−M_F

d−1, we may use concentration about the median ofF to obtain concentration about the mean, with only a loss in constants.

(8)

Proof of Theorem 3. The first two steps of the proof of Theorem 3 were essentially done already in the proof of Theorem 2. From that proof, we have that ifW=

θ,x_I

forθdistributed uniformly onSd−1_and_I _{independent of}_θ _{and uniformly distributed in}_{_{1, . . . ,}_n_}_{, then}

d_{T V}(W,σZ)_≤ A+2

d−1, (8)

forAas in equation (3). Furthermore, it follows from equation (7) in the proof of Theorem 2 that forF(θ):=R f dµθ_nandε > pπpB

d−1, then

P_[_|_F₍_θ₎₋E_F₍_θ₎_|_{> ε}_]_≤PF(θ)−M_F > ε−

M_f −EF(θ)

≤P

F(θ)−M_F > ε−

πpB 2pd−1

≤ Ç

π 2e

−(d8B−1)ε 2

.

(9)

In this proof, this last statement is used together with a series of successive approximations of arbitrary bounded Lipschitz functions as used by Guionnet and Zeitouni[2]to obtain a bound for

P_d

B L(µθn,N(0,σ

2₎₎_{> ε}

. By definition,

P_d_{B L}₍_µθ

n,Eµ

θ

n)> ε

=P

sup kfkB L≤1

Z

f dµθ_n₋E Z

f dµθ_n

> ε

.

First consider the subclass_F_{B L,K} =_{f :_kf_k_{B L} _≤1,supp(f)_⊆K_}for a compact set K_⊆R_{. Let}

∆ =ε

4; for f ∈ FB L,K, define the approximationf∆as in Guionnet and Zeitouni[2]as follows. Let x_o=infKand let

g(x) =







0 x_≤0; x 0≤x≤∆;

∆ x_≥∆.

Forx_∈K, define f_∆recursively by f_∆(x_o) =0 and

f∆(x) =

⌈x−∆xo⌉

X

i=0

2I_f₍_x

o+ (i+1)∆)≥f∆(xo+i∆)

−1g(x₋x_o₋i∆).

That is, the function f∆ is just an approximation off by a function which is piecewise linear and

has slope 1 or−1 on each of the intervals[xo+i∆,xo+ (i+1)∆]. Note that, becausekfkB L≤1, it follows that_kf −f∆k∞≤∆and the number of distinct functions whose linear span is used to approximatef in this way is bounded by|K|

∆, where|K|is the diameter ofK. If{hk}

N

(9)

set of functions used in the approximation f∆andεktheir coefficients, then forε2>8π|K|

(10)

assuming thatB_≥ε. Recall thatER _{f dµ}θ

n=Ef(W)forW=

θ,x_I

, and so by the bound (8),

sup f∈FB L

E

Z

f dµθ_n₋E_f₍_σZ₎ ≤

A+2 d₋1, thus forεbounded below as above and also satisfyingε > 2(A+2)

d−1 ,

P

dB L(W,σZ)> ε

≤ 48

p

πB ε3/2 exp

−(d−1)ε

5

9·216_B2

.

Proof of Corollary 4. The proof is essentially trivial. Note that

P_[_T_ε_>_m_]_≥

1₋c1

p

B ε3/2 exp

c2(d−1)ε5 B2

m

by independence of theθ_jand Theorem 3, sinceTε>mif and only if dB L(µ

θj

n,N(0,σ2)≤εfor all 1≤j≤m. This bound can be used in the identityE_T

ε=

P∞

m=0P[Tε>m]to obtain the bound

in the corollary.

Remark:One of the features of the proofs given above is that they can be generalized to the case ofk-dimensional projections of thed-dimensional data vectors{xi}, withkfixed or even growing with d. The proof of the higher-dimensional analog of Theorem 2 goes through essentially the same way. However, the analog of the proof of Theorem 3 from Theorem 2 is rather more involved in the multivariate setting and will be the subject of a future paper.

References

[1] Persi Diaconis and David Freedman. Asymptotics of graphical projection pursuit.Ann. Statist., 12(3):793–815, 1984. MR0751274

[2] A. Guionnet and O. Zeitouni. Concentration of the spectral measure for large matrices. Elec-tron. Comm. Probab., 5:119–136 (electronic), 2000. MR1781846

[3] Elizabeth Meckes. An infinitesimal version of Stein’s method of exchangeable pairs. Doctoral dissertation, Stanford University, 2006.

[4] Elizabeth Meckes. Linear functions on the classical matrix groups. Trans. Amer. Math. Soc., 360(10):5355–5366, 2008. MR2415077

[5] Vitali D. Milman and Gideon Schechtman. Asymptotic theory of finite-dimensional normed spaces, volume 1200 ofLecture Notes in Mathematics. Springer-Verlag, Berlin, 1986. With an appendix by M. Gromov. MR0856576