Directory UMM :Data Elmu:jurnal:S:Stochastic Processes And Their Applications:Vol88.Issue2.2000:

(1)

www.elsevier.com/locate/spa

Entropy inequalities and the Central Limit Theorem

Oliver Johnson

Statistical Laboratory, University of Cambridge, 16 Mill Lane, Cambridge, CB2 1SB, UK

Received 17 November 1997; accepted 3 January 2000

Abstract

Motivated by Barron (1986, Ann. Probab. 14, 336 –342), Brown (1982, Statistics and Proba-bility: Essays in Honour of C.R. Rao, pp. 141–148) and Carlen and Soer (1991, Comm. Math. Phys. 140, 339 –371), we prove a version of the Lindeberg–Feller Theorem, showing normal convergence of the normalised sum of independent, not necessarily identically distributed ran-dom variables, under standard conditions. We give a sucient condition for convergence in the relative entropy sense of Kullback–Leibler, which is strictly stronger than L1. In the IID case we recover the main result of Barron [1]. c 2000 Elsevier Science B.V. All rights reserved. MSC:60F05; 94A17; 62B10

Keywords:Normal convergence; Entropy; Fisher Information

1. Introduction and results

Consider a series of independent real-valued random variables X(1)_{; X}(2)_{; : : : ;} _with densities, zero mean and nite variances v1; v2; : : : : It will be necessary to consider various normalised sums, so we introduce a notation whereby sums are indexed by sets of integers. For a non-empty set of positive integers S, dene X(S)₌P

i∈S X(i)

with variance vS= Var(X(S)) =Pi∈S vi, and the normalised sum U(S)=X(S)=√vS=

(P

i∈SX(i))=√vS, with densityg(S). Further, deneY(S)=U(S)+Z, whereZis N(0; )

and independent of U(S)_.

In general, we use Zu and Zu(·) to represent normal N(0; u) random variables with

mean zero and variance u; ; 2 and_;2 stand for the N(; 2) distribution function

and probability density, respectively. If = 0 or 2_{= 1, we omit the subscript} _or

2 _{from this notation. Let} _D₍_f

||g) be Kullback–Leibler distance between probability densities f and g, and J(X) be the Fisher Information of a random variable X (see Section 2 for details).

E-mail address:otj1000@cam.ac.uk (O. Johnson).

(2)

Condition 1. There exists a decreasing function ; such that (R)_→0 as R_{→ ∞}

so that for all i and R:

E(X(i))2I(_|X(i)_|¿√viR)6vi (R):

Condition 1 is a uniform version of the Lindeberg Condition.

Condition 2 (Bounded variance). There existsV0 such that vi6V0¡∞ for all i.

Denition 1.1. For ¿0, dene (W; ) = sup_S_:_v_S¿WJ(Y

(S)

)−1=(1 +).

By the Cramer–Rao lower bound, and by results of Lemma 2.5, we know that 1=(1 +)6J(U+Z)61=. Hence(V; ) is always non-negative, and zero if and only

if U(S) _{is N(0}_;_{1) for all} _S _{such that} _V

S¿V.

Condition 3. There exists V1 such that

R∞

0 (V1; ) dis nite.

SinceJ(Y(S))61=, Condition 3 will hold if there exists ¿0 such that

R

0J(Y (S)

) d

is nite. This holds for example if there exists J0 _{such that}_J₍_X(i)₎₆_J0_¡_∞ _{for all}_i_. We prove the following two principal theorems:

Theorem 1. Under Conditions 1 and 2; the Fisher information tends to its minimum. That is; for all ¿0: limW→∞(W; ) = 0.

Theorem 2. Under Conditions 1–3

lim

W→∞

sup

S:vS¿W

D(g(S)_||)

= 0:

Theorem 1 implies weak convergence limW→∞supS:vS¿W|P(U

(S)₆_x₎₋₍_x₎_|_{= 0} and Theorem 2 impliesL1 convergence limW→∞supS:vS¿W||g(S)−||L1_(d_x₎=0. For weak

convergence, the classical Lindeberg condition is known to be necessary and sucient (see for example Gnedenko and Kolmogorov, 1954, Chapter 4, Theorem 4). An in-teresting open problem is to nd necessary and sucient conditions for convergence in L1_{, or in Kullback–Leibler distance.}

Notice that both Theorems 1 and 2 are trivially true if the sum of variancesP∞

i=1vi

converges, hence we assume that P∞

i=1vi=∞. An example of a ‘nearly IID’ case

where Conditions 1–3 hold is whereX(i) is the sum of independent variablesW(i)+Z(i)

for some ¿0. In that case Condition 1 holds by Lemma 5.3, Condition 2 has an obvious interpretation in requiring a uniform bound on Var(W(i)_{), and Condition 3} holds because J(Y(S))6J(Z+) = 1=(+), so that R (V; ) d6log.

(3)

Theorems 1 and 2 extend earlier results in Barron (1986) and Brown (1982) to the case of non-identically distributed random variables. More precisely, our proof can be easily adapted to recover the main result of Barron (1986): if X(1)_{; X}(2)_{; : : :} _{are IID} with densities and D(gm||) is nite for some m then limm→∞D(gm||) = 0. Heregm

is the density of (Pm

i=1X(i))=

√_mv_.

Entropy-related methods of proving convergence to a Gaussian distribution originate from a paper of Linnik (1959). For the history of the question see Barron (1986). The techniques used in this paper are similar to those of Barron (1986), Brown (1982) and Carlen and Soer (1991); in particular, Section 3 is motivated by Barron (1986) and Carlen and Soer (1991) and Sections 4 and 5 by Brown (1986). Theorem 2 bears a close similarity to Theorem 2:2 of Carlen and Soer (1991), except that we remove the condition that the variances be uniformly bounded away from zero. Furthermore, Proposition 3.2 and Theorem 1 give bounds more explicit than the continuity argument in Theorem 1:2 of Carlen and Soer (1991). In particular, our Theorem 1 requires only Conditions 1 and 2 to obtain weak convergence.

2. Notation and denitions

Denition 2.1. Given probability densities f; g, we dene the Kullback–Leibler dis-tance (or ‘relative entropy’) D(f_||g) to be

D(f_||g) =

Z

f(x)log

f(x)

g(x)

dx:

In the case where f has mean zero and variance 2_{, and} _g₌

2, D(f||g) =

1 2log 2

2₋_H₍_f_{), where} _H _{represents the dierential entropy. In other words, we} can view entropy in terms of the distance from the normal. The Kullback–Leibler dis-tance is shift and scale invariant. D(f_||g) is not a metric; it is asymmetric and does not satisfy the triangle inequality. However, D(f_||g)¿0 with equality i f(x) =g(x) for almost all x_∈R. Furthermore (see Kullback, 1967 for details),

||f₋g_||L1_(d_x₎6

p

2D(f_||g):

The normal distribution maximises entropy amongst random variables of given variance, so the Central Limit Theorem corresponds to the entropy tending to its maximum. Fisher Information is used to prove that the maximum is achieved.

Denition 2.2. If U is a random variable with dierentiable density g(u), we de-ne the score function U(u) =@=@u(logg(u)) =I(g(u)¿0)g′(u)=g(u) and the Fisher

Information

J(U) =E2_U(U):

Lemma 2.3. If U is a random variable with density f and variance 1; and Zu is

independent of U; then

D(f_||) =1 2

Z ∞

0

J(U+Zu)−

1 1 +u

(4)

Proof. This is a rescaling of Lemma 1 of Barron (1986), which is an integral form of de Bruijn’s identity (see Blachman (1965) or Stam (1959) for more details of de Bruijn’s result).

Lemma 2.4. If U(1)_{; U}(2) _{are independent and} _U(3)₌_U(1)₊_U(2) _{with score functions}

(1)_;(2) _and (3) _{then with probability one}_; _{for any}_∈_[0_;_1]

(3)=E[(1)(U(1)) + (1₋)(2)(U(2))_|U(3)]:

Here and below E[_{· |}U(3)] stands for the conditional expectation with respect to the

-algebra generated by U(3)_.

Proof. If U(1) _and _U(2) _{have densities} _p _and_q _then _U(1)₊_U(2) _{has density} _r₍_w_{) =}

R

p(x)q(w₋x) dx and so

(3)(w) =r

′₍_w₎

r(w) =

Z _p′₍_x₎_q₍_w₋_x₎

r(w) dx=

Z _p′₍_x₎

p(x)

p(x)q(w₋x)

r(w) dx

which equals, almost surely, the expected value E[(1)(U(1))_|U(3)](w): Similarly, we can produce an expression in terms of the score function V(y) =q′(y)=q(y). We can

add times the rst expression to 1₋ times the second one.

Lemma 2.5. If U(1)_{; U}(2) _{are independent then for any}

∈[0;1]

J(U(1)+U(2))62J(U(1)) + (1₋)2J(U(2));

and hence

J(pU(1)+p1₋U(2))6J(U(1)) + (1₋)J(U(2)):

Proof. Using Lemma 2.4 and Jensen’s Inequality,

J(U(3)) =E(3)(U(3))2 =E[E((1)(U(1)) + (1₋)(2)(U(2))_|U(3))2] 6_E[E(((1)(U(1)) + (1₋)(2)(U(2)))2_|U(3))]

=2E(1)(U(1))2+ (1₋)2E(2)(U(2))2:

Substituting pU(1) and p1₋U(2) for U(1) and U(2); respectively, we recover the second result.

However, we require a stronger result. We will use the Hermite polynomialsHn(x)

(see Szeg˝o, 1958 for more details), which form an orthogonal basis inL2₍₍_x_{) d}_x_).

Denition 2.6. The generating function of the Hermite polynomials with respect to normal weight =2 is

G(x; t) =X

r

tr

r!Hr(x) = exp

−t

2

4 +t x

:

Using this,H0(x) = 1, H1(x) =x, H2(x) =x2₋₌_{2 and the normalisation is given by}

Z

(5)

3. Convergence of densities

Denition 3.1. Given a function f(x) =P∞

r=0 arHr(x) ∈ L2(=2(x) dx), dene the projection map f_7→f by

(f)(x) =

∞

X

r=2

arHr(x) =f(x)−

Z

f(y)

1 + 2xy

=2(y) dy

and a seminorm _{|| · ||} using

||f_||2=

Z

(f)(x)2

=2(x) dx=

∞

X

r=2

a2rr!(=2)r:

This measures ‘how far from being linear f is’, which is useful since ifis the score function of a normal then is linear and hence _||_||= 0.

Proposition 3.2. Let U(1) _and _U(2) _{be independent random variables with mean zero} and variance one. Given ¿0 dene Y(i)=U(i)+Z(i); i= 1;2 and Y; (3)=√Y(1)+ √

1₋Y(2) with densities f(1); f(2); f(3) and score functions (1); (2); (3). Here Z(1)

andZ(2) are independent of U(1),U(2) and of each other. There exists a constant

such that

J(Y(1)) + (1₋)J(Y(2))₋J(Y_;(3))¿(_||(1)_||2 +_||(2)_||2)

2

(1−)

2

:

The normal perturbation considered in Proposition 3.2 guarantees the dierentiability of the densities f(i) _{and hence the existence and boundedness of Fisher Information.} The proof of Proposition 3.2 is presented in Section 4, where we use the natural generalisations of results of Brown (1982).

Denition 3.3. Given a positive function where (R) _→0 as R_{→ ∞}, dene the set of random variables C _by

C ₌_{_X_:_E_X _{= 0}_{; v}_X ₌_E_X2_¡_∞_;_E_X2_I₍_|_X_|¿√vXR)6vX (R) for allR}:

The last inequality has the form E(X=√vX)2I(|X=√vX|¿R)6 (R), which implies

that the classC _{is scale invariant, and is a uniform integrability condition on} _X=√_v_X_, Note that Condition 1 guarantees that each X(i)_∈C _{. Lemma 3.5 below ensures that} any sum X(S)_∈_C _{, for some positive function} _with ₍_R₎_→_{0 as} _R_{→ ∞}_.

Proposition 3.4. Given ¿0 and a random variableX _∈C _; _dene _Y₌_X=√_v_X₊_Z_; with score function (Y)_; _where _Z

is independent of X. There exists a function ()

(depending only on and ); with ()_→0 monotonically as _→0; such that

J(Y)₋ 1

1 +6(||

(Y)

||):

(6)

Lemma 3.5. If for alli; X(i)_∈C _; _{then there exists a function} _{such that for all}_S;

X(S) _∈C _.

Proof. First we prove the assertion under the additional assumption that the X(i) _have even densities. Set V(i) ₌_X(i)_I₍_|_X(i)_|_¡√_v

it), then E(Pi∈SV(i))46

P

i∈SE(V(i))4+

3(P

i∈SE(V(i))2)26

P

i∈St2viE(V(i))2+3(Pi∈SE(V(i))2)26(t2+3)(

P

i∈Svi)2.

Further-more, by Cauchy–Schwarz, for any t;

E(X(S))2I(_|X(S)_|¿√vSR)

=E



 X

i∈S

X(i)(I(_|X(i)_|¿√vit) +I(|X(i)|¡√vit))

!2

I(_|X(S)_|¿√vSR)





62X

i∈S

E(X(i))2I(_|X(i)_|¿√vit)

+2



E X

i∈S

X(i)I(_|X(i)_|¡√vit)

!4 

1=2

P(_|X(S)_|¿√vSR)1=2

62vS (t) + 2(t2+ 3)1=2vS=R:

Taking t=√R, we can use (R) = 2 (R1=2_{) + 2(}_R_{+ 3)}1=2_=R_{, which as required tends} to zero as R_{→ ∞}.

To establish the assertion in the general case we pass from X(i) to random variables

X(i)₋_X(i)′

, whereX(i)′

is an independent copy of X(i)_{. By using inequalities we have} control over the tails of X(i)

−X(i)′ :

E(X(i)₋X(i)′)2I(_|X(i)₋X(i)′_|¿R√2v)

62E(X(i)2+X(i)′2)(I(_|X(i)_|¿Rpv=2) +I(_|X(i)′_|¿Rpv=2))

64vi( (R= √

2) + 2=R2):

Conversely, if we have control over the tails of X(S)₋_X(S)′

; we control the tail of

X(S)_{, since Carlen and Soer (1991, p. 354) shows}

EX(S)2I(_|X(S)_|¿2Rp2vS)6E(X(S)−X(S)

′

)2I(_|X(S)₋X(S)′_|¿Rp2vS):

Proof of Theorem 1. By Condition 2 the variances are bounded by V0. Given a set S

withvS¿3V0, we can writeS=S1∪S2, with S1∩S2=∅ and if=vS1=vS,

1 366

2 3. For example, dening r= min_{l:vS∩(−∞; l]¿vS=3}, let S1=S ∩(−∞; r), then vS1¿vS=3,

andvS16vS=3 +V062vS=3).

According to Propositions 3.2 and 3.4, given ¿0, there are two possibilities:

1. J(Y(S1)

) + (1−)J(Y(S2))−J(Y(S))¿.

2. J(Y(S1)

) + (1−)J(Y(S2))−J(Y(S))¡ . In this case, since(1−)¿2₉, by

Propo-sition 3.2, both _||(S1)_||

and||(S2)|| are less than 31=2=. Hence by Proposition

3.4, J(Y(S1)

(7)

IfvS¿V, then fori= 1;2,vSi¿V=3, soJ(Y

(Si)

)−1=(1 +)6(V=3; ). Hence in the

rst case, J(Y(S))−1=(1 +)6(V=3; )−, and in the second, J(Y(S))−1=(1 +)6

(31=2₌

). We conclude that

(V; ) = sup

S:vS¿V

J(Y(S))₋ 1 1 +

6max

(V=3; )₋;

31=2

:

Hence assuming (V; )¿ for allV, we have that(V; )6(V=3; )₋(−1()=3)2

for all V. This provides a contradiction, since 06(V; )61=(1 +). Therefore lim infV→∞(V; )=0. But(V; ) decreases monotonically inV so limV→∞(V; )=0

for all .

Proof of Theorem 2. By Lemma 2.3, sup_S_:_v_S¿VD(g(S)||)=supS:vS¿V(

R

J(U(S)₊_Z

)−

1=(1 +) d)=26(R

(V; ) d)=2. Now(V; ) is monotone decreasing in V, converges to zero and R

(V1; ) d is nite. By the Monotone Convergence Theorem, sup_S_:_v_S¿VD(g(S)||)→0.

In the case of IID variables, we recover the main result of Barron (1986). If

X(1)_{; X}(2)_{; : : :} _{are IID with densities and} _D₍_g

m||) is nite for some m then

limm→∞D(gm||) = 0. Heregmis the density of Pmi=1 X(i)=

√_mv _and_v_{= Var}_X(i)_{. This}

follows since if we dene k for the score function of W(k)=P2

k

i=1X(i)=

√

2k_v₊_Z ,

Proposition 3.2 becomes

J(W(k))₋J(W(k+1))¿const:_||k||2:

As in Barron (1986), we deduce thatJ(W(k)_{) converges monotonically to its minimum,} and this monotone convergence means that we only require the condition thatD(gm||)

is ever nite to ensure its convergence to zero.

4. Proof of Proposition 3.2

Throughout this section, the notation of Proposition 3.2 is in place. Given ¿0, dene random variables Y(i)=U(i)+Z(i), i= 1;2, and Y; (3)=√Y(1)+√1−Y(2)

with densities f(1)_{; f}(2)_{; f}(3) _{and score functions} (1)_;(2)_;(3)_{. Here and below} _Z(1)

and Z(2) are independent of U(1), U(2) and of each other. In addition, we consider

independent variables Z₌(i)₂,i= 1;2 and setZ₌(3)₂=√Z₌(1)₂ +√1₋Z₌(2)₂. Finally, _{Hr}

will represent the Hermite polynomials with respect to=2.

Lemma 4.1 (Brown [3]). There exists a constant ¿0 such that for any random

variable U with variance 1; the sumY=U+Z; where Z is independent ofU; has

density f bounded below by =2.

Proof. Since U has variance 1, by Chebyshev, P(_|U_|¡2)¿3

4. Hence, if FU is the distribution function of U, for anys_∈R

f(s) =

Z ∞

−∞

(8)

¿

Z 2

−2

(s−t) dFU(t)

¿(3=4)min_{(s−t):|t|¡2}= (3=4)(|s|+ 2)

¿(3=4√2)exp(₋4=)=2(s) ==2(s):

Lemma 4.2.

J(Y(1)) + (1₋)J(Y(2))₋J(Y_;(3))

=E(√(1)(Y(1)) +√1₋(2)(Y(2))₋(3)(Y_;(3)))2

¿2E(√(1)(Z₌(1)₂) +√1₋(2)(Z₌(2)₂)₋(3)(Z₌(3)₂))2:

Proof. Writing R=√(1)₍_Y(1)

) +√1−(2)(Y(2)) andT=(3)(Y; (3)), by Lemma 2.4,

T=E[R_|Y; (3)]. Hence by standard orthogonality arguments,E(R−T)2=ER2−ET2. Since

U(1) _and_U(2) _{are independent, so are}_Y(1)

andY(2) and henceE[(1)(Y(1))(2)(Y(2))] =

E(1)₍_Y(1)

)E(2)(Y(2)) = 0. Thus ER2=E((1)(Y(1)))2+ (1−)E((2)(Y(2)))2, which

by Denition 2.2, equals J(Y(1)) + (1−)J(Y(2)). This gives the rst equation.

The second inequality follows as Lemma 4.1 gives

Z Z

h2(s; t)f(1)(s)f(2)(t) dsdt¿2

Z Z

h2(s; t)=2(s)=2(t) dsdt

for any real h(s; t).

Lemma 4.3.

E[(√(1)(Z₌(1)₂) +√1₋(2)(Z₌(2)₂)₋(3)(Z₌(3)₂))2]

¿_E[(√(1)(Z₌(1)₂) +√1₋(2)(Z₌(2)₂)₋w(Z₌(3)₂))2];

where

w(x) =√

Z

(1)(√x+√1₋v)=2(v) dv

+√1₋

Z

(2)(√1₋x₋√v)=2(v) dv:

Proof. Note that E[(√(1)₍_Z(1)

=2) +

√

1₋(2)₍_Z(2)

=2)−w(Z (3)

=2))2] is minimised over

w by taking w(x) =E[√(1)₍_Z(1)

=2) +

√

1₋(2)₍_Z(2)

=2)|Z (3)

=2=x] (as the L2 distance is minimised by taking projections).

Now (omitting the subscripts =2), fZ(1)_|Z(3)(s|x) = f_Z(1)_{; Z}(2)(s; u)=f_Z(3)(x) =

()−1=2_exp(₋₍_s2₊_u2₋_x2₎₌_{), where} √_s₊√₁₋_u₌_x_{. Rearranging, we obtain}

fZ(1)_|Z(3)(s|x) =₌2(v), wherev= (s−√x)=√1−. Similarlyf_Z(2)_|Z(3)(r|x) =₌2(v),

wherev = (r₋√1₋x)=√. Substituting for r; s, the result follows.

Lemma 4.4. The linear map dened by

f(x) =

Z ∞

−∞

f(px+p1₋v)=2(v) dv

maps Hr(x) into (

p

(9)

Proof. The action of on the generating function is ∞

X

r=0

tr

r!Hr(x) =

Z ₁

√

exp ₋(v−t

p

1₋=2)2

−

t2 4 +

p

xt

!

dv

= exp

−t

2

4 +

p

xt

=

∞

X

r=0 (tp)r

r! Hr(x):

This is the natural generalisation of Eq. (3:6) from Brown (1982), though our proof is dierent.

Lemma 4.5. If (i)₍_x₎

∈L2(=2(x) dx) with (i)(x) =Pa(ri)Hr(x); and w is the

func-tion identied in Lemma 4:3; there exist functions Ar(x) such that

E(√(1)(Z₌(1)₂) +√1₋(2)(Z₌(2)₂)₋w(Z₌(3)₂))2

¿(1₋)=2

∞

X

r=2

(a(1)_r )2(=2)r_r_{! +} ∞

X

r=2

(a(2)_r )2(=2)r_r_!

!

=(1₋)=2(_||(1)_||2+_||(2)_||2):

Proof. By Lemma 4.4

w(x) =

∞

X

r=0

((√)r+1a(1)_r + (√1₋)r+1a(2)_r )Hr(x):

Expanding, and using the normalisation of Hr we know that

E(√(1)(Z₌(1)₂) +√1₋(2)(Z₌(2)₂))2₋Ew2(Z₌(3)₂)

=

∞

X

r=1

((a(1)r )2(−r+1) + (ar(2))2((1−)−(1−)r+1)

−2a(1)_r a(2)_r (p(1₋))r+1)(=2)rr!

¿

∞

X

r=1

(Ar()(a(1)r )2+Ar(1−)(a(2)r )2)(=2)rr!;

whereAr(x)=x−xr+1−x(r+1)=2(1−x)(r+1)=2. AsA1(x)≡0, these terms may be removed.

For xed x_∈[0;1], Ar(x) is increasing in r, so we may replace Ar(); Ar(1−) by

the (positive) valuesA2(); A2(1₋). Now, note that A2(x) =x₋x3₋x3=2(1₋x)3=2=

x(1₋x)(1 +x₋x1=2₍₁₋_x₎1=2₎_¿_x₍₁₋_x₎₌_2.

5. Proof of Proposition 3.4

The notation of Proposition 3.4 holds throughout this section. That is, given ¿0 and a random variable X _∈ C _{, we dene} _Y ₌_X=√_v_X ₊_Z_{, with score function}

(Y)_{, where} _Z

is independent of X. As in the previous section {Hr} are the Hermite

(10)

Lemma 5.1. Given a dierentiable positive probability density f with score

(11)

otherx, (y₋x)2 is non-negative. Breaking up the region of integration into _|x_|6_|y_|=2

Thus, taking the positive square root z =

q

′ _{with the same property such that if} _X

(12)

Lemma 5.4. Fix a function with limR→∞ (R) = 0; given a random variable; there use the triangle inequality to break the integral into three parts:

inequality, we use Lemma 5.3 which implies that R

x2_I₍_|_x_|_¿_z₎_f₍_x_{) d}_x₆ ₍_z_{), where} (z)_→0 as z_{→ ∞}.

(13)

6 1₋

Z z

−z

f(y) dy+ 2z sup

y∈(−z; z)|

c(y)₋1+(y)|+ 2(1 +)=z2:

By Lemma 5.4, we deduce that f tends in L1(dx) to 1+, the N(0;1 +) density, as ||(Y)_||_→_{0, uniformly in random variables} _X _∈_C _:

lim

→0+_X_∈C_:sup||(Y)_||₆||

f₋1+||L1_(d_x₎= 0:

Note. The last fact gives uniform convergence of perturbed distribution functions: sup_x_|P(Y6x)₋1+(x)| →0 as ||(Y)|| →0. If →0, we get, using Eq. (4:2) of

Brown (1982): sup_x_|P(X6x)₋(x)_{| →}0 as_||(Y)_||

→0. Here, and below,1+ is

the N(0;1 +) and the N(0;1) distribution function.

Furthermore, this implies thatf=1+→1 in N(0;1+) probability. Here, we regard

convergence in probability of densities as convergence in probability of the random variable f(Z)=1+(Z). In fact, by Lemmas 5.1 and 5.4, we know that R((Y)(x) +

x=(1 +))2

=2(x) dx→0. But convergence in L2(=2(x) dx) implies convergence in N(0; =2) probability. As N(0; =2) and N(0;1 +) are equivalent measures, we deduce that (Y)_{→ −}_x=_{(1 +}_{) in N(0}_;_{1 +}_{) probability.}

The product of sequences convergent in N(0;1 +) probability also converges. We deduce that as _||(Y)_||

→ 0, the ratio (Y)2f=1+ → x2=(1 +)2 in N(0;1 +)

probability, uniformly in C _{, that is for all} _¿_0,

lim

→0+_X_∈C:sup||(Y)_||

6

Z

I

(Y)2₍_x₎_f₍_x₎

1+(x) −

x2 (1 +)2

¿

1+(x) dx= 0:

For a random variable X with zero mean and nite variance, write f and for the

density and score function of X=√vX +Z. We will show that {2f=1+} form a

uniformly 1+-integrable family. We can use Lemma 3 from Barron (1986) which

states that

2(x)f(x) =f

′₂

(x)=f(x)6cf2(x):

So we need only prove that _{h}={f2=1+} is a uniformly 1+-integrable family.

Since entropy increases on convolution:R

hloghd1+=−H(f2) +R f2log1+6

− H(2) +

R

f2log1+ = L(). But for all K ¿0,

R

|h_|I(_|h_|¿ K) d1+

6R h(logh=logK)I(|h|¿ K) d1+6L=logK, so the uniform 1+-integrability of

the density ratios follows. We therefore deduce that

lim

→0+_X_∈C_:sup||(Y)_||₆

Z

(x)2f(x)

1+(x)

d1+(x) =

Z _x2

(1 +)21+(x) dx

= 1 1 +:

Acknowledgements

(14)

Purdue University have also provided support and many helpful suggestions. The work was funded by an EPSRC grant, Award Reference Number 96000883.

References

Barron, A.R., 1986. Entropy and the Central Limit Theorem. Ann. Probab. 14, 336–342.

Blachman, N.M., 1965. The convolution inequality for entropy powers. IEEE Trans. Inform. Theory 11, 267–271.

Brown, L.D., 1982. A proof of the Central Limit Theorem motivated by the Cramer–Rao inequality. In: Kallianpur, G., Krishnaiah, P.R., Ghosh, J.K. (Eds.), Statistics and Probability: Essays in Honour of C.R. Rao. North Holland, New York, pp. 141–148.

Carlen, E.A., Soer, A., 1991. Entropy production by block variable summation and Central Limit Theorems. Comm. Math. Phys. 140, 339–371.

Gnedenko, B.V., Kolmogorov, A.N., 1954. Limits Distributions for sums of independent Random Variables. Addison-Wesley, Cambridge, MA.

Kullback, S., 1967. A lower bound for discrimination information in terms of variation. IEEE Trans. Inform. Theory 13, 126–127.

Linnik, Yu.V., 1959. An information-theoretic proof of the Central Limit Theorem with the Lindeberg Condition. Theory Probab. Appl. 4, 288–299.

Stam, A.J., 1959. Some inequalities satised by the quantities of information of Fisher and Shannon. Inform. and Control 2, 101–112.