getdoce77e. 279KB Jun 04 2011 12:05:07 AM

(1)

El e c t ro n ic

Jo ur

n a l o

f P

r o

b a b i_{l i t y}

Vol. 14 (2009), Paper no. 74, pages 2156–2181. Journal URL

http://www.math.washington.edu/~ejpecp/

Normal approximation for isolated balls in an urn

allocation model

Mathew D. Penrose

†

Department of Mathematical Sciences, University of Bath, Bath BA2 7AY,

United Kingdom [email protected]

Abstract

Consider throwingnballs at random intomurns, each ball landing in urniwith probabilitypi. LetSbe the resulting number of singletons, i.e., urns containing just one ball. We give an error bound for the Kolmogorov distance from the distribution ofSto the normal, and estimates on its variance. These show that ifn,mand(pi, 1≤i≤m)vary in such a way that supipi=O(n−1), thenSsatisfies a CLT if and only ifn2P

ip2i tends to infinity, and demonstrate an optimal rate of convergence in the CLT in this case. In the uniform case(p_i_≡m−1₎_with_m_and_n_growing

proportionately, we provide bounds with better asymptotic constants. The proof of the error bounds is based on Stein’s method via size-biased coupling.

Key words:Berry-Esseen bound, central limit theorem, occupancy scheme, size biased coupling, Stein’s method.

AMS 2000 Subject Classification:Primary 60F05; Secondary: 62E17, 60C05. Submitted to EJP on April 8, 2009, final version accepted September 18, 2009.

(2)

1 Introduction

Consider the classical occupancy scheme, in which each ofnballs is placed independently at random in one of m urns, with probability pi of going into the ith urn (p1+ p2+· · ·+pm = 1). If Ni denotes the number of balls placed in theith urn, then(N₁, . . . ,N_m)has the multinomial distribution Mult(n;p₁,p₂, . . . ,p_m). A special case of interest is the so-called uniform case where all the p_i are equal to 1/m.

A much-studied quantity is the number of occupied urns, i.e. the sumP_i1_{N_i >0_}. This quantity, scaled and centred, is known to be asymptotically normal asn→ ∞in the uniform case withm∝n, and a Berry-Esseen bound for the discrepancy from the normal, tending to zero at the optimum rate, was obtained for the uniform case by Englund[4](see also[15]), and for the general (nonuniform) case, with a less explicit error bound, by Quine and Robinson[12]. More recently, Hwang and Jan-son[8]have obtained a local limit theorem. A variety of applications are mentioned in[8](‘coupon collector’s problem, species trapping, birthday paradox, polynomial factorization, statistical linguis-tics, memory allocation, statistical physics, hashing schemes and so on’), and further applications of the occupancy scheme are listed in Feller (Section 1.2 of [5]) and in Gnedin et al. [6]. Also noteworthy are the monographs by Johnson and Kotz[9]and by Kolchinet al. [10]; the latter is mainly concerned with models of this type, giving results for a variety of limiting regimes for the growth ofmwithn(in the uniform case) and also in some of the non-uniform cases. There has also been recent interest in the case of infinitely many urns with the probabilities p_i independent ofn

[1; 6].

In this paper we consider the number of isolated balls, that is, the sumP_i1_{N_i =1_}. This quantity seems just as natural an object of study as the number of occupied urns, if one thinks of the model in terms of the balls rather than in terms of the urns. For example, in the well-known birthday paradox, this quantity represents the number of individuals in the group who have a unique birthday.

We mention some other applications where the number of isolated balls could be of particular in-terest. If the balls represent individuals and the urns represent their classification in a database according to certain characteristics (see[6]), then the number of isolated balls represents the num-ber of individuals which can be identified uniquely from their classification. If the balls represent particles or biological individuals, and the urns represent their spatial locations, each particle occu-pying one ofmlocations chosen at random, and if two or more particles sharing the same location annihilate one another, then the number of isolated balls is the number of surviving particles. If the balls represents the users of a set of communications channels at a given instant (these could be physical channels or wavebands for wireless communications), and each user randomly selects one ofmavailable channels, and two or more attempted uses of the same channel interfere with each other and are unsuccessful, then the number of isolated balls is the number of successful users.

In the uniform case, we obtain an explicit Berry-Esseen bound for the discrepancy of the number of isolated balls from the normal, tending to zero at the optimum rate whenm_∝n. In the non-uniform case we obtain a similar result with a larger constant, also giving upper and lower bounds which show that the variance of the number of isolated balls isΘ(n2P_ip_i2). Together, these results for the non-uniform case show that providednmax_i_{p_i_}is bounded, a central limit theorem for the number of isolated balls holds if and only ifn2P_ip2_i tends to infinity, and give the rate of convergence to the normal for this case. The proof of the variance bounds, in Section 5, is based on martingale difference techniques and somewhat separate from the other arguments in the paper.

(3)

the uniform case) and[12](in the non-uniform case) for the number of occupied urns. Our proofs, however, are entirely different. We adapt a method used recently by Goldstein and Penrose[7]for a problem in stochastic geometry (Theorem 2.1 of[7]). A more distantly related method has been used previously forPoisson approximation in the occupancy scheme; see Barbouret al. [2], pages 11 and 115.

Our method does not involve either characteristic functions, or first Poissonizing the total number of balls; in this, it differs from most of the approaches adopted in the past for normal approximation in occupancy-type schemes. As remarked in [8] ‘almost all previous approaches rely, explicitly or implicitly, on the widely used Poissonization technique’, and this remark also applies to[8]itself. One exception is Chatterjee[3], who uses a method not involving Poissonization to give an error bound with the optimal rate of decay (with unspecified constant) for the Kantorovich-Wasserstein distance (rather than the Kolmogorov distance, as here) between the distribution of the number of occupied urns and the normal, in the uniform case.

We believe that our approach can be adapted to the number of urns containingkballs, for arbitrary fixed k, but these might require significant amounts of extra work, so we restrict ourselves here to the case withk=1.

Our approach is based on size-biased couplings. Given a nonnegative random variableW with finite meanµ= EW, we sayW′ has theW size biased distribution ifP[W′∈d w] = (w/µ)P[W ∈d w], or more formally, if

E[W f(W)] =µE_f(W′) for bounded continuous functions f. (1.1)

Lemma 3.1 below tells us that if one can find coupled realizations ofW andW′which are in some sense close, then one may be able to find a good Berry-Esseen bound forW. It turns out that this can be done for the number of non-isolated balls.

2 Results

Letn_∈Nandm=m(n)_∈Nwithm_≥4. Letp(n)= (p(_xn), 1_≤x _≤m)be a probability mass function on [m] := _{1, 2, . . . ,m}, with p(_xn) > 0 for all x ∈[m]. Let X and Xi, 1 ≤ i ≤ n be independent and identically distributed random variables with probability mass functionp=p(n)(we shall often suppress the superscript(n)). DefineY =Y(n)by

Mi :=−1+ n X

j=1

1_{Xj=Xi}; Y := n X

i=1

1_{Mi>0}. (2.1)

In terms of the urn scheme described in Section 1, the probability of landing in Urnx isp_x for each ball,X_i represents the location of theith ball,M_irepresents the number of other balls located in the same urn as the ith ball, andY represents the number of non-isolated balls, where a ball is said to be isolated if no other ball is placed in the same urn as it is. Thusn₋Y is the number of isolated balls, or in other words, the number of urns which contain a single ball.

Let Z denote a standard normal random variable, and let Φ(t) := P[Z _≤ t] = (2π)−1/2Rt

−∞exp(−x

2_/₂₎_{d x}_{. Given any random variable} _W _{with finite mean} _µ

(4)

deviationσW satisfying 0< σW <∞, define

the so-called Kolmogorov distance between the distribution ofW and the normal. We are concerned with estimatingD_Y (which is equal toD_n−Y).

We refer to the case where px = m−1 for each x ∈[m] as the uniform case. Our main result for the uniform case provides a normal approximation error bound for Y, which is explicit modulo computation ofµY andσY, and goes as follows.

For asymptotics in the uniform case, we allowm=m(n)to vary withn. We concentrate on the case where m= Θ(n). (recall that an = Θ(bn) means an/bn is bounded away from zero and infinity). In this case both µY andσ2Y turn out to beΘ(n) as n→ ∞, and thus Theorem 2.1 implies DY is

O(n−1/2)in this regime. More formally, we have the following.

Theorem 2.2. Suppose n,m both go to infinity in a linked manner, in such a way that n/m→α ∈ Theorems 2.1 and 2.2 are proved in Section 4.

We now state our results for the general (non-uniform) case. Givennwe define the parameters

kp_k:= sup

x∈[m]

(p_x); γ=γ(n):=max(n_kp_k, 1). (2.7)

(5)

Theorem 2.3. It is the case that

D_Y _≥min1/6,(8πe)−1/2σ−1, (2.8)

and if

kp_{k ≤}1/11 (2.9)

and also

n≥83γ2(1+3γ+3γ2)e1.05γ, (2.10)

then

DY ≤1633γ2e2.1γ p

99+5C(γ) +62σ−_Y1, (2.11)

with

C(γ):=10(82γ7+82γ6+80γ5+47γ4+12γ3+12γ2)1/2. (2.12) It is of use in proving Theorem 2.3, and also of independent interest, to estimate the varianceσ2_Y in terms of the original parameters(p_x,x ∈[m]), and our next result does this. Throughout, we write P

x for Pm

x=1.

Theorem 2.4. It is the case that

VarY _≤8n2X

x

p2_x, (2.13)

and if(2.9)and(2.10)hold, then

VarY ≥(7776)−1γ−2e−2.1γn2X

x

p2_x. (2.14)

Ifγ(n) remains bounded, i.e. sup_nγ(n)<∞. then both(2.9)and(2.10)hold for large enoughn. Hence, the following asymptotic result is immediate from Theorems 2.3 and 2.4. This result gives us, for cases withγ(n)bounded, necessary and sufficient conditions in terms of the parameters p_x

for a central limit theorem to hold, and gives the rate of convergence of DY to zero when it does hold.

Corollary 2.1. Supposesup_nγ(n)<∞. Then the following three conditions are equivalent:

• n2P_xp2_x → ∞as n→ ∞;

• σY → ∞as n→ ∞;

• (Y ₋E_Y₎_/σ_Y _{converges in distribution to Z as n}_{→ ∞}_.

If these conditions hold, then

D_Y = Θ(σ−1_Y ) = Θ 

  n

2X x

p2_x

!−1/2

(6)

Remarks.In the uniform case, Theorem 2.2 provides an alternative proof of the central limit theorem forY whenm= Θ(n)(see Theorem II.2.4 on page 59 of[10]), with error bounds converging to zero at the optimum rate. Corollary 2.1 shows that in the uniform case, ifn2/m→ ∞andn/mremains bounded, then D_Y = Θ((n2/m)−1/2). Corollary 2.1 overlaps Theorem III.5.2 on page 147 of [10]

but is under weaker conditions than those in[10], and provides error bounds not given in[10]. Mikhailov[11]previously provided error bounds for the non-uniform case which were non-optimal (for example,O(n−1/12)whenP_xp2_x = Θ(n−1)).

The condition thatγ(n)remain bounded, in Corollary 2.1, is also required by[12]for the analogous Berry-Esseen type result for the number of occupied boxes, though not by [8] for the local limit theorem for that quantity. In(2.9), which is used for the non-asymptotic bounds, the bound of ₁₁1 could be replaced by any constant less than 1₃ without changing anything except the constants in (2.11)and(2.14).

As always (see remarks in [8], [13], [1], and for the uniform case [15]), it might be possible to obtain similar results to those presented here by other methods. However, to do so appears to be a non-trivial task, particularly in the non-uniform case. In[10]the count of the number of isolated balls is treated separately, and differently, from the count of occupied urns or the count of urns with

k balls, k=0 or k_≥2. Poisson approximation methods might be of use in some limiting regimes (see[2], Chapter 6), but not when the ratio betweenE[Y]and Var[Y]remains bounded but is not asymptotically 1, which is typically the case here.

Exact formulae can be written down for the probability mass function and cumulative distribution function ofY. For example, using (5.2) on page 109 of [5], the cumulative distribution ofY may be written as

P[Y _≤n₋k] =P[n₋Y _≥k] = m X

j=k

(₋1)j−k _j

−1

k−1

S_j

withSj a sum of probabilities that jof the urns contain one ball each, i.e.

Sj=

X

x1<x2<···<xj≤m

n!

(n₋j)!px1· · ·pxj 1−

j X

i=1

pxi

!n−j

.

We shall not use this formula in obtaining our normal approximation results.

3 Lemmas

A key tool in our proofs is the following result.

Lemma 3.1. Let W ≥ 0be a random variable with mean µ and varianceσ2 _∈₍_0,_∞₎_{, and let W}s

be defined on the same sample space, with the W -size biased distribution. If_|Ws₋W_{| ≤} B for some B_≤σ3/2/p6µ, then

D_W ≤ µ

5σ2 

 r

11B2

σ +5∆ +

2B

p σ



 2

(3.1)

≤ 2µ

σ2

3B2

σ + ∆

(7)

where

∆:=pVar(E₍_Ws₋_W_|_W₎₎_. _(3.3)

Proof. The bound (3.1)is given as Lemma 3.1 of [7], where it is proved via Stein’s method. The simpler bound(3.2)follows by applying the inequalityx2+y2_≤2(x2+y2), valid for all realx,y.

Our next lemma is concerned with the construction of variables with size-biased distributions.

Lemma 3.2. Suppose Y is a random variable given by Y = aP[A|F], whereF is some σ-algebra, a>0is a constant, and A is an event with0<P[A]<1. Then Y′has the Y size biased distribution if

L(Y′) =_L(Y|A).

Proof.See Lemma 4.1 of[7].

Let Bin(n,p)denote the binomial distribution with parametersn_∈N_and_p_∈₍_{0, 1}₎_{. The following} lemma will be used for constructing the desired close coupling of our variable of interestY, and its size biased versionY′, so as to be able to use Lemma 3.1.

(8)

Fixx_∈F. We group the urns into three ‘boxes’. Let Box 1 consist of the urn containing Ball 1, and let Box 2 be the union of the urns containing Balls 2, 3, . . . ,k; this could be the union of any number up tok−1 of urns depending on how many ofx2,x3, . . . ,xkare distinct, but since we assumex∈F, Box 2 does not overlap Box 1. Let Box 3 consist of all other urns except those in Box 1 or Box 2. For

i=1, 2, 3, letNi be the number of balls in Box i, other than Balls 1, . . . ,k. Leth(ℓ)be the expected value of Qk_i₌₂ψi(Mi), given X(k) = x and given that N2 = ℓ. Let Px andEx denote conditional probability and conditional expectation, given thatX(k)=x. Then

E_x[W] =E_x[E_x[W|N1,N2]]

where the last line follows becauseE_x

h_Q_k

i=2ψi(Mi)|N1,N2 i

depends on(N₁,N₂)only throughN₂. Also, givenX(k)=x,(N_i)3

i=1have the multinomial distribution

N₁,N₂,N₃

We give a coupling ofN₁ to another random variable N₁′ with the same distribution as M₁ that is independent ofN₂, for which we can give a useful bound onP[N₁₆=N₁′]. The idea here is that the distribution ofN1, givenN2= j, is obtained from throwingn−k−jballs conditioned to avoid Box 2; this conditioning can be effected by re-throwing those balls which land in Box 2. HenceN₁with this conditioned distribution, coupled toN₁′ with the unconditioned distribution of M₁ can be obtained by first throwingn−k−j balls, then throwing j+k−1 extra balls to getN₁′, and re-throwing the balls which landed in Box 2 (and ignoring the extra balls) to getN₁.

To give this coupling in more detail, we imagine a colouring scheme. Consider throwing a series of coloured balls so each ball can land in one of the three boxes, where the probabilities of landing in Boxes 1, 2, 3 are 1/m, a/m, (m₋a₋1)/mrespectively. First, throw n₋k white balls and let

(9)

so that

Next, we adapt Lemma 3.4 to the non-uniform setting. In this case, we need to allowψi to depend on the location as well as the occupation number associated with theith ball. Consequently, some modification of the proof is required, and the constants in Lemma 3.4 are better than those which would be obtained by simply applying the next lemma to the uniform case.

Lemma 3.5. Suppose that for i=1, 2, 3, 4,ψ_iis a real-valued function defined on[m]_{× {}0, 1, . . . ,n−

(10)

n₋3₋jballs, then re-throwing those balls which land inA(to get the conditional distribution) and throwing 3+ j extra balls (to get the unconditional distribution). Again we explain in more detail via a colouring scheme.

Suppose the balls inAare painted white. Let the balls not in A(including Ball 1 if it is not in A) be re-thrown (again, according to the distributionp). Those which land inAwhen re-thrown are painted yellow, and the others are painted red.

Now introduce one green ball for each white ball, and if Ball 1 is white, let one of the green balls be labelled Ball G1. Throw the green balls using the same distributionp. Also, introduce a number of blue balls equal to the number of yellow balls, and if Ball 1 is yellow then label one of the blue balls as Ball B1. Throw the blue balls, but condition them to avoid A; that is, use the probability mass function(px/(1−

P

y∈Apy),x∈[m]\A)for the blue balls.

SetZ1 to be the location of Ball 1 (if it is white or red) or Ball B1 (if Ball 1 is yellow). SetZ1′ to be the location of Ball 1, if it is red or yellow, or the location of Ball G1 (if Ball 1 is white). LetN₁w,N₁r, and N₁b respectively denote the number of white, red, and blue balls at location Z₁, not counting Ball 1 or Ball B1 itself. LetN₁y,N₁r, andN₁grespectively denote the number of yellow, red, and green balls at location Z₁′, not counting Ball 1 or Ball G1 itself. Set

N₁=N₁w+N₁r+N₁b, N₁′=N₁y+N₁r+N₁g.

Then ((Z_i,N_i)_i4₌₁) have the same joint distribution as ((X_i,M_i)4_i₌₁). Also, (Z₁′,N₁′) has the same distribution as (X₁,M₁) and (Z₁′,N₁′) is independent of ((Z_i,N_i)4_i₌₂). Finally, if Ball 1 is red then

Z₁=Z₁′ andN₁′=N₁₋N₁b+N₁g, so that

P[(Z₁,N₁)₆= (Z₁′,N₁′)]_≤E[N₁g] +E[N₁b] +2P[Z₁′∈A]. (3.10) Now,

P[Z₁′∈A]_≤ 4 X

i=2

P[X1=Xi] =3 X

x

p2_x. (3.11)

Also, ifN_gdenotes the number of green balls, not including Ball G1 if Ball 1 is green, then by(2.7),

E[N_g]_≤3+3nkpk ≤3(1+γ)

and alsoE[N₁g|Ng]≤Ng P

xp 2

x, so that

E[N₁g] =E[E[N₁g|Ng]]≤3(1+γ)

X

x

p2_x. (3.12)

IfNy denotes the number of yellow balls, other than Ball 1, then by(2.7),

E_[_N_y_]_≤₃_n_k_p_{k ≤}₃_γ _(3.13)

and by(2.9),

E[N₁b|Ny]≤Ny

X

x

_p

x 1₋3_kp_k

2

≤2Ny

X

x

(11)

so that

E[N₁b] =E[E[N₁b|Ny]]≤6γ

X

x

p2_x. (3.14)

SetU :=Q4_i₌₂ψi(Zi,Ni). By(3.10),(3.11),(3.12)and(3.14),

E[U(ψ₁(Z₁,N₁)−ψ₁(Z′ 1,N1′))]

≤P[(Z₁,N₁)6= (Z′

1,N1′)]rng(ψ1) 4 Y

i=2

kψ_ik

≤(9+9γ)rng(ψ1)

4 Y

i=2

kψik

! X

x

p2_x. (3.15)

Since (Z₁′,N₁′) is independent of ((Z_i,N_i)4_i₌₂) with the same distribution as (X₁,M₁), and E_ψ₁(X1,M1) =0 by assumption,E[Uψ1(Z1′,N1′)] =0. Hence,

E



 4 Y

i=1

ψi(Xi,Mi) 

=E

Uψ1(Z1,N1)

=E_[_U₍_ψ₁₍_Z₁_,_N₁₎₋_ψ₁₍_Z′

1,N1′))],

and then(3.9)follows by(3.15). The proof of(3.8)is similar, with the factors of 3 replaced by 1 in (3.11),(3.12)and(3.13).

4 Proof of Theorems 2.1 and 2.2

Proof of Theorem 2.1.Recall the definition(2.1)ofM_i andY. Assume the uniform case, i.e. assume p= (m−1,m−1, . . . ,m−1). Letξi :=1{Mi >0}be the indicator of the event that ball i is not iso-lated. ThenY =Pn_i₌₁ξi, and we claim that a random variableY′with the size-biased distribution ofY can be obtained as follows. Let I be a discrete uniform random variable over [n], indepen-dent of X₁, . . . ,X_n. Given the value of I, let X′ = (X₁′, . . . ,X_n′) _∈[m]n be a random n-vector with

L(X₁′, . . . ,X_n′) =_L(X₁, . . . ,X_n_|ξI=1). Set

Y′:= n X

i=1

1{∪j∈[n]\{i}{X′j=Xi′}}.

Then_L(Y′) =_L(Y_|ξI =1)and since Y = nP[ξI =1|X1, . . . ,Xn], the claim follows from Lemma 3.2.

To apply Lemma 3.1 we need to find a random variableY′′, coupled toY, such thatL(Y′′) =_L(Y′) and for some constantBwe have_|Y′′₋Y_{| ≤}B(almost surely). To check that_L(Y′′) =_L(Y′), we shall use the fact that ifM_I′denotes the number of entries X′_j ofX′that are equal to X_I′, other than

X_I′ itself, then (i) given I andX_I′, M_I′has the distribution of a Bin(n₋1, 1/m)variable conditioned to take a non-zero value, and (ii) given I,X′_I and M_I′, the distribution of X′ is uniform over all possibilities consistent with the given values ofI,X_I′andM_I′.

Define the random n-vector X := (X₁. . . ,X_n). We can manufacture a random vector X′′ = (X₁′′, . . . ,X′′_n), coupled toXand (we assert) with the same distribution asX′, as follows.

(12)

• Sample a value ofI from the discrete uniform distribution on[n], independent ofX.

• Sample a Bernoulli random variable_B withP[_B =1] =πMI, where(πk,k≥0)is given by

(3.4)withν=n₋1 andp=m−1. (By Lemma 3.3, 0_≤πk≤1.)

• Sample a value ofJ from the discrete uniform distribution on[n]_{\ {}I_}.

• Define(X₁′′, . . . ,X_n′′)by

X′′_i = ¨

X_I ifi=J and_B=1

X_i otherwise.

ThusX′′is obtained fromXby changing a randomly selected entry ofXto the value ofX_I, if_B=1, and leavingXunchanged if_B=0.

We claim that_L(X′′) =_L(X′). To see this defineN :=MI, and set N′′:=−1+ Pn

i=11{Xi′′=X′′I}. Then N has the Bin(n−1,m−1) distribution, while N′′ always takes the value either N or N+1, taking the latter value in the case where_B=1 and alsoX_J₆=X_I. Thus for anyk_{∈ {}0, 1, . . . ,n₋1_},

P[N′′>k] =P[N>k] +P[N=k]πk(1−(k/(n−1))), (4.1)

so by the definition(3.4)ofπk,L(N′′) =L(N|N >0). This also applies to the conditional distri-bution ofN′′given the values ofI andX_I.

Given the values ofN′′, I andX_I′′, the conditional distribution ofX′′is uniform over all possibilities consistent with these given values. Hence,_L(X′′) =_L(X′). Therefore setting

Y′′:= n X

i=1

1_{∪_j∈[n]\{i}{X′′j =Xi′′}},

we have that_L(Y′′) =_L(Y′), i.e. Y′′has the size-biased distribution ofY.

The definition of X′′ in terms of X ensures that we always have _|Y ₋Y′′_{| ≤} 2 (with equality if

MI=MJ=0) ; this is explained further in the course of the proof of Proposition 4.1 below. Thus we may apply Lemma 3.1 withB=2. Theorem 2.1 follows from that result, along with the following:

Proposition 4.1. It is the case thatVar(E[Y′′−Y|Y])_≤η(n,m),whereη(n,m)is given by(2.4).

Proof.Let_G be theσ-algebra generated byX. ThenY is_G-measurable. By the conditional variance formula, as in e.g. the proof of Theorem 2.1 of[7],

Var(E_[_Y′′₋_Y_|_Y_])_≤_Var₍E_[_Y′′₋_Y_|G_])_, _(4.2)

so it suffices to prove that

Var(E[Y′′−Y|G])_≤η(n,m). (4.3)

For 1_≤i_≤n, letV_i denote the conditional probability that_B=1, givenXand given thatI=i, i.e.

(13)

LetR_{i j} denote the increment in the number of non-isolated balls when the value ofX_j is changed to

Xi. Then

E_[_Y′′₋_Y_|G_{] =} 1

n(n−1) X

(i,j):i6=j

V_iR_{i j}

whereP₍_i,_j₎_:i6₌_j denotes summation over pairs of distinct integersi,jin[n]. For 1≤i≤nand j6=i, let

Si:=1{Mi =0}; Ti :=1{Mi=0} −1{Mi=1};

Q_{i j}:=1_{M_i=1_}1_{X_i=X_j_}.

Then we assert thatR_{i j}, the increment in the number of non-isolated balls when ball jis moved to the location of ball i, is given by R_{i j} :=S_i+T_j+Q_{i j}. Indeed, if X_i ₆= X_j then S_i is the increment (if any) due to ball i becoming non-isolated, while Tj is the increment (if any) due either to ball

j becoming non-isolated, or to another ball at the original location of ball j becoming isolated when ball j is moved to the location of ball i. The definition ofQ_{i j} ensures that if X_i = X_j then

Si+Tj+Qi j=0. Thus,

E[Y′′−Y|G] = 1

n(n₋1) X

(i,j):i6=j

V_i(S_i+T_j+Q_{i j})

= 1

n

n X

i=1

Viτi+ 1

n(n₋1) X

(i,j):i6=j

ViTj, (4.4)

where we set

τi :=Si+

₁

n−1

X

j:j6=i

Q_{i j}=1_{M_i =0_}+

₁

n−1

1_{M_i =1_}. (4.5)

Puta:=E_[_V_i_]_{(this expectation does not depend on}_i_{). Then by}₍_4.4₎_,

E[Y′′−Y|G] =1

n

n X

i=1

V_iτ_i+aT_i

+ X

(i,j):i6=j

(V_i₋a)T_j n(n₋1) .

Since(x+ y)2_≤2(x2+y2)for any real x,y, it follows that

Var E_[_Y′′₋_Y_|G_]_≤_2Var 1

n

n X

i=1

V_iτi+aTi

!

+2Var X (i,j):i6=j

(V_i₋a)T_j

n(n₋1) . (4.6)

From the definitions, the following inequalities hold almost surely:

−1_≤T_i_≤1; 0_≤V_i _≤1; 0_≤τi ≤1; (4.7)

and hence

(14)

SetZ_i :=V_iτi+aTi, and ¯Zi:=Zi−EZi. By(4.8), VarZ1≤EZ12≤4. Also by(4.8), we have|Z¯i| ≤3, same manner as with(6.25)below, yields

Var X

This completes the proof of Proposition 4.1, and hence of Theorem 2.1.

Proof of Theorem 2.2. Suppose n,m both go to infinity in a linked manner, in such a way that

n/m→ α∈(0,∞). Then it can be shown (see e.g. Theorem II.1.1 on pages 37-38 of[10]) that E_Y _∼_n₍₁₋_e−α₎_{, and}

Var(Y)_∼ne−α(1₋e−α) +e−2α(α(1₋α))=ng(α)2.

Substituting these asymptotic expressions into(2.2)and(2.3)and using the fact that in this asymp-totic regime, nη(n,m)

→16+24α(2+2α), yields(2.5)and(2.6).

5 The non-uniform case: proof of Theorem 2.4

(15)

Given any sequencex= (x₁, . . . ,x_k), set

H(x) = k X

i=1 

 1−

Y

j∈[k]\{i}

(1−1{x_j=x_i}) 



, (5.1)

which is the number of non-isolated entries in the sequencex, so that in particular,Y =H(Xn). We shall use the following consequence of Jensen’s inequality: for allk_∈N_,

(t₁+t₂+_{· · ·}+t_k)2_≤k(t₁2+_{· · ·}+t2_k), _∀ (t₁, . . . ,t_k)_∈Rk_. _(5.2)

We shall also use several times the fact that₋t−1ln(1₋t)is increasing ont∈(0, 1)so that if(2.9) holds, then for allx _∈[m]we have

ln(1₋p_x)_≥11p_xln(10/11)_{≥ −}1.05p_x (5.3)

whereas(1₋e−t)/tis decreasing on t_∈(0,_∞)so that by(2.7), for anyα >0 andx _∈[m]we have

1−e−αnpx _≥₍₁₋_e−αγ₎₍_np

x/γ). (5.4)

Proof of(2.13). We use Steele’s variant of the Efron-Stein inequality[14]. This says, among other things, that when (as here) X1, . . . ,Xn+1 are independent and identically distributed random vari-ables andHis a symmetric function onRn_,

VarH(X₁, . . . ,X_n)_≤ 1 2

n X

i=1

E_[(_H₍_X_n₎₋_H₍_X₍_n₊₁₎

\i))2] = (n/2)E_[(_H₍_X_n₎₋_H₍_X₍_n₊₁₎_\n₎₎2_]_. Hence, by the casek=2 of(5.2),

VarY ≤n(E[(H(X_n)₋H(X_n₋₁))2] +E[(H(X₍_n₊₁₎_\_n)₋H(X_n₋₁))2]) =2nE_[(_H₍_X_n₎₋_H₍_X_n

−1))2].

WithMj defined by(2.1),H(Xn)−H(Xn−1) is equal to1{Mn ≥1}+1{Mn =1}, so is nonnegative and bounded by 21_{M_n_≥1_}. Therefore,

Var[Y]_≤8nP[M_n_≥1]_≤8nE_M_n_≤₈_n2

X

x

p2_x.

Proof of(2.14). Construct a martingale as follows. Let_F₀be the trivialσ-algebra, and fori_∈[n]let

Fi:=σ(X1, . . . ,Xi)and writeEifor conditional expectation givenFi. Define martingale differences ∆_i =E_i₊₁_Y₋E_i_Y_{. Then}_Y ₋E_Y ₌Pn−1

i=0 ∆i, and by orthogonality of martingale differences,

Var[Y] = n−1 X

i=0

E[∆2_i] = n−1 X

i=0

Var[∆_i]. (5.5)

We look for lower bounds forE[∆2_i]. Note that

(16)

Recall from(2.1)that fori<n, M_i₊₁ denotes the number of balls in the sequence of nballs, other than balli+1, in the same position as balli+1. Similarly, defineMn+1 andM_ki (fork∈[n+1]) by

Mn+1:= X

j∈[n]

1_{Xj=Xn+1}; Mki := X

j∈[i]\{k}

1_{Xj=Xk}, (5.7)

so that Mn+1 is the number of balls, in the sequence of nballs, in the same location as balln+1, whileM_ki is similar toM_k, but defined in terms of the firstiballs, not the firstnballs.

Seth₀(k):=1_{k_≥1_}+1_{k=1_}. ThenH(X_n)₋H(X_n\(i+1)) =h0(Mi+1), and if Xn+16=Xi+1 then

H(X₍_n₊₁₎_\₍_i₊₁₎)₋H(X_n_\₍_i₊₁₎) =h₀(M_n₊₁), so thatW_i=h₀(M_i₊₁)₋h₀(M_n₊₁)in this case. For taking E_i₊₁-conditional expectations, it is convenient to approximateh0(Mi+1)andh0(Mn+1)byh0(Mii+1) andh0(Mni+1)respectively. To this end, define

Z_i:=W_i₋(h₀(M_ii₊₁)₋h₀(M_ni₊₁)). (5.8) Sinceh₀(M_ii₊₁)isFi+1-measurable, taking conditional expectations in(5.8)yields

h0(Mii+1) =Ei+1[Wi] +Ei+1[h0(Mni+1)]−Ei+1[Zi]. (5.9) Setδ:= (288γe1.05γ)−1. We shall show that for iclose ton, in the sense thatn₋δn_≤i_≤ n, the variances of the terms on the right of(5.9), other thanE_i₊₁[Wi], are small compared to the variance of the left hand side, essentially becauseE_i₊₁_[_h₀₍_Mi

n+1)]is more smoothed out thanh0(Mii+1), while

P[Z_i₆=0]is small wheniis close ton. These estimates then yield a lower bound on the variance of E_i₊₁[Wi].

First consider the left hand sideh0(Mii+1). This variable takes the value 0 when Mii+1=0, and takes a value at least 1 whenM_ii₊₁≥1. Hence,

Var[h₀(M_ii₊₁)]_≥(1/2)min(P[M_ii₊₁=0],P[M_ii₊₁_≥1]). (5.10) Fori_≤n, by(5.3)and(2.7),

P[M_ii₊₁=0] =X x

px(1−px)i≥ X

x

px(1−px)n

≥X

x

pxe−1.05npx ≥γ−1e−1.05γ X

x

np2_x. (5.11)

Fori≥(1−δ)nwe havei≥n/2, so by(5.4)and the fact thatγ≥1 by(2.7),

P[M_ii₊₁_≥1] =X x

p_x(1₋(1₋p_x)i)_≥X x

p_x(1₋e−npx/2₎

≥X

x

p_x(1₋e−γ/2)np_x/γ≥(1₋e−1/2)γ−1X

x

np2_x. (5.12)

Sinceγ≥1, ande−1.05<1₋e−0.5, the lower bound in(5.11)is always less than that in(5.12), so combining these two estimates and using(5.10)yields

Var[h₀(M_ii₊₁)]_≥(1/2)γ−1e−1.05γX

x

(17)

Now consider the second term E_i₊₁_[_h₀₍_Mi

Combining the last two estimates on(5.14)and using assumption(2.10)yields

(18)

Rearranging this and using(5.13),(5.15), and(5.16)yields the lower bound

Var(E_i₊₁_[_W_i_])_≥

₁

6− 2 18

_e−1.05γ

γ

X

x

np2_x= e −1.05γ

18γ

X

x

np2_x,

fori_∈[n₋δn,n]. Since the definition of δ, the condition (2.10) on nand the assumptionγ≥1 guarantee thatnδ≥2, and since⌊t⌋ ≥2t/3 fort≥2, by(5.5)and(5.6)we have

Var[Y]_{≥ ⌊}δn_⌋(18γe1.05γ)−1nX

x

p2_x _≥(δn)(27γe1.05γ)−1nX

x

p2_x

which is(2.14).

6 Proof of Theorem 2.3

Proof of(2.8). WriteσforσY, and fort ∈RsetF(t):=P[(Y−EY)/σ≤t]. Setz0:=σ−1(⌊EY⌋ −

E_Y₎_{, and set}_z₁_:₌_z₀_{+ (}₁₋_ǫ₎_/σ_{, for some}_ǫ_∈₍_{0, 1}₎_{. Then since}_Y _{is integer-valued,}_F₍_z₁_{) =}_F₍_z₀₎_. On the other hand, by the unimodality of the normal density,

Φ(z₁)₋Φ(z₀)_≥(1₋ǫ)σ−1(2π)−1/2exp(₋1/(2σ2))

so thatD_Y is at least half the expression above. Makingǫ↓0 and using the fact thate−1/(2σ2)_≥e−1/2

forσ≥1, gives us(2.8)in the case whereσ≥1.

Whenσ <1, we can takez2 ≤0≤z3, withz3 =z2+1 and F(z3) =F(z2). By the 68−95−99.7 rule for the normal distribution, Φ(z₃)₋Φ(z₂) _≥ 1/3, so D_Y _≥ 1/6, giving us (2.8) in the case whereσ <1.

So in Theorem 2.3, the difficulty lies entirely in proving the upper bound in(2.11), under assump-tions(2.9)and(2.10)which we assume to be in force throughout the sequel. By(2.10)we always haven_≥1661.

As before, seth₀(k):=1_{k_≥1_}+1_{k=1_}. Define for nonnegative integerkthe functions

h₁(k):=1₋h₀(k) =1_{k=0_{} −}1_{k=1_};

h2(k):=21{k=1} −1{k=2}; h3(k):=1{k=1}.

The functionh1(k)may be interpreted as the increment in the number of non-isolated balls should a ball in an urn containingkother balls be removed from that urn with the removed ball then deemed to be non-isolated itself. Ifq=0 then the ball removed becomes non-isolated so the increment is 1, while ifq=1 then the other ball in the urn becomes isolated so the increment is₋1.

The functionh₂(k)is chosen so that h₂(k) +2h₁(k)(for k_≥1) is the increment in the number of non-isolated balls if two balls should be removed from an urn containingk−1 other balls, with both removed balls deemed to be non-isolated. The interpretation ofh₃is given later.

(19)

p=p_x. With the convention 0_·π₋1(x):=0·hi(−1):=0, define

h4(k,x):=

kπ_k−1(x)

n₋1 +

(n₋k₋1)πk(x)

n₋1 −1;

h5(k,x):=πk(x)/(n−1), h6(k):=kh2(k);

h₇(k,x):=h₃(k) +h₄(k,x)₋k(2+h4(k,x))h1(k−1)

n

−kh5(k,x)(k−1)h2(k−1)

n .

Fori=0, 1, 2, 3, 6 defineh_i(k,x):=h_i(k). For eachidefine ˜

h_i(k,x):=h_i(k+1,x)/(k+1). (6.1)

Sometimes we shall write ˜hi(k)for ˜hi(k,x) wheni∈ {0, 1, 2, 3, 6}. Definekhik:=supk,x|hi(k,x)| and_k˜h_i_k:=sup_k,_x_|˜h_i(k,x)_|.

Now we estimate some of theh_i functions. Sinceπ0(x) =1 we haveh4(0,x) =h7(0,x) =0 for all

x, which we use later. Also, by Lemma 3.3,

−1_≤h₄(k,x)_≤0 (6.2)

andh7(1,x) =1+h4(1,x)(1−n−1)−2/nso that−1/n≤h7(1,x)≤1. Also, since(2.10)implies

n_≥1661,

h7(2,x) =h4(2,x)(1+2n−1)−

4π2(x)

n(n₋1)+ 4

n∈[−1, 4/n].

Also,h3(3) =h1(2) =0 so that by(6.2),

h7(3,x) =

6π3(x)

n(n₋1)+h4(3,x)∈[−1, 1], again sincen_≥1661. Fork_≥4,h₇(k,x) =h₄(k,x)_∈[₋1, 0]. Thus,

kh7k ≤1; k˜h7k ≤1; (6.3)

k˜h0k=2; kh5k ≤(n−1)−1; kh6k=2. (6.4)

The strategy to prove Theorem 2.3 is similar to the one already used in the uniform case, but the construction of a random variable with the distribution ofY′, whereY′is defined to have theY size biased distribution, is more complicated. As in the earlier case, by Lemma 3.2, if I is uniform over [n]then the distribution of the sumY conditional on MI>0 is the distribution ofY′. However, in the non-uniform case the conditional information thatM_I>0 affects the distribution ofX_I. Indeed, for eachi, by Bayes’ theorem

P[X_i =x_|M_i>0] = px(1−(1−px) n−1₎

P

ypy(1−(1−py)n−1)

=:ˆp_x. (6.5)

Therefore the conditional distribution of(X₁, . . . ,X_n), given thatM_i>0, is obtained by samplingX_i

(20)

sample the value ofX_i, thenM_i according to the binomial Bin(n₋1,p_X

i)distribution conditioned to

be at least one, then select a subset_J of[n]_{\ {}i}uniformly at random from sets of sizeMi, let the values ofX_j,j ∈ J be equal toX_i, and let the values of X_j,j∈ J/ be independently sampled from the distribution with the probability mass function ofX given thatX₆=X_i.

Thus a random variable Y′′, coupled to Y and with the same distribution as Y′, can be obtained as follows. First sample X1, . . . ,Xn independently from the original distribution p, and set X = (X1, . . . ,Xn); then select I uniformly at random from[n]. Then sample a further random variable

X₀ with the probability mass function ˆp. Next, change the value of X_I to that of X₀; next let N

denote the number of other valuesX_j,j _∈[n]_{\ {}I_}which are equal toX₀, and letπk =πk(X0)be defined by(3.4) withν = n−1 and p = pX0. Next, sample a Bernoulli random variable B with

parameterπN, and ifB =1 change the value of one of theXj,j∈[n]\ {I}(j=J, withJ sampled uniformly at random from all possibilities) toX₀. Finally, having made these changes, define Y′′in the same manner asY in the original sum(2.1)but in terms of the changed variables. ThenY′′has the same distribution asY′by a similar argument to that given around(4.1)in the uniform case. Having defined coupled variablesY,Y′′such thatY′′has theY size biased distribution, we wish to use Lemma 3.1. To this end, we need to estimate the quantities denoted B and∆ in that lemma. The following lemma makes a start. LetG be the σ-algebra generated by the value ofX, and for

x _∈[m]let N_x :=P_in₌₁1_{X_i=x_}be the number of balls in urnx. Lemma 6.1. It is the case that

|Y′′−Y| ≤3, a.s. (6.6)

and

E_[_Y′′₋_Y_|G_{] =}₂₊

X

x ˆ

p_xh₅(N_x,x)n−1

n X

i=1

h₆(M_i) !

+ X

x ˆ

p_xh₇(N_x,x) !

− X

x ˆ

p_xh₄(N_x,x) !

n−1

n X

i=1

h₀(M_i) !

− 2

n

n X

i=1

h₀(M_i). (6.7)

Proof.We have

E_[_Y′′₋_Y_|G_{] =}E_[_Y′′₋_Y_|_X_{] =}

X

x ˆ

p_xE_x_[_Y′′₋_Y_|_X_]_, _(6.8)

where E_x_[_·|_X_] _{is conditional expectation given the value of} _X _{and given also that} _X₀ ₌ _x_{. The} formula forE_x[Y′′−Y|X]will depend on x through the value ofNx and through the value of px. We distinguish between the cases where I is selected with XI = x (Case I) and where I is selected withX_I 6= x (Case II). If N_x = k, then in Case I the value of N on which is based the probability

πN(x)of importing a further ball tox isk−1 whereas in Case II this value ofNisk. The probability of Case I occurring isk/n.

(21)

at least one of balls I andJ tox (note thatπ0(x) =1 so we never end up with an isolated ball at

where in the right hand side, the first sum comes from Case I and the other two sums come from Case II. Hence, ifN_x =kthen the formP_x in(6.7)are over non-empty urns so can be expressed as sums over the balls, i.e. of the formPn_i₌₁. We need further notation. Setξi := ˆpXi˜h4(Mi,Xi), and Tj :=h0(Mj). Set b:=E[Tj]

(this does not depend on j), and ¯T_j:=T_j−b. Again writeP₍_i,_j₎_:i₆₌_jforPn_i₌₁P_j_∈_[_n_]_\{_i_}. Lemma 6.2. It is the case that

(22)

Proof. As in Section 4, (4.2) holds here too. So it suffices to prove (6.9) with the left hand side

The last sum in(6.13)can be rewritten as follows:

X

Substituting into(6.13)yields

(23)

By(6.5),(6.14)and(6.15), for all x _∈[m]we have that

(24)

By Lemma 3.5,

3n2Cov

S1,S2

=3n2E[S¯1S¯2]≤9(1+γ)

32γ6/VarY 16+16γ−2+4γ−4

≤ 4608(γ

7₊_γ6₊_γ5₊_γ4_{) +}₁₁₅₂₍_γ3₊_γ2₎

VarY .

Combining this with(6.23)yields

3Var n X

i=1

Si = 3nVar[S1] +3n(n−1)Cov[S1,S2]

≤

₁₀₀

VarY

47γ7+47γ6+64γ5+47γ4+12γ3+12γ2. (6.24)

Consider now the double sum in(6.9). Writing(n)_k forn!/(n−k)!, we have that

Var X

(i,j):i6=j

ξiT¯j= (n)4Cov(ξ1T¯2,ξ3T¯4) + (n)2 Var(ξ1T¯2) +Cov(ξ1T¯2,ξ2T¯1)

+(n)₃ Cov(ξ1T¯2,ξ1T¯3) +Cov(ξ2T¯1,ξ3T¯1) +2Cov(ξ1T¯2,ξ3T¯1)

. (6.25)

For the first term of the right hand side of(6.25), observe that

Cov(ξ1T¯2,ξ3T¯4) =E[ξ1T¯2ξ3T¯4]−E[ξ1T¯2]E[ξ3T¯4]≤E[ξ1T¯2ξ3T¯4], and that 0≥ξ_i≥ −ˆp_X_i by(6.2), while 0≤T_j≤2. So by Lemma 3.5,(6.17)and(6.18),

3n2Cov(ξ1T¯2,ξ3T¯4)≤12n2kˆpk2(9+9γ) X

x

p2_x ≤108(1+γ)

32γ6

VarY

= (3456γ6+3456γ7)/VarY. (6.26) Now consider the last term in(6.25). By(6.20),

3nCov(ξ1T¯2,ξ1T¯3)≤3nE[ξ21T¯2T¯3]≤12nEˆp2X1≤

384γ5

VarY , (6.27)

while by(6.19)and(6.22),

3n(Cov(ξ2T¯1,ξ3T¯1) +2Cov(ξ1T¯2,ξ3T¯1))≤3nET¯12ξ2ξ3+6nEξ1T¯2ξ3T¯1

≤36nE[ˆpX1ˆpX2]≤144γ

4_n−1

≤ 1152γ

5

VarY . (6.28)

The middle term in the right side of (6.25) is smaller; sinceE[ˆpX1ˆpX2]≤E[ˆp

2

X1]by the

Cauchy-Schwarz inequality,(6.20)gives

3(Var(ξ1T¯2) +Cov(ξ1T¯2, ¯T1ξ2))≤24E[ˆp2X1]≤

768γ5

nVarY,

and sincen≥1661 by(2.10), combined with(6.25),(6.26),(6.27), and(6.28), this shows that

3n−2Var X (i,j):i6=j

ξiT¯j≤ ₁₀₀

VarY

(25)

Also, by(6.22)we obtain

12(n−1)2_≤ ₁₂

1660

_n

n₋1

n−1≤

₉₆

×1661

16602

γ

VarY ≤

γ

10VarY.

Combining this with(6.9),(6.24)and(6.29)yields

Var(E[Y′′₋Y_|Y])_≤ ₁₀₀

VarY

(82(γ7+γ6) +80γ5+47γ4+12(γ3+γ2)).

Proof of Theorem 2.3.It remains to prove(2.11). By(6.14),

E_Y =nX

x

px(1−(1−px)n−1)≤1.05n2 X

x

p2_x,

so that by(2.14),

µY/σ2Y ≤8165γ

2_e2.1γ_. _(6.30)

We can apply(3.1)from Lemma 3.1 withB=3 by(6.6). By(6.30)and Lemma 6.3, this gives

D_Y _≤

8165γ2_e2.1γ 5



 È

99

σ_Y +

5C(γ)

σ_Y +

6

p_σ

Y 

 2

,

which yields(2.11).

Acknowledgement. It is a pleasure to thank the Institut für Stochastik, Universität Karlsruhe, for their hospitality during the preparation of this manuscript, and Norbert Henze for valuable com-ments on an earlier draft.

References

[1] Barbour, A. D. and Gnedin, A. V. (2009) Small counts in the infinite occupancy scheme.Electron. J. Probab.14, 365-384. MR2480545

[2] Barbour, A. D., Holst, L. and Janson, S. (1992)Poisson Approximation.Oxford University Press, New York. MR1163825

[3] Chatterjee, S. (2008). A new method of normal approximation.Ann. Probab.36, 1584–1610. MR2435859

[4] Englund, G. (1981). A remainder term estimate for the normal approximation in classical occu-pancy.Ann. Probab.9, 684–692. MR0624696

(26)

[6] Gnedin, A., Hansen, B. and Pitman, J. (2007) Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws.Probab. Surv.4, 146–171. MR2318403

[7] Goldstein, L. and Penrose, M. D. (2008) Normal approximation for coverage models over bino-mial point processes. To appear inAnn. Appl. Probab.Arxiv: 0812.3084 v2.

[8] Hwang, H.-K. and Janson, S. (2008). Local limit theorems for finite and infinite urn models.

Ann. Probab.36, 992–1022. MR2408581

[9] Johnson, N. L. and Kotz, S. (1977).Urn Models and their Application: An approach to Modern Discrete Probability Theory.John Wiley & Sons, New York. MR0488211

[10] Kolchin, V. F., Sevast’yanov, B.A. and Chistyakov, V. P. (1978). Random Allocations. Winston, Washington D.C. MR0471016

[11] Mikhailov, V. G. (1981). The central limit theorem for a scheme of independent allocation of particles by cells. (Russian) Number theory, mathematical analysis and their applications.Trudy Mat. Inst. Steklov.157, 138–152. MR0651763

[12] Quine, M. P. and Robinson, J. (1982). A Berry-Esseen bound for an occupancy problem.Ann. Probab.10, 663–671. MR0659536

[13] Quine, M. P. and Robinson, J. (1984) Normal approximations to sums of scores based on occupancy numbers.Ann. Probab.12, 794–804. MR0744234

[14] Steele, J. M. (1986) An Efron-Stein inequality for non-symmetric statistics. Ann. Statist. 14, 753–758. MR0840528