• Tidak ada hasil yang ditemukan

Topic 7: Inequalities and Limit Theorems

N/A
N/A
Protected

Academic year: 2024

Membagikan "Topic 7: Inequalities and Limit Theorems"

Copied!
15
0
0

Teks penuh

(1)

Topic 7: Inequalities and Limit Theorems Rohini Somanathan

Course 003, 2018

(2)

The Inference Problem

So far, our starting point has been a given probability space (S,F,P).

In the remainder of this course, we will make inferences about the probability space by analyzing a sample of outcomes. This is known as statistical inference.

Statistical inference involves both the estimation of population parameters and testing hypotheses about them.

We focus on parametric inference - we make assumptions about the probability distributions from which our sample is drawn (for example, each sample observation represents an

outcome of a normally distributed random variable with unknown mean and unit variance).

When no such assumptions are made, inference is nonparametric.

(3)

Defining a Statistic

Definition: Any real-valued function T =r(X1, . . . ,Xn) is called a statistic.

A statistic is a random variable, but not all functions of random variables are statistics.

In which of the following examples is Y a statistic?

Y = X−µσ Y =

Pn i=1

Xi.

Y = Pn i=1

X2i.

Statistics are useful because they have sample counterparts.

In a problem of estimating an unknown parameter, θ, our estimator will be a statistic whose value can be regarded as an estimate of θ.

It turns out that for large samples, the distributions of some statistics, such as the sample mean, are well-known.

(4)

Markov’s Inequality

When we don’t know the distribution of a random variable, we can estimate probabilities either through simulation or by distribution-free bounds such as those provided by the following

inequalities.

Markov’s Inequality: Let X be a random variable with density function f(x) such that P(X≥ 0) = 1. Then for any given number a >0,

P(X≥ a) ≤ E(X) a

Proof. (for discrete distributions) E(X) = P

xxf(x) = P

x<axf(x) +P

x≥axf(x) All terms in these summations are non-negative by assumption, so we have

E(X) ≥ X

x≥a

xf(x) ≥ X

x≥a

af(x) = aP(X ≥ a)

This inequality obviously holds for a ≤E(X). It is useful in bounding probabilities in the tails.

For example, if the mean of X is 1, the probability of X taking values bigger than 100 is less than .01.

(5)

Chebyshev’s Inequality

Chebyshev’s Inequality: Let X be a random variable with finite variance σ2 and mean µ. Then, for every a > 0,

P(|X−µ|≥ a) ≤ σ2 a2 or equivalently,

P(|X−µ|< a) ≥1− σ2 a2

Proof. Use Markov’s inequality with Y = (X−µ)2 and use a2 in place of the constant a. Then Y takes only non-negative values and E(Y) = Var(X) = σ2.

In particular, this tells us that for any random variable, the probability that values taken by the variable will be more than 2 standard deviations away from the mean cannot exceed 14 and 3 standard deviations away cannot exceed 19

P(|X−µ| ≥3σ) ≤ 1 9

For most distributions, this upper bound is considerably higher than the actual probability of this event. What are these probabilities for a Normal distribution?

This inequality will be used in the proof of the law of large numbers, one of our two main large sample theorems.

(6)

Probability bounds ..an example

Chebyshev’s Inequality can, in principle, be used for computing bounds for the probabilities of certain events. We will only use it if we know the distribution of the random variable ( the latter allow us to compute actual probabilities of tail events).

Example:

Let the density function of X be given by f(x) = 1

(23)I(−3,3)(x). In this case µ= 0 and σ2 = (2

3)2

12 =1. If t = 32, then

Pr(|X−µ|≥ 3

2) = Pr(|X|≥ 3

2) = 1−

3

Z2

32

1 2√

3dx =1−

√3

2 =.13

Chebyshev’s inequality gives us 1

a2 = 49 which is much higher.

If a =2, the exact probability is 0, while our bound is 14.

(7)

The sample mean

A particularly important statistic is the sample mean:

Xn = 1

n(X1 +· · ·+Xn) Recall its mean and variance:

E(Xn) = n1 Pn i=1

E(Xi) = n1.nµ =µ

Var(Xn) = 1

n2Var(

Pn i=1

Xi) = 1

n2

Pn i=1

Var(Xi) = 1

n22 = σn2

This tells us that the sample mean has the following properties:

has the same expectation as that of the population.

is more concentrated around its mean value µ than was the original distribution.

its variance is decreasing in sample size, n.

(8)

Sample size and precision of the sample mean

We can use Chebyshev’s Inequality to ask how big a sample we should take, if we want to ensure a certain level of precision in our estimate of the sample mean.

Suppose the random sample is picked from a distribution which unknown mean and variance equal to 4.

We want to ensure an estimate which is within 1 unit of the real mean with probability .99.

So we want Pr(|X¯ −µ| ≥1)≤ .01.

Applying Chebyshev’s Inequality we get Pr(|X¯ −µ|≥ 1) ≤ n4. Since we want n4 =.01 we take n =400.

Note: We have not used any information on the distribution of X¯ and this is probably much larger than what we actually need. That’s the trade-off between the more efficient

parametric procedures and more robust non-parametric ones.

(9)

Convergence of Real Sequences

We would like our estimators to be well-behaved. What does this mean?

One desirable property of an estimator is that our estimates get closer to the parameter that we are trying to estimate as our sample gets larger. We’re going to make precise this notion of getting closer.

Recall that a sequence is just a function from the set of natural numbers N to any set A (Examples: yn = 2n, yn = n1)

A real number sequence {yn} converges to y if for every >0, there exists N() for which n ≥N() =⇒ |yn −y| < . In such as case, we say that {yn} → y Which of the above sequences converge?

If we have a sequence of functions {fn}, the sequence is said to converge to a function f if fn(x) →f(x) for all x in the domain of f.

In the case of matrices, a sequence of matrices converges if each the sequences formed by (i,j)th elements converge, i.e. Yn[i,j] →Y[i,j].

(10)

Sequences of Random Variables

In a sequence of random variables, each term is a random variable Examples: Yn = n1

Pn i=1

Xi, where Xi ∼ N(µ,σ2), or Yn = Pn i=1

Xi, where Xi ∼ Bernoulli(p)

We need to modify our notion of convergence, since the sequence {Yn} no longer defines a given sequence of real numbers, but rather, many different real number sequences,

depending on the realizations of X1, . . . ,Xn.

Convergence questions can no longer be verified unequivocally since we are not referring to a given real sequence, but they can be assigned a probability of occurrence based on the probability space for random variables involved.

The most relevant random variable sequence for us is one in which the nth element is our estimator for a sample size n. We want to know whether this estimator converges to the parameter we are estimating.

There are several types of random variable convergence discussed in the literature. We’ll focus on two of these:

– Convergence in Distribution – Convergence in Probability

(11)

Convergence in Distribution

Definition: Let {Yn} be a sequence of random variables, and let {Fn} be the associated sequence of cumulative distribution functions. If there exists a cumulative distribution function F such that Fn(y)→ F(y) then F is called the limiting CDF of {Yn}. Denoting by Y the r.v. with the distribution function F, we say that Yn converges in distribution to the random variable Y and denote this by Yn −→d Y.

The notation Yn −→d F is also used to denote Yn −→d Y ∼ F

Convergence in distribution holds if there is convergence in the sequence of densities

(fn(y)−→ f(y)) or in the sequence of MGFs (MYn(t) −→ MY(t)). In some cases, it may be easier to use these to show convergence in distribution.

Result:Let Xn −→d X, and let the random variable g(X) be defined by a function continous function g(.) Then g(Xn) −→d g(X)

Example: Suppose Zn −→d Z ∼ N(0, 1), then 2Zn +5 −→d 2Z+5 ∼ N(5, 4) (why?)

(12)

Convergence in Probability

This concept formalizes the idea that we can bring the outcomes of the random variable Yn arbitrarily close to the outcomes of the random variable Y for large enough n.

Definition: The sequence of random variables {Yn} converges in probability to the random variable Y iff

n→limP(|yn −y| < ) = 1 ∀ > 0

We denote this by Yn −→p Y or plimYn =Y. This justifies using outcomes of Y as an approximation for outcomes of Yn since the two are very close for large n.

Notice that while convergence in distribution is a statement about the distribution

functions of Yn and Y whereas convergence in probability is a statement about the joint density of outcomes, yn and y.

Distribution functions of very different experiments may be the same: an even number on a fair die and a head on a fair coin have the same distribution function, but the outcomes of these random variables are unrelated.

Therefore Yn −→p Y implies Yn −→d Y. In the special case where Yn −→d c, we also have Yn −→p c and the two are equivalent.

(13)

The Weak Law of Large Numbers

Consider now the convergence of the random variable sequence whose nth term is given by:

Xn = 1 n

Xn

i=1

Xi

WLLN: Let {Xn} be a sequence of i.i.d. random variables with finite mean µ and variance σ2. Then Xn −→p µ.

Proof. Using Chebyshev’s Inequality,

P(|X¯ −µ|< ) ≥1− σ2 n2

Hence

n→limP(|X¯ −µ|< ) = 1 or plimX¯ =µ

The WLLN will allow us to use the sample mean as an estimate of the population mean, under very general conditions.

(14)

Central Limit Theorems

Central limit theorems specify conditions under which sequences of random variables converge in distribution to known families of distributions.

These are very useful in deriving asymptotic distributions of test statistics whose exact distributions are either cumbersome or difficult to derive.

There are a large number of theorems which vary by the assumptions they place on the random variables ( dependent or independent, identically or non-indentically distributed).

The Lindberg-Levy CLT: Let {Xn} be a sequence of i.i.d. random variables with EXi = µ and var(Xi) = σ2 ∈ (0,∞) ∀i. Then

Xn −µ

σ n

=

√n(Xn −µ) σ

−→d N(0, 1)

In words: Whenever a random sample of size n (large) is taken from any distribution with mean µ and variance σ2, the sample mean Xn will have a distribution which is approximately normal with mean µ and variance σn2.

(15)

Lindberg-Levy CLT..applications

Approximating Binomial Probabilities via the Normal Distribution: Let {Xn} be a sequence of i.i.d. Bernoulli random variables. Then, by the LLCLT:

PXi n −p qp(1−p)

n

= Pn i=1

Xi −np pnp(1−p)

−→d N(0, 1) and Xn

i=1

Xia N(np,np(1−p))

In this case, Pn i=1

Xi is of the form aZn +b with a =p

np(1−p) and b=np and since Zn converges to Z in distribution, the asymptotic distribution

Pn i=1

Xi is normal with mean and variance given above (based on our results on normally distributed variables).

Approximating χ2 Probabilities via the Normal Distribution: Let {Xn} be a sequence of i.i.d. chi-square random variables with 1 degree of freedom. Using the additivity property of variables with gamma distributions, we have

Pn i=1

Xi ∼ χ2n Recall that the mean of gamma distribution is αβ and its variance is α

β2. For a χ2n random variable, α = v2 and β = 12. Then, by the LLCLT:

PXi n −1

q 2 n

= Pn i=1

Xi −n

√2n

−→d N(0, 1) and Xn

i=1

Xia N(n, 2n)

Referensi

Dokumen terkait