CHAPTER 2 Statistics
2.3 Inferring population parameters from sample parameters
Thus far we have focused on statistics that describe a sample in various ways. A sample, however, is usually only a subset of the population. Given the statistics of a sample, what can we infer about the corresponding population parameters? If the sample is small or if the population is intrinsically highly variable, then there is not much we can say about the population.
However, if the sample is large, there is reason to hope that the sample statistics are a good approximation to the population parameters. We now quantify this intuition.
Our point of departure is the central limit theorem, which states that the sum of the n independent random variables, for large n, is approximately normally distributed (see Section 1.7.5 on page 30). Suppose we collect a set of m samples, each with n elements, from some population. (In the rest of the discussion we will assume that n is large enough that the central limit the- orem applies.) If the elements of each sample are independently and randomly selected from the population, we can treat the sum of the elements of each sample as the sum of n independent and identically distributed random variables X1, X2,..., Xn. That is, the first element of the sample is the value assumed by the random variable X1, the second element is the value assumed by the random variable X2, and so on. From the central limit theorem, the sum of these random variables is nor- mally distributed. The mean of each sample is the sum divided by a constant, so the mean of each sample is also normally distributed. This fact allows us to determine a range of values where, with high confidence, the population mean can be expected to lie.
To make this more concrete, refer to Figure 3 and consider sample 1. The mean of this sample is . Similarly
, and, in general, . Define the random variable X as taking on the values . The distri- bution of X is called the sampling distribution of the mean. From the central limit theorem, X is approximately normally distributed. Moreover, if the elements are drawn from a population with mean μ and variance σ2, we have already seen that E(X) = μ (Equation 6) and V(X) = σ2/n (Equation 9). These are, therefore, the parameters of the corresponding normal distri- bution, i.e., X ~ N(μ, σ2/n). Of course, we do not know the true values of μ and σ2.
If we know σ2, we can estimate a range of values in which μ will lie, with high probability, as follows. For any normally dis- tributed random variable Y ~(μY, σY2), we know that 95% of the probability mass lies within 1.96 standard deviations of its mean, and 99% of the probability mass lies within 2.576 standard deviations of its mean. So, for any value y:
P(μY - 1.96 σY < y < μY +1.96 σY) = 0.95 (EQ 16)
The left and right endpoints of this range are called the critical valuesat the 95% confidence level: an observation will lie beyond the critical value, assuming that the true mean is μY, in less than 5% (or 1%) of observed samples. This can be rewrit- ten as:
P(|μY - y| < 1.96 σY) = 0.95 (EQ 17)
Therefore, from symmetry of the absolute value:
P(y - 1.96 σY < μY < y +1.96 σY) = 0.95 (EQ 18)
In other words, given any value y drawn from a normal distribution whose mean is μY, we can estimate a range of values where μY must lie with high probability (i.e. 95% or 99%). This is called the confidence interval for μY.
We just saw that X ~ N(μ, σ2/n). Therefore, given the sample mean x:
P(x - 1.96 < μ< x +1.96 ) = 0.95 (EQ 19)
x1 1 n--- x1i
i
∑
= x2 1
n--- x2i
i
∑
= xk 1
n--- xki
i
∑
= x1, ,x2 …,xn
σ n
--- σ
n ---
DRAFT - Version 3 -Measures of variability
49
and
P(x - 2.576 < μ< x +2.576 ) = 0.99 (EQ 20)
Assuming we knew the sample mean and σ2, this allows us to compute the range of values where the population mean will lie with 95% or 99% confidence.
Note that a confidence interval is constructed from the observations in such a way that there is a known probability, such as 95% or 99%, of it containing the population parameter of interest. It is not the population parameter that is the random vari- able–the interval itself is the random variable.
FIGURE 4. Population and sample mean distributions
The situation is graphically illustrated in Figure 4. Here, we assume that the population is normally distributed with mean 1.
The variance of the sampling distribution (i.e. of X) is σ2/n, so it has a narrower spread than the population (with the spread decreasing as we increase the number of elements in the sample). A randomly chosen sample happens to have a mean of 2.0.
This mean is the value assumed by a random variable X whose distribution is the sampling distribution of the mean and whose expected value from this single sample is centered at 2 (if we were to take more samples, the means of these samples would converge towards the true mean of the sampling distribution. The double headed arrow around x in the figure indicates a confidence interval in which the population mean must lie with high probability.
In almost all practical situations, we do not know σ2. All is not lost, however. Recall that an unbiased estimator for σ2 is
(Equation 15). Therefore, assuming that this estimator is of good quality (in practice, this is true when
n > ~20), X ~ N(μ , ). Therefore, when n is sufficiently large, we can still compute the confidence interval in which the population mean lies with high probability.
EXAMPLE 5: CONFIDENCEINTERVALS
Consider the data values in Table 1 on page 42. What is the 95% and 99% confidence interval in which the population mean lies?
Solution:
σ n
--- σ
n ---
0 0.2 0.4 0.6 0.8 1 1.2
-6 -4 -2 0 2 4 6
f(x)
x X-
confidence interval
Population distribution Sampling distribution of the mean
μ
1 n–1
--- (xi–x)2
i=1 n
⎝
∑
⎠⎜ ⎟
⎜ ⎟
⎛ ⎞
1 n n( –1)
--- (xi–x)2
i=1 n
⎝
∑
⎠⎜ ⎟
⎜ ⎟
⎛ ⎞
DRAFT - Version 3 - Inferring population parameters from sample parameters
We will ignore the fact that n =17 < 20, so that the central limit theorem is not likely to apply. Pressing on, we find the sam- ple mean is 2.88. We compute as 107.76. Therefore, the variance of the sampling distribution of the mean is estimated as 107.76/(17*16) = 0.396 and the standard deviation of this distribution is estimated as its square root, i.e., 0.63.
Using the value of +/- 1.96σ for the 95% confidence interval and +/- 2.576σ for the 99% confidence interval, the 95% confi- dence interval is [2.88-1.96*0.63, 2.88+1.96*0.63] = [1.65, 4.11] and the 99% confidence interval is [1.26, 4.5].
Because x is normally distributed with mean μ and variance σ2/n, is a N(0,1) variable, also called the standard Z
variable. In practice, when n> 20, we can substitute m2(n/(n-1)) as an estimate for σ2 when computing the standard Z varia- ble.
So far, we have assumed that n is large, so that the central limit theorem applies. In particular, we have made the simplifying assumption that the estimated variance of the sampling distribution of the mean is identical to the actual variance of the sam- pling distribution. When n is small, this can lead to underestimating the variance of this distribution. To correct for this, we have to re-examine the distribution of the normally distributed standard random variable , which we actually esti-
mate as the random variable . The latter variable is not normally distributed. Instead, it is called the standard t
variable that is distributed according to the tdistribution with n-1 degrees of freedom (a parameter of the distribution). The salient feature of the t distribution is that, unlike the normal distribution, its shape varies with the degrees of freedom, with its shape for n > 20 becoming nearly identical to the normal distribution.
How does this affect the computation of confidence intervals? Given a sample, we proceed to compute the estimate of the mean as x as before. However, to compute the, say, 95% confidence interval, we need to change our procedure slightly. We have to find the range of values such that the probability mass under the t distribution (not the normal distribution) centered
at that mean and with variance is 0.95. Given the degrees of freedom (which is simply n-1), we can look this up in a standard t table. Then, we can state with 95% confidence that the population mean lies in this range.
EXAMPLE 6: CONFIDENCEINTERVALSFORSMALLSAMPLES
Continuing with the sample in Table 1 on page 42, we will now use the t distribution to compute confidence intervals. The unbiased estimate of the population standard deviation is 0.63. n=17, so this corresponds to a t distribution with 16 degrees of freedom. We find from the standard t table that a (0,1) tvariable reaches the 0.025 probability level at 2.12, so that there is 0.05 probability mass beyond 2.12 times the standard deviation on both sides of the mean. Therefore, the 95% confidence interval is [2.88-(2.12*0.63), 2.88+(2.12*0.63)] = [1.54, 4.22]. Compare this to the 95% confidence interval of [1.65, 4.11]
obtained using the normal distribution in Example 5. Similarly, the tdistribution reaches the 0.005 probability level at 2.921, leading to the 99% confidence interval of [2.88-(2.921*0.63), 2.88+(2.921 *0.63)] = [1.03, 4.72] compared to [1.26, 4.5].
So far, we have focussed on estimating the population mean and variance and have computed the range of values in which the mean is expected to lie with high probability. These are obtained by studying the sampling distribution of the mean. We can obtain corresponding confidence intervals for the population variance by studying the sampling distribution of the var-
xi–x ( )2
i=1 n
∑
x–μ
( )
σ n ---
⎝ ⎠
⎛ ⎞ ---
x–μ
( )
σ n ---
⎝ ⎠
⎛ ⎞ ---
x–μ
( )
xi–x ( )2
i
∑
n n( –1) --- ---
xi–x ( )2
i
∑
n n( –1) ---
DRAFT - Version 3 -Hypothesis testing
51
iance. It can be shown that if the population is normally distributed, this sampling distribution is the χ2 distribution (dis- cussed in Section 2.4.7 on page 58). However, this confidence interval is rarely derived in practice, and so we will omit the details of this result.