• Tidak ada hasil yang ditemukan

Estimates of population characteristics

Estimates of population means and proportions and variances can be obtained directly from the sample means and proportions and variances. An estimate of a population characteristic is denoted by using the symbol " (hat) over the symbol for the population parameter. For example, an estimate of the population proportion is denoted as

P,

and the sample proportion is used for this purpose (i.e.,

p

= p).

With simple random sampling,

x

and s2 and p as defined above are used to estimate ll· cr2, and P respectively. These sample statistics (x, s2, p) may not always be the correct estimate of the population parameters. For example, if a multistage sample incorporating such features as stratification and clustering was selected from the population, then the correct estimate might have to incorporate statistical weighting factors which reflect the complexity of the sample design used.

Estimates of population parameters obtained from a particular sample can never be assumed to be equal to the true value of the population parameters. If we had taken a different sample, we would have obtained different estimates of these parameters, which may have been either closer or further away from the true parameter values than the estimates from the first sample. Since we never know the true value of the population parameters that we are estimating, we never know how close or how far our sample estimates really are from the true population values. If, however, our sampling plan uses a probability sample, then we can, through mathematics, obtain some insight into how far away from the unknown true values our sample estimates are likely to be. In order to do this, we must know something about the distribution of our estimates. This distribution encompasses all possible samples that can arise from the particular sampling plan being used.

possible samples from a given population and that a particular sample results in an estimate,

e,

of

e.

(The symbol

e

here represents any population parameter [e.g., !.!, p or cr2] while the symbol

e

represents an estimate of this parameter [e.g.,

x ,

p or s2]). The collection of values of

e

over the T possible samples is called the sampling distribution of

e

with

respect to the specified sampling plan and estimation procedure.

To illustrate a sampling distribution, consider the following simple example.

Example 11.3.1

Suppose in a hypothetical district with thirty villages, it is necessary to estimate the proportion of the villages in which less than 50% of the eligible children are immunized against polio. It is decided to randomly select a sample of five villages in which the immunization histories of all the eligible children will be collected to determine the polio immunization levels for each village. This hypothetical population of thirty villages is shown in Table 11.1. An indicatorY represents the immunization coverage with a "1" indicating a village with less than 50% of the eligible children immunized against polio, and a "0" indicating immunization level above 50%. (In real life the immunization levels of the villages would of course not be known in advance.)

Table 11.1 Data on immunization status in 30 villages

Proportion Proportion

immunized immunized

Village <50% y Village <50% y

1 yes 1 16 yes 1

2 no 0 17 yes 1

3 yes 1 18 no 0

4 yes 1 19 yes 1

5 yes 1 20 no 0

6 no 0 21 yes 1

7 no 0 22 no 0

8 no 0 23 no 0

9 yes 1 24 yes 1

10 yes 1 25 no 0

11 yes 1 26 no 0

12 yes 1 27 yes 1

13 no 0 28 yes 1

14 yes 1 29 yes 1

15 no 0 30 no 0

If a sampling plan is specified in which five villages are selected at a time at random out of the thirty, such a procedure would yield 142 506 possible samples. Each of these samples has the same chance of being selected. In terms of the notation described in the definition of a sampling distribution, T = 142 506, P=0.57, and each

e =

p, where P and pare the population and estimated popula- tion proportion, respectively. The sampling distribution of the estimated prop- ortion, p, is given in Table 11.2.

56 Foundations of Sampling and Statistical Theory

Table 11.2 Sampling distribution of estimated proportion of samples of size five of the data of Table 11.1

Relative

p Frequency frequency

0.0 1287 0.009

0.2 12155 0.085

0.4 38896 0.273

0.6 53040 0.372

0.8 30940 0.217

1.0 6188 0.043

The relative frequencies shown in the last column of Table 11.2 show the fraction of all possible samples that can take on the corresponding values of p. These are shown graphically in Fig. 12.

0.4

>- 0.3

() c

CD :I 0"

~ 0.2

CD >

:;::

Gi as 0.1 a:

0

0.0 0.2 0.4 0.6 0.8 1.0

Estimated proportion of villages with inadequate vaccination coverage

Fig. 12 Relative frequency histogram of sampling distribution of p

Sampling distributions have also certain characteristics associated with them. For our purposes, the two most important of these are the mean and the variance (or its square root, the standard deviation).

The mean of the sampling distribution of an estimated parameter

e

is also known as its expected value, denoted by E(e), and is defmed by the equation

where T is the number of possible samples and

ei

is the sample statistic computed from the

ith possible sample selected from a population. Note that some of the values of

ei

may be the same from sample to sample, but each one appears in the sum even if there is duplication.

The variance of the sampling distribution of an estimated parameter 9, denoted by Var(9), is defined as

T

var<9) = ~[ei-E(e)J2

rr

The standard deviation of the sampling distribution of an estimated parameter 9 is more commonly known as the standard error of 9 and is simply the positive square root of the variance Var(9) of the sampling distribution of 9. This quantity is denoted as

SE(e) = --JVar[(9)].

Using these general equations, it is possible to derive expressions for the mean (or expected value), variance and standard error of the sampling distribution of any estimator of interest. In particular, for the sample proportion, assuming simple random sampling of elementary units,

E(p) = p Var(p) = P(1-P)/n

and SE(p)

=

-Y{P(1-P)/n}.

For the sample mean, x, the mean, variance and standard error of the sampling distribution are E(x) = ~. Var(x) = cr2/n and SE(x) = cr/.Yn, respectively.

A very important theorem in statistics which concerns the sampling distributions of sample proportions and means, is known as the central limit theorem. This theorem states, in effect, that if the sample means and proportions are based on large enough sample sizes, their sampling distributions tend to be normal, irrespective of the underlying distribution of the original observations. (The normal distribution provides the foundation for much of statistical theory. Readers unfamiliar with the normal distribution and its properties would be advised to refer to Dixon and Masseyn or other basic statistics textbooks.) Hence assuming the sample size is large, the sampling distribution of p may be represented as follows:

P- zap P P+Z crp Fig. 13 Sampling distribution of p

Here z is the number of standard errors from the mean of the sampling distribution and crp=

SE(p). How large the sample size (n) must be for the central limit theorem to hold depends on P. For practical purposes, n is sufficiently large if nP is greater than 5. At z equal to 1.96 in Fig. 13, the shaded area under the normal curve equals 95% of the total area. If z=2.58, then the area equals 99%. Using the central limit theorem we can determine how close a sample proportion p is likely to be to the true proportion P in the population from which that sample was selected. This is discussed in greater detail in the following sections.