• Tidak ada hasil yang ditemukan

The variance of the sampling distribution of an estimated parameter 9, denoted by Var(9), is defined as

T

var<9) = ~[ei-E(e)J2

rr

The standard deviation of the sampling distribution of an estimated parameter 9 is more commonly known as the standard error of 9 and is simply the positive square root of the variance Var(9) of the sampling distribution of 9. This quantity is denoted as

SE(e) = --JVar[(9)].

Using these general equations, it is possible to derive expressions for the mean (or expected value), variance and standard error of the sampling distribution of any estimator of interest. In particular, for the sample proportion, assuming simple random sampling of elementary units,

E(p) = p Var(p) = P(1-P)/n

and SE(p)

=

-Y{P(1-P)/n}.

For the sample mean, x, the mean, variance and standard error of the sampling distribution are E(x) = ~. Var(x) = cr2/n and SE(x) = cr/.Yn, respectively.

A very important theorem in statistics which concerns the sampling distributions of sample proportions and means, is known as the central limit theorem. This theorem states, in effect, that if the sample means and proportions are based on large enough sample sizes, their sampling distributions tend to be normal, irrespective of the underlying distribution of the original observations. (The normal distribution provides the foundation for much of statistical theory. Readers unfamiliar with the normal distribution and its properties would be advised to refer to Dixon and Masseyn or other basic statistics textbooks.) Hence assuming the sample size is large, the sampling distribution of p may be represented as follows:

P- zap P P+Z crp Fig. 13 Sampling distribution of p

Here z is the number of standard errors from the mean of the sampling distribution and crp=

SE(p). How large the sample size (n) must be for the central limit theorem to hold depends on P. For practical purposes, n is sufficiently large if nP is greater than 5. At z equal to 1.96 in Fig. 13, the shaded area under the normal curve equals 95% of the total area. If z=2.58, then the area equals 99%. Using the central limit theorem we can determine how close a sample proportion p is likely to be to the true proportion P in the population from which that sample was selected. This is discussed in greater detail in the following sections.

58

where:

p '

M

m

~ N

'

Foundations of Sampling and Statistical Theory m

(MI(Nm)](LNi Pi),

i=1

total number of clusters

number of clusters selected at first stage number of sampling units in ith cluster number of sampling units in the population P i = estimate of Pi in ith cluster.

In the above formula,

Ni pi estimates the total number of individuals with the characteristic in the ith cluster;

L

m NiP i estimates the total number of individuals with the characteristic over all

i=1

m clusters in the sample;

L

m

~

p i/m estimates the mean number of individuals with the characteristic per

i=1

sample cluster;

m

M

L ~

p i/m estimates the total number of individuals with the characteristic in

i=1

all M clusters in the population; fmally, m

(M/Nm](L Nip i) estimates the proportion of individuals in the population with

i=1

the characteristic.

The variance of P is composed of two parts and is expressed as follows:

M M

Var(P) = M[M-m]/[(M-1 )N2ml

}2c~Pi-NPIM)

2

+MI(mN

2

) L[~CNr-ni)/(Nr1

)J[Pi(1-Pi)/nil

i=1 i=1

The first part of this expression represents the variation due to selecting m clusters, primary sampling units (PSU's), from the M available clusters. The second part is the sum of the sampling variation within clusters caused by selecting ni observations from the Ni which are available.

Since the above expression is a population parameter depending upon knowledge concerning all M population clusters, and true proportions within those clusters, it is necessary to specify an estimate ofVar(P), which may be done as follows:

~

m

l2

V~r(P)

= [W(M-m)]/[N2m(m-1)] NiP1(1/m) LNiPi +

i=1 m

M/(mN2) L{Nf[( Nr-ni)!Ni]( 1/ni) [1/(n11) J[nh1-Pi)]}

i=1

This expression is quite involved, incorporating Ni, ni and Pi for each of the m clusters selected in the sample. From a sample size determination point of view this is a problem, since different combinations of cluster sample sizes could be used in order to obtain some pre-specified level of precision. A simplified formula may be obtained by specifying that each population cluster be the same size (i.e., Ni =

N)

and that the sample size selected from each cluster be the same (i.e., ni = ii). In that special case, it follows that

m m

V~r(P)

= (M-m)/[Mm(m-1)JL(P1

P)

2 + (N-n)I[NMm(n-1)]LPi(1-Pi)

i=1 i=1

m

where P =

L

P i/m, and N = M

N.

Also, P becomes

i=1

m m n

P = (1/m) LPi=(1/mn)

LL>!.

i=1 i=1 ~1

The advantage of the above formula is that once M, m and the Pi, i=1 , ... ,m are specified, it would be possible to solve for n, the number of sampling units to select from each cluster.

However, it is rather unrealistic to assume that Ni=N for each of the clusters. A more reasonable approach would therefore be to use probability proportionate to size (PPS) cluster sampling. In this strategy, clusters are selected proportional to the number of listing units in the cluster. In this way, clusters with large Ni have a greater chance of being included in the sample than clusters having small Ni. In PPS sampling the same number, n, of listing units is generally sampled from each cluster selected at the first stage. The method is illustrated using the following example.

Example 11.3.2

Consider the population of ten villages along with the number of families living in each village, as shown in Table 11.3.

Table 11.3 Distribution of families in ten hypothetical villages

Villages Number of Cumulative Random Random Number

(clusters) families, Ni LNi numbers chosen

1 4288 4288 00001-04288 04285

2 5036 9324 04289-09324

3 1178 10502 09325-1 0502

4 638 11140 10503-11140

5 27010 38150 11141-38150 11883;35700;36699

6 1122 39272 38151-39272

7 2134 41406 39273-41406

8 1824 43230 41407-43230

9 4672 47902 43231 -4 7902

10 2154 50056 47903-50056

How would a PPS sample of n=4n families be taken?

60 Foundations of Sampling and Statistical Theory Solution

We first list these clusters and cumulate the number of listing units (families) as shown in column 3 of Table 11.3. Numbers 1 through 4288 are associated with cluster 1; numbers 4289-9324 are associated with cluster 2; numbers 9325 through 10502 are associated with cluster 3; and so on. Four random numbers, between 1 and 50056, 36699, 35700, 11883 and 4285 are selected corresponding to clusters 5, 5, 5 and 1, respectively. For each of the four random numbers chosen, we take a simple random sample of n families from the village corresponding to the random number. In this example, three independent simple random samples of n families would be selected from village 5 since it corresponds to three of the random numbers, and one simple random sample of

n families would be selected from village 1.

With PPS cluster sampling, P is estimated by

p

pps• where

m n m

Ppps= [1/(mn)lLllii=(1/m) L,F>i.

i=1 j=1 i=1

It should be noted that the expected value of this estimate is the true population proportion.

Hence Ppps is an unbiased estimate" of P.

Also, the expression for the estimated variance of

p

pps is m

V~r(P

pps) = {1/[m(m-1)]} L,<P1P pps)2. i=1

Comparing these formulae to the ones presented earlier for non-PPS cluster sampling, one notes that the great advantage of using PPS cluster sampling is the resulting computational simplicity.

The question of sample size is basic to the planning of any cluster sample. Decisions have to be made on, first, the number of clusters, m, which should be selected from the M available clusters and, second, the number of sampling units, ii, which should we select from each cluster. We want to determine m and n so that

u See pages 61·63

It would seem intuitively clear that a desirable property of a sampling plan and estimation procedure is that it should yield estimates of population parameters for which the mean of the sampling distribution is equal to, or at least close to, the true unknown population parameter and for which the standard error is very small. In fact, the accuracy of an estimated population parameter is evaluated in terms of these two characteristics.

The bias of an estimate !l of a population parameter 0 is defmed as the difference between the mean, E(!l), of the sampling distribution of !l and the true value of the unknown population parameter

o,

Bias(!l) = E(!l)- 0 .

An estimate !l is said to be unbiased if Bias(!l) = 0, or, in other words, if the mean of the sampling distribution of !l is equal to 0. The sample proportion and sample mean are examples of unbiased estimates when we have random sampling. Recall that the population variance is defmed as

N

crz

=

,Lc~-fl)

2

JN

.

i=1

One estimate of a2 might be a similar statistic computed on the n elements selected in the sample. That is,

n

i=1

It can be shown mathematically that

E(if) = [(n-1 )/n]a2.

Hence,

if

is a biased estimate of a2. Alternatively, consider the sample variance presented earlier as

It can be shown that

n

s2= ,Lcxi-x)2

/Cn-1).

i=1

E(s2) =a2.

Hence, s2 is an unbiased estimate of a2.

Expected values are based on the average over all possible samples which can be selected from a population. In any practical situation, however, the population is likely to be very large and only a single sample is selected from it. It is highly unlikely that the estimate produced from this single sample will exactly equal the population parameter. For this reason, we must define the term "sampling error".

The sampling error is the difference between a sample estimate and a population parameter. The population parameter could, in theory, be determined only if a complete

62 Foundations of Sampling and Statistical Theory

study were carried out of every individual in the population of interest. In practice, of course, this is never the case since populations are too large and are constantly changing; as a result, a well-selected sample is the best one can hope for.

To illustrate sampling error, consider the following example.

Example 11.4.1

Suppose the proportion of individuals with hypertension living in a certain community is 0.23. (It should be noted that this proportion is based on a snapshot of the population at an instant in time. Shortly after this snapshot is taken, some members of the population die or move away, while new members arise by birth or in-migration. Hence, defining the population is a more complex issue than may seem at first.) The population parameter is represented as

p = 0.23.

Suppose a representative sample of the population is studied and the proportion of the sampled individuals with hypertension is 28%. This is denoted by

p = 0.28.

Using this notation, the sampling error is defined as p- p = 0.28 - 0.23 = 0.05.

In this example, the value 0.28 is called a point estimate. In general, if

e

is some population parameter, and if

e

is an estimate of

e,

then

e - e

is the sampling error.

Note that sampling error refers to the relationship between an estimate resulting from a single sample and a population parameter. To account for the existence of sampling error confidence intervals are used rather than point estimates when making statements about population parameters. A confidence interval has the general form:

and has the following interpretation. Suppose a sample of size n is selected from a population with an unknown parameter

e;

the estimate 91 would not be expected to equal

e

exactly . Suppose this process of selecting a sample of size n was performed many times (say "k" times), each sample resulting in a new value of ei. i=1 , ... ,k. The collection of the k estimates of

e

constitutes the sampling distribution which, as a result of the central limit theorem, can be approximated by the normal distribution with mean E(e) and variance Var(e). If the normal distribution adequately describes the sampling distribution, then 95%

of all values of 9 will fall in the interval:

e-1.96-v'Var(e) to 9+1.96-v'Var(a).

Only 5% of the time will the estimate

e

fall more than 1.96v'Var(e) units away from the mean, E(e). Since the value of Var(9) is a population parameter, it must be estimated from the sample data. This quantity, denoted by Var(e), is used in the construction of confidence intervals for

e.

For example the 95% confidence interval would take the form:

The interpretation of this interval is that if confidence intervals such as this were constructed for each of all possible samples which could be selected from a population,

95% of all such intervals would include the true value of

e

in the population. Since any particular interval either does or does not cover the mean, the "probability" of a particular interval being correct is either 0 or 1. "Confidence" relates to the theoretical concept of repeatedly sampling from the population. The value of 1.96 implies the inclusion of the central 95% of the area under the normal curve (see Fig. 13). Other choices would be used if different areas were desired (e.g., 2.58 would be used for a 99% confidence interval).

For some parameters, such as Jl, the t-distribution rather than the normal distribution, would be used for the purpose of describing the confidence interval if the variance was estimated from the data and if the sample size was small. (Readers unfamiliar with the t- distribution and its properties should refer to an appropriate basic textbook of statistics.) The following example illustrates the construction of a confidence interval for a population proportion.

Example 11.4.2

In order to estimate the proportion of children in a particular school vaccinated against polio, a list of all students is assembled and a simple random sample of 25 of these students is selected. The proportion of students in the sample vaccinated against polio is observed to be 44%. The 95% confidence interval estimate of P, the true proportion vaccinated in the school, is

or

0.44-1.96~[0.44(1-0.44)/25] ~ p ~ 0.44+ 1.96~[0.44(1-0.44)/25]

0.25 ~ p ~ 0.63.

This states that if confidence intervals of this type were established for all possible random samples of size 25 which could be selected from this population, 95% of these would correctly incorporate the true proportion vaccinated in the population. The interval 0.23 to 0.62 either does or does not include the true P so in that sense, the "probability" of it being correct is either 0 or 1. We have

"confidence" in this interval because of our knowledge of the nature of the sampling distribution as well as how such intervals perform upon repeated sampling. (Since the population proportion P is unknown, p(1-p} is used as an estimate of the variance of the sampling distribution, P(1-P).)

Precision relates to the confidence interval and is defined in terms of repeated sampling.

That is,

precision= [reliability coefficient] x [standard error]

where the reliability coefficient reflects the desired confidence and is represented by a z- value if normality is assumed. For example, if the desired confidence is 95%, then the reliability coefficient is the upper 97 .5th percentile of the normal distribution with mean 0, and variance 1. This is represented by zo.975·