In stratified random sampling, the population is divided into "L" strata, and simple random samples are selected from each such stratum. The proportion of individuals in the population who possess the characteristic of interest is
L
P=LNhPt!N' h=l
where Ph is the proportion of individuals in stratum h possessing the characteristic. That is, the population proportion is a weighted average of the stratum-specific proportions, where the weights are the relative sizes of strata, Nh.
An unbiased estimate of P is obtained by computing
where Ph is the usual estimate of Ph based on the nh sampling units selected from the hth
stratum.
The variance of the estimated proportion is
L
Var(P) = (1/N2)LNh2[(Nh-nh)/(Nh-1)][Ph(1-Ph)/nh].
h=l
Finally, if we assume that the Nh are large and much greater than nh, then this expression simplifies to
L
Var( P) = (1 /N 2) LNh 2[Ph(1-P h)/n
tJ.
h=l
Again, by applying the central limit theorem, the sampling distribution of
P
may be approximated by the normal distribution and, as a result, the precision may be set to d which again may be specified either in terms of a defined distance or as a percentage of P.Estimating P to within "d" percentage points
Under the simplifying assumptions presented above, the sample size n required to estimate P to within d percentage points with 100(1-a.)% confidence is
r See pages 82-83
Statistical Methods for Sample Size Determination
L
z~--ai2LfN~Ph(1-Ph)/wh
n = _ _ _.:.;._;_____,L---h=1 N2d2 +Zf--a/2LNhPh(1-Ph)
h=1
45
(30)
where wh=nh/n. That is, wh is the fraction of observations allocated to stratum h and is typically decided in advance of sampling by the particular allocation scheme used. Thus, if equal allocation is used, then wh=11L for all strata. Alternatively, if proportional allocation is used, then wh=NwN for the h1h stratum. Other types of allocation strategies are commonly employed (for instance, the ones that incorporate such features as variability in the strata and differential costs of sampling within the strata) but will not be discussed further in this manual.
Using the simplified formula for Var( f>) results in a somewhat simpler formula for sample size determination:
L
zta12L.[N~Ph(
1-Pt)/wh n=--~~~1-~----N2d2 (30a)
The number of combinations of parameters in these equations makes the construction of tables impractical. The interested reader would be well advised to use a programmable calculator or a spreadsheet program on a microcomputer to assist with the calculations.
Example 1.8.3
A preliminary survey is made of 3 cities, A, 8, and C with population sizes 2000, 3000, and 5000, respectively. The proportion of families with 1 or more infant deaths within the last 5 years is estimated and presented in Table 1.2 as Ph·
Using these preliminary data, determine the sample size which would be needed to estimate the proportion of families with infant deaths if the precision is to be within ±3 percentage points of the true population P with 95% confidence and the sample is to be distributed using proportional allocation.
Table 1.2 Sample size computation by spreadsheet technique Population weight
City A
c
82000 0.2 3000 0.3 5000 0.5 Total 10000 Solution
4000000 9000000 25000000
Using formula (30) it follows that
Proportion
0.10 0.15 0.20
200 450 1000
180.0 382.5 800.0 1362.5
1800000 3825000 8000000 13625000
n = 1.9602[13625000]/{(1 0000)2(0.03)2+ 1.9602(1362.5)} = 549.61 Hence, a total sample of 550 families should be selected. With proportional allocation, the sample would be distributed to the three strata as follows:
n1 = 550 x 2000/10000 = 110,
n2 = 550 x 3000/1 0000 = 165, n3 = 550 x 5000/1 0000 = 275.
Using formula (30a) results in
n = 1.9602[13625000]/{(1 0000)2(0.03)2} = 581.58
which suggests that a sample of size 582 be used. Hence by using formula (30) which incorporates the finite population correction factors within each stratum, a reduced estimate of sample size is obtained.
Estimating P to within "£" of P For the alternative strategy, we set
and solving for n, noting that NP = LNhPh, it follows that
L
z~--a12L[N~Ph(
1-Pt.>twh h=1n = {
L )2 L
E LNhPh + 4--a12LNhPh( 1-Pt.>
h=1 h=1
(31)
If we assume that Nh is large and much greater than n h• then the simplified expression for Var(P) may be used, giving rise to the following, less complicated, sample size formula:
Example 1.8.4
L
zta12L[N~Ph(
1-Pt.>/wh n = - - - . , . . . - -h=1•{t~Ph]'
In Example 1.8.3, suppose that we wish to estimate P to within 5% of the true value how large a sample should be used?
Solution
Using the calculations in Table 1.2 and formula (31):
n = 1.960 2[13625000]/{(0.05)2(1650) 2+1.960 2(1362.5)} = 4347.17.
Hence, a sample of 4348 families should be selected. Since we are using proportional allocation, the sample would be distributed to the three strata as follows:
n1 = 4348 x 2000/10000 = 870, n2 = 4348 x 3000/10000 = 1304, n3 = 4348 x 5000/10000 = 2174.
Using formula (31 a) results in:
(31 a)
Statistical Methods for Sample Size Determination
n = 1.9602[13625000]/{(0.05)2(1650) 2} = 7690.26
which suggests that a sample of size 7691 be used. Hence by using formula (31) which incorporates the finite population correction factors within each stratum, a greatly reduced estimate of the sample size requirement was obtained.
47
There is no easy rule to give for deciding upon sample size for cluster sampling. The easiest approach is to compute n based on simple random sampling criteria and then multiply by the design effects to obtain the required total sample with cluster sampling.
If 2 were the chosen design effect, twice as many observations would be necessary with cluster sampling as with simple random sampling to obtain the desired level of precision.
This specifies that the variance with cluster sampling will be twice as large as the variance with simple random sampling if the same total number, n, of observations were selected with both. It is normally cheaper to sample n observations with cluster sampling than it does with simple random sampling. This is due to the dramatically reduced cost and time involved in constructing sampling frames. This suggests that, for the same cost, one can afford to select many more sampling units, precision is generally greater with cluster sampling. Hence, one would decide how many observations were necessary using the formulae presented in Chapter 8 for simple random sampling.
Example 1.8.5
In Example 1.8.1 when simple random sampling was used to estimate P to within 5 percentage points with 95% confidence, 338 children would have been needed.
If cluster sampling were to be used, how many clusters, and of what size, would be needed for the same precision?
Solution
Assuming a design effect of 2, 676 (i.e. 2 x 338) children would be needed.
Some information on the heterogeneity of the clusters and the costs involved would be needed in order to determine the sizes of the clusters. For example the 676 children may be distributed in the following different ways:
Number of clusters
10 15 20 25 30
Cluster size
68
45 34 27 23
The mix of number of clusters (m) and cluster sizes (ii) depends upon the degree of heterogeneity of the clusters. When the variation between clusters is large, we would choose ii small and m large. Alternatively, the closer the Pi in the clusters are to 0.5, the larger will be the variability within the clusters. In this case, n should be large relative to m. Another factor in determining the mix of m and n is cost. The total cost of selecting a cluster sample can be expressed as
C=C1m+c2mn,
where c1 is the cost involved in selecting clusters (e.g., constructing frames, renting office space, etc.), and c2 is the cost involved in selecting sampling units (e.g., travel expenses, interviewer expenses, data processing costs). Common sense dictates that if c1 is large, we would tend to take more sampling units and fewer clusters. However, if c2 is large, we would take more clusters and fewer sampling units within each cluster.
ssee page 86