size does not seem to be an issue. However, a major consideration is that we need to know the variancesσi2prior to using the weighted least squares approach, and in practice this information is almost never available. Therefore it is usually necessary to estimate theσi2from study data, in which case the weights are random variables rather than constants. So instead of (1.21) and (1.22) we have instead
θˆ= 1 Wˆ
n i=1
ˆ
wiθˆi (1.23)
and
var(θ)ˆ = 1
Wˆ (1.24)
wherewˆi =1/σˆi2andWˆ =n
i=1wˆi. When theσi2are estimated from large samples the desirable properties of (1.21) and (1.22) described above carry over to (1.23) and (1.24), that is,θˆis asymptotically unbiased with minimum variance.
a of them are cases. The simple random sample estimate of the prevalence rate is ˆ
πsrs=a/r, which has the variance var(πˆsrs)=π(1−π)/r. 1.3.2 Stratified Random Sampling
Suppose that the prevalence rate increases with age. Simple random sampling en- sures that, on average, the sample will have the same age distribution as the popula- tion. However, in a given prevalence study it is possible for a particular age group to be underrepresented or even absent from a simple random sample. Stratified random sampling avoids this difficulty by permitting the investigator to specify the propor- tion of the total sample that will come from each age group (stratum). For stratified random sampling to be possible it is necessary to know in advance the number of in- dividuals in the population in each stratum. For example, stratification by age could be based on a census list, provided information on age is available. Once the strata have been created, a simple random sample is drawn from each stratum, resulting in a stratified random sample.
Suppose there arenstrata. For theith stratum we make the following definitions:
Ni is the number of individuals in the population,πi is the prevalence rate,ri is the number of subjects in the simple random sample, andai is the number of cases among therisubjects(i =1,2, . . . ,n). LetN =n
i=1Ni,a=n
i=1ai and r =
n i=1
ri. (1.25)
For a stratified random sample, along with theNi, theri must also be known prior to data collection. We return shortly to the issue of how to determine theri, given an overall sample size ofr. For the moment we require only that theri satisfy the con- straint (1.25). Since a simple random sample is chosen in each stratum, an estimate ofπi isπˆi =ai/ri, which has the variance var(πˆi)=πi(1−πi)/ri. The stratified random sample estimate of the prevalence rate is
ˆ πstr =
n i=1
Ni N
πˆi (1.26)
which is seen to be a weighted average of theπˆi. SinceE(πˆi)=πi, it follows from (1.7) that
E(πˆstr)=n
i=1
Ni N
πi =π
and soπˆstris unbiased. Applying (1.8) to (1.26) gives var(πˆstr)=
n i=1
Ni N
2πi(1−πi) ri
. (1.27)
We now consider the issue of determining theri. There are a number of approaches that can be followed, each of which places particular conditions on theri. For ex- ample, according to the method of optimal allocation, the ri are chosen so that var(πˆstr)is minimized. It can be shown that, based on this criterion,
ri = Ni
√πi(1−πi) n
i=1Ni
√πi(1−πi)
r. (1.28)
As can be seen from (1.28), in order to determine theriit is necessary to know, or at least have reasonable estimates of, theπi. Since this is one of the purposes of the prevalence study, it is therefore necessary to rely on findings from earlier prevalence studies or, when such studies are not available, have access to informed opinion.
Stratified random sampling should be considered only if it is known, or at least strongly suspected, that theπi vary across strata. Suppose that, unknown to the in- vestigator, theπi are all equal, so thatπi = π for alli. It follows from (1.28) that ri =(Ni/N)rand hence, from (1.27), that var(πˆstr)=π(1−π)/r. This means that the variance obtained by optimal allocation, which is the smallest variance possible under stratified random sampling, equals the variance that would have been obtained from simple random sampling. Consequently, when there is a possibility that theπi
are all equal, stratified random sampling should be avoided since the effort involved in stratification will not be rewarded by a reduction in variance.
Simple random sampling and stratified random sampling are conceptually and computationally straightforward. There are more complex methods of random sam- pling such as multistage sampling and cluster sampling. Furthermore, the various methods can be combined to produce even more elaborate sampling strategies. It will come as no surprise that as the method of sampling becomes more complicated so does the corresponding data analysis. In practice, most epidemiologic studies use rel- atively straightforward sampling procedures. Aside from prevalence studies, which may require complex sampling, the typical epidemiologic study is usually based on simple random sampling or perhaps stratified random sampling, but generally noth- ing more elaborate.
Most of the procedures in standard statistical packages, such as SAS (1987) and SPSS (1993), assume that data have been collected using simple random sampling or stratified random sampling. For more complicated sampling designs it is necessary to use a statistical package such as SUDAAN (Shah et al., 1996), which is specifically designed to analyze complex survey data. STATA (1999) is a statistical package that has capabilities similar to SAS and SPSS, but with the added feature of being able to analyze data collected using complex sampling. For the remainder of the book it will be assumed that data have been collected using simple random sampling unless stated otherwise.
C H A P T E R 2
Measurement Issues in Epidemiology
Unlike laboratory research where experimental conditions can usually be carefully controlled, epidemiologic studies must often contend with circumstances over which the investigator may have little influence. This reality has important implications for the manner in which epidemiologic data are collected, analyzed, and interpreted.
This chapter provides an overview of some of the measurement issues that are im- portant in epidemiologic research, an appreciation of which provides a useful per- spective on the statistical methods to be discussed in later chapters. There are many references that can be consulted for additional material on measurement issues and study design in epidemiology; in particular, the reader is referred to Rothman and Greenland (1998).