In the preceding section we discussed the properties of distributions in general, and those of the normal, chi-square, binomial, and Poisson distributions in particular.
These distributions and others are characterized by parameters that, in practice, are usually unknown. This raises the question of how to estimate such parameters from study data.
In certain applications the method of estimation seems intuitively clear. For ex- ample, suppose we are interested in estimating the probability that a coin will land heads. A “study” to investigate this question is straightforward and involves tossing the coinr times and counting the number of heads, a quantity that will be denoted
bya. The question of how larger should be is answered in Chapter 14. The pro- portion of tosses landing headsa/r tells us something about the coin, but in order to probe more deeply we require a probability model, the obvious choice being the binomial distribution. Accordingly, letAbe a binomial random variable with param- eters(π,r), whereπdenotes the unknown probability that the coin will land heads.
Even though the parameterπcan never be known with certainty, it can be estimated from study data. From the binomial model, an estimate is given by the random vari- ableA/r which, in the present study, has the realizationa/r. We denoteA/r byπˆ and refer toπˆ as a (point) estimate of π. In some of the statistics literature, πˆ is called an estimator ofπ, the term estimate being reserved for the realizationa/r. In keeping with our convention of intentionally ignoring the distinction between ran- dom variables and realizations, we use estimate to refer to both quantities.
The theory of binomial distributions provides insight into the properties ofπˆ as an estimate ofπ. SinceAhas meanE(A)=πrand variance var(A)=π(1−π)r, it follows thatπˆ has meanE(π)ˆ =E(A)/r =πand variance var(π)ˆ =var(A)/r2= π(1−π)/r. In the context of the coin-tossing study, these properties ofπˆ have the following interpretations: Over the course of many replications of the study, each based onr tosses, the realizations ofπˆ will be tend to be near π; and whenr is large there will be little dispersion of the realizations on either side ofπ. The latter interpretation is consistent with our intuition thatπwill be estimated more accurately when there are many tosses of the coin.
With the above example as motivation, we now consider the general problem of parameter estimation. For simplicity we frame the discussion in terms of a discrete random variable, but the same ideas apply to the continuous case. Suppose that we wish to study a feature of a population which is governed by a probability function P(X =x|θ), where the parameterθ embodies the characteristic of interest. For ex- ample, in a population health survey,Xcould be the serum cholesterol of a randomly chosen individual andθ might be the average serum cholesterol in the population.
LetX1,X2, . . . ,Xnbe a sample of sizenfrom the probability functionP(X =x|θ).
A (point) estimate ofθ, denoted byθˆ, is a random variable that is expressed in terms of the Xi and that satisfies certain properties, as discussed below. In the preceding example, the survey could be conducted by samplingnindividuals at random from the population and measuring their serum cholesterol. Forθˆwe might consider using X=(n
i=1Xi)/n, the average serum cholesterol in the sample.
There is considerable latitude when specifying the properties that θˆ should be required to satisfy, but in order for a theory of estimation to be meaningful the prop- erties must be chosen so thatθˆ is, in some sense, informative about θ. The first property we would likeθˆto have is that it should result in realizations that are “near”
θ. This is impossible to guarantee in any given study, but over the course of many replications of the study we would like this property to hold “on average.” Accord- ingly, we require the mean ofθˆ to beθ, that is, E(θ)ˆ = θ. When this property is satisfied we say thatθˆis an unbiased estimate ofθ, otherwiseθˆis said to be biased.
The second property we would likeθˆto have is that it should make as efficient use of the data as possible. In statistics, notions related to efficiency are generally expressed in terms of the variance. That is, all other things being equal, the smaller the variance
the greater the efficiency. Accordingly, for a given sample size, we require var(θ)ˆ to be as small as possible.
In the coin-tossing study the parameter wasθ=π. We can reformulate the earlier probability model by letting A1,A2, . . . ,Anbe independent binomial random vari- ables, each having parameters(π,1). Setting A =(n
i=1Ai)/n we haveπˆ = A, and so E(A) = π and var(A) = π(1−π)/n. Suppose that instead of A we de- cide to use A1 as an estimate of π; that is, we ignore all but the first toss of the coin. Since E(A1) = π, both A and A1 are unbiased estimates of π. However, var(A1)=π(1−π)and so, providedn>1, var(A1) >var(A). This means thatA is more efficient thanA1. Based on the above criteria we would chooseAoverA1as an estimate ofπ.
The decision to choose A in preference to A1 was based on a comparison of variances. This raises the question of whether there is another unbiased estimate of πwith a variance that is even smaller thanπ(1−π)/n. We return now to the general case of an arbitrary probability function P(X = x|θ). For many of the probability functions encountered in epidemiology it can be shown that there is a numberb(θ) such that, for any unbiased estimate θˆ, the inequality var(θ)ˆ ≥ b(θ) is satisfied.
Consequently,b(θ)is at least as small as the variance of any unbiased estimate ofθ. There is no guarantee that for givenθandP(X =x|θ)there actually is an unbiased estimate with a variance this small; but, if we can find one, we clearly will have satisfied the requirement that the estimate has the smallest variance possible.
For the binomial distribution, it turns out that b(π) = π(1 −π)/n, and so b(π)=var(π). Consequentlyˆ πˆ is an unbiased estimate ofπ with the smallest vari- ance possible (among unbiased estimates). For the binomial distribution, intuition suggests thatπˆ ought to provide a reasonable estimate ofπ, and it turns out thatπˆ has precisely the properties we require. However, suchad hocmethods of defining an estimate cannot always be relied upon, especially when the probability model is complex. We now consider two widely used methods of estimation which ensure that the estimate has desirable properties, provided asymptotic conditions are satisfied.
1.2.1 Maximum Likelihood
The maximum likelihood method is based on a concept that is intuitively appealing and, at first glance, deceptively straightforward. Like many profound ideas, its ap- parent simplicity belies a remarkable depth. Let X1,X2, . . . ,Xnbe a sample from the probability function P(X = x|θ)and consider the observations (realizations) x1,x2, . . . ,xn. Since the Xi are independent, the (joint) probability of these obser- vations is the product of the individual probability elements, that is,
n i=1
P(Xi =xi|θ)=P(X1=x1|θ)P(X2=x2|θ)· · ·P(Xn=xn|θ). (1.16) Ordinarily we are inclined to think of (1.16) as a function of thexi. From this perspective, (1.16) can be used to calculate the probability of the observations pro- vided the value ofθis known. The maximum likelihood method turns this argument
around and views (1.16) as a function ofθ. Once the data have been collected, values of thexican be substituted into (1.16), making it a function ofθalone. When viewed this way we denote (1.16) byL(θ)and refer to it as the likelihood. For any value of θ,L(θ)equals the probability of the observationsx1,x2, . . . ,xn. We can graphL(θ) as a function ofθto get a visual image of this relationship. The value ofθwhich is most in accord with the observations, that is, makes them most “likely,” is the one which maximizesL(θ)as a function ofθ. We refer to this value ofθas the maximum likelihood estimate and denote it byθ.ˆ
Example 1.7 Let A1,A2,A3,A4,A5 be a sample from the binomial distribu- tion with parameters(π,1), and consider the observationsa1=0,a2=1,a3=0, a4=0, anda5=0. The likelihood is
L(π)= 5 i=1
πai(1−π)1−ai =π(1−π)4.
From the graph ofL(π), shown in Figure 1.9, it appears thatπˆ is somewhere in the neighborhood of .2. Trial and error with larger and smaller values ofπconfirms that in factπˆ =.2.
The above graphical method of finding a maximum likelihood estimate is feasible only in the simplest of cases. In more complex situations, in particular when there are several parameters to estimate simultaneously, numerical methods are required, such as those described in Appendix B. When there is a single parameter, the maxi- mum likelihood estimateθˆcan usually be found by solving the maximum likelihood equation,
L(θ)ˆ =0 (1.17)
whereL(θ)is the derivative ofL(θ)with respect toθ.
FIGURE 1.9 Likelihood for Example 1.7
Example 1.8 We now generalize Example 1.7. LetA1,A2, . . . ,Ar be a sample from the binomial distribution with parameters(π,1), and denote the observations bya1,a2, . . . ,ar. The likelihood is
L(π)= r i=1
πai(1−π)1−ai =πa(1−π)r−a (1.18) wherea=r
i=1ai. From the form of the likelihood we see that is not the individual ai which are important but rather their suma. Accordingly we might just as well have based the likelihood onr
i=1Ai, which is binomial with parameters(π,r). In this case the likelihood is
L(π)= r
a
πa(1−π)r−a. (1.19)
As far as maximizing (1.19) with respect toπ is concerned, the binomial coef- ficient is irrelevant and so (1.18) and (1.19) are equivalent from the likelihood per- spective. It is straightforward to show that the maximum likelihood equation (1.17) simplifies toa− ˆπr =0 and so the maximum likelihood estimate ofπisπˆ =a/r.
Maximum likelihood estimates have very attractive asymptotic properties. Specif- ically, ifθˆis the maximum likelihood estimate ofθthenθˆis asymptotically normal with meanθand varianceb(θ), where the latter is the lower bound described earlier.
As a result, θ satisfies, in an asymptotic sense, the two properties that were pro- posed above as being desirable features of an estimate—unbiasedness and minimum variance. In addition to parameter estimates, the maximum likelihood approach also provides methods of confidence interval estimation and hypothesis testing. As dis- cussed in Appendix B, included among the latter are the Wald, score, and likelihood ratio tests.
It seems that the maximum likelihood method has much to offer; however, there are two potential problems. First, the maximum likelihood equation may be very complicated and this can make calculatingθˆdifficult in practice. This is especially true when several parameters must be estimated simultaneously. Fortunately, statis- tical packages are available for many standard analyses and modern computers are capable of handling the computational burden. The second problem is that the desir- able properties of maximum likelihood estimates are guaranteed to hold only when the sample size is “large.”
1.2.2 Weighted Least Squares
In the coin-tossing study discussed above, we considered a sample A1,A2, . . . ,An from a binomial distribution with parameters(π,1). SinceE(Ai)=πwe can denote
Aibyπˆi, and in place ofA= ni=1Ai
/nwriteπˆ = ni=1πˆi
/n. In this way we can express the estimate ofπ as an average of estimates, one for eachi. More gen- erally, suppose thatθˆ1,θˆ2, . . . ,θˆnare independent unbiased estimates of a parameter θ, that is, E(θˆi) = θ for alli. We do not assume that theθˆi necessarily have the same distribution; in particular, we do not require that the variances var(θˆi)=σi2be
equal. We seek a method of combining the individual estimatesθˆiofθinto an overall estimateθˆ which has the desirable properties outlined earlier. (Using the symbolθˆ for both the weighted least squares and maximum likelihood estimates is a matter of convenience and is not meant to imply any connection between the two estimates.) For constantswi >0, consider the sum
1 W
n i=1
wi(θˆi− ˆθ)2 (1.20)
whereW =n
i=1wi. We refer to thewi as weights and to an expression such (1.20) as a weighted average. It is the relative, not the absolute, magnitude of eachwithat is important in a weighted average. In particular, we can replacewi withwi =wi/W and obtain a weighted average in which the weights sum to 1. In this way, means (1.2) and variances (1.3) can be viewed as weighted averages.
Expression (1.20) is a measure of the overall weighted “distance” between the θˆi andθ. The weighted least squares method definesˆ θˆ to be that quantity which minimizes (1.20). It can be shown that the weighted least squares estimate ofθis
θˆ= 1 W
n i=1
wiθˆi (1.21)
which is seen to be a weighted average of theθˆi. Since eachθˆiis an unbiased estimate ofθ, it follows from (1.7) that
E(θ)ˆ = 1 W
n i=1
wiE(θˆi)=θ.
Soθˆis also an unbiased estimate ofθ, and this is true regardless of the choice of weights. Not all weighting schemes are equally efficient in the sense of keeping the variance var(θ)ˆ to a minimum. The varianceσi2is a measure of the amount of information contained in the estimateθˆi. It seems reasonable that relatively greater weight should be given to thoseθˆifor whichσi2is correspondingly small. It turns out that the weightswi =1/σi2are optimal in the following sense: The corresponding weighted least squares estimate has minimum variance among all weighted averages of theθˆi(although not necessarily among estimates in general). Settingwi =1/σi2, it follows from (1.8) that
var(θ)ˆ = 1 W2
n i=1
w2i var(θˆi)= 1
W. (1.22)
Note that up to this point the entire discussion has been based on means and variances. In particular, nothing has been assumed about distributions or sample size.
It seems that the weighted least squares method has much to recommend it. Unlike the maximum likelihood approach, the calculations are straightforward, and sample
size does not seem to be an issue. However, a major consideration is that we need to know the variancesσi2prior to using the weighted least squares approach, and in practice this information is almost never available. Therefore it is usually necessary to estimate theσi2from study data, in which case the weights are random variables rather than constants. So instead of (1.21) and (1.22) we have instead
θˆ= 1 Wˆ
n i=1
ˆ
wiθˆi (1.23)
and
var(θ)ˆ = 1
Wˆ (1.24)
wherewˆi =1/σˆi2andWˆ =n
i=1wˆi. When theσi2are estimated from large samples the desirable properties of (1.21) and (1.22) described above carry over to (1.23) and (1.24), that is,θˆis asymptotically unbiased with minimum variance.