As remarked in Section 1.1.3, in statistics the term “exact” means that an actual probability function is being used to perform calculations, as opposed to a normal approximation. The advantage of exact methods is that they do not rely on asymp- totic properties and hence are valid regardless of sample size. The drawback, as will soon become clear, is that exact methods can be computationally intensive, especially when the sample size is large. Fortunately, this is precisely the situation where a nor- mal approximation is appropriate, and so between the two approaches it is usually possible to perform a satisfactory data analysis.
77
ISBN: 0-471-36914-4
Biostatistical Methods in Epidemiology. Stephen C. Newman Copyright¶2001 John Wiley & Sons, Inc.
ISBN: 0-471-36914-4
3.1.1 Hypothesis Test
Suppose that we wish to test the null hypothesisH0 :π =π0, whereπ0is a given value of the probability parameter. If H0is rejected, we conclude that π does not equal π0. To properly interpret this finding it is necessary to have an explicit al- ternative hypothesis H1. For example, it may be that there are only two possible values ofπ, namely,π0andπ1. In this case the alternative hypothesis is necessarily H1:π =π1. If we believe thatπcannot be less thanπ0we are led to consider the one-sided alternative hypothesisH1: π > π0. Before proceeding with a one-sided test it is important to ensure that the one-sided assumption is valid. For example, in a randomized controlled trial of a new drug compared to usual therapy, it may be safe to assume that the innovative drug is at least as beneficial as standard treatment. The two-sided alternative hypothesis corresponding to H0 : π = π0is H1 : π = π0. Ordinarily it is difficult to justify a one-sided alternative hypothesis, and so in most applications a two-sided test is used. Except for portions of this chapter, in this book we consider only two-sided tests. In particular, all chi-square tests are two-sided.
To testH0:π =π0we need to decide whether the observed outcome is likely or unlikely under the assumption thatπ0is the true value ofπ. Withaas the outcome, the lower and upper tail probabilities for the binomial distribution with parameters (π0,r)are defined to be
P(A≤a|π0)= a x=0
r x
π0x(1−π0)r−x
=1− r x=a+1
r x
π0x(1−π0)r−x (3.1)
and
P(A≥a|π0)= r x=a
r x
π0x(1−π0)r−x
=1−a−
1 x=0
r x
π0x(1−π0)r−x (3.2)
respectively.
Let pminbe the smaller of P(A ≤ a|π0)and P(A ≥ a|π0). Then pminis the probability of observing an outcome at least as “extreme” asa at that end of the distribution. For a one-sided alternative hypothesis,pminis defined to be the p-value of the test. To compute the two-sided p-value we need a method of determining a corresponding probability at the “other end” of the distribution. The two-sided p-value is defined to be the sum of these two probabilities. One possibility is to define the second probability to be the largest tail probability at the other end of the distribution which does not exceedpmin. We refer to this approach as the cumulative method. An alternative is to define the second probability to be equal to pmin, in
which case the two-sided p-value is simply 2× pmin. We refer to this approach as the doubling method. Evidently the doubling method produces a two-sided p- value at least as large as the one obtained using the cumulative approach. When the distribution is approximately symmetrical, the two methods produce similar results.
For the binomial distribution this will be the case whenπ0 is near .5. There does not appear to be a consensus as to whether the cumulative method or the doubling method is the best approach to calculating two-sided p-values (Yates, 1984).
In order to make a decision about whether or not to rejectH0, we need to select a value forα, the probability of a type I error (Section 2.1). According to the classical approach to hypothesis testing, when the p-value is less thanα, the null hypothesis is rejected. An undesirable practice is to simply report a hypothesis test as either
“statistically significant” or “not statistically significant,” according to whether thep- value is less thanαor not, respectively. A more informative way of presenting results is to give the actualp-value. This avoids confusion when the value ofαhas not been made explicit, and it gives the reader the option of interpreting the hypothesis test according to other choices ofα. In this book we avoid any reference to “statistical significance,” preferring instead to comment on the “evidence” provided by ap-value (relative to a givenα). For descriptive purposes we adopt the current convention of settingα =.05. So, for example, when the p-value is “much smaller” than .05 we comment on this finding by saying that the data provide “little evidence” forH0.
Referring to the hypothesis test based on (3.1) and (3.2) as “exact” is apt to leave the impression that, when the null hypothesis is true andα=.05, over the course of many replications of the study the null hypothesis will be rejected 5% of the time. In most applications, an exact hypothesis test based on a discrete distribution will reject the null hypothesis less frequently than is indicated by the nominal value ofα. The reason is that the tail probabilities of a discrete distribution do not assume all possible values between 0 and 1. Borrowing an example from Yates (1984), consider a study in which a coin is tossed 10 times. Under the null hypothesis that the coin is fair, the study can be modeled using the binomial distribution with parameters(.5,10).
In this case,P(A≥8|.5)=5.5% andP(A≥9|.5)=1.1%. It follows that, based on a one-sided test withα=.05, the null hypothesis will be rejected 1.1%, not 5%, of the time. For this reason, exact tests are said to be conservative.
In the examples presented in this book we routinely use more decimal places than would ordinarily be justified by the sample size under consideration. The reason is that we often wish to compare findings based on several statistical techniques, and in many instances the results are so close in value that a large number of decimal places is needed in order to demonstrate a difference. Most of the calculations in this book were performed on a computer. In many of the examples, one or more intermediate steps have been included rather than just the final answer. The numbers in the inter- mediate steps have necessarily been rounded and so may not lead to precisely the final answer given in the example.
Example 3.1 Leta =2 andr =10, and considerH0:π0=.4. The binomial distribution with parameters(.4,10)is given in Table 3.1. SinceP(A≤2|.4)=.167 and P(A ≥ 2|.4) = .954, it follows that pmin = .167. At the other end of the
TABLE 3.1 Probability Function (%) for the Binomial Distribution with Parameters (.4, 10)
a P(A=a|.4) P(A≤a|.4) P(A≥a|.4)
0 .60 .60 100
1 4.03 4.64 99.40
2 12.09 16.73 95.36
3 21.50 38.23 83.27
4 25.08 63.31 61.77
5 20.07 83.38 36.69
6 11.15 94.52 16.62
7 4.25 98.77 5.48
8 1.06 99.83 1.23
9 .16 99.99 .17
10 .01 100 .01
distribution the largest tail probability not exceeding pminis P(A ≥6|.4)=.166.
So the two-sidedp-value based on the cumulative method isp=.167+.166=.334.
According to the doubling approach the two-sided p-value is also p = 2(.167) = .334. In view of these results there is little evidence to rejectH0.
3.1.2 A Critique ofp-values and Hypothesis Tests
A “small”p-value means that, under the assumption thatH0is true, an outcome as extreme as, or more extreme than, the one that was observed is unlikely. We interpret such a finding as evidence againstH0, but this is not the same as saying that evidence has been found in favor ofH1. The reason for making this distinction is that the p- value is defined exclusively in terms ofH0and so does not explicitly contrastH0with H1. Intuitively it seems that a decision about whether an outcome should be regarded as likely or unlikely ought to depend on a direct comparison with other possible outcomes. This comparative feature is missing from thep-value. For this reason and others, thep-value is considered by some authors to be a poor measure of “evidence”
(Goodman and Royall, 1988; Goodman, 1993; Schervish, 1996). An approach that avoids this problem is to base inferential procedures on the likelihood ratio. This quantity is defined to be the quotient of the likelihood of the observed data under the assumption that H0is true, divided by the corresponding likelihood under the assumption thatH1is true (Edwards, 1972; Clayton and Hills, 1993; Royall, 1997).
Likelihood ratio methods involve considerations beyond the scope of this book and so, except for likelihood ratio tests, this approach to statistical inference will not be considered further.
From the epidemiologic perspective, another problem with the classical use of p-values is that they are traditionally geared toward all-or-nothing decisions: Based on the magnitude of thep-value and the agreed-uponα,H0is either rejected or not.
In epidemiology this approach to data analysis is usually unwarranted because the
findings from a single epidemiologic study are rarely, if ever, definitive. Rather, the advancement of knowledge based on epidemiologic research tends to be cumulative, with each additional study contributing incrementally to our understanding. Not in- frequently, epidemiologic studies produce conflicting results, making the evaluation of research findings that much more challenging. Given these uncertainties, in epi- demiology we are usually interested not only in whether the true value of a parameter is equal to some hypothesized value but, more importantly, what is the range of plau- sible values for the parameter. Confidence intervals provide this kind of information.
In recent years the epidemiologic literature has become critical of p-values and has placed increasingly greater emphasis on the use of confidence intervals (Rothman, 1978; Gardner and Altman, 1986; Poole, 1987). Despite these concerns, hypothesis tests and p-values are in common use in the analysis of epidemiologic data, and so they are given due consideration in this book.
3.1.3 Confidence Interval
Let αbe the probability of a type I error and consider a given method of testing the null hypothesis H0 : π = π0. A (1−α)×100% confidence interval for π is defined to be the set of all parameter valuesπ0such that H0 is not rejected. In other words, the confidence interval is the set of allπ0that are consistent with the study data (for a given choice ofα). Note that according to this definition, different methods of testing the null hypothesis, such as exact and asymptotic methods, will usually lead to somewhat different confidence intervals. The process of obtaining a confidence interval in this manner is referred to as inverting the hypothesis test. In Example 3.1 the “data” consist ofa =2 andr =10. Based on the doubling method, the exact test ofH0:π0=.4 resulted in a two-sidedp-value of .334. By definition, .4 is in the exact(1−α)×100% confidence interval forπwhen the null hypothesis H0:π=.4 is not rejected, and this occurs when the p-value is greater thanα—that is, whenα≤.334.
A(1−α)×100% confidence interval forπ will be denoted by[π, π]. We refer toπ andπ as the lower and upper bounds of the confidence interval, respectively.
In keeping with established mathematical notation, square brackets are used to in- dicate that the confidence interval contains all possible values of the parameter that are greater than or equal toπ and less than or equal toπ. It is sometimes said (in- correctly) that 1−αis the probability thatπ is in[π, π]. This manner of speaking suggests that the confidence interval is fixed and that, for a given study,π is a ran- dom quantity which is either in the confidence interval or not. In fact, precisely the opposite is true. Bothπandπ are random variables and so[π, π]is actually a ran- dom interval. This explains why a confidence interval is sometimes referred to as an interval estimate. An appropriate interpretation of[π, π]is as follows: Over the course of many replications of the study,(1−α)×100% of the “realizations” of the confidence interval will containπ.
An exact(1−α)×100% confidence interval forπ is obtained by solving the equations
α
2 =P(A≥a|π)= r x=a
r x
πx(1−π)r−x
=1−
a−1
x=0
r x
πx(1−π)r−x (3.3)
and
α
2 =P(A≤a|π)= a x=0
r x
πx(1−π)r−x
=1− r x=a+1
r x
πx(1−π)r−x (3.4)
forπandπ. Since there is only one unknown in each equation, the solutions can be found by trial and error. Ifa=0, which sometimes happens when the probability of disease is low, we defineπ=0 andπ =1−α1/r (Louis, 1981; Jovanovic, 1998). In StatXact (1998, §12.3), a statistical package designed to perform exact calculations, the upper bound is defined to beπ =1−(α/2)1/r. Since an exact confidence interval is obtained by inverting an exact test, when the distribution is discrete, the resulting exact confidence interval will be conservative—that is, wider than is indicated by the nominal value ofα(Armitage and Berry, 1994, p.123).
Example 3.2 Leta=2 andr =10. From .025=1−
1 x=0
10 x
πx(1−π)10−x
=1−
(1−π)10+10π(1−π)9 and
.025= 2 x=0
10 x
πx(1−π)10−x
=(1−π)10+10π(1−π)9+45π2(1−π)8
a 95% confidence interval forπ is[.025, .556]. Note that the confidence interval includesπ0=.4, a finding that is consistent with the results of Example 3.1.