4.2 EXACT CONDITIONAL METHODS FOR A SINGLE 2×2 TABLE
Hypergeometric Distribution
Once we have conditioned onm1, the random variables A1and A2 are no longer independent. Specifically, we have the constraintA1+A2=m1, and soA2is com- pletely determined by A1 (andvice versa). As a result of conditioning onm1 we have gone from two independent binomial random variables to a single random vari- able corresponding to the index cell. We continue to denote the random variable in question byA1, allowing the context to make clear which probability model is being considered. As shown in Appendix C, conditioning onm1 results in a (noncentral) hypergeometric distribution. The probability function is
P(A1=a1|OR)= 1 C
r1
a1
r2
m1−a1
ORa1 (4.13)
where
C= u
x=l
r1
x
r2
m1−x
ORx.
Viewed as a hypergeometric random variable, A1 has the sample space {l,l +1, . . . ,u}, wherel =max(0,r1−m2)andu =min(r1,m1). Here max and min mean thatlis the maximum of 0 andr1−m2, anduis the minimum ofr1andm1. Since r1−m2=(r−r2)−(r−m1)=m1−r2,lis sometimes written as max(0,m1−r2). Evidently,l ≥ 0 andu ≤ r1, and so the hypergeometric sample space of A1 is contained in the binomial sample space. For a given set of marginal totals, the hyper- geometric distribution is completely determined by the parameterOR. Therefore, by conditioning onm1we have eliminated the nuisance parameterπ2. The numerator of (4.13) gives the distribution its basic shape, and the denominatorC ensures that (1.1) is satisfied. From (1.2) and (1.3), the hypergeometric mean and variance are
E(A1|OR)= 1 C
u x=l
x r1
x
r2
m1−x
ORx (4.14)
and
var(A1|OR)= 1 C
u x=l
[x−E(A1|OR)]2 r1
x
r2 m1−x
ORx. (4.15) Unfortunately, (4.13), (4.14), and (4.15) do not usually simplify to less complicated expressions. An instance where simplification does occur is whenOR= 1. In this case we say thatA1has a central hypergeometric distribution. For the central hyper- geometric distribution,
P0(A1=a1)= r1
a1
r2
m1−a1
r
m1
= r1!r2!m1!m2!
a1!(m1−a1)!(r1−a1)!(r2−m1+a1)!r! (4.16)
e1=E0(A1)= r1m1
r (4.17)
and
v0=var0(A1)= r1r2m1m2
r2(r−1). (4.18)
Sincem1is now being treated as a constant,e1andv0are the exact mean and variance rather than just estimates. However, for the sake of uniformity of notation, we will denote these quantities byeˆ1andvˆ0in what follows. Observe that, other thanr!, the denominator of the final expression in (4.16) is the product of factorials defined in terms of the interior cells of Table 4.7. A convenient method of tabulating a central hypergeometric probability function is to form each of the possible 2×2 tables and calculate probability elements using (4.16).
Confidence Interval
Since the hypergeometric distribution involves the single parameterOR, the approach to exact interval estimation and hypothesis testing is a straightforward adaptation of the techniques described for the binomial distribution in Sections 3.1.1 and 3.1.2. An exact(1−α)×100% confidence interval forORis obtained by solving the equations
α
2 =P(A1≥a1|ORc)= 1 Cc
u x=a1
r1
x
r2
m1−x
(ORc)x
=1− 1 Cc
a1−1 x=l
r1
x
r2
m1−x
(ORc)x
and
α
2 =P(A1≤a1|ORc)= 1 Cc
a1
x=l
r1
x
r2
m1−x
(ORc)x
=1− 1 Cc
u x=a1+1
r1 x
r2 m1−x
(ORc)x
forORc andORc, whereCc andCc stand forCwithORcandORc substituted for OR, respectively.
Fisher’s Exact Test
It is possible to test hypotheses of the form H0:OR= OR0for an arbitrary choice ofOR0 but, in practice, interest is mainly in the hypothesis of no associationH0 : OR=1. The exact test of association based on the central hypergeometric distribu- tion is referred to as Fisher’s (exact) test (Fisher, 1936; §21.02). The tail probabilities
are
P0(A1≥a1)= u x=a1
r1
x
r2
m1−x
r
m1
=1−
a1−1 x=l
r1
x
r2
m1−x
r
m1
and
P0(A1≤a1)=
a1
x=l
r1
x
r2
m1−x
r
m1
=1− u x=a1+1
r1
x
r2
m1−x
r
m1
.
Calculation of the two-sidedp-value using either the cumulative or doubling method follows precisely the steps described for the binomial distribution in Section 3.1.1.
Recall the discussion in Chapter 3 regarding the conservative nature of an exact test when the distribution is discrete. This conservatism, which is a feature of Fisher’s test, is more pronounced when the sample size is small. This is precisely the condi- tion under which an asymptotic test, such as Pearson’s test, becomes invalid. These issues have led to a protracted debate regarding the relative merits of these two tests when the sample size is small. Currently, Fisher’s test appears to be regarded more favorably (Yates, 1984; Little, 1989).
Example 4.3 (Hypothetical Data) Data from a hypothetical cohort study are given in Table 4.8. For these data,l =1 andu=3. Note that 0, which is an element of the binomial sample space of A1, cannot be an element of the hypergeometric sample space since that would force the lower right cell count to be−1.
The central hypergeometric probability function is given in Table 4.9. The mean and variance areeˆ1=1.80 andvˆ0=.36.
The noncentral hypergeometric probability function corresponding to Table 4.8 is P(A1=a1|OR)= 1
C 3
a1
2 3−a1
ORa1
where
C= 3 x=1
3 x
2 3−x
ORx =3OR+6OR2+OR3.
TABLE 4.8 Observed Counts:
Hypothetical Cohort Study Disease Exposure
yes no
yes 2 1 3
no 1 1 2
3 2 5
TABLE 4.9 Central Hypergeometric Probability Function: Hypothetical Cohort Study
a1 P0(A1=a1)
1 3!2!3!2!
1!2!2!0!5! =.3
2 3!2!3!2!
2!1!1!1!5! =.6
3 3!2!3!2!
3!0!0!2!5! =.1
The exact conditional 95% confidence interval forORis[.013,234.5], which is ob- tained by solving the equations
.025= 3 x=2
P(A1=x|ORc)= 6(ORc)2+(ORc)3 3ORc+6(ORc)2+(ORc)3 and
.025= 2 x=1
P(A1=x|ORc)= 3ORc+6(ORc)2 3(ORc)+6(ORc)2+(ORc)3 forORcandORc.
Example 4.4 (Antibody–Diarrhea) For the data in Table 4.3,l=3 andu =14.
The central hypergeometric distribution is given in Table 4.10.
The exact conditional 95% confidence interval forORis [1.05, 86.94] which is quite wide and just misses containing 1. The p-value for Fisher’s test based on the
TABLE 4.10 Central Hypergeometric Probability Function (%): Antibody–Diarrhea
a1 P0(A1=a1) P0(A1≤a1) P0(A≥a1)
3 <.01 <.01 100
4 .03 .03 99.99
5 .44 .47 99.97
6 3.08 3.55 99.53
7 11.43 14.98 96.45
8 24.01 38.99 85.02
9 29.35 68.34 61.01
10 20.96 89.31 31.66
11 8.58 97.88 10.69
12 1.91 99.79 2.12
13 .21 99.99 .21
14 .01 100 .01
TABLE 4.11 Central Hypergeometric Probability Function (%): Receptor Level–Breast Cancer a1 P0(A1=a1) P0(A1≤a1) P0(A1≥a1)
... ... ... ...
3 <.01 <.01 100
4 .01 .02 99.99
5 .07 .09 99.98
... ... ... ...
11 9.91 23.13 86.78
12 12.88 36.01 76.87
13 14.54 50.55 63.99
14 14.33 64.88 49.45
15 12.37 77.25 35.12
16 9.39 86.64 22.75
... ... ... ...
22 .13 99.94 .19
23 .04 99.98 .06
24 .01 99.99 .02
... ... ... ...
cumulative method isP0(A1≥12)+P0(A1≤5)=.026, and based on the doubling method is 2×P0(A1 ≥12)=.042. For these data, there is a noticeable difference between the cumulative and doubling results, but in either case we infer that low antibody level is associated with an increased risk of diarrhea. A comparison of the preceding results with those of Example 4.1 illustrates that exact confidence intervals tend to be wider than asymptotic ones, and exact p-values are generally larger than their asymptotic counterparts.
Example 4.5 (Receptor Level–Breast Cancer) For Table 4.5(a),l =0 andu = 48. The central hypergeometric distribution is given, in part, in Table 4.11.
The exact conditional 95% confidence interval forORis [1.58, 7.07], and the p- value for Fisher’s test based on the cumulative method isP0(A1 ≥23)+P0(A1≤ 4) = .08%. The remark made in Example 4.4 about exact results being conserva- tive holds here (except for Pearson’s test), as may be seen from a comparison with Example 4.2. However, when the sample size is large, the differences between exact and asymptotic findings are often of little practical importance, as is the case here.
4.3 ASYMPTOTIC CONDITIONAL METHODS