ADJUSTING STANDARD ERRORS
4.5 EXACT LOGISTIC REGRESSION
Exact logistic regression is a method of constructing the Bernoulli distribu- tion such that it is completely determined. The method is unlike maximum likelihood or iteratively reweighted least squares (IRLS) which are asymptotic methods of estimation. The model coefficients and p-values are accurate, but at a cost of involving a large number of permutations.
Exact logistic and exact Poisson regressions were originally written for the Cytel Corporation product named LogXact in the late 1990s. SAS, SPSS, and Stata statistical software soon incorporated the procedures into their commercial packages. R’s elrm package is the closest R has come to providing R users with this functionality. It is a difficult function to employ and I have not been able to obtain results similar to those of the other packages. I will use Stata’s exlogistic command for an example of the method and its results. The SAS version of the code is at the end of this chapter; the results are the same as Stata output.
Exact logistic regression is typically used by analysts when the size of the data being modeled is too small to yield well-fitted results. It is also used when the data are ill balanced, however, it is not to be used when there is perfect
prediction in the model. When that occurs penalized logistic regression should be used—as we discussed in the previous section.
For an example of exact logistic regression, I shall use Arizona hospital data collected in 1991. The data consist of a random sample of heart procedures referred to as CABG and PTCA. CABG is an acronym meaning coronary artery bypass grafting surgery and PTCA refers to percutaneous transluminal coronary angioplasty. It is a nonsurgical method of placing a type of balloon into a coronary artery in order to clear blockage caused by cholesterol. It is a substantially less severe procedure than CABG. We will model the probability of death within 48 h of the procedure on 34 patients who sustained either a CABG or PTCA. The vari- able procedure is 1 for CABG and 0 for PTCA. It is adjusted in the model by the type of admission. Type = 1 is an emergency or urgent admit, and 0 is an elective admission. Other variables in the data are not used. Patients are all older than 65.
> data(azcabgptca34)
> head(azheart)
died procedure age gender los type 1 Died CABG 65 Male 10 Elective 2 Survive CABG 69 Male 7 Emer/Urg 3 Survive PTCA 76 Female 7 Emer/Urg 4 Survive CABG 65 Male 8 Elective 5 Survive PTCA 69 Male 1 Elective 6 Survive CABG 67 Male 7 Emer/Urg A cross-tabulation of died on procedure is given as:
> library(Hmisc)
> table(azheart$died, azheart$procedure) PTCA CABG
Survive 19 9 Died 1 5
It is clear from the tabulation that more patients died having a CAGB than with a PTCA. A table of died on type of admission is displayed as:
> table(azheart$died, azheart$type) Elective Emer/Urg
Survive 17 11 Died 4 2
First we shall use a logistic regression to model died on procedure and type.
The model results are displayed in terms of odds ratios and associated statistics.
> exlr <- glm( died ~ procedure + type, family=binomial, data=azheart)
> toOR(exlr)
or delta zscore pvalue exp.loci. exp.upci.
(Intercept) 0.0389 0.0482 -2.6170 0.0089 0.0034 0.4424 procedureCABG 12.6548 15.7958 2.0334 0.0420 1.0959 146.1267 typeEmer/Urg 1.7186 1.9296 0.4823 0.6296 0.1903 15.5201
Note that there appears to be a statistically significant relationship between the probability of death and type of procedure (p = 0.0420). Type of admission does not contribute to the model. Given the size of the data and adjusting for the possibility of correlation in the data we next model the same data as a
“quasibinomial” model. Earlier in the book I indicated that the quasibinomial option is nothing more than scaling (multiplying) the logistic model standard errors by the square root of the Pearson dispersion statistic.
> exlr1 <- glm( died ~ procedure + type, family=quasibinomial, data=azheart)
> toOR(exlr1)
or delta zscore pvalue exp.loci. exp.upci.
(Intercept) 0.0389 0.0478 -2.6420 0.0082 0.0035 0.4324 procedureCABG 12.6548 15.6466 2.0528 0.0401 1.1216 142.7874 typeEmer/Urg 1.7186 1.9114 0.4869 0.6264 0.1943 15.2007
The p-value of procedure is further reduced by scaling. The same is the case when the standard errors are adjusted by sandwich or robust variance estimators (not shown). We might accept procedure as a significant predic- tor of the probability of death—if it were not for the small sample size. If we took another sample from the population of procedures would we have similar results? It is wise to set aside a validation sample to test our primary model.
But suppose that we do not have access to additional data? We subject the data to modeling with an exact logistic regression. The Stata code and output are given below.
. exlogistic died procedure type, nolog
Exact logistic regression Number of obs = 34 Model score = 5.355253 Pr >= score = 0.0864 died Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval]
procedure 10.33644 5 0.0679 .8880104 589.8112 type 1.656699 2 1.0000 .1005901 28.38678
The results show that procedure is not a significant predictor of died at the p = 0.05 criterion. This should not be surprising. Note that the odds ratio
of procedure diminished from 12.65 to 10.33. In addition, the model score statistic given in the header statistics informs us that the model does not fit the data well (p > 0.05).
When modeling small and/or unbalanced data, it is suggested to employ exact statistical methods if they are available.