Chapter 3: Logistic Regression
3.6 Logistic Regression Diagnostics
Once a logistic model has been fitted to the data, it is crucial to check if the assumed model is a valid model. The appropriateness of the model can be studied using diagnostic testing. To check the adequacy of the fitted model, the analysis of residuals and the identification of outliers need to be investigated. Sarkar et al. (2011) describe the three ways that an observation can be considered as unusual, namely, outliers, influence and leverage. They describe outliers as a set of observations whose values deviate from the expected range and produce extremely large residuals that may indicate a sample peculiarity. They also describe an observation as being influential if the deletion of that observation substantially changes the estimate of coefficients and argue that influence can be thought of as the product of leverage and outliers. Further defining leverage as a measure of how far an independent variable deviates from its mean. These leverage points can have an unusually large effect on the estimate of logistic regression coefficients (Cook, 1998).
Influence statistics can determine how much some feature of the model changes when a particular observation is deleted from the model fit. The larger the value is for each diagnostic, the greater the influence. Ideally it is expected that each observation should have equal influence on the model. The failure to detect such influential cases could have severe distortion on the validity of inferences drawn from the model.
Pregibon (1981) provided the theoretical framework that extended linear regression diagnostics to logistic regression. The residual vector and a projection matrix are used as building blocks for the identification of outlying and influential points for the logistic regression model. Consider a
34
setup where the fitted model contains covariates and that they form » covariate patterns indexed by = 1,2, … , ». In logistic regression the errors are binomial hence the error variance is a function of the conditional mean as follows that
varg!|Th = ¼"g!|Th × O1 − "g!|ThQ = ¼gThO1 − gThQ.
Beginning with residuals as defined equations 3.41 and 3.4.3 which have been “divided” by estimates of their standard errors, and letting Á and  denote the expressions given in these equations respectively, for covariate pattern T. Each residual is divided by an approximate of its standard error hence it is expected that if the logistic regression model is correct, these quantities should have a mean approximately equal to zero and a variance approximately equal to one.
Let æ denote the » × + 1 matrix containing the values for all » covariate patterns formed from the observed values of the covariates, with the first column being the one to reflect the presence of an intercept in the model. Pregibon (1981) used the weighted least squares linear regression as a model to derive a linear approximation to the fitted values which yields a hat matrix for logistic regression as follows
ç = &⁄ ææ′æèæN &⁄ ,
3.6.1
where is a » × » diagonal matrix with general element ê = ¼gThO1 − gThQ.
Let ℎ denote the th diagonal element of the matrix ç defined in equation 3.6.1, then it follows that
ℎ = ¼gThO1 − gThQTNæ′æèTN = ê× ì,
3.6.2 where ì = TNæ′æè , TN = 1, U, U&, … , U(is a vector of covariate values and ∑ ℎ =
+ 1 is the number of parameters in the model.
35
Hosmer & Lemeshow (1989) discuss the formulation and bounds of any diagonal element in the hat matrix and point out the importance of keeping this distinction in mind as diagnostic information is computed differently in various programs. For more information on this see Hosmer & Lemeshow (1989, p.151).
According to Hosmer & Lemeshow (1989) consider the residual for the th covariate pattern as 0− ¼gTh ≈ g1 − ℎh0, then the variance of the residual is
¼gTh 61 − gTh7 g1 − ℎh&,
where g1 − ℎh& ≈ g1 − ℎh for small ℎ. This suggests that the Pearson residuals will not have variance equal to 1 unless they are further standardized. Recalling that Á denotes the Pearson residual given in equation 3.4.1, the standardized Pearson residual for covariate pattern T is
Áî = Ái¯1 − ℎ.
Hosmer & Lemeshow (1989) further discuss another useful diagnostic statistic, which is one that examines the effect that deleting all subjects with a particular covariate pattern has on the value of the estimated coefficients and the overall summary measures of fit, %& and Æ. They argue that the change in the value of the estimated coefficients is analogous to the measure proposed by Cook (1977, 1979) for linear regression. This is obtained as the standardized difference between M and M, which represent the respective maximum likelihood estimates computed using all » covariate patterns, excluding the ¼ subjects with pattern T, thus standardized via the covariance matrix of M. Pregibon (1981) has shown that, to a linear approximation, this quantity for logistic regression is
∆M= gM − Mh′æNægM − Mh
= Á&ℎ
g1 − ℎh&
= Áî&ℎ.
36
Similar linear approximation can be used to show that the decrease in the value of the Pearson chi-square statistic due to deletion of the subjects with covariate pattern T is
∆%& = Á&
g1 − ℎh = Áî&.
3.6.3
A similar quantity for the change in deviance may be obtained,
∆Æ= Â&+ Á&ℎ g1 − ℎh.
If Á&is replaced by Â&, it yields the following approximation
∆Æ = Â&
g1 − ℎh ,
which is similar in form to the expression in equation 3.6.3.
Hosmer & Lemeshow (1989) argue that these diagnostic statistics are conceptually appealing, as they allow one to identify the covariate patterns that are poorly fit (large values of ∆%&
and/or ∆Æ), and also those that have great influence on the values of the estimated parameters (large values of ∆M).