Residual Analysis - CHECKING LOGISTIC MODEL FIT

ADJUSTING STANDARD ERRORS

4.1 CHECKING LOGISTIC MODEL FIT

4.1.3 Residual Analysis

The analysis of residuals plays an important role in assessing logistic model fit. The analyst can see how the model fits rather than simply looking at a statistic. Analysts have devised a number of residuals to view the relationships in a logistic model. Most all residuals that are used in logistic regression were discussed in Chapter 3, Section 3.2, although a couple were not. Table 4.1 sum- marizes the foremost logistic regression residuals. Table 4.2 gives the R code for producing them. We will only use a few of these residuals in this book, but all have been used in various contexts to analyze the worth of a logistic model.

Finally, I shall give the code and graphics for several of the foremost used residual analyses used in publications.

The Anscombe residuals require calculation of an incomplete beta function, which is not part of the default R package. An ibeta function is displayed below, together with code to calculate Anscombe residuals for the mymod model above. Paste the top lines into the R editor and run. ans consists of logistic Anscombe residuals. They are identical to Stata and SAS results.

y <- medpar$died ; mu <- mymod$fitted.value a <- .666667 ; b <- .666667

ibeta<- function(x,a,b){ pbeta(x,a,b)*beta(a,b) } A <- ibeta(y,a,b) ; B <- ibeta(mu,a,b)

ans <- (A-B)/ (mu*(1-mu))^(1/6)

> summary(ans)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.5450 -1.0090 -0.9020 -0.0872 1.5310 3.2270

Residual analysis for logistic models is usually based on what are known as n-asymptotics. However, some statisticians suggest that residuals should be based on m-asymptotically formatted data. Data in observation-based form;

that is, one observation or case per line, are in n-asymptotic format. The datasets we have been using thus far for examples are in n-asymptotic form.

m-asymptotic data occurs when observations with the same values for all TABLE 4.1 Residuals for Bernoulli logistic regression

Raw r y – μ

Pearson r^p (y−µ)/ µ(1−µ) Deviance r^d 2

∑

^{^{In /}⁽1^µ⁾^} ^if^y ⁼1

∑

^{^{In( /}1 1⁽ ⁻^µ⁾⁾^} ^if ^y ⁼0

Stand. Pearson r^sp r

1−

Stand. deviance r^sd r

1−

Likelihood r^l sgn(y−µ) h r( )^p ²+ −(1 h r)( )^d²

Anscombe r^A A y( ) A( )

{ }^/

−

− µ µ(1 µ)^{1 6}

where A(z) = Beta(2/3, 2/3)*{IncompleteBeta(z, 2/3, 2/3), and z = (y; μ) Beta(2/3, 2/3) = 2.05339. When z = 1, the function reduces to the Beta (see

Hilbe, 2009).

Cooks’ distance _r^CD hr

C h

n(1− )², C_n = number of coefficients Delta Pearson ΔChi2 (r^sd)²

Delta deviance ΔDev ( )r^d² + h⁽^r^sp⁾²

Delta beta Δβ h r

h ( )p

( )

1− 2

predictors are considered as a single observation, but with an extra variable in the data signifying the number of n-asymptotic observations having the same covariate pattern. For example, let us consider the 1495 observation medpar data for which we have kept only white, hmo, and type as predictors for died.

There are 1495 observations for the data in this format. In m-asymptotic format the data appears as:

white hmo type m 1 0 0 1 72 2 0 0 2 33 3 0 0 3 10 4 0 1 1 8 5 0 1 2 4 6 1 0 1 857 7 1 0 2 201 8 1 0 3 83 9 1 1 1 197 10 1 1 2 27 11 1 1 3 3

There are several ways to reduce the three variable subset of the medpar data to m-asymptotic form. I will show a way that maintains the died response variable, which is renamed dead due to it not being a binary variable, and then show how to duplicate the above table.

> data(medpar) TABLE 4.2 Residual code

mu <- mymod$fitted.value # predicted probability; fit r <- medpar$died – mu # raw residual

dr <-resid(mymod, type=“deviance”) # deviance resid

pr <- resid(mymod, type=“pearson”) # Pearson resid hat <- hatvalues(mymod) # hat matrix diagonal stdr <- dr/sqrt(1-hat) # standardized deviance stpr <- pr/sqrt(1-hat) # standardized Pearson deltadev <- dr^2 + hat*stpr^2 # Δ deviance

deltaChi2 <- stpr^2 # Δ Pearson deltaBeta <- (pr^2*hat/(1-hat)^2) # Δ beta

ncoef <- length(coef(mymod)) # number coefficients

# Cooke’s distance

cookD <- (pr^2 * hat) / ((1-hat)^2 * ncoef * summary(mymod)$dispersion)

> test <- subset(medpar, select = c(died, white, hmo, type))

> white <- factor(test$white)

> hmo <- factor(test$hmo)

> type <- factor(test$type)

> mylgg <- na.omit(data.frame(cast(melt(test, measure=”died”), + white + hmo + type ~ .,

+ function(x) {c(alive=sum(x==0), dead=sum(x==1))} )))

> mylgg$m <- mylgg$alive + mylgg$dead

> mylgg

white hmo type alive dead m 1 0 0 1 55 17 72 2 0 0 2 22 11 33 3 0 0 3 6 4 10 4 0 1 1 7 1 8 5 0 1 2 1 3 4 6 1 0 1 580 277 857 7 1 0 2 119 82 201 8 1 0 3 43 40 83 9 1 1 1 128 69 197 10 1 1 2 19 8 27 11 1 1 3 2 1 3

The code above produced the 11 covariate pattern m-asymptotic data, but I also provide dead and alive, which can be used for grouped logistic models in the next chapter. m is simply the sum of alive and dead. For example, look at the top line. With m = 72, we know that there were 72 times in the reduced medpar data for which white=0, hmo=0, and type=1. For that covariate pattern, died=1 (dead) occurred 17 times and died=0 (alive) occurred 55 times.

To obtain the identical covariate pattern list where only white, hmo, type, and m are displayed, the following code reproduces the table.

> white <- mylgg$white

> hmo<- mylgg$hmo

> type <- mylgg$type

> m <- mylgg$m

> m_data <- data.frame(white, hmo,type,m)

> m_data

white hmo type m 1 0 0 1 72 2 0 0 2 33 3 0 0 3 10 4 0 1 1 8 5 0 1 2 4 6 1 0 1 857 7 1 0 2 201

8 1 0 3 83 9 1 1 1 197 10 1 1 2 27 11 1 1 3 3

It should be noted that all but one possible separate covariate pattern exists in this data. Only the covariate pattern, [white=0, hmo=1, type=3] is not part of the medpar dataset. It is therefore not in the m-asymptotic data format.

I will provide codes for Figures 4.1 through 4.3 that are important when evaluating logistic models as to their fit. Since R’s glm function does not use a m-asymptotic format for residual analysis, I shall discuss the traditional n-asymptotic method. Bear in mind that when there are continuous predictors in the model, m-asymptotic data tend to reduce to n-asymptotic data.

Continuous predictors usually have many more values in them than do binary and categorical predictors. A model with two or three continuous predictors typically results in a model where there is no difference between m-asymptotic and n-asymptotic formats. Residual analysis on observation-based data is the traditional manner of executing the plots, and are the standard way of graphing in R. I am adding los (length of stay; number of days in hospital) back into the model to remain consistent with earlier modeling we have done on the medpar data.

You may choose to construct residual graphs using m-asymptotic methods. The code to do this was provided above. However, we shall keep with the standard methods in this chapter. In the next chapter on grouped logistic models, m-asymptotics is built into the model.

R code for creating the standard residuals found in literature related to logistic regression is given in Table 4.2. Code for creating a simple squared standardized deviance residual versus mu graphic (Figure 4.1) is given as:

data(medpar)

mymod <- glm(died ~ white + hmo + los + factor(type), family=binomial, data=medpar)

summary(mymod)

mu <- mymod$fitted.value # predicted value;

probability that died==1

dr <-resid(mymod, type=”deviance”) # deviance residual hat <- hatvalues(mymod) # hat matrix diagonal stdr <- dr/sqrt(1-hat) # standardized

deviance residual plot(mu, stdr^2)

abline(h = 4, col=”red”)

Analysts commonly use the plot of the square of the standardized deviance residuals versus mu to check for outliers in a fitted logistic model. Values in the plot greater than 4 are considered outliers. The values on the vertical axis are in terms of standard deviations of the residual. The horizontal axis are predicted probabilities. All figures here are based on the medpar data.

Another good way of identifying outliers based on a residual graph is by use of Anscombe residuals versus mu, or the predicted probability that the response is equal to 1. Anscombe residuals adjust the residuals so that they are as normally distributed as possible. This is important when using 2, or 4 when the residual is squared, as a criterion for specifying an observation as an out- lier. It is the 95% criterion so commonly used by statisticians for determining statistical significance. Figure 4.2 is not much different from Figure 4.1 when squared standardized deviance residuals are used in the graph. The Anscombe plot is preferred.

> plot(mu, ans^2)

> abline(h = 4, lty = “dotted”)

A leverage or influence plot (Figure 4.3) may be constructed as:

> plot(stpr, hat)

> abline(v=0, col=”red”)

Large hat values indicate covariate patterns that differ from average covariate patterns. Values on the horizontal extremes are high residuals. Values that are high on the hat scale, and low on the residual scale; that is, high in the middle and close to the zero-line do not fit the model well. They are also dif- ficult to detect as influential when using other graphics. There are some seven

4 5 6

∧stdr2 3 2 1 0

0.0 0.1 0.2 0.3

mu 0.4 0.5 0.6

FIGURE 4.1 Squared standardized deviance versus mu.

observations that fit this characterization. They can be identified by selecting hat values greater than 0.4 and squared residual values of |2|.

A wide variety of graphics may be constructed from the residuals given in Table 4.2. See Hilbe (2009), Bilger and Loughin (2015), Smithson and Merkle (2014), and Collett (2003) for examples.

Dalam dokumen Practical Guide to Logistic Regression (Halaman 90-96)