MODELS WITH A CATEGORICAL PREDICTOR

STATA CODE

2.4 MODELS WITH A CATEGORICAL PREDICTOR

Wald or model-based confidence intervals differ little from profile confidence intervals. In the case of the logit2 model where neither x nor the intercept are significant and there are only nine observations in the model, we expect for there to be a somewhat substantial difference in confidence interval values.

Wald or Model-Based Confidence Intervals

> confint.default(logit2) 2.5 % 97.5 % (Intercept) -1.164557 3.361782 x -4.389065 1.380910 Profile Confidence Intervals

> confint(logit2)

Waiting for profiling to be done...

2.5 % 97.5 % (Intercept) -0.9568748 4.105099 x -4.9264210 1.219928

Stata’s pllf command produces profile confidence intervals, but only for con- tinuous predictors.

Scaled, sandwich or robust, and bootstrapped-based confidence intervals will be discussed in Chapter 4, and compared with profile confidence intervals. We shall discuss which should be used given a particular type of data.

2.4 MODELS WITH A

> library(LOGIT)

> data(medpar)

> head(medpar)

los hmo white died age80 type provnum 1 4 0 1 0 0 1 030001 2 9 1 1 0 0 1 030001 3 3 1 1 1 1 1 030001 4 9 0 1 0 0 1 030001 5 1 0 1 1 1 1 030001 6 4 0 1 1 0 1 030001 We can check how many are in each level of type

> table(medpar$type) 1 2 3 1134 265 96

and we can find out the percentage in each level

> prop.table( table(medpar$type)) 1 2 3 0.75852843 0.17725753 0.06421405

A no-frills frequency table may be produced by a little programming.

> Cnt <- table(medpar$type)

> Freq <- prop.table( table(medpar$type))

> typetab <- data.frame(Cnt, Freq)

> my1 <- typetab[ ,1:2]

> Pct <- typetab[ ,4]

> data.frame(my1, Pct) Var1 Freq Pct 1 1 1134 0.75852843 2 2 265 0.17725753 3 3 96 0.06421405

Statisticians have handled categorical predictors in a variety of ways.

When used with regression, and in particular with logistic regression, categorical predictors are nearly always factored into separate indicator or dummy variables. Each indicator variable has a value of 1 or 0 except for the reference level, which is excluded from the regression.

Think of each level except the reference as the x = 1 level with the reference variable as x = 0. If level 1 of a categorical predictor is taken as the

reference level, then level 2 is interpreted with reference to level 1. Level 3 is also interpreted with reference to level 1. Level 1 is the default reference level for both R’s glm function and Stata’s regression commands. SAS uses the highest level as the default reference. Here it would be level 3.

It is advised to use either the lowest or highest level as the reference, in particular whichever of the two has the most observations. But of even more importance, the reference level should be chosen which makes most sense for the data being modeled.

You may let the software define your levels, or you may create them your- self. If there is the likelihood that levels may have to be combined, then it may be wise to create separate indicator variables for the levels. First though, let us let the software create internal indicator variables, which are dropped at the conclusion of the display to screen.

> summary( logit3 <- glm( died ~ factor(type), family = binomial,

data = medpar))

. . .

Coefficients:

Estimate Std. Error z value Pr(>|z|) (Intercept) -0.74924 0.06361 -11.779 < 2e-16 ***

factor(type)2 0.31222 0.14097 2.215 0.02677 * factor(type)3 0.62407 0.21419 2.914 0.00357 **

—-

Null deviance: 1922.9 on 1494 degrees of freedom Residual deviance: 1911.1 on 1492 degrees of freedom AIC: 1917.1

Note how the factor function excluded factor type1 (elective) from the output. It is the reference level though and is used to interpret both type2 (urgent) and type3 (emergency). I shall exponentiate the coefficients of type2 and type3 in order to better interpret the model. Both will be interpreted as odds ratios, with the denominator of the ratio being the reference level.

> exp(coef(logit3))

(Intercept) factor(type)2 factor(type)3 0.4727273 1.3664596 1.8665158 The interpretation is

• Urgent admission patients have a near 37% greater odds of dying in the hospital than do elective admissions.

• Emergency admission patients have a near 87% greater odds of dying in the hospital than do elective admissions.

Analysts many times find that they must change the reference levels of a categorical predictor. This may be done with the following code. We will change from the default reference level 1 to a reference level 3 using the relevel function.

> medpar$type <- factor(medpar$type)

> medpar$type <- relevel(medpar$type, ref=3)

> logit4 <- glm( died~factor(type), family=binomial, data=medpar)

> exp(coef(logit4))

(Intercept) factor(type)1 factor(type)2 0.8823529 0.5357576 0.7320911 Interpretation changes to read

• Elective patients have about half the odds of dying in the hospital than do emergency patients.

• Urgent patients have about a three quarters of the odds of dying in the hospital than do emergency patients.

I mentioned that indicator or dummy variables can be created by hand, and levels merged if necessary. This occurs when, for example, the level 2 coefficient (or odds ratio) is not significant compared to reference level 1. We see this with the model where type = 3 is the reference level. From looking at the models, it appears that levels 2 and 3 may not be statistically different from one another, and may be merged. I caution you from concluding this though since we may want to adjust the standard errors, resulting in changed p-values, for extra correlation in the data, or for some other reason we shall discuss in Chapter 4. However, on the surface it appears that patients who were admitted as urgent are not significantly different from emergency patients with respect to death while hospitalized.

I mentioned before that combining levels is required if two levels do not significantly differ from one another. In fact, when the emergency level of type is the reference, level 2 (urgent) does not appear to be significant, indicating that type levels 2 and 3 might be combined. With R this can be done as

> table(medpar$type) 1 2 3 1134 265 96

> medpar$type[medpar$type = =3] <- 2 # reclassify level 3 as level 2

> table(medpar$type)

1 2 1134 361

> summary( logit6 <- glm(died~ factor(type), family = binomial,

data = medpar))

. . . Coefficients:

Estimate Std. Error z value Pr(>|z|) (Intercept) -0.74924 0.06361 -11.779 < 2e-16 ***

factor(type)2 0.39660 0.12440 3.188 0.00143 **

2.5 MODELS WITH A

Dalam dokumen Practical Guide to Logistic Regression (Halaman 45-49)