Logistic regression with a binary independent variable

> lines(newdata$log10.strep, predicted.line, col="blue")

> axis(side=1, at=-2:4, labels=as.character(10^(-2:4)))

> title(main="Relationship between mutan streptococci \n and probability of tooth decay", xlab="CFU",

ylab="Probability of having decayed teeth")

Note the use of the '\n' in the command above to separate a long title into two lines.

0.0 0.2 0.4 0.6 0.8 1.0

0.01 0.1 1 10 100 1000 10000

Relationship between mutan streptococci and probability of tooth decay

CFU

Probability of having decayed teeth

> glm0 <- glm(case ~ eclair.eat, family=binomial, data=.data)

> summary(glm0) Coefficients:

Estimate Std. Error z value Pr(>|z|) (Intercept) -2.923 0.265 -11.03 <2e-16 eclair.eatTRUE 3.167 0.276 11.48 <2e-16

=================== Lines omitted =================

The above part of the display is actually a matrix from the object 'coef(summary(glm0))'. Epicalc manipulates this matrix and gives rise to a display more understandable by most epidemiologists.

> logistic.display(glm0)

Logistic regression predicting diseased

OR (95% CI) P(Wald's test) P(LR-test) eating eclair 23.75 (13.82,40.79) < 0.001 < 0.001

Log-likelihood = -527.6075 No. of observations = 977 AIC value = 1059.2

The odds ratio from the logistic regression is derived from exponentiation of the estimate, i.e. 23.75 is obtained from:

> exp(coef(summary(glm0))[2,1])

The 95% confidence interval of the odds ratio is obtained from

> exp(coef(summary(glm0))[2,1] + c(-1,1) * 1.96 * coef(summary(glm0))[2,2])

These values are close to simple calculation of the 2-by-2 table discussed earlier in Chapter 9. The log-likelihood and the AIC value will be discussed later.

The default values in logistic.display are 95% for the confidence intervals and the digits are shown to two decimal places. See the online help for details.

> args(logistic.display)

> help(logistic.display)

You can change the default values by adding the extra argument(s) in the command.

> logistic.display(glm0, alpha=0.01, decimal=2)

If the data frame has been specified in the glm command, the output will show the variable description instead of the variable names. The P value from Wald's test is the same as that seen from the coefficient matrix of 'summary(glm0)'.

The output from logistic.display also contains the 'LR-test' result, which checks whether the likelihood of the given model, 'glm0', would be significantly different from the model without 'eclair.eat', which in this case would be the "null" model.

For an independent variable with two levels, the LR-test does not add further important information because Wald's test has already tested the hypothesis. When the independent variable has more than two levels, the LR-test is more important than Wald's test as the following example demonstrates.

> glm1 <- glm(case ~ eclairgr, family=binomial, data=.data)

> logistic.display(glm1)

Logistic regression predicting diseased

OR(95%CI) P(Wald's test) P(LR-test) pieces of eclair eaten: < 0.001 ref.=0

1 17.57 (9.21,33.49) < 0.001 2 22.27 (12.82,38.66) < 0.001 >2 43.56 (22.89,82.91) < 0.001

Log-likelihood = -516.8236 No. of observations = 972 AIC value = 1041.6

Interpreting Wald's test alone, one would conclude that all levels of eclair eaten would be significant. However, this depends on the reference level. By default, R assumes that the first level of an independent factor is the referent level. If we relevel the reference level to be 2 pieces of eclair, Wald's test gives a different impression.

> eclairgr <- relevel(eclairgr, ref="2")

> pack()

> glm2 <- glm(case ~ eclairgr, family=binomial, data=.data)

> logistic.display(glm2)

Logistic regression predicting diseased

OR(95%CI) P(Wald's test) P(LR-test) pieces of eclair eaten: ref.=2 < 0.001 0 0.04 (0.03,0.08) < 0.001 1 0.79 (0.52,1.21) 0.275 >2 1.96 (1.28,2.99) 0.002

==============================================================

The results show that eating only one piece of eclair does not reduce the risk significantly compared to eating two pieces.

While results from Wald's test depend on the reference level of the explanatory variable, the LR-test is concerned only with the contribution of the variable as a whole and ignores the reference level. We will return to this discussion in a later chapter.

Next, try 'saltegg' as the explanatory variable.

> glm3 <- glm(case ~ saltegg, family = binomial, data=.data)

> logistic.display(glm3)

Logistic regression predicting case

OR (95% CI) P(Wald's test) P(LR-test) saltegg: 2.54 (1.53,4.22) < 0.001 < 0.001 Yes vs No

Log-likelihood = -736.998 No. of observations = 1089 AIC value = 1478

The odds ratio for 'saltegg' is statistically significant and similar to that seen from the cross-tabulation in Chapter 9. The number of valid records is also higher than the model containing 'eclairgr'.

Note: ______________________________________________________________

One should always be careful when analysing data that contain missing values.

Methods to handle missing values are beyond the scope of this book and for reasons of simplicity are ignored here. Readers are advised to deal with missing values properly prior to conducting their analysis.

To check whether the odds ratio is confounded by 'eclairgr', the two explanatory variables are put together in the next model.

> glm4 <- glm(case ~ eclairgr + saltegg, family=binomial)

> logistic.display(glm4, crude.p.value=TRUE) Logistic regression predicting case

crude OR(95%CI) P value adj. OR(95%CI) P(Wald) P(LR-test) eclairgr: ref.=2 < 0.001 0 0.04 (0.03,0.08) < 0.001 0.04 (0.03,0.08) < 0.001

1 0.79 (0.52,1.21) 0.275 0.79 (0.51,1.21) 0.279 >2 1.96 (1.28,2.99) 0.002 1.96 (1.28,2.99) 0.002

saltegg: 2.37 (1.4,3.99) 0.001 1.01 (0.53,1.93) 0.975 0.975

Yes vs No

Log-likelihood = -516.823 No. of observations = 972 AIC value = 1043.6

The odds ratios of the explanatory variables in glm4 are adjusted for each other.

The crude odds ratios are exactly the same as from the previous models with only single variable. The P value of 'saltegg' is shown as 0 due to rounding. In fact, it is 0.00112, which is not less than 0.001. Epicalc, for aesthetic reasons, displays P values as '< 0.001' whenever the original value is less than 0.001.

The adjusted odds ratios of 'eclairgr' do not change suggesting that it is not confounded by 'saltegg', whereas the odds ratio of 'saltegg' is celarly changed towards unity, and now has a very large P value. The difference between the adjusted odds ratio and the crude odds ratio is an indication that 'saltegg' is confounded by 'eclairgr', which is an independent risk factor. These adjusted odds ratios are close to those obtained from the Mantel-Haenszel method shown in chapter 9.

Now that we have a model containing two explanatory variables, we can compare models 'glm4' and 'glm2' using the lrtest command.

> lrtest(glm4, glm2)

Likelihood ratio test for MLE method

Chi-squared 1 d.f. = 0.0009809 , P value = 0.975

The P value of 0.975 is the same as that from 'P(LR-test)' of 'saltegg' obtained from the preceding command. The test determines whether removal of 'saltegg' in a model would make a significant difference than if it were kept. When there is more than one explanatory variable, 'P(LR-test)' from logistic.display is actually obtained from the lrtest command, which compares the current model against one in which the particular variable is removed, while keeping all remaining variables.

Logistic regression gives both the adjusted odds ratios simultaneously. The Mantel-Haenszel method only gives the odds ratio of the variable of main interest. An additional advantage is that logistic regression can handle multiple covariates simultaneously.

> glm5 <- glm(case~eclairgr+saltegg+sex, family=binomial)

> logistic.display(glm5)

Logistic regression predicting case

crude OR(95%CI) adj. OR(95%CI) P(Wald's test) P(LR-test) eclairgr: ref.=2 < 0.001 0 0.04 (0.03,0.08) 0.04 (0.02,0.07) < 0.001

1 0.79 (0.52,1.21) 0.75 (0.49,1.16) 0.2 >2 1.96 (1.28,2.99) 1.82 (1.19,2.8) 0.006

saltegg: 2.37 (1.41,3.99) 0.92 (0.48,1.76) 0.807 0.808 Yes vs No

sex: 1.58 (1.19,2.08) 1.85 (1.35,2.53) < 0.001 < 0.001 Male vs Female

Log-likelihood = -509.5181 No. of observations = 972 AIC value = 1031.0

The third explanatory variable 'sex' is another independent risk factor. Since females are the reference level, males have an increased odds of 90% compared to females. This variable is not a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from 'glm4'). The reason for not being able to confound is its lack of association with either of the preceding explanatory variables. In other words, males and females were not different in terms of eating eclairs and salted eggs.

Interaction

An interaction term consists of at least two variables, at least one of which must be categorical. If an interaction is present, the effect of one variable will depend on the status of the other and thus they are not independent. In R the interaction term can be specified in two ways: 'x1*x2' or 'x1:x2'. The former is equivalent to 'x1+

x2+x1:x2'.

Examine the following model where the variables 'eclairgr' and 'beefcurry' are specified as an interaction term.

> glm6 <- glm(case ~ eclairgr*beefcurry, family=binomial)

> logistic.display(glm6, decimal=1)

Logistic regression predicting diseased

crude OR(95%CI) adj. OR(95%CI) P(Wald's test) P(LR-test) eclairgr: ref.=2 < 0.001 0 0 (0,0.1) 0.1 (0,0.5) 0

1 0.8 (0.5,1.2) 0.5 (0.1,2.5) 0.39 >2 2 (1.3,3) 0.5 (0.1,3) 0.41

beefcurry: 2.7 (1.6,4.6) 1.4 (0.5,3.6) 0.53 < 0.001 (Yes vs No)

eclairgr:beefcurry: ref.=2:No 0.03 0:Yes - 0.3 (0.1,1.2) 0.09 1:Yes - 1.7 (0.3,9.7) 0.52 >2:Yes - 4.8 (0.7,33.9) 0.11

Log-likelihood = -511.8

No. of observations = 972 AIC value = 1039.6

The last term, 'eclairgr:beefcurry', is the interaction term. Interpretation of the P values from Wald's test suggests that the interaction may not be significant.

However, the P value from the LR-test is more important, in fact it is decisive. The value of 0.03 indicates that both 'eclairgr' and 'beefcurry' are not acting independently from each other. Computation of the LR-test P values for the main effects, 'eclairgr' and 'beefcurry', is not possible since models without main effects (but with interaction terms) have the same Log-likelihood as ones with the main effects included. The crude odds ratios for the interaction terms are also not

Readers may like to relevel the 'eclairgr' variable back to the original reference level (ref=0) and compare the output.

> eclairgr <- relevel(eclairgr, ref="0")

> pack()

> glm7 <- glm(case~eclairgr*beefcurry, family=binomial, data=.data)

> logistic.display(glm7)

Dalam dokumen Analysis of epidemiological data using R and Epicalc (Halaman 174-180)