levels(testdata$x) [1] "C" "B" "D" "X"
lm(y~x, contrast=list(x=contr.treatment), data=testdata) Call:
lm(formula = y ~ x, data = testdata, contrasts = list(x = contr.treatment)) Coefficients:
(Intercept) x2 x3 x4
30.04 -19.93 19.98 -25.11
In the above example, we used the function ordered to define the level C as the lowest level. Consequently, level C is left out in the regression and the remaining parameters are interpreted as difference between that level and the level C.
The reorder function is used to order factor variables based on some other data. Sup-pose we want to order the levels of x in such a way that the lowest level has the smallest variance in y, then we use reorder as follows:
testdata$x <- reorder(testdata$x,testdata$y,var) levels(testdata$x)
[1] "D" "X" "C" "B"
Level D has the smallest variance in y, and will be left out in a regression where we use a treatment contrast for the regression variable x.
8.4 Logistic regression
Logistic regression can be used to model data where the response variable is binary, a factor variable with two levels. For example, a variable Y with two categories ‘yes’ / ‘no’
or ‘good’ / ‘bad’. Many companies have build score cards based on logistic regression where they try to separate ‘good’ customers from ‘bad’ customers. The logistic regression model calculates the probability of an outcome, P (Y = good) and P (Y = bad) = 1 − P (Y = good) given some regression variables X1, ..., Xp as follows:
P (Y = good) = exp(α0+ α1X1+ ... + αpXp)
1 + exp(α0+ α1X1+ ... + αpXp) (8.2) The regression coefficients α0, ..., αp are estimated with a data set.
CHAPTER 8. STATISTICS 8.4. LOGISTIC REGRESSION
8.4.1 The modeling function glm
The function glm can be used to fit a logistic regression model. Let’s generate a data set and demonstrate the different aspects of building a logistic regression model.
nrecords = 1000 ncols = 3
x <- matrix( runif( nrecords*ncols), ncol=ncols) y <- 2 + x[,1] + 3*x[,2] - 4*x[,3]
y = exp(y)/(1+exp(y)) ppp <- runif(nrecords)
y <- ifelse(y > ppp, "good", "bad") testdata <- data.frame(y, x)
To get an idea which variables in your data set have an influence on your binary re-sponse variable: plot the observed fraction (‘yes’/‘no’) against the (potential) regression variables.
• Divide the variable Xi into say ten buckets (equal intervals).
• For each bucket calculate the observed fraction ‘good’/ ‘bad’.
• Plot bucket number against observed fraction.
Some R code that plots observed fractions for a specific regression variable:
obs.prob <- function(x) {
## observed fraction of good, the
## second level in this case out <- table(x)
out[2]/length(x) }
plotfr <- function(y, x, n=10) {
tmp <- cut(x,n)
p <- tapply(y,tmp, obs.prob) plot(p)
lines(p) title(
paste(deparse(substitute(y)),
"and",
deparse(substitute(x))) )
}
CHAPTER 8. STATISTICS 8.4. LOGISTIC REGRESSION
par(mfrow=c(2,2))
plotfr(testdata$y, testdata$X1) plotfr(testdata$y, testdata$X2) plotfr(testdata$y, testdata$X3)
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
0.650.750.85
Index
p
testdata$y and testdata$X1
●
● ●
●
●
●
●
●
● ●
2 4 6 8 10
0.50.70.9
Index
p
testdata$y and testdata$X3
●
●
●
● ● ●
●
● ●
●
2 4 6 8 10
0.50.70.9
Index
p
testdata$y and testdata$X2
Figure 8.3: Explorative plots giving a first impression of the relation between the binary y variable and x variables.
The plots in figure 8.3 show strong relations. For variable X3 there is a negative relation, just as we have simulated. The interpretation of the formula object that is needed as input for glm, is the same as in lm. So for example the : operator is also used here for specifying interaction between variables. The following code fits a logistic regression model and stores the output in the object test.glm.
test.glm = glm(y ~ X1 + X2 + X3, family = binomial, data=testdata) summary(test.glm)
Call:
glm(formula = y ~ X1 + X2 + X3, family = binomial, data = testdata) Deviance Residuals:
Min 1Q Median 3Q Max
-2.5457 -0.5829 0.3867 0.6721 1.9312
CHAPTER 8. STATISTICS 8.4. LOGISTIC REGRESSION
Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 0.9616 0.2617 3.675 0.000238 ***
X1 1.6361 0.3065 5.338 9.4e-08 ***
X2 3.3955 0.3317 10.236 < 2e-16 ***
X3 -4.0446 0.3513 -11.515 < 2e-16 ***
---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1)
Null deviance: 1158.48 on 999 degrees of freedom Residual deviance: 863.47 on 996 degrees of freedom AIC: 871.47
Number of Fisher Scoring iterations: 5
The object test.glm is a ‘glm’ object. As with lm objects in the previous section, the glm object contains more information. Enter print.default(test.glm) to see the entire object. The functions listed in table 8.3 can also be used on glm objects.
8.4.2 Performance measures
To assess the quality of a logistic regression model several performance measures can be calculated. In the ROCR package there are functions to calculate the Receiver Operator curve (ROC), the Area under the ROC and lift charts from any glm model that is fitted.
A logistic regression model can be used to predict ‘good’ or ‘bad’. Since the model only calculates probabilities of ‘good’, a threshold t ∈ (0, 1) is chosen, if the probability is above t then ‘good’ is predicted otherwise bad is predicted. Dependent on this threshold t we will have the following numbers, TP, FP, FN and TN as displayed in the following so-called confusion matrix.
Observed
Good Bad
Model Good TP FP
predicted Bad FN TN
number of goods number of bads
Table 8.4: confusion matrix
Let ng be the number of observed goods and nb the number of observed bads, then we have:
CHAPTER 8. STATISTICS 8.4. LOGISTIC REGRESSION
1. TP (true positive) is the number of observations for which the model predicted good and that were observed good. True positive rate, T P R = T P/ng.
2. TN (true negative) is the number of observations for which the model predicted bad and that were observed bad. True negative rate, T N R = T N/nb.
3. FP (false positive) is the number of observations for which the model predicted good but were observed bad. False positive rate, F P R = 1 − T N R.
4. FN (false negative) is the number of observations for which the model predicted bad but were actually observed good. False negative rate, F N R = 1 − T P R.
The ROC curve is a parametric curve, for all tresholds t ∈ (0, 1) the points (T P R, F P R) are calculated. Then these points are plotted. A ROC curve demonstrates several things:
1. It shows the trade-off between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
2. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
3. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. The area under the curve (AUR) is a measure of how accurate the model can predict ‘good’. A value of 1 is a perfect predictor while a value of 0.5 is very bad predictor.
The code below shows how to create an ROC and how to calculate the AUROC.
library(ROCR)
pred <- prediction( test.glm$fitted, testdata$y) perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=T, lwd= 3) abline(a=0,b=1)
performance(pred, measure="auc")@y.values [[1]]
[1] 0.8353878
8.4.3 Predictive ability of a logistic regression
Another way to look at the quality of a logistic regression model is to look at the predictive ability of the model. When we predict P (Y = good) with the model, a pair of observations with different observed responses (one ‘good’, the other ‘bad’) is said to be concordant if the observation with the ‘bad’ has a lower predicted probability than the observation with the ‘good’. If the observation with the ‘bad’ has a higher predicted
CHAPTER 8. STATISTICS 8.4. LOGISTIC REGRESSION
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0 0.170.330.50.670.831
Worthless model Perfect model
Figure 8.4: The ROC curve to assess the quality of a logistic regression model
probability than the observation with the ‘good’, then the pair is discordant. If the pair is neither concordant nor discordant, it is a tie.
Four measures of association for assessing the predictive ability of a model are avail-able. These measures are based on the number of concordant pairs, nc, the number of discordant pairs, nd, let the total number of pairs t, and the number of observations, N .
1. The measure called c, also an estimate of the area under ROC, c = (nc+ 0.5 × (t − nc− nd))/t
2. Somer’s D, D = (nc− nd)/t
3. Kendall’s tau-α, defined as (nc− nd)/(0.5 · N · (N − 1)) 4. Goodman-Kruskal Gamma, defined as (nc− nd)/(nc+ nd)
Ideally, we would like nc to be very high and nd very low. So the larger these measures the better the predictive ability of the model. The function lrm in the ‘Design’ package can calculate the above measures.