ADJUSTING STANDARD ERRORS
5.2 FROM OBSERVATION TO GROUPED DATA
Many data sets we have to model are structured in the following format:
y cases x1 x2 x3 1 3 1 0 1 1 1 1 1 1 2 2 0 0 1 0 1 0 1 1 2 2 1 0 0 0 1 0 1 0
x1, x2, and x3 are all binary predictors. The variable cases have values that inform us of the number of times these three binary predictors have the same values—if the data were in observation format. y indicates how many of the number of cases with the same covariate pattern have 1 as a value for y. The first line represents three observations having x1 = 1, x2 = 0, and x3 = 1. One of the three observations has y = 1, and two have y = 0. In observation format the above grouped data set appears as
y x1 x2 x3 Line from grouped data above 1 1 0 1 1
0 1 0 1 1 0 1 0 1 1 1 1 1 1 2 1 0 0 1 3 1 0 0 1 3 0 0 1 1 4 1 1 0 0 5 1 1 0 0 5 0 0 1 0 6
This data set is in observation-based form, with y as 0 or 1. But you should be able to clearly see that both of the data sets are identical, providing exactly the same information. We shall model both to be sure. To do so, both must be put into separate data frames.
Observation Data
> y <- c(1,0,0,1,1,1,0,1,1,0)
> x1 <- c(1,1,1,1,0,0,0,1,1,0)
> x2 <- c(0,0,0,1,0,0,1,0,0,1)
> x3 <- c(1,1,1,1,1,1,1,0,0,0)
> obser <- data.frame(y,x1,x2,x3)
> xx1 < - glm(y ~ x1 + x2 + x3, family = binomial, data = obser)
> summary(xx1)
. . . Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2050 1.8348 0.657 0.511 x1 0.1714 1.4909 0.115 0.908 x2 -1.5972 1.6011 -0.998 0.318 x3 -0.5499 1.5817 -0.348 0.728 Null deviance: 13.46 on 9 degrees of freedom Residual deviance: 12.05 on 6 degrees of freedom AIC: 20.05
Grouped Data
> y <- c(1,1,2,0,2,0)
> cases <- c(3,1,2,1,2,1)
> x1 <- c(1,1,0,0,1,0)
> x2 <- c(0,1,0,1,0,1)
> x3 <- c(1,1,1,1,0,0)
> grp <- data.frame(y,cases,x1,x2,x3)
> grp$noty <- grp$cases – grp$y
> xx2 <- glm( cbind(y, noty) ~ x1 + x2 + x3, family = binomial, data = grp)
> summary(xx2) Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2050 1.8348 0.657 0.511 x1 0.1714 1.4909 0.115 0.908 x2 -1.5972 1.6011 -0.998 0.318 x3 -0.5499 1.5817 -0.348 0.728
(Dispersion parameter for binomial family taken to be 1) Null deviance: 9.6411 on 5 degrees of freedom
Residual deviance: 8.2310 on 2 degrees of freedom AIC: 17.853
The coefficients, standard errors, z values, and p-values are identical.
However, the ancillary deviance and AIC statistics differ due to the number of observations in each model. But the information in the two data sets is the same. This point is important to remember.
Note that the response variable is cbind(y, noty) rather than y as in the standard model. R users tend to prefer having the response be formatted in
terms of two columns of data—one for the number of 1s for a given covari- ate pattern, and the second for the number of 0s (not 1s). It is the only logistic regression software I know of that allows this manner of formatting the bino- mial response. However, one can create a variable representing the cbind(y, noty) and run it as a single term response. The results will be identical.
> grp2 <- cbind(grp$y, grp$noty)
> summary(xx3 <- glm( grp2 ~ x1 + x2 + x3, family = binomial, data = grp)) . . .
Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2050 1.8348 0.657 0.511 x1 0.1714 1.4909 0.115 0.908 x2 -1.5972 1.6011 -0.998 0.318 x3 -0.5499 1.5817 -0.348 0.728
In a manner more similar to that used in other statistical packages, the bino- mial denominator, cases, may be employed directly into the response—but only if it is also used as a weighting variable. The following code produces the same output as above,
> summary(xx4 <- glm( y/cases ~ x1 + x2 + x3, family = binomial, weights = cases, data = grp)) . . .
Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2050 1.8348 0.657 0.511 x1 0.1714 1.4909 0.115 0.908 x2 -1.5972 1.6011 -0.998 0.318 x3 -0.5499 1.5817 -0.348 0.728
The advantage of using this method is that the analyst does not have to create the noty variable. The downside is that some postestimation func- tions do not accept being based on a weighted model. Be aware that there are alternatives and use the one that works best for your purposes. The cbind() response appears to be the most popular, and seems to be used more in pub- lished research.
Stata and SAS use the grouping variable; for example, cases, as the variable n in the binomial formulae listed in the last section and as given in the example directly above. The binomial response can be thought of as y = numerator and cases = denominator. Of course these term names will differ depending on the data being modeled. Check the end of this chapter for how Stata and SAS handle the binomial denominator.
At times a data set may be too large to simply transcribe an observation to grouped format. We will see later in this chapter why converting a categori- cal observation logistic model to a grouped model is desirable. In any case, using the code discussed in Chapter 4, Section 4.1.3, we may convert the above observation data set, obser, to a cbind()-based grouped format and run. I will show the data again for clarity.
> y <- c(1,0,0,1,1,1,0,1,1,0)
> x1 <- c(1,1,1,1,0,0,0,1,1,0)
> x2 <- c(0,0,0,1,0,0,1,0,0,1)
> x3 <- c(1,1,1,1,1,1,1,0,0,0)
> obser <- data.frame(y,x1,x2,x3)
> xx1 < - glm(y ~ x1 + x2 + x3, family = binomial, data = obser)
> library(reshape)
> obser$x1 <- factor(obser$x1)
> obser$x2 <- factor(obser$x2)
> obser$x3 <- factor(obser$x3)
> grp <- na.omit(data.frame(cast(melt(obser, measure = “y”),
x1 + x2 + x3 ~ .,
function(x) { c(notyg = sum(x = =0), yg = sum(x = =1))} )))
> grp
x1 x2 x3 notyg yg 1 0 0 1 0 2 2 0 1 0 1 0 3 0 1 1 1 0 4 1 0 0 0 2 5 1 0 1 2 1 6 1 1 1 0 1
> bin <- glm( cbind(notyg, yg) ~ x1 + x2 + x3, famil = binomial, data = grp)
> summary(bin) . . .
Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) -1.2050 1.8348 -0.657 0.511 x11 -0.1714 1.4909 -0.115 0.908 x21 1.5972 1.6011 0.998 0.318 x31 0.5499 1.5817 0.348 0.728 . . .
The code used to convert an observation to a grouped data set is the same code that can convert an n-asymptotic data set to m-symptotic for residual analysis. You can use the above code as a paradigm for converting any obser- vation data to grouped format.