Goodness of Fit Test - Logistic Regression

Chapter 3: Logistic Regression

3.4 Goodness of Fit Test

To assess the fit of an estimated logistic model, goodness-of-fit test statistics are used. These tests involve investigating how close predicted values are to the observed values in the model.

In logistic regression there are a number of different possible ways to assess the difference between the observed values and the fitted values.

3.4.1 Pearson Chi-Square Statistic and Deviance

Suppose the fitted model contains independent variables, T^N = gU, U_&, … , U₍h and let » denote the number of distinct values that T observed. If some subjects have the same value W then

consider » < . Denote the number of subjects with T = T by ¼ = 1,2, , … , ». It then follows that ∑ ¼^½_x = . Further, let 0denote the number of positive responses, 0 = 1, among

¼subjects with T = T. Denote by ∑ 0^½_x = , the total number of subjects with 0 = 1. To emphasize that the fitted values in logistic regression are calculated for each covariate pattern and depend on the estimated probability for that covariate pattern, Hosmer & Lemeshow (2002) suggest denoting the fitted value for the th covariate pattern as 0 where

0 = ¼ = ¼gexpOAgTh/¿1 + expOAgThQÀQh ,

where AgTh is the estimated logit.

Hosmer & Lemeshow (2002) further suggest considering two measures of the difference between the observed and the fitted values: the Pearson residual and the deviance residual. The Pearson residual for a particular covariate pattern can be defined as

Ág0, h = g0 − ¼h

¯¼g1 − h .

3.4.1 The summary statistic based on these residuals is the Pearson chi-square statistic given by

%^&= {Á60, 7²

3.4.2 The deviance residual can be defined as

Âg0, h = ± Ã2 ~0ln ² 0

¼· + g¼ − 0hln ²g¼− 0h

¼g1 − h·Ä

&⁄

3.4.3 The deviance residual for covariate patterns with 0 = 0 is

Âg0, h = −¯2¼Ålng1 − hÅ .

The deviance residual when 0 = ¼, is as follows

Âg0, h = ¯2¼ÅlnghÅ .

The summary statistic based on the deviance residuals is the deviance

Æ = {Â60, 7² .

3.4.4 If the model is correct, the statistics %^&and Æ have approximately a chi-square distribution with degrees of freedom equal to » − + 1. Further, for the deviance it follows that Æ is the likelihood ratio test statistic of a saturated model with » parameters versus the fitted model with

+ 1 parameters.

3.4.2 The Hosmer-Lemeshow Test

Hosmer and Lemeshow’s goodness-of-fit test is another method commonly used to assess the fit of a model. Hosmer and Lemeshow (1980) and Lemeshow and Hosmer (1982) suggested grouping based on the values of the estimated probabilities. The idea behind the test is that the predicted and observed probabilities should match closely and that the closer they match, the better the fit. Groups are created using predicted probabilities, and then observed and fitted counts of successes and failures are compared on those groups using a chi-squared statistic.

According to Hosmer & Lemeshow (2002), suppose » = and think of the columns as corresponding to the values of the estimated probabilities, with the first column corresponding to the smallest value and the nth column to the largest value. The first grouping strategy proposed involves collapsing the table based on percentiles of the estimated probabilities whilst

the second strategy involves collapsing the table based on fixed values of the estimated probability. The first method uses A = 10 groups and results in the first group containing the

′= subjects having the smallest estimated probabilities and the largest group containing

′= having the largest estimated probabilities.

The second method with A = 10 groups, results in cutoff points defined at the values ^Ç

, È = 1,2, … , 9, and the groups contain all subjects with estimated probabilities between adjacent cutoff points.

The first group contains all subjects whose estimated probabilities are less than and equal to 0.1 while the tenth group contains all subjects whose estimated probabilities is greater than 0.9.

For either grouping strategy the Hosmer -Lemeshow test statistic is defined as follows

= {Ê_Ç− ′_Ç¹_Ç^&

′_Ç¹_Ç1 − ¹_Ç

Ë Çx

where ′_Ç is the total number of subjects in the È^ÌÍ group, Î_Ç denotes the number of covariate patterns in the È^ÌÍ decile,

Ê_Ç = { 0

_Ï

is the number of responses among the Î_Ç covariate patterns and

¹_Ç = {¼

′_Ç

Ð_Ï

is the average estimated probability.

Hosmer and Lemeshow (1989) reported simulations showing that the statistic has approximately a chi-square distribution under the null hypothesis that the model fitted is correct, with A − 2 degrees of freedom. A large H value, that is, an H value larger than the 100| percentage point

of the chi-square distribution (or a p-value less than |) indicates that the model is inadequate.

Lee & Wang (2003) have reported that as with other chi-square goodness-of-fit tests, the approximation depends on the estimated frequencies being reasonably large and if a large number (say, more than 20%) of the expected frequencies are less than 5, the approximation may not be appropriate and the p-value should be interpreted with caution. If this is the case, the proposed solution is to combine adjacent groups to increase the estimated expected frequencies. However, Hosmer & Lemeshow warn that if fewer than six groups are used to calculate H, the test would be insensitive and would almost always indicate that the model is adequate (Lee & Wang, 2003).

3.4.3 Area under the ROC Curve

Agresti (2007) has reported that the accuracy of a diagnostic test is often assessed with two conditional probabilities, namely, sensitivity and specificity. Chen et al (2008), define sensitivity as a measure of accuracy for event prediction:

Sensitivity = Pg0 = 1│0 = 1h,

where “0 = 1” is the number of default individuals who screen the same, “0 = 1” is the total number of default individuals.

1 − Specificity = 1 − Pg0 = 0│0 = 0h,

where “0 = 0”is the number of normal-performing individuals who screen the same; “0 = 0” is the total number of normal-performing individuals.

In other words, sensitivity is the probability of correctly classifying an observation with an outcome of an event and specificity is the probability of correctly classifying an observation with the outcome of a nonevent. Also, the positive predictive value (PPV) is the proportion of observations classified as events that are correctly classified and the negative predictive value (NPV) is the proportion of observations classified as nonevents that are correctly classified.

Sensitivity and specificity rely on a single cutoff point to classify a test result as positive (Hosmer

& Lemeshow, 2002). The area under the ROC (Receiver Operating Characteristic) curve

originated from a single detection theory and shows how the receiver operates the existence of signal in the presence of noise.

Hosmer & Lemeshow (2002) explain that it plots the probability of detecting true signal (sensitivity) and false signal (1 – specificity) for an entire range of cutoff points. The area under the ROC curve ranges from zero to one and provides a measure of the models ability to discriminate between subjects that experience the outcome of interest against those subjects that do not.

Hosmer & Lemeshow (2002) propose as a general rule:

If ROC = 0.5: this suggests no discrimination.

If 0.7 ≤ ROC ≤ 0.8: this is considered acceptable discrimination.

If 0.8 ≤ ROC ≤ 0.9: this is considered excellent discrimination.

If ROC ≥ 0.9: this is considered outstanding discrimination.

Agresti (2007) has reported that for a given specificity, better predictive power corresponds to higher sensitivity therefore the better the predictive power, the higher the ROC curve. In essence, the higher the area under the curve the better the prediction power of the model.

3.4.4 Information Criteria

The Akaike’s information criteria (AIC) and the Schwartz criterion (SC) can be used to test the goodness of fit of two nested models. These methods are used to adjust the likelihood ratio statistic which measures the deviation of the log-likelihood of the fitted model from the log- likelihood of the maximal possible model (Vittinghoff et al., 2005). The AIC was introduced by (Akaike, 1974) and judges a model by how close its fitted values tend to be to the true expected values, as summarized by a certain expected distance between the two. Agresti (2007) reports that the optimal model is the one that tends to have its fitted values closest to the true outcome probabilities.

The AIC is calculated as:

AIC = −2logL + 2 ,

where is the number of parameters used in the model. Usually, the model with the smallest AIC value is the best (Agresti, 2002).

The SC introduced by (Schwatrz, 1978) is calculated as:

−2logL + log,

where is the number of parameters and n is the size of the sample.

3.4.5 Measure of Association

Allison (1999) discusses the four measures of association thatmeasure how well one can predict the dependent variable based on the independent variables, namely, Kendall's Tau-a, Goodman- Kruskal's Gamma, Somers's D statistic, and the c statistic. The idea behind these measures is to pair the observations in different ways without pairing an observation with itself. Pairs that have either both 1’s on the dependent variable or both 0’s are ignored while pairs in which one case has a 1 and the other case has a 0, are retained. For each pair, a pair is considered concordant if the case with a 1 has a higher predicted value (based on the model) than the case with a 0 and discordant otherwise. However, if the two cases have the same predicted value, it is then considered a tie.

Now, let Ú be the number of concordant pairs, Æ the number of discordant pairs, Û the number of ties, and ¨ the total number of pairs (before eliminating any). The four measures of association are then calculated as follows:

Tau- a =Ú − Æ

¨ . Gamma =Ú − Æ Ú + Æ.

Somer^Ns D = Ú − Æ Ú + Æ + Û.

Î = 0.51 + Somer^Ns D.

The Tau-a statistic is Kendall's rank order correlation coefficient without the adjustments for ties and the Gamma statistic is based on Kendall's coefficient but with the adjustments for ties.

Allison (1999) further details that the four measures of association described above vary between 0 and 1, with larger values corresponding to stronger associations between the predicted and observed values and Tau-a being closest to the generalized á^&.

Dalam dokumen Statistical analysis of the attitudes towards blood donation and transfusion in Mali. (Halaman 34-41)