Appendix 3: Additional Test of Calibration
Appendix 3a: External Validation of Novel Models using Calibration Plot
As an additional sensitivity analysis, we generated a calibration plot for each novel model by applying regression coefficients from our development sample (2010 data) to patients in our validation sample (2011 data), generating predicted probabilities of death, and then regressing these predicted probabilities on the binary outcome itself (died vs. did not die; see Steyerberg EW, et al. Epidemiology. 2010;21(1):128-138 or Steyerberg EW, et al. European Heart Journal.
2014;35(29):1925-1931). We then bootstrapped the final regression coefficients using 1,000 repetitions to generate 95% confidence intervals. Ideal calibration plots have an intercept of 0 (i.e., no systematic over- or under-estimation, known as “calibration-in-the-large”) and a slope of 1 (i.e., perfect correlation between observed and expected values). Since our 95% confidence intervals for both parameters in both models contain these values, we believe our models to be appropriately calibrated.
Calibration plot
Calibration-in-the-large Calibration slope And Model -0.014 (-0.053, 0.026) 0.928 (0.843, 1.001) Or Model -0.036 (-0.078, 0.006) 0.962 (0.873, 1.037)
Appendix 3b: Validation of Novel Models using Calibration Belt
We also generated graphical representations of these calibration plots using the calibration belt procedure for STATA (see Nattino G, Finazzi S, Bertolini G. Stat Med.
2016;35(5):709-720). Both plots indicate good overall fit, with the 95% confidence for the Or
model containing the bisector across the probability spectrum and the And model slightly overpredicting the probability of death among patients at the highest risk.
Appendix 3c: Calibration of Final Models using Calibration Belt
Finally, we ran the calibration belt procedure for each final model configuration to determine how well each model was calibrated among the full sample. Again, the novel models appeared to be well calibrated with little to no variation from the bisector. The Standard TQIP and All variables models also had relatively good overall fit, but demonstrated some potential calibration issues among patients the lowest (Standard TQIP) and highest (All variables) predicted probabilities.
Or And