links.lww.com/TA/B367

(1)

Appendix 3: Additional Test of Calibration

Appendix 3a: External Validation of Novel Models using Calibration Plot

As an additional sensitivity analysis, we generated a calibration plot for each novel model by applying regression coefficients from our development sample (2010 data) to patients in our validation sample (2011 data), generating predicted probabilities of death, and then regressing these predicted probabilities on the binary outcome itself (died vs. did not die; see Steyerberg EW, et al. Epidemiology. 2010;21(1):128-138 or Steyerberg EW, et al. European Heart Journal.

2014;35(29):1925-1931). We then bootstrapped the final regression coefficients using 1,000 repetitions to generate 95% confidence intervals. Ideal calibration plots have an intercept of 0 (i.e., no systematic over- or under-estimation, known as “calibration-in-the-large”) and a slope of 1 (i.e., perfect correlation between observed and expected values). Since our 95% confidence intervals for both parameters in both models contain these values, we believe our models to be appropriately calibrated.

Calibration plot

Calibration-in-the-large Calibration slope And Model -0.014 (-0.053, 0.026) 0.928 (0.843, 1.001) Or Model -0.036 (-0.078, 0.006) 0.962 (0.873, 1.037)

Appendix 3b: Validation of Novel Models using Calibration Belt

We also generated graphical representations of these calibration plots using the calibration belt procedure for STATA (see Nattino G, Finazzi S, Bertolini G. Stat Med.

2016;35(5):709-720). Both plots indicate good overall fit, with the 95% confidence for the Or

(2)

model containing the bisector across the probability spectrum and the And model slightly overpredicting the probability of death among patients at the highest risk.

Appendix 3c: Calibration of Final Models using Calibration Belt

Finally, we ran the calibration belt procedure for each final model configuration to determine how well each model was calibrated among the full sample. Again, the novel models appeared to be well calibrated with little to no variation from the bisector. The Standard TQIP and All variables models also had relatively good overall fit, but demonstrated some potential calibration issues among patients the lowest (Standard TQIP) and highest (All variables) predicted probabilities.

Or And

(3)