1 Appendix 1. Data Generation for Simulations used in Figures 1 and 2
Simulations consisted of three independent uniform random variables (𝑋1, 𝑋2, 𝑋3), a binary treatment (𝐴), and a continuous variable (𝑇) representing time to event. The relationship between baseline covariates with treatment, outcome and event time were simulated according to Equations 1 and 2:
logit(E[𝐴|𝑋1, 𝑋2, 𝑋3]) = 𝛼0+ 𝛼1𝑋1+ 𝛼2𝑋2+ 𝛼3𝑋3 [1]
𝑇 =𝜆[exp (𝛽 −log (𝑢)
𝐴𝐴+ 𝛽1𝑋1+𝛽2𝑋2+𝛽3𝑋3)] ] [2]
Where, 𝛼0= −3.5 𝛼1= 1.0 𝛼2= 3.0 𝛼3= 3.0 𝛽1= 1.5 𝛽2= 4.8 𝛽3= 1.3 𝛽𝐴 = 0.693
𝜆 =1/852 (event rate scaled in years) 𝑢~𝑢𝑛𝑖𝑓(0,1)
Note that event times were generated to follow an exponential distribution by generating a random uniform variable, u, and then plugging in values into Equation 2.1
Each simulated cohort consisted of 1 million subjects. Each of these individuals was followed until they experienced an event. The event times were divided into 100 intervals according to the percentiles of the event times so that approximately 1% of the events occurred in each interval.
For Figure 1, 3 measures of effect were calculated within each interval: 1) the conditional (covariate adjusted) hazard ratio, 2) the time-specific marginal hazard ratio, and 3) the population averaged marginal hazard ratio. The population averaged marginal hazard ratio is the average of the instantaneous (time-specific) marginal hazard ratios from baseline (time 0) thru time t.
For Figure 2, 3 measures of effect were calculated: 1) the population averaged marginal hazard ratio from baseline (time 0) thru time t, 2) baseline PS regression adjusted HR, where quadratic splines of the baseline propensity score were included in the Cox outcome model, 3) time-specific PS regression HR, where the time-specific PS was included as a time-varying covariate in the outcome Cox model and was modeled using quadratic splines.
R code for the described simulation and analysis is provided online.2
2 Appendix 2. Conditional and Marginal Hazard Ratios
In this appendix we show
• how the marginal hazard ratio, HRm, diverges over time from the conditional hazard ratio, HRc, toward the null (when the HRc ≠ 1.0)
• how PS-based estimators of the HRc tend to be biased over time -- toward the HRm -- by depletion from the cohort of people who have had outcome events.
• how PS-based estimators of the HRm tend to be biased over time -- toward the HRc -- by random censoring, if the cohort is depleted of higher risk individuals by outcome events.
We start by showing in Figure A4.1 what is meant by the marginal hazard ratio. The two survival curves show how the treatment and comparison groups are depleted over time by loss of the people who have an outcome event. Time in Figure A4.1 is scaled by the percentage of the entire cohort who have had an outcome event. At the outset, the cohort is balanced. It is a full counterfactual cohort, which means that it is comprised of pairs of “clones” such that each pair has a treated person and an untreated person with a matching covariate profile. The true HRc in this simulation is 2.0: at every point in time the treatment doubles the risk of an outcome event in every individual who is still at risk. Over time the treatment group gets depleted at a faster rate than the untreated group – and we see the red curve in figure A4.1 diverge from the blue curve. (If the true HRc was 1.0 instead of 2.0, the two groups would be depleted at the same rate and both survival curves would be precisely on the diagonal).
Figure A1. Survival by Treatment Arm in a Simulated Counterfactual Cohort. Treatment Assignment and Survival time are Driven by Three Covariates (Uniformly Distributed, Uncorrelated). Mechanisms Generating Treatment and Outcome: Logit(Tx) = cov1 + 3 × cov2 + 3 × cov3 - 3.5. Exponential Survival Time with Linear Predictor = ln(2) × Tx + 1.5 × cov1 + 4.8 × cov2 + 1.3 × cov3, Baseline Rate = 1/852.
After 24 months, 25% of the entire cohort has had an outcome event and is no longer in follow-up. The green text points out what we mean by the “instantaneous” HRm at the 24th month and by the HRm for the two years of follow-up until the end of the 24th month. During the 24th month 1.60% of the treated group had an outcome event compared with 1.08% of the untreated group. Thus, the slope of the red survival curve at the 24th month is about 0.0160, and the slope of the blue survival curve at that point is
3 about 0.0108. At each timepoint, the instantaneous marginal HR is the ratio of the slope of the red curve to the blue curve. The HRm from T0 until a 2-year endpoint amounts to the average (on the log scale) of instantaneous HRs over all times until the 2-year endpoint. The HRm can also be found by Cox
regression: it is the maximum likelihood estimate of the hazard ratio (on the log scale) that would be obtained from a Cox model, fitted to data on the full counterfactual cohort followed from T0 until the 2- year endpoint, regressing time-to-event on treatment status without adjustment for covariates. In a counterfactual cohort – which is like an ideal RCT – the HRm is the estimand of Cox regression without any adjustment for the baseline covariates, while the HRc is the estimand of Cox regression that is adjusted for the baseline covariates.
Even if the HRc is always 2.0, the HRm diverges over time toward 1.0 as the cohort is depleted by outcome events. Table A4.1 shows this counterfactual cohort as entirely balanced at the outset of followup. The first row shows this initial balance: the mean of each covariate is 0.5 in each
Table A1. Loss of Balance in a Full Counterfactual Cohort With no Censoring as Treatment (Trt) and Comparison (Comp) Groups are Differentially Depleted by Outcomes. Means of the Covariates, Propensity Score (PS), and Disease Risk Score (DRS) by Time.a
Time Covariate 1 Covariate 2 Covariate 3 PS DRS
Scalea Trtmt Comp Trtmt Comp Trtmt Comp Trtmt Comp Trtmt Comp 1 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 3.800 3.800 5 0.494 0.497 0.484 0.492 0.495 0.497 0.487 0.493 3.706 3.751 10 0.487 0.492 0.463 0.480 0.488 0.494 0.470 0.484 3.588 3.685 20 0.475 0.484 0.422 0.455 0.478 0.486 0.437 0.463 3.362 3.542 30 0.465 0.475 0.380 0.423 0.470 0.478 0.404 0.438 3.131 3.365 40 0.456 0.466 0.336 0.388 0.462 0.471 0.371 0.410 2.898 3.173 50 0.446 0.458 0.293 0.349 0.453 0.464 0.337 0.380 2.665 2.967 60 0.435 0.449 0.254 0.307 0.444 0.456 0.306 0.348 2.449 2.741 70 0.417 0.438 0.211 0.262 0.429 0.446 0.270 0.313 2.198 2.496 80 0.392 0.421 0.171 0.218 0.407 0.432 0.232 0.276 1.938 2.237 90 0.346 0.389 0.128 0.168 0.367 0.404 0.185 0.229 1.612 1.915
a Same cohort and time scale as Figure A4.1. When time = 10, 10% of the cohort has had an outcome event.
group, the mean PS in each group is 0.5, and the mean disease risk score (DRS) in each group is 3.8. This full cohort then loses its balance over time as the treated and untreated groups are differentially depleted of higher risk individuals. In the second row, profiling the cohort after it has been depleted by 5% who have had an outcome event, the “survivors” in each arm have a slightly lower risk profile with respect to each of the covariates. The treatment group lost more people due to outcome events than the untreated group and became slightly less risky – its average DRS decreased more. Over time the two arms become more unbalanced as they become more depleted by outcome events.
As the covariate profile of the treated survivors becomes less risky than that of the untreated survivors, the instantaneous HR (unadjusted for the covariates) diverges from 2.0, and the average of the
instantaneous HRs – averaged over the followup period from T0 to any endpoint of interest -- also diverges from 2.0, though somewhat more gradually. This pattern is shown in Figure A4.2.
4 In surveillance of a real cohort rather than this counterfactual cohort, we can emulate the Figure A4.2 analysis of the HRm by using inverse probability of treatment weights (IPTW), or else by using our 1:1 PS- matched estimator with unstratified Cox regression. These estimators can consistently estimate the average HRm in Figure A4.2 if there is no censoring. However, in drug safety surveillance there is often heavy early censoring, as many new-users quit their drug while others persist for years. Then long-term estimates of the HRm are derived disproportionately from the less censored outcomes earlier on the timeline before there was much differential depletion of higher risk individuals, and consequently our IPTW and 1:1 matched estimators of the HRm yield estimates that are biased away from their target — which is the marginal hazard ratio in the counterfactual population in the absence of censoring before the end of follow-up — toward the HRc. This bias can arise from the amount and timing of the censoring even if who gets censored is entirely unrelated to treatment or risk. A way to avoid this bias in our PS- based estimators of the HRm is to upweight the late under-represented risk sets.
Figure A2. Marginal Hazard Ratios Diverge from the Conditional Hazard Ratio Over Time in the Counterfactual Cohort (Same Cohort and Follow-Up Time Scale as Shown in Figure A4.1).
If we condition the analysis on the individual covariates (for example, by including them in a Cox regression model) it is possible to consistently estimate the HRc despite the challenges from “depletion of susceptibles”. However, if we condition on the baseline PS instead of the baseline covariates the HR estimate tends to land between the HRc and the HRm. As the cohort is depleted of susceptibles, there is change in the associations of the baseline covariates with treatment status (among the “survivors” who remain in followup), and so a PS derived from the cohort at baseline becomes less successful at
adjusting analyses.
In Figure A4.3 the red curve shows how the HR estimates that were adjusted by the baseline PS fell between the true HRc and the HRm (on the log scale). This red curve shows average adjusted HR estimates (on the log scale) yielded by Cox models that were fit to the observed half of the same
counterfactual cohort that was simulated for Figures A4.1 and A4.2. The correlation between the PS and
4
Figure A6.2
5 the DRS is 0.87 in the scenario simulated for Figures A4.1-A4.3. If we modify the outcome-generating mechanism so that the DRS correlates nearly 1.0 with the PS, the PS-adjusted estimator will land very close to the target HRc. This is shown in Figure A4.4. Conversely, if we modify the outcome-generating mechanism so that the PS is less correlated with the DRS, then our PS-regression estimator will land closer to the HRm. Figure A4.5 shows the PS-regression estimates converging with the HRm when the correlation of the PS with the DRS is nearly zero. Whether the PS-adjusted estimate lands closer to the HRc or the HRm depends on the size of the absolute value of this PS-DRS correlation, not on whether it is positive or negative. In the baseline scenario examined in the main paper, the correlation between the PS and the DRS was -0.75.
Figure A3. Hazard Ratio (HR) Estimated by PS Regression Compared with True Conditional and Marginal HRs in Cohort Where the Correlation Between the True PS and the True Disease Risk Score (DRS) is 0.87. Red Curve at T = 10 Shows the PS-Adjusted Ln(HR) From T0 Until the Time When 10% of the Full Cohort has had an Event. True Conditional HR = 2 (Shown at 0.693 on Vertical Axis, Log Scale).
5
Figure A6.3
6 Figure A4. Hazard ratio (HR) Over Time Estimated by PS Regression Compared with True Conditional and Marginal HRs in Scenario Where the Correlation of the True PS with the Disease Risk score (DRS) is Near 1.0.
Figure A4.5. Hazard ratio (HR) Over Time Estimated by PS Regression Compared with True Conditional and Marginal HRs in Scenario Where the Correlation of the True PS with the Disease Risk score (DRS) is Near 0.
7 Appendix 3. Estimating Time-Specific Propensity Scores
For a PS-based method to consistently estimate the HRc, the PS adjustment needs to keep the estimator on target as the treatment groups are depleted differentially. PS adjustment can do this if the PS is updated periodically. The timeline can be divided into intervals brief enough so that only a small percentage (e.g. <1%) of either arm of the cohort has an outcome event before the PS is updated again.
For each interval, we can find a PS that best predicts treatment status during the interval. If the HRc and the covariates are very strong predictors of the outcome, then the PS needs to be updated frequently.
Keep in mind that we are not updating the values of the baseline covariates – only the function of them that predicts treatment status, conditional on still being at risk.
A PS-based HRc estimator can estimate its target more consistently if (a) the model yielding the time- varying PS at each downstream timepoint includes not only the baseline covariates but also preliminary estimates of propensity scores for earlier and later timepoints, and (b) the PS is updated at each
interval’s median risk set rather than its first risk set. The next two paragraphs elaborate on (a) and (b).
(a) A time-varying PS can be improved if it is derived from a model which includes interaction terms – such as the cross-product of a baseline PS and a later PS – that help keep the HRc estimator on target.
Even if no interaction terms are needed for an optimal PS at the start of followup, the need for interaction terms increases over time as susceptibles are depleted in a way that is related to the covariates. But there are many possible covariate interactions and polynomial terms that can be included in the models for updating the PS, and it can be difficult to ascertain which ones would be helpful. We bypass this difficulty by including instead -- in the model for each PS update -- interactions between preliminary versions of an earlier PS and a later PS. The preliminary versions of an earlier and later PS are derived from models specified just as the model for the baseline PS was specified. When we update the PS downstream on the timeline, this inclusion in the PS model of interactions between an earlier PS and a later PS can help keep the HRc estimator on target as would adjustment of the outcome model for a DRS.
(b) If there are 1000 risk sets during an interval, then the regression model used to estimate the PS can be fit to the 500th risk set – i.e. the risk set anchored to the 500th outcome on the timeline during the interval. After we find a function of the baseline covariates that best predicts treatment status in the midpoint risk set, this function can be used to calculate a PS for everybody in every risk set during the interval, including the individuals who had outcome events early in the interval and so were not included in the midpoint risk set. This strategy is helpful if there is a nontrivial amount of depletion of susceptibles during the interval; but if the interval is uneventful then it doesn’t matter whether the PS is updated at the start of the interval or at its midpoint.
Table A2 compares HR estimates that adjusted for covariates by alternative methods in a scenario with simulated time-to-death data where the true HRc = 2.00, and where the true HRm, in followup until 90%
of the cohort dies, is 1.71. As expected, there was no bias in the HRc estimate when it was adjusted for the individual covariates. In early follow-up, before depletion of “susceptibles”, the baseline PS kept the HRc estimator on target (Table A4.2 row 1, column 4). But in later follow-up, the estimates adjusted by the baseline PS diverged from their 2.0 target, while the time-varying PS succeeded in keeping the average HRc estimate right on 2.0. This show that a time-varying PS can avoid the bias that has been noticed in PS-based HRc estimators by Austin and others.1
8 The last column of Table A2 shows the attenuation of the HRm over time. If the cohort is followed
through the time shown in the bottom row, when only 10% of the cohort is still alive, the HRm is 1.71.
As follow-up is extended from the timepoint reported in the top row (when only 1% of the cohort has died) to the timepoint reported in the bottom row (when 90% of the cohort has died) the true HRm decreases from 2.00 to 1.71. On the log scale this decrease amounts to 22%, from 0.69 to 0.54.
Comparison of the IPTW estimator in the 2nd to last column with its target HRm in the final column shows the bias that arises over time in the IPTW estimator. This bias is due to the 50% random censoring. The amount of the bias is trivial initially but amounts to 7% (on the log scale) by the last timepoint (where ln1.78 / ln1.71 = 1.07). The bias from random censoring would be trivial throughout followup if the amount of random censoring was only 5 or 10%, but the bias would be earlier and larger if there is much more censoring (as can happen in drug safety surveillance when individuals are
censored if their supply of the drug runs out.).
9 Table A3 below shows similar results in a scenario where the HRc is 3.0 instead of 2.0. The divergence of the HRm from the HRc in the last column of tables A3 follows the same trajectory on the log scale as the last column of table A2: on the log scale, the HRm diverges from the HRc by 3% when 10% have died, by 6% when 20% have died, and by 9%, 14% and 22% at the times when the percent who have died is 30, 50, and 90 percent, respectively. Thus, when we raised the true HRc from 2 to 3, we also raised the true HRm from 2 to 3 initially, but we did not change the trajectory of the attenuation of the HRm over time.
In both tables, the divergence of the HRm from the HRc increases if we strengthen the covariates’
impact on the outcome. In the Table 2 scenario, if the effect of each covariate on risk is doubled (by doubling each coefficient shown in the bottom row of the footnote), then the HRm in the last column decreases from 2.00 to 1.46, and if the covariate effects on risk are increased five-fold, then the HRm would attenuate drastically from 2.00 to 1.20. However, in some drug safety surveillance, the covariate effects may be much smaller than in these scenarios. If each covariate effect (defined in bottom row of the Table A2 footnote) is reduced by 50%, then the true HRm in the last column of Table A2 would attenuate only to 1.90 by the time in the bottom row; if each covariate effect is reduced by 80%, then the attenuation of the true HRm in Table A2 would be trivial – it would only attenuate to 1.98 by the time in the bottom row.
10 Appendix 4. Description of “plasmode” simulations
In the plasmode framework, empirical data are incorporated into the simulation process to more accurately reflect the complex relations that occur among baseline covariates in real world data.3 We constructed simulations based on an empirical cohort comparing dabigatran versus warfarin with an outcome of major bleeding events derived from the Truven Marketscan Database (2010-2013). This cohort consists of 79,265 individuals with 69 baseline covariates. When dichotomizing multi-level categorical variables, these baseline covariates include 92 binary terms. A full description of this cohort is provided elsewhere.4,5 R code is provided online.2
Of the 92 binary terms, we identified the top 50% (46 terms) with the strongest confounding effect as measured by the Bross formula.6 Multivariable associations between these selected variables with the outcome were simulated to be representative of those observed in the study cohort. This was done by modeling the hazard of the outcome event in the observed data as a function of the 46 selected terms and treatment. The predicted survival curve for each individual was then calculated by plugging in each individual’s covariate values within the fitted model. The treatment coefficient was altered to produce a desired treatment effect. The intercept value for the outcome model was selected to generate the desired outcome incidence in the absence of censoring. The resulting dataset has a known exposure effect, simulated survival curves for each individual, but a correlation structure between baseline covariates and treatment assignment that remain unaltered.
Simulated datasets were then created by sampling, with replacement, 40,000 individuals from this dataset. The event time for each individual was assigned by identifying the time at which the simulated survival probability for that individual first dropped below a randomly generated uniform random variable. For scenarios involving censoring prior to the end of follow-up, censoring times were simulated by randomly drawing from the distribution of censoring times in the actual study population
independent of treatment status (uninformative censoring). For scenarios with no censoring prior to the end of follow-up, censoring times were assigned at the end of the study period. The simulated follow-up time was then taken to be the minimum of the event and censoring time. A detailed discussion on plasmode simulations is provided by Franklin et al.3
For Scenarios 1 through 10, the correlation between the propensity score and prognostic score was taken directly from the simulated data, which resulted in moderate correlations between the two scores (Spearman correlation coefficient of -0.38). For Scenarios 11 through 20, we reduced this correlation by changing the direction of the sign of the coefficient for a subset of variables in the outcome model so that the direction of confounding for a larger proportion of covariates was in opposite directions (cancelling effects). This resulted in a reduced Spearman correlation coefficient of approximately -0.08.
Calculation of true marginal hazard ratio in plasmode simulations: We calculated the true marginal hazard ratio among the treated population for all scenarios. This was done by simulating, for each individual in the observed population, a counterfactual outcome under the unobserved treatment. We then created a counterfactual dataset where each individual was included twice under the two alternative treatment options (i.e., under treatment and the comparator treatment with corresponding simulated potential outcomes). We then obtained an unbiased estimate of the true marginal hazard ratio up until the study’s endpoint by fitting a Cox model within this counterfactual population with exposure as the only independent variable.
11 Appendix 5. Supplemental Results
Supplemental Table 1. Plasmode simulation results for scenarios 1 through 16 with censoring
True Log Hazard Ratio Values Estimated Log Hazard Ratios (St. Error)a
Scenario Marginal HR Conditional HR Unadjusted PS Method 1 PS Method 2 PS Method 3 Outcome Reg 1 0.64 0.69 0.419 (.020) 0.663 (.027) 0.672 (.022) 0.695 (.022) 0.695 (.022) 2 0.67 0.69 0.433 (.033) 0.688 (.043) 0.690 (.036) 0.695 (.037) 0.695 (.036) 3 0.68 0.69 0.435 (.046) 0.697 (.064) 0.695 (.051) 0.696 (.052) 0.695 (.051) 4 0.69 0.69 0.437 (.069) 0.703 (.094) 0.699 (.076) 0.700 (.076) 0.697 (.076) 5 1.01 1.10 0.804 (.019) 1.046 (.025) 1.063 (.021) 1.096 (.022) 1.100 (.021) 6 1.07 1.10 0.829 (.032) 1.083 (.041) 1.093 (.034) 1.101 (.034) 1.102 (.034) 7 1.08 1.10 0.836 (.047) 1.097 (.062) 1.101 (.051) 1.101 (.052) 1.102 (.051 8 1.09 1.10 0.840 (.066) 1.103 (.089) 1.105 (.071) 1.104 (.072) 1.101 (.071) 9 0.60 0.69 0.550 (.019) 0.634 (.024) 0.665 (.020) 0.692 (.021) 0.694 (.020) 10 0.66 0.69 0.600 (.031) 0.668 (.043) 0.682 (.034) 0.693 (.034) 0.695 (.034) 11 0.67 0.69 0.612 (.046) 0.679 (.064) 0.693 (.051) 0.691 (.051) 0.695 (.050) 12 0.68 0.69 0.617 (.066) 0.683 (.093) 0.694 (.074) 0.697 (.075) 0.697 (.072) 13 0.94 1.10 0.928 (.020) 1.000 (.026) 1.050 (.021) 1.101 (.022) 1.100 (.021) 14 1.04 1.10 0.993 (.031) 1.061 (.044) 1.084 (.034) 1.102 (.034) 1.100 (.034) 15 1.07 1.10 1.012(.046) 1.079 (.066) 1.102 (.050) 1.102 (.051) 1.102 (.050) 16 1.08 1.10 1.019 (.065) 1.085 (.094) 1.104 (.072) 1.102 (.072) 1.101 (.071)
aTreatment effects were estimated using 1-to-1 matching on the baseline propensity score (PS Method 1), including quadratic splines of the baseline propensity score in the Cox outcome model (Method 2), including quadratic splines of the time-specific propensity scores in the Cox outcome model (Method 3), and outcome regression that adjusted for each of the baseline covariates directly in the Cox outcome model.
12 REFERENCES
1. Austin, P.C. The performance of different propensity score methods for estimating marginal hazard ratios. Stat Med. 2013;32(16):2837-2849.
2. Wyss, R. tdps-conditionalHR. GitHub Respository 2019. Available at https://github.com/richiewyss/tdps-conditionalHR.
3. Franklin, J.M., Schneeweiss, S., Polinski, J.M. & Rassen, J.A. Plasmode simulation for the
evaluation of pharmacoepidemiologic methods in complex healthcare databases. Computational Statistics & Data Analysis. 2014; 72:219-226.
4. Bohn, J., Schneeweiss, S., Glynn, R.J. & al., e. Confounding control with disease risk scores developed under different follow-up approaches, with applications to the comparative safety of anticoagulants. eGEMs. (accepted ahead of print)
5. Desai, R.J., Wyss, R., Jin, Y. & al., e. Extension of disease risk score-based confounding adjustments for multple outcomes of interest: an empirical evaluation. American Journal of Epidemiology. 2018;187(11):2439-2448.
6. Bross, I.D. Spurious effects from an extraneous variable. Journal of chronic diseases.
1966;19:637-647.