APPENDIX
Notation and Definitions
Consider a longitudinal study ofnsubjects, with each subject’s observations assumed to be i.i.d..
Age (in years) is the time scale of interest, indexed byk, fork = 17,18,19,· · · ,90andSdenotes age at the beginning of follow-up. V is a vector for baseline covariates. Ak is 15-year lagged measure of exposure, i.e the REC levels participants were exposed to at agek−15. Wk is a 15- year lagged binary variable for active employment status, taking the value of one if a participant is actively employed at agek−15and zero if not. Lkis a 15-year lagged categorical variable for employment location at agek−15, with three levels: underground, surface, and inactive. Note that onceWk= 0|Wk−1 = 1thenAk0 = 0for allk0 ≥k, as participants are considered to be no longer exposed after termination of employment. Likewise locationLk0 will be considered as inactive for all k0 ≥ k. Yk is defined as an indicator for failure (here lung cancer death) at agek, while Dk is defined as an indicator for death (all non-lung cancer deaths). Ck is an indicator for censoring due to loss to follow-up at agek. The order assumed at each agekisVk, Wk,Lk,Ak,Ck,Dk, Yk. Histories for time varying covariates are denoted with overbars such that exposure history through agek is denoted byA¯k ={AS, AS+1, AS+2,· · · , Ak}.
The parametric g-formula
The Cumulative Incidence for lung cancer through age k = 90 is given below as the sum of all incidences at each agek = 17,18,19,· · · ,90as a function of the probability of the outcome and the joint distribution of exposure and covariates (20).
90
X
k=17
X
v
X
¯ w90
X
¯l90
X
¯ α90
P r[Yk = 1|V =v,W¯k= ¯wk,L¯k= ¯lk,A¯k= ¯αk,Y¯k−1= ¯Dk = ¯Ck = 0, S≤k]
×Qk j=17
P r[Dj= 0|V =v,W¯j= ¯wj,L¯j= ¯lj,A¯j= ¯αj,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×
P r[Cj= 0|V =v,W¯j = ¯wj,L¯j = ¯lj,A¯j = ¯αj,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×
f[Aj=αj|V =v,W¯j = ¯wj,L¯j = ¯lj,A¯j−1= ¯αj−1,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×
f[Lj=lj|V =v,W¯j = ¯wj,L¯j−1= ¯lj−1,A¯j−1= ¯αj−1,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×
P r[Wj = 1|V =v,W¯j−1= ¯wj−1,L¯j−1= ¯lj−1,A¯j−1= ¯αj−1,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×
f(V =v)×
P r[Yj−1= 0|V =v,W¯j−1= ¯wj−1,L¯j−1= ¯lj−1,A¯j−1= ¯αj−1,Y¯j−2= ¯Dj−1= ¯Cj−1= 0, S≤j−1]
Under simulated interventions whereAk is intervened on (in our case deterministically) and addi- tionally no participant is allowed to be lost to follow up the above quantity reduces to:
90
X
k=17
X
v
X
¯ w90
X
¯l90
P r[Yk = 1|V =v,W¯k= ¯wk,L¯k= ¯lk,A¯k= ¯αk,Y¯k−1= ¯Dk = ¯Ck = 0, S≤k]
×Qk j=17
P r[Dj= 0|V =v,W¯j= ¯wj,L¯j= ¯lj,A¯j= ¯αj,Y¯j−1= ¯Dj = ¯Cj= 0, S≤j]×
f[Lj=l|V =v,W¯j = ¯wj,L¯j−1= ¯lj−1,A¯j−1= ¯αj−1,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×
P r[Wj = 1|V =v,W¯j−1= ¯wj−1,L¯j−1= ¯lj−1,A¯j−1= ¯αj−1,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×
f(V =v)×
P r[Yj−1= 0|V =v,W¯j−1= ¯wj−1,L¯j−1= ¯lj−1,A¯j−1= ¯αj−1,Y¯j−2= ¯Dj−1= ¯Cj−1= 0, S≤j−1]
This quantity is the extension of standardization for time-varying exposures (20, 34-35). The expression above is a sum over all possible covariate and exposure histories, which cannot be com- puted nonparametrically in high dimensional data such as the dataset in question. We instead rely on the use of parametric models and a Monte Carlo simulation to approximate the g-formula.
The parametric models give us estimates for the individual probabilities in the above expression.
For each intervention considered, exposure and covariate histories are generated for a Monte Carlo sample of the original data, using the predicted probabilities and densities from the parametric models fitted. The Cumulative Incidence for lung cancer mortality is then estimated in each simu- lated dataset.
The process is performed in a SAS macro in the following steps:
Parametric models
We fit parametric models to estimate the above probabilities and densities. Baseline covariates V were entered in all models as follows: a cubic polynomial for age with user defined knots and 3 degrees of freedom, a categorical variable for calendar year with one level per decade, a categorical variable for state, and indicators for male sex, and non-white race. This vector function is collectively portrayed asg1(v)in the following models.
The parametric models fit were as follows:
a. A linear model for the level of annual daily average REC exposure at agek,
E[Ak|W¯k= 1, V =v,L¯k= ¯lk,A¯k−1= ¯αk−1,Y¯k−1= ¯Dk−1= ¯Ck−1= 0] =β0+β10g1(v) +β20g2(¯lk) +β30g3( ¯αk−1)
where β10 is a vector of parameter coefficients for all baseline covariates as listed above, β2 is a vector of parameter coefficients for the terms of a function of location historyg2(¯lk), specifically categorical variable terms for location at ageskandk−1. Finallyβ30 is a vector of parameter co- efficients for the terms of a function of exposure historyg3( ¯αk−1), specifically cubic splines with two degrees of freedom each, for the 15-year lagged exposure levels at agesk−1andk−2.
b. An ordinal logistic model for the probability of each level of the categorical location variable at agek,
logit(P r[Lk =l(k)|W¯k = 1, V =v,L¯k−1= ¯lk−1,A¯k−1= ¯αk−1,Y¯k−1= ¯Dk−1= ¯Ck−1= 0]) =γ0+γ10g1(v)+
γ20g2(¯lk−1) +γ30g3( ¯αk−1)
whereγ10 is a vector of parameter coefficients for all baseline covariates, andγ20 is a vector of pa- rameter coefficients for the terms of a function of location historyg2(¯lk−1), specifically categorical variable terms for location at ages k−1 andk−2, and γ30 is a vectors of parameter coefficients for the terms of exposure history as described in (a). The models for prediction ofLk are nested, in that we are predicting first the probability that location is underground, then given that location is not underground, the probability that location is surface. The remaining level is inactive.
c. A logistic model for the probability of active employment at agek,
logit(P r[Wk = 1|W¯k−1= 1, V =v,L¯k−1= ¯lk−1,A¯k−1= ¯αk−1,Y¯k−1= ¯Dk−1= ¯Ck−1= 0]) =δ0+δ10g1(v)+
δ02g2(¯lk−1) +δ03g3( ¯αk−1)
whereδ10 is a vector of parameter coefficients for all baseline covariates, andδ02 andδ30 are vectors of parameter coefficients for the terms of respective location and exposure history functions as described in (b).
d. A logistic model for the probability of censoring at agek,
logit(P r[Ck= 1|V =v,A¯k = ¯αk,L¯k = ¯lk,W¯k = ¯wk,Y¯k−1= ¯Dk−1= ¯Ck−1= 0]) =ζ0+ζ10g1(v) +ζ20g2(¯lk)+
ζ30g3( ¯αk) +ζ40g4( ¯wk)
whereζ10 is a vector of parameter coefficients for all baseline covariates,ζ20 is a vector of parameter coefficients for a function of location historyg2(¯lk)as in (a),ζ30 is a vector of parameter coefficients for a function of exposure historyg3( ¯αk), specifically cubic splines with two degrees of freedom each, for the 15-year lagged exposure levels at ageskandk−1plus a linear term for 15-year lagged cumulative exposure up to agek−2, andζ40 is a vector of parameter coefficients for work status historyg4( ¯wk), specifically indicators for 15-year lagged work status at ageskandk−1.
e. A logistic model for the probability of death due to causes other than lung cancer at agek,
logit(P r[Dk = 1|V =v,A¯k= ¯αk,L¯k= ¯lk,W¯k= ¯wk,Y¯k−1= ¯Dk−1= ¯Ck= 0]) =θ0+θ10g1(v) +θ20g2(¯lk)+
θ03g3( ¯αk) +θ40g4( ¯wk),
where θ10 is a vector of parameter coefficients for all baseline covariates, and θ02, θ30, and θ04 are vectors of parameter coefficients for the terms of location, exposure and work status histories respectively, as described in (d).
f. Finally, a logistic model for the probability of lung cancer death at agek,
logit(P r[Yk= 1|V =v,A¯k = ¯αk,L¯k = ¯lk,W¯k = ¯wk,Y¯k−1= ¯Dk= ¯Ck= 0]) =ψ0+ψ01g1(v) +ψ02g2(¯lk)+
ψ03g3( ¯αk) +ψ04g4( ¯wk)
where ψ10 is a vector of parameter coefficients for all baseline covariates, and ψ20, ψ30 and ψ40 are vectors of parameter coefficients for the terms of location, exposure and work status histories respectively, as described in (d).
Simulation
We next draw a Monte Carlo sample (in this case of equal size as the observed cohort). Age at entryk =S, values for baseline covariatesV and baseline values for exposure and work status are set as observed. By definition the following is also true:Dk=S−1 =Ck=S−1 =Yk=S−1 = 0.
For all k > S, Ak, and Wk are predicted using the parameters from the fitted models described above and the simulated values of all covariate and exposure histories up to that age. Values for linear terms (in this case the exposure termAk), are determined using the model predicted values plus a random error term, while values for discrete variables (here employment status Wk) are determined using the model predicted probability and a randomly drawn uniform value. Values of one are assigned if the predicted probability is below this value and zero otherwise. Once a subject is predicted to not be actively employed,Wk= 0|Wk−1 = 1they are set to be inactive throughout the rest of follow upWk0 = 0for allk0 > kand exposure is set to 0,Ak0 = 0for allk0 ≥k.
Interventions
Under an intervention setting an exposure limit L, the values of Ak are intervened on if the pre- dicted value is above L by replacing the predicted exposure by L. The simulation continues using the intervened values for the prediction of subsequent exposure and covariate values. Ck is set to zero at all times under all interventions as we assume that no one is lost to follow up.
Cumulative Risk estimation
Death due to competing risksDkand the outcomeYk are then simulated probabilistically through agek = 90, by predicting the respective probabilities at each agek, using the models above and the simulated (and intervened on) covariate and exposure histories.
Using the simulated dataset we then estimate a cumulative risk under each specific intervention using the following cumulative incidence estimator (36):
90
X
k=17
P r[Yk= 1|W¯k,A¯k, V, Yk−1 =Dk =Ck = 0]×P r[Yk−1 = 0, Dk= 0].
The estimated risk is averaged over the simulated population, summed over all covariate histo- ries and weighted by the frequency of covariate histories, giving the g-formula estimate described above. Risk Ratios (RR) are estimated for each intervention by comparing the risk under each interventions to the risk under no intervention.
Confidence Intervals
For measures of stability we construct 95% Confidence Intervals (CI) by repeating the entire pro- cess in 2000 bootstraps (sampled from the observed data with replacement). The 95% CI bounds for RRs are set by using the 2.5th and 97.5th percentiles of the RR distributions from the 2000 bootstraps.