appendix

(1)

APPENDIX

Notation and Definitions

Consider a longitudinal study ofnsubjects, with each subject’s observations assumed to be i.i.d..

Age (in years) is the time scale of interest, indexed byk, fork = 17,18,19,· · · ,90andSdenotes age at the beginning of follow-up. V is a vector for baseline covariates. Ak is 15-year lagged measure of exposure, i.e the REC levels participants were exposed to at agek−15. W_k is a 15- year lagged binary variable for active employment status, taking the value of one if a participant is actively employed at agek−15and zero if not. L_kis a 15-year lagged categorical variable for employment location at agek−15, with three levels: underground, surface, and inactive. Note that onceW_k= 0|Wk−1 = 1thenA_k⁰ = 0for allk⁰ ≥k, as participants are considered to be no longer exposed after termination of employment. Likewise locationL_k⁰ will be considered as inactive for all k⁰ ≥ k. Y_k is defined as an indicator for failure (here lung cancer death) at agek, while D_k is defined as an indicator for death (all non-lung cancer deaths). Ck is an indicator for censoring due to loss to follow-up at agek. The order assumed at each agekisV_k, W_k,L_k,A_k,C_k,D_k, Y_k. Histories for time varying covariates are denoted with overbars such that exposure history through agek is denoted byA¯_k ={A_S, A_S+1, A_S+2,· · · , A_k}.

The parametric g-formula

The Cumulative Incidence for lung cancer through age k = 90 is given below as the sum of all incidences at each agek = 17,18,19,· · · ,90as a function of the probability of the outcome and the joint distribution of exposure and covariates (20).

(2)

90

X

k=17

X

v

X

¯ w90

X

¯l₉₀

X

¯ α90











P r[Y_k = 1|V =v,W¯_k= ¯w_k,L¯_k= ¯l_k,A¯_k= ¯α_k,Y¯_k−1= ¯D_k = ¯C_k = 0, S≤k]

×Qk j=17







P r[Dj= 0|V =v,W¯j= ¯wj,L¯j= ¯lj,A¯j= ¯αj,Y¯_j−1= ¯D_j−1= ¯C_j−1= 0, S≤j]×

P r[Cj= 0|V =v,W¯j = ¯wj,L¯j = ¯lj,A¯j = ¯αj,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×

f[A_j=α_j|V =v,W¯_j = ¯w_j,L¯_j = ¯l_j,A¯_j−1= ¯α_j−1,Y¯_j−1= ¯D_j−1= ¯C_j−1= 0, S≤j]×

f[Lj=lj|V =v,W¯j = ¯wj,L¯_j−1= ¯l_j−1,A¯_j−1= ¯α_j−1,Y¯_j−1= ¯D_j−1= ¯C_j−1= 0, S≤j]×

P r[Wj = 1|V =v,W¯j−1= ¯wj−1,L¯j−1= ¯lj−1,A¯j−1= ¯αj−1,Y¯j−1= ¯Dj−1= ¯Cj−1= 0, S≤j]×

f(V =v)×

P r[Y_j−1= 0|V =v,W¯_j−1= ¯w_j−1,L¯_j−1= ¯l_j−1,A¯_j−1= ¯α_j−1,Y¯_j−2= ¯D_j−1= ¯C_j−1= 0, S≤j−1]

















Under simulated interventions whereA_k is intervened on (in our case deterministically) and addi- tionally no participant is allowed to be lost to follow up the above quantity reduces to:

90

X

k=17

X

v

X

¯ w₉₀

X

¯l₉₀











P r[Yk = 1|V =v,W¯k= ¯wk,L¯k= ¯lk,A¯k= ¯αk,Y¯_k−1= ¯Dk = ¯Ck = 0, S≤k]

×Qk j=17







P r[Dj= 0|V =v,W¯j= ¯wj,L¯j= ¯lj,A¯j= ¯αj,Y¯j−1= ¯Dj = ¯Cj= 0, S≤j]×

f[L_j=l|V =v,W¯_j = ¯w_j,L¯_j−1= ¯l_j−1,A¯_j−1= ¯α_j−1,Y¯_j−1= ¯D_j−1= ¯C_j−1= 0, S≤j]×

P r[Wj = 1|V =v,W¯_j−1= ¯w_j−1,L¯_j−1= ¯l_j−1,A¯_j−1= ¯α_j−1,Y¯_j−1= ¯D_j−1= ¯C_j−1= 0, S≤j]×

f(V =v)×

P r[Y_j−1= 0|V =v,W¯_j−1= ¯w_j−1,L¯_j−1= ¯l_j−1,A¯_j−1= ¯α_j−1,Y¯_j−2= ¯D_j−1= ¯C_j−1= 0, S≤j−1]

















This quantity is the extension of standardization for time-varying exposures (20, 34-35). The expression above is a sum over all possible covariate and exposure histories, which cannot be com- puted nonparametrically in high dimensional data such as the dataset in question. We instead rely on the use of parametric models and a Monte Carlo simulation to approximate the g-formula.

The parametric models give us estimates for the individual probabilities in the above expression.

For each intervention considered, exposure and covariate histories are generated for a Monte Carlo sample of the original data, using the predicted probabilities and densities from the parametric models fitted. The Cumulative Incidence for lung cancer mortality is then estimated in each simulated dataset.

(3)

The process is performed in a SAS macro in the following steps:

Parametric models

We fit parametric models to estimate the above probabilities and densities. Baseline covariates V were entered in all models as follows: a cubic polynomial for age with user defined knots and 3 degrees of freedom, a categorical variable for calendar year with one level per decade, a categorical variable for state, and indicators for male sex, and non-white race. This vector function is collectively portrayed asg₁(v)in the following models.

The parametric models fit were as follows:

a. A linear model for the level of annual daily average REC exposure at agek,

E[Ak|W¯k= 1, V =v,L¯k= ¯lk,A¯_k−1= ¯α_k−1,Y¯_k−1= ¯D_k−1= ¯C_k−1= 0] =β0+β₁⁰g1(v) +β₂⁰g2(¯lk) +β₃⁰g3( ¯α_k−1)

where β₁⁰ is a vector of parameter coefficients for all baseline covariates as listed above, β2 is a vector of parameter coefficients for the terms of a function of location historyg₂(¯l_k), specifically categorical variable terms for location at ageskandk−1. Finallyβ₃⁰ is a vector of parameter coefficients for the terms of a function of exposure historyg₃( ¯αk−1), specifically cubic splines with two degrees of freedom each, for the 15-year lagged exposure levels at agesk−1andk−2.

b. An ordinal logistic model for the probability of each level of the categorical location variable at agek,

logit(P r[Lk =l(k)|W¯k = 1, V =v,L¯k−1= ¯lk−1,A¯k−1= ¯αk−1,Y¯k−1= ¯Dk−1= ¯Ck−1= 0]) =γ0+γ₁⁰g1(v)+

γ₂⁰g₂(¯l_k−1) +γ₃⁰g₃( ¯α_k−1)

whereγ₁⁰ is a vector of parameter coefficients for all baseline covariates, andγ₂⁰ is a vector of parameter coefficients for the terms of a function of location historyg₂(¯lk−1), specifically categorical variable terms for location at ages k−1 andk−2, and γ₃⁰ is a vectors of parameter coefficients for the terms of exposure history as described in (a). The models for prediction ofL_k are nested, in that we are predicting first the probability that location is underground, then given that location is not underground, the probability that location is surface. The remaining level is inactive.

c. A logistic model for the probability of active employment at agek,

(4)

logit(P r[Wk = 1|W¯k−1= 1, V =v,L¯k−1= ¯lk−1,A¯k−1= ¯αk−1,Y¯k−1= ¯Dk−1= ¯Ck−1= 0]) =δ0+δ₁⁰g1(v)+

δ⁰₂g₂(¯l_k−1) +δ⁰₃g₃( ¯α_k−1)

whereδ₁⁰ is a vector of parameter coefficients for all baseline covariates, andδ⁰₂ andδ₃⁰ are vectors of parameter coefficients for the terms of respective location and exposure history functions as described in (b).

d. A logistic model for the probability of censoring at agek,

logit(P r[Ck= 1|V =v,A¯k = ¯αk,L¯k = ¯lk,W¯k = ¯wk,Y¯k−1= ¯Dk−1= ¯Ck−1= 0]) =ζ0+ζ₁⁰g1(v) +ζ₂⁰g2(¯lk)+

ζ₃⁰g₃( ¯α_k) +ζ₄⁰g₄( ¯w_k)

whereζ₁⁰ is a vector of parameter coefficients for all baseline covariates,ζ₂⁰ is a vector of parameter coefficients for a function of location historyg₂(¯l_k)as in (a),ζ₃⁰ is a vector of parameter coefficients for a function of exposure historyg₃( ¯α_k), specifically cubic splines with two degrees of freedom each, for the 15-year lagged exposure levels at ageskandk−1plus a linear term for 15-year lagged cumulative exposure up to agek−2, andζ₄⁰ is a vector of parameter coefficients for work status historyg4( ¯wk), specifically indicators for 15-year lagged work status at ageskandk−1.

e. A logistic model for the probability of death due to causes other than lung cancer at agek,

logit(P r[D_k = 1|V =v,A¯_k= ¯α_k,L¯_k= ¯l_k,W¯_k= ¯w_k,Y¯_k−1= ¯D_k−1= ¯C_k= 0]) =θ₀+θ₁⁰g₁(v) +θ₂⁰g₂(¯l_k)+

θ⁰₃g3( ¯αk) +θ₄⁰g4( ¯wk),

where θ₁⁰ is a vector of parameter coefficients for all baseline covariates, and θ⁰₂, θ₃⁰, and θ⁰₄ are vectors of parameter coefficients for the terms of location, exposure and work status histories respectively, as described in (d).

f. Finally, a logistic model for the probability of lung cancer death at agek,

logit(P r[Y_k= 1|V =v,A¯_k = ¯α_k,L¯_k = ¯l_k,W¯_k = ¯w_k,Y¯_k−1= ¯D_k= ¯C_k= 0]) =ψ₀+ψ⁰₁g₁(v) +ψ⁰₂g₂(¯l_k)+

ψ⁰₃g3( ¯αk) +ψ⁰₄g4( ¯wk)

where ψ₁⁰ is a vector of parameter coefficients for all baseline covariates, and ψ₂⁰, ψ₃⁰ and ψ₄⁰ are vectors of parameter coefficients for the terms of location, exposure and work status histories respectively, as described in (d).

(5)

Simulation

We next draw a Monte Carlo sample (in this case of equal size as the observed cohort). Age at entryk =S, values for baseline covariatesV and baseline values for exposure and work status are set as observed. By definition the following is also true:Dk=S−1 =Ck=S−1 =Yk=S−1 = 0.

For all k > S, Ak, and Wk are predicted using the parameters from the fitted models described above and the simulated values of all covariate and exposure histories up to that age. Values for linear terms (in this case the exposure termA_k), are determined using the model predicted values plus a random error term, while values for discrete variables (here employment status W_k) are determined using the model predicted probability and a randomly drawn uniform value. Values of one are assigned if the predicted probability is below this value and zero otherwise. Once a subject is predicted to not be actively employed,W_k= 0|Wk−1 = 1they are set to be inactive throughout the rest of follow upW_k⁰ = 0for allk⁰ > kand exposure is set to 0,A_k⁰ = 0for allk⁰ ≥k.

Interventions

Under an intervention setting an exposure limit L, the values of A_k are intervened on if the predicted value is above L by replacing the predicted exposure by L. The simulation continues using the intervened values for the prediction of subsequent exposure and covariate values. Ck is set to zero at all times under all interventions as we assume that no one is lost to follow up.

Cumulative Risk estimation

Death due to competing risksD_kand the outcomeY_k are then simulated probabilistically through agek = 90, by predicting the respective probabilities at each agek, using the models above and the simulated (and intervened on) covariate and exposure histories.

Using the simulated dataset we then estimate a cumulative risk under each specific intervention using the following cumulative incidence estimator (36):

(6)

90

X

k=17

P r[Y_k= 1|W¯_k,A¯_k, V, Yk−1 =D_k =C_k = 0]×P r[Yk−1 = 0, D_k= 0].

The estimated risk is averaged over the simulated population, summed over all covariate histories and weighted by the frequency of covariate histories, giving the g-formula estimate described above. Risk Ratios (RR) are estimated for each intervention by comparing the risk under each interventions to the risk under no intervention.

Confidence Intervals

For measures of stability we construct 95% Confidence Intervals (CI) by repeating the entire process in 2000 bootstraps (sampled from the observed data with replacement). The 95% CI bounds for RRs are set by using the 2.5th and 97.5th percentiles of the RR distributions from the 2000 bootstraps.