Methodology for Generating and Analyzing Cohort Study Data

(1)

eAPPENDIX

Generating the raw data. First we specified the total size of a study cohort and the overall prevalence of exposure. We used these parameters to generate the total exposed (E+) and unexposed (E-) counts across disease categories. Then we specified the prevalence of the confounder in the exposed (P_C|E+) and unexposed (PC|E-).

Confounder=1 Confounder=0

N11 N01 N12 N02

E+=exposed, E- =unexposed, D+=diseased, D- =nondiseased

For example, to calculate the marginal cell counts N11, N01, N12 and N02 we specified:

(a) total cohort size (N=1000) such that N11+N01+N12+N02=1000

(b) overall prevalence of exposure (P_E); if P_E =0.5 then N11+N12=0.5*1000

(c) prevalence of the confounder in the exposed (PC|E+); if PC|E+=0.4 then N11/(N11+N12)=0.4 (d) prevalence of the confounder in the unexposed (PC|E-); if PC|E-=0.2 then N01/(N01+N02)=0.2

Using these four formulas, the four unknown marginals could be solved: using the example above, N11=200, N01=100, N12=300, N02=400. Note that for C to be associated with the exposure (a criterion of a confounder) PC|E+ ≠ PC|E- .

After obtaining the marginal cell counts above, we generated the internal cell counts (A1, B1, C1, D1, A2, B2, C2, D2) by setting the risk ratios in each stratum (assumed the same across strata, i.e., no effect modification), the overall disease prevalence and the relationship between confounder and disease in the exposed (note: because of how we generated the data, setting the confounder- disease RR in the exposed [RRC|E+] also set the confounder-disease RR in the unexposed [RRC|E-] to the same value). For example, a specified risk ratio of 2.0 for the exposure-disease relationship, a specified disease prevalence of 0.2, and a specified risk ratio of 3.0 for the association between the confounder and disease in the exposed yielded the following four formulas which could be used to

E+ E-

D+ A1 B1

D- C1 D1

E+ E-

D+ A2 B2

D- C2 D2

(2)

(e) set RR1=RR2=2.0, such that ^/

/ 2.0 and ^/

/ 2.0

(f) set disease prevalence =0.2 (total population =1000) such that: A1+B1+A2+B2= 0.2*1000 (g) set RRC|E+ =3.0, such that

3.0

Solving the three equations above yields values of A1, B1, A2, and B2. Finally, the C1, D1, C2, and D2 cells were calculated by subtracting A1, B1, A2, B2 from the marginals calculated previously.

Building on the example used in a,b,c, and d above and solving the equations in e, f and g yields the following 2x2 tables:

Confounder=1 Confounder=0

200 100 300 400

E+ E-

D+ 96 24

D- 104 76

E+ E-

D+ 48 32

D- 252 368

(3)

Generating the raw data for two-confounder scenarios. To generate the two-confounder scenarios we began with the two 2x2 tables generated above for Confounder 1. This goal is to split the two tables above into 4 tables stratified by both Confounder 1 (C1) and Confounder 2 (C2).

C1=0, C2=0 C1=1, C2=0 C1=0, C2=1 C1=1, C2=1

I. Calculate the d cells. From the derived data above in the 1-confounder (C1) scenario we know d1+d2+d3+d4=444. We first specify the prevalence of C2 among the disease free and unexposed (P_{C2|E-, D-}). We set P_{C2|E-, D-} equal to 0.2, i.e.,

0.2. Using the data above, we know that P_{C1|E-, D-} is 0.17 (from 76/444), and we also know the total across all d cells (d1+d2+d3+d4=444).

Using this information the four d cell counts can be calculated:

1. Set the marginal proportions of disease free, unexposed (C1 proportions are known from above) C1=0 C1=1

C2=0 d1 d2 0.8 C2=1 d3 d4 0.2

0.83 0.17

2. Calculate the proportion in each cell by multiplying the marginals

3. Calculate the number in each cell using the known total across strata, in this example where d1+d2+d3+d4=444

II. Calculate the b cells. To calculate the b cells we (1) set the prevalence of C2 in the overall unexposed and (2) set the joint distribution of C1 and C2 in the diseased unexposed by specifying the RR for C2 in C1=1 vs. C1=0. For example, setting the prevalence of C2 in the unexposed equal to 0.25 and setting the RR for C2 (in C1=1 vs. C1=0) equal to 0.8 yields the following two equations:

1.

0.25

E+ E- D+ a1 b1 D- c1 d1

E+ E- D+ a2 b2 D- c2 d2

E+ E- D+ a3 b3 D- c3 d3

E+ E- D+ a4 b4 D- c4 d4

C1=0 C1=1

C2=0 0.663 0.137 C2=1 0.166 0.034

C1=0 C1=1

C2=0 294.4 60.8 C2=1 73.6 15.2

(4)

Based on the data generated in the 1-confounder example, we also know the following:

3. b1+b3=32 4. b2+b4=24

In this example, solving these four equations (using the known d cell counts) yields:

b1=9.4 b2=10.4 b3=22.6 b4=13.6

III. Calculate the a and c cells. Finally we solve for the a cells and c cells by setting the exposure- disease RR in all four strata. The specified exposure-disease RR is constrained to be the same across the four strata (i.e., no effect modification) and is different from the exposure-disease RR set above for the 1-confounder scenario (because the earlier specified exposure-disease RR is still confounded by C2). In this example, we set the exposure-disease RR=1.5 such that:

1. ^/

/ 1.5

2. ^/

/ 1.5

3. ^/

/ 1.5

4. ^/

/ 1.5

Based on the data generated in the 1-confounder example, we also know the following:

5. a1+a3=48 6. c1+c3=252 7. a2+a4=96 8. c2+c4=104

Solving these 8 equations yields:

a1= 8.7 a2=20.5 a3=39.3 a4=75.5 c1=179.9 c2=72.8 c3=72.1 c4=31.2

Thus, the cell counts for this two-confounder example scenario are:

C1=0, C2=0 C1=1, C2=0 C1=0, C2=1 C1=1, C2=1 E+ E-

D+ 8.7 9.4 D- 179.9 294.4

E+ E- D+ 20.5 10.4 D- 72.8 60.8

E+ E- D+ 39.3 22.6 D- 72.1 73.6

E+ E- D+ 75.5 13.6 D- 31.2 15.2

(5)

Confounding and absolute AF bias. The figure presented below shows the confounding RR (crude RR/adjusted RR) plotted against the AF bias expressed as the absolute difference of biased AF and unbiased AF rather than a ratio measure. Analogous to Figure 2A, for a given prevalence of exposure, as the confounding RR moves further from 1 (i.e., greater confounding), the AF is increasingly

overestimated. Unlike Figure 2A, bias did not systematically change with the prevalence of exposure.

That is, for a given confounding RR, a lower prevalence of exposure did not necessarily lead to greater AF bias. Likewise, for confounding RRs greater than 1, bias in the AF was increasingly underestimated as the confounding RR increased, but there was no consistent pattern by prevalence of exposure. Results were similar for assessment of different exposure-disease RRs: the overall relationship between confounding RR and AF bias was consistent with the relative AF bias analyses, but for a given confounding RR, a weaker exposure-disease RR did not consistently lead to greater AF bias.

Appendix Figure 1. Confounding RR and attributable fraction bias expressed as an absolute difference (biased AF - unbiased AF) by prevalence of exposure (exposure-disease RR=2.0).