Costs and Benefits of Environmental Data Hao Luo, Ig

(1)

1 SUPPLEMENTAL DIGITIAL MATERIAL

Investigations of Gene-Disease Associations: Costs and Benefits of Environmental Data Hao Luo, Igor Burstyn, Paul Gustafson

Corresponding author: Paul Gustafson, [email protected]

eAppendix A: BREAK-EVEN COST

The following calculation is based on the power calculation in Appendix B of the paper by Williamson et al.¹ Let denote the distribution function for the noncentral chi- squared distribution with degrees-of-freedom q and non-central parameter k. Let denote the solutions for equation , where s is the pre-specified significance level. Then the sample sizes needed for two data types are

where

Then the break-even cost is just the ratio of sample size for (Y,G) data to sample size for (Y,G,X) data.

Particularly, the break-even cost in the case of qualitative interaction (β2=0) takes the form:

(2)

2

Note that the first term is only a function of the significance level and the desired power. Its value changes in the same direction as the value of power, but in the opposite direction as the significance level. However, once the desired type I error and power are specified, it becomes a constant.

eAppendix B: MIXED DATA TYPE

Suppose our sample consists of (Y,G,X) measurements on a proportion w of the subjects and (Y,G) measurements only on the remaining subjects. Then the budget used to obtain (Y,G) data for m subjects can also be used to obtain such mixed data for m/(cw+1-w) subjects. We apply model (1) for mixed data, noting that this model becomes non-identifiable in the w=0 limit. The information matrix for the parameters β0, β1, β2 and β3 takes the form

where I1 and I2 correspond to (Y,G,X) data and (Y,G) data respectively. Particularly

,

and

,

where A0, A1, and B0 to B3 are expressions from eAppendix A. Following the power calculation given in Williamson et. al.,¹ we notice that maximizing power is equivalent to maximizing the non-centrality parameter of a non-central chi-square distribution. The non-centrality parameter,

(3)

3

which is a function of w, is given by ncp(w) = θ^T C^T {CI^-1C^T}^-1Cθ, where θ = ( 2, 3)^T and C = (02*2, diag{1,1})^T.

We revisit the parameter values of the factorial experiment as shown in Table 1. Neces- sarily, the power of the mixed-data test always goes to zero as w decreases to zero, since the components of β cannot be estimated in this limiting case. For some parameter settings, the power increases monotonically as w increases from zero to one. In other settings, the power reaches its maximum at a value of w between zero and one. However, we only ever observe the latter behavior when the 1 d.f. test based on (Y,G) only has higher power than that achieved by the 2 d.f. test and the optimal value of w. Put another way, if we use the break-even cost, then we always see the power of the mixed-data design increase monotonically with w. In all cases then, either the exclusively (Y,G) only design or the exclusively (Y,X,G) design provides the highest power. Therefore, the mixed data type is not considered further.

eAppendix C: MISCLASSIFICATION

When we treat Xo as if it were X, model (1) implies that we are actually working under a true relationship of the form

Thus, we need a new set of parameters (πG*

, πX*

, β0*

, β1*

, β2*

,β3*

) to capture the distribution of (Y,G,Xo). This new set of parameters can be derived from the original parameters by solving:

Then, we can plug these adjusted parameters into the power calculation shown in eAppendix A, so that we can obtain the power for misclassified data.

As a particular case of interest, say that β2=0 but β3≠ 0, i.e., a qualitative interaction holds. Also, consider the worst-case of a `useless’ exposure classification having sensitivity equal to 1 – specificity. According to the equations above, this will induce = =0 but ≠0,

(4)

4

hence the (Y,XO,G) data still have some power to detect the gene effect. Conversely, XO and G will be conditionally independent given Y=1, which will render the case-only design completely powerless. This gives a sense in which this design is an order-of-magnitude more susceptible to exposure misclassification than the `full data’ design.