• Tidak ada hasil yang ditemukan

Bayesian local false discovery rate for sparse count data with application to the discovery of hotspots in protein domains

N/A
N/A
Protected

Academic year: 2024

Membagikan "Bayesian local false discovery rate for sparse count data with application to the discovery of hotspots in protein domains"

Copied!
15
0
0

Teks penuh

(1)

Bayesian local false discovery rate for sparse count data with application to the discovery of hotspots in protein domains

Item Type Article

Authors Gauran, Iris Ivy M.;Park, Junyong;Rattsev, Ilia;Peterson, Thomas A.;Kann, Maricel G.;Park, DoHwan

Citation Gauran, I. I. M., Park, J., Rattsev, I., Peterson, T. A., Kann, M.

G., & Park, D. (2022). Bayesian local false discovery rate for sparse count data with application to the discovery of hotspots in protein domains. The Annals of Applied Statistics, 16(3). https://

doi.org/10.1214/21-aoas1551 Eprint version Publisher's Version/PDF

DOI

10.1214/21-aoas1551

Publisher Institute of Mathematical Statistics Journal The Annals of Applied Statistics

Rights Archived with thanks to The Annals of Applied Statistics Download date 2024-01-09 21:59:12

Link to Item

http://hdl.handle.net/10754/679873

(2)

Supplementary Material to “Bayesian Local False Discovery Rate for sparse count data with application

to the discovery of hotspots in protein domains”

APPENDIX A: POSTERIOR DISTRIBUTIONS IN SECTION 3.4 In this section, we derive the posterior distributions of φ0 = (η, λ, θ), φ1, π0, C, τ and zN which are mentioned in Section 3.4. Estimation of the local FDR using Gibbs sampling requires full conditional distributions to be specified and sampled from. To implement the Gibbs sampler, an ordering of the (`0+`1+2) parameters is necessary. We proceed by specifying the conditional posterior distribution of C and τ. This is followed by the discussion on the conditional posterior distribution ofπ00 and φ1.

When f0 is ZIGP, the parameters are φ = (φ01, π0, C) and the full conditional distributions are specified only up to a constant of proportion- ality. The posterior distribution of φ can be expressed in terms of the full likelihood function and further simplified as follows

f(φ|xN,zN) ∝ L(φ|xN,zN)g(φ) (A.1)

whereg(φ) =g(C |τ)g(τ)g(λ)g(η)g(θ)g(π0)g(φ1).

A.1. Conditional Posterior Distribution of C and τ. From the prior specification in (3.5) and (3.6), the conditional posterior density ofC given all the other parameters is

P(C =`|xN,zN01, π0, τ)∝L(φ|xN,zN)g(C =`|τ)g(τ)

=

 Y

j≤`

0f0(j|φ0)}nj Y

j≥`+1

f(j|φ01, π0)nj

P(`|τ)g(τ)

X

`=0

 Y

j≤`

0f0(j|φ0)}nj Y

j≥`+1

f(j|φ01, π0)nj

P(`|τ)g(τ)

where P(` | τ) = e−ττ`

`! and g(τ) ≡ G(τ | κτ, ϑτ) is the density function of the Gamma distribution with parameters κτ and ϑτ. In practice, we use a truncated version of the denominator so that we have an approximate distribution which is the multinomial distribution

C|xN,zN01, π0, τ ≈ Multinomial(1,q) (A.2)

(3)

whereq= (q0, q1, . . . , qK) and q` is defined as

q`=

 Y

j≤`

0f0(j|φ0)}nj Y

j≥`+1

f(j|φ01, π0)nj

P(`|τ)g(τ)

X

`≤K

 Y

j≤`

0f0(j|φ0)}nj Y

j≥`+1

f(j|φ01, π0)nj

P(`|τ)g(τ) (A.3)

for `= 0,1, . . . , K and

k

P

`=0

q`= 1.

Moreover, the conditional posterior density ofτ depends only onC, that is,f(τ |C)∝g(C |τ)g(τ) whereg(τ)≡ G(τ |κτ, ϑτ) is the conjugate prior.

After the necessary calculations, the conditional posterior distribution of τ givenC is

f(τ |xN,zN01, π0, C)∝ G(τ|C+κτ, ϑτ + 1) (A.4)

whereG(·|a, b) is a density of Gamma distribution with parametersaandb.

A.2. Conditional Posterior Distribution ofzN andπ0. Before we can proceed to specify the conditional posterior distribution ofπ0, we need to look into the vector of latent variables first because the full likelihood includes the terms

π

N

P

i=1

zi

0 (1−π0)N

N

P

i=1

zi

.

To specify the posterior distribution of zN, we take into account the zero assumption thatxiis generated fromf0whenxi≤Cfor a givenC. From the zero assumption, we havezi= 1 with probability 1 whenxi ≤C. Otherwise, we have zi01, π0, C ∼ Bernoulli(pi) where

pi≡P(zi = 1|φ01, π0, C) = π0f0(xi0) f(xi0, φ1, π0).

From the key assumption thatf(xi) =π0f0(xi0) when xi ≤C, then the value of pi indeed reduces to 1 for values of xi ≤ C. Hence, we specify the conditional posterior distribution ofzi as

zi |xN01, π0, C∼Bernoulli (pi) (A.5)

(4)

where pi = max

I(xi ≤C), π0f0(xi0) f(xi01, π0)

, for any i= 1,2, . . . , N and I(·) is an indicator function. When a given zN = (z1, z2, . . . , zN) is avail- able, we can compute the number of samples from f0 and f1, N0 and N1, respectively as follows:

N0 = X

j≤K

n0j = X

j≤K

X

i≥1

ziI(xi =j), (A.6)

N1 = X

j≤K

n1j = X

j≤K

X

i≥1

(1−zi)I(xi =j).

(A.7)

Using (A.6) and (A.7), we can specify the posterior distribution of π0 given the rest of the parameters as

π0|xN,zN01, π0, C∼ B

π0

N0+ 1, N1+ 1 (A.8)

where B(·|N0 + 1, N1 + 1) is the Beta distribution with shape parameters N0+ 1 andN1+ 1.

A.3. Conditional Posterior Distribution ofφ0 = (η, λ, θ). When f0 is ZIGP, the conditional posterior distribution of the null distribution parameters given the rest of the parameters is given by

f(φ0|xN,zN1, C, π0) ∝ f(xN,zN01, C)g(φ0)

= g(φ0) Y

i∈{i:zi=1,1≤i≤N}

f0(xi)

= g(φ0) Y

0≤j≤K

f0(j)n0j

whereg(φ0) =g(η)g(λ)g(θ) =I(0,1)(η) × I(0,1)(θ) × λ−0.5I(0,∞)(λ) and Y

0≤j≤K

f0(j)n0j ∝ h

η+ (1−η)e−λin00h

(1−η)λe−λi

P

j≥1

n0j

e

−θP

j≥1

jn0j

×

Y

j≥1

(λ+θj)j−1 j!

!n0j

.

If we definef(A|Rest) for some variableA means the conditional distribu- tion ofA given all other data and parameters, the above expression can be

(5)

reduced to the following conditional posterior densities:

f(λ|Rest) ∝ h

η+ (1−η)e−λin00

λ

−0.5+P

j≥1

n0j

e

−λP

j≥1

n0jY

j≥1

(λ+θj)j−1 j!

!n0j

f(η|Rest) ∝ h

η+ (1−η)e−λin00

(1−η)

P

j≥1

n0j

f(θ|Rest) ∝ e

−θP

j≥1

jn0jY

j≥1

(λ+θj)j−1 j!

!n0j

.

The sampling scheme from the full conditionalsf(λ|Rest), f(η|Rest) and f(θ | Rest) are non-trivial because they do not reduce analytically to any well-known distribution with an available random variate generation. Hence, we rely on the Metropolis-Hastings algorithm instead.

Specifically, the MH algorithm is performed in vector form, that is, jump- ing in the three-dimensional space of φ0 = (η, λ, θ) where the null distri- bution parameters are updated simultaneously. [1] stressed that there is no natural advantage to altering one parameter at a time except for potential computational savings. Meanwhile,φ0 belongs to the constrained parame- ter space [0,1]×(0,∞)×[0,1] and notR3. However, after the appropriate transformation, we can assume that the conditional posterior distribution of φ0 given the rest of the parameters is multivariate normal with known variance matrixΣ.

Following the work of [2], in order to obtain draws of the constrained pa- rameters (η, λ, θ), we draw unconstrained random variables from the sampler and transform them to the constrained space [2]. Suppose φ0 is the vector of the constrained parameters whose full conditional density isf(φ0 |Rest).

Let g be the bijection from the space of φ0 to the Euclidean space R3. The density of ϕ0 =g(φ0) is then f g−10)|Rest

· | det=(ϕ0) | where

= = ∂φ0/∂ϕ0. Given ϕ0 = g(φ0), a proposed ϕ?0 will be accepted with probability

min (

1, f g−1?0)|Rest

· |det=(ϕ?0)| f(g−10)|Rest)· |det=(ϕ0)|

) (A.9)

(6)

APPENDIX B: ADAPTIVE METROPOLIS-HASTINGS WITHIN GIBBS SAMPLING

Bayesian methods using Markov Chain Monte Carlo (MCMC) simulation radically influenced current statistical research and have debuted countless new avenues of performing inference [3,4]. As an overview, MCMC simula- tion is a general method based on drawing samples from approximate dis- tributions and then correcting those draws to better approximate the target posterior distribution [1]. The sampling is performed sequentially wherein the distribution of the sampled draws depend on the distribution of the last value drawn, thereby forming a Markov chain [1,3].

Among the MCMC simulation methods, we are interested in implement- ing the Gibbs sampler and the Metropolis-Hastings algorithm. Applied in the Bayesian context, Gibbs sampling is a technique for drawing dependent samples from a multidimensional posterior distribution of the model param- eters [5, 6, 7]. Also referred to as “alternating conditional sampling”, each iteration of the Gibbs sampler cycles through the subvectors of the parame- ter vector, sayθ, drawing each parameter or set of parameters conditional on the value of all the others. As discussed in [1], suppose θ= (θ12, . . . ,θp) can be divided intopsubvectors [1]. An ordering of thepsubvectors is chosen and at each iterationt, eachθ(t)j is sampled from the conditional distribution given all the other components ofθ, that is,

p(θj(t−1)−j ,y) where θ(t−1)−j =

θ(t)1 , . . . ,θ(t)j−1(t−1)j+1 , . . . ,θ(t−1)p .

This indicates that each subvector θj is updated conditional on the most recent values of the other components of θ, which include the components already updated at iteration t and the values of remaining components at iterationt−1.

With the availability of inexpensive, high-speed computing, [8] mentioned that using Gibbs sampler would allow researchers to avoid difficult analytical calculations, but rather deal with a sequence of easier calculations. In line with this, [9] illustrated that by freeing the statistician from the daunting calculations and numerical integration, the main focus can be shifted to the statistical aspects of the problem [9]. Another advantage pointed out by [7]

is that certain full conditionals reduce analytically to well-known distribu- tions, for which special methods for efficient random variate generation are available.

Meanwhile, another MCMC technique useful for sampling from posterior distributions is the Metropolis-Hastings (MH) algorithm which was devel- oped by [10] and generalized by [11]. As described by [3], a chain (θ(t)) is

(7)

generated by drawing θ? values from a proposal distribution qt?(t−1)) and settingθ(t) to

θ?, with probability min

r = p(θ?|y) p(θ(t−1) |y),1

θ(t−1), otherwise

[1] pointed out the algorithm requires the calculation of the ratio r for all (θt−1, θ?), t = 1, . . . , T and when the jump is not accepted, that is θ(t) = θ(t−1), still counts as an iteration in the algorithm.

It can be noted that a key component in implementing the MH algorithm is the choice of the proposal distribution. It is crucial to specify a proposal distribution which provides a close approximation of the posterior distribu- tion because it affects the chain’s ability to move efficiently across the state space and the speed at which the chain converges [4]. When the acceptance rate is low and the correlation among the draws is high, the chain can be trapped indefinitely in a local mode thereby resulting to a slow convergence [12,13].

[1] argued that it is difficult to provide general advice on efficient jumping rules [1]. However, they provided some insights for the case of multivariate normal random walk proposal distributions with the normal jumping ker- nel centered on the current point and with the same shape as the target distribution, that is,

q(θ?(t−1)) =Np?(t−1), c2Σ).

Among this class of proposal distributions, [1] mentioned that the most efficient has scale c ≈ 2.4/√

p, where efficiency is measured relative to in- dependent sampling from the posterior distribution [1]. Further, they de- scribed that the optimal jumping rule has acceptance rate around 44% in one-dimension, reducing to roughly 23% in high-dimension (p >5).

Given the pivotal role played by the proposal distribution, several ex- tensions and adaptive methodologies were proposed to use the preliminary draws to “tune” the proposal to the target. As an active research area, the literature on adaptive MH methods is wide-ranging and can be categorized into several groups. Among these groups, we focus on diminishing adaptation schemes. According to [4], a diminishing adaptation MH sampler performs the standard accept/reject step but in contrast to a traditional MH algo- rithm, it updates the proposal distribution using the history of the draws.

It is referred to as “diminishing” because the updating of the proposal dis- tribution settles down asymptotically in terms of the number of iterations.

(8)

Theoretical work involving diminishing adaptation were developed by [14], [15] and [16], among others.

Consequently, [4] argued that although more theoretical work on adaptive sampling can be expected, the existing framework already provide sufficient justification and guidelines to build adaptive MH samplers for challenging problems.

For this problem, implementing a Gibbs sampler seemed to be a natural choice initially. However, the zero inflated null distribution is not condition- ally conjugate and it is not trivial to draw samples from the full conditional distribution. Hence, we embedded the Metropolis-Hastings algorithm within a Gibbs sampler structure. The details on how the algorithm is carried out are provided in Section 4 of the main text.

(9)

APPENDIX C: PROPOSED METHODS IN SECTION 4

C.1. Semiparametric Model for Bayesian False Discovery Rate in Section 4.2. In contrast to the scenario described in Section 4.1, we consider a nonparametric distribution forf1instead off. The prior distribu- tion ofΥis given by D(β) whereβ= (, , . . . , , γ, γ, . . . , γ) = (·1C+1, γ· 1P−C−1). To reflect the zero assumption, we assign≈0 so that Υ0, . . . ,ΥC are generated to be almost 0 for a givenC. Note thatzi = 0 corresponding toxi≤C for a givenC from the zero assumption. This leads ton0j = 0 for j≤C. The posterior distribution ofΥis

Υ|(xN,zN,β,φ0, C) ∼ D(β) (C.1)

where

βj =





, 0≤j≤C

γ+n1j, C < j ≤K γ, K < j≤P

forj = 0,1,2, . . . , P and P

j≤P

Υj = 1. The implementation of the algorithm with zero assumption is similar to the algorithm described in Section 4.1 except for the modification in the concentration parameterβ in Step (8).

Algorithm for Semiparametric Model:

Step 8 Gibbs step for Υ: Generate Υ(t) from (C.1) and compute ψj = π0(t)f0(t)(j|φ(t)0 ) + (1−π0(t)(t)j for 0≤j≤P.

Once we obtain samples φ(t) and z(t)N, we compute the local false discov- ery rate using

fdr(j|xN) = EzN,φ|xN[ fdr(j|φ,xN,zN) ] (C.2)

≈ 1 T

T

X

t=1

fdr(j|φ(t),xN,z(t)N)

for fdr(j|φ,xN,zN) = π0f(j|φ,xf0(j|φ,xN,zN)

N,zN) and mutation counts j = 0,1, . . . , K.

We reject the null hypothesis if

fdr(j|xN)≤α= 0.05.

(C.3)

(10)

C.2. Parametric Model for Bayesian False Discovery Rate in Section 4.3. Instead if a nonparametric distribution forf1, we consider a parametric distribution in this framework. We use the shifted Generalized Poisson (GP) distribution to reflect the zero assumption such that for a given C,x∼f1(x) whereX=W+C andW is the given parametric distribution forW ≥0. The conditional posterior density is

f(φ1|Rest) ∝ f(xN,zN1)f(φ1) where f(xN,zN | φ1) ∝ Y

i≥1

f1(xi1)1−zi = Y

j>C

g(j−(C+ 1)|φ1)n1j and g(x) is the density function of GP distribution.

The main drawback for using Poisson distribution to model count data is its inability to account for overdispersion. Hence, we considered the case where the alternative distribution is GP. However, since GP is not condi- tionally conjugate, we employ extra Metropolis-Hastings (MH) steps to draw samples ofφ1 = (δ, ν). We modify Step (1b) in 4.1 by specifying φ(0)1 and

(0), where the latter is the initial value of the covariance matrix for the proposal distribution. Then, we follow Steps (2) to (7). We replace Step (8) with the following MH steps:

Algorithm for Parametric Model:

Step 8 Generate φ(t)1 = (δ(t), ν(t)) from the following algorithm:

(a) Randomly generateut from bivariate Standard Normal and let ϕ(t)1 =

(t)1/2

ut(t)1 .

(b) Accept φ(t+1)1 = h−1(t)1 ) with probability defined in (A.9) in Supplementary Material whereh(φ1) =h(δ, ν) = (logδ,log1−νν ).

Otherwise, setφ(t+1)1(t)1 .

Compute ψj(t)0(t) f0(t)(j) + (1−π(t)0 )f1(t)(j).

This is followed by Step (9) which is an updating procedure similar to Step (7) in Section 4.1 except we replaceΣ(t) by ∆(t) and φ(t)0 by φ(t)1 . Finally, we repeat Steps (2) to (9) fort= 1,2, . . . , T.

(11)

APPENDIX D: ADDITIONAL RESULTS

We performed additional simulation studies on different numbers of po- sitions in a protein domainN to provide a clearer picture on the difference between the full and empirical Bayesian approaches.

When the truef0 andf1are well-separated as exemplified by the scenario wherein the true null distributionf0 is ZIP with θ= 0, we can observe that the results for all methods, both fully Bayesian and empirical Bayes method, coincide. In this case, all empirical and full Bayesian methods are controlled in terms of their FDR and the TPR of all methods are approaching to 1 as N increases. In Figure 1, it can be observed that when N is at least 500, then all methods “catch up”. Also, the variability in the FDR and TPR of all the methods can be distinguished when N is small (0 ≤ N < 200) or moderately small (200≤N <500).

Meanwhile, the histograms are displayed in Figure 2 if f0 and f1 are heavily mixed. The corresponding numerical comparison ofFDR when[ f0 is ZIGP(η= 0.4, λ, θ),f1 is shifted Binomial,π0= 0.80, across varying values ofN,λand θis presented in Figure 3. Table 1 in the main manuscript is a specific case where λ= 2 andθ= 0.3.

According to our simulation studies, we can see that the fully Bayesian methods tend to control FDR for N is less than 200 while the empirical Bayes (EB) approaches have inflated FDR. As discussed in the main text, asN increases, the EB methods show improvement in controlling FDR and obtain more TPR than the full Bayesian methods. We recommend the use of full Bayesian methods when N < 200 to ensure control FDR. We pro- pose the use of EB methods when the number of positions is large, i.e.

when N is at least 800. On the other hand, when N is moderately small (200≤N < 500) and N is moderately large (500 ≤N <800), we tend to see different results in terms of superiority of the methods. An EB method (One-Stage procedure) needs at a moderately small number of positions to (marginally) control a given level of FDR. We reiterate proposing esti- mation of the overdispersion parameter to facilitate the guidelines and not just rely solely on the number of positions. There is no universally supe- rior method among the proposed fully Bayesian models and the empirical Bayesian methods used as benchmark.

(12)

Fig 1: Numerical Comparison when the truef0 andf1 are well-separated

0.00 0.01 0.02 0.03 0.04 0.05

250 500 750 1000

N

FDR

(a) FDR, C= 10

0.94 0.96 0.98 1.00

250 500 750 1000

N

TPR

Procedure Bayesian − NP Bayesian − P Bayesian − SP EB − One Stage EB − Two Stage Storey

(b) TPR,C= 10

0.00 0.01 0.02 0.03 0.04 0.05

250 500 750 1000

N

FDR

(c) FDR,C= 5

0.85 0.90 0.95 1.00

250 500 750 1000

N

TPR

Procedure Bayesian − NP Bayesian − P Bayesian − SP EB − One Stage EB − Two Stage Storey

(d) TPR,C= 5

(13)

0 2 4 6 8 10 121416182022 24

051015051015

(a)N = 50, λ= 2

0 2 4 6 810 13 16 19 22 25 28 31

051015202530051015202530

(b)N = 100, λ= 2

02 4 6 810 13 16 19 22 25 28 31

010203040506070010203040506070

(c)N = 200, λ= 2

0 2 4 6 8 101214161820222426

024681012024681012

(d)N = 50, λ= 3

0 246 810 13 16 19 22 25 28 31 34

051015202530051015202530

(e)N = 100, λ= 3

024 6810 13 16 19 22 25 28 31 34

01020304050600102030405060

(f) N= 200, λ= 3

0 2 4 6 8 101214161820222426

024681012024681012

(g)N = 50, λ= 4

02 468 10 1316 19 2225 28 3134

05101520250510152025

(h)N = 100, λ= 4

02468 10 1316 19 2225 28 3134

0102030405001020304050

(i)N = 200, λ= 4 Fig 2: Histograms when f0 is ZIGP(η = 0.40, λ, θ = 0.30), f1 is Binomial, π0 = 0.80, andC= 5

(14)

Fig3:NumericalComparisonof[FDRwhenf0isZIGP(η=0.4,λ,θ),f1isshiftedBinomial,π0=0.80,across varyingvaluesofN,λandθ.Table1inthemainmanuscriptisaspecificcasewhereλ=2andθ=0.3.

(15)

REFERENCES

[1] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A.and Rubin, D. B.(2014) Bayesian Data Analysis.

[2] Raim, A. M., Neerchal, N. K.andMorel, J. G.(2017). An extension of gen- eralized linear models to finite mixture outcome distributions.Journal of Com- putational and Graphical Statistics, (just-accepted).

[3] Robert, C. P. and Casella, G. (2005). Monte Carlo Statistical Methods.

Springer Texts in Statistics.

[4] Giordani, P.andKohn, R.(2010). Adaptive independent Metropolis-Hastings by fast estimation of mixtures of normals.Journal of Computational and Graphical Statistics, 19(2), 243-259.

[5] Geman, S.andGeman, D.(1993). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images.Journal of Applied Statistics, 20(5-6), 25-62.

[6] Gelfand, A. E.andSmith, A. F.(1990). Sampling-based approaches to calcu- lating marginal densities.Journal of the American Statistical Association, 85(410), 398-409.

[7] Gilks, W. R., Best, N. G. and Tan, K. K. C. (1995). Adaptive Rejection Metropolis sampling within Gibbs sampling.Applied Statistics, 455-472.

[8] Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167-174.

[9] Smith, A. F.and Gelfand, A. E. (1992). Bayesian statistics without tears: a sampling–resampling perspective.The American Statistician, 46(2), 84-88.

[10] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing ma- chines.The Journal of Chemical Physics, 21(6), 1087-1092.

[11] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.Biometrika, 57(1), 97-109.

[12] Liu, J. S.(2008). Monte Carlo strategies in Scientific Computing.Springer Sci- ence & Business Media.

[13] Luengo, D. and Martino, L. (2013) Fully adaptive Gaussian mixture Metropolis-Hastings algorithm. In Proceedings: IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Holden, L., Hauge, R. and Holden, M. (2009). Adaptive independent Metropolis–Hastings.The Annals of Applied Probability, 19(1), 395-413.

[15] Haario, H., Saksman, E. andTamminen, J. (2001). An adaptive Metropolis algorithm.Bernoulli, 7(2), 223-242.

[16] Atchad´e, Y. F.andRosenthal, J. S.(2005). On adaptive Markov Chain Monte Carlo algorithms.Bernoulli, 11(5), 815-828.

Referensi

Dokumen terkait