Priors - Seefeld_StatsRBio.pdf

computational complexity involved. Multiple parameter models, extensive amounts of data, can all be worked with using this same model.

The rest of this chapter discusses this algorithm further, taking a closer look at the prior choices, what a likelihood function is, and how the posterior is calculated.

situation where there is little or no information about what the prior distribution is. Such priors are referred to as noninformative. Let’s explore both classes of priors a bit further.

Conjugate Priors

The type of prior we will use for all our examples in this and coming chapters falls into the type of prior referred to as an informative prior or a conjugate prior. Conjugate priors have the property that the posterior distribution will follow the same parametric form as the prior distribution, using a likelihood model of a particular form as well (but the likelihood form is not the same as the prior and posterior form). This property is called conjugacy. In our example the beta prior combined with a binomial likelihood produces a beta posterior. This is always the case given these models. Therefore when a binomial model is used for the data model, one uses a beta for a prior and this will produce a beta posterior distribution. The example used earlier in this chapter for a coin toss can be applied to numerous scenarios. You may have noticed that we are mixing a discrete model for the data, the binomial, with a continuous model for the parameter p, the beta. That’s OK. The subjective belief is quantified on a continuous scale for all possible values for p: 0 < p < 1, and the data are Bernoulli trials modeled with a binomial model, this produces a continuous scale posterior model. There is no requirement for conjugate pairs to be both continuous or both discrete distributions. In an estimation problem, parameters are usually on a continuous scale, and hence the prior and posterior distributions are continuous.

There are other conjugate pairs of prior-likelihood to yield a posterior of the same form as the prior. Let’s consider one more example using standard univariate distributions, the conjugate pair of a Poisson distribution with a gamma prior.

Recall the form of the Poisson distributions (from chapter 7). This is the data model for the occurrence of x events with rate lambda (=rate of occurrence, or average number of events per unit time or space). Let’s use the symbol theta (θ) instead of lambda, where theta represents the rate parameter (all we are doing here is changing the symbol used, a perfectly legal mathematical maneuver).

If we observe a single data value x that follows the Poisson distribution with rate theta then the probability distribution of such a single “Poisson count” is:

P(x|θ)=

! x

xe θ

θ ⁻

where x can take values 0,1,2,3,…..

Note that the rate can be any positive (real) number. If we observe n independent occurrences x1,x2,…,xn (or Poisson counts) they are modeled using the likelihood l:

P(data|model)=P(x1,x2,…,xn|θ)=

!

i i

e x

θ

−

∏

Leaving out the constant term (factorial in denominator):

P(data|model)∝θ^∑^xie^-nθ

Recall also the gamma model, a continuous model with great flexibility of particular shape and location parameters, alpha and beta, used to model data which are greater than zero and continuous in nature but do not follow the normal distribution. The gamma is the conjugate prior for the Poisson distribution and can be used to model the prior distribution of the rate parameter theta:

θβ α

α θ

θ β −

= Γ ⁻e

) ( ) 1

(

f ¹

Or reducing the model (eliminating the constant term) to produce a prior for theta distributed as a gamma:

p(θ)∝θ^α-1e^-βθ

Let’s use this model to measure the rate of a particular mutation in a particular gene, which has rate theta. According to prior knowledge, the rate per 1000 people at which this mutation occurs is 0.4. Based on this we can model this with a gamma distribution with alpha parameter of 2 and beta parameter of 5 (the mean of which is 0.4 from our discussion in chapter 7). That is the distribution of theta (here denoting the rate per 1000 people) for the prior is gamma (2,5). We then obtain some new data that finds that the in n=10 samples of 1000 people each, the number of mutations is as follows: (0,2,1,0,1,2,0,1,1,0) Using the Poisson likelihood model and our gamma prior, we need to determine our posterior distribution based on this data where

Posterior ∝ Likelihood * Prior P(model|data) ∝ P(data|model) * P(model)

P(θ|x) ∝ P(x|θ) * P(θ) P(θ|x) ∝ θ^∑xie^-nθ*θ^α-1e^-βθ

Adding exponents, this produces a posterior distribution with alpha(post)=Σxi+alpha(prior) and beta(post)=beta(prior)+n. Since in our 10 samples we found a total of eight people with the mutation, Σxi is 8. So our

updated parameters are alpha (post)=10 and beta (post)=15 with a gamma (10,15) distribution for theta. Let’s look at a plot (Figure 9-6) of the prior and posterior distributions in light of the new data:

> theta<-seq(0,2,0.001)

> y<-dgamma(theta,2,5)

> plot(theta,y,ylim=range(0:3),ylab="",xlab="theta")

> y2<-dgamma(theta,10,15)

> lines(theta,y2)

> legend(0,2.5,"Prior")

> legend(0.6,2.8,"Posterior")

Figure 9-6

Note in Figure 9-6 shows the posterior distribution of theta shifts toward higher values (the empirical data would suggest a theta of 0.8 per thousand) but still not as high as the data would suggests. The sample size (10) is not very large, an issue we will discuss below in discussing calculating the posterior. Working with conjugate pairs greatly simplies the mathematics and modeling involved in Bayesian data analysis. Table 9-2 summarizes commonly used conjugate pairs.

Table 9-2: Conjugate Pairs

Prior Likelihood Posterior

Univariate Models

Beta Binomial Beta

Gamma Poisson Gamma

Gamma Exponential Gamma

Normal Normal (known

variance, unknown mean)

Normal

Multivariable Models

Dirichlet Multinomial Dirichlet

Multivariate Normal Multivariate Normal (known variance matrix)

Multivariate Normal

Noninformative Priors

What do we do when we want to use a Bayesian model but are ignorant about the prior distribution, and have much to guess on, if any information? The solution to this is to use a so-called noninformative prior. The study of noninformative priors in Bayesian statistics is relatively complicated and an active area of research. Other terms used for noninformative prior include:

reference prior, vague prior, diffuse prior, and flat prior.

Basically, a noninformative prior is used to conform to the Bayesian model in a correct parametric form. The choice of noninformative prior may be purely mathematical and used so the posterior distribution is proper (integrates to 1) and thus is a correct density function. Noninformative priors may be special distribution forms, or versions of common probability distributions that reflect no knowledge about the distribution of the random variable(s) in the prior model.

Here we will introduce only one specific example of a noninformative prior, because it is a prior distribution of interest in modeling proportion data and of interest in bioinformatics. If there is no specific prior information known about the proportion of something (say a proportion of nucleotide in a sequence motif), then we can use a noninformative form of the beta distribution. The beta (1,1) distribution, shown in Figure 9-7, is a flat noninformative prior that is uniform across the distribution of all proportions. This is a special case of a

uniform distribution (which is any distribution of uniform density). If you want to use a Bayesian model and have proportion data to model with no prior information, you should use a beta(1,1) prior. The multivariate version of the beta(1,1) is a Dirichlet model with prior parameters set to 1, and will be used in examples in coming chapters.

> x<-seq(0,1,by=.001)

> plot(x,dbeta(x,1,1),ylab="",main="Beta (1,1) Prior")

Figure 9 -7

Dalam dokumen Seefeld_StatsRBio.pdf (Halaman 149-154)