Extending Bayes’ Rule to Working with Distributions

We could use this information and try to produce a joint probability table for the two events – having the gene and having the disease. Or, now that we know there is Bayes’ rule, we can use it to solve this problem.

From the information above we are given:

P(Gene)=1/1000 P(Disease)=1/100 And

P(Disease|Gene)=0.5

Note that what we are doing here is inverting the probability P(Disease|Gene) to calculate P(Gene|Disease) in light of the prior knowledge that the P(Gene) is 1/1000 and the P(Disease)=1/100. The P(Disease) can be assumed to be the marginal probability of having the disease in the population, across those with and those without the gene X.

P(Gene|Disease)=

) Disease ( P

) Gene ( P ) Gene

| Disease ( P

Solving this is as simple as plugging in the numbers:

P(Gene|Disease)=

100 / 1

(1/1000)

0.5 =0.05

This result is interpreted as if you have the disease (condition) the probability that you have the gene is 5%. Note that this is very different from the condition we started with which was the probability that you have the gene, then there is a 50% probability you will have the disease.

Bayes’ rule could be applied over and over in order to update probabilities of hypotheses in light of new evidence, a process known as Bayesian updating in artificial intelligence and related areas. In such a sequential framework, the posterior from one updating step becomes the prior for the subsequent step. The evidence from the data usually starts to drive the results fairly quickly and the influence of the initial (subjective) prior will be diminished. However, since we are interested in Bayes formula with respect to Bayesian statistics and probability models, our discussion here will continue in this direction.

understanding how these principles apply to distributions, here we first looked at Bayes rule (above) but now, we will extend Bayes’ rule and apply it to working with probability models (distributions).

Let’s return to the “evidence” and “hypothesis” view of Bayes’ rule, and apply it to our discussion at the beginning of the chapter regarding how a Bayesian would view the probability of a fair coin landing on its head.

P(H|E)=

) E ( P

) H ( P ) H

| E ( P

We have already discussed that this can be written as a proportional relationship eliminating P(E), which is a constant for a given scenario, so:

P(H|E)∝P(E|H)P(H)

Recall that P(H) represents the prior degree of belief in the hypothesis before the evidence. In this case, this will be the Bayesian’s subjective belief of the proportion of times the coin will land on heads before the coin is flipped. Recall that the Bayesian would view this as a probability distribution. Therefore, we have to model P(H) with a distribution. We refer to the distribution modeled with P(H) as a prior.

But what distribution should we use for the prior? We have not collected empirical data so we have nothing to base a binomial distribution on. All we have is the Bayesian’s subjective belief of the proportion of times that the coin will land on its head. Let’s say the Bayesian believes that this proportion is ¼.

Recall that the beta distribution is used to model probability distributions of proportion data for one random variable. How can we fit a beta distribution to model our P(H) for a proportion of ¼? Recall the beta model presented in chapter 7:

1 1

( 1 x ) ) x

, ( ) 1 x (

f

^α⁻

−

^β⁻

β α

= β

, 0<x<1

So really all we need here are some alpha and beta parameters which center the distribution around 1/4 (or 0.25). Having some experience with understanding how using different alpha and beta parameters affects the shape of the beta distribution (making a page of different beta distributions with various combinations of parameters is helpful for learning these things), we suspect that using alpha=2 and beta=5 for parameters will produce an appropriate distribution.

Let’s go ahead in R and graph the prior distribution.

> x<-rbeta(5000,2,5)

> plot(x,dbeta(x,2,5),main="Prior dist. P(H)")

Figure 9-2 shows the resulting distribution. Is this an adequate prior? It is centered on 0.25, our Bayesian’s subjective belief. It is not too precise and has a reasonable spread, reflecting both some certainty on the part of the Bayesian that the value is around 0.25, but also reflecting that the Bayesian views this as a distribution and there is a reasonable spread of variation in the distribution between 0 and 0.5, reflecting the Bayesian is fairly certain of his beliefs, but not so certain as to use an overly precise (invariable prior).

You may think this reasoning for defining a prior is quite non scientific. The purpose of a prior is to model the initial subjective belief of the Bayesian regarding the parameter of interest, before data are collected. We will discuss priors in more detail later in this chapter.

Figure 9-2

We now have the P(H) part of the model P(H|E)∝P(E|H)P(H). In order to calculate P(H|E) we need P(E|H), the probability of the evidence given the hypothesis. In order to get this, we need some evidence. Evidence, of course, is data. In order to get data, we need to perform an experiment, so let’s collect some evidence.

Suppose we toss the coin 10 times and record whether it lands on heads or tails.

Each trial of tossing the coin is a Bernoulli trial, and the combination of 10 trials can be modeled using the binomial distribution, which is (from chapter 7):

) n trials in successes k

( P ) k (

p =





 



 k

n

_k _n _k

) p 1 (

p −

⁻

Suppose our coin tossing experiments resulted in 5 of the 10 coins landing on heads. In frequentist statistics our estimate for the Binomial parameter p would be p=0.5 (the sample proportion) and the frequentist estimated binomial model for k=0 to 10 would be

) trials 10 in successes k

( P ) k (

p = = 







 k 10

0.5^k*0.5^10-k which can be modeled in R using the simple code:

> plot(0:10,dbinom(0:10,10,0.5),,type='h',xlab="",main="Data dist. P(E|H)") With the resulting plot shown in Figure 9-3.

Figure 9-3

Since in the Bayesian paradigm p is a random variable, there are infinitely many plausible models for the data, namely one for each value of p. Two such models are depicted in Figure 9.3. The dependence of the probability of the data on the parameter value is also referred to as the likelihood, or likelihood function. In our case since the experiment resulted in k=5 the likelihood is

Likelihood(p) = P(5 successes in 10 trials | parameter p) =

10 5

 

 

 

5 * (1-p)⁵

Likelihoods will be discussed more later in this chapter, but let’s continue working with our extension of Bayes’ rule to apply it to distributions.

With our prior distribution model and our data distribution model, we have now completed the right side of the proportion

P(H|E)∝P(E|H)P(H)

Now what we want is P (H|E), the updated probability of belief in the hypothesis given the evidence. We want this of course, to be in the form of a probability model that we call the posterior.

Recall our beta model for P(H)

1 1

( 1 x ) ) x

, ( ) 1 x (

f

^α⁻

−

^β⁻

β α

= β

, 0<x<1

we concluded that alpha=2, and beta=5 are adequate parameters for modeling our P(H) with an Bayesian estimate of 0.25, this is our x value but x here represents p , the proportion value, so we can rewrite this as

1 1

( ) 1 (1 )

( , )

f p p

^α

p

^β

β α β

− −

= −

, 0<p<1

Again, looking at the binomial data model equation:

P(k)=







 

 k

n

_k _n _k

) p 1 (

p −

⁻

It should be apparent that the two models BOTH contain terms of p raised to an exponent, and (1-p) raised to an exponent. Hopefully you recall a day in math class sometime in secondary school when you encountered the rule for multiplying exponents of the same base, a^maⁿ=a^m+n

Let’s combine our models:

P(H|E)∝P(E|H)P(H)

P(H|E) ∝





 



 k

n

_k _n _k

) p 1 (

p −

⁻ p ¹(1 p) ¹ )

, (

1 ₋ ₋

− ^β

β α β

Multiplication produces the posterior:

P(H|E) ∝





 



 k n

) , (

1 β α

β ^p

k+α-1

(1-p)^n-k+β-1

Combining constant (for this particular) terms and replacing them with c (c is a normalizing constant which we will not be concerned with evaluating here):

P(H|E) ∝cp^k+α-1(1-p)^n-k+β-1

Note that this follows the form of a beta distribution with parameters alpha(new)=k+alpha(old)-1, beta(new)=n-k+beta(old)-1.

Thus our posterior distribution has parameters alpha=5+2=7 and beta=10- 5+5=10, so we have a beta (7,10) distribution to model the posterior distribution of P(H|E). Remember that in the Bayesian paradigm P(H|E) now represents the plausibility of the values p given our data which resulted in a value of k=5 in a binomial model with parameter p,. The plausibility is naturally represented by the posterior probability density function of the Beta (7,10) distribution.

Let’s use R to graphically analyze our posterior solution (Figure 9-4):

> x<-rbeta(5000,6,9)

> plot(x,dbeta(x,6,9),main="Posterior dist P(H|E)")

Figure 9-4

A more interesting plot can be produced in R using the following code, which uses the points function to plot the posterior distribution and the legend function to label the two distributions, as shown in Figure 9-5.

> p <- (0:1000)/1000

> plot(p,dbeta(p,2,5),,ylim=range(0:4),type=”l”

+ ylab=”f(p)”, main="Comparing Prior and Posterior")

> lines(p,dbeta(p,7,10))

> legend(0,3,legend="Prior")

> legend(0.4,3.6,legend="Posterior")

Notice in Figure 9-5 that the posterior is shifted toward higher proportions, reflecting the updating of the model based on the data. Note though that the posterior is still not nearly centered on 0.5, the data value for p. We will return to discussing how the likelihood and prior influence the posterior distribution calculation later in this chapter.

We could update the posterior again, this time utilizing the current posterior distribution as the prior distribution and repeat the algorithm assuming the same data were achieved in a duplicate experiment. The readers are encouraged to do this experiment on their own with plots resulting in R. The result would be a posterior with updated parameters, which graphically shifts more toward 0.5.

0.0 0.2 0.4 0.6 0.8 1.0

01234

f(p)

Comparing Prior and Posterior

Prior

Posterior

Figure 9-5

What is critical to understand in this example is not the details of the distributions used (which have previously been discussed) but the essence of the Bayesian algorithm as it applies to working with probability distribution models:

P(Model|data) ∝ P(Model) P(data|Model) Posterior ∝ Prior * Likelihood

The essence of this is that the Bayesian’s subjective prior model is updated by the data model (the Likelihood) to produce a posterior distribution that combines the Bayesian subjective knowledge with the objective data observed.

Although the mathematical complexity of the applications of this algorithms are broad and can be complex, this algorithm remains the same for the Bayesian methods. Although only simple examples are used in this book that touch on the

computational complexity involved. Multiple parameter models, extensive amounts of data, can all be worked with using this same model.

The rest of this chapter discusses this algorithm further, taking a closer look at the prior choices, what a likelihood function is, and how the posterior is calculated.

Dalam dokumen Seefeld_StatsRBio.pdf (Halaman 142-149)