The whole purpose of setting up the model, determining a prior, and collecting data is of course to evaluate the posterior distribution to determine the results desired. In the examples used so far, the posterior distribution was quite simple and the trick of using conjugate pairs resulted in a familiar posterior distribution whose parameters could easily be determined by taking advantage of the property of conjugacy. Of course, this is not so easy for more complicated models, and this is where the beauty of a computational package such as R comes in. Let’s take a look at some of the issues involved in the posterior outcome of the Bayesian model.
The Relative Role of the Prior and Likelihood
The posterior distribution reflects both the prior and the likelihood models, although in most cases not equally. Let’s investigate the role of the likelihood and role of the prior in determining the posterior.
First, let’s look at the role of the likelihood. It is pretty straightforward that the more data there are, the more the data will influence the posterior and eventually the choice of the prior will become irrelevant. For example, let’s return to our example used earlier in this chapter regarding the coin tossing experiments. In this example we had a prior estimate for p of 0.25, which we modeled with a beta (2,5) prior distribution. We collected n=10 data values and obtained k=5 successes (coin landed on head) for the binomial likelihood model. The comparison of prior and posterior was shown in Figure 9-5.
What would happen in this example if instead of 10 data points we obtained 100 data points, with 50 times the coin landing on its head? For the posterior we can use a beta distribution with parameters alpha (new)=k+alpha(old), beta(new)=n- k+beta(old). With n=100 and x=50 we get alpha (new)=52 and beta (new)=55.
Let’s compare the results with n=10 with the resulting posterior distribution obtained with n=100 and the same success rate in the coin toss:
> p <- seq(0, 1, by = 0.001)
> plot(p, dbeta(p, 2, 5), ylim = range(0:10), ylab = "f(p)", + type = "l", main = "Effect of n on Posterior")
> lines(p, dbeta(p, 7, 10))
> lines(p, dbeta(p, 52, 55))
> legend(0, 4, "Prior")
> legend(0.2, 6, "Post.1")
> legend(0.5, 9, "Post.2")
Figure 9-8 shows the result of this analysis. Note that the second posterior is centered on the data value of p=0.5 and is also much more “precise” on this value with less spread. This reflects the general rule that the more the data the more the data affect the posterior given the same prior.
In terms of the role of the prior, the more vague the prior, the less the prior influences the posterior. Let’s see what would happen in the example above if we used the same data for n=10 but used instead a beta(1,1) noninformative prior to start. Our posterior would follow the form of a beta distribution wit parameters alpha=6, beta=6. Let’s analyze and graph this model and compare it with the prior and first posterior in Figure 9-8
> p <- seq(0.001, 0.999, by = 0.001)
> plot(p, dbeta(p, 1, 1), ylim = range(0:10), ylab = "f(p)", + type = "l", main = "Effect of noninformative prior")
> lines(p, dbeta(p, 6, 6))
> legend(0.5, 4, "Posterior")
> legend(0, 3, "Prior")
Figure 9-9 shows the result of the posterior when a noninformative beta prior is used. This figure is intentionally drawn on the same scale as Figure 9-8 for comparison. Posterior 1 in Figure 9-1 is updated with the same data as the posterior distribution in Figure 9-9 except in Figure 9-9 a noninformative beta prior is used. The effect of the noninformative prior is essentially no effect.
When the noninformative prior is used the data dominate the model, whereas when the informative prior in 9-8 is used the informative prior influences the posterior along with the data.
Figure 9-9
The important point is that in Bayesian analysis the posterior distribution will reflect with relative weights the data model and the prior model. With large datasets, the data may dominate, with a small dataset and a precise prior the prior will dominate. This relative influence of data and prior knowledge is important aspect of Bayesian analysis.
The goal in Bayesian analysis is often to determine the posterior distribution and sample from it. Using analytical methods (the paper and pencil method) without a computer it is nearly impossible to evaluate a posterior except in simple cases.
However, computationally intensive methods utilizing software tools such as R and a little bit of programming skills enable us to powerfully evaluate posterior distributions. We shall see that the focus of Markov Chains and other methods are in fact tools for evaluating a posterior distribution of interest.
The Normalizing Constant
Notice that in many examples in this chapter we have used the model that was written with a proportionality factor, thus we have left out constant terms and ignored them working with the model
Posterior ∝ Likelihood * Prior P(model|data) ∝ P(data|model) * P(model)
Don’t forget though that with the proportionality factor the models are only given “modulo a constant” hence they are not completely defined. In order to make the models fully explicit we would need to add back in a constant term c:
(Note that the notation “c” is generic. The constant c takes different values in each of the two lines below).
Posterior =c * Likelihood * Prior P(model|data) = c* P(data|model) * P(model)
Note that the constant terms, grouped as c, are called the normalizing constant and this can be calculated so that the posterior distribution integrates to 1, but it is only a scaling factor and does not affect the distribution otherwise, so it is acceptable to ignore the c term in many calculations and only calculate it when necessary.
One versus Multiparameter Models
So far we have used only one-parameter models in our discussion. Bayesian statistics can of course be utilized in models with multiple parameters, and those are of primary interest in bioinformatics. We have worked in the previous chapter with some multivariable models, and in this book we will focus examples on the multinomial-Dirichlet conjugate pair in future examples.
Often in our analysis we are concerned only with some parameters (so-called main parameters) and not others (so-called nuisance parameters). Nuisance parameters can be eliminated easily in Bayesian models by calculating marginal and conditional probability distribution for the main parameters Rather than introduce techniques here for dealing with multiparameter situations we shall only note that they exist and we will work with them in coming chapters when we learn computationally intense algorithms that do most of the work for us in evaluating such models to understand the parameter of interest. In particular we will introduce a chapter covering some special algorithms that may be used to evaluate multiparameter Bayesian models. We also cover Markov chain Monte Carlo methods, which are of growing use and popular in bioinformatics applications. Working with multiparameter models simply builds on the concepts used with single parameter models and an understanding of these basic concepts is essential for proceeding further in working with these models