Markov Chain Monte Carlo (MCMC) simulation is a numerical simulation method that is often used in Bayesian analysis to construct the posterior distribution when no analytical expression is available. MCMC simulation works by generating random samples from the target distribu-
tion (typically a Bayesian posterior), and is especially powerful when dealing with multivariate distributions. MCMC methods can be used whenever the target density is known at least up to proportionality constant, and are thus well suited for Bayesian analysis, since the complicated integral expression for the normalizing constant can often be very difficult to evaluate.
The idea behind the Markov Chain Monte Carlo method is to construct a Markov chain such that its stationary distribution is exactly the same as the distribution of interest (in this case the Bayesian posterior). The particular MCMC implementation used in this work is known as the Metropolis algorithm (Metropolis et al., 1953; Chib and Greenberg, 1995), and it is a form of rejection sampling. The algorithm is fairly simple to implement, and it can be used to generate samples from both univariate and multivariate densities. Consider that one wants to generate samples from a univariate densityf(x)that can be evaluated up to a proportionality constant, such that f(x)˜ is known, wheref˜(x) ∝ f(x)(in Bayesian inference, f(x) =˜ π(x)L(x)). In this case, the Metropolis algorithm can be implemented as follows:
1. Seti= 0and choose a starting value,x0. 2. Initialize the list of samples: X ={x0}.
3. Repeat the following steps many times:
(a) Sample a candidatex∗from the proposal density functionq(x∗ |xi).
(b) Calculate the acceptance ratioα= min
"
1, f(x˜ ∗) f(x˜ i)
# .
(c) Generate a random numberufrom the uniform distribution on[0,1].
(d) Ifu < α, setxi+1 =x∗, otherwise setxi+1 =xi. (e) Augment the list of sampled values,X, byxi+1.
(f) Incrementi.
4. After convergence is reached, the list of samples X can be be used to construct an approximation to the target densityf(x).
The proposal densityq(x∗ |xi)defines a probability density that generates random moves x∗based on the current pointxi. In theory, the only restriction on the choice of proposal density q(· | ·)is that it be symmetric with respect to its arguments, i.e. the probability of going from xi tox∗ is the same as that of going fromx∗ toxi. An extension of this algorithm, known as the Metropolis-Hastings algorithm, allows the proposal density to have any form.
Convergence of the chain is generally achieved fairly quickly. However, poor choices for the starting value,x0, may cause the chain to take many samples to reach its stationary distri- bution. A simple method for assessing convergence is to look at a trace plot of the samples.
In general, the user must only specify the starting value and the proposal density,q(· | ·).
Unfortunately, the performance of the algorithm can be sensitive to both of these choices, particularly the choice of proposal density. The most commonly used proposal density is the random walk density, in which the candidate point is given byx∗ =xi+η, whereηis a random variable chosen to be symmetric about the origin. The choice of the variance ofηis critical to the performance of the algorithm. If the moves are very small and the acceptance probability is very high, most moves will be accepted but the chain will take a large number of iterations to converge. If the moves are large, they are likely to fall in the tails of the posterior distribution and result in a low value of the acceptance ratio. One wants to cover the parameter space in a computationally efficient fashion. Many studies have been done on optimal acceptance rates, and the results seem to indicate that 0.45–0.5 is the optimal acceptance rate for 1-dimensional problems, whereas 0.23–0.25 is the optimal acceptance rate for high-dimensional problems (Gilks et al., 1996).
When the target density is multivariate, there are two options for generating random sam- ples: the candidate moves can be made in all dimensions simultaneously, or the moves can be made on one component at a time. Choosing a joint proposal density that is a good approx- imation to the target can be a difficult task, and the added complexity of working with mul- tivariate densities often makes it undesirable to generate multi-dimensional candidate moves.
The componentwise scheme discussed by Hastings (1970) and Chib and Greenberg (1995) allows candidate moves to be made on each component independently. A proposal density is specified for each component of x, and the acceptance ratio for a particular move is given by αi = min
1,f(x∗i |x−i) f(xi |x−i)
, where f(xi | x−i) denotes the full conditional density of the ith component. Thus, the components of x are sampled sequentially from their respec- tive full conditional densities. This method is similar to Gibbs sampling, and it is often re- ferred to as Metropolis-within-Gibbs. Consider the acceptance ratio for a move on compo- nent i, αi = min
1,f(x∗i |x−i) f(xi |x−i)
. Note that the full conditional density of xi is given by f(xi |x−i) = f(xi,x−i)
f(x−i) . In computing the acceptance ratio, the marginal densityf(x−i)will cancel because only the ith component of x is varying. Thus, the acceptance ratio becomes αi = min
1,f(x∗i,x−i) f(xi,x−i)
, which can be computed as long as the joint densityf(x)is known up to a proportionality constant.
The nature of MCMC sampling is that the samples obtained in this fashion will almost al- ways show a strong degree of serial correlation, depending on the particular proposal distribu- tion being used. For this reason, one generally makes inference about the posterior distribution using a very large number of samples (typically on the order of 10,000, or more). The nature of the resulting Markov chain should also be kept in mind if one wants to generate a “small”
random sample from the posterior distribution. For example, if the posterior distribution is sim-
ulated using 20,000 MCMC samples, and 100 random samples from the posterior are needed, it would not be appropriate to take 100 consecutive samples from the chain. Instead, one might either choose the samples from the chain at evenly spaced intervals of 200, or choose the 100 samples randomly from the 20,000 available.