EXTENDED ANALYSIS DETAILS
B.4 Markov Chain Monte Carlo fitting procedure
We will often want to infer parameter values under conditions where posterior distributions are difficult to work with. In these difficult cases, Markov Chain Monte Carlo (MCMC) methods can be used. MCMC gets its name from two processes, Monte Carlo and Markov Chain. Monte Carlo is a method for estimating features of a distribution by randomly drawing samples from the distribution. For example, one could estimate the mean or standard
deviation of a distribution by drawing random samples and computing the mean and standard deviation for those samples. Fig. B.3 (A) schematizes this process.
Markov Chain is a type of algorithm for drawing random samples. In this Monte Carlo method, schematized in Fig. B.3 (B), a random value is added to current parameter values to generate the next sample. In a Markov Chain there is no
(A)
(C) (D)
MONTE CARLO (B) MARKOV CHAIN
-1.0 -0.75 -0.5 -0.25 0.1
0.2 0.3
frequency
∆εR (kBT)
MCMC iteration x1000
Mutual information
Figure B.3: Parameter inference using Markov Chain Monte Carlo. (A) A “Monte Carlo” technique refers to the process of taking random samples from a
distribution and using these samples to estimate properties of the distribution. (B) A “Markov Chain” refers to a process of generating random samples in which an operation is performed only on the current sample to generate the next sample. The identity of each sample depends only on the sample immediately preceding it, and lacks memory of previous samples. (C) A short MCMC example chain for
inferring a parameter. (D)A histogram generated from the chain plotted in (C).
memory of any previous samples; new steps are calculated only with information from the previous step. Each new step is accepted or rejected based on the relative likelihood of the new model compared to the old model. Successive iterations of MCMC will increase the mutual information to a plateau as shown in B.3 (C).
This early period is known as a "burn in" period and is discarded from the analysis.
Further samples can be used to determine the posterior distribution of the parameter, an example of which is shown in B.3 (D).
A full discussion of Markov Chain Monte Carlo method is beyond the scope of this study, but here we will provide a brief explanation. A full explanation of the method can be found in Neal, 1993. Markov Chain Monte Carlo (MCMC) methods are often used when functions are not amenable to analytical solutions or calculations.
MCMC methods allow the expectation value of a given parameter, and its uncertainty without requiring us to have full access to the underlying probability distribution.
As with many cases in biology, the true underlying probability distribution is often complicated and difficult to access.
As proven by Kinney, 2008, the likelihood of a model that predicts gene output is
𝐿(𝜃|𝜇𝑠) ∝ 2𝑁 𝐼𝑠 𝑚𝑜𝑜𝑡 ℎ(𝜇,𝑧), (B.17)
where 𝑁 is the total number of independent sequences, 𝐼𝑠𝑚 𝑜𝑜𝑡 ℎ is the smoothed mutual information between gene expression (𝜇) and DNA sequence (𝑧).
The probability distributions𝑝(𝜃)is very difficult to handle analytically. The reason why we use MCMC is that you can estimate properties using the target probability distribution without needing to know the distribution. For example, we can estimate h𝜃i by drawing many samples of 𝜃 using MCMC and taking the mean of the parameters.
We therefore need to construct a Markov Chain whose stationary distribution con- verges to the distribution of interest 𝑝(𝜃). A Markov chain is a sequence of values 𝜃1, 𝜃2, 𝜃3, ..., 𝜃𝑁 for N steps. We can then find h𝜃iwith
h𝜃i= Í100
𝑁=1𝜃𝑁 𝑁
, (B.18)
A Markov chain has no memory. That is the probability that the 𝑁th value in the chain takes a value𝜃𝑁 depends only on the (𝑁 −1)th value in the chain. To make things a bit more concrete, let’s leave aside𝜃 for the time being. Imagine that we have a light switch, and we know the switch is "on" 25% of the time and "off" 75%
of the time. For each "step" in our Markov chain, we can change the state of the switch; if the switch is on, we turn it off with some rate 𝑘𝑜 𝑓 𝑓, and if the switch is currently off, we turn it off with the rate𝑘𝑜𝑛. A sequence of states will be generated.
One example would be off, off, off, on, on, on , off. These states constitute a Markov chain and if the chain is continued for long enough, the stationary distribution will converge such that 𝑝𝑜𝑛=0.25 and 𝑝𝑜 𝑓 𝑓 =0.75 as we knew going in.
A Markov chain is stationary if detailed balance is satisfied between its states. The condition of detailed balance obtains if the total rate of transitions from on to off is the same as the total rate of transitions from off to on. Mathematically, this condition can be written as
𝑘𝑜𝑛× 𝑝𝑜 𝑓 𝑓 =𝑘𝑜 𝑓 𝑓 × 𝑝𝑜𝑛. (B.19)
The above equation allows calculation of𝑘𝑜𝑛and𝑘𝑜 𝑓 𝑓.
𝑘𝑜𝑛 𝑘𝑜 𝑓 𝑓
= 𝑝𝑜𝑛 𝑝𝑜 𝑓 𝑓
= 1
3. (B.20)
As long as the ratio of transition rates is satisfied and enough steps are taken, then the stationary distribution will converge to the proper distribution.
For the far more complicated task of estimating𝑝(𝜃)we can fall back on the standard Metroplis-Hastings sampling algorithm. 𝜃 will be a matrix where 𝜃𝑖 𝑗 will be the energetic contributions to the energy matrix for the𝑖th position and the 𝑗th base pair where𝑖 ∈1,2,3, ..., 𝐿 and 𝑗 ∈ 𝐴, 𝐶 , 𝐺 , 𝑇. We can then follow the procedure.
energy (kT)
A
-30 GC
T
2 1 0 -1 -2
Figure B.4: An example energy matrix for the YieP binding site of theykgE. An example energy matrix that describes the binding energy between the transcription factor YieP and DNA is shown. By convention, the wild-type nucleotides have zero energy, and each other entry represents the change in binding energy upon a mutation from the wild type nucleotide to a new nucleotide at that position.
1. Start with a random energy matrix𝜃0.
2. Make a random perturbation 𝑑𝜃 to 𝜃0. This perturbation will have a small adjustment to each element of𝜃.
3. Compute the model likelihoods𝐿(𝜃0) and L(𝜃0+𝑑𝜃) using equation B.17.
4. If L(𝜃0 + 𝑑𝜃) > L(𝜃), accept the new parameter values 𝜃0+ 𝑑𝜃 as the next element in the markov chain (𝜃1). Otherwise accept𝜃0+𝑑𝜃with probability
𝐿(𝜃0+𝑑𝜃)
𝐿(𝜃0) . If the step is rejected, the next element𝜃1in the Markov chain reset
to its previous value (𝜃0). The acceptance/rejection probabilities mean that detailed balance is satisfied between the states𝜃0and𝜃0+𝑑𝜃.
5. Repeat steps 2-4 until the chain converges to the stationary distribution. In practice this can be determined by monitoring when the mutual information plateaus, as can be seen in Fig. B.3 (C).
6. To be certain that the distribution has in fact converged to the proper stationary distribution, multiple Markov Chains should be run starting with step 1. If all the chains converge to the same distribution, then they have properly converged.
The end result of these model-fitting efforts is an optimized linear binding energy matrix like the one shown in Fig. B.4. You can get a measure of the uncertainty in𝜃𝑖 𝑗 by forming a confidence interval out of the distribution of parameters formed from the Markov Chain. This inference is performed using the MPAthic software (Ireland and Kinney, 2016).
Uncertainty in Reg-Seq due to number of independent sequences
1400 promoter variants were ordered from TWIST Bioscience for each promoter studied. Due to errors in synthesis, additional mutations are introduced into the ordered oligos. As a result, the final number of variants received was an average of 2200 per promoter. To test whether the number of promoter variants is a significant source of uncertainty in the experiment we computationally reduced the number of promoter variants used in the analysis of the zapAB -10 RNAP region. Each sub-sampling was performed 3 times. The results, as displayed in Figure B.5, show that there is only a small effect on the resulting sequence logo until the library has been reduced to approximately 500 promoter variants.