The identification of transcriptional driving processes

Chapter VII: Model identification and selection

7.2 The identification of transcriptional driving processes

the experiment fails to capture some molecules, the distributions are identical to those obtained by deflating the transcriptional noise intensity. In other words, even though technical noise affects the molecules, its theoretical effects are indistinguishable from decreasing the variability of the transcriptional process. As the noise levels increase, the RNA distributions are pushed toward the indistinguishable Pois- son limit at the bottom edge of the reduced parameter space. We quantify how rapidly the information degrades by plotting smaller circles in Figure 7.1e-f to in- dicate the effect of 50%, 75%, and 85% dropout, in that order from top to bottom.

This result is an extremely general and fundamental consequence of the form of the solution (Section A.8.4).

In sum, in certain overdispersed regimes, candidate driversaremutually distinguishable, and the identification of transcriptional models is qualitatively and quantitatively facilitated by the collection of multimodal data. In addition, by exploiting the mathematical structure of the ODEs defining the transcriptional processes, we find that the impact of the simplest form of drop-out noise can be conceptualized as the reduction of the transcription rate scale, rendering these parameters non-identifiable.

Figure 7.2: Genes from comparable single-cell RNA sequencing datasets can be consistently assigned to a particular biophysical model of transcription.

a. By fitting models in the limiting regimes and calculating model Akaike weights, visualized on a ternary diagram, we can obtain coarse gene model assignments (colors: regimes predicted by the partial fit; red: Γ-OU-like genes; blue: CIR-like genes; violet: mixture-like genes; gray: genes not consistently assigned to a limiting regime).

b. Likelihood ratios for selected genes are consistent across biological replicates, and favor categories consistent with predictions (colors: regimes predicted by the partial fit; points: likelihood ratios; horizontal line markers: Bayes factors; vertical lines: Bayes factor ranges; Bayes factor values beyond the plot bounds have been omitted. 𝑛 = 4 biologically independent animals, with 5,343, 6,604, 5,892, and 4,497 cells per animal).

c. The differences between model best fits are reflected in raw count data (title colors: predicted regimes; lines: model fits at maximum likelihood parameter estimates; line colors: models; histograms: count data).

d. Non-distinguishable genes tend to lie in the slow-reversion and high-gain parameter regime; distinguishable genes vary more, but tend to have relatively high gain (colors: predicted regimes, large dots: genes illustrated in panelc. Genes with absolute log-likelihood ratios above 150 have been excluded).

Γ-OU and CIR models can be supported by data, we fit the two models’ (distinct) burst-like limits, where^x,_y → 1 and (identical) mixture-like limits, where^x → 0 and^y→ 1, usingMonod, assuming no technical noise, to five glutamatergic neuron subtypes from a single mouse. The burst-like limit of the Γ-OU model is given in Section 4.6.2, the mixture-like limit is given in Section 4.6.3, while the burst-like limit of the CIR model has the following generating function:

log𝐺(u)= 1 2

∫ ∞ 0

h 1−√︁

1−4𝑏𝑈_𝑁(u,_s)i

𝑑_s, (7.8)

where𝑏=𝜃/𝜅and𝑈_𝑁 takes the usual form in Equation 4.55 (Section A.8.2). This somewhat degenerate⁵ limit describes driving by a process with infinitely many jumps in each finite time interval. Although this driver has been encountered before in the mathematical finance literature [292], the solution does not appear to have been previously reported [20, 22].

We computed the Akaike weights of the three limits for all genes (results for one subtype shown in Figure 7.2a). Finally, we selected genes that most consistently agreed with the distributions in these limits (colored points in Figure 7.2a), and extracted the genes with the best fits to the optimal models.

We fit the Γ-OU and CIR models to the 80 genes that passed the filtering step to glutamatergic neuron data from four mice, using gradient descent to find the maximum likelihood parameter set, and computed the likelihood ratios for the models (Equation 3.45), discarding poorly fit genes. The likelihood ratios for the remaining 73 genes are depicted in Figure 7.2b (points). To ensure that the likelihood ratios we obtained were not distorted by the omission of uncertainty in estimates, or potentially suboptimal fits, we further fit twelve of the genes using a Bayesian procedure, displaying the distribution of Bayes factors (Equation 3.46) in the same axes (horizontal markers).

The predictions from the coarse filter were largely concordant with the results from the full model, suggesting that it is effective for selecting genes of interest from transcriptome-wide data. The model assignments were typically consistent among datasets. Although orthogonal targeted experiments are necessary to identify whether the proposed models effectively recapitulate the live-cell transcriptional dynamics, the reproducibility of the findings suggests directions and candidate genes for such investigations. Finally, the Bayes factors were largely quantitatively consistent with the likelihood ratios, suggesting that the approximations made in the gradient descent procedure do not substantially degrade the quality of the statis-

tical results. However, we did observe several discrepancies between likelihood ratios and Bayes factors, confirming that the more computationally facile gradient descent procedure does not perfectly recapitulate the full Bayesian fit (cf. results for Ccdc39andBirc6), possibly due to substantial omitted uncertainty in some genes’

parameters.

Five example fits are depicted in Figure 7.2c, with the corresponding gene names color-coded according to the best-fit model (red: Γ-OU, blue: CIR, purple: mixture). Model distinctions mostly appear to be due to differences in probability near distribution peaks. Interestingly, only either the nascent marginal or mature marginal exhibits obvious visual differences between model fits in some of the genes depicted here, further motivating the use of multimodal data.

The location of each best-fit parameter set in the qualitative regimes space is shown in Figure 7.2d. Most Γ-OU fits exist in the top right corner, suggesting we are effectively fitting a standard geometric burst model in these cases. Nonetheless, there are a number of genes for which the parameter sets reside somewhere in the center, indicating that the full complexity of theΓ-OU or CIR models is necessary to describe the corresponding data.

Despite the models’ simplicity, the results suggest that single-cell RNA sequencing data may be sufficiently rich to enable Bayesian model discrimination between superficially similar regulatory schema. Certain genes demonstrate reproducible differences between the two considered SDE drivers, which may imply differences in the underlying regulatory motifs. Interpreting the specific biochemical mean- ing of the findings is challenging without accounting for features which have been omitted in the discussion thus far, such as technical noise and additional complex- ities in downstream processing of RNA. Nevertheless, fine details of transcription

— including DNA mechanics and gene regulation — appear to have signatures in single-cell data, and a model-based, hypothesis-driven paradigm can help identify them. Further, these fine details can be probed using a range of tools, some more ap- proximate and suited to genome-wide exploratory analysis, others more statistically rigorous and suited to detailed study of gene targets.

Dalam dokumen Stochastic foundations for single-cell RNA sequencing (Halaman 113-116)