• Tidak ada hasil yang ditemukan

Motivations for mechanistic models

Chapter II: Technologies, desiderata, and axioms

2.2 Motivations for mechanistic models

This section adapts portions of [115] by G.G., J.J.V., and L.P., [113] by G.G., J.J.V., M.F., and L.P., and [112] by G.G., M.F., T.C., and L.P. This review of motivations was conceptualized by G.G. and L.P.

The essential take-away from Section 2.1 is that sequencing and fluorescence tran- scriptomics analyses are concerned with the same problem: the summary and interpretation of noisy RNA copy number datasets. To treat this problem, they even use similar tools, such as the negative binomial distribution. However, these super- ficial similarities hide profound conceptual differences: the meaningattributed to these tools is different in the two subfields; single-molecule stochasticity is front and center in fluorescence transcriptomics, but sidelined and treated as purely technical in sequencing transcriptomics.

This observation is somewhat troubling: single-molecule stochasticity is ubiquitous [244], and its omission makes sequencing analyses incoherent with known biol- ogy. That said, in spite of these discrepancies, it is not accurate to claim that the scRNA-seq field has entirely neglected the results from fluorescence transcriptomics.

Several articles explicitly point to transcriptional variation as a source of biological variability, and either directly use the solutions to mechanistic models [8, 69, 124]

or augment them with a model of technical noise [37, 116, 159, 278, 279, 308].

However, this approach is comparatively rare, and has not yet gained traction as part of typical pipelines.

Here, it is reasonable to ask: why do the theoretical foundations matter? So far, all we have demonstrated is that both subfields use similar tools, e.g., the negative binomial distribution. Even if their bases and interpretations are subtly different, the end result is much the same. What is the actual impact of adopting one or another worldview?

It turns out that these latent problems come to a head when we attempt to treat broader questions and types of data. For example, typical single-cell analyses use themature transcriptome, i.e., only the counts corresponding to exonic regions. Yet it is also possible to align to intronic regions to obtain two data matrices: the usual mature RNA matrix, as well as a nascent RNA matrix, containing all counts associated

with intronic regions; as introns are typically removed during RNA processing, the nascent molecules represent an earlier stage in the RNA life-cycle. The usual descriptive analyses do not have a prescription for simultaneously treating these data types: single-cell analyses omit the nascent RNA; single-nucleus analyses add the two matrices; the “best” approach is controversial, and there does not appear to be a straightforward basis for choosing between the two (Section 8.3).

Yet, in the mechanistic worldview, the solution is almost trivial. There is a causal relationship between the two modalities: nascent RNA are eventually converted to mature RNA. If are confident in the premise that transcription is bursty, we can immediately write down a reasonable model that unifies the two data types:

∅→−𝑘 𝐵× X𝑁 𝛽

→ X𝑀 𝛾

→ ∅. (2.7)

Of course, this model is simplistic — the binary assignment may be overly reductive (Section B.1). Further, we have omitted ambiguities; for example, purely exonic reads may arise from either nascent or mature molecules (Section B.2). Neverthe- less, we have successfully encoded the transcriptional biophysics and the causal relationship between the two species, and created a theoretical substrate for rep- resenting more sophisticated phenomena, such as technical variability. Indeed, a principled approach to “data integration” is only one of the benefits of adopting the mechanistic worldview, and there is a multi-faceted variety of arguments for its broader adoption.

The biological motivation. By investigating data through the lens of biophysical parameters, we can learn something about the mechanisms that give rise to the data, going beyond data summary to characterize the underlying biological processes.

For example, finding that a gene’s burst size has changed is more interpretable and actionable than finding that a negative binomial distribution’s scale parameter has changed, even if these discoveries are mathematically identical: the former proposes a specific transcriptional mechanism. Just as valuably, this perspective allows us to falsifymodels: if the observed distributions cannot be reproduced by a mathematical model, our conceptualization of the underlying physics is somehow incomplete and must be adjusted.

The physical motivation. The discovery, design, and falsification of biophysical laws deserves special mention: it is part of a broad, interdisciplinary effort to ground the study of biology in physical foundations. Its origins date back to the

mid-twentieth century [24, 25, 145], and recent work in this direction [201, 221]

can be bolstered by the integration of genome-wide data.

The statistical motivation, pt. I. As discussed above, to make confident sum- maries and predictions, accounting for uncertainty is mandatory. Although certain alternatives, such as the central limit theorem and binarization, can help, discrete models produce more statistical power in the sparse, low-copy number limit relevant to scRNA-seq data.

The statistical motivation, pt. II. The statistical advantages of parametric, mech- anistic models range beyond loss function book-keeping. By instantiating models and performing a thorough mathematical analysis, we can discover which features are readily identifiable, which are more challenging to infer, and which are entirely impossible to characterize given a particular type of data. For example, the models in Equations 2.5 and 2.6 produce identical distributions at steady state, so attempting to distinguish them purely based on counts ofX is futile.

The experimental motivation. We can use the results of statistical investigations to design readouts or control experiments that answer questions of interest. For instance, the aforementioned negative binomial models canbe distinguished with two-species data (in the vein of Equation 2.7, and as discussed in Section 7.1).

In addition, the explicit modeling of technical artifacts can provide a quantitative understanding of the differences between experimental workflows (Chapter 8).

The synthesis motivation. As alluded to elsewhere, if we wish to compare se- quencing data to other modalities, such as fluorescence transcriptomics, we need to, on one hand, encode the premise that the underlying biology is identical, and, on the other, attribute any differences to specific technical artifacts (Section 8.2). This is easiest done through biophysical modeling.

The control motivation. Even if we choose not to invest all of our efforts into the analysis of mechanistic models, an understanding of common axioms lets us gen- erate realistic simulated data to benchmark sequencing workflows. In addition, the mathematical framework allows us to systematically investigate implicit limitations and contradictions of common data analysis procedures (Sections 6.1 and 8.4).

The financial motivation. Experiments are expensive; computational data anal- ysis is less so, but still requires non-negligible investment; theory is cheap. It is financially responsible to understand the limitations of experiments and analyses

— i.e., which questions can we confidently answer based on a particular dataset?

— before collecting any data, instead of discovering these limitations post hoc. In addition, a thorough, physically grounded investigation of production pipelines can help identify otherwise obscure technical artifacts and prevent target-oriented industry investigations from pursuing dead ends.

The ethical motivation. The collection of sequencing data is necessarily invasive:

it requires the isolation and destruction of living cells. In a scientific context, this entails raising and euthanizing animal test subjects. In a therapeutic context, this entails collecting samples from severely ill or deceased patients. Both of these scenarios involve complicated ethical questions, but it appears most justifiable to strive to minimize invasive procedures by making the most of fewer and smaller datasets.

The synthetic biology motivation. The characterization of transcriptional kinet- ics has an additional, longer-term perspective: the design of synthetic gene circuits.

To design a system, it is essential to understand the physics of its constituent parts;

for transcriptional systems, an understanding of single-molecule stochasticity is mandatory.