Thesis submitted in partial fulfilment of the requirements for the degree of

The aim of the thesis is to overcome some of the challenges faced by computer algorithms in the reconstruction of time-varying gene regulatory networks. In step 4, the score of the second subset is calculated and stored inside curr.score.

Contributions of the Thesis

Nevertheless, we observe that the memory requirement of the final selection step increases exponentially with the number of shortlisted regulator candidates. At the same time, the former maintains the same recalls as the latter.

Table 1.1: The Runtime of TGS and ARTIVA for the Benchmark Datasets. For each dataset, the fastest runtime is boldfaced.

Organisation of the Thesis

At the same time, the discovery of interactions whose effect can potentially cancel out that of harmful interactions is essential for the development of therapeutic strategies. By monitoring an apparently healthy individual at the molecular level, can we predict whether such a disease is developing.

How to Answer

How to Answer: The Systems Biology Approach

The input is a data matrix of dimensions (V ×N); it contains N measurements for each of the V variables of interest. Depending on the characteristics of the network and the desired operations, an efficient data structure can be chosen.

Chapter Summary

Input Data: Single system, Multiple time points

Different samples can be collected at different times or from different tissues or under different conditions. The goal is to generate a network (or graph)G=(V,E) where the edge-setE represents pairwise co-expression relationships between the system components in V.

Correlation Networks

When each sample is collected at a specific time point, it forms a time series data type with N time points. When each sample is collected at a specific time point, it forms a time series data series with N time points.

Figure 3.1: Input data matrix D. Given a system under observation, let us consider that D is a (p × N ) matrix

Information Theoretic Models

To identify potential false positive edges, ARACNE uses the inverse of the Data Processing Inequality (DPI) principle. ARACNE monotonically reduces the number of false positive edges in mutual information relevance networks.

Conditional Independence (CI) Models

Full Conditional Independence (CI) Models / Markov Random

The Full CI models ask whether or not the observed correlation between two random variables can be explained by the rest of the variables. By definition, two system components and are connected by an undirected edge if and only if the corresponding random variablesXi andXj are conditionally not independent, given Xrest =X\ {Xi, Xj}, the rest of the random variables in X.

Low Order Conditional Independence (CI) Models

Bayesian Network (BayesNet)

Therefore, the computational complexity increases exponentially with the number of observed variables, making inference impossible in high-dimensional settings. Define a scoring function to calculate fitness for each model in the search space with the given data.

Joint Network Inference (JNI) Models

Input Data: Single system, Multiple conditions, Multiple time
Independent Network Inference (INI) with Gaussian Graphical
Joint Network Inference (JNI) with Gaussian Graphical Models
Joint Network Inference (JNI) for Joint Estimation of Multiple

On the other hand, there are methods that can model temporal progression of the system-specific dependency structure by reconstructing several time-varying networks for each individual system. The first publication in the trilogy, Oates et al. 2014), assumes that all system-specific networks reside at the same level of the tree, the topology of which is known a priori. For example, the edge (j2, j3) is not added to the static DBN because the edge (Di;j2,Di;j3) (the red arrow) only appears in 50% of the total transitions in the unrolled DBN.

Figure 3.4: Difference between INI and JNI strategies. The whole input dataset is denoted by D

Chapter Summary: Abridged Literature Survey

Input: Time-series Gene Expression Dataset

Output: Time-varying Gene Regulatory Networks

Notation D(X;Y;Z) is used to denote the observed values of genesX at timesY in time seriesZ. The time interval between two consecutive time points in a data set is usually large enough for regulators to have an effect on the expression of the target gene. It follows from the first-order Markovian assumption that the expression of vj at time t(p+1) depends only on its controllers at time tp.

Figure 4.1: Input time-series gene expression data D is a three dimensional tensor with the dimensions (V genes, T time points, S time series)

Benchmark Datasets

Evaluation Metrics

Comparative Study of the Existing Algorithms

Implementations

The ARTIVA source code is publicly available as an Rpackage of the same name (ARTIVA, version: 1.2.3). The source codes of TVDBN-0,TVDBN-bino-hard,TVDBN-bino-soft,TVDBN-exp-hard andTVDBN-exp-soft are publicly available as a package named 'EDISON' ((EDISON), version: 1.1. 1).

Results

Problem Statement

Initiatives such as The Precision Medicine Initiave (PMI) (https://www.whitehouse.gov/.precision-medicine) and Google Baseline Study (GBS) (https://en.wikipedia.org/wiki/Baseline_Study) are expected to generate such huge data sets.

Chapter Summary

Development of the Baseline Algorithm

Biologically, this implies that the expression level ofvi at time point tp has no regulatory effect on that of vj during the time interval (tp, tp+ 1). On the other hand, the presence of that edge implies that there is a non-zero probability that the expression level vi attp has influenced that of vj during the time interval (tp, tp+ 1). The Bayesian Information Criterion (BIC)') is used with Bene to calculate the scores of sets of candidate regulators.

Development of a Novel Algorithm: The TGS Algorithm (short

However, the Achilles heel of this strategy is that the prediction is strongly dependent on the user-defined threshold value (Liu et al., 2016, Section 'Effects of the threshold parameters'). It reconstructs a weighted MI network 1 over all genes from a gene expression dataset without requiring a user-defined threshold. Therefore, for a high-throughput human genome-scale time series gene expression dataset where (T−1) =o(V) and Mf =o(lgV), the time complexity of TGS tends asymptotically to polynomial while that of TBN remains exponential .

Figure 5.1: Graphical Flowchart (Part 1) of the TGS Algorithm. The flowchart is continued in Figure 5.2

Results

Discretisation of the Datasets
Implementations
Learning From Dataset Ds10n
Learning From Datasets Ds50n and Ds100n
Effects of Noise on Learning Power and Speed

The reason behind this is explained by the fact that the CLR step in TGS captures 7 out of 10 true edges, even from this noisy dataset; the high recall of the CLR step is used by the downstream Bene step to identify at least as many true edges as are identified by TBN, while avoiding searching for as many possible false edges as possible. Two ordered values in each cell for rows 'TBN' and 'TGS' represent application of two different data discretization algorithms – 2L.wt and 2L.Tesla respectively. For those larger number of relationships that do not exist, ARTIVA is less likely than TGS to mistake them as true relationships.

Table 5.1: Learning Power of the Selected Algorithms on Dataset Ds10n. TP = True Positive, FP = False Positive

Excerpt and Future Work

In fact, its main memory requirement grows exponentially with the number of genes (and thus the number of candidate regulators for each gene) in a given data set. In the current implementation of TGS, the maximum number of candidate regulators is limited to fourteen for each gene to avoid this problem. Relaxing this constraint is a significant challenge, as the true number of regulators for a gene is not known in advance.

Contributions

Chapter Summary

This algorithm offers recall competitive to that of TGS and precision competitive to that of ARTIVA. Instead, it feeds the raw mutual information matrix to the ARACNE algorithm (Section 3.1.3.2), which refines the matrix. Such false positive mutual information is detected by ARACNE and their mutual information is reduced to zero in the mutual information matrix.

Results

Implementations

This trade-off between true positives and false positives can be very useful for the users. A user who wants to experimentally verify the predicted edges one by one would prefer to have lower false positive edges, even at the cost of lower true positives. On the other hand, a user, who wants to apply other computational methods to the predicted network, would prefer to have as many true positive edges as possible for further processing, even at the cost of a larger number of false positives.

Learning from the Benchmark Datasets

The reason is that every false positive edge leads to an unnecessary set of experiments, causing waste of valuable resources. Thus, two variants of the TGS algorithm meet two different sets of demands from the users.

Excerpt and Future Work

InTGS, shortlisting is performed based on the "raw" mutual information matrix estimated from the dataset. This algorithm is well known for removing a significant number of false-positive mutual information values at the cost of a reasonable number of true-positive values. It has been observed empirically that TGS+ causes a significantly lower number of false-positive edges than that of TGS.

Contributions

In the second step, the shortlist is thoroughly examined to select the final set of supervisors.

Chapter Summary

In Section 7.2, we measure our progress in the previous chapters and discuss its limitations. In the first step, the 'expressions' of the genes in question are measured at specific time intervals over a predetermined period of time. Therefore, the second step is to reverse-engineer (hereafter 'reconstruct') the time-varying GRN structures from the given dataset.

Limitations of the Previously Proposed Algorithms

After measuring the expression values of all the genes in question at all time points results in a time series gene expression data set. This particular step is known as "time-varying GRN reconstruction from time-series gene expression data". Each time series contains measured expressions of V number of genes across T number of time points.

Figure 7.1: The Workflow of a Time-varying GRN Reconstruction Algorithm. The algorithm takes a time-series gene expression data D as input

Investigations into the Origin of the Limitations

In this step, the main memory requirement (hereafter simply 'memory requirement') increases exponentially with the number of candidate regulators. This parameter accepts a positive integer value from the user and limits the maximum number of candidate regulators to that value. For example, if 'max fan-in = 14', then at most fourteen candidate regulators can be shortlisted for each node.

A Novel Idea for Overcoming the Limitations

Store the highest curr.score in best.score and the corresponding curr.set in best.set for each iteration. Subsequently, the values of curr.set and curr.score are copied to best.set (best subset so far) and best.score (best score so far), respectively. Otherwise, if curr.score is less than or equal to best.score, best.score and best.set are kept unchanged.

Figure 7.2: Illustration of the Idea for Finding the Highest Scoring Subset with 2 |V (j;(p+1)) | Subsets, Two Scores and Two Pointers

Design of Novel Algorithms Based on the Novel Idea

Experimental Results

Comparative Study Against a Random Classifier
Comparative Study Against Alternative Algorithms
Comparative Study Against Time-invariant Algorithms
Results with a Large-scale Dataset

On the other hand, TGS-Lite occupies only 0.7% of the memory at the same time. Finally, we compare the running time of BTA with that of the proposed algorithms (Table 7.7). It is observed that the running time of BTA is in hours, while that of the proposed algorithms is in minutes.

Figure 7.3: Illustration of the Idea for Finding the Highest Scoring Subset amongst 2 |V (j;(p+1)) | subsets with Only Two variables, Two Scores and One Subset-generation Script

Excerpt and Future Work

TGS+.mf14 is able to process the data from the embryo stage, the longest stage, in only 44 minutes. This subset is specific to a specific time interval as the selection is based on that time interval's gene expression data. However, if such a regulator is omitted from the shortlist, there is no way to capture that regulator in the final list of the relevant time interval.

Figure 7.8: The Effects of the Max Fan-in Parameter on the Correctness (A, B) and Runtime (C, D) of the TGS-Lite and TGS-Lite+ Algorithms

Contributions

Chapter Summary

The goal of the previously proposed algorithms is to identify the regulators of each gene during each time interval. In the second step, they select a subset of the short-listed regulators for each time interval. Therefore, the genes that do not share significantly high mutual information over the entire time series with the gene in question are less likely to be shortlisted as the candidate regulators of the latter gene.

A Novel Idea for Overcoming the Limitations

The shortlist is time invariant because the framework uses the entire time series dataset to calculate the mutual information values of other genes with the gene of interest; then the genes that share statistically significant mutual information with the gene of interest are shortlisted. This strategy is useful for shedding those genes that have no regulatory effects on the gene of interest during any interval. However, the strategy can also reject genes that have regulatory effects on the gene of interest for a small number of time intervals.

Design of Novel Algorithms Based on the Novel Idea

The Issue with Extending TGS+ and TGS-Lite+

In TGS+ and TGS-Lite+, the shortlisting step is performed based on "refined" mutual information values. If the relationship is found to be indirect, the "refined" mutual information between vi and vj is considered to be zero. Therefore, we need to develop an algorithm that can produce one refined mutual information matrix for each time interval.

Figure 8.3: Illustration of the Modified Framework with an Example. In this example, the input dataset is comprised of two time series – s 1 and s 2

Developing a Time-varying Refinement Strategy

More specifically, our claim is that we tp share a non-zero mutual information with one of the true regulators of vj t(p+1). The only reason behind this observation is that we tp shares a non-zero mutual information with vk tp. Since we tp share a non-zero mutual information with vk tp, which in turn shares a non-zero mutual information with vj t(p+1), we can tp share a non-zero mutual information with vj t(p+1) , i.e.

Figure 8.5: An Example of the DPI. In this example, gene v i regulates gene v k and v k in turn regulates gene v j

Section Summary

The full version of ARACNE-T takes an ordered list of time-varying raw mutual information matrices as input. As a result, the output of ARACNE-T is an ordered list of time-varying, refined mutual information matrices. From this dataset, an ordered list of time-varying raw mutual information matrices is estimated.

Figure 8.7: Illustration of the Workflow of ARACNE-T with an Example. In this example, the input dataset is disretised and comprised of two time series – s 1 and s 2

Experimental Setup

Evaluation Strategy

The latter (right) differs from the former (left) in two places: first, ARACNE is replaced by ARACNE-T; second, CLR is replaced with CLR-T.

Implementations

In the case of TGS, all 84 cases of G4 are compared with those of the other genes. However, this does not guarantee that the performances of the proposed algorithms are better than that of a random classifier. For two of the three benchmark datasets, TGS-T captures significantly more edges than those of the previously proposed algorithms.

At the same time, the former captures as many true edges as those of the latter algorithm. Although TGS-T+ makes fewer false-positive predictions than TGS-T, the numbers remain higher than those of the previously proposed algorithms.