The aim of the thesis is to overcome some of the challenges faced by computer algorithms in the reconstruction of time-varying gene regulatory networks. In step 4, the score of the second subset is calculated and stored inside curr.score.
Contributions of the Thesis
Nevertheless, we observe that the memory requirement of the final selection step increases exponentially with the number of shortlisted regulator candidates. At the same time, the former maintains the same recalls as the latter.
Organisation of the Thesis
At the same time, the discovery of interactions whose effect can potentially cancel out that of harmful interactions is essential for the development of therapeutic strategies. By monitoring an apparently healthy individual at the molecular level, can we predict whether such a disease is developing.
How to Answer
How to Answer: The Systems Biology Approach
The input is a data matrix of dimensions (V ×N); it contains N measurements for each of the V variables of interest. Depending on the characteristics of the network and the desired operations, an efficient data structure can be chosen.
Chapter Summary
Input Data: Single system, Multiple time points
Different samples can be collected at different times or from different tissues or under different conditions. The goal is to generate a network (or graph)G=(V,E) where the edge-setE represents pairwise co-expression relationships between the system components in V.
Correlation Networks
When each sample is collected at a specific time point, it forms a time series data type with N time points. When each sample is collected at a specific time point, it forms a time series data series with N time points.
Information Theoretic Models
To identify potential false positive edges, ARACNE uses the inverse of the Data Processing Inequality (DPI) principle. ARACNE monotonically reduces the number of false positive edges in mutual information relevance networks.
Conditional Independence (CI) Models
Full Conditional Independence (CI) Models / Markov Random
The Full CI models ask whether or not the observed correlation between two random variables can be explained by the rest of the variables. By definition, two system components and are connected by an undirected edge if and only if the corresponding random variablesXi andXj are conditionally not independent, given Xrest =X\ {Xi, Xj}, the rest of the random variables in X.
Low Order Conditional Independence (CI) Models
Bayesian Network (BayesNet)
Therefore, the computational complexity increases exponentially with the number of observed variables, making inference impossible in high-dimensional settings. Define a scoring function to calculate fitness for each model in the search space with the given data.
Joint Network Inference (JNI) Models
- Input Data: Single system, Multiple conditions, Multiple time
- Independent Network Inference (INI) with Gaussian Graphical
- Joint Network Inference (JNI) with Gaussian Graphical Models
- Joint Network Inference (JNI) for Joint Estimation of Multiple
On the other hand, there are methods that can model temporal progression of the system-specific dependency structure by reconstructing several time-varying networks for each individual system. The first publication in the trilogy, Oates et al. 2014), assumes that all system-specific networks reside at the same level of the tree, the topology of which is known a priori. For example, the edge (j2, j3) is not added to the static DBN because the edge (Di;j2,Di;j3) (the red arrow) only appears in 50% of the total transitions in the unrolled DBN.
Chapter Summary: Abridged Literature Survey
Input: Time-series Gene Expression Dataset
Output: Time-varying Gene Regulatory Networks
Notation D(X;Y;Z) is used to denote the observed values of genesX at timesY in time seriesZ. The time interval between two consecutive time points in a data set is usually large enough for regulators to have an effect on the expression of the target gene. It follows from the first-order Markovian assumption that the expression of vj at time t(p+1) depends only on its controllers at time tp.
Benchmark Datasets
Evaluation Metrics
Comparative Study of the Existing Algorithms
Implementations
The ARTIVA source code is publicly available as an Rpackage of the same name (ARTIVA, version: 1.2.3). The source codes of TVDBN-0,TVDBN-bino-hard,TVDBN-bino-soft,TVDBN-exp-hard andTVDBN-exp-soft are publicly available as a package named 'EDISON' ((EDISON), version: 1.1. 1).
Results
Problem Statement
Initiatives such as The Precision Medicine Initiave (PMI) (https://www.whitehouse.gov/.precision-medicine) and Google Baseline Study (GBS) (https://en.wikipedia.org/wiki/Baseline_Study) are expected to generate such huge data sets.
Chapter Summary
Development of the Baseline Algorithm
Biologically, this implies that the expression level ofvi at time point tp has no regulatory effect on that of vj during the time interval (tp, tp+ 1). On the other hand, the presence of that edge implies that there is a non-zero probability that the expression level vi attp has influenced that of vj during the time interval (tp, tp+ 1). The Bayesian Information Criterion (BIC)') is used with Bene to calculate the scores of sets of candidate regulators.
Development of a Novel Algorithm: The TGS Algorithm (short
However, the Achilles heel of this strategy is that the prediction is strongly dependent on the user-defined threshold value (Liu et al., 2016, Section 'Effects of the threshold parameters'). It reconstructs a weighted MI network 1 over all genes from a gene expression dataset without requiring a user-defined threshold. Therefore, for a high-throughput human genome-scale time series gene expression dataset where (T−1) =o(V) and Mf =o(lgV), the time complexity of TGS tends asymptotically to polynomial while that of TBN remains exponential .
Results
- Discretisation of the Datasets
- Implementations
- Learning From Dataset Ds10n
- Learning From Datasets Ds50n and Ds100n
- Effects of Noise on Learning Power and Speed
The reason behind this is explained by the fact that the CLR step in TGS captures 7 out of 10 true edges, even from this noisy dataset; the high recall of the CLR step is used by the downstream Bene step to identify at least as many true edges as are identified by TBN, while avoiding searching for as many possible false edges as possible. Two ordered values in each cell for rows 'TBN' and 'TGS' represent application of two different data discretization algorithms – 2L.wt and 2L.Tesla respectively. For those larger number of relationships that do not exist, ARTIVA is less likely than TGS to mistake them as true relationships.
Excerpt and Future Work
In fact, its main memory requirement grows exponentially with the number of genes (and thus the number of candidate regulators for each gene) in a given data set. In the current implementation of TGS, the maximum number of candidate regulators is limited to fourteen for each gene to avoid this problem. Relaxing this constraint is a significant challenge, as the true number of regulators for a gene is not known in advance.
Contributions
Chapter Summary
This algorithm offers recall competitive to that of TGS and precision competitive to that of ARTIVA. Instead, it feeds the raw mutual information matrix to the ARACNE algorithm (Section 3.1.3.2), which refines the matrix. Such false positive mutual information is detected by ARACNE and their mutual information is reduced to zero in the mutual information matrix.
Results
Implementations
This trade-off between true positives and false positives can be very useful for the users. A user who wants to experimentally verify the predicted edges one by one would prefer to have lower false positive edges, even at the cost of lower true positives. On the other hand, a user, who wants to apply other computational methods to the predicted network, would prefer to have as many true positive edges as possible for further processing, even at the cost of a larger number of false positives.
Learning from the Benchmark Datasets
The reason is that every false positive edge leads to an unnecessary set of experiments, causing waste of valuable resources. Thus, two variants of the TGS algorithm meet two different sets of demands from the users.
Excerpt and Future Work
InTGS, shortlisting is performed based on the "raw" mutual information matrix estimated from the dataset. This algorithm is well known for removing a significant number of false-positive mutual information values at the cost of a reasonable number of true-positive values. It has been observed empirically that TGS+ causes a significantly lower number of false-positive edges than that of TGS.
Contributions
In the second step, the shortlist is thoroughly examined to select the final set of supervisors.
Chapter Summary
In Section 7.2, we measure our progress in the previous chapters and discuss its limitations. In the first step, the 'expressions' of the genes in question are measured at specific time intervals over a predetermined period of time. Therefore, the second step is to reverse-engineer (hereafter 'reconstruct') the time-varying GRN structures from the given dataset.
Limitations of the Previously Proposed Algorithms
After measuring the expression values of all the genes in question at all time points results in a time series gene expression data set. This particular step is known as "time-varying GRN reconstruction from time-series gene expression data". Each time series contains measured expressions of V number of genes across T number of time points.
Investigations into the Origin of the Limitations
In this step, the main memory requirement (hereafter simply 'memory requirement') increases exponentially with the number of candidate regulators. This parameter accepts a positive integer value from the user and limits the maximum number of candidate regulators to that value. For example, if 'max fan-in = 14', then at most fourteen candidate regulators can be shortlisted for each node.
A Novel Idea for Overcoming the Limitations
Store the highest curr.score in best.score and the corresponding curr.set in best.set for each iteration. Subsequently, the values of curr.set and curr.score are copied to best.set (best subset so far) and best.score (best score so far), respectively. Otherwise, if curr.score is less than or equal to best.score, best.score and best.set are kept unchanged.
Design of Novel Algorithms Based on the Novel Idea
Experimental Results
- Comparative Study Against a Random Classifier
- Comparative Study Against Alternative Algorithms
- Comparative Study Against Time-invariant Algorithms
- Results with a Large-scale Dataset
On the other hand, TGS-Lite occupies only 0.7% of the memory at the same time. Finally, we compare the running time of BTA with that of the proposed algorithms (Table 7.7). It is observed that the running time of BTA is in hours, while that of the proposed algorithms is in minutes.
Excerpt and Future Work
TGS+.mf14 is able to process the data from the embryo stage, the longest stage, in only 44 minutes. This subset is specific to a specific time interval as the selection is based on that time interval's gene expression data. However, if such a regulator is omitted from the shortlist, there is no way to capture that regulator in the final list of the relevant time interval.
Contributions
Chapter Summary
The goal of the previously proposed algorithms is to identify the regulators of each gene during each time interval. In the second step, they select a subset of the short-listed regulators for each time interval. Therefore, the genes that do not share significantly high mutual information over the entire time series with the gene in question are less likely to be shortlisted as the candidate regulators of the latter gene.
A Novel Idea for Overcoming the Limitations
The shortlist is time invariant because the framework uses the entire time series dataset to calculate the mutual information values of other genes with the gene of interest; then the genes that share statistically significant mutual information with the gene of interest are shortlisted. This strategy is useful for shedding those genes that have no regulatory effects on the gene of interest during any interval. However, the strategy can also reject genes that have regulatory effects on the gene of interest for a small number of time intervals.
Design of Novel Algorithms Based on the Novel Idea
The Issue with Extending TGS+ and TGS-Lite+
In TGS+ and TGS-Lite+, the shortlisting step is performed based on "refined" mutual information values. If the relationship is found to be indirect, the "refined" mutual information between vi and vj is considered to be zero. Therefore, we need to develop an algorithm that can produce one refined mutual information matrix for each time interval.
Developing a Time-varying Refinement Strategy
More specifically, our claim is that we tp share a non-zero mutual information with one of the true regulators of vj t(p+1). The only reason behind this observation is that we tp shares a non-zero mutual information with vk tp. Since we tp share a non-zero mutual information with vk tp, which in turn shares a non-zero mutual information with vj t(p+1), we can tp share a non-zero mutual information with vj t(p+1) , i.e.
Section Summary
The full version of ARACNE-T takes an ordered list of time-varying raw mutual information matrices as input. As a result, the output of ARACNE-T is an ordered list of time-varying, refined mutual information matrices. From this dataset, an ordered list of time-varying raw mutual information matrices is estimated.
Experimental Setup
Evaluation Strategy
The latter (right) differs from the former (left) in two places: first, ARACNE is replaced by ARACNE-T; second, CLR is replaced with CLR-T.
Implementations
In the case of TGS, all 84 cases of G4 are compared with those of the other genes. However, this does not guarantee that the performances of the proposed algorithms are better than that of a random classifier. For two of the three benchmark datasets, TGS-T captures significantly more edges than those of the previously proposed algorithms.
At the same time, the former captures as many true edges as those of the latter algorithm. Although TGS-T+ makes fewer false-positive predictions than TGS-T, the numbers remain higher than those of the previously proposed algorithms.