Algorithm 3 ARACNE
1: Input: Dataset D, vertex-setV, random vectorX.
2: Output: An undirected graph G.
3: Edge-set E←∅.
4:
5: ## Reconstruct a mutual information relevance network
6: foreach pair of random variables{Xi, Xj} ∈X do
7: Compute Mij = Mutual information of their expression vectors.
8: if Mij ≥predefined thresholdthen
9: E←E∪(i, j). . Add an undirected edge.
10: end if
11: end for
12: G= (V, E).
13:
14: ## Prune false positive edges
15: foreach fully connected triplet (i.e. a triangle){i, j, k} ∈Gdo
16: if Mik ≤min (Mij, Mjk) then
17: E←E\(i, k). . Remove edge (i, k).
18: end if
19: end for
20: return G= (V, E).
Merit(s) of ARACNE
• ARACNE monotonically reduces the number of false positive edges in the mutual information relevance networks.
Limitation(s) of ARACNE
• It is assumed that the converse of the DPI principle holds, which is not necessarily true. Hence, ARACNE may prune true positive edges representing direct interac- tions. The authors of ARACNE admit this flaw as mentioned by (Bansal et al., 2007, Section ‘ARACNE’).
In CI models, two system components i and j are connected with an edge if and only if the corresponding random variablesXiandXj are conditionally not independent given Xk, a subset of the rest of the random variables. Whether the edge is directed or undirected, that also depends on the model itself. Different classes of CI models are developed depending on howXk is defined. They are -
• Full CI Models: Xk =X\ {Xi, Xj}
• Low Order CI Models: Xk⊂X\ {Xi, Xj}
• Bayesian Network: Xk={XS : For allXS ⊆(X\ {Xi, Xj})}
These CI models are comparatively reviewed in the next sections.
3.2.1 Full Conditional Independence (CI) Models / Markov Random Fields (MRFs) / Markov Networks (MNs)
The Full CI models ask whether the observed correlation between two random variables can be explained by the rest of the variables or not. By definition, two system compo- nentsiandjare connected with an undirected edge if and only if the corresponding ran- dom variablesXi andXj are conditionally not independent givenXrest =X\ {Xi, Xj}, the rest of the random variables in X.
3.2.1.1 Gaussian Graphical Models (GGMs)
Full CI models are called GGMs when the given random vectorXfollows a multivariate Gaussian distributionN (µ,Σ). Hereµis the mean vector of dimensionp(i.e. p-vector) and Σ = (σij) is the covariance matrix of dimension (p×p). Σ is a symmetric matrix since (σij) = (σji) = Covariance between the variables Xi and Xj. GGM assumes that Σ is invertible i.e. K = Σ−1 exists. K is called the precision matrix or concentration matrix. Since inverse of an invertible symmetric matrix is also a symmetric matrix (Proof at Deriso (2013)),K is a symmetric matrix. The value ρij = −kij/p
kii, kjj is known as the partial correlation coefficient (pcc) between Xi and Xj, where kij stands for the (i, j)th entry in K. Then it satisfies the bi-implications fori6=j that
kij = 0 ⇐⇒ ρij = 0 ⇐⇒ Xi ⊥Xj |Xrest (3.2) An undirected edge is drawn between two system components iand j if and only ifkij has a non-zero value i.e. the correlation between Xi and Xj can not be explained by the rest of the variables (Algorithm 4 ).
In practice, GGM faces a challenge when p N, which is a very common scenario with high-throughput experiments. In such a case, Σ becomes non-invertible.
3.2.1. Theorem. Σ is non-invertible when pN.
Proof.
Σ = 1
N −1 D−DT
D−D
(Sch¨afer and Strimmer, 2005, Section ‘Graphical Gaussian models’) whereD= (dij) and D= dij
s. t. di·= N1 PN
j=1dij i.e. the mean value of the random variableXi.
Algorithm 4 Gaussian Graphical Model (GGM) Reconstruction
1: Input: Dataset D, vertex-setV, random vectorX.
2: Output: An undirected graph G.
3: Edge-set E←∅.
4: Estimate the concentration matrixK = Σ−1.
5: foreach pair of random variables{Xi, Xj} ∈X do
6: if kij 6= 0then
7: ## kij = (i, j)th entry in K.
8: E←E∪(i, j). .Add an undirected edge (i, j).
9: end if
10: end for
11: return G= (V, E).
Thus bothD and D matrices are of dimension (p×N). Hence D−D
is a matrix of dimension (p×N). Therefore
rank D−D
= rank D−DT
≤min (p, N)
≤N
Since for any two multipliable matricesAandB, rank (AB) = min (rank (A),rank (B)),
rank (Σ) = rank
D−D
· D−DT
= min
rank D−D
,rank D−DT
≤N
p =⇒ Σ is non-invertible.
In a seminal paper, Sch¨afer and Strimmer (2005) address this issue. Algorithm 5 is proposed and implemented as anRpackage named ‘GeneTS’.
The major contribution of GeneTS is twofold:
• It replaces matrix inverse with Moore-Penrose pseudoinverse (Penrose, 1955) Σ+= V D−1UT, where V and U are eigenvectors of ΣTΣ and ΣΣT, respectively; D is a square diagonal matrix of rank ≤ min (p, N) and so trivially invertible. Their values are obtained by Singular Value Decomposition (SVD) of Σ into U DVT. This strategy eliminates the challenge of not being able to estimate GGM when pN.
• The Bootstrap Aggregation (bagging) (Breiman, 1996) scheme is incorporated to reduce variance in the input data.
A computationally efficient algorithm for biological GGM inference is proposed in Wang et al. (2016) and implemented as an Rpackage called ‘FastGGM’. Through simulation, it demonstrates its ability to accurately (Area Under ROC Curve (AUROC) = 0.96) re- construct GGM in a high-dimensional setting {N = 1,000, p = 10,000} (Wang et al., 2016, Table 1). FastGGM makes the sparsity assumption that the maximum degree of the vertices in the true GGM is o√
N /logp
. In addition to providing partial corre- lation coefficient between the random variables corresponding to two end points of an edge, it also calculates the p-value and confidence interval for the edge.
Algorithm 5 GeneTS GGM Reconstruction
1: Input: Dataset D, vertex-set V, random vector X, number of bootstrap samples to generateB.
2: Output: An undirected graph G.
3: GenerateB number of bootstrap samples with replacement fromD. . Bootstrap Aggregation (bagging).
4: foreach bootstrap sampleb do
5: Estimate the concentration matrix Kb = Σb+. . Moore-Penrose pseudoinverse.
6: Initialize adjacency matrix Ab of dimension (p×p) for the to-be-reconstructed graph with zeroes in each cell.
7: for each pair of random variables{Xi, Xj} ∈X do
8: if kij 6= 0 then
9: ## kij = (i, j)th entry in Kb.
10: Ab(i, j) and Ab(j, i)←kij. .Add an undirected edge (i, j).
11: end if
12: end for
13: end for
14: Final adjacency matrixA ←mean(Ab)∀b.
15: return A.
3.2.2 Low Order Conditional Independence (CI) Models
Let us discuss the motivation behind this class of models with the help of a practical example. Suppose a system component i simultaneously activates system components j and k. In such a case, Pearson correlation coefficient between the random variables Xj and Xk could be very high. But their partial correlation coefficient (pcc) givenXi is highly likely to be very low, indicating absence of any direct dependency. Therefore we can observe that a single third variable can be sufficient to identify an indirect dependency between a pair of variables. It significantly reduces the computational complexity of testing each pair w.r.t. all other variables, especially when they are large in number. In case of gene expression data, it is hypothesized that every gene is regulated by only a handful of other genes. If that is the case then a small subset of genes should be able to explain the indirect correlation, if any, between a given pair of genes. This insight motivates the researchers to come up with the First Order CI model (Magwene and Kim, 2004; De La Fuente et al., 2004; Wille and B¨uhlmann, 2006; Wille et al., 2004) for GGM where an undirected edge is drawn between two system componentsiandjif and only if (∀k∈X\ {i, j}) (Xi 6⊥Xj |Xk). This model automatically eliminatespN problem.
Because for each test, instead of considering the whole (p×N)-dimensional data matrix, only a sub-matrix of dimension (3×N) is adequate for reconstruction. It results in an exact estimation of the GGM when N ≥ 3. Similarly, the nth Order CI model asks whether the observed correlation, if any, between two variables can be explained by any n sized subset of rest of the variables. Given a dataset, selection of the value of n depends upon the biological insight, computational constraints, etc.
3.2.3 Bayesian Network (BayesNet)
Pearl (1985) introduces BayesNet to generalize the idea of Conditional Independence (CI) models. Unlike, other CI models, it produces a Directed Acyclic Graph (DAG).
A directed edge is drawn from the system component ito another component j if and only if their correlation can not be explained by any subset of rest of the variables. It is formally written as Xi 6⊥Xj | XS for all XS ⊆X\ {Xi, Xj}. Thus BayesNet envelops
correlation networks and all the previously discussed CI models:
• Correlation networks whenXS ←∅.
• Full CI models when XS ←X\ {Xi, Xj}.
• First Order CI models whenXS←Xk for all Xk ∈X\ {Xi, Xj}.
3.2.3.1 Merit(s) of BayesNet
• Due to testing w.r.t. all orders of Conditional Independence (CI), this model is guaranteed to produce the lowest number of edges among all CI models.
• The edge direction implies possible regulator-regulatee relationship.
• Each random variableXv is described by a Local Probability Distribution (LPD) p(Xv). A key property of BayesNet is that the Joint Probability Distribution (JPD)P(·) over all variables can be factorized as follows:
P(X) = Y
v∈V
p Xv |Xpa(v)
(3.3) where pa (v) is the set of all vertices from which there is an edge to vertex v;
Xpa(v) is the set of all random variables corresponding to the vertex set pa (v).
With the help of this property, the variables can be partitioned into families.
Efficient network estimation and analysis techniques are devised based on divide- and-conquer paradigm to utilize this property, especially when the network is very large (Nikolova et al., 2013).
3.2.3.2 Limitation(s) of BayesNet
• For each ordered pair of random variables (Xi, Xj), number of tests to be per- formed = |XS|= 2|X\{Xi,Xj}| = 2(p−2). And there are 2· p2
such ordered pairs possible. Hence, a total of 2· p2
·2(p−2) tests are needed to reconstruct a BayesNet.
Therefore, the computational complexity increases exponentially with the number of observed variables, rendering inference infeasible in high-dimensional setting. As discussed in (Markowetz and Spang, 2007, Section ‘Score based structure learn- ing’), the following strategy is generally adapted to overcome this issue:
– Define a search space of possible models: if possible reduce the search space by applying prior knowledge;
– Define a scoring function to compute fitness of each model in the search space with the given data
– Construct an optimization problem to find out the fittest model – Solve the optimization problem
• BayesNet strictly prohibits cyclic dependencies. This is an unrealistic constraint since dynamic biological systems are known to possess feedback loops that facilitate adaptivity, which is one of the salient features of dynamic living systems.
• The biggest theoretical challenge with BayesNet is Markov equivalence. Two dif- ferent BayesNets are called Markov equivalent if they possess the same underlying undirected graph and directed collider sub-graph(s), like - Xi →Xj ←Xk. Two
such BayesNets are statistically indistinguishable. It implies that given a dataset, we can only infer to which Markov equivalence class the output BayesNet is likely to belong. But we can not point out the exact member of the class that represents the true dependency structure even whenN −→ ∞.