Results - Thesis submitted in partial fulfilment of the requirements for the degree of

T_TGS(V) = O V²

(T−1)V M_f² 2(^M^f⁻²)

(5.1)

O V² +o

o(V)V M_f² 2(^M^f⁻²)

∵(T −1) =o(V)

O V² +o

o(V)V o(lgV)² 2^(o(lg^V⁾⁻²⁾

∵M_f =o(lgV)

= O V²

+o o

V² (lgV)²

2^(o(lg^V⁾⁾ 4

O V² +o

V² (lgV)²

(o(V))

= O V²

V³ (lgV)²

(5.2)

challenge while Ds10 is released after the challenge; therefore, the reader can compare the performances of the algorithms in this study with those of the algorithms employed during the challenge.

In the current study,TBN and TGS both perform better than ARTIVAin terms of learning power, except in FP (Table 5.1)². They also outperformARTIVAin speed (Ta- ble 5.2). Amongst TBN andTGS, it is found thatTGS has faster learning speed. It is expected since the regulator search space for each gene, in case ofTGS, is monotonically smaller than that ofTBN. But the interesting observation is thatTGS, being a heuristic based approximate search algorithm, performs competitively with TBN, an exhaustive search algorithm, in every metric of learning power as well. The reason behind that is explained by the fact that the CLR step in TGS captures 7 out of 10 true edges even from this noisy dataset; the high recall of the CLR step is utilised by the downstream Bene step to identify at least as many true edges identified by TBN while avoiding to search for as many potential false edges as possible. This reasoning is supported by another fact thatTGS suffers from much less FP thanTBN. Another observation is made that discretisation of input gene expression data based on domain-specific knowledge (as in wild type values of genes) improves learning compared to the domain-independent alternative (Table 5.1).

Table 5.1: Learning Power of the Selected Algorithms on Dataset Ds10n. TP = True Positive, FP = False Positive. Two ordered values in each cell for rows ‘TBN’ and

‘TGS’ represent application of two different data discretisation algorithms – 2L.wt and 2L.Tesla, respectively. On the other hand, other rows have a single value in each cell, since other algorithms do not require the dataset to be discretised. The numerical values are rounded off to three decimal places. For each column, the best value(s) is boldfaced.

Algorithm TP FP Recall Precision F1

TBN (3,1) (17,25) (0.3,0.1) (0.15,0.038) (0.2,0.056) TGS (3,2) (10,12) (0.3,0.2) (0.231,0.143) (0.261,0.167)

ARTIVA 0 9 0 0 0

TVDBN-0 0 1 0 0 0

TVDBN-bino-hard 1 7 0.1 0.125 0.111

TVDBN-bino-soft 2 9 0.2 0.182 0.190

5.2.4 Learning From Datasets Ds50n and Ds100n

Due to Bene’s main memory requirement of 2^(V⁺²⁾ Bytes (Silander and Myllym¨aki, 2006, Section 5), both TBN and TGS have the same inherent exponential memory requirement. In theory, that should enable them to learn a network with V ≤ 32 with a 31 GB main memory, since 2⁽³²⁺²⁾ Bytes = 16 GB < 31 GB. But it is found empirically that the bnstruct implementation ofBene can learn a network with V ≤15 with that configuration, without any segmentation faults. Therefore, the max fan-in variant of TGS is employed for Ds50n and Ds100n with M_f = 14, since that would restrict each atomic network-learning problem to a maximum of 15 nodes (1 regulatee and a maximum of 14 candidate regulators). However, TBN does not have any such provisions and hence can not be applied on these datasets. As a result,TBN is excluded from the current study.

2Please note that algorithmsTVDBN-exp-hard andTVDBN-exp-softresult in error for Ds10n.

Table 5.2: Runtime of the Selected Algorithms on Dataset Ds10n. Two ordered values in each cell for rows ‘TBN’ and ‘TGS’ represent application of two different data discretisation algorithms – 2L.wt and 2L.Tesla, respectively. On the other hand, other rows have a single value in each cell, since other algorithms do not require the dataset to be discretised. In TGS, the CLR step takes 0.003 seconds for 2L.wt and 2L.Tesla each.

Algorithm Ds10n

TBN (7.119s, 6.867s)

TGS (5.789s, 5.76s)

ARTIVA 10m 20s

TVDBN-0 2m 24s

TVDBN-bino-hard 2m 15.2s TVDBN-bino-soft 2m 14.6s

In this study,ARTIVAconsistently outperformsTGS in FP, with considerable mar- gins (Tables 5.3 and 5.4) ³. Since the true GRNs are believed to be sparse in nature, it is expected that, among all possible regulatory relationships, only a few truly exist.

For those larger number of relationships, that do not exist,ARTIVA is less likely than TGS to mistake them as true relationships. However, ARTIVAtends to over-estimate the non-existent relationships by mistaking a large number of true relationships as non- existent, as evident from its considerably lower TP compared to those ofTGS. Another major concern with ARTIVA is the runtime. It takes almost 32 hours to reconstruct 100-gene GRNs, which is certainly a bottleneck for its application in reconstructing human genome-scale GRNs (Table 5.5). In comparison, TGS consumes only about 18 minutes. Moreover,TGS’s runtime grows almost linearly as the number of genes grow (Figure 5.3). These observations indicate that TGS is substantially more suitable for reconstructing large-scale GRNs than ARTIVA.

Table 5.3: Learning Power of the Selected Algorithms on Dataset Ds50n. TP = True Positive, FP = False Positive. Algorithm 2L.wt is used for data discretisation in TGS.

The numerical values are rounded off to three decimal places. For each column, the best value(s) is boldfaced.

Algorithm TP FP Recall Precision F1

TGS 15 342 0.195 0.042 0.069

ARTIVA 6 64 0.078 0.086 0.082

TVDBN-0 7 199 0.091 0.034 0.049

TVDBN-bino-hard 11 410 0.143 0.026 0.044 TVDBN-bino-soft 14 395 0.182 0.034 0.058

3Please note that algorithmsTVDBN-exp-hard and TVDBN-exp-soft result in error for Ds50n and Ds100n.

Table 5.4: Learning Power of the Selected Algorithms on Dataset Ds100n. TP = True Positive, FP = False Positive. Algorithm 2L.wt is used for data discretisation in TGS.

The numerical values are rounded off to three decimal places. For each column, the best value(s) is boldfaced.

Algorithm TP FP Recall Precision F1

TGS 28 790 0.169 0.034 0.057

ARTIVA 14 158 0.084 0.081 0.083

TVDBN-0 9 678 0.054 0.013 0.021

TVDBN-bino-hard 26 1304 0.157 0.020 0.035 TVDBN-bino-soft 18 1296 0.108 0.014 0.024

Table 5.5: Runtime of the Selected Algorithms on Datasets Ds50n and Ds100n. For TGS, algorithm 2L.wt is used for data discretisation. The CLR step in TGS takes 0.005 and 0.013 seconds for Ds50n and Ds100n, respectively.

Algorithm Ds50n Ds100n

TGS 7m 36s 17m 49s

ARTIVA 4h 30m 15s 31h 52m 54s

TVDBN-0 11m 59s 52m 17s

TVDBN-bino-hard 9m 38s 2h 53m 32s TVDBN-bino-soft 8m 8s 17m 20s

Figure 5.3: Runtime of the TGS Algorithm w.r.t. the Number of Genes in the Bench- mark Datasets (Table 4.1). The black and grey lines represent noisy and noiseless versions of the datasets, respectively.

5.2.5 Effects of Noise on Learning Power and Speed

TGS is evaluated on all noisy and noiseless datasets with different number of genes.

From Figures 5.3 and 5.4 , it can be observed that the presence of noise negatively impacts runtime and precision. This observation can be explained by analysing the effect of noise on the CLR step (Table 5.6). In the absence of noise, theCLR step can eliminate more numbers of potential false regulators from the candidate set of regulators of each regulatee, resulting in smaller and more precise shortlist of candidate regulators.

That in turn, improves precision and speed of the overall algorithm.

Figure 5.4: Precision of the TGS Algorithm w.r.t. the Number of Genes in the Bench- mark Datasets. The black and grey bars represent noisy and noiseless versions of the datasets, respectively. The2L.wt algorithm is used for data discretisation.

Table 5.6: Maximum Number of Neighbours a Gene has in theCLRNetwork. Algorithm 2L.wt is used for data discretisation.

Total Number of Genes Noiseless Dataset Noisy Dataset

10 4 7

50 24 33

100 43 84

Dalam dokumen Thesis submitted in partial fulfilment of the requirements for the degree of (Halaman 82-86)