A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification by Beta-Divergence Method
Md.Shahjaman Department of Statistics Begum Rokeya University Rangpur, Bangladesh [email protected]
Md. Nurul Haque Mollah Department of Statistics University of Rajshahi Rajshahi, Bangladesh [email protected]
Abstract—Clustering and differential expressed (DE) genes identification are both equally important in most microarray data analysis. But a few of task have been done in this regards. In this article we proposed a method using beta-divergence, which enable to extract gene-cluster sequentially. A detected gene-cluster is separated from the dataset using a weight function known as beta- weight function and do the same task again to the remaining data points in the dataset for sequential extraction of gene-clusters. The beta-weight function produces larger weights corresponding to recovered/detected gene-cluster and smaller weight for the unrecovered clusters. The differential gene-expression pattern between two individual groups of each detected cluster is tested using Hotelling's T2-statistic. Thus sequentially we can extract all DE gene-clusters. Simulation results show that the proposed method performs significantly.
Key words—Gene-expression, Minimum beta-divergence method, Beta weight function, Gene-cluster, Differential expression, Hotelling's T2, Sequential clustering.
I. INTRODUCTION
Microarray data and DNA technology has proved to be an essential tool in studying gene expression. Genes are segments of DNA providing the code for producing proteins.
Different organisms contain different numbers of genes. For example, a human being contains 30,000 genes as estimated and the fruit fly contains only about 13,000 genes. And different genes can have different expression levels. Due to the large number of genes and complex gene regulation networks, clustering is a useful exploratory technique for analyzing these data. It identifies groups of genes that have similar expression profiles across samples. Clustering can minimize the effort of studying individual genes and more importantly it can expose the functional groups among genes.
Various approaches have been developed to fulfill this task in the context of microarray experiments [1]. As for example, hierarchical clustering, K-means, and partitioning around medoids have all been applied in high throughput studies.
In partitioning methods the number cluster should be known in advance and it assumes that each subject belongs to one group. This may not be appropriate for clustering genes,
subgroups, but cannot represent a partial overlap between groups as may be required. Also there is a relation between clustering and differentially expressed gene identification in most microarray studies. Unfortunately the two tasks are often conducted without regard to each other and this gives misleading results. It is equally important to know that which cluster a gene belongs and whether or not the gene is differentially expressed. Therefore in this project an attempt is made to propose a procedure that can be used to simultaneously cluster and determine differentially expressed genes by minimizing beta divergence method.
II.UNIFIED MODELING APPROACH BY BETA-DIVERGENCE The -divergence between two probability density functions (pdf’s) p(x) and q(x) is defined as
1 1
1 1
( , ) ( ) ( ) ( ) ( ) ( )
D p q p q x p 1 p q dx
x x x x Which is non-negative,ie D( ( ), ( ))p x q x 0, and equal to zero if and only if p = q [2]. We note that the β-divergence reduces to K-L divergence when β → 0, that is0
lim ( , ) ( ) log ( ) ( , )
( ) KL
D p q p p dx D p q
q
x xx The minimum-divergence estimators for the mean vector
and covariance matrix
obtained [3] iteratively as follows:1 1
1
( | , )
(1) and
( | , )
n
i t t i
i
t n
i t t
i
x x
x
1
1 1
( | , )( )( )
(2)
(1 ) ( | , )
n T
i t t i t i t
i
t n
i t t
x x x
x
where, ( | , ) exp ( ) 1( ) 2
T
x x x
This is known a β-weight function. It produces the weights for each of the data points. The performance of - divergence method depends on the value of the tuning parameter. We select our appropriate from [4] using cross validation.
III.GENE CLUSTERING BY MINIMIZING BETA-DIVERGENCE We compute the β-weight using (1) and (2) as follows:
, ) exp 1
2
β T
( | β( ) ( )
t t t
x x x
∧ ∧ ∧ ∧ ∧
μ μ μ
A detected gene-cluster is separated from the dataset using this β-weight function and does the same task again to the remaining data points in the dataset for sequential extraction of gene-clusters. The beta-weight function produces larger weights corresponding to recovered/detected gene- cluster and smaller weight for the unrecovered clusters. We separate c cluster sequentially in the following way:
( ) | , ( ;ˆ( ), ( )) ; 1, 2,.. , 1, 2,..
(3)
k t k k k
D D x t n k c
x xt
Where we chose the value of k by
, ( ) ( ) , ( ) ( )
1 1
ˆ ˆ
(1 ) min ( ; , ) max ( ; , )
c c
k k k k k
D D
k k
x x
xt
t xt
tx x
With heuristically 0 01. .0 05 .
IV.DIFFERENTIAL EXPRESSION PATTERN IDENTIFICATION We consider a microarray experiment composed of 𝑛𝐷
samples from a disease group and 𝑛𝑁samples from a normal group. Suppose that the expression levels of J genes are measured and used as variables to construct a 𝑇2 statistic. Let 𝑋𝑖𝑗𝐷 be the expression level for gene j of sample i from the disease group and 𝑋𝑘𝑗𝑁 be the expression level for gene j of sample k from the normal group. The expression level vectors for samples i and k from the disease and normal groups can be expressed as 𝑋𝑖𝐷 = (𝑋𝑖1𝐷, … , 𝑋𝑖𝑗𝐷)𝑇 and 𝑋𝑘𝑁= (𝑋𝑘1𝑁, … , 𝑋𝑘𝑗𝑁)𝑇, respectively. The mean expression levels of gene j in the disease and normal groups can be expressed as 𝑋 𝑗𝐷=1
𝑛𝐷 𝑛𝑖=1𝐷 𝑋𝑖𝑗𝐷 and 𝑋 𝑗𝑁=1
𝑛𝑁 𝑛𝑖=1𝑁 𝑋𝑘𝑗𝑁 , respectively. The mean expression level vectors for J genes in the disease and normal groups can be expressed as 𝑋 𝐷 = (𝑋 1𝐷, … 𝑋 𝑗𝐷)𝑇 and 𝑋 𝑁= (𝑋 1𝑁, … 𝑋 𝑘𝑁)𝑇, respectively. The pooled variance–covariance matrix of expression levels of J genes for the disease and normal samples is then defined as
𝑆 = 𝑛𝐷− 1 𝑆𝐷+ 𝑛𝑁− 1 𝑆𝑁 𝑛𝐷+ 𝑛𝑁− 2 = 1
𝑛𝐷+ 𝑛𝑁− 2[ 𝑋 𝑖𝐷− 𝑋 𝐷
𝑛𝐷
𝑖=1
𝑋 𝑖𝐷− 𝑋 𝐷 𝑇
+ 𝑋 𝑘𝑁− 𝑋 𝑁 𝑋 𝑘𝑁− 𝑋 𝑁 𝑇]
𝑛𝑁
𝑘=1
where 𝑆𝐷 and 𝑆𝑁 are the variance–covariance matrix of expression levels for J genes in the disease and normal groups, respectively. The covariance terms in 𝑆𝐷 and 𝑆𝑁 account for the correlation and interdependence (interactions) of gene expression levels [7].
Hotelling's 𝑇2 statistic for gene differential expression studies is then defined as,
𝑇2= 𝑛𝐷𝑛𝑁 𝑛𝐷+ 𝑛𝑁
(𝑋 𝐷− 𝑋 𝑁)𝑆−1(𝑋 𝐷− 𝑋 𝑁)𝑇
This statistic combines information from the mean and dispersion of all the variables (genes being tested) in microarray experiments [5]. The central limit theorem dictates that, 𝑛𝐷+𝑛𝑁−𝐽 −1
𝐽 (𝑛𝐷+𝑛𝑁−2)𝑇2 is asymptotically F-distributed with J degrees of freedom for the numerator and (𝑛𝐷+𝑛𝑁− 𝐽 − 1) for the denominator.
Algorithm
(i) select an appropriate β by cross validation.
(ii) compute μ and for μ and using (1) and (2) (iii) determine β-weight function and calculate weights for each of the data points.
(iv) separate the first gene-cluster from the dataset using β- weight function.
(v) separate the second gene-cluster from the dataset using β- weight function.
(vi) the differential gene-expression pattern of two gene- cluster is tested using Hotelling's T2-statistic.
(vii) repeat (i) to (vi) until all gene-cluster are recovered.
V. SIMULATION RESULTS
To investigate the performance of the proposed model we consider original cluster based on an unobservable matrix 𝑺𝑚 ×𝑛 with m=10000 and n=40, where rows of S
(a) Original Gene Clusters (S) (b) Random Allocation (X)
Fig.1. (a) Image of original gene cluster based on an unobservable matrix 𝑺𝑚 ×𝑛, where rows of S represent the clustered genes and columns of S represents the clustered sample labels. (b) Unclustered image based on an observable matrix 𝑿𝑚×𝑛, the rows and columns of X are the random allocation of rows and columns of S.
represent the clustered genes and columns of S clustered sample labels. We see that from fig.1a there are 7 sets of gene G1= {g1,g2,….g100}, G2= {g101,g102,….g200}, G3= { g201 ,g202,….g300},G4={g301,g302,….g400},G5={g401,g402,….
g500},G6={g501,g502,….g600} and G7={g601,g602 ,….g1000} belong to seven different clusters in the rows of S based on the similarity of intensities. The last group G7 is equally expressed gene group, and four sets of sample labels S1= {s1, s2… s10}, S2= {s11, s12… s20}, S3= {s21, s22…
s30} and S4= {s31, s32… s40} belong to four different clusters in the columns of S based on the similarity of intensities.
Figure 1b shows the unrecovered image based on observable matrix 𝑿𝑚 ×𝑛. The rows and columns of X are the random allocation of rows and columns of S. Our aim is to recover the unobservable matrix S from the observable matrix X to obtain original gene-clusters.
At the first step we randomly select an initial mean vector of gene from the dataset. To separate first cluster by the proposed method, we select the values of the tuning parameter β = 0.03 by K-fold CV (k=10). Then we calculate weights for each of the data points. Figure 2a represents the weights of each data points at step 1 and 2h shows the recovered first gene-cluster when k=1 in equation (3) taking the other data cluster as outliers. Clearly we can see that at step 1 a gene
cluster is separated properly by our method. Then we remove the detected cluster from the dataset and apply our method to the rest of the dataset to separate second gene cluster at the step 2. We select the values of the tuning parameter β = 0.04 by cross validation as before. Again we calculate weights for each data points. Figure 2b represents the weights of each data points at step 2 and 2i shows the recovered second gene-cluster when k=2 in equation (3) taking rest of the dataset as outliers.
Clearly we can see that at step 2 second gene-cluster is properly recovered. Then we apply Hotelling's 𝑇2 statistic to detect DE gene between two detected gene clusters. Then we remove second detected cluster from the dataset and do the same task again. Thus sequentially we can extract all the detected DE gene-clusters from the dataset. Figure 2c represents the weights of each data points at step 3 with β = 0.02 and 2j shows recovered third gene-cluster when k=3 in equation (3) taking the other data cluster as outliers. Clearly we can see that a gene cluster is separated properly at step 3.
Again we apply Hotelling's 𝑇2 statistic to detect DE gene between two detected gene clusters. Then we remove detected cluster from the dataset and estimate the weights for each of the data points again when β= 0.03.
Weights for each data point Recovered cluster (a) At step 1 (h)
Weights for each data point Recovered cluster (b) At step 2 (i)
(c) At step 3 (j)
(d) At step 4 (k)
(e) At step 5 (l)
Weights for each data point Recovered cluster (f) At step 6 (m)
Fig.2. (a) –(g)
Fig.2. (a)–(g) Weights of each data point at steps 1–7, respectively. (h-n) Separated gene cluster at steps 1–7, respectively.
Figure 2d shows the weights of each data points at step 4 and 2k shows recovered forth gene-cluster when k=4 in equation (3) considering the other data cluster as outliers. It is apparent from the fig.2k that equally expressed (EE) gene
group is recovered. At step7 fig.2n cluster seven is recovered. The weights of the fig.7g suggest that there is no more clusters in the dataset, So when all the gene clusters are recovered then our method is terminated automatically. Thus sequentially all the DE gene clusters are detected by our proposed method. We combine the entire DE gene in fig. 3 and observed that all the DE gene clusters are detected as the original in fig.1a.
CONCLUSION
Many clustering algorithms have been used to analyze microarray gene expression data such as hierarchical clustering, k-means, PAM and SOM. But both clustering and DE gene detection is mutually essential for Microarray data analysis. From that point of view we discuss a new method by minimizing beta divergence, which simultaneously separate the similar gene group and identify DE gene class. Using our proposed method, all hidden gene classes can be explored sequentially from the entire data space.
Simulation results show that our proposed method performs significantly.
(g) At step 7 (n)
REFERENCES
[1] Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D.
(1998). Cluster analysis and display of genome-wide
expression patterns. Proceedings of the National Academy of Sciences of the United States of America 95, 14863– 14868.
[2] Minami, M. and Eguchi, S.: Robust blind source separation by beta-divergence. Neural Computation. 14, 1859-1886 (2002).
[3] Mollah, M.N.H., Minami, M. and Eguchi, S. (2007): Robust prewhitening for ICA by minimizing β-divergence and its application to FastICA. Neural Processing Letters,25(2), 91- 110.
[4] Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.
[5] Lu Y, Liu P-Y, Xiao P, Deng H-W: Hotelling's T2 multivariate profiling for detecting differential expression in microarrays.
Bioinformatics 2005, 21(14):3105-3113.
[6] Parmigiani, G., Garrett, E. S., Irizarry, R., and Scott, S. L.
(eds). (2003). The Analysis of Gene Expression Data [7] Chilingaryan, A., et al. 2002 Multivariate approach for selecting sets of differentially expressed genes. Math. Biosci.
17659–69