Complementary Clustering for Microarray Gene- Expression Data Analysis using Factor Analyzers
Abu Shaleh Mahmud1, Md. Mushfiqur Rahman, Md. Alim Hossen, Md. Nurul Haque Mollah
Department of Statistics University of Rajshahi Rajshahi-6205, Bangladesh
Md. Asif Ahsan Zhejiang University
China.
Abstract—A hierarchical clustering (HC) algorithm is one of the most widely used unsupervised statistical techniques for analyzing microarray gene expression data. When applying the HC algorithm to the gene expression data to cluster individuals, most of the HC algorithms generate clusters based on the highly differentially expressed (DE) genes that have very similar expression patterns. These highly DE genes sometimes may be irrelevant in biological processes. However, the serious problem is that those irrelevant genes with high expressions potentially drown out the low expressed genes that have important biological functions or their biological function already known. To overcome this problem, complementary hierarchical clustering (CHC) is proposed in 2008 which was designed to uncover the structures arising from those low expressed genes having important biological function. For the same purpose, in this paper we propose an alternative method named
“Complementary Clustering using Factor Analyzers (CCFA)”.
The advantage of CCFA over the CHC is that the factor scores of CCFA may be utilized in eQTL analysis to discover the regulation pathways of the genes.
Keywords— Gene expression data, Hierarchical clustering (HC), Complementary hierarchical clustering (CHC) and Factor Analyzers
I. INTRODUCTION
Hierarchical clustering algorithms are very useful tools for analyzing microarray data. They provide a very simple and appealing way of displaying the organizational structure of the data using a tree diagram called a dendrogram. An example of the application of hierarchical clustering algorithms to microarray data is given in [1]. To motivate the procedure described in this paper, let us suppose we are clustering RNA samples based on their gene expression profiles. What is often observed is that the clustering pattern is dominated by a small group of highly differentially expressed genes with very related expression patterns. These “strong” genes can potentially drown out “weaker” genes that are not as highly expressed. A problem arises when these weaker genes are responsible for structures among the data that have important biological relevance as this information cannot be discerned from the clustering pattern. The goal of the complementary hierarchical clustering procedure is to uncover the structure arising from
these weaker genes. For the same purpose, in this paper we propose an alternative method named “Complementary Clustering using Factor Analyzers (CCFA)”. The advantage of CCFA over the CHC is that the factor scores of CCFA may be utilized in eQTL analysis to discover the regulation pathways of the genes.
Before delving into the details of this procedure, we will give a very brief review of clustering methods. Further details on cluster analysis and various clustering methods can be found in [2] [3]. Cluster analysis is an unsupervised learning procedure with the goal of grouping data into clusters, with members within a cluster being closer to each other than to members outside that cluster. In order to quantify how close one data point is to another, a distance measure is required. A typical distance measure used with microarray data is one minus the correlation between the gene expression profiles of RNA samples.
Most clustering methods fall into 2 categories: partitioning methods and hierarchical methods. Partitioning methods try to find the most optimal grouping of the data into a predetermined number of clusters. A well-known example is the K-means algorithm. Hierarchical methods will produce clusters of a hierarchical nature. The lowest level of the hierarchy consists of each individual data point, and at each level, the clusters are obtained by merging clusters from the previous lower level. As mentioned above, the hierarchical nature enables the clustering pattern to be displayed as a dendrogram. Hierarchical methods also require the definition of an intercluster distance measure.
One example of such a measure is the maximum pairwise distance between an element from one cluster and an element from the other cluster. There exist many hierarchical clustering algorithms, and they can differ in aspects such as the intercluster distance measure or whether the hierarchy is constructed in a top–down or a bottom–up manner. Some more recent work on the applications of clustering to microarray data include model-based clustering, spectral clustering, and biclustering. Model-based clustering methods are based on the assumption that the microarray data are generated from some underlying probabilistic model. A common example is to assume that the gene expression profiles of the RNA samples are generated from a mixture of normal distributions, with each International Conference on Materials, Electronics & Information Engineering, ICMEIE-2015
05-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4
component of the mixture corresponding to a cluster. Other examples of modelbased clustering can be found in [4]-[6].
Spectral clustering is a technique where the microarray data are treated as a graph with a set of vertices and edges (with corresponding weights) and attempts to find an optimal partition of the vertices. The problem is solved via an eigenvector algorithm involving the matrix of weights.
Applications to microarray data are given in [7]-[9].
Biclustering methods attempt to simultaneously cluster both the samples and the genes with the goal of finding “biclusters,”
subsets of genes that seem to be closely related for a given subset of samples. For more details on biclustering, including both model-based and spectral approaches, see [10]-[13].
Complementary hierarchical clustering is a procedure that can be applied using any hierarchical clustering algorithm, as the only requirement is the ability of the clustering pattern to be represented as a dendrogram. The main idea behind this procedure is to use the information contained in the dendrogram to remove the main structural features from the data and subsequently uncover the structure arising from the weaker genes. The procedure can be broken down into 3 steps.
First, we perform an “initial” clustering on the original data.
Second, the original data are modified, and third, we perform a
“complementary” clustering on the modified data. The key to uncovering the structure lies in the modification of the original data in the second step.
II. A SHORT REVIEW OF CONVENTIONAL COMPLEMENTARY
HIERARCHICLA CLUSTERING (CHC)ALGORITHM
Complementary hierarchical clustering: The complementary hierarchical clustering (CHC) is a procedure for exploring the sequential extraction of a gene-set with similar expression pattern. A major step in the implementation of the CHC procedure is to use the information contained in the dendrogram of HC to ignore the structures based on highly expressed genes from the data and to subsequently uncover the structure arising from the low expressed genes. The CHC procedure consists of three steps. The CHC procedure essentially implements the gene expression data by a order matrix X with respect to p genes and n individuals as follows:
(i) Apply HC to the original dataset X.
(ii) The clustering results can be represented by a dendrogram.
Let a random variable H which follows a uniform distribution on (0, h); here h represents the total height of the dendrogram[14]. For natural grouping of the samples, cut the dendrogram at height H. Let z(H) denoted the vector labels corresponding to the groups of samples obtained by the cut at height H of the dendrogram. The residual matrix corresponding to the linear regression of each row of X onto z(H) is denoted by E(H).Therefore, define
III. FACTOR ANALYSIS
FA is a statistical method used to uncover the latent dimensions of a set of variables. It is widely used in behavioral science, social science, marketing, economics and most recently in bioinformatics. In general, the aims of FA is to extract the most informative m-dimensional unobservable random vector F = {f1, f2, · · · ,fm}T, known as factors from observable vector x = {x1, x2, · · · , xp}of dimension p ≥ m.
From the definition of FA model [15] [16], it can be written in matrix notation as
ˆ)
ˆ)(
1 ( ˆ 1 ˆ 1
1 1
i
n
i i n
i
i x x
V n and
n x
However, we know that the standardization of the data vector avoids the problems having one variable with large variance unduly influencing the determination of factor loadings and the sample covariance matrix obtained by the standardization of the data vector will be the sample correlation matrix R.
Therefore, we use the correlation matrix in place of covariance matrix. The correlation matrix can also be computed directly from the covariance matrixas
International Conference on Materials, Electronics & Information Engineering, ICMEIE-2015 05-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh
www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4
IV. COMPLEMENTARY CLUSTERING USING FACTOR ANALYZERS
Microarray gene expression data analysis for identification of important genes is one of the most popular research fields in bioinformatics. In the high dimensional microarray data analysis, it is often seen that most of the researchers are interested to detect differentially expressed (DE) genes along with their classification/clustering as well as patients or individuals classification/clustering. For this purpose at least two different statistical algorithms are widely used in the literature [17] [18]. To avoid this problem Ahsan et al. 2012 proposed two-way clustering using factor analyzers as an
alternative approach for both purpose mentioned above. Again in some cases the highly DE genes may not be relevant in the biological process. The problem is that those irrelevant genes with high expressions potentially drown out the low expressed genes with important biological functions. To avoid this problem, in this paper we proposed complementary clustering using factor analyzers for microarray gene expression data analysis. Clustering is done by following these steps:
1) Select appropriate number of factor m using scree plot from the data matrix X
2) Compute the m×p dimensional loading matrix L based on R using equation (6).
3) Using the m factor loading separate m groups of differentially expressed genes based on high and low absolute values of factor loading scores.
4) In this stage we apply two-way clustering for m groups of genes
V. SIMULATION STUDY:
To demonstrate the performance of our proposed method we simulated microarray gene expression data. The data set has 3 effects, corresponding to 3 sets of significant genes. The first effect is represented by the G1 rows, the second effect is represented by the G2 rows and the third effect is represented be the G3 rows. The G4 genes are insignificant.
To randomize the gene expression, Gaussian noise is added from N(0,σ^2). Thus, we generate a gene expression dataset using the table 2 with parameters a =8,b=6,c=3 and σ2 = 1 which is displayed in Figure 1 (a), where n1 = 10 genes denoted by {G1.1,G1.2, . . .,G1.10} ∈ G1, n2 = 10 genes denoted by {G1.11,G1.12, . . .,G1.20} ∈ G2 and n3 = 10 genes denoted by {G1.21,G1.22, . . .,G1.30} ∈ G3 and n4 = 10 genes denoted by{G1.31,G1.32, . . .,G1.40} ∈ G4. This original data structure is denoted by (S) which is unobservable.
To make the dataset like the real gene expression data, we randomly mix up the rows (genes) and columns (individuals) of S which is denoted by X and is displayed in Figure. 1 (b).
Here the main objective is to recover the three group DE genes from the X and clustering them according to gene importance.
TABLE I. DATA GENERATING MODEL International Conference on Materials, Electronics & Information Engineering, ICMEIE-2015 05-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh
www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4
Fig. 1: Two-way clustering using factor loading scores.
Fig. 2: 2a, 2c and 2e apply CHC then we get cluster G1, G2 and G3 respectively. And 2b, 2d and 2f apply CCFA then we get same cluster G1, G2 and G3 respectively. We observed that when apply CHC then we get DE genes with EE genes but when apply CCFA then we get DE genes and avoid EE genes.
VI.CONCLUSION
In this paper we have discussed the complementary clustering using factor analyzer. The advantage of CCFA over the CHC is that the factor scores of CCFA may be utilized in eQTL analysis to discover the regulation pathways of the genes.
ACKNOWLEDGMENT
This work is supported by HEQEP sub-project (CP- 3603, R-3, W-2), Bioinformatics Lab, Department of Statistics, University of Rajshahi, Bangladesh. The authors would like to thank the anonymous reviewers for their helpful comments.
REFERENCES
[1] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein,
“Cluster analysis and display of genome-wide expression patterns,”
Proceedings of the National Academy of Sciences, vol. 95, no. 25, pp.
14 863–14 868, 1998.
[2] H. Trevor, T. Robert, and F. Jerome, “The elements of statistical learning: data mining, inference and prediction,” New York: Springer- Verlag, vol. 1, no. 8, pp. 371–406, 2001.
[3] A. Gordon, “Classification. 1999,” Chapman&Hall, CRC, Boca Raton, FL, 1999.
[4] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo,
“Model-based clustering and data transformations for gene expression data,” Bioinformatics, vol. 17, no. 10, pp. 977–987, 2001.
[5] W. Pan, J. Lin, and C. T. Le, “Model-based cluster analysis of microarray gene-expression data,” Genome Biol, vol. 3, no. 2, pp. 1–
0009, 2002.
[6] W. Pan, “Incorporating gene functions as priors in model-based clustering of microarray gene expression data,” Bioinformatics, vol.
22, no. 7, pp. 795–801, 2006.
[7] D. J. Higham, G. Kalna, and M. Kibble, “Spectral clustering and its use in bioinformatics,” Journal of computational and
[8] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein, “Spectral biclustering of microarray data: coclustering genes and conditions,”
Genome research, vol. 13, no. 4, pp. 703–716, 2003.
[9] E. P. Xing and R. M. Karp, “Cliff: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts,”
Bioinformatics, vol. 17, no. suppl 1, pp. S306–S315, 2001.
[10] Y. Cheng and G. M. Church, “Biclustering of expression data.” in Ismb, vol. 8, 2000, pp. 93–103.
[11] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological data analysis: a survey,” Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 1, no. 1, pp. 24–45, 2004.
[12] H. L. Turner, T. C. Bailey, W. J. Krzanowski, and C. A.
Hemingway, “Biclustering models for structured microarray data,”
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 2, no. 4, pp. 316–329, 2005.
[13] Q. Sheng, Y. Moreau, and B. De Moor, “Biclustering microarray data by gibbs sampling,” Bioinformatics, vol. 19, no. suppl 2, pp. ii196–
ii205, 2003.
[14] G. Nowak and R. Tibshirani, “Complementary hierarchical clustering,” Biostatistics, vol. 9, no. 3, pp. 467–483, 2008.
[15] T. W. Anderson, “An introduction to multivariate statistical analysis,” 1958.
[16] R. A. Johnson, D. W. Wichern, and P. Education, “Applied multivariate statistical analysis,” 1992.
[17] I. Pournara and L. Wernisch, “Factor analysis for gene regulatory networks and transcription factor activity profiles,” BMC
bioinformatics, vol. 8, no. 1, p. 61, 2007.
[18] S.-L. Wang, J. Gui, and X. Li, “Factor analysis for cross-platform tumor classification based on gene expression profiles,” Journal of Circuits, Systems, and Computers, vol. 19, no. 01, pp. 243–258, 2010.
International Conference on Materials, Electronics & Information Engineering, ICMEIE-2015 05-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh
www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4