Robust Clustering for Gene-Expression Data Analysis
Abstract—A group of genes with similar profiles are often participants of the same biological functions. Clustering genes with similar profiles is a crucial step to reveal potential relationships among the genes. Many clustering methods including hierarchical and partitioning based clustering are widely used in literature. The hierarchical clustering (HC) is the most widely used and more advantageous than partitioning based clustering for gene-expression data analysis. Generally, a smaller group of important genes are required for biotechnologist. But partitioning methods are not so suitable for finding such a smaller important gene-set from a large size of gene-cluster, while HC methods are able to select such an important smaller gene-set taking representative genes from each subgroup of a large size of hierarchical gene-cluster. The HC results exhibited based on highly differentially expressed (DE) genes, sometimes those highly DE genes are not relevant to the biological function. To overcome this problem, complementary HC (CHC) has been proposed, that can discover biologically low expressed important genes-set. However, most of the existing HC algorithms are very much sensitive to outlying observations, which is an important issue in gene-expression data analysis. To avoid such problems, our proposal is to use robust HC (RHC) and robust CHC (RCHC) that predict critically important genes in presence of Outliers.
Keywords-Microarray gene Expression; Clustering; Robust clustering;
I. I
NTRODUCTIONMicroarray gene expression data allow us to quantitatively and simultaneously monitor the expression of thousands of genes under different conditions [1]. The microarray genes expression data are huge in sizes, unknown about the group with similar pattern.
Identification of such similar profile genes with similar expression patterns is crucial work in microarray gene expression analysis. This work can be accomplished by exploratory techniques such as unsupervised clustering analysis. There are large number of statistical and computational approaches available for clustering methods have been applied in gene expression data analysis. There are main two types of unsupervised clustering algorithms: Partitioning based clustering and Hierarchical clustering (HC). The number of data clusters in the entire data space should be known in advance for partitioning based clustering like k-means clustering [2] [3], partitioning around medoids [4], Self-Organizing Map [5] [6], fuzzy
clustering, model based clustering [7]-[10], and so on. On the other hand, hierarchical clustering [11] with each of agglomerative (bottom- up) and divisive (top-down) methods is a way to investigating grouping of feature variables or objects, where the number of data clusters or groups in the entire data space need not be known in advance. It provides a very simple and appealing way of displaying the organizational structure of the data using a tree diagram called a dendrogram. It allows us to decide what level or scale of clustering is most appropriate in our application. The hierarchical clustering algorithm clusters individuals or phenotypic outcomes into highly differentially expressed genes that have closely related expression patterns. Sometimes these highly differentially expressed genes are not relevant to the biological process. The problem is that these type of highly expressed irrelevant genes can potentially drown out the effects of other genes that have novel biological functions [12], which are very much important for phenotypic variations, such as cancer and Diabetes. To overcome this problems in [12] have been proposed a Complementary HC (CHC) to discover biologically important gene sets relatively low expressions instead of highly expressed genes.
However, in presence of outliers, the complementary hierarchical clustering algorithm may be also produce misleading results. Gene expression microarray data are often contaminated by outliers due to many steps involved in the experimental process from hybridization to image processing for producing data. So, we need to some robust techniques to overcome such types of problems. Therefore, in this paper an attempt has made to investigate the performance of classical and robust clustering methods.
II. P
ARTITIONING BASED CLUSTERING ALGORITHMSA. K-means Clustering
K-means is a well-known partitioning method, requires prior knowledge of the genes that are analyzed. This clustering starts by randomly choosing k patterns as initial means for each cluster. After that the patterns are assigned to the clusters by finding the patterns’
closest mean. The new mean is calculated for each cluster and the patterns are reassigned to new means. This process is iterated until the cluster means are such that no pattern moves from one cluster to another.
Only the Euclidean distance is used, can’t work with categorical patterns.
This method is sensitive to outlier and noisy data which are frequently present in gene expression data. One of the most significant downsides of this method is that it gives different results from the same data if the starting conditions are varied as illustrated in Figure 1.
Md. Alim Hossen
1, Abu shaleh Mahmud
2, Md.
Mushfiqur Rahman
3, Md. Nurul Haque Mollah
4Dept. of Statistics
University of Rajshahi Rajshahi-6205, Bangladesh.
1,
[email protected]2, [email protected]3,
[email protected]4.Md. Bahadur Badsha Dept. of Bioscience and Bioinformatics Kyushu Institute of Technology, 680-4 Kawazu, lizuka,
Fukuoka 820-8502, Japan.
International Conference on Materials, Electronics & Information Engineering, ICMEIE-201505-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4
Figure 1. K-means clustering of 230 genes with different K values.(A) Normalized expression patterns consisting of six expression values (K=1) (B) K -means clustering with K=4. (C) K-means clustering with K=8.
B. Partitioning Around Medoids
Partitioning around medoids (PAM) [4] is a well-known K- medoids algorithm is more generalized version of K-means can work in any distance measures whereas K-means only work in case of Euclidean metric. PAM is more robust to noise and outliers as compared to K-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances.
The PAM function takes as a prespecified number of clusters K, and searches for K representative medoids
. The objective function of PAM finds K medoids that minimize the sum of the dissimilarities of the patterns to their nearest medoids. The disadvantage of this method is, it is relatively more costly;complexity is O(I K(n-K)2) where I is the total number of iterations, K is the total number of clusters and n is the total number of patterns.
It is sensitive to initial cluster selecting and can give different result for same data if different initial medoids are varied.
C. Model-based Clustering Method
Model-based clustering methods are based on the assumption that the microarray data are generated from some underlying probabilistic model. A common example is to assume that the gene expression profiles are generated from a mixture of Gaussian distributions, with each component of the mixture corresponding to a cluster. In mixture model-based clustering, each component in the mixture is assumed to model a group of samples. In order to obtain the probabilities of a sample belonging to each cluster (group), density functions can be used and mixing coefficients can be defined for the mixture. For a given number k, mixture models can be efficiently computed with the Expectation-Maximization (EM) algorithm. The EM works interactively by calculating the probability of each sample for the components and then recomputing the parameters of the individual Gaussian densities until convergence is reached. In practice, k-means can be regarded as an oversimplification of a mixture of Gaussians, where every sample is assigned to a single cluster. In this simplified context, clusters tend to have the same size and lie within spherical regions of the Euclidean space.
D. Self-Organizing Map (SOM)
The self-organizing map is a method for producing ordered low- dimensional representations of an input data space. Typically such input data is complex and high dimensional with data elements being related to each other in a nonlinear fashion. These maps can successfully approximate high-dimensional input space by extracting invariant features of the input signals and maintaining topological relationship between them in lower dimensions. A multiple dimension microarray gene-expression data matrix is constructed, each node representing a point on the microarray data. Then a random node is selected and iteratively adjusted in the n-dimensional space according to the pattern of expression. So, self-organizing maps impose structure on the data with neighboring nodes tending to define related clusters. These clusters become nodes for the lower- dimension matrix. The self-organizing map method is ideally suited
Figure 2. Principle of SOMs. Initial geometry of nodes in 3 × 2 rectangular grid is indicated by solid lines connecting the nodes. Hypothetical trajectories of nodes as they migrate to fit data during successive iterations of SOM algorithm are shown. Data points are represented by black dots, six nodes of SOM by large circles, and trajectories by arrows.
for explorative data analysis where prior information about the distribution of the data is not available. Also the computational algorithms are relatively easy to implement, fast, and scalable to large data sets. The results are easy to visualize and interpret.
E. Fuzzy Clustering
The crisp algorithms like K-means, HC, SOM and so on are unable to identify genes whose expression levels are similar to multiple distinct groups of genes. In addition, crisp clustering methods may yield inaccurate clusters that lead to incorrect conclusions when analyzing large gene expression data sets collected under different conditions, since genes are likely to be co-expressed with different groups of genes under different conditions [13]. Many approaches to the complex relationships between objects. This employs the fuzzy logic method for grouping patterns, and provides a systematic and unbiased way to change precise values into several descriptors of cluster memberships. Fuzzy clustering methods can be used to assign genes that are tightly clustered to one, two or several groups. Therefore, fuzzy clustering uncovers information about the relative likelihood of each gene belonging to each of a predefined number of clusters. The applications of the fuzzy logic method to microarray data analysis have been Fuzzy C-Means (F- CM) [14] [15], Fuzzy J-Means (F-JM) and variable neighborhood search (VNS)[16]. The main drawback of fuzzy clustering is that the assignment of genes to each cluster depends on membership cutoff. The application of a very high membership cutoff may leave out genes with highly correlated expression patterns in all of the experiments, while a very low membership cutoff will assign most of the genes to every cluster. Therefore, membership cutoff should be carefully chosen to obtain the desired outcome if not they will give misclusters.
III. H
IERARCHICALC
LUSTERING(HC)
Hierarchical clustering algorithms are very useful tools for analyzing microarray data. They provide a very simple and appealing way of displaying the organizational structure of the data using a tree diagram called a dendrogram. Graphical representation of the results of hierarchical clustering allows users to visualize global expression patterns of gene expression data, which makes this method a favorite among biologists [17]. However, this method is widely used but very much sensitive to outlying observations, which is an important issue in gene- expression data analysis. To avoid such problems, we proposed to use a robust HC (RHC) method [18]. The RHC restored the original structure in presence of outliers.
A. Robust Hierarchical Clustering
The Robust HC is a simple and intuitively appealing method that replaces the classical correlation matrix derived from X by the minimum β-divergence estimator of the correlation matrix. To calculate the minimum β-divergence estimator of the correlation matrix the mean vector and the covariance matrix obtained iteratively as follows:
n
i i t t
n
i i t t i
t
1 1
1 ( ; , )
) ,
; (
International Conference on Materials, Electronics & Information Engineering, ICMEIE-2015 05-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh
www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4
and
n
i i t t i
n i
T t i t i t t i
vt
1 1 1
1 1
) ,
; ( )
(
) )(
)(
,
; (
The
n n
β -dissimilarity matrix Dβ as follows,n n d D
(
ij)
Where
d
ijis computed using one of(i) d
ij (1
r
ij), )(
(ii)dij 1 rij and (iii)dij(1r2ij)1/2. Where rijvij/ viivjj is the robust Pearson’s correlation coefficient between ith and jth variable. Definition (i) is used in RHC. The β-dissimilarity matrix Dβ is used to generate the dendrogram of HC. The classical HC method is sensitive to outlier or contaminated data. However RHC method is so much robust against outlier or contaminated data.
B. Complementary Hierarchical Clustering Algorithm (CHC)
HC algorithms results exhibit highly differentially expressed (DE) genes that have similar expression pattern. Sometimes these highly DE genes are not relevant to the biological process. The problem is that these types of highly DE irrelevant genes can potentially drown out the effects of other important low expressed genes that have novel biological functions. To overcome this problem, complementary HC (CHC) has been proposed in [12], can discover biologically low expressed important genes-set.Complementary hierarchical clustering (CHC) is a procedure that can be applied using any hierarchical clustering algorithm, as the only requirement is the ability of the clustering pattern to be represented as a dendrogram [12].
Thus, the CHC procedure will consider many groupings present in the initial clustering, while at the same time focusing more on removing the structures arising from the strong genes. However, this CHC approach is very much sensitive to outliers. So it produces misleading results in presence of outliers.
C. Robust Complementary Hierarchical Clustering (RCHC)
Complementary hierarchical clustering (CHC) is not robust against outlying expression and often produces misleading results if there exits some contaminations in the gene expression data. To overcome this problem a robust CHC named RCHC is proposed in [20] by maximizing β-likelihood function see [19]-[21] of dummy variable linear regression model for gth gene expression in n individuals with the h-th cut of the dendrogram in matrix notation as follows:x
g=Z
hγ
gh+є
gh;
whereє
gh~ N(0,σgh2I)The maximum β-likelihood estimator for the parameters γgh and
2
gh obtained iteratively as follows:
gt T g h t g h T h t
gh Z Z # W Z #W x
γ 1 1 1 1
(In matrix notation)
2 1
1
1
1 gt
1T t g t gh h g t T gh h g t
gh 1 β
σ x Zγ x Zγ #W W
Where,
1 2
n gh
gh hi gi
g σ
x 2
β z γ
exp W
Which is called β-weight vector. It produces almost zero weights for outlying gene expressions.
IV. A
RTIFICIALLY GENERATED DATAWe generated microarray dataset for simulation study, as displayed in Figure 3. Where the datasets have three sets of different gene expressions with respect to individuals {1-8}. We used the generated data with three gene-sets {1-20}, {21-40} and {41-60} as
Figure 3. Artificially generated gene expression data. The 60 × 8 matrix of artificially generated gene expression data with three different sets of DE genes is illustrated. The rows of the matrix correspond to the genes and the columns the individuals/ samples. The genes {1-20} are highly DE, whose expression intensities are assumed +8 (positive intensity) and -8 (negative intensity) for the individuals {1, 2, 3, 4} and {5, 6, 7, 8}, respectively.
Similarly, the genes {21-40} are medium DE, whose expression intensities are assumed +6 (positive intensity) and -6 (negative intensity) for the individuals {1, 2, 5, 6} and {3, 4, 7, 8}, respectively. The genes {41-60} are low DE, whose expression intensities are assumed +3 (positive intensity) and -3 (negative intensity) for the individuals {1, 3, 5, 7} and {2, 4, 6, 8}, respectively. To randomize the gene expressions among the individuals, the Gaussian noise with N (0, 1) is added to the expression of each gene.
the reference data to examine the performance of the classical CHC and the proposed RCHC for sequential extraction of the gene-set with proper groups of individuals. Classical CHC and robust CHC (RCHC) exhibit same result for generated gene expression data in absence of outliers as illustrated in figure 4.
To examine the robustness of the RCHC method in comparison of the CHC method, we added outliers to the final row of the data matrix X. Figure 5(a). shows the data matrix in the presence of outliers (X*).
Then RCHC is applied to the contaminated data set X*. The Figure 5(e). shows that the RCHC methods precisely classified the contaminated data without being affected by outliers.
From these results, it is seen that performance of HC and CHC are very much influenced by the outliers .However, the results of RHC and the proposed RCHC are not influenced at all by the outliers.
Therefore, to extract the medium or low DE important gene-set we need to apply the robust CHC algorithm in the full gene expression dataset.
Figure 4. Performance investigation of CHC and RCHC by simulation study in absence of outliers (a) Original Data set X, where first 60 rows represent the original dataset that is represented, (b-c) Classical HC, CHC and (d-e) RHC, RCHC results, respectively.
International Conference on Materials, Electronics & Information Engineering, ICMEIE-2015 05-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh
www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4
Figure 5. Performance investigation of CHC and RCHC by simulation study in presence of outliers (a) Outliers Data (X*) added to the final row of the dataset shown in Figure 3., (b-c) Classical HC, CHC results and (d-e) RHC and RCHC results, respectively.
V. C
ONCLUSIONSClustering is a approach to identifying groups of genes that have similar expression patterns in a group of microarray experiments. The grouping of genes according to their expression patterns is performed by measuring the similarity of genes with respect to expression pattern. The underlying true clustering assignment in a real data is generally unknown and the concept of a cluster is not mathematically well-defined in unsupervised learning. Very often a method works well in some datasets but may perform poorly in other datasets owing to different data structure and characteristics. The number of data clusters in the entire data space should be known in advance for partitioning based clustering which is difficult in microarray gene expression data analysis. One of the most significant downsides of these method is that it gives different results from the same data if the initial clusters are varied. The traditional unsupervised clustering algorithms are not robust against outliers which may produce misleading clustering results in presence of outliers. The HC algorithms exhibit the result of highly differentially expressed (DE) genes that have similar expression pattern. Sometimes these highly DE genes are not relevant to the biological process. The problem is that these types of highly DE irrelevant genes can potentially drown out the effects of other important low expressed genes that have novel biological functions. CHC can discover biologically low expressed important genes-set. But this method is not robust against outlying expression and often produces misleading results if there exits some contaminations in the gene expression data. To overcome this problem our proposal to use the robust CHC (RCHC) method which can extract sequentially DE gene-sets within similar expression patters. To investigate performance of the RCHC method in comparison of the traditional approach, we apply it to the artificially generated gene expression data. The RCHC exhibit robustness against outlying expression. The use of robust CHC method significantly improves the performance over the traditional approach in presence of outliers.
A
CKNOWLEDGEMENTThis work is supported by HEQEP sub-project (CP-3603, W-2, R- 3) Bioinformatics Lab, Dept. of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh. The authors would like to thank the anonymous reviewers for their helpful comments.
R
EFERENCES[1] Brown,P.O. and Botstein,D.,”Exploring the new world of the genome with DNA microarrays,”Nature Genetics, vol. 21, pp.33–37, 1999.
[2] MacQueen,J.B.,”Some methods for classification and analysis of Multi- variate observations,” Proc. fifth Berkeley Symp. Math. Stat. Prob. vol. 1, pp. 281–297, 1967.
[3] Hartigan,J.A. and Wong,M.A., “A K-means clustering alrorithm,”Appl.
Stat., vol. 28, 1979.
[4] Kaufman,L. and Rousseeuw,P. “Finding Groups in Data: An Introduction to Cluster Analysis,”in Wiley, New York, 1990.
[5] Kohonen,T.,”The self-organizing map,” Proc. IEEE, vol. 78, pp.1464- 1480, 1990.
[6] Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S.,Dmitrovsky,E., Lander,E.S. and Golub,T.R., “ Interpreting patterns of gene expression with self- organizing maps: methods and application to hematopoietic differentiation,” Proc. Natl Acad. Sci.USA, vol. 96, pp.2907–2912, 1999.
[7] Yeung, K. Y., C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo, “Model-based clustering and data transformations for gene expression data,” Bioinformatics, vol. 17, pp.977–987, 2001a.
[8] Fraley,C. and Raftery,A.E., “Model-based clustering, discriminant analysis and density estimation,” J. Am. Stat. Assoc., vol. 97, pp.611–631, 2002b.
[9] Medvedovic,M. and Sivaganesan,S., “Bayesian infinite mixture model based clustering of gene expression profiles,” Bioinformatics, vol. 18, pp.1194–1206, 2002.
[10] McLachlan,G.J., Bean,R.W. and Peel,D., “A mixture model-based approach to the clustering of microarray expression data,”Bioinformatics, vol.18, pp. 413–422, 2002.
[11] Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. “Cluster analysis and display of genome-wide expression Patterns,” in Proceedings of the National Academy of Science of the United States of America, vol. 95, pp. 14863–14868, 1998.
[12] Gen Nowak and Robert Tibshirani, “Complementary Hierarchical Clustering”, Biostatistics, vol. 9, pp.467-483, 2008.
[13] Gasch, A.U. and Eisen, M.B., “Exploring the conditional co-regulation of yeast gene expression through fuzzy Kmeans clustering”, Genome Biol, vol. 3, pp. 1−22, 2002.
[14] Dembe´le´,D. and Kastner,P., “Fuzzy C-means method for clustering microarray data,” Bioinformatics,vol. 19, pp. 973-980, 2003.
[15] Wang, J., Bo, T.H., Jonassen, I., and Hovig, E. “Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data,” BMC Bioinformatics, vol. 4, pp. 60, 2003.
[16] Belacel, N., Čuperlović-Culf, M., Laflamme, M., and Ouellette,R. “Fuzzy J-means and VNS methods for clustering genes from microarray data,”
Bioinformatics, vol. 20, pp. 1690−1701, 2004.
[17] Tseng,G.C. and Wong,W.H. “Tight clustering: a resampling-based approach for identifying stable and tight patterns in data,” Biometrics, vol.61, pp.10–16, 2005.
[28] Mollah, M. N. H., Mari, P., Komori, O., and Eguchi, S., “Robust Hierarchical Clustering for Gene Expression Data Analysis,” in Proceedings of the 2nd international conference on bioinformatics and system biology (BSB-2009) , Leipzig, Germany 23-25 March 2009.
[19] Badsha, M. B., et al : “Robust complementary hierarchical clustering for gene expression analysis by β- divergence,” J. Biosci. Bioeng., 2013.
[20] Mollah, M.N.H. Sultana, N. Minami, M. Eguchi, S: “Robust extraction of local structures by the minimum beta-divergence method”, Neural Network, vol.23, pp.223-238, 2010.
[21] Mollah, M.N.H., Minami, M. And Eguchi, S. “Robust prewhitening for ICA by the Minimum β-Divergence and its application to Fast ICA,”
Neural Processing letters, vol. 25, pp. 91-110, 2007.
International Conference on Materials, Electronics & Information Engineering, ICMEIE-2015 05-06 June, 2015, Faculty of Engineering, University of Rajshahi, Bangladesh
www.ru.ac.bd/icmeie2015/proceedings/
ISBN 978-984-33-8940--4