Data Mining in Bioinformatics
Day 7: Clustering in Bioinformatics
Karsten Borgwardt
February 25 to March 10
Clustering in bioinformatics
Microarrays
Clustering is a widely used tool in microarray analysis Class discovery is an important problem in microarray studies for two reasons:
either the classes are completely unknown before-hand
Clustering in bioinformatics
Examples
Classes unknown:
Does a disease affect gene expression in a particular tissue?
Does gene expression differ between two groups in a particular condition?
Subclasses unknown:
Are there subtypes of a disease?
Clustering in bioinformatics
Popularity
Clustering tools are available in the large microarray database NCBI Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
Distance metrics
Euclidean distance
Euclidean distance of gene x and y of n samples or sam-ple x and y of n genes:
Distance metrics
Un-centered correlation coefficient
Un-centered correlation coefficient of gene x and y of n samples or sample x and y of n genes:
rxyu =
Pn
i=1 xiyi
pPn
i=1 x2i
pPn
i=1 yi2
Clustering algorithms
Hierarchical Clustering
Single linkage: The linking distance is the minimum dis-tance between two clusters.
Complete linkage: The linking distance is the maximum distance between two clusters.
Average linkage/UPGMA (The linking distance is the av-erage of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA))
‘Flat’ Clustering
The two-sample problem
Interpretation of clusters
Clustering introduces ‘structure’ into microarray datasets
But is there a statistical or biomedical meaning of these classes?
Biomedical meaning has to be established in experi-ments
‘Statistical meaning’ can be measured using statistical tests, by a so-called two-sample test
The two-sample problem
Data diversity
Molecular biology produces a wealth of information The problem is that these data are generated
on different platforms and by different protocols
under different levels of noise
Hence data from different labs show different scales
different ranges
different distributions Main problem:
The two-sample problem
The two-sample problem
Given two samples X and Y .
Were they generated by the same distribution? Previous approaches
The two-sample problem
t-test
A test of the null hypothesis that the means of two nor-mally distributed populations are equal
unpaired/independent (versus paired)
For equal sample sizes and equal variances, the t statis-tic to test whether the means are different can be calcu-lated as follows:
The two-sample problem
New challenges in bioinformatics high-dimensional
structured (strings and graphs) low sample size
MMD key idea
Key Idea
Avoid density estimator, use means in feature spaces Maximum Mean Discrepancy (Fortet and Mourier, 1953)
D(p, q,F) := sup
f∈F
Ep[f(x)] − Eq [f(y)]
Theorem
D(p, q,F) = 0 iff p = q, when F = C0(X).
Follows directly, e.g. from Dudley, 1984.
MMD statistic
Goal: Estimate D(p, q,F)
Ep,pk(x, x′) − 2Ep,qk(x, y) + Eq,qk(y, y′)
U-Statistic: Empirical estimate D(X, Y,F)
1
Estimate σ2 from data.
Reject null hypothesis that p = q if D(X, Y,F) exceeds
Attractive for bioinformatics
MMD
two-sample test in terms of kernels Computationally attractive
search infinite space of functions by evaluating one ex-pression
Attractive for bioinformatics
Wide applicability
for one- and higher-dimensional vectorial data, but also for structured data!
two-sample problems can now be tackled on strings: protein and DNA sequences
graphs: molecules, protein interaction networks time series: time series of microarray data
Cross-platform comparability
Data
microarray data from two breast cancer studies one on cDNA platform (Gruvberger et al., 2001)
other on oligonucleotide microarray platform (West et al., 2001)
Task
Can MMD help to find out if two sets of observations were generated by
the same study (both from Gruvberger or both from West)?
Cross-platform comparability
Experiment
sample size each: 25
dimension of each datapoint 2,116 significance level: α = 0.05
100 times: 1 sample from Gruvberger, 1 from West 100 times: both from Gruvberger or both from West report percentage of correct decisions
Kernel-based statistical test
novel statistical test for two-sample problem: easy to implement
non-parametric
first for structured data
best on high-dimensional data
quadratic runtime w.r.t. the number of data points impressive accuracy in our experiments
kernel method for two-sample problem:
all kernels recently defined in molecular biology can be re-used for data integration
Biclustering
Clustering in two dimensions
alternative names: co-clustering, two-mode clustering A bicluster is a subset of genes that show similar activ-ity patterns under a subset of conditions.
Clustering in 2 dimensions
Cluster patients and conditions
Earliest work by Hartigan, 1972: Divide a matrix into submatrices with minimum variance.
Most interesting cases are NP-complete.
References and further reading
References