9.5.1 Overview
In several statistical procedures, the objects on which measurements are made are assumed to be homogeneous. However, in cluster analysis, the focus is on the possibility of divid- ing a set of objects into a number of subsets of objects that display systematic differences.
Cluster analysis represents a group of multivariate techniques used primarily to identify similar entities on the basis of the characteristics they possess. It identifies objects that are very similar to other objects in groups or “clusters” with respect to some predetermined selection criteria, while the groups exhibit significant differences between them regard- ing the same criteria. The object cluster then shows high within-cluster homogeneity and high between-cluster heterogeneity. Figure 9.2 illustrates the within-cluster and between- cluster variations.
There is no guideline on what constitutes a cluster, but ultimately it depends on the value judgment of the user. The users are allowed to determine the pattern of clusters inherent in the data. In the literature, an almost endless number of clustering algorithms are found. All of these depend on high-speed computer capacity and aim to meet some criteria that maximizes the between-cluster variation compared with the within-cluster variation. Many different approaches have been used to measure interobject similarity and numerous algorithms to develop clusters. However, until now, there is no universally agreed upon definitive answer to these methods. As such, cluster analysis remains much more of an art than an exact science.
9.5.2 Phases of clustering analysis
The process of cluster analysis may be divided into three phases (Hair et al., 1987). These are
• Partitioning
• Interpretation
• Profiling
Figure 9.2 Within-cluster and between-cluster variations.
9.5.2.1 Partitioning phase
During the first phase, an appropriate measure is selected for measuring interobject simi- larity. The proximity or closeness between each pair of objects is used as a measure of similarity. Since distance is complement of similarity, it is used as a measure of similarity.
9.5.2.1.1 Distance type measurement. This type of measurement is possible for quantitative data. The general Minkowski metric for distance measurement is defined by Equation 9.1 (Dillon and Goldstein, 1984).
dij Xik Xij r k
p r
=⎧ −
⎨⎪
⎩⎪
⎫⎬
⎪
⎭⎪
∑
=1 1/ (9.1)where dij is the distance between two objects, i and j.
When r = 2, the Minkowski equation reduces to the familiar Euclidean distance between objects i and j, given by Equation 9.2.
dij Xik Xjk k
p
=⎧ −
⎨⎪
⎩⎪
⎫⎬
⎪
⎭⎪
∑
= ( ) / 2 11 2
(9.2) When r = 1, the Minkowski equation reduces to the city-block metric, given by Equation 9.3.
dij Xik Xjk k
p
= −
∑
=1 (9.3)Several other options are available in various computer programs. One option is to use the sum of squared differences as a measure of similarity.
The raw data are converted to Z-scores before computing distances. This step is taken to eliminate the spurious effect of unequal variances of the variables. Another very useful Euclidean distance measure is the Mahalanobis distance. The Mahalanobis D2 generalized distance measure is comparable to R2 in regression analysis and is superior to the different versions of the Euclidean distance measures. It is given by the Equation 9.4.
(Xi – Xj)′S – 1(Xi – Xj) (9.4) where S is the pooled within-group covariance matrix, and Xi and Xj are the respective vectors of measurements on objects i and j. This distance measurement has the advantage of explicitly accounting for any correlations that might exist between the variables (Dillon and Goldstein, 1984).
9.5.2.1.2 Match-type measurement. For qualitative data, a match-type or association measure is suitable. Association generally takes the value of “0” to indicate the absence of attribute and “1” to indicate the presence of attribute. Two objects or individuals are considered similar if they share same common attributes and dissimilar if they do not share common attributes. We can visualize the variables absence (0) and presence (1) in the contingency table (Table 9.1).
Similarity may be measured by counting the total number of matches, either (0, 0) or (1, 1), between X and Y and divide the total by the number of attributes (8). The similarity between X and Y then in this case is given by
176 Handbook of industrial and systems engineering
Similarity between X and Y = {((No. of (1, 1)’s or
(0, 0)’s)/No. of attributes} × 100% = (5/8) × 100% = 62.5% (9.5) The resulting association table is given by Table 9.2.
The association measure can be computed in several different ways, and unfortunately these result in different values for the same data sets (Dillon and Goldstein, 1984). Hence, it is essential to assign 1’s and 0’s on the basis of the importance to the user.
9.5.2.1.3 Clustering algorithms. The next step is to select a particular type of com- putational algorithm. The commonly used clustering algorithms are of two types: hierar- chical and non-hierarchical. The clustering process strives to maximize between-cluster variability and minimize within-cluster variability. In other words, subjects within the cluster are most similar and each cluster is markedly different from the others. Clustering techniques have been applied to a wide variety of research problems. For example, in the field of medicine, clustering diseases, cures for diseases, of symptoms of diseases can lead to very useful taxonomies (Hartigan and Wong, 1979). In the field of psychiatry, the correct diagnosis of clusters of symptoms, such as paranoia and schizophrenia, is essential for successful therapy. In general, whenever one needs to classify a large number of informa- tion into manageable piles, cluster analysis is of great utility.
9.5.2.1.4 Hierarchical clustering. Hierarchical procedures construct a tree-like structure. There are basically two types of procedures: agglomerative and divisive. In the agglomerative method, all cases start in their own cluster and are then combined into smaller and smaller numbers of clusters. In other words, all cases start in the same cluster and the process commences by dividing the customers into two groups. The group with the most internal variation, the least homogeneous, gets split into two and now there are three groups, and so on, and the process continues until it can no longer find a statistical justification to continue (Hartigan and Wong, 1979).
9.5.2.1.5 Non-hierarchical clustering—K-means clustering method. This is technique of clustering is gaining popularity for large databases and can be used once agreement is reached with regard to the number of clusters. A non-hierarchical procedure does not involve a tree-like construction process but needs to select a cluster center or seed, and all objects within a specified distance are included in the resulting cluster. There are three
Table 9.1 Contingency Table of Similarity
Object 1 2 3 4 5 6 7 8
X 0 1 1 0 1 1 1 1
Y 1 1 1 0 0 1 0 1
Table 9.2 Association Table for X and Y Object 2
Total
+ –
Object 1 + 4 2 6
– 1 1 2
Total 5 3 8
different approaches for non-hierarchical clustering based on sequential threshold, paral- lel threshold, or optimizing procedures.
The K-means clustering splits a set of objects into a selected number of groups by max- imizing between variations relative to within variation (Green and Rao, 1969). In general, the K-means method will produce exactly K different clusters of greatest possible distinc- tion (Sherman and Seth, 1977; Ling and Li, 1998). In the K-means clustering procedure, the value of K or the number of clusters has to be decided before processing. There appears to be no standard method, but some guidelines are available (Hair et al., 1987). The clustering process may be stopped when the distance between clusters at successive steps exceed a preselected value. An intuitive number of clusters may be tried and based on some pre- selected criteria; the best among the alternatives may be selected. Frequently, judgment of practicality regarding comprehension and communication become the deciding factor.
Cluster analysis packages may be purchased as “off-the-shelf” software that use seg- mentation techniques based on the neighborhood-type approach (TargetPro Version 4.5, 2003). These software programs use prepackaged approaches to multivariate statistical clustering, which fundamentally follow the same concept, and several of these packages use a customized version of non-hierarchical cluster analysis, known as K-means cluster- ing. This approach consists of testing a number of different classifications and searching for a set of clusters that maximizes the similarity of all the geographic units assigned to the same cluster and, at the same time, maximizes the statistical differences between indi- vidual clusters.
9.5.2.2 Interpretation phase
This phase involves determining the nature of the clusters by examining the criteria used to develop the clusters. One way is to determine the average value of the objects in each cluster for each raw variable and develop average profiles from these data. A cluster may favor one attitude, while the other may favor another. From this analysis, each cluster’s attitudes may be evaluated and significant interpretations developed. The interpretations facilitate to assign label that represents the nature of the clusters.
9.5.2.3 Profiling phase
This phase involves describing the characteristics of each cluster to explain the way they differ on relevant dimensions. The demographics, behavioral patterns, buying habits, or consumption characteristics and other traits relevant to a particular study are usually included in the analysis for profiling. For example, more affluent and younger customers may represent one cluster, while the other one may represent older and more conservative persons. This analysis has to focus on the inherent characteristics that differ significantly from those of the other clusters, and are different from the ones used to develop the clusters.
9.5.3 Testing validity of clustering solution
Assessment consists of examining the following:
• Distinctiveness of clusters, presented by profiling.
• Optimum number of clusters depending on a balance between the extent of homoge- neity within cluster and the number of clusters.
• Goodness of fit indicated by a high rank-order correlation between the input and the solution output. Because clusters are generated by maximizing the between- cluster sums of squares, the usual test of significance of ANOVA (F, α,ν1,ν2) cannot
178 Handbook of industrial and systems engineering be conducted in case of cluster analysis (Dillon and Goldstein, 1984). Instead, the maximum value of the F-statistic among the different alternative groupings is used as an indication of best fit.