Analysing the effect of different Distance Measures in K-means Clustering Algorithm

(1)

49

Analysing the effect of different Distance Measures in K-means Clustering Algorithm

Trushali Jambudi & Savita Gandhi¹ Abstract

Distance metrics are primary means for measuring the distance between two objects and used as the principal means of deciding the similarity or dissimilarity between the data to be clustered. Different distance measures are applied by different clustering algorithms for the purpose of grouping objects into clusters. The use of a particular distance metric can affect to a great extent the performance of a clustering algorithm and hence the outcome also. In this paper, we analyze the impact of various distance measures in the performance of K-means algorithm. We first describe the different distance measures that are commonly used with K-means algorithm, followed by application of the K-means clustering algorithm with each of these distance measures on various synthetic and real standard clustering data sets. To measure the impact of each distance measure on the performance of K-means algorithm, we have deployed various performance evaluation metrics.

Keywords— Data Clustering, Unsupervised Learning, Data Mining, K-means clustering, Distance measures

I. Introduction

Data segregation or clustering, also called Segmentation analysis, is the practice of dividing the data objects into groups, that is, forming clusters based on the data properties, in a way that the objects in one cluster are highly similar to one another but are dissimilar with the objects of the other cluster. This process of partitioning data into meaningful groups or categories is based on the data characteristics and thereafter the data so grouped is utilized to derive characteristics from each group. In clustering the objects are grouped so as to maximize the inter-class similarity. After clustering, class label can be assigned to each group so formed based on the characteristics of data in the group. The similarity or dissimilarity between records is determined by the distance function applied on the attributes of the records being clustered [1, 2]. In many cases, clustering is the first data mining task which is applied

1 Research Scholar, Department of Computer Science, Gujarat university.

Dean FCAIT & FCT, GLS University

(2)

50

on a given data sets and is used to explore any underlying patterns existing in the data. Examples of clustering tasks include dividing the plants and animals into groups, symptoms into different diseases, forming student groups, crime pattern recognition, etc. Cluster mining is a challenging task where its application areas pose their own extraordinary requirements. The primary metric of evaluation of clustering algorithms is the quality of clusters they produce, i.e., the extent to which the records in same cluster are closer to each other than to records across clusters. This degree of closeness is determined by the distance measure which measures the distance of an object from other objects and the objects close to each other are considered similar and placed in same cluster whereas the objects that are far from each other are considered dissimilar and are not placed together in the same cluster. Which distance metric should be used depends upon the data type describing the attributes in the data set being clustered and the purpose for which it is to be used? For each data type a different distance metric exists.

For example, for numeric data distance is measured using distance functions such as Euclidean, Manhattan and Chebyshev, whereas for measuring distance between binary attributes, the asymmetric binary similarity is computed [3]. In this paper we have focused on computing distance between numeric attributes for the K-means algorithm.

The paper continues as follows: We describe the K-means algorithm in Section II. The most popular distance measures used in K-means algorithm are discussed in section III. In section IV, we review the work done in literature with respect to applying distance measures in K-means algorithm. In section V, we explain the parameters used for measuring the performance of K-means algorithm, while in section VI we describe our experiments and results obtained and in section VII, we conclude and explain our findings regarding the impact of each distance measure on the performance of the K-means algorithm.

II. The K-means Algorithm

The K-means [3, 4] is the most popularly used partition based clustering algorithm which partitions the data into groups or clusters based on similarity. The K-means algorithm is iterative in nature and works as follows:

1. Input k, the number of clusters to be formed.

2. k initial "centers" are initialized at random from the domain of the data.

3. k clusters are formed by placing every observation to its nearest center.

4. The following steps are repeated until cluster stops changing:

a. For each cluster, the mean of the observations within that cluster is taken as the center.

b. Each data point is reassigned to the cluster whose center is nearest to that data point.

(3)

51

The mean value computed from all the objects in a cluster define that cluster’s centre. The traditional distance measure used in K-means algorithm to determine similarity of an object with the cluster’s mean is the Euclidean distance.

III. Distance measures for K-Means

The commonly used distance measures for calculating the dissimilarity of objects defined by numerical data are Euclidean, Manhattan and Chebyshev distances, which are particular cases of Minkowski distance for different values of the parameterℎ. In other words, the Minkowski distance measure is a generalization of Euclidean, Manhattan and Chebyshev distance measures [3].

Let 𝑖 = (𝑥| |𝑖1, 𝑥_𝑖2… . , 𝑥_𝑖𝑝) and 𝑗 = (𝑥_𝑗1, 𝑥_𝑗2… . , 𝑥_𝑗𝑝)be two objects described by 𝑝 numeric attributes. The most popular distance measure is Euclidean distance. The Euclidean distance between objects 𝑖 and 𝑗 is defined as

Another well-known measure, the Manhattan (or city block) distance is given by:

whereas Chebyshev distance is defined by:

General formula for Minkowski distance between two objects 𝑖 and 𝑗 given as (𝑥_𝑖1, 𝑥_𝑖2… . , 𝑥_𝑖𝑝) (𝑥_𝑗1, 𝑥_𝑗2… . , 𝑥_𝑗𝑝) respectively is

(4)

52

Where ℎ ≥ 1. It is evident that Euclidean and Manhattan distance can be derived from Minkowski distance by substituting the value of parameter ℎ as 2 and 1 respectively. Minkowski distance gets transformed to Chebyshev distance as 𝑖𝑚𝑖𝑡ℎ → ∞ . The Chebyshev distance is defined more formally as:

If we can assign weight to each attribute based on the attribute’s apparent significance, then we can calculate the weighted Euclidean distance as

Similarly, weights can be applied to other distance measures as well.

Moreover, distance metrics must satisfy the following mathematical properties:

1. Non-negativity: 𝑑(𝑖, 𝑗) ≥ 0: Distance cannot be negative.

2. Identity of indiscernible: 𝑑(𝑖, 𝑗) = 0: Distance from self is 0.

3. Symmetry: 𝑑(𝑖, 𝑗) = 𝑑(𝑗, 𝑖): Distance remains same, whether measured from object 𝑖 to object 𝑗 or object 𝑗 to object 𝑖.

4. Triangle inequality: 𝑑(𝑖, 𝑗) ≤ 𝑑(𝑖, 𝑘) + 𝑑(𝑘, 𝑗): Direct distance of object 𝑖from object 𝑗 is always smaller than detour over any other object 𝑘 .

A measure that satisfies these conditions is known as metric.

IV. Literature Review

In this section, we provide a review of the analysis of data clustering algorithms performed in the literature by various authors and provide our analysis based on these reviews of data clustering algorithms. In [5] three different distance metrics namely Euclidean, Manhattan and Minkowski (with value of 𝑝 taken as 4,6,8,10,12 and 14) are used for computing distance in K-means algorithm. The K-

(5)

53

means algorithm is implemented and the results of using each of these distance measures is compared on dummy data which is not disclosed. The results are plotted in histograms and distortion is used as a means for comparing performance. They claim that distortion in K-means using Manhattan metric is less as compared to the distortion in K-means using Euclidean distance metric, the performance of K- means using Euclidean distance metric is the best while the performance of K-means using Manhattan distance metric is the worst. In paper [6] a study is carried out regarding the performance of K-means clustering using Euclidean and Manhattan distance functions in K-means clustering algorithm. These distance functions are evaluated and compared using parameters number of iterations, within sum of squared errors and time taken to build the full model. The results show that the K-means clustering using Euclidean distance outperforms the K-means clustering using Manhattan distance in terms of number of iterations, sum of squared errors and time taken to build the model. They have carried out the experiments on the online user dataset consisting of 200 instances which comprises of attributes describing the user behaviour in a particular social networking site. In [7] the K-means algorithm is implemented using Euclidean distance, Canberra distance and Manhattan distance on Iris dataset and their performance is measured by computing accuracy generated from the method produced by a combination of the Z-score and Min-Max Normalization methods, and also from Cluster homogeneity test using the Silhouette Coefficient method. The results of this method show that the Canberra method is superior to Euclidean and Manhattan on Iris dataset. In paper [8] two distance metrics i.e., Euclidean and Manhattan are analysed through experiments on one synthetic data set: BIRCH and two real data sets: Iris and Diabetes. The performance of these two-distance metrics is measured by using them in the K-means algorithm and comparing the performance in terms of number of iterations required for convergence in K-means. From experimental results, they show that the Euclidean method outperforms Manhattan method. In [9] the authors have measured the performance of K-means algorithm using Euclidean and Manhattan distance measures and they conclude through experiments that the Manhattan distance method has better performance than the Euclidean distance method. They have applied both the distance metrics on Iris dataset with different value of number of clusters k from 2 to 9 and they conclude that results are better for k values 3 and 4 than the results in case of increasingly high or low value of k in the iris dataset. In [10] this paper, the authors have assessed the performance of traditional k means algorithm with Euclidean, Manhattan and Canberra distance metrics. K-means using each of these distance metrics is applied on iris, wine, vowel, ionosphere and crude oil datasets by changing the value of k. From examination of outcome of this experiment we can come to understand that the performance of K-means algorithm differs based on the distance metrics as well as dataset. Different distance metrics yield different performance in different data sets on application of when K-means. No single distance measure can be identified to give the best result for all data sets.

(6)

54

V. Parameters used for Measuring performance of K-means Algorithm

In this study, we have measured the impact of using different distance metrics in the K-means algorithm through two performance evaluation criteria: CPU Time and Sum of Squared Errors (SSE) [4]. CPU Time is the total processing time taken by the K-means algorithm for grouping the input data into meaningful clusters. It is measured in seconds [4].

Sum of Squared Errors (SSE) SSE is the sum of the squared differences between each observation and its group's mean. It can be used as a measure of variation within a cluster. SSE value depends on data set and its distribution. The Smaller the value of SSE, the better [4].

Purity is a measure of the extent to which clusters contain a single class [3]. It is calculated as follows:

For each cluster, count the number of data points from the most common class in said cluster. Now take the sum over all clusters and divide by the total number of data points.

VI. Experiments and Results

We have analysed the effect of using different distance measures in K-means algorithm by performing experiments on real and synthetic data sets taken from UCI Machine Learning Repository [11 & 12].

The experiments are carried out using Python programming language. In Table-1 we have compared the computation time (CPU Time) of K-means algorithm with the four distance measures viz., Euclidean, Squared Euclidean, Manhattan and Chebyshev, with the intent of studying their behavioural and performance characteristics when used in K-means algorithm. In each execution a distance measure is selected in K-Means algorithm and applied on each data set and the CPU Time is noted in Table 1.

In Figure 1, this comparison is shown using a chart. In Table-2 we have compared the SSE value after using each of the four distance measures in the K-means algorithm for each data set and in Table-3 we have compared the purity percentage on using different distance functions in K-means algorithm for two real data sets Iris and Wine. For calculating purity, the class labels of the data are required and the class labels for real data sets are available in the data repository whereas the class labels for synthetic data sets are not known.

Table 1: Comparison of CPU Time using different Distance metrics in K-means

CPU Time (seconds)

Data Set

Euclidean Squared Euclidean

Manhattan Chebyshev

(7)

55

R15 0.21 0.06 0.09 0.05

S3 0.5 0.28 0.21 0.22

D31 0.57 0.22 0.16 0.16

Iris 0.09 0.01 0.015 0.01

Wine 0.07 0.02 0.02 0.02

Figure 1: Comparison of CPU Time using different Distance metrics in K-means

Table 2: Comparison of SSE using different Distance metrics in K-means

Sum of Squared Error

Data Set

Manhattan Chebyshev R15 108.62 875.80 614.18 303.12

S3 1.68E+13 1.88E+13 3.28E+08 2.3E+08 D31 3393.32 5194.61 4167.69 3193.21

Iris 37.12 38 81.6 58.34

Wine 95.55 95.55 142.24 101.46

Table 3: Comparison of purity using different distance metrics in K-means

Purity (%)

0 0.1 0.2 0.3 0.4 0.5 0.6

R15 S3 D31 Iris Wine

Euclidean Distance Metric Squared Euclidean Distance Metric Manhattan Distance Metric Chebyshev Distance Metric

(8)

56 Data

Set

Manhattan Chebyshev

Iris 94 94 95.33 93.33

Wine 70.06 70.06 70.62 70.06

VII. Conclusion

In this paper we have analysed the impact of using four distance measures in K-means algorithm viz., Euclidean, Squared Euclidean, Manhattan and Chebyshev by measuring the performance of K-means algorithm through CPU Time and SSE. We have compared the performance after applying each of these four-distance measure in the traditional K-Means algorithm to three synthetic and two real data sets.

From Table 1 and Figure 1, it is evident that when Euclidean distance is used as distance measure in K- means algorithm, the algorithm requires more time to converge compared to the other three distance measures. From Table 2 it is apparent that we are getting the best value of SSE in three out of five data sets used in the experiment when Euclidean distance metric is used in K-means algorithm, whereas when Chebyshev distance metric is used in K-means, we are getting best value of SSE in two out of five data sets. Squared Euclidean metric also gives good result in case of Wine and Iris data sets whereas the result of Manhattan distance is good in case of S3 data sets. So, we conclude that using Euclidean distance in K-means will give us very good value of SSE however, it will take longer for the algorithm to converge. Whereas using Chebyshev distance as a distance measure will lead to faster convergence of the algorithm and also give a good value for SSE. The results of using Squared Euclidean and Manhattan distance metrics are also comparable for certain data sets. From the purity percentage given in Table 3, we can observe that we get marginally better purity percentage for both Iris and Wine data sets when Manhattan distance metric is used in K-means algorithm. Hence, we can conclude that while Euclidean distance can be used in most data sets, other distance measures also give good results depending on the dataset being used.

References

1. Revathi, S., & Nalini, D. T. (2013). Performance comparison of various clustering algorithm.

International Journal of Advanced Research in Computer Science and Software Engineering, 3(2), 67-72.

2. Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3), 645-67

(9)

57

3. Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and. Techniques (3rd ed), Morgan Kauffman

4. Trushali, J., Savita, G. (2021), “Analytical Review of K-means based algorithms and Evaluation Methods”, GRENZE International Journal of Engineering and Technology (GIJET), Volume 7, Issue 1, Grenze ID: 01.GIJET.7.1.8_1, 479-486.

5. Singh, A., Yadav, A., & Rana, A. (2013). K-means with Three different Distance Metrics.

International Journal of Computer Applications, 67(10)

6. Kapil, S., & Chawla, M. (2016, July). Performance evaluation of k-means clustering algorithm with various distance metrics. In 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES) (pp. 1-4). IEEE.

7. Faisal, M., & Zamzami, E. M. (2020, June). Comparative Analysis of Inter-Centroid K-Means Performance using Euclidean Distance, Canberra Distance and Manhattan Distance. In Journal of Physics: Conference Series (Vol. 1566, No. 1, p. 012112). IOP Publishing.

8. Sinwar, D., & Kaushik, R. (2014). Study of Euclidean and Manhattan distance metrics using simple k-means clustering. Int. J. Res. Appl. Sci. Eng. Technol, 2(5), 270-274.

9. Suwanda, R., Syahputra, Z., & Zamzami, E. M. (2020, June). Analysis of Euclidean Distance and Manhattan Distance in the K-Means Algorithm for Variations Number of Centroid K. In Journal of Physics: Conference Series (Vol. 1566, No. 1, p. 012058). IOP Publishing.

10. Thakare, Y. S., & Bagal, S. B. (2015). Performance evaluation of K-means clustering algorithm with various distance metrics. International Journal of Computer Applications, 110(11), 12-16.

11. Patel, S., Trivedi, D., Bhatt, A., & Shanti, C. (2021). Web visibility and research productivity of NIRF ranked universities in India: A Webometric study. Library Philosophy and Practice (E- Journal). https://digitalcommons.unl.edu/libphilprac/5326/

12. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].

Irvine, CA: University of California, School of Information and Computer Science.