A Comparative Study of Distance-based Similarity Measures

(1)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

________________________________________________________________________________________________

ISSN (Online): 2347 - 2812, Volume-3, Issue -3, 2015 31

A Comparative Study of Distance-based Similarity Measures

1Deepa B. Patil, ²Yashwant V. Dongre

1,2Department of Computer Engineering, Vishwakarma Institute of Information Technology Pune, Maharashtra, India Email: ¹[email protected], ²[email protected]

Abstract — In today’s world of internet, with whole lot of e-documents such as html pages, digital libraries etc.

Occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups. This helps in efficient and effective use of these documents for information retrieval and other NLP tasks. Clustering algorithm requires a metric to quantify how similar or different two given documents are. This difference is often measured by some distance measures such as Euclidian distance, Cosine Similarity, Jaccard Co-efficient to name a few. These measures are called as Similarity Measures. Similarity measure plays an important role in text classification and clustering. In this work, I will use a similarity measure called as SMTP (Similarity Measure for Text Processing) for processing e-mail data using Fuzzy c-means clustering algorithm.

I. INTRODUCTION

Text processing plays an important role in information retrieval, data mining and web search. As the amount of digital documents has been increasing dramatically over the years as the internet grows, information management, search and retrieval has become an important problem. Clustering is a technique that organizes large number of objects into smaller coherent groups. Clustering aims at grouping similar documents in one class and separate this group as much as possible from the ones which contains information on entirely different topics. World Wide Web has huge applications of text clustering such as clustering of results for users on search engines, grouping of comments to suggest products on online stores. Similarity measure plays an important role in deciding how similar or different the two documents are. A lot of similarity measures are in existence for computing the similarity between two documents. Euclidian distance is one of the well known similarity metric taken from Euclidian geometry field.

Cosine similarity is a measure taking the cosine of the angle between two vectors. The Jaccard co-efficient is a statistic used for comparing the similarity of two sample sets and is defined as size of intersection divided by size of union on sample data sets. An information-theoretic measure for document similarity called IT_Sim is a phrase-based measure to compute the similarity based on suffix- Tree Document Model. Pairwise-adaptive

similarity dynamically selects a number of features out of d1 and d2.Clustring requires definition of a distance measure which assigns a numeric value to the extent of difference between two documents and which the clustering algorithm uses for making different groups of a given dataset.

No simple measure works the best in all scenarios and selection depends on, which distance measure best catches the essence of important distinguishing characters for the given document set. A bag of words model is commonly used in information retrieval or text processing task, wherein a document is modeled as a collection of unique words that it has and their frequencies. The order is completely ignored. All the documents get converted into a matrix where each word add a new dimension being the row of matrix and each document is represented as a column vector. This implies that each entry in the matrix gives the frequency of world occurring in a particular document. It is easy to see that the matrix would be sparse. Higher the frequency of each word, is more descriptive off the document

II. INFORMATION RETRIRVAL MODELS

An information retrieval system is responsible for storing of data , organization of data ,representation of data , and easy access to desired information. The ultimate goal of the defined similarity techniques is to facilitate an information retrieval process. Similarity measures in IR processes are applied on a set of data objects, select relevant objects, produce a ranked set of relevant objects, and identify best matches. Several IR models exist that make use of various similarity measuring techniques along with appropriate data processing methodologies, for example :-

Set-theoretic Models: - It is based on set theory. The data object forms sets and set theoretic operations are used to derive similarities.

Algebraic Models:- In algebraic models the information is represented as vectors/matrices and similarity between information has a scalar value.

(2)

________________________________________________________________________________________________

Probabilistic Models:- These IR models are based on probabilistic inferences. Information is retrieved based on the probability of relevance between data.

Knowledge based Models:- The knowledge based IR models use structural and domain knowledge and formalized linguistic information to discover semantic relevance between objects.

Structure based Models:- The structural information retrieval models combine the content and structural characteristics to achieve greater retrieval efficiencies in many applications.

One of the widely used IR model is algebraic models. In algebraic models information is represented as vectors/matrices and similarity between information has a scalar value. Among various algebraic models such as Latent Semantic Analysis based model, Neural Networks and Vector space model(VSM), VSM is the most popular one. In vector space model Information is represented as vectors in multidimensional space. Each dimension corresponds to a possible feature of the information for example term in document. A Distance function is applied to the information vectors, provide the match and rank Information. VSM based information retrieval is a good mathematical implementation for processing large amount of information . It provides possibilities of partial matching and ranked result output. But this approach lacks the control of Boolean model, and has no means to handle semantic or syntactical information.

III. DOCUMENT CLUSTERING

Given a set of documents, we want to partition them into a pre-determined number of k subsets, such that the documents assigned to each subset are more similar to each other than the documents assigned to different subsets. Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are very important in such scenarios. Document clustering is particularly useful in many applications such as grouping search engine results, automatic categorization of documents, building taxonomy of documents, and others. Each document in a corpus corresponds to an m-dimensional vector d, where m is the total number of terms.

Document vectors are often subjected to some weighting schemes, such as the standard Term Frequency-Inverse Document Frequency (TF-IDF), and normalized to have unit length.

IV. DOCUMENT PROCESSING STEPS

Tokenization:- It is a process where document is treated as a string (or bag of words), and then partitioned into a list of tokens.

Removing stop words:- Stop words are frequently occurring, insignificant words. This step eliminates the

stop words. Example of stop words are articles appearing in the sentences such as a, an, the etc

Stemming: - This step is the process of converging tokens to their root form. For example a word „buying‟

is converted into word „buy‟.

Document representation:- Generating N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.

TFIDF Analysis:- By taking into account two factors : term frequency (TF) and inverse document frequency (IDF) it is possible to assign weights to search results and therefore ordering them statistically. A search result‟s score ranking is the product of TF and IDF:

TFIDF = TF * IDF where:

TF = C / T where C = number of times a given word appears in a document and T = total number of words in a document.

Document IDF = D / DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word.

V. RELATED WORK

We are living in cyber world dumped with whole lot of information. Efficient and accurate retrieval of the information is important for survival of many web portals. Thus, similarity measure plays an important role in text classification and clustering algorithm. The process of identification of relevant information is generally termed as "Information Retrieval". Measuring similarity or distance between two information entities is a core requirement for all information discovery tasks (whether IR or Data mining). Use of appropriate measures not only improves the quality of information selection but it also help reduce the time and processing costs. Several similarity measures are proposed :- 1. Distance-Based Similarity Measures:- a. Minkowski Distance

b. Manhattan/City block distance c. Euclidean distance

d. Chebyshev distance e. Jaccard distance f. Dice‟s Coefficient g. Cosine similarity h. Hamming distance i. Levenshtein Distance j. Soundex distance

2. Feature-Based Similarity Measures a. Contrast Model

3. Probabilistic Similarity Measures

(3)

________________________________________________________________________________________________

a. Maximum likelihood estimation (MLE) b. Maximum a posteriori (MAP) estimation 4. Extended/Additional Measures

a. Similarity measures based on fuzzy set theory b. Similarity measures based on graph theory Distance-Based Similarity Measures :-

One of the most popular similarity measures is distance- based similarity measures According to a very popular theoretical assumption, the perceived similarity can be inversely associated with the distance in some suitable feature space. In most of the cases it is considered to be a metric space. Many psychological theories assume that closer objects are more similar than the objects that are far apart. Similarity measure

(s) can be derived from the distance measure (d) using s

= 1 – d. The most common form of dissimilarity calculation refers to distance calculation in metric space.

Some measures which have been popularly adopted for computing the similarity between two documents are briefly presented here.

Euclidean distance:-

Euclidean distance is also called as the Minkowski distance. It is the most commonly used measure to determine distance between two points. It is described as,

(1) Jaccard distance:-

The Jaccard distance measures, dissimilarity between sample sets. It is complementary to the Jaccard coefficient (size of the intersection divided by the size of the union of the sample sets) and is obtained by subtracting the Jaccard coefficient from 1 .

(2) Dice‟s Coefficient:-

Dice coefficient similarity measure is defined as twice the number of terms common to compared entities/strings (nt) divided by the total number of terms in both tested strings.

(3) Cosine similarity:-

Cosine similarity is vector based similarity measure used in text mining and information retrieval. In this approach compared strings are transformed into vector space and the Euclidean cosine rule applied to calculate similarity. This approach is paired with other

approaches to limit the dimensionality of the vector space.

(4) Hamming distance:-

It is considered to be the most popular measure for binary attributes. It is defined as the number of bits which differ between two binary strings i.e. the number of bits which need to be changed to turn one string into the other. For example the bit strings 1011101 and 1001001 has a hamming distance of 2bits, (two bits are dissimilar). This approach is used for exact length comparisons.

VI. CONCLUSION

It can be concluded that not all similarity measures are suitable in all scenarios. Some similarity measure work best for one scenario while others work best for other scenario. While doing survey of various distance measures in the category of distance-based measures it is understood that except for the Euclidean distance measure, the other measures have comparable effectiveness for the partitional text document clustering task. The Jaccard and Pearson coefficient measures can form more coherent clusters. Despite of the above differences, these measures overall performance is similar.

REFERENCES

[1] D. Lin, “An information-theoretic definition of similarity,” in Proc. 15th Int. Conf. Mach.

Learn., San Francisco, CA, USA, 1998.

[2] A. Strehl and J. Ghosh, “Value-based customer grouping from large retail data-sets,” in Proc.

SPIE, vol. 4057. Orlando, FL, USA, Apr. 2000, pp. 33–42.

[3] J. A. Aslam and M. Frost, “An information- theoretic measure for document similarity,” in Proc. 26th SIGIR, Toronto, ON, Canada, 2003, pp. 449–450.

[4] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2^nd ed. San Francisco, CA, USA: Morgan Kaufmann; Boston, MA, USA:

Elsevier, 2006.

[5] P.-N. Tan,M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA, USA: Addision-Wesley, 2006.

[6] T. W. Schoenharl and G. Madey, “Evaluation of measurement techniques for the validation of agent-based simulations against streaming data,” in Proc. ICCS, Kraków, Poland, 2008.

[7] C. G. González, W. Bonventi, Jr., and A. L. V.

Rodrigues, “Density of closed balls in real- valued and autometrized boolean spaces for

(4)

________________________________________________________________________________________________

clustering applications,” in Proc. 19th Brazilian Symp. Artif. Intell., Savador, Brazil, 2008, pp. 8–22.

[8] J. D‟hondt, J. Vertommen, P.-A. Verhaegen, D.

Cattrysse, and J. R. Duflou, “Pairwise-adaptive dissimilarity measure for document clustering,”

Inf. Sci., vol. 180, no. 12, pp. 2341–358,2010.

[9] Yung-Shen Lin, Jung-Yi Jiang, and Shie-Jue Lee”A Similarity Measure for Text Classification and Clustering”, IEEE Transactions On Knowledge and Data Engineering, vol. 26, no. 7 , July 2014

