View of WordNet-based Computation of Semantic Similarity Between Two Vietnamese Nouns

(1)

Technium

43/2023

2023 A new decade for social changes

Social Sciences

Technium.

(2)

WordNet-based Computation of Semantic Similarity Between Two Vietnamese Nouns

Trang Thi My Phan¹, An Thi Duong², Phuong Thi Minh Tran³

1Saigon Technology University, Ho Chi Minh City, Vietnam, ²University of Science, Vietnam National University Ho Chi Minh City, Vietnam, ³University of Social Sciences and Humanities, Vietnam National University Ho Chi Minh City, Vietnam [email protected], [email protected], [email protected] Abstract. This study aimed to measure the semantic distance between Vietnamese nouns to determine their similarities and differences in terms of semantics. The study utilized a dataset of 1000 randomly selected noun pairs from the Vietnamese WordNet and applied different semantic distance measurement formulas, such as Path-length, Leacock & Chodorow, Wu & Palmer, Resnik, Lin, and Jiang-Conrath measures. The results showed that the Resnik measure is suitable for evaluating noun similarity among primitive concepts of the same class and for measuring proximity between two classes of primitive concepts. On the other hand, the Leacock &

Chodorow measure is appropriate for measuring the semantic distance between nouns belonging to two distant classes of primitive concepts. The methods used in this study have various potential applications in different fields of linguistics, such as determining synonyms, identifying semantically similar words, defining lexical-semantic fields, and detecting plagiarism.

Keywords. Semantic distance, Vietnamese nouns, WordNet, similarity measurement, synonyms

1. Introduction

Semantic distance is a measurement that quantifies the degree of similarity or difference in meaning between two language units, which can include words, phrases, sentences, paragraphs, or documents (Mohammad, 2008) [1]. Semantic distance measurement is crucial in applied linguistics and has numerous practical implications. Some of these applications include detecting academic plagiarism [2], disambiguating word senses [3], and conducting cross-lingual analysis [4], etc. Lexical semantic distance is an important aspect of semantic distance that measures the similarity or dissimilarity between words based on their meanings. Among the parts of speech, nouns are particularly useful for studying lexical semantic distance because they often represent tangible objects and concepts that can be visualized and compared, making them useful for determining similarities and differences between words.

There has been extensive research on semantic similarity in English, which has used resources such as the WordNet database. For instance, Tsatsaronis et al. (2010) suggested a method based on semantics to detect text-plagiarism that utilized WordNet and Wikipedia as knowledge bases to address synonymy and hyponymy/hypernymy issues [2]. In 2015, Czerski Technium Social Sciences Journal

Vol. 43, 532-545, May, 2023 ISSN: 2668-7798 www.techniumscience.com

(3)

et al. presented an algorithm using sentence hashing that replaced certain words with representative synonyms from WordNet, thesaurus, and “IS A” ontology. The goal of this method was to reduce the number of sentences in a document while ensuring that the remaining sentences only contained lemmas [2]. Voorhees (1994) incorporated synonyms from WordNet to broaden search queries [3], while Sussna (1993) proposed a method that utilized the WordNet graph to calculate the semantic distance between topics. Their method assigned a sense to each noun within a given context to minimize the semantic distance function between selected senses [3].

Despite these efforts, there is still a lack of research on semantic similarity in the Vietnamese language, which highlights the need for further investigation in this field.

Therefore, this study aims to fill the research gap in the measurement of semantic similarity in Vietnamese, with a focus on noun distance. The proposed methodology employs a corpus-based approach to develop a module for measuring semantic similarities between Vietnamese nouns.

The findings of this study have practical applications in Vietnamese texts and can be a valuable resource for researchers and specialists seeking to understand the meaning of Vietnamese nouns.

2. Literature Review

Semantic similarity is a critical concept in applied linguistics and natural language processing. WordNet, a lexical database, is widely used to measure semantic similarity between words in many languages. WordNet represents each word as a set of synsets, which are groups of synonyms that share the same meaning [5] and are connected by various semantic relationships. These relationships include hyponymy, hypernymy, meronymy, and holonymy [6]. While WordNet includes nouns, verbs, adjectives, and adverbs [5], this research will focus only on nouns to measure semantic relatedness.

There are several approaches to measure semantic similarity between words in WordNet. The Path-length measure finds the shortest path between two synsets in the WordNet graph and assigns a similarity score based on the length of that path [7]. The Leacock &

Chodorow measure considers the depth of the nodes in the graph in addition to the path length [8], [9], while the Wu & Palmer measure normalizes the path length based on the depth of the nodes in the graph to address the issue of path-length measures favoring longer paths over shorter ones [8], [9]. In contrast, the Resnik measure calculates similarity based on the frequency of each synset in a large corpus of text [9]. The Lin measure uses both similarity and relatedness between concepts to compute similarity [9]. Finally, the Jiang-Conrath measure combines information content and path length to capture both the depth and breadth of semantic relationships between words [9].

Several studies have used WordNet to compute the semantic similarity of words in English. For example, Tyler et al. (2021) found that the simple WordNet path length measure was the most effective in predicting the recall of semantically related and unrelated word lists [7]. Hanane et al. (2019) developed a new method to calculate semantic similarity in WordNet that uses both synsets and glosses. This method has applications in fields such as information retrieval and plagiarism detection. [10]. Richardson et al. (1994) proposed using WordNet as a knowledge base for measuring semantic similarity between words in information retrieval tasks.

Their approach addressed challenges related to natural language and replaced pattern matching.

[11]. Despite these studies, there is still limited research on measuring the semantic similarity of Vietnamese words using WordNet. Therefore, this study aims to investigate the semantic similarity between pairs of Vietnamese nouns using WordNet.

Technium Social Sciences Journal Vol. 43, 532-545, May, 2023 ISSN: 2668-7798 www.techniumscience.com

(4)

3. Methodology

The following section outlines the methodology employed in this study for calculating the semantic similarity of two Vietnamese nouns using WordNet.

3.1 Noun selection

In this study, a dataset of 1000 Vietnamese noun pairs was used, which was sourced from the VietNet dictionary, a Vietnamese WordNet manually translated by the Multilingual KimTuDien Dictionary company using WordNet 2.1 as a reference. WordNet is a lexical database consisting of 25 primitive concepts for English nouns [5], some of which are grouped together due to their similar meanings. As a result, only 11 primitive concepts are utilized in WordNet due to this grouping, including entity, abstracttion, psychology feature, natural phenonmenon, activity, event, group, location, possession, shape, and state.

The study involved a dataset of 1000 Vietnamese noun pairs mapped to the WordNet word list, which were categorized into the 11 primitive concepts of English nouns. The pairs were selected to assign approximately 9% to each primitive concept, and then identified pairs of nouns that shared the same semantic category. 334 pairs of nouns were selected for level 1 analysis (same class), 333 pairs of nouns for level 2 analysis (proximity between classes), and 333 pairs of nouns for level 3 analysis (semantic distance between distant classes). Finally, the 1000 Vietnamese noun pairs and their corresponding primitive categorization were documented in a spreadsheet for further analysis.

3.2 Synset identification

The paper presents a method for identifying synsets using the Natural Language Toolkit (NLTK) in Python. Synset identification involves identifying sets of nouns that share similar meanings or contexts, which can greatly enhance natural language processing tasks. The paper describes how to use NLTK to identify synsets and how to analyze the relationships between nouns.

3.3 WordNet-based Semantic Distance Measures

This study aimed to evaluate six different WordNet-based methods for measuring semantic similarity between Vietnamese nouns. The chosen methods included Path-length, Leacock & Chodorow, Wu & Palmer, Resnik, Lin, and Jiang-Conrath measures, which are commonly used in this area of research.

3.3.1 The Path-length measure

The Path-length similarity measure calculates the shortest path between two synsets in the WordNet database. This is the simplest measure among the group of similarity measures in WordNet-based studies. The Pathlen similarity measure is based on the path-len distance, i.e. the shortest path between two edges, and converts to the following formula:

𝑠𝑖𝑚_{𝑃𝑎𝑡ℎ}(𝑐₁, 𝑐₂) = ¹

𝑝𝑎𝑡ℎ𝑙𝑒𝑛(𝑐1, 𝑐2)+1 (1)

3.3.2 The Leacock & Chodorow measure

The Leacock and Chodorow similarity measure utilized synset spacing on WordNet and improved the similarity measure between two concepts by incorporating the maximum depth of the hierarchical tree. The formula is expressed as follows:

𝑠𝑖𝑚_𝐿𝑐ℎ(𝑐₁, 𝑐₂) = − log (^{𝑝𝑎𝑡ℎ𝑙𝑒𝑛(𝑐}¹^{, 𝑐}²⁾

2∗𝐷 ) (2)

Where:

• “pathlen”: The shortest path length of the two concepts c1, c2.

(5)

• D: The maximum depth of the whole hierarchy is determined by the root node, which has a depth of 1. In WordNet, the maximum depth of a noun is 20, and the maximum depth of a verb is 14.

3.3.3 The Wu & Palmer measure

Wu and Palmer similarity measure has proposed a combined similarity measure using the common parent and node depth in the hierarchical tree, but in general this measure still depends on the relationship between the edges.

𝑠𝑖𝑚_𝑊𝑢𝑝(𝑐₁, 𝑐₂) = 2∗𝑑𝑒𝑝𝑡ℎ(𝐿𝑆𝐶(𝑐₁, 𝑐₂))

𝑑𝑒𝑝𝑡ℎ(𝑐1)+𝑑𝑒𝑝𝑡ℎ(𝑐2) (3)

Where:

• depth: the node’s depth in the hierarchy.

• LSC: the nearest common parent node between two nodes.

According to the above formula, the measure value is: 0 < 𝑠𝑖𝑚𝑊𝑢𝑝 <= 1. This measure cannot be zero because the depth of the common parent node is always non-zero (the root node’s depth is 1). The measure has a value of 1 when two concepts are on the same node, that is, completely similar.

3.3.4 The Resnik measure

The Resnik measure belongs to the group of algorithms that use information content, which typically produces more precise measurement results than the group of algorithms that rely on path-length. This approach to measuring word similarity takes into account both the structure of the WordNet network and the content of the information, including the probability of its appearance in a given corpus. The formula for calculating the probability of the concept c is:

𝑃(𝑐) = ^∑𝑤 ∈ 𝑤𝑜𝑟𝑑𝑠(𝑐)^{𝑐𝑜𝑢𝑛𝑡(𝑤)}

𝑁 (4)

Where:

• w are words that have the same concept c.

• N is the total number of words in the corpus.

Formula for calculating information content:

𝐼𝐶(𝑐) = − 𝑙𝑜𝑔𝑃(𝑐) (5)

This means that the information content value will always be greater than or equal to 0. The upper limits for this value are quite large and vary depending on the size of the corpus used to determine the information content value.

The Resnik similarity measure of two concepts is calculated by the amount of information content of the nearest common parent node:

𝑠𝑖𝑚𝑅𝑒𝑠𝑛𝑖𝑘(𝑐1 , 𝑐2 ) = 𝐼𝐶(𝐿𝐶𝑆(𝑐1 , 𝑐2 )) (6) 3.3.5 The Lin measure

Between A and B, there are two parts: similar information and different information. The differing information is determined by using the part of A and B’s information that eliminates the similar information.

• The similar information: IC(common(A,B))

• The different information: IC(description(A,B)) – IC(common(A,B)) Technium Social Sciences Journal

(6)

Lin improved the Renisk similarity measure as follows:

𝑠𝑖𝑚_𝐿𝑖𝑛(𝑐₁, 𝑐₂) = 𝐼𝐶(𝑐𝑜𝑚𝑚𝑜𝑛(𝐴,𝐵))

𝐼𝐶(𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛(𝐴,𝐵)) (7)

When mapping the formula (7) to the concepts of a hierarchical tree on a semantic network, we have the following formula:

𝑠𝑖𝑚_𝐿𝑖𝑛(𝑐₁, 𝑐₂) = ^{2∗𝐼𝐶(𝐿𝐶𝑆(𝑐}¹^{, 𝑐}²⁾⁾

𝐼𝐶(𝑐1)+𝐼𝐶(𝑐2) (8) Where:

• IC is information content.

• LCS is the nearest common father of 𝑐1, 𝑐2.

3. 3.6 The Jiang-Conrath measure

Also using content-based similarity measures, the Jiang-Conrath measure took a completely different approach from Lin. The Jiang-Conrath distance represents the different information between the two concepts.

Jiang-Conrath distance:

𝑑𝑖𝑠𝑡_𝐽𝐶(𝑐₁, 𝑐₂) = 𝐼𝐶(𝑐₁) + 𝐼𝐶(𝑐₂) − 2 ∗ 𝐼𝐶(𝐿𝐶𝑆(𝑐₁, 𝑐₂)) (9) Jiang-Conrath measure:

𝑠𝑖𝑚_𝐽𝐶(𝑐₁, 𝑐₂) = _{𝑑𝑖𝑠𝑡} ¹

𝐽𝐶(𝑐1, 𝑐₂) (10)

3.4 Similarity score calculation

The semantic distance between pairs of nouns can be measured at three levels, with appropriate distance measures used for each level. For example, at level 1, Wu-Palmer or the Leacock-Chodorow similarity measure can be used to measure the semantic distance between pairs of nouns that belong to the same class of primitive concept. At level 2, Lin or the Resnik similarity measure can be used to measure the semantic distance between pairs of nouns that belong to two proximate classes. At level 3, Jiang-Conrath, Lin, or Resnik similarity measure can be used to measure the semantic distance between pairs of nouns that belong to two distant classes of primitive concepts.

3.5 Data analysis

The authors calculated the semantic similarity scores for noun pairs at three levels, and then calculated the mean and standard deviation of the scores in order to understand the distribution in their dataset. The authors also conducted an analysis to identify the semantic similarity between two Vietnamese nouns and gain insights into their relatedness and semantic meanings, using Vietnamese WordNet. The results of the analysis will be presented in the next section.

4. Results

We computed the semantic similarity of 1000 pairs of Vietnamese nouns using Vietnamese WordNet and six different measures including the Path-length, Leacock &

Chodorow, Wu & Palmer, Resnik, Lin, and Jiang-Conrath measures.

(7)

4.1 Level 1: Measuring primitive concepts of the same class

According to the findings presented in Table 1, six distinct semantic measures were employed to evaluate the similarity between 334 pairs of nouns. All the nouns used for level 1 belong to the same category and represent primitive concepts of the same class.

Table 1: Descriptive Statistics for 334 Vietnamese Noun Pairs at Level 1 (measuring primitive concepts of the same class)

Semantic

measures Minimum Maximum Mean Std.

Deviation Variance Path_length 0.0476

0.3333 0.1104 0.0436 0.0019 Leacock &

Chodorow

0.2877 2.2336 1.0609 0.3603 0.1298

Wu & Palmer 0.1667 0.8889 0.5620 0.1277 0.0163

Resnik 0.5181 8.0353 3.1043 1.3761 1.8936

Lin 0.0383 0.6254 0.2342 0.1052 0.0111

Jiang-Conrath 0.0364 0.1245 0.0507 0.0111 0.0001

- The Path-length measure had a range from 0.0476 to 0.3333, with a mean of 0.1104, a standard deviation of 0.0436, and a variance of 0.0019.

- The Leacock & Chodorow measure revealed a range from 0.2877 to 2.2336, with a mean of 1.0609, a standard deviation of 0.3603, and a variance of 0.1298.

- The Wu & Palmer measure had a range from 0.1667 to 0.8889, with a mean of 0.5620, a standard deviation of 0.1277, and a variance of 0.0163.

- The Resnik measure produced a range from 0.5181 to 8.0353, with a mean of 3.1043, a standard deviation of 1.3761, and a variance of 1.8936.

- The Lin measure had a range from 0.0383 to 0.6254, with a mean of 0.2342, a standard deviation of 0.1052, and a variance of 0.0111.

- The Jiang-Conrath measure showed a range from 0.0364 to 0.1245, with a mean of 0.0507, a standard deviation of 0.0111, and a variance of 0.0001.

Table 2: Semantic similarity scores of selected Vietnamese noun pairs at Level 1 (measuring primitive concepts of the same class)

Noun Pairs

Similarity Scores

Path- length

Leacock

&

Chodorow

Wu &

Palmer Resnik Lin

Jiang- Conrat

h - chứng sợ ban đêm

(nyctophobia) - xơ cứng xương (osteosclerosis)

0.0833 0.8473 0.5217 4.4810 0.3041 0.0488 Technium Social Sciences Journal

(8)

- một họ cá bống (a family of gobies) - cá trích Mỹ (American herring)

0.1429 1.3863 0.8125 7.2621 0.5119 0.0722

- qui trình (process) - phẫu thuật mở khí quản (tracheostomy)

0.2500 1.9459 0.8235 6.7049 0.6254 0.1245

- khu vực bị trúng bom (bomb-damaged area) - sân trái (the left court)

0.2500 1.9459 0.8235 6.4288 0.4732 0.0699

- sự bơi lội (swimming)

- bóng chày (baseball) 0.1111 1.1350 0.6667 7.4900 0.5753 0.0904 Table 2 displays the semantic similarity scores for selected Vietnamese noun pairs, which belong to the same primitive concept class.

- The first pair, “chứng sợ ban đêm” (nyctophobia) and “xơ cứng xương” (osteosclerosis), showed a low Jiang-Conrath score of 0.0488 and a low Path-length score of 0.0833, but a high Resnik score of 4.4810.

- The second pair “một họ cá bống” (a family of gobies) and “cá trích Mỹ” (American herring), yielded a low Jiang-Conrath score of 0.0722, a moderate Leacock & Chodorow score of 1.3863, and a high Resnik score of 7.2621.

- The third pair, “qui trình” (process) and “phẫu thuật mở khí quản” (tracheostomy), produced a moderate Leacock & Chodorow score of 1.9459 and a high Resnik score of 6.7049.

- The fourth pair, “khu vực bị trúng bom” (bomb-damaged area) and “sân trái” (the left court), had a low Jiang-Conrath score of 0.0699, a moderate Leacock & Chodorow score of 1.9459, and a high Resnik score of 6.4288.

- The last pair, “sự bơi lội” (swimming) and “bóng chày” (baseball), had a low Jiang- Conrath score of 0.0904, a moderate Leacock & Chodorow score of 1.1350, and a high Resnik score of 7.4900.

4.2 Level 2: Measuring the proximity between two classes of primitive concepts Table 3 presents the results of applying six semantic measures to evaluate the similarity between 333 pairs of nouns used to assess the proximity between two classes of primitive concepts.

Table 3: Descriptive Statistics for 333 Vietnamese Noun Pairs at Level 2 (measuring the proximity between two classes of primitive concepts)

Semantic measures

Minimum Maximum Mean Std.

Deviation

Variance Path_length 0.0455 0.2000 0.0837 0.0258 0.0007 Leacock &

Chodorow

0.2412 1.7228 0.8109 0.2806 0.0787

Wu & Palmer 0.1667 0.7059 0.4354 0.1025 0.0105

(9)

Resnik 0.5181 4.3375 1.7648 0.5981 0.3577

Lin 0.0378 0.3418 0.1338 0.0482 0.0023

Jiang-Conrath 0.0373 0.0755 0.0440 0.0057 0.0000

- The Path-length measure exhibited a range of values between 0.0455 and 0.2000, with a mean value of 0.0837, a standard deviation of 0.0258, and a variance of 0.0007.

- The Leacock & Chodorow measure showed a range of values between 0.2412 and 1.7228, with a mean value of 0.8109, a standard deviation of 0.2806, and a variance of 0.0787.

- The Wu & Palmer measure demonstrated a range of values between 0.1667 and 0.7059, with a mean value of 0.4354, a standard deviation of 0.1025, and a variance of 0.0105.

- The Resnik measure had a range of values between 0.5181 and 4.3375, with a mean value of 1.7648, a standard deviation of 0.5981, and a variance of 0.3577.

- The Lin measure exhibited a range of values between 0.0378 and 0.3418, with a mean value of 0.1338, a standard deviation of 0.0482, and a variance of 0.0023.

- The Jiang-Conrath measure revealed a range of values between 0.0373 and 0.0755, with a mean value of 0.0440, a standard deviation of 0.0057, and a variance of 0.

Table 4: Semantic similarity scores of selected Vietnamese noun pairs at Level 2 (measuring the proximity between two classes of primitive concepts)

Noun Pairs

Similarity Scores Path-

length

Leacock

&

Chodorow

Wu &

Palmer

Resnik Lin Jiang- Conrat

h - người theo thuyết

skinner (Skinnerian) - người di chuyển (traveler)

0.1667 1.5404 0.6316 2.4662 0.1738 0.0427

- tiêu chuẩn (standard)

- phương trình số mũ (exponential

equation)

0.1250 1.2528 0.5333 4.3375 0.3089 0.0515

- người hay bới móc (faultfinder) - bình luận viên (commentator)

0.1250 1.2528 0.5714 2.4662 0.1828 0.0453 Technium Social Sciences Journal

(10)

- người dạy thú (animal trainer) - cầu thủ sân ngoài (winger)

0.1000 1.0296 0.5217 2.4662 0.1808 0.0448

- người hảo tâm (philanthropist) - người chứng kiến (witness)

0.2000 1.7228 0.6667 2.4662 0.2715 0.0755

Table 4 displays the semantic similarity scores of several Vietnamese noun pairs used to assess the closeness between two classes of primitive concepts.

- The noun pairs “người theo thuyết skinner” (Skinnerian) and người di chuyển (traveler) had a low Jiang-Conrath score of 0.0427, a moderate Leacock & Chodorow score of 1.5404 and a high Resnik score of 2.4662.

- The noun pairs “tiêu chuẩn” (standard) and “phương trình số mũ” (exponential equation) had a low Jiang-Conrath score of 0.0515, a moderate Leacock & Chodorow score of 1.2528 and a high Resnik score of 4.3375.

- The noun pairs “người hay bới móc” (faultfinder) and “bình luận viên” (commentator) had a low Jiang-Conrath score of 0.0453, a moderate Leacock & Chodorow score of 1.2528, and a high Resnik score of 2.4662.

- The noun pairs “người dạy thú” (animal trainer) and “cầu thủ sân ngoài” (winger) had a low Jiang-Conrath score of 0.0448, a moderate Leacock & Chodorow score of 1.0296, and a high Resnik score of 2.4662.

- The noun pairs “người hảo tâm” (philanthropist) and “người chứng kiến” (witness) had a low Jiang-Conrath score of 0.0755, a moderate Leacock & Chodorow score of 1.7228 and a high Resnik score of 2.4662.

4.3 Level 3: Measuring the semantic distance between pairs of nouns that belong to two distant classes of primitive concepts

In the analysis presented in Table 5, six semantic measures were employed to evaluate the similarity between 333 pairs of nouns. This was done to assess the semantic distance between pairs of nouns belonging to two distant classes of primitive concepts.

Table 5: Descriptive Statistics for 333 Vietnamese Noun Pairs at Level 3 (measuring the semantic distance between pairs of nouns that belong to two distant classes of

primitive concepts)

Semantic measures Minimum Maximum Mean Std.

Deviation

Variance

Path_length 0.0385 0.2000 0.0670 0.0185 0.0003

Leacock &

Chodorow

0.0741 1.7228 0.5975 0.2414 0.0583 Wu & Palmer 0.0741 0.7500 0.1920 0.1093 0.0120

Resnik -0.0336 4.6752 0.4127 0.6742 0.4546

Lin -0.0041 0.3428 0.0313 0.0519 0.0027

(11)

Jiang-Conrath 0.0339 0.0716 0.0393 0.0053 0.0000 - The Path-length measure ranged from a minimum of 0.0385 to a maximum of 0.2000,

with a mean of 0.0670, a standard deviation of 0.0185, and a variance of 0.0003.

- For the Leacock & Chodorow measure, the minimum value was 0.0741, the maximum was 1.7228, the mean was 0.5975, the standard deviation was 0.2414, and the variance was 0.0583.

- The Wu & Palmer measure had a minimum value of 0.0741, a maximum value of 0.7500, a mean value of 0.1920, a standard deviation of 0.1093, and a variance of 0.0120.

- The Resnik measure produced a minimum value of -0.0336, a maximum value of 4.6752, a mean value of 0.4127, a standard deviation of 0.6742, and a variance of 0.4546.

- The Lin measure ranged from a minimum of -0.0041 to a maximum of 0.3428, with a mean value of 0.0313, a standard deviation of 0.0519, and a variance of 0.0027.

- Finally, the Jiang-Conrath measure had a minimum value of 0.0339, a maximum value of 0.0716, a mean value of 0.0393, a standard deviation of 0.0053, and a variance of 0.

Table 6: Semantic similarity scores of selected Vietnamese noun pairs at Level 3 (measuring the semantic distance between pairs of nouns that belong to two distant

classes of primitive concepts)

Noun Pairs Similarity Scores

Path- length

Leacock

&

Chodorow

Wu &

Palmer

Resnik Lin Jiang- Conrat

h - thạch cao (gypsum)

- bài hát (song) 0.0833 0.8473 0.2667 0.5181 0.0483 0.0490 - giới giáo sĩ (clergy)

- siêu kiến thức (genius) 0.0909 0.9343 0.2857 0.5181 0.0384 0.0385 - sự đề xuất (proposal)

- chỗ ngồi (seat) 0.0909 0.9343 0.2857 0.5181 0.0420 0.0423 - nguyên tử các-bon

(carbon atom)

- túi đựng trà (tea bag)

0.0833 0.8473 0.2667 0.6517 0.0442 0.0355 - thông điệp giáo hoàng

(papal message)

- chất dầu thơm (essential oil)

0.0909 0.9343 0.5000 2.2190 0.1542 0.0411

Table 6 displays the semantic similarity scores of selected Vietnamese noun pairs, which assess the semantic distance between pairs of nouns belonging to two distant classes of primitive concepts.

(12)

- The first pair of nouns, “thạch cao” (gypsum) and “bài hát” (song), showed a lowest Lin score of 0.0483 and a low Jiang-Conrath score of 0.0490. However, the Leacock &

Chodorow measure generated a highest score of 0.8473.

- The second pair of nouns, “giới giáo sĩ” (clergy) and “siêu kiến thức”

(genius), generated a lowest Lin score of 0.0384 and a low Jiang-Conrath score of 0.0385.

Nevertheless, the Leacock & Chodorow measure produced a highest score of 0.9343.

- The third pair of nouns, “sự đề xuất” (proposal) and “chỗ ngồi” (seat), resulted in a lowest Lin score of 0.0420 and a low Jiang-Conrath score of 0.0423. However, the Leacock

& Chodorow measure produced a highest score of 0.9343.

- The fourth pair of nouns, “nguyên tử các-bon” (carbon atom) and “túi đựng trà” (tea bag), generated a lowest Jiang-Conrath score of 0.0355 and a highest Leacock &

Chodorow score of 0.8473.

- The fifth pair of nouns, “thông điệp giáo hoàng” (papal message) and “chất dầu thơm” (essential oil), resulted in a lowest Jiang-Conrath score of 0.0411 and a highest Resnik score of 2.2190.

5. Discussion

5.1 Level 1: Measuring primitive concepts of the same class

The study examined the effectiveness of six semantic measures in evaluating the similarity between 334 pairs of Vietnamese nouns belonging to the same category in level 1.

The results presented in Table 1 showed that the Resnik and Leacock & Chodorow measures were most effective, with mean similarity values of 3.1043 and 1.0609, respectively, in capturing subtle semantic differences. In contrast, the Jiang-Conrath and Path-length measures had the lowest mean similarity values of 0.0507 and 0.1104, respectively, and were less effective in evaluating noun similarity. The Lin measure had a lower mean similarity value of 0.2342, suggesting lower sensitivity to subtle semantic differences. The Wu & Palmer measure yielded a mean similarity value of 0.5620, indicating moderate sensitivity to subtle semantic differences. Overall, the study emphasizes the importance of choosing the appropriate semantic measure for evaluating noun similarity and relatedness, as the choice can significantly impact the results and interpretation of the findings.

Table 2 presents valuable insights into the effectiveness of different measures for measuring semantic similarity among Vietnamese noun pairs. The Resnik measure consistently received the highest scores indicates its reliability in capturing semantic similarities, as seen in the score for “sự bơi lội” (swimming) and “bóng chày” (baseball). In contrast, the Jiang-Conrath measure consistently received the lowest scores suggesting its lower reliability, such as the score for “chứng sợ ban đêm” (nyctophobia) and “xơ cứng xương” (osteosclerosis). The Leacock & Chodorow measure provided moderate scores indicating its effectiveness but not as reliable as the Resnik measure, as observed for “qui trình” (process) and “phẫu thuật mở khí quản” (tracheostomy). Overall, the Resnik measure was the most reliable for measuring semantic similarity between Vietnamese nouns belonging to the same primitive concept class.

5.2 Level 2: Measuring the proximity between two classes of primitive concepts

Table 3 presents the results of using six semantic measures to assess the similarity between 333 pairs of nouns in level 2, revealing a wide range of mean values. The Resnik measure had the highest mean value of 1.7648, indicating its suitability for applications requiring a high degree of semantic similarity. In contrast, the Jiang-Conrath measure had the lowest mean value of 0.0440, suggesting its lower sensitivity to nuances in meaning. The Path-

(13)

length measure had a low mean value of 0.0837, indicating that it may be less sensitive compared to other measures. The Leacock & Chodorow, Wu & Palmer, and Lin measures recorded mean values ranging from 0.1338 to 0.8109, indicating different degrees of sensitivity to noun similarity and relatedness. The study highlights the importance of choosing a semantic measure based on the specific task and desired balance between sensitivity and robustness.

Table 4 shows semantic similarity scores for Vietnamese noun pairs, revealing that the Resnik measure consistently achieved the highest scores, indicating its effectiveness in measuring semantic similarity. For example, the noun pairs “tiêu chuẩn” and “phương trình số mũ” scored 4.3375. In contrast, the Jiang-Conrath measure scored the lowest for all pairs, indicating lower effectiveness. For instance, “người theo thuyết skinner” and “người di chuyển”

scored 0.0427. The Leacock & Chodorow measure produced moderate scores for most pairs, indicating its effectiveness in measuring semantic similarity but not as reliable as the Resnik measure. For example, “người dạy thú” and “cầu thủ sân ngoài” scored 1.0296. Overall, the Resnik measure appears to be the most reliable for measuring semantic similarity between the selected Vietnamese noun pairs.

5.3 Level 3: Measuring the semantic distance between pairs of nouns that belong to two distant classes of primitive concepts

Table 5 presents the significance of selecting an appropriate semantic measure for assessing noun similarity and relatedness between 333 pairs of nouns in level 3. The analysis indicates that the Leacock & Chodorow measure is suitable for measuring the semantic distance between nouns of two distant classes of primitive concepts, as it had the highest mean value of 0.5975. However, the Lin measure is not suitable, as it had the lowest mean value of 0.0313.

The Resnik measure is appropriate for applications requiring a broader range of similarities, with a high mean value of 0.4127 and a wide range of values. The Jiang-Conrath measure is not suitable for evaluating noun similarity and relatedness, with a low mean value of 0.0393.

Moreover, the Path-length measure had a relatively low mean value of 0.0670, suggesting it may be less sensitive. The Wu & Palmer measure had an intermediate mean value of 0.1920, indicating varying degrees of sensitivity to noun similarity and relatedness. Therefore, researchers should carefully select the appropriate semantic measure based on their specific application requirements.

Table 6 shows the semantic similarity scores of selected Vietnamese noun pairs, which assess the semantic distance between pairs of nouns belonging to two distant classes of primitive concepts. The Leacock & Chodorow measure was found to be the most effective for most noun pairs, including the pairs “giới giáo sĩ” (clergy) and “siêu kiến thức” (genius), “sự đề xuất”

(proposal) and “chỗ ngồi” (seat), and “thông điệp giáo hoàng” (papal message) and “chất dầu thơm” (essential oil). The Resnik measure was also effective for pairs with higher relatedness, such as the pair “thông điệp giáo hoàng” (papal message) and “chất dầu thơm” (essential oil).

However, the Jiang-Conrath measure consistently produced low scores for most noun pairs, particularly for pairs with lower relatedness, like “nguyên tử các-bon” (carbon atom) and “túi đựng trà” (tea bag). Overall, the results indicate that the Leacock & Chodorow measure is the most effective for measuring semantic similarity between pairs of nouns of two distant classes of primitive concepts in Vietnamese, followed by the Resnik measure.

The implications of these findings on measuring semantic similarity in Vietnamese are significant for linguistics tasks such as identifying synonyms, defining lexical-semantic fields, and detecting plagiarism. However, the study’s limitations include a sample size of only 1000 noun pairs, which may not represent the entire Vietnamese language. Future studies could expand the number of pairs tested and explore semantic similarity in other parts of speech.

(14)

Additionally, only six measures were used, and future studies could examine other measures to provide a more comprehensive analysis of noun similarity and relatedness.

6. Conclusion

This paper aimed to measure the semantic distance between Vietnamese nouns using six different methods and a dataset of 1000 noun pairs from Vietnamese WordNet. The study concluded that Resnik measure is suitable for assessing noun similarity within the same class and measuring proximity between two classes, while Leacock & Chodorow measure is more appropriate for measuring semantic distance between distant classes. The findings suggest that these methods are effective in computing semantic similarity and reflecting semantic relationships between noun pairs.

The methodology presented in the paper has practical applications in various linguistic fields, including identifying synonyms, defining lexical-semantic fields, and detecting plagiarism. The study using Vietnamese WordNet to compute semantic similarity provides valuable insights into the semantic relations between Vietnamese nouns and their distance, thereby contributing to the advancement of semantic similarity research in the Vietnamese language. This research could also lead to more advanced approaches for measuring the semantic distance of other parts of speech like verbs or adjectives, with implications for natural language processing and language-based application development.

References

[1] Mohammad, S. (2008). Measuring semantic distance using distributional profiles of concepts. University of Toronto.

[2] Vrbanec, T., & Meštrović, A. (2017, May). The struggle with academic plagiarism:

Approaches based on semantic similarity. In 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 870-875). IEEE.

[3] Mittal, K., & Jain, A. (2015). Word sense disambiguation method using semantic similarity measures and owa operator. ICTACT Journal on Soft Computing, 5(2).

[4] Pandit, R., Sengupta, S., Naskar, S. K., Dash, N. S., & Sardar, M. M. (2019, May).

Improving semantic similarity with cross-lingual resources: A study in bangla—a low resourced language. In Informatics (Vol. 6, No. 2, p. 19). MDPI.

[5] Miller, G.A., et al., Introduction to WordNet: An on-line lexical database. International journal of lexicography, 1990. 3(4): p. 235-244.

[6] Jurafsky, D., & Martin, J. (2018). WordNet: Word Relations Senses and Disambiguation. Speech Lang. Process.

[7] Ensor, T. M., MacMillan, M. B., Neath, I., & Surprenant, A. M. (2021). Calculating semantic relatedness of lists of nouns using WordNet path length. Behavior Research Methods, 1-9.

[8] Amine, E. L., Madani, Y., El Ayachi, R., & ERRITALI, M. (2019). A new semantic similarity approach for improving the results of an Arabic search engine. Procedia Computer Science, 151, 1170-1175.

[9] Gupta, A., & Goyal, K. K. (2019). Classification of Semantic Similarity Technique between Word Pairs using Word Net. International Journal of Engineering and Advanced Technology, 9(2), 4397-4402.

(15)

[10] Ezzikouri, H., Madani, Y., Erritali, M., & Oukessou, M. (2019). A new approach for calculating semantic similarity between words using WordNet and set theory. Procedia Computer Science, 151, 1261-1265.

[11] Richardson, R., Smeaton, A., & Murphy, J. (1994). Using WordNet as a knowledge base for measuring semantic similarity between words.