• Tidak ada hasil yang ditemukan

Network Analysis of Maritime English Corpus with Multi-word Compounds:

N/A
N/A
Protected

Academic year: 2023

Membagikan "Network Analysis of Maritime English Corpus with Multi-word Compounds: "

Copied!
216
0
0

Teks penuh

MEC contains marked multi-word compounds which can be called specific purpose terms in maritime English. First, how can we build a corpus of Maritime English to represent specific purpose concepts such as multi-word compounds.

Introduction

Outline of the Thesis

I also review basic concepts for language network construction, previous studies, definitions and the types of language network constructions in order to provide an explanatory power to corpus data in Chapter 4. It proposes language network analysis in order to provide a further explanatory power in relation to comparing the corpus descriptions by keyword.

Literature Review

Maritime English as English for Specific Purposes

The IMO has shown interest in training maritime English for seafarers because it is one of the most important aspects for the international maritime community and maritime industry. Maritime English terms are difficult to understand due to the fact that almost all the domain-specific target terms are used only in the maritime industry.

Keywords in Text

  • Strategies for a Reference Corpus
  • Statistical Measures for Keyword Analysis
  • Problems of Previous Keyword Analysis

There are a few statistical debates about keyword analysis to argue for a log-likelihood test over a chi-square test. The log-likelihood calculation uses an expected value , the frequency of words of  in corpus one, and the frequency of words of  in corpus two.

Table  2.1  Keyness  in  a  genre  of  maritime  law
Table 2.1 Keyness in a genre of maritime law

Collocations in Text

  • Types of Collocations
  • Statistical Measures for Window Collocations
  • Problems of Previous Collocation Analysis

Due to some important variables, such as window spans, frequency and statistical measurements, there are no tools that can evaluate all variables. Therefore, it is important that individual researchers test all statistical measures to find the most appropriate methods, both in general and specialized corpora.

Visualization in Corpus Linguistics

  • Text Visualizations
  • Collocation Networks

They adopted a corpus-oriented approach to the new dictionary using collocation networks, as shown in Figure 2.7(b). He advocated some of the advantages of visualization in collocation studies and claimed to be dealing with directed collocation networks based on Delta P as shown in Figure 2.11 (Brezina et al. The software offers 12 different statistical measures to identify collocations and window range options .

Fig.  2.1  Collocate  clouds  visualization
Fig. 2.1 Collocate clouds visualization

Language Networks

  • Basic Concepts
  • Previous Studies
  • Definitions
  • Types of Language Network Constructions

Social network analysis studied the elements that make up human societies, such as individuals and organizations. Since corpus data follow power law or Zipf law distribution patterns, my data can be subjected to language network analysis.

Fig.  2.12  Four  types  of  link  of  keywords
Fig. 2.12 Four types of link of keywords

Maritime English Corpus

Corpus Design

To collect data for the academic genre, I used Springer's database (http://www.springer.com), which offers many journals to the scientific and professional communities, and Elsevier's Science Direct (http://www.sciencedirect.com), which one of the largest publishers in the world. Maritime Studies”, “Gyroscopy and Navigation”, “Aegean Review of Maritime Law and Maritime Law” and “WMU Journal of Marine Affairs”.

Table  3.2  List  of  news  website  sources
Table 3.2 List of news website sources

Corpus Compilation

  • Stratified Random Sampling
  • Web Crawling and Cleansing
  • Converting PDF to Texts

The second step is to determine how many words to extract by inserting the number of words into WANT_WORDS_CNT, as shown in Figure 3.1. The next step is to extract only sentences from the collected HTML documents using Python program coding.

Fig.  3.1  Python  coding  for  random  sampling
Fig. 3.1 Python coding for random sampling

Multi-word Compounds

My thesis deals with the third type of multi-word compounds such as coast guard, ballast water management, etc. To find out recent research trends, I reviewed papers from The 2013 NAACL HLT (Conference of the North American Chapter of the Association for Computational Linguistics). : Human Language Technologies), The 2014 EACL (Conference of the European Chapter of the Association for Computational Linguistics) multi-word expression workshops. There are four main areas such as a statistical base, a linguistic base, a machine learning base and complex approaches.

Although these previous studies provide a number of algorithms for finding compounds, there are no perfect solutions for identifying multi-word compounds in ESP vocabulary studies. But I found a useful and practical method which uses reference dictionaries to identify multi-word combinations. To compensate for the most modern added terms, I used a supplementary source from Wikipedia's "Glossary of Nautical Terms (2015)", https://en.wikipedia.org/wiki/Glossary_of_nautical_terms.

Table  3.5  Types  of  general  English  compounds  and  maritime  English  compounds
Table 3.5 Types of general English compounds and maritime English compounds

Critical Evaluation and Tagging for Multi-word Compounds

The list of all entries combined with the three sources contains a variety of multi-word compounds ranging from two-grams to nine-grams, as shown in Table 3.7. Therefore, the statistics show how important labeling with multi-word compounds is in ESP studies. To come up with a multi-word composite marking, I combined all the entries from three different sources to make a reference maritime English dictionary.

The second is a list of multi-word compounds, which I call a reference list of multi-word compounds. For the extraction method, I programmed Python codes to create a multi-word compound annotated corpus. Finally, I've put together a new list of reference multiword compounds that includes lemmatized words and all kinds of words from lemmatized words.

Table  3.7  Percentage  of  multi-word  compounds  in  a  reference  multi-word  compound  list
Table 3.7 Percentage of multi-word compounds in a reference multi-word compound list

Comparison of With and Without Compounds

  • Comparison of Basic Statistics
  • Comparison of Word Lists, N-gram Lists, and Keyword Lists
  • Comparison of Visualizations
    • Dispersion Plots
    • GraphColl 1.0

First, the number of tokens for each genre is reduced when the corpora contain multiword compounds. As can be seen from Tables 3.10 and 3.11, the type difference is due to the newly added multi-word compounds. It is important to mention that the basic statistics of a corpus will be significantly different depending on the inclusion of multi-word compounds.

The composite multi-word tagged corpus has more words than those in the untagged corpus. On the 4-grams, the marked corpus of law is also found to contain multi-word compounds such as ships ballast water and sediments, fit to continue_to_sea without, and gross tonnage calculated in accordance. As can be seen in Table 3.16 above, the hits column shows that the keywords without multi-word compound tagging appear more than those with multi-word compound tagging.

Table  3.10  Statistical  results  before  tagging  multi-word  compounds
Table 3.10 Statistical results before tagging multi-word compounds

Summary and Implications

As seen in Table 4.7, eigenvector centrality in the keyword networks showed similar percentages as 7 specific target terms (35%) and 13 general target terms (65%). I counted all the linked keywords of specific target terms and all the general target terms. As seen in Table 4.13, two keywords listed as general purpose terms appear in two different groups.

As seen in Table 4.16, four keywords listed as general purpose terms appear in four different groups. For example, collocation network structures can be used to identify common target terms using eigenvector centrality. On the other hand, the cohesion community structures created by eigenvector and betweenness in keyword networks distinguish a group of the specific target terms and general target terms.

Language Network Structure Analysis

Frameworks of Network Analysis

  • Source Nodes and Target Nodes
  • Two Mode Structures and One Mode Structures
  • Centrality and Cohesion Algorithms

To create both keyword networks and composition networks, it is necessary to determine source nodes and target nodes for the two networks. In this study, a source node refers to a keyword in both keyword networks and composition networks. As seen in Table 4.1, the number of type words on a word list is counted.

Using Kamada-Kawai's spring algorithm, the two-mode visualization of assembly networks is shown in Figure 4.2. As a next step to creating one-mode networks, I use a transform menu to transform two-mode keyword networks and composition networks into one-mode networks. For this purpose, I transform two mode data into one mode data using one of the proximity measures such as the cosine similarity to find statistically significant relationships between keywords.19).

Table  4.1    Percentage  of  single  words  and  multi-word  compounds  in  a  study  corpus
Table 4.1 Percentage of single words and multi-word compounds in a study corpus

Comparison of Keyword Networks and Collocation Networks

  • Centrality Structures: Eigenvector and Betweenness
  • Cohesion Structures: Eigenvector and Betweenness

The results show that cohesion analysis is effective only for keyword networks because special purpose terms and general purpose terms. This finding suggests that the community eigenvector can be used to find special-purpose and general-purpose terms among maritime English vocabulary items. Based on these results, I can hypothesize that special purpose expressions have positive expression specificity, while general purpose expressions have negative expression specificity.

So, it is likely that eigenvector community analysis is an effective tool to identify specific purpose terms and general purpose terms in maritime English. The specialty of the term indicates that 24) It is interesting that the amendment is classified as a group 6 which represents specific purpose terms. This means that the difference is related to more specific purpose terms than to general purpose terms.

Fig.  4.3    Centrality  structure  using  eigenvector  for  keyword  networks
Fig. 4.3 Centrality structure using eigenvector for keyword networks

Critical Evaluation

For cohesion structures, only the community eigenvector divides the two large groups to distinguish between special-purpose and general-purpose terms. Most general terms have negative values ​​between 0.1 and 1, except for one word, packaging shows a positive 0.36. On the other hand, all dedicated terms have positive values ​​ranging from 0.14 to 1.

These results prove that a specialty term can also be used to explain why each group prefers specific-purpose terms or general-purpose terms. On the other hand, a keyword listed as a term for specific purposes also appears in another group. As shown in Table 4.17, the critical evaluation from the 40-keyword experiment with 20 specific-purpose key terms and 20 general-purpose key terms supports the result obtained from the previous experiment.

Table  4.14  Centrality  and  cohesion  of  top  20  general  purpose  terms                          versus  20  single  word  specific  purpose  terms
Table 4.14 Centrality and cohesion of top 20 general purpose terms versus 20 single word specific purpose terms

Summary and Implications

Therefore, I conclude that the specificity or generality of ESP texts can be identified by keyword network structures regardless of either single-word or multi-word combinations. Finally, I created a term specialty equation to explain why a particular community can be separated into two groups for specific or general purposes, showing that the degree of specificity and generality is a continuum concept that is not absolute. The results of this chapter can be used to provide pedagogical implications for creating ESP vocabulary lists.

Using corpus linguistics methodology, teachers can compile lists of key words to answer this question. In addition, the cohesion structures of keyword networks tell us that the group of special-purpose terms and the group of general-purpose terms are well separated using the community of eigenvectors. Therefore, ESP teachers have an advantage in deciding which vocabulary to teach and how to classify vocabulary in ESP through language.

Conclusion

Findings and Implications

First, STTR showed a wider variety of vocabulary items in the MEC labeled multiword compound. Fourth, there were more keywords in the labeled MEC multiword composition than the unlabeled MEC. In Proceedings of Digital Humanities 2008 (pp.53-55), University of Oulu: Scottish Corpus of Texts and Speech.

In Proceedings of 6th International Conference on National Language Processing (pp.20-29), een digitaal archief van onderzoekspapers in de computerlinguïstiekconferentie. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (pp.110-115), Association. In Proceedings of the Workshop on Comparing Corpora (pp.1-8), Hong Kong, Association for Computational Linguistics.

In Proceedings of the workshop on beyond named entity recognition semantic labeling for NLP tasks in connection with the 4th International Conference on Language Resources and Evaluation (pp.7-12), Lisbon, Portugal: Senseval. In Proceedings of ECAI-2000 Workshop on Ontology Learning (pp.37-42), Berlin: The European Chapter of the ACL.

Gambar

Fig.  2.1  Collocate  clouds  visualization
Fig.  2.6  Network  examples  of  keywords  per  document  section
Fig.  2.7  Collocational  network  from  the  verb  to  treat
Fig.  2.8  Character  Network  of  “King  Lear”
+7

Referensi

Dokumen terkait

Adanya kontribusi kesenian kuda kepang pada pembauran sosial di Kecamatan Sei Lepan yaitu ada beberapa hal yang pertama kesenian kuda kepang ini menggunakan bahasa

Fintech is an innovative high-tech product, therefore fintech area used as a study subject with Technology Acceptance Model TAM and Theory of Planned Behavior TPB used to investigate