Network Analysis of Maritime English Corpus with Multi-word Compounds:

MEC contains marked multi-word compounds which can be called specific purpose terms in maritime English. First, how can we build a corpus of Maritime English to represent specific purpose concepts such as multi-word compounds.

Introduction

Outline of the Thesis

I also review basic concepts for language network construction, previous studies, definitions and the types of language network constructions in order to provide an explanatory power to corpus data in Chapter 4. It proposes language network analysis in order to provide a further explanatory power in relation to comparing the corpus descriptions by keyword.

Literature Review

Maritime English as English for Specific Purposes

The IMO has shown interest in training maritime English for seafarers because it is one of the most important aspects for the international maritime community and maritime industry. Maritime English terms are difficult to understand due to the fact that almost all the domain-specific target terms are used only in the maritime industry.

Keywords in Text

Strategies for a Reference Corpus
Statistical Measures for Keyword Analysis
Problems of Previous Keyword Analysis

There are a few statistical debates about keyword analysis to argue for a log-likelihood test over a chi-square test. The log-likelihood calculation uses an expected value , the frequency of words of  in corpus one, and the frequency of words of  in corpus two.

Table 2.1 Keyness in a genre of maritime law

Collocations in Text

Types of Collocations
Statistical Measures for Window Collocations
Problems of Previous Collocation Analysis

Due to some important variables, such as window spans, frequency and statistical measurements, there are no tools that can evaluate all variables. Therefore, it is important that individual researchers test all statistical measures to find the most appropriate methods, both in general and specialized corpora.

Visualization in Corpus Linguistics

Text Visualizations
Collocation Networks

They adopted a corpus-oriented approach to the new dictionary using collocation networks, as shown in Figure 2.7(b). He advocated some of the advantages of visualization in collocation studies and claimed to be dealing with directed collocation networks based on Delta P as shown in Figure 2.11 (Brezina et al. The software offers 12 different statistical measures to identify collocations and window range options .

Language Networks

Basic Concepts
Previous Studies
Definitions
Types of Language Network Constructions

Social network analysis studied the elements that make up human societies, such as individuals and organizations. Since corpus data follow power law or Zipf law distribution patterns, my data can be subjected to language network analysis.

Fig. 2.12 Four types of link of keywords

Maritime English Corpus

Corpus Design

To collect data for the academic genre, I used Springer's database (http://www.springer.com), which offers many journals to the scientific and professional communities, and Elsevier's Science Direct (http://www.sciencedirect.com), which one of the largest publishers in the world. Maritime Studies”, “Gyroscopy and Navigation”, “Aegean Review of Maritime Law and Maritime Law” and “WMU Journal of Marine Affairs”.

Corpus Compilation

Stratified Random Sampling
Web Crawling and Cleansing
Converting PDF to Texts

The second step is to determine how many words to extract by inserting the number of words into WANT_WORDS_CNT, as shown in Figure 3.1. The next step is to extract only sentences from the collected HTML documents using Python program coding.

Fig. 3.1 Python coding for random sampling

Multi-word Compounds

My thesis deals with the third type of multi-word compounds such as coast guard, ballast water management, etc. To find out recent research trends, I reviewed papers from The 2013 NAACL HLT (Conference of the North American Chapter of the Association for Computational Linguistics). : Human Language Technologies), The 2014 EACL (Conference of the European Chapter of the Association for Computational Linguistics) multi-word expression workshops. There are four main areas such as a statistical base, a linguistic base, a machine learning base and complex approaches.

Although these previous studies provide a number of algorithms for finding compounds, there are no perfect solutions for identifying multi-word compounds in ESP vocabulary studies. But I found a useful and practical method which uses reference dictionaries to identify multi-word combinations. To compensate for the most modern added terms, I used a supplementary source from Wikipedia's "Glossary of Nautical Terms (2015)", https://en.wikipedia.org/wiki/Glossary_of_nautical_terms.

Table 3.5 Types of general English compounds and maritime English compounds

Critical Evaluation and Tagging for Multi-word Compounds

The list of all entries combined with the three sources contains a variety of multi-word compounds ranging from two-grams to nine-grams, as shown in Table 3.7. Therefore, the statistics show how important labeling with multi-word compounds is in ESP studies. To come up with a multi-word composite marking, I combined all the entries from three different sources to make a reference maritime English dictionary.

The second is a list of multi-word compounds, which I call a reference list of multi-word compounds. For the extraction method, I programmed Python codes to create a multi-word compound annotated corpus. Finally, I've put together a new list of reference multiword compounds that includes lemmatized words and all kinds of words from lemmatized words.

Table 3.7 Percentage of multi-word compounds in a reference multi-word compound list

Comparison of With and Without Compounds

Comparison of Basic Statistics
Comparison of Word Lists, N-gram Lists, and Keyword Lists
Comparison of Visualizations

Dispersion Plots
GraphColl 1.0

First, the number of tokens for each genre is reduced when the corpora contain multiword compounds. As can be seen from Tables 3.10 and 3.11, the type difference is due to the newly added multi-word compounds. It is important to mention that the basic statistics of a corpus will be significantly different depending on the inclusion of multi-word compounds.

The composite multi-word tagged corpus has more words than those in the untagged corpus. On the 4-grams, the marked corpus of law is also found to contain multi-word compounds such as ships ballast water and sediments, fit to continue_to_sea without, and gross tonnage calculated in accordance. As can be seen in Table 3.16 above, the hits column shows that the keywords without multi-word compound tagging appear more than those with multi-word compound tagging.

Table 3.10 Statistical results before tagging multi-word compounds

Summary and Implications

As seen in Table 4.7, eigenvector centrality in the keyword networks showed similar percentages as 7 specific target terms (35%) and 13 general target terms (65%). I counted all the linked keywords of specific target terms and all the general target terms. As seen in Table 4.13, two keywords listed as general purpose terms appear in two different groups.

As seen in Table 4.16, four keywords listed as general purpose terms appear in four different groups. For example, collocation network structures can be used to identify common target terms using eigenvector centrality. On the other hand, the cohesion community structures created by eigenvector and betweenness in keyword networks distinguish a group of the specific target terms and general target terms.

Language Network Structure Analysis

Frameworks of Network Analysis

Source Nodes and Target Nodes
Two Mode Structures and One Mode Structures
Centrality and Cohesion Algorithms

To create both keyword networks and composition networks, it is necessary to determine source nodes and target nodes for the two networks. In this study, a source node refers to a keyword in both keyword networks and composition networks. As seen in Table 4.1, the number of type words on a word list is counted.

Using Kamada-Kawai's spring algorithm, the two-mode visualization of assembly networks is shown in Figure 4.2. As a next step to creating one-mode networks, I use a transform menu to transform two-mode keyword networks and composition networks into one-mode networks. For this purpose, I transform two mode data into one mode data using one of the proximity measures such as the cosine similarity to find statistically significant relationships between keywords.19).

Table 4.1 Percentage of single words and multi-word compounds in a study corpus

Comparison of Keyword Networks and Collocation Networks

Centrality Structures: Eigenvector and Betweenness
Cohesion Structures: Eigenvector and Betweenness

The results show that cohesion analysis is effective only for keyword networks because special purpose terms and general purpose terms. This finding suggests that the community eigenvector can be used to find special-purpose and general-purpose terms among maritime English vocabulary items. Based on these results, I can hypothesize that special purpose expressions have positive expression specificity, while general purpose expressions have negative expression specificity.

So, it is likely that eigenvector community analysis is an effective tool to identify specific purpose terms and general purpose terms in maritime English. The specialty of the term indicates that 24) It is interesting that the amendment is classified as a group 6 which represents specific purpose terms. This means that the difference is related to more specific purpose terms than to general purpose terms.

Fig. 4.3 Centrality structure using eigenvector for keyword networks

Critical Evaluation

For cohesion structures, only the community eigenvector divides the two large groups to distinguish between special-purpose and general-purpose terms. Most general terms have negative values between 0.1 and 1, except for one word, packaging shows a positive 0.36. On the other hand, all dedicated terms have positive values ranging from 0.14 to 1.

These results prove that a specialty term can also be used to explain why each group prefers specific-purpose terms or general-purpose terms. On the other hand, a keyword listed as a term for specific purposes also appears in another group. As shown in Table 4.17, the critical evaluation from the 40-keyword experiment with 20 specific-purpose key terms and 20 general-purpose key terms supports the result obtained from the previous experiment.

Table 4.14 Centrality and cohesion of top 20 general purpose terms versus 20 single word specific purpose terms

Summary and Implications

Therefore, I conclude that the specificity or generality of ESP texts can be identified by keyword network structures regardless of either single-word or multi-word combinations. Finally, I created a term specialty equation to explain why a particular community can be separated into two groups for specific or general purposes, showing that the degree of specificity and generality is a continuum concept that is not absolute. The results of this chapter can be used to provide pedagogical implications for creating ESP vocabulary lists.

Using corpus linguistics methodology, teachers can compile lists of key words to answer this question. In addition, the cohesion structures of keyword networks tell us that the group of special-purpose terms and the group of general-purpose terms are well separated using the community of eigenvectors. Therefore, ESP teachers have an advantage in deciding which vocabulary to teach and how to classify vocabulary in ESP through language.

Conclusion

Findings and Implications

First, STTR showed a wider variety of vocabulary items in the MEC labeled multiword compound. Fourth, there were more keywords in the labeled MEC multiword composition than the unlabeled MEC. In Proceedings of Digital Humanities 2008 (pp.53-55), University of Oulu: Scottish Corpus of Texts and Speech.

In Proceedings of 6th International Conference on National Language Processing (pp.20-29), een digitaal archief van onderzoekspapers in de computerlinguïstiekconferentie. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (pp.110-115), Association. In Proceedings of the Workshop on Comparing Corpora (pp.1-8), Hong Kong, Association for Computational Linguistics.

In Proceedings of the workshop on beyond named entity recognition semantic labeling for NLP tasks in connection with the 4th International Conference on Language Resources and Evaluation (pp.7-12), Lisbon, Portugal: Senseval. In Proceedings of ECAI-2000 Workshop on Ontology Learning (pp.37-42), Berlin: The European Chapter of the ACL.