MEC contains marked multi-word compounds which can be called specific purpose terms in maritime English. First, how can we build a corpus of Maritime English to represent specific purpose concepts such as multi-word compounds.
Introduction
Outline of the Thesis
I also review basic concepts for language network construction, previous studies, definitions and the types of language network constructions in order to provide an explanatory power to corpus data in Chapter 4. It proposes language network analysis in order to provide a further explanatory power in relation to comparing the corpus descriptions by keyword.
Literature Review
Maritime English as English for Specific Purposes
The IMO has shown interest in training maritime English for seafarers because it is one of the most important aspects for the international maritime community and maritime industry. Maritime English terms are difficult to understand due to the fact that almost all the domain-specific target terms are used only in the maritime industry.
Keywords in Text
- Strategies for a Reference Corpus
- Statistical Measures for Keyword Analysis
- Problems of Previous Keyword Analysis
There are a few statistical debates about keyword analysis to argue for a log-likelihood test over a chi-square test. The log-likelihood calculation uses an expected value , the frequency of words of in corpus one, and the frequency of words of in corpus two.
Collocations in Text
- Types of Collocations
- Statistical Measures for Window Collocations
- Problems of Previous Collocation Analysis
Due to some important variables, such as window spans, frequency and statistical measurements, there are no tools that can evaluate all variables. Therefore, it is important that individual researchers test all statistical measures to find the most appropriate methods, both in general and specialized corpora.
Visualization in Corpus Linguistics
- Text Visualizations
- Collocation Networks
They adopted a corpus-oriented approach to the new dictionary using collocation networks, as shown in Figure 2.7(b). He advocated some of the advantages of visualization in collocation studies and claimed to be dealing with directed collocation networks based on Delta P as shown in Figure 2.11 (Brezina et al. The software offers 12 different statistical measures to identify collocations and window range options .
Language Networks
- Basic Concepts
- Previous Studies
- Definitions
- Types of Language Network Constructions
Social network analysis studied the elements that make up human societies, such as individuals and organizations. Since corpus data follow power law or Zipf law distribution patterns, my data can be subjected to language network analysis.
Maritime English Corpus
Corpus Design
To collect data for the academic genre, I used Springer's database (http://www.springer.com), which offers many journals to the scientific and professional communities, and Elsevier's Science Direct (http://www.sciencedirect.com), which one of the largest publishers in the world. Maritime Studies”, “Gyroscopy and Navigation”, “Aegean Review of Maritime Law and Maritime Law” and “WMU Journal of Marine Affairs”.
Corpus Compilation
- Stratified Random Sampling
- Web Crawling and Cleansing
- Converting PDF to Texts
The second step is to determine how many words to extract by inserting the number of words into WANT_WORDS_CNT, as shown in Figure 3.1. The next step is to extract only sentences from the collected HTML documents using Python program coding.
Multi-word Compounds
My thesis deals with the third type of multi-word compounds such as coast guard, ballast water management, etc. To find out recent research trends, I reviewed papers from The 2013 NAACL HLT (Conference of the North American Chapter of the Association for Computational Linguistics). : Human Language Technologies), The 2014 EACL (Conference of the European Chapter of the Association for Computational Linguistics) multi-word expression workshops. There are four main areas such as a statistical base, a linguistic base, a machine learning base and complex approaches.
Although these previous studies provide a number of algorithms for finding compounds, there are no perfect solutions for identifying multi-word compounds in ESP vocabulary studies. But I found a useful and practical method which uses reference dictionaries to identify multi-word combinations. To compensate for the most modern added terms, I used a supplementary source from Wikipedia's "Glossary of Nautical Terms (2015)", https://en.wikipedia.org/wiki/Glossary_of_nautical_terms.
Critical Evaluation and Tagging for Multi-word Compounds
The list of all entries combined with the three sources contains a variety of multi-word compounds ranging from two-grams to nine-grams, as shown in Table 3.7. Therefore, the statistics show how important labeling with multi-word compounds is in ESP studies. To come up with a multi-word composite marking, I combined all the entries from three different sources to make a reference maritime English dictionary.
The second is a list of multi-word compounds, which I call a reference list of multi-word compounds. For the extraction method, I programmed Python codes to create a multi-word compound annotated corpus. Finally, I've put together a new list of reference multiword compounds that includes lemmatized words and all kinds of words from lemmatized words.
Comparison of With and Without Compounds
- Comparison of Basic Statistics
- Comparison of Word Lists, N-gram Lists, and Keyword Lists
- Comparison of Visualizations
- Dispersion Plots
- GraphColl 1.0
First, the number of tokens for each genre is reduced when the corpora contain multiword compounds. As can be seen from Tables 3.10 and 3.11, the type difference is due to the newly added multi-word compounds. It is important to mention that the basic statistics of a corpus will be significantly different depending on the inclusion of multi-word compounds.
The composite multi-word tagged corpus has more words than those in the untagged corpus. On the 4-grams, the marked corpus of law is also found to contain multi-word compounds such as ships ballast water and sediments, fit to continue_to_sea without, and gross tonnage calculated in accordance. As can be seen in Table 3.16 above, the hits column shows that the keywords without multi-word compound tagging appear more than those with multi-word compound tagging.
Summary and Implications
As seen in Table 4.7, eigenvector centrality in the keyword networks showed similar percentages as 7 specific target terms (35%) and 13 general target terms (65%). I counted all the linked keywords of specific target terms and all the general target terms. As seen in Table 4.13, two keywords listed as general purpose terms appear in two different groups.
As seen in Table 4.16, four keywords listed as general purpose terms appear in four different groups. For example, collocation network structures can be used to identify common target terms using eigenvector centrality. On the other hand, the cohesion community structures created by eigenvector and betweenness in keyword networks distinguish a group of the specific target terms and general target terms.
Language Network Structure Analysis
Frameworks of Network Analysis
- Source Nodes and Target Nodes
- Two Mode Structures and One Mode Structures
- Centrality and Cohesion Algorithms
To create both keyword networks and composition networks, it is necessary to determine source nodes and target nodes for the two networks. In this study, a source node refers to a keyword in both keyword networks and composition networks. As seen in Table 4.1, the number of type words on a word list is counted.
Using Kamada-Kawai's spring algorithm, the two-mode visualization of assembly networks is shown in Figure 4.2. As a next step to creating one-mode networks, I use a transform menu to transform two-mode keyword networks and composition networks into one-mode networks. For this purpose, I transform two mode data into one mode data using one of the proximity measures such as the cosine similarity to find statistically significant relationships between keywords.19).
Comparison of Keyword Networks and Collocation Networks
- Centrality Structures: Eigenvector and Betweenness
- Cohesion Structures: Eigenvector and Betweenness
The results show that cohesion analysis is effective only for keyword networks because special purpose terms and general purpose terms. This finding suggests that the community eigenvector can be used to find special-purpose and general-purpose terms among maritime English vocabulary items. Based on these results, I can hypothesize that special purpose expressions have positive expression specificity, while general purpose expressions have negative expression specificity.
So, it is likely that eigenvector community analysis is an effective tool to identify specific purpose terms and general purpose terms in maritime English. The specialty of the term indicates that 24) It is interesting that the amendment is classified as a group 6 which represents specific purpose terms. This means that the difference is related to more specific purpose terms than to general purpose terms.
Critical Evaluation
For cohesion structures, only the community eigenvector divides the two large groups to distinguish between special-purpose and general-purpose terms. Most general terms have negative values between 0.1 and 1, except for one word, packaging shows a positive 0.36. On the other hand, all dedicated terms have positive values ranging from 0.14 to 1.
These results prove that a specialty term can also be used to explain why each group prefers specific-purpose terms or general-purpose terms. On the other hand, a keyword listed as a term for specific purposes also appears in another group. As shown in Table 4.17, the critical evaluation from the 40-keyword experiment with 20 specific-purpose key terms and 20 general-purpose key terms supports the result obtained from the previous experiment.
Summary and Implications
Therefore, I conclude that the specificity or generality of ESP texts can be identified by keyword network structures regardless of either single-word or multi-word combinations. Finally, I created a term specialty equation to explain why a particular community can be separated into two groups for specific or general purposes, showing that the degree of specificity and generality is a continuum concept that is not absolute. The results of this chapter can be used to provide pedagogical implications for creating ESP vocabulary lists.
Using corpus linguistics methodology, teachers can compile lists of key words to answer this question. In addition, the cohesion structures of keyword networks tell us that the group of special-purpose terms and the group of general-purpose terms are well separated using the community of eigenvectors. Therefore, ESP teachers have an advantage in deciding which vocabulary to teach and how to classify vocabulary in ESP through language.
Conclusion
Findings and Implications
First, STTR showed a wider variety of vocabulary items in the MEC labeled multiword compound. Fourth, there were more keywords in the labeled MEC multiword composition than the unlabeled MEC. In Proceedings of Digital Humanities 2008 (pp.53-55), University of Oulu: Scottish Corpus of Texts and Speech.
In Proceedings of 6th International Conference on National Language Processing (pp.20-29), een digitaal archief van onderzoekspapers in de computerlinguïstiekconferentie. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (pp.110-115), Association. In Proceedings of the Workshop on Comparing Corpora (pp.1-8), Hong Kong, Association for Computational Linguistics.
In Proceedings of the workshop on beyond named entity recognition semantic labeling for NLP tasks in connection with the 4th International Conference on Language Resources and Evaluation (pp.7-12), Lisbon, Portugal: Senseval. In Proceedings of ECAI-2000 Workshop on Ontology Learning (pp.37-42), Berlin: The European Chapter of the ACL.