Chapter 3. Maritime English Corpus
3.5 Critical Evaluation and Tagging for Multi-word Compounds
Although these previous studies provide a number of algorithms to find compounds, there are no perfect solutions for identifying multi-word compounds in ESP vocabulary studies. However, I found one useful and practical method which uses reference dictionaries for identifying multi-word compounds. Thus, I selected two maritime English dictionaries and one supplementary source to tag multi-word compounds within the corpus:
“Dictionary of Maritime and Transportation Terms” (Monroe and Stewart, 2005) and “International Maritime Dictionary” (Kerchove, 1961). In order to compensate the most contemporarily added terms, I used one supplementary source from Wikipedia’s “Glossary of Nautical Terms (2015)”, https://en.wikipedia.org/wiki/Glossary_of_nautical_terms.
boat, and deck bulkhead. In this type of word formation, I changed these multi-word compounds as anchor deck, boat deck, and bulkhead deck in accordance with actual language use in maritime domains. Second, to build a word list of Wikipedia’s “Glossary of Nautical Terms”, I copied the data from the Web and then pasted them by using a free software, notepad ++ for further cleaning processing. To accelerate the cleansing process, I erased unnecessary lines and symbols by using regular expressions. I obtained a list of all entry items from the source.
The basic statistics of each list of all entry items from these three sources is as follows. First, “Dictionary of Maritime and Transportation Terms” has 6,265 terms, “International Maritime Dictionary” has 11,044 terms, and
“Glossary of Nautical Terms” has 1,297 terms. The total number of token of entry items is 18,636. After getting rid of overlapped entry items, the total number of type of entry items is 16,729. A list of all entry items combined with three sources contains various multi-word compounds ranging from bi-grams to nine grams, as shown in Table 3.7.
Table 3.7 Percentage of multi-word compounds in a reference multi-word compound list
N-grams Types % Examples of Multi-word Compounds
Single words 5,697 34.05 bow, starboard 2-grams 8,770 52.42 harbor master 3-grams 1,654 9.89 bow chock plate 4-grams 475 2.84 anchor by the stern 5-grams 86 0.51 dry bulk self unloader ship
6-grams 31 0.19 global maritime distress and safety system 7-grams 11 0.07 left-hand draft in this set of marks 8-grams 3 0.02 between the devil and the deep blue sea 9-grams 2 0.01 International Convention for the Safety of
Life at Sea
Total 16,729 100
As seen in Table 3.7, it is striking that there are only 8,770 2-gram compounds (52.42%). These 2-grams indicate that the corpus vocabulary word list should consider multi-word compounds when extracting word lists.
Especially, ESP vocabulary list should include multi-word compounds in order to correctly count the proportions of specific purpose vocabulary items. The second dominant type is 5,697 single vocabulary items (34.05%).
The third dominant type is 3-grams (9.89%) comprising 1,654. Compared to the number of 2-grams, the number of 3-grams is quite reduced. Other types are 475 4-grams (2.84%), 86 5-grams (0.51%), 31 6-grams (0.19%),
11 7-grams (0.07%), 3 8-grams (0.02%), 2 9-grams (0.01%). This finding is also quite impressing because 11,032 entry items (65.95%) are multi-word compounds. Therefore, the statistics shows how much multi-word compound tagging is important in ESP studies.
Table 3.8 displays the top 20 terms from a reference multi-word compound list showing the top and last terms out of all entry items.
Table 3.8 Top 20 and last terms out of all entries on a reference multi-word compound list
No
Dictionary of Maritime and Transportation Terms
No International
Maritime Dictionary No Wikipedia’s Glossary of Nautical Terms
1 a1 1 aak 1 abaft
2 aa 2 aalboot 2 abaft the beam
3 ab 3 aback 3 abandon ship
4 abaft the beam 4 abaft 4 abeam
5 abandon 5 abaft the beam 5 abel brown
6 abandoned goods 6 abandonment 6 able seaman
7 abandonment 7 abandonment clause 7 aboard
8 abatement 8 abeam 8 above board
9 abc analysis 9 able seaman 9 above-water hull
10 abel tester 10 aboard 10 absentee pennant
11 ablation 11 above deck girder 11 absolute bearing 12 able-bodied seaman 12 abox 12 accommodation ladder 13 able-bodled seaman 13 abreast 13 accommodation ship 14 about ship 14 absence flag 14 accommodation hulk 15 aboveboard 15 absolute contraband 15 act of pardon
16 abovedeck 16 aburton 16 act of grace
17 abrasion 17 acceleration 17 action stations
18 abreast 18 accident boat 18 admiral
19 absent flag 19 accommodation 19 admiralty
20 absolute accuracy 20 accommodation berth 20 admiralty law
. . . . . .
6,265 zone price 11,044 zee bar 1,297 yawl boat
The above table is worth consideration for compound tagging. A large portion of dictionary entries consists of multi-word expressions. Maritime dictionary entries include a bunch of multi-words separated by spaces, as exemplified in the three sources as able-bodied seaman, abaft the beam, abandon ship respectively. Therefore, it is not possible for us to extract these multi-word compound entries from any other corpora, because all of the available corpus analysis tools do not provide a list of compound words. They can help extract only single word or words between dash marks.
In order to prepare multi-word compound tagging, I combined all the entry items from three different sources to make a reference maritime English dictionary. Then I divided it into two word lists. One is a list of single words which I call a reference single word list. The other is a list of multi-word compounds which I call a reference multi-word compound list. All these three kinds of word lists are shown in Table 3.9.
Table 3.9 All the entry items listed from three different sources
No All the Entry Items No Single Words No Multi-word Compounds
1 a1 1 a1 1 abaft the beam
2 aa 2 aa 2 abandon ship
3 aak 3 aak 3 abandoned goods
4 aalboot 4 aalboot 4 abandonment
5 ab 5 ab 5 abandonment clause
6 aback 6 aback 6 abc analysis
7 abaft 7 abaft 7 abel brown
8 abaft the beam 8 abandon 8 abel tester
9 abandon 9 abandonment 9 able seaman
10 abandon ship 10 abatement 10 able-bodied seaman
11 abandoned goods 11 abeam 11 about ship
12 abandonment 12 ablation 12 above deck girder
13 abandonment clause 13 able seaman 13 above-water hull
14 abatement 14 aboard 14 absence flag
15 abc analysis 15 abovedeck 15 absent flag
16 abeam 16 abox 16 absentee pennant
17 abel brown 17 abrasion 17 absolute accuracy
18 abel tester 18 abreast 18 absolute bearing
19 ablation 19 absorption 19 absolute contraband
20 able seaman 19 aburton 19 absolute pressure
. . . . . .
16,730 zulu 8,583 zulu 8,147 zone price
For the extraction method, I programmed Python codes in order to create a multi-word compound tagged corpus. First, I tokenized the study corpus to make a word list, lemmatized the word list using TreeTageer, and then found all kinds of types of lemmatized words. Second, I lemmatized the reference multi-word compound list. Third, I matched two lemmatized word lists. Fourth, I found all types of multi-word compounds. Finally, I combined a new reference multi-word compound list including lemmatized words and all types of words from lemmatized words. This new reference multi-word compound list was used to tag the MEC. The whole process described above is seen in Figure 3.6.
Fig. 3.6 Creation of multi-word compound tagged MEC
Fig. 3.7 Result of compound tagged specific vocabulary terms
Multi-word compounds are expressed by adding ‘_’ symbols. For example, significant_wave_height is tagged through two ‘_’ symbols. There are words as en_route, at_sea, direct_route, oil_tankers, gross_tonnage, ballast_water, clean_ballast, and oil_record_book. The below excerpt from part of maritime law texts shows how this tagging system works, as shown in Figure 3.7.