Chapter 3. Maritime English Corpus
3.2 Corpus Design
These experts who have given me many pieces of advice have been working at several different institutions such as the International Maritime Organization (IMO), World Maritime University (WMU), Korea Maritime and Ocean University (KMOU)16), Mokpo National Maritime University, Korean Register of Shipping (KR), Korea Maritime Institute (KMI), and Korea Institute of Maritime and Fisheries Technology.
I decided to compile a four million word corpus, selecting equal amounts of words from four different genres which represent academy, news, laws, and textbooks. There are some practical considerations to determine the size of the four million word corpus. I considered running time for corpus data using personal computers in the data-driven learning (DDL) environment.
When teaching students focusing on DDL methods, the corpus size influences teaching and learning. If students are engaged in extracting keywords, key word linked list, and n-grams, it takes much time to get the results if the corpus contains more than four million words. In addition, a sub-corpora of the MEC can be compared with a sub-corpora of BNC Baby in order to find the characteristics of an ESP genre because BNC Baby consists of academic writing, newspaper texts, imaginative writing, and spontaneous conversation and has almost equal amount of size for each genre which has one million words, representing the full BNC. Moreover, around four million word corpus is appropriate for language network analysis because this size of corpus produces a proper number of keywords, linked keywords, and collocates to enable software program to visualize and
16) There is a previous study for a compilation of a small size corpus of maritime English (Hong and Jhang, 2010).
compute network analyses algorithms. Based on these reasons, the MEC size is decided to be comprised of four million words and each of the four genres consists of a one million word sub-corpus.
To collect data for an academic genre, I used Springer’s database (http://www.springer.com) which provides numerous journals to scientific and professional communities and Elsevier’s Science Direct (http://www.sciencedirect.com) which is one of the largest publishers in the world. I selected the most relevant maritime related academic journals such as “Maritime Policy and Management”, “Journal for Maritime Research”,
“Maritime Studies”, “Gyroscopy and Navigation”, “Aegean Review of the Law of the Sea and Maritime Law”, and “WMU Journal of Marine Affairs”. All articles in these journal lists were saved manually as PDF files, as shown in Table 3.1.
Table 3.1 List of academic journal sources
Text_ID Titles Sources
A01 Maritime Policy and Management http://www.tandfonline.com/toc/TMPM20/curr ent#.Vb8g7O2wfIU
A02 Journal for Maritime Research http://www.tandfonline.com/toc/rmar20/current
#.Vb8hFe2wfIU
A03 Maritime Studies http://www.maritimestudiesjournal.com/
A04 Gyroscopy and Navigation http://www.springer.com/engineering/mechanic al+engineering/journal/13140
A05 Aegean Review of the Law of the Sea and Maritime Law
http://www.springer.com/law/international/jour nal/12180
A06 WMU Journal of Marine Affairs http://www.wmu.se/publications/wmu-journal
A news genre consists of official institution texts and commercial news texts. Official institution sources are “IMO Press Briefings” and “World Maritime University News”. Commercial news contains specialized maritime news Websites which are regarded as hub sites by the experts. These sources are “World Maritime News”, “The Maritime Executive”,
“Marinelink”, and “Maritime Today News.” Since these Websites contain numerous articles, we used a Wget crawler to collect them. After the collection, an NLP Python program automatically extracted only sentences out of these texts. Table 3.2 shows a list of news Web-site sources.
Table 3.2 List of news website sources
Text_ID Websites Sources
N01 IMO Press Briefings http://www.imo.org/MediaCentre/PressBriefings
N02 World Maritime News http://worldmaritimenews.com/archives
N03 Marinelink http://www.marinelink.com/
N04 World Maritime University http://www.wmu.se/news
N05 Maritime Today News http://www.maritimetoday.com/
N06 The Maritime Executive http://www.maritime-executive.com/offshore-ne ws
A law genre is a collection of the IMO regulations and codes recently released by the IMO. In order to collect database of formal regulations and codes, I obtained an agreement with KR which allows me to use the IMO official legal texts for academic purposes. Thus, the IMO data could be inserted to the law genre with its permission. KR’s department which is in charge of the IMO official legal texts provided some of these data as a form of CD UNIX forma. Table 3.3 shows a list of maritime law sources.
Table 3.3 List of maritime law sources
Text_ID Titles Sources
L01 AFS 2001 http://www.krs.co.kr
L02 Bunker 2001 http://www.krs.co.kr
L03 BWM Convention http://www.krs.co.kr
L04 COLREG 2014 Consolidated Edition http://www.krs.co.kr L05 FSS Code 2014 Consolidated Edition http://www.krs.co.kr L06 FTP Code 2014 Consolidated Edition http://www.krs.co.kr
L07 IBC 2014 http://www.krs.co.kr
L08 IGC 2014 http://www.krs.co.kr
L09 III Code http://www.krs.co.kr
L10 IMDG Code 2014 http://www.krs.co.kr
L11 ISM Code http://www.krs.co.kr
L12 ISMBC Code 2014 Consolidated Edition http://www.krs.co.kr
L13 ISPS Code http://www.krs.co.kr
L14 LSA Code http://www.krs.co.kr
L15 MARPOL 2014 Consolidated Edition http://www.krs.co.kr
L16 MLC 2014 Consolidated Edition http://www.krs.co.kr
L17 Noise Code http://www.krs.co.kr
L18 RO Code 2014 Consolidated Edition http://www.krs.co.kr L19 Ship Recycling 2014 Consolidated Edition http://www.krs.co.kr L20 SOLAS 2014 Consolidated Edition http://www.krs.co.kr
L21 STCW Convention & Codes http://www.krs.co.kr
L22 TONNAGE 1969 http://www.krs.co.kr
Maritime-related textbooks are selected for the last genre. I considered to include various fields so the contents of selected books are economics,
Text_ID Titles Sources
T01 A Global Union for Global Workers: Collective Bargaining and
Regulatory Politics in Maritime Shipping Routledge
T02 Admiral Lord Keith and the Naval War Against Napoleon University Press of Florida T03 Maritime Communities and Vegetation of Open Habitats Cambridge University
Press
T04 Command of the Sea Charles Scribner’s
Sons
T05 International Maritime Transport Routledge
T06 Island Disputes and Maritime Regime Springer
T07 Jurisdiction and Arbitration Clause Springer Berlin
Heidelberg
T08 Maritime Delimitation Martinus Nijhoff
T09 Maritime Economics Routledge
T10 Maritime Fiction Sailors and the Sea Palgrave Macmillan
T11 Maritime Law and Policy in China Routledge-Cavendish
T12 Maritime Safety Law Springer
T13 Maritime Security in the South China Sea Ashgate
T14 Maritime Security Routledge
T15 Maritime Transportation Safety Routledge
T16 Maritime Work Law Fundamentals: Responsible Ship owners,
Reliable Seafarers Springer
T17 Oceans Governance Allen & Unwin
T18 Places of Refuge for Ship Martinus Nijhoff
Publishers Boston
safety, transport, history, policy, etc. The number of collected textbooks is 30 kinds and all of them were PDF formats. Later, these PDF files are transformed into txt files and then they are filtered and extracted by an NLP process. Table 3.4 shows a list of book sources.
Table 3.4 List of textbook sources
T19 Random Seas and Design of Maritime World Scientific Publishing Company
T20 Review of Maritime Transport 2006 United Nations
T21 Roots of Strategy Book 4 Stackpole Books
T22 Security for Airport and Aerospace, Maritime and Port, and High-Threat Targets in Belgium
ICON Group International
T23 State Responsibility for Interferences with the Freedom of
Navigation in Public International Law Springer
T24 Sustainable Maritime transportation and Exploitation of Sea
Resources Proceedings of the 14th International Congress CRC Press
T25 The Carriage of Dangerous Goods by Sea Springer Berlin Heidelberg T26 The Evolving Maritime Balance of Power in the Asia-Pacific World Scientific Pub
Co. Inc.
T27 The Maritime Dimension of International Security RAND Corporation
T28 The Maritime Engineering Reference book Butterworth-Heinemann
T29 The Unforgiving Coast Maritime Oregon State
University Press
T30 Towards Principled Oceans Governance Routledge
The sum of these collected data is much more than 400 million words. The following chapters describe how to collect texts and how to compile each sub-corpora by using NLP.