Implementation Details
6.6 Development of User Interface
6.4. We are restricting unauthorized access of the tagged corpus to minimize user conflicts, tagging verification of the same parts of text by multiple users and to reduce the junk tagging of tagged output data. Though we are using a limited crowd sourcing technique, we are hopeful that the system will help us getting cor- rected annotated corpus in less time.
Figure 6.4: Snapshot of User Interface
Whenever the user completes the modifications of the full text, the “Complete tagging” button is enabled and clicking that button the user can finally submit the modified text. After the completion of the task, the user is not allowed to log into the page again. All the user modified word/tag combinations along with time taken to update tags are stored in a temporary database.
There is an administrator /language expert page where an expert can view/
modify the user modified word-tag pairs. After the expert intervention, words with modified tags are stored in the main database tables for future use. We have created this expert intervention as we assume that all the native users are not language experts and any incorrect word-tag combination in the main database can lead to erroneous tagging of words. The following snapshot shows the expert view of a user modified word-tag pairs. The expert can click in any of the words for further modification (Fig 6.7 of Page 82).
Figure 6.5: Snapshot of multiple word tagging
Figure 6.6: Logout/ Timeout screenshot
The language experts after login will get the following page (Fig 6.8 ofPage 82) where the expert can change the user assigned tag and put some extra information about the word (optional- root word, prefix, suffix etc) which will help for better analysis and create a better database of root words. It is to be noted that the intervention by a language expert at this stage is not necessary for the tagging process. If a language expert looks at the temporary table, he or she can increase the size of the knowledge base by copying the tagged words to the submitted table.
Figure 6.7: Screenshot of Expert view
Figure 6.8: Expert Page for editing
Experimental Results
In the previous chapter, we have discussed about the procedure of automatic tag- ging, the native users’ verification on the output of the tagger and the database tables used in the process. The users play a major role in identifying the com- pound proper nouns, compound common nouns and proper nouns in submitted texts and also to reduce the conflict among these tags. The time required by native speakers for providing verification is a major part of the developed POS tagging process and the efficiency reduces if the time requirement is more. Considering all these factors, we have developed a user friendly, easy to use, User Interface (UI) for native users’ verification to expedite the verification and modification process of the tagged output of the semi automated tagger after the second step.
We have populated the database tables with the respective information and data gathered from the analysis of language experts and apriori knowledge of the language. The prefix table contains the prefixes of Assamese language which are limited in number. The suffix table contains an exhaustive list of suffixes ranging from single character suffix to longer concatenative form of suffixes along with the best possible tags. Initially this table was filled up with the most common and various available suffixes. Then some more were inserted after the manual analysis of the 2500 corpus by the experts.
The root table contains root words of the language along with the best possible tag(s) for each word. We are keeping two tag fields for each word – the first tag field contains the best possible most common tag of the word and the second tag field contains the second best possible tag for the word. Except few words the second tag field is blank for most of the words. This indicates that the root words with two tags have the higher probability of ambiguous instances.
We have created the bigram table by inserting the bigram values of each pair of consecutive tags, tagi and tagi+1 of words wordi and wordi+1 of the sentences available in the tagged corpus C1. Since the corpus was tagged by the linguistic experts, we presume that the wrong tag pair is not available in the corpus.
To reduce the word analysis time using the prefix, suffix and root table we have created another table named submitted table. This table contains the expert verified word-tag pairs from the tagged corpus. Initially this table contains only the word-tag pairs of C1 corpus. In later phases more word-tag pairs were inserted from the tagged corpus generated from the experiment after verification by the experts.
The proper noun and common noun tables were initially empty. During the confusion matrix calculation we have inserted few entries in the proper noun table and more entries were inserted after the native users’ verification and validation by experts from the newly generated tagged corpus by the POS tagger.
The output table was empty before the experiment. This table contains all the new word-tag pairs unavailable in either root or submitted table along with the native user modified tags so that the experts can verify those new and modified word-tag pairs.