Tagging Technique Applied To Assamese

I would like to thank the faculty members and technical staff of the Department of Computer Science & Engineering, especially Prof. The errors found by the expert are used to estimate the accuracy of the main process.

Introduction

Part of Speech Tagger
Diﬀerent Approaches of POS tagging

Rule-based approach
Stochastic approach

The Problem & Our Approach

Dissertation Outline

chapter describes the implementation details of our approach- the evaluation procedure, algorithms, tuning of the tagger, the User Interface de-

That is, the probability P() of a token Wn given the preceding token Wn-1 is equal to the probability of their bigram, or the simultaneous occurrence of the two tokens P(Wn-1,Wn), divided by the probability of the preceding sign. Memory-based learning technique is one of the methods of classification-based approaches of POS labeling.

Figure 1.1: Diﬀerent Approaches of POS Tagging

POS Tagging: Diﬀerent Languages & Approaches

English Language

Issues with English Corpora

The dynamic programming methods were similar to the Viterbi algorithm, which was used in other fields, which is why people started using this algorithm in cash register tagging. Although the English language has the largest corpus, no POS tagging method trained on such large quantities can ever achieve 100% accuracy.

Other Foreign Languages

They used a pre-tagged corpus trained with an HMM model using the Baum-Welch algorithm and then used the Viterbi tagging algorithm to find the most likely tag sequences. They also evaluated the methods using the Brown corpus and the Indonesian language corpus ([26]).

POS tagging and Indian Languages

Hindi Language
Bengali /Bangla Language
Other Indian Languages
Assamese Language
Crowd Sourcing and Semi-Automatic Techniques

In this approach, they achieved an accuracy of 82% thanks to the use of the root list ([60]). It has been observed that mainly different supervised approaches have been tested for different Indian languages.

Assamese Language

Description of Assamese
Phonological Features
Morphology and Grammar
Grammatical Features
Assamese Unicode based corpus
Tagset for Assamese

This mattress can exist above, below, to the right or to the left of the consonant to which it is applied. Another striking phonological feature of the Assamese language is the extensive use of velar nasal /ŋ/. The perfective or completive aspect indicates the completion of the action of the verb.

In Assamese, ন /n/ is preceded by the verb, followed by a vowel which is the exact copy of the vowel of the first syllable of the verb, as in নালােগ /nalage/ which means 'not wanting' (1st, 2nd, 3rd person). Designing a tagset that covers the morphosyntactic details of the language is a very difficult task. Additionally, some tagsets (for example, the AU-KBC Tamil tagset) are language-specific and do not scale to other Indian languages.

Assessment of Probabilistic Approaches

Default Tagger
n-gram Tagger
Backoﬀ Tagger
Brill Tagger
Evaluation of the Techniques
Analysis

Since a Unigram tagger always takes the current token into account, the context of the word has nothing to do with the tag. As a consequence, there is a trade-off between the accuracy and coverage of the results. The N-gram tags are not context sensitive, but a Brill tagger is context sensitive and it generates the transformation rules based on the context of the training set.

For example, in one case, we took 80% of the corpus (four syllables) as the training set, and used the remaining 20% (one syllable) as the test set. From the results of different markers, we observed that the performance increases with increasing training data for Unigram, Backoff and Brill markers. Sometimes the performance of markers was drastically reduced in cross-domain corpora because no combinations of words and markers of the test corpus were learned in the training corpus.

Figure 4.1: Snapshot of Brill tagger using NLTK

Knowledge Base Creation

Our Approach
Initial Resources
Creation of Root Table
Generation of Preﬁx and Suﬃx Tables
Tuning of the Tagger

Confusion Matrix calculation
Reducing Compound Common Noun – Noun Con- ﬂictﬂict

Database Tables

The next table, Table 5.1 on Page 51, is the list of the 20 most common or highest frequency words found in the corpus. We stored all unique word-tag pairs from the C1-GA corpus in a table called 'submitted table'. We found that the most errors occurred for the tags Proper Nouns (NNP), Common Nouns (NNC), and Compound Proper Nouns (NNPC).

We also observed that few of the words were labeled as verbs for nouns as these words contain verb suffixes. Some of the NNP and NNPC words already in the databases are labeled correctly as NNP or NNPC. The root table originally contains about 1885 root words in the language along with the best possible tag(s) for each word.

Implementation Details

Support for other languages
Root Word Generation Algorithm
Exception Rules
Experimental Results of Root Word Gener- ation Techniqueation Technique
Evaluation Procedure
Development of User Interface
Experimental Results using the User Inter- face

Thus, based on the stem word and suffix category, the likely category(ies) of the newly generated stem word is determined. We also used words from different Internet sources to test the accuracy of the generated system. The stem database contains only a single instance of the stem word with its relevant grammatical category(ies).

Although the rules are language dependent, they increase the efficiency of the root word generator. We have applied the bigram method to the output text of the label step-I. The root table contains the root words of the language along with the best possible tags for each word.

Figure 6.1: Flow chart of root word generation

Correctness of the newly generated tagged corpuscorpus

It has been observed that all the modified tags were not wrongly tagged by the system. We calculated the tags modified by users and experts, percentage of finer tagging by experts, percentage of tagging errors, quality and ﬁdelity of the new tagged corpus. SoFidelity is the percentage of words correctly tagged by the tagger, but with some tags possibly roughly tagged.

Accuracy is the percentage of words correctly tagged by the Tagger with rough tags considered errors. The system's accuracy is the percentage of words correctly tagged by the System Tager with coarse tags that are considered errors. Finally, System Fidelity is the percentage of words correctly tagged by the System Tagger, but with some tags possibly mislabeled.

Comparison of Annotation time with Man- ual Taggingual Tagging

We took two linguists with three different sets of corpora ranging from common, frequently used simple word sentences to longer conjugated sentences with less frequently used words and found that approximately 58 words were tagged by the linguists at their best speed. The following figure (Figure 7.2 of Page 92), depicts the number of words marked by the experts in the three different corpora. We added the total time required by the system and time required by the native speakers to update the word tag pairs to calculate the total time (TT otal = TSystem +TU ser) required by our approach .

It was observed that about 12312 words were tagged per hour by our approach, which is more than 200 times faster than the expert's manual tagging system. We have calculated the probable maximum time required by the expert to verify the correctness of the full corpus of 554054 words based on the time required by the experts to check the random corpus of 5276 words, and the expert time requirement (T ) comes up. We have added the value of TExpert with total time (TT otal =TU ser +TExpert) and found that about 2029 words were tagged per hour by our approach, which is more than 34 times faster than the expert's manual tagging -system.

Analysis of the results

The native speakers accurately tagged the newly available proper, common, compound proper, and compound common nouns in the sentences using the user interface. Since we have provided sentences for each native user, they can easily figure out the compound proper nouns and common nouns based on the context. However, in the case of the statistical approaches, the composition itself and the common names also drastically reduce the accuracy.

It has also been observed that the inter and intra domain effect is negligible in the approach whose effect is very high in stochastic approaches as we observed while using NLTK. By using a hybrid method we obtained better results, where no parameter became critical in determining the accuracy of the results. We have drawn on a fairly large knowledge base and a lot of work has gone into analyzing suffixes and formulating language rules to deal with exceptions.

Conclusion and Future Work

Contributions

An additional 250,000 words have already been annotated and it is proposed to use network crowdsourcing for the home user level. While the results show that accuracy increases when home users are used, the system performed very well on its own. If we compare our scheme with manual marking, the time spent is two orders of magnitude shorter even with the intervention of home users (about 200 times faster).

We take into account the time required to create the knowledge base as this is a one-time exercise. The interesting result we achieved was that even if we include the time experts need to verify and correct the output of our tagger, the time required is still much shorter than a completely manual tagging system by experts using the Sanchay software to perform the tagging. Experiments have shown that the main contribution of native users is the identification of proper nouns (singular and compound) and compound nouns.

Language Independent approach

If a reader reads an unlabeled corpus and completes the proper noun tables by identifying the correct nouns in the corpus, we may need very little help for native speakers, and the speed of this last step may be further increased. The native speaking community can contribute to the development of the Assamese language through various user interfaces developed in the process. The tool is available online so that we can tap into the large community of native speakers to contribute to the development of a tagged corpus of Assamese through crowdsourcing.

Future Work

Appendix

List of Corpora

Some of the language speciﬁc rules that have been implemented in the POS taggerbeen implemented in the POS tagger

10] Eric Brill: A simple rule-based part of speech tagger, Proceedings of the Third Conference on Applied Natural Language Processing. 26] Stéphane Bressan, Lily Suryana Indradjaja: Part-of-Speech tagging without training, in Proceedings of the IFIP International Conference (INTELL-COMM), Bangkok, 2004, PP 112-119. 56] Navanath Saharia, Dhrubajyoti Das, Utpal Sharma, Jugal Kalita: Part of Speech Tagger for Assamese Text, In Proceedings of the ACL-IJCNLP Conference, 2009, pp.

57] Mizanur Rahman, Sufal Das, Utpal Sharma: Part-of-Speech Tagged Assamese Text Analysis, In Proceedings of the International Journal of Computer Science Issues, Vol. In Proceedings of 14th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), Samos, Greece, March pp. 12th International Conference on Computational Linguistics and Intelligent Text Processing - Volume Part I, p.