Chapter 1 Hierarchical phrase based Machine Translation - cfilt

Phrase-Based Models (Koehn et al. [2003]) improved upon previous machine translation methods by generalizing translation. The basic phrase-based model is an example of the noisy channel approximation (Brown et al. [1990]). There are other phrase-based models, such as the joint distribution P(e,f) or the one that makes P(e) or P(f|e) as features of a log-linear model.

Sentence-based models are very good at performing sentences-level translations that have been observed from the training data. When we ran sentence-based MT systems like Pharaoh on the Chinese sentence, we got the second sentence. Hierarchical model enables subclauses to remove the problems associated with clause-based MT.

We give some examples of phrase-based translation to understand how redundancy is introduced in 2.1. In essence, the hierarchical model not only reduces the size of the grammar, but also combines the power of both rule-based and phrase-based machine translation systems. Thus, the hierarchical model uses a phrase-based reordering technique to learn the reordering of phrases.

Phrase-based models can work well for translations that are localized to substrings and have been observed previously in the training corpus.

Figure 1.2: Hindi to English translation showing reordering

The model

Once we have the parse tree in one language, we can construct the parse tree in another language. To perform the construction of the parse tree on the target side, we need to apply the transfer rules and get the parse tree in the target language. In the case of reordering, the transfer rules cause the terminals or nonterminals to rotate around a nonterminal that has a corresponding rule in the reordering grammar.

The functions have been divided into three sets in the way they are evaluated. The number of co-occurrences of sentences γ and α can be easily obtained from bi-text simultaneously to obtain the probability. The former function is found in noisy channel models, but the latter function was also found useful in obtaining the alignment matrix discussed last.

Pw (γ|α) and Pw (α|γ) are features that estimate how well the words in sentenceγ translate the words in sentenceαKoehn et al. Given a sentence pair hγ, α and a word alignment between the foreign word positions i = 1..n and the English word positions j = 0,1..m, the lexical weight Pw is calculated by. Consider an example of translating French sentence f and English sentence e, the alignment matrix is given as.

The alignment matrix provides the one-to-one image by filling the matrix with double hash for an alignment and double blank for non-alignment. Based on the alignments and formula proposed above by Koehn, we obtain the probability for translation from English phrase to French phrase given alignment ace in equation 1.4.6. This feature is also similar to Koehn's phrase penalty, which gives the model some flexibility in favoring shorter or longer leads.

Where plm is the language model and exp(λwp|e|) , the word penalty gives some control over the length of the English output.

Decoding

Basic Algorithm

And the goal would be [S,0, n], where S is the starting symbol of the grammar and n is the length of the input string f. Given a synchronous CFG, we could convert its French lateral grammar to Chomsky normal form, and then for each sentence, we could find the best parse using CKY. It would then be a straightforward matter to convert the best parse from Chomsky's normal form to the original form and map it to the corresponding English tree whose output is the output translation.

However, because we have already limited the number of non-terminal symbols in our rules to two, it is more convenient to use a modified CKY algorithm that works directly on our grammar, without any conversion to Chomsky normal form. Converting a CFG to CNF makes the grammar exponentially larger, so it's better to keep the grammar, which is already a million lines long, as a CFG. In the next section, the above technique to transfer a tree to a string is demonstrated with an example of Odia - English translation.

Training

Illustration of word alignment algorithm
Illustration of phrase alignment algorithm using heuristic
Demerits of rule based phrase alignment and solutions to their problems
Glue Rules
Intuition behind using a SCFG

The intuition behind this rule is that the phrase fji is a translation of the phrase eji00 if and only if there is some word in the French sentence f at index k that is aligned with some word in the English sentence at index k'. The second and third rules emphasize that there is no word in f that is aligned with any word outside the phrase e, and there is no word in e that is aligned with any word outside the phrase f. Other phrases can be made, but are ignored for the sake of translation.

To return to synchronous CFG, more complex rules must be constructed that have sub-phrases (X) in them. It is forbidden for non-terminals to be adjacent on the French side, a major cause of false ambiguity. In the first step, we can extract CFG rules for source-side language (Odia) from the SCFG rules, and analyze the source-side sentence with the obtained CFG rules.

Let's go through an Odia to English translation and see what are the stages a sentence has to travel through to reach the destination. Let's say a user gives our system a test sentence in Odia and expects an English sentence as given below.

Table 1.7.1: Odia to English Alignment shown below.

Testing on Odia to English translation

Parse tree in Odia

Apply transfer rules

The text mentioned in red indicates that the text has been translated into English, while the text in white indicates that this text has not yet been translated.

Figure 1.7: The right top corner shows one rule in red which has been applied while the second rule in white is next to be applied to the parse tree

Apply rule 4

Open source hierarchical phrase based machine trans- lation system
JOSHUA

Main functionalities
Language Model
MERT

Moses

Factored Translation Model

Example of phrase based MT lagging

Toolkit

However, most of the systems mentioned above are not open source and therefore not readily available for research. In the following topics, we present two of the well-known open source hierarchical phrase-based MT systems. Joshua implements all the algorithms needed for synchronous CFGs: graph parsing, n-gram language model integration, bundle and cube pruning, and k-best extraction.

The toolkit has been shown to achieve state-of-the-art translation performance on the WMT09 French-English translation task. Instead of using the entire corpus for grammar extraction, only part of the corpus is used as suggested by Kishore Papineni. Each selected sentence causes an increase in the n-grams in W present in it by their number in that sentence.

Hierarchical phrase-based MT requires grammars extracted from parallel corpus, but in real translation tasks, grammars are too large and often violate memory con-. In this part, we describe the different sub-functionalities of the decoding algorithms as described in Li et al. Grammar Formalism The decoder implements a synchronous context-free grammar (SCFG) of the kind described by Heiro.

Chart Parsing Given a source sentence, the decoder produces 1-best and k-best translation using a CKY parser. This java implementation can read ARPA from what is provided by the SRILM toolkit and therefore the decoder can be used independently of SRILM. They also developed their own code that allows the decoder to use the SRILM toolkit to read and score n-grams.

He recently started developing a phrase-based hierarchical MT to make it a complete toolkit. That's why he brought a complete set of translation tools for academic research. The decoder is a central component of a toolkit that was taken from the pharaoh in order to attract the interests of the pharaoh's followers.

Phrase based translation of a Hindi sentence to En- glish sentence
Example to establish reordering
Websites for Gazetteer list
Examples of noisy data in CoNll corpus
Grammar correction example
Single reference translation
Multiple Reference Translations
Translation models

Factor-Based Translation Model

Hindi-english translation

I Proceedings of the 2003 Conference of the North American Chapter of Association for Computational Linguistics on Human Language Technology - bind 1, NAACL ’03, side 48-54, Stroudsburg, PA, USA. I Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, WMT ’10, side 133-137, Stroudsburg, PA, USA. InProceedings af den 21. internationale konference om computational Linguistics og det 44. årlige møde i Association for Computational Linguistics, ACL-44, side 609-616, Stroudsburg, PA, USA.

InProceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 271–279, Stroudsburg, PA, USA. For example:- (This mapping was observed during the training) BArt kA prDAn m/F→Prime Minister of India If a similar sentence appears during testing,. Input for Hi-En translation system is:- sV~ l l\Xn m\ EgrA pln Expected output is:- plane down in central London.

System A: Responsibility of Israeli officials for airport security Reference: Israeli officials are responsible for airport security SYSTEM B: airport security Israeli officials are responsible. Israeli officials are responsible for airport security Israel is responsible for security at this airport. The security work for this airport is the responsibility of the Israeli government, the Israeli side was responsible for the security of this airport.

English Rule-Based Voter from Lucene The configuration for factors was done as follows:-.