1. introduction

(1)

Tgp chi khoa hgc Tn^ng Dqi hgc Quy Nhdn - Sd'4, Tap VHI nam 2014

A NEW APPROACH TO EXTRACT PARALLEL CORPUS LE QUANG HUNC 1. INTRODUCTION

Parallel corpora play an important role in the quality of statistical machine translation and multilingual natural language processing. Statistical machine translation uses parallel sentences as the input for the ahgnment module which produces word translation probabiUties. Cross language information retrieval uses parallel texts for determining corresponding information in both questioning and answering. Parallel texts are also used for acquisition of lexical translation.

Unfortunately only a few parallel corpora with high quality are publicly available such as the United Nations proceedings and the Canadian Parliament proceedings.

These corpora are usually small in size, specializing in narrow areas, usually with fees and licensing restrictions, or sometimes out-of-date. The number of web sites containing web pages in parallel translation is increasing considerably, which enables us to consti-uct parallel corpora with high quality in a big scale more easily. For that reason, many studies are paying their attention to mining parallel corpora from the Web. Potential parallel texts usually exist in web sites containing multilingual news.

Parallel web pages in a site in general speaking have comparable structures and contents.

Therefore, a large number of these studies focused on finding characteristics of HTML structures such as URL links, filename, HTML tags. Some of them use more translation information from bilingual dictionaries and Ureat them as clues for determining parallel pages [8], [1], [6], [17], [16]. The remain part of related studies focuses on determining parallel sentences [14] or parallel sub-sentential fragments [15] from comparable bilingual texts to increase the quality of a statistical machine translation system.

Suppose that we are considering two web pages, each of which has translation relationships with the other. From our observation, there are three kinds of relationships between bilingual web pages, including:

- One page is the full ti-anslation of the otiier. In this kind, it is normal tiiat tiie translation is done at sentence level.

- One page is created by ti-anslating some parts of tiie other. In these parts, the texts are closely translated. The remaining parts are much modified in the target language, just keep the main idea. We have found that the tianslation parts are usually paragraphs in some bilingual web pages between English and Vietnamese.

(2)

LE QUANG HUNG-

- In this kind, the translator just keeps some information from the original page, or even combines information from different pages to generate a new page in another language.

The parallel data which are useful for multiple natural language processing are formed at different levels, including texts, sentences, phrases, or lexicons. Current studies mostly focus on acquisition of parallel texts from the Web. However, they do not pay enough attention to different kinds of bilingual web pages as shown above. In the research, all potential comparable web pages are firstly considered as candidates, and then bilingual web page pairs will be chosen if their evaluations get high scores.

The evaluation is based on corresponding information of web page's structures and lexicon translations. We have investigated this approach and found that if we just base on structural information, then we will obtain many wrong page pairs. If we use lexicon translations or even a translation system, it is only suitable for the first kind of bilingual web pages. Therefore, only a small number of parallel pages are extracted, and increasing number of obtained parallel web pages will increase noised ones. Another direction from these studies is extracting parallel sentences from comparable corpora.

However, it is just evaluated through a statistical machine translation, and not clear about the quality of extracted parallel sentences. It is also easy to understand that from noised parallel corpora, we cannot acquire high quality parallel sentences.

Different from previous studies, we observe that only the first and the second kind of bilingual web pages as classified above bring about useful translation information. It is worth to emphasize that from the second kind, we can also get parallel sentences as well as from the first kind. It, therefore, motivates us to determine these parallel web pages based on identifying translation segments between the two bilingual pages such as documents, paragraphs, sentences, and words. That will help us to exU-act proper translation units in bilingual web pages. In fliis work, we focus on identifying translation paragraphs. As encouraged by [8], we formulate this problem as classification problem and use support vector machine on various kinds of features designed from structural information and paragraph translations. The proposed mefliod win be experimented on flie language pair of English and Vietnamese. Note that the full parallel page is the special case of this.

The rest of the paper is organized as follows. Section II presents our opinion about relationships between parallel web pages and parallel paragraphs. Section HI shows how stiuctoral feahires and content-based features are designed and computed.

In section IV, we formulate this problem as a classification problem, and show how to choose features for this model. Section V presents our experiments. Some related works are presents in section VI. Finally, section VII is the conclusion.

(3)

2. PARALLEL WEB PAGES AND PARALLEL PARAGRAPHS It is well known that after achievement of parallel corpus, some algorithms will be applied for extracting parallel sentences. These sentence pairs are then used to extract word or phrase translations, which are used directiy for statistical machine tianslation or other related NLP applications. Parallel sentence pairs can be considered as the essential data for multiple natural language processing, and it is also the main objective of the task of automatically acquiring parallel corpora. It is wor± recalling that the first and the second kinds of bilingual web pages as mentioned in Section I are suitable for this objective.

Suppose that we are working with the two language English and Vietnamese.

Under this point of view, it is worth presenting the text of a web page by a sequence of paragraph. Denote that Epage, Etext, Vpage, and Vtext stand for an English web page, the corresponding English body text, a Vietnamese web page, and the Vietnamese body text respectively. Then, Etext is presented as the sequence pe^pe^... pe^, and Vtext is presented as the sequence pVjpv^... pv^, where pe. and pv stand for paragraphs m the English text and Vietnamese text respectively.

Previous studies usually compare the whole contents of Etext and Vtext to determine whether they are parallel or not. In our opinion, if the two bilingual pages contain a number of parallel paragraphs, they are worth mining from these parallel paragraphs. This just requires a parUy evaluation of the two pages, thus avoiding drawback from previous studies that will return low scores even these two pages contain parallel paragraphs. Comparing with the approaches which aim to extract parallel sentences [14], and parallel sub-sentential fragments [15], we believe that from obtained parallel paragraphs, we can extract high-quality parallel sentences as well as parallel sentential fragments. Furthermore, dealing with the paragraph level, we can adapt the 1-n or n-1 translation between sentences of the two languages, which it is also suitable with our observation. Note that paragraphs can be denoted easily from HTML pages based on the tag <p> and </p>.

We now must find out which paragraph in a language is the translation of one in another language. This information is then used to determine whether the two pages are parallel. Inversely the corresponding features from the two pages can also provide useful clues for determining parallel paragraphs. In order to measure the corresponding information, we are first to the design function Similarity^Jpe,pv) to measure the translation relationship between pe andpv, which can be computed based on such as using lexicon tianslation from bilingual dictionary, or by other methods. For each pe.

we must find the most appropriate pv, is denoted as in Equation (1).

pv. = arg max Similarity pan. (pe„ pvkh k = 1,..., u (1)

I pvt

(4)

LEQUANGHUNG*

From this translation pairing of paragraph, we can evaluate the degree of ti-anslation relationship between Etext and Vtext. And then it will be expressed as a set of features and their weights in the learning model which is used for determining parallel pages. The next section will show these features in details (we call them content feahires) and how to compute them.

3. FEATURES

3.1. Features of Content SimOarity based on Paragraph Translation

Previous studies usually use lexicon translations getting from a bilingual dictionary to measure the degree of content of the two texts, such as in [8], [16]. This approach may face difficulty because a word has many ways of translations. Differenfly we use the Google ttanslator because by using it we can utilize the advantages of a statistical machine translation. It helps to disambiguate lexical ambiguity, ti-anslate phrases, and reorder.

Equation (I) shows how to choose a best translation of a paragraph. We use the BLEU score (representative for Similarity J as the degree of tianslation relationship frompe topv.. Using this degree, from the two sequences pe^pe^ ...pe^andpv^pv^ •••P'v„

we can easily find an alignment between them such that each pe^ is aligned with its best translation among Vietnamese paragraphs pv (except the paragraph which is chosen). Note that the paragraph pair with iiigher BLEU will be chosen earlier. Some paragraphs which do not belong to any couple are aligned with an empty paragraph and their similarity degree equals zero. From these paragraph pairs we just select some of them, the BLEU score of which is greater than a threshold. These selected pairs are considered as parallel paragraphs.

We now design two content-based features which are considered as evidences for determining whether a Etext is parallel to a Vtext.

The first content feature is the average of Similarity ^^^ of the selected paragraph pairs (denoted avgSim^J. This value estimates the translation degree of the parts of the two texts which have been determined having enough ti-anslation relationships.

A high value expresses tiiat these paragraphs are closely parallel.

The second feanire is the rate between die number of selected/parallel paragraphs and flie total paragraphs of Etext and Vtext (denoted rate J. This value is nearly equal I, which means that die Vtext is fully the ti-anslation of Etext, and equal 0 if conversely. This value lies between 0 and 1, which means diat Vtext is partly the tianslation of Etext.

The next section will present how to utilize these content featiires in a framework of machine learning in order to determine parallel web pages.

(5)

3.2. Structural Features

Though content-based features contain essential information for determining whether two texts are parallel or not, it is useful to use structure-based feature from HTML pages for aligning two web pages. Many studies had used this information for quickly filtering bad alignments and yielding candidates of parallel web pages.

Furthermore, the stiuctural features provide complementary information for this task, thus many studies such as [8] [16] have combined stiuctural information and content information to improve thefr systems' performance.

In this paper we utilize the structural features which were designed in [8], in which the structural filtering processing rehes on analysis of the pages' underlymg HTML. They first produce a linear sequence of HTML tags including three kinds of token: [START: elementlabel], [ENB: elementlabel], [Chunk: length]. The chunk lengtii is measured in nonwhitespace bytes. The sequence is produced for both languages, and then they are aUgned using a standard dynamic programming technique. This alignment [8] computes four values, which are considered as the structural features, as follows.

- dp The difference percentage of alignment, computed based on the numbers of tokens that are not aUgned.

- n The number of aligned nonmarkup text chunks of unequal length.

- r The correlation of lengths of the aligned nonmarkup chunks.

- p The significance level of the correlation r.

[8] found that the directory structure on many web sites reflects parallel organization when pages are translations of each other. This character sometimes expresses on the URL (i.e. the path of htinl file), and thus it is used as a useful feature for detennining parallel web pages, such as in [16], [1]. In our case, we exploit the web sites which do not support this characteristic, and we just utilize directory stiucture to distinguish the two languages. Furthermore, information about the date of a page creation is important. There are rarely two pages which are translations of each other with the dates far from each other, especially for news. Therefore, some studies use this information as an external feature for the task, such as [3], [18]. In this paper, we also extract the information of the date from page stiucture, and tiien group it into the set of structural features. It is also used for discarding bad page pairs and for filtering candidates of parallel web pages.

The content-based features and structure-based features will be combined in a common framework. The next section will present it.

4. CLASSIFICATION MODELING

As a result of content-based and structure analysis of each pair of web pages, we obtain the features which are divided into the two categories:

(6)

LE QUANG HUNG*

1. Sti-uctural features: dp, n, r, p, and publication date;

2. Content features: avgSim^^ and rate^^^.

It is now easy to formulate the task as a classification problem. Each candidate pair of web pages is represented by a vector of these features. Let F = {f,, f^,..., f^} be the set of features, D = {d|,d2, ...,d_,} be the set of all candidate pairs, and C = {0,1} be the set of categories (0: non-parallel, 1: parallel). Then, for each candidate pan- d G D of web pages is represented by a vector of features d^ = (fj,, f^.,..., f^.). We will label them by 1 or 0 if each pair is parallel or not respectively. By this way, we will obtain the training data. In our method, we use a Support Vector Machine algorithm to train a classification system. For a new pair of web pages, we first extract features to have its representation as a vector. This vector goes through the classification system and gets the result as 1 or 0. More formally, the classification problem may be formalized as the task of target function (p (d, C) = [0, 1} (as illustrates m Figure 1). It means that we will have the answer about whether this pair is parallel or not.

A n e w p % of

web pages Output

1:Par«IW 0; Non-patalial

Figure I. Classification model for determining parallel web pages

5. EXPERIMENTS 5.1. Evaluation measures

In order to evaluate how good our proposed approach is, we use standard IR measures of precision and recaU. These measures are shorUy described in Figure 2.

Figure 2. Precision and recall measures

(7)

Precision • where.

.\-InIl, Recall = (2)

- X is the total No. of pairs labeled 1 in data output, and - Y is the total No. of pairs labeled 1 in data test.

We further evaluated the balance of precision and recall with the Fscore, which is defined as

F score = 2 * Precision * Recall

Precision + Recall (3) 5.2. Experimental setup

We perform a host crawling from two web sites: ViemamPlus, and VOA News.

The HTML pages are downloaded fi-om these domains. We have used some thresholds of content-based features to filter candidate pairs. Consequentiy, we receive a number of 1,170 pairs, which are considered as candidates for determining whether each pair of them is parallel or not. Next, we extract structural features and content-based features for these candidates (dp, n, r, p, publication date, avgSim ^^ and rate ^). We then labeled 0 or 1 for each pair of the candidates. A pak will be labeled by 1 if it is parallel, in contrast it will be labeled by 0. There are 433 paks labeled 1 and 737 pairs labeled 0 firom these 1,170 pairs of candidates. After that, we construct this data with format: <label> <indexl>: <valuel> <index2>: <value2>,... which is suitable for using tiie LIBSVM tool.

5.3. Experimental results

We conducted three experiments corresponding to three methods. For each method, we use 5-folds crossvalidation experiment; each fold had 234 test items and 936 tiaining items. To evaluate tiie effectiveness of our proposed methods, we compare performance of them with the previous two approaches: structure-based [7] (the first experiment) and content-based [10] (the second experiment).

- Experiment 1. We only use structural features as in original STRAND system [7]. Experimental results are shown in Table I.

Table I. Structure-based method

Foldl Fold 2 Folds Fold 4 Folds Average

Precision

(%)

40.9 51.8 39.7 45.1 44.4 44.4

Recall

(%)

62.0 61.4 61.4 76.3 65.4 65.3

F_Score

(%)

49.3 56.2 48.2 56.7 52.9 52.9

(8)

20 LE QUANG HUNG-

Experiment 2. The mefliod, which uses the bilingual dictionary to match pairs of word by word as m BITS system [10]. Table 2 showed experimental result of flie content-based method.

Table II. Content-based method

Foldl Fold 2 Fold 3 Fold 4 Fold 5 Average

Precision

(%)

68.8 64.7 64.3 60.1 68.2 65.2

Recall

(%)

48.4 47.8 54.8 56.9 52.8 52.1

F_Score

(%)

56.8 55.0 59.1 58.4 59.5 57.8

- Experiment 3. We combine structure-based and identifying translation segments methods. Experimental results are shown in Tables III-V. The experiment is conducted on different levels of translation segments, including: document, paragraph, and sentence. The results showed tiiat, identifying ti-anslation segments at the paragraph level (90% of precision) is more effective document level (64.2% of precision). The results also showed that the difference between the two levels paragraphs and sentences is not large (90%, 89.9% of precision respectively).

Table III. Identifying translation at document level

Foldl Fold 2 Fold 3 Fold 4 Folds Average

Precision

(%)

61.8 66.6 58.0 69.0 65.4 64.2

Recall

(%)

67.1 63.4 71.8 54.5 53.4 62.0

F_Score

(%)

64.3 65.0 64.2 60.9 S8.8 62.6 Table IV. Identifying translation at paragraph level

Precision

(%)

92.0 91.1 89.7 90.4 86.6 90.0

Recall

(%)

81.6 7S.3 77.7 80.2 77.5 78.5

F_Score

(%)

86.5 82.4 83.3 85.0 81.8 83.8

(9)

Table V. Identifying translation at sentence level

Precision

(%)

89.6 92.0 87.7 90.9 88.9 89.8

Recall

(%)

82.8 75.8 77.5 88.8 79.5 80.9

F_Score

(%)

86.1 83.1 82.3 89.8 83.9 85.0 Table VI. Overall results of each method Me±od

Structure-based Content-based

Our method

Precision

(%)

44.4 65.2 90.0

Recall

(%)

65.3 52.1 78.5

F_Score

(%)

S2.9 57.8 83.8

Table VI shows the overall results of each method in our experiments. It is worth noting that in such task (extracting parallel corpus), the precision is the most important criterion for evaluating the effectiveness of the system. According to the results shown in the above tables, we can see that the precision of using the content-based method (65.2%) is much higher than using the structure-based method (44.4%). Our approach (90%) is also much better the approach in [10] which obtains only 65.2% of precision.

These results have shown that the method we proposed is so effective.

6. RELATED WORK

As far as we know all these studies use the infomiation to align web pages which include URL patterns, HTML tags, and some of tiiem use more lexical translation knowledge from bilingual dictionary. For example, [ 1 ] first searches for host containing parallel texts, and then performs a host crawling, candidate pairs are found based on several external and internal feature such as URL. BITS in [7] provides a different approach. All pages from a specified domain are craw exhaustively. In this proposal, a bilingual dictionary has been used to perform matching at word level between parallel documents. STRAND [8] has a similar approach to PTMmer in [17] except that it handles the case where URL-matching requires multiple substitutions. Structural filtering with a mning parameter optimized by using a machine learning (decision tree) gives it the abihty not to examine all possible combinations like BITS. This system also proposed a new method that combines content-based and structure matching by using a cross-language similarity score as an additional parameter of the structure-based method.

(10)

LE QUANG HUNG- Recendy [6] proposed an effective mefliod which uses Document Object Model (DOM) to present web page (using textiial chunks, hyper-links, and HTML tags as featiires of DOM ttees). This model helps to increase the coverage of web pages and ttie quality of sentence alignment In Vietnam, natural language processing is in the early stage. There are a few studies of mining parallel corpora from Web, one of which is presented in [3].

From the parallel documents, we also aim at extiacting parallel sentence pairs which direcfly support statistical machine tianslation. The most popular technique for sentence alignment is based on sentence length, as presented in [4]. The next studies use more lexical tianslation information for that task, such as [ U ] . Recenfly, much research attention has been paid to aligning sentences in comparable documents [13].

7. CONCLUSION

We have presented a new approach which is based on the extension of the definition of parallel texts to match tianslation segments. That will help us to extract proper tianslation units in bilingual web pages. W e also combine content-based features (as we proposed) with structural features under a framework of machine learning.

The experimental results showed that proposed model are quite successful in extiacting parallel texts from the Web. Our approach has some advantages:

- We have combined both structiu-al features and content-based features to improve performance of the system.

- By using flie Google translator, we can utilize flie advantages of a statistical machine translation, which helps to disambiguate lexical ambiguity, translate phrases, and reorder.

- Our approach can be applied for otiier pairs of languages because of the fact fliat die feanires used in flie proposed model are independent in languages.

We plan to use fliis system to coUect big corpora of flie English-Vietiiamese language pau-. Alttiough fiuther work remains to be done, we can conclude fliat it is possible to automaticaUy consttuct an English-Vietiiamese paraUel corpus from flie Web.

A C K N O W L E D G E M E N T This paper is supported by flie project B2014-28-07 funded by Ministry of Education & Training.

REFERENCES

[1] Chen J., Nie. I. Y., Automatic conslruclion of parallel English-Chinese corpus for cross-language information retrieval, Proc. ANLP, pp. 21-28, Seatfle, (2000).

Brown, P. P., J. C. Lai and R. L. Mercer, Aligning Sentences in Parallel Corpora In Proceedings of 29fli Annual Meeting of the Association for Computational Linguistics, [2]

(1991).

(11)

[3] Van B. Dang, Ho Bao-Quoc, Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining, Proceedings of 5th IEEE International Conference on Computer Science-Research, Innovation and Vision of the Future (RJVF'2007), Hand, Vietnam, (2007).

[4] Gale W. A. and K. Church, A Program for Aligning Sentences in Parallel Corpora, In Proceedings of 29th Annual Meeting of the Association for Computational Linguistics, (1991).

[5] Huyen, N. T. M., Vu Xuan Luong, Le Hong Phuong, A case study of the probabilistic tagger QTAG for Tagging Vietnamese Texts, Proceedings of the First National Symposium on Research, Development and Application of Information and Communication Technology, no. 1, (2003).

[6] Lei Shi, Cheng Niu, Ming Zhou, Jianfeng Gao, A DOM Tree Alignment Model for Mining Parallel Data from the Web, ACL, (2006).

[7] Philip Resnik, Mining the Web for Bilingual Text, 37'" Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, June, (1999).

[8] P. Resnik and N. A. Smith, The Web as a Parallel Corpus, Computational Linguistics, 29 (3); 349-380, (2003).

[9] Pascale Fung and Percy Cheung, Multi-level Bootstrapping for Extracting Parallel Sentences from a Quasi-Comparable Corpus, In Proceedings of Coling 2004, 1051- 1057, (2004).

[10] Xiaoyi Ma, Liberman Mark, BITS: A method for bilingual text search over the web, Machine Translation Summit v n , September, (1999).

[11] Zhao B. and S. Vogel, Adaptive Parallel Sentences Mining From Web Bilingual News Collection, In 2002 IEEE Intemational Conference on Data Mining, (2002).

[12] Bing Zhao, Klaus Zechner, Stephan Vogel, and Alex Waibel, Ejficient Optimization for Bilingual Sentence Alignment based on Linear Regression, In the proceeding of HLT/NAACL 2003 Workshop on Building and using Parallel Texts: Data Driven Machine Translation and Beyond, (2003).

[13] Utiyama, M. and H. Isahara, Reliable Measures for Aligning Japanese-English News Articles and Sentences, In Proceedings of 41st Annual Meeting of the Association for Computational Linguistics. ACL, (2003).

[14] Dragos Munteanu and Daniel Marcu, Improving Machine Translation Performance by Exploiting Comparable Corpora, Computational Linguistics, 31 (4), pp. 477-504, December, (2005).

[15] Dragos Munteanu and Daniel Marcu, Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora. ACL 2006, pp. 81-88, (2006).

[16] Chen, J., Chau, R. and Yeh, C.-H., Discovering Parallel Text from The World Wide Web. In Proc, Australasian Workshop on Data Mining and Web Intelligence (DMWI2004), (2004).

(12)

LE QUANG HUNG*

[17] J. Chen, J.Y. Nie, Parrallel Web text mining for cross-language IR, RIAO, Paris, pp.

62-77, April, (2000).

[18] Jiang Chen and Jian-Yun Nie, Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processing, p. 21-28, April 29, May 04, (2000).

SUMMARY

A NEW APPROACH TO EXTRACT PARALLEL CORPUS

Le Quang Hung Parallel corpus is the crucial resource for many Natural Language Processing (NLP) systems such as statistical machine translation, cross-language information retrieval, and so on. Manually obtaining such corpora takes a very high cost while a large amount of them is available in various ways on the Web, such as web pages of bilingual web sites; therefore, automatically exd-acting parallel texts from the Web becomes an important task in NLP studying.

In this paper, we develop a new approach based on extending of the definition of parallel texts to match translation segments. This will help us to extract proper translation units in bilingual web pages. We also fomiulate the problem as a classification problem and use both kinds of knowledge resources, including stmctural information of web pages and the translation information between the two languages. The experiments are conducted on the language pair of English and Vietoamese, which showed significant results.

T 6 M TAT

CACH TIEP CAN MCfl DE RUT TRfCH N G 0 LIEU SONG N G 0 Le Quang Hung Ngu" li$u song ngij" la nguon tai nguyen quj gid cho nhilu he thfi'ng x^ ly ngon ngff tiJ nhien (NLP) nhiT dich may thd'ng k6, truy xuS't thdng tin lien ngff,... Viec xay drfng ngff li?u thi3 cong td'n chi phi rS't Idn, trong khi d6 c6 m0t Iffdng ldn t^i lieu song ngff sSn c6 tren Web dffdi nhilu dang khac nhau, nhff \l cic trang web cda cdc web-sites song ngff.

VI the, tff d6ng n3t trich cac van ban song ngff tff Web trd th^nh mOt cong vide quan tirong trong nghidn cffu NLP. Trong hki bdo n^y, chdng toi phdt trien m6t cdch tid'p can mdi dffa tren vide md rdng dinh nghla vdn b^n song ngff d l so khdp cdc doan dich. Dieu n^y gidp chung i6i trich nit cac khoi dich u-ong cdc trang web song ngff. Ngoai ra, chffng tdi xem bai todn nay nhff la bai toan phan ldp v^ sff dung c^ hai nguin tri thffc, bao gom thdng tin cS'u true cua trang web va thdng lin dich giffa hai ngdn ngff. Cdc thffc nghidm dffdc tien hanh tren cap ngdn ngff Anh-Vidt cho th3iy cdch til^p cSn nay dat dffdc cdc ket qua kha quan.

•Khoa Cdng nghe thong tin, TnTflng Dai hoc Quy Nhdn Ngay nhan bSi: 24/3/2014; Ngay nhgn dang: 15/10/2014.