Article

Building a Bilingual Dictionary with Scarce Resources: A Genetic Algorithm Approach

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Current corpus-based machine translation systems usually require significant amount of parallel text to build a useful bilingual dictionary for translation. To alleviate this data dependency I propose a novel approach based on genetic algorithms to improve translations by fusing different linguistic hypotheses. A preliminary evaluation is also reported.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... In what follows we comment on several proposals applying EAs to different statistical models of MT. One of these works (Echizen-ya et al. 1996) deals with translation rules learning, others are focused in sentence translation and alignment (Otto and Rojas 2004;Rodríguez et al. 2006), while others are devoted to particular aspects related to statistical MT, such as the bilingual dictionary construction (Han 2001) or the parameter tuning of a translation system (Nabhan and Rafea 2005). Echizen-ya et al. (1996) propose the use of a GA to acquire translation rules. ...
... The system has been able to improve the quality translation by 11% in the experiments carried out. Han (2001) has used a GA for the automatic construction of a bilingual dictionary in the particular case in which there is little data of the source language, which happens in minority or indigenous languages. In this case the statistical models usually applied to the problem can not be used because of the lack of data. ...
Article
Full-text available
Statistical natural language processing (NLP) and evolutionary algorithms (EAs) are two very active areas of research which have been combined many times. In general, statistical models applied to deal with NLP tasks require designing specific algorithms to be trained and applied to process new texts. The development of such algorithms may be hard. This makes EAs attractive since they offer a general design, yet providing a high performance in particular conditions of application. In this article, we present a survey of many works which apply EAs to different NLP problems, including syntactic and semantic analysis, grammar induction, summaries and text generation, document clustering and machine translation. This review finishes extracting conclusions about which are the best suited problems or particular aspects within those problems to be solved with an evolutionary algorithm.
... Genetic algorithms were successfully used in a wide range of fields including computational linguistics. In computational linguistics they were employed to improve the performance of anaphora resolution methods [9, 10], resolve anaphora resolution [11], study optimal vowel and tonal systems [12], build bilingual dictionaries [13], improve queries for information retrieval [14], and learn of syntactic rules [15]. For applications in other fields than computational linguistics a good overview can be found in [16]. ...
Article
Full-text available
This paper presents two methods which automatically produce annotated corpora for text summarisation on the basis of human produced abstracts. Both methods identify a set of sentences from the document which conveys the information in the human produced abstract best. The first method relies on a greedy algorithm, whilst the second one uses a genetic algorithm. The methods allow to specify the number of sentences to be annotated, which constitutes an advantage over the existing methods. Comparison between the two approaches investigated here revealed that the genetic algorithm is appropriate in cases where the number of sentences to be annotated is less than the number of sentences in an ideal gold standard with no length restrictions, whereas the greedy algorithm should be used in other cases.
Conference Paper
Full-text available
This paper presents two methods which automatically produce annotated corpora for text summarisation on the basis of human produced abstracts. Both methods identify a set of sentences from the document which conveys the information in the human produced abstract best. The first method relies on a greedy algorithm, whilst the second one uses a genetic algorithm. The methods allow to specify the number of sentences to be annotated, which constitutes an advantage over the existing methods. Comparison between the two approaches investigated here revealed that the genetic algorithm is appropriate in cases where the number of sentences to be annotated is less than the number of sentences in an ideal gold standard with no length restrictions, whereas the greedy algorithm should be used in other cases.
Conference Paper
Full-text available
Current corpus-based machine translation techniques do not work very well when given scarce linguistic re- sources. To examine the gap between human and ma- chine translators, we created an experiment in which human beings were asked to translate an unknown lan- guage into English on the sole basis of a very small bilingual text. Participants performed quite well, and debrieflngs revealed a number of valuable strategies. We discuss these strategies and apply some of them to a statistical translation system.
Conference Paper
Full-text available
This paper describes a unified framework for bilingual text matching by combining existing hand-written bilingual dictionaries and statistical techniques. The process of bilingual text matching consists of two major steps: sentence alignment and structural matching of bilingual sentences. Statistical techniques are applied to estimate word correspondences not included in bilingual dictionaries. Estimated word correspondences are useful for improving both sentence alignment and structural matching.
Article
Full-text available
We present an algorithm for aligning texts with their translations that is based only on internal evidence. The relaxation process rests on a notion of which word in one text corresponds to which word in the other text that is essentially based on the similarity of their distributions. It exploits a partial alignment of the word level to induce a maximum likelihood alignment of the sentence level, which is in turn used, in the next iteration, to refine the word level estimate. The algorithm appears to converge to the correct sentence alignment in only a few iterations.
Article
Full-text available
In the vast majority of genetic algorithm implementations, the operator probabilities are fixed throughout a given run. However, it can be convincingly argued that these probabilities should vary over the course of a genetic algorithm run --- so as to account for changes in the ability of the operators to produce children of increased fitness. This dissertation describes an empirical investigation into this question. The effect upon genetic algorithm performance of adaptation methods upon both well-studied theoretical problems, and a hard problem from Operations Research --- the flowshop sequencing problem, is examined.
Article
Full-text available
We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment. 1 Introduction While muchwork has already been done on the automatic alignment of parallel corpora (Brown et al. 1991; Kay & Roscheisen 1993; Gale & Church 1993; Church 1993; Chen 1993; Wu 1994), there are several problems which have not been fully addressed by many of these alignment algorithms. First, many corpora are noisy; segments from the source language can be totally missing from the target language or can be substituted with a target language segment which is not a translation. Similarly, the target language may include segments whose translation is totally missing from the source corpus. ...
Article
Full-text available
This paper describes a unified framework for bilingual text matching by combining existing hand-written bilingual dictionaries and statistical techniques. The process of bilingual text matching consists of two major steps: sentence alignment and structural matching of bilingual sentences. Statistical techniques are applied to estimate word correspondences not included in bilingual dictionaries. Estimated word correspondences are useful for improving both sentence alignment and structural matching. 1 Introduction Bilingual (or parallel) texts are useful as resources of linguistic knowledge as well as in applications such as machine translation. One of the major approaches to analyzing bilingual texts is the statistical approach. The statistical approach involves the following: alignment of bilingual texts at the sentence level using statistical techniques (e.g. Brown, Lai and Mercer (1991), Gale and Church (1993), Chen (1993), and Kay and Roscheisen (1993)), statistical machine trans...
Article
Full-text available
. This article presents a combination of unsupervised and supervised learning techniques for the generation of word segmentation rules from a raw list of words. First, a language bias for word segmentation is introduced and a simple genetic algorithm is used in the search for a segmentation that corresponds to the best bias value. In the second phase, the words segmented by the genetic algorithm are used as an input for the first order decision list learner Clog. The result is a set of first order rules which can be used for segmentation of unseen words. When applied on either the training data or unseen data, these rules produce segmentations which are linguistically meaningful, and to a large degree conforming to the annotation provided. Keywords: unsupervised machine learning, inductive logic programming, natural language, word segmentation 1. Introduction Word segmentation is the task of splitting words into a number of constituents or morphemes, e.g. sleep-ing, dis-member-ed, i...
Article
Full-text available
In this paper we address the issue of efficiently and effectively handling the problem of extragrammaticality in a large-scale spontaneous spoken language system. We propose and argue in favor of ROSE, a domain independent parse-and-repair approach to the problem of interpreting extragrammaticalities in spontaneous language input. We argue that in order for an approach to robust interpretation to be practical, it must be domain independent, efficient, and effective. Where previous approaches to robust interpretation possessed one or two of these qualities, the ROSE approach uniquely possesses all three. We evaluate our approach by comparing its performance in terms of parse time and quality 2 Ros'e and Lavie with a parameterized version of the minimum distance parsing (MDP) approach as well as with a more restrictive partial parser, two alternative domain independent approaches. Our analysis demonstrates that the ROSE approach performs significantly faster than a limited version of MD...
Article
We describe and experimentally evaluate an alternative algorithm for aligning and extracting vocabulary from parallel texts using recency vectors and a similarity measure based on Levenshtein distance. The work is largely inspired by Fung and McKeown 's DK-vec, though we use a simpler algorithm. The technique is tested on two sets of parallel corpora involving English, French, German, Dutch, Spanish, and Japanese. We attempt to evaluate the importance of parameters such as frequency of words chosen as candidates, the effect of different language pairings, and differences between the two corpora.
Article
The grammars of natural languages may be learned by using genetic algorithms that reproduce and mutate grammatical rules and part-of-speech tags, improving the quality of later generations of grammatical components. Syntactic rules are randomly generated and then evolve; those rules resulting in improved parsing and occasionally improved retrieval and filtering performance are allowed to further propagate. The LUST system learns the characteristics of the language or sublanguage used in document abstracts by learning from the document rankings obtained from the parsed abstracts. Unlike the application of traditional linguistic rules to retrieval and filtering applications, LUST develops grammatical structures and tags without the prior imposition of some common grammatical assumptions (e.g. part-of-speech assumptions), producing grammars that are empirically based and are optimized for this particular application.
Conference Paper
A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extracted. Over 70% accurate translations of compound nouns and over 50% of unknown words are obtained as the first candidate from small Japanese/English parallel texts containing severe distortions.
Conference Paper
The paper argues that a promising way to improve the success rate of preference-based anaphora resolution algorithms is the use of machine learning. The paper outlines MARS - a program for automatic resolution of pronominal anaphors and describes an experiment which we have conducted to optimise the success rate of MARS with the help of a genetic algorithm. After the optimisation we noted an improvement up to 8% for some files. The results obtained after optimisation are discussed.
Conference Paper
The Linguistic Data Consortium (LDC) is currently involved in a major effort to expand its multilingual text resources, in particular for machine translation, message understanding and information retrieval research. The main sources for data acquisition are governmental and international organizations, newswire services, and diverse publishers. This paper describes some of the research that is being done to identify potential resources, discusses some of the process involved in negotiating the broadest possible access to the material for the human language technology research community, and identifies key issues and considerations in transducing the text into common and well documented formats.
Conference Paper
In the vast majority of genetic algorithm implementations, the operator probabilitiesare fixed throughout a given run. However, it can be convincingly arguedthat these probabilities should vary over the course of a genetic algorithm run ---so as to account for changes in the ability of the operators to produce children ofincreased fitness. This dissertation describes an empirical investigation into thisquestion. The effect upon genetic algorithm performance of adaptation methodsupon both...
Article
David Goldberg's Genetic Algorithms in Search, Optimization and Machine Learning is by far the bestselling introduction to genetic algorithms. Goldberg is one of the preeminent researchers in the field--he has published over 100 research articles on genetic algorithms and is a student of John Holland, the father of genetic algorithms--and his deep understanding of the material shines through. The book contains a complete listing of a simple genetic algorithm in Pascal, which C programmers can easily understand. The book covers all of the important topics in the field, including crossover, mutation, classifier systems, and fitness scaling, giving a novice with a computer science background enough information to implement a genetic algorithm and describe genetic algorithms to a friend.
Article
. Technical term translation represents one of the most difficult tasks for human translators since (1) most translators are not familiar with term and domain specific terminology and (2) such terms are not adequately covered by printed dictionaries. This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups. Given any word which is part of a technical term in the source language, the algorithm produces a ranked candidate match for it in the target language. Potential translations for the term are compiled from the matched words and are also ranked. We show how this ranked list helps translators for technical term translation. Most algorithms for lexical and term translation focus on Indo-European language pairs, and most use a sentence-aligned clean parallel corpus without insertion, deletion or OCR noise. Our algorithm is language and character-set independent, and is robust to noise in the corpus. We show how our a...
Article
. An Example-Based Machine Translation system is supplied with a sentencealigned bilingual corpus, but no other knowledge sources. Using the knowledge implicit in the corpus, it generates a bilingual word-for-word dictionary for alignment during translation. With such an automatically-generated dictionary, the system covers (with equivalent quality) more of its input on unseen texts than the same system does when provided with a manually-created general-purpose dictionary and other knowledge sources. 1 Introduction Previous work ([Brown, 1996, Frederking and Brown, 1996]) on the Pangloss Example-Based Machine Translation engine (PanEBMT) has always assumed the availability of knowledge sources in addition to the sentence-aligned bilingual corpus, particularly a large bilingual dictionary. Although more readily available and/or acquired than, for example, the ontologies and other knowledge sources for a knowledge-based translation system, generating these additional EBMT knowledge sour...
Article
Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule- based methods. In this paper, we present a sim- ple rule-based part of speech tagger which automatically acquires its rules and tags with accuracy coinparable to stochastic taggers. The rule-based tagger has many advantages over these taggers, including: a vast reduction in stored information required, the perspicuity of a sinall set of meaningful rules, ease of finding and implementing improvements to the tagger, and better portability from one tag set, cor- pus genre or language to another. Perhaps the biggest contribution of this work is in demonstrating that the stochastic method is not the only viable method for part of speech tagging. The fact that a simple rule-based tagger that automatically learns its rules can perform so well should offer encouragement for researchers to further explore rule-based tagging, searching for a better and more expressive set of rule templates and other variations on the simple but effective theme described below.
Article
Example-Based Machine Translation (EBMT) using partial exact matching against a database of translation examples has proven quite successful, but requires a large amount of pre-translated text in order to achieve broad coverage of unrestricted text. By adding linguistically tagged entries to the example base and permitting recursive matches that replace the matched text with the associated tag, substantial reductions in the required amount of pre-translated text can be achieved. A modest investment of time -- on the order of two person-weeks -- adding linguistic knowledge reduces the required example text by a factor of six or more, while retaining comparable translation quality. This reduction makes EBMT more attractive for so-called "low-density" languages for which little data is available. 1 Introduction The example-based machine translation engine used in the Pangloss and DIPLOMAT projects (Brown 1996) has, until now, been purely lexical. Unlike other EBMT systems which...
Some More Experiments in Bilingual Text Alignments Bilingual Text Matching Using Bilingual Dictionary and Statistics
  • H Somers
  • A Ward
  • Ankara
  • T Utsuro
Somers, H. and Ward, A. (1996) Some More Experiments in Bilingual Text Alignments. In Oflazer, K. and Somers, H. (eds) Proceedings of the SecondInternational Conference on New Methods in Language Methods in Language Processing, pp. 66-78, Ankara. Utsuro, T. et al. (1994) Bilingual Text Matching Using Bilingual Dictionary and Statistics. In Proceedings of International Conference on Computational Linguistics, pp. 1076-1082, Kyoto. and Inductive Logic Information. In
Knowledge-Free" Example-Based Translation
  • R Brown
Brown, R. (1997) Automated Dictionary Extraction for "Knowledge-Free" Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 111-118. Santa Fe.
A Technical Word-and Term-Translation Aid Using Noisy Parallel Corpora Across Language Groups
  • P Fung
  • K Kckeown
Fung, P. and KcKeown, K. (1997) A Technical Word-and Term-Translation Aid Using Noisy Parallel Corpora Across Language Groups. Machine Translation, Vol. 12, Nos. 1-2, pp. 53-87.
Multilingual Text Resources at the Linguistic Data Consortium
  • D Graff
  • R Finch
Graff, D. and Finch, R. (1994) Multilingual Text Resources at the Linguistic Data Consortium. In Proceedings of the 1994 ARPA Human Language Technology Workshop. Morgan Kaufmann.
High-Precision Bilingual Text Alignment Using Statistical and Dictionary Information
  • M Haruno
  • T Yamazaki
Haruno, M. and Yamazaki, T. (1996) High-Precision Bilingual Text Alignment Using Statistical and Dictionary Information. In Proceedings of Annual Conference of the Association for Computational Linguistics, pp. 131 -138.
Some More Experiments in Bilingual Text Alignments
  • H Somers
  • A Ward
Somers, H. and Ward, A. (1996) Some More Experiments in Bilingual Text Alignments. In Oflazer, K. and Somers, H. (eds) Proceedings of the SecondInternational Conference on New Methods in Language Methods in Language Processing, pp. 66-78, Ankara.