Article

Errors of Omission in Translation

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Automatic detection of translation errors represents one of the more promising applications of NLP techniques to this domain. This paper concentrates on one class of error, the inadvertent omission. To a greater extent than `false friends', terminological inconsistency, etc., the detection of omissions raises problems both theoretical and practical in nature. These problems are discussed, and a technique is presented for identifying possible omissions in a completed translation by employing a model of translational equivalence between words. Examples are taken from a varied corpus of French-English bitext, and illustrate how different settings of the parameters of the system affect its performance. The approach is implemented as part of a translation-checking program. 1 Introduction It has long been recognized that the provision of aids for translators is a promising area for the application of NLP techniques in the domain of translation. A recurring theme (Bashkansky et al....

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The basic idea here is to assimilate an omission to a particular type of alignment where an important contiguous set of words present in the source text cannot be level with the target text. mechanisms similar to Russell [15]. We can distinguish aligned at the word For this we rely on those described in between small (a couple of sentences) and big omissions (any thing bigger than a few paragraphs). ...
Conference Paper
Over the past decade or so, a lot of work in computational linguistics has been directed at finding ways to exploit the ever increasing volume of electronic bilingual corpora. These efforts have allowed for substantial expansion of the computational toolbox. We describe a system, TransCheck, which makes intensive use of these new tools in order to detect potential translation errors in preliminary or non-revised translations.
... There have been some suggestions to check for deceptive cognates (Isabelle et al. 1993), or omissions in translation (Russell 1999), and to choose articles (Chander 1998). At present, these techniques are not yet so useful in practice, probably because the revision systems are not able to consider the meaning of the text. ...
Article
Full-text available
this paper, I suggest several ways to improve machine translation, based on the best practices of human translators, as described in Nida's (1964) Toward a Science of Translating. I call this approach multi-pass machine translation (MPMT), as it crucially relies on processing the text more than once. It is similar to the opportunistic bricoleur approach of Gdaniec (1999) in that it sets out to use the means at hand, adding to or changing them as necessary. As Schutz (2001) points out, much of the research in the past decade has concentrated on the important but non-core issues of integrating MT into DTP formats and HTML. In this paper I concentrate on improving the MT engine itself. The resulting approach integrates much recent research into a single system
... The basic idea here is to assimilate an omission to a particular type of alignment where an important contiguous set of words present in the source text cannot be level with the target text. mechanisms similar to Russell [15]. ...
Article
Over the past decade or so, a lot of work in computational linguistics has been directed at finding ways to exploit the ever increasing volume of electronic bilingual corpora. These efforts have allowed for substantial expansion of the computational toolbox. We describe a system, TransCheck, which makes intensive use of these new tools in order to detect potential translation errors in preliminary or non-revised translations.
Article
Full-text available
We examine two North American case stud- ies, each of which illustrates a different strat-egy for coming to terms with high-volume, high- quality translation. The first eschews MT in favour of translation memory technology; the second employs a controlled language to simplify the input to an MT system. Both strategies betray a certain dissatisfaction with the current state of machine translation, al- though neither alternative, it turns out, fully lives up to its expectations.
Conference Paper
Full-text available
In this paper we present a model for the future use of Machine Translation (MT) and Computer Assisted Translation. In order to accommodate the future needs in middle value translations, we discuss a number of MT techniques and architectures. We anticipate a hybrid environment that integrates data- and rule-driven approaches where translations will be routed through the available translation options and consumers will receive accurate information on the quality, pricing and time implications of their translation choice.
Conference Paper
Full-text available
Researchers in both machine translation (e.g., Brown ., 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary proceedings) which are available in multiple languages (French and English). This paper describes a method for aligning sentences in these parallel texts, based on a simple statistical model of character lengths. The method was developed and tested on a small trilingual sample of Swiss economic reports. A much larger sample of 90 million words of Canadian Hansards has been aligned and donated to the ACL/DCI.
Article
Full-text available
We present an algorithm for aligning texts with their translations that is based only on internal evidence. The relaxation process rests on a notion of which word in one text corresponds to which word in the other text that is essentially based on the similarity of their distributions. It exploits a partial alignment of the word level to induce a maximum likelihood alignment of the sentence level, which is in turn used, in the next iteration, to refine the word level estimate. The algorithm appears to converge to the correct sentence alignment in only a few iterations.
Article
Full-text available
The Corpus Encoding Standard (CES) is an application of SGML 1 (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), conformant to the TEI Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen and Burnard, 1994). It provides encoding conventions for linguistic corpora designed to be optimally suited for use in language engineering and to serve as a widely accepted set of encoding standards for corpus-based work. The CES identifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information). It also provides encoding conventions for more extensive encoding and for linguistic annotation, as well as general architecture for representing corpora annotated for linguistic features. The CES has been developed taking into account several practical realities surrounding the encoding of corpora intended for use in l...
Article
Full-text available
The BAF is a corpus of English and French translations, hand-aligned at the sentence level, which was developed by the University of Montreal's RALI laboratory, within the "Action de recherche concert'ee" (ARC) A2, a cooperative research project initiated and financed by the AUPELF-UREF. The corpus, which totals approximately 800 000 words, is primarily intended as an evaluation tool in the development of automatic bilingual text alignment method. In this paper, we discuss why this corpus was assembled, how it was produced, and what it contains. We also describe some of the computer tools that were developed and used in the process. 1 Introduction The BAF 1 is a corpus of English and French bitext: it consists of pairs of English and French documents, which are translations of one another, and whose sentences have been aligned. The corpus was produced by researchers at the CITI, a Canadian government research laboratory, as part of their contribution to the "Action de recherche con...
Article
Full-text available
In a recent paper, Gale and Church describe an inexpensive method for aligning bitext, based exclusively on sentence lengths [Gale and Church, 1991]. While this method produces surprisingly good results (a success rate around 96%), even better results are required to perform such tasks as the computer-assisted revision of translations. In this paper, we examine some of the weaknesses of Gale and Church's program, and explain how just a small amount of linguistic knowledge would help to overcome these weaknesses. We discuss how cognates provide for a cheap and reasonably reliable source of linguistic knowledge. To illustrate this, we describe a modification to the program in which the criterion is cognates rather than sentence lengths. Finally, we show how better and more efficient results may be obtained by combining the two criteria --- length and "cognateness". Our method can be generalized to accommodate other sources of linguistic knowledge, and experimentation shows that it produc...
Article
Full-text available
Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results. This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text. However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical. This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.
Conference Paper
A description is given of the present state of development of a workstation that has been designed to provide the translator with efficient and easy-to-use computational tools. The aim is to offer translators fast and flexible on-line access to existing dictionary databases and bilingual text archives and also to supply them with facilities for updating, adding to and personalizing the system data archives with their own material.
Conference Paper
Although the problem of full machine translation (MT) is unsolved yet, the computer aided translation (CAT) makes progress. In this field we created a work environment for monolingual translator 1. This package of tools generally enables a user who masters a source language to translate texts to a target language which the user does not master. The application is for Hebrew-to-Russian case, emphasizing specific problems of these languages, but it can be adapted for other pairs of languages also. After Source Text Preparation, Morphological Analysis provides all the meanings for every word. The ambiguity problem is very serious in languages with incomplete writing, like Hebrew. But the main problem is the translation itself. Words’ meanings mapping between languages is M:M, i.e., almost every source word has a number of possible translations, and almost every target word can be a translation of several words. Many methods for resolving of these ambiguities propose using large data bases, like dictionaries with semantic fields based on θ-theory. The amount of information needed to deal with general texts is prohibitively large. We propose here to solve ambiguities by a new method: Accumulation with Inversion and then Weighted Selection, plus Learning, using only two regular dictionaries: from source to target and from target to source languages. The method is built from a number of phases: (1) during Accumulation with Inversion, all the possible translations to the target language of every word are brought, and every one of them is translated back to the source language; (2) Selection of suitable suggestions is being made by user in source language, this is the only manual phase; (3) Weighting of the selection’s results is being made by software and determines the most suitable translation to the target language; (4) Learning of word’s context will provide preferable translation in the future. Target Text Generation is based on morphological records in target language, that are produced by the disambiguation phase. To complete the missing features for word’s building, we propose here a method of Features Expansion. This method is based on assumptions about feature flow through the sentence, and on dependence of grammatical phenomena in the two languages. Software of the workstation combines four tools: Source Text Preparation, Morphological Analysis, Disambiguation and Target Text Generation. The application includes an elaborated windows interface, on which the user’s work is based.
Article
This article presents methods for biasing statistical translation models to reflect these properties. Analysis of the expected behavior of these biases in the presence of sparse data predicts that they will result in more accurate models. The prediction is confirmed by evaluation with respect to a gold standard --- translation models that are biased in this fashion are significantly more accurate than a baseline knowledge-poor model. This article also shows how a statistical translation model can take advantage of various kinds of pre-existing knowledge that might be available about particular language pairs. Even the simplest kinds of language-specific knowledge, such as the distinction between content words and function words, is shown to reliably boost translation model performance on some tasks. Statistical translation models that are informed by pre-existing knowledge about the model domain combine the best of both the rationalist and empiricist traditions.
Article
A model of co-occurrence in bitext is a boolean predicate that indicates whether a given pair of word tokens co-occur in corresponding regions of the bitext space. Co-occurrence is a precondition for the possibility that two tokens might be mutual translations. Models of cooccurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models into an integrated system for exploiting parallel texts. Different models of co-occurrence are possible, depending on the kind of bitext map that is available, the language-specific information that is available, and the assumptions made about the nature of translational equivalence. Although most statistical translation models are based on models of co-occurrence, modeling co-occurrence correctly is more difficult than it may at first appear. 1 Introduction Most methods for estimating translation models from parallel texts (bitexts) start with the following intuition: Words that are translatio...
Monolingual Translator Workstation Machine Translation and the Information Soup: Proceed-ings of the Third Conference of the Association for Machine Translation in the Americas
  • Bashkansky
  • Uzzi Guy
  • Ornan
Bashkansky, Guy and Uzzi Ornan: 1998, ‘Monolingual Translator Workstation’, in D. Farwell, L. Gerber and E. Hovy (eds.) Machine Translation and the Information Soup: Proceed-ings of the Third Conference of the Association for Machine Translation in the Americas, Lecture Notes in Artificial Intelligence no.1529, Berlin: Springer, pp.136–149
Text-Translation AlignmentPTT-3: A Third Version of the CITI’s Workstation for TranslatorsAutomatic Detection of Omissions in Translation
  • Kay
  • Martin Martin
  • I Melamed
  • Dan
Kay, Martin and Martin R¨ oscheisen: 1993, ‘Text-Translation Alignment’, in Computational Linguistics 19(1), pp.121–142. rMacklovitch, Elliott: 1993, ‘PTT-3: A Third Version of the CITI’s Workstation for Translators’, Techical Report, CITI Melamed, I. Dan: 1996, ‘Automatic Detection of Omissions in Translation’, in 16th Interna-tional Conference on Computational Linguistics, Copenhagen, pp.764–769
PTT-3: A Third Version of the CITI's Workstation for Translators
  • Elliott Macklovitch
Macklovitch, Elliott: 1993, 'PTT-3: A Third Version of the CITI's Workstation for Translators', Techical Report, CITI
Automatic Detection of Omissions in Translation
  • I Melamed
  • Dan
Melamed, I. Dan: 1996, 'Automatic Detection of Omissions in Translation', in 16th International Conference on Computational Linguistics, Copenhagen, pp. 764-769.