Article

Bilingual Concordancing and Bilingual Lexicography

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... One of the main activities associated with building such a corpus is developing software for parallel concordancing, in which a user can enter a search string in one language and see all citations for that string in the search language as well as corresponding sentences in the target language. Aligned bilingual corpora have in fact proved useful in many tasks, including machine translation (Brown et al. 1990;Sadler 1989), sense disambiguation (Brown et al. 1991a;Dagan et al. 1991;Gale et al.1991), cross-language information retrieval (Davis and Dunning 1995;Landauer and Littman 1990;Oard 1997) and bilingual lexicography (Klavans and Tzoukermann 1990); Warwick and Russell 1990). ...
Article
Full-text available
In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.
... In the end of the 20 th century, in natural language processing there was observed a paradigm shift to corpus-based methods exploiting corpora resources (monolingual language corpora and parallel bilingual corpora) with the pioneer researches in bilingual lexicography (for example, Warwick and Russell, 1990) and machine translation (for example, Sadler, 1990). A parallel corpus is a collection of texts which is translated into one or more languages in addition to the original (EAGLES, 1996). ...
Conference Paper
Full-text available
The TTC project (Terminology Extraction, Translation Tools and Comparable Corpora) has contributed to leveraging computer-assisted translation tools, machine translation systems and multilingual content (corpora and terminology) management tools by generating bilingual terminologies automatically from comparable corpora in seven EU languages, as well as Russian and Chinese. This paper presents the main concept of TTC, discusses the issue of parallel corpora scarceness and potential of comparable corpora, and briefly describes the TTC terminology extraction workflow. The TTC terminology extraction workflow includes the collection of domain-specific comparable corpora from the web, extraction of monolingual terminology in the two domains of wind energy and mobile technology, and bilingual alignment of extracted terminology. We also present TTC usage scenarios, the way in which the project deals with under-resourced and disconnected languages, and report on the project midterm progress and results achieved during the two years of the project. And finally, we touch upon the problem of under-resourced languages (for example, Latvian) and disconnected languages (for example, Latvian and Russian) covered by the project.
... The growing availability of bilingual, machine-readable texts has stimulated interest in methods for extracting linguistically valuable information from such texts. For example , a number of recent papers deal with the problem of automatically obtaining pairs of aligned sentences from parallel corpora (Warwick and Russell 1990; Brown, Lai, and Mercer 1991; Gale and Church 1991b; Kay 1991). Brown et al. (1990) assert, and Brown, Lai, and and Gale and Church (1991b) both show, that it is possible to obtain such aligned pairs of sentences without inspecting the words that the sentences contain. ...
Article
Full-text available
We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.
... Text-based algorithms use lexical information across the language boundary to align sentences. Warwick-Armstrong and Russell (Warwick-Armstrong & Russell, 1990 ) used a bilingual dictionary to select word pairs in sentences from a parallel corpus and then aligned the sentences based on the word correspondence information. Another type of lexical information, which is helpful in alignment of European language pairs, is called cognate (Simard et al., 1992). ...
Article
Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.
... Most bilingual concordance programs such as ISSCO's BCP program mentioned in footnote 1 of (Warwick and Russel, 1990 ...
Conference Paper
Full-text available
Researchers in both machine translation (e. g., Brown et a/, 1990) arm bilingual lexicography (e. g., Klavans and Tzoukermarm, 1990) have recently become interested in studying parallel texts (also known as bilingual corpora), bodies of text such as the Canadian Hansards (parliamentary debates) which are available in multiple languages (such as French and English). Much of the current excitement surrounding parallel texts was initiated by Brown et aL (1990), who outline a self-organizing method for using these parallel texts to build a machine translation system.
... (Isabelle, 1992) There has recently been quite a bit of interest in programs that align parallel texts such as the Canadian Parliamentary Debates (Hansards), which are available in both English and French, e.g., Brown et al. (1991), Chen (1993), Church (1993), Gale and Church (1993), Dagan et al. (1993), Kay and Ro . . senschein (to appear), Kupiec (1993), Matsumoto et al. (1993), Simard et al. (1992), Warwick-Armstrong and Russell (1990). ...
... Although the corpora in this case are not parallel but only composed of comparable texts (see footnote 1), it is evident that the use of parallel corpora will become an invaluable new resource for lexicographers . The pre-alignment of texts into large units such as paragraphs or sentences has made it possible for lexicographers to use bilingual concordances 19 to quickly scan a large number of translations of a given word and detect important contextual features such as collocations (Warwick & Russel, 1990; Church & Gale, 1991; Hartmann, 1994; Langlois, 1996; Roberts & Montgomery, 1996). Although using parallel corpora is not yet a standard procedure in the preparation of commercial dictionaries, 20 it has already become crucial in terminology and in the design of computerised lexicons. ...
Article
This introductory chapter provides a survey of the processing and use of parallel texts, i.e., texts accompanied by their translation. Throughout the chapter, the various authors' contributions to the book are considered and related to the state of the art in the field. Three themes are addressed, corresponding to the three parts of the book: (i) techniques and methodology for the alignment of parallel texts at various levels such as sentences, clauses or words; (ii) applications of parallel texts in fields such as translation, lexicography, and information retrieval; and (iii) available corpus resources and evaluation of alignment methods.
Chapter
This introductory chapter provides a survey of the processing and use of parallel texts, i.e., texts accompanied by their translation. Throughout the chapter, the various authors’ contributions to the book are considered and related to the state of the art in the field. Three themes are addressed, corresponding to the three parts of the book: (i) techniques and methodology for the alignment of parallel texts at various levels such as sentences, clauses or words; (ii) applications of parallel texts in fields such as translation, lexicography, and information retrieval; and (iii) available corpus resources and evaluation of alignment methods.
Chapter
Ramesh Krishnamurthy was born in Madras, India, and has degrees in French and German from the University of Cambridge, and Sanskrit and Indian religions from the University of London. He worked for the COBUILD project at Birmingham University from 1984 to 2003, where he compiled and edited dictionaries, grammars, and other publications, and contributed to the development of corpora, software, and electronic products. He has been an honorary research fellow at the universities of Birmingham and Wolverhampton, and has taught on undergraduate and postgraduate courses and supervised postgraduate research. He has contributed to several European linguistics projects, and conducted workshops and courses on corpus linguistics and lexicography in several countries. He is currently Lecturer in English Studies at Aston University, Birmingham, UK.
Conference Paper
Full-text available
Conference Paper
In this paper, we will tackle the problem raised by the automatic alignment of sentences belonging to bilingual text pairs. The method that we advocate here is inspired by what a person with a fair knowledge of the other langage would do intuitively. It is based on the matching of the elements which are similar in both sentences. However, to match these elements correctly, we first have to match the sentences that contain them. There seems to be a vicious circle here. We will show how to break it. On the one hand, we will describe the hypotheses we made, and, on the other hand, the algorithms which ensued. The experiments are carried out with French-English and French-Arabic text pairs.We will show that matching sentences and, later, expressions, amounts to raising a new problem in the machine translation field, i. e. the problem of recognition instead of that of translation, strictly speaking.
Article
In this paper, we will tackle the problem raised by the automatic alignment of sentences belonging to bilingual text pairs. "(he method that we advocate here is inspired by what a person with a fair knowledge of the other langage would do intuitively. It is based on rite matching of the elements which are similar in both sentences. However, to match these elements correctly, we first have to match the sentences that contain them. There seems to be a vicious circle here. We will show how to break it. On the one hand, we will describe the hypotheses we made, and, nn the other hand, the algorithms which ensued. The experiments are carried out with French-English and French-Arabic text pairs• We will show that matching sentences and, later, expressions, amounts to raising a new problem in the machine translation field, i. e. the problem of recognition instead of that of translation, strictly speaking.
Conference Paper
Cross-lingual semantic interoperability has drawn significant research attention recently, as the number of digital libraries in non-English languages has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish and French, has been widely explored, but CLIR across European and Oriental languages is still at the initial stages. To cross the language boundary, a corpus-based approach shows promise of overcoming the limitations of knowledge-based and controlled vocabulary approaches. However, collecting parallel corpora between European and Oriental languages is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches, and compare their performance in aligning English and Chinese titles of parallel documents available on the Web.
Article
. Technical term translation represents one of the most difficult tasks for human translators since (1) most translators are not familiar with term and domain specific terminology and (2) such terms are not adequately covered by printed dictionaries. This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups. Given any word which is part of a technical term in the source language, the algorithm produces a ranked candidate match for it in the target language. Potential translations for the term are compiled from the matched words and are also ranked. We show how this ranked list helps translators for technical term translation. Most algorithms for lexical and term translation focus on Indo-European language pairs, and most use a sentence-aligned clean parallel corpus without insertion, deletion or OCR noise. Our algorithm is language and character-set independent, and is robust to noise in the corpus. We show how our a...
Article
A system to process bilingual/multilingual text corpora is described. Thesystem includes components for crosslanguage querying on parallel (ietranslation equivalent) and comparable (ie domain-specific) collections oftexts in more than one language. Both sets of procedures are dependent on lexical resources (bilingual lexicaldatabases) and linguistic tools (morphological procedures). The system was originally designed to meet the requirements ofvarious types of contrastive language studies. However, we are now studyingapplications to cross-language retrieval. Background In the last few years, natural language processing (NLP) techniques andtools have been incorporated into information retrieval (IR) systems withvarying degrees of success (Smeaton 1992). The recent emergence of thefield of CrossLanguage Information Retrieval as an independent area of interest has clearly reinforced this trend. In order to be successful, cross-language applicationsfrequently need access to methodologies ...
Article
Full-text available
Various methods have been proposed for aligning texts in two or more languages such as the Canadian Parliamentary Debates (Hansards). Some of these methods generate a bilingual lexicon as a by-product. We present an alternative alignment strategy which we call K-vec, that starts by estimating the lexicon. For example, it discovers that the English word fisheries is similar to the French paches by noting that the distribution of fisheries in the English text is similar to the distribution of pches in the French. K-vec does not depend on sentence boundaries.
Article
Full-text available
There have been a number of recent papers on aligning parallel texts at the sentence level, e.g., Brown et al (1991), Gale and Church (to appear), Isabelle (1992), Kay and Ro .. senschein (to appear), Simard et al (1992), WarwickArmstrong and Russell (1990). On clean inputs, such as the Canadian Hansards, these methods have been very successful (at least 96% correct by sentence). Unfortunately, if the input is noisy (due to OCR and/or unknown markup conventions), then these methods tend to break down because the noise can make it difficult to find paragraph boundaries, let alone sentences. This paper describes a new program, char_align, that aligns texts at the character level rather than at the sentence/paragraph level, based on the cognate approach proposed by Simard et al. 1. Introduction Parallel texts have recently received considerable attention in machine translation (e.g., Brown et al, 1990), bilingual lexicography (e.g., Klavans and Tzoukermann, 1990), and terminology resea...
Article
In this paper we describe a statistical technique for aligning sentences with their trauslations in two parMid corpora. In addition to certain anchor points that are available in our da. ta, the only information about the sentences that we use for calculating Mignments is the number of tokens that they contain. Because we make no use of the lexicaJ details of the sentence, the Mignment com- putation is fast and therefore practicM for application to very large collections of text. We have used this thnlque to Mign several million sentences in the English-French Hansard corpora and have achieved an ccuracy in excess of 99% in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96% and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.
ResearchGate has not been able to resolve any references for this publication.