ArticlePDF Available

Abstract and Figures

We present an algorithm for aligning texts with their translations that is based only on internal evidence. The relaxation process rests on a notion of which word in one text corresponds to which word in the other text that is essentially based on the similarity of their distributions. It exploits a partial alignment of the word level to induce a maximum likelihood alignment of the sentence level, which is in turn used, in the next iteration, to refine the word level estimate. The algorithm appears to converge to the correct sentence alignment in only a few iterations.
Content may be subject to copyright.
A preview of the PDF is not available
... The idea of using bilingual (parallel) translation corpora in CBMT is not entirely a new thing. Although it dates back to the early days of MT, it is not used in practice until 1984 (Kay and Röscheisen 1993). The CBMT system is based on information acquired from analysis of bilingual parallel corpora. ...
... To do so, the starting point is a preliminary alignment of words with a bilingual dictionary. Definitely, a rough alignment will yield satisfactory results at sentence level (Kay and Röscheisen 1993) especially when supported by various statistical methods (Brown and Alii 1990) with minimal formalisation of the major syntactic phenomena of the texts (Brown and Alii 1993). ...
Book
Full-text available
The author makes an effort to interweave computer and language corpus in an interactive interface with the goal of developing new systems and resources for second language teaching, compiling dictionaries with digitized corpora, introducing new approaches for dialects study, designing robust systems for word sense understanding, and developing a new method for machine translation. The primary goal of the book is to make people aware of the functional and referential benefits of language corpora in the works of applied linguistics. The academic relevance of this book may be attested in its direct focus on the Indian context of applied linguistic works as well as in its sincere appeal for redirecting the focus of works of the Indian applied linguists towards this new approach for the benefit of the discipline. The issues addressed in this book have academic and functional relevance in the areas of corpus linguistics, computational linguistics, applied linguistics, language technology, cognitive linguistics, and language processing, as well as in mainstream linguistics. It is enriched with reference to the recent works carried out in advanced languages in various parts of the world. The book will help the readers to know how novel approaches are used to make valuable improvements over traditional systems and techniques normally used in applied linguistics. The book is suitable to be used as a course book both at undergraduate and postgraduate levels. It can also be used as a reference book for teachers employed in language teaching and researchers working in areas of applied linguistics and language technology. Also, people working in other areas will find this book useful for information, observation, and interpretation.
... Lin, 1998aD. Lin, , 1998b (Kay and Röscheisen, 1993) is monotonic in Jaccard's coefficient (van Rijsbergen, 1979), so its inclusion in our experiments would be redundant. Finally, we did not use the KL divergence because it requires a smoothed base language model. ...
Preprint
We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are three-fold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; and the introduction of a novel function that is superior at evaluating potential proxy distributions.
... When applied to an automatic question-answering system [10][11][12], this technology can automatically identify the usersearched queries and match them with the system database to produce the most relevant answer. Using this technology for translation [13,14], the veracity of the translation between the source and the target sentences can be determined. Also, this technology can be used in the process of automatically generating abstracts [15,16], to compare the generated abstract and the original abstract. ...
Article
Full-text available
Semantic text similarity measurement is fundamental in natural language processing (NLP). With the advancement of NLP technology, the research and application values of similarity measurement have become prominent. This paper utilizes Google Scholar as the primary search tool and collects 179 documents. Then, using filtering technology, 50 key documents are ultimately obtained. Furthermore, this paper summarizes the research progress of semantic text similarity measurement and develops a more comprehensive classification description system for text similarity measurement algorithms. The classification includes string-based, corpus-based, knowledge-based, deep learning-based, traditional pretraining-based, and state-of-the-art pretraining-based methods. For each method, this paper introduces typical models and methods and discusses the advantages and disadvantages of these approaches. The systematic research on text similarity measurement methods enables a quick grasp of these methods, summarizing and analyzing classic and the latest research in text similarity measurement. The paper also lists evaluation indicators in this field and concludes by discussing potential future research directions. The aim is to provide a reference for related research and applications.
... if the string of character boxes and the string of text characters on a page would be placed one above the other and alignment lines would be rendered between the boxes and their corresponding labels, the lines should not intersect. These restrictions suggest an alignment algorithm similar to those used to align translations [7], with the supplementary constraint that no intersections of alignment lines are allowed, therefore the image to text alignment matrix should be a strait diagonal. ...
... The work on automatic word alignment started in the early nineties (41,86,87,123). Brown et al.(1993) introduced five statistical lexical alignment models, known as IBM models, to build statistical machine translation systems. The models have been widely used and various modifications and improvements have been introduced to enhance their robustness and quality (64). ...
Thesis
Full-text available
Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages.
... When the texts being compared are in different languages (also called parallel texts or parallel corpora), the task is more specifically called translation alignment. The result often takes the form of a list of pairs of items, which can be larger text chunks like documents or paragraphs, but more frequently sentences and words (Kay and Röscheisen, 1993;Véronis, 2000). Translation alignment is a very important task in Natural Language Processing. ...
Conference Paper
Full-text available
This paper illustrates a workflow for developing and evaluating automatic translation alignment models for Ancient Greek. We designed an annotation Style Guide and a gold standard for the alignment of Ancient Greek-English and Ancient Greek-Portuguese, measured inter-annotator agreement and used the resulting dataset to evaluate the performance of various translation alignment models. We proposed a fine-tuning strategy that employs unsupervised training with mono-and bilingual texts and supervised training using manually aligned sentences. The results indicate that the fine-tuned model based on XLM-Roberta is superior in performance, and it achieved good results on language pairs that were not part of the training data.
... Translation alignment is defined as the operation of comparing two or more parallel texts in different languages to find correspondences between their textual units, through manual or automated methods (Kay and Rö scheisen, 1993). The result often takes the form of a list of pairs of items, which can be larger text chunks like documents or paragraphs, but more frequently sentences and words. ...
Article
This article presents a study of several parallel corpora of historical languages and their translations. The aligned corpora are the result of a large crowdsourcing project, named Ugarit, aimed at supporting translation alignment for ancient and historical languages: the study of the resulting translation pairs allows us to observe cross-linguistic dynamics in a range of languages, some of which have never been systematically aligned before. The corpora considered are divided into two distinct groups: English translations of ancient languages, including Greek, Latin, Persian, and Coptic; and translations of ancient Greek into other languages, including Latin, English, Georgian, Italian, and Persian. We evaluated different ratios of word matching across each language pair (one-to-one, one-to-many, many-to-one, and many-to-many), and analyzed the resulting trends across the corpus. We propose some observations on how and why different types of alignment links are established in a given language pair, and what factors affect their creation beyond the control of the user: we propose two complementary hypotheses to explain the changes, one based on structural linguistic factors and the other based on cultural difference.
Chapter
As explained in Chap. 1 and later developed in Chap. 6, Machine Translation (MT) engines need to be trained with large numbers of parallel sentences or segments. The quantity and diversity of existing parallel text is limited however. This motivates the search for parallel sentences in comparable corpora. By exploring a larger share of the levels of comparability introduced in Sect. 1.2, a much larger source of multilingual data can be obtained. Strongly comparable corpora such as Wikipedia entries [1, 62] or news text [2] are rife with parallel sentences and have been among the first to be explored.
Article
Cet article présente les premiers résultats de l’analyse d’un corpus littéraire composé de poèmes et d’un essai, ainsi que de leurs 16 rétroversions en français, après traduction dans 8 langues, typologiquement et diachroniquement diversifiées (l’italien, l’allemand, l’arabe, le farsi, le japonais, le coréen, le latin et le grec ancien). Cet ensemble fait l’objet d’une approche interdisciplinaire, interrogeant la rétrotraduction d’un point de vue textométrique et stylistique. À partir d’une évaluation de la distance traductionnelle entre les textes alignés (originaux et rétroversions), fondée sur la longueur, l’indice Dice et la stabilité des lemmes, trois hypothèses principales sont éprouvées : les textes sont-ils davantage transformés lorsqu’ils sont passés par une langue distante du français ? Les poèmes subissent-ils des distorsions plus grandes que l’essai en prose ? Le corpus rétrotraduit peut-il être un outil herméneutique pour l’analyse du texte original ? Pour étudier ces différents aspects dans une perspective de linguistique de corpus outillée et de stylistique française, nous proposons une méthodologie d’alignement multitextuel et de mesures textométriques, s’appuyant sur l’étiquetage et la lemmatisation des unités. L’étude stylistique prend appui sur ces mesures pour aborder ce multitexte riche et complexe. Menée sur un ensemble de taille réduite, l’analyse vise à la mise en place de procédures reproductibles sur d’autres corpus de versions traduites d’un texte dans une même langue.
Conference Paper
Full-text available
Researchers in both machine translation (e.g., Brown ., 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary proceedings) which are available in multiple languages (French and English). This paper describes a method for aligning sentences in these parallel texts, based on a simple statistical model of character lengths. The method was developed and tested on a small trilingual sample of Swiss economic reports. A much larger sample of 90 million words of Canadian Hansards has been aligned and donated to the ACL/DCI.
Article
For a long time, the origin of cosmic radiation has represented a challenge to the imagination of astrophysicists. This radiation, which was discovered in 1912, is raining down on the earth from all directions at a uniform rate. Now, however, a major source of cosmic radiation has finally been found in an object called Cygnus X-3. This object is the third-brightest X-ray emitter in the constellation Cygnus. It was first observed by X-ray astronomers in the late 1960s. Recently, it was found that Cygnus X-3 is also the source of high-energy gamma rays. On the basis of the observed gamma rays the object was identified as a source of cosmic radiation. Cygnus X-3 is a binary star system located at a distance of at least 37,000 light years at the edge of the Galaxy. Attention is given to models of Cygnus X-3, mechanisms involved in the production of cosmic rays, problems regarding the identification of the sources of cosmic rays, studies of Cygnus X-3, and details regarding cosmic rays.
Book
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Conference Paper
An essential problem of example-based translation is how to utilize more than one translation example for translating one source sentence.This paper proposes a method to solve this problem. We introduce the representation, called , which represents the combination of fragments of translation examples. The translation process consists of three steps: (1) Make the source matching expression from the source sentence. (2) Transfer the source matching expression into the target matching expression. (3) Construct the target sentence from the target matching expression.This mechanism generates some candidates of translation. To select the best translation out of them, we define the score of a translation.
Article
Chapter 5 is concerned with sorting into order, internal sorting and external sorting. Chapter 6 deals with the problem of searching for specified items in tables or files. It is subdivided into methods which search sequentially, or by comparison of keys, or by digital properties, or by 'hashing.' It then discusses the more difficult problem of secondary key retrieval
Conference Paper
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora.