Article

Annotating the Dutch Parallel Corpus

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 63-72. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893 .

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... This allowed in fact the improvement and extension in multi-lingual perspective of approaches originally developed for single languages, also increasing the portability of NLP tools and the availability of data useful for their parison and study. As suggested in (Paulussen and Macken, 2010), the use of the same annotating tools and formats for each monolingual corpus may also have a positive impact on the following exploitation and processing of the resulting parallel corpora. On the other hand, the availability of multi-format annotations for parallel treebanks, like that described in (Francom and Hulden, 2008), can be of some help in the analysis of the adequateness of specific format for particular languages and phenomena. ...
Article
Full-text available
The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French, i.e. Parallel–TUT, or simply ParTUT. For the development of this resource, both the dependency and constituency-based formats of the Italian Turin University Treebank (TUT) have been applied to a preliminary dataset, which includes the whole text of the Universal Declaration of Human Rights, sentences from the JRC-Acquis Multilingual Parallel Corpus and the Creative Commons licence. The focus of the project is mainly on the quality of the annotation and the investigation of some issues related to the alignment of data that can be allowed by the TUT formats, also taking into account the availability of conversion tools for display data in standard ways, such as Tiger–XML and CoNLL formats. It is, in fact, our belief that increasing the portability of our treebank could give us the opportunity to access resources and tools provided by other research groups, especially at this stage of the project, where no particular tool – compatible with the TUT format – is available in order to tackle the alignment problems.
Article
Full-text available
Nowadays, text corpora play an important role in language research and all fields involving language study, including theoretical and applied linguistics, language technology, translation studies and CALL (Computer Assisted Language Learning). Multilingual corpora, especially translated corpora, are not always readily available for Dutch. Much depends on the private initiative of individuals, and the data are often restrictedly available. The DPC-project (Dutch Parallel Corpus), which is carried out within the STEVIN program (Odijk et al. 2004), intends to fill the gap for this type of corpora for Dutch. This paper gives an overview of the DPC project. First, an overview and a discussion is given of the main parallel corpora containing Dutch. Then the DPC project is described, focusing on those aspects that make the DPC different from existing parallel corpora. Finally, the choice of an XML based format is explained.
Article
Full-text available
The paper discusses the compilation of massively multilingual corpora, the EU ACQUIS corpus, and the corpus annotation tool "totale". The ACQUIS text collection has recently become available on the Web, and contains EU law texts (the Acquis Communautaire) in all the languages of the current EU, and more, i.e. parallel texts in over twenty different languages. Such document collections can serve as the basis for multilingual parallel corpora of unprecedented size and variety of language, useful as training and testing dataset for a host of different HLT applications. The paper describes the steps that were undertaken to turn the text collection into a linguistically annotated text corpus. In particular, we discuss the harvesting and wrapper induction of the corpus, and the usage of its annotation with EuroVoc descriptors. Next, the text annotation tool "totale" which does multilingual text tokenization, tagging and lemmatisation is presented. The tool implements a simple pipelined architecture which is, for the most part, fully trainable, requiring a word-level syntactically annotated text corpus and, optionally, a morphological lexicon. To train totale for seven different languages we have used the MULTEXT-East corpus and lexicons; we describe this resource and the training of totale, and its application to the ACQUIS corpus. Finally, we turn to the current experiments in aligning the corpus, and developments we plan to undertake in the future.
Article
Full-text available
We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.
Article
Full-text available
With a view to reinforcing the position of Dutch as a language for international communication, the Dutch Language Union (Nederlandse Taalunie) and several Dutch language area ministries have decided to invest in the development of a Machine Translation system for Dutch (from and into English and French). With additional funding from the European Union (provided in the framework of the Multilingual Information Society Programme) and from the technology supplier (Systran) a shared-cost project with a total budget of 2,4 million € was set up. The development has started in July 2000 and will take two years. The operational objective of the project is to develop a system that will provide translations of a quality such that the combined input (resources) needed for Machine Translation (MT) plus post-editing by a human translator is less than the input needed for a "normal" human translation: one might call this FASQT, Fully Automatic Sufficient Quality Translation… It can be shown that even a marginal increase in productivity leads to a surprisingly short payback time. Likewise, the potential for future savings is huge. We therefore contend that the question "Why MT?" could even be answered on purely economic grounds alone, notwithstanding the political motivations often surrounding such investment decisions.
Article
Full-text available
2006 saw the start of a project for compiling a multifunctional parallel corpus with Dutch as a pivotal language: the Dutch Parallel Corpus (DPC). Among other things, parallel corpora can be a useful tool in translation business. They can help to improve idiomatic language usage, provide translation suggestions or serve for filling up translation memory with high-quality data. The advantages of parallel corpora over multilingual comparable corpora or dictionaries/ glossaries is search speed and reliability. They contain a great amount of aligned data, examples from which can be viewed in the surrounding context. Besides, parallel corpora offer their user the benefit of metadata with additional information allowing for a finer-tuned search of the corpus. The corpus design and text typology are crucial for the usability of the corpus. Insights from cognitive linguistics on basic-level categories have proven to be useful for elaborating such a design and typology assuring (i) text type diversity containing translation samples from different areas of expertise; (ii) high translation quality providing reliable translation solutions; (iii) a well-structured taxonomy for prompt data retrieval.
Chapter
Full-text available
Memory-based language processing--a machine learning and problem solving method for language technology--is based on the idea that the direct re-use of examples using analogical reasoning is more suited for solving language processing problems than the application of rules extracted from those examples. This book discusses the theory and practice of memory-based language processing, showing its comparative strengths over alternative methods of language modelling. Language is complex, with few generalizations, many sub-regularities and exceptions, and the advantage of memory-based language processing is that it does not abstract away from this valuable low-frequency information. © Walter Daelemans and Antal van den Bosch 2005 and Cambridge University Press, 2009.
Conference Paper
Full-text available
Researchers in both machine translation (e.g., Brown ., 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary proceedings) which are available in multiple languages (French and English). This paper describes a method for aligning sentences in these parallel texts, based on a simple statistical model of character lengths. The method was developed and tested on a small trilingual sample of Swiss economic reports. A much larger sample of 90 million words of Canadian Hansards has been aligned and donated to the ACL/DCI.
Conference Paper
Full-text available
The explicit introduction of morphosyntactic information into statistical machine translation approaches is receiving an important focus of attention. The current freely available Part of Speech (POS) taggers for the French language are based on a limited tagset which does not account for some flectional particularities. Moreover, there is a lack of a unified framework of training and evaluation for these kind of linguistic resources. Therefore in this paper, three standard POS taggers (Treetagger, Brill's tagger and the standard HMM POS tagger) are trained and evaluated in the same conditions on the French MULTITAG corpus. This POS-tagged corpus provides a tagset richer than the usual ones, including gender and number distinctions, for example. Experimental results show significant differences of performance between the taggers. According to the tagging accuracy estimated with a tagset of 300 items, taggers may be ranked as follows: Treetagger (95.7% ), Brill's tagger (94.6%), HMM tagger (93.4%). Examples of translation outputs illustrate how considering gender and number distinctions in the POS tagset can be relevant.
Conference Paper
Full-text available
After three years of work the Dutch Parallel Corpus (DPC) project has reached an end. The finalized corpus is a ten-million-word high-quality sentence-aligned bidirectional parallel corpus of Dutch, English and French, with Dutch as central language. In this paper we present the corpus and try to formulate some basic data collection principles, based on the work that was carried out for the project. Building a corpus is a difficult and time-consuming task, especially when every text sample included has to be cleared from copyrights. The DPC is balanced according to five text types (literature, journalistic texts, instructive texts, administrative texts and texts treating external communication) and four translation directions (Dutch-English, English-Dutch, Dutch-French and French-Dutch). All the text material was cleared from copyrights. The data collection process necessitated the involvement of different text providers, which resulted in drawing up four different licence agreements. Problems such as an unknown source language, copyright issues and changes to the corpus design are discussed in close detail and illustrated with examples so as to be of help to future corpus compilers.
Conference Paper
Full-text available
A wide spectrum of multilingual applications have a ligned parallel corpora as their prerequisite. The aim of the project described in this paper is to build a multilingual corpus where all s entences are aligned at very high precision with a minimal human effort involved. The experiments on a combination of sentence aligners with different underlying algorithms described in th is paper showed that by verifying only those links which were not recognize d by at least two aligners, an error rate can be re duced by 93.76% as compared to the performance of the best aligner. Such manual i nvolvement concerned only a small portion of all da ta (6%). This significantly reduces a load of manual work necessary to achieve nearly 100% accuracy of alignment.
Article
Full-text available
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Full-text available
This paper describes the lemmatisation and tagging guidelines developed for the "Spoken Dutch Corpus", and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator. 1. Introduction The Dutch-Flemish project "Corpus Gesproken Nederlands " (1998-2003) aims at the collection, transcription and annotation of ten million words of spoken Dutch (Oostdijk, 2000). The first layer of linguistic annotation concerns the assignment of base forms and morphosyntactic tags to each of those ten million words. The first part of this paper presents the lemmatisation guidelines and the tagset which have been devised for thi...
Article
Full-text available
In this paper, we show how the paradigm of evaluation can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation is presented, the main points of its history are recalled, from the first deployment that took place in the USA during the DARPA/NIST evaluation campaigns, up to latest efforts in Europe (SENSEVAL2/ROMANSEVAL2, CLEF, CLASS etc.). Then the principle behind the method used to produce high-quality validated language at low cost from the byproducts of an evaluation campaign is exposed. It was inspired by the experiments (Recognizer Output Voting Error Recognition) performed during speech recognition evaluation campaigns in the USA and consists of combining the outputs of the participating systems with a simple voting strategy to obtain higher performance results. Here we make a link with the existing strategies for system combination studied in machine learning. As an illustration we describe how the MU...
Article
The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations ({\bf bitext maps}). The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other algorithms in the literature. The algorithm is robust enough to use on noisy texts, such as those resulting from OCR input, and on translations that are not very literal. SIMR encapsulates its language-specific heuristics, so that it can be ported to any language pair with a minimal effort.
Fast and accurate sentence alignment of bilingual cor-pora Machine Translation: from research to real users
  • R C Moore
Moore, R. C. (2002). Fast and accurate sentence alignment of bilingual cor-pora. In Proceedings of the fifth Conference of the Association for Machine Translation in the Americas (AMTA), Machine Translation: from research to real users, Tiburon, California, pp. 135-244.
Practical presentation of a "vanilla aligner
  • P Danielsson
  • D Ridings
Danielsson, P. and Ridings, D. (1997). Practical presentation of a "vanilla aligner". In Proceedings of the TELRI Workshop on Alignment and Exploitation of Texts, Ljubljana.