Article

Towards a multilingual corpus for contrastive analysis and translation studies

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

A report is given on the Oslo Multilingual Corpus, with special reference to a new trilingual project focusing on English, Norwegian, and German. As an example, the paper examines the English verb spend and its correspondences in Norwegian and German. Correspondences are either syntactically congruent, usually containing the Norwegian verb tilbringe or the German verb verbringen , or they involve a restructuring of the clause. The patterns of correspondence are broadly comparable in Norwegian and German. Although there is a great deal of restructuring, there is also evidence of overuse of congruent structures. The findings testify to the usefulness of research based on multilingual corpora.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Reflexive cases of sentirse / se sentir / sentirsi were not included since they need a more detailed analysis and a proper treatment. 11 In the description of our results we focus mainly on syntactically congruent matches (that is, through a verbal expression) and will not go into detail about the cases of zero correspondence (Johansson 2002) or inventive translations (Salkie 2002) as illustrated by the examples below, which could be the object of a more thorough analysis: ...
... I felt the fear of death and I could do absolutely nothing.' The rather high amount of cases of zero correspondence should not be surprising, given the literary nature of the compiled corpus and the consequently potentially high number of inventive translations, as indicated by Johansson (1998Johansson ( , 2002. 12 It is not within the scope of this article to resort to advanced statistical techniques. ...
Article
Recent linguistic studies on perception have focused mainly on verbs referring to the dominant visual and auditory modalities, (e.g., English see/look and hear/listen) and have largely ignored the minor verbs. The present paper seeks to fill this gap by comparing the complex semantics of the cognate verbs sentir(e) in three Romance languages, namely Spanish, French and Italian. Since the objective study of semantics is a problematic issue, we pay special attention to methodological problems and opt for a "combined corpus approach" involving both a translation corpus and comparable data. Evidence from both corpora indicates that, notwithstanding the fact that the rich polysemy of the three verbs partly coincides, each individual verb has undergone semantic specializations differentiating the morphological cognates.
... und z.B. StigJohansson (2001).4 Die Daten in den Tabellen 2, 3 und 4 basieren auf OMC, Stand July 2002. ...
... Die ersten großen elektronischen Korpora, wie z.B. für das Englische das BROWNKorpus (Francis und Kučera 1979) und das LOBKorpus (Johans son u. a. 1986), für das Deutsche COSMAS I 6 , das digitale Wörterbuch der deutschen Sprache DWDS (Geyken 2007) oder die TIGERBaumbank (Brants u. a. 2002), waren rein monolingual ausgerichtet. In den letzten Jah ren zeigt sich die Tendenz, Korpora verstärkt multilingual aufzubauen, wie etwa das in der vorliegenden Arbeit verwendete deutschenglische CroCo Korpus, das englischfranzösische HANSARDKorpus (Roukos u. a. 1997), bestehend aus Redemitschriften des kanadischen Parlaments, das Oslo Multi lingual Corpus (Johansson 2000), die Prague CzechEnglish Dependency Treebank (PCEDT, Čmejrek u. a. 2004) oder das EUROPARLKorpus (Koehn 2005), bestehend aus Redemitschriften des EUParlaments. Parallele Korpora -als solche werden im Folgenden Korpora aus Originalen und de ren Übersetzungen bezeichnet (nach Baker 1996;Granger 2003) -eröffnen die Möglichkeit, Valenz einzelsprachlich und sprachvergleichend auf breiter Basis zu untersuchen. ...
... The great advantages provided by (parallel) corpora have been described in numerous previous studies (see e.g. Church and Mercer 1993;Greenbaum 1996;Kennedy 1998;Johansson 2002). A key point in (Parallel) Corpus Linguistics is that the evidence comes from existing data and, therefore, we do not "loo [k] at what is theoretically possible in a language, [but] we study the actual language used in naturally occurring texts" (Biber, Conrad and Reppen 1998: 1). ...
... As I have previously examined Norwegian and German correspondences of spend (Johansson 2002), I will now focus on correspondences in Swedish, as they appear in the fiction texts of the English-Swedish Parallel Corpus (Aijmer and Altenberg 2000).2 Figure 2 gives an overview of the distribution of spend and tillbringa in the fiction texts of the corpus. Although the exact numbers differ, the overall pattern is very much the same as in Figure 1. ...
Article
Full-text available
The starting-point for my paper is the observation by Martin Gellerstam (1996: 59) that the Swedish verb tillbringa (‘spend [time]’) is overused in translations from English, presumably under the influence of the English verb spend. The same is true of Norwegian tilbringe; see Figure 1. For spend in expressions of time there is an opposite translation effect. These translation effects show that there is a tendency for translators to move on the surface of discourse and resort to formally similar structures where this is possible, rather than find forms that are more in line with usage in original texts in the target language. The question I would like to ask is this: what do Swedes and Norwegians do when they do not spend time? What alternatives are there for conveying the English notion of spending time? But first we must examine how the English verb is used.
... Parallellkorpus består av originaltekster fra flere kildespråk sammen med oversettelser på ett eller flere målspråk. Oslo Multilingual Corpus er et eksempel på parallellkorpus, og vi skal beskrive dette korpuset naermere i seksjon 4 (se også Johansson, 2000). Slike korpus kan parallellstilles på setnings-eller ordnivå, slik at man raskt kan se hvilke enheter som svarer til hverandre på kilde-og målspråket. ...
Article
Full-text available
Korpus er en type digitale språkressurser (maskinlesbare tekstsamlinger) som er mye brukt i moderne lingvistikk som empirisk støtte for studiet av ulike språklige fenomener. Korpus kan med hell brukes i fremmedspråkundervisningen, men er frem til nå ikke i tilstrekkelig grad blitt tatt i bruk i norsk skole og høyere utdanning. I denne artikkelen gir vi en gjennomgang av eksisterende forskningslitteratur på feltet, med hovedvekt på bidrag fra det angloamerikanske språkområdet. Deretter går vi gjennom to korpus som er tilgjengelige for norske brukere, Norwegian-Spanish Parallel Corpus og Oslo Multilingual Corpus. Vi viser hvordan disse korpusene kan brukes i undervisningen av ulike språklige fenomener, fra vokabular til sosiolingvistiske fenomener. På grunnlag av eksisterende forskning på pedagogiske applikasjoner av korpus samt våre egne forslag til undervisningsopplegg er det grunn til å tro at bruk av korpus vil kunne være et verdifullt bidrag til fremmedspråkundervisningen i norske klasserom. Likevel finnes visse utfordringer som hindrer at korpus tas i bruk. Både mangel på tid, gruppestørrelser og teknologiske hindre kan stå i veien. Bruk av korpus i fremtidens klasserom er også nært knyttet til ulike trender, både innenfor teknologi og forskning. For det først blir datateknologi stadig mer tilgjengelig, parallelt med at elevenes og studentenes digitale kompetanse øker. Dette er fenomener som begge skulle tilsi økt bruk av korpus. Videre er det en sterk bevegelse for å åpne opp tilgang til forskningsdata og –ressurser (Open Access) og denne bølgen kan også brukes til å åpne opp forskningsressurser, herunder korpus, for læringsformål. Til syvende og sist er spørsmålet om bruk av korpus i klasserommet imidlertid avhengig av en tettere dialog mellom lærere og korpuslingvister, både i lærerutdanningen så vel som gjennom faglig oppdatering underveis i yrkeslivet.
... The value of parallel corpora has been shown in various NLP applications and research disciplines. Some of them are data-driven machine translation (Brown et al. 1993, Brown 1996, multilingual lexicon/terminology extraction (Gale and Church 1991, Smadja et al. 1996, Hiemstra 1998, Gaussier 1998, Tufis and Barbu 2001, word sense disambiguation (Ide 2000, Diab andResnik 2002) and general translation studies (Johansson 2002) to mention just a few. However, in contrast to monolingual language corpora there are still only a few parallel corpora available especially ones containing more than two languages. ...
Article
Full-text available
In this paper on-going work of creating an extensive multilingual parallel corpus of movie subtitles is presented. The corpus currently contains roughly 23,000 pairs of aligned subti-tles covering about 2,700 movies in 29 languages. Subtitles mainly consist of transcribed speech, sometimes in a very condensed way. Insertions, deletions and paraphrases are very frequent which makes them a challenging data set to work with especially when applying automatic sentence alignment. Standard alignment approaches rely on translation consis-tency either in terms of length or term translations or a combination of both. In the paper, we show that these approaches are not applicable for subtitles and we propose a new alignment approach based on time overlaps specifically designed for subtitles. In our experiments we obtain a significant improvement of alignment accuracy compared to standard length-based approaches.
... Die ersten großen elektronischen Korpora, wie z.B. für das Englische das BROWNKorpus (Francis und Kučera 1979) und das LOBKorpus (Johans son u. a. 1986), für das Deutsche COSMAS I 6 , das digitale Wörterbuch der deutschen Sprache DWDS (Geyken 2007) oder die TIGERBaumbank (Brants u. a. 2002), waren rein monolingual ausgerichtet. In den letzten Jah ren zeigt sich die Tendenz, Korpora verstärkt multilingual aufzubauen, wie etwa das in der vorliegenden Arbeit verwendete deutschenglische CroCo Korpus, das englischfranzösische HANSARDKorpus (Roukos u. a. 1997), bestehend aus Redemitschriften des kanadischen Parlaments, das Oslo Multi lingual Corpus (Johansson 2000), die Prague CzechEnglish Dependency Treebank (PCEDT, Čmejrek u. a. 2004) oder das EUROPARLKorpus (Koehn 2005), bestehend aus Redemitschriften des EUParlaments. Parallele Korpora -als solche werden im Folgenden Korpora aus Originalen und de ren Übersetzungen bezeichnet (nach Baker 1996;Granger 2003) -eröffnen die Möglichkeit, Valenz einzelsprachlich und sprachvergleichend auf breiter Basis zu untersuchen. ...
Article
Automatische Extraktion von bilingualen Valenzwörterbüchern Sprachdidaktik, Translation und Maschinelle Übersetzung haben seit geraumer Zeit von bilingualen Valenzwörter-büchern profitiert. Wurden diese Wörterbücher zuvor in auf-wändiger Handarbeit erstellt, eröffnen multilinguale Korpora neue Perspektiven für eine (halb-)automatische Erstellung von Valenzwörterbüchern anhand realer Sprachdaten. Hier rücken insbesondere parallele Korpora -also Textsamm-lungen von Originalen und deren Übersetzungen -in den Fokus der Aufmerksamkeit, da sie das Auffinden von Äqui-valenten zumindest theoretisch erleichtern. Praktisch steht dem entgegen, dass Original und Übersetzung nicht immer völlig deckungsgleich sind, auf syntaktischer wie seman-tischer Ebene. Im vorliegenden Buch werden Experimente beschrieben, die anhand eines deutsch-englischen Parallel-korpus untersuchen, wie syntaktische Divergenzen zwischen dem Deutschen und dem Englischen auf Basis von Mehr-ebenenannotation und -alignierung automatisch erkannt und beschrieben werden können. Praktische Verwendungs-möglichkeiten wie eine Umsetzung in Transferregeln oder in hypertextuellen Wörterbüchern werden konzipiert und mög-liche Ursachen und Implikationen semantischer Divergenzen beleuchtet.
... MULTEXT 2 (Ide and Véronis 1994) and PAROLE (Kruyt 1998, de Does andvan der Voort/van der Kleij 2002) are typical examples of projects that focus on harmonization of multilingual corpus standards, but they contain no translations for the Dutch text samples. Table 1 gives an overview of the main presently available parallel corpora containing a Dutch component 3 : the Namur Corpus (Paulussen 1999), the European Corpus Initiative Multilingual Corpus I (ECI/MCI) corpus 4 , the MLCC corpus 5 , the Scania corpus (Tjong Kim Sang 1996), the Oslo Multilingual Corpus 6 (Johansson 2002a, Johansson 2002b, the Europarl corpus (Koehn 2005), and the OPUS 7 corpus (Tiedemann and Nygaard 2004). The corpora are sorted according to their creation period. ...
Article
Full-text available
Nowadays, text corpora play an important role in language research and all fields involving language study, including theoretical and applied linguistics, language technology, translation studies and CALL (Computer Assisted Language Learning). Multilingual corpora, especially translated corpora, are not always readily available for Dutch. Much depends on the private initiative of individuals, and the data are often restrictedly available. The DPC-project (Dutch Parallel Corpus), which is carried out within the STEVIN program (Odijk et al. 2004), intends to fill the gap for this type of corpora for Dutch. This paper gives an overview of the DPC project. First, an overview and a discussion is given of the main parallel corpora containing Dutch. Then the DPC project is described, focusing on those aspects that make the DPC different from existing parallel corpora. Finally, the choice of an XML based format is explained.
... The value of parallel corpora has been shown in various NLP applications and research disciplines. Some of them are data-driven machine translation (Brown, Pietra, Pietra andMercer 1993, Brown 1996), multilingual lexicon/terminology extraction (Gale and Church 1991, Smadja, McKeown and Hatzivassiloglou 1996, Hiemstra 1998, Gaussier 1998, Tufis and Barbu 2001, word sense disambiguation (Ide 2000, Diab andResnik 2002) and general translation studies (Johansson 2002) to mention just a few. However, in contrast to monolingual language corpora there are still only a few parallel corpora available especially ones containing more than two languages. ...
Article
In this paper on-going work of creating an extensive multilingual parallel corpus of movie subtitles is presented. The corpus currently contains roughly 23,000 pairs of aligned subti- tles covering about 2,700 movies in 29 languages. Subtitles mainly consist of transcribed speech, sometimes in a very condensed way. Insertions, deletions and paraphrases are very frequent which makes them a challenging data set to work with especially when applying automatic sentence alignment. Standard alignment approaches rely on translation consis- tency either in terms of length or term translations or a combination of both. In the paper, we show that these approaches are not applicable for subtitles and we propose a new alignment approach based on time overlaps specifically designed for subtitles. In our experiments we obtain a significant improvement of alignment accuracy compared to standard length-based approaches.
Article
Full-text available
This paper reports on a cross-linguistic study of ‘be’ verbs in Czech, English and Norwegian, viz. být, be and være, drawing on data from the fiction part of the International Comparable Corpus (Čermáková et al. 2021). The study identifies two main uses of ‘be’ verbs: auxiliary and linking, plus an ‘other’ category which includes minor (often) language-specific uses. The study reveals marked proportional differences in how the three languages exploit the grammatical and functional potential of their respective ‘be’ verbs, notably there is a marked preference for linking uses in English and Norwegian and a more even distribution between auxiliary and linking uses in Czech. In a case study of the linking use, the languages are shown to behave similarly, but with some minor differences regarding choice of adjective to describe fictional subjects. The methodology highlights the importance of a carefully crafted tertium comparationis at several levels, not only in relation to datasets and linguistic phenomena investigated, but also as regards terminology and grammatical traditions of description for the languages compared.
Article
This article compares phraseological tendencies in translated vs. non-translated English through functionally classified 3-word sequences. The study builds on previous research that compared 3-grams in fiction texts originally written in English with fiction texts translated from Norwegian. The current investigation adds English translations from two additional languages – German and Swedish – with the aim of establishing to what extent the tendencies noted for English translations from Norwegian extend to English translations from other languages. Thus the study contributes to the discussion of translation universals and translation as a third code. At the level of 3-gram functions, it has been uncovered that English originals and translations share similar functional characteristics in eight of the fourteen categories identified. Of the remaining six, four show statistically significant differences between originals and translations, regardless of source language. Based on a more qualitative study of four specific 3-grams from two of these categories, it is concluded, in line with the previous studies, that the most likely explanations are source language(s) shining through and the (potentially universal) tendency for translators to use a smaller and more fixed set of expressions in their translations.
Chapter
Parallel corpora are a unique resource in language acquisition that enables learners to conceptualize a target language through the established schemas of their first language by providing parallel representations of text in two or more languages. Parallel corpora are defined as specialized translation corpora that consist of source texts in one language that are aligned with translation texts in one or more additional languages. The following chapter thoroughly explores the pedagogical application of parallel corpora in general, before taking an in-depth look at how English L1 beginning-level learners of Mandarin Chinese applied a Chinese–English parallel corpus. In addition to elucidating the specific observed outcomes of parallel corpora in this unique learning context, numerous parallel corpus resources are detailed with suggestions for pedagogical application, and an extensive review of potential further applications based on continued research in the field is enumerated and analyzed.
Article
Full-text available
This paper proposes an effective method to extract printed and handwritten characters from multilingual document images to build corpus. To extract the characters from the document images, a connected component analysis method is used to remove the graphics. After that, multiple types of features and AdaBoost algorithm are introduced to classify printed and handwritten characters in a more versatile and robust way. Firstly, the content of the image is divided into several text patches which are then used to distinguish different languages. Secondly, we use the multiple types of features and AdaBoost algorithm to train the classifiers based on the segmented patches. Finally, we can separate printed and handwritten parts of new image set by the trained classifiers. The proposed method improves the precision of the extraction of written materials in text images of different languages. Experimental results demonstrate that the proposed method is more accurate in terms of precision and recall rate compared with the state-of the-art methods.
Article
Full-text available
The purpose of this paper is to assess the current status of English-Arabic Contrastive Analysis (CA) in Iraqi universities & to suggest some redefinitions of the goals of this analysis accordingly. A sample of 25 theses that have been randomly chosen has been investigated and chronologically appended. The historical background will have a bird's eye view on the various phases CA in general has undergone so far. One of the limitations is consulting English references only, admitting that Arab scholars did have their own contribution as well.
Article
It is sometimes said that part of speech (POS) tags are likely to be the same for translation equivalent words. If this is correct, we could formulate the following hypothesis: It should be possible to use POS tagging for one language in combination with a word alignment system, in order to obtain a (partial) POS tagging for another language. This hypothesis is investigated both empirically—an experiment is described where POS tags were transferred from a POS tagged German text to a parallel Swedish text by automatic word alignment—and theoretically, in the form of a review of relevant linguistic work on the typology of POS systems. The conclusions are that the hypothesis seems to hold at least for closely related languages, that the findings of typological research do not contradict it (or a slightly modified form of it), but that further empirical research is needed.
ResearchGate has not been able to resolve any references for this publication.