Julia Trushkina’s research while affiliated with North-West University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


The North-West University Bible corpus: A multilingual parallel corpus for South African languages
  • Article

June 2008

·

49 Reads

·

1 Citation

Language Matters

Julia Trushkina

This article reports on a project on the development of language resources and linguistic tools for Afrikaans. In the first part of the article, the multilingual parallel corpus that was created in the project is described. The corpus consists of aligned translations of the Bible in three languages, English, Dutch and Afrikaans. The second part of the article discusses one of the applications of a multilingual parallel corpus: induction of linguistic tools for a resource-scarce language.


Table 4 summarizes the results of the evaluation.
Sentence Alignment in DPC: Maximizing Precision, Minimizing Human Effort.
  • Conference Paper
  • Full-text available

January 2008

·

79 Reads

·

5 Citations

A wide spectrum of multilingual applications have a ligned parallel corpora as their prerequisite. The aim of the project described in this paper is to build a multilingual corpus where all s entences are aligned at very high precision with a minimal human effort involved. The experiments on a combination of sentence aligners with different underlying algorithms described in th is paper showed that by verifying only those links which were not recognize d by at least two aligners, an error rate can be re duced by 93.76% as compared to the performance of the best aligner. Such manual i nvolvement concerned only a small portion of all da ta (6%). This significantly reduces a load of manual work necessary to achieve nearly 100% accuracy of alignment.

Download


Development of a Multilingual Parallel Corpus and a Part-of-Speech Tagger for Afrikaans

September 2006

·

26 Reads

This paper describes design and creation of a multilingual parallel corpus for South African languages. One of the applications of the corpus, namely, the induction of a part-of-speech tagger for Afrikaans from the data, is presented in the paper. Development of the Afrikaans part-of-speech tagger is based on a modified method for induction of linguistic tools from parallel corpora originally proposed by Yarowsky and Ngai (2001). Full Text at Springer, may require registration or fee


Automatic error detection in second language learner's writing

January 2006

·

37 Reads

·

1 Citation

Language Matters

This article presents a system designed for automatic detection of errors in second language learner's writing. The system contains two modules, one aimed at detecting and correcting of spelling errors, and a second module providing a broad detection of higher level language errors. The article describes extensions to existing error detection algorithms and introduces a novel method of context-based ranking of spelling correction candidates based on probabilistic context-free grammars. The performance of the system is evaluated on real data with manually marked and classified errors.


Figure 1: simplified DTD sample for a book
Dutch Parallel Corpus: a multifunctional and multilingual corpus

January 2006

·

350 Reads

·

15 Citations

Nowadays, text corpora play an important role in language research and all fields involving language study, including theoretical and applied linguistics, language technology, translation studies and CALL (Computer Assisted Language Learning). Multilingual corpora, especially translated corpora, are not always readily available for Dutch. Much depends on the private initiative of individuals, and the data are often restrictedly available. The DPC-project (Dutch Parallel Corpus), which is carried out within the STEVIN program (Odijk et al. 2004), intends to fill the gap for this type of corpora for Dutch. This paper gives an overview of the DPC project. First, an overview and a discussion is given of the main parallel corpora containing Dutch. Then the DPC project is described, focusing on those aspects that make the DPC different from existing parallel corpora. Finally, the choice of an XML based format is explained.


Figure 1: Sub-sentential alignment 
Table 2 : 
Dutch Parallel Corpus: MT Corpus and translator's aid.

129 Reads

·

7 Citations

This paper reports on the development of the Dutch Parallel Corpus: a high quality sentence-aligned parallel corpus of 10 million words for the language pairs Dutch-English and Dutch-French. The corpus is composed of different text types. All steps of processing the corpus including alignment and linguistic annotation undergo quality control on different levels. Four categories of potential users of the DPC can be distinguished: developers of HLT-applications, linguists conducting more fundamental research, human translators and language learners. This paper focuses on two types of intended users: MT developers and human translators. The paper describes different characteristics of the corpus relevant for such users, concentrating on corpus design, processing of the corpus data and the exploitation of the corpus.

Citations (6)


... They established that spelling errors were the single biggest contributor to tagging error. Truskina (2006) builds on this work by developing a computational tool to correct spelling errors automatically in the data. She extended the spelling correction work by also looking at a number of other errors that can be detected automatically in learner corpus data through the evaluation of unlikely or low-frequency sequences of part-ofspeech tags. ...

Reference:

Learner Corpus Research in South Africa (1989–2019)
Automatic error detection in second language learner's writing
  • Citing Article
  • January 2006

Language Matters

... In vertaalwoordenboeken gelden ze in ieder geval als een vertaalpaar. Dit wordt ook bevestigd door een overzicht van het vertaalpatroon 7 van de twee markeerders in het Dutch Parallel Corpus 8 (Macken et al. 2007). In literaire teksten wordt dus in 42% van de gevallen door donc vertaald en in politieke toespraken stijgt dit percentage tot 65%. ...

Dutch Parallel Corpus: A Multilingual Annotated Corpus

... There has been a growing interest in NLP in Africa [3,4,5,6]. TALAf4 (Traitement Automatique des Langues Africaines5 (text and speech)) is a workshop (held at the JEP-TALN-RECITAL conference, 2016) with the aim of bringing together researchers in the NLP field working on African Indigenous Languages6 (AIL) through: meetings at the workshop; extracting knowledge using open source tools, standards (ISO, Unicode), and publishing the tools developed with an open license to avoid losses when a project stops and cannot be reopened for lack of resources; developing a set of best practices based on the researchers' acquaintances; setting up simple and effective methodologies based on free, or almost free, software for the development of tools; communicating methods that can eschew the use of non-existent tools; and refraining from loss of time and energy. AFLAT7 is an African Language Technology body interested in language technology research for AIL, aiming to catalogue resources (such as corpora, dictionaries, and NLP tools) for the majority of resource-scarce AIL (both current and extinct) for the benefit of researchers interested in African language technology. ...

The North-West University Bible corpus: A multilingual parallel corpus for South African languages
  • Citing Article
  • June 2008

Language Matters

... For the extraction and the validation step of the bootstrapping process we extracted two subcorpora from the Dutch Parallel Corpus [19]. The Dutch Parallel Corpus has a balanced composition and contains five text types: administrative texts, texts treating external communication, literary texts, journalistic texts and instructive texts. ...

Dutch Parallel Corpus: MT Corpus and translator's aid.