About
95
Publications
14,086
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
601
Citations
Citations since 2017
Introduction
Additional affiliations
September 2012 - December 2012

Tilde Company
Position
- TTC
Description
- TTC is a Collaborative project funded within FP7-ICT-2009-4 call and action ICT-2009.2.2: Language-based interaction under Grant agreement no. 248005 More information can be found at http://www.ttc-project.eu.
June 2012 - June 2014

Tilde Company
Position
- TaaS
Description
- TaaS is a Collaborative project funded within European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 296312. More information can be found at http://www.taas-project.eu.
Education
October 2011 - May 2015
October 2008 - June 2009
Publications
Publications (95)
This paper investigates a hybrid method for translation from English into Latvian by chaining an NMT system with an SMT system in order to cover out-of-vocabulary word translation. Different from other works, the primary translation is handled by the NMT system, and the SMT system acts as a secondary system. Automatic evaluation results have shown...
The aim of this doctoral thesis is to research methods and develop tools that allow successfully integrating bilingual terminology into statistical machine translation systems so that the translation quality of terminology would increase and that the overall translation quality of the source text would increase. The author presents novel methods fo...
In this paper the author presents methods for dynamic terminology integration in statistical machine translation systems using a source text pre-processing work-flow. The workflow consists of exchange-able components for term identification, inflected form generation for terms, and term translation candidate ranking. Automatic evaluation for three...
In this paper the author presents a new context independent method for bilingual term mapping using maximised character alignment maps. The method tries to particularly address mapping of multi-word terms and compound terms that are extracted from comparable corpora. The method allows integrating linguistic resources (e.g., probabilistic dictionari...
Although term extraction has been researched for more than 20 years, only a few studies focus on under-resourced languages. Moreover, bilingual term mapping from comparable corpora for these languages has attracted researchers only recently. This paper presents methods for term extraction, term tagging in documents, and bilingual term mapping from...
Machine Translation (MT) is one of the oldest language technologies having been researched for more than 70 years. However, it is only during the last decade that it has been widely accepted by the general public, to the point where in many cases it has become an indispensable tool for the global community, supporting communication between nations...
In this paper, we examine the development and usage of six low-resource machine translation systems translating between the Ukrainian language and each of the official languages of the Baltic states. We developed these systems in reaction to the escalating Ukrainian refugee crisis caused by the Russian military aggression in Ukraine in the hope tha...
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for f...
Consolidated access to current and reliable terms from different subject fields and languages is necessary for content creators and translators. Terminology is also needed in AI applications such as machine translation, speech recognition, information extraction, and other natural language processing tools. In this work, we facilitate standards-bas...
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for f...
This contribution describes the German EU Council Presidency Translator (EUC PT), a machine translation service created for the German EU Council Presidency in the second half of 2020, which is open to the general public. Following a series of earlier presidency translators, the German version exhibits important extensions and improvements. The Ger...
The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom availab...
Most of the recent work on terminology integration in machine translation has assumed that terminology translations are given already inflected in forms that are suitable for the target language sentence. In day-to-day work of professional translators, however, it is seldom the case as translators work with bilingual glossaries where terms are give...
This paper describes Tilde's submission to the WMT2020 shared task on news translation for both directions of the English-Polish language pair in both the constrained and the unconstrained tracks. We follow our submissions from the previous years and build our baseline systems to be morphologically motivated sub-word unit-based Transformer base mod...
In this paper, we present various pre-training strategies that aid in im-proving the accuracy of the sentiment classification task. We, at first, pre-trainlanguage representation models using these strategies and then fine-tune them onthe downstream task. Experimental results on a time-balanced tweet evaluation setshow the improvement over the prev...
When translating "The secretary asked for details." to a language with grammatical gender, it might be necessary to determine the gender of the subject "secretary". If the sentence does not contain the necessary information, it is not always possible to disambiguate. In such cases, machine translation systems select the most common translation opti...
Pipeline-based speech translation methods may suffer from errors found in speech recognition system output. Therefore, it is crucial that machine translation systems are trained to be robust against such noise. In this paper, we propose two methods for parallel data augmentation for pipeline-based speech translation system development. The first me...
Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are used to translate texts of informal origins, such as chat conversations, social media posts and web pages. We...
The paper describes the Latvian e-government language technology platform HUGO.LV. It provides an instant translation of text snippets, formatting-rich documents and websites, an online computer-assisted translation tool with a built-in translation memory, a website translation widget, speech recognition and speech synthesis services, a terminology...
In this paper, we present various pre-training strategies that aid in improving the accuracy of the sentiment classification task. At first, we pre-train language representation models using these strategies and then fine-tune them on the downstream task. Experimental results on a time-balanced tweet evaluation set show the improvement over the pre...
Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are used to translate texts of informal origins, such as chat conversations, social media posts and web pages. We...
Manual processes in accounting can introduce errors that affect business decisions. Automation (or at least partial automation of accounting processes) can help to minimise human errors. In this paper, we investigate methods for the automation of one of the processes involved in invoice posting – the assignment of account codes to posting entries –...
The paper describes the development process of Tilde's NMT systems for the WMT 2019 shared task on news translation. We trained systems for the English-Lithuanian and Lithuanian-English translation directions in constrained and unconstrained tracks. We build upon the best methods of the previous year's competition and combine them with recent advan...
The tools that were developed through the ACCURAT project and are presented in this book are packed into the ACCURAT toolkit (Pinnis et al. 2012a)—a collection of tools that are capable of collecting comparable corpora, analysing and extracting parallel data. The ACCURAT toolkit produces
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) i...
This chapter describes how semi-parallel and parallel data extracted from comparable corpora can be used in enhancing machine translation (MT) systems: what are the methods used for this task in statistical and rule-based machine translation systems; what kinds of showcases exist that illustrate the usage of such enhanced MT systems. The impact of...
The paper describes parallel corpus filtering methods that allow reducing noise of noisy "parallel" corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent...
The paper describes the development process of the Tilde's NMT systems that were submitted for the WMT 2018 shared task on news translation. We describe the data filtering and pre-processing workflows, the NMT system training architectures, and automatic evaluation results. For the WMT 2018 shared task, we submitted seven systems (both constrained...
We present the Latvian Tweet Corpus and its application in sentiment analysis by comparing four different machine learning algorithms and a lexical classification method. We show that the best results are achieved by an averaged perceptron classifier. In our experiments, the more complex neural network-based classification methods (using recurrent...
Online learning has been an active research area in statistical machine translation. However, as we have identified in our research, the implementation of successful online learning capabilities in the Moses SMT system can be challenging. In this work, we show how to use open source and freely available tools and methods in order to successfully im...
In this paper, we present results of employing multilingual and multi-way neural machine translation approaches for morphologically rich languages, such as Estonian and Russian. We experiment with different NMT architectures that allow achieving state-of-the-art translation quality and compare the multi-way model performance to one-way model perfor...
In this paper, we present Tilde MT, a custom machine translation (MT) platform that provides linguistic data storage (parallel, monolingual corpora, multilingual term collections), data cleaning and normalisation, statistical and neural machine translation system training and hosting functionality, as well as wide integration capabilities (a machin...
The paper describes Tilde's work on developing a neural machine translation (NMT) tool for the 2017-2018 Presidency of the Council of the European Union. The tool was developed by combining the European Commission's eTranslation service with a set of customized, domain-adapted NMT systems built by Tilde. The central aim of the tool is to assist sta...
In this paper, we present Tilde's work on boosting the output quality and availability of Estonian machine translation systems, focusing mostly on the less resourced and morphologically complex language pairs between Estonian and Russian. We describe our efforts on collecting parallel and monolingual data for the development of better neural machin...
In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. We dive deeper into ways for it to handle output from transformer-based NMT models. Its purpose is to help researchers and developers find we...
The recent technological shift in machine translation from statistical machine translation (SMT) to neural machine translation (NMT) raises the question of the strengths and weaknesses of NMT. In this paper, we present an analysis of NMT and SMT systems' outputs from narrow domain English-Latvian MT systems that were trained on a rather small amoun...
This paper analyses issues of rare and unknown word splitting with byte pair encoding for neural machine translation and proposes two methods that allow improving the quality of word splitting. The first method linguistically guides byte pair encoding and the second method limits splitting of unknown words. We also evaluate corpus re-translation fo...
The paper evaluates neural machine translation systems and phrase-based machine translation systems for highly inflected and small languages. It analyses two translation scenarios: (1) when translating broad domain data from a morphologically rich language into a morphologically rich language or English (and vice versa), and (2) when translating na...
The paper describes findings of a large post-editing project in the medical domain carried out by Tilde. It analyzes the efficacy of post-editing of highly technical texts in a specialized domain and provides answers to questions important to localization service providers that consider the introduction of post-editing in their translation workflow...
In this paper the authors present a speech corpus designed and created for the development and evaluation of dictation systems in Latvian. The corpus consists of over nine hours of orthographically annotated speech from 30 different speakers. The corpus features spoken commands that are common for dictation systems for text editors. The corpus is e...
In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligner...
This paper presents a case study about the development of MT systems for two Baltic gov-ernments. The governments of Latvia and Lithuania presented Tilde with a need to expand their communication to reach multilingual citizens. In order to meet this need, Tilde collected a vast amount of domain-specific data and trained MT system to produce high-qu...
In this paper we share our experience from implementing machine translation in localization into relatively small languages of the three Baltic countries – Latvian, Lithuanian, and Esto-nian. We describe our approach in improving terminology translation and consistency by pre-processing of the source text and performing term integration. We present...
Transliteration dictionaries are an important resource for the development of machine transliteration systems. The paper describes and analyses a large multilingual transliteration dictionary extracted from probabilistic dictionaries for 24 European languages containing approximately 1.25 million transliterated word pairs. The transliteration dicti...
Transliteration dictionaries are an important resource for the development of machine transliteration systems. The paper describes and analyses a large multilingual transliteration dictionary extracted from probabilistic dictionaries for 24 European languages containing approximately 1.25 million transliterated word pairs. The transliteration dicti...
In this paper, the authors present the results of ongoing research on Large Vocabulary Automatic Speech Recognition for the Latvian language. The paper describes the initial acoustic model, phoneme set, filler and noise models, and grapheme-to-phoneme modelling. The second part of this work is focused on language modelling. Different word and class...
Grapheme to phoneme modelling is one of the key features in automated speech recognition and speech synthesis. In this paper, the authors compare two different approaches: a statistical machine translation based method using the phonetically transcribed Latvian Speech Recognition Corpus and a rule-based method for phonetic transcription of words fr...
The authors present a service-based model for semi-automatic gener-ation of multilingual terminology resources which, if performed manually, is very time consuming. In this model, the automation of individual terminology work tasks is rendered as a set of interoperable cloud-based services integrated into workflows. These services automate the iden...
This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a nove...
In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the...
Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the qualities of outputs of tools relying on the dictionaries are negatively affected. In this work, we present three different methods for cleaning noise from automatically generated bilingual dictionari...
This paper evaluates the impact of ma-chine translation on the software localiza-tion process and the daily work of profes-sional translators when SMT is applied to low-resourced languages with rich mor-phology. Translation from English into six low-resourced languages (Czech, Es-tonian, Hungarian, Latvian, Lithuanian and Polish) from different lan...
In this demonstration paper we present an innovative platform TaaS "Terminology as a Service" for acquiring raw terminological data, cleaning up, sharing, and reusing them, based on cloud computing. The platform serves, among others, the needs of specialised lexicography. The proposed solution aims at filling the gap of collaborative terminology ma...