Mārcis Pinnis

Mārcis Pinnis
Tilde Company · MT Group

Dr.Sc.Comp.

About

95
Publications
14,086
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
601
Citations
Citations since 2017
51 Research Items
434 Citations
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
Additional affiliations
January 2017 - present
Ekonomikas un kultūras augstskola
Position
  • Professor (Assistant)
Description
  • Artificial Intelligence (since spring 2017), Systems Modelling Basics (since fall 2018)
September 2012 - December 2012
Tilde Company
Tilde Company
Position
  • TTC
Description
  • TTC is a Collaborative project funded within FP7-ICT-2009-4 call and action ICT-2009.2.2: Language-based interaction under Grant agreement no. 248005 More information can be found at http://www.ttc-project.eu.
June 2012 - June 2014
Tilde Company
Tilde Company
Position
  • TaaS
Description
  • TaaS is a Collaborative project funded within European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 296312. More information can be found at http://www.taas-project.eu.
Education
October 2011 - May 2015
University of Latvia
Field of study
  • Computer Science
October 2008 - June 2009
University of Cambridge
Field of study
  • Computer Speech, Text and Internet Technology

Publications

Publications (95)
Conference Paper
This paper investigates a hybrid method for translation from English into Latvian by chaining an NMT system with an SMT system in order to cover out-of-vocabulary word translation. Different from other works, the primary translation is handled by the NMT system, and the SMT system acts as a secondary system. Automatic evaluation results have shown...
Thesis
Full-text available
The aim of this doctoral thesis is to research methods and develop tools that allow successfully integrating bilingual terminology into statistical machine translation systems so that the translation quality of terminology would increase and that the overall translation quality of the source text would increase. The author presents novel methods fo...
Conference Paper
Full-text available
In this paper the author presents methods for dynamic terminology integration in statistical machine translation systems using a source text pre-processing work-flow. The workflow consists of exchange-able components for term identification, inflected form generation for terms, and term translation candidate ranking. Automatic evaluation for three...
Conference Paper
Full-text available
In this paper the author presents a new context independent method for bilingual term mapping using maximised character alignment maps. The method tries to particularly address mapping of multi-word terms and compound terms that are extracted from comparable corpora. The method allows integrating linguistic resources (e.g., probabilistic dictionari...
Conference Paper
Full-text available
Although term extraction has been researched for more than 20 years, only a few studies focus on under-resourced languages. Moreover, bilingual term mapping from comparable corpora for these languages has attracted researchers only recently. This paper presents methods for term extraction, term tagging in documents, and bilingual term mapping from...
Chapter
Full-text available
Machine Translation (MT) is one of the oldest language technologies having been researched for more than 70 years. However, it is only during the last decade that it has been widely accepted by the general public, to the point where in many cases it has become an indispensable tool for the global community, supporting communication between nations...
Preprint
Full-text available
In this paper, we examine the development and usage of six low-resource machine translation systems translating between the Ukrainian language and each of the official languages of the Baltic states. We developed these systems in reaction to the escalating Ukrainian refugee crisis caused by the Russian military aggression in Ukraine in the hope tha...
Chapter
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for f...
Preprint
Full-text available
Consolidated access to current and reliable terms from different subject fields and languages is necessary for content creators and translators. Terminology is also needed in AI applications such as machine translation, speech recognition, information extraction, and other natural language processing tools. In this work, we facilitate standards-bas...
Preprint
Full-text available
Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for f...
Article
Full-text available
This contribution describes the German EU Council Presidency Translator (EUC PT), a machine translation service created for the German EU Council Presidency in the second half of 2020, which is open to the general public. Following a series of earlier presidency translators, the German version exhibits important extensions and improvements. The Ger...
Preprint
Full-text available
The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom availab...
Preprint
Full-text available
Most of the recent work on terminology integration in machine translation has assumed that terminology translations are given already inflected in forms that are suitable for the target language sentence. In day-to-day work of professional translators, however, it is seldom the case as translators work with bilingual glossaries where terms are give...
Preprint
Full-text available
This paper describes Tilde's submission to the WMT2020 shared task on news translation for both directions of the English-Polish language pair in both the constrained and the unconstrained tracks. We follow our submissions from the previous years and build our baseline systems to be morphologically motivated sub-word unit-based Transformer base mod...
Preprint
In this paper, we present various pre-training strategies that aid in im-proving the accuracy of the sentiment classification task. We, at first, pre-trainlanguage representation models using these strategies and then fine-tune them onthe downstream task. Experimental results on a time-balanced tweet evaluation setshow the improvement over the prev...
Preprint
Full-text available
When translating "The secretary asked for details." to a language with grammatical gender, it might be necessary to determine the gender of the subject "secretary". If the sentence does not contain the necessary information, it is not always possible to disambiguate. In such cases, machine translation systems select the most common translation opti...
Chapter
Full-text available
Pipeline-based speech translation methods may suffer from errors found in speech recognition system output. Therefore, it is crucial that machine translation systems are trained to be robust against such noise. In this paper, we propose two methods for parallel data augmentation for pipeline-based speech translation system development. The first me...
Chapter
Full-text available
Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are used to translate texts of informal origins, such as chat conversations, social media posts and web pages. We...
Chapter
Full-text available
The paper describes the Latvian e-government language technology platform HUGO.LV. It provides an instant translation of text snippets, formatting-rich documents and websites, an online computer-assisted translation tool with a built-in translation memory, a website translation widget, speech recognition and speech synthesis services, a terminology...
Chapter
Full-text available
In this paper, we present various pre-training strategies that aid in improving the accuracy of the sentiment classification task. At first, we pre-train language representation models using these strategies and then fine-tune them on the downstream task. Experimental results on a time-balanced tweet evaluation set show the improvement over the pre...
Preprint
Full-text available
Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are used to translate texts of informal origins, such as chat conversations, social media posts and web pages. We...
Chapter
Manual processes in accounting can introduce errors that affect business decisions. Automation (or at least partial automation of accounting processes) can help to minimise human errors. In this paper, we investigate methods for the automation of one of the processes involved in invoice posting – the assignment of account codes to posting entries –...
Conference Paper
Full-text available
The paper describes the development process of Tilde's NMT systems for the WMT 2019 shared task on news translation. We trained systems for the English-Lithuanian and Lithuanian-English translation directions in constrained and unconstrained tracks. We build upon the best methods of the previous year's competition and combine them with recent advan...
Chapter
The tools that were developed through the ACCURAT project and are presented in this book are packed into the ACCURAT toolkit (Pinnis et al. 2012a)—a collection of tools that are capable of collecting comparable corpora, analysing and extracting parallel data. The ACCURAT toolkit produces
Chapter
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) i...
Chapter
This chapter describes how semi-parallel and parallel data extracted from comparable corpora can be used in enhancing machine translation (MT) systems: what are the methods used for this task in statistical and rule-based machine translation systems; what kinds of showcases exist that illustrate the usage of such enhanced MT systems. The impact of...
Conference Paper
Full-text available
The paper describes parallel corpus filtering methods that allow reducing noise of noisy "parallel" corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent...
Conference Paper
Full-text available
The paper describes the development process of the Tilde's NMT systems that were submitted for the WMT 2018 shared task on news translation. We describe the data filtering and pre-processing workflows, the NMT system training architectures, and automatic evaluation results. For the WMT 2018 shared task, we submitted seven systems (both constrained...
Conference Paper
We present the Latvian Tweet Corpus and its application in sentiment analysis by comparing four different machine learning algorithms and a lexical classification method. We show that the best results are achieved by an averaged perceptron classifier. In our experiments, the more complex neural network-based classification methods (using recurrent...
Conference Paper
Online learning has been an active research area in statistical machine translation. However, as we have identified in our research, the implementation of successful online learning capabilities in the Moses SMT system can be challenging. In this work, we show how to use open source and freely available tools and methods in order to successfully im...
Conference Paper
Full-text available
In this paper, we present results of employing multilingual and multi-way neural machine translation approaches for morphologically rich languages, such as Estonian and Russian. We experiment with different NMT architectures that allow achieving state-of-the-art translation quality and compare the multi-way model performance to one-way model perfor...
Conference Paper
Full-text available
In this paper, we present Tilde MT, a custom machine translation (MT) platform that provides linguistic data storage (parallel, monolingual corpora, multilingual term collections), data cleaning and normalisation, statistical and neural machine translation system training and hosting functionality, as well as wide integration capabilities (a machin...
Conference Paper
Full-text available
The paper describes Tilde's work on developing a neural machine translation (NMT) tool for the 2017-2018 Presidency of the Council of the European Union. The tool was developed by combining the European Commission's eTranslation service with a set of customized, domain-adapted NMT systems built by Tilde. The central aim of the tool is to assist sta...
Conference Paper
In this paper, we present Tilde's work on boosting the output quality and availability of Estonian machine translation systems, focusing mostly on the less resourced and morphologically complex language pairs between Estonian and Russian. We describe our efforts on collecting parallel and monolingual data for the development of better neural machin...
Article
Full-text available
In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. We dive deeper into ways for it to handle output from transformer-based NMT models. Its purpose is to help researchers and developers find we...
Conference Paper
Full-text available
The recent technological shift in machine translation from statistical machine translation (SMT) to neural machine translation (NMT) raises the question of the strengths and weaknesses of NMT. In this paper, we present an analysis of NMT and SMT systems' outputs from narrow domain English-Latvian MT systems that were trained on a rather small amoun...
Conference Paper
This paper analyses issues of rare and unknown word splitting with byte pair encoding for neural machine translation and proposes two methods that allow improving the quality of word splitting. The first method linguistically guides byte pair encoding and the second method limits splitting of unknown words. We also evaluate corpus re-translation fo...
Conference Paper
The paper evaluates neural machine translation systems and phrase-based machine translation systems for highly inflected and small languages. It analyses two translation scenarios: (1) when translating broad domain data from a morphologically rich language into a morphologically rich language or English (and vice versa), and (2) when translating na...
Conference Paper
Full-text available
The paper describes findings of a large post-editing project in the medical domain carried out by Tilde. It analyzes the efficacy of post-editing of highly technical texts in a specialized domain and provides answers to questions important to localization service providers that consider the introduction of post-editing in their translation workflow...
Conference Paper
Full-text available
In this paper the authors present a speech corpus designed and created for the development and evaluation of dictation systems in Latvian. The corpus consists of over nine hours of orthographically annotated speech from 30 different speakers. The corpus features spoken commands that are common for dictation systems for text editors. The corpus is e...
Article
Full-text available
In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligner...
Conference Paper
Full-text available
This paper presents a case study about the development of MT systems for two Baltic gov-ernments. The governments of Latvia and Lithuania presented Tilde with a need to expand their communication to reach multilingual citizens. In order to meet this need, Tilde collected a vast amount of domain-specific data and trained MT system to produce high-qu...
Conference Paper
Full-text available
In this paper we share our experience from implementing machine translation in localization into relatively small languages of the three Baltic countries – Latvian, Lithuanian, and Esto-nian. We describe our approach in improving terminology translation and consistency by pre-processing of the source text and performing term integration. We present...
Conference Paper
Full-text available
Transliteration dictionaries are an important resource for the development of machine transliteration systems. The paper describes and analyses a large multilingual transliteration dictionary extracted from probabilistic dictionaries for 24 European languages containing approximately 1.25 million transliterated word pairs. The transliteration dicti...
Data
Transliteration dictionaries are an important resource for the development of machine transliteration systems. The paper describes and analyses a large multilingual transliteration dictionary extracted from probabilistic dictionaries for 24 European languages containing approximately 1.25 million transliterated word pairs. The transliteration dicti...
Conference Paper
Full-text available
In this paper, the authors present the results of ongoing research on Large Vocabulary Automatic Speech Recognition for the Latvian language. The paper describes the initial acoustic model, phoneme set, filler and noise models, and grapheme-to-phoneme modelling. The second part of this work is focused on language modelling. Different word and class...
Conference Paper
Full-text available
Grapheme to phoneme modelling is one of the key features in automated speech recognition and speech synthesis. In this paper, the authors compare two different approaches: a statistical machine translation based method using the phonetically transcribed Latvian Speech Recognition Corpus and a rule-based method for phonetic transcription of words fr...
Conference Paper
Full-text available
The authors present a service-based model for semi-automatic gener-ation of multilingual terminology resources which, if performed manually, is very time consuming. In this model, the automation of individual terminology work tasks is rendered as a set of interoperable cloud-based services integrated into workflows. These services automate the iden...
Conference Paper
Full-text available
This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a nove...
Conference Paper
Full-text available
In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the...
Conference Paper
Full-text available
Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the qualities of outputs of tools relying on the dictionaries are negatively affected. In this work, we present three different methods for cleaning noise from automatically generated bilingual dictionari...
Conference Paper
Full-text available
This paper evaluates the impact of ma-chine translation on the software localiza-tion process and the daily work of profes-sional translators when SMT is applied to low-resourced languages with rich mor-phology. Translation from English into six low-resourced languages (Czech, Es-tonian, Hungarian, Latvian, Lithuanian and Polish) from different lan...
Conference Paper
Full-text available
In this demonstration paper we present an innovative platform TaaS "Terminology as a Service" for acquiring raw terminological data, cleaning up, sharing, and reusing them, based on cloud computing. The platform serves, among others, the needs of specialised lexicography. The proposed solution aims at filling the gap of collaborative terminology ma...