About
54
Publications
8,709
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
229
Citations
Introduction
Document-level neural machine translation
Current institution
Additional affiliations
June 2019 - present
Position
- Researcher
November 2014 - August 2015
Education
October 2014 - May 2019
September 2012 - June 2014
September 2008 - June 2012
Publications
Publications (54)
In this article, we describe a tool for visualizing the output and attention weights of neural machine translation systems and for estimating confidence about the output based on the attention.
Our aim is to help researchers and developers better understand the behaviour of their NMT systems without the need for any reference translations. Our tool...
Attention distributions of the generated translations are a useful bi-product of attention-based recurrent neural network translation models and can be treated as soft alignments between the input and output tokens. In this work, we use attention distributions as a confidence metric for output translations. We present two strategies of using the at...
In this paper, we present results of employing multilingual and multi-way neural machine translation approaches for morphologically rich languages, such as Estonian and Russian. We experiment with different NMT architectures that allow achieving state-of-the-art translation quality and compare the multi-way model performance to one-way model perfor...
Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well...
This research builds upon the Latvian Twitter Eater Corpus (LTEC), which is focused on the narrow domain of tweets related to food, drinks, eating and drinking. LTEC has been collected for more than 12 years and reaching almost 3 million tweets with the basic information as well as extended automatically and manually annotated metadata. In this pap...
At the beginning of 2022, a simplistic word-guessing game took the world by storm and was further adapted to many languages beyond the original English version. In this paper, we examine the strategies of daily word-guessing game players that have evolved during a period of over two years. A survey gathered from 25% of frequent players reveals thei...
At the beginning of 2022, a simplistic word-guessing game took the world by storm and was further adapted to many languages beyond the original English version. In this paper, we examine the strategies of daily word-guessing game players that have evolved during a period of over two years. A survey gathered from 25% of frequent players reveals thei...
This research focuses on food price-related discussions on Twitter, specifically in the Latvian context, with the aim of tracing the impact of Russia's 2022 invasion of Ukraine, which caused a global food crisis. Analysing the Latvian Twitter Eater Corpus (LTEC), the research reveals 12 years of food price-related data and includes the effect of CO...
Food choice is a complex phenomenon influenced by factors such as taste, environment, culture, weather and many others. Although people spend most of their lives indoors, weather conditions remain influential, both in shaping seasonal food cultures in particular geographical areas and in influencing individual choices. With the recent increase in t...
Food choice is a complex phenomenon shaped by factors such as taste, ambience, culture or weather. In this paper, we explore food-related tweeting in different weather conditions. We inspect a Latvian food tweet dataset spanning the past decade in conjunction with a weather observation dataset consisting of average temperature, precipitation, and o...
We introduce BERT-NAR-BERT (BnB) – a pre-trained non-autoregressive sequence-to-sequence model, which employs BERT as the backbone for the encoder and decoder for natural language understanding and generation tasks. During the pre-training and fine-tuning with BERT-NAR-BERT, two challenging aspects are considered by adopting the length classificati...
In this paper, we describe adaptation of a simple word guessing game that occupied the hearts and minds of people around the world. There are versions for all three Baltic countries and even several versions of each. We specifically pay attention to the Latvian version and look into how people form their guesses given any already uncovered hints. T...
This article describes linguistic collections of Livonian, development of digital resources, and their usage. The main motivation is to ensure the accessibility of Livonian intangible cultural heritage and linguistic materials for the Livonian and research communities as well as using digital technologies for the purpose of ensuring accessibility a...
In this paper, we describe adaptation of a simple word guessing game that occupied the hearts and minds of people around the world. There are versions for all three Baltic countries and even several versions of each. We specifically pay attention to the Latvian version and look into how people form their guesses given any already uncovered hints. T...
How a food, or a dish, is named and how its components and attributes are described can all influence the perception and the enjoyment of the food. Therefore, tracing patterns in food descriptions and determining their role can be of value. The aims of this study were the following: (1) to describe the multisensory food experience as represented in...
One of the most popular methods for context-aware machine translation (MT) is to use separate encoders for the source sentence and context as multiple sources for one target sentence. Recent work has cast doubt on whether these models actually learn useful signals from the context or are improvements in automatic evaluation metrics just a side-effe...
Most machine translation (MT) research has focused on sentences as translation units (sentence-level MT), and has achieved acceptable translation quality for sentences where cross-sentential context is not required in mainly high-resourced languages. Recently, many researchers have worked on MT models that can consider a cross-sentential context. T...
We analysed sentiment and frequencies related to smell, taste and temperature expressed by food tweets in the Latvian language. To get a better understanding of the role of smell, taste and temperature in the mental map of food associations, we looked at such categories as 'tasty' and 'healthy', which turned out to be mutually exclusive. By analysi...
Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-Eng...
The Latvian Twitter Eater Corpus is a set of tweets in the narrow domain related to food, drinks, eating and drinking. We also separate two sub-corpora of question and answer tweets and sentiment annotated tweets. We analyse contents of the corpus and demonstrate use-cases for the sub-corpora by training domain-specific question-answering and senti...
We present the Latvian Twitter Eater Corpus - a set of tweets in the narrow domain related to food, drinks, eating and drinking. The corpus has been collected over time-span of over 8 years and includes over 2 million tweets entailed with additional useful data. We also separate two sub-corpora of question and answer tweets and sentiment annotated...
While the progress of machine translation of written text has come far in the past several years thanks to the increasing availability of parallel corpora and corpora-based training technologies, automatic translation of spoken text and dialogues remains challenging even for modern systems. In this paper, we aim to boost the machine translation qua...
We present the Latvian Twitter Eater Corpus - a set of tweets in the narrow domain related to food, drinks, eating and drinking. The corpus has been collected over time-span of over 8 years and includes over 2 million tweets entailed with additional useful data. We also separate two sub-corpora of question and answer tweets and sentiment annotated...
This paper aims to combine output from various machine translation (MT) systems so that the overall translation quality of the source text would increase. Applicability of the developed methods for small, morphologically rich and under-resourced languages is evaluated, especially Latvian and Estonian. Existing methods have been analysed, and severa...
The paper describes the development process of Tilde's NMT systems for the WMT 2019 shared task on news translation. We trained systems for the English-Lithuanian and Lithuanian-English translation directions in constrained and unconstrained tracks. We build upon the best methods of the previous year's competition and combine them with recent advan...
This thesis aims to research methods and develop tools that allow to successfully combine output from various machine translation (MT) systems so that the overall translation quality of the source text would increase. Applicability of the developed methods for small, morphologically rich and under-resourced languages is evaluated, especially Latvia...
The paper describes the development process of the Tilde's NMT systems that were submitted for the WMT 2018 shared task on news translation. We describe the data filtering and pre-processing workflows, the NMT system training architectures, and automatic evaluation results. For the WMT 2018 shared task, we submitted seven systems (both constrained...
In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. The purpose of the tool is to help researchers and developers find weak and faulty example translations that their NMT systems produce withou...
Online learning has been an active research area in statistical machine translation. However, as we have identified in our research, the implementation of successful online learning capabilities in the Moses SMT system can be challenging. In this work, we show how to use open source and freely available tools and methods in order to successfully im...
In this paper, we present Tilde's work on boosting the output quality and availability of Estonian machine translation systems, focusing mostly on the less resourced and morphologically complex language pairs between Estonian and Russian. We describe our efforts on collecting parallel and monolingual data for the development of better neural machin...
In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. We dive deeper into ways for it to handle output from transformer-based NMT models. Its purpose is to help researchers and developers find we...
Processing of multi-word expressions (MWEs) is a known problem for any natural language processing task. Even neural machine translation (NMT) struggles to overcome it. This paper presents results of experiments on investigating NMT attention allocation to the MWEs and improving automated translation of sentences that contain MWEs in English->Latvi...
This paper presents the comparison of how using different neural network based language modelling tools for selecting the best candidate fragments affects the final output translation quality in a hybrid multi-system machine translation setup. Experiments were conducted by comparing perplexity and BLEU scores on common test cases using the same tra...
The tool described in this article has been designed to help machine translation (MT) researchers to combine and evaluate various MT engine outputs through a web-based graphical user interface using syntactic analysis and language modelling. The tool supports user provided translations as well as translations from popular online MT system applicati...
This paper presents an attempt to improve a specific baseline hybrid machine translation (MT) combination system by using brute force and searching through all possibilities for the best-combined translation instead of incrementally building the translation piece by piece. The result is an improved phrase-based multi-system MT system that allows im...
The tool described in this article has been designed to help machine translation (MT) researchers to combine and evaluate various MT engine outputs through a web-based graphical user interface using syntactic analysis and language modelling. The tool supports user provided translations as well as translations from popular online MT system applicati...
Processing of multiword expressions (MWE) is a well known problem for natural language processing tasks. One of which is machine translation. This paper presents results of experiments on automated translation of MWEs in English – Latvian statistical machine translation system. In this study, three approaches were investigated – (1) bilingual pairs...
This paper describes a hybrid machine translation system that explores a parser to acquire syntactic chunks of a source sentence, translates the chunks with multiple online machine translation (MT) system application program interfaces (APIs) and creates output by combining translated chunks to obtain the best possible translation. The selection of...
This paper describes a hybrid machine translation system that explores using a parser to acquire syntactic chunks of a source sentence, translates the chunks with multiple online MT system APIs and creates output by combining translated chunks to obtain the best possible translation. The selection of the best translation hypothesis is performed by...
This paper presents a hybrid machine translation (HMT) system that pursues syntactic analysis to acquire phrases of source sentences, translates the phrases using multiple online machine translation (MT) system application program interfaces (APIs) and generates output by combining translated chunks to obtain the best possible translation. The aim...
This paper describes a hybrid machine translation (HMT) system that employs several online MT system application program interfaces (APIs) forming a Multi-System Machine Translation (MSMT) approach. The goal is to improve the automated translation of English – Latvian texts over each of the individual MT APIs. The selection of the best hypothesis t...
The objective of this master’s thesis is to explore the popular language processing tool Maltparser and the algorithms used in its sentence syntactic analyzer – the dependency parsing Arc-eager Shift-reduce algorithm and the machine learning SVM algorithm. In this thesis the SVM algorithm will be compared to two other machine learning algorithms –...
The objective of this thesis is to explore how data from the Twitter social network can
be analyzed and to find the most useful methods that would help to analyze such data. This
thesis will describe the methods, solutions and tools that could be used for analyzing Twitter
data of any topic. In the course of the work a universal tool for analyzing...