Antonio Toral

Antonio Toral
University of Groningen | RUG

PhD in Computational Linguistics

About

124
Publications
24,978
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,473
Citations
Additional affiliations
March 2014 - September 2016
Dublin City University
Position
  • Research Fellow in Machine Translation

Publications

Publications (124)
Preprint
Character-based representations have important advantages over subword-based ones for morphologically rich languages. They come with increased robustness to noisy input and do not need a separate tokenization step. However, they also have a crucial disadvantage: they notably increase the length of text sequences. The GBST method from Charformer gro...
Preprint
Full-text available
We introduce DivEMT, the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages. Using a strictly controlled setup, 18 professional translators were instructed to translate or post-edit the same set of English documents into Arabic, Dutch, Italian, Turkish, Ukrainian, and...
Preprint
This paper investigates very low resource language model pretraining, when less than 100 thousand sentences are available. We find that, in very low resource scenarios, statistical n-gram language models outperform state-of-the-art neural models. Our experiments show that this is mainly due to the focus of the former on a local context. As such, we...
Preprint
Although text style transfer has witnessed rapid development in recent years, there is as yet no established standard for evaluation, which is performed using several automatic metrics, lacking the possibility of always resorting to human judgement. We focus on the task of formality transfer, and on the three aspects that are usually evaluated: sty...
Preprint
Full-text available
This article presents the results of a study involving the translation of a short story by Kurt Vonnegut from English to Catalan and Dutch using three modalities: machine-translation (MT), post-editing (PE) and translation without aid (HT). Our aim is to explore creativity, understood to involve novelty and acceptability, from a quantitative perspe...
Article
Full-text available
This article presents the results of a study involving the translation of a short story by Kurt Vonnegut from English to Catalan and Dutch using three modalities: machine-translation (MT), post-editing (PE) and translation without aid (HT). Our aim is to explore creativity, understood to involve novelty and acceptability, from a quantitative perspe...
Preprint
We exploit the pre-trained seq2seq model mBART for multilingual text style transfer. Using machine translated data as well as gold aligned English sentences yields state-of-the-art results in the three target languages we consider. Besides, in view of the general scarcity of parallel data, we propose a modular approach for multilingual formality tr...
Preprint
This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2021 Unsupervised Machine Translation task for German--Lower Sorbian (DE--DSB): a high-resource language to a low-resource one. Our system uses a transformer encoder-decoder architecture in which we make three changes to the standard training pr...
Preprint
Style transfer aims to rewrite a source text in a different target style while preserving its content. We propose a novel approach to this task that leverages generic resources, and without using any task-specific parallel (source-target) data outperforms existing unsupervised approaches on the two most popular style transfer tasks: formality trans...
Preprint
Scarcity of parallel data causes formality style transfer models to have scarce success in preserving content. We show that fine-tuning pre-trained language (GPT-2) and sequence-to-sequence (BART) models boosts content preservation, and that this is possible even with limited amounts of parallel data. Augmenting these models with rewards that targe...
Preprint
Full-text available
This article presents the results of a study involving the translation of a fictional story from English into Catalan in three modalities: machine-translated (MT), post-edited (MTPE) and translated without aid (HT). Each translation was analysed to evaluate its creativity. Subsequently, a cohort of 88 Catalan participants read the story in a random...
Preprint
In this chapter we build a machine translation (MT) system tailored to the literary domain, specifically to novels, based on the state-of-the-art architecture in neural MT (NMT), the Transformer (Vaswani et al., 2017), for the translation direction English-to-Catalan. Subsequently, we assess to what extent such a system can be useful by evaluating...
Preprint
We combine character-level and contextual language model representations to improve performance on Discourse Representation Structure parsing. Character representations can easily be added in a sequence-to-sequence model in either one encoder or as a fully separate encoder, with improvements that are robust to different language models, languages a...
Article
Full-text available
This article presents the results of a study involving the translation of a fictional story from English into Catalan in three modalities: machine-translated (MT), post-edited (MTPE) and translated without aid (HT). Each translation was analysed to evaluate its creativity. Subsequently, a cohort of 88 Catalan participants read the story in a random...
Preprint
This research presents a fine-grained human evaluation to compare the Transformer and recurrent approaches to neural machine translation (MT), on the translation direction English-to-Chinese. To this end, we develop an error taxonomy compliant with the Multidimensional Quality Metrics (MQM) framework that is customised to the relevant phenomena of...
Preprint
Full-text available
We reassess the claims of human parity and super-human performance made at the news shared task of WMT 2019 for three translation directions: English-to-German, English-to-Russian and German-to-English. First we identify three potential issues in the human evaluation of that shared task: (i) the limited amount of intersentential context available,...
Preprint
Full-text available
The quality of machine translation has increased remarkably over the past years, to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations. We reassess Hassan et al.'s 2018 investigation into Chinese to English news translation, showing that the finding of human-machine parit...
Article
Full-text available
The quality of machine translation has increased remarkably over the past years, to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations. We reassess Hassan et al.'s 2018 investigation into Chinese to English news translation, showing that the finding of human–machine parit...
Preprint
Full-text available
Post-editing (PE) machine translation (MT) is widely used for dissemination because it leads to higher productivity than human translation from scratch (HT). In addition, PE translations are found to be of equal or better quality than HTs. However, most such studies measure quality solely as the number of errors. We conduct a set of computational a...
Article
Neural methods have had several recent successes in semantic parsing, though they have yet to face the challenge of producing meaning representations based on formal semantics. We present a sequence-to-sequence neural semantic parser that is able to produce Discourse Representation Structures (DRSs) for English sentences with high accuracy, outperf...
Article
Full-text available
In the context of recent improvements in the quality of machine translation (MT) output and new use cases being found for that output, this article reports on an experiment using statistical and neural MT systems to translate literature. Six professional translators with experience of literary translation produced English-to-Catalan translations un...
Preprint
Full-text available
Neural methods have had several recent successes in semantic parsing, though they have yet to face the challenge of producing meaning representations based on formal semantics. We present a sequence-to-sequence neural semantic parser that is able to produce Discourse Representation Structures (DRSs) for English sentences with high accuracy, outperf...
Article
Full-text available
This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between...
Preprint
Full-text available
We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and considering three variables that were not taken into account in that previous study: the language in which the source side of the test set was original...
Chapter
Given the rise of the new neural approach to machine translation (NMT) and its promising performance on different text types, we assess the translation quality it can attain on what is perceived to be the greatest challenge for MT: literary text. Specifically, we target novels, arguably the most popular type of literary text. We build a literary-ad...
Article
Full-text available
We conduct the first experiment in the literature in which a novel is translated automatically and then post-edited by professional literary translators. Our case study is Warbreaker, a popular fantasy novel originally written in English, which we translate into Catalan. We translated one chapter of the novel (over 3,700 words, 330 sentences) with...
Article
Full-text available
Given the rise of a new approach to MT, Neural MT (NMT), and its promising performance on different text types, we assess the translation quality it can attain on what is perceived to be the greatest challenge for MT: literary text. Specifically, we target novels, arguably the most popular type of literary text. We build a literary-adapted NMT syst...
Article
Full-text available
We present a widely applicable methodology to bring machine translation (MT) to under-resourced languages in a cost-effective and rapid manner. Our proposal relies on web crawling to automatically acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. If that is not the case...
Conference Paper
In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. Feature Decay Algorithms (FDA) have demonstrated excellent performance in a number of tasks. While the decay function is at the heart of the success of FDA, its parameters are initialised with the same weights. In this paper, we...
Article
Full-text available
Reordering is one of the most important factors affecting the quality of the output in statistical machine translation (SMT). A considerable number of approaches that proposed addressing the reordering problem are discriminative reordering models (DRM). The core component of the DRMs is a classifier which tries to predict the correct word order of...
Article
Full-text available
We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems’ outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two anno...
Article
We present a syntax-based reordering model (RM) for hierarchical phrase-based statistical machine translation (HPB-SMT) enriched with semantic features. Our model brings a number of novel contributions: (i) while the previous dependency-based RM is limited to the reordering of head and dependant constituent pairs, we also model the reordering of pa...
Article
Full-text available
We aim to shed light on the strengths and weaknesses of the newly introduced neural machine translation paradigm. To that end, we conduct a multifaceted evaluation in which we compare outputs produced by state-of-the-art neural machine translation and phrase-based machine translation systems for 9 language directions across a number of dimensions....
Conference Paper
We describe our participation in TweetMT for three language pairs in both directions: Spanish from/to Catalan, Basque and Portuguese. We used a range of techniques: statistical and rule-based MT, morph segmentation, data selection with ParFDA and system combination. As for resources, our focus was on crawling vast amounts of tweets to perform monol...
Conference Paper
Full-text available
We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It i...
Article
Full-text available
Language models (LMs) are an essential element in statistical approaches to natural language processing for tasks such as speech recognition and machine translation (MT). The advent of big data leads to the availability of massive amounts of data to build LMs, and in fact, for the most prominent languages, using current techniques and hardware, it...
Conference Paper
Full-text available
This paper presents the systems submitted by the Abu-MaTran project to the Englishto-Finnish language pair at the WMT 2016 news translation task. We applied morphological segmentation and deep learning in order to address (i) the data scarcity problem caused by the lack of in-domain parallel data in the constrained task and (ii) the complex morphol...
Article
Full-text available
Contrary to perceived wisdom, we explore the role of machine translation (MT) in assisting with the translation of literary texts, considering both its limitations and its potential. Our motivations to explore this subject are twofold, arising from: (1) recent research advances in MT, and (2) the recent emergence of the ebook, which together allow...
Conference Paper
Full-text available
This article presents an overview of the shared task that took place as part of the TweetMT workshop held at SEPLN 2015. The task consisted in translating collections of tweets from and to several languages. The article outlines the data collection and annotation process, the development and evaluation of the shared task, as well as the results ach...
Article
This paper provides an overview of the research and development activities carried out to alleviate the language resources' bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for the acquisition of the main resources required by the two most popular approaches to machine translation, i.e. statistical...
Conference Paper
Full-text available
We explore the feasibility of applying machine translation (MT) to the translation of literary texts. To that end, we measure the translata-bility of literary texts by analysing parallel corpora and measuring the degree of freedom of the translations and the narrowness of the domain. We then explore the use of domain adaptation to translate a novel...
Conference Paper
Full-text available
This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish‐English language pair at the WMT 2015 translation task. We tackle the lack of resources and complex morphology of the Finnish language by (i) crawling parallel and monolingual data from the Web and (ii) applying rule-based and unsupervised method...
Article
Full-text available
This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two...
Conference Paper
Full-text available
This paper aims to automatically identify which linguistic phenomena represent barriers to better MT quality. We focus on the translation of news data for two bidirectional language pairs: EN↔ES and EN↔DE. Using the diagnostic MT evaluation toolkit DELiC4MT and a set of human reference translations, we relate translation quality barriers to a selec...
Article
Full-text available
We present OLE, a dissemination activity of the EU Abu-MaTran project which aims to establish a linguistic Olympiad in Spain. We introduce the Linguistic Olympiads, our rationale and objectives for setting up OLE as well as our implementation plan. We foresee our work to be useful for other countries looking to start a Linguistic Olympiad.
Conference Paper
Full-text available
This paper presents the machine translation systems submitted by the AbuMaTran project to the WMT 2014 translation task. The language pair concerned is English‐French with a focus on French as the target language. The French to English translation direction is also considered, based on the word alignment computed in the other direction. Large langu...
Article
Full-text available
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system trainin...
Chapter
This contribution, aims at highlighting the strong interconnection between lexicons, terminologies and ontologies and especially the fundamental role that ontologies and lexica mutually play. Our view is that lexical resources are evolving in nature, from ontologically based lexicons we are going towards lexically based ontologies. We explore diffe...
Conference Paper
Full-text available
This paper addresses diagnostic evaluation of machine translation (MT) systems for Indian languages, English to Hindi MT in particular, using linguistic checkpoints. The proposed scheme is based on identification of linguistic checkpoints for evaluation of English to Hindi MT. We use the diagnostic evaluation tool DELiC4MT to analyze the contributi...
Conference Paper
This paper addresses diagnostic evaluation of machine translation (MT) systems for Indian languages, English to Hindi MT in particular, assessing the performance of MT systems on relevant linguistic phenomena (checkpoints). We use the diagnostic evaluation tool DELiC4MT to analyze the performance of MT systems on various PoS categories (e.g. nouns,...
Article
We develop an open-source large-scale finite-state morphological processing toolkit (AraComLex) for Modern Standard Arabic (MSA) distributed under the GPLv3 license (http://aracomlex.sourceforge.net). The morphological transducer is based on a lexical database specifically constructed for this purpose. In contrast to previous resources, the databas...
Article
Full-text available
The aim of this report is to define the interfaces for the tools used in the MT development and evaluation scenarios as included in the QTLaunchPad (QTLP) infrastructure. Specification of the interfaces is important for the interaction and interoperability of the tools in the developed QTLP infrastructure. In addressing this aim, the report provide...
Article
Full-text available
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stages involved in the acquisition, production, updating and maintenance of the large language resources required by, among others, MT systems. The development of a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the...
Conference Paper
This paper addresses diagnostic evaluation of machine translation (MT) systems for Indian languages, English to Hindi translation in particular. Evaluation of MT output is an important but difficult task. The difficulty arises primarily from some inherent characteristics of the language pairs, which range from simple word-level discrepancies to...
Article
Full-text available
This paper demonstrates DELiC4MT, a piece of software that allows the user to perform diagnostic evaluation of machine translation systems over linguistic checkpoints, i.e., source-language lexical elements and grammatical constructions specified by the user. Our integrated tool builds upon best practices, software components and formats developed...