
Dimitar Sht. ShterionovTilburg University | UVT · Cognitive Science and Artificial Intelligence
Dimitar Sht. Shterionov
PhD
About
46
Publications
9,557
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
573
Citations
Introduction
Additional affiliations
August 2020 - present
January 2020 - June 2020
November 2017 - December 2019
Publications
Publications (46)
While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high-cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should...
Automatic translation from signed to spoken languages is an interdisciplinary research domain on the intersection of computer vision, machine translation (MT), and linguistics. While the domain is growing in terms of popularity—the majority of scientific papers on sign language (SL) translation have been published in the past five years—research in...
The effectiveness of Neural Machine Translation (NMT) models largely depends on the vocabulary used at training; small vocabularies can lead to out-of-vocabulary problems -- large ones, to memory issues. Subword (SW) tokenization has been successfully employed to mitigate these issues. The choice of vocabulary and SW tokenization has a significant...
Businesses and customers can gain valuable information from product reviews. The sheer number of reviews often necessitates ranking them based on their potential helpfulness. However, only a few reviews ever receive any helpfulness votes on online marketplaces. Sorting all reviews based on the few existing votes can cause helpful reviews to go unno...
This paper presents the results of the First WMT Shared Task on Sign Language Translation (WMT-SLT22) 1. This shared task is concerned with automatic translation between signed and spoken 2 languages. The task is novel in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known p...
Sign Languages (SLs) are the primary means of communication for at least half a million people in Europe alone. However, the development of SL recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and standardization issues in the available data. The former challenge relates to the volume of data ava...
With the increase in machine translation (MT) quality over the latest years, it has now become a common practice to integrate MT in the workflow of language service providers (LSPs) and other actors in the translation industry. With MT having a direct impact on the translation workflow, it is important not only to use high-quality MT systems, but a...
Machine Translation (MT) has become an irreplaceable part of translation industry workflows. With a direct impact on productivity, it is very important for human post-editors and project managers to be informed about the translation quality of MT.
MT Quality estimation (QE) is the task of predicting the quality of a translation without human refer...
Automatic translation from signed to spoken languages is an interdisciplinary research domain, lying on the intersection of computer vision, machine translation and linguistics. Nevertheless, research in this domain is performed mostly by computer scientists in isolation. As the domain is becoming increasingly popular - the majority of scientific p...
Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method r...
Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method r...
Recent years have seen an increasing need for gender-neutral and inclusive language. Within the field of NLP, there are various mono- and bilingual use cases where gender inclusive language is appropriate, if not preferred due to ambiguity or uncertainty in terms of the gender of referents. In this work, we present a rule-based and a neural approac...
This paper addresses the tasks of sign segmentation and segment-meaning mapping in the context of sign language (SL) recognition. It aims to give an overview of the linguistic properties of SL, such as coarticulation and simultaneity, which make these tasks complex. A better understanding of SL structure is the necessary ground for the design and d...
General-domain corpora are becoming increasingly available for Machine Translation (MT) systems. However, using those that cover the same or comparable domains allow achieving high translation quality of domain-specific MT. It is often the case that domain-specific corpora are scarce and cannot be used in isolation to effectively train (domain-spec...
This article presents a review of the evolution of automatic post-editing, a term that describes methods to improve the output of machine translation systems, based on knowledge extracted from datasets that include post-edited content. The article describes the specificity of automatic post-editing in comparison with other tasks in machine translat...
Recent studies in the field of Machine Translation (MT) and Natural Language Processing (NLP) have shown that existing models amplify biases observed in the training data. The amplification of biases in language technology has mainly been examined with respect to specific phenomena, such as gender bias. In this work, we go beyond the study of gende...
In a translation workflow, machine translation (MT) is almost always followed by a human post-editing step, where the raw MT output is corrected to meet required quality standards. To reduce the number of errors human translators need to correct, automatic post-editing (APE) methods have been developed and deployed in such workflows. With the advan...
Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-bas...
Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation (Sennrich et al., 2016), which consists on generating synthetic sentences by translating a set...
Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation (Sennrich et al., 2016a), which consists on generating synthetic sentences by translating a set...
This work presents an empirical approach to quantifying the loss of lexical richness in Machine Translation (MT) systems compared to Human Translation (HT). Our experiments show how current MT systems indeed fail to render the lexical diversity of human generated or translated text. The inability of MT systems to generate diverse outputs and its te...
The quality of e-Commerce services largely depends on the accessibility of product content as well as its completeness and correctness. Nowadays, many sellers target cross-country and cross-lingual markets via active or passive cross-border trade, fostering the desire for seamless user experiences. While machine translation (MT) is very helpful for...
We present our system for the CLIN29 shared task on cross-genre gender detection for Dutch. We experimented with a multitude of neural models (CNN, RNN, LSTM, etc.), more "traditional" models (SVM, RF, LogReg, etc.), different feature sets as well as data pre-processing. The final results suggested that using tokenized, non-lowercased data works be...
Neural machine translation (NMT) has recently gained substantial popularity not only in academia, but also in industry. For its acceptance in industry it is important to investigate how NMT performs in comparison to the phrase-based statistical MT (PBSMT) model, that until recently was the dominant MT paradigm. In the present work, we compare the q...
A prerequisite for training corpus-based machine translation (MT) systems -- either Statistical MT (SMT) or Neural MT (NMT) -- is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in ca...
Neural Machine Translation (NMT) is a recently-emerged paradigm for Machine Translation (MT) that has shown promising results as well as a great potential to solve challenging MT tasks. One such a task is how to provide good MT for languages with sparse training data. In this paper we investigate a Zero Shot Translation (ZST) approach for such lang...
We present a multilingual preordering component tailored for a commercial Statistical Machine translation platform. In commercial settings, issues such as processing speed as well as the ability to adapt models to the customers’ needs play a significant role and have a big impact on the choice of approaches that are added to the custom pipeline to...
Neural Machine Translation (NMT) has recently gained substantial popularity not only in academia, but also in industry. In the present work, we compare the quality of Phrase-Based Statistical Machine Translation (PBSMT) and NMT solutions of a commercial platform for Custom Machine Translation (CMT) that are tailored to accommodate large-scale trans...
We present a multilingual preordering component tailored for a commercial Statistical Machine translation platform. In commercial settings, issues such as processing speed as well as the ability to adapt models to the customers’ needs play a significant role and have a big impact on the choice of approaches that are added to the custom pipeline to...
In recent years Statistical Machine Translation (SMT) has established a dominant position among the variety of machine translation paradigms. Industrial Machine Translation computer systems, such as KantanMT, deliver fast and of high performance SMT solutions to the end user. KantanMT is a cloud-based platform that allows its users to build custom...
In recent years, statistical machine translation (SMT) has been widely deployed in translators' workflow with significant improvement of productivity. However, prior to invoking an SMT system to translate an unknown text, an SMT engine needs to be built. As such, building speed of the engine is essential for the translation workflow, i.e., the soon...
Knowledge compilation converts Boolean formulae for which some inference tasks are computationally expensive into a representation where the same tasks are tractable. ProbLog is a state-of-the-art Probabilistic Logic Programming system that uses knowledge compilation to reduce the expensive probabilistic inference to an efficient weighted model cou...
In order to handle real-world problems, state-of-the-art probabilistic logic and learning frameworks, such as ProbLog, reduce the expensive inference to an efficient Weighted Model Counting. To do so ProbLog employs a sequence of transformation steps, called an inference pipeline. Each step in the probabilistic inference pipeline is called a pipeli...
ProbLog [7, 10] is a general purpose Probabilistic Logic Programming (PLP) language. It extends Prolog with uncertain knowledge encoded as probabilistic facts. A probabilistic fact, p: f states that the fact f is true with probability p.
Probabilistic logic languages, such as ProbLog and CP-logic, are probabilistic generalizations of logic programming that allow one to model probability distributions over complex, structured domains. Their key probabilistic constructs are probabilistic facts and annotated disjunctions to represent binary and mutli-valued random variables, respectiv...
Probabilistic logic programs are logic programs in which some of the facts
are annotated with probabilities. This paper investigates how classical
inference and learning tasks known from the graphical model community can be
tackled for probabilistic logic programs. Several such tasks such as computing
the marginals given evidence and learning from...
Deriving knowledge from real-world systems is a complex task, targeted by many scientific fields. Such systems can be viewed as collections of highly detailed data elements and interactions between them. The more details the data include the more accurate the system representation is but the higher the computational requirements become. Using abstr...
Inference in probabilistic logic languages such as ProbLog, an extension of Prolog with probabilistic facts, is often based on a reduction to a propositional formula in DNF. Calculating the probability of such a formula involves the disjoint-sum-problem, which is computationally hard. In this work we introduce a new approximation method for ProbLog...