Stephan Vogel

Stephan Vogel
none · none

About

214
Publications
29,605
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,316
Citations
Introduction
Skills and Expertise
Additional affiliations
July 2014 - present
Qatar Computing Research Institute
Position
  • Research Director
July 2006 - October 2011
Carnegie Mellon University
Position
  • Assistant Research Professor
January 2000 - present
Rheinisch-Westfälische Technische Hochschule Aachen

Publications

Publications (214)
Preprint
We address the problem of simultaneous translation by modifying the Neural MT decoder to operate with dynamically built encoder and attention. We propose a tunable agent which decides the best segmentation strategy for a user-defined BLEU loss and Average Proportion (AP) constraint. Our agent outperforms previously proposed Wait-if-diff and Wait-if...
Conference Paper
Full-text available
We address the problem of simultaneous translation by modifying the Neural MT decoder to operate with dynamically built encoder and attention. We propose a tunable agent which decides the best segmentation strategy for a user-defined BLEU loss and Average Proportion (AP) constraint. Our agent outperforms previously proposed Wait-if-diff and Wait-if...
Conference Paper
Full-text available
This article presents the WAW Corpus, an interpreting corpus for English/Arabic, which can be used for teaching interpreters, studying the characteristics of interpreters' work, as well as to train machine translation systems. The corpus contains recordings of lectures and speeches from international conferences, their interpretations, the transcri...
Chapter
Full-text available
We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for doctor-patient communication. As this is a low-resource language pair, especially for speech and for the medical doma...
Conference Paper
In this work, we present Qatar Computing Research Institute»s live speech translation system. Our system works with both Arabic and English. It is designed using an array of modern web technologies to capture speech in real time, and transcribe and translate it using state-of-the-art Automatic Speech Recognition (ASR) and Machine Translation (MT) s...
Conference Paper
Full-text available
End-to-end training makes the neural machine translation (NMT) architecture simpler , yet elegant compared to traditional statistical machine translation (SMT). However, little is known about linguistic patterns of morphology, syntax and semantics learned during the training of NMT systems, and more importantly, which parts of the architecture are...
Article
Full-text available
We explore the idea of automatically crafting a tuning dataset for Statistical Machine Translation (SMT) that makes the hyper-parameters of the SMT system more robust with respect to some specific deficiencies of the parameter tuning algorithms. This is an under-explored research direction, which can allow better parameter tuning. In this paper, we...
Article
Full-text available
This paper describes the Arabic MGB-3 Challenge - Arabic Speech Recognition in the Wild. Unlike last year's Arabic MGB-2 Challenge, for which the recognition task was based on more than 1,200 hours broadcast TV news recordings from Aljazeera Arabic TV programs, MGB-3 emphasises dialectal Arabic using a multi-genre collection of Egyptian YouTube vid...
Book
Full-text available
This workshop addresses BOTH the most recent developments in contributions of NLP to translation/interpreting and the contributions of translation/interpreting to NLP/MT. In this way it addresses the interests of researchers & specialists in both areas and their joint collaborations, aiming for example to improve their own tasks with the techniques...
Article
Full-text available
Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-w...
Article
Full-text available
In this paper, we explore alternative ways to train a neural machine translation system in a multi-domain scenario. We investigate data concatenation (with fine tuning), model stacking (multi-level fine tuning), data selection and weighted ensemble. We evaluate these methods based on three criteria: i) translation quality, ii) training time, and ii...
Conference Paper
Full-text available
We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semanti...
Conference Paper
Full-text available
We present QCRI's Arabic-to-English speech translation system. It features modern web technologies to capture live audio, and broadcasts Arabic transcriptions and English translations simultaneously. Our Kaldi-based ASR system uses the Time Delay Neural Network architecture , while our Machine Translation (MT) system uses both phrase-based and neur...
Article
Full-text available
This paper describes QCRI's machine translation systems for the IWSLT 2016 evaluation campaign. We participated in the Arabic->English and English->Arabic tracks. We built both Phrase-based and Neural machine translation models, in an effort to probe whether the newly emerged NMT framework surpasses the traditional phrase-based systems in Arabic-En...
Conference Paper
Full-text available
This paper describes QCRI's machine translation systems for the IWSLT 2016 evaluation campaign. We participated in the Arabic→English and English→Arabic tracks. We built both Phrase-based and Neural machine translation models, in an effort to probe whether the newly emerged NMT framework surpasses the traditional phrase-based systems in Arabic-Engl...
Article
Full-text available
We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for doctor-patient communication. As this is a low-resource language pair, especially for speech and for the medical doma...
Article
Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that...
Article
Full-text available
The paper describes the Egyptian Arabic-to-English statistical machine translation (SMT) system that the QCRI-Columbia-NYUAD (QCN) group submitted to the NIST OpenMT'2015 competition. The competition focused on informal dialectal Arabic, as used in SMS, chat, and speech. Thus, our efforts focused on processing and standardizing Arabic, e.g., using...
Conference Paper
Full-text available
Poorly translated text is often disfluent and difficult to read. In contrast, well-formed translations require less time to process. In this paper, we model the differences in reading patterns of Machine Translation (MT) evalua-tors using novel features extracted from their gaze data, and we learn to predict the quality scores given by those evalua...
Conference Paper
Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging -- e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we pre...
Conference Paper
Full-text available
We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for doctor-patient communication. As this is a low-resource language pair, especially for speech and for the medical doma...
Conference Paper
Full-text available
Joint models have recently shown to improve the state-of-the-art in machine translation (MT). We apply EM-based mixture modeling and data selection techniques using two joint models, namely the Operation Sequence Model or OSM — an ngram-based translation and reordering model, and the Neural Network Joint Model or NNJM — a continuous space translati...
Conference Paper
Full-text available
We present novel models for domain adaptation based on the neural network joint model (NNJM). Our models maximize the cross entropy by regularizing the loss function with respect to in-domain model. Domain adaptation is carried out by assigning higher weight to out-domain sequences that are similar to the in-domain data. In our alternative model we...
Conference Paper
Full-text available
This paper describes our recognition system for handwritten Arabic. We propose novel text line image normalization procedures and a new feature extraction method. Our recognition system is based on the Kaldi recognition toolkit which is widely used in automatic speech recognition (ASR) research. We show that the combination of sophisticated text im...
Conference Paper
Full-text available
Since Arabic script has a strong baseline, many state-of-the-art recognition systems for handwritten Arabic make use of baseline-dependent features. For printed Arabic, the baseline can be detected reliably by finding the maximum in the horizontal projection profile or the Hough transformed image. However, the performance of these methods drops sig...
Conference Paper
Full-text available
One of the basic challenges in page layout analysis of scanned document images is the estimation of the document skew. Precise skew correction is particularly important when the document is to be passed to an optical character recognition system. In this paper, we propose a very generic and robust method which can cope with a wide variety of docume...
Article
The evolution of mobile technology is moving at a very fast pace. Smartphones are currently considered a primary communication platform where people exchange voice calls, text messages and emails. The human-smartphone interaction, however, is generally optimized for sighted people through the use of visual cues on the touchscreen, e.g., typing text...
Conference Paper
Full-text available
The paper describes the Egyptian Arabic-to-English statistical machine translation (SMT) system that the QCRI-Columbia-NYUAD (QCN) group submitted to the NIST OpenMT'2015 competition. The competition focused on informal dialectal Arabic, as used in SMS, chat, and speech. Thus, our efforts focused on processing and standardizing Arabic, e.g., using...
Conference Paper
Zero-resource Automatic Speech Recognition (ZR ASR) addresses target languages without given pronunciation dictionary, transcribed speech, and language model. Lexical discovery for ZR ASR aims to extract word-like chunks from speech. Lexical discovery benefits from the availability of written translations in another source language [1, 2, 3]. In th...
Article
Full-text available
In this paper we present a recipe and language resources for training and testing Arabic speech recognition systems using the KALDI toolkit. We built a prototype broadcast news system using 200 hours GALE data that is publicly available through LDC. We describe in detail the decisions made in building the system: using the MADA toolkit for text nor...
Conference Paper
Full-text available
We study the impact of source length and verbosity of the tuning dataset on the performance of parameter optimizers such as MERT and PRO for statistical machine translation. In particular, we test whether the verbosity of the resulting translations can be modified by varying the length or the verbosity of the tuning sentences. We find that MERT lea...
Conference Paper
Full-text available
In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and com- bined as well as a model which uses statisti-...
Article
In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under- resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in t...
Conference Paper
Full-text available
In this paper we tackle the task of bootstrapping an Automatic Speech Recognition system without an a priori given language model, a pronunciation dictionary, or transcribed speech data for the target language Slovene – only untranscribed speech and translations to other resource-rich source languages of what was said are avail-able. Therefore, our...
Conference Paper
Full-text available
This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multi-lingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. I...
Article
Full-text available
This paper describes a detailed comparison of several state-of-the-art speech recognition techniques applied to a limited Arabic broadcast news dataset. The different approaches were all trained on 50 hours of transcribed audio from the Al-Jazeera news channel. The best results were obtained using i-vector-based speaker adaptation in a training sce...
Conference Paper
Full-text available
In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online edu-cational content. We crawl a multilingual collection com-munity generated subtitles, and present the results of pro-cessing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English...
Conference Paper
Hiero translation models have two limitations compared to phrase-based models: 1) Limited hypothesis space; 2) No lexicalized reordering model. We propose an extension of Hiero called Phrasal-Hiero to address Hiero's second problem. Phrasal-Hiero still has the same hypothesis space as the original Hiero but incorporates a phrase-based distance cost...
Conference Paper
Full-text available
With the help of written translations in a source language, we cross-lingually segment phoneme sequences in a target language into word units using our new alignment model Model 3P [17]. From this, we deduce phonetic transcriptions of target language words, introduce the vocabulary in terms of word IDs, and extract a pronunciation dictionary. Our a...
Conference Paper
Full-text available
While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters - overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- an...
Conference Paper
Full-text available
Research on statistical machine translation has focused on particular translation directions, typically with English as the target language, e.g., from Arabic to English. When we reverse the translation direction, the multiple reference translations turn into multiple possible inputs, which offers both challenges and opportunities. We propose and e...
Conference Paper
Full-text available
We present our new alignment model Model 3P for cross-lingual word-to-phoneme alignment, and show that unsupervised learning of word segmentation is more accurate when information of another language is used. Word segmentation with cross-lingual information is highly relevant to bootstrap pronunciation dictionaries from audio data for Automatic Spe...
Article
Full-text available
Most statistical machine translation systems use phrase-to-phrase translations to capture lo-cal context information, leading to better lexical choice and more reliable local reordering. The quality of the phrase alignment is crucial to the quality of the resulting translations. Here, we propose a new phrase alignment method, not based on the Viter...
Conference Paper
In this paper we explore the challenges in crowdsourcing the task of translation over the web in which remotely located translators work on providing translations independent of each other. We then propose a collaborative workflow for crowdsourcing translation to address some of these challenges. In our pipeline model, the translators are working i...
Conference Paper
Full-text available
We present a framework for the analysis of Machine Translation performance. We use multivariate linear models to determine the impact of a wide range of features on translation performance. Our assumption is that variables that most contribute to predict translation performance are the key to understand the differences between good and bad translat...
Article
Full-text available
We describe the systems developed by the team of the Qatar Computing Research Institute for the WMT12 Shared Translation Task. We used a phrase-based statistical machine translation model with several non-standard settings, most notably tuning data selection and phrase table combination. The evaluation results show that we rank second in BLEU and T...
Conference Paper
Full-text available
We study a problem with pairwise ranking optimization (PRO): That it tends to yield too short translations. We find that this is partially due to the inadequate smoothing in PRO's BLEU+1, which boosts the precision component of BLEU but leaves the brevity penalty unchanged, thus destroying the balance between the two, compared to BLEU. It is also p...
Conference Paper
In past Evaluations for Machine Translation of European Languages, it could be shown that the translation performance of SMT systems can be increased by integrating a bilingual language model into a phrase-based SMT system. In the bilingual language model, target words with their aligned source words build the tokens of an n-gram based language mod...
Conference Paper
This paper describes the statistical machine translation system submitted to the WMT11 Featured Translation Task, which involves translating Haitian Creole SMS messages into English. In our experiments we try to address the issue of noise in the training data, as well as the lack of parallel training data. Spelling normalization is applied to reduc...
Conference Paper
Supervised learning algorithms for identifying comparable sentence pairs from a dominantly non-parallel corpora require resources for computing feature functions as well as training the classifier. In this paper we propose active learning techniques for addressing the problem of building comparable data for low-resource languages. In particular we...
Conference Paper
Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases crea...
Conference Paper
In this paper we present a novel approach of utilizing Semantic Role Labeling (SRL) information to improve Hierarchical Phrase-based Machine Translation. We propose an algorithm to extract SRL-aware Synchronous Context-Free Grammar (SCFG) rules. Conventional Hiero-style SCFG rules will also be extracted in the same framework. Special conversion rul...
Conference Paper
Full-text available
In this work we propose methods to label probabilistic synchronous context-free grammar (PSCFG) rules using only word tags, generated by either part-of-speech analysis or unsupervised word class induction. The proposals range from simple tag-combination schemes to a phrase clustering model that can incorporate an arbitrary number of features. Our m...
Conference Paper
Full-text available
We present an approach of expanding parallel corpora for machine translation. By utilizing Semantic role labeling (SRL) on one side of the language pair, we extract SRL substitution rules from existing parallel corpus. The rules are then used for generating new sentence pairs. An SVM classifier is built to filter the generated sentence pairs. The f...
Conference Paper
Full-text available
Word alignment has an exponentially large search space, which often makes exact inference infeasible. Recent studies have shown that inversion transduction grammars are reasonable constraints for word alignment, and that the constrained space could be efficiently searched using synchronous parsing algorithms. However, spurious ambiguity may occur i...
Conference Paper
As researchers embrace micro-task markets for eliciting human input, the nature of the posted tasks moves from those requiring simple mechanical labor to requiring specific cognitive skills. On the other hand, increase is seen in the number of such tasks and the user population in micro-task market places requiring better search interfaces for prod...
Article
In this paper, we propose that MT is an important technology in crisis events, something that can and should be an integral part of a rapid-response infrastructure. By integrating MT services directly into a messaging infrastructure (whatever the type of messages being serviced, e.g., text messages, Twitter feeds, blog postings, etc.), MT can be us...