Wolfgang Macherey's research while affiliated with Google Inc. and other places

Publications (64)

Preprint
Full-text available
We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SL...
Preprint
Full-text available
In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing...
Preprint
Multilingual neural machine translation models are trained to maximize the likelihood of a mix of examples drawn from multiple language pairs. The dominant inductive bias applied to these models is a shared vocabulary and a shared set of parameters across languages; the inputs and labels corresponding to examples drawn from different language pairs...
Article
Full-text available
Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal...
Preprint
Full-text available
Self-supervised pre-training of text representations has been successfully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve notable gains on resource-rich NMT. In this paper, we propose a joint training approach, $F_2$-XEnDec, to combine self-supervised and supervised learning to optimize NMT models. To...
Preprint
Full-text available
Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal...
Preprint
Full-text available
We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found i...
Preprint
Full-text available
In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, of which the crucial one is a novel vicinity distribution for adversarial sentences that describes a smooth interpolated embedding space c...
Preprint
There has been great progress in improving streaming machine translation, a simultaneous paradigm where the system appends to a growing hypothesis as more source content becomes available. We study a related problem in which revisions to the hypothesis beyond strictly appending words are permitted. This is suitable for applications such as live cap...
Preprint
We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to s...
Preprint
We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples. Our system demonstrates effective transfer learning abi...
Preprint
Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to lea...
Preprint
Full-text available
Neural machine translation (NMT) often suffers from the vulnerability to noisy perturbations in the input. We propose an approach to improving the robustness of NMT models, which consists of two parts: (1) attack the translation model with adversarial source examples; (2) defend the translation model with adversarial target inputs to improve its ro...
Preprint
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the tra...
Preprint
Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-t...
Preprint
Full-text available
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed trainin...
Preprint
End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since...
Preprint
Transferring representations from large supervised tasks to downstream tasks has shown promising results in AI fields such as Computer Vision and Natural Language Processing (NLP). In parallel, the recent progress in Machine Translation (MT) has enabled one to train multilingual Neural MT (NMT) systems that can translate between multiple languages...
Preprint
Full-text available
Translating characters instead of words or word-fragments has the potential to simplify the processing pipeline for neural machine translation (NMT), and improve results by eliminating hyper-parameters and manual feature engineering. However, it results in longer sequences in which each symbol contains less information, creating both modeling and c...
Preprint
The past year has witnessed rapid advances in sequence-to-sequence (seq2seq) modeling for Machine Translation (MT). The classic RNN-based approaches to MT were first out-performed by the convolutional seq2seq model, which was then out-performed by the more recent Transformer model. Each of these new approaches consists of a fundamental architecture...
Article
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficu...
Patent
Full-text available
Systems, methods, and apparatuses including computer program products for machine translation. A method is provided that includes generating a plurality of machine translation systems using a single machine translation engine, and generating a consensus translation from a plurality of candidate translations for a source sentence, where each candida...
Patent
Full-text available
Methods, systems, and apparatus, including computer program products, for language translation are disclosed. In one implementation, a method is provided. The method includes accessing a hypothesis space, where the hypothesis space represents a plurality of candidate translations; performing decoding on the hypothesis space to obtain a translation...
Patent
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for presenting alternative translations. In one aspect, a method includes receiving source language text; receiving translated text corresponding to the source language text from a machine translation system; receiving segmentation data for the transl...
Patent
Full-text available
Systems, methods, and computer program products are provided for statistical machine translation. In some implementations a method is provided. The method includes receiving multi-lingual parallel text associating a source language, a target language, and one or more bridge languages, determining an alignment between the source language and the tar...
Conference Paper
We present a simple and effective infrastructure for domain adaptation for statistical machine translation (MT). To build MT systems for different domains, it trains, tunes and deploys a single translation system that is capable of producing adapted domain translations and preserving the original generic accuracy at the same time. The approach unif...
Conference Paper
This paper presents efficient algorithms for expected similarity maximization, which co- incides with minimum Bayes decoding for a similarity-based loss function. Our algorithms are designed for similarity functions that are sequence kernels in a general class of posi- tive definite symmetric kernels. We discuss both a general algorithm and a more...
Conference Paper
Minimum Error Rate Training (MERT) and Minimum Bayes-Risk (MBR) decod- ing are used in most current state-of-the- art Statistical Machine Translation (SMT) systems. The algorithms were originally developed to work with N-best lists of translations, and recently extended to lat- tices that encode many more hypotheses than typical N-best lists. We he...
Conference Paper
Minimum Error Rate Training (MERT) is an effective means to estimate the feature func- tion weights of a linear model such that an automated evaluation criterion for measuring system performance can directly be optimized in training. To accomplish this, the training procedure determines for each feature func- tion its exact error surface on a given...
Conference Paper
We present Minimum Bayes-Risk (MBR) de- coding over translation lattices that compactly encode a huge number of translation hypothe- ses. We describe conditions on the loss func- tion that will enable efficient implementation of MBR decoders on lattices. We introduce an approximation to the BLEU score (Pap- ineni et al., 2001) that satisfies these...
Conference Paper
We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for combining word alignment systems from multiple bridge...
Conference Paper
Full-text available
Discriminative training criteria have been shown to consistently outperform maximum likelihood trained speech recognition systems. In this paper we employ the Minimum Classifica- tion Error (MCE) criterion to optimize the parameters of the acoustic model of a large scale speech recognition system. The statistics for both the correct and the competi...
Conference Paper
Full-text available
In this paper the methods we used in the 2005 ImageCLEF content-based image retrieval evaluation are described. For the medical retrieval task, we combined several low-level image features with textual information retrieval. Combining these two information sources, clear improvements over the use of one of these sources alone are possible. Addition...
Conference Paper
Full-text available
In this paper we present the minimum exact word error (exactMWE) training criterion to optimise the parameters of large scale speech recognition systems. The exactMWE criterion is similar to the minimum word error (MWE) criterion, which minimises the expected word error, but uses the exact word error instead of an approximation based on time alignm...
Conference Paper
Full-text available
Discriminative training techniques have proved to be a powerful method for improving large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models. Typically, the optimization of discriminative objective functions is done using the extended Baum algorithm. Since for continuous distributions no proof of fast and stable c...
Article
We integrate the tangent method into a statistical framework for classification analytically and practically. The resulting consistent framework for adaptation allows us to efficiently estimate the tangent vectors representing the variability. The framework improves classification results on two real-world pattern recognition tasks from the domains...
Article
Full-text available
We integrate the tangent method into a statistical framework for classification analytically and practically. The resulting consistent framework for adaptation allows us to efficiently estimate the tangent vectors representing the variability. The framework improves classification results on two real-world pattern recognition tasks from the domains...
Article
Full-text available
While Maximum Entropy (ME) based learning procedures have been successfully applied to text based natural language processing, there are only little investigations on using ME for acoustic modeling in automatic speech recognition. In this paper we show that the well known Generalized Iterative Scaling (GIS) algorithm can be used as an alternative m...
Article
Accessing information in multimedia databases encompasses a wide range of applications in which spoken document retrieval (SDR) plays an important role. In SDR, a set of automatically transcribed speech documents constitutes the files for retrieval, to which a user may address a request in natural language. This paper deals with two probabilistic a...
Conference Paper
Full-text available
Accessing information in multimedia databases encompasses a wide range of applications in which spoken document retrieval (SDR) plays an important role. In the recent past, research increasingly focused on the development of heuristic and probabilistic retrieval metrics that are suitable for retrieving spoken documents. So far, many heuristic retri...
Article
Full-text available
When setting up a speech recognition system for a new domain, a lot of manual effort is spent on corpus preparation, i.e., data acquisition, cutting and segmentation of the audio material, generation of pronunciation lexica, as well as the definition of suitable training and test sets. In this paper we describe several methods that help to automate...
Conference Paper
Full-text available
In many applications, modelling techniques are necessary which take into account the inherent variability of given data. In this paper, we present an approach to model class specific pattern variation based on tangent distance within a statistical framework for classification. The model is an effective means to explicitly incorporate invariance wit...
Conference Paper
Full-text available
In this paper we present a new approach to variance mod- elling in automatic speech recognition (ASR) that is based on tangent distance (TD). Using TD, classifiers can be made in- variant w.r.t. small transformations of the data. Such tran s- formations generate a manifold in a high dimensional feature space when applied to an observation vector. W...
Article
The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The unified criterion leads to particular criteria thro...
Conference Paper
Full-text available
In this work a method for splitting continuous mixture density hidden Markov models (HMM) is presented. The approach com- bines a model evaluation measure based on the Maximum Mu- tual Information (MMI) criterion with subsequent standard Max- imum Likelihood (ML) training of the HMM parameters. Exper- iments were performed on the SieTill corpus for...
Conference Paper
A formally unifying approach for a class of discriminative training criteria including maximum mutual information (MMI) and minimum classification error (MCE) criterion is presented, together with the optimization methods of the gradient descent (GD) and extended Baum-Welch (EB) algorithm. Comparisons are discussed for the MMI and the MCE criterion...
Article
Full-text available
In this work we compare two parameter optimization techniques for discriminative training using the MMI cri-terion: the extended Baum-Welch (EBW) algorithm and the generalized probabilistic descent (GPD) method. Us-ing Gaussian emission densities we found special expres-sions for the step sizes in GPD, leading to reestimation formula very similar t...
Article
In this paper, a formally unifying approach for a class of discriminative training criteria including Maximum Mutual Information (MMI) and Minimum Classification Error (MCE) criterion is presented, including the optimization methods gradient descent (GD) and extended Baum-Welch (EB) algorithm. Comparisons are discussed for the MMI and the MCE crite...
Article
Discriminative training has become an important means for estimating model parameters in many statistical pattern recognition tasks. While standard learning methods based on the Maximum Likelihood criterion aim at optimizing model parameters only class individually, discriminative approaches benefit from taking all competing classes into account, t...

Citations

... Hidden variables may lead to non-convex objective function and hence, training algorithms cannot be applied. However, a modified version of the training algorithm for MaxEnt modeling that can discriminatively estimate the parameters of the state Gaussian densities is developed in (Macherey and Ney, 2003). ...
... Multilingual Mix, Bapna et al. (2022): To optimize the multilingual NMT models, the authors, Bapna et al. (2022) introduce a multilingual crossover approach. The approach promotes sharing inputs and outputs across languages. ...
... We give an example of this prompt in Figure 1. The idea is to prompt ChatGPT to generate a human-like evaluation like MQM (Freitag et al., 2021) by ❶ identifying major and minor errors, and ❷ scoring the translations according to the severity of these errors. In addition, we also explore the potential of ChatGPT compared with modern neural metrics like COMET (Rei et al., 2020), BERTScore (Zhang et al., 2020) and BLEURT (Sellam et al., 2020). ...
... Reference-based evaluation metric required the identical matching of reference translation with MT output sentences as in lexical metrics such as BLEU, TER, NIST. But in the case of KoBE [85] evaluation method is based on the multilingual knowledge base evaluation by taking the linkage in entities of source sentences to the MT sentences. The recall of the entities is taken in MT output sentences with reference sentences. ...
... Streaming translation models either adopt fixed policies [3,7,8] or adaptive policies [1,2,4,9,10] to find the READ-WRITE paths and need to balance translation quality and latency. Retranslation models re-translate each successive source prefix to revise previous partial translations, requiring careful control of the flicker in the translation [11,12]. However, there is a thought-provoking phenomenon. ...
... Liu et al. (2019) incorporated noisy input with similar pronunciations into NMT training to address homophone noise. Cheng, Jiang, and Macherey (2019); Cheng et al. (2020) proposed to use adversarial examples to address synonym problem. However, they only focus on the noisy input, part of the source diversity phenomenon. ...
... Recently, remarkable progress has been made by simultaneous machine translation (SimulMT) models [1,2], consisting of streaming translation models [3,4] that do not revise translations and re-translation models [5,6] with revision. Streaming translation models either adopt fixed policies [3,7,8] or adaptive policies [1,2,4,9,10] to find the READ-WRITE paths and need to balance translation quality and latency. ...
... To date, cascade systems consisting of automatic speech recognition (ASR), machine translation (MT), and text-tospeech (TTS) systems remain dominant in high-resource language pair settings. By contrast, direct S2ST methods do not rely on text generation as an intermediate step, and have been gaining attention due to their low computational cost and the ability to translate unwritten languages [2][3][4][5]. However, these studies often focus on accurate semantic translation, overlooking the para-linguistic information in speech, which plays a crucial rule in speech communication. ...
... A traditional streaming simultaneous translation (ST) system is usually formed by cascading a streaming auto-speechrecognition (ASR) component with a streaming machine translation (MT) component [1,2]. Most of the previous work focuses on simultaneous text translation [3,4,5,6], where main research direction is to figure out a reasonable READ policy and WRITE policy. Recent work generally splits into two classes: • READ Policies determine whether to wait for another source word (READ). ...
... Liu et al. (2019) incorporated noisy input with similar pronunciations into NMT training to address homophone noise. Cheng, Jiang, and Macherey (2019); Cheng et al. (2020) proposed to use adversarial examples to address synonym problem. However, they only focus on the noisy input, part of the source diversity phenomenon. ...