Conference PaperPDF Available

Syntax-aware Transformers for Neural Machine Translation: The Case of Text to Sign Gloss Translation

Authors:

Abstract and Figures

It is well-established that the preferred mode of communication of the deaf and hard of hearing (DHH) community are Sign Languages (SLs), but they are considered low resource languages where natural language processing technologies are of concern. In this paper we study the problem of text to SL gloss Machine Translation (MT) using Transformer-based ar-chitectures. Despite the significant advances of MT for spoken languages in the recent couple of decades, MT is in its infancy when it comes to SLs. We enrich a Transformer-based architecture aggregating syntactic information extracted from a dependency parser to word-embeddings. We test our model on a well-known dataset showing that the syntax-aware model obtains performance gains in terms of MT evaluation metrics.
Content may be subject to copyright.
Syntax-aware Transformers for Neural Machine Translation: The Case of
Text to Sign Gloss Translation
Santiago Egea G´
omez
Universitat Pompeu Fabra
santiago.egea@upf.edu
Euan McGill
Universitat Pompeu Fabra
euan.mcgill@upf.edu
Horacio Saggion
Universitat Pompeu Fabra
horacio.saggion@upf.edu
Abstract
It is well-established that the preferred mode
of communication of the deaf and hard of hear-
ing (DHH) community are Sign Languages
(SLs), but they are considered low resource
languages where natural language processing
technologies are of concern. In this paper we
study the problem of text to SL gloss Machine
Translation (MT) using Transformer-based ar-
chitectures. Despite the significant advances
of MT for spoken languages in the recent cou-
ple of decades, MT is in its infancy when it
comes to SLs. We enrich a Transformer-based
architecture aggregating syntactic information
extracted from a dependency parser to word-
embeddings. We test our model on a well-
known dataset showing that the syntax-aware
model obtains performance gains in terms of
MT evaluation metrics.
1 Introduction
Access to information is a human right and crossing
language barriers is essential for global information
exchange and unobstructed, fair communication.
However, we are still far from the goal of making
information accessible to all a reality. The World
Health Organisation (WHO) reports that there are
some 466 million people in the world today with
disabling hearing loss
1
; moreover, it is estimated
that this number will double by 2050. According to
the World Federation of the Deaf (WFD), over 70
million people are deaf and communicate primarily
via a sign language (SL).
It is well-established that the preferred mode of
communication of the deaf and hard of hearing
(DHH) community are SLs (Stoll et al.,2020), but
they are considered extremely low resource lan-
guages (Moryossef et al.,2021), and lag further
1https://www.who.int/
news-room/fact- sheets/detail/
deafness-and- hearing-loss
behind in terms of the provision of language tech-
nologies available to DHH people. 150 SLs have
been classified around the world (Eberhard et al.,
2021) while there may be upwards of 400 accord-
ing to SIL International
2
. Creating accessible-to-all
technological solutions may also mitigate the effect
of more variable reading literacy rate in the DHH
community (Berke et al.,2018). The written lan-
guage is usually the ambient spoken language in the
geographical area signers are found (e.g. English
in the British Sign Language area), and providing
resources in native SL could benefit the provision
and uptake of sign language technology.
Machine translation (MT) (Koehn,2009) is a
core technique for reducing language barriers that
has advanced, and seen many breakthroughs since
it began in the 1950s (Johnson et al.,2017), to
reach quality levels comparable to humans (Hassan
et al.,2018). Despite the significant advances of
MT for spoken languages in the recent couple of
decades, MT is in its infancy when it comes to SLs.
The output of MT between spoken languages
tends to be text, but there are further considerations
for researchers doing Sign Language translation
(SLT). Full writing systems exist for SL (e.g. Ham-
NoSys (Hanke,2004), SiGML (Zwitserlood et al.,
2004)), but are not always the output or used at all
in SLT. SL glosses are a lexeme-based representa-
tion of signs where classifier predicates, manual
and non-manual cues (Porta et al.,2014) are dis-
tilled into a lexical representation, usually in the
ambient spoken language. The articulators in SLs
include hand configuration and trajectory, facial
articulators including lip position and eyebrow con-
figuration, and spatial articulation including eye
gaze and body position (Mukushev et al.,2020)
- all used to convey meaning. Glosses, and the
Text2Gloss process, are an essential step in the MT
2https://www.sil.org/sign-languages
pipeline between spoken and sign languages - even
though they are considered a flawed representation
which hinder the extraction of meaning by some
researchers (Yin and Read,2020). Although some
current approaches to SL translation follow an end-
to-end paradigm, translating into glosses offers an
intermediate representation which could drive the
generation of the actual virtual signs (e.g. by an
avatar) (Almeida et al.,2015;L
´
opez-Lude
˜
na et al.,
2014). A growing number of researchers (Jantunen
et al.,2021) have been using innovative methods
to leverage the limited supply of SL gloss corpora
and resources for SL technology.
In spite of the impressive results achieved by
Neural Machine Translation (NMT) when massive
parallel data-sets are available for training using
just token level information, recent research (Ar-
mengol Estap
´
e and Ruiz Costa-Juss
`
a,2021) shows
that morphological and syntactic information ex-
tracted from linguistic processors can be of help
for out-of-domain machine translation or rich mor-
phology languages.
In this work, we make transformer models for
NMT ‘syntax-aware’ - where syntactic informa-
tion embeddings are included as well as word em-
beddings in the encoder part of the model. The
rationale behind including syntactic embeddings
draws from the success of word embeddings im-
proving natural language processing tasks includ-
ing syntactic parsing itself (Socher et al.,2013),
and from context-sensitive embeddings pioneered
in transformer models (Vaswani et al.,2017;De-
vlin et al.,2019;Liu et al.,2020). We posit that
encoding syntactic information will in turn boost
the performance of Text2Gloss as we show with
our experimental results.
The rest of the paper is organised in the follow-
ing way: in the next section we briefly introduce
the project in the context of which this work is
being carried out. Then, in Section 3, we present
related work on SL translation and background on
NMT and in Section 4we describe the NMT ar-
chitecture we use in our experiments. In Section 5
we describe the experimental methodology includ-
ing data and evaluation metrics while in Section 6
we present quantitative results. Section 7analyses
the results while Section 8closes the paper and
discusses further work which could expand this
avenue of research.
2 The SignON project
SignON
3
is a Horizon 2020 project which aims to
develop a communication service that translates
between sign and spoken (in both text and audio
modalities) languages and caters for the commu-
nication needs between DHH and hearing individ-
uals (Saggion et al.,2021). Currently, human in-
terpreters are the main medium for sign-to-spoken,
spoken-to-sign and sign-to-sign language transla-
tion. The availability and cost of these profession-
als is often a limiting factor in communication be-
tween signers and non-signers. The SignON com-
munication service will translate between sign and
spoken languages, bridging language gaps when
professional interpretation is unavailable. A key
piece of this project is the server which will host
the translation engine, which imposes demanding
requirements in terms of latency and efficiency.
3 Related Work
The bottleneck to creating SL technology primarily
lies in the training data available, such as from ex-
isting corpora and lexica. Certain corpora may be
overly domain-specific (San-Segundo et al.,2010),
containing only sentence fragments or example
signs as part of a lexicon (Cabeza et al.,2016), have
little variation in individual signers or the framing
of the signer in 3D space (Nunnari et al.,2021),
or simply too small in size to be applied to large
neural models alone (Jantunen et al.,2021).
The next section describes current methods to
mitigate the data-scarcity problem, and state-of-
the-art models and studies with sign language gloss
data - including Text2Gloss, Gloss2Text, and ef-
forts towards end-to-end (E2E) SLT.
3.1 Transformer models for NMT
Transformer architecture has been successful in
covering a large amount of language pairs with
great accuracy in MT tasks, most notably in mod-
els such as BART (Lewis et al.,2020) and mBART
(Liu et al.,2020). mT5 (Xue et al.,2021) also per-
forms well with an even larger set of languages,
many of which are considered low-resource. These
models are also highly adaptable to other NLP tasks
by means of finetuning (Lewis et al.,2020). In addi-
tion, recent work has shown that transformer mod-
els including embeddings with linguistic informa-
tion in a low-resource language pair improve model
3https://signon-project.eu/
Table 1: T2G production examples
Spoken
Sp
¨
ater breiten sich aber nebel oder
hochnebelfelder aus
(EN) Later, however, fog or high-fog
fields are widening
Gloss
ABER IM-VERLAUF NEBEL HOCH
NEBEL IX4
(EN) BUT IN-COURSE FOG HIGH
FOG IX
performance by 1.2 BLEU score (Armengol Estap
´
e
and Ruiz Costa-Juss
`
a,2021) over a baseline - and
when using arbitrary features derived from neural
models (Sennrich and Haddow,2016). Their ‘Fac-
tored Transformer’ model inserts embeddings for
lemmas, part-of-speech tags, lexical dependencies,
and morphological features in the encoder of their
attentional encoder-decoder architecture.
In this work, a syntax-aware transformer model
is proposed for Text2Gloss translation - one step
in the SLT pipeline. Although current steps to-
wards E2E SLT using transformer-based NMT sys-
tems look promising (Nunnari et al.,2021), using
glosses as an intermediate representation still im-
prove performance even in these state-of-the-art
systems (Camgoz et al.,2020;Yin and Read,2020).
Our model exploits lexical dependency information
to assist in learning the intrinsic grammatical rules
that involves translating from text to glosses. Un-
like other works, we consider model simplicity a
key feature to fulfil efficiency requirements in the
SignON Project. Thus, we applied an easy aggre-
gation scheme to inject syntactic information to the
model and chose a relatively simple neural architec-
ture. Using only lexical dependency features also
allows us to examine the impact of this individual
linguistic feature on model performance.
4 System Overview: A Syntax-aware
Transformer for Text2Gloss
Our model is an Encoder-Decoder architecture
which consists of augmenting the input embeddings
to the Encoder via including lexical dependency
information. As can be noted from Table 1, gloss
production from spoken text is essentially based
on word permutations, stemming and deletions.
In many cases, those transformations depend on
the syntactical functions of word, for example de-
terminers are always removed to produce glosses.
Consequently, we believe that word dependency
tags might assist in modelling syntactic rules which
are intrinsic in gloss production.
Importantly, our Text2Gloss model has been de-
veloped considering the efficiency requirements
demanded for the SignON Project. Therefore, the
size of the architecture has been selected to produce
accurate but also lightweight translations. Figure
1shows the different modules composing our sys-
tem.
The neural architecture employed is based on
multi-attention layers (Vaswani et al.,2017), which
has produced excellent results when modelling
long input sequences. More specifically, the En-
coder and Decoder are composed by three multi-
attention layers with four attention heads. The in-
ternal dimensions for the fully connected network
are set to 1024 and the output units to 512. The
Encoder transforms inputs to latent vectors, whilst
the Decoder produces word probabilities from the
encoded latent representations.
Our system augments the discriminative power
of the embeddings inputted to the Encoder by ag-
gregating syntactic information to word embed-
dings. Unlike (Armengol Estap
´
e and Ruiz Costa-
Juss
`
a,2021) (which added encoders to manage
injected features), we integrate an additional table
that contains the vector embeddings for the syn-
tactic tags. The word and syntax embeddings are
sum up producing an aggregated embedding that is
input to the Encoder. Both tables were set to have
a vector length of 512.
To accommodate input text to the neural model,
we process it employing subword tokenisation and
dependency tags are produced using the model
de core news sm available in the spaCy
5
library.
The dependency tags we incorporate are from the
TIGER dependency bank (Albert et al.,2003), in-
cluded in the German model, and designed specifi-
cally to categorise words in German (Brants et al.,
2004). An example of these tags with a German
sentence is shown in Figure 2. Then, word and
syntax tokens were aligned with the correspond-
ing words as shown in Figure 1. For the tokeniser,
a Sentence Piece model (Kudo and Richardson,
2018) was trained using only the training corpus
with a vocabulary size of 3000, keeping some to-
kens for control.
Regarding the training, Adam optimiser with a
learning rate of
105
and a batch size of
64
was
applied to optimise Cross Categorical Entropy for
500
epochs. Text generation was carried out using
4
IX gloss indicates that the signer needs to point to some-
thing or someone.
5https://spacy.io/
Figure 1: Syntax-Aware Text2Gloss model
Figure 2: Lexical dependency tree diagram of the sentence “On the weekend it gets a little warmer”. Key to tags:
ep = expletive es, mo = modifier, nk = noun kernel element, pd = predicate
Beam Search Decoding with 5beams.
5 Methods & Materials
In this section, we present the methods and mate-
rials used in this research. Firstly, we introduce
the dataset used and performance metrics and other
implementation details are described.
5.1 Dataset: RWTH-PHOENIX-2014-T
The parallel corpus selected for our experiments
is the RWTH-PHOENIX-2014-T (Camgoz et al.,
2018). It is publicly available
6
, and is widely-
adopted for SLT research. This dataset contains
images, and transcriptions in German text and Ger-
man Sign Language (DGS) glosses of weather fore-
casting news from a public TV station. The large
vocabulary (1,066 different signs) and number of
signers (nine) make this dataset promising for SLT
6https://www-i6.informatik.rwth- aachen.
de/˜koller/RWTH-PHOENIX- 2014-T/
Table 2: Data partitions Information
#Samples #Words #Glosses
Train 7096 2887 1085
Dev 519 951 393
Test 642 1001 411
research, in an albeit limited semantic domain. In
this study, we only consider the text and gloss tran-
scriptions.
The authors included development and test par-
titions in their dataset with unseen patterns in the
training data. We used the development subset to
control overfitting and performances are reported
on the test subset. The information about the differ-
ent subsets included in RWTH-PHOENIX-2014-T
is presented in Table 2.
5.2 Performance Metrics
In order to fairly evaluate our approach, we have
selected performance metrics that are extensively
used in NMT. Consequently, the metrics used are
introduced below:
Translation Edit Rate (TER):
TER (Snover
et al.,2006) measures the quality of system transla-
tions by counting the number of text edits needed
to transform the produced test into the reference.
SacreBLEU:
SacreBLEU (Post,2018) is a very
popular metric for NMT. It facilitates the imple-
mentation of BLEU (Papineni et al.,2002) and
standardises input schemes to the metric by means
of tokenisation and normalisation. This in turn
makes comparing scores from other works more di-
rectly comparable and straightforward. BLEU aims
to correlate a ‘human-level’ judgement of quality
by using a reference translation as part of its calcu-
lation.
ROUGE-L F1:
ROUGE-L (Lin,2004) was pri-
marily conceived for evaluating text summarisation
models, however it has become popular for other
NLP tasks. It measures the longest sequence in
common between the given reference and model
output sentence, without pre-defining an N-Gram
length. We report the F1 score to measure model
accuracy, as also seen in other works on this dataset
(Camgoz et al.,2018;Yin and Read,2020).
METEOR:
METEOR (Banerjee and Lavie,
2005) is a metric for MT evaluation based on uni-
gram matching. This metric is based on unigram-
precision and recall to consider word alignments,
with recall having more influence on the score. It is
considered to have a higher correlation with human
judgement than BLEU.
Generation time:
Finally, the generation time
is reported to assess our system in terms of com-
putational efficiency. It is reported in seconds for
each model.
5.3 Implementation Details
The experiments reported here were carried out us-
ing Tensorflow as Deep Learning framework. The
Embedding Tables, Encoder and Decoder imple-
mentations were inherited from the HuggingFace-
transformers library
7
and spaCy was employed to
produce word-dependency features. Finally, NLTK
7https://huggingface.co/transformers/
and other third-party code
8,9,10
was used to com-
pute the performance metrics adopted here. We
make our code publicly available at GitHub11.
6 Results
Here, we present the results from our experiment.
As the objective of this research is evaluating
the benefits of injecting syntactic information for
Text2Gloss translation, we compare two models
with the same architecture: One including, and
one not including lexical dependency information.
Those models are denoted as Syntax and No-Syntax
respectively in this and subsequent sections.
6.1 Performance vs Epochs
Figure 3presents the evolution of the performance
metrics after each 5 training epochs while the mod-
els are being trained. It is apparent that including
the syntactic information brings notable benefits for
the most of the metrics adopted, with the exception
of METEOR.
Focusing on sacreBLEU score, the Syntax model
produces substantially better translations after
80
training epochs. After this point, the models con-
verge and the difference in the sacreBLEU score be-
tween the models becomes more evident. Namely,
the greatest difference between both models hap-
pens at epoch
165
, when Syntax model produces a
sacreBLEU 5.7points higher than No-Syntax.
As for TER, the differences between curves are
more remarkable. Syntax model produces TER
scores notably better than the No-syntax, the score
becomes stable after
95
epochs and tends to reduce
its oscillations. At this point Syntax model out-
performs the No-syntax model in around
0.15
for
TER.
According to the ROUGE-L (F1-score) obtained,
we also observe a slight improvement of Syntax
model over No-syntax, although this increase is not
clear until epoch
150
. In this case the differences
are not as clear as the metrics already observed, but
it implies enhancements higher that
0.01
for this
metric.
The METEOR score is the only metric that does
not improve when syntactic information is included.
In this regard, the No-syntax model produced better
8https://github.com/BramVanroy/pyter
9https://github.com/mjpost/sacrebleu
10https://github.com/google/seq2seq/
blob/master/seq2seq/metrics/rouge.py
11https://github.com/LaSTUS- TALN-UPF/
Syntax-Aware- Transformer-Text2Gloss
Figure 3: Performance Metrics evolution during training.
translations in terms of this score for all the whole
training phase. When the models converge after
100 epochs, the greatest difference between models
happens at epoch 350 when No-syntax overcomes
the Syntax model by
0.029
points. It is also re-
markable that the differences between models are
not higher than 0.015 for most of the points after
convergence. The reason why No-Syntax produces
a slightly better METEOR than Syntax might be
the fact that METEOR benefits unigram recall and
the No-Syntax model tends to repeat words, as we
show in next Section. Nonetheless, we will further
analyse this observation in future research.
Finally, as efficiency is one of the goals of our
project, we turn to generation time. From the Gen-
eration Time curves shown in Figure 3, we can
observe that injecting syntactic information does
not lead to marked generation time increases. We
include the extra time necessary to produce the lex-
ical dependency tags. In the case of the training
subset, the tagging process took around
20.9
sec-
onds, this processing time constitutes an increase
of
2.95
milliseconds per sentence compared to not
using syntax tags. Regarding the test subset, the tag
process lasted
3.23
seconds in total, which is not
a marked increase considering the total generation
times and that Syntax is until
60
seconds faster than
No-syntax (this is the case for 155 to 180 epochs).
The cause behind the great differences in gener-
ation times might be that Beam Search decoding
produces more precise hypotheses and needs less
decoding iterations when syntax tags are employed.
6.2 Best-performing points
From the previous analysis, we have identified the
points in which the neural models converge and
where high variation is not present in the metric
curves. In this section, we focused on the points in
which the metrics reach their maximum values after
convergence point, which is located around epoch
100
. Table 3shows the best-performing values for
all metrics.
From Table 3, we observe that the Syntax model
reaches its maximum values with less epochs than
No-syntax. This observation indicates that syntac-
tic information also might benefit the neural model
learning leading to shorter training times. Another
observation is that the most of metrics are improved
by injecting syntactic information, with the excep-
tion of METEOR.
Table 3: Best scores for the models. This table contains the maximum values for all metrics after convergence. The
values between parenthesis denotes the epoch in which those values are produced.
SacreBLEUTERROUGE-L (F1-score)METEOR
Syntax 53.52 (400) 0.722 (330) 0.467 (115) 0.407 (190)
No-syntax 51.06 (485) 0.814 (485) 0.461 (140) 0.424 (210)
Diff 2.46 (85) -0.092 (155) 0.006 (35) -0.017 (-20)
7 Discussion
In the previous section, we have described quanti-
tatively the results produced from our selected met-
rics. Additionally, this section presents a qualitative
analysis of the benefits produced for Text2Gloss
translation including lexical information in the
transformer model. Table 4contains two examples
on how both models produce glosses at different
training points.
As can be noted in both examples, the No-syntax
model needs more epochs to produce coherent
translations and tends to repeat some patterns lead-
ing to corrupted outputs in some cases. This ef-
fect is quite remarkable in the second example, for
which No-syntax retains repeating patterns after
100 epochs while Syntax produces more coherent
translations. This fact might lead to the No-Syntax
model obtaining a slightly higher METEOR than
Syntax (see 6.1), while Syntax substantially outper-
formed its competitor in terms of Sacrebleu.
The fast-learning capacity exhibited by the Syn-
tax model could be advantageous for our project,
since domain-adaptation is an expected feature for
the system under development. Also, we have
shown that injecting syntactic information to the en-
coder enables more accurate models without whole-
sale architecture modifications. The feature injec-
tion could be extended to other lexical features,
such as Part-of-Speech tags, via integrating a new
embedding table.
8 Conclusion
In this paper we present a syntax-aware transformer
for Text2Gloss. To make the model syntax-aware
we inject word dependency tags to augment the
discriminative power of embeddings inputted to
Encoder. The fashion in which we expand trans-
formers to include lexical dependency features in-
volves minor modifications in the neural architec-
ture leading to negligible impact on computational
complexity of the model.
As the results of this research show, inject-
ing syntax dependencies can boost Text2Gloss
model performances. Namely, our syntax-aware
model overcame traditional transformers in terms
of BLEU, TER and ROUGE-L F1. Meanwhile, the
METEOR metric was slightly worse for our model.
Furthermore, we have shown that syntax informa-
tion can also assist in model learning leading to a
faster modelling of complex patterns.
This preliminary research constitutes a promis-
ing starting point to reach the objectives expected
for the SignON Project, in which it is planned to
deployed resource-hungry translation models in
cloud-based computing servers.
Further work could compare the impact of other
individual, or combinations of, other linguistic fea-
tures such as part of speech tags which are used
in other studies using syntactic tagging for NMT
(Sennrich and Haddow,2016;Armengol Estap
´
e
and Ruiz Costa-Juss
`
a,2021). It may also use more
widely-used lexical dependency tags such as the
Universal Dependencies treebank (Borges V
¨
olker
et al.,2019). Moreover, we are currently exploring
data augmentation techniques to expand the scarce
availability of SL data.
Acknowledgements
We thank the reviewers for their comments and
suggestions. This work has been conducted within
the SignON project. SignON is a Horizon 2020
project, funded under the Horizon 2020 program
ICT-57-2020 - ”An empowering, inclusive, Next
Generation Internet” with Grant Agreement num-
ber 101017255.
References
Stefanie Albert, Jan Anderssen, Regine Bader,
Stephanie Becker, Tobias Bracht, Sabine Brants,
Thorsten Brants, Vera Demberg, Stefanie Dipper,
and Peter Eisenberg. 2003. TIGER Annotationss-
chema. Universit¨
at des Saarlandes and Universit¨
at
Stuttgart and Universit¨
at Potsdam, pages 1–148.
Example 1
Source und nun die wettervorhersage f¨
ur morgen samstag den zw¨
olften september
(EN) And now the weather forecast for tomorrow Saturday the twelfth of September
Target JETZT WETTER MORGEN SAMSTAG ZWOELF SEPTEMBER
(EN) NOW WEATHER TOMORROW SATURDAY TWELVE SEPTEMBER
Syntax
5JETZT WETTER WETTER
(EN) NOW WEATHER WEATHER
50 JETZT WETTER WIE-AUSSEHEN MORGEN SAMSTAG FUENFTE MAI
(EN) NOW WEATHER LOOK TOMORROW SATURDAY FIFTH MAY
100 JETZT WETTER WIE-AUSSEHEN MORGEN SAMSTAG ZWOELF SEPTEMBER
(EN) NOW WEATHER LOOK TOMORROW SATURDAY TWELVE SEPTEMBER
150 JETZT WETTER WIE-AUSSEHEN MORGEN SAMSTAG ZWOELF SEPTEMBER
(EN) NOW WEATHER LOOK TOMORROW SATURDAY TWELVE SEPTEMBER
No-syntax
5JETZT WETTER WIE WIE WIE-AUSSE...AUSSEAUSS
(EN) NOW WEATHER HOW HOW AUSSE...AUSSEAUSS
50 JETZT WETTER WIE-AUSSEHEN MORGEN SAMSTAG FUENFZEHN SEPTEMBER
(EN) NOW WEATHER LOOK TOMORROW SATURDAY FIFTEEN SEPTEMBER
100 JETZT MORGEN WETTER WIE-AUSSEHEN SAMSTAG ZWOELF SEPTEMBER
(EN) NOW TOMORROW WEATHER LOOK SATURDAY TWELVE SEPTEMBER
150 JETZT MORGEN WETTER WIE-AUSSEHEN SAMSTAG ZWOELF SEPTEMBER
(EN) NOW TOMORROW WEATHER LOOK SATURDAY TWELVE SEPTEMBER
Example 2
Source
vom nordmeer zieht ein kr
¨
aftiges tief heran und bringt uns ab den morgenstunden heftige schneef
¨
alle
zum teil auch gefrierenden regen
(EN) From the North Sea, a strong deep pulls up and brings us violent snowfalls from the morning
hours, sometimes freezing rain
Target KRAEFTIG AB MORGEN FRUEH MEISTENS SCHNEE SCHNEIEN KALT REGEN
(EN) SKIMPY FROM TOMORROW EARLY MOSTLY SNOW SNOW COLD RAIN
Syntax
5KOMMEN REGION KOMMEN
(EN) COME REGION COME
50 TIEF KOMMEN MORGEN KOMMEN REGEN KOMMEN REGEN KOMMEN
(EN) DEEP COME TOMORROW COME RAIN COME RAIN COME
100 TIEF KOMMEN REGEN KOMMEN MITTE BERG KOMMEN
(EN) NOW WEATHER LOOK TOMORROW SATURDAY TWELVE SEPTEMBER
150 JETZT IN-KOMMEND TIEF KOMMEN REGEN KOMMEN MILD
(EN) NOW IN-COMING DEEP COME RAIN COME MILD
No-syntax
5REGION KOMMEN REGION KOMMEN REGEN
(EN) REGION COME REGION COME RAIN
50
MORGEN KOMMEN TIEF KOMMEN REGEN KOMMEN REGEN KOMMEN REGEN KOMMEN
REGEN KOMMEN
(EN) TOMORROW COME DEEP COME RAIN COME RAIN COME RAIN COME RAIN COME
100
TMORGEN REGEN TIEF KOMMEN REGION KOMMEN REGEN KOENNEN SCHNEE REGEN
GEFRIEREN GLATT GEFAHR GLATT GEFAHR
(EN) TOMORROW RAIN DEEP COME REGION COME RAIN CAN SNOW RAIN FREEZE
SMOOTH DANGER SMOOTH DANGER
150
MORGEN MEISTENS SCHNEE REGEN GLATT REGION KOMMEN REGEN GEFAHR GLATT
REGEN GEFAHR GLATT REGEN GEFAHR
(EN) TOMORROW MOSTLY SNOW RAIN SMOOTH REGION COME RAIN DANGER SMOOTH
RAIN DANGER SMOOTH RAIN DANGER
Table 4: Some translation examples
Inˆ
es Almeida, Lu´
ısa Coheur, and Sara Candeias. 2015.
From European Portuguese to Portuguese Sign Lan-
guage. In Proceedings of SLPAT 2015: 6th Work-
shop on Speech and Language Processing for As-
sistive Technologies, pages 140–143, Dresden, Ger-
many. Association for Computational Linguistics.
Jordi Armengol Estap´
e and Marta Ruiz Costa-Juss`
a.
2021. Semantic and syntactic information for neural
machine translation: Injecting features to the trans-
former.Machine Translation, 35:3:3–17.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for MT evaluation with im-
proved correlation with human judgments. In Pro-
ceedings of the ACL Workshop on Intrinsic and Ex-
trinsic Evaluation Measures for Machine Transla-
tion and/or Summarization, pages 65–72, Ann Ar-
bor, Michigan. Association for Computational Lin-
guistics.
Larwan Berke, Sushant Kafle, and Matt Huenerfauth.
2018. Methods for evaluation of imperfect caption-
ing tools by deaf or hard-of-hearing users at different
reading literacy levels. In Proceedings of the 2018
CHI Conference on Human Factors in Computing
Systems, CHI ’18, page 1–12, New York, NY, USA.
Association for Computing Machinery.
Emanuel Borges V¨
olker, Maximilian Wendt, Felix Hen-
nig, and Arne K¨
ohn. 2019. HDT-UD: A very large
Universal Dependencies treebank for German. In
Proceedings of the Third Workshop on Universal De-
pendencies (UDW, SyntaxFest 2019), pages 46–57,
Paris, France. Association for Computational Lin-
guistics.
Sabine Brants, Stefanie Dipper, Peter Eisenberg, Sil-
via Hansen-Schirra, Esther K¨
onig, Wolfgang Lezius,
Christian Rohrer, George Smith, and Hans Uszkor-
eit. 2004. TIGER: Linguistic interpretation of a ger-
man corpus.Journal of Language and Computation,
2:597–620.
Carmen Cabeza, Jos´
e Mar´
ıa Garc´
ıa-Miguel, Carmen
Garc´
ıa-Mateo, and Jose Luis Alba-Castro. 2016. Co-
rilse: a spanish sign language repository for linguis-
tic analysis. In of the Language Resources and Eval-
uation Conference, Portoroˇ
z (Slovenia), pages 23–
28.
Necati Camgoz, Simon Hadfield, Oscar Koller, Her-
mann Ney, and Richard Bowden. 2018. Neural sign
language translation. In 2018 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition,
pages 7784–7793.
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield,
and Richard Bowden. 2020. Sign language trans-
formers: Joint end-to-end sign language recognition
and translation. In 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
pages 10020–10030.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
David M. Eberhard, Gary F. Simons, and Charles D.
Fennig. 2021. Ethnologue: Languages of the World,
twenty-fourth edition. SIL International, Dallas, TX,
USA.
Thomas Hanke. 2004. Hamnosys—representing sign
language data in language resources and language
processing contexts. In LREC 2004, Workshop pro-
ceedings: Representation and processing of sign lan-
guages, pages 1–6, Paris, France.
Hany Hassan, Anthony Aue, C. Chen, Vishal Chowd-
hary, J. Clark, C. Federmann, Xuedong Huang,
Marcin Junczys-Dowmunt, W. Lewis, M. Li, Shujie
Liu, T. Liu, Renqian Luo, Arul Menezes, Tao Qin,
F. Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi
Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang,
and M. Zhou. 2018. Achieving human parity on au-
tomatic chinese to english news translation. ArXiv,
abs/1803.05567:1–25.
Tommi Jantunen, Rebekah Rousi, P ¨
aivi Raino, Markku
Turunen, Mohammad Valipoor, and Narciso Garc´
ıa.
2021. Is There Any Hope for Developing Automated
Translation Technology for Sign Languages?, pages
61–73.
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Vi´
egas, Martin Wattenberg, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2017. Google’s
multilingual neural machine translation system: En-
abling zero-shot translation.Transactions of the As-
sociation for Computational Linguistics, 5:339–351.
Philipp Koehn. 2009. Statistical Machine Translation.
Cambridge University Press.
Taku Kudo and John Richardson. 2018. SentencePiece:
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. In
Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium.
Association for Computational Linguistics.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence pre-
training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 7871–7880, Online. Association
for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Edunov, Marjan Ghazvininejad, Mike Lewis, and
Luke Zettlemoyer. 2020. Multilingual denoising
pre-training for neural machine translation.CoRR,
abs/2001.08210:1–17.
V. L´
opez-Lude˜
na, C. Gonz´
alez-Morcillo, J.C. L ´
opez,
R. Barra-Chicote, R. Cordoba, and R. San-Segundo.
2014. Translating bus information into sign lan-
guage for deaf people. Engineering Applications of
Artificial Intelligence, 32:258–269.
Amit Moryossef, Kayo Yin, Graham Neubig, and Yoav
Goldberg. 2021. Data augmentation for sign lan-
guage gloss translation.CoRR, abs/2105.07476:1–
7.
Medet Mukushev, Arman Sabyrov, Alfarabi Imashev,
Kenessary Koishybay, Vadim Kimmelman, and
Anara Sandygulova. 2020. Evaluation of man-
ual and non-manual components for sign language
recognition. In Proceedings of the 12th Language
Resources and Evaluation Conference, pages 6073–
6078, Marseille, France. European Language Re-
sources Association.
Fabrizio Nunnari, Cristina Espa ˜
na-Bonet, and Eleft-
herios Avramidis. 2021. A data augmentation ap-
proach for sign-language-to-text translation in-the-
wild. In Proceedings of the 3rd Conference on Lan-
guage, Data and Knowledge. Conference on Lan-
guage, Data and Knowledge (LDK-2020), Septem-
ber 1-3, Zaragoza, Spain, Spain, volume 93 of Ope-
nAccess Series in Informatics (OASIcs). Dagstuhl
publishing.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Jordi Porta, Fernando L ´
opez-Colino, Javier Tejedor,
and Jos´
e Col´
as. 2014. A rule-based translation from
written spanish to spanish sign language glosses.
Computer Speech & Language, 28:788–811.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Belgium, Brussels. Association for Computa-
tional Linguistics.
H. Saggion, D. Shterionov, G. Labaka, T. Van de Cruys,
V. Vandeghinste, and J. Blat. 2021. SignON: Bridg-
ing the gap between Sign and Spoken Languages. In
Proceedings of the 37th Conference of the Spanish
Society for Natural Language Processing, M´
alaga,
Spain (held on-line). SEPLN.
Rub´
en San-Segundo, Ver´
onica L´
opez, Raquel Mart´
ın,
David S´
anchez, and Adolfo Garc´
ıa. 2010. Language
resources for spanish - spanish sign language (lse)
translation. In Proceedings of the 4th Workshop
on the Representation and Processing of Sign Lan-
guages: Corpora and Sign Languages Technologies,
pages 208–211.
Rico Sennrich and Barry Haddow. 2016. Linguistic
input features improve neural machine translation.
In Proceedings of the First Conference on Machine
Translation: Volume 1, Research Papers, pages 83–
91, Berlin, Germany. Association for Computational
Linguistics.
Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
pages 223–231.
Richard Socher, John Bauer, Christopher D. Manning,
and Andrew Y. Ng. 2013. Parsing with compo-
sitional vector grammars. In Proceedings of the
51st Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
455–465, Sofia, Bulgaria. Association for Computa-
tional Linguistics.
Stephanie Stoll, Necati Cihan Camg ¨
oz, Simon Had-
field, and Richard Bowden. 2020. Text2sign: To-
wards sign language production using neural ma-
chine translation and generative adversarial net-
works.Int. J. Comput. Vis., 128(4):891–908.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proceedings of the 31st International
Conference on Neural Information Processing Sys-
tems, NIPS’17, page 6000–6010.
Linting Xue, Noah Constant, Adam Roberts, Mi-
hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
Barua, and Colin Raffel. 2021. mT5: A massively
multilingual pre-trained text-to-text transformer. In
Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 483–498, Online. Association for Computa-
tional Linguistics.
Kayo Yin and Jesse Read. 2020. Better sign language
translation with STMC-transformer. In Proceed-
ings of the 28th International Conference on Com-
putational Linguistics, pages 5975–5989, Barcelona,
Spain (Online). International Committee on Compu-
tational Linguistics.
I. Zwitserlood, M. Verlinden, J. Ros, and Sanny van der
Schoot. 2004. Synthetic signing for the deaf : eS-
IGN.
... Recently, Egea et al. (2021) proposed to inject syntax-aware information in an encoder-decoder architecture using transformer layers in their text-to-gloss translation. They argued that, as the transformation of text into glosses was based on word permutations, stemming and deletions (e.g. ...
... The text-to-gloss translation is seen as a simplification of the spoken language text (Egea et al., 2021). Table 2 shows the results found in the literature for this task. ...
Article
Sign Languages (SLs) are employed by deaf and hard-of-hearing (DHH) people to communicate on a daily basis. However, the communication with hearing people still faces some barriers, mainly because of the scarce knowledge about SLs among hearing people. Hence, tools to allow the communication between users of either sign or spoken languages must be encouraged. A stepping stone in this direction is the research of the sign language translation (SLT) task, which aims to produce a spoken language translation of a sign language video or vice versa. By implementing these types of translators in portable devices, we will make considerable progress towards a barrier-free communication between DHH and hearing people. That is why, in this work, we focus on reviewing the literature on SLT and provide the necessary background about SLs. Besides, we summarise the available datasets and the results found in the literature for one of the most used datasets, the RWTH-PHOENIX-2014T. Moreover, the survey lists the challenges that need to be tackled within the SLT research and also for the adoption of SLT technologies, and proposes future research lines.
... This schema was then used on a low-resource translation task English-Nepali [3] that improved performance on the FLoRes 5 by 1.2 BLEU. Recent work in Text2Gloss [7] explored the use of lexical dependencies in the model embeddings obtaining a peak improvement of 5.7 BLEU over a baseline. ...
... Furthermore, we analyze other performance metrics in the test phase to have wider understanding on how linguistic features are contributing to modeling T2G task. The selected metrics are: Sacre-BLEU with character-level tokenization, which is used in [7]; and METEOR, which evaluates the word alignments according to precision and recall using unigrams, having recall higher importance in the metric computation. ...
Chapter
In spite of the recent advances in Machine Translation (MT) for spoken languages, translation between spoken and Sign Languages (SLs) or between Sign Languages remains a difficult problem. Here, we study how Neural Machine Translation (NMT) might overcome the communication barriers for the Deaf and Hard-of-Hearing (DHH) community. Namely, we approach the Text2Gloss translation task in which spoken text segments are translated to lexical sign representations. In this context, we leverage transformer-based models via (1) injecting linguistic features that can guide the learning process towards better translations; and (2) applying a Transfer Learning strategy to reuse the knowledge of a pre-trained model. To this aim, different aggregation strategies are compared and evaluated under Transfer Learning and random weight initialization conditions. The results of this research reveal that linguistic features can successfully contribute to achieve more accurate models; meanwhile, the Transfer Learning procedure applied conducted to substantial performance increases.
... The actual signing avatar will then have its gestures generated from the translated sentences written in glosses. Sign Languages, however, are considered low-resource languages in the domain of NLP [13], as the amount of data typically available to train NMT models is small when compared to spoken languages such as English and German. Therefore, we intend to apply transfer learning, by firstly training our model on a high-resource language pair (Portuguese-English) and then reusing the learned parameters to initialize the model to be trained on the low-resource corpus. ...
Article
Full-text available
The paper describes an ongoing project aimed at developing a neural machine translation approach to translating text into sign language, with the translation result presented by a realistic three-dimensional avatar. The approach can be applied to translate texts, books, and Internet pages, improving access to information for deaf individuals and contributing to the social, educational, and labor inclusion of these citizens. The paper elaborates on four fundamental issues related to the approach: (i) the establishment of a written representation of the sign language; (ii) the construction of a parallel corpus involving text from an oral language and the written representation of sign language; (iii) the establishment of a neural machine translation model to perform the translation; (iv) the visual presentation of the translation by a signing avatar. Although focused on translating Brazilian Portuguese into Brazilian Sign Language, the concepts discussed in the paper are general enough to be applied to other written-signed language pairs.
... To settle the scarcity of sign language, Nunnari et al. [74] proposed a data augmentation model, which could help eliminate the background and personal effects of the signer. Gomez et al. [75] proposed a transformer model for text-to-sign gloss translation. Their proposed model recognized syntactic information and enhanced the discriminative power for low-resource SLT tasks without significantly increasing model complexity. ...
Article
Full-text available
Sign language is the main communication way for deaf and hard-of-hearing (i.e., DHH) people, which is unfamiliar to most non-deaf and hard-of-hearing (non-DHH) people. To break down the communication barriers between DHH and non-DHH people and to better promote communication among DHH individuals, we have summarized the research progress on sign language translation. We provide the necessary background on sign language translation and introduce its four subtasks (i.e., sign2gloss2text, sign2text, sign2(gloss+text), and gloss2text). We distill the basic mode of sign language translation (SLT) and introduce the transformer-based framework of SLT. We analyze the main challenges of SLT and propose possible directions for its development.
... Yin and Read (2020) employ the Spatial-Temporal Multi-Cue (STMC) Network (Zhou et al., 2020) for the task. There have also been several experiments on the opposite direction: text-to-gloss (Othman and Jemni, 2011;Egea Gómez et al., 2021). ...
Conference Paper
Full-text available
We examine methods and techniques, proven to be helpful for the text-to-text translation of spoken languages in the context of gloss-to-text translation systems, where the glosses are the written representation of the signs. We present one of the first works that include experiments on both parallel corpora of the German Sign Language (PHOENIX14T and the Public DGS Corpus). We experiment with two NMT architectures with optimization of their hyperpa-rameters, several tokenization methods and two data augmentation techniques (back-translation and paraphrasing). Through our investigation we achieve a substantial improvement of 5.0 and 2.2 BLEU scores for the models trained on the two corpora respectively. Our RNN models outperform our Transformer models, and the segmentation method we achieve best results with is BPE, whereas back-translation and paraphrasing lead to minor but not significant improvements.
Preprint
As the main means of communication for deaf people, sign language has a special grammatical order, so it is meaningful and valuable to develop a real-time translation system for sign language. In the research process, we added a TSM module to the lightweight neural network model for the large Chinese continuous sign language dataset . It effectively improves the network performance with high accuracy and fast recognition speed. At the same time, we improve the Bert-Base-Chinese model to divide Chinese sentences into words and mapping the natural word order to the statute sign language order, and finally use the corresponding word videos in the isolated sign language dataset to generate the sentence video, so as to achieve the function of text-to-sign language translation. In the last of our research we built a system with sign language recognition and translation functions, and conducted performance tests on the complete dataset. The sign language video recognition accuracy reached about 99.3% with a time of about 0.05 seconds, and the sign language generation video time was about 1.3 seconds. The sign language system has good performance performance and is feasible.
Chapter
Full-text available
This is the Festschrift of Dr. Jack Rueter. The book presents peer-reviewed scientific work from Dr. Rueter’s colleagues related to the latest advances in natural language processing, digital resources and endangered languages in a variety of languages such as historical English, Chukchi, Mansi, Erzya, Komi, Finnish, Apurina, Sign Languages, Sami languages and Japanese. Most of the papers present work on endangered languages or on domains with a limited number of resources available for NLP. This book collects original and insightful papers from well-established researchers in NLP, linguistics, philology and digital humanities. This is a tribute to Dr. Rueter’s long career that is characterized by constant altruistic work towards a greater good in building free and open-source tools and resources for endangered languages. Dr. Rueter is a true pioneer in the field of digital documentation of endangered languages.
Conference Paper
Full-text available
Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
Article
Full-text available
We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition settings qualitatively and quantitatively using broadcast quality assessment metrics.
Conference Paper
Full-text available
Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation for both end-to-end and pretrained settings (us-ing expert knowledge). This allows us to jointly learn the spatial representation of video frames, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT we collected the first Continuous SLT dataset, RWTH-PHOENIX-Weather 2014T, which will be made publicly available. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a german vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss level tokenization systems were able to achieve 9.58 and 18.13 respectively.
Article
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task- specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show that it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.