Conference PaperPDF Available

Visual Pivoting Unsupervised Multimodal Machine Translation in Low-Resource Distant Language Pairs

Authors:
  • Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences

Abstract

Unsupervised multimodal machine translation (UMMT) aims to leverage vision information as a pivot between two languages to achieve better performance on low-resource language (LRL) pairs. However, there is presently a challenge: how to handle alignment between low-resource distant language pairs (DLPs) in UMMT. To this end, this paper proposes a visual pivoting UMMT method in low-resource DLPs. Specifically, we first construct a dataset containing two DLPs including English-Uyghur and Chinese-Uyghur. We then apply the visual pivoting method for both a pre-training language model and a UMMT model, and we observe that the images on the encoder and decoder of UMMT have noticeable effects on DLPs. Finally, we introduce informative multi-granularity image features to facilitate further alignment of the latent space between the two languages. Experimental results show that the proposed method significantly outperforms several baselines for UMMT on close language pairs (CLPs) and DLPs. Our dataset Multi30k-Distant will be available online for free access.
Visual Pivoting Unsupervised Multimodal Machine Translation in
Low-Resource Distant Language Pairs
Turghun Tayir1, Lin Li1, Xiaohui Tao2, Mieradilijiang Maimaiti3
Ming Li1,Jianquan Liu4
1Wuhan University of Technology, 2University of Southern Queensland
3Chinese Academy of Sciences, 4NEC Corporation
1{hotpes, cathylilin, liming7677}@whut.edu.cn, 2xiaohui.tao@usq.edu.au
3miradel_51@hotmail.com, 4jqliu@nec.com
Abstract
Unsupervised multimodal machine translation
(UMMT) aims to leverage vision information
as a pivot between two languages to achieve
better performance on low-resource language
pairs. However, there is presently a challenge:
how to handle alignment between distant lan-
guage pairs (DLPs) in UMMT. To this end,
this paper proposes a visual pivoting UMMT
method for DLPs. Specifically, we first con-
struct a dataset containing two DLPs, includ-
ing English-Uyghur and Chinese-Uyghur. We
then apply the visual pivoting method for both
to pre-training and fine-tuning, and we ob-
serve that the images on the encoder and de-
coder of UMMT have noticeable effects on
DLPs. Finally, we introduce informative multi-
granularity image features to facilitate further
alignment of the latent space between the two
languages. Experimental results show that the
proposed method significantly outperforms sev-
eral baselines on DLPs and close language
pairs (CLPs). Our dataset Multi30k-Distant
and code are available at:
https://github.
com/WUT-IDEA/VP-UMMT.
1 Introduction
Neural machine translation (MT) (Sutskever et al.,
2014;Cho et al.,2014;Vaswani et al.,2017) has be-
come a promising method for MT, which depends
on the availability of large-scale parallel corpora.
However, the preparation of such corpora in the
low-resource language is extremely challenging,
and existing studies (Zoph et al.,2016) have shown
that neural MT achieves much worse translation
quality than statistical MT with a small number of
corpora. Therefore, developing methods to allevi-
ate the need for annotation of large parallel corpora
has attracted increasing attention from researchers.
To alleviate this problem, unsupervised
MT (Lample et al.,2018;Artetxe et al.,2018)
has been proposed, which relies on monolingual
corpora and trains MT model in an unsupervised
manner. Since the alignment of the source-target
sentence of the unsupervised MT is uncertain,
it is highly subject to initialization. Therefore,
researches (Su et al.,2019;Huang et al.,2020;Li
et al.,2023b) have found that exploiting visual
content for unsupervised MT while leveraging
a language model pre-trained on large-scale
monolingual data is a feasible way to improve
translation quality. Visual content is qualified to
improve alignment in the latent space of language
because the physical visual perception of people
who speak different languages is similar. However,
previous works (Su et al.,2019;Huang et al.,
2020;Li et al.,2023b,a;Huang et al.,2021)
mainly consider high-resource CLPs, such as
English-German and English-France. Because
unsupervised MT aims to achieve high-quality
translation results with low-resource language. The
study of unsupervised MT solely on high-resource
CLPs makes it challenging to assess its effective-
ness in low-resource languages, diminishing its
applicability and hindering the advancement of
UMMT efficiency. In DLPs, even initialization
with a monolingual pre-trained model does not
yield significant improvements, as unsupervised
MT performs well when the monolingual data
in both languages belong to the same language
family (Marchisio et al.,2020). Therefore, the
UMMT task needs to be extended to translation
between low-resource DLPs, which is beneficial
for a more comprehensive exploration of the
influence of linguistic distance and the contribution
of visual content.
To address these challenges, we propose a visual
pivoting UMMT approach for DLPs. Specifically,
we first manually translate the mainstream multi-
modal MT dataset Multi30k (Elliott et al.,2016),
which primarily contains high-resource CLPs, into
Chinese and Uyghur. Both Chinese and Uyghur
belong to different language families from the lan-
guages of the Multi30k. Even their scripts and
Close language pairs (En-De):
En: A baseball player is fielding a ball.
De: Ein Baseballspieler spielt den Ball.
Distant language pairs (En-Uy, Zh-Uy):
En: A baseball player is fielding a ball.
Uy: ﻰﺴﯩﭽﺗﮫﻜﯩرﮫﮭﻧﮫﺗ پﻮﺗ كﮫﺘﻟﺎﻛ پﻮﺗ ۇﺪﯩﺗاۋﯘﺗﯘﺗ.
Zh: жѠἈ⨹䘆ࣞ઎↙൞᧛ȾImage Captions in four languages
Figure 1: Simple examples of CLPs from Multi30k and
DLPs from our dataset. It consists of an image and its
descriptions in four languages, English (En), German
(De), Uyghur (Uy), and Chinese (Zh). Words with the
same color have the same meaning in different language.
grammar structures are different, as shown in Fig-
ure 1; for example, Uyghur is a subject-object-
verb language, while English and Chinese are
subject-verb-object languages. Moreover, Uyghur
is a low-resource language, hence the main data
studied in this paper are the low-resource DLPs
composed of English-Uyghur and Chinese-Uyghur
sentences. We then extend MLM (Conneau and
Lample,2019) by leveraging visual information to
generate a visual pre-training language modeling
(VPLM) model, which is subsequently applied to
initialize a UMMT model. Finally, we use images
as a pivot to semantically align source-target lan-
guages into a shared latent space. Specifically, the
image is introduced into the encoder to correct the
pseudo-sentence, while the input image in the de-
coder is treated as a pivot between the source and
target languages. We conducted experiments on
DLPs and CLPs and the results show tha proposed
model consistently outperforms several baselines.
Overall, we make the following contributions: (i)
We constructs a dataset with DLPs and the UMMT
is implemented on this dataset. It provides a bench-
mark for further research on this challenging task.
(ii) We find that visual content is more qualified to
improve the alignment of DLPs latent space. (iii)
The experimental results show that in unsupervised
MT between gender and gender-neutral language,
images contribute to improving gender accuracy.
2 Related Work
2.1
Multimodal Machine Translation Datastes
Existing commonly employed multimodal MT
datasets include Multi30k (Elliott et al.,2016),
IKEA (Grubinger et al.,2006), IAPR TC-12 (El-
liott et al.,2016) and MS-COCO (Lin et al.,2014),
and these datasets are all focused on high-resource
DLPs such as English and German. Datasets IKEA
and IAPR TC-12 contain fewer images and de-
scription sentences. Multi30K dataset is not only
immediately applicable to research on a wide range
of tasks, it is collected from a wider range of fields.
Moreover, Multi30k is the most commonly used
dataset, and it contains 31k high-quality practical
events. Therefore, we have manually translated
it into Chinese and Uyghur and generated a low-
resource DLPs dataset. MS-COCO dataset con-
tains 164k images, each with five different English
descriptive sentences. We automatically translate
English sentences of MS-COCO into Chinese and
Uyghur for the pre-training dataset.
2.2 Unsupervised Multimodal Machine
Translation
While supervised MT relies on bilingual parallel
corpora (Cho et al.,2014;Bahdanau et al.,2015),
this approach often fails to effectively utilize mono-
lingual corpora. To address this limitation, some
recent studies (Lample et al.,2018;Artetxe et al.,
2018) have proposed unsupervised MT that lever-
ages monolingual corpora. However, the lack of
target language supervision information poses a
challenge, making it difficult for unsupervised MT
to achieve the same high-quality translation as su-
pervised MT. Therefore, improving model perfor-
mance by incorporating visual information into
unsupervised MT has gained significant attention
from researchers (Su et al.,2019;Wang et al.,2021;
Huang et al.,2021;Li et al.,2023a).
UMMT investigates the possibility of using im-
age disambiguation and improving unsupervised
MT. Its core assumption, intuitively based on the
immutability of images, suggests that descriptions
of the same visual content in different languages
should remain largely similar. However, existing
research has primarily focused on high-resource
CLPs, limiting the practical application of UMMT.
To address this limitation and investigate UMMT
in the context of DLPs, we construct a dataset con-
taining DLPs. Furthermore, we incorporate image
information into both pre-training and fine-tuning
to improve translation performance.
3 Our Dataset
3.1 Distant Language Pairs
Language pairs are generally divided into CLPs
and DLPs (Sun et al.,2021). Language similarity
is determined by whether two languages belong
to the same language family, whether they share
Table 1: Corpus-level statistics about Multi30k-Distant.
Splits Sentences Uyghur Chinese English
Tokens Avg-length Tokens Avg-length Tokens Avg-length
Train 29,000 343,342 11.83 391,903 13.51 357,172 11.9
Validation 1,014 12,077 11.91 13,855 13.66 13,308 13.1
Test(Test2016) 1,000 11,834 11.83 13,566 13.57 12,968 13.0
words and sentences with the same word order, and
so on. Most languages belong to different language
families, and many of them suffer from a lack of
resources. As shown in Figure 1, there are some
gaps in the DLPs, such as English and Uyghur,
they are written from different directions, and their
scripts and word order are not the same, which also
exists in Chinese-Uyghur. Moreover, Uyghur is an
low-resource language, thus, English-Uyghur and
Chinese-Uyghur are creating low-resource DLPs.
3.2 Data Collection
Multimodal MT mainstream corpus Multi30k (El-
liott et al.,2016) contains 31k images and their
descriptions in CLPs, e.g. English-German. To
study UMMT on DLPs, we manually translate En-
glish sentences from Multi30k into Chinese and
Uyghur. For Chinese, three native Chinese speak-
ers with good English skills on our team, who are
master students, are involved in the translation. For
Uyghur, three native speakers with good English
skills, all with bachelor’s degrees, participate in the
translation. During the translation, the translator
can access both the image and the English sentence,
which facilitates the correct translation according
to the image. To ensure the quality of translation,
each translation sentence is further reviewed by
another translator. It took about three months to
complete the translation work. Statistics about our
dataset Multi30k-Distant are shown in Table 1.
4 Method
In this section, we first detail the visual pivoting for
UMMT and multimodal alignment, and then intro-
duce the UMMT model and the training strategy.
4.1 Visual Pivoting for UMMT
Unsupervised MT assumes the availability of a
monolingual corpus during training. It defines the
input
T= [t1, ..., tl]
as a l-length sentence. Our
model extends unsupervised MT by adding visual
features
Z= [z1, ..., zj]
, where
j
is the number of
the most confident regions of an image. As shown
in Figure 2, the image is input in two forms, which
are input and output to the encoder.
4.1.1 Encoder Input
We assume the availability of the sentence and im-
age binary and redefine the input as:
M= [t1, ..., tl, z1, ..., zj](1)
As shown in Figure 2, each input to the encoder
consists of a sentence and its corresponding image
features. Specifically, for the source, the input is
a concatenation of the source language sentence
and its corresponding image features, denoted as
Mx= [x1, ..., xn, zx1, ..., zxj]
. Similarly, for the
target, the input is a concatenation of the target
language sentence and its corresponding image fea-
tures, denoted as
My= [y1, ..., ym, zy1, ..., zyj]
.
Where,
x
and
y
(
{x} {y}=ϕ
) represent source
and target sentences, and
zx
and
zy
(
{zx}∩{zy}=
ϕ) represent their corresponding image features.
In this approach, text and image are concatenated
and fed into the encoder. Image supplements the
missing information in pre-trained model, allow-
ing the model to learn to represent complete sen-
tences accurately. In translation, image promotes
the alignment of words between the two languages
by calibrating the incomplete pseudo-sentences.
4.1.2 Encoder Output
In this method, text and image features are fused
by attention-gate structure (AGS). The encoded
sequence
E
and the image features
Z
are incorpo-
rated by attention. Then, the gate structure is used
to integrate further
E
and attention output
H
, it can
be written as:
H= Softmax EZT
dZ(2)
g= Sigmoid (WeE+WhH)(3)
Hf= (1 g)·E+g·H(4)
where
We
and
Wh
are trainable matrices. In this
approach, images enable the model to avoid los-
ing information during the encoding process, thus
···
a batter is getting ready to hit the ball
coming at him.
Word Embedding Image Embedding
Transformer Encoder
Gate & Add
Attention
···
Output
Gate Attention
Structure (GAS)
Whole
image
Encoder Output Attention
ek
!"
ekhkhkek
Multimodal Alignment
ekhk
!"#
Multimodal Alignment
hk
s(ek,hk)s(hk,ek)
Figure 2: The framework of our multimodal fusion model. This figure only shows a sentence A batter is getting
ready to hit the ball coming at him. and its corresponding region feature inputs.
improving the prediction of the VPLM. Whereas
in translation, the images serve as an approximate
pivot point that connects the non-parallel sentences
and thus improves the quality of the translation.
4.2 Multimodal Alignment
We employ contrastive learning (Sohn,2016) in
cross-modal retrieval to align inputs in shared
multilingual semantic space, where inputs are
close when they are semantically related or paired.
Specifically, we first generalize the encoding out-
put
E
and the attention (Eq(2)) output
H
. The fine-
grained alignment is then obtained by the cosine
similarity
s
(
ek
,
hk
) between the
k
-th token-level of
E
and
H
, where
ek, hkRd
. Finally, to bring the
visual and textual modalities closer, we use noise
contrastive estimation (van den Oord et al.,2018).
Leh
CL =1
K
K
X
k=1
log exp (s(ek, hk))
PK
l=1 exp (s(ek, hl))
Lhe
CL =1
K
K
X
k=1
log exp (s(hk, ek))
PK
l=1 exp (s(hk, el))
LCL =1
2Leh
CL +Lhe
CL
(5)
where
K
is the sum of the sentence length and the
number of regions in the image. We also utilize
mean square error (MSE) losses to further mini-
mize the distance between ekand hk.
LMSE =1
2K
K
X
k=1 ekhk2
2(6)
Finally, the multimodal alignment loss function is:
LMA =LCL +λ1LMSE (7)
where the hyper-parameter λ1is set to 1.
4.3 Unsupervised Multimodal Machine
Translation
Our UMMT model consists of a multimodal de-
noising auto-encoding (MDA) and a multimodal
back-translation (MBT) model.
4.3.1 Multimodal Denoising Auto-encoding
MDA is extended by incorporating image features
into denoising auto-encoding (Vincent et al.,2008).
MDA is constructed by connecting the Transformer
decoder to the output of Figure 2. It aims to im-
prove the model learning ability by reconstructing
noisy sentences in the same language. We create it
separately for the unpaired source sentence
x
and
target sentence y. The process in xis:
Decx(Encx(N(x), zx), zx)bx(8)
where
N
(·) is the artificial noise function, which
includes random deletion, swapping, and blanking.
Firstly, the noisy source sentence
N(x)
and its cor-
responding image feature
zx
are introduced into the
source language encoder
Encx
(·). The encoded out-
put and image are then introduced into the source
language decoder
Decx
(·), and the reconstructed
sentence
bx
of
N(x)
is obtained. Finally, supervised
training is performed between
x
and
bx
. The recon-
struction process on the target is similar to that on
the source. The total MDA loss in xand yis:
LMDA =CE(bx, x) + CE(by, y)(9)
where CE(·, ·) represents cross-entropy loss.
4.3.2 Multimodal Back-Translation
MDA’s training inputs and outputs still involve only
one language, even though MT goal is to map input
sentences from the source/target language to the
target/source language. For cross-language train-
ing, we use MBT which is extended by adding
a batter is getting
ready to hit the ball
coming at him
Source Side
xL
Concat Source
Encoder AGS Target
Decoder eyxL
Concat Target
Encoder AGS Source
Decoder ex
zx
a batter···
MultimodalBack-5ranslation
Multimodal
Alignment
Multimodal
Alignment
Figure 3: MBT framework in the source sentence
x
, since the framework in the target language is similar, we show
only on the source language. Its encoder and decoder are from Transformer. AGS represents the fusion of text and
image through the attention-gate structure, as shown in Figure 2.
image features to back-translation (Sennrich et al.,
2016a). It explicitly guarantees that the model has
translation ability without paired sentences. As
shown in Figure 3, each of the encoder and decoder
for MBT is the input image. The MBT is carried
out on the source sentence
x
and target sentence
y
respectively, and we analyze the source in detail
here. As shown in Figure 3, first given
x
and its
corresponding image
zx
, we apply the source lan-
guage encoder
Encx
(·) and target language decoder
Decy
(·) trained in MDA to translate
x
into target
sentence
eyx
.
Encx
(·) and
Decy
(·) are frozen, and
they are involved in inferring:
Decy(Encx(x, zx), zx)eyx(10)
x
is the high-quality input, and
zx
supplements
the information lost during encoding and decoding,
thereby improving the
eyx
.
eyx
and
zx
are then fed to
the target language encoder
Ency
(·) and the source
language decoder Decx(·) translates eyxinto ex:
Decx(Ency(eyx, zx), zx)ex(11)
The total process (x, zx)(eyx, zx)ex:
Decx(Ency([Decy(Encx(x, zx), zx)], zx), zx)ex
(12)
Pseudo-input
eyx
is a corrupted version of unknown
yx
, and the noisy inputs result in degraded transla-
tion performance. Therefore,
zx
is introduced into
Ency
(·) to correct the pseudo-sentence and elim-
inate the noise. Whereas, the input
zx
in
Decx
(·)
is treated as a pivot between
yx
and
x
. This is the
process of translating two language sentences into
each other, so they correspond to one image.
Training on the target side is similar to the
source, the training process in target side:
Decy(Encx([Decx(Ency(y, zy), zy)], zy), zy)ey
(13)
The total MBT loss in xand yis:
LMBT =CE(ex, x) + CE(ey, y)(14)
4.4 Training Strategy
4.4.1 Pre-training
MLM defines the input
T= [t1, ..., tl]
as a n-length
sentence in the text sequence
T
. Our pre-training
model VPLM extends MLM (Conneau and Lample,
2019) by adding image features
Z= [z1, ..., zj]
in
domain
Z
, which aims to learn multimodal cross-
language representation. The framework of VPLM
is built by adding a prediction layer to the output
of Figure 2. Similar to MLM, 15% of the text and
image region are randomly selected for masking.
The objective function of VPLM is a combination
of MLM loss and masked region classification loss.
It is the masking text
¯
t
and region
¯z
against ground
truth text target ˇ
tand region label ˇz:
LVPLM =1
|T | X
t∈T logp(ˇ
t|¯
t;θp)
+1
|Z| X
z∈V logp(ˇz|¯z;θp)
(15)
where
θp
is the model parameter. VPLM is trained
with multimodal alignment:
LPre =LVPLM +λ2LMA (16)
where the hyper-parameter λ2is set to 1.
4.4.2 Fine-tuning
MDA and MBT are initialized with VPLM and
then fine-tuned, and they are also trained with mul-
timodal alignment:
LFin =LMDA +λ3LMBT +λ4LMA (17)
where the hyper-parameter
λ3
and
λ4
are set to
1. MDA and MBT are cycle-trained, and their
parameters are fully shared between them.
5 Experiments
5.1 Dataset and Preprocessing
Pre-training We use the training and validation
set of the MS-COCO (Lin et al.,2014) dataset.
To construct the monolingual data, this dataset is
randomly split into two disjoint subsets. Each
set contains 64,542 images and five English de-
scriptive sentences for each image. Then we ap-
ply
Lingvanex1
translator to translate English sen-
tences into German, Chinese, and Uyghur.
Fine-tuning We performed experiments on both
the Multi30k and our dataset Multi30k-Distant,
which contain 29K training, 1K validation, and 1K
(Test2016) test samples. To ensure that the model
avoids learning from parallel sentences, as with the
pre-training data, the training set of a language is
randomly divided into two, and two non-parallel
corpora with 14,500 samples of training set are
produced.
Preprocessing For Chinese, we use the tok-
enizer of Chang et al. (2008). For all other lan-
guages,
Moses
(Koehn et al.,2007) Toolkit is used
to lowercase, punctuation normalization, and tok-
enize all sentences. We use Byte Pair Encoding
(BPE) (Sennrich et al.,2016b) and use
fastBPE2
to learn the BPE code and split words into sub-
word units. Our Uyghur data is written in the Arab
script, as shown in Figure 1. We transliterate it
into Latin script using a
transliterator3
. This
operation allows some common bytes to be shared
between English and Uyghur.
5.2 Experimental Setup
Model dimensions and feedforward dimensions are
set to 512 and 2,048. The Adam (Kingma and Ba,
2015) optimizer with a learning rate of
1×104
is used for optimization. Experiments are imple-
mented on a machine with a single 12GB TITAN
Xp GPU. For image features, we follow (Caglayan
et al.,2021), using Faster R-CNN (Ren et al.,2015)
models to extract features
[z1, ..., zj]
, where the
number of regions jis set at 36 and ziR1536.
5.3 Baselines and Evaluation Metrics
Baselines To verify the performance of our model,
We mainly compare with the existing models: (1)
XLM (Conneau and Lample,2019) is a monolin-
gual text-only unsupervised MT based on MLM.
(2) UMNMT (Su et al.,2019) is established on
multimodal monolingual data by two training paths,
such as auto-encoding loss and cycle-consistency
loss. (3) M-Transformer (Huang et al.,2021)
1https://lingvanex.com/translate
2https://github.com/glample/fastBPE
3https://cis.temple.edu/anwar/code/Latin2Uyghur.html
uses additional visual modalities to recover sen-
tences that have previously masked some words.
(4) IVTA (Li et al.,2023a) is semi-supervised MT
and includes both unsupervised and supervised
training components. (5) VUMMT (Tayir and
Li,2024) is the most recent work, which stud-
ies the effect of practice measures for UMMT. (6)
Game-MMT (Chen et al.,2018) is a reinforce-
ment learning-based UMMT. (7) Knwl. (Huang
et al.,2023) introduces knowledge entities as an
additional modality to enhance the representation.
Since Game-MMT, Progressive, and Knwl. are
only translated between English, German, and
French, they also take advantage of pre-trained
models and knowledge entities that exist only in
those languages. Therefore, they cannot be repro-
duced using our dataset.
Evaluation metrics We apply MT metrics such
as RIBES (Isozaki et al.,2010), BLEU (Papineni
et al.,2002), METEOR (Lavie and Agarwal,2007)
and TER (Snover et al.,2006) to evaluate transla-
tion quality. RIBES is mainly utilized to evaluate
the translation quality between DLPs, while ME-
TEOR is not supported for Uyghur and Chinese.
Moreover, Uyghur is a gender-neutral language
(e.g. “he”, “she” and “it” are all translated into
“u”), whereas others are gendered languages. Al-
most 14% of the sentences in the fine-tuned data
contain gender pronouns, which affects the trans-
lation between Uyghur and other languages. This
paper argues that the image provides information
to correct the gender accuracy of the translation.
Therefore, we introduce gender accuracy as an addi-
tional evaluation index. We scored the correctness
of the gender pronoun by examining the gender pro-
noun in the translation and its reference sentence
(wrong: 0, correct: 1). Then, the gender accuracy
is obtained by dividing the whole test set score by
the number of gender pronouns in the reference set.
5.4 Comparison with the Baselines
5.4.1 Results on Distant Language Pairs
In Table 2, the baselines are reproduced using our
datasets. UMNMT is a multimodal model with
monolingual pre-training. IVTA is a multimodal
semi-supervised model with 300 parallel corpora,
it is trained from scratch.
Text-only models The results of XLM show that
the translation quality of the model deteriorates sig-
nificantly when using text-only data. Although
XLM is initialized with a pre-trained model trained
Table 2: Results for DLPs translation. Since Uyghur and Chinese are not supported by METEOR, METEOR is not
reported here. TER score is inversely proportional to the quality of the translated sentence.
En Uy Uy En Zh Uy Uy Zh
RIBESBLEUTERRIBESBLEUTERRIBESBLEUTERRIBESBLEUTER
XLM(Text-only) 53.2 2.6 96.4 54.4 3.1 87.1 51.9 2.6 92.3 58.8 3.9 89.4
UMNMT 65.1 7.4 83.2 65.9 8.0 74.9 67.4 10.6 75.8 71.0 14.1 74.7
M-Transformer 70.4 11.5 76.1 70.2 11.3 75.6 70.7 17.2 71.0 73.7 21.2 68.8
IVTA 69.8 13.2 74.9 71.0 13.7 69.8 76.9 22.4 63.1 77.5 24.5 61.3
VUMMT 73.3 15.7 72.3 75.1 16.0 75.5 81.9 28.7 53.1 79.8 33.2 52.4
Ours 76.4 20.9 66.1 81.1 20.6 64.8 86.5 32.2 50.4 85.9 37.0 46.7
Table 3: Results for CLP translation. According to the
original paper, it is compared on two indicators.
EnDe DeEn
BLEU METEOR BLEU METEOR
XLM(Text-only) 26.4 45.2 29.8 29.9
Game-MMT 16.6 19.6
UMNMT 23.5 26.1 26.4 29.7
M-Transformer 26.7 29.8
Knwl. 28.9 31.8
IVTA 22.9 39.7 25.5 29.2
VUMMT 29.4 48.8 33.2 32.5
Ours 30.7 50.1 34.4 33.4
on 322,710 monolingual sentences, it fails to trans-
late complete sentences. Compared to this, XLM
on CLPs provides quite satisfactory experimental
results, as shown in Table 3.
Multimodal models The performance of the
model is significantly improved when images are
introduced for fine-tuning (XLM Vs. UMNMT,
M-Transformer). For example, for En
Uy, the
BELU of UMNMT is 7.4, which is larger than that
of XLM, i.e., 2.6. Meanwhile, for Uy
Zh, their
BLEU gap is 10.2. IVTA achieves comparatively
better results, which indicates that a small number
of parallel multimodal corpora can significantly im-
prove the translation. Among the baseline models,
VUMMT yields the best results, while our model
benefits from the outstanding image introduction
method, resulting in a significant improvement.
5.4.2 Results on Close Language Pair
Table 3shows the original paper and our experi-
mental results. Our model yielded superior per-
formance, which outperformed the text-only and
multimodal baselines. Notably, our model has a
BLEU score of 1.3 and 1.2 higher than VUMMT.
Compared to DLPs, XLM results on CLP yield
better results because the relationship between the
two languages can be learned without images.
5.5 Effects of Image Pivoting
To measure the contribution of image pivoting, we
summarize the following two points.
5.5.1 Multimodal Inputs and Alignment
Compared to the text-only model (the first row),
the model with connected image features (Eq (1))
has a significant improvement, and the score on
BLEU increases by 3.0 to 29.3, as shown in Table
4. Whether it is the concatenated image introduc-
tion (Eq(1)) or the AGS image introduction (Eq(2))
method, they all bring great improvement to DLPs,
and CLP achieves the best result in Eq(1). When
these two methods are used together, the transla-
tion quality of DLPs continues to improve, while
that of CLP decreases significantly. This means
that, to some extent, richer images serve to bridge
the gap between DLPs, thus improving translation
performance between them. Moreover, the transla-
tion of both languages has been improved with the
addition of multimodal alignment (
LMA
) methods.
In terms of gender accuracy, two image introduc-
tion methods and multimodal alignment gradually
improved gender recognition, which validated our
hypothesis that image fusion is conducive to the
correct translation of gender pronouns.
5.5.2 Images on Different Branch Models
As described in Section 4, our model consists of
three branch modules, VMLM, MDA, and MBT.
As shown in Table 5, this section explores the ef-
fects of images on them. Compared to the text-only
model (the first row of Table 5), the performance
of the fine-tuned translation model containing im-
ages is improved on both language pairs. Images
Table 4: Experimental results (BLEU) of different multimodal inputs and alignments. Gender Accuracy (introduced in
Section 5.3) is the average of the accuracy in both Uy
En and Uy
Zh. Eq(1): text and image concat inputs, Eq(2): text and
image inputs via AGS and LMA: multimodal alignment.
Eq(1) Eq(2)LMA
En-Uy Zh-Uy En-De Gender Accuracy
→←→←→←
2.6 3.1 2.6 3.9 26.4 29.8 18.2
15.7 16.0 28.7 33.2 29.4 33.2 59.8
15.3 15.6 28.1 32.9 28.0 32.6 58.2
19.3 19.2 31.4 35.5 28.1 32.3 65.5
16.5 16.3 29.0 34.2 28.7 32.9 60.4
20.4 19.9 31.8 36.7 28.4 32.5 67.2
Table 5: Experimental results (BLEU) of images on different branch models. VPLM: visual pre-training language modeling,
MDA: multimodal denoising auto-encodin model and MBT: multimodal back-translation model.
VPLM MDA MBT En-Uy Zh-Uy En-De
→←→←→←
Image
2.6 3.1 2.6 3.9 26.4 29.8
7.4 8.3 9.2 12.7 28.6 32.8
9.3 10.8 26.6 30.9 28.2 31.6
8.6 9.5 16.9 18.7 17.6 25.4
15.4 16.2 28.4 33.4 28.8 32.9
15.7 16.0 28.7 33.2 29.4 33.2
Table 6: Human evaluations on DLPs. Com.,Amb.,
and Flu. stand for Completeness, Ambiguity, and Fluency,
respectively. Results are averaged on EnUy and ZhUy.
Avg. Human evaluations
BLEU Com.Amb.Flu.
M-Transformer 15.6 4.3 7.2 4.6
IVTA 17.1 4.5 7.1 4.9
GPT-4 17.8 5.1 6.8 5.1
Ours 25.5 5.6 6.1 5.7
from pre-trained models provide a significant rein-
forcement to DLPs. Images have a positive effect
on all branches, and the best translation results are
achieved when they all are fused with the image.
However, the introduction of images to the MDA
without the inclusion of images for MBT impairs
the model performance and the image in the MDA
effect is not significant.
5.6 Human Evaluation
For each model, we randomly sampled 100 sen-
tences from its test translations and rated each sen-
tence on a scale from 0 to 10 according to their
quality. As listed in Table 6, our model shows the
highest BLEU among the three manually evaluated
models. We also compare ours with the translation
results of the large language model GPT-4
4
. Our
model BLEU reaches 25.5 with 43.2% to 63.4%
improvements over GPT-4 and M-Transformer. In
terms of FLu. to measure the translation cohesion
and fluency, our model still shows best among our
three human evaluations.
5.7 Case Study
To further demonstrate the validity of our model,
we show the translation results generated by dif-
ferent models, as shown in Table 7. We can see
that our added regional visual information is more
helpful when the translated object encounters am-
biguity. XLM translates the source word “blue”
to “sëriq renglik” (yellow), and the model with the
added image translates correctly. It is interesting to
observe that our model, as defined in Eq.1, extracts
more information from images in complex scenes
and translates information that is not present in the
reference sentence but is present in the image, e.g.,
“öz’ara paranglishiwatidu” (talking to each other).
Moreover, we also compare ours with the trans-
lation results of GPT-4. The translation result indi-
cates that although the objects in the input sentence
are correctly translated, the relations between them
4https://openai.com/gpt-4
Table 7: Case study. Since the translation between Chinese and Uyghur is similar to this, their translation results are
not presented. Eq(1) represents the model in the second row of Table 4
SRC(En): a group of men in blue uniforms are standing together.
REF(Uy): bir top kök renglik forma kiygen erler bille turidu.
XLM(Text-only): bir top sëriq renglik kiyim kiygen.
Eq(1): bir top kök renglik kiyim kiygen erler öz’ara paranglishiwatidu.
Ours: bir top kök renglik forma kiygen erler bille turdi.
GPT-4: kök uniformadiki bir gurup er adamlar birge turdu.
Table 8: Supervised results (BLEU) on Multi30K.
En-Uy Zh-Uy
→←→←
Transformer(text-only) 40.4 36.0 61.9 61.2
Selective-attn 41.2 36.6 62.1 61.2
VTLM 42.5 38.2 64.5 64.1
Ours 44.8 39.8 65.3 64.9
are not. In addition, iFLYTEK
5
’s large model did
not translate successfully.
5.8 Supervised Case
While this paper primarily focuses on unsupervised
MT with images as pivots, we are also interested in
exploring supervised translation on our dataset. To
gain deeper insights, we conducted supervised MT
experiments by switching from back-translation
to a transformer-based framework with additional
image features. We benchmarked recent super-
vised MT models, including Transformer(text-
only) (Vaswani et al.,2017), Selective-attn (Li
et al.,2022), and VTLM (Caglayan et al.,2021).
VTLM and our model are both pre-trained and fine-
tuned on our dataset Multi30k-Distant.
It can be seen from the experimental results that
our method shows the best than other baselines.
Compared with VTLM, our fusion method is more
effective for supervised MT. It is noteworthy that
images provide marginal improvements in super-
vised DLPs translation.
6 Conclusion
In this work, we first create a dataset containing
two DLPs to investigate UMMT on low-resource
DLPs. We then found that cross-language align-
ment in shared latent spaces can be improved by
incorporating visual content in both pre-trained
and fine-tuned models. Compared to the baseline
5https://xinghuo.xfyun.cn/
model, our model has 5.2 and 4.6 BLEU score
improvements in English-Uyghur translation, and
3.5 and 3.8 BLEU score improvements in Chinese-
Uyghur translation. Moreover, the experimental
results show that images contribute to improving
gender accuracy in translation between gender and
gender-neutral languages.
7 Limitations
Although our proposed method achieves good trans-
lation results, it also has some limitations in dealing
with DLPs. As can be seen from Figure 4, incorpo-
rating more image features may hurt the accuracy
of a high-score translated sentence. More persons
are needed to join our human evaluations since
translation is subjective to some degree. The differ-
ences among different persons could be analyzed
in detail.
Acknowledgments
This work is partially supported by NSFC, China
(No.62276196). Thanks to all reviewers for their
comments.
References
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and
Kyunghyun Cho. 2018. Unsupervised neural
machine translation. In International Conference
on Learning Representations, pages 1–12.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In Interna-
tional Conference on Learning Representations,
pages 1–15.
Ozan Caglayan, Menekse Kuyu, Mustafa Ser-
can Amac, Pranava Madhyastha, Erkut Erdem,
Aykut Erdem, and Lucia Specia. 2021. Cross-
lingual visual pre-training for multimodal ma-
chine translation. In Conference of the European
Chapter of the Association for Computational
Linguistics, pages 1317–1324.
Pi-Chuan Chang, Michel Galley, and Christo-
pher D. Manning. 2008. Optimizing chinese
word segmentation for machine translation per-
formance. In Proceedings of the Third Workshop
on Statistical Machine Translation, pages 224–
232.
Yun Chen, Yang Liu, and Victor O. K. Li. 2018.
Zero-resource neural machine translation with
multi-agent communication game. In Proceed-
ings of the Thirty-Second AAAI Conference on
Artificial Intelligence, (AAAI-18), the 30th in-
novative Applications of Artificial Intelligence
(IAAI-18), and the 8th AAAI Symposium on Ed-
ucational Advances in Artificial Intelligence,
pages 5086–5093.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry
Bahdanau, and Yoshua Bengio. 2014. On
the properties of neural machine translation:
Encoder-decoder approaches. In Proceedings
of SSST@EMNLP 2014, Eighth Workshop on
Syntax, Semantics and Structure in Statistical
Translation, pages 103–111.
Alexis Conneau and Guillaume Lample. 2019.
Cross-lingual language model pretraining. In
Advances in Neural Information Processing Sys-
tems, pages 7057–7067.
Desmond Elliott, Stella Frank, Khalil Sima’an, and
Lucia Specia. 2016. Multi30k: Multilingual
english-german image descriptions. In Proceed-
ings of the 5th Workshop on Vision and Lan-
guage, pages 70–74.
Michael Grubinger, Paul Clough, Henning Müller,
and Thomas Deselaers. 2006. The iapr tc-12
benchmark: A new evaluation resource for vi-
sual information systems. In International work-
shop ontoImage, pages 1–11.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and
Jian Sun. 2016. Deep residual learning for image
recognition. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 770–778.
Ping Huang, Shiliang Sun, and Hao Yang. 2021.
Image-assisted transformer in zero-resource
multi-modal translation. In International Confer-
ence on Acoustics, Speech and Signal Process-
ing, pages 7548–7552.
Ping Huang, Jing Zhao, Shilinag Sun, and Yichu
Lin. 2023. Knowledge enhanced zero-resource
machine translation using image-pivoting. Appl.
Intell., 53(7):7484–7496.
Po-Yao Huang, Junjie Hu, Xiaojun Chang, and
Alexander G. Hauptmann. 2020. Unsupervised
multimodal neural machine translation with
pseudo visual pivoting. In Proceedings of the
58th Annual Meeting of the Association for Com-
putational Linguistics, pages 8226–8237.
Hideki Isozaki, Katsuhito Sudoh, Hajime Tsukada,
and Kevin Duh. 2010. Head finalization: A sim-
ple reordering rule for SOV languages. In Pro-
ceedings of the Joint Fifth Workshop on Statis-
tical Machine Translation and MetricsMATR,
pages 244–251.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In Computer
Vision and Pattern Recognition, pages 1–15.
Philipp Koehn, Hieu Hoang, Alexandra Birch,
Chris Callison-Burch, Marcello Federico, Nicola
Bertoldi, Brooke Cowan, Wade Shen, Christine
Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
Alexandra Constantin, and Evan Herbst. 2007.
Moses: Open source toolkit for statistical ma-
chine translation. In Proceedings of the 45th
Annual Meeting of the Association for Computa-
tional Linguistics, pages 177–180.
Guillaume Lample, Alexis Conneau, Ludovic De-
noyer, and Marc’Aurelio Ranzato. 2018. Unsu-
pervised machine translation using monolingual
corpora only. In 6th International Conference
on Learning Representations, pages 1–14.
Alon Lavie and Abhaya Agarwal. 2007. METEOR:
an automatic metric for MT evaluation with high
levels of correlation with human judgments. In
The Second Workshop on Statistical Machine
Translation, pages 228–231.
Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong
Xiao, Anxiang Ma, and Jingbo Zhu. 2022. On vi-
sion features in multimodal machine translation.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics,
pages 6327–6337.
Lin Li, Turghun Tayir, Yifeng Han, Xiaohui Tao,
and Juan D. Velásquez. 2023a. Multimodality
information fusion for automated machine trans-
lation. Information Fusion, 91:352–363.
Mingjie Li, Po-Yao Huang, Xiaojun Chang, Jun-
jie Hu, Yi Yang, and Alex Hauptmann. 2023b.
Video pivoting unsupervised multi-modal ma-
chine translation. IEEE Trans. Pattern Anal.
Mach. Intell., 45(3):3918–3932.
Tsung-Yi Lin, Michael Maire, Serge J. Belongie,
James Hays, Pietro Perona, Deva Ramanan, Pi-
otr Dollár, and C. Lawrence Zitnick. 2014. Mi-
crosoft COCO: common objects in context. In
Computer Vision - ECCV 2014 - 13th European
Conferenc, pages 740–755.
Kelly Marchisio, Kevin Duh, and Philipp Koehn.
2020. When does unsupervised machine transla-
tion work? In Proceedings of the Fifth Confer-
ence on Machine Translation, pages 571–583.
Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul
Michel, Danish Pruthi, and Xinyi Wang. 2019.
compare-mt: A tool for holistic comparison of
language generation systems. In Proceedings of
NAACL-HLT 2019, pages 35–41.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. Bleu: a method for auto-
matic evaluation of machine translation. In Pro-
ceedings of the 40th Annual Meeting of the As-
sociation for Computational Linguistics, pages
311–318.
Shaoqing Ren, Kaiming He, Ross B. Girshick, and
Jian Sun. 2015. Faster R-CNN: towards real-
time object detection with region proposal net-
works. In Conference on Neural Information
Processing Systems, pages 91–99.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016a. Improving neural machine trans-
lation models with monolingual data. In Pro-
ceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics, pages
86–96.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016b. Neural machine translation of rare
words with subword units. In Annual Meeting of
the Association for Computational Linguistics,
pages 1715–1725.
Matthew Snover, Bonnie Dorr, Richard Schwartz,
Linnea Micciulla, and John Makhoul. 2006. A
study of translation edit rate with targeted human
annotation. In Proceedings of Association for
Machine Translation in the Americas, pages 223–
231.
Kihyuk Sohn. 2016. Improved deep metric learn-
ing with multi-class n-pair loss objective. In
Advances in Neural Information Processing Sys-
tems 29: Annual Conference on Neural Informa-
tion Processing Systems, pages 1849–1857.
Yuanhang Su, Kai Fan, Nguyen Bach, C.-C. Jay
Kuo, and Fei Huang. 2019. Unsupervised multi-
modal neural machine translation. In IEEE Con-
ference on Computer Vision and Pattern Recog-
nition, pages 10482–10491.
Haipeng Sun, Rui Wang, Masao Utiyama, Ben-
jamin Marie, Kehai Chen, Eiichiro Sumita, and
Tiejun Zhao. 2021. Unsupervised neural ma-
chine translation for similar and distant language
pairs: An empirical study. ACM Trans. Asian
Low Resour. Lang. Inf. Process., 20(1):1–17.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
2014. Sequence to sequence learning with neu-
ral networks. In Advances in neural information
processing systems, pages 3104–3112.
Turghun Tayir and Lin Li. 2024. Unsupervised
multimodal machine translation for low-resource
distant language pairs. ACM Trans. Asian Low-
Resour. Lang. Inf. Process., pages 1–22.
Aäron van den Oord, Yazhe Li, and Oriol Vinyals.
2018. Representation learning with contrastive
predictive coding. CoRR, abs/1807.03748.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. At-
tention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–
6008.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio,
and Pierre-Antoine Manzagol. 2008. Extracting
and composing robust features with denoising
autoencoders. In Machine Learning, Proceed-
ings of the Twenty-Fifth International Confer-
ence, pages 1096–1103.
Yijun Wang, Tianxin Wei, Qi Liu, and Enhong
Chen. 2021. Unpaired multimodal neural ma-
chine translation via reinforcement learning. In
Database Systems for Advanced Applications -
26th International Conference, pages 168–185.
Barret Zoph, Deniz Yuret, Jonathan May, and
Kevin Knight. 2016. Transfer learning for low-
resource neural machine translation. In Pro-
ceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, pages
1568–1575.
A Appendix
A.1 Experimental Results on Image Features
with Different Granularity
We argue that regional features are extracted based
on object confidence, which may ignore the rela-
tionships between objects and their background
information. Therefore, we also discuss the grid
features
GR49×2,048
extracted by using Resnet-
101 (He et al.,2016). As shown in Table 9, we
conducted experiments on the region and grid fea-
tures individually and together. Two image features
are combined in a concatenated manner. The exper-
imental results from Table 9show that DLPs are
significantly improved in both features, while CLP
is in grid features. This validates our hypothesis
that DLPs require richer image features.
A.2 Images in Inferencing
Are images helpful when they are used during train-
ing but are not readily available during the infer-
ence (which is likely in practice)? From the XLM
results in Table 10, it can be seen that when train-
ing without images, the introduction of images in
the test leads to performance degradation, which
may be because the parameters of the image lin-
ear transformation are not well-trained. Trained
with images still have a considerable advantage
over models trained with text-only. This suggests
that images can be used as pivoting and additional
information during training, leading the model to
converge to a better optimum. This indicates that if
images are unavailable, the requirements for paired
images and a monolingual text during the testing
phase can be relaxed to provide text-only data.
A.3 Image Enhancement to Pre-training and
Fine-tuning
Although we discussed the effects of images on
different branches in Section 5.5.2, we did not de-
tail the benefits of different amounts of images. In
this section, we mainly discuss the importance of
Table 9: Experimental results (BLEU) of image features
with different granularity. The experiments are based on the
best model of Table 4and 5. Reg. and Gri. represent region
and grid features, and Reg.&Gri. indicates the both features.
En-Uy Zh-Uy En-De
Reg. 20.4 19.9 31.8 36.7 29.4 33.2
Gri. 20.2 20.3 29.9 33.5 30.7 34.4
Reg&Gri. 20.9 20.6 32.2 37.0 29.0 32.3
Table 10: Experimental results with or without images
on the test set. Note that XLM is a model training
without images.
Uy En Uy Zh
BLEUTERBLEUTER
Testing without image
XLM(Text-only) 2.6 96.4 3.1 87.1
Ours 13.7 71.5 25.7 59.1
Testing with image
XLM(Text-only) 1.6 98.4 2.5 92.5
Ours 20.6 64.8 37.0 46.7
images on different language pairs, and observe
them separately on pre-training and fine-tuning.
Our model benefits from the image pivoting of the
source sentence-image, image-target sentence and
source sentence-image-target sentence.
Without pre-training From the first two rows
of Table 11, performance improves when images
are added for fine-tuning compared to text-only
model
1
. Model
2
showed increases of up to 8.5
to 10.1 BLEU scores for both language pairs, with
even greater improvements for DLPs.
For small amounts of pre-training corpora
When both pre-training and fine-tuning models are
trained on text-only data (model
1
Vs. model
3
),
translation on CLP is improved, while translation
on DLPs is degraded. This verifies that unsuper-
vised translation performance deteriorates rapidly
when the source and target corpora are from dif-
ferent domains. Even with fine-tuning on text and
image, pre-training on text-only reduces transla-
tion on DLPs (model
2
Vs. model
4
), but better
than text-only model
3
. The translation quality on
CLP still retains the trend of improvement. How-
ever, what is common to both DLPs is that images
in pre-training are more beneficial than images in
fine-tuning. Clearly, the best translation quality is
achieved when images are included in both pre-
training and fine-tuning. The above experimental
results show that larger gains can be observed in
Table 11: Experimental results (BLEU) of models with different amounts corpus. Image and text are fused by
concatenation Eq(1).
Num. Pre-training Fine-tuning En-Uy Zh-Uy En-De
(#image/sents) (#image/sents) →←→←→←
1
(0/0) (0/14.5k) 1.7 2.9 2.0 3.3 6.3 6.9
2
(0/0) (14.5k/14.5k) 7.6 8.7 11.2 13.4 14.8 16.1
3
(0/14.5k) (0/14.5k) 1.5 2.2 2.0 3.2 11.8 14.0
4
(0/14.5k) (14.5k/14.5k) 3.0 4.4 7.9 10.7 15.5 19.1
5
(14.5k/14.5k) (0/14.5k) 7.0 8.1 14.0 16.1 20.8 24.5
6
(14.5k/14.5k) (14.5k/14.5k) 13.3 13.4 23.2 26.4 23.1 25.3
7
(0/322.7k) (0/14.5k) 2.6 3.1 2.6 3.9 26.4 29.8
8
(0/322.7k) (14.5k/14.5k) 7.4 8.3 9.2 12.7 28.6 32.8
9
(322.7k/322.7k) (0/14.5k) 9.3 10.8 26.6 30.9 28.2 31.6
10
(322.7k/322.7k) (14.5k/14.5k) 15.7 16.0 28.7 33.2 29.4 33.2
the visual pivoting, which aligns and reduces un-
certainty in the language latent space.
For a large amounts of pre-training corpora
We increase to 322.7k monolingual multimodal
corpora on the above pre-training. Experimental
results for model
7
show that the massive increase
in text-only pre-training data leads to a small im-
provement in translation quality on DLPs, while a
significant enhancement on CLP outperforms the
previous multimodal translation. This result im-
plies that CLP learns the alignment between two
languages by training on text-only, while DLPs fail
to align unpaired language sentences. The rest of
the experiment results are the same as in the small
pre-training corpus. Compared with CLP, visual
information on DLPs is more helpful in improving
cross-language alignment in unsupervised transla-
tion through pivoting.
B Analyses
This section analyzes the quality of the translated
sentences from two aspects, which are bucketed
analysis and case study respectively. XLM is the
text-only model and Eq(1) is the image and text
concatenation input method.
B.1 Bucketed Analysis
To find significant differences between translated
sentences, we used the
compare-mt
toolkit (Neu-
big et al.,2019) for analysis. Bucketed analy-
sis word accuracy analysis and sentence hierarchy
analysis. This paper provides an example of sen-
tence hierarchy analysis, as shown in Figure 4. In
the number of translated sentences with different
BLEU values, it can be found from figure that when
濄濃濃濅濃濃濆濃濃濇濃濃濈濃濃濉濃
濏濄濃濮濄濃澿澳濅濃澼濮濅濃澿濆濃澼濮濆濃澿濇濃澼濮濇濃澿濈濃澼濮濈濃澿澳濉濃澼濑濐濉濃濖瀂瀈瀁
濫濟濠濘瀄澻濢瀈瀅瀆
濖瀂瀈瀁瀇
濦濸瀁瀇濸瀁濶濸激濿濸瀉濸濿激濕濟濘濨
Figure 4: An example of translation accuracy analysis
in the En
Uy task. Since the translation Zh
Uy is
similar to this, their analysis is not presented.
the BLEU value is less than 20, the relationship
between the number of sentences from small to
large is inversely related to the output quality of
our model for the whole test set. However, when
the BLEU value is greater than 20, the relationship
between the number of sentences is consistent with
the model’s score on the whole test set. XLM has
no more than 60 BLEU sentences in translation.
Article
Full-text available
English has long been regarded as the universal language. Countries that were earlier reluctant to learn English have also changed their stand due to its global reach. The nonnative English speaker’s proficiency largely depends on the College English Teaching (CET) and its evaluation methods. Traditional teaching evaluation models failed to consider all the difficulties of language learning. This drawback has led to researchers focusing on integrating Artificial Intelligence (AI) techniques in CET evaluation. This work is in line with such an effort to contribute a novel Teaching Evaluation Method (TEM) that employs a Machine Learning (ML) algorithm. The objective of this work is to provide colleges and other institutions with a practical framework for evaluating the CET by modifying the existing models through contemporary pedagogical approaches, thereby helping the institutions meet the complex demands of teaching and collaborative research. The study proposes a novel ML model that combines Convolutional Neural Networks (CNN) with a Grey Correlation-Based Genetic Algorithm (GCBGA) to meet the challenges in the traditional TEM. The proposed TEM model fared better than the other models in terms of English Teaching Evaluation. The performance enhancement is attributed to the novel GCBGA that optimized the proposed CNN model by avoiding it from getting stuck in local minima. This new method makes it easier for CNN to consistently judge the quality of teaching because it gets rid of the subjectivity and unpredictability of the conventional methods.
Conference Paper
Full-text available
End-to-end neural machine translation (NMT) heavily relies on parallel corpora for training. However, high-quality parallel corpora are usually costly to collect. To tackle this problem, multimodal content, especially image, has been introduced to help build an NMT system without parallel corpora. In this paper, we propose a reinforcement learning (RL) method to build an NMT system by introducing a sequence-level supervision signal as a reward. Based on the fact that visual information can be a universal representation to ground different languages, we design two different rewards to guide the learning process, i.e., (1) the likelihood of generated sentence given source image and (2) the distance of attention weights given by image caption models. Experimental results on the Multi30K, IAPR-TC12, and IKEA datasets show that the proposed learning mechanism achieves better performance than existing methods.
Article
Unsupervised machine translation (UMT) has recently attracted more attention from researchers, enabling models to translate when languages lack parallel corpora. However, the current works mainly consider close language pairs (e.g., English-German and English-French), and the effectiveness of visual content for distant language pairs has yet to be investigated. This paper proposes an unsupervised multimodal machine translation (UMMT) model for low-resource distant language pairs. Specifically, we first employ adequate measures such as transliteration and re-ordering to bring distant language pairs closer together. We then use visual content to extend masked language modeling (MLM) and generate visual masked language modeling (VMLM) for UMT. Finally, empirical experiments are conducted on our distant language pair dataset and the public Multi30k dataset. Experimental results demonstrate the superior performance of our model, with BLEU score improvements of 2.5 and 2.6 on translation for distant language pairs English-Uyghur and Chinese-Uyghur. Moreover, our model also brings remarkable results for close language pairs, improving 2.3 BLEU compared with the existing models in English-German.
Article
The main challenge in the field of unsupervised machine translation (UMT) is to associate source-target sentences in the latent space. As people who speak different languages share biologically similar visual systems, various unsupervised multi-modal machine translation (UMMT) models have been proposed to improve the performances of UMT by employing visual contents in natural images to facilitate alignment. Commonly, relation information is the important semantic in a sentence. Compared with images, videos can better present the interactions between objects and the ways in which an object transforms over time. However, current state-of-the-art methods only explore scene-level or object-level information from images without explicitly modeling objects relation; thus, they are sensitive to spurious correlations, which poses a new challenge for UMMT models. In this paper, we employ a spatial-temporal graph obtained from videos to exploit object interactions in space and time for disambiguation purposes and to promote latent space alignment in UMMT. Our model employs multi-modal back-translation and features pseudo-visual pivoting, in which we learn a shared multilingual visual-semantic embedding space and incorporate visually pivoted captioning as additional weak supervision. Experimental results on the VATEX Translation 2020 and HowToWorld datasets validate the translation capabilities of our model on both sentence-level and word-level and generalizes well when videos are not available during the testing phase.
Article
Unsupervised neural machine translation (UNMT) has achieved remarkable results for several language pairs, such as French–English and German–English. Most previous studies have focused on modeling UNMT systems; few studies have investigated the effect of UNMT on specific languages. In this article, we first empirically investigate UNMT for four diverse language pairs (French/German/Chinese/Japanese–English). We confirm that the performance of UNMT in translation tasks for similar language pairs (French/German–English) is dramatically better than for distant language pairs (Chinese/Japanese–English). We empirically show that the lack of shared words and different word orderings are the main reasons that lead UNMT to underperform in Chinese/Japanese–English. Based on these findings, we propose several methods, including artificial shared words and pre-ordering, to improve the performance of UNMT for distant language pairs. Moreover, we propose a simple general method to improve translation performance for all these four language pairs. The existing UNMT model can generate a translation of a reasonable quality after a few training epochs owing to a denoising mechanism and shared latent representations. However, learning shared latent representations restricts the performance of translation in both directions, particularly for distant language pairs, while denoising dramatically delays convergence by continuously modifying the training data. To avoid these problems, we propose a simple, yet effective and efficient, approach that (like UNMT) relies solely on monolingual corpora: pseudo-data-based unsupervised neural machine translation. Experimental results for these four language pairs show that our proposed methods significantly outperform UNMT baselines.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.