Content uploaded by Farhad Nooralahzadeh
Author content
All content in this area was uploaded by Farhad Nooralahzadeh on Jun 21, 2023
Content may be subject to copyright.
Improving the Cross-Lingual Generalisation in Visual Question Answering
Farhad Nooralahzadeh, Rico Sennrich
Department of Computational Linguistics, University of Zurich
fahrad.nooralahzadeh@uzh.ch, sennrich@cl.uzh.ch
Abstract
While several benefits were realized for multilingual vision-
language pretrained models, recent benchmarks across var-
ious tasks and languages showed poor cross-lingual gen-
eralisation when multilingually pre-trained vision-language
models are applied to non-English data, with a large gap
between (supervised) English performance and (zero-shot)
cross-lingual transfer. In this work, we explore the poor per-
formance of these models on a zero-shot cross-lingual vi-
sual question answering (VQA) task, where models are fine-
tuned on English visual-question data and evaluated on 7
typologically diverse languages. We improve cross-lingual
transfer with three strategies: (1) we introduce a linguistic
prior objective to augment the cross-entropy loss with a sim-
ilarity-based loss to guide the model during training, (2) we
learn a task-specific subnetwork that improves cross-lingual
generalisation and reduces variance without model modifica-
tion, and (3) we augment training examples using synthetic
code-mixing to promote alignment of embeddings between
source and target languages. Our experiments on xGQA us-
ing the pretrained multilingual multimodal transformers UC2
and M3P demonstrate the consistent effectiveness of the pro-
posed fine-tuning strategy for 7 languages, outperforming ex-
isting transfer methods with sparse models.
1 Introduction
Multimodal pretraining has established state-of-the-art per-
formance for many multimedia tasks such as image-text re-
trieval, visual question, and answering, video localization,
speech recognition, etc. Pretraining models outperforms tra-
ditional methods by providing stronger representation of dif-
ferent modalities learned in an unsupervised training fash-
ion (e.g. Radford et al. 2021; Schneider et al. 2019; Sun
et al. 2019). However, progress in this area has been lim-
ited mostly to the English language, whereas the main mul-
timodal datasets consist only of English data. In order to
generalize this achievement to non-English languages, re-
cent works (e.g. Zhou et al. 2021; Ni et al. 2021; Liu et al.
2021; Bapna et al. 2022) attempt to learn universal represen-
tations to map objects that occurred in different modalities
or texts expressed in various languages into shared semantic
space.
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
IGLUE (Bugliarello et al. 2022), a recent benchmark
spanning various tasks and languages has shown that per-
formance degrades significantly when existing multilingual
vision-language models are applied to non-English data, and
there is a large gap between supervised performance and
(zero-shot) cross-lingual transfer. This gap is most notice-
able for resource-poor languages and languages that are dis-
tinct from English, attributed mostly to misalignment of text
embeddings between the source and target languages (Liu
et al. 2021; Pfeiffer et al. 2022).
In this work, we address a number of deficiencies in
how these multilingual vision-language models are trained
and evaluated on xGQA (Pfeiffer et al. 2022), a multilin-
gual evaluation benchmark for the visual question answering
task, where the source English dataset is extended to 7 typo-
logically diverse languages. Specifically, we address the fol-
lowing issues: (i) The standard cross-entropy loss function
fails to assess properly the different incorrect model outputs
and results in treating equally all incorrect predictions dur-
ing training, (ii) The label space is highly derived from the
source language (i.e. English), resulting in language bias in
the training material and hurting generalization to other lan-
guages, and (iii) The non-restricted fine-tuning of multilin-
gual vision-language models likely neglect the task-specific
and language-neutral components, resulting in over-fitting
on the source language and poor cross-lingual generalisa-
tion. Our contributions are as follows:
1. We design an effective fine-tuning strategy by incor-
porating the linguistic prior, task-specific sparse sub-
network, and synthetic code-mixing augmentation to ad-
dress the low performance of pretrained multilingual
vision-language models on cross-lingual VQA task. Our
strategy does not introduce extra trainable parameters or
layers, and even reduces the number of model parame-
ters. Code and data to reproduce our findings are publicly
available.1
2. We evaluate the proposed strategy on cross-lingual zero-
shot learning, across a total of 7 languages and observe
consistent improvements over strong multilingual multi-
modal transformers including UC2 (Zhou et al. 2021) and
M3P (Ni et al. 2021), achieving a substantial +13.12%
and +12.63% gain in average accuracy over all lan-
1https://github.com/nooralahzadeh/CLG-VQA
arXiv:2209.02982v2 [cs.CL] 30 Nov 2022
Figure 1: A standard
setup (Bugliarello et al.
2022; Pfeiffer et al. 2022)
to perform VQA task us-
ing UC2 or M3P.
Figure 2: Illustration of how we augment the cross-entropy loss with a similarity-based loss using a
linguistic prior knowledge in VQA task. The model receives a question-image pair where the question
is Who is flying through the sky? and the ground truth label is skateboarder. Example is taken from
xGQA (Pfeiffer et al. 2022) dataset.
guages in xGQA against UC2 and M3P baselines, respec-
tively.
3. We perform an error analysis highlighting a substantial
number of confusions between semantically related la-
bels in xGQA, including synonyms, hypernyms, and hy-
ponyms. We propose a metric that treats all synonyms of
the ground truth label as correct.
2 Background
Having a pair of an image and a question, the task in vi-
sual question answering (VQA) is to provide an answer
considering both modalities. This process has been treated
as a classification task in most VQA benchmark datasets,
where the underlying model should select one or multiple
answers from a set of predefined labels. Recently, Pfeif-
fer et al. (2022) introduce a typologically diverse multilin-
gual and multimodal benchmark for VQA task by extending
the monolingual English-only GQA (Hudson and Manning
2019) dataset. They utilize 12,578 questions and 398 im-
ages from the test and development set of GQA, where the
questions are manually translated into 7 different languages,
covering 5 different scripts: Bengali (Bn), German (De), In-
donesian (Id), Korean (Ko), Portuguese (Pt), Russian (Ru)
and simplified Chinese (Zh). The xGQA benchmark also
consists of new fixed data splits to guide cross-lingual few-
shot learning experiments, where only a small number of ex-
amples in the target language are available. This dataset has
been used in recent studies on cross-lingual transfer learn-
ing of vision-language models (e.g. Liu et al. 2022; Zeng
et al. 2022) and includes several types of structured ques-
tions about an image. In this work, we base our approach on
two state-of-the-art pretrained multilingual vision-language
architectures, namely UC2 (Zhou et al. 2021) and M3P (Ni
et al. 2021). These two transformer-based multimodal mod-
els accept the concatenation of image region features ex-
tracted with an object detector (i.e. Faster R-CNN (Ren et al.
2015)) and a sequence of BPE tokens (Sennrich, Haddow,
and Birch 2016) representing the question using Sentence
Piece model (Kudo and Richardson 2018) as an input. This
input is then processed by a BERT-like encoder (Devlin et al.
2019) to obtain multimodal, contextualised representations.
They are initialized from XLM-R (Conneau et al. 2020) and
mainly differ in their pretraining strategy.
As Figure 1 depicts, the standard setup (Pfeiffer et al.
2022; Bugliarello et al. 2022) to perform cross-lingual VQA
task is to fine-tune the pretrained multilingual image-text
model in the source language (i.e, English). Then, the rep-
resentation of the [CLS] token as a multimodal and con-
textualized representation is fed into a non-linear two-layer
feed-forward classifier head to predict an answer for a given
image-question pair. For zero-shot cross-lingual evaluation,
the fine-tuned model is evaluated on the multilingual test
data, whereas in few-shot cross-lingual scenario, the fine-
tuned model is additionally trained on image-question ex-
amples available in the target language.
3 Fine-Tuning Strategies
Multilingual vision-language pretrained models often suf-
fer from poor cross-lingual generalisation compared with
their corresponding monolingual baseline, achieving much
better performance in the source language than in target
languages unseen during fine-tuning. In this work, we aim
to address the poor performance of these models on the
xGQA benchmark, where models are fine-tuned on English
data and evaluated on 7 typologically diverse languages.
In particular, we investigate the impact of three fine-tuning
strategies: (i) Incorporating linguistic prior, (ii) Task-spe-
cific sparse fine-turning, and (iii) Multilingual Code-Switch-
ing (i.e. Code-Mixing) data augmentation. In this section,
we describe these three strategies in detail.
Incorporating Linguistic Prior We realize a number of
deficiencies in how multilingual vision-language models
are trained and evaluated cross-lingually in the VQA task.
(i) The loss function fails to assess properly the different
incorrect model outputs and results in treating equally all
incorrect predictions during training, (ii) Training examples
are only annotated with one label, where intuitively multi-
ple labels are often plausible (e.g. lady vs. woman; couch
vs. sofa), and (iii) The label space highly depends on the
source language (i.e. English) and hurts generalization to
other languages. For instance, there are singular and plural
labels such as car/cars,woman/women and laptop/laptops
while in some target languages such as Chinese, most nouns
are not marked for grammatical numbers. In this section, we
aim to address these issues.
Given an image iand a question q, the respective model
fθfor VQA task provides a probability distribution ˆy=
pθ(y|i, q)over a set of predefined answers. Commonly,
VQA models are trained using the cross-entropy loss, in
which parameters of the underlying model θare optimized
using the following objective function:
Lce =
C
X
c=1
y∗
clog pθ(yc|i, q)
where Cis the number of classes in the answer set, and y∗
is the one-hot vector that represents the ground truth answer.
The objective loss function encourages the model to give a
large probability mass to a correct class. It compares the pre-
dicted and ground-truth label and takes a once-for-all match-
ing strategy, consequently evaluating all predictions as ei-
ther correct or incorrect and ignoring the similarity between
the correct and less incorrect predictions. As an example in
the question-image pair shown in Figure 2, if the model re-
ceives a question as Who is flying through the sky? and the
ground truth label is skateboarder, the underlying loss func-
tion will penalize the wrong predicted labels such as skater,
man,t-shirt or car equally. We argue that the incorrect train-
ing predictions may be quite diverse and letting the model
be aware of which incorrect predictions are more incorrect
or less incorrect than others may more effectively guide the
model during the training. Therefore, in our example, simi-
lar labels such as skater and man should be penalized much
less than dissimilar words like t-shirt or car.
In order to alleviate the issue, we add a linguistic prior ob-
jective to augment the cross-entropy loss with a similarity-
based loss. The loss can be conceived as a form of risk min-
imization, where the risk function is the distance dbetween
a ground truth label y∗and the predicted label yc. In other
words, the objective function should give a small loss if the
predicted and ground truth label are similar, and penalize
dissimilar answers:
Algorithm 1: Iterative Magnitude Pruning (IMP) with
rewinding step (Han, Mao, and Dally 2016).
Input: Model f(.;θ)initialized with pretrained parameters
θ0.
Parameter:p% : a pruning rate
Output: M
1: Set the initial pruning mask to M= 1|θ|.
2: while not done do
3: Train f(.;Mθ0)to step t:f(.;Mθt).
4: Prune p% of remaining weights of Mθtand update
Maccordingly.
5: end while
6: Return f(.;Mθ0).
Lprior =X
c
pθ(yc|i, q)d(yc, y∗)
L=Lce +αLprior
The risk dis weighted by the probability distribution over
all target labels pθ(y|i, q), provided by the classification
layer. We formalize the distance score d(yc, y∗)between the
ground truth label and others in the label space by using two
sources of linguistic knowledge:
WordNet (priorwn): We extract the explicit relations
among the labels using the synset structure of the English
lexical database (i.e. WordNet (Fellbaum 1998)). To be more
precise, we derive the synonymy, hyponymy, and hyper-
nymy relations and formulate the distance as:
d(yc, y∗) =
0if ycand y∗are synonyms
d1if ycis hyponym of y∗
d2if ycis hypernym of y∗
1otherwise
, where 0< d1, d2<1.
Word Embeddings (priorem): A distance dis extracted
from implicit semantic proximity within pretrained word
embeddings. We calculate the embeddings cosine distance
as the distance of y∗and all other labels as:
d(yc, y∗) = CosineDistance(emby∗, embyc)
Task-specific Sparse Fine-tuning (SFT) The success of
multilingual pretrained models in cross-lingual genralisation
is often attributed to task-specific and language-neutral com-
ponents, which capture commonalities among languages
(Libovick´
y, Rosa, and Fraser 2020; Foroutan et al. 2022).
To this end, we are inspired by previous works (Fran-
kle and Carbin 2019; Chen et al. 2020; Ansell et al. 2022)
that claim there exists a sparse, separated trainable subnet-
work (i.e. a winning ticket) capable to match or even out-
perform the original neural network. Similarly, we design a
task-specific sparse fine-tuning strategy, here dubbed S FT,
consisting of two steps:
Figure 3: The code-mixed question, where a set of 4 words is
randomly selected in order to be replaced by their translation
using bilingual dictionaries of MUSE (Lample et al. 2018)
into 4 randomly selected target languages in xGQA.
Step0:Considering the VQA model f(.;θ)initialized
with pretrained weights θ0, we obtain a subnetwork
f(.;Mθ)where M∈ {0,1}|θ|represents a binary mask
and is element-wise multiplication. More specifically as
it is shown in Algorithm 1, we utilize Iterative Magnitude
Pruning (IMP) (Han, Mao, and Dally 2016) to discover the
pruning mask M, during the fine-tuning of the VQA model
in English-only data. After each epoch, we prune a certain
amount (e.g. p%) of the original parameters. Then, we con-
tinue the fine-tuning by resetting the remaining parameters
to their original value on the pretrained initialization θ0.
Step1:Having the pruning mask M, the model parame-
ters are initialized with their original values θ0and are fine-
tuned again. However, in this step, only the unmasked pa-
rameters are trained while the masked ones are kept frozen.
It should be mentioned that following previous works (Zhou
et al. 2019; Chen et al. 2020) the masked parameters are set
to zero.
Code-Mixing (CDM) While each fine-tuning step only
involves questions from the English language, the VQA task
is unable to benefit properly from cross-lingual alignment
information that exists in multilingual vision-language mod-
els. To make full use of this cross-lingual alignment informa-
tion and better fine-tuning, we construct code-mixed data in
target languages. To generate the code-mixed questions, we
follow the mechanism of multilingual code-switching data
augmentation (CoSDA) proposed by Qin et al. (2020). First,
a set of words is randomly chosen in each question. Sec-
ond, for each selected word, we randomly specify a target
language to translate. Third, we replace the word with its
translation in the selected language. If the word has mul-
tiple translations in the target language, then one of them
is randomly selected for replacement. To increase the data
diversity during the training, Qin et al. (2020) proposes to
reset the replacement after each epoch and to replace differ-
ent words at different epochs.2Figure 3 shows the result of
applying the code-mixing procedure to our example.
4 Experiments
To evaluate our proposed strategies, as explained in Section
2, we benchmark two state-of-the-art multilingual vision-
language transformers, namely UC2 and M3P, as the base
2For further details regarding CoSDA we refer the reader to the
original work.
models. We study the impact of each strategy by fine-tuning
the model on the monolingual English GQA dataset3, then
evaluating the cross-lingual transfer on the multilingual ex-
tension of GQA, known as xGQA. We adopt the codebase
of IGLUE benchmark4to implement our proposed approach
and we keep the value of the models and training hyper-
parameters equal to the ones that are reported by Bugliarello
et al. (2022). The results are reported for each experiment by
averaging the performance over five different runs.
Model Configurations and Notation
On both UC2 and M3P models, we experiment with three
different setups:
With priorxx:The fine-tuning process is performed us-
ing a similarity-based loss together with cross-entropy loss.
The prior knowledge based distances dare computed as
follows: (i) priorwn: The WordNet base distance is com-
puted by using the NLTK library (Loper and Bird 2002), and
(ii) priorem: To compute the cosine distance among 1842
labels in xGQA, we use the spaCy5toolkit, where an em-
bedding emby∈R300 of each label is derived from GloVe
(Pennington, Socher, and Manning 2014) pretrained word
embeddings.
To mitigate the negative influence of non-probable classes
on the similarity-based loss, we consider the kmost proba-
ble answers according to their probability pθ(yc|i, q)in both
setups. We set hyper-parameters d1= 0.8,d2= 0.8,k= 10
and α= 10 based on validation set performance.6
+SFT: Using PyTorch pruning module7, we extract the
subnetwork from the pretained weights θ0following Algo-
rithm 1 and Step0of SFT strategy. More specifically, we
consider metrics of the encoder part of the model (see Figure
1), excluding the image and text embeddings as well as the
classifier layer in both UC2 and M3P.
Since the network architecture varies between UC2 and
M3P, the pruning is applied to a different set of parameters.
We perform IMP and prune a set of weights with the lowest-
magnitude globally throughout the network after each fine-
tuning epoch (number of epochs=5). Based on preliminary
experiments, we iteratively prune a certain fraction of the
lowest-magnitude weights (i.e. p= 10%) at each epoch
which results in the final sparsity level of around 40% in both
models. Considering the exclusion of some parameters, the
level of sparsity is 12.28% for UC2 and 13.44% for M3P.8
As our focus is not on conducting a large-scale analysis over
different sparsity levels, we leave this topic for future work.
3We consider balanced subset of GQA as recommended in
IGLUE benchmark (Bugliarello et al. 2022)
4https://iglue-benchmark.github.io/
5https://spacy.io/
6We performed a grid search using different values for these
hyper-parameters. Note that Lprior is typically much smaller than
Lce, hence the large α.
7https://pytorch.org/tutorials/intermediate/pruning tutorial.
html
8UC2 has around 281.66 M parameters where 85.52 M of them
are involved during the pruning process. M3P has 376.90 M pa-
rameters and 123.67 M of them are considered for pruning.
Model En Bn De Id Ko Pt Ru Zh Avg
Fine-tune model on English training set (Zero-Shot)
UC2 Our Baseline 54.92 19.99 42.00 28.44 22.40 30.92 28.55 31.19 29.07
Baseline (Bugliarello et al. 2022) 55.19 19.98 42.85 28.67 21.36 30.41 30.99 31.15 29.35
Liu et al. (2022) 58.57±0.226.23 ±1.549.51 ±1.138.92 ±1.336.48 ±1.339.76 ±0.641.72 ±0.346.52 ±0.939.87
With priorwn 55.77±0.02 23.66 ±0.76 47.93 ±0.19 35.67 ±1.43 34.57 ±1.81 37.46 ±1.35 40.08±0.54 40.08±4.31 37.06
With priorem 56.09 ±0.14 23.97±2.56 48.13 ±0.78 36.87±1.90 34.14 ±3.56 38.18 ±2.55 41.07 ±0.86 41.76±1.89 37.73
With priorem+ SFT 56.56±0.10 23.53 ±1.97 49.54 ±0.27 36.79 ±0.46 34.56 ±0.49 38.95 ±0.19 41.18 ±0.23 43.40 ±0.21 38.28
With priorem + CDM 54.37±0.01 27.38 ±0.02 46.66 ±1.70 20.88 ±2.33 36.32±1.11 40.81 ±2.06 43.48 ±0.18 30.62 ±1.46 35.16
With priorem + SFT+CDM 55.21±0.08 30.96 ±1.33 50.30 ±0.22 41.68 ±0.74 39.57 ±0.65 43.43 ±0.60 44.58 ±0.92 44.80 ±0.78 42.19
M3P Our Baseline 54.02 17.24 32.40 23.77 25.57 32.91 32.32 27.39 27.37
Baseline (Bugliarello et al. 2022) 53.75 18.64 33.42 32.48 25.11 31.40 27.50 28.65 28.17
Liu et al. (2022) 46.70±0.729.75 ±1.439.52 ±1.336.73 ±1.635.67 ±1.137.59 ±0.837.93 ±0.936.15 ±0.936.19
With priorwn 55.91±0.20 22.38 ±0.38 39.48 ±1.73 29.31 ±2.27 35.15±0.86 39.00 ±0.17 38.92 ±0.31 35.74 ±0.79 34.28
With priorem 56.33±0.09 22.93 ±3.19 40.10 ±0.54 30.63 ±0.05 35.35 ±2.14 38.85 ±1.09 39.95 ±0.03 36.97 ±0.07 34.97
With priorem + SFT 56.18 ±0.00 22.07 ±0.46 40.29 ±0.17 27.04 ±0.11 34.62 ±0.06 38.39 ±0.24 39.44 ±0.00 36.32 ±0.49 34.02
With priorem + CDM 54.35±0.64 28.71 ±1.73 43.57 ±0.33 38.89 ±2.07 38.06 ±0.44 41.93 ±0.59 41.64 ±1.11 38.80 ±1.22 38.80
With priorem + SFT+CDM 55.58±0.12 31.53 ±1.47 46.19 ±0.54 34.60 ±0.49 40.21 ±0.91 42.87 ±0.58 42.32 ±1.19 42.25 ±0.60 40.00
Translate everything to English and use the English-only model (Translate-Test)
UC2 (Bugliarello et al. 2022) 55.19 49.31 52.61 50.34 48.62 52.17 49.95 48.32 50.19
M3P (Bugliarello et al. 2022) 53.75 47.79 51.01 49.35 47.64 51.21 47.76 47.04 48.83
Table 1: Accuracy results on the xGQA test set for Zero-Shot transfer. Columns indicate the target languages. We also report the
average (Avg.) accuracy across languages excluding English. For our baseline, we fine-tuned the model on the English balanced
subset of GQA and evaluated it on the test set of xGQA. In With priorxx, the original cross-entropy loss is augmented with a
similarity-based loss, either using WordNet (i.e. With priorwn) or Word Embeddings (i.e. With priorem ). In With priorem+SFT,
we apply task-specific sparse fine-tuning strategy along with Word Embeddings based loss. In With priorem+SFT+CDM is our
final design where we employ the code-mixing augmentation on top of the previous strategy. For comparison, we report results
from Bugliarello et al. (2022) and Liu et al. (2022). The performance of all proposed strategies is averaged from five runs with
different random seeds.
Furthermore, following Step1of the SFT strategy, we
fine-tune the model using the pruning mask M. In both
steps, we incorporate similarity-based loss using word em-
beddings prior (i.e. priorem) in our experiments. This effec-
tively leads to a two-stage pruning and sparse fine-tuning
process, termed as With priorem+SFT.
+CDM: To perform code-mixing, the English questions
and the bilingual dictionaries of MUSE (Lample et al. 2018)
are used as the basis. We use all 7 target languages in
xGQA during the code-mixing augmentation. To perform
With priorem+SFT+CDM experiments, we continue with
fine-tuning by applying the code-mixing during the Step1of
SFT after pruning the model according to the Step0of SFT.
We find that including code-mixing during the pruning step
(i.e. Step0) negatively impacts the model performance in the
experiments that follow With priorem+SFT+CDM strategy.
Baselines and Previous Results:
We create Our baselines by directly evaluating the mono-
lingual fine-tuned models on the test set of the target lan-
guages. For each model, we report another baseline using
the results in Bugliarello et al. (2022). Moreover, we com-
pare our model with a previous study (Liu et al. 2022),
where the low performance of multilingual vision-language
models (i.e. UC2 and M3P) in the xGQA dataset has been
addressed through sophisticated classification architectures,
fine-tuning strategies, and modifications of the model input
via question-type conditioning. In addition, we report re-
sults of Translate-Test setup from Bugliarello et al. (2022)
where target language test data is translated to English and
an English-only fine-tuned model is evaluated on the trans-
lated test set.
5 Results and Discussion
In this section, we report the results of the different fine-
tuning strategies. The proposed strategies result in the best-
performing models across all 7 target languages in the cross-
lingual visual question answering task. A summary of the
results with various strategies is provided in Table 1.
With priorxx:Our first set of experimental results shows
the advantage of using the proposed loss along with the stan-
dard cross-entropy loss for the VQA task. The proposed
strategy (i.e. With priorxx) improves the average cross-
lingual zero-shot transfer accuracy by +7.99 and +8.66
points over the UC2 baseline using WordNet and GloVe
embeddings, respectively. At the same time, it shows gains
of +6.9and +7.6absolute accuracy points using differ-
ent modeling choice (i.e. M3P) with priorwn and priorem,
respectively. The results indicate that the similarity-based
loss obtained from linguistic priors can effectively guide
the models during training. They also support our hypoth-
esis that incorporating additional semantic prior knowledge
about the label space improves the cross-lingual generali-
sation. Among the proposed semantic distances, the GloVe
embeddings-based distance delivers the greatest improve-
ments in almost all languages. One major conceptual dif-
ference between our WordNet and GloVe-based distance
that could explain this difference in performance is that the
former is sparse and heuristic, whereas the latter is dense
and continuous. GloVe will also capture relations such as
antonym labels (e.g. male/female, boy/girl, or yes/no).
With priorem+SFT: The results demonstrate the impor-
tance of a task-specific sparse fine-tuning strategy (i.e. SFT)
for adapting the multilingual vision-language models in the
downstream VQA task without modifications to the model.
The SFT strategy brings further improvements (i.e +0.55)
over the With priorem strategy for UC2. Even though it
does not surpass the previous strategy in M3P and pro-
vides slightly lower performance for some of the target lan-
guages in UC2, such as Bangali (Bn) and Indonesian (Id),
it yields considerably more stable (lower variance) perfor-
mance across random seeds in all 7 target languages. It is
also worth noting that the SFT strategy offers a task-specific
and parameter-efficient structure for both models, where a
fraction of the encoder’s parameters (12.28% of parameters
in UC2 and 13.44% of parameters in M3P) are masked and
ignored during the fine-tuning. These results suggest that
SFT is successful in discovering language-neutral and task-
specific parameters that generalise well cross-lingually for
xGQA, similar to the finding by Foroutan et al. (2022) for
text-only tasks.
With priorem+SFT+CDM: The highest zero-shot trans-
fer performance observed in our experiments is obtained by
leveraging the code-mixing strategy on top of the previous
best strategy (i.e. With priorem+SFT). This strategy achieves
much better performance than the previous strategies by a
large margin on both transformer models compared to the
baselines. The improvement is +13.12 and +12.63 in aver-
age accuracy compared to UC2 and M3P baselines, respec-
tively. It can be observed that this approach outperforms the
previous work by Liu et al. (2022), across most of the target
languages with better performance and lower variance. No-
tably, our final strategy provides 42.19 versus 39.87 for UC2
model and 40.00 versus 36.19 for M3P model in terms of av-
eraged accuracy across 7 languages. This confirms that our
approach can better adapt the multilingual vision-language
models for the cross-lingual VQA task.
We further aim to understand the impact of C DM in isola-
tion where we do not perform SFT. It can be seen in Table 1,
applying CDM as the only strategy results in a large perfor-
mance drop for UC2 model in some of the target languages,
especially in Indonesian (Id) and Chinese (Zh). It also leads
to higher variance compared to its counterpart which only
benefits from the SFT strategy in both models. This re-
sult demonstrates synergies between the proposed strategies,
with the combination of CDM, which promotes alignment of
word representations between source and target languages,
and SFT, which discovers subnetworks that may be more
language-neutral, achieving a large improvement in combi-
nation whereas effects are more moderate (or negative) when
Model Avg.
w/o Syn. w Syn. Diff.
UC2 Our Baseline 29.07 29.96 +0.89
With priorwn 37.06 38.91 +1.85
With priorem 37.73 39.06 +1.33
With priorem+ SFT 38.28 39.67 +1.39
With priorem + SFT+CDM42.19 43.90 +1.71
M3P Our Baseline 27.37 31.83 +4.56
With priorwn 34.28 37.70 +3.42
With priorem 34.97 38.85 +3.88
With priorem + SFT 34.02 38.25 +4.23
With priorem + SFT+CDM40.00 43.52 +3.52
Table 2: Results of adjusting the evaluation metric to con-
sider the synonym of the target label as a correct prediction
(w Syn.). The w/o Syn. column indicates the results before
the adjustment.
applied in isolation. It is worth to note that we also con-
duct experiments with only CDM strategy (i.e. excluding the
With priorem strategy). However, the results were lower than
applying the With priorem+CDM (e.g Avg=32.76 compare
to Avg=35.16 using UC2).
6 Further Analysis
To further investigate the effect of synonymy relations
among the target labels on xGQA evaluation results, we
modify the evaluation metric to consider synonyms of the
ground truth label as a correct prediction. We use the Word-
Net synonym synset for this purpose. For instance we con-
sider couch correct if the ground truth label is sofa, or girls
if the ground truth label is girl. We note that confusion be-
tween synonymous labels is relatively common in xGQA; if
we consider synonyms to be correct answers, model perfor-
mance is actually higher than reported by the original accu-
racy by 0.9-1.85 and 3.42-4.56 percentage points with UC2
and M3P, respectively (see Table 2).
Table 3 shows the 5 most-confused labels for each lan-
guage, specifically where the UC2 model predicts a syn-
onym, hypernym, or hyponym of the target label. While
synonyms are predominantly due to inflectional differences
(singular/plural), we also find a large number of “wrong”
predictions that are in a hypernymy/hyponymy relationship
with the ground truth, and semantically plausible (girl or
lady vs. woman). Although the confusion between similar
labels motivated our use of linguistic priors, the performance
improvement that we observe is not predominantly due to
a reduction in this confusion. In fact, with our best strat-
egy, the number of “wrong” predictions that are semantically
plausible even increases for UC2, especially for some low-
resource languages such as Bengali (Bn) and Korean (Ko),
which we take as a positive result: our strict accuracy results
in Table 1 already show a substantial improvement for these
languages, and with a more permissive evaluation metric,
gains over the baseline would be even greater. Similar results
are observed when we only take into account synonymy re-
lationships.
Model Lang. 5 most-confused labels
label:prediction (rel.)
Our Baseline En girl:woman (hyp) 27 material:color (hpo) 23 lady:woman (hyp) 18 coffee table:table (hyp) 17 zebras:zebra (syn) 16
Bn sailboats:sailboat (syn) 3 skater:skateboarder (hpo) 3 plain:field (syn) 2 trees:tree (syn) 2 tank top:shirt (hyp) 1
De girl:woman (hyp) 33 material:color (hpo) 21 lady:woman (hyp) 16 woman:girl (hpo) 13 street sign:sign (hyp) 13
Id girl:woman (hyp) 28 lady:woman (hyp) 18 skater:skateboarder (hpo) 15 woman:girl (hpo) 14 zebras:zebra (syn) 12
Ko girl:woman (hyp) 7 skater:skateboarder (hpo) 7 boy:man (hyp) 2 fire truck:truck (hyp) 2 gown:dress (hyp) 1
Pt girl:woman (hyp) 22 skater:skateboarder (hpo) 17 lady:woman (hyp) 13 zebras:zebra (syn) 12 woman:girl (hpo) 11
Ru girl:woman (hyp) 32 skater:skateboarder (hpo) 17 lady:woman (hyp) 17 woman:girl (hpo) 14 cabinets:cabinet (syn) 12
Zh girl:woman (hyp) 26 chairs:chair (syn) 15 cabinets:cabinet (syn) 15 skater:skateboarder (hpo) 15 lady:woman (hyp) 15
Our Best Strategy En girl:woman (hyp) 28 material:color (hpo) 24 cabinets:cabinet (syn) 20 woman:girl (hpo) 18 zebras:zebra (syn) 16
Bn cabinets:cabinet (syn) 29 girl:woman (hyp) 19 skater:skateboarder (hpo) 15 woman:girl (hpo) 12 lady:woman (hyp) 12
De girl:woman (hyp) 32 material:color (hpo) 23 lady:woman (hyp) 18 cabinets:cabinet (syn) 17 woman:girl (hpo) 16
Id girl:woman (hyp) 27 cabinets:cabinet (syn) 24 woman:girl (hpo) 17 chairs:chair (syn) 17 lady:woman (hyp) 17
Ko cabinets:cabinet (syn) 39 girl:woman (hyp) 34 elephants:elephant (syn) 20 woman:girl (hpo) 17 chairs:chair (syn) 17
Pt material:color (hpo) 25 girl:woman (hyp) 24 woman:girl (hpo) 20 zebras:zebra (syn) 15 lady:woman (hyp) 15
Ru girl:woman (hyp) 33 cabinets:cabinet (syn) 25 material:color (hpo) 19 woman:girl (hpo) 18 lady:woman (hyp) 16
Zh cabinets:cabinet (syn) 32 girl:woman (hyp) 27 chairs:chair (syn) 26 zebras:zebra (syn) 25 elephants:elephant (syn) 24
Table 3: The 5 most-confused labels for each language, specifically where the UC2 model predicts a synonym (syn), hypernym
(hyp), or hyponym (hpo) of the target label. The number of “wrong” predictions that are in a synonym/hypernymy/hyponymy
relationship (rel.) with the ground truth label is reported in separate columns.
7 Related Works
The primary motivation for this work is the low cross-
lingual generalization of multilingual vision-language pre-
trained models. There are a number of works addressing this
problem. Zeng et al. (2022) introduce cross-view language
modeling by considering both image-caption pairs and par-
allel sentence pairs as two different views of the same object
and train the model to align the two views by maximizing the
mutual information between them with conditional masked
language modeling and contrastive loss. Whereas they report
a state-of-the-art zero-shot cross-lingual performance for
xGQA, their method demands a pretraining step as well as
high computing resources and multilingual language-vision
datasets. In contrast, our proposed strategy can be applied
on top of any multilingual vision-language pretrained model
as an adaptation step. Our approach is similar to the work
by Liu et al. (2022), where they propose a set of methods
that improves previously low transfer performance and thus
substantially reduce the gap to monolingual English perfor-
mance. However, their approach is more complex and our
final strategy provides better performance with a sparse en-
coder.
Similarity-based loss: There is an increasing interest in in-
corporating prior domain knowledge in neural NLP down-
stream tasks. Prior knowledge of the language has been
applied recently to language generation learning. Li et al.
(2020) introduces a technique that imposes the prior from
(linguistic) data over the sequence prediction models and
improves performance in typical language generation tasks,
including machine translation, text summarization, and im-
age captioning. Chousa, Sudoh, and Nakamura (2018) pro-
pose a novel NMT loss function that includes word similar-
ity in forms of distances in a word embedding space and it
leads to a substantial gain in the machine translation task.
Sparse fine-tuning: Our approach is inspired by studies of
sparse fine-tuning methods (Ansell et al. 2022; Liang et al.
2021; Foroutan et al. 2022) . Ansell et al. (2022) and Liang
et al. (2021) claim that non-restricted fine-tuning of multi-
lingual models is prone to over-fitting on source language
as well as catastrophic forgetting. They suppose that pa-
rameter interference is one of the causes of this degrada-
tion. Foroutan et al. (2022) suggest that language-specific
and language-neutral subnetworks play a prominent role in
the cross-lingual generalisation of the multilingual language
model (i.e Multilingual BERT). In this work, we follow
the above-mentioned ideas by looking at the structure and
weights of multilingual vision-language models in the VQA
task.
Code-Switching: Data augmentation training using Code-
Switching offers a significant improvement to the low-
resource languages. It helps the model explicitly learn the
relationship among words in different languages. It has been
applied to the training of various multimodal multilingual
models such as M3P (Ni et al. 2021) and CCLM (Zeng et al.
2022). Raj Khan, Gupta, and Ekbal (2021) create a multilin-
gual and code-mixed VQA dataset in eleven different lan-
guage setups considering the multiple Indian and European
languages as well as their code-mixed versions. They pro-
pose a knowledge distillation to extend an English language-
vision model (teacher) into a multilingual and code-mixed
model (student). However, this dataset is not diverse as
xGQA in terms of covering the low-resource languages.
8 Conclusion
We present a series of strategies to fine-tune multilingual
vision-language pretrained models for better cross-lingual
generalisation in the visual question answering task. Our ap-
proach is based on various adaptation techniques aimed to
mitigate the number of issues that we discovered regarding
the training and evaluation of multilingual vision-language
models on xGQA. Comparing our approach with the base-
line and previous similar work in several pretrained models,
the results indicate substantial improvements across target
languages. The improvement is +13.12 and +12.63 in av-
erage accuracy over all 7 languages in xGQA compared to
UC2 and M3P baselines, respectively.
We perform an analysis of closely related target labels in
xGQA, proposing a new metric that rewards synonymous
predictions and further demonstrates the success of the pro-
posed strategies. This analysis also highlights the need for
future research on the label space and evaluation metrics for
cross-lingual VQA.
Acknowledgments
We would like to thank Xin Sennrich and Alham Fikri
Aji for their helpful feedback on language resources. This
work was funded by the Swiss National Science Founda-
tion (project MUTAMUR; no. 176727) at the University of
Zurich.
References
Ansell, A.; Ponti, E.; Korhonen, A.; and Vuli´
c, I. 2022.
Composable Sparse Fine-Tuning for Cross-Lingual Trans-
fer. In Proceedings of the 60th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Pa-
pers), 1778–1796. Dublin, Ireland: Association for Compu-
tational Linguistics.
Bapna, A.; Cherry, C.; Zhang, Y.; Jia, Y.; Johnson, M.;
Cheng, Y.; Khanuja, S.; Riesa, J.; and Conneau, A. 2022.
mSLAM: Massively multilingual joint pre-training for
speech and text. arXiv preprint arXiv:2202.01374.
Bugliarello, E.; Liu, F.; Pfeiffer, J.; Reddy, S.; Elliott, D.;
Ponti, E. M.; and Vuli´
c, I. 2022. IGLUE: A Benchmark
for Transfer Learning across Modalities, Tasks, and Lan-
guages. In Proceedings of the 39th International Confer-
ence on Machine Learning, volume 162 of Proceedings of
Machine Learning Research, 2370–2392. PMLR.
Chen, T.; Frankle, J.; Chang, S.; Liu, S.; Zhang, Y.; Wang,
Z.; and Carbin, M. 2020. The lottery ticket hypothesis for
pre-trained bert networks. Advances in neural information
processing systems, 33: 15834–15846.
Chousa, K.; Sudoh, K.; and Nakamura, S. 2018. Train-
ing neural machine translation using word embedding-based
loss. arXiv preprint arXiv:1807.11219.
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.;
Wenzek, G.; Guzm´
an, F.; Grave, E.; Ott, M.; Zettlemoyer,
L.; and Stoyanov, V. 2020. Unsupervised Cross-lingual Rep-
resentation Learning at Scale. In Proceedings of the 58th
Annual Meeting of the Association for Computational Lin-
guistics, 8440–8451. Online: Association for Computational
Linguistics.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), 4171–4186. Min-
neapolis, Minnesota: Association for Computational Lin-
guistics.
Fellbaum, C. 1998. WordNet: An Electronic Lexical
Database. Bradford Books.
Foroutan, N.; Banaei, M.; Lebret, R.; Bosselut, A.; and
Aberer, K. 2022. Discovering Language-neutral Sub-
networks in Multilingual Language Models. ArXiv,
abs/2205.12672.
Frankle, J.; and Carbin, M. 2019. The Lottery Ticket Hy-
pothesis: Finding Sparse, Trainable Neural Networks. In In-
ternational Conference on Learning Representations.
Han, S.; Mao, H.; and Dally, W. J. 2016. Deep Compression:
Compressing Deep Neural Network with Pruning, Trained
Quantization and Huffman Coding. In Bengio, Y.; and Le-
Cun, Y., eds., 4th International Conference on Learning
Representations, ICLR 2016, San Juan, Puerto Rico, May
2-4, 2016, Conference Track Proceedings.
Hudson, D. A.; and Manning, C. D. 2019. GQA: A New
Dataset for Real-World Visual Reasoning and Composi-
tional Question Answering. 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 6693–
6702.
Kudo, T.; and Richardson, J. 2018. SentencePiece: A sim-
ple and language independent subword tokenizer and deto-
kenizer for Neural Text Processing. In Proceedings of the
2018 Conference on Empirical Methods in Natural Lan-
guage Processing: System Demonstrations, 66–71. Brussels,
Belgium: Association for Computational Linguistics.
Lample, G.; Conneau, A.; Ranzato, M.; Denoyer, L.; and
J´
egou, H. 2018. Word translation without parallel data. In
6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
Conference Track Proceedings. OpenReview.net.
Li, Z.; Wang, R.; Chen, K.; Utiyama, M.; Sumita, E.; Zhang,
Z.; and Zhao, H. 2020. Data-dependent Gaussian Prior Ob-
jective for Language Generation. In ICLR.
Liang, J.; Zhao, C.; Wang, M.; Qiu, X.; and Li, L. 2021.
Finding Sparse Structures for Domain Specific Neural Ma-
chine Translation. In AAAI.
Libovick´
y, J.; Rosa, R.; and Fraser, A. 2020. On the Lan-
guage Neutrality of Pre-trained Multilingual Representa-
tions. In Findings of the Association for Computational Lin-
guistics: EMNLP 2020, 1663–1674. Online: Association for
Computational Linguistics.
Liu, C.; Pfeiffer, J.; Korhonen, A.; Vulic, I.; and Gurevych,
I. 2022. Delving Deeper into Cross-lingual Visual Question
Answering. ArXiv, abs/2202.07630.
Liu, F.; Bugliarello, E.; Ponti, E. M.; Reddy, S.; Collier, N.;
and Elliott, D. 2021. Visually Grounded Reasoning across
Languages and Cultures. In Proceedings of the 2021 Confer-
ence on Empirical Methods in Natural Language Process-
ing, 10467–10485. Online and Punta Cana, Dominican Re-
public: Association for Computational Linguistics.
Loper, E.; and Bird, S. 2002. NLTK: The Natural Lan-
guage Toolkit. In Proceedings of the ACL-02 Workshop
on Effective Tools and Methodologies for Teaching Natural
Language Processing and Computational Linguistics, 63–
70. Philadelphia, Pennsylvania, USA: Association for Com-
putational Linguistics.
Ni, M.; Huang, H.; Su, L.; Cui, E.; Bharti, T.; Wang, L.;
Zhang, D.; and Duan, N. 2021. M¡sup¿3¡/sup¿P: Learning
Universal Representations via Multitask Multilingual Multi-
modal Pre-training. In 2021 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 3976–3985.
Pennington, J.; Socher, R.; and Manning, C. 2014. GloVe:
Global Vectors for Word Representation. In Proceedings
of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 1532–1543. Doha, Qatar:
Association for Computational Linguistics.
Pfeiffer, J.; Geigle, G.; Kamath, A.; Steitz, J.-M.; Roth, S.;
Vuli´
c, I.; and Gurevych, I. 2022. xGQA: Cross-Lingual Vi-
sual Question Answering. In Findings of the Association for
Computational Linguistics: ACL 2022, 2497–2511. Dublin,
Ireland: Association for Computational Linguistics.
Qin, L.; Ni, M.; Zhang, Y.; and Che, W. 2020. CoSDA-
ML: Multi-Lingual Code-Switching Data Augmentation for
Zero-Shot Cross-Lingual NLP. In IJCAI.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
Krueger, G.; and Sutskever, I. 2021. Learning Transfer-
able Visual Models From Natural Language Supervision. In
Meila, M.; and Zhang, T., eds., Proceedings of the 38th In-
ternational Conference on Machine Learning, volume 139
of Proceedings of Machine Learning Research, 8748–8763.
PMLR.
Raj Khan, H.; Gupta, D.; and Ekbal, A. 2021. Towards De-
veloping a Multilingual and Code-Mixed Visual Question
Answering System by Knowledge Distillation. In Findings
of the Association for Computational Linguistics: EMNLP
2021, 1753–1767. Punta Cana, Dominican Republic: Asso-
ciation for Computational Linguistics.
Ren, S.; He, K.; Girshick, R. B.; and Sun, J. 2015. Faster
R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks. In Cortes, C.; Lawrence, N. D.; Lee,
D. D.; Sugiyama, M.; and Garnett, R., eds., Advances in
Neural Information Processing Systems 28: Annual Confer-
ence on Neural Information Processing Systems 2015, De-
cember 7-12, 2015, Montreal, Quebec, Canada, 91–99.
Schneider, S.; Baevski, A.; Collobert, R.; and Auli, M. 2019.
wav2vec: Unsupervised Pre-Training for Speech Recogni-
tion. In Proc. Interspeech 2019, 3465–3469.
Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural
Machine Translation of Rare Words with Subword Units.
In Proceedings of the 54th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long Pa-
pers), 1715–1725. Berlin, Germany: Association for Com-
putational Linguistics.
Sun, C.; Myers, A.; Vondrick, C.; Murphy, K. P.; and
Schmid, C. 2019. VideoBERT: A Joint Model for Video
and Language Representation Learning. 2019 IEEE/CVF In-
ternational Conference on Computer Vision (ICCV), 7463–
7472.
Zeng, Y.; Zhou, W.; Luo, A.; and Zhang, X. 2022. Cross-
View Language Modeling: Towards Unified Cross-Lingual
Cross-Modal Pre-training. ArXiv, abs/2206.00621.
Zhou, H.; Lan, J.; Liu, R.; and Yosinski, J. 2019. Decon-
structing Lottery Tickets: Zeros, Signs, and the Supermask.
In Advances in Neural Information Processing Systems.
Zhou, M.; Zhou, L.; Wang, S.; Cheng, Y.; Li, L.; Yu, Z.;
and Liu, J. 2021. UC2: Universal Cross-lingual Cross-modal
Vision-and-Language Pre-training. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR 2021).