Conference PaperPDF Available

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching

Authors:

Abstract and Figures

Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence. The key challenge lies in how to measure image-text relevance accurately as matching evidence. Most existing works aggregate the local semantic similarities of matched region-word pairs as the overall relevance, and they typically assume that the matched pairs are equally reliable. However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement. In this paper, we propose a novel Cross-Modal Confidence-Aware Network to infer the matching confidence that indicates the reliability of matched region-word pairs, which is combined with the local semantic similarities to refine the relevance measurement. Specifically, we first calculate the matching confidence via the relevance between the semantic of image regions and the complete described semantic in the image, with the text as a bridge. Further, to richly express the semantic of regions, we extend the region to its visual context in the image. Then, local semantic similarities are weighted with the inferred confidence to filter out unreliable matched pairs in aggregating. Comprehensive experiments show that our method achieves state-of-the-art performance on benchmarks Flickr30K and MSCOCO.
Content may be subject to copyright.
Show Your Faith: Cross-Modal Confidence-Aware Network
for Image-Text Matching
Huatian Zhang1, Zhendong Mao1*, Kun Zhang1, Yongdong Zhang1, 2
1University of Science and Technology of China, Hefei, China
2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230022, China
{huatianzhang, kkzhang}@mail.ustc.edu.cn, {zdmao, zhyd73}@ustc.edu.cn
Abstract
Image-text matching bridges vision and language, which is
a crucial task in the field of multi-modal intelligence. The
key challenge lies in how to measure image-text relevance
accurately as matching evidence. Most existing works aggre-
gate the local semantic similarities of matched region-word
pairs as the overall relevance, and they typically assume that
the matched pairs are equally reliable. However, although
a region-word pair is locally matched across modalities, it
may be inconsistent/unreliable from the global perspective
of image-text, resulting in inaccurate relevance measurement.
In this paper, we propose a novel Cross-Modal Confidence-
Aware Network to infer the matching confidence that indi-
cates the reliability of matched region-word pairs, which is
combined with the local semantic similarities to refine the
relevance measurement. Specifically, we first calculate the
matching confidence via the relevance between the semantic
of image regions and the complete described semantic in the
image, with the text as a bridge. Further, to richly express the
semantic of regions, we extend the region to its visual context
in the image. Then, local semantic similarities are weighted
with the inferred confidence to filter out unreliable matched
pairs in aggregating. Comprehensive experiments show that
our method achieves state-of-the-art performance on bench-
marks Flickr30K and MSCOCO.
1 Introduction
Image-text matching, which refers to image searching given
descriptions or text retrieval given image queries, is benefi-
cial to many multi-modal tasks (Anderson et al. 2018)(Xu
et al. 2018)(Xu et al. 2019)(Yu et al. 2020) such as im-
age captioning, text-to-image synthesis, visual question an-
swering. The matching aims to bridge vision and language
so as to reduce the visual-semantic discrepancy between
these two heterogeneous modalities. Despite the remarkable
progress in recent years, image-text matching remains the
challenge that how to measure image-text relevance accu-
rately as matching evidence.
To explore efficacious approaches to capture cross-modal
semantic interplays for image-text relevance measuring,
plenty of researches have been done. The common paradigm
is to first align vision and language semantically, and then
*Corresponding author.
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
s1=0.9
s3=0.8 s4=0.9
s2=0.9
c2=0.1 c3=0.2 c4=0.1
c1 =0.9
A man is standing inside a cherry picker.
𝑟: image-text relevance
𝑐: matching confidence
𝑠: semantic similarity
(a) Existing methods (b) Our method
A man is standing inside a cherry picker.
s1=0.9
s3=0.8 s4=0.9
s2=0.9
Figure 1: Illustration of the necessity of matching confi-
dence. (a) Existing methods typically measure the overall
image-text relevance with aggregating local semantic simi-
larities, assuming all matched region-word pairs are reliable.
The word “man” in text will align to all man regions, even
s2,s3,s4are not really referred to. (b) Our method further
infers the matching confidence to distinguish the reliability
of each matched region-word pair from the global perspec-
tive, and filters out the unreliable matched pairs (e.g., regions
with red box) to achieve more accurate semantic alignment.
measure cross-modal semantic similarity as relevance based
on resulting alignments. There are two main strategies:
global aligning based and local aligning based. Global align-
ing based methods (Wang, Li, and Lazebnik 2016)(Liu
et al. 2017) (Gu et al. 2018) (Huang et al. 2018)(Shi et al.
2019)(Li et al. 2019) infer cross-modal semantic similarity
directly from the global alignment between the whole image
and full text in a common embedding space. Local align-
ing based methods aggregate the overall relevance from lo-
cal semantic alignments between detected salient image re-
gions and text words. Recent works mainly probe into local
aligning to discover fine-grained visual-semantic similarity
at region-word level. (Lee et al. 2018) proposes a stacked
cross attention network to capture all latent local alignments
by attending to image regions and words with each other,
which achieves promising performance and inspires a se-
ries of works (Wang et al. 2019b) (Hu et al. 2019) (Zhang
et al. 2020b) (Chen et al. 2020) (Wu et al. 2019)(Wehrmann,
Kolling, and Barros 2020) (Chen and Luo 2020a). They han-
dle sophisticated semantic interactions reasonably between
modalities, for obtaining discriminative visual-semantic rep-
resentations to facilitate cross-modal aligning. (Liu et al.
2020) (Diao et al. 2021) focus on exploring local align-
ments aggregating mechanisms such as graph convolution
and attentional reasoning to enhance meaningful alignments
in overall relevance measurement. In general, most local
aligning based methods match image regions and text words
with associating visual-semantic locally, and aggregate se-
mantic similarities between matched region-word pairs me-
chanically to measure the overall image-text relevance.
However, local semantic similarities, i.e., relevance of
matched region-word pairs, are aggregated by default con-
fidence 1equally in most existing works, which is ill-
considered since the matching confidence, i.e., reliability of
matched region-word pairs, depending on the global image-
text semantic context, is different from each other. That
is, a local region-word is matched across modalities, yet it
may be inconsistent/unreliable with the global perspective
of image-text. Thus, in order to reveal the real contribution
level of local semantic similarities to the overall cross-modal
relevance, it is necessary to explicitly indicate the confidence
of region-word pairs in matching. Without considering the
confidence, the inconsistent region-word pairs will be ag-
gregated indiscriminately and thus interfere with the overall
relevance measurement. More seriously, redundant inconsis-
tent region-word pairs may even overwhelm the matched
ones, causing the effects of other relatively few matched
pairs that are critical for matching are diluted. As shown in
Figure 1, the word “man” locally aligns to all man regions
in Figure 1(a), even the man who is not standing inside the
cherry picker, which results in inaccurate semantic align-
ment and interferes with the relevance measurement. With
taking the matching confidence into account in Figure 1(b),
the interferences from the man regions irrelevant to text se-
mantic can be filtered out.
To address the above issues, we propose a novel Cross-
Modal Confidence-Aware Network (CMCAN) for image-
text matching, which takes the confidence of matched
region-word pairs into account and combines it with the lo-
cal semantic similarities to measure cross-modal relevance
accurately. CMCAN infers the matching confidence from
the relevance between the semantic of image regions and the
complete described semantic in the image, with the text as a
bridge. Specifically, the confidence is measured by the inner
product between the semantic similarity of the region-text
and the semantic similarity of the whole image-text, which
are connected by the full text. Moreover, to express the se-
mantic of the image region richly, we extend the region to
its visual context in the image. In detail, our method con-
tains three modules: 1) Feature Representing: we first ex-
tract the representations of detected image regions and text
words for global and local aligning. In order to fully exploit
region semantic in the image, we extend each region with
its surrounding scene together as its visual context based
on the natural neighboring relationship; 2) Matching Con-
fidence Inferring: the matching confidence is inferred from
how much semantic similarity between visual context of re-
gions and the full text can be contained in the overall seman-
tic similarity of image-text, since it indicates the relative ex-
tent to which regions are described in text from the perspec-
tive of the whole image; 3) Cross-Modal Relevance Mea-
suring: we weight each region-queried local semantic sim-
ilarity with the corresponding inferred confidence, and im-
plement self-attentional reasoning on global similarity with
both weighted region-queried local similarities and word-
queried local similarities separately to measure the overall
image-text relevance.
Our contributions are summarized as follows:
• We propose a novel Cross-Modal Confidence-Aware
Network, which is the first time to, for the best of our
knowledge, infer the confidence of matched region-word
pairs from a global perspective in image-text matching,
to filter out the inconsistent local matched region-word
pairs to enhance more accurate relevance measurement.
We propose a delicately designed matching confidence
inferring method, which uses the full text as a bridge to
measure the faith of whether the regions are really de-
scribed in the text, relative to the global semantic simi-
larity of the whole image-text.
The experimental results demonstrate that our method
achieves state-of-the-art performance on public bench-
marks Flickr30K and MSCOCO.
2 Related Work
Extensive efforts have been made to align visual-semantic
between heterogeneous modalities and measure cross-modal
relevance for image-text matching which is more compli-
cated than unimodal retrieval (Cui et al. 2019)(Zhu et al.
2020). (Wang, Li, and Lazebnik 2016)(Liu et al. 2017)(Gu
et al. 2018)(Shi et al. 2019)(Li et al. 2019) conform to the
global aligning paradigm and mainly focus on exploring the
ways of feature fusion or exploiting latent scene semantic to
learn more discriminative representations.
To capture fine-grained cross-modal interplays, (Nam,
Ha, and Kim 2017)(Huang, Wang, and Wang 2017) attempt
to learn region-word level correspondences locally but can
only attend to limited alignments because of the high cou-
pling in alignment aggregating. Significantly, (Lee et al.
2018) proposes a stacked cross attention to mine region-
word local alignments by attending to image regions and
words with each other as context and aggregates the lo-
cal alignments by average or LogSumExp to measure over-
all cross-modal relevance. Under the local aligning frame-
work inspired by (Lee et al. 2018), (Wang et al. 2019b) (Hu
et al. 2019) (Zhang et al. 2020b) (Chen et al. 2020) (Wu
et al. 2019)(Wehrmann, Kolling, and Barros 2020) (Chen
and Luo 2020a) aim to design reasonable cross-modal align-
ing mechanisms to meet visual semantic interactions in or-
der to facilitate the relevance measuring. (Wang et al. 2020)
models scene graph (Xu et al. 2020) to describe the natural
scene in images. (Liu et al. 2020) introduces relative spa-
tial position of image regions and syntactic dependency tree
of text to model the semantic associations between regions
and words respectively, and then aggregates local align-
ment between regions and words, based on graph convolu-
tional network(Kipf and Welling 2017). (Diao et al. 2021)
enhances global alignment and local alignments mutually
with the help of attentional reasoning. (Chen et al. 2021)
discovers that simple pooling can outperform well-designed
complex methods in feature aggregating, and automatically
learns the best pooling strategy. (Yan, Yu, and Xie 2021) ex-
plicitly transforms features from heterogeneous modalities
into a common embedding space with attention mechanism,
which optimizes attention weights towards evaluation met-
rics, based on policy gradient.
In summary, the most existing works aggregate fine-
grained local semantic similarities or global and local se-
mantic similarities mechanically either by read-out func-
tions or with weights inferred from inter-alignment reason-
ing to measure cross-modal relevance, without taking the
inherent reliability of matched region-word pairs from the
global image-text perspective into account. That is, in most
existing works, interference from matching relationships
that are locally matched but inconsistent with the global per-
spective are aggregated into the overall image-text relevance
without screening.
3 Methodology
In this section, we elaborate on the matching confidence in-
ferring and how to introduce the inferred confidence into
cross-modal relevance measurement. As illustrated in Fig-
ure 2, our CMCAN is composed of three modules. Firstly,
the way to learn visual and textual representations and ex-
tend the semantic of detected image regions is introduced in
section 3.1. Secondly, how to infer the matching confidence
of matched region-word pairs from the global image-text
perspective is proposed in section 3.2. Finally, our vision-
language self-attentional reasoning method for measuring
cross-modal relevance is presented in section 3.3, and the
objective function for training is mentioned in section 3.4.
3.1 Feature Representing
Image Representation To extract image regions with ex-
pressive visual semantics, bottom-up attention has been
widely employed in multi-modal tasks (Zhang et al. 2020a),
which imitates human to focus on salient objects or other re-
gions spontaneously. Following (Anderson et al. 2018) (Lee
et al. 2018), we utilize Faster R-CNN (Ren et al. 2015)
with ResNet-101(He et al. 2016) as backbone to imple-
ment the bottom-up attention, which is pretrained on the Vi-
sual Genomes dataset (Krishna et al. 2017). Specifically, the
Faster R-CNN is utilized to detect salient regions in an im-
age Iand encode visual representation xifor detected image
region ri. Then we transform xito a D-dimensional vivia
linear projection:
vi=Wvxi+bi(1)
The image Ican be denoted as {vi|i= 1,2,· · · , N, vi
RD}, where Nis the number of regions in image I.
Further, the global representation vglo of the whole im-
age Iis encoded by attention mechanism with the average
feature vave =1
NPN
i=1 vias query. Concretely, vglo is ag-
gregated from the detected regions as follows:
vglo =PN
i=1 wivi
PN
i=1 wivi
2
(2)
where the attention weight wiis the normalized similarity
between viand the query vave.
Text Representation We extract text semantic informa-
tion at word level in order to capture the fine-grained in-
terplay between vision and language. We first map one-hot
encodings {w1,w2,· · · ,wM}of words in text Tto dis-
tributed representations by learnable word embedding layer
as ti=Wewi. To enhance the text representation with con-
text semantics, we utilize a bi-directional GRU (Bahdanau,
Cho, and Bengio 2015) to encode both forward and back-
ward information as follows:
hi=
GRU ti,
hi1, i [1, M](3)
hi=
GRU ti,
hi+1, i [1, M ](4)
where
hiand
hidenote hidden states from the forward and
backward GRU, respectively. The context enhanced word
representation uiis defined as the mean of bi-directional
hidden states:
ui=
hi+
hi
2, i [1, M](5)
The text Tcan be denoted as {ui|i= 1,2,· · · , M, ui
RD}. Similarly, the global representation uglo of the full text
Tis represented in the same way as vglo in Eq.2.
Semantic Extending In order to represent the image re-
gions more discriminative from each other, we take a further
step on extracting the visual context of each region for se-
mantic extending. Moreover, considering that the surround-
ing scene of a region usually contains its related semantics,
we design to extend a region with its neighboring regions as
the visual context. To be specific, for a region vi, we divide
its surrounding scene into four equal scopes with vias the
center, and extract the Knearest detected regions from each
scope (i.e., top, bottom, left or right). Then we gather the
indexes of all extracted image regions as well as the center,
that is idxi, as:
idxi={[
scope
idxscope, i},scope ∈ {top,bottom,left,right}
(6)
where idxscope denotes indexes of the Kextracted nearest
regions in one scope. The surrounding scene of region viis
disassembled into its neighboring regions indexed by idxi.
Furthermore, we formulate the scene vneig
ias:
vneig
i=Piidxiwivi
PN
i=1 wivi
2
(7)
where wiin Eq.7 shares the same attention weight parameter
of region viin Eq.2.
Figure 2: Illustration of our proposed CMCAN. The entire method consists of three modules: feature representing, matching
confidence inferring, and cross-modal relevance measuring. The confidence is inferred from the relevance between the visual
context of regions and the complete described semantic in the image, with the text as a bridge.
3.2 Matching Confidence Inferring
Cross-Modal Aligning To characterize the detailed cor-
respondence between vision and language and align visual
semantics across modalities, inspried by (Diao et al. 2021),
we embody the semantic similarity between heterogeneous
modalities with normalized distance-based representation.
Specfically, the local semantic similarity sv
ibetween im-
age region viand its semantically matched relevant words
in the text is represented as:
sv
i=Wv
s|viau
i|2
kWv
s|viau
i|2k2
(8)
where Wv
sRP×Dis a learnable parameter matrix.
The text context au
iis attended by region viwith au
i=
PM
j=1 αij ujas (Lee et al. 2018), where αij =e(λˆcij )
PK
i=1 e(λˆcij ),
ˆcij = [cij ]+/qPL
j=1 [cij ]2
+, and cij is the cosine sim-
ilarity between region viand word uj. That is, the se-
mantic similarity sv
iis queried by image region vi. Simi-
larly, the semantic similarity su
jbetween word ujand its
matched visual context av
jin the image is captured by su
j=
Wu
s|ujav
j|2
kWu
s|ujav
j|2k2
.
We further measure the global semantic similarity sglo be-
tween the whole image vglo and full text uglo :
sglo =Wg
s|vglo uglo |2
kWg
s|vglo uglo |2k2
(9)
where Wg
sRP×Dis a learnable parameter matrix.
Cross-Modal Confidence When salient image regions are
viewed separately, their visual semantics are fragmented,
which leads to a locally aligned region-word that may be
inconsistent with the global image-text semantics. The con-
fidence is to show the consistency degree of each region
with the global perspective of image-text, which can filter
out the inconsistent matched region-word pairs. Specifically,
we first extend each region vias its visual context vneig
i, in
order to make the representation of each region more dis-
criminative. The extended visual context can be exploited to
verify how much semantic of the image region are described
in the global text semantic, which is measured by the align-
ment between visual context vneig
iand the full text uglo as:
sneig
i=Wn
s|vneig
iuglo |2
Wn
s|vneig
iuglo |2
2
(10)
where Wn
sRP×Dis a learnable parameter matrix.
Referring to the given text, we have obtained how much
semantic of the whole image are described in the global text
semantic, namely sglo in Eq.9. Then, bridged by the full text,
we measure the matching confidence ciwith the normalized
relevance between the global semantic similarity sglo and
the corresponding sneig
ias:
i=wnsglo sneig
i, i = 1,2,· · · , N (11)
c=σ(LayerNorm ([1, 2,· · · , N])) (12)
where c= [c1, c2,· · · , cN],wnR1×Pis a learnable pa-
rameter vector, indicates the element-wise product, σin-
Method
Text Retrieval Image Retrieval
R@1 R@5 R@10 R@1 R@5 R@10 R@sum
CAMP (Wang et al. 2019b) 68.1 89.7 95.2 51.5 77.1 85.3 466.9
SCAN (Lee et al. 2018) 67.4 90.3 95.8 48.6 77.7 85.2 465.0
SGM (Wang et al. 2020) 71.8 91.7 95.5 53.5 79.6 86.5 478.6
MMCA (Wei et al. 2020) 74.2 92.8 96.4 54.8 81.4 87.8 487.4
CAAN (Zhang et al. 2020b) 70.1 91.6 97.2 52.8 79.0 87.9 478.6
DPRNN (Chen and Luo 2020b) 70.2 91.6 95.8 55.5 81.3 88.2 482.6
PFAN (Wang et al. 2019a) 70.0 91.8 95.0 50.4 78.7 86.1 472.0
VSRN (Li et al. 2019) 71.3 90.6 96.0 54.7 81.8 88.2 482.6
IMRAM (Chen et al. 2020) 74.1 93.0 96.6 53.9 79.4 87.2 484.2
GSMN (Liu et al. 2020) 76.4 94.3 97.3 57.4 82.3 89.0 496.8
SGRAF(Diao et al. 2021) 77.8 94.1 97.4 58.5 83.0 88.8 499.6
CMCAN (ours) 79.5 95.6 97.6 60.9 84.3 89.9 507.8
Table 1: Comparisons with state-of-the-art methods on Flickr30K. The bests are in bold.
dicates the sigmoid function, and LayerNorm denotes the
layer normalization operation. Note that the key idea here is
that the matching confidence is inferred from how much se-
mantic similarity between visual context of regions and the
full text can be contained in the overall semantic similarity
of image-text, since it indicates the relative extent to whether
the region is really described from the global perspective of
image-text.
3.3 Cross-Modal Relevance Measuring
To distinguish the matching confidence of region-word pairs
in matching and filter out local semantic similarities con-
tributed by the region-word pairs that locally matched but
the regions are not really referred to in the global semantic
of text, i.e., unreliable matched region-word pairs, in overall
cross-modal relevance measurement, we first multiply each
region-queried semantic similarity sv
iby the corresponding
ci. Thus, we can collect global semantic similarity and the
scaled local similarities together as:
Sv= [sglo , c1sv
1,· · · , cNsv
N](13)
Meanwhile, the global similarity sglo and word-queried
semantic similarities su
1,su
2,· · · ,su
Mare collected together
as Su= [sglo ,su
1,· · · ,su
M].
We implement multi-layer self-attentional reasoning on
the collected Svand Su, separately, in order to obtain modal-
ity specific enhanced global alignments:
Sl+1 = ReLU Wl
r·softmax Wl
qSl·Wl
kSl>·Sl
(14)
where Wl
qRP×Pand Wl
kRP×Pare parameter matices
to transform attention query and key in the lth layer respec-
tively, and Wl
rRP×Pis a parameter matrix to map the
attended features to the next l+ 1th layer. Note that Sl
vand
Sl
uare denoted as Slin Eq.14.
Further, we concatenate the reasoned vision-enhanced
global semantic similarity sgloL
vand language-enhanced
global semantic similarity sgloL
uin the last Lth layer, and
then feed the concatenated vision-language enhanced global
similarity into a fully connected layer activated by the sig-
moid function to measure the overall cross-modal relevance
rbetween image Iand text T:
r(I , T ) = σws[sgloL
v:sgloL
u] (15)
where wsR1×2Pis the learnable parameters to map the
concatenated similarity vector to a scalar relevance score.
3.4 Objective Function
To cluster matched image-text pairs and enforce unmatched
ones away from each other in the shared embedding space,
the ranking objectives are widely employed in matching.
Following (Faghri et al. 2018), we adopt the bi-directional
triplet loss for end-to-end training, with focusing on the hard
negatives within a minibatch for computational efficiency:
L(I , T ) = λr(I, T ) + rI , T
h+
+λr(I , T ) + rI
h, T +
(16)
where λis a margin constraint, [x]+= max(x, 0) and
r(·)is the corss-modal semantic relevance measurement
defined by Eq.15. Given a positive pair (I, T ),I
h=
arg maxI6=Ir(I, T ), and T
h= arg maxT6=Tr(I , T )
are the hardest negatives within the training minibatch.
4 Experiments
4.1 Datasets and Evalution Metrics
We evaluate our method on Flickr30K (Young et al. 2014)
and MSCOCO (Lin et al. 2014) datasets. Flickr30K con-
tains 31,000 images and each image is captioned by 5de-
scriptions. Following dataset splits in (Lee et al. 2018), we
use 29,000 images for training, 1,000 images for validation,
and 1,000 images for testing. MSCOCO contains 133,287
images and each image is annotated with 5sentences. We
use 123,287 images for training, 5,000 images for valida-
tion, and 5,000 images for testing, and the results are re-
ported by both averaging over 5 folds of 1,000 test images
Method
Text Retrieval Image Retrieval
R@1 R@5 R@10 R@1 R@5 R@10 R@sum
CAMP (Wang et al. 2019b) 72.3 94.8 98.3 58.5 87.9 95.0 506.8
SCAN (Lee et al. 2018) 72.7 94.8 98.4 58.8 88.4 94.8 507.9
SGM (Wang et al. 2020) 73.4 93.8 97.8 57.5 87.3 94.3 504.1
MMCA (Wei et al. 2020) 74.8 95.6 97.7 61.6 89.8 95.2 514.7
CAAN (Zhang et al. 2020b) 75.5 95.4 98.5 61.3 89.7 95.2 515.6
DPRNN (Chen and Luo 2020b) 75.3 95.8 98.6 62.5 89.7 95.1 517.0
PFAN (Wang et al. 2019a) 76.5 96.3 99.0 61.6 89.6 95.2 518.2
VSRN (Li et al. 2019) 76.2 94.8 98.2 62.8 89.7 95.1 516.8
IMRAM (Chen et al. 2020) 76.7 95.6 98.5 61.7 89.1 95.0 516.6
GSMN (Liu et al. 2020) 78.4 96.4 98.6 63.3 90.1 95.7 522.5
SGRAF(Diao et al. 2021) 79.6 96.2 98.5 63.2 90.7 96.1 524.3
CMCAN (ours) 81.2 96.8 98.7 65.4 91.0 96.2 529.3
Table 2: Comparisons with state-of-the-art methods on MSCOCO 1K test images. The bests are in bold.
and testing on the full 5,000 test images. As common in in-
formation retrieval, we measure the performance by R@K
(recall at K) defined as the percentage of queries that are
correctly matched in the closest K queried instances. R@1,
R@5, R@10 are adopted as metrics. The higher R@K in-
dicates better performance. To show overall matching per-
formance, we sum up all recall values as R@sum at both
image-to-text and text-to-image directions.
4.2 Implementation Details
We utilize the Faster R-CNN detector to extract N=
36 region proposals in each image then obtain a 2048-
dimensional feature for each region. We set the word embed-
ding dimension as 300. The dimension of vision-language
shared embedding space Dis set as 1024 and the dimen-
sion of distance-based similarity vectors Pis 256. In region
extended semantic representing, we extract K= 3 nearest
detected regions in each of the top, bottom, left, and right
scopes. For the region whose scopes are incomplete in the
edges of the image, we use the region itself to supplement
the lack, and further randomly discard one region in scopes
to reduce the visual context distortion caused by the supple-
mentation. The layer number Lof the self-attentional mech-
anism for relevance measuring is 3. The Adam optimizer
with 0.0002 as the initial learning rate is employed for model
optimization. The learning rate is decayed by 10 times after
40 epochs in training on Flickr30K, and after 20 epochs in
training on MSCOCO. The margin λin triplet loss function
is empirically set as 0.2. Source codes will be released.
4.3 Comparisons with State-of-the-art Methods
We compare our proposed CMCAN with recent state-of-the-
art methods on Flickr30K and MSCOCO datasets (For fair
comparison, the feature extraction backbone of all methods
is the same, i.e., that for image is Faster R-CNN, and that
for text is Bi-GRU). The experimental results are cited di-
rectly from respective papers. Comparison results are shown
in Table 1 and Table 2 for Flickr30K and MSCOCO 1K,
respectively. Note that our proposed CMCAN can achieve
performance improvements on all metrics, compared to the
state-of-the-art methods. On the Flickr30K test set, CMCAN
outperforms other methods with R@1=79.5% for text re-
trieval and R@1=60.9% for image retrieval, obtaining per-
formance improvements of 1.7% and 2.4%, respectively. On
the MSCOCO 1K test set, our proposed CMCAN achieves
the performance with R@1=81.2% for text retrieval and
R@1=65.4% for image retrieval, which is a remarkable im-
provement. Our proposed CMCAN can outperform state-
of-the-art methods by a large margin of 8.2% and 5.0%
in terms of the overall performance R@sum on Flickr30K
and MSCOCO, respectively. As shown in Table 3, CMCAN
outperforms state-of-the-art models on almost all evaluation
metrics in testing on the MSCOCO 5K test set. R@1=61.5%
for text retrieval and R@1=44.0% for image retrieval, get-
ting over 3.7% and 2.1% improvements, respectively. The
consistently remarkable performance of CMCAN demon-
strates its effectiveness and robustness.
Method Text Ret. Image Ret.
R@1 R@10 R@1 R@10
CAMP (Wang et al. 2019b) 50.1 89.7 39.0 80.2
SCAN (Lee et al. 2018) 50.4 90.0 38.6 80.4
CAAN (Zhang et al. 2020b) 52.5 90.9 41.2 82.9
VSRN (Li et al. 2019) 53.0 89.4 40.5 81.1
IMRAM (Chen et al. 2020) 53.7 91.0 39.7 79.8
MMCA (Wei et al. 2020) 54.0 90.7 38.7 80.8
SGRAF (Diao et al. 2021) 57.8 91.6 41.9 81.3
CMCAN (ours) 61.5 92.9 44.0 82.6
Table 3: Comparisons on MSCOCO 5K test images.
4.4 Ablation study
To show the effectiveness of the matching confidence in
cross-modal relevance measurement, we enumerate the per-
formance of the relevance measuring with and without
Figure 3: Visualization of the matching confidence. Brighter regions receive higher confidence w.r.t. the text, i.e., the consistency
degree with the global perspective of image-text. Results show CMCAN can accurately locate the really described regions.
Method Text Ret. Image Ret.
R@1 R@5 R@10 R@1 R@5 R@10 R@sum
without confidence 75.9 93.6 97.2 58.4 82.3 86.6 494.0
with confidence 77.5 94.3 96.9 58.8 82.9 88.9 499.3
CMCAN 79.5 95.6 97.6 60.9 84.3 89.9 507.8
Table 4: Ablation on Flickr30K. “without confidence” in-
dicates the cross-modal relevance measuring without confi-
dence, and “with confidence” is the opposite. CMCAN aver-
ages the relevance scores of two trained models in inference.
matching confidence on Flickr30K in Table 4. The cross-
modal relevance measurement with matching confidence
outperforms that without the confidence on almost all met-
rics in both image retrieval and text retrieval directions.
Specifically, the relevance measurement with matching con-
fidence obtains improvements of 1.6% on R@1 and 0.7%
on R@5 for text retrieval, 2.3% on R@10 for image re-
trieval, and 5.3% on the overall performance R@sum. CM-
CAN, which averages the cross-modal relevance scores of
two trained models, outperforms the relevance measurement
without confidence by 13.8% and measurement with confi-
dence by 8.5% on the overall performance R@sum.
4.5 Qualitative Analysis
To verify the effectiveness of CMCAN, we visualize the
learned matching confidence in Figure 3 which shows the
relatively highest confidence in each image for brevity. The
confidence is able to highlight the image regions that is re-
ally semantically consistent with the text to be matched, and
guides to focus on the key scene in image-text matching. We
also show the top-3 retrieval results given both image quries
and text quries in Figure 4. It can be seen that images with
similar contents are distinguished, since the inferred match-
ing confidence can capture subtle visual clues.
Figure 4: Case study, where the green texts or boxes denote
the same with the ground-truth, and the red are not.
5 Conclusion
In this paper, we present a novel Cross-Modal Confidence-
Aware Network for image-text matching, which infers the
confidence of matched region-word pairs from the global
perspective, enabling the model to be aware of whether the
local matched pair is really described to refine the image-
text relevance measurement. Moreover, bridging with the
full text, we propose a delicately designed matching confi-
dence measuring method via the whole image and the visual
context of image regions. Extensive experiments are con-
ducted to demonstrate the proposed method can significantly
outperform state-of-the-art. Future works include employing
the confidence-aware network into other multi-modal tasks,
such as image captioning and visual question answering.
Acknowledgments
This work is supported in part by National Natural Science
Foundation of China under Grant U19A2057, Science Fund
for Creative Research Groups under Grant 62121002, and
Fundamental Research Funds for the Central Universities
under Grants WK3480000008 and WK3480000010.
References
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.;
Gould, S.; and Zhang, L. 2018. Bottom-up and top-down at-
tention for image captioning and visual question answering.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, 6077–6086.
Bahdanau, D.; Cho, K. H.; and Bengio, Y. 2015. Neural ma-
chine translation by jointly learning to align and translate. In
3rd International Conference on Learning Representations,
ICLR 2015.
Chen, H.; Ding, G.; Liu, X.; Lin, Z.; Liu, J.; and Han, J.
2020. Imram: Iterative matching with recurrent attention
memory for cross-modal image-text retrieval. In Proceed-
ings of the IEEE/CVF conference on computer vision and
pattern recognition, 12655–12663.
Chen, J.; Hu, H.; Wu, H.; Jiang, Y.; and Wang, C. 2021.
Learning the best pooling strategy for visual semantic em-
bedding. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 15789–15798.
Chen, T.; and Luo, J. 2020a. Expressing objects just like
words: Recurrent visual embedding for image-text match-
ing. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 34, 10583–10590.
Chen, T.; and Luo, J. 2020b. Expressing Objects Just Like
Words: Recurrent Visual Embedding for Image-Text Match-
ing. In AAAI, 10583–10590.
Cui, H.; Zhu, L.; Li, J.; Yang, Y.; and Nie, L. 2019. Scalable
deep hashing for large-scale social image retrieval. IEEE
Transactions on image processing, 29: 1271–1284.
Diao, H.; Zhang, Y.; Ma, L.; and Lu, H. 2021. Similarity
Reasoning and Filtration for Image-Text Matching. In AAAI.
Faghri, F.; Fleet, D. J.; Kiros, J. R.; and Fidler, S. 2018.
Vse++: Improving visual-semantic embeddings with hard
negatives. In BMVC.
Gu, J.; Cai, J.; Joty, S. R.; Niu, L.; and Wang, G. 2018. Look,
Imagine and Match: Improving Textual-Visual Cross-Modal
Retrieval With Generative Models. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR).
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, 770–778.
Hu, Z.; Luo, Y.; Lin, J.; Yan, Y.; and Chen, J. 2019. Multi-
Level Visual-Semantic Alignments with Relation-Wise Dual
Attention Network for Image and Text Matching. In IJCAI,
789–795.
Huang, Y.; Wang, W.; and Wang, L. 2017. Instance-Aware
Image and Sentence Matching With Selective Multimodal
LSTM. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Huang, Y.; Wu, Q.; Song, C.; and Wang, L. 2018. Learning
semantic concepts and order for image and sentence match-
ing. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 6163–6171.
Kipf, T. N.; and Welling, M. 2017. Semi-supervised classi-
fication with graph convolutional networks. In 5th Interna-
tional Conference on Learning Representations, ICLR 2017.
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.;
Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma,
D. A.; et al. 2017. Visual genome: Connecting language and
vision using crowdsourced dense image annotations. Inter-
national journal of computer vision, 123(1): 32–73.
Lee, K.-H.; Chen, X.; Hua, G.; Hu, H.; and He, X. 2018.
Stacked cross attention for image-text matching. In Pro-
ceedings of the European Conference on Computer Vision
(ECCV), 201–216.
Li, K.; Zhang, Y.; Li, K.; Li, Y.; and Fu, Y. 2019. Visual se-
mantic reasoning for image-text matching. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, 4654–4662.
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-
manan, D.; Doll´
ar, P.; and Zitnick, C. L. 2014. Microsoft
coco: Common objects in context. In European conference
on computer vision, 740–755. Springer.
Liu, C.; Mao, Z.; Zhang, T.; Xie, H.; Wang, B.; and Zhang,
Y. 2020. Graph structured network for image-text matching.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 10921–10930.
Liu, Y.; Guo, Y.; Bakker, E. M.; and Lew, M. S. 2017.
Learning a recurrent residual fusion network for multimodal
matching. In Proceedings of the IEEE International Confer-
ence on Computer Vision, 4107–4116.
Nam, H.; Ha, J.-W.; and Kim, J. 2017. Dual attention net-
works for multimodal reasoning and matching. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, 299–307.
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn:
Towards real-time object detection with region proposal net-
works. Advances in neural information processing systems,
28: 91–99.
Shi, B.; Ji, L.; Lu, P.; Niu, Z.; and Duan, N. 2019. Knowl-
edge Aware Semantic Concept Expansion for Image-Text
Matching. In IJCAI, volume 1, 2.
Wang, L.; Li, Y.; and Lazebnik, S. 2016. Learning Deep
Structure-Preserving Image-Text Embeddings. In Proceed-
ings of the IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR).
Wang, S.; Wang, R.; Yao, Z.; Shan, S.; and Chen, X. 2020.
Cross-modal Scene Graph Matching for Relationship-aware
Image-Text Retrieval. In WACV, 1497–1506.
Wang, Y.; Yang, H.; Qian, X.; Ma, L.; Lu, J.; Li, B.; and Fan,
X. 2019a. Position Focused Attention Network for Image-
Text Matching. In IJCAI, 3792–3798.
Wang, Z.; Liu, X.; Li, H.; Sheng, L.; Yan, J.; Wang, X.; and
Shao, J. 2019b. Camp: Cross-modal adaptive message pass-
ing for text-image retrieval. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, 5764–5773.
Wehrmann, J.; Kolling, C.; and Barros, R. C. 2020. Adaptive
cross-modal embeddings for image-text alignment. In Pro-
ceedings of the AAAI Conference on Artificial Intelligence,
volume 34, 12313–12320.
Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; and Wu, F. 2020.
Multi-Modality Cross Attention Network for Image and
Sentence Matching. In CVPR, 10941–10950.
Wu, Y.; Wang, S.; Song, G.; and Huang, Q. 2019. Learning
fragment self-attention embeddings for image-text match-
ing. In Proceedings of the 27th ACM International Con-
ference on Multimedia, 2088–2096.
Xu, N.; Liu, A.-A.; Wong, Y.; Nie, W.; Su, Y.; and Kankan-
halli, M. 2020. Scene graph inference via multi-scale con-
text modeling. IEEE Transactions on Circuits and Systems
for Video Technology, 31(3): 1031–1041.
Xu, N.; Zhang, H.; Liu, A.-A.; Nie, W.; Su, Y.; Nie, J.;
and Zhang, Y. 2019. Multi-level policy and reward-based
deep reinforcement learning framework for image caption-
ing. IEEE Transactions on Multimedia, 22(5): 1372–1383.
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang,
X.; and He, X. 2018. Attngan: Fine-grained text to image
generation with attentional generative adversarial networks.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, 1316–1324.
Yan, S.; Yu, L.; and Xie, Y. 2021. Discrete-continuous Ac-
tion Space Policy Gradient-based Attention for Image-Text
Matching. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 8096–8105.
Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014.
From image descriptions to visual denotations: New simi-
larity metrics for semantic inference over event descriptions.
Transactions of the Association for Computational Linguis-
tics, 2: 67–78.
Yu, J.; Zhang, W.; Lu, Y.; Qin, Z.; Hu, Y.; Tan, J.; and Wu,
Q. 2020. Reasoning on the relation: Enhancing visual rep-
resentation for visual question answering and cross-modal
retrieval. IEEE Transactions on Multimedia, 22(12): 3196–
3209.
Zhang, C.; Yang, Z.; He, X.; and Deng, L. 2020a. Multi-
modal intelligence: Representation learning, information fu-
sion, and applications. IEEE Journal of Selected Topics in
Signal Processing, 14(3): 478–493.
Zhang, Q.; Lei, Z.; Zhang, Z.; and Li, S. Z. 2020b. Context-
aware attention network for image-text retrieval. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 3536–3545.
Zhu, L.; Lu, X.; Cheng, Z.; Li, J.; and Zhang, H. 2020.
Deep collaborative multi-view hashing for large-scale im-
age search. IEEE Transactions on Image Processing, 29:
4643–4655.
... Vision-language retrieval, such as image-text retrieval [10,48,47] and video-text retrieval [34,16,17,3,37], etc., is formulated to retrieve relevant samples across different vision and language modalities. Compared to unimodal image retrieval, vision-language retrieval is more challenging due to the heterogeneous gap between query and candidates. ...
Preprint
There are two popular loss functions used for vision-language retrieval, i.e., triplet loss and contrastive learning loss, both of them essentially minimize the difference between the similarities of negative pairs and positive pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN), which is widely used in existing retrieval models to improve the discriminative ability, is easy to fall into local minima in training. On the other hand, Vision-Language Contrastive learning loss (VLC), which is widely used in the vision-language pre-training, has been shown to achieve significant performance gains on vision-language retrieval, but the performance of fine-tuning with VLC on small datasets is not satisfactory. This paper proposes a unified loss of pair similarity optimization for vision-language retrieval, providing a powerful tool for understanding existing loss functions. Our unified loss includes the hard sample mining strategy of VLC and introduces the margin used by the triplet loss for better similarity separation. It is shown that both Triplet-HN and VLC are special forms of our unified loss. Compared with the Triplet-HN, our unified loss has a fast convergence speed. Compared with the VLC, our unified loss is more discriminative and can provide better generalization in downstream fine-tuning tasks. Experiments on image-text and video-text retrieval benchmarks show that our unified loss can significantly improve the performance of the state-of-the-art retrieval models.
Conference Paper
Full-text available
As a typical cross-modal problem, image-text bi- directional retrieval relies heavily on the joint embedding learning and similarity measure for each image-text pair. It remains challenging because prior works seldom explore semantic correspondences between modalities and seman- tic correlations in a single modality at the same time. In this work, we propose a unified Context-Aware Attention Net- work (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context. Specifically, it simultaneously utilizes global inter- modal alignments and intra-modal correlations to discover latent semantic relations. Considering the interactions be- tween images and sentences in the retrieval process, intra- modal correlations are derived from the second-order atten- tion of region-word alignments instead of intuitively com- paring the distance between original features. Our method achieves fairly competitive results on two generic image- text retrieval datasets Flickr30K and MS-COCO.
Article
Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representations are firstly learned to characterize the local and global alignments in a more comprehensive manner, and then the Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments. The Similarity Attention Filtration (SAF) module is further developed to integrate these alignments effectively by selectively attending on the significant and representative alignments and meanwhile casting aside the interferences of non-meaningful alignments. We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF with extensive qualitative experiments and analyses.
Article
Existing image-text matching approaches typically infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. However, they ignore the connections between the objects that are semantically related. These objects may collectively determine whether the image corresponds to a text or not. To address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN). In particular, given an input image-text pair, our model reorders the image objects based on the positions of their most related words in the text. In the same way as extracting the hidden features from word embeddings, the model leverages RNN to extract high-level object features from the reordered object inputs. We validate that the high-level object features contain useful joint information of semantically related objects, which benefit the retrieval task. To compute the image-text similarity, we incorporate a Multi-attention Cross Matching Model into DP-RNN. It aggregates the affinity between objects and words with cross-modality guided attention and self-attention. Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.
Article
The scene graph generated for an image structurally represents its object interactions and it substantially aids image scene understanding. To the best of our knowledge, most current works on scene graph generation chiefly focus on pairwise object regions for object and relation inference while ignoring the global visual context outside of these regions. Guided by the intuition that object/relation inference can benefit from the visual context within an image, this paper proposes a multi-scale context modeling method, which can jointly discover and integrate the complementary object-centric and region-centric context for scene graph inference. While both the object-centric and regioncentric contexts are separately modeled by their individual modules, a bi-directional message propagation strategy is designed to mutually reinforce the context modeling. A context-fused inference is then proposed to integrate the multi-scale context to guide scene graph inference. Extensive experiments establish that this method can achieve competitive performance compared to the state-of-the-art methods on three benchmarks. Additional ablation studies further validate its effectiveness. Code has been made available at: https://github.com/ningxu1990/MSCM.