ArticlePDF Available

Abstract

While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics , detached video-language view . In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.
1
Enhancing Video-Language Representations
with Structural Spatio-Temporal Alignment
Hao Fei, Member, IEEE, Shengqiong Wu, Meishan Zhang,
Min Zhang, Tat-Seng Chua and Shuicheng Yan, Fellow, IEEE
Abstract—While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream
video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning,
under-modeling of temporal dynamics,detached video-language view. In this work, we target enhancing VLMs with a fine-grained
structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with
fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an
SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and
the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian
differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions.
Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered
temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as
a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training
from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both
standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes
the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.
Index Terms—Video-Language Understanding, Structured Semantics Learning, Spatio-Temporal Grounding, Scene Graphs
1 INTRODUCTION
R
Ecently, large language model (LLM) pre-training over
various data modalities (e.g., texts, images and videos)
has shown amazing potential in ushering human-level intelli-
gence, such as GPT4 [1], PaLM-E [2], BLIP-2 [3], Flamingo [4],
LLaVA [5]. Among them, the topic of video-language model
(VLM) pre-training has received an increasing number of
research attention [6], [7], [8], [9], [10]. Compared with vision-
text modeling which focuses mainly on the individual visual
semantic understanding, video understanding goes beyond
static images, requiring both the comprehension of spatial
semantics and temporal dynamics, due to the nature of a
sequence of frames over time. Extensive efforts have been
made for learning effective VLMs, and facilitated a broad
range of downstream video-language (VL) tasks [11].
Despite the promising progress, existing VLMs could
be still subject to certain common yet crucial issues that
are intrinsically raised by the nature of video-text modality
heterogeneity. Consequently, the performance of downstream
VL tasks might still be hampered from achieving optimal.
First, coarse-grained cross-modal aligning. Current
works extensively perform alignment either between
the overall video and text representations [12], [13], or
extracting frame patches [14], [15]. Nevertheless, these
Hao Fei, Shengqiong Wu and Tat-Seng Chua are with School of Computing,
National University of Singapore, Singapore.
E-mail: {haofei37, dcscts}@nus.edu.sg, swu@u.nus.edu
Meishan Zhang and Min Zhang are with Harbin Institute of Technology
(Shenzhen), China. (Corresponding author: Meishan Zhang.)
E-mail: {zhangmeishan, zhangmin2021}@hit.edu.cn
Shuicheng Yan is with Skywork AI, Kunlun 2050 Research, Singapore.
E-mail: shuicheng.yan@kunlun-inc.com
two modalities are unequal in carrying information, e.g.,
texts are limited and succinct while videos encompass
dense surplus contents, which inevitably leads to low-
effective alignment when using a coarse-grained manner.
For example, in video captioning, this can lead to
generating brief captions without sufficient detail. As
highlighted in Figure 1, the correspondences between
video and text can be actually fine-grained, i.e., the co-
referred object of interest.
Second, under-modeling of temporal dynamics. The
language uses abstract words (e.g., predicates and
adverbial modifiers) to express complicated actions;
while videos describe the events with dynamic changing
of specific scenes in consecutive frames. For example
in Figure 1, the text action sit down on is depicted by
the tracking process over the man object in the video.
This thus suggests a delicate motion-level alignment
to model the temporal dynamics correspondences of
video-text data. Unfortunately, existing research mostly
takes the straightforward manner of video temporality
modeling, i.e., with temporal attention or pooling over
the overall frames [16], [17]. Consequently, e.g., for
the video temporal localization task where modeling
temporal dynamics is required, VLMs largely fail to
accurately match textual content with the corresponding
video temporal content.
Last, detached video-language view. While two modal-
ities come with shared features, there can be also rich
differentiated information. Intuitively, texts provide ab-
stract expressions (e.g., emotions&feelings) and videos
are visual-sensible signals (e.g., colors&appearances).
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
2
A young man with his knitted
scarf and skateboard qu ickly
sit down
on
bench in autumn
park
drinks
coffee hastily out
of paper cup.
man
scarf
skateboard
bench
coffee
cup
with
with
knitted
autumn
paper out of
young
in
sit down on
park
drink
man
with
skateboard
bench
next to
skateboard
man
in front of
bench
skateboard
with
with
hold
skateboard
cup
scarf
with
scarf
with
man
hold
cup
sit on
bench
scarf
with beside
man drink
coffee
sit on
bench
cup
in
00:15
parsing
Scene Graph for Text Scene Graph for Video
00:00
black black black
wooden wooden wooden
parsing
Input
Text
Input
Video
Fig. 1: Representing the text and video with the corresponding scene graphs (SGs) enables more fine-grained control of
video-language correspondence learning (same colors denote same concepts), as the SG representations depict the intrinsic
modal-agnostic semantic structures of texts or videos. Best viewed in color.
Such distinctions can serve a complementary role to
each other for a holistic multimodal semantic under-
standing. Yet current work pays the focus exclusively
on VL alignment, while mostly treating the unaligned
part of features as noises, and aggressively discards
them without modeling [6], [18]. In reasoning-intensive
scenarios, e.g., video question answering, VLMs cannot
fully leverage both modality-sharing and modality-
complementary information for in-depth reasoning.
We argue that the fine-grained structured representations
of video and text are indispensable for comprehensive
VL understanding, as we human beings always grasp the
underlying semantic structure before the next reasoning
over VL. In this work, we consider representing the input
video and text with the scene graph (SG) representations
[19]. As shown in Figure 1, depicting the intrinsic semantic
relations of contents in texts or videos with structured
modal-agnostic representations, SGs enable fine-grained
control of VL learning. Based on the SG representations,
we further make certain customizations before using it for
our purpose. First, we slightly retrofit the existing textual SG
(TSG) definition [20] by adding a predicate attribute node to
further support the adverbial modifiers of actions (e.g., quickly’,
hastily’), such that the TSG expression for motions can be
augmented. Second, for the dynamic SG (DSG) of video [21],
we connect the SG sequence of DSG as a whole by creating a
type of temporal coreference edges across different DSG frames.
Finally, we unify the TSG and DSG with a type of cross-modal
coreference edges, resulting in an overall holistic SG (HSG) of
both video and language. Via HSG (cf. Figure 2), we are able
to secure a comprehensive view of multimodal semantics.
Built upon the TSG, DSG and HSG, we then propose a
f
ine-gra
in
ed structural
s
patio-
t
emporal
a
lignment learning
(namely Finsta) framework. Our framework has a dual-
stream-sum architecture, as shown in Figure 3. We first
adopt a graph Transformer (GTrm) [22] model for the highly-
parallel graph encoding of TSG. Based on GTrm we then de-
vise a novel recurrent graph Transformer (R-GTrm) for spatio-
temporal propagations of DSG and HSG. We further propose
a spatial-temporal Gaussian differential graph Transformer
(STGD-GTrm) for strengthening the perception of the changes
in objects across spatial and temporal dimensions, which
learns to differentiate between moving nodes and stationary
nodes. Next, based on the structural features we perform fine-
grained VL alignment learning with respect to both spatiality
and temporality. Specifically, a high-order object-centered spatial
contrastive (OSC) learning and a high-order predicate-centered
temporal contrastive (PTC) learning are introduced for the
cross-modal alignment learning between the TSG and DSG.
Finally, we present a representation transfer mechanism, in
which we inject the well-aligned VL feature representations
into a host VLM, which can be any existing VLMs. With such
plug-and-play design, we can enhance the VL representations
of any well-trained VLMs by post-training with our Finsta
module without the expensive pre-training from scratch;
meanwhile avoiding the noise introduction of SG parsing
when applying to downstream applications.
Extensive experiments are conducted on 6 representative
video-language modeling tasks over 12 datasets in both
standard and long-form video scenarios, such as Video
Action Recognition [23], Video Captioning [24], Video-
Text Retrieval [25], Video Question Answering [26], Video-
Paragraph Retrieval [27] and Long-form Video Question
Answering [16]. Results show that our Finsta framework
consistently improves the existing 10 strong-performing
VLMs and 3 recent LVLMs persistently, and helps push
new state-of-the-art VL end tasks significantly in both the
fine-tuning and zero-shot settings. Via further analyses, we
verify that the proposed method effectively addresses the
aforementioned bottlenecks of the VL learnings, including
coarse-grained cross-modal aligning, under-modeling of
temporal dynamics and insufficient VL collaboration. We
also show empirical analyses to quantify the contributions of
each module in Finsta, and explore the effects of a range of
potential factors. Finally, we discuss the system’s efficiency,
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
3
and offer a collection of case studies to provide a direct
insight into how Finsta’s advancements unfold.
In summary, this work contributes in four key aspects:
To our knowledge, we are the first to comprehensively
enhance the video-language representation learning
with structured fine-grained spatio-temporal alignment
learning based on SG representations.
Based on the GTrm, we devise a novel R-GTrm model
for the spatial-temporal feature encoding of video.
We further propose an STGD-GTrm to strengthen the
perception of the changes in objects across spatial and
temporal dimensions, differentiating between nodes in
moving or stationary.
We propose novel high-order object-centered spatial
contrastive and high-order predicate-centered temporal
contrastive learning strategies to realize the fine-grained
spatio-temporal cross-modal alignment.
Our method empirically boosts the current state-of-the-
art VLMs on a broad range of downstream VL under-
standing tasks. Besides, our framework is designed as a
plug-and-play module, which can be easily applied to
many existing VLMs broadly.
2 RE LATED WORKS
2.1 Video Language Modeling And Learning
Recently, large-scale language models have demonstrated
the stunning ability of human-level semantics understanding
[2]. Such triumph of the text-based LLMs has soon expanded
to multimodal models, such as GPT4 [1], BLIP-2 [3] and
Flamingo [4], by further associating different multimodalities
together, e.g., text, vision and video. Among them, VLMs
have received increasing research attention [6], [7], [8].
Building on the top of LLMs, video LLMs (i.e., Large VLMs,
a.k.a. LVLMs) also have witnessed the rapid development,
such as Video-LLaMA [28], Video-ChatGPT [9], Video-
LLaVA [10]. VLM pre-training aims to learn a strong joint
representation between two modalities by training models
on large-scale video-text pairs. VLM greatly facilitates a wide
range of downstream VL understanding tasks, such as Video
Captioning [29], [30], Video-Text Retrieval [25] and Video
Question Answering [31], [32]. Aligned with the success
of the prior pre-training of image-language models (ILMs)
[15], [33], [34], some VLMs directly extend the cross-modal
features of existing ILMs as the VLMs. This is intuitive as
both ILMs and VLMs can share much similar multimodal
feature representations. By taking existing ILMs as initiation,
one can build a VLM with much lower costs, i.e., with fewer
video-text pairs without training from scratch [35].
Based on such intuition, most of the existing VLMs
take a similar idea of vision-text alignment learning as in
ILM, where the major focus has been paid to the vision-
language alignment [12], [33], [34]. However, videos can be
essentially different from visions. Compared with vision-
text modeling which focuses mainly on the individual
visual semantic understanding, video understanding goes
beyond static images, requiring both the comprehension
of spatial semantics and temporal dynamics, due to the
nature of a sequence of frames over time. Unfortunately,
several key characteristics of the video-language modality
heterogeneity have not been well considered in existing
VLMs. The first one is coarse-grained video-text alignment.
Existing VLMs mostly perform alignment either between
the overall video and text representations [12], [36], or
extracting the regional frame patches [14], [15]. Yet, texts
and videos are unequal in carrying information, e.g., texts
are limited and succinct while videos encompass dense
surplus contents, which inevitably leads to low-effective
alignment when using a coarse-grained manner. Besides,
the temporal dynamics is the pivotal crux of the video
modality understanding, which however is not carefully
modeled in existing VLMs. As we stressed earlier, language
has a natural gap with video. Language often uses abstract
words (e.g., predicates and adverbial modifiers) to express
complicated actions; while videos describe the events with
dynamic changing of specific scenes in consecutive frames.
Thus, a delicate motion-level alignment for the temporal
dynamics correspondence modeling of video-text data is
required. In a word, a fine-grained consideration should be
paid properly for both the spatial and temporal modeling
of VLM. To our knowledge, we are the first to model the
fine-grained video-language spatio-temporal alignment by
making use of SG representations.
2.2 Scene Graph Representations
This study also closely relates to the application of scene
graph representations [19]. In SGs, the object and attribute
nodes are connected in certain relations, where such graph
structures intrinsically describe the underlying semantic
meanings of input data. With such characteristics, SGs
have been extensively adopted as auxiliary features for
downstream applications, e.g., Image Retrieval [19], and
Vision Captioning [37]. In this work, we take advantage
of SG representations, and perform cross-modal spatial-
temporal alignment upon the SG structures. We consider the
textual SG [20] to represent the language, and the dynamic
SG [21] for the videos, both of which depict the modality-
agnostic semantics. We further construct a type of holistic
SG over the TSG and DSG to maintain an overall unified
VL representation. To our knowledge, this is also the first
attempt in the literature to connect the SG representations of
both video and language.
In this paper, we also investigate the encoding approach
of the fine-grained SG representations. SG has naturally been
depicted as a graph structure; thus the graph neural models
can be used for the SG encoding, such as GCN [38], GAT
[39]. However, we consider the Transformer architecture [40]
for the graph modeling, as the self-attention calculation of
the Transformer allows highly parallel computations. And
also the Transformer architecture has a more easy adaptation
with the existing VLM backbone, such as ViT [41], VIOLET
[42], BEiT [43] and more [3], [33]. We thus adopt the existing
graph Transformer (GTrm) [22] to model the TSG structure.
On the other hand, the DSG has the temporal sequential
characteristic. Instead of using the RGNN [39] for the DSG
encoding, we newly devise a recurrent graph Transformer
(R-GTrm). We draw the main inspiration from the recurrent
networks [44]. Built upon the GTrm propagation, the video
dynamics is additionally modeled in R-GTrm through the
temporal coreference edges of nodes.
3 SC EN E GRA PH CONSTRUCTION
As aforementioned, we represent the input video and text
with the DSG and TSG representations. Following we de-
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
4
man
scarf skateboard
bench
coffee
cup
with
with
knitted
autumn
paper
out of
young
in
sit down on
park
drink
quickly
hastily man
with
skateboard
bench
next to
skateboard
skateboard
with
scarf
with
man
hold
cup
sit on
bench
scarf
with beside
man drink
coffee
sit on
bench
cup
in
...
wooden woodenwooden
black black
... ...
Attribute Node
Relation Node
Object Node
Cross-modal
Coreference
(first occurrence)
Temporal
Coreference
Text Video
Textual Scene Graph Dynamic Scene Graph
Holistic Scene Graph
Fig. 2: We represent the input text and video with textual scene graph (TSG) and dynamic scene graph (DSG), respectively.
We unify the TSG and DSG into a holistic SG (HSG) by adding the cross-modal coreference edges.
scribe the construction of the DSG and TSG of video and text,
and also the HSG of a compound of both two.
3.1 Dynamic Scene Graph (DSG)
DSG describes a video into a temporally consecutive SGs.
Typically, each single visual SG comprises three types of
nodes, including object,attribute, and relation nodes. As
illustrated in Figure 2, the visual object nodes are connected
in certain relations, and also objects are attached to their
attributes. The raw DSG maintains a visual SG for each video
frame, while video frames are always redundant in content,
and cause huge computation costs. Thus, we first perform
keyframe extraction for the video, such that the dense
redundant video frames can be effectively compacted. A
clustering-based method [45] is used to extract the significant
keyframes that faithfully keep the salient event contents in
a proper sampling rate. We record the raw time stamps
τi
of the resulting frames in the raw video, the key temporal
information. Then, these frames are fed into a parser for
producing each static visual SG of each keyframe [46].
We follow the most common practice, and employ the
FasterRCNN [47] as the object detector to obtain all the object
nodes, where for each node
vD
i
, we use 1) the object’s neural
representation
fi
, 2) the object category label
ci
in the Visual
Genome (VG) dataset, 3) the bounding box of the object
bi
(the 2D coordinate in the image, i.e.,
(xl, xr, yu, yd)
). And
then we use MOTIFS as a relation classifier to obtain the
relational edges as well as the relation labels. We then use an
attribute classifier to obtain attribute nodes. All nodes (i.e.,
vD
iand vD
j) are connected with edges eD
i,j .
Since each single SG of DSG is separate in the graph
sequence, we consider connecting them as a whole. We create
a type of temporal coreference edges,
eD
t1t
, for the objects
across different SG frames, i.e., essentially a process of object
tracking. We realize this by measuring the intersection over
union (IoU) of the bounding boxes (
bD,t
i
and
bD,t
j
) of two
objects with the same object label (cD,t
i=cD,t
j):
aD(vD,t
i, vD,t
j) = IoU(bD,t
i, bD,t
j|cD,t
i=cD,t
j).(1)
Essentially, the same objects across time always come with
consecutive spatial movements in consistent labels. We make
comparisons with the object pairs between two consecutive
SGs, and with the IoU value
aD(vD,t
i, vD,t
j)> γD
to be
considered as the co-referred nodes, and we create a temporal
coreference edge between these objects. By ensembling the
SGs of all the video keyframes via the temporal coreference
edges, we obtain the resulting DSG (
GD
) for the overall
video, as illustrated in Figure 2. Formally,
GD={GD
1,· · · ,GD
t| ED
12,· · · , ED
t1t},(2)
where each SG frame GD
t= (VD
t;ED
t), where:
VD
t={vD,t
i}={(fi, ci, bi, τ )D,t},ED
t={eD,t
i,j },(3)
and each temporal coreference edge:
ED
t1t={eD
12,· · · , eD
t1t}.(4)
3.2 Textual Scene Graph (TSG)
The key difference between TSG and DSG lies in that TSG
only comes with one single graph frame. Similar to the
visual SGs, TSGs also include three types of nodes, including
object,attribute, and relation nodes. The objects are the textual
entities within a scene, and each object has affiliate attributes
connecting to it to describe the properties. Note that the
object nodes in visual SGs are images, while the object nodes
in LSG are textual tokens, which are also the category labels
of those objects. So we only maintain the token/label
cD
i
as
the node
vD
i
. And different types of nodes (i.e.,
vT
i
and
vT
j
)
are connected with edges
eT
i,j
. Here relations could be either
the persistent correlations (e.g., in’, with and next to’) or
some dynamic predicate words (e.g., drink’, sit down on and
hold’). However, the raw TSG definition fails to support the
adverbial modifiers of the dynamic predicates. For example, in
the sentence of Figure 1, the SG does not include the action
modifiers quickly for drink and hastily for sit down on’. We
note that this leads to important information loss, since in
VL scenario, video can naturally depict such action states
via its temporal characteristic. Thus we introduce a type of
dynamic attribute node of predicates, i.e., adverbial modifiers.
We illustrate the retrofitted TSG in Figure 2.
In practice, we can obtain the TSG of texts via an off-
the-shelf TSG parser [20]. We first convert the sentences into
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
5
Text
TSG Encoder DSG Encoder
HSG Encoder
HSG
DSG
TSG
Video
...
... ... ...
...
Softmaxj
Scaling
Product
Sumj
Concatkh
Add&Norm
Heads
Layers
Node Edge
+
(a) High-level view of the Finsta framework (b) Illustration of the R-GTrm cell
Graph Attention Head
Feed Forwar d
Add&Norm
Concatke
Add&Norm
Feed Forwar d
Add&Norm
Fig. 3: (a) The high-level view of our fine-grained structural spatio-temporal alignment learning (Finsta) framework based
on the dual-stream-sum architecture. And (b) the detailed dataflow of the recurrent graph Transformer (R-GTrm).
dependency trees with a dependency parser [48], which is
then transformed into a graph based on the rules defined at
[20]. To make TSG support the adverbial modifiers of the
dynamic predicates, we retrofit the existing TSG parser, and
keep the adverbial words or phases within the dependency
tree, such that the TSG adds a type of attribute node of
predicates. Formally, we denote the resulting TSG as GT:
GT= (VT;ET),(5)
where
VT={vT
i}={(ci)T},ET={eT
i,j }.(6)
3.3 Holistic Scene Graph (HSG)
Given a text-video pair, we expect the semantic contents of
them to be well-matched. Yet it must be a disparity between
a paired text and video. To make full of the differentiated
part of information and secure a comprehensive view of
multimodal semantics, we consider a combined view of these
two modalities. Technically, with the above paired TSG and
DSG at hand, we can unify them by creating a type of cross-
modal coreference edges,
eC
vT
ivD
j
, via which the objects in
TSG link to the correspondence objects in DSG. Specifically,
we match the semantic similarities between any pair of text
and image objects from TSG and DSG, respectively, via CLIP
encoder [34]. We measure the text label (
cT
i
) against both the
visual representation (fD
j) and the visual object label (cD
j).
aC(vT
i, vD
j) = 1
2CLIP(cT
i, cD
j) + CLIP(cT
i, f D
j),(7)
With a matching score
aC(vT
i, vD
j)
higher than
γC
to be
considered as the cross-modal co-referred nodes, we create
a cross-modal coreference edge between the objects. Note
that the correspondence objects in DSG are only at the first
occurred SG frame, which means that we only link the TSG
nodes to the first potentially occurred nodes of DSG. This
results in an overall holistic SG (HSG) of both video and
language, marked as
GC
, as shown in Figure 2. Formally,
HSG can be given as:
GC={GC
0,GC
1,· · · ,GC
t| EC
vT
ivD
j},(8)
where HSG merges the single frame of TSG into the frame
sequence of DSG as the first frame GC
0.
GC
0=GT,{GC
1,· · · ,GC
t}={GD
1,· · · ,GD
t},(9)
and the cross-modal coreference edges:
EC
vTvD={eC
vT
ivD
j}.(10)
4 ARCHITECTURE OF FINSTA FRAMEWORK
We now present a fine-grained structural spatio-temporal
alignment learning (dubbed Finsta) framework to encode
the TSG, DSG and HSG representations that constituent the
overall VLM system. As shown in Figure 3(a), Finsta has a
dual-stream-sum architecture.
4.1 Spatiality Encoding with Graph Transformer (GTrm)
First, for the TSG
GT
, only the fine-grained spatiality of the
scene should be taken care. Thus we consider adopting a
graph Transformer (GTrm) [22] to model the TSG. Compared
with the general graph neural networks [38], [39], GTrm
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
6
advances in both the graph topology modeling and highly-
parallel computation of Transformer architecture [40]. GTrm
has
L
stacked layers, where the update of representation
hl
i
of node viin l-th layer is given by:
hl+1
i=Ol
h,k ||H
k=1
X
j∈Ni
wk,l
i,j Vk,l hl
i
,(11)
hl+1
i=Norm(hl+1
i+hl
i),(12)
hl+1
i=Norm(FFN(hl+1
i) + hl+1
i),(13)
where
k
denotes the attention head number,
Ol
kRd×d
is
the attention head representation,
||
is the concatenation. The
concatenation is followed by the Feed-Forward layer (FFN)
and Add&Norm layer with residual connection. And
wk,l
i,j
is
by the k-th self-attention head:
wk,l
i,j =Softmaxj Qk,l ·Kk,l
pdk1!·Ek,l ,(14)
where
Ek,l Rd×d
=
WE{ei,j }
is the embedding of edge
eT
i,j
in TSG. Here all
K, Q, V Rd×d
derived from the node
representation via Eq (13) of last layer:
Kk,l =WK{hl
j}, Qk,l =WQ{hl
j}, V k,l =WV{hl
j}.(15)
The initial node representation
h0
i
is the embedding of the
textual label of TSG node. We gather all hl+1
iinto Hl+1.
For the graph edge representation, there is a similar
process as for the node propagation:
el+1
i,j =Ol
e,k ||H
k=1 wk,l
i,j ,(16)
el+1
i,j =Norm(el+1
i,j +el
i,j ),(17)
el+1
i,j =Norm(FFN(el+1
i,j ) + el
i,j ).(18)
By gather all el+1
i,j , we obtain the resulting El+1.
4.2 Spatiality-Temporality Encoding with Recurrent
Graph Transformer (R-GTrm)
DSG is characterized by temporal dynamics, compared with
the single-frame TSG. Thus, we devise a novel recurrent
graph Transformer (R-GTrm), for which we draw the main
inspiration from the recurrent networks [44]. As shown in Fig-
ure 3(b), built upon the GTrm propagation, the dynamics are
additionally modeled in R-GTrm through the temporal coref-
erence edges of nodes within DSG (
GD={GD
1,· · · , GD
t}
),
essentially modeling the tracking of objects over time:
hl+1,t
i=Ol,t
h,k ||H
k=1
X
j∈Ni
wk,l,t
i,j Vk,l,t hl,t
i
,(19)
hl+1,t
i=Norm(hl+1,t
i+hl,t
i),(20)
hl+1,t
i=Norm(FFN(hl+1,t
i) + hl+1,t
i),(21)
Each attention head at t-th time-frame in DSG goes by
wk,l,t
i,j =Softmaxj ˆ
Qk,l,t ·Kk,l,t
pdk2!·Ek,l,t .(22)
Here
Ek,l,t Rd×d
=
WE{et
i,j }
is the embedding of edge
eD,t
i,j
of time-step
t
in DSG. Same as in Eq
(15)
,
K, Q, V
are
all derived from the node representation of corresponding
frames via Eq
(21)
. The initial node representation
h0,t
i
is the
concatenation of 1) the exported object neural representation
ft
i
, 2) the embedding of the node label
ct
i
and 3) the
embedding of the raw frame timestamp τt
i. Compared with
GTrm, the update of R-GTrm representation
hl,t
i
of node
vi
at time-step
t
further fuses the feature of prior frame
t
-
1
, via
a automatic gate ηt
q:
ˆ
Qk,l,t = (1 ηt
q)·Qk,l,t1+ηt
q·Qk,l,t ,(23)
ηt
q=Sigma(Wq·Qk,l,t ·Kk,l,t ).(24)
With
ˆ
Qk,l,t
, we perform the same follow-up propagation.
The graph edge propagation shares the same process as for
GTrm encoding TSG.
el+1,t
i,j =Ol
e,k,t ||H
k=1 (wk,l,t
i,j ),(25)
el+1,t
i,j =Norm(el+1,t
i,j +el,t
i,j ),(26)
el+1,t
i,j =Norm(FFN(el+1,t
i,j ) + el,t
i,j ).(27)
By gather all el+1,t
i,j , we obtain the resulting El+1,t.
As HSG has the same temporal property as DSG, we thus
use another R-GTrm to encode HSG, i.e., GC:
{HC
0, HC
1,· · · HC
t}=R-GTrmC({GC
0, GC
1,· · · , GC
t}).(28)
For simplicity, we denote the final TSG node feature
matrix from GTrm as
HT
=
{hT
1,· · · , hT
i}
, the matrix of
DSG from R-GTrm as
HD
=
{HD
1,· · · HD
t}
=
{hD
1,1,· · · , hD
t,i}
(
t
denotes the frame dimension), and the one of HSG as
HC
=
{HC
0, HC
1,· · · HC
t}
. Also, we use the resulting node
representations of TSG and DSG from GTrm and R-GTrm
(i.e., the last
L
-th layer), respectively, to initialize the node
representations of HSG:
{HC,0
0, HC,0
1,· · · HC,0
t}←{HT,L HD,L
1,· · · HD,L
t}.(29)
4.3 Spatial-Temporal Gaussian Differential Graph Trans-
former (STGD-GTrm)
The above R-GTrm still fails to adequately perceive the
changing spatial positions of objects. A significant conse-
quence here is the under-modeling of the distinction between
stationary and moving objects. We emphasize that a key step
in modeling video dynamics is to distinguish between objects
in motion and those at rest (i.e., foreground vs. background).
To this end, based on the R-GTrm, we further design a spatial-
temporal Gaussian differential graph Transformer (STGD-
GTrm). The key idea is to enable the graph Transformer to
perceive the changes in objects across spatial and temporal
dimensions. To illustrate this, we plot the strength (i.e., the
distribution density, described by the following distribution
kernel) of the changes of any objects between two consecutive
frames (or keyframes) in Figure 4. We see that such spatial-
temporal change might intrinsically follow the Gaussian
distribution, where those objects moving more clearly tend to
have larger and sharper energy, while the stationary objects
that move slowly tend to have lower strength. Thus, we
propose to model the spatio-temporal differential of the
graph nodes of DSG or HSG along their trackers from
the temporal coreference edges, with Gaussian distribution.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
7
man
next to
man
in front of
bench
skateboard
with
with
hold
skateboard
cup
scarf
with
scarf
with
man
hold
cup
sit on
bench
black black
wooden wooden
skateboard
with
with
scarf black
bench wooden
hold
cup
Concatkh
Add&Norm
Heads
+
Feed Forward
Add&Norm
Concatke
Add&Norm
Feed Forward
Add&Norm
Object tracking of video
The Gaussian distribution of spatial-temporal change (differential)
Modeling spatial-temporal Gaussian differential with STGD-GTrm
Graph Attention Head
Gaussian Kernel
Attention Head
Concatkz
Heads
Feed Forward
Add&Norm
Temporally co-referred DSG
R-GTrm at step
R-GTrm at
step
STGD-GTrm
Fig. 4: Illustration of the STGD-GTrm for modeling the spatio-
temporal changes.
Technically, the spatial-temporal proximity of a node
vi
between the consecutive time from
t
to
t
is captured with a
Gaussian kernel, depicted as follows:
κ(vt
i|Ni) = exp
∥Ni∥·∥pt
iptt
i2
σz·Pj=i,j∈Nipt
jptt
j2
,(30)
where
t
=
tt
,
Ni
denotes
vi
neighbor of adjacent nodes.
pt
i
is the centroid of the object
vi
bounding box (marked as
bi
).
σz
is a spatial scale of the proximity. By observing the
spatial movement (position change) against the neighbors, it
is also reasonable to distinguish those false positive moving
objects caused by the camera moving.
As shown in Figure 4, during the propagation of R-GTrm
from
t
to
t
at
l
-th layer, an STGD-GTrm at this layer is
inserted between these two R-GTrm frames. With the
k
-th
Gaussian kernel attention head
κ(vt
i)
(denoted as
κk,l,t
for unification), the self-attention encoder is given as:
zk,l,t
i=Softmaxj Qk,l,t
·κk,l,t
pdk3!·Vk,l,t
.(31)
With
k
attention heads, we construct the resulting representa-
tion via concatenation, followed by the FFN transformation
and residual connection, the same operation as in R-GTrm:
Zl,t=Ol,t
z,k ||H
k=1 X
i
zk,l,t
i!,(32)
Zl,t=Norm(FFN(Zl,t) + Zl,t).(33)
With
Zl,t
, we then retrofit the R-GTrm gating mechanism
in Eq
(23)
by fusing both the query feature of prior frame
t
-
1
and this STGD-GTrm feature:
ˆ
Qk,l,t = (1 ηt
qηt
z)·Qk,l,t1+ηt
q·Qk,l,t +ηt
z·Zl,t,
(34)
ηt
q=Sigma(Wq·Qk,l,t ·Kk,l,t ),(35)
ηt
z=Sigma(Wz·Zl,t·Kk,l,t).(36)
With the new
ˆ
Qk,l,t
, the following calculations in R-GTrm are
carried out. In this way, the system learns to better capture
the spatiality-temporality changes of objects, and also is able
to automatically recognize those nodes of static node and
dynamic node during the graph representation learning.
5 VI DE O-LA NG UAGE REPRESENTATION LEARNING
With the fine-grained structural features learned via the
Finsta framework, we now perform representation learning,
through which we enhance the VL representations of existing
host VLMs. In what follows, we first elaborate on cross-
modal alignment learning. Then we introduce how to apply
our Finsta to an existing VLM.
5.1 Fine-grained Structural Spatio-Temporal Alignment
Learning
We divide the VL alignment learning into the spatiality and
temporality perspectives, where the former focuses on the
fine-grained static object-level semantic matching, while the
latter concentrates on the fine-grained dynamic motion-level
semantic matching. These two learning processes are carried
out between the DSG and TSG encoding modules.
1
Also we
note that both two fine-grained alignment learning methods
are automatically carried out unsupervisedly, i.e., without
relying on external annotations or human interference.
1) High-order Object-centered Spatial Contrasting (OSC).
Our idea is to encourage the object nodes in TSG to find
their correct correspondences in the DSG. We adopt the
contrastive learning [49] to pull semantically identical node
pairs together, and push apart those different. The fine-
grained VL modeling can be carried out over single objects
of texts and videos within the TSG and DSG. However,
we consider a more informative manner; we perform the
matching of a high-order region that is centered on any
objects. Intuitively, a textual object and a visual object should
be treated more similarly when the object pair as well as
their modifying contexts (i.e., specific attributes and even
relational neighbor objects) are all matched. Supplementary
Material Section 1 illustrates the high-order neighbor mod-
eling mechanism. For a TSG object
vT
i
, we traverse its
n
-th
1
Note that, although the learning happens only between DSG and
TSG encoders, the learned features will further propagate into the HSG
encoder via the subsequent feature injection and initiation, cf. Eq (29).
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
8
Spatial Region
... ...
Temporal Interval
...
Temporal direction
Fig. 5: The predicate-centered temporal contrasting mech-
anism, where we extract the spatial region and temporal
interval for the predicate-centered temporal alignment.
order (e.g.,
1
st,
2
nd, or
3
rd) neighbors
NT
i
. And then we
obtain the region representation
hT
i
via pooling operation.
Likewise, for a DSG object
vD
t,j
, we also obtain the
n
-order
neighbor representation
hD
t,j
. We then measure the bipartite
similarity between these two region representations, and
then produce the learning target:
So
i,t,j =(hT
i)T·hD
t,j
||hT
i|| ||hD
t,j || ,(37)
LOSC =X
iGT, jGD
tX
t
log exp(So
i,t,j o)
Z,(38)
where
τo>
0 is an annealing factor. We also define a threshold
ρo
to decide the matching confidence, i.e., pairs with
So
i,t,j >
ρo
are considered aligned.
j
represents a positive DSG
region with iin TSG, i.e., So
i,t,j¿ρo.
2) High-order Predicate-centered Temporal Contrasting
(PTC). Modeling only the spatiality is not enough, which
motivates the predicate-oriented dynamic semantics. The
predicate-centered temporal alignment has a similar formu-
lation with OSC. The aim is to find the correspondence
between the textual predicates in TSG and the dynamic
motions in DSG. Slightly different from the OSC learning, we
take a predicate-centered temporal contrasting learning. Our
targets are the dynamic relation nodes (i.e., predicates) in
TSG, and centered on a predicate node
vT
i
. Likewise, we first
find its
n
-order neighbor spatial region (the representation
is marked as
ˆ
hT
i
) within TSG. We use the same method to
find its
n
-order neighbor spatial region (the representation is
marked as ˆ
hT
i) within TSG.
Then, in the DSG, we also find such
n
-order region for
each predicate node
vD
t,j
, and further slice the DSG sequence
with a temporal interval,
t:t+m
, i.e., starting from
t
-th
frame and ending at (
t+m
)-th frame in the DSG sequence.
We take the pooled representation over the region features
(
ˆ
hD
t:t+m,j
) as the candidate counterpart of DSG. This is
illustrated in Figure 5. Thereafter, we perform PTC learning:
Sp
i,t:t+m,j =(ˆ
hT
i)T·ˆ
hD
t:t+m,j
||ˆ
hT
i|| ||ˆ
hD
t:t+m,j || ,(39)
LPTC =X
iGT, jGD
tX
tX
m
log exp(Sp
i,t:t+m,j p)
Z,
(40)
Fig. 6: A stereoscopic illustration of registering the represen-
tations from Finsta system into a host VLM.
where
τp
is the annealing factor, and
ρp
is the threshold for
PTC. Such textual predicate’-‘visual object tracking alignment
vividly simulates the temporal dynamics of two modalities.
5.2 Representation Transfer Learning
Via the above alignment learning, the TSG and DSG rep-
resentations of text and video can be well matched, and
expected to better facilitate downstream VL tasks. However,
directly applying Finsta as VLM for usage can be problematic,
because our system relies heavily on the SG annotations,
while parsing SG labels for all potential incoming data
will inevitably introduce noises and result in low-efficient
applications. Meanwhile, training a VLM with Finsta from
scratch can be much resource-costing (i.e., with 100m VL
pairs), and parsing such a large number of SG annotations
is impractical. To this end, we consider designing our Finsta
as a plug-and-play module, and inject the well-aligned
VL feature representations into a host VLM. Based on any
existing VLM with the similar dual-stream-sum architecture,
with a warm-starting, we can incrementally perform the
alignment more efficiently.
Technically, we register Finsta to a host VLM with the
knowledge distillation (KD) technique [50], [51]. Figure
6 illustrates the mechanism. Before Finsta propagates the
messages, we first import the first-layer representations of
the text encoder, video encoder and multimodal encoder of
host VLM respectively, into Finsta’s three key modules as the
initial feature representations of various SG modeling, which
are seen as well-aligned visual-language embeddings. We use
RT
,
RD
and
RC
to denote the representations of text, video
and multimodal encoder in host VLM, respectively. Specifi-
cally, the text/video/multimodal representations of the first-
layer host VLM,
RT/D/C,l1
, are copied to Finsta as the initial
input feature embeddings, i.e.,
HT/D/C,l0X+RT /D/C,l1
,
where
X
is the node embedding of the input TSG/VSG/HSG
in Finsta. By injecting the well-initiated VL feature represen-
tations into Finsta, we can warm-start the following post-
training of Finsta with host VLM.
Afterward, we distill the features of Finsta into host VLM.
The Finsta encoder performs propagations over the SG data,
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
9
and finally obtains the resulting well-aligned spatio-temporal
VL feature
HT/D/C
in the last layer. Next, we distill them
from Finsta to the host VLM via KD. Such that, the host VLM
encoding features learn to be similar to the ones as in Finsta,
i.e., the fine-grained spatial-temporally aligned features.
LTRS =X
d
(λT||RTHT||2+
λDX
t
||RD
tHD
t||2+
λCX
t
||RC
tHC
t||2),
(41)
where
λT/D/C
are the learning co-efficiencies. We note that
the representation used for distillation is the overall instance-
level representation, i.e., we take the output representation
of the ‘[CLS]’ token as the transfer targets. Also, the KD for
feature injection only happens during the post-training phase.
During fine-tuning or inference for the downstream task
and data, the host VLM is able to make better predictions
alone without the attendance of Finsta. This way, the SG
annotations are only needed during the post-training stage.
Remark. We expect the host VLMs to have the same
architecture (i.e., text encoder, video encoder and cross-
modal encoder) as Finsta to achieve the plug-and-play
functionality. Exactly same architecture, yet mostly necessary,
is not a strict requirement. While the text-video-multimodal
encoding architecture has been the standard paradigm in
most of the existing VLMs [42], there are also a number of
VLMs that do not come with strictly the dual-stream-sum
architecture [10], [14]. Even if any of the three encoders are
absent in the host VLM, Finsta can still work, by not distilling
the Finsta features of the counterpart encoder to the absent
encoder in the host VLM (i.e., in Eq 41, remove any of the
three). But in such case, the efficacy of Finsta will be sacrificed
to a certain extent, which we will analyze in the following
experiment section 7.3.4.
5.3 Training of Overall System
The training of the overall framework takes a warm-up
paradigm. We first pre-train the Finsta alone on the text-video
pairs with TSG and DSG annotations, with the alignment
learning (
LPTC
and
LOSC
). When the Finsta tends to converge,
we then perform the knowledge distillation, and inject the
Finsta representations into the host VLM as described above.
The joint training involves three learning objectives:
LPTC
,
LOSC
and
LTRS
. By the way, there are also the standard VL
learning objectives in the host VLM, such as the masked
language modeling
LMLM
[16], and the overall coarse-
grained video-text alignment learning
LVLA
[18]. We can
summarize all the learning objectives together:
L=λOSC LOSC +λP T C LPTC +λT R S LTRS
+λMLM LMLM +λV LA LVLA ,(42)
where
λ
are the co-efficiencies that dynamically change by
linearly learning scheduler [52].
6 EXPERIMENTS AND MAIN RE SU LTS
6.1 Experimental Settings
1) Video-Language Understanding Tasking There has been
a series of representative VL modeling tasks, which in our
experiments we divide into four groups: video-to-text trans-
formation (e.g., Video Action Recognition, Video Captioning),
text-to-video transformation (e.g., Video-Text Retrieval), VL
collaboration (e.g., Video Question Answering) and the more
challenging scenario, long-form VL understanding (e.g.,
Long-Form Video Question Answering, Video-Paragraph
Retrieval). For each task, we use the representative datasets,
and measure the performance with metrics by following the
common practice. In Supplementary Material Section 2 we
give a detailed description of all tasks with respect to the
task definitions, datasets and metrics.
2) Implement Details. Our Finsta takes a 12-layer GTrm
for TSG encoding and 12-layer R-GTrm & STGD-GTrm
for DSG encoding (
L
=12). The HSG R-GTrm & STGD-
GTrm encoders is a 6-layer version. All the attention head
number is 8 (
k
=8). We set all dimensions as 768 in our
system. The hyper-parameters during post-training are set as
follows, which help achieve the best effects. Initial annealing
factor
τo
and
τp
are all set as 0.8. The co-efficiencies in
Eq 41 are set as [
λT
=0.2,
λD
=0.5,
λC
=0.3] for video-to-text
transformation tasks, and [
λT
=0.3,
λD
=0.35,
λC
=0.35] for VL
co-comprehension tasks, and [
λT
=0.35,
λD
=0.3,
λC
=0.35] for
text-to-video transformation tasks. The initial weights are
set as:
λOSC
=0.5,
λPTC
=0.5, both of them will be linearly
decreased from 0.5 to 0.15 along the training. And
λT RS
=0.2
gradually increases to 0.5.
λV LA
=0.3 and
λMLM
=0.3 are
kept unchanged. The threshold for building the temporal
coreference edges,
γD
is set as 0.6; and for cross-modal
coreference edges,
γC
is 0.9. The
n
in
n
-order neighboring
calculation for alignment learning is set as 3 for OSC and 4
for PTC. The alignment confidence threshold value
ρo
is set
as 0.7 for OSC, and ρpis set as 0.6 for PTC.
3) Baselines and Backbone VLMs. We compare with strong-
performing baselines of different benchmarks. We consider
the existing state-of-the-art language models as our back-
bones, including 10 VLMs and 3 LVLMs. We adopt VLMs
which may either have 1) dual-stream-sum architectures with
three text-video-multimodal encoders, such as HDVILA [18],
Clover [7], LFVILA [16], or 2) with certain encoder(s) being
absent, such as VideoCLIP [14] & CLIP4Clip [12] missing
the cross-modal encoder, and All-in-one [53] only having
one multimodal encoder. Different (L)VLMs are pre-trained
on different amounts of corpus, and with different volumes
of parameters. We elaborate all the VLMs’ architectures,
parameter sizes and pertaining data in Supplementary
Material Section 3.1.
4) Post-Training Details. In the post-training, we first tune
the Finsta alone for 2 epochs as warming-up, using a batch
size of 300. We use AdamW optimizer with a weight decay
5e-3 and betas (0.9, 0.98). The learning rate is first warmed-
up by 1 epoch to 5e-3 and then decays. All trainings are
conducted on 16 NVIDIA A100 GPUs. For the post-training
of Finsta-VLMs on the normal (short) form videos, we use a
total of 50K VL pairs, with 25K sampled from WebVid-2.5M
[54] and 25K sampled from HD-VILA-100M [18]. For the
post-training of Finsta-LFVILA
2
, we further consider the use
of long-form data as what LFVILA has been trained, i.e.,
with 30K VL pairs sampled from LF-VILA-8M [16], where
2
We denote the Finsta-LFVILA post-trained on the normal (short)
form videos as S-Vid, and on the long-form data as L-Vid.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
10
TABLE 1: Video Action Recognition results (Acc. on Top-1 &
Top-5) on two datasets. The best results are in blue, while the
existing state-of-the-art results are
underlined
.Red scores
are the improvement of Finsta over the backbone VLMs.
Method K400 [55] SSV2 [56]
Top-1 Top-5 Top-1 Top-5
TimeSformer [57] 78.0 93.7 59.5 -
Frozen [54] 78.5 94.1 61.6 85.7
OmniVL [11] 79.1 94.5 62.5 86.2
ViViT [58] 84.9 95.8 65.4 89.8
Swin [59] 84.9 96.7 69.6 92.7
UniFormer-V2 [60] 85.4 97.0 73.0 94.5
VideoMAE [61] 87.4 97.6 75.4 95.2
HDVILA 78.6 94.0 61.3 86.2
Finsta-HDVILA 83.4 +4.8 97.6 +3.6 65.2 +3.9 90.3 +4.1
Clover 78.8 95.3 62.3 86.9
Finsta-Clover 81.2 +2.4 97.1 +1.8 64.1 +1.8 91.4 +4.5
InternVideo 91.1 98.9 77.2 95.4
Finsta-InternVideo 93.7 +2.6 99.2 +0.3 80.5 +3.3 96.7 +1.3
Video-LLaMA 86.9 97.1 74.7 93.6
Finsta-Video-LLaMA 91.2 +4.3 98.0 +0.9 76.8 +2.1 94.5 +0.9
Video-LLaVA 88.3 97.8 75.8 94.0
Finsta-Video-LLaVA 92.4 +4.2 98.8 +1.0 77.0 +1.2 95.6 +1.6
the avg. video duration is 100.2 sec. and the avg. text length
is 307.9 tokens. Supplementary Material Section 3.2 extends
the details of post-training datasets.
5) Scene Graph Parsing. For the TSG annotations, we mainly
follow the prior practice of SG applications. We also perform
filtering to remove objects, relations, and attributes that
appear less than 5 times in all the parsed scene graphs.
After such filtering, we obtain 7,021 objects, 2,256 relations,
and 4,895 attributes in TSGs. For each video, we flexibly
extract 10-50 keyframes while preserving their order. The key
process of DSG parsing has been elaborated in Section 3.1.
6) End-Task Fine-tuning Details. For the input videos of
the host VLM, we resize and center crop the video frames
into 256
×
256 to split into patches with size 16
×
16, getting
H=W=16. Then, the joint training with host VLM is with
epochs from 5 to 20, using a batch size in [100,150,200],
flexibly dependent on the task and dataset used. For different
downstream tasks, we mainly keep the same configuration as
described in the above implementation details, with only few
places are further tuned. The scores of different models from
our implementations are averaged five runs with random
seeds, and the results of other baselines are copied from the
raw papers (for which we will mark their citations).
6.2 Main Results on VL Modeling Tasks
We first evaluate the fine-tuning performance of Finsta on a
wide range of VL tasks.3
6.2.1 Results on Video-to-Text Transformation Tasks
Video Action Recognition. Table 1 presents the overall
performance of the K400 and SSV2 datasets, where the
Finsta-enhanced VLMs also compare with the existing strong-
performing systems. As seen, InternVideo has been the
existing state-of-the-art system for the task. However, our
Finsta further helps InternVideo improve by 2.6% and
3
Due to the space limit, we present the complete results on more
datasets with more metrics for each task in Supplementary Section 4.
TABLE 2: Video Captioning results on two datasets. M:
METEOR; B@4: BLEU@4.
Method YouCook2 [62] MSR-VTT [29]
M B@4 M B@4
VideoBERT [35] 11.0 4.1 - -
OmniVL [11] 14.8 8.7 - -
IcoCap [63] - - 30.3 46.1
HiTeA [13] - - 30.7 49.2
VLAB [64] - - 33.4 54.6
UniVL [6] 22.4 17.3 29.7 45.7
Finsta-UniVL 23.6 +1.2 18.4 +1.1 33.4 +3.7 49.0 +3.3
HDVILA 13.5 8.2 32.4 46.0
Finsta-HDVILA 18.8 +5.3 12.7 +4.5 36.9 +4.5 48.6 +2.6
Clover 14.2 9.0 34.1 47.5
Finsta-Clover 18.6 +4.4 12.5 +3.5 38.8 +4.7 49.3 +1.8
All-in-one 12.0 7.2 32.7 48.3
Finsta-All-in-one 13.1 +1.1 7.6 +0.4 34.2 +1.5 50.0 +1.7
Video-LLaMA 18.5 12.7 30.7 56.3
Finsta-Video-LLaMA 21.3 +2.8 16.3 +3.6 33.0 +2.3 57.1 +0.8
Video-LLaVA 20.0 16.4 33.5 57.0
Finsta-Video-LLaVA 23.0 +3.0 17.0 +0.6 35.3 +1.8 58.9 +1.9
TABLE 3: Video-Text Retrieval results on two datasets.
Method LSMDC [31] DiDeMo [65]
R@1 R@5 R@1 R@5
OA-Trans [66] 18.2 34.3 34.8 64.4
ALPRO [67] - - 35.9 67.5
Frozen [54] 15.0 30.8 31.0 59.8
CAMoE [68] 22.5 42.6 43.8 71.4
UMT-L [69] 43.0 65.5 70.4 90.1
VIOLET 16.1 36.6 32.6 62.8
Finsta-VIOLET 20.4 +4.3 39.1 +2.5 37.8 +5.2 66.0 +3.2
VideoCLIP 18.6 36.0 32.8 63.0
Finsta-VideoCLIP 21.8 +3.2 38.3 +2.3 35.5 +2.7 64.5 +1.5
CLIP4Clip 21.6 41.8 43.4 70.2
Finsta-CLIP4Clip 22.9 +1.3 43.0 +1.2 45.2 +1.8 71.3 +1.1
MCQ 17.9 35.4 37.0 62.2
Finsta-MCQ 22.8 +4.9 40.2 +4.8 41.7 +4.7 66.5 +4.3
HDVILA 17.4 34.1 28.8 57.4
Finsta-HDVILA 25.3 +7.9 46.3 +12.2 41.3 +12.5 70.9 +13.5
Clover 24.8 44.0 50.1 76.7
Finsta-Clover 28.9 +4.1 48.8 +4.8 56.0 +5.9 82.8 +6.1
All-in-one 20.5 38.8 32.7 61.4
Finsta-All-in-one 22.0 +1.5 39.9 +1.1 33.2 +0.5 62.1 +0.7
Video-LLaMA 38.0 62.5 67.5 82.4
Finsta-Video-LLaMA 44.8 +6.8 66.7 +4.2 70.2 +2.7 85.7 +3.3
Video-LLaVA 40.6 61.5 71.2 88.7
Finsta-Video-LLaVA 43.5 +2.9 67.2 +5.7 73.6 +2.4 90.3 +1.6
3.3% Top-1 accuracy on two datasets, respectively, which
makes the Finsta-InternVideo the current new state-of-the-
art on both two benchmarks. Overall, all the VLMs witness
performance increases to different extents. Notably, HDVILA
is enhanced by an average of above 4% accuracy.
Video Captioning. Table 2 shows the results on three
benchmarks. First of all, we can see that all different VLMs
receive varied improvements from Finsta with clear margins.
With Finsta, both the state-of-the-art performances on two
datasets further have been refreshed. This validates the
effectiveness of our method for enhancing video-to-language
type of task understanding. Also, besides the dual-stream-
sum VLMs, the All-in-one VLM which only has one shared
multimodal encoder is included. Interestingly, compared
with other combinations, Finsta-All-in-one receives the least
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
11
TABLE 4: Video Question Answering results (Acc.) on two
datasets: MSR-VTT-QA [26], MSVD-QA [26], TGIF-Frame
[32]. MC: Multiple-choice type; OE: Open-ended type.
Method MSR-VTT MSVD TGIF
MC OE OE OE
ClipBERT [15] 88.2 37.4 48.7 64.7
ALPRO [67] - 42.1 45.9 -
OmniVL [11] - 44.1 51.0 -
HiTeA [13] 97.2 45.4 55.6 73.2
VLAB [64] - 49.6 61.0 79.0
VIOLET 91.9 43.9 47.9 65.3
Finsta-VIOLET 94.6 +2.7 47.4 +3.5 54.6 +6.7 69.8 +4.5
VideoCLIP 92.1 41.3 48.9 67.0
Finsta-VideoCLIP 95.3 +3.2 45.8 +4.5 53.7 +4.8 69.2 +2.2
HDVILA 93.1 40.0 50.7 68.3
Finsta-HDVILA 96.3 +3.2 45.4 +5.4 53.3 +2.6 71.8 +3.5
Clover 95.0 42.5 51.1 71.6
Finsta-Clover 97.9 +2.9 47.8 +5.3 54.6 +3.5 75.8 +4.2
All-in-one 92.3 44.3 47.9 66.0
Finsta-All-in-one 94.0 +1.7 45.2 +0.9 50.0 +2.1 68.1 +2.1
Video-LLaMA 95.3 48.3 57.2 75.3
Finsta-Video-LLaMA 98.1 +2.8 51.3 +3.0 62.7 +5.5 82.4 +7.1
Video-LLaVA 96.8 48.0 65.7 77.5
Finsta-Video-LLaVA 99.4 +2.6 51.7 +3.7 72.5 +6.8 83.1 +5.6
improvements. This is largely because of the absence of the
two key modules: the textual encoder and video encoder.
6.2.2 Results on Text-to-Video Transformation Task
Video-Text Retrieval. We use a total of 9 VLMs as back-
bone, where we further compare with 5 strong-performing
baselines. We present the overall results in Table 3. As seen,
our Finsta still consistently improves all the VLMs by clear
margins. Among them, the Finsta-Video-LLaMA and the
Finsta-Video-LLaVA have been the new state-of-the-arts on
all the datasets. Notably, we find that HDVILA benefits
from Finsta the most, with an average of 10.3% recall rate.
Likewise, the multimodal-encoder-only All-in-one VLM has
received the least boosts from Finsta. Also, without the
cross-modal encoder, the enhancements for VideoCLIP and
CLIP4Clip VLMs are limited, compared with all the rest
VLMs with a full dual-stream-sum architecture.
6.2.3 Results on Video-Text Collaboration Task
Video Question Answering. Table 4 presents the results
on both the multi-choice QA and open-ended QA. Similar to
the aforementioned tasks, all different backbone VLMs are
improved via our Finsta system. Among them, the Finsta-
Video-LLaMA and the Finsta-Video-LLaVA surpass all the
existing performances, setting new state-of-the-art records
on all the datasets in different QA settings. Most significantly,
Finsta boosts Video-LLaVA by 6.8% accuracy in MSVD-QA
data. Also there is a similar trend for the All-in-one VLM,
which gets the most conservative improvement.
6.2.4 Results on Long-form Video-Text Tasks
Video-Paragraph Retrieval. As presented in Table 5,
Finsta helps all different VLMs achieve different levels of
improvement, in which two points can be observed. First,
we see that the LFVILA (L-Vid) has shown much stronger
performance than the LFVILA (S-Vid). Also, Finsta-LFVILA
(L-Vid) becomes the new state-of-the-art on two datasets.
Such a gap indicates the importance of training VLMs with
long-form videos (and texts) for long-form scenarios. Besides,
TABLE 5: Video-Paragraph Retrieval results on two datasets.
Method QuerYD [70] ActivityNet [24]
R@1 R@5 R@1 R@5
TeachText [71] 14.4 37.7 - -
Frozen [54] 53.8 75.7 28.8 60.9
TESTA [27] 83.4 93.8 54.8 80.8
LFVILA 69.7 85.7 35.3 65.4
Finsta-LFVILA (S-Vid) 72.0 +2.3 87.4 +1.7 37.7 +2.4 68.0 +2.6
Finsta-LFVILA (L-Vid) 78.6 +8.9 94.5 +8.8 69.8 +34.5 92.5 +27.1
Video-LLaMA 71.5 86.2 49.4 68.4
Finsta-Video-LLaMA 78.5 +7.0 90.2 +4.0 56.5 +7.1 73.8 +5.4
Video-ChatGPT 84.4 90.5 66.8 86.1
Finsta-Video-ChatGPT 86.8 +2.4 94.1 +3.6 69.3 +2.5 89.7 +3.6
Video-LLaVA 76.8 88.0 68.4 89.3
Finsta-Video-LLaVA 80.1 +3.3 91.0 +3.0 71.6 +3.2 91.8 +2.5
TABLE 6: Long-Form Video Question-Answering results
(Acc.) on two datasets.
Method How2QA [72] VIOLIN [73]
ResNet-SF [74] 74.3 -
GVE [75] - 68.4
HERO [72] 74.3 68.6
LFVILA 76.1 70.9
Finsta-LFVILA (S-Vid) 77.5 +1.4 71.7 +0.8
Finsta-LFVILA (L-Vid) 84.8 +8.7 78.0 +7.1
Video-LLaMA 79.6 75.3
Finsta-Video-LLaMA 84.0 +4.4 77.8 +2.5
Video-ChatGPT 80.7 73.4
Finsta-Video-ChatGPT 84.5 +3.8 76.9 +3.5
Video-LLaVA 83.5 75.0
Finsta-Video-LLaVA 87.8 +4.3 77.3 +2.3
among three different LVLMs, Video-LLaMA receives the
most significant boost from Finsta.
Long-form Video Question Answering. Table 6 reports the
results on two datasets. Likewise, all four different systems
are enhanced consistently, with the L-Vid LFVILA obtaining
the biggest improvements. Finsta strikingly boosts the raw
LFVILA (L-Vid) VLM by 8.7% and 7.1% accuracy on two
datasets, respectively. The Finsta-Video-LLaVA and Finsta-
LFVILA (L-Vid) become the new state-of-the-arts on two
datasets, respectively. The above consistent improvement
of our Finsta system on the long-form setting evidently
verifies its efficacy in improving the VL modeling and
understanding.
6.3 Zero-shot Video-Language Understanding Results
Here we examine Finsta’s efficacy in zero-shot setting, where
VLMs make predictions on the downstream VL tasks without
fine-tuning the on-demand training data. We representatively
test on three VL tasks: Video Action Recognition, Video-
Text Retrieval and Video Question Answering, each using
different dataset(s). The results are presented in Table 7,
from which we can gain several key observations. First
of all, all VLMs have shown significant improvement by
Finsta, notably with Finsta-InternVideo and Finsta-Video-
LLaVA emerging as the new best in zero-shot state-of-the-
art performance. Second, when compared to the prior fine-
tuning results, the enhancements brought by Finsta to zero-
shot scenarios are strikingly more evident. This underscores
Finsta’s role in fine-grained structured VL alignment learning,
essentially providing a crucial signal that aids unsupervised
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
12
TABLE 7: Zero-shot results of
Video Action Recognition
,
Video-Text Retrieval and Video Question Answering .
Method K400 [55] LSMDC [31] MSVD [76]
Top-1 R@1 R@5 OE
Frozen [54] - 9.3 22.0 8.9
OmniVL [11] - 8.6 18.7 -
HiTeA [13] - 15.5 31.1 18.2
ImageBind [77] 50.0 10.4 27.9 14.5
LanguageBind [78] 64.0 16.4 30.8 12.3
VIOLET 44.8 8.6 23.2 11.0
Finsta-VIOLET 59.7 +14.9 15.5 +6.9 35.5 +12.3 27.2 +16.2
VideoCLIP 36.4 9.5 16.3 7.6
Finsta-VideoCLIP 46.7 +10.3 12.4 +2.9 19.8 +3.5 15.7 +8.1
CLIP4Clip 48.3 11.3 22.7 10.6
Finsta-CLIP4Clip 54.5 +6.2 13.9 +2.6 26.7 +4.0 18.5 +7.9
MCQ 41.3 12.2 25.9 8.2
Finsta-MCQ 55.2 +13.9 15.8 +3.6 37.5 +11.6 25.8 +17.6
InternVideo 53.3 11.0 25.8 13.5
Finsta-InternVideo 68.9 +15.6 21.9 +10.9 39.3 +13.5 29.8 +16.3
Video-LLaMA 60.3 13.5 26.3 20.3
Finsta-Video-LLaMA 69.7 +9.4 23.9 +10.4 37.8 +11.5 36.6 +16.3
Video-LLaVA 64.3 15.5 29.8 25.1
Finsta-Video-LLaVA 72.2 +7.9 22.0 +6.5 43.5 +13.7 45.3 +20.2
learning. Finsta takes full advantage of the external semantic
scene structural features (i.e., SGs) for enhancement of VL
understanding, which correspondingly gives rise to these
improvements. Lastly, in contrast to other VLMs with full
dual-stream-sum architecture, the lowest improvements of
VideoCLIP and CLIP4Clip suggest that the absence of certain
modules can significantly impact the performance of Finsta.
7 DISCUSSIONS AND IN-DEPTH AN ALYS ES
In this section, we give more discussions via a series of
in-depth analyses to reveal how the system advances.
4
Following, we try to ground the answers for the following
five key research questions:
RQ-1: Does Finsta improve VLMs by really addressing
the bottlenecks of VLMs?
RQ-2: How much does each module contribute to the
overall Finsta?
RQ-3: What factors influence Finsta’s performance? How
do different factors impact Finsta?
RQ-4: What is the Finsta’s computational cost? Is Finsta
efficient compared to its efficacy?
RQ-5: How exactly does Finsta advance better VLMs?
7.1 RQ-1: Does Finsta Address VLM Bottlenecks?
In the Introduction section we highlight the imperatives of
resolving three key bottlenecks of existing strong-performing
VLMs in modeling video&language. Here we investigate if
the enhancement of Finsta over backbone VLMs is really
coming from addressing these main challenges. To directly
answer the question, we consider a human evaluation of
the VL modeling. We select four VLMs, including VIOLET,
HDVILA, InternVideo and Video-LLaVA. We sampled 300
long video-language pairs from ActivityNet [24] test set for
zero-shot Video Captioning, Question Answering and VL Re-
trieval tasks, with each video containing more than 8 actions
4
Due to the space limit, we present more experiments and analysis
results in Supplementary Material Section 5.
Video Temporal
Dynamics
Video-Language
Aligning
Video-Language
Collaboration
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Finsta-InternVideo
InternVideo
Finsta-VIOLET
VIOLET
Finsta-Video-LLaMA
Video-LLaVA
Finsta-HDVILA
HDVILA
Fig. 7: Video-Paragraph Retrieval results with varying num-
bers of video actions in ActivityNet data.
60
70
80
90
75
85
95
50
60
70
80
90
60
70
80
90
(a) Static entity-object correspondence
R@1
R@3
(b) Dynamic predicate-action tracking correspondence
R@1
R@3
Finsta-HDVILA w/o LOSC w/o LP T C
Finsta-Clover w/o LOSC w/o LP T C
Fig. 8: Evaluation of fine-grained TSG and DSG node
correspondence.
for imitating the challenging action-complex video scenario.
We ask 5 trained volunteers to assess the direct performance
of each VLMs in terms of Video-Language Aligning,Video
Temporal Dynamics and Video-Language Collaboration based on
the VLMs’ outputs. In Figure 7 we plot the change before and
after equipping with Finsta, where Finsta evidently enhances
the abilities in these three aspects.
Further, we step further in evaluating how better the
fine-grained video-language alignment/grounding Finsta
has been achieved. We probe the exact correspondences
between the two modalities, in terms of the static token-
object alignment, and the dynamic predicate-action tracking
alignment, respectively. We measure each pair of nodes
from the TSG and DSG respectively, i.e., the bipartite scores:
So
i,t,j (vT
i, vD
t,j )
, and
Sp
i,t:t+m,j (vT
i, vD
t:t+m,j )
. Note that here
we degrade the order
n
as 0, i.e., only considering the object
node itself. We mainly examine the HDVILA and Clover
VLMs equipped with our Finsta. The results are shown in
Figure 8. As seen, both 1) the spatial textual-entity and visual-
object alignments and 2) dynamic temporal predicate-action
grounding are well captured in Finsta system.
One direct assessment of the temporal dynamics model-
ing is through the Language Video Localization task, which
requires localizing the specific temporal moments in an
untrimmed video precisely by a given language query. We
experiment with the same challenging ActivityNet data, and
compare our VLMs with the state-of-the-art models including
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
13
TABLE 8: Model ablation of Finsta-HDVILA and Finsta-LFVILA on different datasets.
HDVILA LFVILA (L-Vid)
AVG
K400
(VAR, Top-1)
MSR-VTT
(VC, M)
DiDeMo
(VTR. R@1)
MSVD
(VQA, Acc.)
ActivityNet
(VPR, R@1)
How2QA
(LF-VQA, Acc.)
Finsta-[VLM]83.4 36.9 49.3 53.3 69.8 84.8 62.9
SG Representation Integration
w/o Temporal coref. edge in DSG 82.1 33.3 47.0 52.2 59.7 81.0 59.2 -3.7
w/o Adverbial modifier in TSG 83.0 34.0 47.2 51.9 61.1 82.9 60.0 -2.9
w/o X-modal coref. edge in HSG 82.7 35.0 48.2 52.6 62.4 82.0 60.5 -2.4
SG Encoding
GTrmGAT 82.5 36.2 48.9 53.0 57.7 81.5 59.9 -2.9
R-GTrmRGNN 82.0 35.2 48.5 52.7 55.3 80.1 58.9 -3.9
w/o STGD-GTrm 81.7 35.4 47.2 52.0 52.1 78.6 57.8 -5.1
Alignment Learning
w/o LOSC 80.9 33.4 46.8 51.6 50.9 78.6 57.0 -5.9
w/o high-order 83.0 35.9 48.8 53.1 57.0 79.7 59.6 -3.3
w/o LPTC 79.5 32.8 46.1 51.0 43.5 77.2 55.0 -7.9
w/o high-order 82.6 34.0 47.3 51.9 52.4 78.9 57.8 -5.1
38
42
46
50
20
24
28
32
Finsta-LFVILA (L-Vid)
LF-VILA
Finsta-LFVILA (S-Vid)
Video-LLaVA Finsta-Video-LLaVA
VSLNet
DRN
R@1, IoU=0.5
R@1, IoU=0.7
Fig. 9: Language Video Localization on ActivityNet data.
DRN [79] and VSLNet [80]. As shown in Figure 9, our Finsta
boosts the backbone VLMs with striking gains in grounding
the video temporality from language.
7.2 RQ-2: How Much Does Each Module Contribute?
To understand the exact contribution of each component, here
we present an ablation study. We analyze Finsta-HDVILA
and Finsta-LFVILA on three aspects: SG representations, SG
encoders and alignment learning. The results are shown in
Table 8. First, by 1) canceling the adverbial modifier node
in TSG, 2) removing the temporal coreference edges in DSG
and 3) dropping the cross-modal coreference edges in HSG,
respectively, we can witness different levels of performance
decreases. Among them, the installation of temporal corefer-
ence edges for DSG is of the most importance, i.e., average
-3.7%. Further, we replace the Transformer-based GTrm with
the GAT [39] for TSG encoding, and replace the R-GTrm
with RGNN [81] encoder for the DSG & HSG encoding,
respectively. There are also considerable performance drops
accordingly, indicating the effectiveness of using the GTrm
and R-GTrm encoders. More significantly, we see that by
removing the STGD-GTrm, we witness a drop of -5.1%,
highlighting the importance of modeling the moving changes
of videos. Finally, we cancel the alignment learning of either
the object-centered spatial contrasting (
LOSC
) or the predicate-
centered temporal contrasting (
LPTC
), there is the most
significant performance downgrading, compared with any
other factors. Especially the temporal alignment (
LPTC
) shows
Fig. 10: Analysis of the impact of threshold γDand γC.
the most striking influences among all the rest modules.
Further, if we downgrade the high-order feature modeling
of VL alignment into the first-order manner, we can also
see considerable drops. This certifies the prominence of the
high-order feature modeling.
7.3 RQ-3: What Factors Influence Finsta?
In this part, we analyze all possible factors that will influence
Finsta’s performance.
7.3.1 Influence of Hyperparameters
We mainly study three key sets of hyperparameters: the
γD
&
γC
in constructing temporal and cross-modal coreference
edges in DSG and HSG; the orders of neighbor features
for OSC and PTC alignment learning; and the alignment
confidence thresholds ρo&ρp.
1) Influence of Different Thresholds for Building Temporal
and Cross-Modal Coreference Edge. We vary the values
of
γD
and
γC
, and use the data of constructed SGs for
training the VLMs, and then explore the performance of
Finsta-HDVILA on various VL tasks. In Figure 10, we plot
the results. As seen, different values of
γD
and
γC
lead to
distinct results of end tasks. We find that when
γD
is set as
0.6, the best quality of temporal coreference edges in DSG
can be obtained. And setting
γC
to 0.9 seems to be optimal
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
14
12345
32
36
40
12345
35
47
59
12345
45
50
55
12345
34
37
40
12345
40
50
60
12345
45
50
55
60
METEOR
(a) n-th order feature modeling for OSC (neighbor depths)
R@1
Acc.
METEOR
(b) n-th order feature modeling for PTC (neighbor depths)
R@1
Acc.
Finsta-HDVILA Finsta-Clover
MSR-VTT (VC) DiDeMo (VTR) MSVD (VQA)
MSR-VTT (VC) DiDeMo (VTR) MSVD (VQA)
Fig. 11: Finsta-HDVILA and Finsta-Clover performance with
different n-order neighboring contexts.
0.1 0.3 0.5 0.7 0.9
32
34
36
38
40
0.1 0.3 0.5 0.7 0.9
43
45
47
49
51
0.1 0.3 0.5 0.7 0.9
51
53
55
0.1 0.3 0.5 0.7 0.9
34
36
38
0.1 0.3 0.5 0.7 0.9
45
47
49
51
0.1 0.3 0.5 0.7 0.9
51
53
55
METEOR
(a) Alignment condence threshold ρofor OSC
R@1
Acc.
METEOR
(b) Alignment condence threshold ρpfor PTC
R@1
Acc.
MSR-VTT (VC) DiDeMo (VTR) MSVD (VQA)
MSR-VTT (VC) DiDeMo (VTR) MSVD (VQA)
Fig. 12: Finsta-HDVILA performance on different tasks with
varied alignment threshold ρoand ρp.
for building the cross-modal coreference edges in HSG. This
is because the video and language modalities come with
bigger gaps in semantics, which naturally demands a larger
threshold for finding their match.
2) Influence of High-order Neighboring Modeling. Intu-
itively, higher-order (
n
) of feature modeling allows larger
context windows, yet at the cost of covering more noises. In
Figure 11 we present the trends over both Finsta-HDVILA
and Finsta-Clover models. As seen, mostly the OSC learning
relies on a 3rd-order feature for the static spatial alignment;
while for the PTC learning, a 4th-order context of regions
is needed for the best dynamic temporal alignment. This is
reasonable as the alignment learning for temporal dynamics
modeling depends more on the contexts of two modalities.
3) Influence of Different Threshold Value for Alignment
Learning. We further probe the impact of setting different
thresholds
ρo
and
ρp
for the fine-grained spatial and temporal
alignment learning. In Figure 12 we present the results. We
see that the trends of
ρo
and
ρp
can be slightly different.
Overall, The best values of
ρo
are higher than that of
ρp
.
This indicates that, the alignments of temporal dynamics
10 20 30 40 50
35
45
55
65
75
85
R@5
Finsta HDVILA Finsta-HDVILA
Clover Finsta-Clover
200 300 400 500 600 700
Fig. 13: Finsta-VLM performance on the Video-Text Retrieval
task (DiDeMo) by post-training with varied SG data amount.
715 23
48
53
58
15 25 35
48
53
58
6 12 18
48
53
58
20 30 40
48
53
58
R@50
MSVD (VQA Acc.)
R@100
R@5 R@10
Finsta-HDVILA Finsta-Clover
(a)a)) DSG parsing performance
(b)b)) TSG parsing performance
Fig. 14: Influences of the SG parser quality.
require more shreds of evidence between the two modalities.
In summary, we set
ρo
as 0.7 for OSC, and
ρp
as 0.6 for PTC,
where we can secure the best performance.
7.3.2 Influence of Post-training Data Amount
One general viewpoint is that, the larger the data used
for training, the better the resulting performance. For post-
training Finsta, we used quite a limited amount of data, i.e.,
a total of 50K for normal-scene videos, which is only 0.94%
for pre-training the raw Clover (with 5.3M), and 0.037% for
pre-training the raw HDVILA (with 136M). In Figure 13
we verify this claim by evaluating the end task (Video-Text
Retrieval) performance with Finsta-VLM post-trianed using
different numbers of SG data. As seen, even using SG data
not reaching 50K, both the Finsta-HDVILA and Finsta-Clover
can quickly climb to their best performances, based on the
pre-trained HDVILA and Clover VLMs. This is because the
well pre-trained backbone VLMs provide a warm starting for
more rapid convergence of training. However, if we treat the
Finsta as a standalone VLM, and (pre-)train it from scratch
with SG annotations, we find that the pre-training process
requires much more data, and also results in lower peak.
7.3.3 Influence of SG Parsing Quality
Our proposed Finsta system relies much on the availability
of the SGs. Here we study how the SG parser quality affects
the final results. We consider varying the performance of
the scene graph generation (SGG) step in SG parsing. We
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3393452
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University of Singapore. Downloaded on April 26,2024 at 01:18:45 UTC from IEEE Xplore. Restrictions apply.
15
30
35
40
48
52
56
+12.5
+10.8
+5.9
+3.5
+7.3
HDVILA
R@1
+5.9
+4.1+3.6
+1.5+2.3
Clover
X Finsta-X Finsta-X(w/o T)
Finsta-X(w/o D) Finsta-X(w/o T&D) Finsta-X(w/o C)
Fig. 15: Influence of