ArticlePDF Available

Abstract

Monitoring and managing Earth’s surface resources is critical to human settlements, encompassing essential tasks such as city planning, disaster assessment, etc. To accurately recognize the categories and locations of geographical objects and reason about their spatial or semantic relations , we propose a multi-task framework named EarthVQANet, which jointly addresses segmentation and visual question answering (VQA) tasks. EarthVQANet contains a hierarchical pyramid network for segmentation and semantic-guided attention for VQA, in which the segmentation network aims to generate pixel-level visual features and high-level object semantics, and semantic-guided attention performs effective interactions between visual features and language features for relational modeling. For accurate relational reasoning, we design an adaptive numerical loss that incorporates distance sensitivity for counting questions and mines hard-easy samples for classification questions, balancing the optimization. Experimental results on the EarthVQA dataset (city planning for Wuhan, Changzhou, and Nanjing in China), RSVQA dataset (basic statistics for general objects), and FloodNet dataset (disaster assessment for Texas in America attacked by Hurricane Harvey) show that EarthVQANet surpasses 11 general and remote sensing VQA methods. EarthVQANet simultaneously achieves segmentation and reasoning, providing a solid benchmark for various remote sensing applications. Data is available at http://rsidea.whu.edu.cn/EarthVQA.htm
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
Available online 18 May 2024
0924-2716/© 2024 Published by Elsevier B.V. on behalf of International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing
journal homepage: www.elsevier.com/locate/isprsjprs
EarthVQANet: Multi-task visual question answering for remote sensing
image understanding
Junjue Wang a, Ailong Ma a,, Zihang Chen a, Zhuo Zheng b, Yuting Wan a, Liangpei Zhang a,
Yanfei Zhong a
aState Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430074, China
bDepartment of Computer Science, Stanford University, Stanford, CA 94305, USA
ARTICLE INFO
Keywords:
Visual question answering
Semantic segmentation
Multi-modal fusion
Multi-task learning
Knowledge reasoning
ABSTRACT
Monitoring and managing Earth’s surface resources is critical to human settlements, encompassing essential
tasks such as city planning, disaster assessment, etc. To accurately recognize the categories and locations of
geographical objects and reason about their spatial or semantic relations , we propose a multi-task framework
named EarthVQANet, which jointly addresses segmentation and visual question answering (VQA) tasks.
EarthVQANet contains a hierarchical pyramid network for segmentation and semantic-guided attention for
VQA, in which the segmentation network aims to generate pixel-level visual features and high-level object
semantics, and semantic-guided attention performs effective interactions between visual features and language
features for relational modeling. For accurate relational reasoning, we design an adaptive numerical loss
that incorporates distance sensitivity for counting questions and mines hard-easy samples for classification
questions, balancing the optimization. Experimental results on the EarthVQA dataset (city planning for
Wuhan, Changzhou, and Nanjing in China), RSVQA dataset (basic statistics for general objects), and FloodNet
dataset (disaster assessment for Texas in America attacked by Hurricane Harvey) show that EarthVQANet
surpasses 11 general and remote sensing VQA methods. EarthVQANet simultaneously achieves segmentation
and reasoning, providing a solid benchmark for various remote sensing applications. Data is available at
http://rsidea.whu.edu.cn/EarthVQA.htm
1. Introduction
High-spatial resolution (HSR) remote sensing images provide rich
details about object structures, facilitating macro-level observations of
human settlements and ecosystems (Wang et al.,2022b). Large-scale
HSR images often depict various ground objects with diverse spatial
and spectral patterns, which can be highly complex and heteroge-
neous (Wieland et al.,2023). To characterize land-cover semantics
from HSR image, intelligent interpretation of HSR images has been
studied over the past few decades (Zhang et al.,2020b;Dimitrovski
et al.,2023). Traditional methods for recognizing land-cover objects in
HSR images rely on handcrafted visual characteristics such as object
indices (Chen et al.,2004), filter responses (Pelletier et al.,2016),
and object image analysis (Hossain and Chen,2022). However, these
methods have limited performance due to their reliance on expert ex-
perience and shallow features that cannot adequately capture complex
urban scenes (Xiao et al.,2023). In recent years, data-driven methods
represented by convolutional neural networks (CNNs) have gradually
Corresponding author.
E-mail address: maailong007@whu.edu.cn (A. Ma).
replaced handcrafted methods, offering powerful representation abil-
ities thanks to the end-to-end hierarchical learning of deep semantic
features (Martins et al.,2020). CNNs and their variants have advanced
the development of HSR interpretation, particularly in the areas of
object detection (Kellenberger et al.,2018), road extraction (Chen
et al.,2021), and scene classification (Carbonneau et al.,2020).
To further advance comprehensive understanding, we construct the
multi-level understanding process for HSR image (Fig. 1). #1 stage
denotes the object extraction, which provides intuitive mapping results
for users. This is implemented by a segmentation task and can also
be replaced by object detection. #2 stage involves relational reasoning
for interested objects, which provides comprehensive knowledge. There
already exist many mature solutions (Zhao et al.,2022) to accurately
extract the locations and categories of land-cover objects from HSR
images. However, these can only provide land-cover maps and cannot
achieve object-relational reasoning for #2 stage.
Visual question answering (VQA) (Antol et al.,2015) is a task aimed
at answering customized questions by searching for visual clues in the
https://doi.org/10.1016/j.isprsjprs.2024.05.001
Received 5 August 2023; Received in revised form 9 April 2024; Accepted 3 May 2024
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
423
J. Wang et al.
Fig. 1. Multi-level understanding process for remote sensing image. #1 stage performs information extraction of object categories and locations and #2 stage involves relational
reasoning for comprehensive knowledge.
provided image. The properties of tasks are determined by the questions
asked, and various analysis processes can be embedded according to
specific applications (Abdelnour et al.,2023). VQA bridges the gap
between vision and language, and plays a crucial role in assisting
humans to understand images, such as blind reading (Tu et al.,2021),
robot conversation (Gao et al.,2021a), etc.
There already exists preliminary research for remote sensing VQA
(Lobry et al.,2020;Zheng et al.,2021), and most methods directly fuse
the global image and language features. However, the global fusion
mechanism cannot capture the relations for small-scale objects and fails
to achieve accurate relations between multiple objects. Many computer
vision VQA studies (Tan and Bansal,2019;Anderson et al.,2018) show
that local object features are very important for relational reasoning.
Especially for HSR images, multiple objects with varied scales require
more refined analysis.
To this end, we implement #2 stage by a novel semantic-guided
VQA task. In this paper, we propose a multi-task framework, called
EarthVQANet, to seamlessly combine the segmentation and VQA ar-
chitectures for simultaneous learning. Guided by refined object seman-
tics from the segmentation network, EarthVQANet can pay attention
to each individual land-cover object and even its internal features,
achieving complex reasoning. Main contributions of this paper are:
(1) Task-Seg: Hierarchical pyramid segmentation network for
visual features. As information extraction serves as a foun-
dation for VQA, a hierarchical pyramid network is designed
to obtain accurate land-cover object semantics. The segmenta-
tion network consists of a scalable encoder, a pyramid pooling
module and a pyramid decoder. The encoder shares the uni-
versal semantics for multi-tasks and is compatible with both
CNN and Transformer architectures. The pyramid pooling module
and pyramid decoder hierarchically capture multi-scale features
for segmentation-specific representation. Overall, the hierarchical
pyramid network provides visual features and pseudo masks with
object semantics for VQA.
(2) Task-VQA: Semantic-guided attention mechanism for rela-
tional reasoning. The semantic-guided attention shares the uni-
versal semantics with segmentation and utilizes pseudo masks as
semantic guidance. The semantic guidance spatially bootstraps
semantics in a hierarchical structure. The semantic-guided self-
attention firstly enhances visual features with object semantics,
reasoning object external relations. The semantic-guided cross-
attention then performs multi-modal interactions, aggregating
visual clues via keywords in questions. The bidirectional mech-
anism is constructed to fully fuse visual and language features,
summarizing knowledge according to required questions.
(3) Adaptive numerical loss for balanced optimization. Faced
with questions with diverse difficulties and imbalanced answers,
we design an adaptive numerical loss for balanced optimiza-
tion. Specifically, the numerical difference penalty is added to
regression questions and hard example mining is adopted for
classification questions, jointly revising the optimization direc-
tions. Enhanced by these two strategies, adaptive numerical loss
is finally combined with segmentation loss for multi-task learning.
The proposed EarthVQANet has demonstrated remarkable superi-
ority in comprehensive tasks, providing not only accurate semantic
maps but also interactive knowledge reasoning answers. EarthVQANet
has achieved superior results on the EarthVQA dataset, RSVQA dataset
and has also been successfully applied to disaster assessment using the
FloodNet dataset. The rest of this paper is organized as follows. Sec-
tion 2introduces the related work for VQA. Sections 3and 4describe
the details of EarthVQANet and EarthVQA dataset. Experimental results
are analyzed in Sections 5and 6. Section 7concludes the paper.
2. Related work
2.1. Land-cover semantic segmentation
Remote sensing semantics segmentation aims to determine the given
image with specific categories at a pixel level. Fully convolutional
network (FCN), as a well-established architecture, has already achieved
promising results in HSR segmentation tasks (Liu et al.,2023). Con-
sidering the rich details in HSR images, ResUNet (Diakogiannis et al.,
2020) was designed with residual connections, atrous convolutions,
and pyramid scene parsing pooling. These advanced modules con-
tribute to multi-scale object recognition in land-cover mapping tasks.
To preserve the object integrity, OCNN (Martins et al.,2020) inte-
grates deep learning features with traditional object image analysis,
reducing semantic inconsistency results. To further achieve large-scale
mapping, LoveCS (Wang et al.,2022b) explores model transferability
between different sensors. With the abandonment of inductive bias,
various Transformer architectures (Wang et al.,2022a) have been
introduced into HSR segmentation. UNetFormer (Wang et al.,2022c)
designs an efficient global–local attention mechanism in the decoder,
and achieves a good trade-off between accuracy and efficiency. Due
to the excellent generalization ability, Swin-Transformer (Liu et al.,
2021) has shown great potential in many competitions, such as land-
slide detection (Ghorbanzadeh et al.,2022), land-cover mapping (Hän-
sch et al.,2022), etc. These advanced segmentation methods serve
as a solid benchmark for information extraction. EarthVQANet ex-
hibits scalability and compatibility with both FCN and Transformer
architectures.
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
424
J. Wang et al.
Table 1
Comparison between EarthVQA and existing remote sensing VQA datasets.
Datasets Source Image size Resolution (m) #QA pairs Bas Ju Bas Co Rel Ju Rel Co Obj An Com An Sem Mask Land-Use
RSVQA-LR (Lobry et al.,2020) Auto 256 10 77K × × × ×
RSVQA-HR (Lobry et al.,2020) Auto 512 0.15 955K × × × ×
RSVQAxBen (Lobry et al.,2021) Auto 120 1060 15M × × × ×
RSIVQA (Zheng et al.,2021) Semi 5124000 0.38 111K ×× × ×
HRVQA (Li et al.,2023b) Semi 1024 0.08 1070K × × × ×
CDVQA (Yuan et al.,2022b) Auto 512 0.53 122K × × × × ×
FloodNet (Rahnemoonfar et al.,2021) Auto 3000–4000 0.015 11K × ××
TextRS-VQA (Bashmal et al.,2023) Manual 256 0.06–5 6245 × × × ×
EarthVQA Semi 1024 0.3 208K
The abbreviations are: Bas Ju (Basic Judging), Rel Ju (Relational-based Judging), Bas Co (Basic Counting), Rel Co (Relational-based Counting), Obj An (Object Situation Analysis),
Com An (Comprehensive Analysis), Sem Mask (Semantic Mask).
2.2. Visual question answering
Early research consider VQA as the fusion of global image and
language features (Antol et al.,2015). The image and language features
are individually processed by CNN and Recurrent Neural Network
(RNN), and multimodal global features are fused to predict the final
answer. Stacked attention network (SAN) (Yang et al.,2016) designs
multiple attention layers to locate the visual clues layer by layer. In
order to reason complex relations efficiently, Bottom-Up-Top-Down
(BUTD) (Anderson et al.,2018) has used Faster-RCNN features to
introduce object features. The image regions function as a restricted
attention mechanism as the fusion model can efficiently capture key
objects. Pythia (Jiang et al.,2018) further re-implemented the BUTD
model and ensemble diverse models trained with different settings,
achieving better performances. Cubic visual attention (Song et al.,
2018) has applied a channel and spatial attention on object regions
to enhance the associated convolutional features. By imitating the
human mislabeling behavior, semantic noisy label correction (Zhang
et al.,2023a) has improved the robustness of VQA under noisy labels.
VILBERT (Lu et al.,2019) designs a two-stream BERT to process both
visual and textual inputs in separate streams that interact through co-
attentional transformer layers. To simplify and unify, VL-BERT (Su
et al.,2020) regard the visual and language features as inputs for one
BERT model without any restriction on the attention patterns, which
facilitates the interaction of features early and freely. Modular Co-
Attention Network (MCAN) (Yu et al.,2019) further implements the
fusion module with Transformer to interact the visual and language
features. D-VQA (Wen et al.,2021a) integrates a unimodal bias de-
tection module, thereby efficaciously alleviating negative biases. With
the advent of large multi-modal models, the conditional generative
models (i.e., BLIP-2 Li et al.,2023a, Instruct-BLIP Dai et al.,2024) also
show promising results on generic VQA tasks. These large multi-modal
methods can be fine-tuned by injection of a few learnable parameters
when applied to remote sensing VQA tasks. Recently, many advanced
VQA algorithms (Lin et al.,2022;Gao et al.,2022;Zhang et al.,2020a)
introduce the external knowledge database, i.e., Wikidata, to improve
the generalizability. Besides, various research (Gao et al.,2021b;Zeng
et al.,2022) has applied VQA from single-frame images to video
interactions.
In the remote sensing community, there has been preliminary
research on VQA. As for datasets, RSVQA-LR (Lobry et al.,2020) and
RSVQA-HR (Lobry et al.,2020) constructed QA pairs from Open Street
Map (OSM) properties following the handcrafted rules. RSVQAxBen
(Lobry et al.,2021) was built on the 2018 CORINE Land Cover
database, querying land-cover types at different levels. RSIVQA dataset
(Zheng et al.,2021) collects images from existing classification and
object detection datasets (AID (Xia et al.,2017), HRRSD (Zhang et al.,
2019), etc.) and automatically generates answers from their semantic
labels. In addition, some handcrafted QA pairs were added for complex-
ity. Similar to the RSIVQA dataset, the TextRS-VQA dataset (Bashmal
et al.,2023) also collects scene classification images from AID (Xia
et al.,2017), PatternNet (Zhou et al.,2018), UC Merced (Yang and
Newsam,2010), and NWPU45 (Cheng et al.,2017) datasets. The QA
pairs are manually designed to ensure balanced questions. Similarly,
the HRVQA (Li et al.,2023b) dataset constructed automatic QA pairs
from PDOK open-data source and manually annotated some relational
reasoning-based samples. CDVQA (Yuan et al.,2022b) generates the
QA pairs for change details based on the SECOND (Yang et al.,2021)
change detection dataset. FloodNet (Rahnemoonfar et al.,2021) focuses
on the Harvey Hurricane event, and designs disaster assessment QA
pairs. As shown in Table 1, EarthVQA dataset involves more complex
and practical questions to meet the needs of city planning requirements.
Most remote sensing VQA methods (Lobry et al.,2020;Rahnemoon-
far et al.,2021) directly predict answers from the raw images, ignoring
the important information extraction stage. Prompt-RSVQA (Chappuis
et al.,2022) firstly uses ResNet-50 to recognize salient objects via a
multi-class classification task, and then fuses them with question fea-
tures via the BERT to get the answers. Self-paced curriculum learning
(SPCL) (Yuan et al.,2022a) model learns VQA tasks in an easy-to-hard
way, using simple questions to guide the learning of difficult questions.
Spatial hierarchical reasoning network (SHRNet) (Zhang et al.,2023b)
designs a hash-based spatial multi-scale module to adaptively capture
key regions guided by question features. For complex relational reason-
ing, SOBA (Wang et al.,2024) leverages semantic features to guide the
downstream VQA task. However, the semantic segmentation and VQA
tasks are individually optimized, and directly training SOBA via multi-
task losses inevitably causes mutual exclusivity in gradients. To address
this, we propose EarthVQANet to explore simultaneous learning for
segmentation and VQA, achieving mutual promotion.
3. EarthVQANet
An overview of EarthVQANet is shown in Fig. 2, which depicts:
(a) hierarchical pyramid segmentation network, (b) semantic-guided
attention, and (c) adaptive numerical optimization. The hierarchical
pyramid segmentation network is designed to obtain land-cover seman-
tic mapping results while also generating visual features and semantic
guidance for the VQA task. The semantic-guided attention performs
interactions between visual and language features, hierarchically fusing
them for the final answer. The segmentation network and semantic-
guided attention share a siamese encoder, contributing to universal
presentation learning. The adaptive numerical optimization individu-
ally models the regression and classification tasks in VQA, introducing
numerical sensitivity and sample balancing. Based on EarthVQANet,
the segmentation and VQA tasks can be achieved under a universal
architecture.
3.1. Hierarchical pyramid segmentation network
Motivation. The early study (Antol et al.,2015;Yang et al.,2016)
directly predicted the final answers based on the raw images. They
utilize CNN as the global feature extractor, and the pooling and flat-
tening functions lose the spatial details. This limits the model’s ability
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
425
J. Wang et al.
Fig. 2. Overview of the proposed universal segmentation and VQA framework (EarthVQANet). EarthVQANet mainly contains: (a) hierarchical pyramid segmentation network, (b)
semantic-guided attention, and (c) adaptive numerical optimization.
to fit complex scenes with multi objects. To address this, BUTD (An-
derson et al.,2018) introduced object features by using a detector
(Faster-RCNN). The bounding boxes refer to the recognized objects,
and local features within objects are aggregated with average poolings.
These object features explicitly provide initial semantic representation,
i.e., semantic categories, number of objects, etc. Inspired by BUTD,
we aim to incorporate more spatial and refined features into the
object features, thus enhancing the model’s ability to represent complex
scenes.
Design. To this end, we utilize a semantic segmenter (FCN) to
generate more refined details. The segmenter is implemented with
a hierarchical pyramid segmentation network, which compromises a
siamese scalable encoder, a pyramid pooling module (PPM), and a
pyramid decoder. We have tested the compatibility of the encoder with
various advanced CNN and Transformer architectures, which allows
for increased flexibility and accuracy in our analysis. The siamese
encoder processes the input image hierarchically into four different
spatial shapes, namely 1∕4,1∕8,1∕16,1∕32 with respect to the raw
image. PPM (Zhao et al.,2017) is utilized to capture the multi-scale
features from the encoder outputs. In the implementation, the encoder
outputs are firstly processed by four average poolings with different
scales (1, 2, 3, 6). The resulting multi-scale features are then reduced
by 1 ×1 convolutions, resized to the same size, concatenated, and
finally passed through a 3 ×3 convolution. Similar to the feature
pyramid network (Kirillov et al.,2019;Ma et al.,2022), the pyramid
decoder gradually restores the spatial resolution through a top-down
pathway and skip-connections. The multi-scale features from different
layers are then fused using a mean function, yielding the final semantic
segmentation output.
The PPM and pyramid decoder capture the multi-scale features from
different structure levels. Possessed with scalability and multi-scale
processing ability, the hierarchical pyramid segmentation network pro-
vides a solid foundation of feature representation. This contributes
significantly to the segmentation and VQA tasks.
3.2. Semantic-guided attention for VQA
Motivation. To leverage the object features provided by the seg-
mentation network, we have designed a semantic-guided attention. As
the segmentation features contain rich semantic and spatial details, the
internal features of the encoder are shared. Besides, the segmentation
outputs are utilized as semantic guidance due to the object semantic
boundaries.
Design. The semantic-guided attention includes semantic-guided
self-attention and semantic-guided attention. As is shown in Fig. 2,
the semantic-guided self-attention is designed to enhance the visual
features. The semantic-guided cross-attention aims to promote multi-
modal interaction between visual and language features. The visual
features 𝐅𝑣R𝐻1∕32×𝑊1∕32 ×𝐶are generated from the outputs of scalable
encoder, which preserves the spatial details and locations. 𝐻and 𝑊
denote the spatial size of the raw image. 𝐶represents the feature
dim. The VQA and segmentation shared the siamese encoder to learn
universal representation and boost each other.
3.2.1. Semantic guidance
In order to introduce object semantic features, we utilize the seg-
mentation output from the pyramid decoder to guide the VQA tasks.
As is shown in Fig. 3, the generation process of semantic-guided in-
volves (a) spatial scaling, (b) spatial dynamic weighting, and (c) spatial
reduction. Because the segmentation output is only used for guidance,
we interrupt the gradient flow to avoid the interference of the VQA task
on the segmentation decoder. This guarantees that task-specific features
are embedded in the pyramid decoder and semantic-guided attention
separately. The related experiments are also analyzed in Table 5.
The spatial scaling aims to align the size of decoder output with
the size of 𝐅𝑣. Specifically, the scaling process is implemented with
three 3 ×3 convolutions with strides of 2. Each convolution is followed
with a batch normalization and a ReLU activation. The size of output
is successively reduced, yielding 𝐒𝑣. The spatial dynamic weighting is
designed to modulate semantic features based on spatial characteristics.
Inspired by the existing spatial attention mechanism (Woo et al.,2018),
we firstly reduce the channel dimension from two aspects, i.e., max
pooling and average pooling. The 7 ×7 convolution is then adopted
to aggregate the spatial characteristics via a large receptive field.
After aggregation, the key regions are enhanced and normalized via
a sigmoid function. The spatial scores are used to weight 𝐒𝑣and finally
flattened into 𝐆𝑣R𝑃×𝑑𝑚.𝑃=𝐻1∕32 ×𝑊1∕32 denotes the number of
visual tokens and 𝑑𝑚is the hidden size.
3.2.2. Semantic-guided self-attention
To capture global relations between geospatial objects, we utilize
vision transformer (Dosovitskiy et al.,2020) to construct self-attention
module. As is shown in Fig. 4, the self-attention module consists of
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
426
J. Wang et al.
Fig. 3. Generation process of semantic guidance. The semantic guidance includes object semantic features which contributes the object consistency, spatial locations, etc. The
process includes: (a) spatial scaling, (b) spatial dynamic weighting, and (c) spatial reduction.
Fig. 4. Implementation of semantic-guided self-attention and semantic-guided cross-attention. The semantic guidance is integrated in different attention stages.
𝑁𝑒Transformer blocks. Each Transformer block contains a multi-head
self-attention (MSA) and a feed forward network (FFN), embedded
with layer normalizations and residual connections. MSA utilizes 𝑀
parallel heads with independent learnable weights. Different heads
encode the feature similarities from diverse aspects. FFN consists of two
linear transformation layers with a GELU activation, further performing
hierarchical representation.
Before transformation, the visual features 𝐗were first summed up
with the semantic guidance 𝐆𝑣. At each Transformer block, the guided
visual features are transformed into query, key, and value: 𝐐= (𝐗+
𝐆𝑣)𝐖𝑞,𝐊= (𝐗+𝐆𝑣)𝐖𝑘,𝐕= (𝐗+𝐆𝑣)𝐖𝑣.𝐖𝑞,𝐖𝑘,𝐖𝑣R𝑑𝑚×𝑑𝑣
are implemented with three linear projection layers and 𝑑𝑣=𝑑𝑚𝑀
represents the feature dim of each head. The self-attention utilizes the
feature relations between each patch to improve the representations of
these patches. The process of self-attention formulates as follows:
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝐐,𝐊,𝐕) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥((𝐗+𝐆𝑣)𝐖𝑞((𝐗+𝐆𝑣)𝐖𝑘)𝑇
𝑑𝑣
)(𝐗+𝐆𝑣)𝐖𝑣
(1)
There are multiple attention operations in parallel to jointly attend
to different representation subspaces. After concatenation, a linear
projection layer is utilized to fuse these outputs. Formally,
𝑀𝑆 𝐴(𝐐,𝐊,𝐕) = 𝐶𝑜𝑛𝑐𝑎𝑡(1,, 𝑀)𝐖𝑂
𝑤ℎ𝑒𝑟𝑒 𝑖=𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝐐𝑖,𝐊𝑖,𝐕𝑖)(2)
𝐖𝑂R𝑀𝑑𝑣×𝑑𝑚represents the learnable weights of fusion layer.
MSA dynamically models the relations of visual features between each
object. FFN is designed to further improve the representations via non-
linear transformations. Specifically, FFN contains two linear projection
layers with a GELU in between. The process of FFN is formulated as
follows:
𝐹𝐹𝑁(𝐗) = 𝐺𝐸𝐿𝑈 (𝐗𝐖1)𝐖2(3)
where 𝐖1R𝑑𝑚×𝑑𝑓and 𝐖2R𝑑𝑓×𝑑𝑚are projection weights. 𝑑𝑓
represents the hidden dimension.
3.2.3. Semantic-guided cross-attention
The semantic-guided cross-attention is designed to fuse the visual
and language features. The language features can be encoded with
advanced methods, i.e., LSTM (Hochreiter and Schmidhuber,1997),
BERT (Kenton and Toutanova,2019), etc. As for language encoding, a
two-layer LSTM with a hidden size of 384, BERT-Base1and DistillBERT-
Base2were adopted for comparison. The LSTM was randomly initialized
and two BERT models were pre-trained on BooksCorpus (800M words)
and English Wikipedia (2500M words). All the input language features
were reduced to 384 via a linear projection. The visual clues are
searched according to keywords in the question, and then summa-
rized to the final answer. The cross-attention consists of two series
of transformer blocks, where the fusion mechanisms are bidirectional.
Similarly, the visual features from the self-attention are firstly fused
with 𝐆𝑣
2to achieve semantic guidance. According to Fig. 4, the guided
1https://huggingface.co/bert-base- uncased.
2https://huggingface.co/distilbert-base- uncased.
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
427
J. Wang et al.
Fig. 5. The index embedding of counting answers. 𝐲is obtained from the answer’s index.
features are transformed into a query vector 𝐐= (𝐗+𝐆𝑣
2)𝐖𝑞, while
key and value are generated from the language features 𝐘. This form of
attention aggregates language features to enhance the visual represen-
tations 𝐗𝑓. In the second stage, we regard language features as query,
and visual features as key and value. The semantic guidance is once
again performed, with 𝐊= (𝐗𝑓+𝐆𝑣
3)𝐖𝑘,𝐕= (𝐗𝑓+𝐆𝑣
3)𝐖𝑣. The second
transformer blocks then hierarchically aggregate visual features into
language representations 𝐘𝑓. The keywords of interested objects will
automatically collect corresponding visual features. After multi-modal
reasoning, we fuse 𝐗𝑓and 𝐘𝑓to obtain the final answer.
3.3. Adaptive numerical optimization
Motivation. The VQA tasks involve different types of questions,
including classification and regression (object counting) questions. The
existing research simply treats it as a multi-classification task and
utilizes the standard cross-entropy (CE) loss for optimization. However,
this optimization target is not suitable for regression tasks, due to its
lack of distance awareness. Additionally, classification questions often
have imbalanced simple and hard samples (Wen et al.,2021b), where
simple questions dominate the optimization. To this end, we propose
adaptive numerical optimization to automatically balance task and
sample differences.
Design. We first revisit the multi-classification CE loss:
𝐶𝐸 (𝑝, 𝑦)=−𝑦 𝑙𝑜𝑔(𝑝) =
𝑐𝑙𝑎𝑠𝑠
𝑖=1
𝑦𝑖𝑙𝑜𝑔(𝑝𝑖)(4)
where 𝑦 and 𝑝 denote the ground truth and predicted probabilities. 𝑦
is the one-hot encoded vector. To add distance awareness into CE loss,
we design the modulating factor 𝑑=𝛼|𝐲𝑑𝑖𝑓 𝑓 |𝛾=𝛼|𝐲𝑝𝑟 𝐲𝑔 𝑡|𝛾.𝐲𝑝𝑟 ,𝐲𝑔𝑡
denote the number of prediction and ground truth for regression tasks.
As the counting answers are embedded, their indexes represent the 𝐲
(Fig. 5). 𝛼0and 𝛾0are hyperparameters to controls the intensity
of distance penalty. 𝑑determines the distance penalty 𝑑𝐲𝑑𝑖𝑓 𝑓 . The
numerical difference loss for regression tasks can be formulated as
follows:
𝑁𝐷(𝑝, 𝑦) = −(1 + 𝑑)𝑦 𝑙𝑜𝑔(𝑝)
= −(1 + 𝛼|𝐲𝑑 𝑖𝑓 𝑓 |𝛾)𝑦 𝑙𝑜𝑔 (𝑝)
= −(1 + 𝛼|𝐲𝑝𝑟 𝐲𝑔𝑡 |𝛾)
𝑐𝑙𝑎𝑠𝑠
𝑖=1
𝑦𝑖𝑙𝑜𝑔(𝑝𝑖)
(5)
As for the classification tasks, we designed hard example mining
strategy to adaptively select the effective samples. Online hard example
mining (OHEM) only optimizes hard samples with low confidence,
which reflects low probability in the true label position. Hence, OHEM
filters the easy samples with a probability threshold. However, a fixed
probability threshold can lead to training instability as the number of
classified samples in mini-batches changes. To address this, we modify
the filter method to select a fixed number of hard samples with high
losses, controlled by parameter 𝑘.𝑘 (0,1] denotes the remained ratio
for hard samples with top losses. This allows for the selection ratio of
hard samples to be adjusted adaptively. For example, it is assumed that
one mini-batch has 12 classification samples with their losses: {0.81,
0.23, 0.34, 0.54, 0.63, 0.72, 0.12, 0.33, 0.54, 0.44, 0.09, 0.14}. If
𝑘= 0.8, we only select 12 × 0.8 = 9.6= 9 samples with high losses to
optimize. In this case, the easy samples {0.09, 0.12, 0.14} are filtered
out. As the training process continues, the loss of each sample will be
different in each epoch. The model automatically selects the samples
that are most worthy of training. The universal version of our adaptive
numerical loss is formulated as follows:
𝐴𝑁(𝑝, 𝑦) = {𝑁 𝐷(𝑝, 𝑦),𝚝𝚊𝚜𝚔 =𝑟𝑒𝑔.
𝑇 𝑜𝑝𝑘{𝐶𝐸 (𝑝, 𝑦)},𝚝𝚊𝚜𝚔 =𝑐𝑙𝑠. (6)
The proposed adaptive numerical optimization combines the clas-
sification and regression VQA tasks in one unified loss. This automati-
cally balances task and sample differences during the training without
any additional parameters.
The simplified version of the proposed multi-task framework is
shown in Fig. 6. By sharing the encoder weights, the segmentation
features support the VQA tasks according to the forward path. Si-
multaneously, the VQA loss supervises the encoder optimization via
the backward path, which then refines the segmentation results. By
implicitly modeling constraints, ‘There are buildings exist in the scene’
will increase the probability of buildings appearing in segmentation,
and ‘The area of buildings occupies 10%–20%’ controls the number of
pixels that are predicted to ‘building’.
4. EarthVQA dataset
The EarthVQA dataset (Wang et al.,2024) was extended from
the LoveDA dataset (Wang et al.,2022b), which was originally de-
signed for the land-cover segmentation task. The images were col-
lected from Wuhan, Nanjing, and Changzhou cities including 18 urban
and rural administrative districts (536.15 km2). With the addition of
matched QA pairs based on the city planning requirements, the Earth-
VQA dataset contains 6000 HSR images, 6000 semantic land-cover
masks, and 205,593 QA pairs. The size of each image is 1024 ×1024,
with a spatial resolution of 0.3 m.
4.1. Dataset statistics
Fig. 7 shows the distributions of questions and each question owns a
considerable number of samples. Each urban image is equipped with 42
QA pairs, and each rural image has 29 QA pairs. Because the urban (ru-
ral) sample has same questions and similar answers, we evenly divide
regions into train (4 urban, 4 rural), val (2 urban, 2 rural), test (3 urban,
3 rural) sets to ensure the balance of QAs, following dataset standards.
The whole dataset was divided into train (2522 images with 88 166 QA
pairs), val (1669 images with 57 202 QA pairs) and test (1809 images
with 62 744 QA pairs) sets. The EarthVQA dataset includes six types of
questions, i.e., basic judging, basic counting, relational-based judging,
relational-based counting, object situation analysis, and comprehensive
analysis. The semantic categories include background, building, road,
water, barren, forest, agricultural and playground.
Fig. 8 shows more detailed distributions of questions and answers.
The number of unique answers in the EarthVQA dataset is 166. The
distributions of questions with different types are shown in Fig. 8(a).
The relational-based judging has the most samples (83.2k) and the
relational-based counting occupies the minimal samples (6k). This
is because relational-based judging has many concerned questions
(Fig. 7). Each question type includes a different number of questions,
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
428
J. Wang et al.
Fig. 6. The simplified version of our multi-task framework. By sharing the encoder weights, the segmentation features support the VQA tasks according to the forward path.
Simultaneously, the VQA loss supervises the encoder optimization via the backward path, which then refines the segmentation results.
Fig. 7. The distributions of questions in the EarthVQA dataset (Wang et al.,2024). Based on city planning requirements, the EarthVQA dataset includes six types of questions,
i.e., basic judging, basic counting, relational-based judging, relational-based counting, object situation analysis, and comprehensive analysis. The distributions of questions are
relatively balanced, ensuring comprehensive and adequate training and evaluation.
further intensifying the differentiation in difficulty of question types.
The answer distributions are complex and reveal many practical chal-
lenges for the HSR VQA task. As shown in Fig. 8(b), the counting
answers showcase a long-tail distribution. As for judging questions, the
answers in (d) and (e) show different distributions. The ‘Yes’ occupies
a larger proportion in basic judging but ‘No’ has more samples in
relational-based judging. As for object analysis situation and compre-
hensive analysis, some questions such as (c) and (f) have relatively
balanced answer distributions but (g) have imbalanced answer dis-
tributions. In conclusion, each question has different distributions of
answers, which brings more challenges when faced with the actual
Earth’s environment. These challenges inspire some new directions to
advance complex reasoning in remote sensing VQA. As for imbalanced
answer distributions, it is worth using the denoising diffusion proba-
bilistic model (Ho et al.,2020) to generate samples with rare answers.
As for fine-grained answers in object situation analysis, contrastive
learning (Wang et al.,2021b) can be explored to enhance the model’s
discriminative ability.
Fig. 9 shows two samples from urban and rural scenes, respectively.
The urban scenes focus on the residents, traffic situation, green spaces,
water sources, urban villages, etc. The rural scenes focus on water
governance, road improvement, agricultural cultivation, etc.
4.2. Annotation procedure
The basic questions are automatically generated from semantic
masks, similar to RSVQA (Lobry et al.,2020) and others are manually
annotated. Fig. 10 shows some annotation procedures of representative
samples. (1) Basic judging. The basic judging questions focus on
the presence or absence of land cover types. Each semantic category
is separately queried, and the answer is generated from the corre-
sponding mask. (2) Basic counting. The basic counting questions are
used to estimate the area of features. Considering that each image
has the same spatial resolution of 0.3 m, we evenly divided it into
ten intervals. The upper bound of area is 1024 × 1024 × 0.3 (m) ×
0.3 (m) = 94371.84 m2(3) Object situation analysis. As for ‘What
are the types of residential buildings?’, the candidate’s answers include
commercial buildings, private buildings and no residential buildings.
The commercial buildings denote houses contracted and sold by real
estate developers, which often show consistent appearances and neat
layouts. The private buildings are often built by individuals and have
inconsistent appearances as well as heights. It is possible that there
are both commercial housing and private housing in the same scene.
(4) Comprehensive analysis. As for ‘What are the roads around the
village?’, we first search for the village, which is formed of compact
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
429
J. Wang et al.
Fig. 8. More detailed statistical distributions of questions and answers.
Fig. 9. Representative samples from the EarthVQA dataset. The urban and rural scenes focus on the different schemes. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
430
J. Wang et al.
Fig. 10. Representative samples from the EarthVQA dataset. The urban and rural scenes focus on the different schemes.
buildings (more than 20 buildings). The aggregated buildings form a
polygon of the village. Most of roads are cement but a small section
of road has not yet been paved. These roads are near (<100 m) to
the village, thus we obtain the final answer ‘There are unsurfaced and
cement roads’. (5) Relational-based counting. The relational-based
counting requires reasoning spatial or semantic relations that are not
included in original semantic masks. For ‘How many intersections are
in this scene?’, the annotators need to judge the topologies of roads
and then count the number of crossed parts. For ‘How many eutrophic
waters are in this scene?’, the annotators need to judge the sub-
properties of all waters and count the number of waters with abnormal
spectra. (6) Relational-based judging. The relational-based judging
involves spatial or semantic reasoning with several objects. As illus-
trated in ‘Are there intersections near the school?’, a stadium, several
teaching buildings, and playgrounds are close to each other, forming
the school. Besides, two roads are crossed to form an intersection. The
annotators used the ArcGIS toolbox to calculate the polygon-to-polygon
distance between the school and the intersection, obtaining 23 m <
100 m. Hence, the final answer is ‘Yes’. Besides, some land-use types
(commercial, industrial, educational) are determined using OSM data as
an auxiliary. The distance judgment threshold is unified to 100 m and
our dataset does not involve ambiguous questions such as geographical
orientations.
The annotation thresholds can be summarized as follows: (1) The
distance threshold judging ‘near’ is 100 m; (2) ‘more than 20 com-
pact buildings’ form a resident; (3) The private buildings in residents
have inconsistent appearances as well as heights; (4) The commercial
buildings show consistent appearances and neat layouts. (5) The waters
with green algae and other floating vegetation are eutrophic waters.
(6) Some land-use types with socioeconomic attributes (commercial,
industrial, etc.) are determined using OSM properties as auxiliary. (7)
If the leaf area index (vegetation area/total area) in the residential area
is less than 30%, it needs to be supplemented.
5. Experiments
5.1. Experimental settings
Evaluation metrics. Since our framework includes both semantic
segmentation and VQA tasks, we report their performances separately
using different metrics. For semantic segmentation tasks, we report
the overall performance using mean Union and Intersection (mIoU) as
recommended in Wang et al. (2022b). For VQA tasks, we report clas-
sification accuracy and root-mean-square error (RMSE) as commonly
used (Lobry et al.,2020). The overall performances include the overall
accuracies (OA) and overall root-mean-square error (OR).
Reference methods. For the fairness nine general VQA methods
and two remote sensing VQA methods are selected for reference. The
general VQA methods include SAN (Yang et al.,2016), MAC (Hudson
and Manning,2018), BUTD (Anderson et al.,2018), BAN (Kim et al.,
2018a), MCAN (Yu et al.,2019), LXMERT (Tan and Bansal,2019), D-
VQA (Wen et al.,2021b), BLIP-2 (Li et al.,2023a), Instruct-BLIP (Dai
et al.,2024). The remote sensing methods are RSVQA (Lobry et al.,
2020), RSIVQA (Zheng et al.,2021) and SOBA (Wang et al.,2024).
Following the original settings, BLIP-2 and Instruct-BLIP used the pre-
trained ViT-g/14 and FlanT5XL as encoders. The Q-Former was trained
to bridge the multi-modal features for conditional generation. As BUTD,
BAN, D-VQA, MCAN, and LXMERT require local semantic features
as inputs, we adopt the proposed multi-task learning strategy fairly,
and the visual features are generated by the hierarchical pyramid
segmentation network. The CNN encoders are adopted as ConvNeXt-
Tiny (Liu et al.,2022) as default, and other settings are identical to
their original literature.
Training implementations. As for the EarthVQA dataset, all VQA
methods are trained for 55k steps with a batch size of 16. For multi-
task learning, we only trained the segmentation network in the first
15k steps for initialization, and the rest is utilized for joint training.
The data augmentations include random flipping, rotation, and color
jittering. Consistent with LoveDA dataset (Wang et al.,2021a), we set
scales ={0.5, 0.75, 1.0, 1.25, 1.5, 1.75} for multi-scale training. The
512 ×512 patches are randomly cropped as segmentation inputs and
all images are resized into 768 ×768 for VQA inputs for information
integrity. We set the initial learning rate to 1e-4 and use a ‘poly’
schedule for adaptation. To analyze the contributions of each proposed
module, we have conducted comprehensive analysis experiments as
follows. Because the urban and rural scenes have different answer
distributions according to their characteristics, we evenly divided the
urban and rural scenes into Train/Val/Test sets (Wang et al.,2021a) to
ensure the diversity of model training and evaluation.
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
431
J. Wang et al.
Table 2
Compared results with other VQA methods on EarthVQA𝑡𝑒𝑠𝑡 set (Kim et al.,2018b).
Method mIoU (%) Accuracy (%) OA (%) RMSE OR
Bas Ju Rel Ju Bas Co Rel Co Obj An Com An Bas Co Rel Co
Only Segm. 57.27
General methods
SAN (Yang et al.,2016) 88.73 80.61 77.16 64.98 52.73 42.70 75.52 1.0253 1.2197 1.0516
MAC (Hudson and Manning,2018) 82.71 78.60 71.82 55.64 45.71 38.10 71.19 1.4250 1.3576 1.4167
BUTD (Anderson et al.,2018) 57.11 90.36 82.00 79.45 67.58 55.89 46.80 77.55 0.7919 1.1787 0.8499
BAN (Kim et al.,2018b) 56.46 90.83 82.27 78.66 63.93 48.65 45.14 76.74 0.9155 1.2367 0.9615
D-VQA (Wen et al.,2021b) 56.95 90.07 81.94 79.47 66.26 58.09 45.88 77.54 0.8233 1.2071 0.8805
MCAN (Yu et al.,2019) 57.07 90.62 81.03 79.90 68.14 58.06 47.09 77.53 0.7977 1.1208 0.8449
LXMERT (Tan and Bansal,2019) 56.35 90.00 81.93 80.54 68.26 54.87 45.07 77.46 0.7980 1.1731 0.8539
BLIP-2 (Li et al.,2023a) 88.13 81.92 70.26 58.58 42.72 28.34 71.07 1.8790 1.3200 1.8186
Instruct-BLIP (Dai et al.,2024) 89.67 79.69 76.96 63.34 59.72 45.68 75.25 0.7994 1.2170 0.8627
RS methods
RSVQA (Lobry et al.,2020) 82.38 79.14 71.80 55.42 42.98 36.27 70.95 1.5162 1.3605 1.4976
RSIVQA (Zheng et al.,2021) 87.83 80.66 77.03 64.71 54.02 42.35 75.41 0.9715 1.2206 1.0060
SOBAa(Wang et al.,2024) 57.17 91.10 81.56 80.84 67.09 58.34 47.94 78.24 0.7642 1.1258 0.8182
SOBA-MT (Wang et al.,2024) 56.01 89.82 81.18 79.69 68.69 58.34 47.07 77.42 0.7790 1.1021 0.8263
EarthVQANet (w/o Segm.) 89.23 80.38 77.42 65.00 53.67 44.82 76.47 0.9934 1.2045 1.0214
EarthVQANet 57.43 90.52 82.09 81.90 67.92 59.06 48.64 78.54 0.7490 1.0808 0.7980
The abbreviations are: Bas Ju (Basic Judging), Rel Ju (Relational-based Judging), Bas Co (Basic Counting), Rel Co (Relational-based Counting), Obj An (Object Situation Analysis),
Com An (Comprehensive Analysis). OA (Overall Accuracy), OR (Overall RMSE).
aDenotes the cascade training for SOBA in the original setting.
5.2. Comparative experiments
The comparative multi-task results on the EarthVQA dataset are
presented in Table 2. As the segmentation performance provides a basis
for VQA, the Only Segm. setting shows that the proposed hierarchical
pyramid segmentation network obtains a baseline mIoU of 57.27%
on the segmentation task. The methods that directly utilize global
visual features, i.e., SAN, MAC, RSVQA, and RSIVQA, achieve low
VQA performances on both classification and regression tasks. With the
global features as inputs, it is hard for the EarthVQANet (w/o Segm.) to
answer the relational-based questions. As for Instruct-BLIP and BLIP-2,
the visual encoders pre-trained on ImageNet are not suitable for remote
sensing VQA tasks. As for Basic counting task, our proposed Earth-
VQANet has achieved the best accuracy (81.90%) and RMSE (0.7490).
As for relational-based counting task, our proposed EarthVQANet (Acc
=67.92%, RMSE =1.0808) fails to exceed BAN (Acc =69.52%, RMSE
=1.0656). This may because the relational-based counting samples in
our dataset are limited (only occupies 2.87% according to Fig. 8(a))
so our numerical sensitive AD loss may cause some overfit effects.
In the future, we will develop some regular constraints to improve
AD loss and reduce its overfitting risk. Implemented by the proposed
multi-task learning strategy, the existing Faster-RCNN-based methods,
i.e., BUTD, BAN, D-VQA, and MCAN, are also compatible with our
proposed visual features, significantly exceeding the global methods.
This demonstrates the necessity and importance of high-quality visual
features for complex VQA tasks. Our proposed EarthVQANet generally
outperforms the existing general and remote sensing methods. Specif-
ically, for object counting tasks, compared with the best referenced
method (MCAN, OR =0.8449), EarthVQANet significantly reduces the
error by 5%. Directly training SOBA simultaneously (SOBA-MT) lost
lots of performance on segmentation and VQA tasks, because directly
sharing image encoder and decoder gradients causes the optimization
contradiction (Vandenhende et al.,2022). Hence, we firstly adopts PPM
for no-linear projections to weaken the negative gradient interactions.
The semantic-guided attention leverages residual connections to further
guarantee the original VQA gradient flow. Moreover, the gradients from
semantic guidance are blocked to avoid the effects of the global VQA
semantics for image decoder training. In conclusion, EarthVQANet also
improves the segmentation accuracy, proving its potential to achieve
the mutual promotion of segmentation and VQA. The universal and
task-specific semantics are effectively encoded in our siamese encoder
and different decoders. With the roaring of large multi-modal models,
exploring the application in remote sensing VQA is very promising.
There exist some new directions to design remote sensing-specific large-
modal models: (1) Visual features alignment. As the pre-trained vision
encoders are pre-trained using CLIP (Radford et al.,2021) or MAE (He
et al.,2022) based on natural images, it is necessary to align the general
visual features with remote sensing images. (2) Task-driven application.
The current large language models (LLMs) are constructed for general
scenes via casual language modeling (Liu et al.,2024) or instruction
tuning (Dai et al.,2024). It is worth exploring practical requirements
and scenes in remote sensing tasks (such as city planning, disaster
assessment, etc.)
In order to show the comparative results intuitively, two representa-
tive samples are selected from urban and rural scenes for visualizations
(see Fig. 11). As for the urban scene, all referenced methods fail to
recognize the intersection in the judging question (#Q1) but part of
them succeed in answering the counting question (#Q2). The roads in
the north-south direction are blocked by trees and the school scene is
not obvious, which increases the difficulty of spatial reasoning. Com-
pared with the referenced methods, EarthVQANet successfully achieves
the right answers to these two questions and has good reasoning
consistency. #Q3 requires recognition of high-level functional zones,
and the global fusing methods (RSVQA and SAN) generate wrong
answers due to the lack of object details. MCAN and BUTD wrongly
decided that there is a viaduct (#Q4) due to shadows cast by tall
buildings, but EarthVQANet successfully avoids this inference error.
The selected rural scene represents relatively simple distributions, the
methods (MCAN, BUTD and EarthVQANet) with local visual features
also outperforms than global fusion methods (RSVQA and SAN). The
proposed EarthVQANet performs well in both semantic extraction and
knowledge reasoning stages and exhibits high reasoning consistency
across different questions.
5.3. Segmentation network analysis
Scalable image and language encoders. As the image encoder de-
termines the universal representation ability, we evaluate its scalability
with varied CNN and Transformer architectures, including ResNet (He
et al.,2016), HRNet (Wang et al.,2020), ConvNeXt (Liu et al.,2022),
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
432
J. Wang et al.
Fig. 11. Visualizations of predicted samples from urban and rural areas. Several typical questions are selected for comparison. EarthVQANet shows strong performances on questions
of varying difficulty.
Table 3
Encoder ablation study.
Image Language Param(M) mIoU (%) OA (%) OR
ResNet-50 LSTM 65.95 54.56 77.54 0.8276
ResNet-50aLSTM 65.95 52.48 75.08 0.8674
ResNet-101 LSTM 80.06 55.12 77.76 0.8175
ResNet-101aLSTM 80.06 53.02 75.78 0.8658
HRNet-W32 LSTM 55.41 55.67 77.89 0.8691
HRNet-W40 LSTM 71.88 55.72 78.08 0.8463
ConvNeXt-Tiny LSTM 65.64 57.43 78.54 0.7980
ConvNeXt-Tiny BERT-Base 173.23 57.28 78.88 0.8101
ConvNeXt-Tiny DistillBERT-Base 130.11 57.27 78.75 0.7983
ConvNeXt-Small LSTM 87.28 57.34 78.62 0.8040
ConvNeXt-Small BERT-Base 194.87 57.14 78.84 0.8088
ConvNeXt-Small DistillBERT-Base 151.75 57.25 78.82 0.8101
MiT-B2 LSTM 60.17 56.36 77.25 0.8389
MiT-B3 LSTM 74.72 56.83 77.59 0.8287
Swin-Tiny LSTM 65.34 56.10 77.35 0.8481
Swin-Tiny BERT-Base 172.93 54.82 77.80 0.8403
Swin-Tiny DistillBERT-Base 129.81 55.13 77.78 0.8351
Swin-Small LSTM 86.66 56.34 77.52 0.8498
Swin-Small BERT-Base 194.25 56.02 78.03 0.8452
Swin-Small DistillBERT-Base 151.13 56.10 78.05 0.8422
aDenotes the frozen ResNets of Faster-RCNN pre-trained on Visual Genome.
MiT (Xie et al.,2021), and Swin-Transformer (Liu et al.,2021). Ta-
ble 3 presents that the advanced encoders achieve higher segmentation
results and lead to better VQA performances. For similar structures, a
larger number of parameters tend to result in higher VQA performance.
Therefore, given sufficient computing power and time, larger image
encoders are recommended. Compared with LSTM, BERT-Base further
improves the VQA performances due to powerful Transformer structure
and pre-trained weights on an external large corpus. As a student
model of BERT, DistillBERT contains fewer parameters and achieves
compatible performances. Besides, we also added comparative experi-
ments with the general local feature extractors (Tan and Bansal,2019;
Anderson et al.,2018) pre-trained on Visual Genome (Krishna et al.,
2017). The compared results in Table 3 demonstrate the effectiveness
of the remote sensing visual feature extractors. Ablation experiments
demonstrate the good scalability and compatibility of the proposed
EarthVQANet.
Hierarchical pyramid decoder. The hierarchical pyramid decoder
aims to capture multi-scale features and achieves accurate segmen-
tation results and pseudo semantic guidance. The ablation study of
two pyramid modules has been performed, with the results presented
in Table 4. Compared with the naive decoder, the PPM significantly
enhances segmentation results by 1.25% mIoU and VQA accuracies
Table 4
Ablation study of hierarchical pyramid decoder.
Structure PPM PD #Channel Params
(M)
mIoU
(%)
OA
(%)
OR
Naive decoder 256 46.80 53.66 76.36 1.5761
+PPM 256 61.35 54.91 77.62 0.8415
+PD 256 51.34 56.35 78.26 0.8313
+PPM+PD (Ours) 256 65.64 57.43 78.54 0.7980
384 69.34 57.27 78.71 0.7960
512 74.21 56.71 78.38 0.9313
Table 5
Ablation study of semantic guidance.
Semantic guidance Types Gradient mIoU (%) OA (%) OR
𝑆𝑉 𝐿 ×56.67 77.55 0.8650
𝑆𝐺𝑉 𝐿 SG ×57.17 78.15 0.8481
𝑆𝐺𝑉 𝐿 SG 56.83 78.04 0.8673
𝑆𝑉𝐺𝐿CG ×56.86 78.05 0.8410
𝑆𝑉 𝐿𝐺CG ×57.39 77.36 0.8636
𝑆𝑉𝐺𝐿𝐺CG ×57.18 78.16 0.8290
𝑆𝑉𝐺𝐿𝐺CG 56.79 77.32 0.8269
𝑆𝐺𝑉𝐺𝐿𝐺SG+CG ×57.43 78.54 0.7980
𝑆𝐺𝑉𝐺𝐿𝐺SG+CG 56.88 77.91 0.8304
𝑆𝐺5𝑉𝐺𝐿𝐺SG+CG ×57.25 78.09 0.8066
𝑆𝐺𝑉𝐺5𝐿𝐺SG+CG ×57.44 78.23 0.8215
𝑆𝐺𝑉𝐺𝐿𝐺5SG+CG ×57.07 78.01 0.8241
𝑆𝐺5𝑉𝐺5𝐿𝐺5SG+CG ×56.81 77.90 0.8448
by 1.26% OA. The pyramid decoder (PD) structure boosts both the
segmentation (+2.68% mIoU) and VQA (+0.71% OA) performances.
Because PPM and PD enhance multi-scale features at the cell and
structure layers respectively, their fusion brings complementary gains.
Increasing the inner channels of the decoder slightly improves VQA per-
formance, but it may also lead to overfitting. Overall, the combination
of PPM and PD achieves effective information extraction and exhibits
strong compatibility with VQA tasks.
5.4. Semantic-guided attention
Semantic guidance. As semantic guidance introduces object se-
mantics and spatial details, we have performed comparative experi-
ments for guidance types and positions to evaluate the effects. Fig. 4
illustrates that the default setting comprises three types of guidance.
The three guidances are located respectively before self-attention mod-
ules, one-stage cross-attention modules, and two-stage cross-attention
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
433
J. Wang et al.
Fig. 12. VQA overall and counting accuracies with varied number of attention layers. The performances firstly improve as the number of layer increase and gradually convergence.
Fig. 13. VQA overall and counting accuracies with varied number of heads and feature dim. The performances firstly improve as the number of layer increase and gradually
convergence.
modules. This configuration can be expressed as 𝑆𝐺𝑉𝐺𝐿𝐺, where 𝐺
represents the guidance, and 𝑉and 𝐿represent the bidirectional
cross-attentions for visual features as query and language features as
query, respectively. Since each attention module is composed of five
transformer blocks, we also added guidance before each block to test
its effects, denoted as 𝐺5. For example, 𝑉𝐺indicates that ‘As for five
cross-attention blocks which use visual features (𝑉) as a query, only
the first cross-attention adopts the semantic guidance (𝐺)’. 𝑉𝐺5denotes
that all five cross-attentions adopt the semantic guidance. Table 5
presents the results of the ablation study of semantic guidance. The
addition of semantic guidance on self-attention (𝑆𝐺𝑉 𝐿) significantly
improves the VQA overall accuracy (+0.61% OA). As for the bidirec-
tional cross-attention, the guided visual features (𝑆𝑉𝐺𝐿) also boost
the VQA performance with visual features as query. Without pre-
enhancement, the late guidance 𝑆𝑉 𝐿𝐺may cause semantic confusion
and lead to negative effects on the VQA performance. The combination
of different guidances (𝑆𝐺𝑉𝐺𝐿𝐺) results in the best VQA performance
at a system level. In any case, adding a gradient return to the semantic
guidance causes negative effects. This is due to the direct supervi-
sion from VQA tasks conflicting with the segmentation representation.
Besides, intensive guidance cannot lead to further improvements but
induce overfitting. In conclusion, we selected 𝑆𝐺𝑉𝐺𝐿𝐺configuration
for the semantic guidance.
Ablation on number of attention layers. The semantic guidance
attention includes self-attention and cross-attention, each consisting
of several transformer blocks. Since the number of attention layers
directly impacts the VQA representation capability, we varied it to
evaluate the effects. For simplicity, we set 𝑁=𝑁𝐸=𝑁𝐷.Fig. 12 shows
the VQA overall and counting performances with varying numbers of
layers. As the number of layers increases, the performance initially
improves and gradually converges. The best OA and OR are achieved
when 𝑁= 5 and 𝑁= 3, reaching 78.54% and 0.792, respectively.
When faced with a new dataset, we can gradually deepen the layers
and find a good trade-off between accuracy and efficiency.
Ablation on number of heads and feature dim. As key hyperpa-
rameters for transformer blocks, the number of heads (𝑀) and feature
dim (𝑑𝑚) determine the width of the network. To evaluate their effects,
we conducted ablation experiments with varied 𝑀= 2,4,8,16 and
𝑑𝑚= 128,256,384,512. The comparative results in Fig. 13(a) show two
different trends. When 𝑑𝑚= 128, increasing the number of heads 𝑀
leads to a gradual drop in performance. However, when 𝑑𝑚increases,
the model has more ability to resist the lack of representation brought
about by multi-head division. This is because each head has more
learnable dimensional features for representation. When 𝑑𝑚= 512,
increasing 𝑀brings positive gains. In contrast, the low feature dimen-
sion has limited representation and forcibly dividing it renders each
part powerless. In contrast, when 𝑑𝑚= 512, a larger number of heads
is required to prevent overfitting. Fig. 13(b) shows similar trends in
counting tasks, and 𝑑𝑚= 384 yields the best performance with diverse
indicators.
5.5. Adaptive numerical optimization
Because the proposed adaptive numerical optimization introduces
three hyperparameters. 𝑘controls the intensity of hard example mining
for classification tasks, while 𝛼and 𝛾control the distance difference
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
434
J. Wang et al.
Fig. 14. Relations between penalty 𝑑and distance difference 𝐲𝑑𝑖𝑓 𝑓 with the controlled factors 𝛼and 𝛾.
Fig. 15. Experimental results with varied hyperparameters for adaptive numerical optimization. The proposed adaptive numerical optimization outperforms the CE loss with a
wide range of hyperparameter selections. (b) The mean values and standard deviations are reported after three runs.
penalty for counting tasks. Fig. 15(a) analyzes the optimization sensi-
tivity with varied 𝑘. This shows that when 𝑘 > 0.6, our balance strategy
on classification optimization stably exceeds CE loss. The overall perfor-
mance firstly increases and then slightly decreases from 1.0 to 0.5. This
is because a low keep ratio reduces the number of effective samples in
each batch, leading to model underfitting. By setting 𝑘= 0.8, we also
varied 𝛼and 𝛾from 0.125 to 2 and reported the overall accuracies
in Fig. 15(b). We run each setting three times and mean value as
well as standard deviation are reported. The results show that the
proposed adaptive numerical optimization outperforms the CE loss with
a wide range of hyperparameter selections, i.e., 𝛼 (0,1.75] and 𝛾
(0,1.25]. To ensure stability in the VQA performance, it is recommended
to restrict the value of 𝛾into (0, 1]. The performance shows more
instability when 𝛾 > 1, due to the influence curve changing from
concave to convex (Fig. 14). Overall, setting these hyperparameters
reasonably can lead to stable improvements in VQA performance.
Additionally, we compare the proposed adaptive numerical opti-
mization with two existing methods that are designed to address the
sample imbalance problem, namely Focal Loss (Lin et al.,2017), CB
Loss (Cui et al.,2019) and OHEM (Shrivastava et al.,2016). Table 6
shows that focal loss slightly improves the overall accuracy but reduces
the counting accuracy. Although focal loss has balanced the hard-
easy samples, it cannot adapt to the regression task. The CB loss
considers both sample difficulty as well as class distribution and, also
enhance the classification performance (OA+ =0.18%). The uncertain
number of optimized samples in each batch brings unstable training
to OHEM, leading to negative effects for VQA tasks. Adding con-
straints to the classification and regression tasks separately results in
different improvements. By combining these constraints, the proposed
EarthVQANet achieves the best performance.
5.6. Overall architecture analysis
To comprehensively evaluate the contributions of each proposed
module, comprehensive analysis experiments are performed. Earth-
VQANet was disassembled into five sub-modules, which are: (1) self-
attention, (2) bidirectional cross-attention, (3) multi-task strategy, (4)
semantic guidance, and 5) adaptive numerical optimization. Table 7
shows that each module contributes to the performance improvements
in different ways. Bidirectional cross-attention achieves higher accu-
racies compared to self-attention, which highlights the importance of
interaction between multi-modal features. Multi-task training further
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
435
J. Wang et al.
Fig. 16. Visualization of attention maps in semantic-guided cross-attention layers with language features as queries. From left to right are the raw image, semantic prediction and
visualized attention maps from 𝑙1to 𝑙5. Three examples are queried by different keywords: intersections’, residents’, and water’. As the layers deepen, the model progressively
focus on the right queried words.
Table 6
Compared with the similar optimization algorithms.
Optimization Classification Regression OA (%) OR
CE loss (Baseline) 78.02 0.8146
Focal Loss (Lin et al.,
2017)
78.11 0.8202
CB loss (Cui et al.,2019) 78.20 0.8154
OHEM (Shrivastava et al.,
2016)
77.67 0.8362
+balance on cls. 78.18 0.8189
+penalty on reg. 78.23 0.7923
EarthVQANet 78.54 0.7980
improves performance due to the mutual promotion of segmentation
and VQA. The addition of semantic guidance benefits the difficult ques-
tions that require object relational reasoning and significantly improves
counting accuracy. All the proposed modules are compatible with each
other within the EarthVQANet framework. Except for the necessary
attention blocks, EarthVQANet maintains a lightweight design and
efficiently achieves information extraction and knowledge reasoning.
For the model efficiency, we have reported the theoretical and
practical indices. The theoretical indices include the floating-point
operations (FLOPs) as well as model parameters. The theoretical metric
is sample per second (FPS) that denotes the model inference speed.
We have reported the inference speed after 500 runs on one 24G
RTX 4090. The efficiency comparative results arec show in Table 8.
Without the segmentation part, RSVQA achieves the highest efficiency
but the lowest accuracy. Equipped with the segmentation network,
BUTD and MCAN are supervised by the multi-task training, decreasing
the model efficiency. Compared with MCAN, the EarthVQANet w/o
guidance brings +6.22M on model size and 1.03 GFLOPs on prediction
complexity. The semantic guidance further brings additional 2.14M
parameters and slightly slowed down model inference speed (4.24
Sample/Second).
5.7. Attention visualizations for keywords
The attention maps in the semantic-guided cross-attention layers
are visualized to analyze the mechanism of multi-modal interactions.
Since the attention map size is 24 ×24, we use bilinear interpolation
to scale it to the size of a raw image for better comparison. Fig. 16(a)
shows an urban scene and intersections is queried to associate the
visual features. The shallow layers (𝑙1and 𝑙2) divert attention on all
roads in this scene, including arterial roads and internal paths for resi-
dents. However, as the number of layers deepens, the model gradually
narrows its focus to the intersection of main roads, and ultimately
concentrates on the intersection. Similarly, residents is selected as
the second query word in Fig. 16(b). The attention map in 𝑙1wrongly
focuses on the road and barren areas but gradually shifts the attention
to the right buildings in 𝑙4and 𝑙5. The third example is a rural scene,
and water is selected to query the visual features. Due to the similar
spectral signatures of water and agriculture, 𝑙2divides its attention to
both the water and vegetation areas. However, in 𝑙3, the model starts
to filter out the uninteresting agricultural areas and concentrates on
the water areas. Overall, the reasoning in the semantic-guided cross-
attention layers is a step-by-step process that utilizes keywords to
search for visual clues, reason about spatial semantic relationships, and
finally summarize the knowledge.
6. Evaluation on other VQA datasets
6.1. Experiments on the FloodNet dataset
In order to test the generalizability of the proposed EarthVQANet,
we also perform comparative experiments on FloodNet dataset (Rah-
nemoonfar et al.,2021). The FloodNet consists of 2343 post-disaster
images obtained by unmanned aerial vehicles during the aftermath of
Hurricane Harvey. The dataset covers Ford Bend County, Texas and
other directly impacted areas. Because the QA pairs for the valida-
tion and test set are not released, we have re-arranged the training
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
436
J. Wang et al.
Table 7
Overall architecture analysis.
Self-Att. Bi. Cross-Att. Multi-task Sem. Guid. AN optim. Params(M) mIoU (%) OA (%) OR
52.44 75.16 1.1287
56.90 76.77 0.9943
58.83 76.96 0.9918
63.13 56.97 77.55 0.8651
65.64 57.28 78.02 0.8146
65.64 57.43 78.54 0.7980
Table 8
Efficiency analysis and comparison.
Segm. VQA Guidance Params (M) FLOPs (G) FPS (Sample/Second) mIoU (%) OA (%)
Segm (ConvX-T) 33.69 108.35 46.32 52.27
RSVQA 38.32 59.54 71.29 70.95
BUTD 53.59 161.23 25.86 57.11 77.55
MCAN 56.91 163.98 22.45 57.07 77.53
EarthVQANet 63.13 165.01 23.41 57.15 78.11
EarthVQANet 65.54 226.29 19.17 57.43 78.54
Table 9
Compared results with other VQA methods on FloodNet𝑉 𝑎𝑙 set.
Method mIoU (%) Accuracy (%) OA (%) RMSE OR
Com. Ju Road Si. Sim. Co Com. Co Sim. Co Com. Co
Only Segm. 72.61
General methods
SAN (Yang et al.,2016) 98.25 99.11 32.00 31.15 78.94 2.5171 2.9927 2.7768
MAC (Hudson and Manning,2018) 98.84 98.45 36.80 29.71 79.16 5.0812 2.3494 4.0213
BUTD (Anderson et al.,2018) 72.69 99.41 98.89 35.20 35.50 80.18 2.2396 2.4657 2.3609
BAN (Kim et al.,2018b) 71.94 98.83 98.45 38.40 38.40 80.74 2.0842 3.1021 2.6672
D-VQA (Wen et al.,2021b) 71.96 98.83 98.89 32.80 36.95 79.98 1.8654 2.4286 2.1792
MCAN (Yu et al.,2019) 73.07 98.25 98.67 39.20 38.40 80.85 1.9819 2.5036 2.2706
RS methods
RSVQA (Lobry et al.,2020) 98.25 98.23 38.40 34.78 79.95 3.4871 4.1090 3.8260
RSIVQA (Zheng et al.,2021) 97.67 98.67 40.80 30.43 79.72 2.5059 3.1450 2.8591
EarthVQANet 75.07 98.47 98.85 48.03 45.71 83.22 1.5163 2.4885 2.0834
The abbreviations are: Com Ju (Complex Judging), Road Si (Road Situation Analysis), Sim Co (Simple Counting), Com Co (Complex Counting), OA (Overall Accuracy), OR (Overall
RMSE).
Fig. 17. Representative samples from the FloodNet dataset.
samples. Specifically, 1156 images and 3587 QA pairs are split for
Train set and the rest are used for Val set. The original semantic cat-
egories include: background, building-flooded, building-non-flooded,
road-flooded, road-no-flooded, vehicle, pool, tree, water and grass. The
types of QA pairs are simple counting, complex judging, and object
situation analysis.
For urban disaster experiments, we set the initial learning rate to
5e–5 and attention layer 𝑁= 5. Other settings remain unchanged. The
comparative results are shown in Table 9. All the compared methods
performed well on complex judging and road situation questions, as
their candidate answers were simple and easy to identify (see Fig. 17).
However, for complex counting questions, all methods obtained low
counting accuracies, especially due to the wide variation in the num-
ber of attacked buildings in urban areas and the small-scale training
samples. As a result, models trained on the FloodNet dataset are sus-
ceptible to overfitting. Consistent with the results from the EarthVQA
dataset, models leveraging visual features demonstrated improved gen-
eralizability than global fusion methods. This further underscores the
effectiveness of collaborative learning for segmentation and VQA tasks.
Despite the limited samples and challenging disaster scenes, our pro-
posed EarthVQANet method achieves the best overall performance,
highlighting its potential to provide accurate VQA solutions in disaster
scenarios (see Table 10).
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
437
J. Wang et al.
Table 10
Compared results with other VQA methods on RSVQA-HR𝚝𝚎𝚜𝚝.
Method Accuracy (%) OA (%)
Count Presence Compare Area
General methods
SAN (Yang et al.,2016) 68.42 91.47 90.77 85.25 84.37
MAC (Hudson and Manning,2018) 68.08 90.94 90.68 85.57 84.09
BUTD (Anderson et al.,2018) 69.98 92.17 91.71 86.66 85.41
BAN (Kim et al.,2018b) 69.35 92.04 91.36 86.38 85.05
D-VQA (Wen et al.,2021b) 68.17 91.01 90.79 85.63 84.21
MCAN (Yu et al.,2019) 69.77 92.30 91.85 86.62 85.30
RS methods
RSVQA (Lobry et al.,2020) 69.03 91.35 91.04 83.22 83.77
RSVQA-R152a(Lobry et al.,2020) 68.63 90.43 88.19 85.24 83.23
RSIVQA (Zheng et al.,2021) 69.03 91.22 66.33 86.25 76.54
EarthVQANet 70.75 92.45 92.43 87.10 85.98
aThe accuracies reported in original paper using ResNet152.
6.2. Experiments on the RSVQA dataset
As a traditional automatic generated VQA dataset, the RSVQA-HR
dataset (Lobry et al.,2020) covers Portland, Manhattan and Philadel-
phia areas in America. Each 512 ×512 image has the resolution of
0.15 m which is extracted from the High Resolution Orthoimagery
data collection of the United States Geological Survey. The whole
dataset includes 10,659 images and 1,066,316 QA pairs, where training
(61.5%), validation (11.2%), and test (27.3%) sets are splitted. As the
RSVQA-HR dataset does not include the semantic masks, we convert
all models into global feature fusion types. We trained all models using
the Adam optimizer with a learning rate of 1e–5 for 35 epochs. The
batch size is set to 16 and image augmentations include randomly
vertical and horizontal flipping. As shown in Table 10, compared with
the referenced VQA methods, the proposed EarthVQANet achieves the
best accuracy on each type of question.
7. Conclusion
For the exploration of spatial and semantic relations between ob-
jects, the semantic segmentation and VQA tasks are collaboratively
integrated for information extraction and knowledge reasoning. Specif-
ically, the proposed EarthVQANet consists of a hierarchical pyramid
segmentation network and semantic-guided attention. The hierarchical
pyramid segmentation network is constructed for information extrac-
tion and provides accurate visual features as well as object seman-
tics. The semantic-guided attention firstly utilizes self-attention to rea-
son about object relations based on the guided visual features. The
cross-attention is then designed to perform multi-modal interactions,
searching for visual clues according to the questions and summarizing
knowledge. Furthermore, we propose adaptive numerical optimization
to unify the classification and regression tasks for VQA. Given an
image and a question, EarthVQANet can automatically generate corre-
sponding semantic maps and comprehensive answers. In experiments,
EarthVQANet outperforms ten general and remote sensing VQA meth-
ods, including large multi-modal models (BLIP-2 and Instruct-BLIP). It
also exhibits good scalability and friendly parameter sensitivity. We
will extend the framework and dataset to other complex scenarios and
applications.
CRediT authorship contribution statement
Junjue Wang: Conceptualization, Data curation, Formal analysis,
Methodology, Validation, Writing original draft. Ailong Ma: Concep-
tualization, Funding acquisition, Supervision, Writing review & edit-
ing. Zihang Chen: Data curation, Visualization. Zhuo Zheng: Concep-
tualization, Investigation. Yuting Wan: Data curation, Validation, Vi-
sualization. Liangpei Zhang: Conceptualization, Supervision, Writing
review & editing. Yanfei Zhong: Conceptualization, Supervision.
Declaration of competing interest
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Founda-
tion of China under Grant 42325105 and 42171336.
References
Abdelnour, J., Rouat, J., Salvi, G., 2023. NAAQA: A neural architecture for acoustic
question answering. IEEE Trans. Pattern Anal. Mach. Intell. 45 (4), 4997–5009.
http://dx.doi.org/10.1109/TPAMI.2022.3194311.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.,
2018. Bottom-up and top-down attention for image captioning and visual question
answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. CVPR.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D., 2015. Vqa:
Visual question answering. In: Proceedings of the IEEE International Conference on
Computer Vision. pp. 2425–2433.
Bashmal, L., Bazi, Y., Melgani, F., Ricci, R., Al Rahhal, M.M., Zuair, M., 2023. Visual
question generation from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs.
Remote Sens. 16, 3279–3293. http://dx.doi.org/10.1109/JSTARS.2023.3261361.
Carbonneau, P.E., Dugdale, S.J., Breckon, T.P., Dietrich, J.T., Fonstad, M.A.,
Miyamoto, H., Woodget, A.S., 2020. Adopting deep learning methods for airborne
RGB fluvial scene classification. Remote Sens. Environ. (ISSN: 0034-4257) 251,
112107. http://dx.doi.org/10.1016/j.rse.2020.112107.
Chappuis, C., Zermatten, V., Lobry, S., Le Saux, B., Tuia, D., 2022. Prompt-RSVQA:
Prompting visual context to a language model for remote sensing visual question
answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 1372–1381.
Chen, J., Jönsson, P., Tamura, M., Gu, Z., Matsushita, B., Eklundh, L., 2004. A simple
method for reconstructing a high-quality NDVI time-series data set based on the
Savitzky-Golay filter. Remote Sens. Environ. 91 (3–4), 332–344.
Chen, D., Zhong, Y., Zheng, Z., Ma, A., Lu, X., 2021. Urban road mapping based on an
end-to-end road vectorization mapping network framework. ISPRS J. Photogramm.
Remote Sens. 178, 345–365.
Cheng, G., Han, J., Lu, X., 2017. Remote sensing image scene classification: Benchmark
and state of the art. Proc. IEEE 105 (10), 1865–1883.
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S., 2019. Class-balanced loss based
on effective number of samples. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 9268–9277.
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P.N.,
Hoi, S., 2024. Instructblip: Towards general-purpose vision-language models with
instruction tuning. Adv. Neural Inf. Process. Syst. 36.
Diakogiannis, F.I., Waldner, F., Caccetta, P., Wu, C., 2020. ResUNet-a: A deep
learning framework for semantic segmentation of remotely sensed data. ISPRS J.
Photogramm. Remote Sens. 162, 94–114.
Dimitrovski, I., Kitanovski, I., Kocev, D., Simidjievski, N., 2023. Current trends in
deep learning for Earth Observation: An open-source benchmark arena for image
classification. ISPRS J. Photogramm. Remote Sens. 197, 18–35.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is
worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929.
ISPRS Journal of Photogrammetry and Remote Sensing 212 (2024) 422–439
438
J. Wang et al.
Gao, L., Lei, Y., Zeng, P., Song, J., Wang, M., Shen, H.T., 2021b. Hierarchical
representation network with auxiliary tasks for video captioning and video question
answering. IEEE Trans. Image Process. 31, 202–215.
Gao, F., Ping, Q., Thattai, G., Reganti, A., Wu, Y.N., Natarajan, P., 2022. Transform-
retrieve-generate: Natural language-centric outside-knowledge visual question
answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 5067–5077.
Gao, C., Zhu, Q., Wang, P., Li, H., Liu, Y., Van den Hengel, A., Wu, Q., 2021a.
Structured multimodal attentions for textvqa. IEEE Trans. Pattern Anal. Mach.
Intell. 44 (12), 9603–9614.
Ghorbanzadeh, O., Xu, Y., Zhao, H., Wang, J., Zhong, Y., Zhao, D., Zang, Q., Wang, S.,
Zhang, F., Shi, Y., Zhu, X.X., Bai, L., Li, W., Peng, W., Ghamisi, P., 2022. The
outcome of the 2022 Landslide4Sense competition: Advanced landslide detection
from multisource satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
15, 9927–9942. http://dx.doi.org/10.1109/JSTARS.2022.3220845.
Hänsch, R., Persello, C., Vivone, G., Navarro, J.C., Boulch, A., Lefevre, S., Le Saux, B.,
2022. The 2022 IEEE GRSS data fusion contest: Semisupervised learning [technical
committees]. IEEE Geosci. Remote Sens. Mag. 10 (1), 334–337.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are
scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 16000–16009.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recog-
nition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 770–778.
Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models. Adv. Neural
Inf. Process. Syst. 33, 6840–6851.
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8),
1735–1780.
Hossain, M.D., Chen, D., 2022. A hybrid image segmentation method for building
extraction from high-resolution RGB images. ISPRS J. Photogramm. Remote Sens.
192, 299–314.
Hudson, D.A., Manning, C.D., 2018. Compositional attention networks for machine rea-
soning. In: 6th International Conference on Learning Representations, ICLR 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net.
Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D., 2018. Pythia v0.
1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956.
Kellenberger, B., Marcos, D., Tuia, D., 2018. Detecting mammals in UAV images: Best
practices to address a substantially imbalanced dataset with deep learning. Remote
Sens. Environ. 216, 139–153.
Kenton, J.D.M.-W.C., Toutanova, L.K., 2019. BERT: Pre-training of deep bidirectional
transformers for language understanding. In: Proceedings of NAACL-HLT. pp.
4171–4186.
Kim, J.-H., Jun, J., Zhang, B.-T., 2018a. Bilinear attention networks. Adv. Neural Inf.
Process. Syst. 31.
Kim, J.-H., Jun, J., Zhang, B.-T., 2018b. Bilinear Attention Networks. In: Advances in
Neural Information Processing Systems. Vol. 31, pp. 1571–1581.
Kirillov, A., Girshick, R., He, K., Dollár, P., 2019. Panoptic feature pyramid networks.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 6399–6408.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y.,
Li, L.-J., Shamma, D.A., et al., 2017. Visual genome: Connecting language and
vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123,
32–73.
Li, J., Li, D., Savarese, S., Hoi, S.C.H., 2023a. BLIP-2: bootstrapping language-
image pre-training with frozen image encoders and large language models. In:
Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (Eds.),
International Conference on Machine Learning, ICML 2023, 23-29 July 2023,
Honolulu, Hawaii, USA. In: Proceedings of Machine Learning Research, vol. 202,
PMLR, pp. 19730–19742.
Li, K., Vosselman, G., Yang, M.Y., 2023b. HRVQA: A visual question answering
benchmark for high-resolution aerial images. arXiv preprint arXiv:2301.09460.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object
detection. In: Proceedings of the IEEE International Conference on Computer Vision.
pp. 2980–2988.
Lin, Y., Xie, Y., Chen, D., Xu, Y., Zhu, C., Yuan, L., 2022. Revive: Regional visual
representation matters in knowledge-based visual question answering. Adv. Neural
Inf. Process. Syst. 35, 10560–10571.
Liu, H., Li, C., Wu, Q., Lee, Y.J., 2024. Visual instruction tuning. Adv. Neural Inf.
Process. Syst. 36.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin
transformer: Hierarchical vision transformer using shifted windows. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S., 2022. A convnet for
the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 11976–11986.
Liu, Y., Zhong, Y., Ma, A., Zhao, J., Zhang, L., 2023. Cross-resolution national-scale
land-cover mapping based on noisy label learning: A case study of China. Int. J.
Appl. Earth Obs. Geoinf. 118, 103265.
Lobry, S., Demir, B., Tuia, D., 2021. RSVQA meets bigearthnet: A new, large-scale,
visual question answering dataset for remote sensing. In: 2021 IEEE International
Geoscience and Remote Sensing Symposium IGARSS. pp. 1218–1221. http://dx.doi.
org/10.1109/IGARSS47720.2021.9553307.
Lobry, S., Marcos, D., Murray, J., Tuia, D., 2020. RSVQA: Visual question answering
for remote sensing data. IEEE Trans. Geosci. Remote Sens. 58 (12), 8555–8566.
Lu, J., Batra, D., Parikh, D., Lee, S., 2019. Vilbert: Pretraining task-agnostic visiolin-
guistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst.
32.
Ma, A., Wang, J., Zhong, Y., Zheng, Z., 2022. FactSeg: Foreground activation-
driven small object semantic segmentation in large-scale remote sensing imagery.
IEEE Trans. Geosci. Remote Sens. 60, 1–16. http://dx.doi.org/10.1109/TGRS.2021.
3097148.
Martins, V.S., Kaleita, A.L., Gelder, B.K., da Silveira, H.L., Abe, C.A., 2020. Exploring
multiscale object-based convolutional neural network (multi-OCNN) for remote
sensing image classification at high spatial resolution. ISPRS J. Photogramm.
Remote Sens. 168, 56–73.
Pelletier, C., Valero, S., Inglada, J., Champion, N., Dedieu, G., 2016. Assessing the
robustness of random forests to map land cover with high resolution satellite image
time series over large areas. Remote Sens. Environ. 187, 156–168.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning trans-
ferable visual models from natural language supervision. In: Meila, M., Zhang, T.
(Eds.), Proceedings of the 38th International Conference on Machine Learning. In:
Proceedings of Machine Learning Research, vol. 139, PMLR, pp. 8748–8763, URL
https://proceedings.mlr.press/v139/radford21a.html.
Rahnemoonfar, M., Chowdhury, T., Sarkar, A., Varshney, D., Yari, M., Murphy, R.R.,
2021. FloodNet: A high resolution aerial imagery dataset for post flood scene
understanding. IEEE Access 9, 89644–89654.
Shrivastava, A., Gupta, A., Girshick, R., 2016. Training region-based object detectors
with online hard example mining. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 761–769.
Song, J., Zeng, P., Gao, L., Shen, H.T., 2018. From pixels to objects: cubic visual
attention for visual question answering. In: Proceedings of the 2