ArticlePDF Available

Abstract and Figures

The past few years have seen an accelerating integration of deep learning (DL) techniques into various remote sensing (RS) applications, highlighting their power to adapt and achieving unprecedented advancements. In the present review, we provide an exhaustive exploration of the DL approaches proposed specifically for the spatial downscaling of RS imagery. A key contribution of our work is the presentation of the major architectural components and models, metrics, and data sets available for this task as well as the construction of a compact taxonomy for navigating through the various methods. Furthermore, we analyze the limitations of the current modeling approaches and provide a brief discussion on promising directions for image enhancement, following the paradigm of general computer vision (CV) practitioners and researchers as a source of inspiration and constructive insight.
Content may be subject to copyright.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 0274-6638/22©2022IEEE IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
The past few years have seen an accelerating integration
of deep learning (DL) techniques into various remote
sensing (RS) applications, highlighting their power to adapt
and achieving unprecedented advancements. In the pres-
ent review, we provide an e xhaustive exploration of the DL
approaches proposed specifically for the spatial downscal-
ing of RS imager y. A key contribution of our work is the
presentation of the major architectural components and
models, metrics, and data sets available for this task as well
as the constr uction of a compact taxonomy for navigating
through the various methods. Furt hermore, we analyze the
limitations of the cur rent modeling approaches and pro-
vide a brief discussion on promising directions for image
enhancement, following the paradigm of general computer
vision (CV) practitioners and researc hers as a source of in-
spiration and constructive insight.
MOTIVATION
Recent technological advances have signif icantly increased
the volume and distribution rate of RS data, reaching the
level of tens of terabytes on a daily basis. For that reason,
such data have become a ubiquitous source of information
for the monitoring of Earth’s physical, chemical, and bio-
logical systems, assisting with atmospheric, geological, and
oceanic research as well as hazard assessment and resource
management applications, to name a few.
Satellite RS currently drives Earth obser vation (EO) re-
search and applications. There are many operational sat-
ellites orbiting Earth mounted with active and passive R S
sensors, providing a continuous stream of information on
various aspects of the planet’s physical processes. Satellite
imager y from these sensors is character ized by its spatial,
spectral, temporal, and radiometric resolutions [1]. The
spatial resolution (or the ground sample distance) refers to the
size of a single satellite image pixel on the ground and cor-
responds to t he level of spatial detail that can be acquired
with t his particular sensor. Spectral resolution refers to the
range of the electromag netic (EM) spectrum (wavebands)
that the sensor acquires observations in, while tem poral reso-
lution (or revisit time) refers to the time interval between two
consecutive image acquisitions of the same location. Final-
ly, radiometric resolution refers to t he numerical precision or
bit depth of a single pixel. Unfortunately, due to technical
and financial constraints, there are usually t radeoffs among
these factors, and no available sensor can capture informa-
tion at the highest possible spatial and temporal resolution
across all wavebands.
Digital Object Identifier 10.1109/MGRS.2022.3171836
Date of current version: 2 June 2022
Deep Learning for
Downscaling Remote
Sensing Images
Fusion and super-resolution
MARIA SDRAKA, IOANNIS PAPOUTSIS, BILL PSOMAS,
KONSTANTINOS VLACHOS, KONSTANTINOS IOANNIDIS,
KONSTANTINOS KARANTZALOS, ILIAS GIALAMPOUKIDIS,
AND STEFANOS VROCHIDIS
XXXXX
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 3
Therefore, one of the hottest topics in RS is the fusion of
multisource data with the aim to combine their strengths
and enhance the resolution along the spatial, spectral, or
temporal dimension. In this par ticular study, we focus on
the spatial downscaling problem, which can be g reatly
aided by the integration of DL methods and makes up an
essential part of the pipeline of var ious RS research fields,
such as land use and land cover classif ication [2], [3], defor-
estation monitoring [4], [5], crop yield forecasting, precipi-
tation forecasting [6], disaster monitoring [7], [8], stream
flow monitoring [9], and many more.
Several review articles were published recently that, to
a certain extent, address the problem of image downscal-
ing with deep neural networks. The present study aims to
differ and, ultimately, add a methodological framework as
well as a valuable summar y of the most recent literature on
enhancing t he spatial resolution of satellite imagery data,
specifically, using advanced DL architectures. These DL
models are tailored to EO data with their unique and het-
erogeneous spatial, temporal, and spectral characteristics,
which differ significantly from the imagery traditionally
used by the CV community.
In fact , research on C V applications has mot ivated the pro-
duction of valuable review articles, mainly for (nonsatellite)
image super-resolution (SR), like [10][15] and [16]. Our work
exclusively targets t he RS field and provides a broader over-
view of methods and applications than [17]–[19], which fo-
cus solely on pansharpening approaches, or [20], which only
examines single-image SR (SISR) non-DL methods. Addi-
tionally, a number of noteworthy studies [21][2 3] provide a
thorough analysis of the use of DL techniques in RS, but they
are not limited to the spatial downscaling problem and ad-
dress t he entire spectrum of applications. Other review works
([1] and [24]) focus on the state of the art of multimodal data
fusion, partially addressing image resolution enhancement
without focusing on DL techniques. Finally, a study similar
to ours [25] reviews the literature up to mid-2019, therefore
missing the most recent state-of-the-art approaches.
Indeed, t he last three years have been productive for sci-
entif ic works on image downscaling with DL. For example,
while publications on RS image SR have been steadily in-
creasing, t he ratio of studies that use DL has blown up, from
5% in 2017 to almost 40% in 2020 (Figure 1). Similarly, in
CV, publications on DL for image SR [26] have exhibited a
steady increase.
In this review ar ticle, we present the recent advance-
ments (up to July 2021) of spatial downscaling on satel-
lite imaging through DL approaches and analyze their
strengths and shor tcomings. We are only interested in the
enhancement of surface ref lection products and do not ad-
dress geophysical variables, such as land surface tempera-
ture (LST), vegetation indexes, and so on.
TERMINOLOGY
Before moving forward, we need to clarify which termi-
nology is used in t his article as far as spatial resolution
increase/decrease is concerned. In climate and meteorologi-
cal (e.g., [27]) as well as R S [28] studies, the term downscale
refers to the transition from low to high resolution, i.e., less
to more detail representation. However, in the CV field, it is
the ter m upscale that refers to the increase of (spatial) reso-
lution, and downscale refers to the decrease of it (e.g., [29]);
these terms are synonymous with upsample and downsample,
respectively. Zhan et al. [30] conducted research on LST
downscaling terminolog y, among others, and found that
terms such as enhancement, sharpening, fusion, SR, unmixing,
subpixel, and disaggregation are also relevant to spatial resolu-
tion increase. In this article, we use the term downscale.
DEEP LEARNING FOR REMOTE SENSING
The governing principle of DL is the construction of arti-
ficial neural networks with a large number of layers (in-
dicated by the adjective deep in the term), which mostly
comprise convolutional, pooling, and fully connected
units. Although several architectures with these building
blocks have been proposed, some of which have been care-
fully handcraf ted for a specific task, the main idea is the
constr uction of a hierarchy of features extracted from raw
input data. This hierarchy is computed through representa-
tion learning approaches that can be supervised, semisu-
pervised, or unsupervised. Overall, the strongest advantage
of DL is its ability to process raw data, thus mitigating the
need for manual feature extraction, and unravel complex
nonlinear dependencies in the input.
One critical factor for the success of any DL method is
the existence of a large and diverse data set to train on.
The abundance and availability of data in EO, therefore,
provide a fer tile ground for the application of advanced
machine learning algorithms, and notable progress has
been made over the last decade ([21]–[23]). For example,
a number of works that exploit deeper architectures have
recently been published and achieve impressive results in
problems such as land use and land cover classification
FIGURE 1. The number of published papers related to image SR
for traditional and DL-based techniques for the satellite RS and CV
fields [26].
95 91 132 147 152
518 36 58 58
90 191
334
518 474
2017 2018 2019 2020 Sept. 2021
DL for Image SR in CV
DL For Satellite Image SR in RS
Satellite Image SR in RS
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
4
[31], scene classification [32], object detection [33], im-
age fusion [1], and image registration [34], [35], high-
lighting the great potential of DL in RS applications and
research.
However, EO poses a unique challenge for DL since it
involves the manipulation of multimodal and multitempo-
ral data. Remote sensors acquire information from multiple
segments of the EM spectrum, differentiating themselves
from typical CV data, which lie mostly in the red, green,
blue (RGB) range. In addition, time is quite an important
variable in EO applications. When studying dynamic sys-
tems, information is captured at regular time intervals, and
successive observations must be assessed and compared.
Finally, RS images often suffer from information loss, due
to either hardware failure or atmospheric conditions that
are difficult for certain sensor types to penetrate (e.g., cloud
coverage, haze, and so on are common obstacles for opti-
cal sensors). Therefore, any researcher willing to design and
implement novel DL algorithms for EO must take all of
these points into consideration.
PROBLEM DEFINITION
Given a set of n low-resolution (LR) images
(, ,,),xx x
n12
f
where
,xX
iHW
!
#
and their corresponding high-resolution
(HR) images
(, ,,),yy y
n12
f
where
,
yY
i
kH kW
!
#
the goal is to
estimate a downscaling function:
: .fX Y"
Note that H is
the image height, W is the image width, and k is the scaling
factor. This sur vey presents the approaches that have been
proposed for the estimation of this nonlinear downscaling
function f through deep neural networks.
IMAGING MODEL
The process of obtaining the LR x image from its HR y
equivalent is commonly represented in the literature by the
imaging model
() ,xybn
k
.7=+
(1)
where
b7
is the convolution with a blurring kernel b,
is the downsampling operation by a scaling factor k,
and n is a noise term. This formula
is a simple model of the image deg-
radation taking place during the
capture of the scene and attempts
to simulate the physics inside the
imaging sensor. Some researchers
have proposed modifications of
this model that account for param-
eters like the motion blur, quan-
tization error of the compression
process, zooming effects, exposure
time, white balancing , and so on.
For a thorough investigation of the
imaging model and its many ex-
tensions, please refer to [14].
WALD’S PROTOCOL
Due to the lack of paired LR–HR images in most cases,
an alternative approach described by Wald’s protocol
[36] is employed (Figure 2). This protocol assumes that
the performance of data fusion models is independent of
the scale, provided that certain conditions hold. In their
seminal work, Wald et al. suggest first degrading the input
image according to a factor k, thus creating LR–HR image
pairs, and proceed to design a model tasked to downscale
it to the original resolution. Then, the developed method
can be transferred to downscale the original image into
one of much higher resolution according to the same
downscaling factor k. Effectively, this is a self-super vised
modeling approach. Note that, throughout this docu-
ment, we refer to the LR images as coarse (C) and the HR
images as fine (F).
METRICS
Several quality metrics have been proposed to assess the
output of image restoration algorithms. Depending on the
availability of a reference HR image, these metrics can be
divided into three broad categories [37]:
Full reference: A complete HR reference image is required
for comparison with the reconstructed image.
No reference: Only the reconstructed image is required.
Reduced reference: Only a set of features extracted from
an HR image is available and used for compar ison.
Table 1 presents some of the most popular quality metrics
found in the literature for the task of spatial enhancement.
PERCEPTION–DISTORTION TRADEOFF
Full-reference metrics are also referred to as distortion met-
rics and, typically, measure the similarity/dissimilarity
between the reconstructed image and the corresponding
HR image. The goal of such metrics is to assess the recon-
struction algorithm’s ability to respect the structure and
semantic content of the target image and can be generally
formulated as
(,
),
II
HR HR
T
t
(2)
FIGURE 2. The Wald’s protocol pipeline. The original image (middle) is upscaled by a /k factor,
and the resulting pair is used for model training. The trained model is then transferred to
downscale the original image by a
#
k factor.
Training Inference
Upscale/kDownscale ×k
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 5
where
T
is a similarit y metric,
IHR
is the HR image, and
IHR
t
the reconstructed one.
Accordingly, no-reference metrics are also known as per-
ceptual quality metrics, and they aim to quantify the “natural
look” of a reconstr ucted image, i.e., how close it looks to a val-
id natural image, rega rdless of its similarity to t he correspond-
ing
.IHR
Such metrics tend to approximate the perceptual
qualit y of the human visual system and can be formulated as
TABLE 1. THE MOST POPULAR METRICS FOR IMAGE QUALITY ASSESSMENT.
METRIC RANGE DESCRIPTION CATEGORY
Mean square error (MSE)
[, )0
3 Pixel-based mean square error FR
Root-mean-s quare error (R MSE)
[, )0
3 Pixel-based root-mean-square error FR
Mean absolute error (M AE)
[, )0
3Pixel-based mean absolute error FR
Correlation coef ficient (CC) [–1, 1] Pixel-based correlation FR
Coef ficient of determination (R2) [0, 1] Per-pixel proportion of total variation FR
Signal-to-reconstruction-error ratio (SRE)
[, )0
3Error relative to the mean image intensi ty FR
Peak signal-to-noise ratio (PSNR)
(,)
33- Peak SNR based on the MSE and expressed in decibels FR
Weighted peak signal- to-noise ratio (WPSNR) [38]
(,)
33- Weighted PSNR to evaluate differently specific regions
of the image
FR
Universal image quality index (UIQI or UQI) [39] [–1, 1] Local differences in correlation, luminance, and contrast FR
Structural similarity index (SSIM) [37] [–1, 1] Based on the UQI and measures local dif ferences in
luminance, contrast, and structure
FR
Multiscale structural similarity index (MS-SSIM) [40] [–1, 1] A combinat ion of the SSIM at various scales FR
Information fidelity criterion (IFC) [41]
[, )0
3Utilizes natural scene statistics, defined as Gaussian
scale mixtures in the wavelet domain
FR
Visual information fideli ty ( VIF ) [42]
[, )0
3An extension of the IFC obtained by normalizing over
the reference image content
FR
Noise qualit y measure (NQM) [43]
(,)
33- The SNR based on contrast pyramid variations FR
Feature similarity index (FSIM) [44] [0, 1] Similar to the SSIM and utilizes phase congruency
and gradient magnitude
FR
Gradient similarity measure (GSM) [45] [0, 1] Similar to the SSIM and measures gradient similarity FR
Spectral angle mapper (SAM) [46]
[, ]0
r Compares the angle between the two spect ra FR
Erreur relative globale adimensionelle de synthese
(ERGA S) [47]
[, )0
3 Mean of the normali zed average error of each band FR
Most apparent distortion (MAD) [48]
[, )0
3 Weighted geometric mean of the local er ror in the
luminance domain and the subband local st atis tics
FR
VGG loss [49]
[, )0
3The MSE between feature maps extracted from interme-
diate layers of a VGG network for both prediction and
target images
FR
Blind/referenceless image spatial quality evaluator
(BRISQUE) [50]
[, )0
3Support vector regression model trained on natural
scene statistics of locally normalized luminance coef-
ficients accompanied with differential mean opinion
scores (for different distortions)
NR
Natural image quality evaluator (NIQE) [51]
[, )0
3Multivariate Gaussian model trained on natural sc ene
statistics, similar to BRISQU E (but for nondistor ted
images only)
NR
Perception-based image qualit y evaluator (PIQE) [52] [0, 1] Natural scene statistics, similar to BRISQUE, extracted
from blocks of t he distorted image and then pooled
based on variance
NR
QMA [53]
[, )0
3 Linear regression on the outputs of three independent
regression forests trained on extracted features of local
frequency, global frequency, and spatial discontinuity
along with the corresponding perceptual scores
NR
Perception index (PI) [54]
[, )0
3The linear combination of QMA and NIQE NR
Learned perceptual image patch similarity (LPIPS) [55]
[, )0
3L2 (Euclidean) norm and averaging between features
extracted from machine learning models on supervised,
self-supervised, or unsupervised settings
NR
Qualit y with no reference (QNR) [56] [0, 1] One’s complements of t wo spectral and spatial
distortion indexes based on the band correlation,
each raised to a real-valued ex ponent
NR
FR = full re ference; NR = no refe rence.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
6
(,),dp p
II
HR HR
t (3)
where d is a distribution similarity metric,
p
IHR is the distri-
bution of the natural HR images, and
p
IHR
t is the distribu-
tion of the reconstructed images.
Reduced-reference metrics provide an intermediate ap-
proach to full- and no-reference metrics, and they can be re-
garded as either distortion or perceptual depending on the
extracted features. Such metrics are primarily used for the
qualit y-of-ser vice monitoring of image-/video-broadcast-
ing systems, where only a selected number of features are
transmitted along with the compressed image to assess the
transmission qualit y. In the image enhancement domain,
no such met rics are noted to be in wide use.
It was empirically observed and then mathematically
proven [57] that distortion and perceptual quality metrics
act in a complementary yet competitive manner. The per-
ception–distortion tradeoff theorem dictates that, as t he
distortion error of an algorithm decreases, the visual quali-
ty must also decrease, and vice versa. In practice, pursuing a
low distor tion rate results in more blurry a nd oversmoothed
images because the produced output approx imates the sta-
tistical average of possible HR solutions to this one-to-ma-
ny problem, whereas a sharper, more natural-looking result
is usually not consistent with the initial LR image. It has
also been proven that there is an unattainable region in the
perception–distortion plane whose boundary is monoton-
ic. This means t hat any reconstruction method can never
achieve both a low distortion er ror and a high perceptual
qualit y at the same time, but attempts are made to design
an algorithm as close to the boundar y as possible. Fig ure 3
illustrates the perception–distor tion plane and the afore-
mentioned boundary.
An interesting conc lusion of [57] is that the method that
converges closer to the perception–distortion bound is the
generative adversarial network (GAN) [58]. Researchers
show that such models are usually trained to minimize a
weighted sum of a distortion and a perceptual quality metric
by modif ying the loss function of the generator as follows:
[( ,)](
,)
,
lI
IdppE
GI
IHR HR HR HR
Tm= +
t
t (4)
where
m
is the weight of the perception quality factor, and
(, )dp p
II
HR HR
t is usually approximated by the standard adver-
sarial loss. Therefore, GANs are usually able to produce im-
ages of a low distortion error and with the highest percep-
tual qualit y possible for this distortion error.
STANDARD DEEP LEARNING METHODS FOR
DOWNSCALING IN COMPUTER VISION
Resolution enhancement has been thoroughly investigated
in the field of general C V over the past decades. Cer tain
methods and algorithms have been established and often
serve as the basis of further investigation and improve-
ments when developing novel approaches for RS downscal-
ing. We present these methods in this section and then use
them throughout our article as core modules.
BUILDING BLOCKS
In this section, we brief ly present some of the most fun-
damental building blocks of downscaling DL architectures.
UPSAMPLING LAYERS
Resize convolution: This was one of the first techniques
proposed for feature downscaling. This operation in-
volves upsampling the input by a traditional interpola-
tion method, such as nearest neighbor, bilinear, or bicu-
bic interpolation, and then performing a convolution on
the result [Fi g ure 4 (a)]. Although it is a simple approach,
it has been successfully applied to a number of studies
in the field of CV.
Transposed convolution: This layer is also called the decon-
volutional layer [59], which is a quite inaccurate term since
deconvolution in CV aims to revert the operation of a
normal convolution and is rarely used in DL. Conversely,
transposed convolution aims to produce a feature map of
higher dimensions by first e xpanding the input wit h zero
insertions and then performing a convolution [Figure
4(b)]. The transposed convolutional layer is widely used
in downscaling architectures, but caution is required
since it is quite susceptible to producing checkerboard
artifacts, affecting the overall quality of the output [60].
Subpixel convolution: Also called pixel shuffle [61], this
layer comprises a convolution operation followed by a
specific image reshape that rearranges the input features
of shape
HWCr2
##
to
rH rW C##
[Figure 4 (c)]. This
layer achieves a larger receptive field than transposed
convolution and causes fewer artifacts in the final out-
put [62].
RESIDUAL LEARNING
The aim of downscaling is to learn a mapping between one
(or multiple) LR image(s) and an HR image. This formulates
an image-to-image translation task where the input (LR) is
highly correlated with the output (HR) regardless of the
scaling factor. To simplify this task and avoid learning such
FIGURE 3. The perception–distortion plane and the monotonic
boundary separating the unattainable region. (Source: [57]; used
with permission.)
Perception
Better Quality
Impossible
Less Distortion
Distortion
Algorithm 3
Algorithm 1
Algorithm 2
Possible
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 7
a complex translation, several studies employ global resid-
ual learning architectures [63] that focus on learning solely
the residual, or difference, between the input and output.
Provided that a considerable part of the image remains ba-
sically unchanged, such a model is tasked with retrieving
only the high-frequency details needed for the reconstruc-
tion of the HR counterpart, so it generally converges faster
and avoids bad minima.
In addition to global residual learning, local residual
learning connections [64] are also commonly employed in
downscaling architectures to alleviate vanishing gradients
as the model gets deeper and more complex. Local residual
learning shortcuts are inser ted between intermediate lay-
ers, while a global residual learning connection is used be-
tween the input and output.
LAPLACIAN PYRAMID STRUCTURE
First proposed in [65], the Laplacian pyramid structure is
a feature extractor based on the Gaussian pyramid struc-
ture, which operates simultaneously at different scales and
exploits the image difference (residuals) between levels.
Applied to a DL setting, an input LR image is progressively
upsampled s times through convolutional and upsampling
layers, and the residual of each consecutive pair of upsam-
pled outputs is computed. This results in the production of
s residual images at different scales that contain features at
different levels of abstraction. Such structures have been
extensively used in image downscaling since they split the
problem into smaller manageable tasks of smaller scale and
help the model converge to better optima.
ATTENTION MECHANISM
Through the attention mechanism, the underlying neural
network manages to isolate and focus on the most impor-
tant feature details for the task at hand. Multiple types of
attention mechanisms have been proposed over the years
and can be categorized based on the dimension on which
they operate. For example, channel attention considers the
interdependence of the feature maps between channels
and attributes a different weight on each one, while spatial
attention emphasizes interesting regions in the spatial do-
main. Popular implementations of the channel attention
mechanism include the squeeze-and-e xcitation (SE) block
[66] and the efficient channel attention (ECA) [67], while a
spatial attention mechanism commonly used in practice is
the coordinate attention module (CAM) [68]. Several stud-
ies also use a combination of channel and spatial attention,
such as the bottleneck attention module (BAM) [69], the
convolutional block attention module (CBAM) [70] and
the triplet attention [71]. An interesting overview of the at-
tention mechanisms used in downscaling architectures is
presented in [72].
FIGURE 4. An example of the three basic convolution schemes for upsampling a single-channel 3
#
3 feature map by a
#
2 factor: the
(a) resize, (b) transposed, and (c) subpixel convolutions. The red dashed lines refer to a simple 3
#
3 convolution.
Interpolate Zero Padding
Zero Expansion
Zero Padding Reshape
(a)
(b)
(c)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
8
UPSAMPLING FRAMEWORKS
Although different DL architectures can var y greatly, four
basic downscaling frameworks that describe all approaches
present in the literature can be discerned. These f rameworks
are outlined in Figure 5 and represent the possible ways to
design a downscaling DL model with convolutional and
upsampling/downsampling layers as basic components.
PREUPSAMPLING FRAMEWORK
This is the first framework explored in the literature for im-
age downscaling via DL approaches. In its most common
form, a traditional upsampling algorithm, e.g., bicubic
interpolation, is utilized to upsample the image to the re-
quired scale. Then, a convolutional neural network (CNN)
model is applied that refines the upsampled image and pro-
duces the HR result. Such an approach provides a simpler
learning pipeline since the network is relieved of the bur-
den to properly upsample the image and is only tasked to
sharpen and cleanse the input. A nother advantage of the
preupsampling framework is the ability to handle images
of arbitrar y size and scale. On t he other hand, the compu-
tational cost is increased since all operations are performed
in a higher-dimensional space while the preceding upsam-
pling procedure often amplifies noise and significantly in-
creases blurring.
POSTUPSAMPLING FRAMEWORK
Mitigating the complexity and high cost of the preupsam-
pling approach, in the postupsampling framework, an end-
to-end model undertakes the upsampling task via trainable
layers located at the end of the architecture. In the most
common approach, a DL network performs feature extrac-
tion on the low-dimensional space of the LR image and fi-
nally increases the resolution to obtain the HR output. A
disadvantage of this framework is the fixed scaling factor,
which forms an integral part of the architecture; thus, a
different model must be designed and trained for different
scales. In addition, performance is highly affected by the
magnitude of the scaling factor. Since upsampling is per-
formed in a single step, high factors (e.g.,
,)810##
increase
the learning difficulty and make the models considerably
harder to train.
PROGRESSIVE UPSAMPLING FRAMEWORK
In this framework, a model upsamples the image in a pro-
gressive manner through consecutive convolutional and
upsampling layers. At each stage, the input is upsampled
to a higher resolution, finally obtaining the required scale
at the output. This approach facilitates the learning pro-
cess since the downscaling task is decomposed into much
simpler steps. Such architectures are also able to handle re-
quirements for multiscale output since each stage produces
an upsampled image of intermediate scale. However, pro-
gressive upsampling models require more complex archi-
tectures and are, t hus, harder to design and train.
ITERATIVE UP- AND DOWNSAMPLING FRAMEWORK
This framework exploits consecutive up- and downsam-
pling layers, which refine the reconstruction error on HR-
to-LR projections, thus extracting more information on the
FIGURE 5. The possible downscaling frameworks present in the DL literature: (a) preupsampling, (b) Postupsampling, (c) progressive
upsampling, and (d) iterative up- and downsampling. The convolutional, upsampling, and downsampling layers are all trainable. Layers
enclosed by dashed boxes denote stackable blocks.
Upsample
Upsample
Conv
Conv
Conv
Upsample
Conv
Conv
Conv
Upsample
Upsample
Upsample
Upsample
Downsample
Downsample
Conv
Conv
Conv
Conv
Conv
(a) (b)
(c) (d)
Conv: Convolutional Layer
Upsample: Upsampling Layer
Downsample: Downsampling Layer
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 9
relationship and cor relations bet ween the two spaces. Such
models usually achieve higher-quality results and are able
to handle higher scaling factors successfully.
MODELS
One of the first robust DL methods for downscaling was
presented in [73] (SRCNN), where a two-layer CNN was fed
an upsampled version of an image and produced a sharp-
ened HR output. It was t rained and tested on subsets of
ImageNet and outperformed equivalent non-DL methods.
A similar approach was adopted by K im et al. [74] (VDSR)
who designed a deeper, VGG-like architecture [75] with a
global residual connection and managed to outper form
SRCNN on the test set.
Shi et al. [61], [62] (ESPCN) subsequently introduced
the subpixel convolution, which later became a popular
upsampling technique for DL models. This trick helps re-
duce the model’s number of parameters without compro-
mising its representational power.
The nex t landmark article [76] (LapSRN) introduced a
multiscale architecture that integrates t he Laplacian pyra-
mid str ucture and produces intermediate images down-
scaled by smaller factors
(, ,)24and8## #
in a single pass.
The intermediate outputs are supervised via separate Char-
bonnier loss functions, and this progressive upsampling
scheme helps the model retain high accuracy in higher
scales.
Ledig et al. [77] (SRGAN) introduced an adversarial ap-
proach to spatially enhance natural images. The generator,
named SRResNet, consists of a series of residual blocks,
local and global residual connections, and subpixel con-
volutional layers for downscaling. The discriminator is a
VGG-like network that performs the real/fake binary clas-
sification. The generator’s loss function is a combination
of the adversarial loss and a term comparing the produced
downscaled and the target HR image. Based on this model,
Wang et al. [78] (ESRGAN) propose a number of improve-
ments to achieve sharper results. They replace the residual
blocks with novel residual-in-residual dense blocks, which
actually comprise dense blocks with global residual con-
nections, as seen in Figure 6, and use the relativistic average
discriminator introduced in [79].
Following t he success of the baseline SRGAN, Lim et
al. [81] (EDSR/MDSR) extend the SR ResNet arc hitecture
by removing the rectified linear unit (ReLU) activations
outside the residual blocks and deepening the model. The
authors name this architecture EDSR and train it separate-
ly for the scaling factors
,,23##
and
.4#
They also noted
that, by fine-tuning a pretrained
2#
model when train-
ing for
3#
or
4#
downscaling, the entire training process
is accelerated, and the algorithm converges much faster.
Based on this observation, the authors argue that down-
scaling at multiple scales involves interrelated tasks, so
they design an alter native model, namely, MDSR, which
handles multiple scales simultaneously. Subsequently, Yu
et al. [82] (WDSR) introduce two novel residual blocks to
the EDSR architecture. These blocks employ a wide acti-
vation approach by constricting the features of the iden-
tity mapping pathway and widening t he features before
activation.
Another robust technique was proposed in [83] (RDN).
The authors present a residual dense block (RDB) that com-
prises a dense block with three novelties:
contiguous memor y, where the output of an R DB is fed
to each layer of the next RDB
local feature f usion, which is a concatenation and a 1
#
#
1 convolution layer at the end of an RDB that adap-
tively controls the output information, making t he net-
work easier to train
local residual learning, which is a residual connection
between the input and output of the RDB.
Utilizing a sequence of such RDB blocks and subpixel
upsampling layers, the final RDN architecture is formed
and then trained with the MA E loss function.
A number of methods, such as [85] (DBPN and D-DBPN)
and [86] (SRFBN), opt for an iterative up- and downsam-
pling strategy in the main core of their model. Specif ically,
several consecutive layers alternatively perform up- and
downprojection operations, learning different types of im-
age degradation, which then contribute to the construction
of the final HR image. This procedure provides an error
feedback mechanism for projection errors at each stage and
manages to extract better representations of the various fea-
tures.
FIGURE 6. The residual-in-residual block (RIRB). It contains multiple dense blocks and residual connections both between blocks and
between the input and output of the RIRB. Here,
b
refers to the residual scaling parameter. (Source [80]; used with permission.) Conv:
convolutional layer; LReLU: leaky rectified linear unit.
Dense
Block
Dense
Block
Dense
Block
Conv
LReLu
Conv
LReLu
Conv
LReLu
Conv
LReLu
Conv
×β
×β
×β
×β
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
10
Some methods (DRCN [87] and DRRN [84]) propose
the use of recursive structures inside the model. Arguing
that the addition of more layers makes a network ineff i-
cient and more likely to overfit, the aforementioned studies
introduce recursive convolutional layers, which apply the
same convolution multiple times. Therefore, weights are
shared between consecutive convolutional operations, and
more stable convergence is achieved. Fig ure 7 displays the
structural differences between DRCN and DRRN for bet-
ter understanding. A similar extension is also proposed for
the LapSRN model in [88] (MS-LapSRN). In particular, the
network parameters across pyramid levels are shared since
they perform a similar task via a similar structure, and the
feature embedding subnetwork of each pyramid level is re-
placed by a series of recursive convolutional layers to in-
crease the robustness of the model without increasing the
number of parameters accordingly.
Finally, Zhang et al. [89] (RCAN) propose a channel at-
tention module that consists of a global average pooling
layer and a gating mechanism that adds attention to t he
pooled features and enables the model to focus on the in-
formative feature maps. Multiple such attention modules
are incorporated inside residual-in-residual blocks, and the
final downscaling is performed by subpi xel convolutions.
When combined with a self-ensembling strateg y, RCA N
outper forms several robust DL methods.
Table 2 summar izes the most popular models in CV for
image downscaling via DL, vis-à-vis the building blocks
employed, the upsampling framework adopted, whether a
GAN pipeline is used or not, and the number of the model
parameters. The last attribute is useful to assess the com-
plexity of each model and therefore weigh its proneness to
overfit given the training data available.
DOWNSCALING TAXONOMY IN REMOTE SENSING
Based on the dimensions and modalities to be combined,
a variety of downscaling schemes have been proposed in
the context of EO. Figure 8 provides a simple yet complete
FIGURE 7. An overview of the classic ResNet, VDSR, DRCN, and DRRN architectures. Global residual connections are marked by a purple
line,
5
refers to elementwise addition, and outputs in blue are supervised. (a) ResNet. The green dashed box signifies a residual block.
(b) VDSR. (c) DRCN. The blue dashed box refers to a recursive layer whose convolutional layers are marked in green and share the same
weights. (d) DRRN. The red dashed box refers to a recursive block, and the green dashed box marks the residual units. The corresponding
convolutional layers marked in green and red share the same weights. W1 and W4 are learnable weights assigned to each intermediate hid-
den state output during recursion. (Source: [84]; used with permission.)
Input Input Input Input
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Conv
Output Output Output Output
W1W4
Output 1 Output 4
(a) (b) (c) (d)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 11
taxonomy of the methodological approaches used in the
literature according to our review.
Given this taxonomy, one can discern three f undamen-
tal groups of satellite image downscaling approaches for
RS, depending on whether spectral, temporal, or no exter-
nal information is used:
Spatiospectral fusion (SSF): Images of different spatial and
spectral resolutions are fused to produce an image of the
highest possible spatial resolution in the coarser bands.
Spatiotemporal fusion (STF): Images of high spatial but
low temporal resolution (HSLT) are fused with images
of low spatial but high temporal resolution (LSHT) to
produce images of the highest resolution in both dimen-
sions.
SR: A single image or multiple images is/are downscaled
without any additional external information.
In more detail, when the downscaling process is assisted
with information on different spectra, then SSF techniques
are used. These techniques are further discriminated based
on the type of input spectra at hand, resulting in multispec-
tral (MS) fusion (two MS images with different spectral in-
formation), panshar pening [an MS image and a panchro-
matic (PAN) image], and MS/hyperspectral (HS) fusion (an
MS and an HS image).
TABLE 2. AN OVERVIEW OF THE MOST POPULAR DOWNSC ALING MODELS IN CV.
MODEL BUILDING BLOCKS USED UPSAMPLING FRAMEWORK GAN NUMBER OF PARAMETERS
SRCNN [73]1Simple CNN Preupsampling No 57,000
VDSR [74] VGG based and residual connections Preupsampling No 665,000
ESPCN [61] Simple CNN and subpixel convolut ion Postupsampling No 20,000
LapSRN [76]2Laplacian pyramid structure Progressive upsampling No 821,000
SRGAN [77] Subpixel convolution and residual con-
nections
Postupsampling Ye s Generator: 734,000
Discr iminator: 5.2 m
ESRGAN [78]3Subpixel convolution and residual-in-
residual blocks
Postupsampling Ye s Generator: 16.7 m
Discr iminator: 14.5 m
EDSR [ 81]4Subpixel convolution, residual connec-
tions, and pretraining
Postupsampling No 43 m
MDSR [ 81]4Multiscale EDSR Postupsampling No 8 m
WDSR [82]5EDSR with w ide activation modules Postupsampling No Small model: 1.2 m
Big model: 37.9 m
RDN [83]6RDBs, local residual connections, and
subpixel convolution
Postupsampling No 22.3 m
DBPN [85]7Residual connections and transposed
convolution
Iterat ive up - and
downsampling
No 188,00 0–2. 2 m
D-DBPN [85]7Residual connections and transposed
convolution
Iterat ive up - and
downsampling
No 10.3 m
SRFBN [86]8Residual connections, transposed convo-
lution , and recurrent layers
Iterat ive up - and
downsampling
No 3.6 m
DRCN [87] Recursive convolutions and residual
connections
Preupsampling No 1.8 m
DRRN [84]9DRCN with recursi ve blocks and added
local residual connections
Preupsampling No 297,000
MS-LapSRN
[88]2
LapSRN with shared weights and recur-
sive blocks
Progressive upsampling No 222,000
RCAN [89]10 Channel attention, subpixel convolution,
residual-in-residual blocks, and residual
connection
Postupsampling No 16 m
Paramet ers are an estima tion for the
4#
scaling factor, and links to the official code repositories are provided where possible.
1http://mmlab.ie.cuhk.edu.hk/projects/SRCNN.html.
2https://github.com/phoenix104104/LapSRN.
3https://github.com/xinntao/ESRGAN.
4https://github.com/LimBee/NTIRE2017.
5https://github.com/JiahuiYu/wdsr_ntire2018.
6https://github.com/yulunzhang/RDN.
7https://www.toyota-ti.ac.jp/Lab/Denshi/iim/members/muhammad.haris/projects/DBPN.html.
8https://github.com/Paper99/SRFBN_CVPR19.
9https://github.com/tyshiwo/DRRN_CVPR17.
10https://github.com/yulunzhang/RCAN.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
12
In contrast, when the same spectra are available at dif-
ferent time steps and different spatial resolutions, then ST F
methods come into play, where temporal differences are ad-
ditionally exploited for the spatial downscaling. This fam-
ily of methods includes two subfamilies depending on the
time points of the input data.
Finally, when no external information is available, and
downscaling can only be per formed directly on the initial
LR data, then SR techniques can be employed. There are
three method subfamilies depending on the number of in-
put images and whether additional features ex tracted from
the same LR data are used as auxiliar y input.
Figures 9 and 10 present an over view of the aforemen-
tioned method families, graphically highlighting the dif-
ferent approaches, whereas Figures 11–13 show downscal-
ing examples of each family. In the following sections,
we base our review on this discrimination and provide
a detailed examination of the approaches shaping each
method family.
SPATIOSPECTRAL FUSION
Satellites are equipped with various different sensors that
operate in different parts of the EM spectrum and capture
information on different features of the scanned location.
These features can have variable spatial resolution; thus, an
advanced method called SSF is usually employed to elabo-
rately blend the fine spatial resolution of a band
BHR
into
the coarser spatial resolution of a target band
BLR
and ob-
tain a new image in the target band of much higher qualit y.
We discern three families of SSF: MS image fusion, pan-
sharpening, and HS image downscaling. These are pre-
sented next, while, in Table 3, we summarize the main DL
models developed for SSF.
MULTISPECTRAL IMAGE FUSION
Using information from a single satellite source has the
advantage of consistent satellite orbit character istics (e.g.,
the altitude, inclination, and so on) and atmospheric con-
ditions. Some satellites carry multiple sensors that allow
simultaneous capture of multiresolution images, thus pro-
viding an ideal setting for SSF and a common data source.
For example, the constellation of Sentinel-2 satellites (A/B)
launched by the European Space Agency acquires an image
with 13 discrete bands, four of which have a 10-m spatial
FIGURE 8. The proposed taxonomy of DL downscaling methods in the literature. MISR: multiple-image SR; RefSR: reference SR.
Spatial Downscaling
SSF STF SR
MS Fusion Context Assisted SISR
MISR
RefSR
MS–HS Fusion
Pansharpening
• Single Sensor
• Multisensor Context and
Target Assisted
+=
(a)
t1t2t3
+
=
(b)
FIGURE 9. (a) SSF: an image of coarse spatial resolution is fused
with an image of fine spatial resolution containing different bands.
The result is a version of the former image downscaled to the spa-
tial resolution of the latter. (b) STF: an image of high temporal
(t1, t2, and t3) but low spatial resolution (LSR) is fused with an
image of low temporal (t1 and t3) but high spatial resolution.
The result is an image of the highest spatial resolution in time t2.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 13
resolution; six have 20 m, and three have 60 m [93]. Sev-
eral methods (DSen2 and VDSen2 [94], FUSE [95], [96], and
SPRNet [97]) use two input sets, one for the
BHR
and one for
the
BLR
resampled to match the target resolution, as input
to the CNN models, which aim to transfer high-frequency
details from
BHR
to
BLR
to spatially enhance t he latter ac-
cordingly. DSen2, V DSen2, and the model proposed by
Palsson et al. [95] use a concatenation of both sets in the
input, while FUSE and SPR Net process each set in parallel
and then fuse the results.
In a similar setting, Luo et al. [98] (FusGAN) propose a
GAN framework consisting of an ESRGA N generator and a
PatchGAN discriminator [99], which
takes, as the input, a downsampled
concatenation of HR and LR Senti-
nel-2 bands to recover the original LR
bands (F ig ur e 14). On the othe r hand,
Nguyen et al. [100] (S2SUCNN) pro-
pose a multiscale model that takes as
the input the bands in their original
resolution and progressively upsam-
ples the lower-resolution ones g uid-
ed by the e xtracted features of the
higher-resolution bands to finally
obtain all Sentinel-2 bands in a 10-m
spatial resolution. The final result is
subsequently degraded to be com-
pared with the original input in an
MAE loss f unction.
Finally, an interesting approach
is presented in [101], where the
FUSE model is evaluated under an
unsupervised training scheme. Contrary to the original
FUSE study, which employs a preupsampling framework
and, thus, relies upon the primary creation of synthetic
training data, the authors propose a reversed pipeline,
where the model is applied on the original images, and
its output is then downsampled and compared with the
initial input. Subsequently, a second term is added to the
loss function, which is calculated on the local cor relation
between the
BHR
and
BLR
bands and accounts for the pres-
ervation of high-frequenc y details. The preliminary results
showcase the potential of this approach, which, however,
is still below the level of the supervised learning sc heme.
(a)
(c)
(b)
FIGURE 10. (a) SISR: a single LR image is downscaled without using any external information.
(b) MISR: multiple LR images of the same scene are used to acquire an image of higher spatial
resolution of that scene. (c) RefSR: an LR image is downscaled by combining information from
features extracted from it.
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
FIGURE 11. An example of pansharpening on WorldView-3 data: (a) an HR “ground-truth” image, (b) panchromatic, (c) LR multispectral
image, and (d)–(j) the pansharpening results obtained by different DL approaches. (Source: [90]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
14
Shao et al. [102] (ESRCNN) propose a framework that
extends the SRCNN architecture (Table 2) and utilizes aux-
iliar y information from Sentinel-2 to downscale Landsat-8
images. The Landsat-8 satellite provides observations in the
visible, near-infrared (NIR), and shortwave infrared (SWIR)
spectra at 30 m and a PAN band at 15-m spatial resolution
ever y 16 days [103], so the goal of this study is to produce
the equivalent Landsat images at 10-m spatial resolution.
The whole process can be broken down into two separate
steps. First is t he self-adaptive f usion of Sentinel-2, where the
20-m Sentinel-2 bands (11 and 12) are resampled to 10 m us-
ing k-nearest neighbors (k-NN) interpolation and are then
concatenated wit h the native 10-m bands as the input to the
proposed ESRCNN model. The output is bands 11 and 12
downscaled to 10-m resolution. Following this is the multi-
temporal fusion of Landsat-8 and Sentinel-2, where the 30-m
Landsat bands (1–7) and the PAN band are resampled to 10
m, again using k-NN interpolation, and are concatenated
with t he native 10-m Sentinel-2 bands and t he downscaled
20-m Sentinel-2 bands. These are fed to the ESRCNN, which
outputs a downscaled version of the L andsat bands 1–7.
A distinct advantage of this method versus traditional
approaches is the ability to f use Sentinel-2 and Landsat
data obtained on different, albeit close, dates. Using the
same satellite sources, Chen et al. [2] propose the fusion
of Sentinel-2 and Landsat images to enhance the latter to a
spatial resolution of 10 m. They proved that an adversarial
approach is superior to a nonadversarial one, and the pro-
posed model resembles the architecture of the ESRGAN
trained on a composite of the RGB bands for both satellites.
The authors also tested whether the GAN model could be
improved by pretraining on natural instead of satellite im-
ages using the DIV2K data set (see the “Data Sets” section),
but the results were not favorable.
(a) (b) (c) (d) (e)
FIGURE 13. An example of SISR: (a) an LR image and (b)–(e) the prediction results obtained by different approaches for a scaling factor of
#
4. (Source: [92]; used with permission.)
(a) (b) (c) (d)
(e) (f) (g) (h) (i) (j)
N
0816 32 km
FIGURE 12. An example of STF. (a) An LR image at time t1. (b) An HR image at time t1. (c) An LR image at time t2. (d) An HR image at time t2,
which is the target. (e)–(j) The prediction results at time t2 obtained by different approaches. (Source: [91]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 15
TABLE 3. SUMMARY OF THE STATE-OF-THE-ART DL MODELS FOR SSF FOR IMAGE DOWNSC ALING IN RS.
MODEL
FUSION
TYPE FUSION DATA CV MODEL BUILDING BLOCKS
UPSAMPLING
FRAMEWORK ARCHITECTURE
CODE AVAILABLE/
NUMBER OF
PARAMETERS
DSen2 [94] MS Sentinel-2 Residual learning Preupsampling CNN Yes/1.8 m
VDsen2 [94] MS Sentinel-2 Residual learning Preupsampling CNN Yes/37.8 m
Palsson et al.
[95]
MS Sentinel-2 Residual learning Preupsampling CNN No/—
FUSE [96] MS Sentinel-2 Residual learning Preupsampling CNN No/28,000
FusGAN [98] MS Sentinel-2 ESRGA N Residual learning
and subpixel
convolution
Postupsampling GAN No/—
S2SUCNN [100] MS Sentinel-2 Residual learning Progressive
upsampling
CNN Yes/—
Ciotola et al.
[101]
MS Sentinel-2 Residual learning CNN No/—
SPRNet [97] MS Sentinel-2 Residual learning Preupsampling CNN No/—
ESRCNN [102] MS Multitemporal Landsat-8
and Sentinel-2
SRCNN Preupsampling CNN Yes/—
Chen et al. [2] MS Landsat-8 and
Sentinel-2
ESRGAN Residual learning and
subpixel convolution
Postupsampling G AN No/—
RRSG AN [104] MS WorldView-2 and
GaoFen-2
Residual learning,
subpixel convolution,
and attention
mechanism
Progressive
upsampling
GAN Yes/7.47 m
PNN [106] PAN + MS IKONOS, G eoE ye -1, and
WorldView-2
SRCNN Preupsampling CNN Yes/310,000
PanNet [109] PAN + MS IKONOS, WorldView-2,
and WorldView-3
Residual learning and
high-pass filtering
Progressive
upsampling
CNN No/250,000
DRPNN [108] PAN + MS IKONOS, WorldView-2,
and QuickBird
SRCNN Residual learning Preupsampling CNN No/1.6 m
DM L- G M ME [111] PAN + MS IKONOS, WorldView-2,
QuickBird, and GaoFen-2
Stacked sparse
autoencoder s [145]
Preupsampling CNN No/8,000
MSDCNN [112] PAN + MS IKONOS, WorldView-2,
and QuickBird
Residual learning Preupsampling 2 CNNs No/—
L1-RL-FT [110] PAN + MS WorldView-2 and
WorldView-3
SRCNN Residual learning Preupsampling CNN Yes/—
DiCNN [113] PAN + M S WorldView-2 Washington,
IKONOS Hobar t, and
QuickBird Sundarbans
SRCNN Preupsampling 2 CNNs No/180,000
DIRCNN [119] PAN + MS I KONOS, QuickBird,
Gaofen-1, and
Gaofen-2
Residual learning,
attention mechanism,
and auxiliar y gradient
data
Preupsampling CNN No/1.6 m
MIPSM [115] PAN + MS IKONOS and QuickBird Residual learning and
high-pass filtering
Preupsampling 2 CNNs No/—
Fusion -Net [116] PAN + MS WorldView-2,
WorldView-3,
QuickBird, and Gaofen-2
Residual learning Preupsampling CNN Yes/230,000
SR P P N N [117 ] PAN + MS QuickBird, World-
View-3, and Landsat-8
Residual learning and
high-pass filtering
Preupsampling CNN No/—
UP-SA M [120] PAN + MS G eoE ye -1, IKONOS,
WorldView-2, and
WorldView-3
Residual learning,
attention mecha-
nism, and subpixel
accuracy
Preupsampling CNN No/—
Luo et al. [114] PAN + MS Gaofen-2 and
WorldView-2
Residual learning and
attention mechanism
Preupsampling CNN No/—
GTP- PNet [123] PAN + MS WorldView-2, Gaofen-2,
and QuickBird
Residual learning and
gradient information
Preupsampling 2 CNNs No/—
PSCSC-Net [124] PAN + MS G eoE ye -1, IKONOS, and
WorldView-2
Deep unfolding
and variational
optimization
Preupsampling CNN No/1.1 m
(Continued)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
16
In their study, Dong et al. [104] (RRSGAN and RRSNet)
argue that RS images coming from different sources must
be carefully aligned before processing due to differences in
the altitude, viewpoint, or angle. They form a data set con-
sisting of World Vie w -2 (0.5-m) and GaoFe n-2 (0.8-m) obs e r-
vations as well as the corresponding images from Google
Earth (0.6 m). The proposed model is a GAN where image
alignment is assisted by the extraction of gradients.
In par ticula r, a CNN is fed the input i mages and thei r gra-
dients and proceeds to extract features that are then aligned
via a pyramid with deformable convolutional layers [105].
Subsequently, a relevance attention module is proposed to
TABLE 3. SUMMARY OF THE STATE-OF-THE-ART DL MODELS FOR SSF FOR IMAGE DOWNSC ALING IN RS.
MODEL
FUSION
TYPE FUSION DATA CV MODEL BUILDING BLOCKS
UPSAMPLING
FRAMEWORK ARCHITECTURE
CODE AVAILABLE/
NUMBER OF
PARAMETERS
VO+Net [125] PAN + MS WorldView-3,
WorldView-2, and
QuickBird
Variational
optimization
Preupsampling CNN No/—
SC-PN N [126] PAN + MS WorldView-3, GeoE ye -1,
and SPOT5
Salienc y analysis
and hybrid and
deformable
convolution
Preupsampling CNN + fully
convolutional
network
No/—
NLRNet [90] PAN + MS WorldView-3 and
QuickBird
Residual learning and
attention mechanism
Preupsampling CNN No/—
LPPNet [118] PAN + MS Pavia Center, Houston,
and Los Angeles
Laplacian pyramid
decomposition
Preupsampling CNN No/—
Scarpa et al.
[11 0 ]
PAN + MS G eo Eye -1 and
WorldView-2
Residual learning Preupsampling CNN No/—
Ciotola et al.
[130]
PAN + MS G eo Eye -1, WorldView-2,
and WorldView-3
CNN No/—
PSGA N [131] PAN + MS QuickBird, GaoFen-2,
and WorldView-2
— — Preupsampling GAN Yes/1.88 m
Pan- GA N [132] PAN + MS GaoFen-2 and
WorldView-2
Two discriminators:
spatial and spectral
Preupsampling GAN No/—
MDSSC-GAN
SAM [133]
PAN + MS Pléiades and
WorldView-3
Two discriminators:
spatial and spectral;
residual learning; and
attention mechanism
Preupsampling GAN Yes/—
PanColorGAN
[134]
PAN + MS Pléiades, WorldView-2,
and WorldView-3
Self-super vised and
noise/color injection
Preupsampling GAN No/—
Palsson et al.
[135]
MS + HS Pavia Center and
IKONOS
— — Preupsampling 3D CNN No/—
DHSIS [136] MS + HS CAVE and Harvard S elf-super vised and
noise injection
Preupsampling GAN Yes/—
PFCN [137] MS + HS Botswana; Washington,
D.C.; and Pavia Center
Residual learning Preupsampling CNN No/—
CF-BPNN [138] MS + HS AVIRIS and Pavia
Center
k-Means clustering Preupsampling NN No/—
HyperPNN [139] MS + HS Washington, D.C.
National Mall; Moffett
Field; and Salinas
Scene
— — Preupsampling CNN No/—
DDLPS [140] MS + HS Moffett Field, Chikusei,
and Salinas Scene
LapSRN Preupsampling CNN No/—
TONWMD [141] MS + HS CAVE, Harvard, and
Pavia Center
Residual learning
and matrix
decomposition
Preupsampling CNN N o/—
MH F -Ne t [142] MS + HS CAVE, Chikusei,
Houston, and
Pavia Center
— — Preupsampling CNN Yes/
UMAG -Net [143] MS + HS CAVE and Har vard Attention mechanism Preupsampling CNN and AE N o/—
SSR-Net [144] MS + HS Pav ia Center; Botswana;
and Washington, D.C.
National Mall
— — Preupsampling CNN Yes/
CV Model refers to the models presented in Table 2. AE: autoencoder; AVIRIS: airborne visible/infrared imaging spectrometer; CAVE: Columbia computer vision laboratory;
NN: neural network; NLRNet: nonlocal attent ion residual network.
(Continued )
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 17
combine the aligned features by focusing on the relevant
information, and a series of upsampling blocks performs
the final downscaling. For the adversarial training, two dis-
criminators are employed, one for the downscaled image
and one for the gradient of the downscaled image produced
by the generator. The loss function is a weighted sum of:
1) the MAE between the downscaled and HR images, 2)
the adversarial loss for the downscaled and HR images, 3)
the VGG loss between the downscaled and HR images, 4)
the MAE between the gradients of the downscaled and HR
images, and 5) the adversarial loss for the gradients of the
downscaled and HR images. The results show that both the
adversarial RRSGAN and the nonadversar ial RRSNet per-
form better than numerous other DL methods, with R RS-
GAN producing more high-frequency details.
In conclusion, consider ing single-source data for MS im-
age fusion, the available solutions cover a variety of needs.
For example, when all LR input images have the same spa-
tial resolution (e.g., 20 m), then SPR Net seems to be a more
suitable and robust approac h. On the other hand, when
hardware and/or time restrictions apply, FUSE provides a
lightweight candidate since it contains very few trainable
parameters (~28,000) compared to other methods but
has only be applied wit h a
2#
scaling factor. Finally, for
an end-to-end approach where all multiresolution input
bands are downscaled in a single for ward pass, FusGAN
seems to produce more accurate and shar p results. In the
case of multisource input data, ESRCNN tackles the lack
of clear, cloudless HR input images on the required date
by enabling the use of multiple HR images acquired at ar-
bitrar ily close dates. The authors observe that, especially
when more than three Sentinel-2 images are used, the model
is able to additionally capture land use/land cover changes
in the landscape. On the contrary, when the HR input im-
ages are inevitably contaminated by clouds or even absent
in some cases, RRSGA N is able to overcome the loss of in-
formation and produce downscaled results of acceptable
qualit y thanks to its robust feature extraction and attention
mechanisms.
PANSHARPENING
Pansharpening refers to a downscaling process aided by a
PAN band. This special type of band allows the acquisition
of a single measurement for the total intensity of visible
light in a single pixel; thus, PAN sensors are able to detect
brightness changes at quite small spatial scales.
The first work to introduce CNNs to panshar pening is
[106] (PNN). Inspired by t he SR field of CV, Masi et al. [106]
build upon the SRCNN and improve it by augmenting the
input with a number of radiometric indexes tailored to fea-
tures relevant for RS applications [the normalized differ-
ence vegetation index (NDVI), the normalized difference
water index (NDWI), and so on]. Following t he three steps
of sparse coding SR [107], they make use of a three-layer
CNN named PNN, as shown in Fig ure 15. Their method
follows the preupsampling framework.
Motivated by the high nonlinearity of deeper networks
and inspired by SRCNN and PNN, Wei et al. propose a
deep residual network named DRPNN [108], in which they
add some pansharpening specific improvements. Yang et
al. also propose a deep residual network named PanNet
[109] that preserves both spatial and spectral resolution.
For spectral preservation, they directly add the upsampled
MS images to the network output, while, for spatial pres-
ervation, they train the network in the high-pass f iltering
domain rather than the image domain, as this is expected
Input Data Output ImageG-Net
Upsampling Addition Conv
Global Skip Connection
Concatenation Conv ReLU ResDensBlock ResDensBlock
Scaling
Lower-Resolution
Spectral Bands
Higher-Resolution
Spectral Bands
Fused
HR Image
× 4
FIGURE 14. The FusGAN generator network. ResDensBlock is the RDB as described in ESRGAN. (Source: [98]; used with permission.)
G-Net: generator of the model.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
18
to generalize better among different satellites (Fi g u r e 16).
Star ting from PNN, too, Scarpa et al. [110] explore a num-
ber of variations to improve its performance and robust-
ness. They propose the use of the MAE loss, which boosts
performance and allows fast convergence; exploit skip con-
nections; and add a target-adaptive fine-tuning phase. Their
ablation study shows that shallow architectures are able to
perform as well as the deeper ones; thus, they use a three-
layer CNN (L1-RL-FT) with residuals.
A different approach inspired by metric learning that
makes use of stacked autoencoders is introduced in [111].
Upscaled PAN images are divided into patches, grouped
according to t heir geometry, and fed as the input to auto-
encoders that are utilized to map them into hierarc hical
feature spaces that accurately capture nonlinear manifolds
while, at the same time, preserve their local geomet ry in the
embedding space. Based on the assumption that MS and
their corresponding PAN patches form the same geometric
manifolds, the geometric multimanifold embedding mod-
el (DML-GMME) using a metric learning loss function is
trained to estimate HRMS image patches.
A two-branch network named MSDCNN is proposed
in [112]. While the one branch is a three-layer CNN, the
other one is a deep residual network with multiscale con-
volutional blocks. Multiscale refers to the fact that the au-
thors use convolutional filters with different sizes to e xtract
feature maps. The t wo subnetworks are jointly trained, and
the final estimation is a sum over the estimation of each
subnetwork.
In [113], DiCNN, a general detail injection formulation
of pansharpening , is proposed. DiCNN comprises two
CNNs, DiCNN1 and DiCNN2, both utilizing the preupsam-
pling framework. DiCNN1 adds a skip connection to the
PNN architecture, while DiCNN2 works under the assump-
tion that, ideally, the MS spatial details should matc h and
be relevant only to the PAN image. Thus, it utilizes only the
PAN image as an input to the network, while the preinter-
polated MS image is used only at its end. Structural com-
parisons among PNN, DRPNN, DiCNN1, and DiCNN2 can
be seen in Fi g u r e 17.
Liu et al. [115] propose a met hod named MIPSM that
combines a shallow–deep convolutional network (SDCN)
PA N
MS Interpolation
First
Conv
Second
Conv
Third
Conv
HR MS
FIGURE 15. An outline of PNN. The network comprises three layers that are expected to match the three steps of sparse coding SR. (Source:
[106]; used with permission.)
PAN Image
LRMS Image
High Pass
Structural Preservation
Spectral Preservation
4
4
ResNet
HRMS Image
FIGURE 16. An outline of PanNet. The network decouples the structural from the spectral preservation. (Source: [109]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 19
and a spectral discrimination-based detail injection (SDDI)
model. The SDCN consists of a three-layer shallow network
and a deep residual network that can capture midlevel and
high-level spatial features from PAN images. The SDCN
works on the high-pass filtering domain. The SDDI is de-
veloped to merge the spatial details extracted by the SDCN
into MS images with minimal spectral distortion. The
SDCN and SDDI are jointly trained.
Inspired by component substitution and multiresolu-
tion analysis, Deng et al. [116] design two deep residual net-
works named CS-Net and MRA-Net that extract details and
have a solid physical justif ication. They also design a net-
work that is directly fed with details extracted by differenc-
ing the single PAN image with each MS band. This network
is called Fusion-Net. They make use of the preupsampling
framework using a polynomial kernel.
Cai et al. [117] propose a progressive downscaling pan-
sharpening neural network named SRPPNN. It includes
three components: 1) a downscaling process that extracts
the inner spatial detail that is present in the MS image
and combines it with the spatial detail of the PAN image
to generate fused results; 2) progressive pansharpening to
separate the spatial resolution improvement process, whic h
achieves a gradual and stable pansharpening process; and
3) a high-pass residual module that helps by directly inject-
ing spatial detail from PAN images and achieves better spa-
tial preservation.
Dong et al. [118] propose a Laplacian py ramid network
called LPPNet that has a clear physical interpretation of
pansharpening; follows t he general idea of multiresolution
analysis; and divides pansharpening into two processes: de-
tail extraction and reconstr uction. For the detail extraction,
they use the Laplacian pyramid to decompose the PAN im-
age into multiple levels that can distinguish the details of
different scales. They build a simple detail extraction sub-
network for each level that can help fully extract the depth
of different levels. For reconstruc tion, the subband residual s
estimated at each level are injected into the respective level
of the MS image, while they are upsampled and fed as the
input to the ne xt subnetwork, which can help make full use
of complementary details between different levels.
Instead of focusing on the architecture, Jiang et al. [119]
focus on the input/output of the net work. They introduce
three novelties: 1) the differential information mapping
strategy, 2) the auxiliary gradient information strategy, and
3) the combination of an attention module with residual
blocks. Taking into account t he underutilization of the PAN
image in the input, they propose copying and assigning the
PAN image to each band of the downscaled MS image.
Motivated by the existence of mixed pixels in satellite
images, where each pixel tends to cover more than one con-
stituent mater ial, Qu et al. [120] propose a met hod based on
the self-attention mechanism (SAM) [121] that works at the
subpixel level. A method using skip connections inspired
by [122] is introduced in [114], in which Luo et al. propose a
novel loss function that utilizes spatial constraints, spectral
consistency, and the quality with no reference (QNR) inde x
(see the “Metrics” section). Instead of using simple stac ked
convolutional layers and separating the feature extraction,
their network architecture adopts an iterative way to jointly
extract and fuse the features. A n outline of their method
can be seen in F ig u r e 17(e).
Zhang and Ma [123] propose a model comprising t wo
networks: a gradient information network (TNet) and pan-
sharpening network (PNet). TNet is a residual network
committed to seeking t he nonlinear mapping between
gradients of PAN and HRMS images, which essentially is
a spatial relationship regression of imaging bands in dif-
ferent ranges. PNet is a spatial attention residual network
used to generate HRMS images, which is not only super-
vised by the HR MS reference image but also constrained
by the trained TNet.
Inspired by the learned iterative soft-thresholding algo-
rithm, Yin [124] proposes a deep PNet that integrates the
LRMS PA N LRMS PA N LRMS PAN LRMS PA N LRMS PA N
HRMS
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
Conv
ReLu
Conv
ReLu
Conv
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
Conv
ReLu
Conv
ReLu
Conv
HRMS
Conv
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
Conv
ReLu
Conv
ReLu
Conv
LRMS
HRMS
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
Conv
ReLu
Conv
ReLu
Conv
HRMS
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
ReLu
Conv
Conv
ReLu
Conv
ReLu
HRMS
LRMS
Conv
ReLu
Conv
ReLu
(a) (b) (c) (d) (e)
FIGURE 17. A structural comparison between (a) PNN, (b) DRPNN, (c) DiCNN1, and (d) DiCNN2 (source: [113]; used with permission) as well
as e) the model of Luo et al. (source: [114]; used with permission).
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
20
detail injec tion, variational optimization, and DL sc hemes
into a single framework. It consists of the input convolu-
tional layer, Conv-ISTA module (deep unfolded network),
fusion module, and output convolutional layer. The weight-
ed use of variational optimization with DL is proposed in
VO+Net [125], too. For the variational optimization model-
ing, a general detail injection term inspired by the classi-
cal multiresolution analysis is proposed as a spatial fidel-
ity term, and a spectral fidelity employing the MS sensor’s
modulation transfer functions is also incorporated. For the
DL injection, a weighted reg ularization term is designed to
introduce DL into the variational model. The f inal convex
optimization problem is efficiently solved by the designed
alternating-direction method of multipliers.
Zhang et al. [126] (SC-PNN) propose a saliency cascade
CNN that consists of two parts: 1) a dilated deformable ful-
ly convolutional network (DDCN) for saliency analysis and
2) a saliency cascade residual dense network (SC-RDN) for
pansharpening. DDCN is a network based on hybrid and
deformable convolution aiming to separate salient regions,
like residential areas, from nonsalient areas, like moun-
tains and vegetation areas. SC-RDN is composed of t hree
stages: 1) detail maps of MS and PAN images are extracted
via dual-tree complex wavelet transform (DT- C W T ) [127],
2) a deep regression network based on R DBs takes those
detail maps as the input and produces the primarily sharp-
ened image with high spatial and spectral quality, and 3)
a saliency enhancement module emphasizes the impact of
the obtained salienc y map via the saliency-weighted region
convolution (SW-RC). More details about this method can
be seen in Figure 18.
Given that the convolution operation is focused on the
local region, and, thus, position-independent global infor-
mation is difficult to obtain, Lei et al. [90] propose an ef-
ficient nonlocal attention residual network (called NLRNet)
to capture the similar contextual dependencies of all pixels.
Motivated by the unavoidable absence of the ground truth,
which of ten results in networks trained solely in a reduced-
resolution domain, Vitale and Scarpa [128] propose a new
learning strateg y involving a loss function with terms com-
puted both at reduced- and full-resolution images, thus
enforcing cross-scale consistency. Their method is based
on A-PNN [110], an advanced version of the PNN with 1) a
different loss function for training (the MAE instead of the
mean square er ror [MSE]), 2) a residual learning conf igura-
tion, and 3) a target-adaptive scheme.
In the same direction, Ciotola et al. [130] introduce a
full-resolution training framework in which training takes
place in the HR domain, relying only on t he original PAN
and MS pairs (with no downgrading), thus avoiding any
loss of information. They design a new compound loss
function with two components accounting separately for
spatial and spectral consistency.
Apart from CNNs, one of the first attempts to utilize
GANs for producing high-quality pansharpened im-
ages is introduced by Liu et al. in [131] (PSGAN). PSGAN
comprises a generator, which takes PAN images as the in-
put and maps them to the desired HRMS images, and a
Input Image
PAN Image
Stacked
Interpolated
MS Image
Fused Image
Saliency Map
Saliency
Enhancement
Module
Saliency
Weighted Region
Convolutional
Layer
Primary Sharpened Image
Saliency Cascade Residual Dense Network
Dilated Deformable Convolutional Network
Spectral Preservation
DT-CWT
Detail Map Residual Dense Block
4
Convolutional Layer Dilated Convolutional Layer
Deformable Layer Skip Connection
Hybrid Dilated Deformable Block
FIGURE 18. An outline of SC-PNN. (Source: [126]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 21
discriminator, which implements the adversarial training
strategy for generating higher-fidelity pansharpened im-
ages. Making the assumptions that 1) the spectral distribu-
tion of the f used image should be consistent with that of
the LR MS image and 2) the spatial distribution of the f used
image should be consistent with that of t he PA N image with
the same resolution, Ma et al. propose the use of a GAN
with t wo discriminators in [132] (Pan-GAN). The generator
of Pan-GA N attempts to generate a HRMS image contain-
ing major spectral information of the LRMS image together
with additional image gradients of the PAN image.
A similar GA N architecture called MDSSC-GAN SAM
that jointly exploits the spatial and spectral information
sources is proposed in [133], in which Gastineau et al.
make use of two discriminators, too: one to preser ve the
texture and geometry of the images by taking as the input
the luminance Y and NIR band of images and the other to
preser ve the color and the spectral resolution by compar ing
the chroma components Cb and Cr.
A different approach, in which pansharpening is treat-
ed as a colorization problem, is introduced by Ozcelink et
al. in [134] (PanColorGAN). In contrast with the ordinary
method, the authors give, as the input, the gray-scale-trans-
formed MS image and train the model to learn the color-
ization of it. The model learns to generate an original MS
image by taking, as the input, the cor responding reduced-
resolution and gray-scale ones. PanColorGAN is trained
using both a reconstruction (MAE) and an adversarial loss.
This can be interpreted as meaning that the model learns
to separate the spectral and spatial components of the MS
image during training.
In conclusion, when hardware and/or time restrictions
apply, L1-RL-F T is a great solution, as it is lightweight and
trains ver y fast. It also seems to have a good generalization
ability and to solve the problem of insufficient data with
its target-adaptive tuning phase. DML-GMME is a unique
approach that utilizes deep metric learning and autoencod-
ers. Because it has a rich ablation and is a lightweight mod-
el, a researcher would gain useful insights experimenting
with it. Accurate and shar p results seem to be produced by
LPPNet, a network that simplifies the pansharpening prob-
lem into several pyramid-level learning problems. LPPNet
makes use of the Laplacian pyramid decomposition tech-
nique to decompose the image into different levels that can
differentiate large- and small-scale details, thus achieving
great visual appearance.
Novel ideas that a researcher might want to consider are
presented by Zhang et al. [123] and Luo et al. [114] Zhang
et al. design a special gradient transformation network that
searches the nonlinear mapping between the gradients of
PAN and MS images. Luo et al. propose a PAN -guided strat-
egy that continuously extracts and fuses features from the
PAN image. VO+Net is a framework t hat can be put on top
of other approaches to improve the end result. Finally, SC-
PNN is a solution that successfully makes use of saliency
maps and provides great visual results.
HYPERSPECTRAL/MULTISPECTRAL FUSION
HS image sharpening aims at f using an observable low-
spatial-resolution HS image with a high-spatial-resolution
MS image of the same scene to acquire a HR HS image. One
of the first works to utilize CNNs for HS/MS fusion is intro-
duced by Palsson et al. in [135], where the authors propose
the use of a 3D CNN with three layers for the HS/MS fusion.
The dimensionality of the HS image is reduced using prin-
cipal component analysis to constrain the computational
cost and increase robustness.
Dian et al. [136] propose a deep HS image-sharpening
method called DHSIS that directly learns the priors of the
HRHS image via CNN-based residual learning. They first
initialize the HRHS image by solving a Sylvester equa-
tion. Then, to learn the priors, they utilize the initialized
HRHS image as the input of the CNN to map the residuals
between the reference HRHS image and initialized HRHS
image. This initialization can fully utilize the constraints
of the fusion framework, thus improving the quality of
the input data. The learned priors of the HR HS image are
returned to the fusion framework to reconstruct the final
estimated HR HS image, which can further improve the
performance (Fig ure 19).
Zhou et al. [137] introduce a pyramid f ully convolu-
tional network (PFCN) consisting of two subnetworks: 1)
an encoder aiming to encode the LRHS image into a latent
image and 2) a pyramid fusion that utilizes this latent im-
age together with an HRMS pyramid image to progressively
reconstruct the HRHS image in a global-to-local way. More
details about the method can be seen in Figure 20.
Instead of formulating the task of HS/MS fusion as the
spatial downscaling of an LRHS image, Han et al. [138]
formulate it as the spectral downscaling of an HRMS im-
age. Their method, CF-BPNN, consists of three stages: 1) the
fusion problem is formulated as a nonlinear spectral map-
ping from an HRMS image to and HR HS image with the
help of an LRHS image, 2) a cluster-based learning met hod
using multibranch neural networks is utilized to ensure a
more reasonable spectral mapping for each cluster, and 3)
an associative spectral clustering is proposed to ensure that
training and fusion clusters are consistent.
He et al. [139] introduce HyperPNN, an HS image-sharp-
ening method via spectrally predictive CNNs, exploiting
the spectral convolution structure to strengthen the spec-
tral prediction. Li et al. [140] propose a detail-based deep
Laplacian panshar pening model (DDLPS) to improve the
spatial resolution of HS imagery. Their method includes
three main components: downscaling , detail injection,
and optimization. They make use of the well-known Lapla-
cian pyramid SR network LapSRN (see the “Standard Deep
Learning Methods for Downscaling in Computer Vision”
section) to improve the resolution of each band. Then, a
guided image filter and a gain matrix are used to combine
the spatial and spectral details with an optimization prob-
lem, which is formed to adaptively select an injection coef-
ficient.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
22
Shen et al. [141] propose a twice-optimizing net with
matrix decomposition (TONWMD). They first decouple
the fusion problem into a spectral and a spatial optimiza-
tion task with the help of matrix decomposition. These two
problems are handled sequentially by solving a linear (Syl-
vester) equation. Then, they train a deep residual network
to establish the mapping between t he initial and reference
images. Finally, the predicted result is returned to the opti-
mization procedure to get the final fusion image.
In [142], Xie et al. propose MHF-Net, a network having
clear physical meaning and great interpretability. They first
constr uct an HS/MS fusion model that merges the gener-
alization models of LR images and the low-rankness prior
knowledge of an HRHS image into a concise formulation.
Then, they build the network by unfolding the proximal
gradient algorithm to solve the proposed model.
Liu et al. [143] propose UMAG-Net, a network compris-
ing a multiattention autoencoder network and a multiscale
feature-guided network (MSFG). First, the multiattention
autoencoder network extracts deep multiscale features of
the MS image, and, then, a loss function containing a pair
of HS and MS images is used to iteratively update t he pa-
rameters of the network and lear n prior knowledge of the
fused image. The MSFG is used to construct the final HRHS
image. Nonlocal blocks are used to better retain spectral
and spatial details of the image. Laplacian blocks are used
to connect the multiattention autoencoder network with
the MSFG to achieve better fusion results while ensuring
feature alignment. A lthough UMAG-Net does not use satel-
lite HS data, the e xpansion into them is straightfor ward.
Fig ure 21 shows the method.
Zhang et al. [144] propose SSR-Net , an interpretable spa-
tial–spectral reconstruction network that consists of three
components: 1) cross-mode message inserting (CMMI), an
operation producing a preliminar y fused HRHS image; 2)
a spatial reconstruction network (SpatRN) that focuses on
reconstructing the lost spatial information of the LRHS im-
age with the g uidance of a spatial edge loss; and 3) a spectral
reconstruction network (SpecRN) that aims to reconstruct
the lost spectral information of the HRMS image under the
constraint of a spectral edge loss.
In conclusion, even though the architectures proposed
in HS/MS fusion are limited in number, they exhibit re-
markable variability (CNNs, 3D CNNs, GA Ns, and so on).
The MHF-Net is an interpretable network showing supe-
riorit y both visually and quantitatively. A bright idea that
researchers should take into account is presented in the
PFCN. The authors propose encoding the spectral informa-
tion of the LRHS image into a latent image and then decod-
ing this image with an HRMS image pyramid into a sharp
HRHS image. The drawback of this method is the fact that
experiments are conducted on simulated images. The SSR-
Net treats HS/MS fusion as a spatial–spectral reconstruc-
tion problem. The authors prov ide a good ablation study
and useful insights.
Finally, a complete solution that has not yet been tested
on RS data is proposed in the UM AG-Net. This solution
combines great ideas like the use of multiattention, nonlo-
cal blocks, Laplacian blocks, and a loss function that mea-
sures bot h the spectral and the spatial similarit y between
pairs of images.
SPATIOTEMPORAL FUSION
Apart from their spectral signatures, satellites are also char-
acter ized by their unique revisit times. ST F aims to integrate
images of HSLT with images of LSHT. A typical data set for
the STF problem consists of LSHT–HSLT image pairs at one
or multiple time steps, and the aim is to predict an HR im-
age on a future or intermediate target time
.t
target All im-
ages must contain similar spectral information, including
the number of bands and the bandwidths. For e xample, the
Moderate-Resolution Imaging Spectroradiometer (MODIS)
captures images daily (high temporal resolution) at a scale
of 250 m to 1 km (low spatial resolution) [146], whereas
Landsat-8’s Operational Land Imager (OLI) captures images
ever y 16 days (low temporal resolution) at a 30-m scale
Step 1: Initialize the HRHS Image
From the Fusing Framework
Step 2: Learn the Priors via
CNN-Based Residual Learning
Step 3: Return the Learned Priors
Into the Fusing Framework
Upsample
Input: HRMS Image Z
Yup
Xin Xres Output: Xfin
Xcnn
Solving
Optimization
Problem
Solving
Optimization
Problem
CNN +
Input:
LRMS Image Y
FIGURE 19. An outline of DHSIS, a deep HS image-sharpening method. (Source: [136]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 23
(high spatial resolution) [103]. Both sensors operate on the
visible and infrared spectra; therefore, one could combine
pairs of MODIS (LSHT) and Landsat-8 OLI (HSLT) images
on different dates to produce high-spatial-resolution im-
ages on a prediction date
.t
target
The various STF methods present in t he literature fol-
low a context-assisted (C-A) or context- and target-assisted
(CT-A) scheme depending on the availability of target data
during the training phase. CT-A approaches use additional
LSH T information on
,t
target whereas C-A approaches e xploit
LSHT–HSLT pairs from nontarget times only (Figure 22).
We must note here that a couple of other discriminant
factors can also be observed among STF studies. First, some
methods perform a preprocessing step where time differ-
ence images, def ined as
III
ij
ji
=-
for the time steps
ti
and
,t
j are computed and used as additional inputs to the mod-
el. Such an approach is followed by [91] and [147]–[153].
Second, whereas the most common strategies involve data
from times prior to
,t
target there are cases where future ob-
servations are also required, as in [147] and [150 ][15 8].
For simplicit y, in this work, we solely employ the C -A ver-
sus CT-A classification and separately describe each cat-
egory in the following sections, while, in Table 4, we pro-
vide an overview of all STF met hods. Note that we refer to
the HSLT images on time t as
Ft
and the L SHT images as
,Ct
re spe c t i vely.
CONTEXT- AND TARGET-ASSISTED METHODS
Several researchers argue that the spatial resolution gap
between cer tain sensors, such as t hose carried by MODIS
and Landsat, is quite large and that data coming from both
sources undergo different atmospheric and geometric cor-
rections. Therefore, they design models that produce in-
termediate images enhanced by a smaller scaling factor to
facilitate the downscaling process. For example, Song et al.
[154] (STFDCNN) (Figure 23) propose a two-stage model
that takes as the input an arbitrary pair of Landsat-5/7 (25-
m) and MODIS (500-m) images and learns to predict an
intermediate enhanced image of 250-m spatial resolution.
The intermediate image is computed in a preupsampling
fashion, while the final 25-m image is computed via a post-
upsampling SRCNN structure. During the inference, fea-
tures are extracted from MODIS images at times
,,tt
12
and
t3
(where
t2
is the prediction date), which are linearly com-
bined with the corresponding Landsat images on
t1
and
t3
to produce the final HR result. Building on this, Zheng et
al. [158] (VDCNSTF) propose deeper network architectures
and redesign the SRCNN stage as a multiscale model pro-
ducing images at 125 m and 25 m.
A slight ly different approach is followed by Liu et a l. [147]
(StfNet), who arg ue that the temporal changes expressed by
a time difference image are highly cor related with the con-
tents of the original images. Therefore, they design a model
that takes as the input an LSHT MODIS image (250–300
m) at prediction date
,t2
a date before
()t
1 and a date after
()t
3 the prediction date, and a cor responding HSLT Landsat
FIGURE 20. An outline of PFCN comprising an encoder subnetwork and a pyramid fusion subnetwork. (Source: [137]; used with permission.) DeConv: deconvolutional layer;
GDL: gradient difference loss.
8× 8 ×K32 × 32 ×K
32 × 32 ×K
32 × 32 ×k16 × 16 ×k8× 8 ×k
32 × 32 ×K
Bicubic
Interpolation
LRHS Image
LRHS Image
HRHS Image Prediction
l2 + GDL
Encoder Subnetwork
Pyramid Fusion Subnetwork
Conv Conv
Conv Conv Conv Conv
Max Pool Max-Pool
DeConv DeConv
PFCN
1× 1 Conv 1× 1 Conv 1× 1 Conv
HRMS Image Pyramid
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
24
image at dates
t1
and
;t3
produces time difference images;
and then reconstructs the HR image on date
t2
by transfer-
ring information from these temporal relations.
More specifically, they propose two CNNs that take as
the input a concatenation of the MODIS time difference
image and the Landsat image and produce a time differ-
ence Landsat. They employ these networks to learn the
following mappings: 1)
(,)CF F
13
11
3
"
and
(,)CF F
13
31
3
"
and 2)
(,)CF F
12
11
2
"
and
(,).CF F
23
32
3
"
Mapping 1 can be
super vised by the label
,F13
which is available in the train-
ing data, forming the time difference reconstruction term
of the loss f unction. The results of mapping 2 are summed
to obtain a predicted
,F13
which is compared to t he label
,F13
forming the temporal consistenc y term of the loss func-
tion. The total loss function is a weighted sum of these two
terms. Finally, the predicted
F12
and
F23
are combined with
F1
and
F3
through an adaptive local weighting strategy
to obtain the target image
.F2
A schematic outline of the
method is presented in Figure 24. Compared with non-DL
and DL approaches, the proposed StfNet achieves sharper
results will fewer visible artifacts.
Tan et al. [159] (DCSTFN) < propose a t wo-branch CNN
that takes as the input the LSHT MODIS image on predic-
tion date
t2
along with a pair of HSLT Landsat-8 and LSHT
MODIS (500-m) images on a date prior but close to the
prediction date
.t1
The first branch of the model learns a
mapping from LSHT to HSLT images in a postupsampling
scheme, while the second one e xtracts information from
the HSLT with a sequence of convolutional layers. The three
outputs, which share the same width and height, are then
concatenated following the assumption of the traditional
spatial and temporal adaptive ref lectance fusion model
(STAR FM) algorithm [160],
,FCFC
2211
=--
for dates
t1
and
t2
and enter a series of convolutions for the final
reconstruction.
In a subsequent publication [161] (EDCSTFN), the au-
thors propose an enhancement over the DCST FN model:
instead of processing solely the LSHT images on the first
branch, it takes as the input both the LSHT images and the
HSLT image concatenated along the channel dimension
and extracts information on their spectrum differences.
Finally, the authors describe a novel flexible training
scheme where more than one reference pair can be used as
the input during either the training or the inference phase,
depending on data availability. The proposed EDCST FN
FIGURE 21. An outline of UMAG-Net comprising an encoder and a decoder with spatial cross attention mechanism. (Source: [143];
used with permission.) BN: batch normalization; LG: Laplacian guide; NL: nonlocal block; S: stride; SCA: spatial cross attention;
UG: upsampling guide.
Multiattention
Autoencoder Network
Encoder Decoder
HRMS Image
NL Block
LG Block
LG Block
UG Block
HRMS Image
SCA Block
(3,3)
(3,3)
(3,3)
(3,3)
(3,3)
(3,3)
(1,1)
(3,3), S = 2
MSFG
Radom
Code
HRMS
ImageLRHS
Image
Loss Function
Convolution
Layer + BN
+ LReLU
Convolution Layer
(S = 2) + BN +
LReLU
Upsample
+ Convolution
Layer + BN +
LReLU
Upsampling
Guide
Block
Nonlocal
Block
Spatial Cross
Attention
Block
Laplacian
Guide
Block
Random
Code
FF
FF
CCCC C
tprev ttarget tnext tprev ttarget tnext
(a) (b)
FIGURE 22. The data used for (a) CT-A and (b) C-A STF during
training. F refers to the HSLT image, C refers to the LSHT image, and
tprev and tnext are one or multiple dates before and af ter the target
date ttarget, respectively.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 25
model manages to outperform DCSTFN and StfNet in
most cases while displaying more stable and consistent
behavior.
Li et al. [148] (DMNet) propose a complex CNN archi-
tecture with two multiscale mechanisms including parallel
convolutions with eit her different ker nel sizes or different
dilation rates for a more efficient feature extraction. The
model takes as t he input the MODIS time difference im-
age
C12
and the Landsat image
F1
and learns to predict
.F2
In a follow-up study [149] (AMNet), the authors propose
progressive upsampling at three scales
(,,48##
and
)16#
through deconvolutional layers, while a third model seg-
ment combines the feature maps at each scale to extract
more spatial details and temporal dependencies. The out-
put of this segment is then fed to a channel attention mech-
anism and a spatial attention mechanism in sequence. The
final results respect the spatial and temporal changes of the
data but are significantly blurred.
A number of studies have also focused on the applica-
tion of GA Ns to the CT-A ST F problem. For example, Shang
et al. [91] (GASTFN) propose an adversarial version of the
DCSTFN model where an EDSR-like generator performs
the spatial enhancement task. Experiments showed that the
proposed model yields shar per and more accurate results
compared to the nonadversarial DCSTFN. Bouabid et al.
[162] propose a model similar to the popular pix2pix GAN
[163], which comprises a conditional GAN with a U-Net ar-
chitecture for the generator and a PatchGAN architecture
for the discriminator.
Chen et al. [155] (CycleGAN-STF) employ a cycle GAN
architecture [164] to enhance the traditional flex ible spa-
tiotemporal data fusion (FSDAF) algorithm [165]. The main
framework consists of the following four stages:
1) Gen eration: A cycle GAN takes as the input the HSLT im-
age pair
(, )FF
tt11-+
and produces an
Ft
GAN
in the output.
The GA N produces a single image each time, so an itera-
tive generation scheme is introduced to generate mul-
tiple in-bet ween images.
2) Selection: A single
Ft
GAN
image is selected based on mu-
tual information metrics of the HSLT and LSHT images.
3) Enhancement: The discrete wavelet transform is used to
enhance the quality of the selected image, borrowing
information from
.Ct
4) Fusion: The result of the previous steps along wit h
Ct
and
Ct1-
are inserted in the FSDAF algorithm to obtain the
final prediction.
The model was only compared with traditional non-DL
algorithms. Experiments showed that CycleGA N-ST F out-
performed the other approaches in preserving spatial de-
tails but resulted in a loss of spectral information.
Zhang et al. [156] (STFGAN) propose a cascade of two
SRGA N-like structures that learn to produce an HR Landsat
image for a target date
t2
based on Landsat-5/7 data from
dates
t1
and
t3
as well as MODIS data from dates
,,tt
12
and
.t3
The first GAN takes as the input the two Landsat and all
of the corresponding MODIS images and produces an in-
termediate Landsat image
.F
int
2
t
Due to the limited ability of
the SRGAN for spatial enhancement to such a large scaling
TABLE 4. A SUMMARY OF THE STATE-OF-THE-ART DL MODELS FOR STF FOR IMAGE DOWNSCALING IN RS.
MODEL
INPUT
ASSISTANCE
TIME
DIFFERENCE
IMAGES
PRIOR DATES
ON LY CV MODEL ARCHITECTURE
CODE AVAILABLE/
NUMBER OF
PARAMETERS
STFDCNN [154] CT-A No No SRCNN CNN No/—
VDCNSTF [158] CT-A No No VDSR CNN N o/—
Stf Net [147] CT-A Yes No CNN N o/—
DCSTFN [159] CT- A No Yes CNN Yes/409,000
EDCSTFN [161] CT-A No Yes CNN Yes/282,000
DMNet [148] C T-A Ye s Yes CNN No/327,0 00
AMNet [149] C T-A Ye s Yes CNN No/—
GASTFN [91] CT- A Yes No EDSR GAN N o/—
Bouabid e t al. [162] CT-A No Yes GAN Ye s/—
CycleGAN-STF [155] CT-A No No GA N N o/—
STFGAN [156] CT- A No No SRGA N GAN No/—
GAN-STFM [166] CT-A No Yes GAN Yes/578,000 + 3.6 m
Teo and Fu [169] CT-A No Yes VDSR GAN No/—
DL-SDFM [150] C-A Yes No CNN No/—
HDLSFM [170] C-A No Yes LapSRN CNN No/—
STF 3DC NN [152] C-A Yes No CNN No/—
BiaSTF [153] C-A Yes No CNN No/—
CV Model refers to the models presented in Table 2.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
26
factor
(),16#
this image is far from optimal. Therefore, a
second GAN is used that takes as the input the Landsat im-
ages along with a downsampled version of these Landsat
images and the intermediate
F
int
2
t
to produce the final
F2
image.
A different approach is followed by Tan et al. [166]
(GAN-STFM), who propose a conditional GAN architec-
ture for downscaling MODIS images with a Landsat refer-
ence. The generator follows a U-Net architecture, and the
inputs are the coarse MODIS image at the prediction date t
Ct
and a fine Landsat image at a different date
t)
arbitrari-
ly close to the target
.Ft
) Similarly, the discriminator takes
as the input a concatenation of either the coarse
Ct
and
the corresponding ground truth
Ft
or the coarse
Ct
and
the predicted
Ft
pred
to perform a fake/real classification.
All convolutional blocks in both networks are replaced
by custom residual blocks with switchable normalization
[167] in the generator and spectral normalization [168] in
the discriminator.
The authors further propose the use of a multiscale dis-
criminator where all inputs are additionally downsampled
with factors /2 and /4 and are used to train three different
discriminators with similar architectures at different scales.
The proposed method is compared
with non-DL approaches and EDC-
STFN, showing the superiority of
the random Landsat reference selec-
tion against the temporal proximity
imposed by STF in terms of compu-
tational cost without compromising
the downscaling quality.
Different DL approaches for
blending Landsat-8 with Formosat-2
(8 m) images to increase the number
of cloud-free observations have been
studied by Teo and Fu [169]. First ,
Landsat images were resampled to 8
m and then blended with the rest via
a simple STARFM algorithm. Second,
pairs of Formosat and Landsat im-
ages obtained on the same date were
fed to a VDSR model that learned to
predict the residual bet ween the LR
and HR features. This prediction was
then used to estimate the final spa-
tially enhanced image. The last two
experiments, nicknamed blend-then-
SR and SR-then-blend, tested t he hy-
brid approaches of applying STA RFM
for blending and then V DSR for
downscaling or applying VDSR for
downscaling and then STARFM for
blending, respectively. The study
concludes that the SR-then-blend
approach yielded the best results
overall, which implies that spatially
enhancing t he LR images before fusion can reduce the vari-
ation between the two image sets.
CONTEXT-ASSISTED METHODS
A C-A approach that aims to integrate temporal change to
an end-to-end model is proposed by Jia et al. [150] (DL-
SDFM). They design a two-stream CNN, with one branch
()M
1 learning a temporal change-based mapping and the
other
()M
2 learning a spatial change-based mapping. Each
branch consists of inception modules containing dilated
convolutions with different dilation factors, and the overall
model is trained with two types of input data: in a time-
forward pass, the time differences are computed forward in
time, whereas in a time-back ward pass, they are computed
backward in time. In the former case, the learned map-
pings are :( ,)MC
FF
1131 3
1
"
t
and :(
,)
MCFC F
2311 3
2
"-
t
,
and, in the latter case, they are :( ,)MC
FF
1313 1
1
"
l
l
t
and
:(
,).
MCFC F
2133 1
2
"-
l
l
t
All outputs are super vised by
the given labels. Then, in the prediction phase, the mod-
el produces the following mappings :( ,)MC
FF
1121 2
1
"
t
and :(
,)
MCFC F
2211 2
2
"-
t
for the forward pass and
:( ,)MC
FF
1323 2
1
"
l
l
t
and :(
,)
MCFC F
2233 2
2
"-
l
l
t
for the
backward pass. Figure 25 presents the entire pipeline.
Nonlinear
Mapping CNN
MODIS Images
on t1, t2, and t3
Transitional LSR Landsat
Images on t1, t2, and t3
Transitional Landsat
Images on t1, t2,
and t3
Predicted Landsat
Image on t2
LSR Landsat
Images on t1 and t2
Landsat
Images on t1 and t2
LSR Landsat
Image on t2
LSR Landsat
Images
Landsat
Images
Training Stage
Prediction Stage
Fusion Model
Fusion Model
SR CNN
Learning an SR CNN
Learning a Nonlinear
Mapping CNN
MODIS
Images
FIGURE 23. An outline of the STFDCNN method. (Source: [154]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 27
The authors compared DL -SDFM with two traditional ap-
proaches and the DL -based ST FDCNN model and argue
that their method manages to capture phenological change
and achieve results closer to the ground truth but slightly
inferior to ST FDCNN visually.
Jia et al. [170] (HDLSFM) propose a hybr id approach that
involves an LapSRN model for spatial downscaling and a
linear model for extracting temporal changes. To alleviate
the problem of large radiation differences between LR and
HR images, the L apSRN is trained on MODIS–L andsat pairs
to produce an intermediate output at the
2#
scale following
the progressive upsampling scheme. During inference, tem-
poral changes are captured by a linear model that extracts
information from both
F1
and the intermediate output of
LapSR N for images
C1
and
.C2
In the final downscaled im-
age, considerable blurring was obser ved in heterogeneous
areas of the underlying scene.
Downscaling a time series of MODIS images based
on Landsat observations captured on sparser dates is ad-
dressed by Peng et al. [152] (STF3DCNN). The proposed
approach takes as the input the time difference MODIS
images between each consecutive pair of dates, and a 3D
CNN model is trained to produce the corresponding time
difference Landsat images of the in-between dates. The out-
put is added to the original Landsat series to produce the
final prediction. The presented method manages to capture
abrupt changes in the obser ved scene.
A novel idea was presented in [153] (BiaSTF), where it is
argued that, when different sensors capture changes with
differences in spectral and spatial viewpoints, a consider-
able bias between these sensors is introduced. No previ-
ously published method accounts for this bias, so the au-
thors propose a pipeline with two CNNs, one for learning
the spectral/spatial changes and the other for learning the
sensor bias. Both networks are trained wit h a separate MSE
loss and take as t he input pairs of MODIS and Landsat ob-
servations. The final prediction is obtained by summing
the output of the two networks along with the initial HSLT
image. The results showed that this inclusion of the sensor
bias lets the model converge to a lower minimum, and its
predictions exhibit fewer spatial and spectral distortions.
In conclusion, the studies presented in this section pro-
vide a variety of methods for tackling the spatiotemporal
variation of the obser ved landscape. The lack of a common
benchmark data set, again, renders the direct comparison
of all methods infeasible, but certain usef ul characteristics
can be discerned. First, models such as EDCSTFN, GAST-
FN, and GAN-STFM require a minimal number of input
images, thus facilitating the downscaling task in areas with
severe cloud contamination. Among these approaches,
GAN-STFM has the additional advantage of using fine im-
ages at arbitrary dates prior to the target date, which pro-
vides an extra level of freedom concerning the selection of
images for training and/or inference.
Second, EDCSTFN, DMNet, ST F3DCNN, and BiaST F
employ simple architectures with a limited number of
trainable parameters, which makes them ideal candidates
for quick e xperimentation and testing. Finally, considering
the spectral correlation between the different bands enables
the model to exploit complementary information to better
uncover land cover and phenological changes. The models
accepting multiband input are EDCSTFN, GASTFN, STF-
GAN, GAN-STFM, DL-SDFM, and STF3DCNN.
SUPER-RESOLUTION
SR is a broad family of methods that aim to enhance the
spatial resolution of an image without t he need to blend in-
formation from auxiliar y sources in either the spectral or
C12
C23
F13 F1F3
F1
F3
Predicted F23
Predicted F2
Predicted F12
Temporal
Constraint
DCNN Mapping 2
DCNN Mapping 1
Combination
FIGURE 24. An outline of the StfNet method. DCNN refers to a three-layer deep CNN. (Source: [147]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
28
FIGURE 25. The DL-SDFM pipeline. (Source: [150]; used with permission.) Adam: adaptive movement estimation algorithm.
Prediction Input 1:
F1, C12
Prediction Input 2:
F1C1, C2
Training Data Set 1:
F1, C13, F3
Training Data Set 2:
F1C1, C3, F3
M1(C13, F1, φ)
M1(C12, F1, φ1)
M2(C13, F1C1, φ)
M2(C2, F1C1, φ2)
Spatial Information-Based Mapping
Temporal Change-Based Mapping
Learning Parameter of Two
Mappings Using ADAM
Temporal Change
Prediction F2
"
Forward Prediction
Result F2
"
C2
C2
Land Cover Change
Prediction F2
"
2
Weighted Combination
Weighted Combination
Final Prediction
Result F2
"
final
Forward Prediction
Backward Prediction
(Φ1, Φ2} = arg min {λL1 + (1 – λ)L2}
Prediction Input 3:
F3, C32
Prediction Input 4:
F3C3, C2
Training Data Set 3:
F3, C31, F1
Training Data Set 4:
F3C3, C1, F1
Spatial Information-Based Mapping
Temporal Change-Based Mapping
Learning Parameter of Two
Mappings Using ADAM
C2Weighted Combination
M1(C32, F3, Φ1)
′′
(Φ1, Φ2} = arg min {λL1 + (1 – λ)L2}
′′ ′′
M2(C2, F3C3, Φ2)
′′
Land Cover Change
Prediction F2
"
2
Temporal Change
Prediction F2
"
M1(C31, F3, φ)
M2(C1, F3C3, φ)
Backward Prediction
Result F2
"
bw
C13 is an abbreviated form of C3C1.
C31 is an abbreviated form of C1C3.
C12 is an abbreviated form of C2C1.
C32 is an abbreviated form of C2C3.
L1, L2, L1, and L2 represent loss functions.
Φ1, Φ2, Φ1, and Φ2represent parameters of mappings.
′′
1
1
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 29
the temporal dimensions. For better assessment, they can
be categorized into SISR, multiple-image SR (MISR), and refer-
ence SR (RefSR). These are presented next, while, in Table 5,
we summarize the main DL models developed for SR. In the
“Super-resolution for Synthetic Aperture Radar and Aerial
Imager y” section, we examine SR architectures that are spe-
cific for synthetic aper ture radar (SAR) and aerial imagery.
SINGLE-IMAGE SUPER-RESOLUTION
SISR aims to recover an HR version of a single LR input im-
age. However, lost pixel information in the LR image can
never be f ully retrieved but only hallucinated, which means
that multiple possible HR images can be constructed from
one LR source. This renders the SISR problem mathemati-
cally ill posed and noninvertible, but it is often the only
viable approach when only a single LR input is available.
Therefore, several attempts have been made to employ DL
techniques in the SISR domain for RS.
MULTISCALE APPROACHES
Lei et al. [171] (LGCNet) (Figure 26) design a CNN model
that combines feature maps produced by previous layers to
extract information at different scales and levels of detail.
The model was evaluated on the University of California
(UC), Merced data set and selected Gaofen-2 images, and
it managed to outperform traditional image enhancement
methods, such as bicubic interpolation and sparse cod-
ing, but showed only marginal improvements compared
to other established DL models. Haut et al. [172] experi-
ment on the same data with a residual model containing a
sequence of convolutional layers for feature extraction and
an inception module followed by upsampling layers for the
final downscaling. Their method achieved a performance
similar to that of LGCNet.
Lu et al. [173] (MRNN) propose a preupsampling ar-
chitecture with parallel convolutional layers and design a
network with three parallel branches containing residual
blocks of different convolutional kernel sizes. Each branch
is initially trained separately with interpolated versions of
the original LR image var ying in size, and then all branches
are combined for the final image reconstr uction and fine-
tuned in an end-to-end setting. Experimental results show
promising improvements over other state-of-the-art DL
methods, especially for larger scaling factors. In anot her
multiscale approach, Xu et al. [174] employ a U-Net-resem-
bling architecture, adding a module with sequential dilated
convolutions at the bottleneck section, a global residual
connection, and pixel shuffle operations before the final
output. The dilated convolutions have different dilation
rates, allowing the model to extract information using dif-
ferent receptive f ields and scales.
MULTITASK LEARNING
In their study, Yan and Chang [175] (MSF) e xploit a mul-
titask learning procedure to improve the generalization of
the underlying network to different degradation models.
According to the standard approach, an image is downs-
ampled by convolving with a Gaussian blur kernel, apply-
ing bicubic interpolation and then adding some noise. The
authors argue that a model trained on images degraded by
a single Gaussian kernel may perform quite well on such
images but fail to generalize to different kernels. There-
fore, they propose a model trained in a multitask setting
where each task represents a separate Gaussian kernel and
is learned by a dedicated CNN.
ADDITIONAL POSTPROCESSING
A study by Qin et al. [176] (DGANet-ISE) presented a cus-
tom postprocessing pipeline for the improvement of the
output of an SR model. Their architecture is heavily based
on EDSR (see the “Standard Deep Learning Methods for
Downscaling in Computer Vision” section) and is trained
with a custom loss function that additionally considers the
gradient similarity between the prediction and target. The
model’s output is then iteratively improved via a proposed
image-specific enhancement (ISE) algorithm that back-
projects the error bet ween the SR output and the LR input
image and, accordingly, updates the prediction. This algo-
rithm alleviates the possible variation between the training
and testing data sets that might occur from different sens-
ing platforms, light conditions, and so on.
DIFFERENT SOURCES FOR THE INPUT AND OUTPUT
Contrary to most approaches in this categor y that exploit
Wald’s protocol, a number of methods have been proposed
that utilize different sources for the input and output. Galar
et al. [177] (S2PS) propose the use of PlanetScope images
as the target to downscale t he four Sentinel-2 10-m bands.
They train a modified version of the EDSR separately for
each of the NIR and red bands, accounting also for the style
transfer loss [178] between the prediction and target.
Pouliot et al. [179] (DCR-SRCNN) use Sentinel-2 ob-
servations to downscale the corresponding Landsat-8 and
Landsat-5 images from three regions in Canada through
an SRCNN architecture with denser residual connections
trained to predict a single band. Landsat–Sentinel t rain-
ing pairs were selected based on a minimum change vector
across time, and the authors noted that better results were
obtained for Sentinel observations closest to the prediction
date due to the dynamic behavior of land cover types, such
as croplands.
Finally, Collins et al. [180] apply an SRCNN on the two
Resourcesat sensors. The constellation of Indian Resourc-
esat satellites (1/2) provide multitemporal and multireso-
lution observations in t he same spectra with coincident
captures enabling the use of SISR techniques. Both satel-
lites carr y the sensors linear imaging self scanning (LISS)
III, which captures information in the green, red, NIR, and
SWIR bands with 24-m spatial resolution and a 24-day re-
visit cycle, and advanced wide field sensor (AWiFS), which
captures the same bands with 56-m spatial resolution and
a five-day revisit cycle. The authors used a training set
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
30
TABLE 5. A SUMMARY OF THE STATE-OF-THE-ART DL MODELS FOR SR IN RS.
MODEL
SR
TYPE DESCRIPTION/NOVELTY
CV
MODEL BUILDING BLOCKS
UPSAMPLING
FRAMEWORK ARCHITECTURE
CODE AVAILABLE/
NUMBER
OF PARAMETERS
LGCNet
[171]
SISR Multiscale approach and features
from different layers
Residual learning Preupsampling CNN No/—
Haut et al.
[172]
SISR Multiscale approach with inception
module
Residual learning and
subpixel convolution
Postupsampling CNN N o/—
MRNN
[173]
SISR Multiscale approach and parallel fea-
ture ex trac tion f rom dif ferent scales
of the LR input
Residual learning Preupsampling CNN No/—
Xu et al.
[174]
SISR Multiscale approach and U -Net
model with dilation module at the
bottleneck
Residual learning and
subpixel convolution
Postupsampling CNN N o/—
MSF [175] SISR Multitask learning and a different
model for each Gaussian kernel
Residual learning Preupsampling CNN No/—
DGANet-
ISE [176]
SISR Postprocessing algori thm and gradi-
ent loss term
EDSR Residual learning and
subpixel convolution
Postupsampling CNN N o/—
S2PS
[177]
SISR Downscaling of Sentinel-2 images
using PlanetScope as the target
EDSR Residual learning and
subpixel convolution
Postupsampling CNN N o/—
DCR-
SRCNN
[179 ]
SISR Downscaling of Landsat-5/8 images
using Sentinel-2 as the targe t
SRCNN Residual learning Preupsampling CNN No/993,000
Collins
et al.
[180 ]
SISR Downscaling of coarser AWiF S
images using shar per LISS III
images from Resourcesat
SRCNN Preupsampling CNN No/—
Zhang
et al.
[183 ]
SISR Unsupervised model that learns
multiple image degradations
Residual learning and
bilinear upsampling
layers
Postupsampling GAN No/—
EUSR
[181]
SISR Dense network , with the resulting
image downsampled and compared
with the LR input
Bilinear upsampling
layers
Postupsampling CNN N o/—
WTCRR
[185 ]
SISR Approach assisted by the discrete
wavelet trans form and use of recur-
rent blocks
DRRN Residual learning Preupsampling CNN No/—
DWTSR
[186 ]
SISR Approach assisted by the discrete
wavelet transform and stationary
wavelet trans form
Residual learning Preupsampling CNN No/—
RRDGAN
[187 ]
SISR Approach assisted by the discrete
wavelet trans form and the tot al
variation loss function
ESR-
GAN
Residual learning and
subpixel convolution
Postupsampling GAN No/—
MPSR
[189 ]
SISR Multiscale approach with residual
connections and channel attention
Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN N o/—
DRSEN
[190 ]
SISR Approach with channel at tention EDSR Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN No/8.6 m
Haut et al.
II [191]
SISR Approach with channel at tention Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN N o/—
MSAN
and
SAMSAN
[192]
SISR Approach with channel at tention
and scene-adaptive learning
WDSR Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN N o/—
DSSR
[193 ]
SISR Approach with channel at tention
and chain training
WDSR Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN N o/9.1 m
AMFFN
[194 ]
SISR Multiscale approach with channel
attention
Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN N o/—
IRAN
[195 ]
SISR Approach with inception modules
and both channel and spatial
attention
Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN No/1.88 m
(Continued)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 31
TABLE 5. A SUMMARY OF THE STATE-OF-THE-ART DL MODELS FOR SR IN RS.
MODEL
SR
TYPE DESCRIPTION/NOVELTY
CV
MODEL BUILDING BLOCKS
UPSAMPLING
FRAMEWORK ARCHITECTURE
CODE AVAILABLE/
NUMBER
OF PARAMETERS
NLA SR
[196 ]
SISR Multiscale approach with nonlo-
cal modules and both channel and
spatial attention
Residual learning, sub-
pixel convolution, and
attention mechanism
Iterative up-
and downsam-
pling
CNN No/10.7 m
PGCNN
[198 ]
SISR Approach with channel at tention EDSR Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN No/1.44 m
HSENet
[199 ]
SISR Attention for multiscale recurring
features
Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling CNN Ye s/—
BCLSR
[200]
SISR Recurrent convolutional model Residual learning and
subpixel convolution
Postupsampling CNN Yes/170,000
CDGAN
[201]
SISR Coupled discriminator ESR-
GAN
Residual learning and
subpixel convolution
Postupsampling GAN No/1.4 m
DRGAN
[202]
SISR RDN-like generator RDN Residual learning and
subpixel convolution
Postupsampling GAN No/—
RS-
ESRGAN
[203]
SISR Multiple training phases on different
data se ts
ESR-
GAN
Residual learning Preupsampling GAN Ye s/—
udGAN
[204]
SISR Multiscale generator with ult radense
residual blocks
Residual learning and
subpixel convolution
Postupsampling GAN No/2.4 m
Shin et al.
[205]
SISR Multiscale generator with pyramidal
structure and discriminator with
difference of Gaussian kernels on
feature maps
Residual learning and
subpixel convolution
Progressive
upsampling
GAN N o/—
Enlight-
en-GAN
[206]
SISR Multiscale generator with intermedi-
ate output and the clipping-and-
merging method
ESR-
GAN
Residual learning and
subpixel convolution
Progressive
upsampling
GAN N o/—
EEGAN
[207]
SISR Downscaling assisted by edge
enhancement and attention
Residual learning, sub-
pixel convolution, and
attention mechanism
Progressive
upsampling
GAN Ye s/
E-DBPN
[92]
SISR DBPN-like generator with channel
attention on multiple layers
DBPN Residual learning, trans-
posed convolution, and
attention mechanism
Iterative up-
and down–
upsampling
GAN N o/—
SRAG AN
[208]
SISR Generator and discriminator with
local and global channel and spatial
attention modules
Residual learning, at-
tention mechanism, and
subpixel convolution
Postupsampling GAN No/4. 8 m
EvoNet
[209]
MISR Approach assisted by evolutionary
image model algorithm
Residual learning Preupsampling CNN No/—
Märtens
et al. [211]
MISR Simple CNN for PROBA-V images that
takes as the input a concatenation of
the LR images
— — Preupsampling CNN No/119,000
DeepSUM
[212]
MISR SR of each input separately and fu-
sion of results
Residual learning Preupsampling CNN Yes /—
Deep-
SUM++
[213 ]
MISR Extension of DeepSUM with graph
convolutional operations
Residual learning Preupsampling CNN No/—
HighRes-
Net [214]
MISR Paired SR of an LR image and the
chosen reference LR as well as Shift-
Net for registration of results
Residual learning and
transposed convolution
Postupsampling CNN Yes/600,000 +
34 m
MISR-
GRU [216]
MISR LR images regarded as a time serie s;
paired SR performed at each time
step, similar to HighRes-Net; and
uses ConvGRU layers and Shif tNet
Residual learning and
transposed convolution
Postupsampling CNN Yes/900,000
RAMS
[218]
MISR Approach assisted by 3D convolu-
tions and attention modules
Residual learning, sub-
pixel convolution, and
attention mechanism
Postupsampling 3D CNN Yes/1 m
SD-GAN
[219]
Ref-
SR
Saliency information used as refer-
ence
Residual learning and
subpixel convolution
Postupsampling GAN No/—
(Continued)
(Continued )
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
32
with coincident images from the two satellites to down-
scale the AWiFS data to match the spatial resolution of
the corresponding LISS III data. The model was evaluated
only against simple baselines and produced better peak
signal-to-noise ratio (PSNR) and structural similarity index
(SSI M) scor es.
DIFFERENT DEGR ADATIONS
Sheikholeslami et al. [181] (EUSR) employ a dense network
with a bilinear upsampling layer for the reconstruction.
Contrary to the majority of studies in the literature, the
authors downsample the initial data set via the Lanczos3
kernel [182] to be used in the model’s training following
TABLE 5. A SUMMARY OF THE STATE-OF-THE-ART DL MODELS FOR SR IN RS.
MODEL
SR
TYPE DESCRIPTION/NOVELTY
CV
MODEL BUILDING BLOCKS
UPSAMPLING
FRAMEWORK ARCHITECTURE
CODE AVAILABLE/
NUMBER
OF PARAMETERS
SG-
FBGA N
[220]
RefSR Extension of SD- GAN with a triplet of
discriminators and recursive layers in the
generator, curriculum learning also used
Residual learning and
subpixel convolution
Postupsampling GAN Yes/—
SR-GAN
[223]
SISR SRGAN Residual learning and
subpixel convolution
Postupsampling GAN No/—
NF-GAN
[224]
SISR Generator based on residual
encoder–decoder, discriminator
based on ResNet50, and embodies
despeckling component
Residual learning and
transposed convolution
Preupsampling GAN No/—
Di-GAN
[225]
SISR Generator based on U -Net and dis-
criminator based on PatchGAN-like
network
Residual learning and
transposed convolution
Preupsampling GAN No/—
FSRCNN
[226]
SISR Residual learning Preupsampling CNN No/—
PSSR
[227]
SISR Learnable preupsampling, uses a
complex structure block for complex
numbers, uses residual compensa-
tion approach, and uses fully polSAR
Residual learning and
transposed convolution
Preupsampling CNN No/—
WDCCN
[228]
SISR Import weighted dense connections DRCN Residual learning Preupsampling CNN No/—
MSSRRC
[229]
SISR Uses residual compensation and
uses fully polSAR data
VDSR Residual learning Preupsampling CNN No/—
CV Model refers to the models presented in Table 2. ConvGr u: convolu tional gated recurrent unit; LI SS: linear imaging self scanning; polSAR: polarimetric s ynthetic aperture radar; SAR:
synthetic aperture radar.
(Continued )
LR Image
Upscale
LGCNet
Representation Reconstruction
Local–Global
Combination
HR Image
FIGURE 26. A high-level overview of LGCNet. Blue boxes represent convolutional layers followed by ReLU activation, orange boxes repre-
sent the concatenation of selected feature maps via a convolutional layer, and the green box represents the last convolutional layer for the
final reconstruction. (Source: [171]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 33
Wald’s protocol. The resulting image is then downsampled
again with the same kernel and compared with the initial
LR image in a PSNR-based loss function. Experiments show
that results are similar to other methods, but the proposed
approach prevails when larger input images are used.
Arg uing that most published studies following Wald ’s
protocol produce synthetic LR images through a specific
distortion model and develop methods that focus solely
on the enhancement of such LR images, Zhang et al. [183]
propose an unsupervised model to handle multidegrada-
tion schemes. In particular, their approach involves a post-
upsampling generator network that produces an SR image
and a degrader network that distorts this SR result. The final
loss function is the MSE between the degraded image and
the original LR, thus alleviating the need to compare the
result to an HR ground tr uth. For the degrader, the authors
adopt the same pipeline as in [184]. Results on the UC Mer-
ced NWPU-RESIS45 data sets (see the “Data Sets” section),
and Jilin-1 satellite images showed t hat the proposed meth-
od outperformed state-of-the-art DL approaches when dis-
tortions other that bicubic inter polation were used for the
LR input. It managed to produce results c loser to the ground
truth and retain edges and object shapes more correctly.
WAVELETS
A large family of traditional non-DL approaches perform
the SR task in the frequency domain, usually through t he
wavelet transform. The general pipeline is to analyze the
image into a number of frequenc y components, separate-
ly enhance the components, and then apply the inverse
transformation to obtain the final SR image. A number of
DL methods have been proposed (WTCRR [185], DWTSR
[186], and RRDGAN [187]) that use the 2D discrete wavelet
transform and design a DL network to undertake the task
of component enhancement. In W TCRR, residual blocks
of a ResNet are replaced with recurrent blocks to reduce
the number of parameters and increase the network dept h
without overfitting. On the other hand, DWTSR uses a
simpler architecture but employs the 2D stationary wavelet
transform along with the 2D discrete wavelet transform for
richer features. Finally, RRDGA N enhances the ESRGA N
architecture with denser connections, a relativistic discrim-
inator, and a total variation loss [188] to separately enhance
the four components of the Haar wavelet transform. A ll of
the aforementioned studies achieve good results, indicating
that the frequency domain may offer more useful informa-
tion to a DL model and is, thus, worth exploring further.
ATTENTION MECHANISM
Several studies also employ attention mechanisms to aid
the downscaling process and help the model focus on the
high-frequency details of the image. For example, Dong et
al. [189] (MPSR) and Gu et al. [190] (DRSEN) design archi-
tectures with various residual connectivity schemes and
channel attention modules similar to the squeeze-and-ex-
citation blocks proposed in [66]. Haut et al. [191] utilize the
residual channel attention block (RCA B) attention module
[89] inside convolutional blocks wit h residual connections
at multiple levels. RCAB is also adopted by Zhang et al.
[192] (MSAN and SAMSAN), who additionally propose a
scene-adaptive learning framework where a separate model
is fine-tuned on each possible scene depicted in an RS im-
age, and Dong et al. [193] (DSSR) also present a chain learn-
ing strategy where a
k2
#
model is based on a pretrained
k#
model.
A similar architecture to DSSR is proposed by Wang
et al. [194] (AMFFN), where both squeeze-and-excitation
and RCAB modules are applied on a multiscale feature ex-
traction framework containing parallel convolutions with
varying kernel sizes. Lei and Liu [195] (IRAN) propose a
network comprising a series of inception modules followed
by channel (squeeze-and-excitation) and spatial attention
mechanisms. Similarly, Wang et al. [196] (NLASR) design a
model with nonlocal blocks [197] that follows the iterative
up- and downsampling scheme with channel and spatial
attention modules.
Finally, based on the popular EDSR architecture, Peng
et al. [198] (PGCNN) propose a gated residual block that
encourages the model to focus on high-frequency details,
whereas Lei and Shi [199] (HSENet) employ custom atten-
tion modules that aim to discover information recurr ing at
multiple scales inside the image. All of the aforementioned
studies show that t he inclusion of such attention mecha-
nisms boosts the model’s performance and helps achieve
a sharper downscaled result closer to the HR ground tr uth.
RECURSION
Chang and Luo [200] (BCLSR) present a novel approach
by employing a recursive framework on images obtained
from the GaoFe n-2 satellite. Their model comprises multiple
densely connected convolutional blocks that share their pa-
rameters and feed their outputs to a bidirectional convolu-
tional long short-term Memory layer (BiConvLSTM). The
output is then downscaled via a subpixel convolution. The
results show that this method outperformed several estab-
lished DL models and produced sharper results without los-
ing substantial high-frequency details.
GENERATIVE NETWORKS
A multitude of studies have also explored the adaptation of
GAN models for SR. In an interesting approach, Lei et al. [201 ]
(CDGAN) present the “discrimination-ambiguity” problem,
which states that RS images contain more low-frequency com-
ponents than natural images, thus impairing the discrimina-
tor’s ability to decide whether a given input is real or fake. To
tackle this issue, they propose a “coupled discriminator” that
takes as the input both the predicted SR image and its cor-
responding HR ground truth shuffled by a random gate and
is then tasked with deciding whether the input constitutes a
real–fake pair (one) or a fake–real pair (zero). The genera-
tor architecture is based on ESRGAN. The model competed
against a number of DL methods on the UC Merced and
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
34
Wuhan University-Remote Sensing (WHU-RS19) data sets
(see the “Data Sets” section) as well as selected GaoFen-2 im-
ages and produced less blurry results with fewer artifacts.
A number of studies have also proposed minor adjust-
ments of popular SR architectures to f it the needs of the RS
domain. For example, Ma et al. [202] (DRGAN) utilize an
RDN-like architecture for the generator with subpixel con-
volution for downscaling and a VGG loss function. Their
model was evaluated on the N WPU-RESISC45 data set (see
the “Data Sets” section) and several other CV benchmarks
and achieved sharper images with cleaner object bound-
aries as compared with other state-of-the-ar t DL methods.
Salgueiro Romero et al. [203] (RS-ESRGAN) adapt the ESR-
GAN model in a preupsampling framework and train the
generator in three stages: first, it is trained on a set of World-
View images only; then, it is fine-tuned on pairs of World-
View and Sentinel-2 images; and, finally, it is trained in an
adversarial manner with WorldView and Sentinel-2 pairs.
The final image is formed by a linear combination of the
generator’s output trained with and without the adversarial
scheme, which helps the user calibrate the perception–dis-
tortion tradeoff.
MULTISCALE GENERATORS
Dense and multilevel connections have also been intro-
duced to different generator architectures with the aim of
extracting more accurate representations of both small- and
large-scale objects. For e xample, Wang et al. [204] (udGAN)
design a novel ultradense residual block that contains par-
allel convolutions and additional diagonal connections,
while features at each level are concatenated through a bot-
tleneck 1
#
1 convolution to limit the channel size. Their
study illustrates the value of this new connectivity scheme
by surpassing several other established DL methods in the
sharpness and quality of the produced images.
Shin et al. [205] propose a multiscale generator compris-
ing multiple parallel streams in a pyramidal fashion, each
of which is formed by a series of RDBs. A reconstruction
module fuses the output of all streams and produces the
final SR image. Before entering the discriminator, an HR
or SR image is f irst fed to a pretrained VGG network, and a
number of intermediate feature maps are selected. A set of
blurring Gaussian kernels is applied on these feature maps,
and the results are then fed to a discriminator model with
a PatchGAN architecture. Both networks are illustrated in
Figure 27. The proposed method achieved much better re-
sults compared to EEGA N and CDGA N, and it managed
to capture and recover even small-scale details in the pro-
duced images, which the other techniques failed to do.
Another multiscale approach was introduced by [206]
(Enlighten-GAN) that improves on the ESRGAN by adding
an “enlighten block ” to the generator. This block outputs an
intermediate SR image and helps the generator learn high-
frequency information in a progressive manner. The loss
Discriminator Architecture
Real
or
Fake
VGG12
SFL
Scale-Space Filter Layer (SFL) Conv Block Blur
Kernel
Concatenation
Applying the Operation for Every Channel
SFL
Generator Architecture
Convolution Stream (CS)
Cross-Scale Aggregation
Module (CSAM)
Reconstruction Module (RM)
Elementwise Sum
Stage 1
Stage 3
Stage 4
Conv
CSA
CSA
RM
CSA
CS
CS
CS
CS
CS
CS
CS
CS
CSCS
CSA
F×4
i
F×2
i
F×1
iF×1
o
F×2
o
F×4
o
F×8
o
Downsample Upsample
(a)
(b)
FIGURE 27. The (a) generator and (b) discriminator for the GAN proposed in [205]. (Source: [205]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 35
function has a self-supervised hierarchical perceptual loss
component, where an autoencoder is trained from scratch
on RS images, and the distance between the correspond-
ing feature maps of the SR and HR images is computed.
Finally, the authors present a novel large image tiling and
batching approach for downscaling overlapping satellite
image patches separately (Figure 28). Experimental results
showed that Enlighten-GA N produces shar per images
with muc h fewer artifacts t han other G AN-based methods
while, at the same time, retaining the true hues and shapes
of the objects.
GENERATIVE ADVERSARIAL
NETWORKS AND ATTENTION
Attempting to improve the output of an SR GAN model,
multiple studies exploit attention mec hanisms. Jiang et al.
[207] (EEGAN) propose a generator that first enhances the
input and then extracts and sharpens its edges (Fig ure 29).
A mask branch with an attention mechanism is also em-
ployed during the edge-enhancement step to focus on
the useful information. The model outperforms SRGA N,
VDSR, and SRCNN on the Kaggle Draper Satellite Image
Chronology data set (see the “Data Sets” section).
In addition, Yu et al. [92] (E-DBPN) propose an exten-
sion of the popular DBPN model in a GAN setting. The gen-
erator adopts the DBPN architecture where each up-projec-
tion unit is followed by a squeeze-and-excitation channel
attention mechanism, and the features extracted from mul-
tiple levels of the network are fused in a sequential manner.
The authors pretrain the generator with the MSE loss and,
then, fine-tune it in an adversarial setting. The results show
that the proposed model produces shar per results closer to
the ground tr uth, with fewer blurring effects and artifacts.
Finally, Li et al. [208] (SRAGAN ) design a complex G AN
with local and global channel and spatial attention mod-
ules both in the generator and the discriminator network to
capture short- as well as long-range dependencies between
pixels. Several e xperiments proved the superiority of t he
proposed model, especially at higher scaling factors.
MULTIPLE-IMAGE SUPER-RESOLUTION
In an MISR setting, a model takes as the input multiple
LR images of the same scene taken from different angles/
viewpoints and aims to synthesize a single HR image. The
main advantage of this approach is t he fact that the minor
geometric displacements and distortions among the LR
96 96 336 336
168
384
672
SRR
FIGURE 28. An example of the clipping-and-merging method pipeline. The input image has a size of 168
#
168 and is cropped into four
overlapping patches, each with a size of 96
#
96. The patches are independently downscaled by an SR algorithm (denoted SRR here), pro-
ducing four 384
#
384 images. Half of the overlap region of each patch is then clipped, ending up with four 336
#
336 images, which are
then joined to produce the final SR prediction. (Source: [206]; used with permission.)
Laplacian
Edge Enhance
IBase IEdge IEdge
C+
ISR
FIGURE 29. The pipeline of the edge-enhancement procedure for EEGAN. (Source: [207]; used with permission.) I: image; C: combination;
:IBase
the intermediate SR result;
:IEdge
edge map extracted from the intermediate SR result;
I:
Edge
)
edge map extracted from the final SR result.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
36
images offer a richer source of information for a candidate
downscaling model than any individual LR image alone,
thus usually obtaining better results than SISR. Also, a key
difference from ST F or SSF is the fact that both LR and HR
images contain information on the same spectra, whereas
their acquisition times are never coincident.
Such an MISR method is described in [209] (EvoNet),
where a number of shif ted LR images are used to produce
a single HR image. In the proposed model, each LR im-
age is independently enhanced t hrough a ResNet, and,
then, the individual SR outputs are coregistered and fed
to the evolutionar y image model algorithm [210], which
constr ucts the final output. One experiment employed
artificially shifting and downsampling images for the
creation of training data, whereas another experiment
utilized a number of Sentinel-2 images to produce a Sat-
ellite pour l’Observation de la Terre (SPOT)-like HR output
downscaled by a
2#
factor. EvoNet achieved higher re-
sults against several traditional SISR and MISR approaches
in both distortion and perceptual quality metrics at the
expense of higher computational time. On a qualitative
basis, EvoNet produced results similar to SRGAN but less
blurr y and with more artifacts.
A common source of data for the MISR problem is the
Project for On-Board Autonomy-Vegetation (PROBA-V) satel-
lite, which is able to capture MS images at 300-m spatial
resolution ever y day and 100-m spatial resolution every
five days. Since both observations lie in the same spectral
bands and are never paired on the same date, a number of
studies exploit the LR images for the construction of the
corresponding HR image in an MISR approach, w ith the au-
thors in [211] proposing a PROBA-V data set exclusively for
this problem setting. They also design a simple four-layer
CNN for benchmarking and propose a custom metric that
takes into account spatial displacements between the pre-
diction and the ground truth.
In their study [212] (DeepSUM) (Figure 30), Molini et
al. design a network that downscales an NIR or red band
of PROBA-V data. The model takes as t he input a single im-
age and performs feature extraction. All extracted features
are then coregistered and fused in the feature space. Before
the final f usion, a mutual inpainting process is employed to
replace unreliable pixels in a feature map (such as clouds,
shadows, and so on) with values ta ken from the correspond-
ing feature maps of other images. The authors claim t hat
end-to-end training of this model leads to many local op-
tima, so they choose to train eac h step separately. Evaluated
against other MISR methods, the proposed model achieved
better results and sharper output scoring first in the PROBA-
V SR challenge issued by the European Space Agency [211].
In a subsequent publication [213] (DeepSUM++), the authors
extend the feature extraction part with graph convolutional
operations to exploit nonlocal correlations among pixels.
Another popular method for the PROBA-V data set was
proposed by Deudon et al. [214] (HighRes-Net). The authors
argue that the set of LR images contain redundant low-fre-
quency information, so they select the median LR image as
the reference and pair each LR image with this. Then, they
train a model to e xtract a shared representation for each
pair, whic h allows it to highlight differences in multiple
LR views and focus on the important high-frequency fea-
tures. The extracted embeddings are then recursively fused
using a mechanism with shared weights, and t he common
representation is downscaled to predict the final SR image.
Another model, called ShiftNet, is also proposed; it regis-
ters the SR with the target HR image to properly calculate
Conv2d
Conv3d
InstanceNorm
Leaky ReLU
Mean (Temporal Dimension)
Conv3d (S 2)
Global Dynamic Convolution
Mean (Spatial Dimensions)
Softmax
Mutual Inpainting
N
Bicubic Shared Shared
SISRNet RegNet FusionNet
SR Image
Reshape
GDC GDC
FIGURE 30. An overview of the DeepSUM model. SISRNet performs the feature extraction, RegNet the feature registration, and FusionNet
the final feature fusion and reconstruction. The global dynamic convolution (GDC) is a convolution between an image and the correspond-
ing learned filter for the image registration. (Source: [212]; used with permission.) N: number of input images.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 37
the loss f unction. Without such a registration, t he model
outputs blurr y results to compensate for the misalignment
between the SR and the target HR. The architecture follows
HomographyNet, proposed by [215], but is trained coop-
eratively with HighRes-Net in an end-to-end setting and
achieves results similar to DeepSUM.
Rifat Arefin et al. [216] (MISR-GRU) (Fig u re 31) choose
to tackle the MISR problem in a time series setting by re-
garding the LR input images as a temporal sequence. At each
time step, their model takes as the input one LR image and
the median of all LR inputs, coregisters them, and produces
a unif ied feature map. The output of this stage is then fed
to a stack of convolutional gated recurrent (ConvGRU) unit
modules [217], and t he output is globally averaged across the
temporal dimension and downscaled. The f inal prediction
is also registered following the ShiftNet strateg y introduced
by [214], and the loss function is a custom negative PSNR
that involves a brightness bias. MISR-GRU achieved the
highest score compared with FSRCNN, SRResNet, Deep-
SUM, and HighRes-Net, and the authors conclude that the
proposed model’s accuracy is highly affected by the number
of LR inputs and the amount of occlusion observed in the
LR images.
A more complex model was proposed by Salvetti et al.
[218] (RAMS); it employs 3D convolutions and attention
mechanisms on both the temporal and spatial domains to
downscale a single band of PROBA-V data. The 3D convolu-
tions are able to assess the interrelations across the different
dimensions, whereas the attention modules focus on t he
similarity between the input LR images (temporal atten-
tion) or the useful high-frequency details to retain on the
spatial domain of the LR feature maps (feature attention).
The model performed quite similarly to MISR methods,
such as HighRes-Net and DeepSUM. The authors also ex-
perimented with a temporal self-ensembling strateg y and
obser ved a significant increase in the output accurac y but at
the expense of computational speed.
REFERENCE SUPER-RESOLUTION
In RefSR, t he input of the model is
accompanied by an auxiliar y (refer-
ence) image, which provides addi-
tional information to assist in the
downscaling process. A number of
studies have explored using features
extracted from the original data as
the refe rence input, and , hereafte r, we
highlight a selection of the most
promising attempts in the literature.
An adversarial RefSR approach is
proposed by a series of publications
([219]–[221]) that focus on the sa-
liency information of the input im-
ages. In [219] (SD-GAN) (Figure 32),
the authors discriminate the high-
ly salient areas of an image as the
foreground and the less salient as the background, and they
argue that, by applying dif ferent reconstruction principles
based on the level of saliency, the GAN will be able to pro-
duce more realistic images stripped of hallucinated pseu-
dotextures. For that reason, they propose the extraction of
a saliency map for each input image through a weakly su-
pervised learning scheme [222] and design a generator that
takes as the input the LR image concatenated with its cor-
responding saliency map along the channel dimension and
produces an SR output. Additionally, a paired disc riminator
is used for the adversarial learning, one for the salient (fore-
ground) and one for the nonsalient (background) areas.
Experimentation on GeoEye-1 PAN images showed that SD-
GAN outperformed other DL approaches, such as SRCNN,
ESPCN, VDSR, and SRGAN. A qualitative analysis proved
that it managed to produce fewer pseudote xtures in salient
areas than SRGA N.
Extending their previous work in a subsequent study
[220] (SG-FBGAN), the same research group proposes a
recursive generator architecture and a triplet of discrimina-
tors. More precisely, the generator performs parallel pro-
cessing of salient and nonsalient information in a recursive
fashion, and the f inal output of the network is the output of
the last iteration. Similar to SD-GAN, a salient area discrim-
inator and a nonsalient area discr iminator are employed
along with a global discriminator that takes as the input
SR or HR images and learns to classify them. Then, the out-
puts of all discriminators over all iterations are averaged
to calculate an overall discriminator loss. When compared
with V DSR , RDN, EDSR , SRFBN, SRGA N, SD-GAN, and
D-DBPN, the proposed method achieved superior results,
producing more realistic images with fewer pseudotextures
and artifacts. The authors also experiment with curriculum
learning and more complex degradation schemes, and the
results were superior to those of the other DL approaches,
especially for higher scaling factors
(3#
and
).4#
l1
l2
lK
r1
r2
rK
Encoderθ
Decoderβ
Encoderθ
Encoderθ
ConvGRUα
ConvGRUα
ConvGRUα
h0
h1
h1
h2
hK
hn–1
GAP H
FIGURE 31. An overview of the MISR-GRU model, where li is the ith LR input image, H is the
predicted downscaled image, hi is the ith hidden state of the ConvGRU layer. In the original
article, the encoder comprises two convolutional layers and two residual blocks (each with
two convolutional layers and parametric ReLU activation), while the decoder consists of a
deconvolutional layer and two 1
#
1 convolutional layers. GAP: global average pooling layer.
(Source: [216]; used with permission.)
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
38
To summarize this analysis, there are two main ap-
proaches a researcher can take, depending on the number
of available images in the data set at hand. When only a
single LR image can be acquired per occasion, SISR and
RefSR methods can be applied. In particular, several of
the aforementioned models offer a robust solution to the
downscaling problem, proving that certain mechanisms
and modules can f ur ther boost performance and achieve
sharp results. For example, attention mechanisms (e.g.,
MPSR, DRSEN, DSSR, Haut et al. II [191], and NLASR) can
always assist the discover y and preser vation of high-fre-
quency components, whereas multiscale feature extraction
structures (e.g., NLASR and Shin et al. [205]) can unravel
nonlocal correlations inside the image and expand the re-
ceptive field of basic convolutional layers.
Furthermore, a number of novel techniques seem to
leverage the ef ficienc y of the underlying model, e.g., the
diagonal connectivity scheme proposed in udGA N or the
clipping-and-merging postprocessing technique and the
autoencoder loss proposed in Enlighten-G AN. Finally, cer-
tain methods (EUSR, DWTSR, DRSEN, DSSR, DG ANet-
ISE, NLA SR, Shin et al. [205], and SG-FBGAN) manage to
perform better at larger scaling factors, whereas Zhang et al.
[183] provide an interesting candidate when different dis-
tortions have taken place during the LR image acquisition.
Unfortunately, up to this point in time, only a handful of
RefSR methods have been developed, and none seems to
match the efficiency and robustness of the SISR domain.
On the other hand, when multiple LR images can be ob-
tained for each training/testing sample, then MISR models
can be employed. In this family of methods, MISR-GRU
and RAMS, in particular, seem to prevail in terms of both
the resulting image quality and the number of trainable pa-
rameters. It is wor th noting that a common challenge faced
by all MISR approaches is the coregistration of the input
LR images, which is handled differently by each proposed
model, either inside the network or as a separate prepro-
cessing step in the pipeline. In addition, this coregistration
may incur minor shifts in the output, which, in turn, can
potentially affect the computation of the loss function dur-
ing training and encourage a blur ry result. This phenom-
enon has been successfully handled through the ShiftNet
module, which was proposed in HighRes-Net and, subse-
quently, used in other studies. Finally, it is again proven
that attention mechanisms enhance the downscaled output
and, also, that the number and clarity of the input LR im-
ages can greatly affect the f inal result.
SUPER-RESOLUTION FOR SYNTHETIC APERTURE
RADAR AND AERIAL IMAGERY
SYNTHETIC APERTURE RADAR
Most of the SAR spatial resolution enhancement techniques
related to deep neural networks use the SISR approach,
which makes the data collection, processing, and experi-
mentation fairly straightforward and easier compared
FIGURE 32. The SD-GAN model. (Source: [219]; used with permission.)
Preprocessing
Weakly Supervised
Saliency Analysis
Input Layer
Residual Dense Saliency-Guided Generator
Generated Image
Residual Dense Block (RDB) Structure
Concatenation
Concatenation
Conv 3 × 3
Conv 3 × 3
Subpixel Layer
Conv 3 × 3
Conv 1 × 1
Conv 3 × 3
RDB
RDB
RDB
RDB
Conv
ReLU
Conv
ReLU
Conv
ReLU
Conv
ReLU
Conv
Conv
ReLU
Concat
Paired Discriminator
Salient Area
Nonsalient Area
Salient Area
Discriminator
Nonsalient Area
Discriminator
Adversarial
Learning
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 39
to optical data. However, SAR data inherently introduce
speckle noise, which few authors explicitly consider when
building SR pipelines.
Wang et al. [223] used an SISR approach by applying an
SRGA N on Te r ra S A R-X images after having been despeck-
led using a CNN, as described in [230]. The HR image is
downsampled by a factor of four using a Gaussian kernel,
while both the generator and discriminator elements are
CNN based. The generator element produces the SR image
using the LR image, while the discriminator compares the
SR image with the HR image. The loss function compris-
es a perceptual loss with a content (pixelwise MSE) and a
weighted adversarial (probability-based) component of the
discriminator.
Gu et al. [224] propose a transfer lear ning GAN-based
paradigm in dealing with speckle noise using a so-called
noise-free GAN (NF-GAN) to preserve the high-frequenc y
image details as much as possible. They experiment with
the horizontal–horizontal (HH) polarization channel of
Airborne SAR data. The generator element consists of a
despec kling network and the reconstruc tion network, while
the discriminator element is ResNet based. The despeckling
network is pretrained using optical images with speckle
noise added on them, and it uses an MSE loss. Its input is an
LR (downsampled HR version by a factor of two) noise-full
image. As with the previous case, the NF-G AN objective
function is defined by an adversarial and a pixelwise (MSE)
component. The authors train their network pipeline with
and without the despeckling component and show that the
former, indeed, works better.
Li et al. [225] tried to solve the problem of increased sys-
tem integration time and low azimuth resolution of geo-
synchronous SAR (GEO SAR) using a CNN-based G AN ap-
proach. GEO SAR is an active area of research in developing
a SAR satellite system in geosynchronous orbit, which will
significantly assist in operational disaster monitoring by
increasing the temporal resolution compared to low-Ear th
orbit satellite systems. In particular, the authors generated
synthetic GEO SA R data based on advanced land observ-
ing satellite phased array type L -band SAR (ALOS PALSAR)
characteristics. They use a dialectical GAN (Di-GAN) [231]
with t he generator element comprising a U-net and the dis-
criminator a PatchGAN-like network. The generator takes
the LR simulated GEO SAR image as the input, whose SR-
produced image is compared with the ALOS PALSAR HR in
the discriminator. The authors claim a noticeable improve-
ment of the resolution, which is mostly based on a qualita-
tive comparison.
Cen et al. [226] propose a three-module CNN-based
network named FSRCNN for downscaling bistatic SAR
images. The first module is used for feature extraction in
various scales of the LR images. The second module adds
together the resulting feature maps that were learned from
the first module. The third module consists of a reconstruc-
tion CNN that computes the final SR image. The authors
compare their results with bilinear, bicubic, and SRCNN
approaches using PSNR and SSIM and show an overall best
performance of the proposed FSRCNN.
Helal-Kelany et al. [232] aimed to enhance the coregis-
tration accuracy between two single-look complex images
of European Remote Sensing-1/2 (ERS-1/2) data. They train a
scale-invariant SR CNN (SINV CNN) model using both the
amplitude and phase, which mainly takes advantage of the
feature extraction and residual block components. Their re-
sult is evaluated based on descriptive statistics of the coher-
ence between SINV CNN and sinc interpolation instead of
commonly used metrics used in CV, which may make their
output difficult to compare with other approaches.
Shen et al. [227] present a rather complete work where
they apply their technique (PSSR) to full polarimetric SAR
(PolSAR) images. Unlike [232], they do not treat the real
and imaginary image parts separately but utilize them
with a separate structure block since the information is
lost because of separation. They use various satellite sen-
sors, such as Radarsat-2, experimental SAR (ESAR), and
polarimetric and interferometric SAR (PiSAR) whose data
they despeckle first. They compare their approach (along
with t he residual compensation st rategy) with the conven-
tional non-DL approach and multichannel SAR SR (MSSR)
using the PSNR and MAE. They also use equivalent num-
ber of looks, which is used to spot whether ar tifacts are
introduced after SR. Notably, they experiment with the
presence of speckle noise and show that their approach is
superior to the traditional met hods.
Lin et al. [229] also use PolSAR data and propose a re-
sidual compensated MSSR (MSSRRC) to tackle issues of the
conventional (non-DL-based) SR approaches, such as the
insuf ficient use of polarimetric information and decreased
reconstruction of details. Their network is a VDSR adjusted
for multichannel (full-PolSAR) input applied on RadarSat-2
data that is compensated for by residuals between LR re-
constr ucted and original images. Prior to the training, all
data are despeckled. PSNR, SSIM, and qualitative evalua-
tion show better performance with and without residual
compensation compared to conventional SR approac hes.
Yu et al. [228] propose a weighted dense connected
convolutional network (WDCCN), which they claim is a
better alternative to fast SR CNNs and DRCN. Their net-
work is based on DRCN as well as the notion of weighted
dense connections, and it tries to combat the restricted fea-
ture propagation issue. They compare their approach wit h
SRCNN and DRCN using PSNR, which suggests a better
performance.
In conclusion, before one starts searching for baseline
models for SA R image downscaling based on the currently
published literature, there are certain decisions that must
be made. For example, the processing level of the input data
ranging, from single-look complex to coregistered and/or
geometrically corrected, speckle filtered, and so on, all play
a role in designing fit-to-purpose downscaling models.
Similarly, the preferred type of products (e.g., fully PolSAR,
interferometric wide swath mode, and so on) is important.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
40
We then provide some general directions that need
to be seen with care and do not discourage authors from
fur ther experimentation since SAR image downscaling is
at its research infancy. Results from architectures such as
NF-GA N and PSSR indicate that speckle noise needs spe-
cial treatment that should be integrated in the overall ar-
chitecture, thus leading to end-to-end approaches. As a
baseline, researchers could begin with general noise sup-
pression architectures established in the CV field or dive
deeper by adapting architectures dedicated to speckle noise
reduction that already exist in the literature. Residual block
components seem to also add value in the overall learning.
In addition, if one decides to experiment with single-look
complex images, using a dedicated structure block would
be more fruitful (e.g., PSSR) compared to the opposite
(e.g., SINV CNN) as well as adapting activations other than
ReLU (e.g., parametric ReLU, leaky ReLU, and so on) that
will not freeze the filters’ weight update. Finally, we suggest
that more focus can be placed on G AN-based architectures
in SAR downscaling since they can exploit more ty pes of
inputs and explicitly take into consideration SAR imaging
unique characteristics.
AERIAL IMAGERY FROM UNMANNED
AERIAL VEHICLES/DRONES
By their initial mass production and market distribution,
unmanned aerial vehicles (UAVS) represent one of the
most applicable and simple means of data acquisition in-
fluencing a plethora of applications, including RS. Simple
architectures as well as easy-to-use and low-cost solutions
contributed to increasing their usage and ex panding their
applicability for various objectives. The simplicit y in in-
tegrating widely used sensory systems, such as optronics,
played a significant role in substituting core RS systems as
they overcome many applicability limitations. Nonethe-
less, despite their efficiency and robustness as data acquisi-
tion systems, simple cameras mounted on a UAV cannot
entirely substitute for satellite alternatives, as the latter
exhibit enhanced payload sensor technical specifications,
such as higher spatial resolution.
Aiming at exploiting UAV systems in specific RS applica-
tions and higher spatial resolution for the acquired images,
numerous SR approaches have been proposed and validat-
ed in real use cases. Depending on the availability of the
input images, resolution enhancement techniques are typi-
cally divided into MISR and SISR methods, as for satellite
imager y SR. However, no DL models have been developed
for the MISR case; therefore, hereafter, we focus on only t he
SISR approach.
Targeting on identifying higher frequencies on images,
wavelet multiscale representations have been used for train-
ing a CNN and, thus, vice versa for their estimation [233].
A shallower CNN architecture was proposed in Gonzalez et
al. [234] to be integrated onboard a UAV so that computa-
tional resources and power requirements could be retained
at low levels. The combination of two sequential CNNs
along with a bicubic upsampling stage produce suff icient
spatial imagery data. A similar technique was also deployed
in Truong et al. [235], where the LR image is inser ted in a
deep CNN with a residual skip connection and network in
network for generating the HR images.
To reduce resource consumption by decreasing the to-
tal number of network parameters, a deep recursive dense
network [236] (DRDN) has been proposed. The recursive
dense block can extract abundant local features and adap-
tively combine dif ferent hierarchical features of the input
image. A dedicated implementation of SRGAN (see the
Standard Deep Learning Methods for Downscaling in
Computer Vision” section) for UAV operations has been
incorporated as an initial processing step by Zhou et al.
[237] (SAIC). The main target of the proposed pipeline was
to deliver a high-precision detection framework. Nonethe-
less, the spatial increment of the aerial image’s resolution
as an initial processing step is considered imperative to at-
tain high detection performances.
A similar objective was shared in Chen et al. [238],
where a synergistic CNN for spatial resolution enhance-
ment along with a modified object detection algorithm,
which processes the enhanced image, were established. Fi-
nally, dedicated CNN-based models were utilized by Asla-
hishahri et al. [239], targeting the enhancement of aerial
spatial resolution for producing details in plant phenotyp-
ing, showcasing that such models could be application ori-
ented depending on the data set availabilit y.
In conclusion, most approaches applied in the resolu-
tion increment of aerial images follow similar schemes, as
the problem is translated into a CV counterpart. The major-
ity of the corresponding architectures rely on the e xtraction
of features from pretrained models, which eventually limits
the necessity of dedicated models apart from the applica-
tion-driven solutions. Due to the f undamental operational
nature of UAV systems, the overall performance is mean-
ingful mostly in near real-time operations, which, eventu-
ally, is a prerequisite in many cases. Hence, dedicated light-
weight architectures for specif ic drone applications exhibit
better performance in terms of both the accuracy and the
execution time with respect to more universal, generic, and
heavyweight modeling solutions.
DATA SETS
Despite the abundance of RS images, there is still a notice-
able gap in the availability of public benchmark data sets
for the evaluation of downscaling methods. This is hardly
surprising since such a benchmark data set would require ex-
tremely careful handling and elaborate preprocessing pipe-
lines during assembly to meet the following basic conditions:
Each HR image must be paired with one or more LR im-
ages.
All LR/HR pairs must share the same scaling factor.
All LR/HR pairs must be aligned and coregistered.
All images must contain minimum obstructions (e.g.,
clouds, haze, corr upt pixels, and so on).
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 41
The depicted scenes must be as diverse as possible. Espe-
cially for STF, the temporal/phenological changes must
be as diverse as possible.
A large number of images are required to avoid overfit-
ting DL models with thousands/millions of trainable
parameters.
Apart from a handful of data sets proposed specifically
for the task of spatial downscaling, several data sets ad-
dressing different RS problems, such as object detection or
scene classification, have been systematically used by most
downscaling studies since they offer a ready-to-use collec-
tion of high-quality satellite images. In the following list,
we present the most popular of such data sets and their cor-
responding characteristics.
UC Merced [240] contains 2,100 aerial RGB images com-
ing from the U.S. Geological Survey National Map Ur-
ban Area Imagery depicting 21 different land use classes
at 0.3-m resolution from several U.S. regions.
WH U-R S19 [241] contains 950 aerial RGB images from
Google Ear th depicting 19 classes of land use at dif ferent
spatial resolutions reaching up to 0.5 m. Images or igi-
nate from different regions around the world.
WHU-RS20 [242] is an extension of the WHU-R S19
data set with an extra land use class and a total of 5,000
aerial RGB images.
Remote sensing Scene classif ication (RSSCN7) [243]
contains 2,800 aerial RGB images from Google Ear th
depicting seven land use classes.
Remote scene classification (RSC11) [244] contains
1,232 aerial RGB images from Google Earth depicting
11 land use classes at 0.2-m spatial resolution. Images
come from several U.S. cities.
The Aerial Image Dataset (AID) [245] contains 10,000
aerial RGB images coming from Google Earth at resolu-
tions ranging from 0.5 to 8 m. They depict 30 land use
classes from different countries around the world and at
different time and seasons.
NWPU-R ESISC45 [246] contains 31,500 aerial RGB
images from Google Earth depicting 45 land use
classes with spatial resolutions ranging from 0.2 to
30 m. Images come from several different regions aro u nd
the world.
RS-IDEA Research Group-W U (SIRI-WHU) [247] con-
tains 2,400 aerial RGB images from Google Ear th de-
picting 12 land use classes at a spatial resolution of 2 m.
The images mainly cover urban areas in China.
The Brazilian coffee scene data set [248] contains 2,876
SPOT images (green, red, and NIR bands) over four re-
gions in Brazil for binary image classif ication based on
the presence or absence of coffee crops.
Sentinel 1-2 (SEN1-2) [249] contains 282,384 pairs of
Sentinel-1 and Sentinel-2 RGB images at 10-m spatial
resolution from around the world at different seasons.
Sentinel-1/2 MODIS (SEN12MS) [250] contains 180,662
triplets of Sentinel-1 dual-polarization SAR , Sentinel-2
MS, and MODIS land cover images at 10-m spatial reso-
lution coming f rom all around the globe and at different
times.
Dataset of Object deTection in Aerial images (DOTA)
[251] contains 2,806 aerial images from different sen-
sors along with GaoFen-2 and Jilin-1 satellite images. This
data set is targeted toward object detection and includes
labels spanning more than 15 object categories.
DIOR [33] contains 23,463 aerial RGB images from
Google Earth with spatial resolutions ranging from 0.5
to 30 m. The images cover several regions around the
globe, and their labels span more than 20 object catego-
ries.
Coleambally irrigation area (CIA) [252] contains 17
Landsat/MODIS pairs from Coleambally Irrigation
Area, Australia, at 25-m spatial resolution. Images were
obtained during a single summer season but have strong
spatial heterogeneit y.
Lower Gw ydir Catchment (LGC) [252] contains 14
Landsat/MODIS pairs from LGC, Australia, at 25-m
spatial resolution. Images were obtained during a whole
year, which also included a major flood. This renders
the data set ideal for the study of abrupt and unpredict-
able changes in time series.
Ar Horqin Banner (AHB) [253] contains 27 Landsat/
MODIS pairs from ARB, China, over a span of five years.
It is intended for the study of phenological changes in
rural areas.
Tianjin [253] contains 27 Landsat/MODIS pairs from
Tianjin, China, over a span of six years. It is intended for
the study of phenological changes in urban areas.
Daxing [253] contains 29 Landsat/MODIS pairs from
Daxing, China, over a span of six years. It is intended for
the study of land cover changes.
The Gaofen Image Data Set [254] contains 150 Gaofen-2
images (RGB and NIR bands) from many regions in Chi-
na with 4-m spatial resolution. It is intended for scene
classification and land cover segmentation.
Kelvin’s PROBA-V SR Data Set [211] contains 1,160 imag-
es from the PROBA-V satellite (red and NIR bands) from
several locations around the globe at different points in
time. Each data point contains an HR image of 100-m
resolution and several LR images of 300-m resolution.
Kaggle’s Draper Satellite Image Chronolog y [255] con-
tains 1,720 aerial RGB images from California, United
States, over a period of five days.
Diverse Real-World Image SR [256] contains 31,970 LR
image patches including aerial images.
Pavia Center [118] was acquired by ref lective optics sys-
tem imaging spectrometer (ROSIS) over the city of Pa-
via, Italy, in the waveleng th range of 430 to 860 nm. It
contains 115 spectral bands and is of size 1,096
#
1,096.
Houston [118] was acquired by an ITR ES-compact air-
borne spectrographic imager (CASI) 1500 HS sensor
over the campus of t he University of Houston and its
neighboring urban areas. Each HS image comprises 144
bands covering t he spectral range of 380 to 1,050 nm,
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
42
and each band contains 349
#
1,905 pixels with a spa-
tial resolution of 2.5 m
Los Angeles [118] was acquired over a por t in the city
of Los Angeles by the Hyperion sensor mounted on the
Earth Observing-1 (EO-1) satellite. The HS image contains
242 spectral bands with a spatial resolution of 30 m.
Botswana [257] was acquired over the Okavango Delta
in Botswana by t he Hyperion sensor mounted on the
EO -1 satellite. The HS image contains 242 spectral bands
with a spatial resolution of 30 m.
Hobart [113], acquired by the IKONOS sensor, repre-
sents an urban and harbor area of Hobart, Australia. The
MS sensor is characterized by four bands (RGB and NIR)
and also a PAN channel with band range from 450 to
900 nm. The resolution of MS is 4 m and of PAN is 1 m.
Sundarbans [113], obtained by the QuickBird sensor, rep-
resents a forest area of Sundarbans in India. This data set
provides an HR PAN image with a spectral cover range
from 760 to 850 nm and a resolution of 0.6 m as well as
a four-band (RGB and NIR) MS image with a resolution
of 2.4 m.
Washington DC Mall [139] covers an urban area in the
Washington, D.C., National Mall. The size of the degrad-
ed HS image is 256
#
60 and that of the PAN image is
1,280
#
300.
Moffett Field [139] covers a mixed urban/rural area in
Moffett Field, Califor nia. The size of the degraded HS
image is 79
#
37 with 10-m resolution and that of the
PAN image is 395
#
185 with 20-m resolution.
Salinas Scene [139] covers a rural area in Salinas Valley,
California. The size of the degraded HS image is 102
#
43 and that of the PAN image is 510
#
215.
Chikusei [258] was captured by Headwall’s Hyperspec
Visible and Near-Infrared, series C imaging sensor over
Chikusei, Ibaraki, Japan, on 29 July 2014. The data set
contains 128 bands in the spectral range of 363–1,018
nm. The PAN image has 300
#
300 pixels with a spatial
resolution of 2.5 m.
Foster [258] has 33 spectral channels from 400 to 720
nm with 10 nm per band. The original size of each HS
image in the Foster data set is 1,341
#
1,022.
ADVANCEMENTS IN COMPUTER VISION
Spatial enhancement, or SR, is being t horoughly investigat-
ed in general C V, and a great number of met hods have been
proposed that build on previous research and expand the
state of the art. Hence, in the C V field, some informative re-
view articles have been published in the last couple of years
focusing on CV DL algorithms for image downscaling, such
as [12] and [16]. In this section, we present some of the most
promising and innovative studies in C V published over the
last few years t hat, to the best of our knowledge, have not
yet been used in an RS context, hoping to provide a source
of inspiration for further applications in the RS field.
Most of the s tudies found in t he literature t rain models on
synthetic data sets where LR counterparts are synthetically
constr ucted, usually via a single predefined degradation
algorithm, such as bicubic inter polation. This raises the
question of whether such a model can properly generalize
to real-world images that have undergone arbitrary degrada-
tion processes. To that end, a number of publications (e.g.,
SFTMD [259] and DAN [260], [261]) explore deep networks
that are trained to jointly handle the downscaling task and
learn the appropriate blur kernel in an end-to-end fashion.
This family of methods is usually referred to as blind SR.
In some cases, the available data set comprises LR imag-
es that need to be downscaled, along with a number of HR
reference images of the same domain that, however, do not
correspond to the LR data. A family of methods attempts to
exploit such HR information through domain translation
approaches and the adaptation of the CycleGAN [164] idea.
For example, [262] (CinCGAN), [263] (DDGAN), [264]
(UISRPS), and [265] (MCinCGA N) propose GAN architec-
tures that are trained to translate the LR images to cleaned,
synthetic LR counterparts and then further downscale the
result to an HR output. The use of cycle-consistency loss
circumvents the need for paired data, so any HR data of the
same domain can be used.
An emerging trend in the f ield of SR approaches is diffu-
sion models. Initially proposed in [266], diffusion models
employ a Markov chain to slowly add Gaussian noise to the
input data and a trainable model to stochastically learn the
reverse process of gradually removing this noise. Saharia
et al. [267] (SR3) adapt this idea to the image SR of faces
and natural images by training a U-Net to iteratively refine
Gaussian noise conditioned on the LR image. Their method
achieved results of remarkable shar pness and realism while
remaining true to t he LR input. In addition, by cascading
multiple such models, higher scaling factors can be targeted
(e.g.,
)8and 16##
without compromising the final image
qualit y. This breakthrough study showed that diffusion
models can overcome G ANs and set an interesting research
field for future exploration.
DISCUSSION
A number of key findings have emerged from the present
literature review that showcase the limitations of the cur-
rent approaches. In the following sections, we highlight
some essential topics for further exploration and research
in the task of image downscaling, focused especially on the
field of RS.
UNIVERSAL METRICS
An impor tant conclusion of the “Metrics” section is the fact
that there exist no established evaluation metr ics for down-
scaling models. To be sure, a limited subset of the metrics
presented in Table 1 have become more popular and widely
used in recent studies; however, none of them can entirely
capture and assess t he quality of a produced SR image. The
design of a universal metric (or set of metrics) able to ac-
count for both low distortion and high perceptual quality
of an image is still an open field of research, and the DL
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 43
communit y will greatly benefit from any advancement in
this area.
MODEL INTERPRETABILITY
The definition of universal quality inde xes for EO image
downscaling contributes to the robustness against the in-
herent superresolved image hallucinations and increase
in the trust and interpretability of proposed SR models.
Indeed, generative networks, widely used for image down-
scaling and thoroughly presented in this review, are able
to achieve impressive aesthetic results; however, they are
prone to creating hallucinations and/or artifacts. Control-
ling and quantify ing the tradeoff between SR performance
vis-à-vis the expected hallucination level remains an open
issue. In addition, it may be that a single metric character-
izing the overall model performance is not enough, but
an additional gridded output with uncertainty estimates
should be produced.
Therefore, we consider it critical to develop algorithms
that will help bot h ML practitioners and end users to bet-
ter understand, interpret, and trust the DL model outputs.
explainable artificial intelligence (x AI) algorithms [129]
are essential tools toward an enhanced understanding and
transparency of the developed DL models, especially for fa-
cilitating the operational uptake of EO image downscaling
models.
BENCHMARK DATA SETS
The availability and abundance of RS images has greatly
facilitated the formulation of data sets t hat satisfy the
needs of complex DL models. Many researchers choose to
directly download RS images from the respective provid-
ers; perform the preprocessing pipeline that best suits their
analysis; and, subsequently, evaluate the model output on
a held-out subset. However, there is an urgent need for spe-
cific, carefully designed benchmark data sets tailored to the
downscaling task, which will help to objectively evaluate
and compare different models, thus gaining more concrete
insight into their generalization and applicability.
MODEL PERFORMANCE
In addition to the point discussed, the adoption of best
practices dur ing and after model-building procedures is
also necessary. In the former case, ablation studies can be
adopted more widely, while, in the latter case, results can
be followed by some sor t of evidence of statistical strengt h
when comparing models. As a result, practices such as
these, among others, may lead to more understandable ar-
chitectures and transparent results as well as less biased and
weak inference regarding the model performance.
OPEN SOURCE CODE AND REPRODUCIBILIT Y
During our study, we obser ved a glaring lack of source code
availability for the presented met hods. This prevents an ob-
jective evaluation and hinders quick advancements in the
field. Transparency, reproducibility, and testabilit y of the
reported results and comparison with novel approaches re-
quire publicly accessible source code of the whole pipeline
as well as a permissive license of use (e.g., Massachusetts
Institute of Technology (MIT), Berkley Source Distribution
(BSD), GNU, and so on). In this way, faster scientific prog-
ress can be achieved, which, from a model’s perspective,
means t hat it can go up the technolog y readiness le vel faster.
To this end, a possible contribution from the authors,
in addition to open source code, would be to explicitly
make reference to the number of trainable parameters of
their models. This information provides intuition to data
scientists. Depending on the problem at hand, the available
data for training, and the computing resources, the model
size provides useful indications for training time and effec-
tiveness, although other factors, such as the use of recursive
architectures, can affect these.
BEYOND A SINGLE DEGRADATION SCHEME
When the acquisition of LR–HR image pairs is too expen-
sive or overall impossible, Wald’s protocol often comes to
the rescue. Even though it offers an out let for the formu-
lation of an appropr iate training data set, LR images are
usually constructed with a single degradation algorithm.
Consequently, a model trained on such a data set learns to
“reverse” t his particular deg radation scheme and, therefore,
may fail to generalize on dif ferent degradation/distortion
operations. Further study is required for the development
of models able to handle diverse t ypes of image distor tion
that are applicable in real-world scenarios during the sensor
capture of an image.
MULTIMODAL FUSION
The spectral fusion of images can greatly assist the down-
scaling process (see the “Spatiospectral Fusion” section).
However, apart from captures lying in the visible and in-
frared spectra, new approaches can be investigated for the
fusion of other spectral ranges. For example, radar imaging
can provide complementary information to optical imag-
ing, such as surface topography, and is also able to pene-
trate canopies and clouds/smoke. Therefore, an interesting
topic of study would be the fusion of SA R and optical data
for the pur pose of downscaling, which, to our knowledge,
has not yet been investigated in the DL field.
GENERATIVE ADVERSARIAL NETWORKS OR ELSE
GANs manage to better approximate the boundar y of the
perception–distortion plane and achieve more realistic and
perceptually convincing results (see the “Metrics” section).
Therefore, a further study of the G AN framework is needed
to exploit its potential to the full extent. Additionally, an
exploration of novel architectures and training schemes
may lead to performances even closer to the boundar y. For
example, recent studies have unveiled the great power of
diffusion models, and future research may possibly estab-
lish them as the successor of GA Ns to the downscaling state
of the ar t.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
44
UNSUPERVISED LEARNING
Acquiring ground-truth HR labels in the training data set
is often a time-consuming and expensive task, while, in
some cases, it may also be practically infeasible. On the
other hand, a sy nthetic training data set can be developed
through Wald’s protocol, but this process requires addi-
tional degradation and high-frequency information loss.
To tackle this problem, some studies employ a completely
unsupervised learning scheme with specially designed
loss functions. Even though these models still struggle
to match the per formance of their supervised competi-
tors, they tend to preser ve high-frequency details and stay
faithful to the spectral content of the LR input. Therefore,
we believe that unsupervised learning offers a potential
outlet for handling the lack of training targets in down-
scaling, and further research will only achieve fruitful
results.
COMPUTER VISION PARADIGM
The field of general CV has made a lot more progress on the
task of downscaling and novel architectures, and ideas have
been recently introduced. We believe that t he RS domain
could greatly benefit from an adaptation and expansion
of these developments. We introduce some of these meth-
ods in the “Advancements in Computer Vision” section.
However, caution is needed when directly applying such
approaches since scaling factors in the RS domain are usu-
ally considerably larger and may hinder the model’s perfor-
mance. For example, SR in natural images usually involves
a magnification factor much smaller than those in the RS
domain (ranging from
2#
to
4#
compared with
8#
to
),16#
where texture information is severely distorted, and
high-frequency details are almost impossible to retrieve.
Therefore, a simple transfer learning approach is not pos-
sible, and specialized architectures must be designed when
it comes to RS data.
DOWNSCALING SYNTHETIC
APERTURE RADAR IMAGERY
The techniques proposed in the literature for SAR image
enhancement are few, and they compare well-established
techniques bor rowed from CV research on SISR . However,
speci al care is nee ded to downsca le SAR data sinc e they pres-
ent properties that need to be either taken e xplicitly into
account by tailored model architectures or eliminated be-
forehand. For example, few authors use fully PolSAR data,
and even fewer incorporate the complex number nature of
SAR data in their models. In addition, preprocessing steps
need to be presented in a clearer way, while, in our review, a
number of authors apply SR techniques only on data of the
same level of preprocessing. This may lead to SAR-unique
properties, such as speckle noise and geometric distortions
(e.g., foreshortening and layover), affecting the model per-
formance or resulting in misleading outcomes. Therefore,
we believe that there is room for significant improvement
in SAR imagery SR modeling by focusing on the unique
SAR properties and designing proper model architectures,
loss functions, and accuracy metrics.
Last but not least, other potential future research orien-
tations could be toward the adaptation of MISR and expan-
sion of SISR approaches using SAR data acquired from dif-
ferent SA R imaging sensors. This will provide new e xternal
information to assist the downscaling process, exploiting
different view geometries through incidence angle diver-
sity, radar frequency bands (e.g., the C , X , and L bands),
imaging modes (e.g., StripMap, wide swath, spotlight, and
so on), and the availability of polarimetric data.
CONCLUSION
In this sur vey, we offer a detailed overview of the methods
available in the literature for the spatial downscaling of RS
imager y. We explore the different types of spatial enhance-
ment and introduce a comprehensive taxonomy of the vari-
ous approaches. Additionally, we conduct a thorough inves-
tigation on the most popular metrics and data sets for this
task, and we analyze the t radeof f between perception and
distortion as a key factor for the selection of an appropriate
loss function and training scheme. Finally, we discuss the
weaknesses and shortcomings of the current state of the art
in the field and brief ly present recent advancements in the
general CV community as a source of inspiration.
As seen from our analysis, although there is a strong
presence of the DL paradigm in RS, and the publication
rates are ever increasing, there is still plenty of room for im-
provement and exploration. Various facets of the downscal-
ing problem could benefit from new contributions, such
as universal evaluation metrics and model interpretability
algorithms toward xAI, multimodal data sets, innovative
upsampling layers/frameworks, novel training schemes,
original architectures, and many more. Due to the wide
range of RS data and applicability, there is and will be an
incessant need for better, more eff icient, and trustworthy
DL models. We hope that this sur vey further stimulates the
research community and assists in avoiding common pit-
falls in the design, development, and assessment of new DL
techniques.
ACKNOWLEDGMENTS
This work received funding from t he European Union’s Ho-
rizon2020 research and innovation projects 1) DeepCube,
under grant agreement 101004188 (Maria Sdraka and Io-
annis Papoutsis); 2) NEA NIAS, under grant agreement
863448 (Bill Psomas and Konstantinos Karantzalos); and
3) CALLISTO, under grant agreement 101004152 (Kon-
stantinos Vlachos, Konstantinos Ioannidis, Ilias Gialam-
poukidis, and Stefanos Vrochidis).
AUTHOR INFORMATION
Maria Sdraka (masdra@noa.gr) received her M.Sc. degree
in electrical and computer engineering from the National
Technical University of Athens, Greece, in 2016. She is cur-
rently working toward a Ph.D. degree from the Institute of
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 45
Astronomy, Astrophysics, Space Applications, and Remote
Sensing, National Observatory of Athens, Athens, 15236,
Greece. Her research interests include the application of ar-
tificial intelligence techniques on remote sensing data for
earth observation tasks, especially damage assessment of
forest wildfires. She has worked on signal processing, data
fusion, image enhancement and segmentation as well as
change detection through the assistance of deep learning
algorithms.
Ioannis Papoutsis (ipapouts@gmail.com) received his
diploma in electr ical and computer engineering from the
National Technical University of Athens, Greece, in 2002,
his M.Sc. deg ree in technologies for broadband communi-
cations from the Department of Electronic and Electrical
Engineering, University College London, London, U.K.,
in 2003, and his Ph.D. degree in remote sensing from the
National Technical University of Athens in 2014. In 2019,
he was elected associate researcher with the Institute of
Astronomy, Astrophysics, Space Applications, and Remote
Sensing, National Observatory of Athens, Athens, 15236,
Greece, where he leads OrionLab, a research unit of arti-
ficial intelligence for big earth obser vation data (ESA). He
has been the Operations Manager with the Greek node of
European Space Agency Hubs that distribute Sentinel data.
He has also acted as the Copernicus Emergency Manage-
ment Ser vices Manager for Risk and Recovery activations.
He has par ticipated and coordinated several research proj-
ects funded by the European Commission and ESA. His re-
search interests include the exploitation, management and
processing of big satellite data, and machine learning for
knowledge e xtraction and fusion of multimodal EO data.
He is a Member of IEEE .
Bill Psomas (psomasbill@mail.ntua.gr) received his in -
tegrated master in rural, sur veying and geoinformatics en-
gineering from the National Technical University of Ath-
ens, Greece, in 2018. He continued his studies with a M.Sc.
degree in data science and information technologies, spe-
cializing in big data and artificial intelligence at National
and Kapodistrian University of Athens, Greece, where he
graduated in 2020. He is currently a Ph.D. student at the
National Technical University of Athens, Athens, 15780,
Greece working on representation learning. Previously, he
worked at Inr ia Rennes Bretagne-At lantique, France and
Athena Research Center, Greece. His research interests lie
in the intersection of deep learning with computer vision.
He has worked on metric learning, self-superv ised learn-
ing, and continual learning.
Konstantinos Vlachos (kostasvlachosgrs@iti.gr) re-
ceived his B.Sc. degree in geology from the University of
Patras, Greece, in 2015, where he specialized in quantita-
tive spatial analysis. In 2019, he received his M.Sc. and engi-
neering degrees in applied earth sciences from the Geosci-
ence and Remote Sensing Department at Delft University
of Technology, The Netherlands, where he specialized in
the fusion of multi-sensor satellite data using machine/
deep learning for sea level estimation—funded by Deltares,
an Institute for Applied Research, The Netherlands. Since
2021, he has been a research associate at the Information
Technologies Institute, Center for Research and Technol-
ogy Hellas, Thes saloniki, Thessaloniki, 57001, Greece, and
a member of the Multimodal Data Fusion and Analy tics
(M4D) Group of the Multimedia Knowledge and Social Me-
dia Analytics Lab. His current researc h interests lie in the
interdisciplinar y domains of Earth science, Earth observa-
tion and artificial intelligence focusing on spatiotemporal
analysis and fusion for downscaling and change detection.
Konstantinos Ioannidis (kioannid@iti.gr) received his
diploma and Ph.D. deg rees from the Depar tment of Electri-
cal and Computer Engineering, Democritus University of
Thrace, Greece, in 2006 and 2013, respectively. Currently,
he is a senior researcher at the Information Technologies
Institute, Center for Research and Technolog y Hellas, Thes-
saloniki, Thessaloniki, 57001, Greece. He is a member of
the Multimodal Data Fusion and Analytics (M4D) Group
of the Multimedia Knowledge and Social Media Analytics
Lab. His research interests mainly include the areas of path
planning, collective behavior in swarm robotics, autono-
mous navigation and formation control as well as a variety
of computer vision techniques indicatively, object detec-
tion, 3D representation, aerial imager y, image enhance-
ment, aerial imagery, photogrammetry, SLA M and many
others using both learning-based (machine and deep learn-
ing) models and fundamental approaches.
Konstantinos Karantzalos (karank@central.ntua.gr)
received his diploma degree in engineering from the Na-
tional Technical University of Athens, Greece, in 2000, and
his Ph.D. degree from the National Technical University of
Athens in collaboration with Ecole Nationale de Ponts et
Chaussees, Champs-sur-Marne, France, in 2007. In 2007,
he joined the Depar tment of Applied Mathematics, Ecole
Centrale de Paris, Gifsur-Yvette, France as a postdoc. He is
an associate professor of remote sensing with the National
Technical University of Athens, Athens, 15780, Greece.
His teaching and research interests include geoscience and
earth observation, geospatial data analytics, spectral data
analysis, and machine learning with applications in, e.g.,
environmental monitoring and precision agriculture. He
has several publications in top-rank international journals
and conferences and a number of awards and honors for his
research contributions. He serves on the board of directors
of the Greek Space Center. He is a Senior Member of IEEE.
Ilias Gialampoukidis (heliasgj@iti.gr) received his bach-
elor’s degree in mathematics and his M.Sc. degree in
statistics and modeling from the Aristotle Universit y of
Thessaloniki, Greece. He also received a Ph.D. degree in
mathematics, with a special interest in applied mathemat-
ics, time series analysis, stochastic modelling, and network
analytics. He is a senior postdoctoral researcher at the In-
formation Technologies Institute, Center for Research and
Technology Hellas, Thessaloniki, Thessalonik i, 57001,
Greece. He has extensive experience in EC-funded research
projects through work package leaderships and critical
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
46
roles in several projects. His research interests involve mul-
timodal information retrieval, Earth observation, big data
analytics, multimodal fusion, supervised (deep) and unsu-
pervised learning, and social media mining and network
analytics. He has coauthored more than 60 publications in
international journals and conferences.
Stefanos Vrochidis (stefanos@iti.gr) received his di-
ploma degree in electrical engineering from Aristotle Uni-
versity of Thessaloniki, Greece, his M.Sc. degree in radio
frequency communication systems from the University of
Southampton, and his Ph.D. degree in electronic engineer-
ing from Queen Ma ry Univer sity of London, U.K. Cur rently,
he is a senior researcher (grade C) at the Information Tech-
nologies Institute, Center for Research and Technology Hel-
las, Thessaloniki, Thessaloniki, 57001, Greece, and the head
of the Multimodal Data Fusion and Analytics (M4D) Group
of the Multimedia Knowledge and Social Media Analyt-
ics Lab. His research interests include multimedia analysis
and retrieval, multimodal fusion, computer vision, multi-
modal analy tics, and artificial intelligence, as well as media
and arts, and environmental and security applications. He
has par ticipated in more than 50 European and National
projects (in more than 15 as project coordinator, scientific
or technical manager) and has been member of the organi-
zation team of several conferences and workshops relevant
to the aforementioned research areas. He has edited three
books and authored more than 250 related scientific jour-
nal, conference and book chapter publications.
REFERENCES
[1] P. Ghamisi et al., “Multisource and multitemporal data fusion
in remote sensing: A comprehensive review of the state of the
art,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 7, no.
1, pp. 6–39, Mar. 2019, doi: 10.1109/MGRS. 2018 .2 890023.
[2] B. Chen, J. Li, and Y. Jin, “Deep learning for feature-level data
fusion: Higher resolution reconstruction of histor ical landsat
archive,” Remote Sens., vol. 13, no. 2, p. 167, Ja n. 2021, doi:
10. 3390/rs13 020167.
[3] A. O. Onojeghuo, G. A. Blackburn, Q. Wang, P. M. Atkinson,
D. Kindred, and Y. Miao, “Mapping paddy rice f ields by apply-
ing machine learning algor ithms to multi-temporal Sentinel-
1A and Landsat data,” Int. J. Remote Sens., vol. 39, no. 4, pp.
10421067, Feb. 2018, doi: 10.108 0/01431161.2017.1395969.
[4] Y. Zhang, P. M. Atkinson, X. Li, F. Ling, Q. Wang, and Y. Du,
Learning-based spatial–temporal superresolution mapping
of forest cover with MODIS images,” IEEE Trans. Geosci. Re-
mote Sens., vol. 55, no. 1, pp. 600614, Jan . 2017, doi: 10.1109/
TGRS.2016.2613140.
[5] Y. Feng, D. Lu, E. Moran, L. Dutra, M. Calvi, and M. de Olivei-
ra, “Examining spatial distribution and dynamic change of
urban land covers in the Brazilian Amazon using multitem-
poral multisensor hig h spatial resolution satellite imager y,”
Remote Sens., vol. 9, no. 4, p. 381, A pr. 2017, doi: 10. 339 0/
rs9040381.
[6] A . Y. Sun and G. Tang, “Dow nscaling s atellite and rea nalysis pre -
cipitation products using attent ion-based deep convolutional
neural nets,” Front. Water, vol. 2, p. 536,743, Nov. 2020, doi:
10.3389/frwa.2020.536743.
[7] I. K. Lee, J. C. Trinde r, and A . Sowmya, “Application of U-net
convolutional neural net work to bushfire monitoring in Aus-
tralia with Sentinel-1/-2 data,” ISPRS – Int. Arch. Photogram.,
Remote Sens. Spatial Inf. Sci., vol. X LIII-B1-2020, pp. 573578,
Aug. 2020, doi: 10.5194/isprs-archives-XLIII-B1-2020-573-
2020.
[8] M. M. Pinto, R. Libonati, R. M. Trigo, I. F. Trigo, and C. C. Da-
Camara, “A deep learning approach for mapping and dating
burned areas using temporal sequences of satellite images,” IS-
PRS J. Photogram. Remote Sens., vol. 16 0, pp. 260274, Feb. 2020,
doi: 10.1016/j.i sprsjprs.2019.12.014.
[9] D. Garcia et al., “Pix 2St reams: Dy namic hydrolog y maps from
satellite-LiDAR fusion,” Nov. 2020, arXi v: 2 011. 07584.
[10] W. Yang, X. Zhang, Y. Tian, W. Wa ng , J.-H. Xue, and Q. Liao,
Deep learning for single image super-resolution: A brief re-
view,” IEEE Trans. Multimedia, vol. 21, no. 12, pp. 31063121,
Dec. 2019, doi: 10.1109/ TMM.2019. 2919431.
[11] J. J. Danker Khoo, K. H. Lim, and J. T. Sien Phang, “A review on
deep learning super resolution techniques,” in Proc. IEEE 8th
Conf. Syst., Process Control (ICSPC), Dec. 2020, pp. 134139, doi:
10.1109/ IC SPC50992. 2020.9305806.
[12] H. Chen, X. He, L. Qing , Y. Wu, C. Ren, and C. Zhu, “Real-world
single image super-resolution: A brief review,” Mar. 2021. [On-
line]. Available: http://arxiv.org /abs/2103.02368
[13] S. M. A. Bashir, Y. Wang, and M. Khan, “A comprehensive re-
view of deep learning-based single image super-resolution,”
Feb. 2021, a rXiv : 2102.09351.
[14] K. Nasrollahi and T. B. Moeslund, “Super-resolution: A compre-
hensive sur vey,” Mach. Vis. Appl., vol. 25, no. 6, pp. 142 31468,
Au g. 2014, doi: 10.10 07/s00138- 014 - 0623- 4.
[15] H.-I. Kim and S. B. Yoo, “Trends in super-high-definition
imaging techniques based on deep neural networks,” Math-
ematics, vol. 8, no. 11, p. 1907, Oct. 2020, doi: 10. 339 0/
mat h81119 07.
[16] Z. Wang, J. Chen, and S. C. H. Hoi, “Deep learning for image
super-resolution: A survey,” IEEE Trans. Pattern Anal. Mach. In-
tell., vol. 43, no. 10, pp. 3365 3387, Oct. 2021, doi: 10.1109/
TPAMI.2020.2982166.
[17] F. Dadrass Javan, F. Samadzadegan, S. Mehravar, A. Toosi, R.
Khatami, and A. Stein, “A review of image fusion techniques
for pan-sharpening of high-resolution satellite imager y,” ISPRS
J. Photogram. Remote Sens., vol. 171, pp. 101117, Jan. 2021, doi:
10.1016/j.isprsjprs.2020.11.001.
[18] G. Kaur, K. S. Saini, D. Singh, and M. Kaur, “A comprehensive
study on computational panshar pening techniques for remote
sensing images,” Arch. Comput. Methods Eng., vol. 28, no. 7, Feb.
2021, doi: 10.100 7/s11831- 021-0 9565-y.
[19] X. Meng, H. Shen, H. Li, L. Zhang, and R. Fu, “Review of the
pansharpening methods for remote sensing images based on
the idea of meta-analysis: Practical discussion and challeng-
es,” Inf. Fusion, vol. 46, pp. 102113, Mar. 2019, doi: 10.1016/j.
inff us.2018.05.006.
[20] R. Fernandez-Beltran, P. Latorre-Carmona, and F. Pla, “Sin-
gle-frame super-resolution in remote sensing: A practical
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 47
overview,” Int. J. Remote Sens., vol. 38, no. 1, pp. 314 354, Jan.
2017, doi: 10.10 80/01431161.2016.1264027.
[21] Q. Yuan et al., “Deep learning in environmental remote sens-
ing: Ac hievements and challenges,” Remote Sens. Environ., vol.
241, p. 111,716, May 2020, doi: 10.1016/ j. r s e.2 020.111716 .
[22] X . X. Zhu et al., “Deep lear ning in remote sensing: A compre-
hensive re view and list of resources,” IEEE Geosci. Remote Sens.
Mag. (replaces Newslett.), vol. 5, no. 4, pp. 8–36, Dec. 2017, doi:
10.1109/MGRS.2017.2762307.
[23] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, “Deep
learning in remote sensing applications: A meta-analysis and
review,” ISPRS J. Photogram. Remote Sens., vol. 152, pp. 166 177,
Jun. 2019, doi: 10.1016/j.isprsjprs.2019.04.015.
[24] X. Zhu, F. Cai, J. Tian, and T. Williams, “Spatiotemporal fu-
sion of multisource remote sensing data: Literature sur vey,
taxonomy, principles, applications, and future direct ions,”
Remote Sens., vol. 10, no. 4, p. 527, Mar. 2018, doi: 10.3390/
rs10040527.
[25] G. Tsagkata kis, A. Aidini, K. Fotiadou, M. Giannopoulos, A.
Pentari, and P. Tsakalides, “Survey of deep-learning approaches
for remote sensing observation enhancement,” Sensors, vol. 19,
no. 18, p. 3929, Sep. 2019, doi: 10.3390/s19183929.
[26] “
Web of science
.” Accessed: Sep. 9, 2021. [Online]. Available:
https://webof knowledge.com
[27] A. W. Wood, L. R. Leung, V. Sridhar, and D. P. Lettenmaier, “Hy-
drologic implications of dynamical and statistical approaches
to downscaling climate model outputs,” Climatic Change, vol.
62, nos. 1–3, pp. 189216, Jan. 2004, doi: 10.1023/ B: CLI M .0 0 0
0013685.99609.9e.
[28] P. M . Atkinson, “Downscaling in remote sensing,” Int. J. Appl.
Earth Observ. Geoinf., vol. 22, pp. 106 114, Jun. 2013, doi:
10.1016/j.jag.2012.04.012.
[29] W. Sun and Z. Chen, “Learned image dow nscaling for upscal-
ing using content adaptive resampler,” IEEE Trans. Image Pro-
cess., vol. 29, pp. 40274040, Feb. 2020, doi: 10.1109/ T IP.2020.
2970248.
[30] W. Zhan et al., “Disaggregation of remotely sensed land surface
temperature: Literature sur vey, ta xonomy, issues, and caveats,”
Remote Sens. Environ., vol. 1 31, pp. 119139, Apr. 2013, doi:
10.1016/j.r se.2012.12 .014.
[31] Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, “Deep learning
for remote sensing image classif icat ion: A surve y,” Wiley Inter-
disciplinary Rev.: Data Mining Knowledge Discovery, vol. 8, no. 6,
Nov. 2018. [Online]. Available: http s://onli nelibr a r y.w i ley. com/
doi /ab s/10.1002/w id m.126 4
[32] G. Cheng, X. Xie, J. Han, L. Guo, and G.-S. Xia, “Remote sens-
ing image scene classification meets deep lear ning: Challenges,
methods, benchmarks, and oppor tunities,” Jun. 2020, arXiv:
2005.010 94 .
[33] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection
in optical remote sensing images: A sur vey and a new bench-
mark,” ISPRS J. Photogram. Remote Sens., vol. 159, pp. 296307,
Jan. 2020, doi: 10.1016/j.i sprsjprs.2019.11.02 3.
[34] W. Ma, J. Zhang, Y. Wu, L. Jiao, H. Zhu, and W. Zhao, “A nov-
el two-step registration method for remote sensing images
based on deep and local features,” IEEE Trans. Geosci. Remote
Sens., vol. 57, no. 7, pp. 48344843, Jul. 2019, doi: 10.1109/
TGRS.2019.2893310.
[35] N. Merk le, W. Luo, S. Auer, R. ller, and R. Urtasun, “Ex-
ploiting deep matching and SAR data for the geo-localiza-
tion accurac y improvement of optical satellite images,”
Remote Sens., vol. 9, no. 6, p. 586, Jun . 2017, doi: 10.3390/
rs9060586.
[36] L. Wald, T. Ranchin, and M. Mangolini, “Fusion of satellite
images of different spatial resolutions: Assessing the qualit y
of resulting images,” Photogram. Eng. Remote Sens., vol. 63, no.
6, pp. 691699, 1997. [Online]. Available: https://hal.archives
-ouvertes.fr/hal- 00365304
[37] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality
assessment: from er ror visibility to structural similar ity,” IEEE
Trans. Image Process., vol. 13, no. 4, pp. 600612, Apr. 2004, doi:
10.1109/ T I P.20 03.819861.
[38] C. R. Helmrich, S. Bosse, M. Siekmann, H. Schwarz, D.
Marpe, and T. Wiegand, “Perceptually optimized bit-alloca-
tion and associated distortion measure for bloc k-based im-
age or video coding ,” in Proc. Data Compress. Conf. (DCC),
Snowbird, U T, USA , Mar. 2019, pp. 172181, doi: 10.1109/
DCC.2019.00025.
[39] Z. Wa ng and A. Bovik, “A universal image qualit y inde x,” IEEE
Signal Process. Lett., vol. 9, no. 3, pp. 8184, Mar. 2002, doi:
10.1109/97.995823.
[40] Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural
similarit y for image quality assessment,” in Proc. 37th Asilomar
Conf. Signals, Syst. Comput., 2003, pp. 13981402. [On line].
Available: http://ieeexplore.ieee.org/document/1292216/
[41] H. Sheikh, A. Bovik, and G. de Veciana , “An infor mation fidel-
ity criterion for image quality assessment using natural scene
statistics,” IEEE Trans. Image Process., vol. 14, no. 12, pp. 2117
2128, Dec. 2005, doi: 10.1109/TIP.2005.859389.
[42] H. Sheikh and A. Bovik, “Image information and visual qual-
ity,” IEEE Trans. Image Process., vol. 15, no. 2, pp. 430444, Feb.
2006, doi: 10 .1109/ T I P.20 05.859378.
[43] N. Damera-Venkata, T. Kite, W. Geisler, B. Evans, and A. Bo-
vik, “Image quality assessment based on a degradation model,”
IEEE Trans. Image Process., vol. 9, no. 4, pp. 636650, Apr. 2000,
doi: 10.1109/83. 84194 0.
[44] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature
similarit y inde x for image quality assessment,” IEEE Trans.
Image Process., vol. 20, no. 8, pp. 23782386, Aug . 2011, doi:
10.1109/ T I P.2011. 2109730.
[45] A. Liu, W. Lin, and M. Narwaria, “Image quality assess-
ment based on gradient similarity,” IEEE Trans. Image Pro-
cess., vol. 21, no. 4, pp. 15001512 , Apr. 2012, doi: 10.1109/
TI P.2011.2175935.
[46] R. H. Yuhas, A. F. Goetz, and J. W. Boardman, “Discr imination
among semi-ar id landscape endmembers using the spectral
angle mapper (SA M) algorit hm,” in Proc. 3rd Annu. JPL Air-
borne Earth Sci. Workshop, Pasadena, CA, USA, Jun. 1992. [On-
line]. Available: https://aviris.jpl.nasa.gov/proceedings/work
shops/92\_ docs/52.PDF
[47] L. Wald, “Quality of high resolution synthesised images: Is
there a simple criterion?” in Proc. 3rd Conf. Fusion Earth Data:
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
48
Merging Point Meas., Raster Maps Remotely Sensed Images, Jan.
2000, pp. 99103.
[48] D. M. Chandler, “Most apparent distortion: Full-refer-
ence image quality assessment and t he role of strateg y,”
J. Electron. Imaging, vol. 19, no. 1, p. 11,006 , Jan. 2010, doi:
10.1117/1.3267105.
[49] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for re-
al-time style transfer and super-resolution,” M a r. 2016, arXiv:
1603.08155.
[50] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference im-
age quality assessment in the spatial domain,” IEEE Trans. Image
Process., vol. 21, no. 12, pp. 46954708, Dec. 2012, doi: 10.1109/
TIP.2012.2214050.
[51] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a com-
pletely blind image qualit y analyzer,” IEEE Signal Process.
Lett., vol. 20, no. 3, pp. 209212, Mar. 2013, doi: 10.1109/
LSP.2012.2227726.
[52] N. Venkatanath, D. Praneeth, B. Maruthi Chandrasekhar, S.
S. Channappayya, and S. S. Medasani, “Blind image qual-
ity evaluation using perception-based features,” in Proc. 21st
Nat. Conf. Commun. (NCC), Feb. 2015, pp. 1–6, doi: 10.1109/
NCC.2015.7084843.
[53] C. Ma, C.-Y. Yang , X. Yang, and M.-H. Yang, “Learning a no-ref-
erence qualit y metric for single-image super-resolution,” Dec.
2016 , arXi v:1612.05890.
[54] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-
Manor, “The 2018 PIRM c hallenge on perceptual image super-
resolution,” Jan . 2019, arXiv : 18 09.07 517.
[55] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
The unreasonable effectiveness of deep features as a percep-
tual metric,” Apr. 2018, a rX iv: 1801.03924.
[56] L. Alparone, B. Aiazzi, S. Baronti, A. Garzelli, F. Nencini, and
M. Selva, “Multispectral and panchromatic data fusion assess-
ment without reference,” Photogram. Eng. Remote Sens., vol. 74 ,
no. 2, pp. 193200, Feb. 2008, doi: 10.14358/PER S .74. 2.193.
[57] Y. Blau and T. Michaeli, “ The perception-distortion tradeoff,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun . 2018,
arXiv: 1711.06077.
[58] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc.
27th Int. Conf. Neural Inf. Process. Syst. – vol. 2 (NIPS’ 14). Cam-
bridge, M A, USA: MIT Press, 2014, pp. 26722680.
[59] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Decon-
volutional networks,” in Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., Jun. 2010, pp. 2528 2535.
[60] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and
checkerboard artifacts,” Distill, vol. 1, no. 10, 2016, doi:
10.23915/distill.00003.
[61] W. Shi et al., “Real-time single image and v ideo super-resolution
using an eff icient sub-pixel convolutional neural net work,”
Se p. 2016, arXiv: 1609.05158.
[62] W. Shi et al., “Is the deconvolut ion layer the same as a convolu-
tional layer?Se p. 2016, arXiv: 1609.07009.
[63] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a
Gaussian denoiser: Residual learning of deep CNN for image
denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142
3155, Ju l. 2017, doi: 10.1109/TIP.2017.2662206.
[64] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” Dec. 2015, a rXiv: 1512.03385.
[65] P. Burt and E. Adelson, “The Laplacian pyramid as a compact
image code,” IEEE Trans. Commun., vol. 31, no. 4, pp. 532540,
Apr. 1983, doi: 10.1109/T C OM .1983.1095851.
[66] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-
excitation networks,” May 2019, a rXi v : 1709. 0150 7.
[67] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net:
Eff icient channel attention for deep convolutional neural net-
works,” Apr. 2020, arXiv: 1910.03151.
[68] Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for effi-
cient mobile network design,” Mar. 2021, arXiv: 2103.02907.
[69] J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, “BAM: Bottleneck at-
tention module,” Jul. 2018, arXiv: 1807.06514.
[70] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolu-
tional block attention module,” Jul. 2018, arXiv: 1807.06521.
[71] D. Misra, T. Nalamada, A. U. Arasanipalai, and Q. Hou, “Rotate
to attend: Convolutional triplet attention module,” Nov. 2020,
arXiv: 2010.03045.
[72] H. Zhu, C. Xie, Y. Fei, and H. Tao, “Attention mechanisms in
CNN-based single image super-resolution: A brief review and a
new perspective,” Electronics, vol. 10, no. 10, p. 1187, May 2021,
doi: 10.3390/electron ic s10101187.
[73] C. Dong, C. C. Loy, K. He, and X. Tang , “Image super-reso-
lution using deep convolut ional network s,” Jul. 2015, arXiv:
1501.00092.
[74] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolu-
tion using very deep convolutional networks,” Nov. 2 016, arXiv:
1511. 045 87.
[75] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” Apr. 2015, arXiv:
140 9.1556 .
[76] W.-S . Lai, J.-B. Huang, N. A huja, and M.-H. Ya ng, “Deep L apla-
cian pyramid networks for fast and accurate super-resolution,”
Oc t . 2017, a rX iv: 1704. 03915.
[77] C. Ledig et al., “Photo-realistic single image super-resolution
using a generative adversar ial network,” May 2017, arXi v:
1609.04802.
[78] X. Wa ng et al., “ESRGAN: Enhanced super-resolution generative
adversarial net works,” Sep. 2018, arXiv: 1809.00219.
[79] A. Jolicoeur-Martineau, “The relativistic discriminator: A key ele-
ment missing from standard GAN,” Se p. 2018, arXiv: 1807.00734.
[80] X intao, “xinntao/ESRG AN,” Aug. 31, 2018. [Online]. Available:
https://github.com/xinntao/E SRGAN
[81] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep
residual networks for single image super-resolution,” Jul. 2017,
arXiv: 1707.02921.
[82] J. Yu et al., “Wide activat ion for efficient and accurate image
super-resolution,” De c. 2018, arXiv : 180 8.0 8718.
[83] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
network for image super-resolut ion,” Ma r. 2018, arXi v: 18 02.
08797.
[84] Y. Tai , J. Ya ng, and X. Liu, “Image super-resolution via deep re-
cursive residual net work,” in Proc. IEEE Conf. Comput. Vis. Pat-
tern Recognit. (CVPR), Ju l . 2 017, pp. 27902798, doi: 10.1109/
CVPR.2017.298.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 49
[85] M. Haris, G. Shakhnarovic h, and N. Ukita, “Deep bac k-
projection networks for super-resolution,” Ma r. 2018 , arXiv:
1803.02735.
[86] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback net-
work for image super-resolution,” Jun. 2019, arX iv: 190 3.0 9814.
[87] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolu-
tional network for image super-resolution,” N ov. 2016, arXiv:
1511. 04491.
[88] W.- S . Lai, J.-B. Huang, N. A huja, and M.-H. Yan g, “Fast and ac-
curate image super-resolution with deep laplacian pyramid net-
works,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 11, pp.
25992613, Nov. 2019, doi: 10.1109/ T PA M I.2018.2865304.
[89] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image
super-resolution using ver y deep residual channel attention
networks,” Jul. 2018, arX i v : 18 07.027 5 8 .
[90] D. Lei, H. Chen, L. Zhang, and W. Li, “NLRnet: An efficient
nonlocal attention ResNet for pansharpening,” IEEE Trans.
Geosci. Remote Sens., vol. 60, pp. 1–13, Ma r. 2021, doi: 10.1109/
TGRS.2021.3067097.
[91] C. Shang et al., “Spatiotemporal reflectance fusion using a gen-
erative adversarial network,” IEEE Trans. Geosci. Remote Sens.,
vol. 60, pp. 115, M ar. 2021, doi: 10.1109/ T GRS. 2021.3065418.
[92] Y. Yu, X. Li, and F. Liu, “E-DBPN: Enhanced deep back-projec-
tion networks for remote sensing scene image superresolut ion,”
IEEE Trans. Geosci. Remote Sens., vol. 58, no. 8, pp. 55035515,
Aug. 2020, doi: 10.1109/ T GRS. 2020.2966 669.
[93] “
Sentinel-2 – Overview
.” https://sentinel.esa.int/web/senti
nel /missions/sentine l-2 /over v ie w
[94] C. Lanaras, J. Bioucas-Dias, S. Galliani, E. Baltsav ias, and K.
Schindler, “Super-resolution of Sentinel-2 images: Learning a
globally applicable deep neural network,” ISPRS J. Photogram.
Remote Sens., vol. 146, pp. 305319, De c. 2018, doi: 10.1016/j.
isprsjprs.2018.09.018.
[95] F. Palsson, J. Sveinsson, and M. Ulfarsson, “Sentinel-2 image
fusion using a deep residual network,” Remote Sens., vol. 10, no.
8, p. 1290, Au g. 2018 , doi: 10.3390/rs100 81290.
[96] M. Gargiulo, A. Mazza, R. Gaetano, G. Ruello, and G. Scarpa,
Fast super-resolution of 20 m Sentinel-2 bands using convolu-
tional neural networks,” Remote Sens., vol. 11, no. 22, p. 2635,
Nov. 2019, doi: 10.3390/rs11222635.
[97] J. Wu, Z. He, and J. Hu, “Sentinel-2 sharpening via parallel re-
sidual network,” Remote Sens., vol. 12, no. 2, p. 279, Jan. 2020,
doi: 10.3390/rs12020279.
[98] X. Luo, X. Tong , and Z. Hu, “Improving satellite image fu-
sion via generative adversarial training ,” IEEE Trans. Geosci.
Remote Sens., vol. 59, no. 8, pp. 114, 2020, doi: 10.1109/
TGRS.2020.3025821.
[99] C. Li and M. Wand, “Precomputed real-time texture synthesis
with markov ian generat ive adversarial networks,” Apr. 2016,
arXiv: 1604.04382.
[10 0] H. V. Nguyen, M. O. Ulfarsson, J. R. Sveinsson, and M. D.
Mura, “Sentinel-2 shar pening using a single unsupervised
convolutional neural net work with MT F-based degradation
model,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,
vol. 14, pp. 68826896, Jun. 2021, doi: 10.1109/JSTA R S.2021.
3092286.
[101] M. Ciotola, M. Ragosta, G. Poggi, and G. Scarpa, “A full-res-
olution training framework for Sentinel-2 image fusion,” in
Proc. IEEE Int. Geosci. Remote Sens. Symp.(IGARSS), Jul. 2021, pp.
12601263, doi: 10.1109/IG A R SS47720.2021.9553199.
[102] Z. Shao, J. Cai, P. Fu, L. Hu, and T. Liu, “Deep learning-based
fusion of L andsat-8 and Sentinel-2 images for a harmoni zed
surface ref lectance product,” Remote Sens. Environ., vol. 235, p.
111,42 5, Dec. 2019, doi: 10.1016/j.rse .2019.111425.
[10 3] “ Landsat 8,” NA SA, Washington, DC, USA. [Online]. Available:
https://landsat.gsfc.nasa.gov/landsat-8/landsat-8-overview
[104] R. Dong, L. Zhang, and H. Fu, “R RSGA N: Reference-based
super-resolution for remote sensing image,” IEEE Trans. Geos-
ci. Remote Sens., vol. 60, pp. 1–17, Jan. 2021, doi: 10.1109/
TGRS.2020.3046045.
[105] J. Dai et al., “Deformable convolutional networks,” Ju n . 2017,
arXiv: 1703.06211.
[106] G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa, “Panshar p-
ening by convolutional neural net works,” Remote Sens., vol. 8,
no. 7, p, 594, 2016, doi: 10.3390/rs8070594.
[107] J. Yang, J. Wrig ht, T. S. Huang, and Y. Ma, “Image super-resolu-
tion vi a sparse repres entation,” IEEE Trans. Image Process., vol. 19,
no. 11, pp. 28 612873, 2010, doi: 10.1109/ T I P.2010.2050625.
[108] Y. Wei, Q. Yuan, H. Shen, and L. Zhang, “Boosting the accura-
cy of mult i-spectral image pan-sharpening by learning a deep
residual network,” IEEE Geosci. Remote Sens. Lett., vol. 14 ,
no. 10, pp. 17951799, Oct . 2017, doi: 10.1109/LGR S.2017.
2736020.
[109] J. Yang , X. Fu, Y. Hu, Y. Huang, X. Ding, and J. Paisley, “PanNet:
A deep net work architecture for pan-sharpening,” in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1753 1761, doi:
10.1109/ IC C V.2017.193.
[110] G. Scarpa, S. Vitale, and D. Cozzolino, “Target-adaptive
CNN-based pansharpening,” IEEE Trans. Geosci. Remote Sens.,
vol. 56, no. 9, pp. 54435457, 2018, doi: 10.1109/TG R S.2018.
2817393.
[111] Y. Xing, M. Wang, S. Ya ng, a nd L. Jiao, “Pan-sharpening via deep
metric learning,” ISPRS J. Photogram. Remote Sens., vol. 145, pp.
165183, Nov. 2018 , doi: 10.1016/j.isprsjprs.2018.01.016.
[112] Q. Yuan, Y. Wei, X. Meng, H. Shen, and L. Zhang, “A multi-
scale and multidepth convolutional neural network for remote
sensing imagery pan-sharpening,” IEEE J. Sel. Topics Appl. Earth
Observ. Remote Sens., vol. 11, no. 3, pp. 978989, 2018, doi:
10.1109/ JST A R S.2018.27948 88 .
[113] L. He et al., “Pansharpening via detail injec tion based convo-
lutional neural networks,” IEEE J. Sel. Topics Appl. Ear th Observ.
Remote Sens., vol. 12, no. 4, pp. 1188 1204, 2019, doi: 10.1109/
JSTARS.2019.2898574.
[114] S. Luo, S. Zhou, Y. Feng, and J. Xie, “Panshar pening via un-
super vised convolut ional neural networks,” IEEE J. Sel. Topics
Appl. Earth Observ. Remote Sens., vol. 13, pp. 4295 4310, Jul.
2020, doi: 10.1109/J STA R S .2020.30 08047.
[115] L. Liu et al., “Shallow–deep convolutional network and spec-
tral-discrimination-based detail injection for mult ispectral
imagery pan-sharpening,” IEEE J. Sel. Topics Appl. Earth Observ.
Remote Sens., vol. 13, pp. 17721783, Mar. 2020, doi: 10.1109/
JSTARS.2020.2981695.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
50
[116] L.-J. Deng , G. Vivone, C. Jin, and J. Chanussot, “Detail injec-
tion-based deep convolutional neural networks for pansharp-
ening,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 8, pp. 1–16,
2020, doi: 10.1109/TG R S.2020.3 031366.
[117] J. Cai and B. Huang, “Super-resolution-guided progressive pan-
sharpening based on a deep convolutional neural network,”
IEEE Trans. Geosci. Remote Sens., vol. 59, no. 6, pp. 520 65220,
2021, doi: 10.1109/ TGRS .2020.3015878.
[118] W. Dong, T. Zhang, J. Qu, S. Xiao, J. Liang, and Y. Li, “Lapla-
cian pyramid dense network for hyperspect ral pansharpening ,”
IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–13, M ay 2021,
doi: 10.1109/ T GRS. 2021.3076768 .
[119] M. Jiang, H. Shen, J. Li, Q. Yua n, and L. Zhang, “A differen-
tial information residual convolutional neural network for
pansharpening,” ISPRS J. Photogram. Remote Sens., vol. 163, pp.
257271, May 2020, doi: 10.1016/j.isprsjprs.2020.03.006.
[120] Y. Qu, R. K. Baghbaderani, H. Qi, and C. Kwan, “Unsupervised
pansharpening based on self-attention mechanism,” IEEE
Trans. Geosci. Remote Sens., vol. 59, no. 4, pp. 31923208, 2021,
doi: 10.1109/ T GRS. 2020.300 9207.
[121] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural
Inf. Process. Syst. (NIPS 2017), 2017.
[12 2] T. Dai, J. Cai, Y. Zhang , S .-T. Xia, and L. Zhang, “Second-order
attention network for single image super-resolution,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp.
11,05711,0 66, doi: 10.1109/CVPR .2019.01132.
[12 3] H. Zhang and J. Ma, “GTP-PNET: A residual learning network
based on gradient transformation prior for pansharpening,” IS-
PRS J. Photogram. Remote Sens., vol. 172, pp. 223239, 2021, doi:
10.1016/j.isprsjprs.2020.12.014.
[124] H. Yin, “PSCSC-Net: A deep coupled convolutional sparse
coding network for pansharpening,” IEEE Trans. Geosci. Remote
Sens., vol. 60, pp. 1–16, Jun. 2021, doi: 10.1109/ TGRS .2021.
3088 313 .
[12 5] Z.-C. Wu, T.-Z. Huang, L.-J. Deng, J.-F. Hu, and G. Vivone,
VO+Net: An adaptive approach using variational optimizat ion
and deep learning for panchromatic sharpening,” IEEE Trans.
Geosci. Remote Sens., vol. 60, pp. 1–16, Mar. 2021, doi: 10.1109/
TGRS.2021.3066425.
[126] L. Zhang , J. Zhang, J. Ma, and X. Jia, “SC-PNN: Salienc y cas-
cade convolutional neural network for pansharpening,” IEEE
Trans. Geosci. Remote Sens., vol. 59, no. 11, pp. 1–19, 2021, doi:
10.1109/ T GRS. 2021.305 4641.
[127] I. Selesnic k, R. Baraniuk, and N. Kingsbury, “The dual-tree
complex wavelet transform,” IEEE Signal Process. Mag., vol. 22,
no. 6, pp. 123151, 2005, doi: 10.1109/MS P.20 05.1550194.
[12 8] S. Vitale and G. Scarpa, “A detail-preser ving cross-scale learn-
ing strateg y for CNN-based pansharpening,” Remote Sens., vol.
12, no. 3, p. 348, 2020, doi: 10.3390/rs12030348.
[12 9] A. Barredo Arrieta et al., “Explainable artificial intelligence
(XAI): Concepts, taxonomies, opportunities and challenges to-
ward responsible AI,” Inf. Fusion, vol. 58, pp. 82115, Jun. 2020,
doi: 10.1016/j.i nffus.2019.12 .012.
[13 0] M. Ciotola, S. Vitale, A. Mazza, G. Poggi, and G. Scarpa, “Pan-
sharpening by convolutional neural networks in the full resolu-
tion framework,” 2021, arXiv :2111.08334.
[131] X. Liu, Y. Wa ng , and Q. Liu, “PSGAN: A generative adversarial
network for remote sensing image pan-sharpening,” in Proc.
25th IEEE Int. Conf. Image Process. (ICIP), 2018, pp. 873877, doi:
10.1109/ IC IP.2018.8 4510 49.
[132] J. Ma, W. Yu, C. Chen, P. Liang, X. Guo, and J. Jiang, “Pan-G A N:
An unsuper vised pan-sharpening method for remote sensing
image fusion,” Inf. Fusion, vol. 62, pp. 110120, Oct. 2020, doi:
10.1016/j.i nffu s.2020.0 4. 006.
[133] A. Gastineau, J.-F. Aujol, Y. Berthoumieu, and C. Germain, “Gen-
erative adversarial network for pansharpening with spect ral
and spatial discr iminators,” IEEE Trans. Geosci. Remote Sens., vol.
60, pp. 1–11, Mar. 2021, doi: 10.1109/TGR S.2021.3060958.
[13 4] F. Ozcelik, U. Alganci, E. Sertel, and G. Unal, “Rethinking
CNN-based pansharpening: Guided colorization of panchro-
matic images via GA Ns,” IEEE Trans. Geosci. Remote Sens., vol.
59, no. 4, 2020, doi: 10.1109/ T GRS. 2020. 30104 41.
[135] F. Palsson, J. R. Sveinsson, and M. O. Ulfarsson, “Multispectral
and hyperspect ral image fusion using a 3-d-convolut ional neu-
ral net work,” IEEE Geosci. Remote Sens. Lett., vol. 14 , no. 5, pp.
639643, M ay 2017, doi: 10.1109/L GRS.2017.2668299.
[136] R. Dian, S. Li, A. Guo, and L. Fang, “Deep hyperspectral image
sharpening,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no.
11, pp. 53455355, 2018, doi: 10.1109/TN NLS. 2018.2798162.
[137] F. Zhou, R. Hang, Q. Liu, and X. Yuan, “Pyramid fully con-
volutional network for hyperspectral and multispectral im-
age fusion,” IEEE J. Sel. Topics Appl. Earth Observ. Remote
Sens., vol. 12, no. 5, pp. 15491558, Ma y 2019, doi: 10.1109/
JSTARS.2019.2910990.
[138] X. Han, J. Yu, J. Luo, and W. Sun, “Hyperspectral and multispec-
tral image fusion using cluster-based multi-branch BP neural
networks,” Remote Sens., vol. 11, no. 10, p. 1173, Jan. 2019, doi:
10. 3390/rs11101173.
[139] L. He et al., “HyperPNN: Hyperspectral pansharpening via
spectrally predictive convolutional neural networks,” IEEE J.
Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 8, pp.
30923100 , 2019, doi: 10.1109/JSTARS.2019.2917584.
[140] K. Li, W. X ie, Q. Du, and Y. Li, “DDLPS: Detail-based deep la-
placian pansharpening for hyperspectral imagery,” IEEE Trans.
Geosci. Remote Sens., vol. 57, no. 10, pp. 80118025, 2019, doi:
10.1109/ T GRS. 2019.2917759.
[141] D. Shen, J. Liu, Z. Xiao, J. Yang, and L. Xiao, “A twice optimiz-
ing net with matrix decomposition for hyperspectral and mul-
tispectral image fusion,” IEEE J. Sel. Topics Appl. Earth Observ.
Remote Sens., vol. 13, pp. 40954110, Jul. 2020, doi: 10.1109/
JSTA RS.2020.3009250.
[142] Q. Xie, M. Zhou, Q. Zhao, Z. Xu, and D. Meng, “MHF-Net: An
interpretable deep net work for multispectral and hyperspectral
image fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no.
3, pp. 14571473, 2020, doi: 10.1109/ T PA M I.2020.304 5010.
[143] S. Liu, S. Miao, J. Su, B. Li, W. Hu, and Y.- D. Zhang, “UMAG-
Net: A new unsupervised multiattention-guided network for
hyperspectral and multispectral image fusion,” IEEE J. Sel. Top-
ics Appl. Earth Observ. Remote Sens., vol. 14, pp. 73737385, Jul.
2021, doi: 10.1109/JS TARS. 2021.3097178.
[144] X. Zhang, W. Huang, Q. Wa ng , and X. Li, “SSR-net: Spa-
tial–spectral reconstruction network for hyperspect ral and
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 51
multispectral image fusion,” IEEE Trans. Geosci. Remote Sens., vol.
59, no. 7, pp. 5953 5965, Jul. 2021, doi: 10.1109/TGR S .2020.
3018732.
[145] Y. Bengio, “Learning deep arc hitectures for A I,” Found.
Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009, doi:
10.1561/2200000006.
[14 6] “
MODIS technical specif ications
,” NASA, Washington, DC,
USA. Accessed: Jul. 8, 2021. [Online]. Available: https://modis.
gsfc.nasa.gov/about/specifications.php
[147] X. L iu, C. Deng, J. Chanussot, D. Hong, and B. Zhao, “STF-
Net: A two-stream convolutional neural network for spatio-
temporal image f usion,” IEEE Trans. Geosci. Remote Sens., vol.
57, no. 9, pp. 65526564, Sep. 2019, doi: 10.1109/ T GRS.2019.
29 07310.
[148] W. Li, X. Zhang, Y. Peng, and M. Dong, “DMNet: A network
architecture using dilated convolution and multiscale mec ha-
nisms for spatiotemporal fusion of remote sensing images,”
IEEE Sensors J., vol. 20, no. 20, pp. 12,19012,202, Oct. 2020,
doi: 10.1109/ JSE N.2020.3000249.
[149] W. Li, X. Zhang, Y. Peng, and M. Dong, “Spatiotemporal f u-
sion of remote sensing images using a convolutional neural
network with attention and mult iscale mechanisms,” Int. J.
Remote Sens., vol. 42, no. 6, pp. 19731993, Mar. 2021, doi:
10.1080/01431161.2020.180 9742.
[150] D. Jia, C. Song , C. Cheng, S. Shen, L. Ning, and C. Hui, “A
novel deep learning-based spatiotemporal f usion method
for combining satellite images with different resolutions
using a two-stream convolutional neural network,” Remot e
Sens., vol. 12, no. 4, p. 698, Feb. 2020, doi: 10.3390/r s1
2040698.
[151] S. Yang and X. Wang, “Sparse representation and SRCNN based
spatio-temporal information fusion method of multi-sensor
remote sensing data,” J. Network Intell., vol. 6, no. 1, pp. 4053,
2021.
[152] M. Peng, L. Zhang, X. Sun, Y. Cen, and X. Zhao, “A fast three-di-
mensional convolutional neural network-based spat iotemporal
fusion method (STF3DCNN) using a spatial-temporal-spectral
dataset,” Remote Sens., vol. 12, no. 23, p. 3888, Nov. 2020, doi:
10.3390/r s122 33888.
[153] Y. Li, J. Li, L. He, J. Chen, and A. Plaza, “A new sensor bias-
driven spatio-temporal fusion model based on convolutional
neural networks,” Sci. China Inf. Sci., vol. 63, no. 4, p. 140,302,
Apr. 2020, doi: 10.1007/s11432- 019 -2805-y.
[15 4] H. Song , Q. Liu, G. Wa ng , R. Hang, and B. Huang, “Spatio-
temporal satellite image fusion using deep convolutional
neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote
Sens., vol. 11, no. 3, pp. 821829, Ma r. 2018, doi: 10.1109/
JSTARS.2018.2797894.
[155] J. Chen, L. Wang, R. Feng, P. Liu, W. Han, and X. Chen, “Cycle-
GAN-STF: Spatiotemporal fusion via CycleG AN-based image
generation,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 7, pp.
1–15, 2020, doi: 10.1109/ T GRS. 2020.3023432.
[156] H. Zhang, Y. Song, C. Han, and L. Zhang, “Remote sensing im-
age spatiotemporal fusion using a generative adversarial net-
work,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 5, pp. 1–14 ,
2021, doi: 10.1109/ TGRS .2020.3010530 .
[157] X. Wang and X. Wang, “Spatiotemporal f usion of remote sens-
ing image based on deep learning,” J. Sensors, vol. 2020, pp.
1–11, Jun. 2020, doi: 10.1155/2020/8873079.
[158] Y. Zheng, H. Song, L. Sun, Z. Wu, and B. Jeon, “Spatiotempo-
ral fusion of satellite images via ver y deep convolutional net-
works,” Remote Sens., vol. 11, no. 22, p. 2701, N ov. 2019, doi:
10.3390/rs11222701.
[159] Z. Tan , P. Yue, L. Di, and J. Tang, “Deriving high spatiotempo-
ral remote sensing images using deep convolutional network,”
Remote Sens., vol. 10, no. 7, p. 1066 , Jul. 2018 , doi: 10. 339 0/
rs10 071066.
[160 ] F. Gao, J. Masek, M. Schwaller, and F. Hall, “On the blend-
ing of the Landsat and MODIS surface ref lec tance: Predicting
daily Landsat surface reflectance,” IEEE Trans. Geosci. Remote
Sens., vol. 44, no. 8, pp. 22072218, Aug. 2006, doi: 10.1109/
TGRS.2006.872081.
[161] Z. Tan , L. Di, M. Zhang, L. Guo, and M. Gao, “An enhanced
deep convolutional model for spatiotemporal image fusion,”
Remote Sens., vol. 11, no. 24, p. 2898, Dec. 2019, doi: 10.3390/
rs11242898.
[162] S. Bouabid, M. Chernetskiy, M. Rischard, and J. Gamper, “Pre-
dicting landsat ref lectance wit h deep generative fusion,” No v.
2020, arXi v: 2011.0 4762.
[163] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image
translation w ith conditional adversarial networks,” Nov. 2018 ,
arXi v: 1611.07004 .
[164] J.-Y. Zhu, T. Park, P. Isola, and A . A. Efros, “Unpaired image-to-
image translation using cycle-consistent adversarial networks,”
Aug. 2020, arXiv: 1703.10593.
[165] X. Zhu, E. H. Helmer, F. Gao, D. Liu, J. Chen, and M. A. Lefsky,
A flexible spatiotemporal method for fusing satellite images
with different resolutions,” Remote Sens. Environ., vol. 172 , pp.
165177, Ja n. 2 016, doi: 10.1016/j.rs e.2015.11.016.
[166] Z. Tan, M. Gao, X. Li, and L. Jiang, “A fle xible reference-insen-
sitive spat iotemporal fusion model for remote sensing images
using conditional generative adversarial network,” IEEE Trans.
Geosci. Remote Sens., vol. 60, pp. 1–13, Ja n. 2021, doi: 10.1109/
TGRS .2021.3050551.
[167] P. Luo, J. Ren, Z. Peng, R. Zhang, and J. Li, “Differentiable learn-
ing-to-normalize via switchable normali zation,” Apr. 2019,
arXiv: 1806.10779.
[168 ] T. Miyato, T. Kataoka, M. Koyama, and Y. Yosh ida , “Spectral
normalization for generative adversarial networks,” Fe b. 2018,
arX iv: 1802.05957.
[169] T.-A . Teo and Y.-J. Fu, “Spatiotemporal fusion of Formosat-2
and Landsat-8 satellite images: A comparison of ‘super resolu-
tion-then-blend’ and ‘blend-then-super resolution’ approach-
es,” Remote Sens., vol. 13, no. 4, p. 606, Feb. 2021, doi: 10.3390/
rs13040606.
[170] D. Jia, C. Cheng, C. Song, S. Shen, L. Ning , and T. Zhang, “A
hybrid deep learning-based spatiotemporal fusion method
for combining satellite images with different resolutions,”
Remote Sens., vol. 13, no. 4, p. 645, Feb. 2021, doi: 10.3390/
rs13040645.
[171] S. Lei, Z. Shi, and Z. Zou, “Super-resolution for remote sens-
ing images via local–global combined network,” IEEE Geosci.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
52
Remote Sens. Lett., vol. 14, no. 8, pp. 12431247, Aug. 2017, doi:
10.1109/ LGR S .2017.270 412 2.
[172] J. M. Haut, M. E. Paoletti, R. Fernandez-Beltran, J. Plaza, A.
Plaza, and J. Li, “Remote sensing single-image superresolution
based on a deep compendium model,” IEEE Geosci. Remote Sens.
Lett., vol. 16, no. 9, pp. 14321436, Sep. 2019, doi: 10.1109/
LGRS.2019.2899576.
[173] T. Lu, J. Wang, Y. Zhang, Z. Wang, and J. Jiang, “Satellite image
super-resolution via multi-scale residual deep neural network,”
Remote Sens., vol. 11, no. 13, p. 1588, Ju l. 2019, doi: 10.3390/
rs1113158 8.
[174] W. Xu, C. Zhang, and M. Wu, “Multi-scale deep residual net-
work for satellite image super-resolution reconstruc tion,” in
Pattern Recognition and Computer Vision (Lecture Notes in Com-
puter Science), Z. Lin et al., Eds. Cham: Springer International
Publishing, 2019, vol. 118 59, pp. 332340. [Online]. Available:
http://link .springer.com/10.1007/978-3-030-31726-3\_ 28
[175] L. Yan and K. Chang, “A new super resolution framework based
on multi-task learning for remote sensing images,” Sensors, vol.
21, no. 5, p. 1743, Mar. 2021, doi: 10. 339 0/s 21051743.
[176] M. Qin et al., “Remote sensing single-image resolution improve-
ment using a deep gradient-aware network with image-specific
enhancement,” Remote Sens., vol. 12, no. 5, p. 758, Feb. 2020,
doi: 10.3390/rs12050758.
[177] M. Galar, R. Sesma, C. Ayala, L. Albizua, a nd C. Aranda, “Learn-
ing super-resolution for Sentinel-2 images wit h real ground
truth data from a reference satellite,” ISPRS Ann. Photogram.,
Remote Sens. Spatial Inf. Sci., vol. V-1-2020, pp. 9–16, Aug. 2020,
doi: 10.5194/isprs-annals-V-1-2020-9-2020.
[178] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image st yle transfer us-
ing convolutional neural networks,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 24142423.
[179] D. Pouliot, R. Latifovic, J. Pasher, and J. Duffe, “L andsat super-
resolution enhancement using convolution neural networks
and Sentinel-2 for training,” Remote Sens., vol. 10, no. 3, p. 394,
Ma r. 2018, doi: 10.3390/rs10030394.
[180] C. B. Collins, J. M. Beck, S. M. Bridges, J. A. Rushing, and S.
J. Graves, “Deep learning for multisensor image resolut ion en-
hancement,” in Proc. 1st Workshop on Artif. Intell. Deep Learning
Geographic Knowledge Discovery. Los Angeles, CA, USA: ACM,
Nov. 2 017, pp. 3744, doi: 10.1145/314 980 8.3149815 .
[181] M. M. Sheikholeslami, S. Nadi, A. A. Naeini, and P. Ghamisi,
An efficient deep unsuper vised superresolution model for
remote sensing images,” IEEE J. Sel. Topics Appl. Earth Observ.
Remote Sens., vol. 13, pp. 19371945, May 2020, doi: 10.1109/
JSTARS.2020.2984589.
[182] K. Tur kow ski, “Filters for common resampling tasks,” in Gra phics
Gems. Amsterdam, The Net herlands: Elsevier, 1990, pp. 147165.
[183] N. Zhang et al., “A multi-degradation aided method for unsu-
pervised remote sensing image super resolution with convolu-
tion neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 60,
pp. 1–14, Dec. 2020, doi: 10.1109/ TGR S .2020.30 42460.
[184] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolu-
tional super-resolution network for multiple degradations,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3262
3271, doi: 10.1109/C V PR.2018.0 034 4.
[185] W. Ma, Z. Pan, J. Guo, and B. Lei, “Achieving super-resolu-
tion remote sensing images via the wavelet transform com-
bined wit h the recursive ResNet,” IEEE Trans. Geosci. Remote
Sens., vol. 57, no. 6, pp. 35123527, Ju n. 2019, doi: 10.1109/
TGRS.2018.2885506.
[186] Q. Qin, J. Dou, and Z. Tu, “Deep ResNet based remote sensing
image super-resolut ion reconstr uction in discrete wavelet do-
main,” Pattern Recognit. Image Anal., vol. 30, no. 3, pp. 541550,
Jul. 2020, doi: 10.1134/S1054 661820 0302 32.
[187] X. Feng, W. Zhang, X. Su, and Z. Xu, “Optical remote sensing
image denoising and super-resolution reconstructing using
optimized generative network in wavelet t ransform domain,”
Remote Sens., vol. 13, no. 9, p. 1858, May 2021, doi: 10.3390/
rs130 91858.
[188] A. Mahendran and A. Vedaldi, “Understanding deep image
representations by inverting them,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Ju n. 2015, pp. 5188 5196, doi:
10.1109/CVPR .2015.7299155.
[189] X. Dong, Z. Xi, X. Sun, and L. Gao, “Transferred multi-percep-
tion attention networks for remote sensing image super-reso-
lution,” Remote Sens., vol. 11, no. 23, p. 2857, D ec . 2019, doi:
10.3390/rs11232857.
[19 0] J. Gu, X. Sun, Y. Zhang, K. Fu, and L. Wang, “Deep residual
squeeze and e xcitation network for remote sensing image su-
per-resolution,” Remote Sens., vol. 11, no. 15, p. 1817, Aug. 2019,
doi: 10. 3390/rs11151817.
[191] J. M. Haut, R. Fernandez-Beltran, M. E. Paoletti, J. Plaza,
and A. Plaza, “Remote sensing image superresolution using
deep residual channel attention,” IEEE Trans. Geosci. Remote
Sens., vol. 57, no. 11, pp. 92779289, Nov. 2019, doi: 10.1109/
TGRS.2019.2924818.
[192] S. Zhang, Q. Yuan, J. Li, J. Sun, and X. Zhang, “Scene-adaptive
remote sensing image super-resolution using a multiscale at-
tention network,” IEEE Trans. Geosci. Remote Sens., vol. 58,
no. 7, pp. 47644779, Jul. 2020, doi: 10.1109/ T GRS.2020.
2966805.
[193] X. Dong , X. Sun, X. Jia, Z. Xi, L. Gao, and B. Zhang, “Remote
sensing image super-resolution using novel dense-sampling
networks,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 2, pp.
16181633 , Feb. 2021, doi: 10.1109/TG R S.2020.2 9942 53.
[194] X. Wang, Y. Wu, Y. Ming, and H. Lv, “Remote sensing imag-
ery super resolution based on adaptive multi-scale feature fu-
sion network,” Sensors, vol. 20, no. 4, p. 1142, Feb. 2020, doi:
10.3390/s20041142.
[195] P. Lei and C. Liu, “Incept ion residual attention network
for remote sensing image super-resolution,” Int. J. Re-
mote Sens., vol. 41, no. 24, pp. 95659587, Dec. 2020, doi:
10.1080/01431161.2020.180 0129.
[196] H. Wang, Q. Hu, C. Wu, J. Chi, and X. Yu, “Non-locally up-
down convolutional attention network for remote sensing im-
age super-resolution,” IEEE Access, vol. 8, pp. 166,304–166, 319,
Sep. 2020, doi: 10.1109/ACC ESS.2020. 3022882.
[197] X. Wang , R. Girshick, A. Gupta, and K. He, “Non-local neu-
ral net works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CV P R), Jun . 2018, pp. 77947803, doi: 10.1109/C VPR .2018.
00 813.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 53
[198] Y. Peng , X. Wang, J. Zhang, and S. Liu, “Pre-training of gated
convolution neural network for remote sensing image super-
resolution,” IET Image Process., vol. 15, no. 5, pp. 1179 1188 ,
Apr. 2021, doi: 10.1049/ipr2.12096.
[199] S. Lei and Z. Shi, “Hybrid-scale self-similarity e xploitation for
remote sensing image super-resolution,” IEEE Trans. Geosci.
Remote Sens., vol. 60, pp. 110, Apr. 2021, doi: 10.1109/
TGRS .2021.3069889.
[200] Y. Chang and B. Luo, “Bidirectional convolutional LSTM neu-
ral net work for remote sensing image super-resolution,” Re-
mote Sens., vol. 11, no. 20, p. 2333, Oct. 2019, doi: 10.3390/
rs11202333.
[201] S. Lei, Z. Shi, and Z. Zou, “Coupled adversar ial t raining for
remote sensing image super-resolution,” IEEE Trans. Geosci.
Remote Sens., vol. 58, no. 5, pp. 36333643, May 2020, doi:
10.1109/ T GRS. 2019.2959020.
[202] W. Ma, Z. Pan, F. Yuan, and B. Lei, “Super-resolut ion of remote
sensing images via a dense residual generative adversarial net-
work,” Remote Sens., vol. 11, no. 21, p. 2578, Nov. 2019, doi:
10.3390/rs11212578.
[203] L. Salgueiro Romero, J. Marcello, and V. Vilaplana, “Super-
resolution of Sentinel-2 imagery using generative adversarial
networks,” Remote Sens., vol. 12, no. 15, p. 2424, Jul. 2020, doi:
10.3390/rs12152424.
[204] Z. Wang, K. Jiang, P. Yi, Z. Han, and Z. He, “Ultra-dense GAN
for satellite image ry super-resolution,” Neurocomputing, vol. 398,
pp. 328337, Jul. 2020, doi: 10.1016/j.neucom.2019.03.106.
[205] C. Shin, S. Kim, and Y. Kim, “Satellite image target super-
resolution with adversarial shape disc riminator,” IEEE Geos-
ci. Remote Sens. Lett., vol. 19, pp. 1–5, 2020, doi: 10.1109/
LGRS.2020.3042238.
[206] Y. Gong et al., “Enlighten-G AN for super resolution reconst ruc-
tion in mid-resolution remote sensing images,” Remote Sens.,
vol. 13, no. 6, p. 1104 , Mar. 2021, doi: 10.3390/rs13061104.
[207] K. Jiang, Z. Wang, P. Yi, G. Wang , T. Lu, and J. Jiang, “Edge-en-
hanced GAN for remote sensing image superresolution,” IEEE
Trans. Geosci. Remote Sens., vol. 57, no. 8, pp. 57995812, Aug.
2019, doi: 10.1109/ TGR S .2019.29 02431.
[208] Y. Li et al., “Single-image super-resolution for remote sens-
ing images using a deep generative adversarial network with
local and global attention mechanisms,” IEEE Trans. Geos-
ci. Remote Sens., vol. 60, pp. 1–24, Jul. 2021, doi: 10.1109/
TGRS.2021.3093043.
[209] M. Kawulok, P. Benecki, S. Piechaczek, K. Hrynczenko, D.
Kostrzewa, and J. Nalepa, “Deep learning for multiple-image
super-resolution,” IEEE Geosci. Remote Sens. Lett., vol. 17,
no. 6, pp. 10621066, Jun. 2020, doi: 10.1109/L GRS. 2019.
2940483.
[210] M. Kawulok, P. Benecki, D. Kostrzewa, and L. Skonieczny,
Towards evolutionar y super-resolution,” in Applications of Evo-
lutionary Computation (Lecture Notes in Computer Science), K.
Sim and P. Kaufmann, Eds. Cham: Springer International Pub-
lishing, 2018, vol. 1078 4, pp. 480496, doi: 10.10 07/978 -3 -319
-77538 -8\_33.
[211] M. Märtens, D. Izzo, A. Kr zic, and D. Cox, “Super-resolution
of PROBA-V images using convolutional neural networks,”
Astrodynamics, vol. 3, no. 4, pp. 387402, Dec . 2019, doi:
10.1007/s4206 4-019-0059-8 .
[212] A. B. Molini, D. Valsesia, G. Fracastoro, and E. Magli, “Deep-
SUM: Deep neural network for super-resolution of unregistered
multitemporal images,” IEEE Trans. Geosci. Remote Sens., vol.
58, no. 5, pp. 36443656, May 2020, arXiv: 1907.06490, doi:
10.1109/ T GRS. 2019.2959248.
[213] A. B. Molini, D. Valsesia, G. Fracastoro, and E. Magli, “Deep-
sum++: Non-local deep neural network for super-resolution of
unregistered multitemporal images,” in Proc. IEEE Int. Geosci.
Remote Sens. Symp. Waikoloa, HI, USA, Sep. 2020, pp. 609612,
doi: 10.1109/ IG A R S S390 84.2020.9324 418.
[214] M. Deudon et al., “HighRes-Net: Recursive fusion for multi-
frame super-resolution of satellite imagery,” Feb. 2020, arXiv:
2002.06 460.
[215] D. DeTon e, T. Malisiewicz, and A. Rabinovich, “Deep image ho-
mography estimation,” Jun . 2016, arXiv: 1606.03798.
[216] M. Rifat Arefin et al., “Multi-image super-resolution for re-
mote sensing using deep recurrent net works,” in Proc. IEEE/
CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
Jun. 2020, pp. 816 825, doi: 10.1109/C V PRW50 498. 2020.
00111.
[217] N. Ballas, L. Yao, C. Pal, and A. Cour ville, “Delving deeper into
convolutional networks for learning video representations,”
Ma r. 2016 , arXiv: 1511.06432.
[218] F. Salvetti, V. Mazzia, A. K haliq, and M. Chiaberge, “Multi-im-
age super resolution of remotely sensed images using residual
attention deep neural networks,” Remote Sens., vol. 12, no. 14,
p. 2207, Jul. 2020, doi: 10.339 0/rs12142 207.
[219] J. Ma, L. Zhang, and J. Zhang , “SD-GAN: Salienc y-discrim-
inated G AN for remote sensing image superresolution,” IEEE
Geosci. Remote Sens. Lett., vol. 17, no. 11, pp. 19731977, Nov.
2020, doi: 10.1109/L GRS.2019.2956969.
[220] H. Wu, L. Zhang, and J. Ma, “Remote sensing image super-
resolution via salienc y-g uided feedback GA Ns,” IEEE Trans.
Geosci. Remote Sens., vol. 60, pp. 1–16 , Dec. 2020, doi: 10.1109/
TGRS.2020.3042515.
[2 21] L. Zhang, D. Chen, J. Ma, and J. Zhang, “Remote-sensing im-
age superresolution based on visual saliency analysis and
unequal reconst ruction networks,” IEEE Trans. Geosci. Remote
Sens., vol. 58, no. 6, pp. 40994115, Jun. 2020, doi: 10.1109/
TGRS.2019.2960781.
[222] L. Zhang, J. Ma, X. Lv, and D. Chen, “Hierarchical weakly
super vised learning for residential area semantic segmenta-
tion in remote sensing images,” IEEE Geosci. Remote Sens.
Lett., vol. 17, no. 1, pp. 117121, Jan. 2020, doi: 10.1109/
LGRS.2019.2914490.
[223] L. Wang, M. Zheng, W. Du, M. Wei, and L. Li, “Super-resolution
SAR image reconstruction via generative adversarial net work,”
in Proc. 12th Int. Symp. Antennas, Propag. EM Theory (ISAPE),
2018, pp. 1–4, doi: 10.1109/IS A PE.2018.863 43 45.
[224] F. Gu, H. Zhang, C. Wang, and F. Wu, “SAR image super-
resolution based on noise-free generative adversarial net-
work,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS
2019), 2019, pp. 25752578, doi: 10 .1109/ IG A R S S.2019.
8899202.
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE GEOSCIENCE AND R EMOTE SENSING MAGAZINE MONTH 2022
54
[225] Y. Li, D. Ao, C. O. Dumitru, C. Hu, and M. Datcu, “Super-resolu-
tion of geosync hronous synthetic aper ture radar images using
dialectical GANs,” Sci. China Inf. Sci., vol. 62, no. 10, p. 209,3 02,
Apr. 2019, doi: 10.10 07/s11432- 018-9668- 6.
[226] X. Cen, X. Song, Y. Li, and C. Wu, “A deep lear ning-based super-
resolution model for bistatic sa r image,” in Proc. Int. Conf. Elec-
tron., Circuits Inf. Eng. (ECIE), 2021, pp. 228233, doi: 10.1109/
ECI E52353.2021.00056.
[227] H. Shen, L. Lin, J. Li, Q. Yuan, and L. Zhao, “A residual convo-
lutional neural network for polarimetric SAR image super-res-
olution,” ISPRS J. Photogram. Remote Sens., vol. 161, pp. 90108,
2020, doi: 10.1016/j.isprsjprs.2020.01.006.
[228] J. Yu, W. Li, Z. Li, J. Wu, H. Yang, and J. Yang , “SAR image
super-resolution base on weighted dense connected con-
volutional network,” in Proc. IEEE Int. Geosci. Remote Sens.
Symp. (IGA RSS 2020), 2020, pp. 21012104, doi: 10.1109/
IGARSS39084.2020.9324079.
[229] L. Lin, J. Li, Q. Yuan, and H. Shen, “Polarimetric SAR image
super-resolution via deep convolutional neural network,” in
Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS 2019), 2019,
pp. 32053208, doi: 10.1109/ IG A R S S.2019.8898160.
[230] P. Wang, H. Zhang, and V. M . Patel, “SAR image despeckling
using a convolutional neural network,” IEEE Signal Process.
Lett., vol. 24, no. 12, pp. 17631767, D e c . 2 017, doi: 10.1109/
LSP.2017.2758203.
[231] D. Ao, C. O. Dumitru, G. Schwarz, and M. Datcu, “Dialectical
gan for SA R image translation: From Sentinel-1 to Terrasar-X,”
Remote Sens., vol. 10, no. 10, 2018 , doi: 10. 339 0/rs10101597.
[232] K. A . H. Kelany, A. Baniasadi, N. Dimopoulos, and M. Gara, “Im-
proving I nSAR image qualit y and co-registration t hrough CNN-
based super-resolution,” in Proc. IEEE Int. Symp. Circuits Syst. (IS-
CA S), 2020, pp. 1–5, doi: 10.1109/ISC A S 45731.2020.9180733.
[233] T. Wang, W. Sun, H. Qi, and P. Ren, “Aerial image super reso-
lution via wavelet multiscale convolutional neural networks,”
IEEE Geosci. Remote Sens. Lett., vol. 15, no. 5, pp. 769773, 2018,
doi: 10.1109/ LGR S .2018.2810893.
[234] D. Gon lez, M. A. Patricio, A. Berlanga, and J. M. Molina,
A super-resolution enhancement of UAV images based on a
convolutional neural net work for mobile devices,” Personal
Ubiquitous Comput., pp. 1–12, 2019, doi: 10.10 07/s 00779-019
-01355-5.
[235] N. Q. Truong , P. H . Nguyen, S. H. Nam, and K. R. Park, “Deep
learning-based super-resolution reconstr uction and marker
detection for drone landing,” IEEE Access, vol. 7, pp. 61,639
61,655, May 2019, doi: 10.1109/AC CESS .2019.2915944.
[236] F. Liu, Q. Yu, L. Chen, G. Jeon, M. K. Albertini, and X. Yang,
Aerial image super-resolution based on deep recursive dense
network for disaster area sur veillance,” Personal Ubiquitous Com-
put., pp. 1–10, 2021, doi: 10.1007/s 00779-020 - 01516-x.
[237] J. Zhou, C.-M. Vong , Q. Liu, and Z. Wang, “Scale adaptive image
cropping for UAV object detection,” Neurocomputing, vol. 366,
pp. 305313, Nov. 2019, doi: 10.1016/j.neucom. 2019.07.073.
[238] H. Chen, Z. He, B. Shi, and T. Zhong, “Researc h on recogni-
tion method of electr ical components based on Yolo v3,” IEEE
Access, vol. 7, pp. 157,818 157, 829, Oct . 2019, doi: 10.1109/AC
CESS.2019.2950053.
[239] M. Aslahishahri, K. G. Stanley, H. Duddu, S. Shirtlif fe, S. Vail,
and I. Stavness, “Spatial super resolution of real-world aerial
images for image-based plant phenot yping,” Remote Sens., vol.
13, no. 12, p. 2308, 2021, doi: 10.3390/rs13122308.
[240] Y. Ya ng and S. Newsam, “Bag- of-visual-words and spatial e x-
tensions for land-use classif icat ion,” in Proc. 18th SIGSPATIAL
Int. Conf. Adv. Geographic Inf. Syst. (GIS ‘10), 2010, p. 270, doi:
10.1145/1869790.1869829.
[241] G. Sheng, W. Yang , T. Xu, and H. Sun, “High-resolution sat-
ellite scene classification using a sparse coding based mul-
tiple feature combination,” Int. J. Remote Sens., vol. 33, no.
8, pp. 23952412, A pr. 2012, doi: 10.10 80/01431161. 2011.
60 8740.
[242] J. Hu, T. Jiang , X. Tong, G.-S. Xia, and L. Zhang, “A bench-
mark for scene classification of high spatial resolution re-
mote sensing imagery,” in Proc. IEEE Int. Geosci. Remote Sens.
Symp. (IGARSS), Ju l. 2015, pp. 50035006, doi: 10.1109/
IGAR SS.2015.7326956.
[243] Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based
feature selec tion for remote sensing scene classif ication,” IEEE
Geosci. Remote Sens. Lett., vol. 12, no. 11, pp. 23212325, No v.
2015, doi: 10.1109/ LGR S .2015.2475299.
[244] L. Zhao, P. Ta ng , and L. Huo, “Feature significance-based mult i-
bag-of-visual-words model for remote sensing image scene clas-
sification,” J. Appl. Remote Sens., vol. 10, no. 3, p. 35,0 04, Jul.
2016 , doi: 10.1117/1.JR S.10.03500 4.
[245] G.-S. Xia et al., “AID: A benchmark data set for performance
evaluation of aerial scene classification,” IEEE Trans. Geosci. Re-
mote Sens., vol. 55, no. 7, pp. 39653981, Ju l. 2017, doi: 10.1109/
TGRS.2017.2685945.
[246] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene
classification: Benchmark and state of the art,” Proc. IEEE,
vol. 105, no. 10, pp. 1865 1883, O c t . 2017, doi: 10.1109/
JPROC.2017.2675998.
[247] B. Zhao, Y. Zhong, G.-S. Xia, and L. Zhang, “Dirichlet-derived
multiple topic scene classification model for high spatial res-
olution remote sensing imagery,” IEEE Trans. Geosci. Remote
Sens., vol. 54, no. 4, pp. 2108 2123, Apr. 2 016 , doi: 10.1109/
TG R S.2015. 2496185.
[248] O. A. B. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep
features generalize from everyday objects to remote sensing
and aerial scenes domains?” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. Workshops (CVPRW), Jun. 2015, pp. 4451, doi:
10.1109/CVPRW.2015.7301382 .
[249] M. Schmitt, L . H. Hughes, and X. X. Zhu, “The SE N1-2 datas et
for deep learning in SA R-optical data f usion,” Ju l. 2018 , arXi v:
18 0 7.01 5 69.
[250] M. Schmitt, L. H. Hughes, C. Qiu, and X. X. Zhu, “SEN12MS –
A curated dataset of georeferenced multi-spectral Sentinel-1/2
imagery for deep learning and data fusion,” Jun. 2019, arXiv:
1906.07789.
[251] G.-S. Xia et al., “DOTA: A large-scale dataset for object detection
in aerial images,” May 2019, arXiv : 1711.1039 8 .
[252] I. V. Emelyanova, T. R . McVicar, T. G. Van Niel, L. T. Li, and A. I.
van Dijk, “Assessing the accurac y of blending Landsat–MODIS
surface ref lectances in two landscapes wit h contrasting spatial
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
MONTH 2 022 IEEE GEOSCIENCE AND REMOTE SENSING M AGA ZIN E 55
and temporal dynamics: A framework for algorithm selection,”
Remote Sens. Environ., vol. 133, pp. 193209, Jun. 2013, doi:
10.1016/j.r se.2013.02.007.
[253] J. Li, Y. Li, L. He, J. Chen, and A. Plaza, “Spatio-temporal f usion
for remote sensing data: An over view and new benchmark,” Sci.
China Inf. Sci., vol. 63, no. 4, p. 140,301, Apr. 2020. https://link.
spr i ng er.com/10.10 07/s11432-019-2785-y, doi: 10.1007/s11432
-019-2785-y.
[254] X.-Y. Tong et al., “Land-cover classif ication with high-resolution
remote sensing images using transferable deep models,” Remote
Sens. Environ., vol. 237, p. 111,322, Feb. 2020, doi: 10.1016/j.
rs e .2019.111 32 2.
[255]
“Draper satellite image chronology
.” Kaggle.com. [On-
line]. Available: https://kaggle.com/c/draper-satellite-image
-chronolog y
[256] P. Wei et al., “Component divide-and-conquer for real-world
image super-resolution,” in Proc. Eur. Conf. Comput. Vis., 2020,
pp. 101117.
[257] Y. Zheng, J. Li, Y. Li, J. Guo, X. Wu, and J. Chanussot, “Hyper-
spectral pansharpening using deep prior and dual attention re-
sidual network,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 11,
pp. 8059 8076, 2020, doi: 10.1109/TG R S.2020.2 986313.
[258] W. Xie, Y. Cui, Y. Li, J. Lei, Q. Du, and J. Li, “HPGAN: Hy-
perspectral pansharpening using 3-d generative adversarial
networks,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 1, pp.
463477, 2021, doi: 10.1109/ T GRS.2020. 2994 238.
[259] J. Gu, H. Lu, W. Zuo, and C. Dong, “Blind super-resolution with
iterative kernel cor rection,” in Proc. IEEE Conf. Comput. Vis. Pat-
tern Recognit. (CVPR), Jun . 2019, pp. 1604 1613 , doi: 10.1109/
CV PR.2019.0 0170.
[260] V. Cornillère, A. Djelouah, W. Yifan, O. Sorkine-Hor nung, and
C. Schroers, “Blind image super resolution with spatially vari-
ant degradations,” ACM Trans. Graph., vol. 38, no. 6, pp. 1–13,
2019, doi: 10.1145/3355089.3356575.
[261] Z. Luo, Y. Huang, S. Li, L. Wang, and T. Tan, “Unfolding the al-
ternating optimization for blind super resolut ion,” Nov. 2020,
arX iv: 2010.02631.
[262] Y. Yua n, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, “Unsu-
pervised image super-resolution using cycle-in-cycle generative
adversarial net works,” Sep. 2018, arXiv: 1809.00437.
[263] G. Kim et al., “Unsuper vised real-world super resolution wit h
cycle generative adversarial network and domain discr imi-
nator,” in Proc. IEEE/CV F Conf. Comput. Vis. Pattern Recognit.
Wor ksh ops (C VPRW), Jun. 2020, pp. 18621871, doi: 10.1109/
CV PRW50498.2020.00236.
[264] S. Maeda, “Unpaired image super-resolution using pseudo-su-
pervision,” Feb. 2020, arXiv : 20 02.11 39 7.
[265] Y. Zhang, S. Liu, C. Dong, X. Zhang, and Y. Yuan, “Mult iple
cycle-in-cycle generative adversarial networks for unsuper vised
image super-resolution,” IEEE Trans. Image Process., vol. 29, pp.
11011112, Sep. 2019, doi: 10.1109/TIP.2019.29383 47.
[266] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Gan-
guli, “Deep unsuper vised learning using nonequilibrium ther-
modynamics,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2256
2265.
[267] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. No-
rouzi, “Image super-resolut ion via iterative ref inement,” 2021.
[Online]. Available: ht tps://arxi v.or g /ab s/2104.07636
GRS
Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on June 12,2022 at 13:02:30 UTC from IEEE Xplore. Restrictions apply.
... Once deployed, they continuously supply valuable data over vast areas, supporting ecological and agricultural monitoring [1], [2]. However, satellite observations sometimes fall short of advanced application requirements regarding spatial and temporal resolution, limiting their applicability for specific uses, like wildfire management, air quality monitoring, or precision agriculture [3], [4], [5]. ...
... These advancements have significantly contributed to various climate-related studies [4], [3], [9]. The high cost and societal importance of atmospheric modeling and observation systems further highlight the promise of DL-based SR techniques. ...
... Additionally, SR models exhibit strong adaptability, enabling their application across various climates, temporal scales, and geographic locations when adequately trained [11], [12], [13], [14]. Indeed, DL-based SR methods have already demonstrated significant success in Earth sciences, improving the spatial and temporal resolution of climate data derived from satellite observations and chemical transport model simulations [15], [4], [16], [3], [2], driving progress in multiple research domains [5], [8], [10], [17], [18]. ...
Preprint
Full-text available
Remote sensing plays a crucial role in monitoring Earth's ecosystems, yet satellite-derived data often suffer from limited spatial resolution, restricting their applicability in atmospheric modeling and climate research. In this work, we propose a deep learning-based Super-Resolution (SR) framework that leverages land cover information to enhance the spatial accuracy of Biogenic Volatile Organic Compounds (BVOCs) emissions, with a particular focus on isoprene. Our approach integrates land cover priors as emission drivers, capturing spatial patterns more effectively than traditional methods. We evaluate the model's performance across various climate conditions and analyze statistical correlations between isoprene emissions and key environmental information such as cropland and tree cover data. Additionally, we assess the generalization capabilities of our SR model by applying it to unseen climate zones and geographical regions. Experimental results demonstrate that incorporating land cover data significantly improves emission SR accuracy, particularly in heterogeneous landscapes. This study contributes to atmospheric chemistry and climate modeling by providing a cost-effective, data-driven approach to refining BVOC emission maps. The proposed method enhances the usability of satellite-based emissions data, supporting applications in air quality forecasting, climate impact assessments, and environmental studies.
... On the other hand, the non-linear nature of the mapping task lends itself to the use of neural networks. Several models have been adapted from traditional single image digital super-resolution in computer vision literature [19]. Existing deep learning models in single image super-resolution are primarily dominated by Convolutional Neural Network (CNN) based models. ...
Preprint
Full-text available
The frequency of extreme flood events is increasing throughout the world. Daily, high-resolution (30m) Flood Inundation Maps (FIM) observed from space play a key role in informing mitigation and preparedness efforts to counter these extreme events. However, the temporal frequency of publicly available high-resolution FIMs, e.g., from Landsat, is at the order of two weeks thus limiting the effective monitoring of flood inundation dynamics. Conversely, global, low-resolution (~300m) Water Fraction Maps (WFM) are publicly available from NOAA VIIRS daily. Motivated by the recent successes of deep learning methods for single image super-resolution, we explore the effectiveness and limitations of similar data-driven approaches to downscaling low-resolution WFMs to high-resolution FIMs. To overcome the scarcity of high-resolution FIMs, we train our models with high-quality synthetic data obtained through physics-based simulations. We evaluate our models on real-world data from flood events in the state of Iowa. The study indicates that data-driven approaches exhibit superior reconstruction accuracy over non-data-driven alternatives and that the use of synthetic data is a viable proxy for training purposes. Additionally, we show that our trained models can exhibit superior zero-shot performance when transferred to regions with hydroclimatological similarity to the U.S. Midwest.
... Furthermore, the limitations of assessment metrics, which are designed to enhance perceptual similarity to account for the human visual system [44], may not faithfully reproduce climate imagery representing physical factors. Significantly, while current modeling approaches excel in generating perceptually pleasing imagery, they continue to suffer from hallucinations [72]. However, such hallucinations could distort physical climate imagery. ...
Article
Full-text available
Surface albedo is a key variable influencing ground-reflected solar irradiance, which is a vital factor in boosting the energy gains of bifacial solar installations. Therefore, surface albedo is crucial towards estimating photovoltaic power generation of both bifacial and tilted solar installations. Varying across daylight hours, seasons, and locations, surface albedo is assumed to be constant across time by various models. The lack of granular temporal observations is a major challenge to the modeling of intra-day albedo variability. Though satellite observations of surface reflectance, useful for estimating surface albedo, provide wide spatial coverage, they too lack temporal granularity. Therefore, this paper considers a novel approach to temporal downscaling with imaging time series of satellite-sensed surface reflectance and limited high-temporal ground observations from surface radiation (SURFRAD) monitoring stations. Aimed at increasing information density for learning temporal patterns from an image series and using visual redundancy within such imagery for temporal downscaling, we introduce temporally shifted heatmaps as an advantageous approach over Gramian Angular Field (GAF)-based image time series. Further, we propose Multispectral-WaveMix, a derivative of the mixer-based computer vision architecture, as a high-performance model to harness image time series for surface albedo forecasting applications. Multispectral-WaveMix models intra-day variations in surface albedo on a 1 min scale. The framework combines satellite-sensed multispectral surface reflectance imagery at a 30 m scale from Landsat and Sentinel-2A and 2B satellites and granular ground observations from SURFRAD surface radiation monitoring sites as image time series for image-to-image translation between remote-sensed imagery and ground observations. The proposed model, with temporally shifted heatmaps and Multispectral-WaveMix, was benchmarked against predictions from models image-to-image MLP-Mix, MLP-Mix, and Standard MLP. Model predictions were also contrasted against ground observations from the monitoring sites and predictions from the National Solar Radiation Database (NSRDB). The Multispectral-WaveMix outperformed other models with a Cauchy loss of 0.00524, a signal-to-noise ratio (SNR) of 72.569, and a structural similarity index (SSIM) of 0.999, demonstrating the high potential of such modeling approaches for generating granular time series. Additional experiments were also conducted to explore the potential of the trained model as a domain-specific pre-trained alternative for the temporal modeling of unseen locations. As bifacial solar installations gain dominance to fulfill the increasing demand for renewables, our proposed framework provides a hybrid modeling approach to build models with ground observations and satellite imagery for intra-day surface albedo monitoring and hence for intra-day energy gain modeling and bifacial deployment planning.
... However, despite the revolutionary breakthroughs that deep learning has brought to remote sensing image interpretation, it still faces a series of challenges. The high resolution, multispectral features, and complex scenes with cross-scale targets in remote sensing images pose significant challenges to traditional deep learning models [24][25][26]. To address these challenges, an increasing number of studies are exploring hybrid models that combine traditional methods with deep learning in order to better extract multi-level Electronics 2025, 14, 349 3 of 21 spatial features [27,28]. ...
Article
Full-text available
As the demand for land use monitoring continues to grow, high-precision remote sensing products have become increasingly important. Compared to traditional methods, deep learning networks demonstrate significant advantages in automatic feature extraction, handling complex scenes, and improving classification accuracy. However, as the complexity of these networks increases, so does the computational cost. To address this challenge, we propose an innovative knowledge distillation model, integrating two key modules—spatial-global attention feature distillation (SGAFD) and channel attention-based relational distillation (CARD). This model enables a lightweight “student” network to be guided by a large “teacher” network, enhancing classification performance while maintaining a compact model size. We validated our approach on the large-scale public remote sensing datasets GID15 and LoveDA, and the results show that these modules effectively improve classification performance, overcoming the limitations of lightweight models and advancing the practical applications of land use monitoring.
... Deep learning models are currently becoming a widely used method for image spatial super-resolution reconstruction due to their powerful feature extraction capabilities to learn the complex nonlinear relationship between coarse and fine spatial resolution images. Image spatial super-resolution reconstruction methods can be further divided into single-image super-resolution methods [27] and image-fusion-based super-resolution methods [28]. In practical applications, compared to image-fusion-based super-resolution methods, single-image super-resolution methods avoid the complex processes of image alignment, fusion, and spatiotemporal synchronization, making them more practical. ...
Article
Full-text available
Mixed pixels often hinder accurate cropland mapping from remote sensing images with coarse spatial resolution. Image spatial super-resolution reconstruction technology is widely applied to address this issue, typically transforming coarse-resolution remote sensing images into fine spatial resolution images, which are then used to generate fine-resolution land cover maps using classification techniques. Deep learning has been widely used for image spatial super-resolution reconstruction; however, collecting training samples is often difficult for cropland mapping. Given that the quality of spatial super-resolution reconstruction directly impacts classification accuracy, this study aims to assess the impact of different types of training samples on image spatial super-resolution reconstruction and cropland mapping results by employing a Residual Channel Attention Network (RCAN) model combined with a spatial attention mechanism. Four types of samples were used for spatial super-resolution reconstruction model training, namely fine-resolution images and their corresponding coarse-resolution images, including original Sentinel-2 and degraded Sentinel-2 images, original GF-2 and degraded GF-2 images, histogram-matched GF-2 and degraded GF-2 images, and registered original GF-2 and Sentinel-2 images. The results indicate that the samples acquired by the histogram-matched GF-2 and degraded GF-2 images can resolve spectral band mismatches when simulating training samples from fine spatial resolution imagery, while the other three methods have limitations in their inability to fully address spectral and spatial mismatches. The histogram-matched method yielded the best image quality with PSNR, SSIM, and QNR values of 42.2813, 0.9778, and 0.9872, respectively, and produced the best mapping results, achieving an overall accuracy of 0.9306. By assessing the impact of training samples on image spatial super-resolution reconstruction and classification, this study addresses data limitations and contributes to improving the accuracy of cropland mapping, which is crucial for agricultural management and decision-making.
... The core concept of assimilating remote sensing into crop growth model shares common ground with high spatial resolution data for acquiring spatial texture details. This makes it a key tool for obtaining continuous and accurate remote sensing observations of crop growth throughout the growing season [29]. Bai et al. utilized ESTARFM to acquire high spatiotemporal resolution spectral parameters and, based on this, performed the inversion of regional high spatiotemporal resolution ET and analyzed the interannual variations in crop ET proportions since the implementation of large-scale irrigation water-saving transformations [30]. ...
Article
Full-text available
Remote sensing spatiotemporal fusion technology can provide abundant data source information for assimilating crop growth model data, enhancing crop growth monitoring, and providing theoretical support for crop irrigation management. This study focused on the winter wheat planting area in the southeastern part of the Loess Plateau, a typical semi-arid region, specifically the Linfen Basin. The SEBAL and ESTARFM were used to obtain 8 d, 30 m evapotranspiration (ET) for the growth period of winter wheat. Then, based on the ‘localization’ of the CERES-Wheat model, the fused results were incorporated into the data assimilation process to further determine the optimal assimilation method. The results indicate that (1) ESTARFM ET can accurately capture the spatial details of SEBAL ET (R > 0.9, p < 0.01). (2) ESTARFM ET can accurately capture the spatial details of SEBAL ET (R > 0.9, p < 0.01). The calibrated CERES-Wheat ET characteristic curve effectively reflects the ET variation throughout the winter wheat growth period while being consistent with the trend and magnitude of ESTARFM ET variation. (3) The correlation between Ensemble Kalman filter (EnKF) ET and ESTARFM ET (R² = 0.7119, p < 0.01) was significantly higher than that of Four-Dimensional Variational data assimilation (4DVar) ET (R² = 0.5142, p < 0.01) and particle filter (PF) ET (R² = 0.5596, p < 0.01). The results of the study provide theoretical guidance to improve the yield and water use efficiency of winter wheat in the region, which will help promote sustainable agricultural development.
Article
In remote sensing image processing for Earth and environmental applications, super-resolution is a crucial technique for enhancing the resolution of low-resolution images. In this study, we proposed a novel algorithm of Frequency Domain Super-Resolution with Reconstruction from Compressed Representation. The algorithm follows a multi-step procedure: first, a low-resolution image in the space domain is transformed to the frequency domain using a Fourier transform. The frequency-domain representation is then expanded to the desired size (number of pixels) of a high-resolution image. This expanded frequency-domain image is subsequently inverse Fourier transformed back to the spatial domain, yielding an initial high-resolution image. A final high-resolution image is then reconstructed from the initial high-resolution image using a low-rank regularization model that incorporates a non-local Smoothed Rank Function. We evaluated the performance of the new algorithm by comparing the reconstructed high-resolution images with those generated by several commonly used super-resolution algorithms, including: (1) bicubic interpolation, (2) sparse representation, (3) adaptive sparse domain selection and adaptive regularization, (4) fuzzy-rule-based algorithm, (5) super-resolution convolutional neural network, (6) fast super-resolution convolutional neural networks, (7) practical degradation model for deep blind image super-resolution, (8) the frequency separation for real-world super-resolution, and (9) the enhanced super-resolution generative adversarial networks. The algorithms were tested on Landsat-8 and Moderate Resolution Imaging Spectrora-diometer multi-resolution images over various locations, as well as on images with artificially added noise to assess robustness of each algorithm. Results show that: (1) the proposed new algorithm outperforms the others in terms of peak signal-to-noise ratio, structure similarity, and root-mean-square error; and (2) it effectively suppresses noise during high-resolution reconstruction from noisy low-resolution images, overcoming a key limitation of existing super-resolution methods.
Article
Full-text available
A critical approach to achieving dual carbon targets is to maintain strong decoupling (SD) between land use carbon emissions (LUCE) and ecological environment quality (EEQ). However, the spatiotemporal decoupling mechanisms of LUCE and EEQ at the county scale remain understudied. This study aims to explore the decoupling dynamics of LUCE and EEQ across 107 counties in Shaanxi Province from 2000 to 2020. Using the PIE-Engine and Google Earth Engine, we constructed LUCE and Remote Sensing Ecological Index (RSEI) models and applied the Tapio decoupling theory to analyze the decoupling trends. The results reveal that from 2000 to 2020, LUCE in Shaanxi tripled, while EEQ increased by 20.93%, albeit with fluctuations. County-level variations in LUCE and EEQ exhibited pronounced spatiotemporal heterogeneity, with decoupling types undergoing significant transitions in over 95% of counties. In 2005, 28.04% of counties achieved a decoupled state, but this deteriorated sharply by 2020, when nearly half of the counties displayed strong negative decoupling (SND), and no counties maintained SD. These findings suggest that achieving SD remains challenging and requires targeted strategies based on regional decoupling characteristics. This study offers a theoretical and practical framework for understanding county-level LUCE and EEQ decoupling, which is crucial for sustainable development.
Article
Full-text available
We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models [1], [2] to image-to-image translation, and performs super-resolution through a stochastic iterative denoising process. Output images are initialized with pure Gaussian noise and iteratively refined using a U-Net architecture that is trained on denoising at various noise levels, conditioned on a low-resolution input image. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8× face super-resolution task on CelebA-HQ for which SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GAN baselines do not exceed a fool rate of 34%. We evaluate SR3 on a 4× super-resolution task on ImageNet, where SR3 outperforms baselines in human evaluation and classification accuracy of a ResNet-50 classifier trained on high-resolution images. We further show the effectiveness of SR3 in cascaded image generation, where a generative model is chained with super-resolution models to synthesize high-resolution images with competitive FID scores on the class-conditional 256×256 ImageNet generation challenge.
Article
Full-text available
In recent years, there has been a growing interest in deep learning-based pansharpening. Thus far, research has mainly focused on architectures. Nonetheless, model training is an equally important issue. A first problem is the absence of ground truths, unavoidable in pansharpening. This is often addressed by training networks in a reduced-resolution domain and using the original data as ground truth, relying on an implicit scale invariance assumption. However, on full-resolution images, results are often disappointing, suggesting such invariance not to hold. A further problem is the scarcity of training data, which causes a limited generalization ability and a poor performance on off-training-test images. In this article, we propose a full-resolution training framework for deep learning-based pansharpening. The framework is fully general and can be used for any deep learning-based pansharpening model. Training takes place in the high-resolution domain, relying only on the original data, thus avoiding any loss of information. To ensure spectral and spatial fidelity, a suitable two-component loss is defined. The spectral component enforces consistency between the pansharpened output and the low-resolution multispectral input. The spatial component, computed at high resolution, maximizes the local correlation between each pansharpened band and the panchromatic input. At testing time, the target-adaptive operating modality is adopted, achieving good generalization with a limited computational overhead. Experiments carried out on WorldView-3, WorldView-2, and GeoEye-1 images show that methods trained with the proposed framework guarantee a pretty good performance in terms of both full-resolution numerical indexes and visual quality.
Article
Full-text available
Single image super-resolution (SISR) has been widely studied in recent years as a crucial technique for remote sensing applications. In this paper, a dense residual generative adversarial network (DRGAN)-based SISR method is proposed to promote the resolution of remote sensing images. Different from previous super-resolution (SR) approaches based on generative adversarial networks (GANs), the novelty of our method mainly lies in the following factors. First, we made a breakthrough in terms of network architecture to improve performance. We designed a dense residual network as the generative network in GAN, which can make full use of the hierarchical features from low-resolution (LR) images. We also introduced a contiguous memory mechanism into the network to take advantage of the dense residual block. Second, we modified the loss function and altered the model of the discriminative network according to theWasserstein GAN with a gradient penalty (WGAN-GP) for stable training. Extensive experiments were performed using the NWPU-RESISC45 dataset, and the results demonstrated that the proposed method outperforms state-of-the-art methods in terms of both objective evaluation and subjective perspective.
Article
Full-text available
To reconstruct images with high spatial resolution and high spectral resolution, one of the most common methods is to fuse a low-resolution hyperspectral image (HSI) with a high-resolution (HR) multispectral image (MSI) of the same scene. Deep learning has been widely applied in the field of HSI-MSI fusion, which is limited with hardware. In order to break the limits, we construct an unsupervised multiattention-guided network named UMAG-Net without training data to better accomplish HSI-MSI fusion. UMAG-Net first extracts deep multiscale features of MSI by using a multiattention encoding network. Then, a loss function containing a pair of HSI and MSI is used to iteratively update parameters of UMAG-Net and learn prior knowledge of the fused image. Finally, a multiscale feature-guided network is constructed to generate an HR-HSI. The experimental results show the visual and quantitative superiority of the proposed method compared to other methods.
Article
Full-text available
Super-resolution (SR) technology is an important way to improve spatial resolution under the condition of sensor hardware limitations. With the development of deep learning (DL), some DL-based SR models have achieved state-of-the-art performance, especially the convolutional neural network (CNN). However, considering that remote sensing images usually contain a variety of ground scenes and objects with different scales, orientations, and spectral characteristics, previous works usually treat important and unnecessary features equally or only apply different weights in the local receptive field, which ignores long-range dependencies; it is still a challenging task to exploit features on different levels and reconstruct images with realistic details. To address these problems, an attention-based generative adversarial network (SRAGAN) is proposed in this article, which applies both local and global attention mechanisms. Specifically, we apply local attention in the SR model to focus on structural components of the earth’s surface that require more attention, and global attention is used to capture long-range interdependencies in the channel and spatial dimensions to further refine details. To optimize the adversarial learning process, we also use local and global attentions in the discriminator model to enhance the discriminative ability and apply the gradient penalty in the form of hinge loss and loss function that combines L1 pixel loss, L1 perceptual loss, and relativistic adversarial loss to promote rich details. The experiments show that SRAGAN can achieve performance improvements and reconstruct better details compared with current state-of-the-art SR methods. A series of ablation investigations and model analyses validate the efficiency and effectiveness of our method.
Article
The classical spatio-temporal fusion algorithms STARFM and SPSTFM will have large fusion errors when the phenological changes or type changes appear. In this paper, based on the spatial feature information of the image, we proposed a new spatio temporal information fusion method which combines SRCNN (Super-Resolution Convolutional Neural Network) and sparse representation. Firstly, complete the feature reconstruction of the reflectance change image by combining SRCNN and sparse representation, and then the reconstructed image is superimposed by the time weight to obtain the predicated reflectance image. Experiments show that the proposed method is better than the classic spatio-temporal fusion algorithms STARFM and SPSTFM.
Article
Single image super-resolution (SISR), which aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) observation, has been an active research topic in the area of image processing in recent decades. Particularly, deep learning-based super-resolution (SR) approaches have drawn much attention and have greatly improved the reconstruction performance on synthetic data. However, recent studies show that simulation results on synthetic data usually overestimate the capacity to super-resolve real-world images. In this context, more and more researchers devote themselves to develop SR approaches for realistic images. This article aims to make a comprehensive review on real-world single image super-resolution (RSISR). More specifically, this review covers the critical publicly available datasets and assessment metrics for RSISR, and four major categories of RSISR methods, namely the degradation modeling-based RSISR, image pairs-based RSISR, domain translation-based RSISR, and self-learning-based RSISR. Comparisons are also made among representative RSISR methods on benchmark datasets, in terms of both reconstruction quality and computational efficiency. Besides, we discuss challenges and promising research topics on RSISR.