ThesisPDF Available

Deep Super-Resolution of Sentinel-2 Time Series

Authors:

Abstract and Figures

Deep convolutional networks for super-resolution are an intensively researched methodin the field of computer vision. Applying these technologies to the field of remote sensingcan provide many benefits, mainly since a high temporal and a high spatial resolution aremutually exclusive for remote sensing products. Super-resolution has the possibility ofbridging that gap by upsampling the high frequency imagery to a higher spatial resolu-tion. Many challenges arise when trying to implement super-resolution techniques to thespecific characteristics of remote sensing data, as well as ensuring that the needs of theremote sensing communities towards the resulting data are met.This thesis compares how well several different models with different architectures canadapt to the task of performing super-resolution for the region of Brittany in France, usingdata from the SPOT-6 satellite as high-resolution ground truth to super-resolute Sentinel-2imagery. We also demonstrate that employing a histogram matching approach can suc-cessfully bridge the spectral gap between the sensors. The experimentation shows that thestandard Super-Resolution Convolutional Neural Network (SRCNN) and Residual Chan-nel Attention Network (RCAN) succeed in adapting the color mapping, but fail to deliverrealistic-appearing super-resolution images. On the other hand, we show that the Super-Resolution Generative Adversarial Network (SRGAN) can produce visually impressiveplausible super-resoluted images, but that the products lose the connection to the under-lying real-world phenomena that the satellite imagery depicts.The method of including time-series information by a simple band-stacking approachinto the generative adversarial model is shown not to be sufficient, while the fusion ofencoded information of multi-temporal low-resolution imagery via a recursive networkshows promising potential. In order to build on that potential, the main challenges toovercome are identified as formulating useful loss functions as well as the publishing ofa common dataset to provide a comparison baseline, which is geared towards a real-lifeapplication.
Content may be subject to copyright.
Deep Super-Resolution of Sentinel-2
Time Series
Master Thesis
to obtain the degree
Master of Science
Department of Geoinformatics, Faculty of Digital and Analytical
Sciences, Paris-Lodron Universit¨
at Salzburg
Facult´
e Sciences & Sciences de l’Ing´
enieur,
Universit´
e Bretagne Sud
by
Simon DONIKE
Supervisors
Assoc. Prof. Charlotte PELLETIER
Universit´
e Bretagne Sud, France
Assoc. Prof. Nicolas AUDEBERT
Conservatoire National des Arts et M´
etiers, France
Assoc. Prof. Dirk TIEDE
Paris Lodron Universit¨
at Salzburg, Austria
Vannes (France), June 15th 2022
Abstract
Deep convolutional networks for super-resolution are an intensively researched method
in the field of computer vision. Applying these technologies to the field of remote sensing
can provide many benefits, mainly since a high temporal and a high spatial resolution are
mutually exclusive for remote sensing products. Super-resolution has the possibility of
bridging that gap by upsampling the high frequency imagery to a higher spatial resolu-
tion. Many challenges arise when trying to implement super-resolution techniques to the
specific characteristics of remote sensing data, as well as ensuring that the needs of the
remote sensing communities towards the resulting data are met.
This thesis compares how well several different models with different architectures can
adapt to the task of performing super-resolution for the region of Brittany in France, using
data from the SPOT-6 satellite as high-resolution ground truth to super-resolute Sentinel-2
imagery. We also demonstrate that employing a histogram matching approach can suc-
cessfully bridge the spectral gap between the sensors. The experimentation shows that the
standard Super-Resolution Convolutional Neural Network (SRCNN) and Residual Chan-
nel Attention Network (RCAN) succeed in adapting the color mapping, but fail to deliver
realistic-appearing super-resolution images. On the other hand, we show that the Super-
Resolution Generative Adversarial Network (SRGAN) can produce visually impressive
plausible super-resoluted images, but that the products lose the connection to the under-
lying real-world phenomena that the satellite imagery depicts.
The method of including time-series information by a simple band-stacking approach
into the generative adversarial model is shown not to be sufficient, while the fusion of
encoded information of multi-temporal low-resolution imagery via a recursive network
shows promising potential. In order to build on that potential, the main challenges to
overcome are identified as formulating useful loss functions as well as the publishing of
a common dataset to provide a comparison baseline, which is geared towards a real-life
application.
Keywords: Deep Learning, GAN, Remote Sensing, Residual Network, Sentinel-2,
SPOT-6, Super-Resolution, Time Series
i
Acknowledgements
I would like to thank my supervisors for this project, Assoc. Prof. Charlotte Pelletier,
Assoc. Prof. Nicolas Audebert and Assoc. Prof. Dirk Tiede. Their continuous support
and patience were crucial to the success of this project. While Assoc. Prof. Pelletier
and Assoc. Prof. Audebert helped me understand the technological nuances and assist in
the implementations, Assoc. Prof. Tiede’s input was crucial to keep the remote sensing
perspective in mind. I also thank my supervisors for enabling a research visit to CNAM
in Paris. Without this support, a project of this magnitude would have been to daunting to
even attempt.
We acknowledge the Dinamis Group of Interest for providing us with the SPOT-6 data
AIRBUS DS 2018).
This project was supported by the Erasmus+ Programme of the European Union, through
the Erasmus Mundus Joint Master Degree Copernicus Master in Digital Earth.
ii
Contents
Abstract i
Acknowledgement ii
Table of contents iv
1 Introduction 1
2 Literature Review 2
2.1 ProblemDenition ............................. 3
2.2 Single Image Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Super-Resolution in Remote Sensing . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Super-Resolution across Sensors and Sensor Types . . . . . . . . 7
2.3.2 Multiple Image Super-Resolution in Remote Sensing . . . . . . . 8
3 Data and Methods 9
3.1 StudyArea ................................. 9
3.2 Data..................................... 10
3.2.1 SatelliteData............................ 10
3.2.2 Preprocessing............................ 19
3.2.3 Matching Sensor Values . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Losses and Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Validity of Performance Metrics in Super-Resolution . . . . . . . 28
3.4 Model Selection, Model Architectures and Hyperparameters . . . . . . . 29
3.4.1 SRCNN............................... 29
3.4.2 RCAN ............................... 30
3.4.3 SRGAN............................... 32
3.4.4 Augmenting SRGAN with Temporal Data . . . . . . . . . . . . . 34
4 Experimental Settings 36
4.1 Training and Testing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Experiments and Hyperparameter Tuning . . . . . . . . . . . . . . . . . 37
4.2.1 SRCNN Model Training . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 RCAN Model Training . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 SRGAN Model Training . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Temporally enhanced SRGAN Model Training . . . . . . . . . . 43
iii
5 Results 44
5.1 Standard Model Performance Comparison . . . . . . . . . . . . . . . . . 44
5.2 Temporally enhanced SRGAN Performance . . . . . . . . . . . . . . . . 47
5.3 Model Complexities vs. Performance . . . . . . . . . . . . . . . . . . . 49
6 Discussion 50
7 Conclusion 53
iv
1 Introduction
Optical satellite imagery has drawn sustained attention for decades from researches and
the commercial sector alike. The applications developed span many fields, including ex-
amples such as agricultural uses (Weiss et al., 2020), land cover mapping and change
detection (Khatami et al., 2016), monitoring of climate change (Yang et al., 2013) and
ecosystems (Bresciani et al., 2022), and natural disaster management and prevention
(Westen, 2000). The launch of ESAs Sentinel-2 optical Earth Observation (EO) satel-
lites, starting with Sentinel-2A in 2015, further increased the availability of remote sens-
ing products. In combination with the recent trends of making these taxpayer-funded
products available to researches, companies, and the general public free of charge, the
intensity of research and usage of satellite imagery has increased considerably (Wulder
et al., 2012).
Satellites that fall into the same category as Sentinel-2, such as NASAs Landsat pro-
gram, have multispectral sensor on board that capture large areas with their rather sizeable
swaths widths (185km for Landsat-7, 265km for Sentinel-2). Acquiring images of larger
areas leads to a higher revision rate of only several days for most points on earth, increas-
ing the temporal dimension of these sensors. On the other hand, the spatial resolution is
decreased by the wider field-of-view apertures. For example, the MultiSpectral Instru-
ment sensor aboard the Sentinel-2 satellites has a spatial resolution of 10 by 10 meters per
pixel for the visible spectral bands between 490 and 665 nanometers (ESA, 2022). High-
resolution multispectral satellites, such as WorldView-3, Pl´
eiades, or SPOT-6 approach
and surpass spatial resolutions of 1 meter. Unfortunately, the high-resolution products
from these satellites are only available commercially at substantial prices. Additionally,
these satellites have much higher revisit rates, meaning the image acquisition is likely to
have a high temporal difference to the phenomenon of interest. Because of these sensor
specific restrictions, a high temporal and a high spatial resolution are mutually exclusive
for remote sensing products.
Super-resolution has the potential to enhance all existing workflows based on low-resolution
remote sensing imagery. Not only can more visually appealing images be produced more
frequently, but a truthful increase in spatial resolution immediately improves the accuracy
of existing applications like for example land cover classification.
In the field of computer vision, the process of increasing the resolution of an image is
referred to as Super-Resolution (SR). The naive approach is to interpolate the newly cre-
ated pixels, but more sophisticated methods have been developed to leverage the power of
machine learning to increase the resolution more accurately. Applying these methods to
remote sensing imagery can bridge the gap between higher and lower resolution imagery,
combining the benefits of the free availability and high temporal resolution of satellites
such as the Sentinel-2 with the higher resolution of satellites such as SPOT-6.
1
This project intends to derive a workflow that performs SR on remote sensing imagery in
a real-world scenario, overcoming the fact that the bands of the different sensors cover
slightly different spectral ranges and were acquired at different times and different dates.
The models implemented are intended to as also handle the land cover change that might
have happened between the two acquisitions. Finally, it is tested if the high temporal
dimension of the low-resolution imagery can be used to augment and improve the SR
performance.
The research objectives are therefore twofold:
1. performing SR, overcoming the inherent differences of the two datasets and com-
paring the performances of different models
2. enhancing SR performance by including the temporal dimension of the Sentinel-2
data
The structure of this thesis is as follows: after reviewing the relevant literature for both
single image SR as well as multi-image SR, the study area is described. The acquisition of
the two datasets, as well as their characteristics and the necessary preprocessing steps to
align the data is then outlined. The methods and data section describes the splitting of the
data for the experimentation. Before detailing the inner workings of three SR models, we
present the loss functions that are used to measure SR performance. Then, the structure of
the temporally augmented model is presented. After discussing the experiments to gauge
performance as well as the tuning of model hyperparameters, the results of all models are
compared before formulating a conclusion regarding the research objectives.
2 Literature Review
Main advancements in SR stem from the field of computer vision, which is more focused
in the methodology than applications. Solving the problem itself is interesting for many
areas of application, most famously for the enhancement of medical imagery (Greenspan,
2009). With growing interest, SR has also received attention from a large variety other
interest groups, for example in the field of medical imaging, photographic processing or
remote sensing, and continues to be developed further.
Within the developmental history of SR, the methods can be categorized in traditional and
deep learning approaches. According to Yang et al. (2014), traditional methods can be
classified into (i) simple interpolation methods, and (ii) edge enhancement methods that
focus on edge enhancement but can lack texture approximation and statistical methods
that exploit gradient distributions. These initial approaches focus more on signal process-
ing techniques (Park et al., 2003) but are recently being outperformed by deep learning
methods, as shown by performance comparisons in recent literature (Anwar et al., 2021,
2
p.24:60). Given these developments, traditional approaches are not discussed further,
instead the focus is put on deep learning-based approaches that attempt to learn the degra-
dation prior.
2.1 Problem Definition
Given a low-resolution image, the target is the according high-resolution image. The
relationship between these two images can be described by Equation 1, where the low-
resolution image xis equal to the high-resolution image yto which the degradation θhas
been applied. The degradation, consisting of noise and the scaling factor, in combination
with the original image, makes up the degradation function Φ.
x=Φ(y;θ)(1)
SR tries to reverse the degradation applied to the high-resolution image by approximating
the degradation function (Anwar et al., 2021, p.60:3).
ˆy=Φ1(x;θ)(2)
The strength of the degradation function is mainly influenced by the scaling factor. For
a factor of 4, e.g. going from 10 to 2.5m spatial resolution, one single pixel needs to be
replaced by 16 pixels. Since the relationship between factor and replaced pixels is not
linear but exponential (the square of the factor), the degradation gets increasingly difficult
for rising scaling factors. Therefore, factors of over 4 are uncommon or only used to bring
methods to their limits.
In literature, synthetic datasets are usually used, meaning a high-resolution image is
downsampled to the lower resolution, and then upsampled again by the SR method.
While this does not introduce additional noise, using datasets from different sensors in-
troduces a significant amount of noise.
2.2 Single Image Super-Resolution
The term Single-Image Super-Resolution (SISR) describes the process of taking a single
low-resolution image to approximate the high-resolution image. While SISR has a rich
history in the era before deep learning was introduced, deep learning models will be the
focus of this section because of their performance benefits over traditional methods. The
classifications in this section follow the taxonomy outlined by Anwar et al. (2021).
3
Standard Neural Networks Deep learning was demonstrated to be capable of perform-
ing SR for the first time in 2014 by Dong et al. (2014, 2015), showing that convolutional
layers can be used to perform SR. The design is quite basic, consisting only of three con-
volutional layers, the first to being followed by a Rectified Linear Unit (ReLU). As an
input, the model takes an interpolated version of the low-resolution image upsampled to
the target resolution. The authors intend the first convolution to extract feature maps from
the input images, which are then converted into a higher-dimensional space by the follow-
ing convolution. The final convolution produces the high-resolution image by aggregating
the feature maps, for which a Mean Squared Error (MSE) loss is computed to advance the
training (see Figure 16).
This basic design acted as a foundation for many other models, which apply an interpola-
tion first and then a set of convolutional layers. The Very Deep Super Resolution (VDSR)
model (Kim et al., 2016) for example uses the popular Visual Geometry Group Network
(VGGnet) (Simonyan and Zisserman, 2015) that employs twenty convolutional layers
with a kernel size of 3 and ReLU activation functions for each hidden layer. The networks
predict a residual image that depicts the differences between the low and high-resolution
images, which is then added to the original image to produce the super-resoluted ver-
sion. Because the attention is only on the residual image, the target information is of
high frequency and therefore leads to faster convergence. Other linear models that do
not perform an initial interpolation, but instead a late upsampling, include FSRCNN (Fast
Super-Resolution Convolutional Neural Network) by Dong et al. (2016), an improvement
of the SRCNN model, and Efficient Sub-Pixel Convolutional Neural Network (ESPCN)
by Shi et al. (2016), intended for real time usage on video streams. The latter employs a
sub-pixel convolution in the final layer to project the low-resolution feature maps into the
high-resolution space.
Residual Networks One of the drawbacks of performing the interpolation before the
image is fed into the model is that the layers have large dimensions from the start, lead-
ing to large computational requirements and memory consumption. Most of the more
recent models therefore perform the upsampling within the network. One of these models
is the Enhanced Deep Super-Resolution (EDSR) model introduced by Lim et al. (2017).
As before, another widely successful deep learning architecture has been adapted to per-
form SR, in this case Residual Network (ResNet) developed by He et al. (2015) which
was originally intended for image classification. By eliminating the batch normalization
present in the original ResNet residual blocks as well as the ReLU outside of the residual
blocks, EDSR is streamlined and according to the authors better suited for SR.
The Super-Resolution Residual Network (SRResNet), which Lim et al. (2017) reference
and claim to surpass in performance and shown in figure 1, is more closely modeled after
ResNet and retains the batch normalization within the residual blocks. This model forms
the generator of the Generative Adversarial Network (GAN) described in Section 3.4.3.
4
Figure 1: Residual block design of ResNet (left), SRResNet (center), EDSR (right). Im-
age from Lim et al. (2017, p.3)
On par with ResNet, UNet (Ronneberger et al., 2015) is also a widely used image seg-
mentation network employing residual blocks that has been adapted for SR purposes by
(Mao et al., 2016) under the name Residual Encoder Decoder Network (REDNet). Anwar
et al. (2021) classify it as a multi-stage ResNet, since the encoder downsamples the input
while the decoder performs the upsampling. Each convolution and deconvolution block
is followed by a ReLU. Skip connections add the encoder output to the mirrored decoder
before a final ReLU activation function, as well as an intermediate encoder output to an
intermediate decoder step.
Attention-based Networks The defining characteristic of attention-based networks is
that they do not consider the whole input image to be equally important, but instead se-
lectively extract features considered to be significant. For example, Choi and Kim (2017)
implement the selective attention by inserting selective units’ between the convolutions,
which consist out of a ReLU activation followed by a 1 ×1 convolution and a sigmoid
activation function. The selective unit is supposed to take control over which feature maps
from the previous convolution can be passed on to the next one.
The Residual Channel Attention Network (RCAN) (Zhang et al., 2018c) achieves the
selective attention through a different strategy, which is shrinking the filter spatial dimen-
sion from H×W×Cdimensions down to 1 ×1×C. The authors describe the result of
this process as ”[...] a collection of the local descriptors, whose statistics contribute to
express the whole image” (Zhang et al., 2018c, p.7). In addition, RCAN employs residual
connections within residual blocks. While long skip connections circumvent the residual
blocks before an element-wise sum with the prediction, a short skip connection exists
within each residual block, which is summed element-wise with the output of the channel
attention blocks. This model is described in detail in Section 3.4.2.
5
Generative Adversarial Networks GANs are characterised by a game-theory approach,
in which two networks are in competition with each other. First introduced as ”adversarial
autoencoders” by (Makhzani et al., 2016), the two networks are known as generator and
discriminator. While many implementations and different types exist, the general outline
is that the generator predicts a fake (super-resoluted) image, either based on random noise
or an input. The resulting prediction is fed to the discriminator, which returns a perfor-
mance score. The discriminator gets trained on the target as well as the fake image, lead-
ing to an improved ability in distinguishing real from fake inputs. Based on the feedback
from the discriminator the generator can be trained to generate better predictions. This
mutual improvement ideally continues until a balance is achieved, where the discrimina-
tor can not differentiate between real’ and ’fake’ inputs any more (Creswell et al., 2018).
Within this general framework, the technique has been adapted for many use cases (Wang
et al., 2017) including remote sensing Ben Hamida et al. (2018); Tao et al. (2017); Li
et al. (2021). Adapted for SR, GANs have been shown to provide benefits over other
deep learning methods, especially when handling noisy inputs (You et al., 2020). These
improvements are mainly visual though, the use of GAN can also potentially introduce
hallucination-like artifacts. The main difference between SR GANs is the structure of
the generator, which can be designed after many possible models, while the discriminator
network adds the adversarial element to the loss function. Since the generator of a GAN
is capable of performing SR on its own, many authors include the performance of their
generator individually in their papers, in addition to the generator-discriminator combi-
nation to show how the adversarial loss influences the prediction (Ledig et al., 2017). A
GAN model is described in detail in Section 3.4.3.
2.3 Super-Resolution in Remote Sensing
Given the aforementioned importance of spatial resolution for all tasks involving remote
sensing imagery, it is natural that many SR techniques have been applied to this kind
of imagery as well. SR methods as classified and described by Fernandez-Beltran et al.
(2017) use approaches such as image reconstruction to mitigate the effect of interpola-
tion. The authors also state that learning-based approaches, which seek to approximate
the degradation function as outlined in Section 2.1, is ”probably one of the most relevant
research directions within the SR field” (Fernandez-Beltran et al., 2017, p.322). This re-
search has heavily borrowed from SR methodologies developed in the computer vision
field and transfers this knowledge to the domain of remote sensing. Following the sim-
plified taxonomy of the aforementioned authors, the models presented in this project can
be classified as a ”mapping” technique, considering SR as a ”regression problem between
high-resolution and low-resolution spaces” (Fernandez-Beltran et al., 2017, p.323). A fi-
nal comparison drawn by these authors uses a set of different SR methodologies, some of
which have already been introduced earlier, on a synthetic dataset. The dataset consists
of aerial images, as opposed to satellite imagery, and applies different degradation func-
6
tions to these images to obtain low-resolution versions of the images. Because of these
differences, this overview can serve as an entry point to SR in remote sensing but is not
transferable to real-world scenarios on multi-sensor SR imagery, as researched for this
project.
Notable examples in the subcategory of real-world SR applications include Li and Li
(2021), who created a globally applicable SRGAN, that is trained on degraded Sentinel-
2 images. The authors showed that the GAN-based approach outperform standard and
attention-based networks, which is one of the reasons why a GAN is one of the models
implemented in this project. Other examples are Pouliot et al. (2018), Galar et al. (2019),
Haut et al. (2019), and Lu et al. (2019).
A large issue in remote sensing remains the disconnect between the super-resoluted infor-
mation and the physical phenomena that the original data represent. Some methods are
less likely to obfuscate these phenomena, while more imaginative models such as GANs
can have the effect of adding textural inaccuracies (Bhadra et al., 2021).
2.3.1 Super-Resolution across Sensors and Sensor Types
Some authors perform SR across sensors, Wang et al. (2021) created a dataset called
OLI2MSI that contains 30m Landsat imagery that is super-resoluted to the 10m resolu-
tion of the Sentinel-2 MSI instrument. While the GAN presented in this paper is im-
pressive and shows promising results, it is of limited interest within the focus of this
project of delivering a real-world benefit. The Landsat and Sentinel data in their study
are both freely available and offer a high temporal resolution, therefore it is of no real use
to super-resolute Landsat imagery when it is possible to download the higher-resolution
Sentinel data instead. A real benefit for the remote sensing community is only provided
if the SR product offers an added value by approximating data that is unavailable, either
because it is cost-prohibitive or an according sensor does not exist. Closer to real-life
applications are therefore studies that apply SR techniques to data from different sen-
sors and performing SR in a way that the product delivers an added value to the remote
sensing community. The added value consists of transforming the low-spatial but high-
temporal-resolution information from a certain sensor type (like Landsat or Sentinel-2) to
high-spatial and high-temporal information.
This shift from a theoretical approach based on synthetic datasets towards real-life appli-
cations has seen an increase in attention recently. For example, a dataset has just been
released, containing multispectral Sentinel-2 and Venµs 5m resolution images (Michel
et al., 2022). This dataset has the potential to enable a comparison between models, but
the release date (15th of May 2022) was too late for it to be considered in this project. Ad-
ditionally, the Venµs satellite mission only covers selected regions of interest, prohibiting
this exact method to be applied worldwide.
SR within the previously described usefulness characteristics has most notably been ap-
plied within the SR4RS toolbox implementation by Cresson (2022), where SPOT-6 im-
7
agery has been used to train Ledig et al. (2017)’s SR-GAN to super-resolute Sentinel-2
imagery. Razzak et al. (2021) also perform a sophisticated GAN-based SR method, using
the SpaceNet-7 (Van Etten et al., 2021) dataset based on 4.77m resolution PlanetScope
images to super-resolute multispectral Sentinel-2 data. Another example is the study
conducted by Galar et al. (2019), being one of the first examples of implementing across-
sensor SR. In this example, RapidEye 5m imagery was used to super-resolute Sentinel-2
RGB imagery. Additionally, approaches exist that merge optical imagery with informa-
tion from different data types, such as radar imagery (Li et al., 2021; Garioud et al., 2021),
but the focus of this project is on optical data.
2.3.2 Multiple Image Super-Resolution in Remote Sensing
Extending on SISR, Multi-Image Super-Resolution (MISR), sometimes also called Multi-
Frame Super-Resolution (Deudon et al., 2020), includes several low-resolution images of
the same scene in the workflow. The general idea is that a set of depictions of the same
scene at different points in time contains more relevant information than a single image
(Kawulok et al., 2020). Information on how the ground changes can contain information
on the appearance of the ground itself, e.g. knowing that frequent changes in vegeta-
tion cover occur show that the area is most likely agricultural. Also, image and sensor
noise as well as atmospheric disturbances can be identified and exaveraged out by the
inclusion of multiple images. While algorithms have been implemented to for example
include the following and previous frames of a video into the SR Haris et al. (2019), sev-
eral remote sensing SR workflows also include more than one image as their basis for
SR. ESAs PROBA-V Super Resolution Competition on time series imagery in 2018 kick-
started several remote sensing SR research projects focusing in MISR. In this case, the
data was neither preprocessed nor cleaned, meaning the participants in these challenges
had to take these factors into account, leading to many different fusion methodologies
not uniquely focusing on merging the images (Razzak et al., 2021). It can be seen that
most approaches fuse the images after an initial feature extraction (Deudon et al., 2020;
Salvetti et al., 2020). Deudon et al. (2020) implement a special fusion technique in their
HighRes-Net, which recursively fuses the closest temporal images in the feature space
until all images have been fused to the H×W×RGB dimension. The result of that fusion
is fed into SISR workflows that have already been presented. In this project, the generator
of Ledig et al. (2017) SRGAN is utilized since it has been identified as the model with the
highest reconstruction accuracy.
8
3 Data and Methods
3.1 Study Area
The study area is the historic region of Brittany (Figure 2) on the western coast of France.
Situated on a latitude of approximately 48° North bordering the Atlantic Ocean, the penin-
sula is mild in temperature with a small annual range without freezing winters. Classified
as oceanic climate (Cfb) by the K¨
oppen-Geiger scale (Peel et al., 2007), the natural in-
land vegetation mainly consists of humid inland forests and thickets. The coast is quite
fragmented by the presence of the Golf of Morbihan in the south and several large estu-
aries along the shoreline. While the center of Brittany reaches up to 240m in altitude, the
coastal areas are quite flat with only moderate, slow-climbing hills (Service, 2022).
Most densely populated is the shoreline, with cities like Saint-Malo, Saint-Brieuc, Brest
Quimper, Lorient and Vannes (from northeast to south). The largest population center,
Rennes, with a population of over 220,000 inhabitants is situated inland in the east.
Since the satellite imagery covers the whole extent of Brittany, the land cover is very im-
portant to understand how different the dataset is intrinsically, and what different kinds
of land surfaces are present. As seen in Table 1, the large majority of Brittany is covered
by agricultural areas. Almost 12% are covered by forests or other seminatural vegetation,
while just over 1.1 percent is covered by water and (tidal) wetlands.
Since the high-resolution dataset stems from the summer of 2018, this period is the se-
lected as the study period. The exact dates are described in the according dataset descrip-
tions (Section 3.2.1).
Square Kilometers Percentage
Agricultural Areas 21792.3 78.9
Artificial Surfaces 2023.6 7.3
Forest and seminatural Areas 3512.0 12.7
Water Bodies 119.3 0.4
Wetlands 185.2 0.7
Total 27632.5 100
Table 1: Area of CORINE Land Cover classes in Brittany.
9
Figure 2: CORINE Land Cover of 2018, showing highest-taxonomy classifications as
well as Brittanys location within metropolitan France.
3.2 Data
Other than some auxiliary data like the CORINE Land Cover, this project is based on
public domain Sentinel-2 imagery funded by the ESA EU Copernicus Programme and
SPOT-6 imagery kindly provided by IGN. While the SPOT-6 imagery in general is pro-
prietary, the images have been extracted from the mosaic of France published annually 1.
3.2.1 Satellite Data
Data from two different satellite platforms is used in this thesis. The source for low-spatial
but high-temporal resolution is the Sentinel-2 program launched by ESA, while the low-
temporal but high-spatial resolution data acquired by the SPOT-6 satellite operated by the
French national space agency - Centre National d’ ´
Etudes Spatiales (CNES). The SPOT-6
data has been provided by the National Institute of Geographic and Forest Information
(IGN).
1IGN data published under the INSPIRE directive: https://geoservices.ign.fr/services-web-inspire (May
2022)
10
Sentinel-2 Within the broader Sentinel program of EO satellites by ESA, the Sentinel-
2 mission launched two satellites in 2015 and 2017. The sensors, called MultiSpectral
Instrument (MSI), are optical wide swath sensors with 13 bands. Both satellites are struc-
turally identical and fly on the same orbit but are phased at 180°, enabling revisit rates of
up to 5 days at the equator. The minuscule differences in radiometric resolution between
the sensors are for most intends and purposes negligible.
The data was acquired via Theia, which is a data and services center founded by sev-
eral French public and research institutions in order expedite the availability and usage
of imagery, as well as ensure data quality and continuity. The imagery is Level-2A cor-
rected already and therefore represents the atmospherically corrected ground reflectance.
Theia uses the Maccs-Atcor Joint Algorithm (MAJA) preprocessing algorithm developed
by CNES, CESBIO (Centre d’Etudes Spatiales de la Biosph`
ere) and DLR (Deutsches
Zentrum f¨
ur Luft- und Raumfahrt), which is an evolution of the MACCS (Multi-sensor
Atmospheric Correction and Cloud Screening) algorithm. MAJA uses the temporal res-
olution to create cloud and snow masks, but also estimates atmospheric absorption and
aerosol optical thickness to approximate the ground reflectance. The MAJA algorithm has
been shown to be more precise than it’s more widely used competitor Sen2Core (Baetens
et al., 2019). The data value range after correction is between 0 and 1, but stretched to
between 0 and 10000 to enable more efficient storage as raster files.
Spatial Resolution
(m) Band Central Wavelength
(nm)
Bandwidth
(nm) Description
10
2 492.4 66 Blue
3 559.8 36 Green
4 664.6 31 Red
8 832.8 106 NIR
20
5 704.1 15 Vegetation red edge
6 740.5 15 Vegetation red edge
7 782.8 20 Vegetation red edge
8a 864.7 21 Narrow NIR
11 1613.7 91 Water Vapour
12 2202.4 175 SWIR
60
1 442.7 21 Coastal Aerosol
9 945.1 20 Water Vapour
10 1373.5 31 SWIR - Cirrus
Table 2: Sentinel-2A bands.
The many bands of the MSI sensor, outlined in Section 2, enables users from many fields
to incorporate this data into their workflows. In this analysis, the focus is only on the
RGB bands with a spatial resolution of 10 meters.
11
For the analysis in this thesis, the Theia API has been queried for Level-2A data between
the 1st of April 2018 and the 31st of August 2018, filtering for images with a cloud cover
under 5%. Since the SPOT-6 data was acquired in the summer of 2018 (Figure 4), the
Sentinel-2 time period was chosen to include images from before and after the SPOT-6
acquisitions while staying in a range of similar seasonal and climate conditions.
Figure 3: Footprints of Sentinel-2 tiles that have been queried for images under 5% cloud
cover for the study period.
SPOT-6 The SPOT-6 platform was launched in 2012 within the commercial Satellite
Pour l’Observation de la Terre (SPOT) satellite program of CNES. Since then, it delivers
high-resolution EO imagery for commercial customers. Since the customers are able to
schedule acquisition dates and areas to their needs, there is no fixed revisit rate for larger
areas of the earth. In theory, the rate could be as low as one day (ASTRIUM, 2012), but
only if the sensor focuses on a certain area and neglects all others. With a daily acquisition
capacity of 6 million square kilometers, it could theoretically depict the whole surface of
the Earth in just under 100 days, if the scheduling of customer missions would not inhibit
the acquisitions.
12
As outlined in Table 3, the sensor covers only the visible spectrum plus a NIR with a res-
olution of 6m. The panchromatic band, because of the higher radiation intensity of a large
spectral window, can acquire high-resolution 1.5m images. After pansharpening, the final
product used in this project are RGB images with a spatial resolution of 1.5m. With an 8-
bit value range, the product is a radiometrically and geometrically corrected L2A image.
The SPOT-6 dataset consists of 3352 RGB image tiles with acquisition dates between the
19th of April and the 3rd of August 2018 (Figure 4).
Figure 4: Footprints and acquisition dates of SPOT-6 image tiles.
13
Spatial Resolution
(m)
Spectral Band
(nm) Description
1.5 450-745 Panchromatic
6 450-525 Blue
6 530-590 Green
6 625-695 Red
6 760-890 NIR
Table 3: Spectral bands and resolution of the sensor onboard the SPOT-6 platform
Ambiguity of Spatial Resolution The first and easiest metric of spatial resolution is
the raster cell size of the obtained image product. This metric can differentiate quite sig-
nificantly from the actual physical processes going on from the moment the energy is
reflected off the ground and captured in the sensor. Firstly, assuming a perfectly nadir
image, the flight height and the field of view of the sensor on board a satellite determine
its Ground Resolved Cell (GRC). This cell is the physical area from which optical energy
is detected by a singular cell on the sensor. The size of this cell is described as the Ground
Resolved Distance (GRD), and it measures the radius of the area of the GRC. Modern
satellites with push-broom acquisition technology like Sentinel-2 and SPOT-6 have, due
to the design of their sensors, different GRDs across-track and along-track in relation to
the orbital path of the satellite. Additionally, the slightly off-nadir edges further influence
the GRD (Figure 5).
Figure 5: Schema showing Ground Resolved Cell (GRC), Ground Resolved Distance
(GRD) and Ground Sampling Distance (GSD).
The GRD and GRC determine the Ground Sampling Distance (GSD), which is the dis-
tance between the centers of the GRCs. Ideally, GSD and GRD would be equal. In reality
and most notably in off-nadir conditions, the GRD is usually larger, leading to spectral
mixing at pixel-level. Before being delivered to the end-users, the data is resampled to
14
the spatial grid present in the final product. Therefore, the GSD reports a lower resolution
than the actual final raster grid size. These physical factors and postprocessing methods
are sensor-specific, meaning the spatial resolution of image rasters can be more or less
affected by the issues above.
The SPOT-6 sensor has a relatively high discrepancy between the GSD and the advertised
pixel size. The multispectral bands have a median GSD of 8.8m, contrary to the product
raster pixel size of 6m. The pansharpened band is published in a raster size of 1.5m while
the median GSD is closer to 1.75m (ASTRIUM, 2012). Practically, this means that real-
world features of a size of 6m can not be distinguished with the multispectral bands of
the SPOT-6 sensor, even though the pixel size might theoretically might allow it. While
the exact figures are not published for ESAs Sentinel-2 mission, the modulation transfer
function is given as 0.15 to 0.3 for the 10m bands (ESA, 2022). This function models
the low-pass filtering effect introduced by the image degradation, enabling a comparison
between sensors and an estimation how well the GSD correlates to the given pixel raster
size (Faran et al., 2009).
One of the reasons for the downsampling performed in Section 3.2.2 is to resample the
spatial extent of the raster image to a value lower than the median GSD of the SPOT-
6 sensor. This is expected to mitigate the problems introduced by the above-mentioned
mechanisms.
Image Georeferencing The satellite imagery providers georeference their images be-
fore publishing. In the case of the Level-2A products used in this project, the data is also
already geometrically corrected. While the Sentinel-2 mission aims for a georeferencing
accuracy of 3 meters for 95 percent of the images, the real accuracy is currently closer
to 12 meters for 95 percent of the acquisitions (Pandˇ
zic et al., 2016). The SPOT-6 im-
agery has been verified to be accurate within 2.5 meters (Sukojo and Ode Binta, 2018).
Since image patches from the two sensors are overlayed and fed into the networks, it is
important that the physical features align, meaning that they occupy the same spatial cells
in the image rasters. Figure 6 shows the extraction of the edges using the Canny Edge
Detector (Canny, 1986) for an interpolated Sentinel-2 image and the according SPOT-6
image. The detected edges are overlayed in Figure 7. While some differences in the edges
exist, these sections that were identified in both images are within several pixels (3-5) to
each other. Since the grid size is 2.5x2.5 meter in the visualization, the small distance
between the edges is not larger than the low-resolution pixel size. Since the job of the
super-resolution is to find the exact location of the edges and place them accordingly, the
small distance is not expected to cause problems further down the line. Even if larger dis-
crepancies exist, the shifts are expected to occur in the same direction, since the relative
accuracy of the referencing is quite high. These shifts could be learned by the network.
For other datasets with larger pixel sizes and inaccuracies, other authors additionally train
networks to correct for possible misalignments (e.g. ’ShiftNet’ by Razzak et al. (2021)).
15
Figure 6: Canny Edge detection for an interpolated Sentinel-2 and SPOT-6 image.
Figure 7: Overlay of detected edges for a Sentinel-2 (red) and the corresponding SPOT-6
(green) image.
16
Intersection of Sentinel-2 and SPOT-6 data Given the high temporal resolution of the
Sentinel-2 data, the vast majority of tiles have a very small difference from the SPOT-6
image to the temporally closest Sentinel-2 image, with over 92% of the SPOT-6 images
having a corresponding Sentinel-2 image with a difference of 10 days or closer (Fig-
ure 9). The temporal difference of the fourth-closest image of the Sentinel-2 time series
is 65 days, with 50% of the fourth-closest images being closer than 30 days. The minimal
number of Sentinel-2 images in the time series is 8, the maximum is 26 (Figure 8). If a
harvest occurs between the two acquisition dates a lot of noise is introduced to the dataset,
which happens regularly in the summer. The SPOT-6 acquisition dates are provided by
IGN, but due to their preprocessing some image aggregations might have been performed.
The given dates might therefore not correspond exactly to the acquisition date of subsets
of the images. Still, we assume this date as the acquisition date for the following analy-
sis. Figure 10 shows the temporal difference for the 4 closest Sentinel-2 acquisitions in
relation to the reference SPOT-6 image.
Figure 8: Amount of overlapping Sentinel-2 images at each patch location.
17
Figure 9: Minimal time difference of Sentinel-2 and SPOT-6 acquisitions tiles.
Figure 10: Time difference between the acquisitions for the 4 temporally closest Sentinel-
2 images to the reference SPOT-6 image. The time difference here is calculated for each
750x750m patch that has been saved to disc. The x axis shows the time differences of
each image for the given temporal distance, while the y axis shows the amount of image
patches.
18
3.2.2 Preprocessing
While the satellite imagery itself is atmospherically corrected already, some preprocessing
operations specific to this project needs to be applied. Not only do the images need to
be brought into a format and resolution suitable for their use in deep learning, but also
the intrinsic spectral differences of their sensors need to be taken into consideration and
matched to each other.
Preprocessing of image files To reach the target SR factor, the SPOT-6 data is down-
sampled with a bicubic interpolation from 1.5m to 2.5m, which in relation to the 10m
Sentinel-2 resolution brings the factor down from 6.66 to 4. The downsampling is ex-
pected to mitigate the issues outlined in Section 3.2.1, reducing the raster cell size below
the median GSD of the SPOT-6 sensor.
Chopped into 3×75×75px and 3×300×300px blocks for Sentinel-2 and SPOT-6 respec-
tively, the image patches are saved to disc via windowed reading. These patches cover
an area of 750×750 meters. For each coordinate window, all available Sentinel-2 ac-
quisitions are saved if they do not have a majority of undefined pixels. The images are
distributed by Theia using a tiling grid that does not perfectly overlap with the acquisition
strip of the Sentinel-2 sensor. Hence, it is possible that only a small section of the tile
contains data. By dividing by 10000 for SPOT-6 and 255 for Sentinel-2, the images are
brought to the same value range between 0 and 1.
3.2.3 Matching Sensor Values
Unlike in the majority of SR literature the dataset in this case is not synthetic. Therefore,
special care needs to be taken when preparing the data to be fed into the models. Stem-
ming from different sensors, the images and their values can be quite different. Not only
changes on the ground introduce noise, but also the differences in weather and radiation
conditions at the time of acquisition, as well as the differences in spectral bands of the
two sensors (Table 2,Table 3). Bridging these gaps and making the images spectrally as
similar as possible is a priority, so that the models do not have to learn a color mapping
before performing actual SR. Because many plausible color mappings exist that in the-
ory could be true, this is an ill-posed problem. If the reconstruction loss metrics are too
large and the model has problems to find the correct color mappings, it is likely that the
model will never improve. Therefore it is of importance to adapt the SPOT-6 histogram
as closely as possible to the Sentinel-2 distribution. Since the high-resolution data is not
present in the inference stage, it is more useful to perform the matching in this direction
than the other way around.
Another possibility is to perform the training on a better-resolution image towards its
downsampled resolution and to then fine-tune the model on the desired data, as imple-
mented by Galar et al. (2019). This method has the drawback that the model is mainly
19
trained on a different data source, potentially including specific noises and overfitting to-
wards the spectral bands of the first sensor, hindering the transferability towards another
sensor.
Standardization Normalization, bringing the inputs into the same value range, is a cru-
cial step in all machine learning algorithms using gradient descent. The common MinMax-
normalization is achieved by dividing the values by the original value range to bring the
data between 0 and 1, with the original value ranges being 255 for SPOT-6 and 10,000 for
Sentinel-2. This operation is done for each color band separately. Standardization goes a
bit further, by setting the mean of the data to 0 and the standard deviation to 1. While this
assumes that the data is distributed in a Gaussian curve, it helps to achieve convergence
by creating a more even loss surface. While in theory the networks are able to learn the
standardization and reach the same result, it has been proven that standardization helps the
model to converge better and faster as well as reach improved results in practice (Shanker
et al., 1996).
x=xµ
σ(3)
To standardize the data (x), the mean (µ) is subtracted from the image (x) and then divided
by the standard deviation (σ). In practice, with over 5.4 billion pixels per channel for the
whole dataset, it is not feasible to calculate the sum in order to find the standard deviation.
Therefore, the values are calculated per image patches. While the mean of the means per
images is the same as the overall mean, this is not the case for the standard deviation.
The median of the standard deviations of the image patches is therefore used to perform
the overall standardization, which might introduce a small error by not exactly mean-
centering the values.
Figure 11 shows the effect of standardization on the images. While the data is success-
fully mean-centered and matched in their standard deviation, it is clear that there is a color
mismatch. This mismatch can be learned by the network, due to harvested fields or other
changes it might fail to find a color mapping. This could prevent the network from occu-
pying itself with the actual SR, but instead focusing too much on matching the colors of
the images.
20
Figure 11: Standardization applied to images, according histograms and CDF. The CDF
displayed has been calculated for a grayscale conversion of the image for demonstration
purposes, while the images show the originals.
Moment Matching In order to bring the value ranges closer together, a statistical mo-
ment matching was also implemented. This is done by moving the mean and standard
deviation of the SPOT-6 data to match the Sentinel-2 data. This technique has seen ap-
plication for this similar problem by e.g. Gadallah et al. (2000), where a linear difference
between two datasets is assumed.
x=xµy+µx
σ2y(4)
To get to x, the mean of SPOT-6 (µ) is subtracted from the Sentinel-2 image (x) to mean-
center the data, to which the mean of Sentinel-2 is added (µ) to adapt the mean of y to the
mean of x. Then, a division by variance of y (sigma2y) is performed to adapt the variance.
21
Figure 12: Moment matching applied to images, according histograms and CDF. The
CDF displayed has been calculated for a grayscale conversion of the image for demon-
stration purposes, while the images show the originals.
As seen in Figure 12, this method adapts the color to a much larger degree than the
standardization. The centering of standardization and moment matching is the same,
leaving the Comulative Distribution Function (CDF) of both images unchanged as well.
Shifting the hole SPOT-6 histogram towards the Sentinel-2 histogram did have the desired
effect on color, while keeping the original variance within the data intact. A two percent
MinMax stretching has also been applied to the image to eliminating the outliers. Moment
matching has been used by Cresson (2022) for a very similar approach, where a SRGAN
is trained to perform SR for the same sensor combination as in this study.
22
Histogram Matching The final matching technique is histogram matching, sometimes
called histogram specification, which adopts the CDF to match exactly. While histogram
equalization matches the histogram to a normal distribution, histogram matching adapts
the contrast of the source image to a reference image. This process has found widespread
use for normalizing large image datasets in the medical field and for the calibration of
image systems (Gonzalez et al., 2009, 3.17). Each pixel value in the source image is
mapped to the according pixel value of the reference image that has the equivalent cumu-
lative probability value. The CDF (Equation 5) denotes the probability of a pixel value
occurring in a histogram bin by dividing the number of occurrences of this value (n) by
the total number of pixels (nj). This Probability Density Function (PDF) can then be
mapped into a CDF, Equation 6, by consequently summing over the values to calculate
the probability of any given pixel of a certain value to fall within 0 and the value in ques-
tion.
Histogram matching has been used with success in the SR of Sentinel-2 imagery with a
ground truth from a different sensor by Galar et al. (2019); Deudon et al. (2020).
CDF =nj
n(5)
PDF =
k
j=0
nj
n(6)
To perform histogram matching, each pixel value of the source image (X) is mapped to
the value of the according probability of the reference image (Y) so that their CDF/PDFs
are identical.
CDFX=
k
j=0
nj
n
CDF
Y=
k
j=0
nj
n
so that
CDFX=CDF
Y
(7)
Figure 13 shows the result of the histogram matching process. Note that the matching is
applied separately to each band, while the shown histograms and CDF are displayed for
an RGB conversion of the original images. Examining the Sentinel-2 and SPOT-6 images,
it is clear that the color characteristics of the Sentinel image have been sucessfully applied
to the SPOT image.
23
Figure 13: Histogram matching applied to images, according histograms and CDF
Comparison The three afore-mentioned matching techniques were applied to a 10,000
randomly selected image pairs, after which the MAE and MSE (see Section 3.3)between
the interpolated low-resolution image and the high-resolution image are calculated. While
the error is naturally quite high since no SR has been performed yet, the decrease in error
metrics of the matching techniques compared with the unmatched errors shows which
method returns the closest match. Figure 14 shows that the histogram matching technique
returns the best result, therefore this matching is applied within the dataloader for the
whole dataset.
24
Figure 14: MAE and MSE values for different matching techniques between the low- and
high-resolution images. For comparison purposes, the error between unmatched images
is also shown.
3.3 Losses and Performance Metrics
In order to train the networks, a loss function needs to be formulated that can judge the
performance and progress of the network. During the training phase, the weights are
constantly adapted by the gradient descent algorithm, which needs the loss function to
decide whether the output is improving or not and how to adjust the weights. Additionally,
the final performance of the model on a test set needs to be accurately evaluated using the
same metrics. The discussion on which losses perform best (Ballester et al., 2022; Jo
et al., 2020; Johnson et al., 2016a) for SR tasks is ongoing. Which final metrics to use to
select the best model and judge its performance is also up for debate, with some authors
going as far as to not trust any deterministic values but instead create a metric from public
polling (Ledig et al., 2017).
The two main types of losses and metrics are pixel-based losses, that compute the error
between the super-resoluted and original pixels and perceptual losses and metrics, that
analyse the perceived structures and textures more akin to human perception.
MAE and MSE The most widely used loss metrics in computer vision, Mean Absolute
Error (MAE) and Mean Squared Error (MSE) are pixel-based losses that compare the
pixels of the low and high-resolution images and state how different they are. While MAE
states the error by calculating the absolute error, Mean Squared Error (MSE) returns the
square of the errors. Naturally, lower values indicate more similar images. These types of
losses are also called reconstruction loss.
25
MAE =1
n
n
i=1|Yiˆ
Yi|
MSE =1
n
n
i=1
(Yiˆ
Yi)2
(8)
The act of squaring the error heavily weighs outliers, punishing prediction mages that
contain outliers more heavily. While mainly used as a loss function, MAE and MSE can
also be used as a metric of performance.
These pixel-based metrics are easy to implement but very susceptible to color changes.
The data used in this project, stemming from different sensors, is generically quite differ-
ent in pixel-based methods. The networks first have to learn this color mapping, which
is different in this project in the input and target tensors due to changes in the physical
world. Overcoming these high errors to focus on SR is a challenge.
PSNR Even though heavily critisised (Korhonen and You, 2012), Peak Signal-to-Noise
Ratio (PSNR) is still one of the most widely used quality metrics in SR. It sees heavy use
when comparing competing model performances (Anwar et al., 2021) and in competitions
(for example ESAs PROBA-V Super Resolution Competition in 2018).
PSNR is also pixel based, describing the ratio between signal strength of the reference
compared to the signal strength of the degraded version. It is logarithmic in nature and
therefore given in the decibel scale. A higher value indicates more similarity, should both
images be identical the PSNR would be not defined due to division by zero. It is defined
as
PSNR =10 ·log10(MAX2
MSE )
=20 ·log10(MAX
MSE )
=20 ·log10(MAX)10 ·log10(MSE)
(9)
where MAX represents the maximum value of the possible pixel value range. The MSE
is calculated individually for all color channels.
PSNR suffers from the same issues like the other pixel-based metrics with the additional
problem that it is highly sensible to the intrinsic differences of the dataset used in this
project, since it relies on the MSE. Color or signal differences of the images might be
due to the sensors, leading to high error metrics when the actual perceptual SR might
actually work to satisfaction. Also, when it comes to human perception, studies show that
PSNR correlates poorly with perceived image quality (Girod, 1993). For computer vision
research this metric has its established place in the evaluation step, but this metric has
clear limitations in the field of remote sensing (see Section 6).
26
SSIM The Structural Similarity Index Measure (SSIM) (Wang et al., 2003) is a perception-
based metric ranging from 0 to 1, with higher values indicating a closer similarity. SSIM
builds on the assumption that spatially close pixels contain structural information and
are therefore interdependent on each other. SSIM has been shown to be effective for the
structural comparison of images (Wang and Bovik, 2009) The window size investigated
by SSIM is a parameter to be manually set. Measuring the reconstructing quality of these
local features is intended to give feedback on the structural and textural perception of the
degraded and original image.
Given the ground truth Yand prediction ˆ
Yof image pair windows, SSIM is calculated as
follows (Wang et al., 2003):
SSIM(b
Y,Y) = 2µb
YµY+c12σb
YY +c2
µ2
b
Y+µ2
Y+c1σ2
b
Y+σ2
Y+c2(10)
where µxis the average of x, µyis the average of y, σ2
xis the variance of x, σ2
yis the
variance of y, and σxy is the covariance of x and y. c1and c2are variables that stabilize
the division in the presence of a small denominator, defined as c1= (0.01 ·value range)2
and c2= (0.03 ·dynamic range)2. The formula describes the computation for one image
window, which in this study is 11×11 pixels big. The overall image structural similarity
score is the average of all image windows.
While SSIM has also received some of criticism, mainy since it is closely correlated with
the MAE (Dosselmann and Yang, 2011), it is widely used in measuring image restoration
quality (e.g.Huynh-Thu and Ghanbari (2012)).
LPIPS It has been demonstrated that the hidden layers of CNNs carry meaning in a
similar way humans perceive visual information. Having the information for an original
and reconstructed image enables the calculation of a distance between these two images.
The Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018a) is one of
these metrics, focusing on human perception instead of pixel-based comparisons. The
authors show that ”networks trained to solve challenging visual prediction and modelling
tasks end up learning a representation of the world that correlates well with perceptual
judgments” (Zhang et al., 2018a, p.8). After calculating the deep embeddings of a selected
convolutional layers of the network, the activations are normalized and scaled for both
the input as well as the reference image. Between the two image representations, the
MSE is calculated and average across channels and the image itself (Figure 15, left). A
small network, trained to judge the perceptual distance, then returns the final judgement
(Figure 15, right).
27
LPIPS is a popular method to evaluate the results of SR (Jo et al., 2020; Ahn et al., 2022;
Ledig et al., 2017; Anwar et al., 2021) and has also seen usage in remote sensing SR
(Wang et al., 2021). While the original authors propose judgement based on the extracted
features of several networks, the LPIPS in this project is calculated based on backbone of
the AlexNet. (Krizhevsky et al., 2012).
Figure 15: Calculation of LPIPS. Image Souce: Zhang et al. (2018a, p.6)
3.3.1 Validity of Performance Metrics in Super-Resolution
A fundamental problem in SR is the usage of robust performance metrics to judge the
quality of the reconstruction (Jo et al., 2020). Metrics such as MSE and MAE are often
employed (Anwar et al., 2021) for SR models and compute the differences of the pixels in
the original and super-resoluted image. For this study, this metric can be useful to judge
if the models can learn the color mapping between the images of the different sensors.
Especially when it comes to high-frequency information, these pixel-based methods are
inadequate to judge the textural information and optimize for smooth, blurry results that
are not necessarily perceptually similar (Bruna et al., 2016). On the other hand, perceptual
losses such as LPIPS perform well for traditional computer vision SR applications, but
are prone value the perceptual quality higher than staying true to the original reflectance
values (Johnson et al., 2016b). In remote sensing, staying true to the original reflectance
data and the physical phenomena that they represent is crucial. Optimizing for perceptu-
ally beautiful images is therefore also not ideal to produce valid super-resoluted images.
Ledig et al. (2017) go as far as to have images judged by humans instead of trusting the
performance metrics. In the field of remote sensing, the balance between pixel-based and
visual metric performance needs to be found that produces imagery valid for analysis. An-
other interesting approach is to verify the SR images for a specific application, checking
whether the super-resoluted product improves the performance of a specific application
(Razzak et al., 2021).
28
3.4 Model Selection, Model Architectures and Hyperparameters
Out of the many SR models present in computer vision and remote sensing literature,
the three models described in this thesis have been selected for different reasons. The SR-
CNN, which is the pioneering model of SR using convolutional layers, is very simple in its
structure and easy to train. Additionally, it has been identified by Fernandez-Beltran et al.
(2017) as a viable solution for remote sensing SR. The very deep and residual channel-
attention structure of the RCAN is intended to pick up on more complex features and
expected to be deep enough to handle the intrinsic variety in the dataset. The SRGAN is
selected in part because of its generative and adversarial approach. Additionally, it has
been successfully implemented for remote sensing applications by Wang et al. (2017), but
also shown its capabilities of performing SR across different sensors by Cresson (2022)
for the same Sentinel-2 to SPOT-6 workflow. The SRGAN, or sometimes also only the
generator of this network, has consistently performed well in remote sensing applications
and is used as a baseline to compare new models to (Razzak et al., 2021).
The following section expands on the model descriptions given in Section 2.2 and details
their variables and hyperparameters.
3.4.1 SRCNN
The SRCNN (Dong et al., 2014), already briefly introduced in Section 2.2, is the most ba-
sic model used in this project. Consisting only of three convolutional layers and a ReLU
activation (Section 16), it is fast to train but less accurate than other models (Anwar et al.,
2021). As an input, it takes a low-resolution image interpolated to the target resolution.
The authors propose version both in the RGB as well as the YCrCb color space (Dong
et al., 2014, p.192, Implementation Details), which are both implemented in this project.
Dong et al. (2016) use a MSE loss for their training, which is also substituted for MAE
and SSIM losses. Originally, an Adam (Kingma and Ba, 2017) optimizer is implemented
by the authors, which was also swapped with a Stochastic Gradient Descent (SGD) op-
timizer for testing purposes. The only real hyperparameters, other than experimenting
with implementation details, are the kernel sizes and the learning rate. Since the kernel
size is dependant on the input dimension of the image, the original ratio proposed by the
authors has not been changed. The final convolutional layer has a learning rate of 1
10 of
the learning rate of the model in general. While the overall learning rate is experimented
with, the ratio of the final layer in relation to the overall learning rate stays the same.
29
Figure 16: SRCNN Model. The input, an interpolated low-resolution image, is super-
resoluted by running through three convolutional layers followed by a ReLU activation.
3.4.2 RCAN
The RCAN architecture (Zhang et al., 2018b), classified by (Anwar et al., 2021) as an
attention-based network, is based on the ”Residual-in-Residual” architecture and reduc-
tion of filter dimensions already shortly outlined in Section 2.2. Reducing the filter dimen-
sions in the very first convolutional layer (see Section 17) is supposed to extract ”shallow
features” first, which is then followed by the main operations of the network. The very
deep features are extracted in the residual groups, consisting of a number of residual chan-
nel attention blocks. A skip connection around the residual groups stabilizes the training
by ”easing the flow of information across Residual Groups”. The short skip connections
within a Residual Group, around the residual channel attention blocks, culminating in an
element-wise sum, are supposed to force the network to pay attention to more informative
features identified within the residual channel attention Blocks. These blocks themselves
are founded on the assumption that interdependencies along the feature channels exist,
which are collapsed and condensed by a global pooling layer which then serves as a fea-
ture descriptor. The resulting statistic per channel, collapsed from the spatial dimension,
is presumed express the whole image through its representation of image regions. These
statistics are consequently reprojected to the image scale by an element-wise product,
which constitutes the channel attention of the network. After having passed through the
residual groups, the extracted image information is upscaled and passed through a final
convolutional layer. The very deep recursive approach, as claimed by the authors, results
in performance enhancements.
The basic assumption is that that the focus on important image regions by the residual
channel attention blocks will enable the network to more accurately reconstruct intricate
regions with many edges, focusing less on uniform areas. This strategy is expected to
adapt well to application in remote sensing. More uniform regions, such as fields, are
quite easy to super-resolute, achieving high accuracy values even with a simple interpola-
tion. Spending more resources and attention on complex areas, such as human structures,
30
is an assumption applicable to the scope of this project. The original authors train this
model on a MAE loss and an Adam optimizer. Variables in this model include the num-
ber of residual groups and channel attention layers within these groups, as well as the
amount of feature layers extracted and the learning rate.
Figure 17: Schema of the RCAN model, which consists of 20 residual groups and 10
residual channel attention blocks.
31
3.4.3 SRGAN
Figure 18: Schema of the SRGAN generator.
GANs are based on the increasingly popular game-theory approach of creating a compe-
tition between a generator and a discriminator network (see Section 2.2). The SRGAN
was fist implemented by Ledig et al. (2017) for SISR. The generator of this network,
which super-resolutes the low-resolution input image is a deep residual-type network not
much unlike RCAN. After a first convolutional block and a following Parametric Rec-
tified Linear Unit (PReLU), a number of residual blocks follow. In the implementation
of this project, 16 blocks are created. The residual blocks have been inspired by John-
son et al. (2016a), consisting of a convolution followed by batch normalization and a
PReLU layer, before finishing with another convolution and a batch normalization. Each
residual block is spanned by a skip connection, resulting in an element-wise sum. All
residual blocks are additionally surrounded by a long skip connection that results in an
element-wise sum. Following the residual blocks, the upsampling of the extracted fea-
tures is performed by upsampling blocks (Figure 18). Each of these blocks upsample by a
factor of two, meaning upsampling blocks are implemented in this project to reach a total
factor of four. These blocks consist of a convolution followed by a pixel shuffler, which
performs sub-pixel convolutions, and another PReLU. This network, which can perform
SR by itself with a high accuracy (Anwar et al., 2021), is called ’SRresNet’. While using
the standalone version it is trained via a MSE loss, meaning it is only evaluated on the
pixel-wise reconstruction error. Since this comes with the aforementioned drawbacks, the
32
loss of the generator is enhanced by including the opinion of the discriminator.
The discriminator takes a high-resolution image and is continuously trained to judge
whether the image is the result of SR or an original high-resolution image. To perform this
task, it consists of 8 blocks of a convolution followed by batch normalization and a leaky
ReLU, with the batch normalization being absent in the first iteration. The tail-end of
the network transforms the feature maps via a dense layer of 1024 neurons, leaky ReLU,
another dense layer of 1 neuron and a final sigmoid function Figure 18. The result is a
score of the ’realness’ value between 0 and 1. The result of the discriminator is combined
with the reconstruction loss of the super-resolved images to serve as the total loss, which
is used to train the generator. Simultaneously to the optimization of the generator, the
discriminator is also trained. Obviously, it is known whether the images fed into the dis-
criminator are real or not, the discriminator can be backpropagated with that information.
Ideally, the adversarial training continues until an equilibrium occurs where the generator
creates images of such a high quality, that the discriminator can not distinguish them from
real high-resolution imagery.
Figure 19: Schema of the SRGAN Discriminator.
The adversarial loss is known to introduce ’hallucination’-like artifacts, especially imag-
ining high-frequency texture information where it does not exist (Zhang et al., 2019). On
the other hand, this creation of structures can considerably enhance the perceptual qual-
ity of the reconstruction. The loss chosen to train the generator is of importance, since
pixel-based losses promote a higher adherence of the prediction towards the values them-
selves, limiting hallucinations. Implementing a perception-based loss instead was shown
to increase the high-frequency information reconstruction (Ledig et al., 2017, p.8).
33
3.4.4 Augmenting SRGAN with Temporal Data
The previously implemented SRGAN is a SISR approach, taking only the closest Sentinel-
2 acquisition to the ground truth into account. More information can be extracted from the
time series of Sentinel-2 images, as outlined in Section 2.3.2. Two different approaches
are implemented to achieve that goal.
Band Stacking The naive approach is to stack the bands of multiple images of the time
series. Adapting the input dimensions of the generator of the SRGAN from three (a single
image) to 12 (four images) is the easiest way to feed the multi-temporal information into
the network. Since the output dimension is still three, a single super-resoluted image
containing the extracted information from the time series is created. Since the depth of
consequent layers after the input layer are dependant on the input size, the network grows
considerably in size, leading to a reduced batch size therefore most likely less stable
training.
Figure 20: Schema of SRGAN generator, containing the recursive network that fuses the
encoded images.
Image Fusion Network A more sophisticated approach was introduced by Deudon
et al. (2020) in their super-resolution network called HighresNet. Their fusion approach
is transplanted from their network into the SRGAN.
Instead of just a single one-channel image, the dataloader loads four three-channel images
that have been stacked which are then sent through the standard generator encoder. After
the encoding the multi-temporal images individually, the encodings are fused recursively.
This recursive approach fuses two encoded image representations together, before fusing
the already fused representations in order to achieve a representation of all encoded im-
ages in the dimensions of a singular original image. This fusion is repeated until all time
34
series images have been fused to a single image dimensional representation. This fusing
network is implemented within the workflow of the SRGAN, meaning four multitemporal
images are encoded and fused, before being super-resoluted by the generator and fed into
the normal SRGAN discriminator. The fusing network is trained by the same optimizer
as the generator, meaning the extraction of information within the fusion is trained in the
same losses as the generator. The fusion for two images is shown in Figure 21.
Figure 21: Schema of fusion implementation within SRGAN generator.
The fusion itself consists of a residual block containing two convolutional layers, which
take two encoded images as an input and produce a single image dimensional output. In
this case, the convolutional layers take 2 ×64 =128 encoded channels as an input and
return the merged representation of 64 channels. Since the encoder is responsible for
upsampling the original 3 channels to 64 channels, the fusion network only performs the
recursive fusion and the normal generator workflow can then continue.
35
4 Experimental Settings
This section describes the training process, first outlining the split of the dataset before
showing the process of tuning the hyperparameters of the standard and temporally en-
hanced models.
4.1 Training and Testing Data
The dataset is split 70%-30% into training and testing areas. This is done by grouping 4 by
4 grids of SPOT-6 tiles together, resulting in a grid of 10km by 10km rectangles covering
the study area. 70 percent of this grid is randomly selected to serve as training areas, while
the remaining 30 percent are the test area (see Figure 22). Splitting the smaller SPOT-6
tiles into two datasets would have resulted in a very fine grid, meaning the inference
would have been performed spatially very close in a cluster of training areas. Creating
larger squares results in bigger, more cohesive blocks. Out of 53,590 image patch pairs,
37,513 are training and 16,077 are testing patches. 10 percent of the training data is split
off the training set to serve as a validation dataset. The performance of the models on this
dataset is used to select the best-performing models during the training phase. Based on
the aforementioned CLC data and classes, the dataset is skewed towards agricultural areas.
With such a large amount of data, we have not observed an adverse effect of stratifying
the data. But stratifying the dataset and potentially even training different models for
different land cover classes might be looked into in the future.
36
Figure 22: Train-Test split of the dataset
4.2 Experiments and Hyperparameter Tuning
The experiments are conducted on the dataset splits described in Section 4.1, tuning the
hyperparameters outlined in section 3.4. For the evaluation of experiments, we consider
both pixel-based losses and perceptual losses. While these metrics come with their draw-
backs, they can serve as indicators if the training is progressing as expected. In the case
of the GAN approaches, where the PSNR does not improve considerably, the LPIPS is
used to select the best model.
This section outlines the fine-tuning of the models and shows the development of the
relevant metrics during the training phase. These indicators are interpreted in Section 5.1.
37
4.2.1 SRCNN Model Training
While the main SRCNN hyperparameter is the learning rate, the optimizer and loss func-
tion used are also variables that are experimented with. Even though the original authors
did not use a perceptual loss function, this was experimented with as well Table 4.
The Adam optimizer performs best compared to SGD, with the learning rate gradually
declining by a factor of 0.1. Since the SRCNN is fast to train, even a small initial learning
rate leads to quick convergence. The perceptual loss implementation itself quite naturally
leads to a low PSNR, since it does take the pixel reconstruction error into account. While
certain corresponding spatial shapes become visible, the training does not lead to satisfy-
ing results since the color of the image loses all correlation to the real world. In order to
force the reconstruction to also take the color, not only textures, into account, the percep-
tual and pixel-based losses are weighted (×LPIPS +MSE or ×LPIPS +MAE). The
actual weights are not experimentally decided, but instead are inspired by the weighting
for similar metrics by Ledig et al. (2017).
Optimizer Loss LR PSNR
Adam
MAE 0.001 36.19
Step Decay 37.77
MSE 0.001 36.29
Step Decay 38.74
LPIPS Step Decay 10.68
LPIPS ×0.001 + MSE Step Decay 30.88
SGD
MAE 0.001 23.11
Step Decay 28.64
MSE 0.001 26.64
Step Decay 31.45
0.001 ×LPIPS + MSE Step Decay 9.96
Table 4: Comparison of PSNR results on the validation dataset for several configurations.
These comparisons are calculated on a dataset that has been histogram matched and trans-
lated to the YCbCr color space. For this architecture, the ideal model is selected based on
the PSNR value.
The best performance was reached by using the MSE loss, a step decay for the learning
rate of a factor 0.1 for each epoch (large epoch sizes permit quick LR steps) and the
Adam optimizer. This corresponds with the implementation of Dong et al. (2014, 2015).
Including perceptual losses does not have the desired effect but instead prevents the color
mapping from being learnt, changing the optimizer to SGD also does not increase the
performance. The training progress for the optimal implementation is visible in Figure 23.
The y-axis is shown by iterations, the batch size of 16 therefore leads to a backpropagation
once for every 16 images that pass through the network. While the PSNR reaches the
38
desired high level, the SSIM decreases, indicating a lower correlation between the super-
resoluted prediction and the reference image. On the other hand, the LPIPS, where a
lower value stands for a more similar image, indicates that during the training process the
images start to look more and more similar.
Figure 23: SRCNN training progress.
4.2.2 RCAN Model Training
We experiment with the general hyperparameters described in Section 3.4. Generally, the
other intrinsic settings are not changed from the implementation by the original authors
(Zhang et al., 2018c). These settings include the number of residual blocks and residual
channel attention blocks, their kernel sizes and number of feature maps (64) as well as the
patch size (192). The optimization therefore mainly consists of experimenting with the
loss functions used as well as the learning rate. Earlier experimentation with the adaption
of the two datasets outlined in Section 3.2.3 showed that the best results are obtained
by performing histogram matching before feeding the images into the network. Table 5
shows the best performance of the testing process obtained by the L1 loss and a step
decay. Optimizing for structural or perceptual losses does not yield any usable results,
since they return solid black or solid green predictions. RCAN is a very deep network
with many trainable parameters, therefore training takes a considerable amount of time.
Additionally, since the model already takes up a large amount of memory on the GPU,
the batch size has to be reduced to 8. Training the model takes several days on a NVIDIA
GeForce RTX 3060 GPU.
39
Optimizer Loss LR PSNR
Adam
MAE 0.001 31.73
Step Decay 34.90
SSIM Step Decay no convergence
LPIPS Step Decay no convergence
0.001 ×LPIPS + MSE Step Decay no convergence
Table 5: Comparison of PSNR results on the validation dataset for several configurations.
The spatial and perceptual losses resulted in solid black and green predictions, therefore
the PSNR results her