PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The focus of this paper is to present an innovative approach to classify land cover, with specific emphasis on Land Use, Land-Use Change and Forestry (LULUCF) monitoring. LULUCF is a sector in greenhouse gas inventory that tracks changes in greenhouse gas levels in the atmosphere due to land use and land-use change. In this study, we employed Deep Learning classifiers and Random Forest to classify land cover/land use in Czechia, adhering to LULUCF regulations. We evaluated the effectiveness of 2D and 3D Convolutions in Convolutional Neural Networks, with varying filter sizes and training methods, alongside the use of Random Forest classifier. We used Sentinel-2 bands with 10 m and 20 m spatial resolution, NDVI, NDVI variance, and SRTM altitude data to create input paths of 5×5 pixels. The results indicate that the 3D model trained with classical training and 3×3-pixel filters achieved the best F1 Score of 0.84. One significant advantage of using convolutional neural networks is their ability to include information from a pixel’s neighbourhood in the classification process, in contrast to solely considering the pixel itself.
Content may be subject to copyright.
Small Convolutional Neural Networks and Random
Forest in Land Use, Land-Use Change and Forestry
(LULUCF)
Gabriel A. Carneiro
Institute of Systems and Computer Engineering, Technology and Science (INESC-TEC)
Jan Svoboda
Charles University
António Cunha
Institute of Systems and Computer Engineering, Technology and Science (INESC-TEC)
Joaquim J. Sousa
Institute of Systems and Computer Engineering, Technology and Science (INESC-TEC)
Přemysl Štych
Charles University
Research Article
Keywords: convolutional neural networks, deep learning, random forest, LULUCF, Sentinel-2, land cover
classication
Posted Date: June 24th, 2024
DOI: https://doi.org/10.21203/rs.3.rs-4159206/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Additional Declarations: No competing interests reported.
Small Convolutional Neural Networks and
Random Forest in Land Use, Land-Use Change
and Forestry (LULUCF)
Gabriel A. Carneiro1,2, Jan Svoboda3, Ant´onio Cunha1,2,
Joaquim J. Sousa1,2, remysl ˇ
Stych3*
1*Institute of Systems and Computer Engineering, Technology and
Science (INESC-TEC), Porto, 4200-465, Portugal.
3EO4Landscape Research Team, Department of Applied Geoinformatics
and Cartography, Faculty of Science, Charles University, Prague, 12843,
Czechia.
2University of Tas-Os-Montes e Alto Douro, Vila Real, 5000-801,
Portugal.
*Corresponding author(s). E-mail(s): stych@natur.cuni.cz;
Contributing authors: gabrielc@utad.pt;svoboj25@natur.cuni.cz;
acunha@utad.pt;jjsousa@utad.pt;
Abstract
The focus of this paper is to present an innovative approach to classify land cover,
with specific emphasis on Land Use, Land-Use Change and Forestry (LULUCF)
monitoring. LULUCF is a sector in greenhouse gas inventory that tracks changes
in greenhouse gas levels in the atmosphere due to land use and land-use change. In
this study, we employed Deep Learning classifiers and Random Forest to classify
land cover/land use in Czechia, adhering to LULUCF regulations. We evaluated
the effectiveness of 2D and 3D Convolutions in Convolutional Neural Networks,
with varying filter sizes and training methods, alongside the use of Random Forest
classifier. We used Sentinel-2 bands with 10 m and 20 m spatial resolution, NDVI,
NDVI variance, and SRTM altitude data to create input paths of 5×5 pixels.
The results indicate that the 3D model trained with classical training and 3×3-
pixel filters achieved the best F1 Score of 0.84. One significant advantage of using
convolutional neural networks is their ability to include information from a pixel’s
neighbourhood in the classification process, in contrast to solely considering the
pixel itself.
1
Keywords: convolutional neural networks; deep learning; random forest; LULUCF;
Sentinel-2; land cover classification
1 Introduction
The United Nations Secretariat on Climate Change and the adopted Paris Agreements
under the United States Framework Convention on Climate Change have declared the
monitoring of land cover/land-use change highly relevant due to its significant im-
pact on climate change and the global carbon cycle. For these purposes, the binding
regulation is provided for the inventory and reporting of relevant land use classes, so-
called LULUCF—land use, land-use change and forestry, (see Decision 529/2013/EU,
European Commission 2013).. LULUCF information is collected and reported on an
international scale and is one of the main input data sources for climate change mod-
elling and GHG (greenhouse gas) emission estimates within the Intergovernmental
Panel on Climate Change (IPCC). Within the LULUCF, the following six classes are
distinguished: Cropland (CL), Settlements (ST), Grassland (GL), Forestland (FL),
Wetlands (WL), and Other land (OL). The area and changes of the classes are re-
ported annually to the IPCC. Researchers are increasingly developing sophisticated
strategies to represent the global dimension of land use and assess its impact on cli-
mate mitigation Michetti (2012). For this reason, the role of systematic monitoring
and reporting the LULUCF has been stressed and a high relevancy from point of the
research and stakeholder view is obvious, see e.g. Michetti (2012), or Ellison, Lund-
blad, and Petersson (2014). In most countries, the official records (cadastral data) are
used as input data for LULUCF. However, the use of cadastral data for this purpose
present several limitations (Micek, Feranec, & Stych,2020): (1) the spatial and tempo-
ral resolution of cadastral maps might not be able to capture the changes in land use
over time; (2) it only includes information about the legal ownership of land, which
may not be accurate in cases where land use changes have occurred, or when there
are discrepancies between the legal and actual land use; (3) the accuracy may also be
affected by errors in the data collection process, such as outdated information or mis-
takes made during data entry; and (4) it may not provide enough information about
the spatial patterns and characteristics of land use.
Earth observation (EO) has proven to be a highly effective and promising tool
for monitoring LCLUC Svoboda, ˇ
Stych, Laˇstoviˇcka, Paluba, and Kobliuk (2022); (Al-
cantara, Kuemmerle, Prishchepov, & Radeloff,2012;Hansen, Stehman, & Potapov,
2010). The launch of the Sentinel-2 satellites by the European Space Agency (ESA)
has been a game-changer, offering remarkable advancements in spatial, spectral, and
temporal resolutions. These technological attributes have opened up new avenues for
gaining deeper insights into Earth’s dynamic changes, as elaborated in Close, Petit,
Beaumont, and Hallot (2021). The advent of Sentinel-2’s capabilities has enabled the
execution of numerous studies that combine Sentinel-2 imagery with machine learning
algorithms. Notable examples include the use of Support Vector Machines (SVM) in
studies by Nguyen, Doan, Tomppo, and McRoberts (2020), Cavur, Duzgun, Kemec,
2
and Demirkan (2019), Rana and Suryanarayana (2020), and Ghayour et al. (2021). Ad-
ditionally, Random Forest (RF) has been employed in investigations led by Svoboda
et al. (2022), Nguyen et al. (2020), and Rana and Suryanarayana (2020), while Arti-
ficial Neural Networks have been utilized in studies conducted by da Silva, Cicerelli,
Almeida, Neumann, and de Souza (2020) and Ghayour et al. (2021) to facilitate land
cover classification.
Additionally, the application of Random Forest (RF) has been a common practice
in remote sensing literature for land cover classification (Vali, Comai, & Matteucci,
2020). This approach involves creating decision trees that determine the classification
of each feature (pixel) by using threshold values. The class that is most commonly
assigned to the feature by the trees (modus) is then chosen as the ultimate classifi-
cation outcome. Regarding LULUCF, Svoboda et al. (2022) extensively studied the
capabilities of Random Forest (RF) for their classification. The authors performed
pre-processing, classification, and post-processing of Sentinel-2 data to produce a top-
performing model with an overall accuracy of 0.89. However, this method does not
consider the spatial information of the target pixel’s neighborhood, as only the spectral
signature, Normalized Difference Vegetation Index (NDVI), its variance, and terrain
altitude were used for classification.
Recent results of relevant studies show that the exploration of only the spectral
dimension made classifiers ignore spatial relations, which can contain important in-
formation about shape, context and layout, and, in addition, closest pixels have high
probability of belonging to the same class (Jia et al.,2021;Signoroni, Savardi, Baronio,
& Benini,2019;C. Zhang et al.,2019). Probabilistic models, such as Markov Random
Field, or adaptive neighborhood systems that utilize morphological profiles or filters,
have been extensively used to examine spatial relationships (Vali et al.,2020). Deep
Learning (DL) emerged in 2012 and nowadays represents the state-of-the-art in most
computer vision tasks, such as image classification, object detection and image seg-
mentation. The field of remote sensing has seen a surge in the number of studies over
the past few years that utilize DL-based techniques to analyse remote sensing images
(RSI) (Ma et al.,2019). In addition to land cover classification, DL has been tested
and evaluated for various remote sensing tasks, including image fusion, image registra-
tion, scene classification, object detection, semantic segmentation, and object-based
image analysis.
Among the different architectures employed in studies which applied DL to RSI,
the Convolutional Neural Networks (CNNs) are widely used when spatial dimension
analysis is required. CNNs are designed to process data in the form of multidimen-
sional arrays, as 1D for signals and sequences, 2D for images and audio spectrograms,
and 3D for video or volumetric images (LeCun, Bengio, & Hinton,2015). They are
composed mainly by three components: convolutional layers, pooling operation layers
and fully connected/dense layers. According to LeCun et al. (2015), the chain of these
components in addition to the local connections and shared weights, makes the CNNs
take advantage of the properties of natural signals. CNNs are advantageous because
they can learn to extract features from the data, resulting in effective representa-
tions (Chollet,2017). The ability of CNNs to process spectral and spatial dimensions
has made them well-suited for analysing remote sensing (RS) data. One-dimensional
3
(1D) CNNs can handle the spectral dimension, while two-dimensional (2D) CNNs can
process the spatial dimension, and three-dimensional (3D) CNNs can handle both si-
multaneously. Despite their suitability, the application of CNNs to RSI images comes
with certain challenges: i) RSIs can contain hundreds of bands, which would require
a large number of parameters in the networks, resulting in huge models; ii) CNNs re-
quire a lot of data to be trained, and this requirement increases with the size of the
model; however, there are few labelled samples available for land cover classification
when it comes to RSI; iii) RSIs are more complex than natural scene images, as they
consist of various types of objects in different sizes, colours, locations, and rotations,
and are acquired with different sensors; and iv) besides visual patterns, the spectral
signature may also provide important information across different bands and should
be considered in the tasks (Ben Hamida, Benoit, Lambert, & Ben Amar,2018;Li,
Zhang, Xue, Jiang, & Shen,2018;Ma et al.,2019;Zhu et al.,2017).
DL methods require a large amount of data to be trained and generalize well, and
these requirements are directly related to the depth of the model. As the model be-
comes deeper, more data is needed to train it. This limitation is overcome by using
transfer learning, which involves pre-trained models from larger generalist datasets,
such as ImageNet Deng et al. (2009) or MS COCO T. Lin et al. (2014). RSI is distinct
from general scene images in that they may consist of multiple bands, making it chal-
lenging to utilize pre-trained models in this context. Although, one option is to rely
solely on the RGB bands, as is typical with general scene images, doing so results in
a significant reduction in classification information. To address the issue of limited la-
bel data regarding RSI, alternative approaches such as data augmentation techniques,
the generation of artificial data, and the use of unsupervised DL architectures such
as auto-encoders have emerged (Vali et al.,2020). Aside from the significant data re-
quirements, more advanced DL models demand greater computational resources for
both training and inference. Multispectral and hyperspectral RSI are by nature high
dimensional data, demanding high computational power to process. To address this
issue, techniques such as Principal Component Analysis (PCA) and feature selection
are used to decrease the data dimension (Signoroni et al.,2019). Taking these fac-
tors into consideration, one approach for harnessing DL in land cover classification is
through window-pixel-wise classification. This method aims to classify a target pixel by
considering its individual characteristics and information from its surrounding neigh-
borhood. This approach facilitates the processing of images in smaller portions, even
when they contain numerous bands, by taking into account both spatial and spectral
relationships.
Several studies (Ben Hamida et al.,2018;Corbane et al.,2021;Helber, Bischke,
Dengel, & Borth,2019;Sharma, Liu, Yang, & Shi,2017;Storie & Henry,2018;Syrris
et al.,2019) have been exploring DL, through CNNs, for automatically extracting fea-
tures and classifying land cover using images from Unmanned Aerial Vehicle (UAV),
and satellite images. Sharma et al. (2017) accomplished an accuracy of 85.60% by train-
ing a small CNN model on patch-based samples of 5x5 pixels from Landsat-8 imagery
to classify 8 distinct classes. Meanwhile, Ben Hamida et al. (2018) employed patch-
based samples and a 3D model to classify data from various hyperspectral benchmarks,
including the KSC dataset with a spatial resolution of 18m. By utilizing 3D models,
4
it becomes possible to extract features that rely on both spatial and spectral dimen-
sions. However, this comes at the cost of increasing the number of parameters when
compared to 2D models. In a patch-based approach, Syrris et al. (2019) and Corbane
et al. (2021) employed four bands from Sentinel-2 imagery (blue, green, red, and nir)
to classify various land cover classes. Syrris et al. (2019) compared the effectiveness
of DL-based segmentation architectures, a 2D classification architecture, and an RF
method for classifying nine different classes, ultimately achieving an F1-score of 0.87
using a slide-window of 5x5 pixels. Based on Syrris et al. (2019)’s results, Corbane
et al. (2021) also applied a slide-window approach with several 2D models to classify
human settlements globally. In contrast, Helber et al. (2019) fine-tuned pre-trained
deep models using the blue, green, and red bands of Sentinel-2 imagery to classify
10 different classes, ultimately achieving an accuracy of 98.57%. Karra et al. (2021)
used a U-Net (Ronneberger, Fischer, & Brox,2015) architecture with the red, green,
blue, NIR, SWIR1, and SWIR2 bands of Sentinel-2 imagery to segment 10 different
classes, achieving an accuracy of 85%. Similarly, Storie and Henry (2018) adapted a
Fully Convolutional Network to segment 18 classes using six bands from Landsat 8/7
satellite with a spatial resolution of 30 m, ultimately achieving an accuracy of 88%.
These findings collectively suggest that small CNNs (comprising fewer than 1 mil-
lion parameters and fewer than 20 layers) can effectively perform Sentinel-2 pixel-wise
classification in a 5x5 pixel manner. However, it is noteworthy that most studies
have predominantly utilized only a few bands and have not explored 3D convolutions.
Furthermore, pixel-wise classification facilitates the processing of images in smaller
portions, even when they contain numerous bands. This approach is particularly suit-
able for the specific datasets e. g. employed by Svoboda et al. (2022) where there is
no temporal component, and training points are derived from homogeneous circular
polygons with diameters ranging from 60 to 80 meters. These polygons exclusively
contain pixels of a single class. In such cases, the small CNNs method is a practical
choice. However, the same characteristics that make this approach feasible present
challenges for the use of deep learning-based segmentation architectures, such as U-
Net (Ronneberger et al.,2015), SegNet (Badrinarayanan, Kendall, & Cipolla,2017),
and Convolutional Networks (Shelhamer, Long, & Darrell,2017). Deep learning-based
segmentation architectures typically rely on feature extractors, often based on large
models pre-trained for classification, unsupervised methods like autoencoders, or self-
supervised pre-training. These architectures require a significant amount of data for
training. Moreover, the scene’s homogeneity in our data presents an additional chal-
lenge. Pixel-wise classification benefits from polygon border pixels to delineate region
boundaries, while deep learning-based segmentation architectures do not have anno-
tations for pixels outside the polygons, making this task more complex. For a detailed
comparison between deep learning-based segmentation architectures and small CNNs
approach for land cover classifications, readers can refer to the study conducted by
Syrris et al. (2019).
One of the distinctive aspects of our study is the consideration of spatial informa-
tion in the classification process. In contrast to methods that solely leverage spectral
features, our approach incorporates spatial context by examining the target pixel’s
5
neighborhood. By doing so, we aim to enhance the accuracy and contextual under-
standing of land cover classification, a crucial aspect of the evolving field of remote
sensing. Therefore, in the context of this study, our objective is to assess the use of
small CNNs for classifying LULUCF classes in the Czech Republic. Our experiment
involves testing both 2D and 3D convolutions with various filter sizes. We also incor-
porate data augmentation and contrastive learning techniques into the analysis. The
input data encompass Sentinel-2 bands with spatial resolutions of 10 meters and 20
meters, alongside metrics like NDVI, NDVI variance, and SRTM altitude data. To
evaluate performance, we compared the results of our method with those obtained
using RF. This study endeavors to make several key contributions, including::
Assessing the applicability of deep and machine learning methods for LULUCF
classification;
Investigating the influence of convolutional filter sizes on LULUCF classification
through the implementation of a compact 3D convolution-based architecture;
Contrasting the performance of 3D CNNs and RF for land cover classification using
Sentinel-2 imagery;
Delving into the implications of treating LULUCF classification as a contrastive
learning problem;
The following structure is used in this paper: Section 2introduces the utilized
dataset, trained models, and selected hyperparameters for training. Section 3describes
the obtained results. Section 4critically examines the results and methods used, and
compares them with similar studies. Lastly, Section 5presents the main conclusions
and contributions of the study.
2 Materials and Methods
Our approach involves using CNN models to classify pixels in accordance with LU-
LUCF classes, based on Sentinel-2 imagery. The traditional pipeline was followed for
training the Deep Learning model, with the basic workflow illustrated in Figure 1.
The first step involved dividing and reshaping the data into patches of 5×5 pixels,
so that each pixel belonging to an annotated polygon became a patch, with 24 sur-
rounding pixels. These patches were then utilized to train four different models, which
were tested, trained and evaluated. Finally, each trained model was used to classify
the data, and the results were validated (accuracy assessment) and interpreted.
The initial methodological step involved the creation of a classification nomencla-
ture that conforms to the LULUCF regulations as outlined in IPCC (2003). These
regulations define and report the status and development of various classes of land,
including Forest land, Cropland, Grassland, Wetlands, Settlements, and Other Land.
2.1 Area of Interest
The annual LULUCF reporting to the IPCC is done for a national area, which in the
case of Czechia is covered by 22 Sentinel-2 tiles and cannot be completely covered by
one day’s imagery. Hence the necessity of mosaicking the imagery. High cloud cover is
also common in Central Europe and completely cloudless images are relatively rare,
6
Figure 1 . Flowchart of the proposed method. First the data is prepared in 5×5 slide-window
fashion. This data is then used to train several CNN models and one RF classifier. Lastly, the trained
models are utilized to generate classifications for the entire area of interest.
especially for mountainous areas Laˇstoviˇcka et al. (2020). To obtain an image for
classification of the whole area, due to cloud cover, it is advisable to composite images
from some time span so that there is a measurement for each classified pixel. For testing
the land cover classification methodology, according to the LULUCF nomenclature,
the original composite mosaic from Svoboda et al. (2022) was used.
Area of intersts covers two NUTS 2 regions, namely Jihov´ychod (CZ06) and Stˇredn´ı
Morava (CZ07), as depicted in Figure 2. The selection of the regions of interest was
guided by the criteria of a project, “Developing supports for monitoring and reporting
of GHG emissions and removals from land use, land use change and forestry” from
which this study originated 1. The total area of the region is approximately 23,217
km², with varying elevations ranging from the lowest point at the confluence of rivers
Morava and Dyje (150 m above sea level) to the highest point at the mountain of
Pradˇed (1492 m above sea level). The elevation of the area gradually increases from
south to west, east, and north. The largest cities in the region of interest include Brno
(381,346 inhabitants), Olomouc (100,663 inhabitants), Zl´ın (74,935 inhabitants), and
Jihlava (51,216 inhabitants).
2.2 Data
For the purpose to compare the results of RF and CNN methods, the dataset of Svo-
boda et al. (2022) is used. All the mosaic creation was done in the cloud-based Google
Earth Engine platform, which allows huge amounts of data to be processed for free for
scientific purposes. The Sentinel-2 multispectral images from the joint ESA/European
Commission Copernicus Mission were used. The images were acquired during the late
spring and early summer of 2018 and were extracted from atmospherically corrected
data from Sentinel-2A and Sentinel-2B from 2018. These data are provided in a 10 m
spatial resolution for the Blue, Green, Red, and NIR bands (B2, B3, B4, and B8), and
in a 20 m spatial resolution for the Vegetation red edge bands (B5, B6, B7, B8A) and
1https://www.copernicus-user-uptake.eu/user-uptake/details/developing-support-for-monitoring-and-
reporting-of-ghg-emissions-and-removals-from-land-use-land-use-change-and-forestry-73, accessed on 22
September 2021
7
Figure 2 Area of interest, Source: Svoboda et al. (2022), data source: CORINE Land Cover 2018
SWIR bands (B11, B12). Additionally, the data include the NDVI, NDVI variance,
and the digital elevation model from SRTM (Shuttle Radar Topography Mission) in a
30 m resolution. A detailed description of the composite mosaic used for classification
is in Svoboda et al. (2022).
In an effort to preserve the highest spatial resolution of the data, the 20 m bands
(and the SRTM height band) were resampled into a grid of 10 m bands when down-
loaded to the hard disk. To do this, the default Nearest Neighbour method was used
in GEE. So the values of the lower resolution bands have not been changed.
Training and validation data were also taken from Svoboda et al. 2022. The Coper-
nicus Corine Land Cover 2018 database, Base map of the Czech Republic at 1:10 000,
and Land Parcel Identification System (LPIS) for the year 2018 were used to gen-
erate these datasets. To verify training polygons and accuracy assessment, historical
orthophotos from 2017, 2018, and 2019 from Czech Office for Surveying, Mapping and
Cadastre (Czech: ˇ
Cesk´y ´
rad Zemˇemˇriˇcsk´y a Katastr´aln´ı - ˇ
C´
UZK) and historical
imageries in Google Earth Pro software were used.
The CNN models were trained using a pixel-slide window of size 5x5 as input,
which consists of a central pixel as the classification target and its surrounding neigh-
bourhood, as illustrated in Figure 1. This configuration enables the representation of
both spectral and spatial relationships using the different bands and neighbourhood
8
information (Corbane et al.,2021;Signoroni et al.,2019). We used the same data con-
figuration as Svoboda et al. (2022), consisting of 10 Sentinel-2 data bands, to train
our models. The original validation subset from Svoboda et al. 2022 was randomly di-
vided into two subsets, validation and test slide-windows, with each subset containing
50% of the samples of each class. Table 1displays the number of samples per class for
each subset.
Table 1 Amount of samples per category in each subset.
Class Training - polygons Training - lines Training - total Validation Test
Settlement 2240 1612 3852 81 81
Cropland 10 576 900 11 476 461 462
Woodland 5673 2367 8040 385 386
Grassland 2048 0 2048 161 161
Wetland 448 463 911 12 12
Other land 174 0 174 16 16
Total 21 159 5342 26 501 1116 1118
Samples for training were taken from the same polygons as in Svoboda et al. (2022),
but with a modification for purpose of the used methods in this study. The original
polygons were transformed from circles with a radius of 40 pixels to squares with a
side length of 80 m for reason of the increasing the number of samples. From the
original 50 samples per polygon, the number of updated samples increased to 64 (14
more samples). The Other land class consists of relatively small areas (smaller than
above mentioned squares 80 m, e.g. blockfields). Due to this aspect, it was necessary
to manually select and define individual samples for training the Other land.
As Svoboda et al. (2022) utilized a random forest that used the spectral signature of
pixels as input, and the CNNs utilized small patches representing the spectral signature
of the neighbourhood (5x5 pixels). After testing the classification, we observed that
the training dataset lacked the representation of line features and borders. In order to
improve the classification of the ecotones, in both CNN and RF, training lines were
added to the training data. These lines cover the relevant types of landscape objects
and were created by manual vectorization over the aerial image (Ortofoto ˇ
CUZK).
The inclusion of pixels belonging to line features provided important information for
detection of significant landscape elements from point of view LULUCF. These lines
cover various line elements like roads, railways, rivers, and windbreaks. An example
of the preparation of these lines can be seen in Figure 3.
Due to the comparatively lower number of samples in the Wetland and Other
land classes as compared to the other classes, the training subset for these classes
was subjected to data augmentation techniques. Each sample was rotated by 90 and
180 degrees and flipped along the horizontal and vertical axes, resulting in four new
samples per original image. These augmentations were selected because they only alter
the location of the pixels while maintaining their original values.
Finally, a data ablation study was conducted to verify the importance of all the
data used in this study (Sentinel-2 10 and 20 m bands, NDVI, NDVI variance, and
altitude). The model with the best performance was trained with part of the data
(RGB + NIR, RGB + NIR + SWIR, All Sentinel-2’s 10 m and 20 m bands, All
9
Figure 3 Training lines samples generation. The manual vectorization technique was used to create
a windbreaker line that separates two agricultural plots, and training pixels of a woodland class
were identified in the pixels intersected by this line. One of the training pixels and its 5 x 5 pixel
neighborhood has been highlighted in the image. It is important to note that only training pixels are
included in the Random Forest classification training process, while both training pixels and their
neighbourhoods are included in the CNN training process.
Sentinel-2’s 10 m and 20 m bands + NDVI and its variance, and all Sentinel-2’s 10 m
and 20 m bands + altitude).
2.3 Deep Learning Classifiers
2.3.1 Architectures
The objective of this study is to develop CNN architectures capable of performing
pixel-wise classification of 6 distinct LULUCF classes in 10 m resolution Sentinel-2
images. Three CNN models are evaluated and trained, including a 2D baseline model, a
3D model that incorporates depth into the filters to analyse patterns between channels,
and a Multiscale 3D model that utilizes different kernel sizes in parallel to extract
multiple representations of the data. The 2D model utilizes 2D convolutional filters
to learn patterns from the spatial dimension, while the other models expand on this
approach using 3D convolutional filters. A schematic representation of the architecture
is depicted in Figure 4.
10
Figure 4 Visual representations depicting the schematic layout of the various layers. The Net2Vis
tool (auerle, Van Onzenoodt, & Ropinski,2021) is utilized to generate or partially generate
visualizations of the layers comprising each architecture presented in this study
The development of the architectures in this study was guided by certain principles,
including:
The classes being targeted in this study often span more than one pixel in a 10 m
spatial resolution, so the architectures must be capable of extracting features from
different pixel group sizes that relate to the target pixel and the surrounding con-
textual information. Previous studies (Corbane et al.,2021;Sharma et al.,2017;
Syrris et al.,2019), have suggested using a pixel neighbourhood of 24 pixels, com-
posed of a slide-window of 5×5 pixels with the target pixel at the centre, which was
also adopted in this study;
All the data utilized in Svoboda et al. (2022) must be incorporated into the clas-
sification. Some studies (Corbane et al.,2021;Helber et al.,2019;Karra et al.,
2021;Syrris et al.,2019) that utilized CNNs to classify or segment land cover have
used only a few bands for the classification, but this study utilized 10 Sentinel-2
bands, as well as NDVI, its variance, and the altitude. These additional data in-
puts were considered as original bands in the input data, resulting in slide windows
with the dimension 5×5×13. Therefore, fine-tuning pre-trained architectures, was
not applied;
The resulting architectures should be lightweight, making them easily trainable and
able to perform inferences on computers that lack dedicated GPUs. This would
11
allow them to be adopted in other classification tasks that do not require substan-
tial computational resources, and their knowledge could be utilized as a source in
transfer-learning tasks.
All models were constructed using a basic convolution block as a base. This block
was comprised of two consecutive convolutional layers, each containing 32 filters with
side sizes of 3 pixels (3×3 for 2D Convolutions and 3×3×3 for 3D convolutions), fol-
lowed by a Max Pooling layer with a size of 2, and a Batch Normalization layer. This
configuration was adapted for each model, with the 2D models utilizing 2D Convolu-
tional layers and 2D Max Pooling, while the 3D models used the 3D Convolutional
layer and 3D Max Pooling. The models consisted of two convolution blocks, considering
the quantity of data and the data input size.
Activation was performed using Rectified Linear Units (ReLU), with padding as-
signed to the same values for the convolutional layers. While the Gaussian Error Linear
Unit and hyperbolic tangent activations were also tested, but no improvements were
found. Padding was applied to avoid information loss at the start of the models, with
the type of padding chosen as ”same” due to the behaviour of spectral data. In small
slide-windows, it is common to have homogeneous pixels with similar values across
different bands. Filters with a side size of 3 were chosen to allow the network to obtain
more information about the neighbourhood compared to smaller filters, as the target
data typically consists of a large group of pixels, except at the borders. Filters with
sizes 2 and 4 were also tested for the architecture with the best performance. Max
Pooling was employed to reduce the data dimension, while Batch Normalization was
used to decrease the probability of instability during training.
The fully connected part (FC) of the classifier consisted of 40 neurons with ReLU
activation, and the flattened more-on-top activations of each model were used as input.
Global Average Pooling was tested to reduce the dimensionality, but it resulted in
a loss of information and decreased the models’ performance. The model’s output
was represented by a FC layer with 6 neurons, which is the number of classes that
were classified, and a SoftMax activation was applied. Additionally, a 25% Dropout
was implemented between the output layer and the previous layer as a regularization
technique to prevent overfitting.
Although all models had the same configurations regarding the number of layers,
filters, and classifiers, each had its own unique characteristics. The 2D model, unlike
traditional machine learning techniques, utilized learned 2D convolutional filters to
extract information about the neighbourhood surrounding the target data. This served
as the baseline model for the study since 2D convolutions are the primary tool in deep
learning for image classification.
The 3D model goes beyond the capacity of 2D convolutions by incorporating 3D
convolutional filters that can learn relationships between different channels of data.
The time component was not explored, thus the 3D filters were applied aiming to learn
different relations across channels. In this specific application, the channels represent
the different bands, so models utilizing 3D Convolutions are capable of considering
both spectral and spatial dimensions in the learned representation. In addition to the
filter 3x3x3, we also tested filters of 2x2x2 and 4x4x4, aiming to verify the impact of
different filter’s size in the architecture.
12
The 3D multiscale CNN model takes the 3D model further by adding two additional
layers that extract features in parallel with the initial convolutional layer, as depicted
in Fig. 4(c). The added layers utilize filters with dimensions of 2×2×1 and 1×1×5.
This configuration enables the model to extract relationships between close spatial
neighbourhoods (in the first added layer) and more widely spread bands in the spectral
dimension (in the second added layer). However, this comes at the expense of increasing
the models’ complexity and computational requirements.
2.3.2 Training
The common supervised training pipeline for multi-classification was used to train the
three proposed architectures. For the sake of simplicity, this method will be referred to
as classical training for the rest of this text. However, the architecture that exhibited
the highest performance was also trained using a contrastive approach, where pairs of
images were used to convert the training into a metric learning task. The idea behind
contrastive learning is pull together an anchor and a an sample that belongs to its
class together in embedding space, and push apart the anchor from samples that are
not from the same class (Khosla et al.,2021).
Classic training
The models were trained using the common supervised training pipeline for multi-
classification with Adam optimizer and a learning rate (LR) of 10E-3. The default
parameter configuration of Adam was used with β1= 0.9, β2= 0.999 and ϵ= 107.
Additionally, if the learning reached a plateau, the LR would be reduced by a factor of
0.2. The models were trained for 50 epochs with an early stop of 20 epochs. The Cross-
Entropy (CE) loss was used as a loss function. The CE loss is defined by Equation
1, where tiand siare the ground-truth and model score, respectively, for each class
iin the set of classes C. Due to the dataset’s significant class imbalance, Focal Loss
(FL) (T.-Y. Lin, Goyal, Girshick, He, & Doll´ar,2017) and Class-Balanced Cross-
Entropy (Cui, Jia, Lin, Song, & Belongie,2019) were tested, but no improvement
was observed. The batch size was set to 256 samples. The hyperparameters such
as optimizer, learning rate, weight initialization, batch size, and epochs were kept
consistent for all experiments, based on Syrris et al. (2019)’s conclusion that these
basic hyperparameters did not significantly impact the results.
CE =
C
X
i
tilog(si) (1)
Contrastive training
The training using the contrastive approach was implemented as a Deep Siamese Net-
work (Koch,2015), as shown in Figure 5. The training procedure consisted of using
pairs of images to train a neural network that learns the distance metric between the
images. The Cosine distance was employed as the distance metric. The goal was to ex-
tract features from each image in the pair that would decrease the intra-class distance
while increasing the inter-class distance, effectively transforming the classification task
13
into a metric learning task. During training, the convolutional part of the Siamese
network was shared between both images in the pair, meaning that the same weights
were applied to both images, and both images were projected onto the same represen-
tation space. Note that unlike the 3D Model, a new layer of 256 neurons was added in
the Siamese network, which was responsible for the image representation. One of the
advantages of contrastive learning is that it can make use of the large number of neg-
ative pairs, which are pairs formed by images of different classes (Khosla et al.,2021).
Consequently, the use of contrastive learning could reduce the effects caused by the
lack of data for the wetland and other lands classes.
Figure 5 The process of contrastive training is divided into two steps. First, in step 1 (a), the
classification task is transformed into a metric learning task, and the model is trained with the
contrastive loss. In step 2 (b), a shallow classifier is added to the space representation, and transfer
learning combined with fine-tuning is applied.
During the first step of the training process (depicted in Figure 5 (a)), the Con-
trastive Loss (Hadsell, Chopra, & LeCun,2006) was employed as the loss function,
using Root Mean Squared Propagation (RMSprop) as the optimizer with a learning
rate of 103, a batch size of 64 samples, and keeping the remaining hyperparameters
the same as in the classic learning approach. The contrastive loss function is expressed
mathematically in Equation 2, where yrepresents the ground-truth, drepresents the
distance between image features, and αrepresents the margin, which indicates the
maximum intra-class distance and minimum inter-class distance. In our case, the mar-
gin was set to 1, since we used cosine distance. Pairs of images are crucial in contrastive
training, so we used 3500 pairs for each class, with 20% being positive and 80% neg-
ative. Positive pairs were created by different combinations of positive samples, while
negative pairs were equally distributed between classes. Furthermore, pairs with the
same images in different positions were considered as distinct pairs. Mean Squared
Error was used to evaluate this step.
14
CL = (1 y)d2=ymax(αd, 0) (2)
Next, transfer learning combined with fine-tuning was implemented as the second
step. Here, the classifier was added to the top of the space representation, and the
distance layer was removed. This updated model was then trained in two stages. In
the first stage, only the classifier was trained while keeping the convolutional part and
space representation’s weights frozen. In the second stage, all weights were unfrozen
and the model was fully trained. During the first stage, the knowledge learned from the
contrastive learning was transferred to the new model. All hyperparameters were kept
consistent with those of the Classic training. During the second stage, fine adjustments
were made to the weights, and the initial learning rate was reduced to 105, while
keeping the other hyperparameters the same as in Classic training. The performance
of the model trained using the contrastive approach was evaluated using the same
metrics.
2.3.3 Validation/Accuracy Assessment
Four metrics were used to evaluate the models: Producer’s Accuracy (PA), User’s
Accuracy (UA), F1 Score (F1) and Overall Accuracy (OA). These metrics were calcu-
lated using Equations 3,4,5, and 6, which incorporate values such as True Positives
(TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
P A =TP
(T P +F P )(3)
UA =T P
T P +F N (4)
F1 = 2P A UA
P A +UA (5)
OA =T P +T N
T P +F P +T N +F N (6)
In addition to employing metrics, our objective was to understand which pixels
within the neighborhood received more attention from the model when correctly clas-
sifying samples. To achieve this, we applied the Gradient-weighted Class Activation
Mapping (Grad-CAM) technique (Selvaraju et al.,2016). Grad-CAM assigns an im-
portance score to each pixel in an image I, classified by model M, using the gradient
of activations from the convolutional layer LMwith respect to the class C. Our
evaluation using Grad-CAM involved three steps. First, we generated 3D heatmaps for
each correctly classified sample with respect to its respective class, utilizing the convo-
lutional layer higher up in the architecture to capture high-level data representations.
Next, we reduced the dimensionality of the activations by computing the average for
the bands of each sample, resulting in 2D heatmaps. Finally, we calculated the aver-
age pixel importance for each class by aggregating data from all the samples within
that class, thereby generating a 2D heatmap for each class. This allowed us to de-
termine the average importance of each pixel in the classification of a specific class.
Our objective was to assess whether pixels closer to the center were more crucial for
classification than those located at the borders.
15
2.4 Post-processing
The convolutional neural networks produced multiple classifications, and the one with
the highest accuracy was selected as the best model. Additionally, postprocessing was
conducted on the best model and the Random Forest classification to compare their
results. The postprocessing had two main objectives: (a) adjusting the grid size to
match the minimum mapping unit (MMU) to ensure that the smallest objects in the
map were accurately represented, and (b) dividing the Woodland category into Forest
land and non-forest woody vegetation, which was part of the Other land class, in
order to generate land cover rasters that adhered to the LULUCF nomenclature. To
achieve the MMU, the smallest unit of measurement was set to two pixels (200 m²),
and single pixels were removed using a majority filter. Finally, the accuracy assessment
was repeated on the processed raster to evaluate the accuracy of the final output (note
that the validation and test dataset did not distinguish between Forest land and Other
land within the Woodland class).
The Woodland class was divided into two categories, Forest land and Other land,
based on their size and width. A minimum width of 20 meters was required for a stand
to be considered as Forest land according Filip and Radim (2010). This width corre-
sponds to a width of two pixels in the classified raster. Therefore, Woodland ob jects
that were only one pixel wide were excluded from the Forest land category. Convo-
lution filters were used to remove one-pixel-wide objects by filtering out horizontal
objects with a width of one pixel followed by vertical objects with a width of one pixel.
The non-forest woody vegetation objects that were filtered out were reclassified into
the Other land class.
After these morphological operations, new lone pixels in the Other land class ap-
peared in the classifications. To maintain the MMU of the classified objects, these
lone pixels were reclassified as Forest land class because they were always neighbour-
ing the Forest land class object of which they were a part. Next, a selection of lone
pixels with an area of 4 pixels or less was made and reclassified to the Other land class
as non-forest woody vegetation. This final step was necessary because according to
Regulation (EU) 2018/841 of the European Parliament, the Forest land class MMU
for the Czech Republic is 0.05 ha, which corresponds to an object with an area of five
pixels in the resulting classifications.
3 Results
The primary aim of this research was to develop a LULUCF report classification uti-
lizing compact convolutional models. To achieve this, various convolution dimensions,
filter sizes, data augmentation strategies, and contrastive learning were experimented
with. The performance of the models was assessed using metrics such as Producer’s
Accuracy, User’s Accuracy, Overall Accuracy, and F1 Score, while Grad-CAM was
utilized to assess the importance of pixels in the classification. Furthermore, a post-
processing step was performed to enhance the classification, and a comparison of
overall quality was conducted between the best DL-based and RF classifiers to
determine the impact of DL on the classification relative to Svoboda et al. (2022).
16
3.1 Architectures comparison
Although the 3D model yielded the best result, all DL architectures proposed achieved
an F1 score greater than 0.82. It is important to note that all DL models were trained
with augmented datasets for the wetland and other land classes, and the hyperparam-
eters remained consistent with those outlined in Section 2.3.2. Table 2summarizes the
results obtained for each CNN architecture presented in this study, as well as the per-
formance of RF when applied to our data. The hyperparameters for the RF method
were set to Number of Trees = 300, Variables per Split = 2, and Bag fraction = 0.1,
resulting from a grid search with the same parameters defined in Svoboda et al. (2022).
Table 2 Accuracy assessment of the classifiers for the different tested architectures
2D 3D 3D Multiscale RF
(OA = 88.70 %) (OA = 89.40 %) (OA = 88.30 %) (OA = 85.60 %)
Class PA UA F1 PA UA F1 PA UA F1 PA UA F1
Settlement 0.79 0.84 0.81 0.83 0.85 0.84 0.82 0.83 0.82 0.69 0.64 0.67
Cropland 0.92 0.92 0.92 0.93 0.92 0.93 0.93 0.92 0.92 0.88 0.93 0.91
Woodland 0.90 0.95 0.92 0.88 0.97 0.92 0.89 0.94 0.92 0.85 0.97 0.90
Grassland 0.82 0.69 0.75 0.83 0.70 0.82 0.77 0.68 0.72 0.86 0.56 0.68
Wetland 0.71 0.83 0.77 0.90 0.75 0.82 0.77 0.83 0.80 1.00 0.75 0.86
Other land 0.92 0.69 0.79 1.00 0.63 0.77 0.76 0.81 0.79 1.00 0.19 0.32
Average 0.84 0.82 0.83 0.90 0.80 0.84 0.82 0.84 0.83 0.88 0.67 0.72
The 3D model achieved the highest F1 score for all classes, although the RF model
performed better in terms of PA and F1 for each class. All test samples that fell
into the underrepresented wetland and other land classes were correctly classified by
RF, so the extent of the Wetlands class would likely be overestimated in the case of
the 3D model. The 3D outperformed all other classifiers in classifying Settlements,
while RF achieved the best Recall for Cropland and the 3D and Multiscale models
achieved the best PA. However, RF misclassified most of its samples as Cropland,
leading to an overestimation of the Cropland class compared to other classifiers. All
classifiers were similarly successful in classifying the Woodland class, with the 3D
model slightly outperforming the others in terms of Recall and F1. Conversely, RF
is expected to underestimate the extent of the Grassland class since almost 44% of
samples representing this class were misclassified as other classes (UA = 0.56).
Figure 6 displays the mean heatmaps obtained by the 3D model for each class in
the dataset, depicting the correctly classified pixels. Despite minor variations, all the
heatmaps assign more significance to pixels located closer to the centre of the image
compared to those that are farther away. It appears that the activation of central
pixels has been transmitted to the surrounding pixels, as the last convolutional layer’s
activation maps have a size of 2×2×6, and the final step of Grad-CAM involves resizing
the heatmap to match the original input’s size. Therefore, we can conclude that central
pixels are crucial for classifying all classes, and their importance surpasses that of
non-central pixels.
17
Figure 6 Heatmaps representing the pixels that were accurately classified by the 3D model. The
intensity of red indicates higher significance, whereas white signifies lower importance. Each class’s
most crucial score is highlighted by the green square
3.2 Application of Data Augmentation
The classification was improved by using data augmentation on classes that had fewer
samples. Table 3summarizes the outcomes of training the 3D model with and without
data augmentation for all classes. In the results shown in Table 3, the 3D model was
trained with augmentation applied only to classes with fewer samples. The model’s per-
formance decreased when augmentation was applied to all classes, as observed in Table
3. Furthermore, not using data augmentation resulted in worse performance. Conse-
quently, all other experiments in this study utilized data augmentation exclusively for
the Wetlands and Other lands classes, which had fewer samples.
Table 3 The outcomes of training the 3D model with and without augmentation
No augmented Smaller classes augmented All classes augmented
Class PA UA F1 PA UA F1 PA UA F1
Settlement 0.82 0.79 0.81 0.83 0.85 0.84 0.77 0.79 0.78
Cropland 0.92 0.90 0.91 0.93 0.92 0.93 0.87 0.88 0.88
Woodland 0.91 0.94 0.93 0.88 0.97 0.92 0.89 0.93 0.91
Grassland 0.74 0.78 0.76 0.83 0.70 0.82 0.75 0.65 0.70
Wetland 0.71 0.83 0.77 0.90 0.75 0.82 0.59 0.83 0.69
Other land 0.73 0.50 0.59 1.00 0.63 0.77 0.91 0.63 0.74
Average 0.81 0.79 0.79 0.90 0.80 0.84 0.80 0.78 0.78
3.3 Data Ablation
Using all the Sentinel-2 bands plus NDVI, NDVI variance and altitude led the 3D
model to the best result. Tab. 4summarises the results of an ablation study carried
out with the 3D model. All the models were trained applying data augmentation to
classes with fewer samples (see Sec. 3.2). Using all the bands entered in Svoboda et
al. (2022) led the model to outperform using fewer bands. The same model trained
with the 10m Sentinel-2 bands (RGB+NIR) had the worst performance. Training only
with the Sentinel-2 bands with a resolution greater than or equal to 20m obtained an
F1 Score 0.06 lower than training with all the data (0.78 vs 0.84).
18
Table 4 Results from the ablation study. We trained the 3D CNN model using only parts of the
data: RGB+NIR (10 m), RGB+NIR+SWIR, all Sentinel-2 Bands with 10 and 20m (S2), all
Sentinel-2 Bands with altitude (S2+ALT) and NDVI (S2+NDVI+VAR), separately. The best result
was achieved by combining all the bands used in this study.
Class RGB+
NIR
RGB+
NIR+
NDVI+
VARNDV I
RGB+
NIR+
SWIR
S2 S2+
ALT
S2+
NDVI+
VARNDV I
All
Settlement 0.78 0.78 0.79 0.78 0.77 0.81 0.84
Cropland 0.81 0.91 0.85 0.88 0.88 0.92 0.93
Woodland 0.86 0.91 0.89 0.91 0.92 0.91 0.92
Grassland 0.51 0.69 0.57 0.70 0.69 0.76 0.82
Wetland 0.58 0.62 0.67 0.69 0.69 0.52 0.82
Other land 0.41 0.69 0.50 0.74 0.58 0.87 0.77
Average 0.66 0.77 0.71 0.78 0.76 0.81 0.84
3.4 Filters size comparison and contrastive learning effect
The 3×3-pixel convolutions exhibited a slightly better performance than 2×2 and
4×4 convolutions. Table 5summarizes the results obtained for different filter sizes in
the convolutional filters, and the 3D model was trained using the contrastive model.
Overall, there was not a significant difference in the achieved metrics when using
various filter sizes. Contrastive training did not improve the classification performance
overall. The classical training of the 3D model outperformed the model trained using
the contrastive method, as shown in Table 5. However, the contrastive training resulted
in better UA for the Wetland and Other lands classes.
Table 5 Summary of the outcomes obtained for various convolutional filter sizes used in the 3D
model and the model trained using the contrastive approach.
2×2 3×3 4×4 3×3 (Contrastive)
Class PA UA F1 PA UA F1 PA UA F1 PA UA F1
Settlement 0.84 0.83 0.83 0.83 0.85 0.84 0.80 0.84 0.82 0.79 0.83 0.81
Cropland 0.92 0.92 0.92 0.93 0.92 0.93 0.93 0.92 0.92 0.91 0.91 0.91
Woodland 0.89 0.97 0.93 0.88 0.97 0.92 0.88 0.95 0.92 0.87 0.95 0.91
Grassland 0.85 0.68 0. 76 0.83 0.70 0.82 0.81 0.7 0 0.75 0.77 0.60 0.68
Wetland 0.63 0.83 0.71 0.90 0.75 0.82 0.75 0.75 0.75 0.71 0.83 0.77
Other land 1.00 0.50 0.67 1.00 0.63 0.77 1.00 0.63 0.77 0.85 0.69 0.76
Average 0.85 0.79 0.80 0.90 0.80 0.84 0.86 0.80 0.82 0.82 0.80 0.81
3.5 Accuracy assessment after lone pixel filtering
There was no change in the 3D model’s accuracy after removing single pixels, although
most of the RF metrics improved. Table 6summarizes the individual accuracy evalu-
ation indices for the classified rasters (3D and RF) after eliminating lone pixels. The
improvement observed in RF after filtering was not statistically significant, with only
six more samples accurately identified, but there was no decline in any class’s met-
rics. Additionally, all samples that were correctly classified before filtering remained so
19
after filtering. Thus, it can be assumed that removing lone pixels refined the RF classi-
fication. Despite the RF’s improvement in observed metrics, the 3D classification still
outperformed it, with an OA of 89.4% for the filtered 3D rasters versus 86.1% for RF,
and this difference is statistically significant. Only the Wetlands class classification
metrics were superior in RF.
Table 6 Evaluation of classification accuracy after removing single
pixels
3D post-processed
(OA = 89.40%)
RF post-processed
(OA = 86.10%)
Class PA UA F1 PA UA F1
Settlement 0.83 0.85 0.84 0.71 0.68 0.69
Cropland 0.93 0.92 0.93 0.88 0.93 0.91
Woodland 0.88 0.97 0.92 0.85 0.97 0.91
Grassland 0.83 0.70 0.82 0.87 0.56 0.68
Wetland 0.90 0.75 0.82 1.00 0.75 0.86
Other land 1.00 0.63 0.77 1.00 0.2 0.32
Average 0.90 0.80 0.84 0.89 0.68 0.73
Table 7shows the confusion matrix for the 3D architecture after post-processing.
Classes with fewer samples, such as Wetland, Other lands, and Grassland, are often
misclassified compared to the other classes, resulting in lower UA metrics due to
higher false negatives. Similar situations were reported by Syrris et al. (2019) and
Svoboda et al. (2022) for classes with few samples in their classifications. Training
CNNs with large imbalanced datasets can lead to a model bias towards frequent classes
in the training data, resulting in the model overlooking classes with fewer samples.
The Grassland class was often misclassified as Cropland due to their similarity, while
confusion between Woodland and Grassland can be explained by the reforestation
of the forest after drought and bark beetle calamities. The Woodland class had the
highest false positives, followed by cropland, as they represent 72% of the training
data. This behaviour can be attributed to the inductive bias.
Table 7 Confusion Matrix for 3D model
ST CL WL GL WL OL
ST 69 5 5 2 0 0
CL 7 425 16 14 0 0
WL 4 1 374 7 0 0
GL 2 25 20 113 1 0
WL 0 0 3 0 9 0
OL 1 0 5 0 0 10
20
There was a concern that the inclusion of the oversampled 20 m Sentinel-2 bands
by the natural neighbour method in the 10 m grid would cause misclassification at the
boundaries between different LULUCF classes. Resampling any pixel resulted in a 2
x 2 region of target pixels with the same value (per band) of the original pixel. Thus,
the boundaries between the original pixels, when resampled to higher detail, could
imply boundaries between different objects, but would be more accurately resolved
from the 10 m Sentinel-2 bands. In Figure 7, it can be seen, in selected subsets, that
the inclusion of the 20 m bands in the classification, did not cause a clear degradation
in classification at the locations of boundaries between objects with different LULUCF
classes.
Figure 7 Comparison of CNN classification of 10 m bands and classification of 10/20 m composite
mosaic bands. In the upper panel of the figure there is a subset with the boundary between the
Wetland and the adjacent Forest land. The boundary seems to be better defined for the classification
with 10/20 m bands. The subset from the middle panel shows the boundary between Cropland and
Forest land. Again, a better result was obtained for the classification using the Sentinel-2 10/20
m bands. In the bottom panel of the image, a subset of the road (Settlements) is surrounded by
two parcels of Cropland. In the classification with 10 m bands, the width of the road was slightly
narrower, but conversely, in the classification with resampled 20 m bands, the road was clearly wider
(pixels adjacent to the road are often classified as Settlements. Thus, in this case, the inclusion of the
20 m bands in the classification led to the overestimation of the roads. In fact, there is supporting
technical infrastructure along the roads often, so this the wider delimitation of the roads is a positive
affect from point of view LULUCF classification (both the roads and infrastructure belong to the
Settlement). Moreover in the bottom panel, the boundary between Grassland and Cropland in the
left part of the subset, is much better defined in the classification with 20 m bands resampled.
21
3.6 Results of the classification
A comparison of the area of the individual LULUCF classes obtained from the classi-
fication methods used in this study with Svoboda et al. (2022), CORINE Land Cover
(CLC) and LPIS documents Table 8. According to the accuracy assessment, the 3D
model’s results are closer to reality than those of the RF in most of the classes. In
the area of interest, the dominant surface is Forest land, closely followed by Crop-
land class, according to both the RF and 3D classifications, as shown in Table 8. The
Cropland class has the largest absolute difference between RF and 3D (over 700 km²).
Cropland and Forest land account for more than 80% of the entire area of interest in
both classifications. The 3D model provides a more accurate representation of settle-
ments than RF, as per the results in Table 6, while RF underestimates them. Other
classes are better classified in 3D than in RF, as indicated in Table 6. However, the
wetlands class is overestimated in the 3D model, and thus, the actual area of this class
may be smaller. The larger area of Other land is composed of non-forest woody vege-
tation. In the 3D model, it is 90.7%, while in RF, it is even 99.7%. This is likely due
to the overestimation of the non-woody Other land class for the 3D classification, as
shown in Table 7. At the same time, there is less representation of non-forest woody
vegetation in the 3D model (3D = 117.3 km², RF = 162.2 km²).
Table 8 This table documents a comparison of the results of the area of the individual LULUCF
classes obtained from the classification methods used in this study with Svoboda et al. (2022),
CORINE Land Cover (CLC) and LPIS. The table provides a useful overview of the results of the
different classification and data collection methods. The method published in Svoboda et al. 2022
was used for the conversion of CLC classes to LULUCF nomenclature. The LPIS database
documents only agricultural land classes.
3D RF Svoboda et al. (2022) CLC 2018 LPIS 2018
Class km²% km²% km²% km²% km²%
ST 1680.0 7.24 1294.4 5.58 1059.8 4.56 1489.4 6.42
CL 9030.9 38.98 9766.5 42.07 9936.4 42.80 10435.9 44.95 8940.3 38.51
FL 9829.9 42.34 9807.3 42.24 8259.6 35.58 7773.6 33.48
GL 2281.7 9.83 1998.8 8.61 3560.7 15.34 3407.6 14.68 2221.9 9.64
WL 265.3 1.14 187.4 0.81 178.6 0.77 118.9 0.51
OL 129.2 0.56 162.7 0.70 162.2 0.69
Figure 8illustrates the land cover classification results of RF and 3D methods in
specific locations. In the selected slices, it can be seen that the objects (regions of
pixels with the same class) produced by the 3D CNN classification are more compact
and with less noise than in the RF case. On the other hand, the object boundaries
are slightly shifted in the case of convolutional neural network classification compared
to reality (and RF), which is probably due to the inclusion of neighborhoods in the
classification. In Figure 9, subsets are presented to illustrate how land cover has been
classified for linear features in the landscape. While the RF classification does not show
the progression of some line features at all, in the case of CNNs there is better detection
of the lines, however often an extension of them compared to the real delimitation in
the landscape.
22
Figure 8 Results of planar objects classification. The right side of the figure displays CIR (colour
Infrared) aerial images, where vegetation appears in red hues and water is represented by a dark
colour, while light or blue colours indicate bare ground or built-up areas. The top subsets depicts
a rural development surrounded by Grassland and Cropland. RF tends to misclassify pixels within
the development as Cropland or Forest land, while the 3D method shows sparser misclassifications,
resulting in more compact classified objects. The middle subset shows a largely cleared forest due to
bark beetle disturbance, which is expected to be restored in the future. The 3D classifier has fewer
non-Woodland pixels compared to RF in this location. The bottom subset shows a quarry area,
where RF misclassifies Cropland in some logging locations, while the 3D classifier shows no Cropland
classification in any pixel within the quarry area. However, the vegetation island (Woodland) in the
centre appears larger in the CIR image than the 3D classification. It is an effect of the neighbourhood
in classification, where the Settlements category is predominant.
4 Discussion
The main objective of this study was to assess the suitability of CNN for LULUCF
classification using open Sentinel-2 data. The study tested and evaluated 2D and 3D
convolutions in CNNs with various filter sizes, as well as the use of data augmentation
and contrastive training. The data input included Sentinel-2 bands with spatial reso-
lution of 10 m and 20 m, NDVI, NDVI variance, and SRTM altitude. The performance
and accuracy of CNNs were compared with RF methods used in Svoboda et al. (2022).
If the results of CNN architectures and RF methods are compared, the 3D model
achieved the highest OA (89.4 %) and average value of F1 score for all the classes.
Concerning the classes evaluated, the 3D outperformed (based on F1) all other clas-
sifiers in Settlements, Cropland, Woodland and Grassland. RF was more successful in
23
Figure 9 Results of the classification of linear elements. The top subset shows two meandering
watercourses (Wetlands) flowing through the floodplain forest. The 3D method appears to be consid-
erably more successful in classifying these watercourses, although they still do not form continuous
linear features even in the case of the 3D. The middle subset shows the road network (Settlements),
which the 3D method is more successful in classifying. However, the roads appear slightly wider in
the 3D classification than in the CIR aerial image. This could potentially be due to the inclusion of
roadside verges and ditches (infrasructure) in the classification. This proves abilities of 3D CNN in
the detection of linear objects, however with a tendency if the overestimation in some cases.
classifying the Wetland and 2D and 3D Multiscale in Other land. The used methods of
this study brought several improvements in the classification of LULUCF classes. For
example the Grassland, Svoboda et al. (2022) documented that the grassland covered
15.34%, compared to this study where it covers RF = 8.55% and 3D = 9.77%. The
reason for this difference is because a significant area of the young forest was mistak-
enly classified as Grassland. So, the classification of Grassland in Svoboda et al. (2022)
was overestimated. However, the misclassification of Grassland as Cropland (as shown
in Table 7) remains the most common error, consistent with Svoboda et al. (2022)’s
study. The 3D model trained with data augmentation for the underrepresented classes
achieved better performance compared to the 3D model with augmentation for all
classes. We propose that this phenomenon is associated with the bias introduced into
the model during the augmentation process. As discussed in Balestriero, Bottou, and
LeCun (2022), data augmentation can introduce bias, leading to reduced accuracy in
certain classes due to the uncertainty introduced with the generation of new sam-
ples. This assertion finds support in the observed increase in PA for all classes, except
Woodland, accompanied by a decrease in UA for the Grassland and Wetland classes.
The number of trainable parameters in all the models used in this study was much
smaller than those in DL segmentation architectures like U-Net, Fully Convolutional
Networks, or SegNet. Otherwise, the inference time for all model can be bigger than the
aforementioned models. The architecture proposed by Syrris et al. (2019) performed
poorly on the same data despite being composed of many more filters compared to the
24
proposed models. Increasing the size of the convolutions, using 3D convolutions, and
ReLU activation function helped the proposed models to learn more about the data.
In addition, newer techniques for data augmentation could be applied. Among
the image transformations, MixUp (H. Zhang, Cisse, Dauphin, & Lopez-Paz,2018)
and CutOff (Shen, Zheng, Shen, Qu, & Chen,2020) can be applied to generate new
samples, even with the nature of RSI data patched in slide-windows of 5x5 pixels.
MixUp merge images from different classes, by applying a composition of the value for
the each pixel and do the same to labels, while CutOff remove random pixels of the
image. While, in terms of policies, a shrunken AutoAugment (Cubuk, Zoph, Mane,
Vasudevan, & Le,2018) can be employed, learning good transformations for each
sample in the data, using a huge RSI dataset as BigEarthNet (Sumbul, Charfuelan,
Demir, & Markl,2019) as proxy task.
By using a large number of negative pairs in the contrastive learning, the 3D
model was able to classify classes with fewer samples more accurately but at the ex-
pense of lower performance for the other classes. In contrast, the 3D Multiscale model
performed better than the 3D model with contrastive training, indicating that using
different scales to extract features is more effective in preventing false negatives in these
classes. Increasing the size of the convolutional filters can improve the classification
of smaller sparse settlement areas since more pixels are considered when extracting
features. Conversely, smaller filter sizes are better for identifying small elements like
narrow streets, which can be obscured by the high presence of other classes in the
surroundings. For this application, a filter size of 3×3 is the best trade-off between
classifying spread-out and small areas. In future work, concatenating the activation of
different blocks can provide more information about the data, albeit with an increase
in model parameters. Dropout can also be used to add noise during training, masking
some activations in the convolutional part of the model.
If the results of this study are compared with similar studies, Syrris et al. (2019)
used convolutional networks to classify land cover using Sentinel-2 data. A model
similar to the 2D model in this work was employed, and a 5×5 pixel neighbourhood
was used for classification. The key difference between this study and Syrris et al.
(2019) is that only the Sentinel-2 bands with a resolution of 10 m (B2, B3, B4, and B8)
were used for the classification by them. This study confirmed, through a data ablation
study, that the bands with a resolution of 20 m in red edge/SWIR are crucial for
distinguishing various vegetation types such as Grassland, Forest land, and Cropland,
e.g. Svoboda et al. (2022).
In this study, the convolutional neural networks used for classification exhibit solid
abilities for the detection of linear objects. However, an overestimation of the width can
be found in some features likely due to the selection of training samples. The training
process utilizes every pixel that intersects a training line, but not all of these pixels
are located within the line object being trained. To address this issue, modifications
to the training of line data may be necessary in future studies.
After classification, the 3D model was identified as the superior DL model and
post-processing was performed on it. This involved filtering out lone pixels. In the case
of 3D, lone pixels were sparsely distributed, and filtering them had little impact on
accuracy assessment compared to RF. However, it would be worthwhile to investigate
25
lone pixels in 3D in more detail to determine if they represent correctly classified
LULUCF classes or salt and pepper noise. It is possible that for certain DL models,
filtering lone pixels may not be appropriate. The noise suppression observed in DL
models may be due to the contribution of surrounding pixel values in the classification
process, which can override outlier values from the classified pixel, such as a clearing
in Forest land, a pool or garden in Settlements, a boat in Wetlands, or a tree in
Grassland.
From the research point of view, the important task was to test the advanced
classification methods for the purpose of LULUCF classification. LULUCF classes
often include very different surfaces in terms of spectral reflectance. The positive
finding is that the use of CNN classifiers brings an improvement of the accuracy of
most of the classes. Obviously, the classification based on 3D model can better de-
tect the classes from point of view of the spatial delimitation (spatial pattern), see
Fig. 8and 9. It is a significant research contribution of this study and it would be
useful to compare the results of this study with the new Copernicus Land product:
CLC+ (https://land.copernicus.eu/pan-european/clc-plus). Moreover, an investiga-
tion of Minimal Mapping Unit (MMU) in the following studies seems to be a relevant
problem. Svoboda et al. (2022) utilized a minimum object size of 0.5 ha (50 pixels at
10 m resolution) as the sole criterion for distinguishing the Woodland class into non-
forest woody vegetation and Forest land inspired by the relevant land cover databases
and nomenclatures such as High Resolution Layers Forests (Copernicus Land Mon-
itoring Service,2021), the FAO and Kyoto Protocol definitions of forest (Ministry
of Environment of the Czech Republic,2006), and the National Inventory of Forest
within the UHUL (Czech National Forest Inventory). However, official EU regulations
of LULUCF reporting allow a disunity of MMU of the Forest land across European
countries. The Czech Republic and Austria are the only EU countries employing a
relatively small minimum Forest land object size of 0.05 ha. If a perspective of the
standard approach of LULUCF classification for the entire European Union or the
world takes into account, a value of 0.5 ha seems to be the most suitable, given its
prevalence in numerous definitions and its status as the most widely used minimum
area for a forest object in the European Union. It is used in over a third of EU countries
and covers 45% of the total EU area, including countries like France, Sweden, Fin-
land, Italy, Hungary, and others (refer to Regulation (EU) 2018/841 of the European
Parliament).
5 Conclusions
Monitoring LCLUC is crucial as it has a significant impact on climate change and the
global carbon cycle. Therefore, regulations have been put in place for the inventory
and reporting of relevant land use classes, which is called land use, land-use change,
and forestry. Despite the availability of satellite multispectral imagery and methods
to automatically classify land cover classes, these methods do not consider spatial and
spectral relations. In this study, we successfully employed small Convolutional Neural
Networks to classify land cover, with the objective of generating a LULUCF report
for southern Czechia. We explored different dimensions for convolutions, filter sizes,
26
data augmentation techniques, and contrastive learning. Our best model achieved an
F1 Score of 0.84, outperforming the previous approach presented in Svoboda et al.
(2022)’s study by 0.12.
Regarding architecture, our results showed that: i) the use of 3D convolutions
obtained better performance than 2D convolutions and Random Forest Classifiers; ii)
data augmentation applied to classes with fewer samples improved classification in
DL models; iii) the best filter size was 3×3×3, with the best tradeoff between the
classification of spread-out and small areas in our data; and iv) contrastive learning
did not improve the overall classification. . In terms of classification quality, the areas
of most of the classes in the 3D Model’s better delimitate/detect the spatial pattern
of the classes based on the LULUCF definition than the areas from the RF’s results.
Despite achieving promising results, there is still room for improvements. Newer
data augmentation techniques could be employed to generate more data, particularly
for classes with fewer samples. In terms of architecture, Dropout and concatenation
among activations in different blocks could be incorporated to enhance overall per-
formance. From point of view of LULUCF reporting, the following research steps
should focus on the possibilities of combining satellite data (Sentinel-2, Planet.com)
and cadastral data with the maximum use of the advantages of these data sources for
the purposes of accurate and time-compatible reporting.
Acknowledgements. This research activity was supported by “DATI - Digital
Agriculture Technologies for Irrigation efficiency” project. PRIMA Partnership for
Research and Innovation in the Mediterranean Area, (Research and Innovation ac-
tivities), financed by the states participating in the PRIMA partnership and by the
European Union, through Horizon 2020 and by FCT - Portuguese Foundation for
Science and Technology, under the project UIDB/04033/2020, and by the Charles Uni-
versity, project GA UK No. 412922 (Charles Univ, Fac Sci, 2022–2024): “Klasifikace
Land Use/Land-use Change and Forestry (LULUCF) ˇ
Ceska pomoc´ı metod strojov´eho
cen´ı”.
Statements and Declarations
Conflict of interest
The authors report there are no competing interests to declare.
Data availability
The data that support the findings of this study are available from the corresponding
author, remysl ˇ
Stych, upon reasonable request.
Materials availability
Not applicable
Code availability Not applicable
Not applicable
27
References
Alcantara, C., Kuemmerle, T., Prishchepov, A.V., Radeloff, V.C. (2012, 9 1). Mapping
abandoned agriculture with multi-temporal modis satellite data. Remote Sensing
of Environment,124 , 334–347, https://doi.org/10.1016/j.rse.2012.05.019
Badrinarayanan, V., Kendall, A., Cipolla, R. (2017). SegNet: A Deep Convolutional
Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence,39 (12), 2481–2495, https://doi.org/
10.1109/TPAMI.2016.2644615 Retrieved from http://arxiv.org/abs/1511.00561
(arXiv: 1511.00561)
Balestriero, R., Bottou, L., LeCun, Y. (2022, 4 8). The effects of regularization and
data augmentation are class dependent.
https://doi.org/10.48550/arXiv.2204.03632 Retrieved from
http://arxiv.org/abs/2204.03632 (arXiv:2204.03632 [cs, stat])
Ben Hamida, A., Benoit, A., Lambert, P., Ben Amar, C. (2018, August). 3-D Deep
Learning Approach for Remote Sensing Image Classification. IEEE Transactions
on Geoscience and Remote Sensing,56 (8), 4420–4434, https://doi.org/10.1109/
TGRS.2018.2818945 (Conference Name: IEEE Transactions on Geoscience and
Remote Sensing)
auerle, A., Van Onzenoodt, C., Ropinski, T. (2021). Net2vis–a visual grammar
for automatically generating publication-tailored cnn architecture visualizations.
IEEE transactions on visualization and computer graphics,27 (6), 2980–2991,
(publisher: IEEE)
Cavur, M., Duzgun, H.S., Kemec, S., Demirkan, D.C. (2019). LAND USE AND
LAND COVER CLASSIFICATION OF SENTINEL 2-A: ST PETERSBURG
CASE STUDY. The International Archives of the Photogrammetry, Remote
Sensing and Spatial Information Sciences,XLII-1/W2 , 13–16, https://doi
.org/10.5194/isprs-archives-XLII-1-W2-13-2019 Retrieved from https://isprs-
archives.copernicus.org/articles/XLII-1-W2/13/2019/
Chollet, F. (2017). Deep learning with python. Manning Publications Company.
Retrieved from https://books.google.pt/books?id=Yo3CAQAACAAJ (Citation
Key: chollet2017deep)
Close, O., Petit, S., Beaumont, B., Hallot, E. (2021, January). Evaluating the Poten-
tiality of Sentinel-2 for Change Detection Analysis Associated to LULUCF in
28
Wallonia, Belgium. Land,10 (1), 55, https://doi.org/10.3390/land10010055 Re-
trieved 2023-10-14, from https://www.mdpi.com/2073-445X/10/1/55 (Number:
1 Publisher: Multidisciplinary Digital Publishing Institute)
Copernicus Land Monitoring Service (2021). Tree-cover/forest and change 2015-2018
(Tech. Rep.). Retrieved from https://land.copernicus.eu/user-corner/technical-
library/forest-2018-user-manual.pdf
Corbane, C., Syrris, V., Sabo, F., Politis, P., Melchiorri, M., Pesaresi, M., . . . Kemper,
T. (2021, 6 1). Convolutional neural networks for global human settlements
mapping from sentinel-2 satellite imagery. Neural Computing and Applications,
33 (12), 6697–6720, https://doi.org/10.1007/S00521-020-05449-7/TABLES/2
(publisher: Springer Science and Business Media Deutschland GmbH)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V. (2018).
AutoAugment: Learning augmentation policies from data. , 113–123,
https://doi.org/10.48550/arxiv.1805.09501 Retrieved 2022-03-04, from
https://arxiv.org/abs/1805.09501v3 (ISBN: 1805.09501v3) 1805.09501
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S. (2019, 1 16). Class-balanced loss
based on effective number of samples.
https://doi.org/10.48550/arXiv.1901.05555 Retrieved from
http://arxiv.org/abs/1901.05555 (arXiv:1901.05555 [cs])
da Silva, J.F., Cicerelli, R.E., Almeida, T., Neumann, M.R.B., de Souza, A.L.F. (2020).
Land use/cover (lulc) mapping in brazilian cerrado using neural network with
sentinel-2 data. Floresta,50 (3), 1430–1438,
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. (2009). Imagenet: A large-
scale hierarchical image database. 2009 ieee conference on computer vision and
pattern recognition (p. 248-255).
Ellison, D., Lundblad, M., Petersson, H. (2014, 6 1). Reforming the eu approach to
lulucf and the climate policy framework. Environmental Science & Policy,40 ,
1–15, https://doi.org/10.1016/j.envsci.2014.03.004
Filip, H., & Radim, A. (2010). Digital photogrammetric survey in the
national forest inventory (nfi) in the czech republic.. Retrieved from
https://www.researchgate.net/publication/228984487
Ghayour, L., Neshat, A., Paryani, S., Shahabi, H., Shirzadi, A., Chen, W., . . . Ah-
mad, A. (2021, January). Performance Evaluation of Sentinel-2 and Landsat 8
29
OLI Data for Land Cover/Use Classification Using a Comparison between Ma-
chine Learning Algorithms. Remote Sensing,13 (7), 1349, https://doi.org/
10.3390/rs13071349 Retrieved 2023-10-14, from https://www.mdpi.com/2072-
4292/13/7/1349 (Number: 7 Publisher: Multidisciplinary Digital Publishing
Institute)
Hadsell, R., Chopra, S., LeCun, Y. (2006, 6). Dimensionality reduction by learning
an invariant mapping. (Vol. 2, pp. 1735–1742). (ISSN: 1063-6919)
Hansen, M.C., Stehman, S.V., Potapov, P.V. (2010, 5 11). Quantification of global
gross forest cover loss. Proceedings of the National Academy of Sciences,107 (19),
8650–8655, https://doi.org/10.1073/pnas.0912668107 (publisher: Proceedings
of the National Academy of Sciences)
Helber, P., Bischke, B., Dengel, A., Borth, D. (2019, 7). Eurosat: A novel dataset
and deep learning benchmark for land use and land cover classification. IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing,
12 (7), 2217–2226, https://doi.org/10.1109/JSTARS.2019.2918242 (event-title:
IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing)
IPCC (2003). Good practice guidance for land use, land-use change and forestry.
Published by the Institute for Global Environmental Strategies for the IPCC.
Jia, S., Jiang, S., Lin, Z., Li, N., Xu, M., Yu, S. (2021, 8 11). A survey: Deep learning
for hyperspectral image classification with few labeled samples. Neurocomputing,
448 , 179–204, (arXiv: 2112.01800)
Karra, K., Kontgis, C., Statman-Weil, Z., Mazzariello, J.C., Mathis, M., Brumby, S.P.
(2021, 7). Global land use / land cover with sentinel 2 and deep learning. (pp.
4704–4707). (ISSN: 2153-7003)
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., . . . Krishnan, D.
(2021, 3 10). Supervised contrastive learning.
https://doi.org/10.48550/arXiv.2004.11362 Retrieved from
http://arxiv.org/abs/2004.11362 (arXiv:2004.11362 [cs, stat])
Koch, G.R. (2015). Siamese neural networks for one-shot image recognition.. (Citation
Key: Koch2015SiameseNN)
30
Laˇstoviˇcka, J., ˇ
Svec, P., Paluba, D., Kobliuk, N., Svoboda, J., Hladk´y, R., ˇ
Stych, P.
(2020, January). Sentinel-2 data in an valuation of the impact of the distur-
bances on forest vegetation. Remote Sensing,12 (5), 1914, https://doi.org/
10.3390/rs12121914 Retrieved 2023-11-02, from https://www.mdpi.com/2072-
4292/12/12/1914 (Number: 5 Publisher: Multidisciplinary Digital Publishing
Institute)
LeCun, Y., Bengio, Y., Hinton, G. (2015, 5). Deep learning. Nature,521 (7553), 436–
444, https://doi.org/10.1038/nature14539 (number: 7553 publisher: Nature
Publishing Group)
Li, Y., Zhang, H., Xue, X., Jiang, Y., Shen, Q. (2018). Deep learning for remote sensing
image classification: A survey. WIREs Data Mining and Knowledge Discovery,
8(6), e1264, https://doi.org/10.1002/widm.1264
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., . . . Zit-
nick, C.L. (2014). Microsoft COCO: common objects in context. CoRR,
abs/1405.0312 , , Retrieved from http://arxiv.org/abs/1405.0312 1405.0312
Lin, T.-Y., Goyal, P., Girshick, R.B., He, K., Doll´ar, P. (2017). Focal
loss for dense object detection. CoRR,abs/1708.0 , , Retrieved from
http://arxiv.org/abs/1708.02002 (arXiv: 1708.02002 Citation Key: focaloss)
Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., Johnson, B.A. (2019, 6 1). Deep learning
in remote sensing applications: A meta-analysis and review. ISPRS Journal of
Photogrammetry and Remote Sensing,152 , 166–177, https://doi.org/10.1016/
J.ISPRSJPRS.2019.04.015 (publisher: Elsevier)
Micek, O., Feranec, J., Stych, P. (2020, 5). Land use/land cover data of the urban atlas
and the cadastre of real estate: An evaluation study in the prague metropolitan
region. Land,9(5), 153, https://doi.org/10.3390/land9050153 (number: 5
publisher: Multidisciplinary Digital Publishing Institute)
Michetti, M. (2012). Modelling land use, land-use change, and forestry in climate
change: A review of major approaches (Tech. Rep.). Milano. Retrieved from
http://hdl.handle.net/10419/59742
Ministry of Environment of the Czech Republic (2006). Czech republic’s
initial report under the kyoto protocol (Tech. Rep.). Retrieved from
https://unfccc.int/files/national reports/initial reports under the kyoto protocol/application/pdf/initial rep
31
Nguyen, H.T.T., Doan, T.M., Tomppo, E., McRoberts, R.E. (2020, January). Land
Use/Land Cover Mapping Using Multitemporal Sentinel-2 Imagery and Four
Classification Methods—A Case Study from Dak Nong, Vietnam. Remote
Sensing,12 (9), 1367, https://doi.org/10.3390/rs12091367 Retrieved 2023-10-
14, from https://www.mdpi.com/2072-4292/12/9/1367 (Number: 9 Publisher:
Multidisciplinary Digital Publishing Institute)
Rana, V.K., & Suryanarayana, T.M.V. (2020). Performance evaluation
of MLE, RF and SVM classification algorithms for watershed scale
land use/land cover mapping using sentinel 2 bands. Remote Sens-
ing Applications: Society and Environment,19 , 100351, https://
doi.org/https://doi.org/10.1016/j.rsase.2020.100351 Retrieved from
https://www.sciencedirect.com/science/article/pii/S2352938519302411
Ronneberger, O., Fischer, P., Brox, T. (2015). U-net: Convolutional networks for
biomedical image segmentation. CoRR,abs/1505.04597 , , Retrieved from
http://arxiv.org/abs/1505.04597 1505.04597
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D. (2016,
10 7). Grad-cam: Visual explanations from deep networks via gradient-based lo-
calization. International Journal of Computer Vision,128 (2), 336–359, https://
doi.org/10.1007/s11263-019-01228-7 (arXiv: 1610.02391v4 publisher: Springer)
Sharma, A., Liu, X., Yang, X., Shi, D. (2017, 11 1). A patch-based convolutional
neural network for remote sensing image classification. Neural Networks,95 ,
19–28, https://doi.org/10.1016/j.neunet.2017.07.017
Shelhamer, E., Long, J., Darrell, T. (2017, November). Fully Convolutional Net-
works for Semantic Segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence,39 (4), 640–651, https://doi.org/10.1109/TPAMI
.2016.2572683 Retrieved 2021-08-26, from https://arxiv.org/abs/1411.4038v2
(arXiv: 1411.4038 Publisher: IEEE Computer Society)
Shen, D., Zheng, M., Shen, Y., Qu, Y., Chen, W. (2020). A simple but
tough-to-beat data augmentation approach for natural language understanding
and generation (No. arXiv:2009.13818). arXiv. Retrieved 2023-02-08, from
http://arxiv.org/abs/2009.13818
Signoroni, A., Savardi, M., Baronio, A., Benini, S. (2019, 5 8). Deep learning meets
hyperspectral image analysis: A multidisciplinary review. Journal of Imaging
2019, Vol. 5, Page 52 ,5(5), 52, https://doi.org/10.3390/JIMAGING5050052
(publisher: Multidisciplinary Digital Publishing Institute)
32
Storie, C.D., & Henry, C.J. (2018, 7). Deep learning neural networks for land use land
cover mapping. (pp. 3445–3448). (ISSN: 2153-7003)
Sumbul, G., Charfuelan, M., Demir, B., Markl, V. (2019). BigEarthNet: A large-scale
benchmark archive for remote sensing image understanding. IGARSS 2019 -
2019 IEEE international geoscience and remote sensing symposium (pp. 5901–
5904). Retrieved 2022-11-29, from http://arxiv.org/abs/1902.06148
Svoboda, J., ˇ
Stych, P., Laˇstoviˇcka, J., Paluba, D., Kobliuk, N. (2022). Ran-
dom Forest Classification of Land Use, Land-Use Change and Forestry (LU-
LUCF) Using Sentinel-2 Data—A Case Study of Czechia. Remote
Sensing,14 (5), , https://doi.org/10.3390/rs14051189 Retrieved from
https://www.mdpi.com/2072-4292/14/5/1189
Syrris, V., Hasenohr, P., Delipetrev, B., Kotsev, A., Kempeneers, P., Soille, P.
(2019). Evaluation of the potential of convolutional neural networks and
random forests for multi-class segmentation of sentinel-2 imagery. Remote
Sensing,11 (8), , https://doi.org/10.3390/rs11080907 Retrieved from
https://www.mdpi.com/2072-4292/11/8/907 (Citation Key: rs11080907)
Vali, A., Comai, S., Matteucci, M. (2020, 1). Deep learning for land use and land cover
classification based on hyperspectral and multispectral earth observation data:
A review. Remote Sensing,12 (15), 2495, https://doi.org/10.3390/rs12152495
(number: 15 publisher: Multidisciplinary Digital Publishing Institute)
Zhang, C., Sargent, I., Pan, X., Li, H., Gardiner, A., Hare, J., Atkinson, P.M. (2019,
2 1). Joint deep learning for land cover and land use classification. Remote
Sensing of Environment,221 , 173–187, https://doi.org/10.1016/J.RSE.2018
.11.014 (publisher: Elsevier)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D. (2018). mixup: Beyond empirical
risk minimization (No. arXiv:1710.09412). arXiv. Retrieved 2023-02-08, from
http://arxiv.org/abs/1710.09412
Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F. (2017).
Deep learning in remote sensing: A comprehensive review and list of resources.
IEEE Geoscience and Remote Sensing Magazine,5(4), 8-36, https://doi.org/
10.1109/MGRS.2017.2762307
33