ArticlePDF Available

Review on Convolutional Neural Networks (CNN) in Vegetation Remote Sensing

Authors:

Abstract and Figures

Identifying and characterizing vascular plants in time and space is required in various disciplines, e.g. in forestry, conservation and agriculture. Remote sensing emerged as a key technology revealing both spatial and temporal vegetation patterns. Harnessing the ever growing streams of remote sensing data for the increasing demands on vegetation assessments and monitoring requires efficient, accurate and flexible methods for data analysis. In this respect, the use of deep learning methods is trend-setting, enabling high predictive accuracy, while learning the relevant data features independently in an end-to-end fashion. Very recently, a series of studies have demonstrated that the deep learning method of Convolutional Neural Networks (CNN) is very effective to represent spatial patterns enabling to extract a wide array of vegetation properties from remote sensing imagery. This review introduces the principles of CNN and distils why they are particularly suitable for vegetation remote sensing. The main part synthesizes current trends and developments, including considerations about spectral resolution, spatial grain, different sensors types, modes of reference data generation, sources of existing reference data, as well as CNN approaches and architectures. The literature review showed that CNN can be applied to various problems, including the detection of individual plants or the pixel-wise segmentation of vegetation classes, while numerous studies have evinced that CNN outperform shallow machine learning methods. Several studies suggest that the ability of CNN to exploit spatial patterns particularly facilitates the value of very high 1 spatial resolution data. The modularity in the common deep learning frameworks allows a high flexibility for the adaptation of architec-tures, whereby especially multi-modal or multi-temporal applications can benefit. An increasing availability of techniques for visualizing features learned by CNNs will not only contribute to interpret but to learn from such models and improve our understanding of remotely sensed signals of vegetation. Although CNN has not been around for long, it seems obvious that they will usher in a new era of vegetation remote sensing.
Content may be subject to copyright.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
Available online 16 January 2021
0924-2716/© 2020 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
Review Article
Review on Convolutional Neural Networks (CNN) in vegetation
remote sensing
Teja Kattenborn
a
,
*
, Jens Leitloff
b
, Felix Schiefer
c
, Stefan Hinz
b
a
Remote Sensing Centre for Earth System Research, Leipzig University, Talstr. 35, 04103 Leipzig, Germany
b
Institute of Photogrammetry and Remote Sensing (IPF), Karlsruher Institute of Technology (KIT), Englerstr. 7, 76131 Karlsruhe, Germany
c
Institute for Geography and Geoecology (IFGG), Karlsruher Institute of Technology (KIT), Kaiserstr. 12, 76131 Karlsruhe, Germany
ARTICLE INFO
Keywords:
Convolutional Neural Networks (CNN)
Deep learning
Vegetation
Plants
Remote sensing
Earth observation
ABSTRACT
Identifying and characterizing vascular plants in time and space is required in various disciplines, e.g. in forestry,
conservation and agriculture. Remote sensing emerged as a key technology revealing both spatial and temporal
vegetation patterns. Harnessing the ever growing streams of remote sensing data for the increasing demands on
vegetation assessments and monitoring requires efcient, accurate and exible methods for data analysis. In this
respect, the use of deep learning methods is trend-setting, enabling high predictive accuracy, while learning the
relevant data features independently in an end-to-end fashion. Very recently, a series of studies have demon-
strated that the deep learning method of Convolutional Neural Networks (CNN) is very effective to represent
spatial patterns enabling to extract a wide array of vegetation properties from remote sensing imagery. This
review introduces the principles of CNN and distils why they are particularly suitable for vegetation remote
sensing. The main part synthesizes current trends and developments, including considerations about spectral
resolution, spatial grain, different sensors types, modes of reference data generation, sources of existing reference
data, as well as CNN approaches and architectures. The literature review showed that CNN can be applied to
various problems, including the detection of individual plants or the pixel-wise segmentation of vegetation
classes, while numerous studies have evinced that CNN outperform shallow machine learning methods. Several
studies suggest that the ability of CNN to exploit spatial patterns particularly facilitates the value of very high
spatial resolution data. The modularity in the common deep learning frameworks allows a high exibility for the
adaptation of architectures, whereby especially multi-modal or multi-temporal applications can benet. An
increasing availability of techniques for visualizing features learned by CNNs will not only contribute to interpret
but to learn from such models and improve our understanding of remotely sensed signals of vegetation. Although
CNN has not been around for long, it seems obvious that they will usher in a new era of vegetation remote
sensing.
1. Introduction
Locating and characterizing vascular plants in time and space is key
to various tasks: For instance, nature conservation in the context of
global change and biodiversity decline can only be successfully imple-
mented and supervised with accurate spatial representations of the state,
structure and functioning of ecosystems and its ora (Nagendra et al.,
2013; Pettorelli et al., 2017; Turner et al., 2003). Forestry requires
regular and extensive information on forest stands, including their
structure, timber volume, species composition, and forest damage
(Fassnacht et al., 2016; McRoberts and Tomppo, 2007; White et al.,
2016). In agriculture, there is a growing demand for geoinformation that
facilitates resource efciency and a reduction of environmental impacts
(cf. precision farming), including ne-scale predictions of yield, weed
infestations, and plant vigor (Atzberger et al., 2013; Mulla, 2013).
Concerning all of these tasks and requirements remote sensing contin-
uously establishes as a key technology.
In the last decades, various technological advances resulted in
growing availability of remote sensing data revealing vegetation pat-
terns on both spatial and temporal domains (Colomina and Molina,
2014; Toth et al., 2016). Novel remote sensing platforms, such as
swarms of microsatellites, or unmanned aerial vehicles (UAV), facilitate
* Corresponding author.
E-mail address: teja.kattenborn@uni-leipzig.de (T. Kattenborn).
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing
journal homepage: www.elsevier.com/locate/isprsjprs
https://doi.org/10.1016/j.isprsjprs.2020.12.010
Received 28 July 2020; Received in revised form 5 November 2020; Accepted 15 December 2020
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
25
a birds eye view on vegetation canopies with increasing spatial detail.
Synthetic-aperture radar (SAR), and terrestrial or airborne lasers-
scanning enable to capture the three-dimensional structure of multi-
layered canopies. Additionally, there is an ongoing trend of data sharing
and open access (cf. OpenAerialMap, NEON programme of the US Na-
tional Science Foundation, EUs and ESAs Copernicus Open Access Hub).
These growing opportunities for vegetation remote sensing come
hand in hand with several challenges, including increased data volumes
and computational loads as well as more diverse data structures with
increasing dimensions (spatial, temporal, spectral) often featuring
complex relationships. Moreover, the various vegetation related tasks
and applications elds can differ greatly in their inherent processes and
requirements. Hence, harnessing remote sensing data for vegetation
assessments and monitoring requires efcient, accurate, and exible
analytical methods.
In the context of image analysis and computer vision, deep learning
is currently paving new avenues for remote sensing analysis (Chollet,
2017; Huang et al., 2018; Ronneberger et al., 2015; Zhu et al., 2017;
Hoeser and Kuenzer, 2020; Zhang et al., 2019). In contrast to the pre-
vious shallow neural network approaches that have been under inves-
tigation for decades, deep learning is characterized by a signicantly
increased number of successively connected neural layers. This
increased amount of layers and transformations can reveal higher-level
features and more abstract concepts uncovering more complex and hi-
erarchical relationships. A series of studies has demonstrated that this
increased depth can indeed enhance the retrieval of vegetation-related
information contained in remote sensing data (cf. Section 3.6). At the
same time, increasing transformations and, thus, deeper levels of
complexity commonly require more training data and computational
loads. Nevertheless, deep learning became very popular due to several,
corresponding technical developments, including efcient data pro-
cessing techniques (e.g. data augmentation or non-linear activation
functions, see Section 2.3 and 3.2), high-performance graphic cards,
cloud-computing, as well as open data initiatives providing annotated
data. These developments enable an efcient calculation of countless
non-linear transformations of the respective input data and, thus, form
the core for the essential strength of deep learning - namely the ability of
end-to-end-learning.
Previous data analysis methods in remote sensing usually require
feature engineering, which is the heuristic selection of appropriate
transformations and hand-crafting latent variables from the input data
prior to modelling. Examples in the eld of vegetation remote sensing
are spectral indices (Adam et al., 2010) or texture metrics (Haralick,
1979), whereas the numerous ways to derive such variables make it
often impossible and inefcient to derive the most effective set of pre-
dictors. Moreover, dening the most appropriate predictors for vegeta-
tion analysis can be challenging, as this may not only require knowledge
on the biochemical and structural plant properties but also on how these
interact with the electromagnetic signal measured by the sensor. By
contrast, with deep learning, the neural network itself can learn the data
transformations that are best to solve the problem at hand.
The class of deep learning algorithms most commonly used for
spatial pattern analysis are convolutional neural networks (CNNs or
ConvNets). CNNs are designed to learn the spatial features, e.g. edges,
corners, textures, or more abstract shapes, that best describe the target
class or quantity. The core for learning these features are manifold and
successive transformations of the input data (convolutions) on different
spatial scales (e.g. via pooling operations). This facilitates identifying
and combining both low-level features and high-level concepts. The
functioning of a CNN can, hence, be regarded as a mimicry of the animal
cortex (Angermueller et al., 2016; Cadieu et al., 2014), where analo-
gously numerous visual stimuli at varying scales are perceived in the
eld of vision (counterpart of an image) and the contained spatial fea-
tures and their spatial context serves to identify objects. For example,
the shape of a leaf does not necessarily indicate the corresponding
vegetation type, but its close proximity to branches and the tall and
bulky canopy suggest that it belongs to a tree and not to a herb.
The effectiveness of deep learning and particularly CNNs undoubt-
edly revolutionized our possibilities to analyse spatial patterns in Earth
observation data. Reference is made here to previous and valuable
comments and reviews, including a review by Zhu et al. (2017) on the
general principles and potentials of deep learning in remote sensing,
Hoeser and Kuenzer (2020) summarizing common frameworks and an in
depth overview on architectures for Earth observation data analysis, a
comment by Brodrick et al. (2019) highlighting potentials of CNN for
segmentation tasks in ecology and Reichstein et al. (2019) providing
perspectives on how deep learning in general can advance earth system
science.
The remote sensing of vegetation is characterized by special re-
quirements and challenges, such as the often complex acquisition of
reference data or the understanding of the vegetation specic radiative
transfer, the resulting sensor-specic electromagnetic signals and their
dynamics across the phenology. The present review therefore concen-
trates specically on CNN applications in the eld of vegetation remote
sensing. A series of recent studies have demonstrated that CNNs enable
to reveal accurate spatial representations of vegetation properties, such
as detecting individual plant organs or individuals, classifying species
and communities or quantifying plant traits, from all kinds of remote
sensing sensors and platforms. Still, CNN-based vegetation remote
sensing is a very topical but young eld of research. People with a
background in remote sensing or vegetation science may require pro-
cedural knowledge on the working principles of CNNs and the antici-
pated potentials for vegetation mapping. In contrast, people from
computer sciences may require declarative knowledge on application
tasks in vegetation science, on types and availability of remote sensing
data suitable for vegetation analysis, or on the relationship between
remotely sensed signals and vegetation properties. Thus, the overall aim
of this review is to link procedural and declarative knowledge and
provide an introduction and synthesis on the current state of the art on
the utility of CNNs for vegetation remote sensing.
The present review is organized into three main sections: Chapter 2
briey introduces the basic principles and the general functioning of
CNNs and deduces why it is such a promising method for remote sensing
of vegetation. Chapter 3 provides a summary and meta-analysis on the
corresponding literature and synthesizes the current state of the art and
challenges, including:
common CNN approaches, architectures and strategies for the
retrieval of vegetation properties,
an overview of common applications tasks and demonstrated po-
tentials in the context of agriculture, forestry, and conservation,
challenges and corresponding solutions regarding reference data
quantity and quality of continuous and discrete vegetation variables,
a consideration of spatial and spectral resolution for CNN-based
vegetation remote sensing and considerations towards different
sensors, platforms and combinations thereof.
Lastly, chapter 4 gives concluding remarks and discusses possible
future directions and developments.
2. Principles of CNNs and relevance for vegetation remote
sensing
This chapter introduces the basic principles of CNN, including the
functioning of convolutions, features that make convolutions suitable
for vegetation analysis, and how a CNN is commonly implemented and
trained.
2.1. Basic functioning and structure of CNNs
As any typical neural network-type model, CNNs are based on
neurons that are organized in layers and can, hence, learn hierarchical
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
26
representations. The neurons between layers are connected through
weights and biases. The initial layer is the input layer, e.g. remote
sensing data, and the last layer is the output, such as a predicted clas-
sication into plant species. In between are hidden layers transforming
the feature space of the input in a way that it matches the output. CNNs
include at least one convolutional layer as a hidden layer to exploit
patterns (in the context of this review predominantly spatial patterns).
It can also include other non-convolutional layers. Convolutional
layers include multiple optimizable lters (Fig. 1) that transform the
input or preceding hidden layers. The number of lters denes the
depth of a convolutional layer. The resulting transformations are aimed
to reveal patterns that are decisive for the problem at hand. The decisive
patterns are iteratively learned through convolving, which is essentially
the sliding of the lter over the layer and the calculation of the dot-
product of the lter and the layers values. The result is a new layer of
dot-products for each lter, also called a feature map (Fig. 1). The early
feature maps in a CNN may include simple and ne scaled patterns, such
as corners, circles, or edges. The derived feature maps then serve as
input for the next layer, e.g. another convolutional layer or a nal layer
that predicts an outcome based on the detected features. In deeper layers
of a network, convolving usually reveals more abstract patterns and
higher-level concepts, such as leaf forms, branching patterns or habit.
During model training, randomly initialized lters will be iteratively
optimized to detect the relevant image features (the training procedure
is described in Section 2.3). The combination of several successive
convolutional layers with their numerous lters, hence, enables the
network to learn and combine even subtle image features, revealing if a
class is present in an image or not (see Fig. 1 for a tree-species-specic
activation of the network; for details on class activation mapping see
Section 3.6.2).
Between sequences of multiple convolutional layers, the feature
maps are commonly spatially down-sampled using spatial pooling op-
erations (see Fig. 1). Pooling describes the transformation of multiple
cells into one cell, similar to resampling an image to a coarser spatial
resolution. Pooling has several advantages: It reduces the data size while
preserving discriminant information, which in turn decreases the num-
ber of model parameters, thus computational load and the chance of
overtting; and it enables detecting more abstract features as well as
spatial context across scales and thereby condenses semantic informa-
tion. Pooling is dened by a lter size, stride (the distance between
consecutive pooling operations), and a reduction operation. The most
typical pooling operation is max-pooling. The idea of max-pooling
(instead of, for instance, average pooling) is that strong activations (e.
g. edge or line features) are conserved within the network and not
averaged out. A typical max-pooling operation with 2-by-2 lter size and
a stride of 2 reduces the size of the input feature map by a factor of 4,
whereas the output cells contain the maximum value of the 4 input cells
within the 2-by-2 lter.
The layers of CNNs, e.g. convolutional or pooling layers, can be
combined in very different ways - commonly described as the CNN ar-
chitecture. CNNs can, hence, have very different architectures, which
are basically dened by the task. The task can be the classication of
images, the segmentation of multiple classes, or the localization of in-
dividual objects within a scene (presented in more depth in chapter
3.2.2). The suitability of a CNN architecture largely depends on the
complexity of the task: A more complex problem usually requires a
deeper and more sophisticated network. In contrast, limited availability
of training data constrains model complexity due to an increased risk of
overtting. The complexity and general performance of a CNN archi-
tecture further depends on the hyper-parameters, which dene
amongst others the number and characteristics of hidden layers, pooling
operations, regularization techniques, or cost-functions. Accordingly,
there exists a wide array of options to implement a CNN towards the
specic use case as well as predened and established architectures.
Examples are given in the literature review in chapter 3.2. Compre-
hensive overview of different architectures is given in Hoeser and
Kuenzer (2020) and Zhu et al. (2017).
2.2. Why CNN for vegetation remote sensing?
The physiology and morphology of vascular plant canopies is pri-
marily optimized towards the absorption of solar energy using the
photosynthetic machinery and the corresponding assimilation of carbon
for maintenance, further growth, and reproduction. Despite these
common goals among vascular plants, plant life can differ greatly on
multiple scales, ranging from various morphological features of the in-
dividual, including leaf tissue properties, leaf form, branching patterns,
canopy structure, and the general habitus, to large-scale patterns of
vegetation communities. Furthermore, anthropogenic land use can
determine spatial vegetation patterns, either through indirect inuences
on oral vitality and species composition or directly through economic
activities. Examples include dendritic or sh bone-like deforestation
structures in rain forests, crop rows on plantations, or directed changes
in species composition as a result of gradual nutrient inputs from agri-
cultural land.
Remote sensing offers several sensors and acquisitions techniques
Fig. 1. Scheme of a CNN composed of four
convolutional layers and subsequent pooling
operations trained for tree species classication.
The visualization of convolutional lters (top)
indicate characteristic patterns the CNN is look-
ing for and were derived by gradient ascent; a
technique revealing articial images maximizing
each lters activation. The feature maps (center)
are the dot-product of the preceding layer and
individual lters. Feature attribution maps (bot-
tom) can reveal individual pixels that were
decisive for the tree species assignment (details
on feature attributions 3.6.2).
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
27
that are sensitive to physiological and morphological properties of
vegetation and, hence, allow for spatial representations of vegetation
patterns at the scale from plant organs to entire landscapes. This in-
cludes close-range observations from terrestrial platforms (e.g., farming
robots), ne-resolution data from airborne platforms (UAV or airplanes)
as well as more coarser-resolution satellite-based acquisitions that are
usually focused on large-scale applications.
So why are CNNs suitable for vegetation analysis with such remote
sensing data? CNNs are indeed a revolutionary technique but they do
not do magic, meaning that they cannot reveal more information than is
contained in the data. The crucial advancement of CNNs is how they can
extract information from spatial data. Previous parametric or machine
learning methods applied in vegetation remote sensing usually required
feature engineering, i.e. the careful screening of redundancies in the
input data and the extraction of latent variables that best describe the
response variable. Simply put, the model needs to be taught how to see
the relevant features before it can start solving the problem. Feature
engineering is, hence, based on an understanding of a system and its
processes. This enables to control the model with pre-knowledge but is
certainly limited in case of unknown systems that potentially inherit
many dimensions and complex interactions. Especially for the analysis
of 2D or 3D patterns, there are a plethora of transformations that can be
applied to extract spatial features and textures. Examples are Grey Level
Co-Occurrence Matrices (Haralick, 1979), Fourier Transformations
(Bone et al., 1986), or 3D multi-scale metrics derived from point clouds
(Brodu and Lague, 2012; Weinmann et al., 2015). These numerous types
of transformations can moreover be applied with different hyper-
parameters (e.g., kernel function or size). The potential amount of
latent variables extracted this way explodes, considering that one can
extract latent variables with such transformations based on different
input data available, e.g., different wavelengths of a multispectral sen-
sors or snapshots from a time series. Thus, identifying the best combi-
nation of possible predictors on a heuristically basis is often a very
inefcient task and often hardly possible.
In contrast, a CNN itself learns the ability to see by iteratively opti-
mizing the transformations, i.e., the convolutional layers, during the
training process. This end-to-end learning principle can make feature
engineering obsolete and, thus, providing the raw data (e.g. spectral
bands or the point cloud) can be already sufcient. Additional feature
engineering, e.g., transformations like vegetation indices or pre-
processing such as speckle reduction, may even introduce an informa-
tion loss and decrease the model accuracy (Hartling et al., 2019; Geng
et al., 2017; Sothe et al., 2020). In contrast to statistical modeling or
machine learning, deep learning, hence, shifts the focus from what a
model should learn to how a model should learn. The latter is primarily
dened by the model architecture and the optimization of its hyper-
parameters as discussed in the following sections.
2.3. The training process
Training a CNN model for vegetation mapping requires the remote
sensing data and matching reference annotations, also called labels or
targets. While machine learning algorithms, such as random forests or
support vector machines, require relatively simple array-type data
structures, CNN-based training is performed using more sophisticated
data structures called tensors. Tensors are essentially stacked arrays
that typically have 4 dimensions, including the individual samples, the
spatial dimensions (x, y), a feature dimension (z, e.g. intensity or
reectance), and a layer dimension (e.g. the corresponding wavelength).
During training, the CNN weights are optimized for a certain task, e.
g., detecting a certain plant species. This detection is realized by trans-
forming the input data through convolutional and other hidden layers
while being propagated through the network. The neurons between
layers are connected through activation functions determining if a
neuron is active also referred to as ring - or not (ReLU, the most
frequently used activation function, is described in chapter 3.2.1.1). If
activated, the intensity of a neurons output is determined by its weights
and biases. The weights and biases are usually optimized using the
gradient descent algorithm, which can briey be described as follows:
The term gradient descent implies the progressing minimization
(descent) of errors along a slope (gradient). Gradient descent is per-
formed in iterations, in which predictions of a model with momentary
parameterization are compared to the annotations of the training data
using a loss function. The gradients are derived using the back-
propagation algorithm. Given a neural network with an input layer (a
tensor), an output layer (prediction) and n hidden layers in-between (e.
g. convolutional layers), the back propagation algorithm calculates the
gradient of the loss function with respect to the weights and biases be-
tween the hidden layers. This gradient is then used to evaluate and
update the model weights and biases through gradient descent, i.e.
trying to nd a global minimum in the high-dimensional feature space.
The gradient descent procedure is performed for multiple samples, fol-
lowed by averaging the calculated weights and biases of the hidden
layers.
Training a CNN is usually computationally very intensive as the
explanatory variables, e.g. image data or point cloud representations,
are rich in dimensions (geolocation +layers) resulting in a myriad of
feature maps that depict different spatial features and context at varying
scales. This obviously results in excessive amounts of data to be pro-
cessed during CNN training especially considering that model training
may require many samples to memorize the decisive features of the
target class or value. These data volumes may, thus, not t the memory
of our system at once. To overcome this, training is commonly per-
formed sequentially in batches comprising only a share of the entire
dataset. The model weights and biases are updated based on one average
gradient for the entire batch. Separating the dataset into batches enables
to train the model iteratively until it has seen all samples, which is called
an epoch. The number of iterations to nish an epoch is, thus, the total
number of observations divided by the batch size.
Generally, it is unlikely that a CNN trained in a single epoch already
reaches maximum performance. For instance, observations (in form of
batches) that were shown to the model at the beginning of the training
phase may be again useful to extract more features at a later stage of the
training process. Moreover, multiple steps in the training procedure
described above feature stochasticity: The convolutions are based on
randomly initialized lters, the assignment of observations into batches
is random, and the gradient descent has a random nature (hence, also
referred to as stochastic gradient descent). For this very reason, CNNs are
commonly optimized within a series of subsequent epochs until the
model performance stops to advance (the model converges) or even
decreases (the model overts). The number of epochs eventually de-
pends on the complexity of the problem and model structure.
The fact that gradient descent is an iterative algorithm opens several
interesting avenues for CNN-based modelling: Firstly, models can be
updated with unseen data at any time without training the model again
from scratch, substantially saving computation loads and processing
time. Secondly, models that have seen a lot of data, e.g. from generic
image databases such as ImageNet, can be shared and optimized for a
specic problem (further discussed in Section 3.2.1.2). The third and
probably most future-oriented avenue is federated learning, which is
the training of local models with local data on distributed clients and the
simultaneous sharing of weights coordinated by a central server
(Bonawitz et al., 2019). The server thereby merges the locally derived
gradients without ever seeing the data. Federated Learning follows,
thus, the principle of bringing the code to the data, instead of the data to the
code, which will be inevitable in the geosciences due to constantly
growing data streams. Besides reducing communications costs, this
approach avoids problems related to data access rights, security, or
privacy.
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
28
2.4. Implementation, libraries and frameworks
Most deep learning frameworks can be used on standard operating
system (Linux-based, Windows, macOS) and provide bindings for
different programming languages. Currently, Python is the most com-
mon language in DL research. Training and inference of deep learning
models consist of millions of simple computations, i.e. multiplications
and additions. Thus, it is helpful to use graphics processing units
(GPU) rather than central processing units (CPU). In contrast to CPUs,
GPUs have rather simple cores but thousands of them, which are opti-
mized to handle concurrent operations, leading to a drastic reduction of
time for training and inference. Mostly NVIDIA GPU are used, as these
feature the CUDA Deep Neural Network (cuDNN) library, which is utilized
by common DL frameworks. The cuDNN library provides highly per-
formant primitives for convolutions, pooling operations, normalization
and activation functions. Furthermore, AMD provides different tools for
deep learning on Linux-based platforms with the Radeon Open Compute
Platform. In case of missing hardware it is nowadays possible, to use
(partially free) cloud platforms with GPU support, such as Alibaba Cloud,
Amazon Web Services, Microsoft Azure or Google Cloud Platforms such as
Google Earth Engine. These platforms have completely congured con-
tainers for many frameworks. Google Colab https://colab.research.goog
le.com even provides free access to (limited) computing resources
including GPUs with no setup.
CNNs can be implemented through different frameworks. Over-
views of former and current frameworks are given in Hoeser and
Kuenzer (2020), Nguyen et al. (2019) and on the corresponding Wiki-
pedia page (https://en.wikipedia.org/wiki/Comparison_of_deep-learni
ng_software). The currently most prominent deep learning frameworks
are PyTorch and Tensorow (Nguyen et al., 2019). Both provide high-
level APIs (e.g. Keras) and various tools for training, data augmentation,
and visualization (e.g. Tensorboard). Furthermore, many vintage and
modern DL architectures can be used directly and with pretrained
weights. Extensive documentations, many tutorials, and Jupyter
Fig. 2. Number of yearly publications based on the literature search indicating
a steep increase of studies applying CNNs for vegetation remote sensing. Counts
for 2020 were extrapolated based on the number of publications
until November.
Fig. 3. Network analysis on terms contained in title and abstracts of the reviewed studies. The frequency of the terms is represent by their size and their color
represents statistically derived clusters (determined using the node-repulsion LinLog method Noack, 2007). The analysis was performed using VOSviewer (Van Eck
and Waltman, 2010). A detailed description of the corresponding workow is given in the Appendix.
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
29
notebooks allow an easy start with both open-source frameworks.
Additionally, the Open Neural Network Exchange (ONNX) format allows
interoperability between many frameworks such as Pytorch, Tensor-
ow, Keras, mxnet, scikit-learn, Matlab, SAS, and many more. Thus,
already implemented and trained models can be transferred to a favored
framework. Links to various tools, models, and quick start tutorials are
provided in Section 5.
3. Literature review on CNN-based vegetation remote sensing
The literature review was based on a survey on Google Scholar and
the search terms CNN, convolutional neural networks, vegetation, plants,
forestry, agriculture, land cover, conservation, mapping, Remote Sensing,
RGB multispectral, LiDAR TLS, ALS, SAR, RADAR, airborne, satellite, and
UAV. The search results were rst ltered by the title, by the abstract
and then by the content. We only considered primary research articles
that underwent a peer-review process. This resulted in a total of 101
research studies considered in the literature review. All studies were
published after 2016 and more than 75% of the studies were published
in 2019 or later (see Fig. 2), underlining that CNN-based vegetation
remote sensing is a very young but rapidly developing eld.
The resulting literature is very heterogeneous in terms of application
areas, vegetation types, target variables, CNN implementations, and
remote sensing data (compare Fig. 3). Accordingly, several criteria were
dened to structure the literature and identify general trends, including
the underlying CNN architecture, remote sensing platform, sensor,
spatial resolution of the remote sensing data, mode of reference data
acquisition (in-situ or by visual interpretation), number of training and
test observations, response type (e.g., object detection or semantic seg-
mentation), geographic location of the study area, accuracy metrics,
area of application (agriculture, forestry, conservation or miscellaneous)
and specic task (e.g., detecting weed infestation or tree cover map-
ping). A corresponding spreadsheet including all assessed criteria and
studies is available in the Appendix. For the accuracy metrics, we con-
strained our analysis on the most frequently reported metrics (overall
accuracy, precision, recall, F-score and intercept over union). Whenever
a study reported multiple accuracy metrics, e.g. when comparing mul-
tiple methods, we recorded the best result. The geographic locations of
the study areas were derived from place-names using the Google Geo-
coding API, unless the manuscripts explicitly included the longitude and
latitude of the study area.
3.1. Reference data
3.1.1. Reference data sources
As with any supervised modelling approach, training and validating
a CNN requires reference observations, also referred to as annotations,
labels, or targets. The large number of parameters in CNNs and the
corresponding ability to detect even subtle patterns are associated with
the risk in training a model that is based on overly-specic details and
does not generalize well - it is overtting. Accordingly, independent
validation of CNNs prior to model deployment is of great importance to
evaluate its robustness and transferability. Ideally, such validation
should not solely involve iteratively shufing training and validation
data, as frequently done in remote sensing studies (e.g., as with a cross-
validation or bootstrapping), but be based on entirely independent data
that the model has never seen before. Therefore, most CNN-related
studies split their reference data in a (1) training data set, which
commonly is split again in training and validation data during the
model training process, and (2) a testing data set used to independently
evaluate the eventual predictive performance of the nal model. Typi-
cally, a share of 20 to 30% of the reference data is used for independent
testing (median 21%).
In the eld of remote sensing of vegetation, reference data was so far
most commonly acquired in ground-based surveys in the form of in-
situ plot or point observations (Fassnacht et al., 2016). However, the
quantity of reference data of ground-based surveys is generally limited
as these involve high logistic efforts and costs for transportation,
equipment, and personnel. In particular for studies in natural environ-
ments, limited accessibility can also greatly hamper the sampling fre-
quency. The effectiveness of ground-based surveys for CNN modelling
may, hence, be limited as the latter often requires ample reference data.
In particular for complex tasks, such as the differentiation of classes that
only differ in subtle features, the quantity of available reference data can
be the critical factor for a successful model training and convergence.
Moreover, tasks as object detection or the segmentation of individual
crown components (Section 3.2.2) require reference data that is
spatially explicit and in exact correspondence with the remote sensing
data. Especially for analysis of very high spatial resolution remote
sensing data at centimetre scale, GNSS-coded reference data acquired in
the eld is often not directly applicable for two main reasons: Firstly,
geolocation errors of GNSS-measurement typically exceed 0.11 m;
particularly under dense vegetation canopies (Valbuena et al., 2013;
Kaartinen et al., 2015; Branson et al., 2018). Secondly, for practical
reasons, eld data is usually measured in form of point observations (e.
g., stem position of a tree) or using circular or rectangular plots, which
does commonly not allow for a spatially explicit link with remote
sensing data (Kattenborn and Schmidtlein, 2019; Anderson, 2018; Lei-
tao et al., 2018). Correspondingly, only 14% of the studies reviewed
here used in-situ data as exclusive reference input.
Instead of using in-situ observations, reference data in CNN-based
studies are most often (62%) directly acquired in the primary or sec-
ondary (e.g., higher resolution) remote sensing data using visual
interpretation.
In contrast to common in-situ point or plot observations, reference
data acquired by visual interpretation is commonly spatially explicit as it
is directly derived from the imagery or point cloud. Furthermore, there
is no position error, as long as the same input data is used for the CNN
and visual interpretation. If secondary data (e.g. higher resolution) is
used for visual interpretation, the geolocation error is relative to the
spatial agreement of primary and secondary data. Visual interpretation
provides a very efcient mode of generating reference data, given that
the variable of interest is clearly identiable in the imagery. Accord-
ingly, this mode of reference data acquisition is in particular applicable
for discrete classes (e.g. species, plant communities, crop or vegetation
types, individuals). The term visual interpretation implies a rather
imprecise capture of the target metric, but it should be noted that in-situ
observations do not necessarily represent (ground) truth: As with visual
image interpretation, mapping species in the eld is commonly based on
visual interpretation and, hence, can also be prone to errors and bias
(Lunetta et al., 1991; Leps and Hadincova, 1992).
Annotations from visual interpretation are often derived by delin-
eating target classes in a GIS environment. This includes the identi-
cation of individuals by points, as often performed for image-based
object detection in agricultural environments (Csillik et al., 2018;
Freudenberg et al., 2019), or by delineating the vegetation components
(e.g. in form of polygons) for semantic or instance segmentation (Flood
et al., 2019; Kattenborn et al., 2019a). Many studies have also used
special interfaces for an efcient labeling such as RectLabel, LabelMe,
Labelbox or LableImg (Russell et al., 2008). Instead of manually labeling
the spatial extent of target classes, a semi-automatic approach using a
prior segmentation may be used. For instance, dos Santos Ferreira et al.
(2017) automatically segmented canopy components in RGB imagery of
soybean elds using SLIC (Simple Linear Iterative Clustering) super-
pixels (Achanta et al., 2012), and assigned each segment to weeds or
crops by visual interpretation. Natesan et al. (2019) labeled segments
derived from a watershed-based segmentation using a Digital Surface
model. In particular, for LiDAR-based point cloud data, region growing
algorithms may be used to efciently segment points belonging to in-
dividual plants (Wang et al., 2019) or plant components (e.g. stems,
branches or foliage; (Xi et al., 2018)).
Despite the above-mentioned advantages, obtaining reference data
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
30
by visual interpretation does not rule out misinterpretation. Yet, at the
example of mapping plant species, it has been shown that CNNs can to
some extent compensate awed or noisy labels (Kattenborn et al., 2020;
Hamdi et al., 2019).
Although in-situ data may not be the ideal for training and validating
CNNs, it may be an essential requirement in case the target class (e.g.
species) is not readily identiable in the remote sensing data by means of
visual interpretation alone. According to our review, 22% of the studies
that acquired reference data by visual interpretation also incorporated
in-situ data for training or validation. 84% of these studies were either
related to forestry or conservation tasks and thus to rather complex
environments, in which visual interpretation alone may not be suf-
cient. For instance, Schiefer et al. (2020) and Kattenborn et al. (2019a)
used ground-based full inventory data as a basis to annotate tree species
in UAV imagery in temperate forests in Germany, and in highly het-
erogeneous and complex natural forests in Waitutu, New Zealand,
respectively. Similarly, Sun et al. (2019) used in-situ data on tree species
to map the species diversity in tropical wetlands. Field data may also
provide an independent source to validate CNN-based predictions
(Flood et al., 2019). Especially in cases when a bias by visual interpre-
tation is assumed, a validation using in-situ reference data is highly
recommended.
Visual interpretation may be more efcient for data annotation than
using in-situ data alone, but even human labeling through visual inter-
pretation can be very tedious, especially for large datasets or complex
vegetation canopies that require very detailed annotations. The effort of
annotating data may be reduced by specic training strategies, such as
weakly- or semi-supervised learning (see Section 3.2.1.3), that
compensate for few or coarse annotations. Alternatively, if no knowl-
edge of a vegetation expert is required, crowdsourcing can be used for
labeling. Commercial services are now also available for this purpose.
For example, Branson et al. (2018) used the service Amazon Mechanical
Turk
TM
to locate individual trees in Google Street View imagery.
Although visual interpretation is an effective labeling approach to
many tasks, it should be noted that there are many vegetation-related
applications where it is not applicable. Particularly, for continuous
quantities, such as crop yield or forest biomass (Ayrey and Hayes, 2018;
Yang et al., 2019; Castro et al., 2020), reference data acquisition is
conceptually more difcult as these are often not directly measurable
from the remote sensing data. Here, in-situ measurements or other
physically-based retrieval procedures may often present the only
applicable solution. A physically-based retrieval of reference data was
presented by Du et al. (2020), who aimed at mapping wetland inunda-
tion extent in forests on large spatial scales with satellite data (World-
View-2). For parts of their study area, LiDAR data was available enabling
accurate detection of surface waters due to its strong absorption in near-
infrared wavelengths. Reference data acquisition on yield or biomass in
an agricultural context may be automatized by integrating measurement
devices on harvesting machines. For instance, Nevavuori et al. (2019)
trained a CNN to predict wheat and malting barley yield from UAV
imagery using training data derived from a yield measurement device
(John Deere Greenstar 1) that was coupled with a GNSS receiver and
mounted on a harvester.
Concerning biochemical and structural plant traits, an interesting
approach is to train CNNs with simulated data derived from physically-
based models. Such hybrid approaches, i.e. coupling statistical and
process-based models, may not only provide data for training but also
enable including priors and realistic constrains in model training
(Reichstein et al., 2019). For instance, Annala et al. (2020) trained a 1D-
CNN with reectance spectra simulated with the radiative transfer
model (RTM) SLOP (Maier et al., 1999). Although SLOP is a relatively
simple leaf reectance model, Annala et al. (2020) demonstrated
promising tests of this hybrid inversion method for UAV hyperspectral
acquisitions of forest canopies. More sophisticated RTMs may allow to
produce more robust models, e.g. PROSAIL (Jacquemoud et al., 2009),
enabling to account for bidirectional reectance effects in plant
canopies, whereas 3D-RTMs such as FLIGHT (North, 1996) or DART
(Gastellu-Etchegorry et al., 1996) may provide interesting sources for
generating synthetic training data for 2D-CNNs (see Section 3.2 for de-
tails on 1D-, 2D- and 3D-CNNs).
3.1.2. Reference data quantity
The quantity of reference data required for the convergence of a CNN
depends particularly on the complexity of the algorithm and most
importantly on the contrast of the features that are decisive for the
vegetation property of interest. Fewer reference data may be required if
the vegetation property of interest is easily identiable in the remote
sensing data (e.g., due to a distinct canopy structure or contrasting
owers). Subtle differences and complex relationships in turn require
more complex algorithms and more samples to identify the relevant
features. Accordingly, the effects of varying the training data size cannot
be generalized. The results of Weinstein et al. (2020) suggest that the
accuracy rst increases rapidly with increasing the reference data
quantity and then stagnates. In the context of tree species mapping in
urban environments, Hartling et al. (2019) showed that using 10% of
their available training samples decreased the overall accuracy from
82.58% to 70.77%. Using 2003940 samples and multiple CNN archi-
tectures, Fromm et al. (2019) showed that the reference data quantity
can have a large inuence on the overall accuracy for tree seedling
mapping (up to 18%). Using UAV data for segmenting growth forms in
wetlands, Liu et al. (2018a) demonstrated that the effect of sample size
(7003500 samples) can greatly differ across different model architec-
tures and complexities.
Overall, the amount of reference data used in the reviewed studies
differed greatly - most notably between studies using different remote
sensing platforms. Studies based on terrestrial data acquisitions, e.g.,
terrestrial or mobile LiDAR scanning, used around 340 reference ob-
servations (median). UAV- or airborne-related studies used a median of
2795 reference observations and studies based on satellite observations
6001 observations. These large differences may be the result of two
factors: Firstly, studies at the satellite-scale typically cover larger spatial
extents and are, hence, more likely to benet from previously acquired
reference data sets (cf. Schmitt et al., 2020), whereas, the coarser spatial
resolutions also allow to incorporate reference data with higher geo-
location errors. Secondly, data acquired at higher resolutions, often TLS
or MLS LiDAR data, contains ner information on vegetation structures
and may thus include more characteristic features. This may hence
facilitate model convergence and decreases the amount of reference data
required.
A common training strategy that aims to compensate for few refer-
ence observations is data augmentation, which inates the number of
reference data by introducing small manipulations to the existing data,
or creating synthetic data (see details Section 3.2.1.1). Instead of col-
lecting new reference data, it may be more efcient to use existing
reference data, e.g. from previous research projects or authorities (e.g.
environmental agencies, forestry ofces). Accordingly, the establish-
ment of open access databases incorporating labeled remote sensing
data is increasingly demanded but still lacking (Zhu et al., 2017). Such
databases would not only facilitate the efciency of model training due
to ample training data, but would also allow to assess and improve the
extrapolation and transferability of these models to new domains. This is
particularly important as geoscientic models are often under-
constrained due to limited representatives of the training data (Reich-
stein et al., 2019). Accordingly, databases can enable to test and improve
the model transferability towards new domains, such as different remote
sensing acquisitions, vegetation types, or growth stages. Moreover, da-
tabases of sufcient size could also play an important role to develop
backbones that are specically oriented to vegetation remote sensing
(further discussed in Section 3.2.1.2). Freely accessible databases can
also facilitate more comprehensive and universal comparisons of algo-
rithms and the identication of improvement opportunities.
Despite the described benets, there exist still only a few databases
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
31
providing labeled remote sensing data, which may be explained by the
novelty of the scientic eld (cf. Fig. 2), associated costs for data storing
and sharing (especially in regard to high-resolution data), various elds
of application with individual annotation requirements, and lastly the
diversity in remote sensing sensors, acquisition and processing modes. A
prime example is the voluntarily organized ImageCLEF initiative
(imageclef.org). The latter hosts an evaluation platform and mostly
annually recurring competitions for cross-language annotation of im-
ages (Kelly et al., 2019). The rst competition was hosted in 2003 and
aimed at classications of generic photograph datasets, whereas in 2011
the rst vegetation-specic competition followed, which was centered
on plant species identication from ordinary photographs. Since 2017,
ImageCLEF also hosts the GeoCLEF competitions, which focus on plant
species identication by means of environmental and remote sensing
data, including high-resolution remote sensing imagery and respective
land cover products. Another example is the NSF NEON database (Kao
et al., 2012; Kampe et al., 2010; Marconi et al., 2019) including a wide
array of (partly multitemporal) reference and remote sensing data (most
importantly from RGB, LiDAR, hyperspectral airborne campaigns) on
natural and semi-natural ecosystems. This database has already been
proven to be of immense value to train and validate models across
ecosystems and remote sensing acquisitions (Weinstein et al., 2020;
Ayrey and Hayes, 2018). For instance, Weinstein et al. (2020) tested
cross ight performance of a CNN for tree crown segmentation in
different environments. Their results underlined the value of large da-
tabases for model training, as the model generalization with additional
datasets greatly improved - even when the target class was not present in
all datasets. Examples centered on developing and benchmarking deep
learning towards vegetation types and land-cover mapping with
Sentinel-2 imagery are the SEN12MS (Schmitt et al., 2019), BigEarthNet
(Sumbul et al., 2019) and EUROSAT (Helber et al., 2019) datasets. In the
agricultural context, the Global Wheat Dataset (global-wheat.com) in-
cludes standardized images on weeds (1024 ×1024 pixels) with sub-
centimetre resolution, providing the basis for public challenges, such as
the 2020 challenge on counting wheat ears (David et al., 2020).
An alternative approach could also be the use of databases that only
refer to vegetation information but can be linked to existing remote
sensing data in other ways, e.g. by taxonomic identities or geo-
coordinates. Valuable resources in this context are the TRY database
(try-db.org, kattge2020try), which contains a wealth of morphological,
physiological and phenological plant traits, the opentrees database
(opentrees.org, providing species and location information of individual
trees in urban areas, or GBIF (gbif.org, providing several huge datasets
on citizen-science-based plant photographs together with species names
and geo-coordinates, including the popular iNaturlist dataset.
3.2. Common CNN approaches and architectures
3.2.1. Training strategies
Training a CNN can be challenging due to a restricted amount of
labeled observations, computation load required for model conver-
gence, and model overtting. This chapter lists the most common stra-
tegies and methods applied during training to alleviate these challenges.
3.2.1.1. Normalization and regularization techniques. A famous problem
in training articial neural networks with gradient-based learning is the
vanishing or exploding gradient problem (Hochreiter, 1991;
Hochreiter, 1998). During backpropagation, the weights of each node
are updated proportionally to its gradient in respect to the loss. The
gradients are derived by calculating the derivative of an activation
function. For a common sigmoid function, this derivative becomes
increasingly small for very low or high values. The derivative of a layer
is calculated by the chain rule, and so gradients and corresponding
updates of weights in earlier layers of the network can approach zero
(vanish). The opposite effect, i.e. exploding gradients, can occur for
large derivatives. This imbalance in the network ultimately impairs the
networks ability to nd the ideal updates for the weights.
A common counter-measure is batch normalization, which is
applied in 26% of the reviewed studies, particularly in networks with
many parameters such as for semantic segmentations (Wagner et al.,
2019; Kattenborn et al., 2019a; Ronneberger et al., 2015). Batch
normalization normalizes the output of activation functions to zero-
mean and unit variance and thereby prevents the network from
becoming imbalanced due to excessively high or low activations. This
smooths the optimization problem of the gradient descent function and
allows for larger ranges of learning rates and hence facilitates network
convergence.
The vanishing gradient problem can also be greatly reduced by using
the Rectied Linear Unit (ReLU) activation function. The output
weight of the ReLU function equals the weighted sum of the inputs as
long as this sum is >0 (values <0 are ignored). For >0, ReLU is a simple
linear function such that the derivative is always 1, hence, preventing
the vanishing gradient problem. The probably more important charac-
teristics of ReLU are its non-linearity and its regularization function of
the network. The large amount of parameters in deep networks makes
them prone to overtting and, therefore, regularization aims to facilitate
a networks ability to generalize. ReLU regularizes the network by
reducing the parameters of the model as it ignores values <0 - these
values are in theory not activated anyway. The reduction of parameters
also greatly decreases the computing time in contrast to conventional
hyperbolic tangent functions (Krizhevsky et al., 2012). Only few studies
reported that they used other activation functions, suggesting that in
fact most of the studies used ReLU.
One of the most common and effective regularization technique is
Dropout (Srivastava et al., 2014) (used in at least 31% of the reviewed
studies), and stands for randomly removing a fraction (typically 50%) of
a layers output features during the training process (these output fea-
tures are set to zero). The core idea of dropout is to articially introduce
stochasticity to the training process preventing the model from learning
statistical noise in the data.
Still, overtting does not only depend on the number of parameters
in the model, but also on the representatives of the sampling. Particu-
larly in the context of vegetation mapping, samples are often taken
under limited conditions, while a model is deployed to further, foreign
conditions. The associated risk is therefore an over-tting of the model
to the situation with limited conditions and representatives (e.g., with
regard to scene illumination or local vegetation properties). An obvious
solution is a larger amount of training data or covered variation,
respectively. To reduce the costs of creating labeled observations, a
commonly applied procedure is to synthetically increase the sample
quantity and diversity using data augmentation procedures (Chateld
et al., 2014; Krizhevsky et al., 2012). Data augmentation is the process of
producing more samples from existing data by introducing manipula-
tions them (Shorten and Khoshgoftaar, 2019). These changes may
include randomly changing the spatial extent of the imagery, e.g., to
make a model more robust for detecting individuals of a plant species
with varied sizes. Random transformations, such as ipping, rotating or
translating the imagery, can increase the generality towards varying
sun-azimuth angles and corresponding cast-shadows (also described as
rotational invariance). Random spectral shifts may compensate for
variation in illuminations caused by topography or atmospheric condi-
tions and may further alleviate data calibration issues or sensor-specic
differences. In most CNN-related studies using LiDAR data, the detection
process is not based on the point cloud, but 2D projections derived from
the point cloud (cf. Section 3.5.2). Here, data augmentation can be
performed by varying the viewing geometry prior to generating the 2D
image also referred to as multi-view-data generation (Ko et al., 2018;
Su et al., 2015; Zou et al., 2017; Jin et al., 2018). The overall effec-
tiveness of data augmentation is highlighted by the fact that 47% of the
studies used data augmentation. Fromm et al. (2019) and Safonova et al.
(2019) explicitly tested the effect of data augmentation and found
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
32
signicant improvements for the detection of tree seedlings and bark
beetle-infected trees, respectively.
Data augmentation may also be performed by not introducing minor
manipulations, but creating new, synthetic observations from the
existing data. Gao et al. (2020) presented an automated procedure for
the creation of synthetic images and labels from original images for
detecting weed infestation (Calystegia sepium) in sugar beet elds. Their
approach involved the creation of masks for individual plants from the
original images used for cropping and transferring the corresponding
RGB information to other base images. Adding data created from this
(very simply said copy & paste) approach to the original training data
indeed increased the precision from 0.75 to 0.83. For training a CNN for
detecting individual tree crowns, Braga et al. (2020) used the same
principle and created synthetic Worldview-3 observations by randomly
placing manually-delineated tree crowns on background tiles.
Probably the most elegant framework for generating synthetic data is
Generative Adversarial Networks (GANs). Inspired by game theory,
GANs are driven by the competition of a generator module, creating
synthetic data (e.g., images) and a discriminator module aiming to
disambiguate between synthetic and real data (Goodfellow et al., 2014;
Frid-Adar et al., 2018). During training, a GAN, hence, simultaneously
improves on how to synthesize observations from noise and how to
classify them (synthetic vs real data or further classes). At the example of
segmenting weed infestation in crop elds in UAV imagery, Kerdegari
et al. (2019) demonstrated a GAN architecture, composed of a generator
and the discriminator modules with four convolutional layers each. The
proposed GAN produced realistic synthetic visual and near-infrared
scenes. Moreover, it was demonstrated that using the discriminator
module for semantic segmentation of unknown images resulted in
comparable accuracy to a pure CNN - even when using only 50% of the
available labels. The fact that the discriminator was originally trained to
detect another problem, i.e. differentiating synthetic from real data,
suggest that applying this trained discriminator to real world problems
could also be considered as a form of transfer learning - an approach
discussed in more detailed in the next chapter.
3.2.1.2. Transfer learning and backbones. As described earlier, training
data for vegetation attributes is often limited as its acquisition is
commonly costly and limited by accessibility. Furthermore, the training
itself is often associated with high computing costs.
A common practice to alleviate this problem is to apply transfer
learning during CNN model training. Transfer learning includes pre-
training of the CNN model on other, presumably very large and het-
erogeneous datasets. Such datasets do not necessarily have to include
the target metric or class (e.g. a certain plant species) and can, for
instance, be derived from public and generic databases. Popular exam-
ples are the image databases MSCOCO or ImageNet, which contain
thousands of images from various objects, such as cars, buildings, or
people. A very elegant approach of transfer learning is to built on pre-
trained models directly, commonly referred to as pre-trained back-
bone, which can potentially reduce data storage and processing costs.
The principle of transfer learning can be transcribed as the process
where very generic images, not necessarily belonging to vegetation-
related situations, are used to teach the CNN the ability to see in a
general sense. The subsequent step of adjusting the network can be
understood as teaching the CNN how to apply the ability to see to a very
specic problem, such as the differentiation of certain plant species.
There exist various transfer learning approaches (Too et al., 2019;
Tuia et al., 2016; Pires de Lima and Marfurt, 2020), which can be
roughly grouped into two primary strategies: The shallow strategy
adopts very general, lower-level image features such as edge detectors
from the pre-trained backbone or the generic training dataset. Only the
last layers of the CNN are then ne-tuned for higher level and task-
specic features using imagery corresponding to the specic problem
(e.g. plant species detection). The deep strategy, in contrast, involves
ne-tuning the entire network, i.e. start back-propagation with all layers
on the pre-trained network.
The use of pre-trained backbones is restricted to available architec-
tures. Yet, backbones can be customized with output layers (e.g. to apply
it on regression or classication problems), cost functions, and other
components or integrated in existing CNNs. There exist a variety of
backbones for popular CNN architectures (cf. Section 3.2.2), such as
VGG, ResNet or Inception. It should be noted that the popular backbones
are usually trained on 3-channel (RGB) data, whereas remote sensing
information often provides more predictors, such as multiple bands,
time steps, or sensor types. In this case, band selection or feature
reduction algorithms provide a promising avenue (Rezaee et al., 2018).
Of the studies reviewed here, 30.5% used pre-trained backbones (e.
g., Gao et al., 2020; Brahimi et al., 2018; Fromm et al., 2019; Mah-
dianpari et al., 2018; Rezaee et al., 2018; Branson et al., 2018). Ghazi
et al. (2017) compared the utility of three backbones based on GoogLe-
Net, AlexNet, VGGNet, to identify plant species in photographs. Brahimi
et al. (2018) assessed the value of pre-training for plant disease recog-
nition based on RGB imagery and multiple CNN architectures. They
showed deep pre-training strategy, i.e. back-propagation on all layers of
the pre-trained model, delivered the highest accuracy. The shallow
strategy was usually worse than training a model from scratch. Fromm
et al. (2019) showed that pre-training not always signicantly improved
the detection of tree seedlings and that the value of pre-training depends
on the networks complexity, while more shallow architectures are less
likely to benet from pre-training. Mahdianpari et al. (2018) report that
full training resulted in better accuracy than ne-tuning existing back-
bones trained on ImageNet. This suggests that the detection of vegetation
patterns may not necessarily benet from features learned on generic
datasets. This also agrees with recent research by He et al. (2018) sug-
gesting that transfer learning may indeed be useful if training data is
scarce and computation power limited, but otherwise an exhaustive
training on task-specic data will result in higher accuracy than using
generic datasets.
3.2.1.3. Weakly- and semi-supervised learning. Besides a lack of refer-
ence data, it may occur that reference data already exist, but do not meet
the ideal requirements for the intended application. Accordingly, several
concepts and strategies have evolved to compensate for limited avail-
ability or conceptual incompatibilities of reference data.
The aim of Weakly supervised learning is to decrease costs for
human labeling or to make use of existing, lower quality reference data.
This concept is particularly interesting for semantic segmentation tasks,
where usually an annotation for each sample (point or pixel) is required.
Weakly supervised-learning can, for instance, involve annotations at an
image level instead of at a pixel level, or sparsely annotated data at a
pixel level, such as bounding boxes, lines, or points. Adhikari et al.
(2019) applied weakly supervised learning using the principle of se-
mantic graphics to map crop rows and individual weed plants in rice
paddies. Semantic graphics denes target objects or concepts through
abstract forms. Accordingly, Adhikari et al. (2019) dened crop rows as
line features and weeds as solid circles and showed that an encoder-
decoder CNN is capable of accurately learning and mapping these con-
cepts. Their ndings are particularly interesting because plant rows are
rather fuzzy and not clearly delimitable. The higher-level concept of a
row, however, is clearly denable for humans by abstracting the spatial
context of the individual plants and obviously also reproducible by
CNNs. The concept of weakly supervised learning is also applicable
when explicit ‘ground truthis scarce but frequent datasets from other
studies exist that come with their own errors or lower spatial resolutions.
Promising results of this approach were presented by Schmitt et al.
(2020), who predicted vegetation types with Sentinel data and used
training data derived from MODIS land cover maps at 500 m resolution.
Using high resolution imagery, they demonstrated that the Sentinel-
based predictions reached even higher accuracy than the MODIS-
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
33
based land cover data used for training.
Another promising avenue of weakly supervised learning is directed
towards semantic segmentation using saliency maps. The basis for this
approach is a CNN trained for image classication, which can be
analyzed through class activation mapping (cf. Section 3.6.2 and Fig. 1
showing an example for tree species) to identify those pixels that are
decisive for assigning an imagej to a classi). These pixels are then used to
segment the target classi based on the assumption that these pixels
highlight the components of the respective class in the imagej (e.g., the
canopy of a tree species). Although no study has been published to date
that has applied this approach to vegetation remote sensing, the po-
tential has been demonstrated several times in other disciplines (Li et al.,
2018; Lee et al., 2019). This approach could, hence, provide a promising
way for an efcient and automatic segmentation (e.g., of plant species)
based on large image databases without spatially explicit labels, such as
the iNaturalist data.
Semi-supervised learning describes the training of a model with
only a small number of reference data and, hence, can be located be-
tween supervised and unsupervised learning. Weinstein et al. (2019)
applied semi-supervised learning framework for detecting single tree
crowns in airborne imagery using a two-step approach: The rst step,
which can be considered as unsupervised or weakly-supervised learning,
involved training a CNN with labels (bounding boxes, n =435,551)
derived automatically from LiDAR data and a tree crown segmentation
algorithm (Roussel et al., 2017). In the second step, the CNN was opti-
mized using a few hand-annotated samples derived from the airborne
imagery (n =2,848). Thereby, Weinstein et al. (2019) demonstrated
that only few high-quality samples may be required for training a robust
CNN.
However, the number of samples required for a specic task is
difcult to estimate in advance. In this regard, Active learning, which
can be considered as a special case of supervised learning, can be an
efcient solution. Active Learning describes the iterative optimization of
a model by repeatedly adding new reference data until the predictive
accuracy saturates or reaches a desired threshold. Ghosal et al. (2019)
exemplied an active learning approach for sorghum head detection in
UAV imagery. Starting point was a single image together with bounding
boxes of sorghum heads to train a CNN, which was then applied to
another random image. The image and predictions were afterward fed
into an annotation app, in which a human interpreter corrected the
predictions before they were added to the training dataset. The initial
model was then optimized using the enlarged training dataset and the
entire procedure was repeated in multiple iterations. In their case study,
the model accuracy already converged between 510 iterations, high-
lighting the efciency of active learning for nding the right balance
between costs of human labeling and model performance.
3.2.2. Approaches and architectures
Depending on the components and architecture, CNNs can be
implemented in many different ways, which in turn enables a wide range
of different applications in the eld of vegetation remote sensing. CNNs
can initially be grouped into 1D-, 2D- and 3D-CNNs, where the number
refers to the dimensions of the kernel. 1D-CNNs are less often used (8%
of the reviewed studies) since they do not explicitly consider spatial
context and are, hence, primarily applied to analyze optical spectra or
multitemporal data (Xi et al., 2019; Annala et al., 2020; Zhong et al.,
2019; Liao et al., 2020; Guidici and Clark, 2017; Kussul et al., 2017).
Most studies applied 2D-CNNs (88%), as these readily exploit spatial
patterns in common imagery (e.g., RGB or multispectral imagery, cf.
Kattenborn et al., 2020; Weinstein et al., 2019; Wagner et al., 2020;
Fromm et al., 2019; Neupane et al., 2019; Milioto et al., 2017). The
added value of spatial patterns, i.e. of 2D versus 1D-CNNs, was even
demonstrated with relatively coarse-resolution Landsat data (Kussul
et al., 2017). 3D-CNNs are rarely used (4%), but are the means of choice
when successive layers have a directional relationship to be considered
(e.g. canopy height proles, hyperspectral reectance, or time-series
data, e.g., Nezami et al., 2020; Zhong et al., 2019; Lottes et al., 2018;
Ayrey and Hayes, 2018; Liao et al., 2020; Barbosa et al., 2020; Jin et al.,
2019). 2D- and 3D-CNNs can be applied to solve different problems,
including assigning values or classes to entire images, detecting indi-
vidual objects within images, segmenting the extent of classes, or
simultaneously detecting individual objects and segmenting their extent
(Fig. 10b). The major differences, including the required structure of
labels and resulting outputs, are described in the following sections:
3.2.2.1. Image classication/regression. Image classication is the
assignment of a class to an entire image (Fig. 9a). For example, an image
may be assigned to the class shrub if at least a fraction is covered with
Ulex europaeus or Sambucus nigra. Training image classication or
regression-based CNNs requires comparably simple annotations in the
form of class correspondences or continuous values, respectively, for
each image. Typical CNN-architectures for image classication and
regression include VGG, ResNet, Inception or EfcientNet. VGG uses
blocks of consecutive convolutions and non-linear activations. Between
those building-blocks max-pooling with stride of 2 reduces the
Fig. 4. Schematic diagram of the VGG-16 architecture. The 16 stands for the number of convolutional and dense layers. Frequently used alternatives are VGG-8 and
VGG-19.
Fig. 5. Schematic diagram of a residual building block used in repeated
sequence in common ResNet architectures.
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
34
resolution of the layers. The lter size of the convolution is restricted to
3x3, leading to less parameters and thus more possible layers. The small
lter size is still common in more recent networks. Finally, some fully
connected layers are added for classifying the output of the building-
blocks (Fig. 4). ResNet also consists of building-blocks with consecu-
tive convolutions and activations (Fig. 5) but with some major differ-
ence: First, the depth of the layers is drastically reduced before the 3x3
convolution with a bottleneck 1x1 convolution. Thus, the number of
parameters is much lower compared to VGG, even so ResNet has up to 10
times more layers. Second, to compensate for the vanishing gradient
problem (cf. Section 3.2.1.1) with such a high number of layers (e.g.
152), skip connection with identity or convolution shortcuts are intro-
duced. Such skip connections are still used in the current design,
allowing very deep networks. Third, ResNet only uses one max pooling
layer. Instead, convolution with stride 2 are used for resolution reduc-
tion. Most modern architectures such as EfcientNet also dismiss max-
pooling operation to reduce possible information loss during pooling.
A typical procedure to map vegetation patterns in remote sensing
imagery with CNN-based image classication or regression is to subset
the original imagery into regular tiles (e.g., 128 x 128 pixels) on which
the model is subsequently applied (details see Section 3.5.1). This pro-
cedure was for instance applied to LiDAR and airborne imagery to map
tree species (Sun et al., 2019) or the detection of forest types using a
combination of high-resolution satellite imagery and LiDAR data (Sothe
et al., 2020). Image classication or regression may also be applied to
segments derived from previously applied unsupervised image seg-
mentation methods (dos Santos Ferreira et al., 2017; Ko et al., 2018;
Hartling et al., 2019; Liu and Abd-Elrahman, 2018). Image regression is
used when a continuous quantity is assigned to an entire tile. For
example (Kattenborn et al. (2020) predicted continuous cover values
[%] of plant species and communities in UAV-based tiles (25 m) along
smooth vegetation gradients. Yang et al. (2019) and Castro et al. (2020)
estimated rice grain yield and forage biomass in pastures, respectively,
from UAV-based tiles. Barbosa et al. (2020) mapped continuous crop
yield on coarser scales based on satellite data. Ayrey and Hayes (2018)
used regression on airborne LiDAR data to predict forest biomass and
tree density.
3.2.2.2. Object detection. Object detection aims at locating individual
occurrences of a class (e.g. trees) within an image (Fig. 9b). The detec-
tion typically includes the localization of the object center and an
approximation of its extent using a simple rectangular bounding box.
Widely applied architectures for object detection are region-based
CNNs (R-CNN, Girshick et al., 2014), which involve a two-step
approach; region proposals of the objects location and extent fol-
lowed by a classication. R-CNN was followed by two successors, i.e.
Fast R-CNN (Girshick, 2015) and the most widely applied and efcient
Faster R-CNN (Ren et al., 2017). The more recent Faster-R-CNN forwards
feature maps (often derived using a VGG-type backbone) to a region
proposal branch that performs an initial prediction on potential object
locations (also referred to as anchors). These rather rough region pro-
posals are then used to crop areas of the feature maps as input for a ne-
scaled object localization and classication (Fig. 6).
Object detection is suitable for countable things with denable
spatial extent within the eld of view. Such conditions are often found in
agricultural settings and accordingly 45% of the studies related to
agriculture apply object detection techniques, such as locating and
counting palm or tree individuals in plantations (Csillik et al., 2018;
Freudenberg et al., 2019), individual maize plants in TLS-point-clouds of
crop elds (Jin et al., 2018) or individual strawberry fruits and owers
in sub-centimeter UAV-imagery (Chen et al., 2019). The application of
object detection in natural environments is less frequent, which can be
explained by the presence of continuous gradients and smooth transi-
tions in species cover, traits, and communities. In forestry or conserva-
tion, only 14% and 10% of the studies used object detection. Examples
include the localization of r trees infested by bark beetle (Safonova
et al., 2019), the mapping of individual tree crowns across several
ecosystems (Weinstein et al., 2020) or the detection of Cactae (Lopez-
Jimenez et al., 2019).
Object detection-based CNNs are typically trained using bounding
boxes of desired classes as labels. Several tools exists for a fast annota-
tion of bounding boxes (see Section 3.1.1). However, a problem with
bounding boxes in vegetation analysis is that they often do not explicitly
dene vegetation boundaries (vegetation is not rectangular). This in
turn can make validation difcult, as inaccurate reference data do not
allow a nal assessment of the prediction (Weinstein et al., 2019;
Weinstein et al., 2020). From this point of view semantic (Section
3.2.2.3) or instance segmentation (Section 3.2.2.4) may be more
spatially explicit, but also require more sophisticated annotations.
3.2.2.3. Semantic segmentation. While image classication and object
detection aim to detect the presence or location of an object, semantic
segmentation aims to delineate the explicit spatial extent of the target
class within the image (Fig. 9c). In contrast to object detection, semantic
segmentation assigns all pixels in an image to a class. It is especially
suited to segment uncountable and amorphous stuff (frequently used
term to illustrate the contrast to countable things, cf. Kirillov et al.,
2019). The training process is typically based on labels in the form of
spatially explicit masks to provide a class assignment for each single
pixel (e.g., absence or presence or species a, b, c).
The challenge with semantic segmentation is that CNNs usually
include multiple pooling operations to reveal spatial context in the
feature maps derived from the convolutions and, thereby, spatial
reference and detail is initially lost. One solution often referred to as
Fig. 6. Faster-R-CNN and Mask-RCNN, respectively.
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
35
patch-based, is to perform a semantic segmentation by predicting only
values for the center pixel of the input image and iteratively slide the
eld of view over the image data until every pixel received a label
(Rezaee et al., 2018; Baeta et al., 2017; Mahdianpari et al., 2018; Fricker
et al., 2019; Zhang et al., 2018; Kussul et al., 2017). However, this
method requires an individual prediction for each pixel and is rather
inefcient considering that the CNN analyses the neighbouring pixels at
the same time anyway. A more elegant and effective way is to build a
semantic segmentation on fully convolutional networks (FCN) as rst
demonstrated by Long et al. (2015). FCN conserve the spatial reference,
by memorizing the pixels that caused activations in earlier stages of the
network and forwarding it to an output segmentation map (see Fig. 7).
This way, FCN do not only allow detecting the presence of a target class
within an image (e.g., a species) but also the individual pixels that
correspond to the target class. A more recent and frequently applied
architecture for semantic segmentation is the U-Net (named after its U-
like shape, Ronneberger et al., 2015). U-Net features encoder-decoder
structure, while the spatial scale is subsequently reduced after consec-
utive pooling operations and again increased in a contracting path (see
Fig. 8). The activations from the contracting path are forwarded using
skip connections to the expanding path to reconstruct the spatial iden-
tity. Further commonly applied CNN-architectures for semantic seg-
mentation are SegNet (Badrinarayanan et al., 2017) or FC-DenseNet
(Jegou et al., 2017). Semantic segmentation is widely used in several
contexts, ranging from mapping of plant species (Fricker et al., 2019)
and plant communities (Wagner et al., 2019; Kattenborn et al., 2019a),
to mapping deadwood (Fricker et al., 2019; Jiang et al., 2019). Torres
et al. (2020) compared amongst other architectures U-Net, SegNet, FC-
DenseNet for mapping Dipteryx alata trees in an urban context. Their
results suggest that the segmentation accuracy of the three latter algo-
rithms was quite similar, whereas it was found that more simpler ar-
chitectures (e.g., U-net) require less effort for model training.
3.2.2.4. Instance segmentation. Instance segmentation aims at detecting
Fig. 7. Schematic diagram of the FCN architecture as proposed by Long et al. (2015). Predictions (also referred to as scores) within the network are forwarded to
deeper layers to relate respective activations to the original spatial resolution.
Fig. 8. Schematic diagram of the U-Net architecture depicting its encoder-decoder structure using an contracting and expanding path.
Fig. 9. Schemes illustrating the conceptual differences between different CNN approaches, including (a) image classication, where the entire image is assigned to a
class; (b) object detection, where individual occurrences are localized and their extent estimated with bounding boxes; (c) semantic segmentation, which assigns each
pixel of the input image to the target classes; and (d) instance segmentation, where individuals belonging to a class are mapped.
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
36
individual things, such as individual plants or plant elements, and seg-
menting their spatial extent. Instance segmentation may, hence, be
considered as a combination of object detection and semantic segmen-
tation (Fig. 9d). A few studies used CNN-based object detection and
subsequently applied segmentation techniques, such as region growing
in the case of point cloud data, to detect individuals (Wang et al., 2019).
However, here we dene instance segmentation as an end-to-end, CNN-
based segmentation of individuals.
One of the most popular algorithms for end-to-end instance seg-
mentation is Mask-R-CNN (He et al., 2017); a derivative from R-CNN
described in Section 3.2.2.4. Alike Faster-RCNN, it comprises a two-step
approach, including an initial region proposal followed by the locali-
zation and classication of the feature maps, while in the case of Mask-R-
CNN, the proposed region is subject to a segmentation branch (Fig. 6).
Similar to semantic segmentation, fully connected layers are used to
create masks at the original resolution of the input imagery. Despite the
potential utility of instance segmentation, the literature search only
comprised few respective studies; Jin et al. (2019) used instance seg-
mentation to map individual leaves and stems in maize plants, Braga
et al. (2020) delineated individual tree crowns in tropical forests and
(Chiang et al., 2020) detected individual dead trees. The rare use of
instance segmentation could be explained by the more sophisticated
collection of reference data, which involves both the identication of
individuals and delineating their explicit spatial extent. In an agricul-
tural context, the identication of instances of multiple classes may
often not be necessary, as most tasks are situated in mono-cultures.
Instance segmentation in a forestry or conservation context may often
not be applicable because natural canopies often feature smooth tran-
sitions or overlapping crowns.
3.3. Geographic and thematic areas of CNN application
CNN-based vegetation remote sensing has already been applied in
many countries (see Fig. 11), whereas a large amount of studies were
carried out in Europe, USA, Brazil, and China. The pattern suggests that
CNN applications are found in many of the Worlds biomes and are
hence applicable for a wide range of vegetation types and applications.
Our literature survey revealed that CNN-based vegetation remote
sensing is applied to a wide spectrum of thematic categories (Fig. 12). A
classication of the studies into broad categories showed that 44% of the
studies are related to agriculture, 26% of the studies have relevance for
both conservation and forestry. 8% and 22% exclusively tackled
research questions for forestry and conservation, respectively. Within
these broad categories, the specic tasks are very diverse (the interested
reader can nd the explicit references of each task in the appendix):
Examples in the context of agriculture include the mapping of in-
dividual crop elds at regional scales using medium and high-resolution
satellite data, e.g. coffee crop elds (Baeta et al., 2017), rice paddies
(Zhang et al., 2018), safower, corn, alfalfa, tomatoes, and vineyards
(Zhong et al., 2019). Several studies used high-resolution imagery from
airborne and satellite platforms to map individual plants in plantations,
e.g. citrus trees, palm trees or bananas (Csillik et al., 2018; Freudenberg
et al., 2019; Li et al., 2017; Mubin et al., 2019; Neupane et al., 2019).
Besides detecting individual citrus trees, Ampatzidis and Partel (2019)
quantied their crown diameter, health status (NDVI-based), and
respective canopy gaps in plantation rows. A large share of the studies
used imagery with milli- or centimeter pixel size acquired terrestrially or
from UAVs. A prime example of such detailed input data is the detection
of weed infestations, e.g., in soybean (dos Santos Ferreira et al., 2017) or
sugar beet elds (Gao et al., 2020; Sa et al., 2018; Milioto et al., 2017)).
Lottes et al. (2018) presented an automatic approach for mapping weed
infestation in imagery acquired by a farming robot equipped with a
mechanical actuator that can stamp detected weeds into the ground.
Adhikari et al. (2019) used subcentimeter imagery to map crop lines of
rice plants in paddy elds to aid navigation of weeding robots for the
eradication of weeds (Panicum miliaceum). Jin et al. (2018) tested the
detection and height estimation of individual maize plants. Other
studies used high-resolution imagery for yield estimation, e.g., based on
counting individual owers at sub-centimeter resolution as a proxy for
strawberries yield (Chen et al., 2019), segmenting sorghum panicles
(Malambo et al., 2019) or applying CNN-based regression for rice grain
yield estimation (Yang et al., 2019).
In the forestry context, most studies use high-resolution data from
UAV or airborne platforms. Ayrey and Hayes (2018) used airborne
LiDAR data to map forest biomass and tree density in temperate forests.
Weinstein et al. (2020) tested the localization of individual tree crowns
(object detection) across ecosystems using airborne data. Braga et al.
(2020) used very high-resolution satellite data to delineate individual
tree crowns (instance segmentation) in tropical forests. A series of
studies dealt with the mapping of tree species or genera in forests
Fig. 10. Barplots characterizing the reviewed literature in terms of frequency of (a) different architectures, including direct implementations as well as modications
of the original architecture and (b) different approaches.
Fig. 11. Study areas of the reviewed studies.
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
37
(Kattenborn et al., 2020; Pinheiro et al., 2020; Nezami et al., 2020;
Fricker et al., 2019; Natesan et al., 2019; Trier et al., 2018; Zou et al.,
2017; Schiefer et al., 2020) and urban areas (Hartling et al., 2019; Torres
et al., 2020; Santos et al., 2019). Fromm et al. (2019) tested the detec-
tion of individual conifer seedlings in high resolutions airborne imagery
for monitoring of tree regeneration. A substantial interest exists towards
assessments of forest damage, e.g., caused by wind throw (Hamdi et al.,
2019; Korznikov, 2020) or bark beetle infestations (Safonova et al.,
2019).
Examples in conservation with medium resolution data include the
mapping of wetland types at regional scales with multispectral Landsat
and polarimetric RADARSAT-2 data (Pouliot et al., 2019; Mahdianpari
et al., 2018; Mohammadimanesh et al., 2019). de Bern et al. (2020)
mapped deforestation in the Amazon using stacked pairs of Landsat
imagery from consecutive years. In the context of dryland mapping
program by FAO (Food and Agriculture Organization of the United
Nations), Guirado et al. (2020) mapped tree cover (%) using airborne
orthoimagery and exemplied that CNN-based mapping outperformed
previous assessments by FAO based on photo-interpretation. Examples
for mapping at high spatial resolution include the mapping of rainforest
types and disturbance (Wagner et al., 2019), plant succession stages in a
glacier-related chronosequence (Kattenborn et al., 2019a), herbaceous
and woody invasive species species in several environments (Kattenborn
et al., 2019a; Qian et al., 2020; Kattenborn et al., 2019a; Liu et al.,
2018b), shrub cover (Guirado et al., 2017), ecosystem structure-relevant
plant communities in the Arctic tundra (Langford et al., 2019) or the
rehabilitation of native tussock grass (Lomandra longifolia) after weed
eradication campaigns (Hamylton et al., 2020).
3.4. Remote sensing platforms
Approximately, 17% of the studies acquired data from the ground or
terrestrial platforms, including stationary photography (Ma et al.,
2019), mobile mapping data from Google Street View (Branson et al.,
2018; Barbierato et al., 2020), farming robots (Lottes et al., 2018), and
terrestrial laser scanning (e.g., Wang et al., 2019; Bingxiao et al., 2020).
The major part of studies using terrestrial platforms took place in an
agriculture context with a focus on precision farming.
With 36%, the largest share of studies assessed in this review used
data captured from UAV. This can be explained as UAV feature two
important features; they enable to autonomously acquire spatially
continuous data with automated georeferencing - a feature that recently
revolutionized possibilities for fast, exible, repeated, and cost efcient
remote sensing data acquisition for vegetation analysis. At the same
time, UAV can be operated at low altitudes capturing vegetation can-
opies with high spatial detail. High-resolution data acquired by UAV and
CNN-based pattern analysis provide powerful synergies for spatially
continuous vegetation analysis. Due to the inevitable trade-off of spatial
resolution and image footprint, a drawback of any high-resolution
remote sensing is the limited area coverage decreasing the efciency
for vegetation assessments on large scales. One approach to overcome
this limitation is the spatial up-scaling of UAV-based vegetation maps
with satellite data (Kattenborn et al., 2019b), where UAV-based maps
are used as a reference for coarse-resolution but large-scale satellite-
based predictions.
Depending on the spatial scale of the vegetation analysis and the size
of the decisive spatial features, airplanes may feature a more efcient
compromise between area coverage and resolution. 11% of the studies
2
4
6
8
weed in
f
estation
f
f
banana plants
p
y gaps
cof
f
cof
cof
ee c
f
f
r
op fields
c
r
op
r
o
ws
c
r
op yield
mai
z
e tassels
oilseed rapes
g
n
i
g
d
o
l
e
c
i
r
rice grain yield
rice lodging
rice plants
rubber tree individuals
str
a
wber
r
y yield
vine disease
biomass
mai
z
e individuals
rice pa
d
dies
so
r
ghum pani
c
les
citrus tree individuals
sugar beet
c
r
op types
palm tree individuals
weed in
f
estation
f
f
cactus individuals
dea
d
w
ood
de
f
orestation
f
f
f
orest types
f
f
i
n
v
asive species
shrub c
o
ver
tree species dive
r
sity
r
r
tussok individuals
ve
g
etation c
o
ver
wetland i
n
undation
i
n
v
asive species
individual tree c
r
o
wns
palm tree individuals
ve
g
etation types
w
oo
d
y ve
g
etation c
o
ver
wetland c
o
ver
tree species
wetland types
biomass
c
hlo
r
op
h
yll content
coni
f
er seedlings
f
f
f
orest types
f
f
i
n
v
asive species
shrub species
terrain in
f
ormation
f
f
tree c
o
ver
ve
g
etation com
m
unities
ve
g
etation succession
dea
d
w
ood
f
oli
f
f
a
g
e and
w
oo
d
y components
tree windth
r
o
w
individual tree c
r
o
wns
tree species
bark beetle in
f
ections
f
f
f
oli
f
f
a
g
e and
w
oo
d
y components
individual tree c
r
o
wns
tree structural attri
b
utes
biomass
tree species
a
griculture
conse
r
v
ation
conservation / forestry
f
orest
f
f
r
y
Fig. 12. Frequency of studies in the context of agriculture, forestry, and conservation. The class forestry/conservation includes studies that are relevant for both elds.
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
38
in this review used airborne sensors. In addition to increased spatial
coverage, an advantage of airplane platforms is their increased potential
payload supporting more sophisticated and high-quality sensors.
Accordingly, a large proportion of airplane-related studies used LiDAR
or hyperspectral data or a combination of both.
Aerial data from UAV and airplanes are often generated by matching
single frames from imaging sensors in concert with photogrammetric
processing techniques. Due to the relatively low height of both plat-
forms, the single image frames usually feature a substantial variation in
viewing geometry and bidirectional reectance effects. At rst sight, this
may challenge the retrieval of vegetation characteristics, but as Liu and
Abd-Elrahman (2018) and Liu et al. (2018b) have shown, this variation
can also be a valuable source for increasing the amount of training data
and generating more robust models. In a case study on mapping vege-
tation types in UAV imagery, they demonstrated increasing model per-
formance when using a multi-view approach that combined tiles from
orthoimagery and the spatially corresponding single image frames.
In total 35% of the studies used data acquired from satellites. The
potential of CNN-based pattern recognition combined with the unprec-
edented amount of high-resolution satellite data was demonstrated by
Brandt et al. (2020), who mapped more than 1.8 billion trees across the
Sahara and Sahel zone with a mosaic of 11,128 satellite scenes (GeoEye-
1, WorldView-2, WorldView-3 and QuickBird-2). This pioneering study
suggest how high resolution data from small satellites (weight <500 kg)
and microsatellites (weight <100 kg) will offer ground braking op-
portunities for CNN-based vegetation analysis. Examples are the Planet
Labs constellation of PlanetScope data, which image the entire Earth
Surface on a daily basis at 3.7 m resolution or SkySat, which enable to
image targeted areas at 0.72 m resolution. These satellite constellations
may provide sufcient spatial detail for various large-scale CNN-based
vegetation assessments.
3.5. Sensors, spatial and spectral resolution
CNN are most frequently applied on passive optical sensors (RGB,
multispectral, or hyperspectral). Only a few studies (7%) used products
from SAR systems. Passive optical and SAR data are commonly analyzed
with raster-based methods and, hence, discussed together in Section
3.5.1. The second-largest share of studies (10%), incorporated LiDAR
data, whereas 3% used terrestrial LiDAR data, and 7% used airborne
LiDAR. The common methods for the analysis of LiDAR-based point
clouds are presented in Section 3.5.2. The fusion of multiple sensor types
is discussed in Section 3.5.3.
3.5.1. Passive optical and SAR data analysis
CNNs involve numerous transformations of the input data and the
available (mostly GPU-based) memory may, hence, limit the maximum
size of the input data. However, raster data, such as airborne or space-
borne acquisitions from passive optical or SAR-sensors, usually feature
multiple layers (e.g., bands of different wavelengths or multitemporal
data) and can, thus, occupy large data volumes. Moreover, for some CNN
approaches, e.g. image classication, it would not be meaningful to
make a single prediction for an entire raster, but, instead, make multiple
smaller-scaled predictions to reveal the spatial variation within the area
covered by the raster. For these reasons, CNN training and inference is
not performed on entire rasters but instead on equally sized tiles
extracted from a raster. The trained CNN can then be used to create
spatial maps using a sliding window principle. Thereby, the CNN is
applied to regularly extracted tiles that have the same size as the tiles
used for training.
The most efcient approach is the seamless extraction of tiles
without overlap, whereas combining the results of multiple, overlapping
tiles may be useful to increase redundancy and compensate for edge
effects (Du et al., 2020; Brandt et al., 2020). Similarly, Neupane et al.
(2019) showed that combining the tiling results from different ortho-
photos acquired at multiple resolutions enhances the detection of palm
trees. Generally, the tile sized should be maximized as determined by
memory capacities, as larger sizes increase the CNNs eld of view and,
hence, amplies the available spatial context and thus accuracy of the
model. This effect was demonstrated in (Kattenborn et al., 2020), where
the accuracy in estimating the cover of plant species and communities
from UAV imagery increased considerably from smaller (2 m) to larger
tile sizes (5 m). Likewise, at the example of predicting crop yield from
UAV imagery, Nevavuori et al. (2019) demonstrated that larger tile sizes
(10, 20, 40 m) resulted in more accurate predictions. Especially for very
high-resolution data, it should also be considered that increasing the tile
size can furthermore decrease the effect spatially inaccurate reference
data (e.g., geolocation errors of in-situ data or inaccurately delineated
masks or bounding boxes). However, in the case of image regression or
classication (Section 3.2.2.1), which results in a single prediction per
tile, increasing the tile size decreases the spatial grain of the mapping
output (Kattenborn et al., 2020). For segmentation approaches (Section
3.2.2.3), the spatial extent of the input tiles will have no effect on the
output resolution. The processing speed of the sliding window approach
can be enhanced by rst pre-ltering areas of the target raster using a
region proposal. For instance, in the context of shrub cover segmenta-
tion in arid areas, Guirado et al. (2017) used brightness thresholds and
edge-detectors, as these are already a good indicator to show the general
occurrence of shrubs.
In addition to the spatial context or tile size, the spatial resolution is
a decisive factor. The spatial resolution most strongly varies with the
remote sensing platform (Fig. 13) and additionally depends on operating
altitude and sensor properties. Although CNN applications are designed
for pattern analysis, the highest possible resolution will not ultimately
be the most operational solution, as higher resolution comes with
increased storage and computation loads. In addition, data acquisition at
higher spatial resolution leads to smaller area coverage. The ideal spatial
resolution is determined by the spatial scale at which the characteristic
patterns of the target class or quantity occur. For instance in the context
of tree species mapping, Schiefer et al. (2020) showed decreasing the
spatial resolution from 2 to 8 cm decreases the accuracy (F-score) by at
least 25%. Fromm et al. (2019) showed that the detection accuracy for
tree seedlings based on different UAV-image resolutions (0.36.3 cm)
can vary up to 20%. Similarly, Neupane et al. (2019) found a 17%
decrease in detection accuracy for banana palms in plantations when
decreasing the pixel size from 40 cm to 60 cm. Weinstein et al. (2020)
assessed the relationship between object size and spatial resolution the
other way around. They did not change the spatial resolution of the
remote sensing data, but analyzed different ecosystems with charac-
teristic tree sizes and concluded that treetop detection for small trees (in
alpine forests) was the least accurate.
Regarding spectral resolution of passive optical sensors, the
Fig. 13. Frequency distribution of spatial resolutions by different remote
sensing platforms among the reviewed studies (only raster prod-
ucts considered).
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
39
literature search revealed that with 52% the largest share of studies used
RGB imagery, whereas only 31% used multispectral (dened as RGB and
at least one additional band) and only 9% used hyperspectral data
(dened here as >20 spectral bands). The fact that multispectral, and
hyperspectral data are less frequently used is not a surprise; multispec-
tral and hyperspectral sensors feature larger pixel sizes as narrower
spectral bands receive less radiation and given that the amount of ra-
diation received by the sensor must clearly surpass its signal to noise
ratio. Accordingly, everything else being equal, multispectral and
hyperspecral sensors have a lower spatial resolution than RGB data. As
CNNs are particularly designed for pattern analysis RGB data may often
be preferred.
Accordingly, the results of several studies suggest, that for many
tasks no high-spectral-resolution information may be needed: For
instance, Osco et al. (2020) found that counting citrus trees did not
clearly improve when combining multispectral with RGB data. Zhao
et al. (2019) found no improvement in using multispectral over RGB
data for rice damage assessments (rice lodging). Yang et al. (2019)
showed that the added value of multispectral on top of RGB information
only slightly improved the estimation accuracy of rice grain yield. In the
context of tree species classication, Nezami et al. (2020) did not report
clear improvements in using UAV-based hyperspectral data over RGB
data. Similarly, Kattenborn et al. (2019a) showed that CNN-based spe-
cies identication is more accurate than a pixel-based hyperspectral
classication of plant species (Lopatin et al., 2019; Kattenborn et al.,
2019b).
Yet, for several elds of application spectral data may be absolutely
necessary. For instance, analysis related to chemical constituents in
plant tissue, e.g. as a proxy for plant health status or plant diseases
(Zarco-Tejada et al., 2018; Zarco-Tejada et al., 2019) may not be
possible without sufcient spectral information as biochemistry partic-
ularly changes absorption properties and not patterns.
Finally, it should be noted that high spectral and spatial resolution
can also be combined. For example, pan-sharpening algorithms, such as
local-mean variance matching or Gramm-Schmidt spectral sharpening, can
be used to sharpen coarser multi-spectral bands with spatially high-
resolution imagery. Such pan-sharpening algorithms are often applied
to imagery from very high-resolution satellite sensors that feature a
panchromatic band, as for instance WorldView, QuickBird, or Pleiades
(cf. Li et al., 2017; Hartling et al., 2019; Korznikov, 2020; Braga et al.,
2020). Recently, more sophisticated pan-sharpening algorithms based
on CNNs were proposed (Masi et al., 2016; Yuan et al., 2018, see also
Section 3.5.3).
SAR backscatter is known to be particularly sensitive to vegetation
3D-structure and therefore has a great potential for differentiating
vegetation types and growth forms. The fact that microwaves penetrate
clouds makes it especially suitable for extracting continuous temporal
features and large scale assessments and. Accordingly, SAR data was
most frequently used as input for CNN for land cover and vegetation type
mapping (Mohammadimanesh et al., 2019; DeLancey et al., 2020; Liao
et al., 2020). Although SAR data have been used overall relatively rarely
so far in combination with CNNs, it can be assumed that CNNs are
excellently suited to unravel the relatively complex SAR signals and will
thus play a major role in Earth observation in the long term (see Zhu
et al., 2020 for a review on analyzing SAR data with deep learning).
3.5.2. LiDAR-based point cloud analysis
The analysis of spatial point clouds is basically more computationally
intensive than for raster data since there is no spatial discretization (and
thus no normalization) in cells, which often results in larger data sets
and more complex spatial representations. A strategy to increase the
processing speed is to run the analysis on subsets of point clouds, for
instance, by detecting key features of the target plant, which are then
used as seeds to apply region growing algorithms. This approach was for
instance applied for detecting individual maize plants Zea parviglumis
(Jin et al., 2018) and rubber plants Hevea brasiliensis (Wang et al., 2019).
The most frequently applied strategy to handle point clouds (mostly
terrestrial LiDAR) is the conversion to simpler and discrete feature
representations prior to the CNN analysis, including 3D voxels or 2D
projections (e.g. depth maps) (Zou et al., 2017; Ko et al., 2018; Jin et al.,
2018; Windrim and Bryson, 2020).
Voxels are volumetric representations of the point cloud that are
dened by regular and non-overlapping 3D cube-like cells. During the
conversion of point clouds to voxel datasets, a voxel is created in a
delimitable area (x,y,z) if it contains one or a minimum number of
points. Voxels can be analyzed in a similar way as multi-layered rasters,
where a layer corresponds to an elevation section of the original point
cloud. Jin et al. (2019) used a 0.4 cm voxel space with terrestrial LiDAR
data to separate leaves and stems from individual maize plants. Ayrey
and Hayes (2018) used 25 ×25 ×33cm voxels created from airborne
LiDAR-based point clouds to map forest properties, whereas each voxel
was assigned the number of points it included.
Projections are 2D representations of the point cloud from a certain
position (x,y,z) and viewing angle (azimuth, zenith). The projections can
be created by different spatial or spectral criteria, e.g. as depth maps,
prior extracted 3D-metrics describing the local neighbourhood, in-
tensity, or color information (Ko et al., 2018; Jin et al., 2018; Zou et al.,
2017). For airborne LiDAR data, projections are commonly created
using nadir view, for instance, to extract digital height models or to
extract height percentiles. For terrestrial LiDAR, projections are typi-
cally created using oblique viewing angles. The transformation from
TLS-based point clouds to depth images (2D) is usually applied multiple
times using different viewing geometries, which can be considered as a
form of data augmentation (see Section 3.2.1.1). For instance, to train a
CNN for detecting individual maize plants in TLS point clouds, Jin et al.
(2018) created 32 2D-projections for each target location with varying
oblique angles.
Despite such possibilities to decrease computation load, it has to be
considered that projections or voxel representations of the point cloud
will result in a loss of the original spatial detail. Therefore, it may be
desirable to use end-to-end learning directly with the raw point cloud
data as input. Using the raw point clouds instead of voxel or projections
may be more computationally demanding but it can be assumed that
ongoing developments in processing and algorithms will advance ca-
pabilities to harness point clouds directly. Another challenge is that
point clouds are unordered sets of vectors (in contrast to elements in
raster layers) and their analysis requires a spatial invariance with
respect to rotations and translations. A well-known CNN architecture
that considers these challenges is PointNet, which, hence, enables ef-
cient end-to-end learning on point clouds. The foundation of PointNet are
symmetric functions to ensure permutation invariance with regard to
the unordered input and transforms the data into a canonical feature
space to ensure spatial invariance. Even though PointNet or similar al-
gorithms have been used comparatively rarely so far, the results are very
promising: (Jin et al., 2020) applied PointNet to detect ground points
under dense forest canopies and found greater accuracy than for tradi-
tional non-deep learning methods. Briechle et al. (2020) tested PointNet
to classify temperate tree species in UAV LiDAR data and reported an
overall accuracy of up to 90%. Bingxiao et al. (2020) and Windrim and
Bryson (2020) used modied versions of PointNet, which besides point
coordinates also considers the LiDAR return intensity, and demonstrated
high accuracy in differentiating woody elements and foliage for multiple
coniferous and deciduous tree species (up to 9396% overall accuracy).
The results of the aforementioned studies are especially remarkable,
considering that this approach performs a classication at the highest
possible detail, i.e. at the level of individual points.
3.5.3. Sensor and data fusion
Multimodal remote sensing analysis or data fusion is the combina-
tion of acquisitions of different sensors types (LiDAR, SAR, passive op-
tical). The different characteristics of the sensor types result in different
sensitivities towards plant properties: Passive optical data is largely
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
40
shaped by absorption and scattering properties at the top of the canopy.
SAR signals are composed of directional scattering processes originating
in a few centimeters or even meters depth in the canopy (depending on
the wavelength). LiDAR measures backscattered radiation of commonly
very small footprints enabling to look deep into plant canopies. These
different sensing modes can hence reveal different plant characteristics
and their synergistic use can be used to harness complementary
information.
A conceptually rather simple fusion approach is to merge the
resulting predictions of multiple, dataset-specic CNNs. This can, for
instance, be done by majority voting (Baeta et al., 2017) or by proba-
bilistic approaches, such as Conditional Random Fields (Branson et al.,
2018). However, this way only the output space is combined, but not the
features contained in the different data sources, so that their synergies
cannot be directly integrated and exploited. Therefore it is usually more
expedient to simultaneously integrate the different data sources in a
single neural network - also known as feature level fusion. Feature
level fusion requires either preprocessing of the data or an adaption of
the CNN architectures to comply with different data structures (e.g.
point cloud vs. raster data), sensing modalities such (e.g., viewing angles
from oblique SAR vs. nadir passive optical acquisitions).
A frequently used approach for feature level fusion is converting and
normalizing the spatial dimensions of the different sensor products and a
subsequent stacking to a common tensor. Based on this tensor, a CNN
can be applied to simultaneously extract features from both data sour-
ces. This approach is easy to implement and most frequently applied. For
instance, Trier et al. (2018) stacked hyperspectral data with normalized
Digital Surface Models (also referred to as canopy height model) for
classifying tree species. Hartling et al. (2019) stacked LiDAR intensities,
hyperspectral and panchromatic bands for tree species classication in
urban areas. Prior to applying the CNN, they also used the LiDAR data
extract tree crown segments by height. In the context of large scale
mapping of vegetation types in the arctic, Langford et al. (2019) stacked
multiple satellite products, including high spatial resolution SPOT data,
high spectral resolution EO1-Hyperion data and a height model derived
from SAR-interferometry. In the context of mapping crop cover types,
Liao et al. (2020) stacked multi-temporal polarimetric RADARSAT-2
SAR data with VEN
μ
S multispectral data using a 1D-CNN. While mul-
tispectral was superior to SAR data, combining multi-temporal SAR data
with multispectral data increased the model performance. Nezami et al.
(2020), Kattenborn et al. (2019a), Kattenborn et al. (2020), Sothe et al.
(2020) used UAV imagery for mapping plant species and stacked RGB
orthoimagery and canopy height models (CHM) derived from photo-
grammetric processing pipelines. Interestingly, Nezami et al. (2020)
found minor improvements when using CHM information for UAV-
based tree species classication. Sothe et al. (2020), found that CHM
information does not signicantly improve the accuracy, whereas
(Kattenborn et al., 2020) suggested that at these high spatial resolutions
the information represented by the CHM is already indirectly visible in
the orthoimagery itself through shadows and illumination differences.
In contrast, at the example of coarser-resolution satellite imagery and
forest type classication, Sothe et al. (2020) reported that stacking
LiDAR-derived canopy height information with pan-sharpened World-
view-2 contributed important information.
Overall, these studies demonstrated that merging the different data
sources into a single tensor can potentially facilitate the extraction of
complementary signals through convolutions. This approach is easy to
implement as it does not require manipulating common CNN structures.
However, stacking datasets may not be ideal as the normalization to a
common tensor may introduce a critical loss of the original the infor-
mation, e.g., by converting point clouds to coarse voxels or depth maps
(cf. Section 3.5.2), or the viewing geometries and acquisitions modes
may not be directly compatible, e.g., oblique SAR vs. nadir optical data.
Instead of fusing datasets through a common tensor, it may, therefore,
be more advantageous to process the different data sources in parallel
branches and perform a feature concatenation at a later stage in the
network; that is linking the activations or feature maps derived from
multiple, sensor- or data-specic CNN. These networks are also referred
to as multi-stream networks. At the example of mapping rice grain
yield from UAV imagery, Yang et al. (2019) applied a concatenation of
feature maps resulting from two CNN branches, namely RGB imagery
with high and multispectral imagery with low spatial resolution,
respectively. A prime example on how feature concatenation enables to
integrate different data types and structures was presented by Branson
et al. (2018), who classied tree species in an urban environment by
concatenating a branch fed with nadir airborne RGB imagery and a
branch fed with multiple Google Street View scenes extracted with
varying viewing angles and zoom levels. Lottes et al. (2018) used feature
concatenations for detecting crop plants and weed infestations in image
sequences taken by a farming robot. Their approach takes into account
that planting patterns in agricultural elds (e.g. row structures) provide
additional spatial information for differentiating crops from weeds.
Accordingly, their approach included the parallel segmentation of suc-
cessive image frames using encoder-decoder CNN structures and the
subsequent concatenation of the resulting feature maps.
Barbosa et al. (2020) compared data fusion based on both stacking
datasets and feature concatenation for crop yield mapping based on
heterogeneous input data, including remote sensing reectance and
elevation data and in-situ maps on nitrogen, seed rate, and soil elec-
troconductivity. They tested multi-stream approaches with branches
being concatenated at an early stage and a later stage in the network,
that is before and after applying fully connected layers, respectively. The
best performance was achieved with a concatenation after fully con-
nected layers, followed by a feature concatenation at an earlier stage in
the network. The worst performance was found when stacking all pre-
dictors before applying the CNN, which was attributed to a sometimes
complex relationship among different input datasets.
Another noteworthy application of multi-stream networks is CNN-
based pan-sharpening, i.e. the process of fusing high spectral informa-
tion from the coarser-resolution bands with high spatial resolution in-
formation. Pan-sharpening is frequently applied to data from very high-
resolution satellites as these are often equipped with pan-chromatic
bands that have wider spectral bandwidths enabling an increased
sensitivity for incoming radiance and thus higher spatial resolution than
the other bands with narrower bandwidths. The fusion of spatial and
spectral information requires the representation of highly complex and
non-linear relationships - an application for which CNN are ideally
suited (Yuan et al., 2018; Dong et al., 2016). A case study on this seminal
technique was presented by Brook et al. (2020), who used a multi-scale
pan-sharpening algorithm (Yuan et al., 2018) to fuse both multispectral
and -temporal information from Sentinel-2 satellite data with the high
spatial information from UAV-imagery at the centimetre scale. The
corresponding case study demonstrated that this approach can reveal
the temporal variation of leaf biochemical status of individual vineyard
rows.
It should be noted that multitemporal analysis (e.g., change detec-
tion, time series analysis) can also be considered as feature level fusion.
As discussed in more detail in Section 3.5.4, multitemporal analysis can
be performed using both of the above presented modes, that is stacking
multidate inputs (de Bern et al., 2020) or concatenating them in mul-
tiple CNN branches operating in parallel (Branson et al., 2018; Mazzia
et al., 2019).
3.5.4. Multi-temporal analysis
Almost all plant life is subject to seasonal variation as a consequence
of reoccurring changes of abiotic factors, such as radiation driving
photosynthesis, temperature controlling its efciency or water input
providing the primary oxidation source. The seasonal phases or dy-
namics, also known as phenology, of plants is expressed through
biochemical and structural properties which in turn determine how
plants are represented in remote sensing data. This implies that temporal
variation in plant traits can limit the transferability of our models
T. Kattenborn et al.
ISPRS Journal of Photogrammetry and Remote Sensing 173 (2021) 24–49
41
through time. At the same time, temporal dynamics can also provide
essential information for plant characterization, e.g. phenological fea-
tures such as owers revealing the taxonomic identity and or the length
of the growing season as an essential factor for productivity and yield.
A few studies assessed model performances based on comparing or
combining multitemporal datasets. For instance, Ma et al. (2019)
assessed the biomass estimation with subcentrimetre imagery in wheat
crops across 17 acquisition dates and found a strong variation in accu-
racy (R
2
0.600.89) highlighting that timing can play an important role.
Rezaee et al. (2018) successfully tested the temporal transferability of a
CNN for wetland segmentation on a RapidEye scene that was not
included in the training process. Yang et al. (2019) tested the trans-
ferability of CNN models across time for rice grain yield estimation, in
terms of how good a CNN trained on one or multiple phenological
phases is applicable to a phenological phase it has not seen before. As
expected, the models became better the more times were considered in
the training process. Similarly, Zhang et al. (2018) showed that stacking
multidate Landsat scenes increased the accuracy of segmenting rice
paddies.
In the context of satellite-based land cover classication, Mazzia
et al. (2019) incorporated spatial patterns of temporal dynamics by
concatenating the pixel-wise branches of recurrent neural networks
(RNNs), followed by the subsequent application of a CNN. RNNs is a
class of neural networks designed to reveal recurring patterns and are
therefore perfectly suitable for multitemporal remote sensing analysis
(Zhu et al., 2017; Zhong et al., 2019). A primary strength of RNNs is
their ability to resemble temporal patterns despite the presence of dat