Conference PaperPDF Available

Evolutionary Optimization of Neural Architectures in Remote Sensing Classification Problems

Authors:

Abstract and Figures

BigEarthNet is one of the standard large remote sensing datasets. It has been shown previously that neural networks are effective tools to classify the image patches in this data. However, finding the optimum network hyperparameters and architecture to accurately classify the image patches in BigEarthNet remains a challenge. Searching for more accurate models manually is extremely time-consuming and labour intensive. Hence, a systematic approach is advisable. One possibility is automated evolutionary Neural Architecture Search (NAS). With this NAS many of the commonly used network hyperparameters, such as loss functions, are eliminated and a more accurate network is determined.
Content may be subject to copyright.
EVOLUTIONARY OPTIMIZATION OF NEURAL ARCHITECTURES
IN REMOTE SENSING CLASSIFICATION PROBLEMS
Daniel Coquelin1, Rocco Sedona 2,3, Morris Riedel 2,3, Markus Götz1
1Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology (KIT), Germany
2Juelich Supercomputing Centre (JSC), Forschungszentrum Jülich (FZJ), Germany
3School of Engineering and Natural Sciences (SENS), University of Iceland (UoI), Iceland
ABSTRACT
BigEarthNet is one of the standard large remote sensing
datasets. It has been shown previously that neural networks
are effective tools to classify the image patches in this data.
However, finding the optimum network hyperparameters
and architecture to accurately classify the image patches in
BigEarthNet remains a challenge. Searching for more ac-
curate models manually is extremely time consuming and
labour intensive. Hence, a systematic approach is advisable.
One possibility is automated evolutionary Neural Architec-
ture Search (NAS). With this NAS many of the commonly
used network hyperparameters, such as loss functions, are
eliminated and a more accurate network is determined.
Index TermsNeural Architecture Search, NAS, Evo-
lutionary Algorithms, Remote Sensing, Classification
1. INTRODUCTION
Deep learning has reached unprecedented results across
many domains in the last decade and Remote Sensing (RS)
is no exception. As deep neural networks require a large
amount of data to train, the constant stream of measurements
from satellite-borne sensors makes RS an frequent use-case.
In recent years, publicly accessible, labeled, RS-specific
datasets [1, 2] have been released, allowing the RS com-
munity to train neural networks tailored to specific research
questions.
ResNet-50 is a convolutional neural network (CNN) with
skip connections which has been shown to be effective for
computer vision tasks [3]. It has been tested extensively and
is the bases for many CNNs used in a multitude of different
fields including Remote Sensing.
There are multiple ways of evaluating the effectiveness
of a neural network, many of which use the ideas of preci-
sion and recall. The F1score is one such measure. It is the
harmonic mean of precision and recall and is often used in
the multi-class multi-label case. The result can be calculated
globally, i.e. the micro-F1score, or it can be the unweighted
average of the F1scores for each class, i.e. the macro-F1
Fig. 1. Example patches and labels for Sentinel-2 tiles [6].
score [4]. There are other methods, however they are not rel-
evant to this work.
BigEarthNet 1, a large-scale remote-sensing dataset con-
taining patches extracted from 125 Sentinel-2 tiles (Level2A)
acquired from June 2017 to May 2018 [1]. The archive con-
sists of 590,326 patches, each of which is assigned one or
more of the 19 available labels. The label nomenclature is an
adaptation of the CORINE Land Cover [5] which consists of
labels from 10 European countries updated in 2018 [6]. Each
patch has 12 spectral bands at various resolutions: 3 RGB
bands and band 8 at 10 m resolution (120 px ×120 px); bands
5, 6, 7, 8a, 11, and 12 at 20 m resolution (60 px ×60 px); and
bands 1 and 9 at 60 m resolution (20 px ×20 px). The cir-
rus sensitive band 10 was omitted, as well as patches covered
with snow or clouds [7]. Example patches are shown in fig. 1.
In [6], multiple network types were trained for the
BigEarthNet dataset to various degrees of success. Those
experiments excluded bands 1 and 9, utilized the Adam op-
timizer, the sigmoid cross entropy loss, an initial learning
rate of 0.001, and were trained for 100 epochs. It was found
that ResNet-50 was the most accurate network architecture,
achieving a micro-F1score of 77.11 and a macro-F1score of
67.33.
1http://bigearth.net/
As a portion of the data was excluded for these experi-
ments (bands 1 and 9), the inclusion of this data may increase
the effectiveness of a network. This, combined with a dif-
ferent network architecture or training configuration has the
possibility to greatly improve both training time and accuracy.
2. EVOLUTIONARY OPTIMIZATION
The structure of a neural network is determined by its hy-
perparameters. These include, the optimizer, the number of
training epochs, the network architecture, the loss function,
and many others. As the number of combinations of hyperpa-
rameters for a network can become extremely large, it is typ-
ically infeasible to determine effective combinations manu-
ally. Therefore, it is typically done automatically via selected
a hyperparameter search and a Neural Architecture Search
(NAS). A NAS typically takes a large amount of time to com-
plete as it must train a network for each set of hyperparame-
ters to determining the set’s effectiveness or quality. The algo-
rithms for finding the optimal hyperparameters within a NAS
vary widely [8], however this work utilizes an evolutionary
NAS to search for an optimal configuration for the ResNet-50
network used in [6].
The evolutionary NAS to be used begins with a popula-
tion of random hyperparameter sets, known as offspring. Af-
ter these networks are trained, they are sorted into the mat-
ing population. When an offspring is added to the mating
population, the worst-performing network is removed. New
generations of offspring are created by breeding two random
members of the mating population, i.e. randomly sampling
hyperparameters from each set. During the mating process
mutations can occur, i.e. modifying a small amount of the
offspring’s set. These new networks are then trained and the
process is repeated for either a specified number of genera-
tions or until an external factor stops it.
3. EXPERIMENTS
The hyperparameter search space was divided into six groups:
optimizers, learning rate (LR) schedulers, activation func-
tions, loss functions, the number of filters in each convolu-
tional block, and the activation order. The search space is
shown in table 1, excluding the number of filters which ex-
ponentially spans the space between two and 256. For this
network, the number of filters in a convolutional block is
dependent on a ratio with the first block’s number of filters.
The parameters of the learning rate schedulers and optimizers
were also included in the search space.
It has been shown in [13] that the order of activation, batch
normalization (BN), and convolution layers within residual
building blocks can affect the accuracy of a network. The test
configurations shown in fig. 2 will be used when referring to
the activation order.
Fig. 2. The various orders of the activation, batch normal-
ization (BN) and convolution layers within residual building
blocks used in a network. ReLU is an example activation
function, during the NAS the activation function is defined
by the network hyperparameters [13].
The data is prepared analogously to [6]. Albeit, there is
no image augmentation performed on the images. The net-
work was implemented in TensorFlow [17] and the NAS was
controlled by the open source package propulate [18].
propulate maintains the status of the population via
MPI [19]. The experiments were performed on various num-
bers of NVidia A100 GPUs [20] on ForHLR 2 at KIT. The
networks use early stopping to exit the training if the loss as
measured on the validation set is not longer increasing for ten
epochs.
Hyperparameter combinations which contained the opti-
mizers Adam, Adamax, Nadam, and RMSprop, failed more
frequently than their counterparts. These optimizers will
return NaN values if the combination between the loss func-
tion, optimizer, and their parameters is unfavorable, most
likely due to their adaptive algorithms. This behavior is
commonly referred to as instability. It is important to note
that an optimizer’s stability is typically a poor indicator of
its effectiveness. Since other optimizers were more stable,
the more unstable optimizers are quickly excluded from the
search space. To compensate for this effect, individual NASs
were performed for Adam, Adamax, Nadam, and RMSprop.
As poorly performing networks are removed from the
population, the relative frequency of hyperparameter selec-
tion can be used as a proxy for the stability and accuracy of
the network with that hyperparameter. This measurement is
conducted across all of the completed searches as to avoid ex-
cluding the metrics which perform better with the less stable
optimizers. By this measure, the most successful loss func-
tions were binary cross entropy and categorical hinge. The
least effective loss functions were hinge, Kullback–Leibler
divergence, and squared hinge. Exponential decay, inverse
time decay, and polynomial decay were used at roughly the
same frequency. Indicating that, when given fitting parame-
ters, all three are stable and effective choices for learning rate
schedulers. The use of this metric to determine the effective-
ness of activation functions is flawed as activation function
Table 1. Neural architecture and hyperparameter search space. The Activation column header shows the activation functions.
Activation order details are shown in fig. 2. ELU is the exponential linear unit, ReLU is the rectified linear unit, SELU is the
scaled exponential linear unit, K-L divergence is the Kullback-Leibler divergence, and tanh is the hyperbolic tangent.
Optimizer [9, 10] LR Scheduler [11] Activation [12] Activation Order [13] Loss [14, 15, 16]
Adadelta Exponential Decay ELU Original Binary Cross Entropy
Adagrad Inverse Time Decay Exponential BN after addition Categorical Cross Entropy
Adam Polynomial Decay Hard sigmoid Activation before addition Categorical Hinge
AMSGrad Linear Activation-only pre-activation Hinge
Adamax ReLU full pre-activation K-L Divergence
Ftrl SELU Squared Hinge
Nadam Sigmoid
RMSprop Softmax
SGD Softplus
Softsign
Swish
tanh
effectiveness was found to be correlated with the network op-
timizer. I.e. Adamax was most effective with ELU whereas
Adadelta was most effective with Softmax. Nonetheless,
the most commonly chosen activation function was a linear
function. It is theorized that this is due to its stability with
many optimizers. Therefore it is chosen in the early stages
of the NAS and slowly disregarded as other, more effective,
activation functions are found. Similarly to the usage of the
linear activation function, the number of filters chosen during
the early stages of the NAS is primarily four. As the NAS
progresses, the number of filters is predominantly chosen as
either 32 or 64. The most effective activation order is full
pre-activation (as shown in fig. 2).
As mentioned, some optimizers were sequestered into
separate searches and thus the measure of their frequency
is less useful. Therefore, they are judged on the F1scores
which their optimizers produce. Ftrl did not fail more fre-
quently than others, however it did not continue to produce
competitive results and was slowly eliminated from the pop-
ulation in the general NAS. The individual searches with
Adamax, RMSprop, and Adam all produced competitive net-
works with the collective NAS. Showing that the networks
excluded for stability issues were unfairly removed for the
population.
The most effective optimizers were Adadelta and SGD,
both of which produced multiple configurations which had
results greater than or equal to the target micro-F1score of
77.11. The most accurate configuration converged to a mirco-
F1score of 77.25 and a macro-F1score of 69.57 in 10 epochs.
Both of which are accuracy improvements over the networks
presented in [6], and the found network converged in 90 fewer
epochs. The final network configuration is as follows. The
optimizer was Adadelta, with rho as 0.8831, and epsilon as
3.4771e06; the learning rate scheduler was polynomial de-
cay with an initial learning rate of 0.8666, 10585 decay steps,
an ending learning rate of 0.0035, a power of 0.8693, and
without cycling; 128 filters; the softmax activation function;
the binary cross entropy loss; and with full pre-activation.
At a class level, the found network noticeably outperforms
the networks used in [6] for certain classes. Namely, coastal
wetlands; moors, heathland and sclerophyllous vegetation;
and agro-forestry areas were 9.60%,5.05%, and 6.86% more
accurate then the presented ResNet50 implementation. How-
ever, the class ”beaches, dunes, and sands” performed 17.4%
worse. As the overall performance is very similar, the large
differences in class accuracies are attributed to the inclusion
of more spectral bands in the trained data.
4. CONCLUSION
The recent increase in availability of large RS datasets has
allowed researchers to achieve unprecedented results in tasks
such as land cover classification. However, this comes at the
price of spending a considerable amount of time manually
searching for optimal DL architectures and hyperparameters.
When utilized correctly, a NAS can determine network
hyperparameters which result in accelerated convergence
rates, improved accuracy, or both. However, if the search
space of a NAS contains hyperparameters of functions with
varying stability, some hyperparameters can be unfairly ex-
cluded. To avoid this, either careful selection of the search
space hyperparameters or multiple NASs are required. This
is exemplified by the various optimizers which required indi-
vidual NASs to produce competitive results.
It has been shown that an evolutionary NAS can increase
the accuracy and greatly increase the convergence rate of the
target network. The use of Adadelta, polynomial decay, Soft-
max, and binary cross entropy has been shown to slightly out-
perform the initial network.
Some of the classes showed greatly improved accuracy
with inclusion of more spectral bands. However, one cate-
gory (beaches, dunes, and sands) was significantly less accu-
rate than previous results. Therefore, the inclusions of more
bands should be done based on the intended use-case of the
network after training. More research is required to determine
the effectiveness of each band for the accurate classification
of an image.
This result, and the result of any NAS, is limited by its
search space and as the functions within the search space were
selected based on their commonality, there are many cutting
edge functions which were not included. To greatly outper-
form the given network, a combination of implementing a
more advanced network architecture and the selection of more
useful loss functions is required.
Acknowledgments
This work was performed on the supercomputer ForHLR
funded by the Ministry of Science, Research and the Arts
Baden-Württemberg and by the Federal Ministry of Educa-
tion and Research. This work is supported by the Helmholtz
Association’s Initiative and Networking Funds under under
the Helmholtz AI platform grant and the Helmholtz Analytics
Framework collaboration.
5. REFERENCES
[1] Gencer Sumbul, Marcela Charfuelan, Begüm Demir,
and Volker Markl, “BigEarthNet: A Large-Scale Bench-
mark Archive For Remote Sensing Image Understand-
ing,” arXiv preprint 1902.06148, 2019.
[2] Xiao Xiang Zhu, Jingliang Hu, Chunping Qiu, Yilei Shi,
Jian Kang, et al., “So2Sat LCZ42: A Benchmark Data
Set for the Classification of Global Local Climate Zones
[Software and Data Sets], IEEE Geoscience and Re-
mote Sensing Magazine, vol. 8, no. 3, pp. 76–89, 2020.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep Residual Learning for Image Recogni-
tion,” in 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). June 2016, pp. 770–
778, IEEE.
[4] Zachary Chase Lipton, Charles Elkan, and Balakrish-
nan Narayanaswamy, “Thresholding Classifiers to Max-
imize F1 Score,” 2014.
[5] M. Bossard, J. Feranec, and J. Otahel, “CORINE land
cover technical guide – Addendum 2000, Tech. Rep.,
European Environment Agency, 2000.
[6] Gencer Sumbul, Jian Kang, Tristan Kreuziger, Filipe
Marcelino, Hugo Costa, et al., “BigEarthNet Dataset
with A New Class-Nomenclature for Remote Sensing
Image Understanding,” 2020.
[7] “Scripts to Remove Cloudy and Snowy Patches,” .
[8] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter,
“Neural Architecture Search: A Survey,” 2019.
[9] Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao, “A
Survey of Optimization Methods from a Machine Learn-
ing Perspective, arXiv preprint 1906.06821, 2019.
[10] H. Brendan McMahan, “Analysis Techniques for Adap-
tive Online Learning, arXiv preprint 1403.3465, 2014.
[11] Yanzhao Wu, Ling Liu, Juhyun Bae, et al., “Demystify-
ing Learning Rate Polices for High Accuracy Training
of Deep Neural Networks, arXiv preprint 1908.06477,
2019.
[12] Chigozie Nwankpa, Winifred Ijomah, Anthony Gacha-
gan, and Stephen Marshall, Activation Functions:
Comparison of Trends in Practice and Research for
Deep Learning,” arXiv preprint 1811.03378, 2018.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Identity Mappings in Deep Residual Networks,”
arXiv preprint 1603.05027, 2016.
[14] S. Kullback and R. A. Leibler, “On Information and
Sufficiency,Ann. Math. Statist., vol. 22, no. 1, pp. 79–
86, 03 1951.
[15] Qi Wang, Yue Ma, Kun Zhao, and Yingjie Tian, “A
Comprehensive Survey of Loss Functions in Machine
Learning,” Annals of Data Science, Apr 2020.
[16] Shruti Jadon, “A survey of loss functions for seman-
tic segmentation, 2020 IEEE Conference on Computa-
tional Intelligence in Bioinformatics and Computational
Biology (CIBCB), Oct 2020.
[17] Martín Abadi, Ashish Agarwal, Paul Barham, et al.,
“TensorFlow: Large-Scale Machine Learning on Het-
erogeneous Systems,” 2015, Software available from
tensorflow.org.
[18] Oskar Taubert, “Propulate,” https://github.
com/oskar-taubert/propulate, 2020.
[19] Message Passing Forum, “MPI: A Message-Passing In-
terface Standard, Tech. Rep., University of Tennessee,
USA, 1994.
[20] Yuhsiang Mike Tsai, Terry Cojean, and Hartwig Anzt,
“Evaluating the Performance of NVIDIA’s A100 Am-
pere GPU for Sparse Linear Algebra Computations,”
2020.
... Multiple computer vision networks for BigEarthNet classification have been trained [35], with ResNet-50 [20] being the most accurate. While a previous Propulate version was used to optimize a set of HPs and the architecture for this use case [13], a more versatile and efficient parallelization strategy in the current version makes it worthwhile to revisit this application. Analogously to [13], we consider different optimizers, learning rate (LR) schedulers, activation functions, loss functions, number of filters in each convolutional block, and activation orders [21]. ...
... While a previous Propulate version was used to optimize a set of HPs and the architecture for this use case [13], a more versatile and efficient parallelization strategy in the current version makes it worthwhile to revisit this application. Analogously to [13], we consider different optimizers, learning rate (LR) schedulers, activation functions, loss functions, number of filters in each convolutional block, and activation orders [21]. The search space is shown in Table 4. Optimizer parameters, LR functions, and LR warmup are included as well. ...
... We only consider SGD-based optimizers as they share common parameters and thus exclude Adam-like optimizers from the search. We theorize that including Adam led to the difficulties seen previously [13]. The training is exited if the validation loss has not been increasing for ten epochs. ...
Chapter
Full-text available
We present , an evolutionary optimization algorithm and software package for global optimization and in particular hyperparameter search. For efficient use of HPC resources, omits the synchronization after each generation as done in conventional genetic algorithms. Instead, it steers the search with the complete population present at time of breeding new individuals. We provide an MPI-based implementation of our algorithm, which features variants of selection, mutation, crossover, and migration and is easy to extend with custom functionality. We compare to the established optimization tool . We find that is up to three orders of magnitude faster without sacrificing solution accuracy, demonstrating the efficiency and efficacy of our lazy synchronization approach. Code and documentation are available at https://github.com/Helmholtz-AI-Energy/propulate/ .
... Multiple computer vision networks for BigEarthNet classification have been trained [35], with ResNet-50 [20] being the most accurate. While a previous Propulate version was used to optimize a set of HPs and the architecture for this use case [13], a more versatile and efficient parallelization strategy in the current version makes it worthwhile to revisit this application. Analogously to [13], we consider different optimizers, learning rate (LR) schedulers, activation functions, loss functions, number of filters in each convolutional block, and activation order [21]. ...
... While a previous Propulate version was used to optimize a set of HPs and the architecture for this use case [13], a more versatile and efficient parallelization strategy in the current version makes it worthwhile to revisit this application. Analogously to [13], we consider different optimizers, learning rate (LR) schedulers, activation functions, loss functions, number of filters in each convolutional block, and activation order [21]. The search space is shown in Table 4. Optimizer parameters, LR functions, and LR warmup are included as well. ...
... We only consider SGD-based optimizers as they share common parameters and thus exclude Adam-like optimizers from the search. We theorize that including Adam led to the difficulties seen previously [13]. The training is exited if the validation loss has not been increasing for ten epochs. ...
Preprint
Full-text available
We present Propulate, an evolutionary optimization algorithm and software package for global optimization and in particular hyperparameter search. For efficient use of HPC resources, Propulate omits the synchronization after each generation as done in conventional genetic algorithms. Instead, it steers the search with the complete population present at time of breeding new individuals. We provide an MPI-based implementation of our algorithm, which features variants of selection, mutation, crossover, and migration and is easy to extend with custom functionality. We compare Propulate to the established optimization tool Optuna. We find that Propulate is up to three orders of magnitude faster without sacrificing solution accuracy, demonstrating the efficiency and efficacy of our lazy synchronization approach. Code and documentation are available at https://github.com/Helmholtz-AI-Energy/propulate
... The present manuscript contributes to the Remote Sensing (RS) community by exploring a way to reduce these computational costs. While the group's previous work [2] focused on using evolutionary methods, this work aims at reducing hyperparameter tuning costs by training with a large batch size BS without sacrificing validation accuracy. By running the hyperparameter tuning more efficiently, it becomes faster and cheaper for the community to find the best performing models. ...
... In terms of validation F1 scores (a weighted average of precision and recall), the best performing configuration for both approaches is LR = 0.20735, M = 0.26415, W D = 5 · 10 −5 , and N M = f alse. The scores achieved are in line with our earlier work [14,2]. As the scores are almost similar for BS = const. ...
Conference Paper
Full-text available
Deep Learning models have proven necessary in dealing with the challenges posed by the continuous growth of data volume acquired from satellites and the increasing complexity of new Remote Sensing applications. To obtain the best performance from such models, it is necessary to fine-tune their hyperparameters. Since the models might have massive amounts of parameters that need to be tuned, this process requires many computational resources. In this work, a method to accelerate hyperparameter optimization on a High-Performance Computing system is proposed. The data batch size is increased during the training, leading to a more efficient execution on Graphics Processing Units. The experimental results confirm that this method reduces the runtime of the hyperparameter optimization step by a factor of 3 while achieving the same validation accuracy as a standard training procedure with a fixed batch size.
... GPUs have shown to be faster and more energy efficient than CPUs with minimal accuracy loss [1]. Contrastively, deep neural network (DNN) methods have shown great successes recently but often require expensive tuning [2]. ...
Preprint
Full-text available
As with any physical instrument, hyperspectral cameras induce different kinds of noise in the acquired data. Therefore, Hyperspectral denoising is a crucial step for analyzing hyperspectral images (HSIs). Conventional computational methods rarely use GPUs to improve efficiency and are not fully open-source. Alternatively, deep learning-based methods are often open-source and use GPUs, but their training and utilization for real-world applications remain non-trivial for many researchers. Consequently, we propose HyDe: the first open-source, GPU-accelerated Python-based, hyperspectral image denoising toolbox, which aims to provide a large set of methods with an easy-to-use environment. HyDe includes a variety of methods ranging from low-rank wavelet-based methods to deep neural network (DNN) models. HyDe's interface dramatically improves the interoperability of these methods and the performance of the underlying functions. In fact, these methods maintain similar HSI denoising performance to their original implementations while consuming nearly ten times less energy. Furthermore, we present a method for training DNNs for denoising HSIs which are not spatially related to the training dataset, i.e., training on ground-level HSIs for denoising HSIs with other perspectives including airborne, drone-borne, and space-borne. To utilize the trained DNNs, we show a sliding window method to effectively denoise HSIs which would otherwise require more than 40 GB. The package can be found at: \url{https://github.com/Helmholtz-AI-Energy/HyDe}.
... They included ◗ "Practice and Experience in Using Parallel and Scalable Machine Learning in Remote Sensing From HPC Over Cloud to Quantum Computing" [13] ◗ "Comparing Area-Based and Feature-Based Methods for Co-Registration of Multispectral Bands on GPU" [14] ◗ "An FPGA-Based Implementation of a Hyperspectral Anomaly Detection Algorithm for Real-Time Applications" [15] ◗ "Enhancing Large Batch Size Training of Deep Models for Remote Sensing Applications" [16] ◗ "Evolutionary Optimization of Neural Architectures in Remote Sensing Classification Problems."[17] HDCRS will organize new special sessions on different topics in the future editions of IGARSS. ...
Article
The High-Performance and Disruptive Computing in Remote Sensing (HDCRS) Working Group (WG) was recently established under the IEEE Geoscience and Remote Sensing Society (GRSS) Earth Science Informatics (ESI) Technical Committee to connect a community of interdisciplinary researchers in remote sensing (RS) who specialize in advanced computing technologies, parallel programming models, and scalable algorithms. HDCRS focuses on three major research topics in the context of RS: 1) supercomputing and distributed computing, 2) specialized hardware computing, and 3) quantum computing (QC). This article presents these computing technologies as they play a major role for the development of RS applications. The HDCRS disseminates information and knowledge through educational events and publication activities which will also be introduced in this article.
Article
Full-text available
As one of the important research topics in machine learning, loss function plays an important role in the construction of machine learning algorithms and the improvement of their performance, which has been concerned and explored by many researchers. But it still has a big gap to summarize, analyze and compare the classical loss functions. Therefore, this paper summarizes and analyzes 31 classical loss functions in machine learning. Specifically, we describe the loss functions from the aspects of traditional machine learning and deep learning respectively. The former is divided into classification problem, regression problem and unsupervised learning according to the task type. The latter is subdivided according to the application scenario, and here we mainly select object detection and face recognition to introduces their loss functions. In each task or application, in addition to analyzing each loss function from formula, meaning, image and algorithm, the loss functions under the same task or application are also summarized and compared to deepen the understanding and provide help for the selection and improvement of loss function.
Article
Full-text available
Gaining access to labeled reference data is one of the great challenges in supervised machine-learning endeavors. This is especially true for an automat ed analysis of remote sensing images on a global scale, which enables us to address global challenges, such as urbanization and climate change, using state-of-the-art machine-learning techniques. To meet these pressing needs, especially in urban research, we provide open access to a valuable benchmark data set, So2Sat LCZ42, which consists of local-climate-zone (LCZ) labels of approximately half a million Sentinel-1 and Sentinel-2 image patches in 42 urban agglomerations (plus 10 additional smaller areas) across the globe.
Conference Paper
Deep neural networks (DNN) have been successfully used in diverse emerging domains to solve real-world complex problems with may more deep learning(DL) architectures, being developed to date. To achieve this state-of-the-art (SOTA) performances, the DL architectures use activation functions (AFs), to perform diverse computations between the hidden layers and the output layers of any given DL architecture. This paper presents a survey on the existing AFs used in deep learning applications and highlights the recent trends in the use of the AFs for DL applications. The novelty of this paper is that it compiles the majority of the AFs used in DL and outlines the current trends in the applications and usage of these functions in practical deep learning deployments against the SOTA research results. This compilation will aid in making effective decisions in the choice of the most suitable and appropriate AF for a given application, ready for deployment. This paper is timely because majority of the research papers on AF highlights similar works and results while this paper will be the first, to compile the trends in AF applications in practice against the research results from the literature, found in DL research to date.
Article
Machine learning develops rapidly, which has made many theoretical breakthroughs and is widely applied in various fields. Optimization, as an important part of machine learning, has attracted much attention of researchers. With the exponential growth of data amount and the increase of model complexity, optimization methods in machine learning face more and more challenges. A lot of work on solving optimization problems or improving optimization methods in machine learning has been proposed successively. The systematic retrospect and summary of the optimization methods from the perspective of machine learning are of great significance, which can offer guidance for both developments of optimization and machine learning research. In this article, we first describe the optimization problems in machine learning. Then, we introduce the principles and progresses of commonly used optimization methods. Finally, we explore and give some challenges and open problems for the optimization in machine learning.
Article
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which further makes training easy and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10/100, and a 200-layer ResNet on ImageNet.
Conference Paper
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62 % error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https:// github. com/ KaimingHe/ resnet-1k-layers.