Conference PaperPDF Available

Multi-Scale Convolutional SVM Networks for Multi-Class Classification Problems of Remote Sensing Images


Abstract and Figures

The classification of land-cover classes in remote sensing images can suit a variety of interdisciplinary applications such as the interpretation of natural and man-made processes on the Earth surface. The Convolutional Support Vector Machine (CSVM) network was recently proposed as binary classifier for the detection of objects in Unmanned Aerial Vehicle (UAV) images. The training phase of the CSVM is based on convolutional layers that learn the kernel weights via a set of linear Support Vector Machines (SVMs). This paper proposes the Multi-scale Convolutional Support Vector Machine (MCSVM) network, that is an ensemble of CSVM classifiers which process patches of different spatial sizes and can deal with multi-class classification problems. The experiments are carried out on the EuroSAT Sentinel-2 dataset and the results are compared to the one obtained with recent transfer learning approaches based on pre-trained Convolutional Neural Networks (CNNs).
Content may be subject to copyright.
Gabriele Cavallaro1, Yakoub Bazi2, Farid Melgani3and Morris Riedel1,4
ulich Supercomputing Centre, Forschungszentrum J¨
ulich, Germany
2Department of Computer Engineering, King Saud University, Saudi Arabia
3Department of Information Engineering and Computer Science, University of Trento, Italy
4School of Engineering and Natural Sciences, University of Iceland, Iceland
The classification of land-cover classes in remote sensing im-
ages can suit a variety of interdisciplinary applications such
as the interpretation of natural and man-made processes on
the Earth surface. The Convolutional Support Vector Ma-
chine (CSVM) network was recently proposed as binary clas-
sifier for the detection of objects in Unmanned Aerial Ve-
hicle (UAV) images. The training phase of the CSVM is
based on convolutional layers that learn the kernel weights
via a set of linear Support Vector Machines (SVMs). This
paper proposes the Multi-scale Convolutional Support Vector
Machine (MCSVM) network, that is an ensemble of CSVM
classifiers which process patches of different spatial sizes and
can deal with multi-class classification problems. The exper-
iments are carried out on the EuroSAT Sentinel-2 dataset and
the results are compared to the one obtained with recent trans-
fer learning approaches based on pre-trained Convolutional
Neural Networks (CNNs).
Index TermsMulti-scale convolutional support vector
machine (MCSVM) network, supervised feature generation,
multiclass classification, sentinel-2, remote sensing
Passive optical sensors are a class of remote sensing instru-
ments able to collect natural radiation from the Earth’s sur-
face and convert it into imagery [1]. A wide variety of resolu-
tions, ranging from panchromatic to hyperspectral images, are
nowadays available to serve different thematic applications.
Among all the possible products that can be derived from re-
mote sensing images, classification products are among the
most frequently utilized. Supervised classification algorithms
can be used to distinguish between different types of land-
cover classes (e.g., streets, houses, grass, etc.) in order to
interpret processes, such as monitoring of urban growth, land
The results of this research were achieved through the Human Brain
Project PCP Pilot Systems at the Juelich Supercomputing Centre and the
DEEP-EST project, which received co-funding from the European Union’s
Horizon 2020 research and innovation programme under the Grant Agree-
ment No. 604102 and No. 754304, respectively.
cover mapping, road network extraction, impacts of natural
disasters, crop monitoring, object detection, etc. [2].
Recently, deep learning has brought in revolutionary
achievements in many applications, including the classifi-
cation of remote sensing images [3]. The state-of-the-art
results have been achieved by the Convolutional Neural Net-
works (CNNs) [4, 5] due to their sophisticated hierarchical
structure able to extract more hidden and deeper features than
classic machine learning methods based on handcrafted fea-
tures (i.e., shallow classifiers). The performance of a CNN
classifier is considerably dependent on the amount of avail-
able training data (i.e., the larger is the training set the lower
is the chance that overfitting occurs). Despite the recent ad-
vances in Earth observation benchmark data creation (e.g.,
RSI-CB128 with 36,000 images and 45 annotated classes 1),
the gap with the data size of the computer vision datasets
(e.g., ImageNet with 14,197,122 images 2) is still a key lim-
iting factor for the development of effective deep network
classifiers for remote sensing data. The scarcity of large,
reliable and open-source annotated datasets is largely due
to the inherent interpretation complexity of remote sensing
data (e.g., RADAR images) and the time effort and cost in-
volved in the collection of training samples. One effective
approach to handle small/medium datasets is transfer learn-
ing [6]. It consists of acquiring the knowledge from networks
that were pre-trained on an auxiliary recognition task with a
higher number of labeled data (e.g., ResNet [7]) instead of
performing the training from scratch.
Bazi et al. [8] proposed the Convolutional Support Vec-
tor Machine (CSVM) network for binary classification of
datasets that have limited number of annotated samples. The
CSVM adopts a learning strategy based on SVMs [9] that
is alternative to the standard backpropagation algorithm [5].
The novelty was the introduction of the SVM convolutional
layers that learn the kernel weights via a forward supervised
learning strategy based on a set of linear SVMs. This paper
introduces the Multi-Scale CSVM network for multiclass
875978-1-5386-9154-0/19/$31.00 ©2019 IEEE IGARSS 2019
Fig. 1. Example of a MCSVM network that receives as input batches of two different spatial sizes. The architecture includes
two SVM convolutional layers, two reduction layers, a concatenation and feature generation step, and a classification layer
placed on the top.
classification problems. It is an ensemble of CSVM net-
works, where each CSVM takes in input a set of batches at
a defined spatial size. The experiments are conducted on the
EuroSAT Sentinel 2 dataset which consists of classifying 10
land cover classes. The comparison of the preliminary results
with the one obtained with pretrained CNNs confirms the
effectiveness of the SVM convolutional layers for multi-class
classification problems.
2.1. SVM Convolutional Layer
The MCSVM network (see Figure 1) is composed by N
CSVM networks that receive in input batches with different
sizes K={kj}N
j=1, with kjZ. Each CSVM network
includes many SVM convolutional layers, which are differ-
ent than the convolutional layers of standard CNNs. In the
following sub-sections, the structure of the first SVM convo-
lutional layer (i.e., SV M (j,1)
1,SV M (j,1)
2, ... , S V M j,1
n(1)) of
ajth CSVM network is described. The generalization to the
next layers is plain.
2.1.1. Formation of the Global Training Set
Let {Ii, yi}M
i=1 be the training set composed of M multi-bands
images with {Ii∈ <r×c×b}, where r,cand brefers to the
number of rows, columns and bands of the image, respec-
tively, while yiZdenotes the class label. From each im-
age {Ii}, a set of non-overlapped patches of size kj×kj×b
are extracted and reshaped as feature vectors xiof dimen-
sion dj=kj×kj×b. The result is a global training set
j={xi, yi}m(1)
i=1 of size m(1).
2.1.2. Training the set of SVM Filter Banks
A set of linear SVM filters are learned on distinct sub-training
sets T(1)
subj={xi, yi}l
i=1 by optimizing the unconstrained
optimization problem described in [10] (i.e., with the L2-
loss function). The lsamples are randomly extracted from
the global training set T(1)
j. After training, the SVM filters
j,z }n(1)
z=1 are computed. The term w(1)
j,z ∈ <g×djrefers to
zth-SVM filter weight matrix, while n(1) is the number of fil-
ters. Each filter matrix wj∈ <[g×dj]includes the weights that
are assigned to the features, with g=n class (n class
1)/2(i.e., the output attribute coef for multiclass cases with
the linear kernel 3). The complete weights of this convo-
lutional layer could be grouped into a filter bank W(1)
<g×dj×n(1) .
2.1.3. Generation of the Convolutional Feature Maps
Each training image {Ii, yi}M
i=1 is convolved with the SVM
filters to generate a set of 3-D hyper-feature maps {H(1)
j,i }M
Here, {H(1)
j,i } ∈ <r(1)×c(1) ×b(1) }is the new feature represen-
tation of image Iicomposed of n(1) feature maps. To obtain
the zth feature map h(1)
z,j,i, the zth SVM filter is convolved with
Table 1. Classification results of different pretrained CNNs and the proposed MCSVM for the EuroSAT dataset. The experi-
ments of the first four columns use the three features Red, Green and Blue colors while the last column, includes 14 features (i.e.,
the original 13 bands of Sentinel-2 concatenated with the NDVI). For each configuration, the average and standard deviation in
brackets of several metrics are reported (the values result from 5 different generated training sets).
DenseNet169 ResNet-50 VGG16 MCSVM MCSVM Metrics
3 features (RGB) 14 features (sentinel 2 + NDVI)
77.52 (3.77) 90.65 (0.56) 92.85 (0.26) 76.28 (0.96) 93.60 (0.37) OA
82.04 (1.49) 90.45 (0.55) 92.59 (0.28) 77.51 (0.63) 93.43 (0.34) AA
75.01 (4.19) 89.59 (0.63) 92.04 (0.29) 73.56 (1.07) 92.88 (0.41) Kappa
1043.17 (129.69) 405.12 (4.35) 277.20 (5.84) 2658.40 (8.08) 928.39 (11.05) Train+Test time [s]
a set of sliding windows of size kj×kj×b(with a predefined
stride) over the training image Ii
z,j,i =f(Iiw(1)
z,j ), z = 1, ..., n(1) (1)
where is the convolution operator and fis the ReLU
activation function. The spatial size of the feature maps
j,i }M
i=1 is then reduced via a pooling operation (i.e., ei-
ther max of mean) as it is done in a pooling layer of a CNN
2.2. Generation of the Convolutional Feature Maps
At the last SVM computing layer L(convolution or reduction
depending on the architecture) of each CSVM, the hyper-
feature maps {H(L)
j,i , yi}M
i=1 are outputted. These maps
are then concatenated {H(L)
1,i ,H(L)
2,i , ..., H(L)
N,i , yi}M
i=1. Sub-
sequently, the feature generation step takes as input each
concatenated hyper-feature map {H(L)
1,i ,H(L)
2,i , ..., H(L)
N,i }for
the training image Iiand compute the mean or max value
for each feature map. A non linear SVM classifier with the
Radial Basis Function (RBF) kernel is finally trained over
these features.
3.1. Dataset
The experiments have been carried out on the open-source
EuroSAT dataset [11]. It includes two subsets of Sentinel-2
satellite image patches (i.e., with spatial size of 64 ×64 pix-
els) 4: one with the RGB colors and the other with the all 13
original spectral bands acquired by the Sentinel 2 multispec-
tral sensor. Both subsets consist of 10 classes with in total
27,000 labeled images. The Normalised Difference Vegeta-
tion Index (NDVI 5) is also considered as additional feature
for the experiments of the MCSVM network.
3.2. Experimental Setup
Preliminary results are presented for a single layer MCSVM
network architecture. The EuroSAT dataset is split into a
training and test set at the ratio of 80:20. The MCSVM en-
sembles three CSVM networks which receive as input batches
with the following dimensions K={8,10,12}. Each CSVM
has only one SVM convolutional layer which includes 7 lin-
ear SVM classifiers. The 2D convolutional operations are
performed with the stride set to 5 and combined with a re-
duction layer (max pooling) with window shape and stride
set to 3 and 2, respectively. Each SVM filter is trained with
l= 1000 patches (100 for each class) that are randomly ex-
tracted. The estimation of the penalty parameter C for each
SVM is done via a 5-fold cross-validation procedure in the
range [101102].
The experiments are performed on the JURON supercom-
puter system at the J¨
ulich Supercomputing Centre. For im-
plementing the CSVM network (i.e., with Python 3.6.1),
the following software packages and respective versions were
used: the TensorFlow 1.7.0 (GPU) for the realization
of the convolution operations, the ThunderSVM (GPU) for
the SVM algorithm and the h5py 2.8.0 for interfacing the
Hierarchical Data Format (HDF5) files.
3.3. Evaluation
The performance of the MCSVM is evaluated in terms of the
standard metrics Overall Accuracy (OA), Average Accuracy
(AA) and Cohen’s Kappa coefficient (Kappa), as depicted by
Table 1. To rate the results, three popular pretrained CNNs are
run with the same input setting: DenseNet169 [12], ResNet-
50 [7] and VGG-16 [13]. For each pretrained network, the
weights resulting from the training of the ImageNet dataset
are frozen for all the layers (i.e., not trainable). Two addi-
tional layers are added on top (a fully connected layer fol-
lowed by a softmax layer with 10 outputs) and trained with
20 epochs. The classification results of MCSVM that are ob-
tained with the RGB dataset are not satisfactory, especially if
they are compared with the one of the VGG16 network. How-
ever, with the 14 bands dataset (i.e., 13 bands of Sentinel-2
and the NDVI), the MCSVM is able to achieve competitive
classification results. The pretrained networks are not eval-
uated with this dataset since their weights have been trained
with 3-channel images and their original architectures have to
be modified. Table 1 reports also the processing times (i.e.,
training plus test). The training time of the pretrained net-
works is only the fraction that was spent by the 20 epochs for
learning the two top additional layers.
The good performance of the MCSVM network is espe-
cially encouraging if considered within the whole satellite
image data framework. On the one hand, deep learning has
proved capable of outperforming traditional machine learn-
ing classifier. On the other hand, the majority of the proposed
deep networks operate with the limited RGB information. In
the light of the fact that many earth observation programmes
put a lot of effort and resources in designing sensors with
higher resolutions, new methods have to be developed in or-
der to avoid their underutilization.
This paper proposed a novel MCSVM network for multi-class
classification problems of remote sensing images. Its archi-
tecture can include several CSVM networks that take as in-
puts batches with different spatial sizes. With this configura-
tion the classifier can exploit both the spatial and spectral in-
formation and provide competitive classification results com-
pared to recent solutions based on knowledge transfer from
pretrained CNNs.
As perspectives, it is intended to investigate the effect of
the different tunable parameters and additional layers on the
performance of the network into more detail. Furthermore,
thanks to the availability of distributed-GPU systems at J¨
Supercomputing Centre, it is planned to scale up the training
phase by distributing the compute load (SVM training and
convolution operations) over different GPUs.
[1] W. G. Rees, Physical Principles of Remote Sensing,
3rd ed. Cambridge University Press, 2013, vol. 37.
[2] M. J. Canty, Image Analysis, Classification and Change
Detection in Remote Sensing: With Algorithms for
ENVI/IDL and Python, Third Edition. Taylor & Fran-
cis, 2014.
[3] J. E. Ball, D. T. Anderson, and C. S. Chan, “A
Comprehensive Survey of Deep Learning in Remote
Sensing: Theories, Tools and Challenges for the
Community,” SPIE Journal of Applied Remote Sensing
(JARS), Special Section on Feature and Deep Learning
in Remote Sensing Applications, sep 2017. [Online].
[4] A. Romero, C. Gatta, G. Camps-valls, and S. Mem-
ber, “Unsupervised Deep Feature Extraction for Remote
Sensing Image Classification,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 54, no. 3, pp. 1–
14, 2015.
[5] L. Zhang, L. Zhang, and B. Du, “Deep Learning for Re-
mote Sensing Data: A Technical Tutorial on the State of
the Art,” IEEE Geoscience and Remote Sensing Maga-
zine, vol. 4, no. 2, pp. 22–40, jun 2016.
[6] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,
IEEE Transactions on Knowledge and Data Engineer-
ing, vol. 22, no. 10, pp. 1345–1359, 2010.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual
Learning for Image Recognition,” in 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), 2016.
[8] Y. Bazi and F. Melgani, “Convolutional SVM Networks
for Object Detection in UAV Imagery,IEEE Transac-
tions on Geoscience and Remote Sensing, vol. 56, no. 6,
pp. 3107–3118, 2018.
[9] C. Cortes and V. Vapnik, “Support-vector Networks,”
Machine Learning, vol. 20(3), pp. 273–297, 1995.
[10] K.-W. Chang, C.-J. Hsieh, and C.-J. Lin, “Coordinate
Descent Method for Large-scale L2-loss Linear Sup-
port Vector Machines,” Journal of Machine Learning
Research, 2008.
[11] P. Helber, B. Bischke, A. Dengel, and D. Borth,
“EuroSAT: A Novel Dataset and Deep Learning
Benchmark for Land Use and Land Cover Classi-
fication,” Computing Research Repository - arXiv,
vol. abs/1709.0, aug 2017. [Online]. Available:
[12] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Wein-
berger, “Densely Connected Convolutional Networks,”
2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 2261–2269, 2017.
[13] K. Simonyan and A. Zisserman, “Very Deep
Convolutional Networks for Large-Scale Im-
age Recognition,” sep 2014. [Online]. Available:
... In recent time, deep learning has brought revolutionary achievements in many applications, including the classification of remote sensing images. The state-of-theart results have been achieved by the Convolutional Neural Networks (CNNs) due to their structure being able to extract more hidden and deeper features than classic machine learning methods [38]. This concept lies at the basis of many deep learning algorithms: models (networks) composed of many layers that transform input data (e.g., images) to outputs (e.g., disease present/absent) while learning increasingly high-level features [39]. ...
Full-text available
Deep learning approaches are applied for a wide variety of problems, they are being used in the remote sensing field of study and showed high performance. Recent studies have demonstrated the efficiency of using spectral indexes in classification problems, because of accuracy and F1 score increasing in comparison with the usage of only RGB channels. The paper studies the problem of classification satellite images on the EuroSAT dataset using the proposed convolutional neural network. In the research set of the most used spectral indexes have been selected and calculated on the EuroSAT dataset. Then, a novel comparative analysis of spectral indexes was carried out. It has been established that the most significant set of indexes (NDVI, NDWI, GNDVI) increased classification accuracy from 64.72% to 84.19% and F1 score from 63.89% to 84.05%. The biggest improvement was obtained for River, Highway and PermanentCrop classes.
... In this subsection, an adaptive SCNet is applied on these two specific features for feature fusion and classification. The importance levels of different sensor features are usually not considered by the simply fusion strategies (e.g., stacked concatenated operation [50][51][52][53] and averaged operation [9]). For further considering the channel-dependencies and sensorimportance, an attention-based selective kernel unit is designed for multi-sensor feature fusion, which is shown in Figure 2c. ...
Full-text available
Multi-sensor image can provide supplementary information, usually leading to better performance in classification tasks. However, the general deep neural network-based multi-sensor classification method learns each sensor image separately, followed by a stacked concentrate for feature fusion. This way requires a large time cost for network training, and insufficient feature fusion may cause. Considering efficient multi-sensor feature extraction and fusion with a lightweight network, this paper proposes an attention-guided classification method (AGCNet), especially for multispectral (MS) and panchromatic (PAN) image classification. In the proposed method, a share-split network (SSNet) including a shared branch and multiple split branches performs feature extraction for each sensor image, where the shared branch learns basis features of MS and PAN images with fewer learn-able parameters, and the split branch extracts the privileged features of each sensor image via multiple task-specific attention units. Furthermore, a selective classification network (SCNet) with a selective kernel unit is used for adaptive feature fusion. The proposed AGCNet can be trained by an end-to-end fashion without manual intervention. The experimental results are reported on four MS and PAN datasets, and compared with state-of-the-art methods. The classification maps and accuracies show the superiority of the proposed AGCNet model.
... In Ref. 24, OA of 93.2% with a semi-supervised generative adversarial networks was achieved. In Ref. 25, OA of 93.6% with a multi-scale convolutional support vector machine network was reported. In Ref. 26, a 96.6% with an efficient convolutional neural networks for multi-spectral image classification was achieved. ...
The use of remote sensing data has become very useful to generate statistical information about society and its environment. In this sense, land use and land cover classification (LULC) are tasks related to determining the cover on the Earth’s surface. In the decision-making process, this kind of information is relevant to handle in the best way how the information about events such as earthquakes or cadastre information can be used. To attend this, a methodology to perform supervised classification based on the combination of statistical, textural, and shape features for the LULC classification problem is proposed. For the experiments, thousands of Landsat and Sentinel images covering all of Mexico and Europe, respectively, were used. Twelve LULC classes were established for Mexico territory, and 10 were established for Europe. This methodology was applied to both datasets and benchmarking with multiple well-known classifiers (random forest, support vector machines, extra trees, and artificial neural network). As a result, the overall accuracy (OA) for Landsat and Sentinel-2 was reached 77.1% and 96.7%, respectively. The performance of the different types of features was compared using Mexico and Europe images. Interesting results were achieved and some conclusions about using traditional and non-traditional features were found in the LULC classification task.
Sentinel-1 C-band radar backscatter satellite images provide a repeating sequence of fine-resolution (10-m) observations that can be used for a number of applications, but the 12-day interval between satellite observations is too infrequent for many applications, such as measuring moisture dynamics. For a variety of applications, moisture information is demanded at high temporal frequency and fine spatial resolution over large areas. Machine learning approaches have been used to predict higher spatial resolutions than the original satellite images, but little effort has been made to increase the temporal resolution of acquired backscatter images. This study extends machine learning approaches to infer fine-resolution backscatter between observations relying on auxiliary data observations, including elevation and daily gridded weather. Several variations of Multi-modal Fully Convolutional Neural Network architectures, problem setup, and training methods are explored for a predominantly rural area in southwest Oklahoma near the transition between humid subtropical and semiarid climates. The training area lies in the overlap zone for adjacent Sentinel-1 satellite tracks, allowing for training with several different temporal offsets. We find that the UNET architecture produced the most accurate and robust estimated backscatter patterns, with superior prediction compared to a prior observation baseline in nearly all cases investigated when geography was included in the training data. This superior performance also generalized to nearby areas when training data for a given geography was not available, where 86% of predictions performed superior compared to a prior observation baseline.
In the general case, the performance of deep learning-based classification models depends on the ability of capturing features. When a sample has various appearances, the increased features may lower the performance of these models. In this case, training more models on different appearances can be a choice to improve the accuracy. In this paper, we built a new framework that generates a network of models to improve the accuracy. First, our framework built a strategy to increase the number of models to well capture the increased features. We then utilise our recursive Bayesian methods on the selected outputs of trained models, which is to reduce the similarity among these outputs for higher accuracy. The experimental results show that our framework can be a good choice to improve the performance of deep learning applications.
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Full-text available
In this paper, we address the challenge of land use and land cover classification using Sentinel-2 satellite images. The Sentinel-2 satellite images are openly and freely accessible provided in the Earth observation program Copernicus. We present a novel dataset based on Sentinel-2 satellite images covering 13 spectral bands and consisting out of 10 classes with in total 27,000 labeled and geo-referenced images. We provide benchmarks for this novel dataset with its spectral bands using state-of-the-art deep Convolutional Neural Network (CNNs). With the proposed novel dataset, we achieved an overall classification accuracy of 98.57%. The resulting classification system opens a gate towards a number of Earth observation applications. We demonstrate how this classification system can be used for detecting land use and land cover changes and how it can assist in improving geographical maps. The geo-referenced dataset EuroSAT is made publicly available at
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at
Full-text available
This paper introduces the use of single layer and deep convolutional networks for remote sensing data analysis. Direct application to multi- and hyper-spectral imagery of supervised (shallow or deep) convolutional networks is very challenging given the high input data dimensionality and the relatively small amount of available labeled data. Therefore, we propose the use of greedy layer-wise unsupervised pre-training coupled with a highly efficient algorithm for unsupervised learning of sparse features. The algorithm is rooted on sparse representations and enforces both population and lifetime sparsity of the extracted features, simultaneously. We successfully illustrate the expressive power of the extracted representations in several scenarios: classification of aerial scenes, as well as land-use classification in very high resolution (VHR), or land-cover classification from multi- and hyper-spectral images. The proposed algorithm clearly outperforms standard Principal Component Analysis (PCA) and its kernel counterpart (kPCA), as well as current state-of-the-art algorithms of aerial classification, while being extremely computationally efficient at learning representations of data. Results show that single layer convolutional networks can extract powerful discriminative features only when the receptive field accounts for neighboring pixels, and are preferred when the classification requires high resolution and detailed results. However, deep architectures significantly outperform single layers variants, capturing increasing levels of abstraction and complexity throughout the feature hierarchy.
Nowadays, unmanned aerial vehicles (UAVs) are viewed as effective acquisition platforms for several civilian applications. They can acquire images with an extremely high level of spatial detail compared to standard remote sensing platforms. However, these images are highly affected by illumination, rotation, and scale changes, which further increases the complexity of analysis compared to those obtained using standard remote sensing platforms. In this paper, we introduce a novel convolutional support vector machine (CSVM) network for the analysis of this type of imagery. Basically, the CSVM network is based on several alternating convolutional and reduction layers ended by a linear SVM classification layer. The convolutional layers in CSVM rely on a set of linear SVMs as filter banks for feature map generation. During the learning phase, the weights of the SVM filters are computed through a forward supervised learning strategy unlike the backpropagation algorithm widely used in standard convolutional neural networks (CNNs). This makes the proposed CSVM particularly suitable for detecting problems characterized by very limited training sample availability. The experiments carried out on two UAV data sets related to vehicles and solar-panel detection issues, with a 2-cm resolution, confirm the promising capability of the proposed CSVM network compared to recent state-of-the-art solutions based on pretrained CNNs.
In recent years, Deep Learning (DL), a re-branding of neural networks (NNs), has risen to the top in numerous areas, namely Computer Vision (CV), speech recognition, natural language processing, etc. Whereas remote sensing (RS) possesses a number of unique challenges, primarily related to sensors and applications, inevitably RS draws from many of the same theories as CV; e.g., statistics, fusion, and machine learning, to name a few. This means that the RS community should not only be aware of advancements like DL, but also be leading researchers in this area. Herein, we provide the most comprehensive survey of state-of-the-art RS DL research. We also review recent new developments in the DL field that can be used in DL for RS. Namely, we focus on theories, tools and challenges for the RS community. Specifically, we focus on unsolved challenges and opportunities as it relates to (i) inadequate data sets, (ii) human-understandable solutions for modelling physical phenomena, (iii) Big Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and learning algorithms for spectral, spatial and temporal data, (vi) transfer learning, (vii) an improved theoretical understanding of DL systems, (viii) high barriers to entry, and (ix) training and optimizing the DL.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Deep-learning (DL) algorithms, which learn the representative and discriminative features in a hierarchical manner from the data, have recently become a hotspot in the machine-learning area and have been introduced into the geoscience and remote sensing (RS) community for RS big data analysis. Considering the low-level features (e.g., spectral and texture) as the bottom level, the output feature representation from the top level of the network can be directly fed into a subsequent classifier for pixel-based classification. As a matter of fact, by carefully addressing the practical demands in RS applications and designing the input"output levels of the whole network, we have found that DL is actually everywhere in RS data analysis: from the traditional topics of image preprocessing, pixel-based classification, and target recognition, to the recent challenging tasks of high-level semantic feature extraction and RS scene understanding.
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.