Content uploaded by Morris Riedel
Author content
All content in this area was uploaded by Morris Riedel on Dec 08, 2019
Content may be subject to copyright.
SMART DATA ANALYTICS METHODS FOR REMOTE SENSING APPLICATIONS
Gabriele Cavallaroa, Morris Riedela,b, Jon Atli Benediktssona, Markus Goetza,b,
Tomas Runarssona,Kristjan Jonassona,Thomas Lippertb
aFaculty of Electrical and Computer Engineering, University of Iceland, Reykjavik, Iceland
bJ¨
ulich Supercomputing Center, Forschungszentrum, J¨
ulich, Germany
ABSTRACT
The big data analytics approach emerged that can be inter-
preted as extracting information from large quantities of sci-
entific data in a systematic way. In order to have a more con-
crete understanding of this term we refer to its refinement as
smart data analytics in order to examine large quantities of
scientific data to uncover hidden patterns, unknown correla-
tions, or to extract information in cases where there is no exact
formula (e.g. known physical laws). Our concrete big data
problem is the classification of classes of land cover types
in image-based datasets that have been created using remote
sensing technologies, because the resolution can be high (i.e.
large volumes) and there are various types such as panchro-
matic or different used bands like red, green, blue, and nearly
infrared (i.e. large variety). We investigate various smart data
analytics methods that take advantage of machine learning al-
gorithms (i.e. support vector machines) and state-of-the-art
parallelization approaches in order to overcome limitations of
big data processing using non-scalable serial approaches.
Index Terms—Data Analytics, Support Vector Ma-
chines, Parallel Computing, Remote Sensing, Classification
1. INTRODUCTION
Besides the traditional sources and collection methods of
data, with all their limitations, satellite remote sensing [1]
remains one of the largest source of data collections. Re-
mote sensing takes advantage of satellite and airborne sen-
sors to observe, measure, and record the radiation reflected
or emitted by the Earth and its environment. It can signifi-
cantly enhance the information available from traditional data
sources (i.e., by providing synoptic views of large portions
of Earth), which can be used for subsequent data processing.
The big data problem is given due to the rapid improvement
of remote sensing capabilities such as the availability of re-
motely sensed images with very high geometrical resolution
(QuickBird 0.6m). In addition detailed spectral information
(AVIRIS 224 spectral channels) is constantly increasing, and
the amount of data is continuously growing with images more
and more numerous, precise, frequent, but also complex.
In this context, Remote sensing makes use of several anal-
ysis methods, such as image processing, automatic classifica-
tion, multitemporal processing and data fusion, in order to
handle different real applications. The availability of the data
raises a demand for smart data analytics techniques such as
image classification that is one amongst the most significant
application worlds for remote sensing, but facing serious lim-
itations when performing classification with traditional serial
tools (e.g. R, Matlab, etc.). The problem of classification
aims to categorize all pixels in a digital image into mean-
ingful features or classes of land cover types in a scene. In
order to obtain a satisfactory level of detection accuracy, we
perform a detailed physical analysis by exploiting the avail-
ability of high spatial resolution image. We consider attribute
filters, flexible operators that can transform an image accord-
ing to many different attributes (e.g., geometrical, textural and
spectral) as further optimization technique.
This paper offers one solution to the aforementioned de-
scribed scientific case in the field of remote sensing applica-
tions by applying smart data analytics methods to one specific
big dataset. We provide a scalable analytics solution for im-
age classification taking advantage of one of the most succes-
ful classification methods referred to as support vector ma-
chines (SVMs) [2]. But in order to overcome the limitations
of the wide variety of traditional serial SVM data analysis
tools, we survey and apply existing open source SVM tools
for big data analytics that take advantage of parallelization
techniques. The contribution of this paper is thus the design
of a tailored parallel smart data analytics method to the afore-
described scientific case that is able to reduce the training time
of the SVM classifier under the constraints of not dropping in
terms of accuracy. The work has been performed and dis-
cussed within the Research Data Alliance (RDA) Big Data
Analytics Interest Group (IG)[3].
This paper is structured as follows. After the introduc-
tion into the problem domain, Chapter 2 provides the nec-
essary technical background, offers methods summaries, and
surveys related work. Chapter 3 presents results from our ap-
proach and the paper ends with some concluding remarks.
2. BACKGROUND AND RELATED WORK
The required technical background is given as intersection be-
tween two distinct fields: a theoretical classification method
from the field of machine learning and its practical paral-
lelization approaches that originate from the field of parallel
and distributed computing. Related work approaches are re-
viewed in the light of the established background, but with an
emphasis on real working and available parallel tools.
2.1. Methods Summary
The method we have choosen to perform image classifi-
cation is known as Support Vector Machines (SVMs) [2],
because they are one of the most powerful classification and
regression tools today. Our problem domain is a multi-class
classification problem and SVMs solve this problem with the
following given ninput data instances (i.e. labelled training
data):
Training set T= (x1, y1), ..., (xn, yn)
The SVM method maps these input data instances into
a high-dimensional feature space with a non-linear mapping
function Φand then performs linear classification in this high-
dimensional feature space. This mapping in accordance with
Covers theorem [4] guarantees that the transformed data in-
stances are more likely to be linearly seperable. The mapped
data instances belonging to different classes (i.e. multi-class)
are separated by tracing maximum margin (decision) hyper-
planes in this higher dimensional space. Since maximizing
the distance of data instances to the optimal decision hyper-
plane is equivalent to minimizing the norm of the weight w,
SVMs solve the following constraint optimization problem:
min
w,ξi,b (1
2
w
2+CX
i
ξi)(1)
subject to:
yi(hφ(xi),wi+b)≥1−ξi∀i= 1, ..., n (2)
ξi≥0∀i= 1, ..., n (3)
SVM use a bias b and the important generalization param-
eter C controls the generalization capability of the SVM clas-
sifier that is related of how well it will behave out of sample
once the classifier is trained on data in sample (i.e. on train-
ing dataset instances). Data instances with labelled data have
label yi, while the ξiare positive slack variables allowing to
deal with permitted errors.
Formula (1) can be transformed in its dual problem that in
turn can be solved using quadratic programming (qp) mecha-
nisms. The learning model selection in terms of choosing the
right values for C and ξiis performed using cross-validation
techniques. A complete introduction to SVMs is out of scope
and we refer to C. Cortes et al. [2] for more technical details.
2.2. Related Work
There is a wide variety of related work in the field and while
we focus on those that are parallel SVM analytics tools we
can further categorize the approaches in three approaches.
(a) High Throughput Computing (HTC) based SVM tools
that leverage the new map-reduce paradigm [5] are available
but either their functionality or their stability needs to be im-
proved. Based on Apache Hadoop [6], the Apache Mahout
[7] is a Java-based machine learning framework, but accord-
ing ot its Website there is no strategy for its implementation
although activities have been started to at least implement ini-
tially a serial version. We have been also using the parallel
SVM implementation from Zun and Fox as described in [8]
based on an iterative map-reduce framework called Twister
[9]. Since the code has many dependencies with messaging
frameworks and map-reduce, we found the solution interest-
ing but the stability of the whole stack and the parallel SVM
implementation needs to be improved. The code is based on
the serial de-facto standard implementation of SVMs named
as LibSVM [10]. Closely related to map-reduce is Spark
[11] that is an alternative to map-reduce still working with
Hadoop, but its MLlib only suppots binary classification and
linear SVMs while we require for our approach multi-class
classification.
(b) High Performance Computing (HPC) based SVM
tools that leverage the traditional message passing interface
(MPI) paradigm [12] are available but are either old codes,
beta releases, or show limits in scalability. In this paper we
use the PiSVM [13] implementation of SVMs that is also
internally based on LibSVM, but that is an older code show-
ing several I/O limits while on the other hand being stable to
use. Another parallel SVM implementation using MPI is the
pSVM tool [14] that is available on google code repository,
but it is also an old code that is only publicly available as beta
version.
(c) Several implementations are emerging in the field of
Graphical Processing Units (GPU) accelerated SVMs such as
the GPU-LibSVM [15], which is also based on the original
LibSVM and using the CUDA framework. We have already
deployed the software package and started to use it but due
to page restriction we keep our focus on the study of the MPI
implementation.
Finally, also in remote sensing parallel versions of SVMs
have been used before. In [16], the performance of a par-
allel implementation of the SVM based is evaluated on the
parallelization of the incomplete Cholesky factorization and
presented a novel implementations of the standard Master-
Worker decompostions. Because of the page restriction, we
cannot go into detail in each one of them, but our approach
mostly differentiates from using a tailored SVMs together
with a mathematical morphological approach.
3. RESULTS
The multispectral data set used in our experiments is an im-
age of Rome, Italy, acquired by the QuickBird satellite. It
consists of a high-resolution (0.6m) panchromatic image,
a low-resolution (2.4m) multispectral image with the four
bands Red, Green, Blue and Near Infrared. Labelled data is
existing in form of ground truth data of 9 different land-cover
classes that are shown in Table 1. We generated a set of
training samples by randomly selecting 10% of the reference
samples and a set of test samples from the remaining labels.
Table 1. Rome data set: number of training and test samples
Class Training Test
Buildings 18126 163129
Blocks 10982 98834
Roads 16353 147176
Light Train 1606 14454
Vegetation 6962 62655
Trees 9088 81792
Bare Soil 8127 73144
Soil 1506 13551
Tower 4792 43124
Total 77542 697859
The experimental analysis is organized in three main steps
as it is shown in Fig. 1. The first step aims to use morpholog-
ical attribute filters [17] in order to extract spatial information
from the input images. We considered the attribute area, and
we selected 10 different thresholds values. The attribute filter
is applied over all the input images, and its output consists of
a so called Self Dual Attribute Profile (SDAP) [18] containing
55 features.
The purpose of the last steps is to consider the extracted
features for classification task. We conducted experiment
with PiSVM [13] that in turn uses the MPI for communica-
tion between several nodes. The dimension of the training
vectors is 77542 including 55 features. In order to demon-
strate the impact of using parallel smart data analytics, we
present the processing time for the training phase (finding
the support vectors) for two cases. In the first, we run the
training phase in the serial matlab environment, and we found
a processing time of 1277 seconds. In order to take advan-
tage of parallel analytics, we use the JUDGE cluster [19] at
the Juelich Supercomputing Centre in Germany and we train
SVM by using different number of processors. Fig. 2 shows
that the walltime for training is decreasing when the number
of nodes involved in the cluster computation is increasing
reaching a speed-up limit with roughly 16 nodes.
Attribute Filter
(Area, Thresholds) SDAP
Training Set Train SVM
Ground Truth
Test Set
SVM Model
SVM Classifier
Classification
accuracy
Morphological analysis
Training phase
Test phase
Panch Red Green Blue Infrared
Fig. 1. Structural layout of the approach.
NP TIME
156,57
239,32
429,59
0
200
400
600
800
1000
1200
14710 13 16
Processing time [s]
Number of processes
Fig. 2. Using the MPI-based parallel SVM implementation
PiSVM significantly reduces the training time compared to
serial approaches with Matlab.
4. CONCLUSIONS
Because of the page restriction, we only provide a few high-
lights of our research related to smart data analytics based
on parallel SVM tools. One of the first findings is that open
source availability of parallel implementation of SVMs are
rare, simple, or could be improved in terms of stability and
scalability. In this paper we have shown how big data analyt-
ics can be tailored to a parallel smart data analytics solution
in the field of image classification in the remote sensing com-
munity. We can draw two conclusions from this study.
Firstly, we have been able through the use of the parallel im-
plementation of piSVM to enable a speed-up in terms of over-
all processing time compared to the serial Matlab approach.
This significant reduction in training time was not affecting
the training accuracy that we obtained by running also SVM
predictions in parallel being always roughly 97% like the se-
rial Matlab approach. The implementation of piSVM for ba-
sic smart analytic applications is stable enough, but we ob-
served some limitations with respect to scaling to higher num-
ber of cores and I/O limits.
In order to support the more and more emerging approaches
towards ’reproducable science’ we have uploaded all datasets
and the runtimes into the B2SHARE EUDAT service. Hence,
the data and the piSVM implementation can be thus used to
reproduce our findings in the paper. Finally, the described ap-
proach with concrete application in this paper contributes to
the findings of the RDA Big Data Analytics Interest Group.
Finally future work will be the detailed investigation of
other parallel implementations with a focus on the GPU-
LibSVM library.
5. REFERENCES
[1] Koch F. H. Wiele C. F. Nelson S. A. C. Khorram, S.,
Remote Sensing, vol. 7, 2012.
[2] C. Cortes and V. Vapnik, “Support-vector networks,”
Machine Learning, vol. 20(3), pp. 273–297, 1995.
[3] Research Data Alliance, “Big data analytics in-
terest group website,” Website, 2014, Available
online at https://rd-alliance.org/group/
big-data-analytics-ig.html,.
[4] T.M. Cover, “Geometrical and statistical properties of
systems of linear inequalities with application in pattern
recognition,” IEEE Transactions on Electronic Comput-
ers, vol. 14, pp. 326–334, 1965.
[5] J. Dean and S. Ghemawat, “Mapreduce: simplified data
processing on large clusters,” Communications of the
ACM, vol. 51(1), pp. 107–113, 2008.
[6] T. White, Hadoop: The Definitive Guide, OReilly, 2009.
[7] T. Dunning A. Robin and E. Friedman, Mahout in ac-
tion, Manning, 2011.
[8] Y. Sun and G. Fox, “Study on parallel svm based
on mapreduce,” International Conference on Paral-
lel and Distributed Processing Techniques and Applica-
tions, 2012, pp. 16–19.
[9] B. Zhang T. Gunarathne T. Bae J. Qiu J. Ekanayake, J. Li
and J. Fox, “Twister: a runtime for iterative mapreduce,”
19th ACM International Symposium on High Perfor-
mance Distributed Computing, 2010, pp. 810–818.
[10] C. Chang and C. Lin, “Libsvm: a library for support
vector machines,” ACM Transactions on Intelligent Sys-
tems and Technology (TIST), vol. 2(3), pp. 27, 2011.
[11] M. Franklin S. Shenker M. Zaharia, M. Chowdhury and
I. Stoica, “Spark: cluster computing with working sets,”
2nd USENIX conference on Hot topics in cloud com-
puting, 2010, pp. 10–10.
[12] A. Skjellum W. Gropp, E. Lusk, Using MPI: portable
parallel programming with the message-passing inter-
face, MIT Press, 1999.
[13] PiSVM, “Pisvm - parallel svm baed on mpi,” Web-
site, 2014, Available online at http://pisvm.
sourceforge.net/index.html,.
[14] H. Wang H. Bai J. Li Y. Qiu E.Y. Chang, K. Zhu and
H. Cui, “Psvm: Parallelizing support vector machines
on distributed computers,” NIPS, 2007.
[15] V. Meyaris I. Kompatsiaris A. Athanasopoulos, A. Di-
mou, “Gpu acceleration for support vector machines,”
12th International Workshop on Image Analzsis for
Multimedia Interactive Services, 2011.
[16] J. A. Gualtieri J. Muoz, A. Plaza and G. Camps-Valls.,
“Parallel Implementation of SVM in Earth Observation
Applications.,” in Parallel Programming and Applica-
tions in Grid, P2P and Networking systems., F. Xhafa,
Ed., pp. 292–312. IOS Press, 2009.
[17] M. Dalla Mura, A. Villa, J. A. Benediktsson, J. Chanus-
sot, and L. Bruzzone, “Classification of hyperspec-
tral images by using morphological attribute filters and
Independent Component Analysis,” IEEE Geoscience
and Remote Sensing Letters, vol. 8, no. 3, pp. 542–546,
2010.
[18] M. Dalla Mura, J. A. Benediktsson, and L. Bruzzone,
“Self-dual attribute profiles for the analysis of remote
sensing images,” in Mathematical Morphology and Its
Applications to Image and Signal Processing, Ouzou-
nis G. Soille P, Pesaresi M, Ed., pp. 320–330. Springer
Berlin Heidelberg, 2011.
[19] Juelich Supercomputing Centre, “Judge cluster,”
Website, 2014, Available online at http://www.
fz-juelich.de/ias/jsc/EN/Expertise/
Supercomputers/JUDGE/judge_node.html,.