Conference PaperPDF Available

Abstract and Figures

The big data analytics approach emerged that can be interpreted as extracting information from large quantities of scientific data in a systematic way. In order to have a more concrete understanding of this term we refer to its refinement as smart data analytics in order to examine large quantities of scientific data to uncover hidden patterns, unknown correlations, or to extract information in cases where there is no exact formula (e.g. known physical laws). Our concrete big data problem is the classification of classes of land cover types in image-based datasets that have been created using remote sensing technologies, because the resolution can be high (i.e. large volumes) and there are various types such as panchromatic or different used bands like red, green, blue, and nearly infrared (i.e. large variety). We investigate various smart data analytics methods that take advantage of machine learning algorithms (i.e. support vector machines) and state-of-the-art parallelization approaches in order to overcome limitations of big data processing using non-scalable serial approaches.
Content may be subject to copyright.
SMART DATA ANALYTICS METHODS FOR REMOTE SENSING APPLICATIONS
Gabriele Cavallaroa, Morris Riedela,b, Jon Atli Benediktssona, Markus Goetza,b,
Tomas Runarssona,Kristjan Jonassona,Thomas Lippertb
aFaculty of Electrical and Computer Engineering, University of Iceland, Reykjavik, Iceland
bJ¨
ulich Supercomputing Center, Forschungszentrum, J¨
ulich, Germany
ABSTRACT
The big data analytics approach emerged that can be inter-
preted as extracting information from large quantities of sci-
entific data in a systematic way. In order to have a more con-
crete understanding of this term we refer to its refinement as
smart data analytics in order to examine large quantities of
scientific data to uncover hidden patterns, unknown correla-
tions, or to extract information in cases where there is no exact
formula (e.g. known physical laws). Our concrete big data
problem is the classification of classes of land cover types
in image-based datasets that have been created using remote
sensing technologies, because the resolution can be high (i.e.
large volumes) and there are various types such as panchro-
matic or different used bands like red, green, blue, and nearly
infrared (i.e. large variety). We investigate various smart data
analytics methods that take advantage of machine learning al-
gorithms (i.e. support vector machines) and state-of-the-art
parallelization approaches in order to overcome limitations of
big data processing using non-scalable serial approaches.
Index TermsData Analytics, Support Vector Ma-
chines, Parallel Computing, Remote Sensing, Classification
1. INTRODUCTION
Besides the traditional sources and collection methods of
data, with all their limitations, satellite remote sensing [1]
remains one of the largest source of data collections. Re-
mote sensing takes advantage of satellite and airborne sen-
sors to observe, measure, and record the radiation reflected
or emitted by the Earth and its environment. It can signifi-
cantly enhance the information available from traditional data
sources (i.e., by providing synoptic views of large portions
of Earth), which can be used for subsequent data processing.
The big data problem is given due to the rapid improvement
of remote sensing capabilities such as the availability of re-
motely sensed images with very high geometrical resolution
(QuickBird 0.6m). In addition detailed spectral information
(AVIRIS 224 spectral channels) is constantly increasing, and
the amount of data is continuously growing with images more
and more numerous, precise, frequent, but also complex.
In this context, Remote sensing makes use of several anal-
ysis methods, such as image processing, automatic classifica-
tion, multitemporal processing and data fusion, in order to
handle different real applications. The availability of the data
raises a demand for smart data analytics techniques such as
image classification that is one amongst the most significant
application worlds for remote sensing, but facing serious lim-
itations when performing classification with traditional serial
tools (e.g. R, Matlab, etc.). The problem of classification
aims to categorize all pixels in a digital image into mean-
ingful features or classes of land cover types in a scene. In
order to obtain a satisfactory level of detection accuracy, we
perform a detailed physical analysis by exploiting the avail-
ability of high spatial resolution image. We consider attribute
filters, flexible operators that can transform an image accord-
ing to many different attributes (e.g., geometrical, textural and
spectral) as further optimization technique.
This paper offers one solution to the aforementioned de-
scribed scientific case in the field of remote sensing applica-
tions by applying smart data analytics methods to one specific
big dataset. We provide a scalable analytics solution for im-
age classification taking advantage of one of the most succes-
ful classification methods referred to as support vector ma-
chines (SVMs) [2]. But in order to overcome the limitations
of the wide variety of traditional serial SVM data analysis
tools, we survey and apply existing open source SVM tools
for big data analytics that take advantage of parallelization
techniques. The contribution of this paper is thus the design
of a tailored parallel smart data analytics method to the afore-
described scientific case that is able to reduce the training time
of the SVM classifier under the constraints of not dropping in
terms of accuracy. The work has been performed and dis-
cussed within the Research Data Alliance (RDA) Big Data
Analytics Interest Group (IG)[3].
This paper is structured as follows. After the introduc-
tion into the problem domain, Chapter 2 provides the nec-
essary technical background, offers methods summaries, and
surveys related work. Chapter 3 presents results from our ap-
proach and the paper ends with some concluding remarks.
2. BACKGROUND AND RELATED WORK
The required technical background is given as intersection be-
tween two distinct fields: a theoretical classification method
from the field of machine learning and its practical paral-
lelization approaches that originate from the field of parallel
and distributed computing. Related work approaches are re-
viewed in the light of the established background, but with an
emphasis on real working and available parallel tools.
2.1. Methods Summary
The method we have choosen to perform image classifi-
cation is known as Support Vector Machines (SVMs) [2],
because they are one of the most powerful classification and
regression tools today. Our problem domain is a multi-class
classification problem and SVMs solve this problem with the
following given ninput data instances (i.e. labelled training
data):
Training set T= (x1, y1), ..., (xn, yn)
The SVM method maps these input data instances into
a high-dimensional feature space with a non-linear mapping
function Φand then performs linear classification in this high-
dimensional feature space. This mapping in accordance with
Covers theorem [4] guarantees that the transformed data in-
stances are more likely to be linearly seperable. The mapped
data instances belonging to different classes (i.e. multi-class)
are separated by tracing maximum margin (decision) hyper-
planes in this higher dimensional space. Since maximizing
the distance of data instances to the optimal decision hyper-
plane is equivalent to minimizing the norm of the weight w,
SVMs solve the following constraint optimization problem:
min
w,ξi,b (1
2
w
2+CX
i
ξi)(1)
subject to:
yi(hφ(xi),wi+b)1ξii= 1, ..., n (2)
ξi0i= 1, ..., n (3)
SVM use a bias b and the important generalization param-
eter C controls the generalization capability of the SVM clas-
sifier that is related of how well it will behave out of sample
once the classifier is trained on data in sample (i.e. on train-
ing dataset instances). Data instances with labelled data have
label yi, while the ξiare positive slack variables allowing to
deal with permitted errors.
Formula (1) can be transformed in its dual problem that in
turn can be solved using quadratic programming (qp) mecha-
nisms. The learning model selection in terms of choosing the
right values for C and ξiis performed using cross-validation
techniques. A complete introduction to SVMs is out of scope
and we refer to C. Cortes et al. [2] for more technical details.
2.2. Related Work
There is a wide variety of related work in the field and while
we focus on those that are parallel SVM analytics tools we
can further categorize the approaches in three approaches.
(a) High Throughput Computing (HTC) based SVM tools
that leverage the new map-reduce paradigm [5] are available
but either their functionality or their stability needs to be im-
proved. Based on Apache Hadoop [6], the Apache Mahout
[7] is a Java-based machine learning framework, but accord-
ing ot its Website there is no strategy for its implementation
although activities have been started to at least implement ini-
tially a serial version. We have been also using the parallel
SVM implementation from Zun and Fox as described in [8]
based on an iterative map-reduce framework called Twister
[9]. Since the code has many dependencies with messaging
frameworks and map-reduce, we found the solution interest-
ing but the stability of the whole stack and the parallel SVM
implementation needs to be improved. The code is based on
the serial de-facto standard implementation of SVMs named
as LibSVM [10]. Closely related to map-reduce is Spark
[11] that is an alternative to map-reduce still working with
Hadoop, but its MLlib only suppots binary classification and
linear SVMs while we require for our approach multi-class
classification.
(b) High Performance Computing (HPC) based SVM
tools that leverage the traditional message passing interface
(MPI) paradigm [12] are available but are either old codes,
beta releases, or show limits in scalability. In this paper we
use the PiSVM [13] implementation of SVMs that is also
internally based on LibSVM, but that is an older code show-
ing several I/O limits while on the other hand being stable to
use. Another parallel SVM implementation using MPI is the
pSVM tool [14] that is available on google code repository,
but it is also an old code that is only publicly available as beta
version.
(c) Several implementations are emerging in the field of
Graphical Processing Units (GPU) accelerated SVMs such as
the GPU-LibSVM [15], which is also based on the original
LibSVM and using the CUDA framework. We have already
deployed the software package and started to use it but due
to page restriction we keep our focus on the study of the MPI
implementation.
Finally, also in remote sensing parallel versions of SVMs
have been used before. In [16], the performance of a par-
allel implementation of the SVM based is evaluated on the
parallelization of the incomplete Cholesky factorization and
presented a novel implementations of the standard Master-
Worker decompostions. Because of the page restriction, we
cannot go into detail in each one of them, but our approach
mostly differentiates from using a tailored SVMs together
with a mathematical morphological approach.
3. RESULTS
The multispectral data set used in our experiments is an im-
age of Rome, Italy, acquired by the QuickBird satellite. It
consists of a high-resolution (0.6m) panchromatic image,
a low-resolution (2.4m) multispectral image with the four
bands Red, Green, Blue and Near Infrared. Labelled data is
existing in form of ground truth data of 9 different land-cover
classes that are shown in Table 1. We generated a set of
training samples by randomly selecting 10% of the reference
samples and a set of test samples from the remaining labels.
Table 1. Rome data set: number of training and test samples
Class Training Test
Buildings 18126 163129
Blocks 10982 98834
Roads 16353 147176
Light Train 1606 14454
Vegetation 6962 62655
Trees 9088 81792
Bare Soil 8127 73144
Soil 1506 13551
Tower 4792 43124
Total 77542 697859
The experimental analysis is organized in three main steps
as it is shown in Fig. 1. The first step aims to use morpholog-
ical attribute filters [17] in order to extract spatial information
from the input images. We considered the attribute area, and
we selected 10 different thresholds values. The attribute filter
is applied over all the input images, and its output consists of
a so called Self Dual Attribute Profile (SDAP) [18] containing
55 features.
The purpose of the last steps is to consider the extracted
features for classification task. We conducted experiment
with PiSVM [13] that in turn uses the MPI for communica-
tion between several nodes. The dimension of the training
vectors is 77542 including 55 features. In order to demon-
strate the impact of using parallel smart data analytics, we
present the processing time for the training phase (finding
the support vectors) for two cases. In the first, we run the
training phase in the serial matlab environment, and we found
a processing time of 1277 seconds. In order to take advan-
tage of parallel analytics, we use the JUDGE cluster [19] at
the Juelich Supercomputing Centre in Germany and we train
SVM by using different number of processors. Fig. 2 shows
that the walltime for training is decreasing when the number
of nodes involved in the cluster computation is increasing
reaching a speed-up limit with roughly 16 nodes.
Attribute Filter
(Area, Thresholds) SDAP
Training Set Train SVM
Ground Truth
Test Set
SVM Model
SVM Classifier
Classification
accuracy
Morphological analysis
Training phase
Test phase
Panch Red Green Blue Infrared
Fig. 1. Structural layout of the approach.
NP TIME
156,57
239,32
429,59
0
200
400
600
800
1000
1200
14710 13 16
Processing time [s]
Number of processes
Fig. 2. Using the MPI-based parallel SVM implementation
PiSVM significantly reduces the training time compared to
serial approaches with Matlab.
4. CONCLUSIONS
Because of the page restriction, we only provide a few high-
lights of our research related to smart data analytics based
on parallel SVM tools. One of the first findings is that open
source availability of parallel implementation of SVMs are
rare, simple, or could be improved in terms of stability and
scalability. In this paper we have shown how big data analyt-
ics can be tailored to a parallel smart data analytics solution
in the field of image classification in the remote sensing com-
munity. We can draw two conclusions from this study.
Firstly, we have been able through the use of the parallel im-
plementation of piSVM to enable a speed-up in terms of over-
all processing time compared to the serial Matlab approach.
This significant reduction in training time was not affecting
the training accuracy that we obtained by running also SVM
predictions in parallel being always roughly 97% like the se-
rial Matlab approach. The implementation of piSVM for ba-
sic smart analytic applications is stable enough, but we ob-
served some limitations with respect to scaling to higher num-
ber of cores and I/O limits.
In order to support the more and more emerging approaches
towards reproducable science’ we have uploaded all datasets
and the runtimes into the B2SHARE EUDAT service. Hence,
the data and the piSVM implementation can be thus used to
reproduce our findings in the paper. Finally, the described ap-
proach with concrete application in this paper contributes to
the findings of the RDA Big Data Analytics Interest Group.
Finally future work will be the detailed investigation of
other parallel implementations with a focus on the GPU-
LibSVM library.
5. REFERENCES
[1] Koch F. H. Wiele C. F. Nelson S. A. C. Khorram, S.,
Remote Sensing, vol. 7, 2012.
[2] C. Cortes and V. Vapnik, “Support-vector networks,”
Machine Learning, vol. 20(3), pp. 273–297, 1995.
[3] Research Data Alliance, “Big data analytics in-
terest group website,” Website, 2014, Available
online at https://rd-alliance.org/group/
big-data-analytics-ig.html,.
[4] T.M. Cover, “Geometrical and statistical properties of
systems of linear inequalities with application in pattern
recognition,” IEEE Transactions on Electronic Comput-
ers, vol. 14, pp. 326–334, 1965.
[5] J. Dean and S. Ghemawat, “Mapreduce: simplified data
processing on large clusters, Communications of the
ACM, vol. 51(1), pp. 107–113, 2008.
[6] T. White, Hadoop: The Definitive Guide, OReilly, 2009.
[7] T. Dunning A. Robin and E. Friedman, Mahout in ac-
tion, Manning, 2011.
[8] Y. Sun and G. Fox, “Study on parallel svm based
on mapreduce,” International Conference on Paral-
lel and Distributed Processing Techniques and Applica-
tions, 2012, pp. 16–19.
[9] B. Zhang T. Gunarathne T. Bae J. Qiu J. Ekanayake, J. Li
and J. Fox, “Twister: a runtime for iterative mapreduce,”
19th ACM International Symposium on High Perfor-
mance Distributed Computing, 2010, pp. 810–818.
[10] C. Chang and C. Lin, “Libsvm: a library for support
vector machines, ACM Transactions on Intelligent Sys-
tems and Technology (TIST), vol. 2(3), pp. 27, 2011.
[11] M. Franklin S. Shenker M. Zaharia, M. Chowdhury and
I. Stoica, “Spark: cluster computing with working sets,”
2nd USENIX conference on Hot topics in cloud com-
puting, 2010, pp. 10–10.
[12] A. Skjellum W. Gropp, E. Lusk, Using MPI: portable
parallel programming with the message-passing inter-
face, MIT Press, 1999.
[13] PiSVM, “Pisvm - parallel svm baed on mpi,” Web-
site, 2014, Available online at http://pisvm.
sourceforge.net/index.html,.
[14] H. Wang H. Bai J. Li Y. Qiu E.Y. Chang, K. Zhu and
H. Cui, “Psvm: Parallelizing support vector machines
on distributed computers, NIPS, 2007.
[15] V. Meyaris I. Kompatsiaris A. Athanasopoulos, A. Di-
mou, “Gpu acceleration for support vector machines,”
12th International Workshop on Image Analzsis for
Multimedia Interactive Services, 2011.
[16] J. A. Gualtieri J. Muoz, A. Plaza and G. Camps-Valls.,
“Parallel Implementation of SVM in Earth Observation
Applications.,” in Parallel Programming and Applica-
tions in Grid, P2P and Networking systems., F. Xhafa,
Ed., pp. 292–312. IOS Press, 2009.
[17] M. Dalla Mura, A. Villa, J. A. Benediktsson, J. Chanus-
sot, and L. Bruzzone, “Classification of hyperspec-
tral images by using morphological attribute filters and
Independent Component Analysis,” IEEE Geoscience
and Remote Sensing Letters, vol. 8, no. 3, pp. 542–546,
2010.
[18] M. Dalla Mura, J. A. Benediktsson, and L. Bruzzone,
“Self-dual attribute profiles for the analysis of remote
sensing images,” in Mathematical Morphology and Its
Applications to Image and Signal Processing, Ouzou-
nis G. Soille P, Pesaresi M, Ed., pp. 320–330. Springer
Berlin Heidelberg, 2011.
[19] Juelich Supercomputing Centre, “Judge cluster,
Website, 2014, Available online at http://www.
fz-juelich.de/ias/jsc/EN/Expertise/
Supercomputers/JUDGE/judge_node.html,.
... The third section of the course deals with issues related to the storage of large data sets. The specifics of the IoT lead to the fact that the main attention will be paid to the processing of data streams (stream data processing) and systems that support it (for example, Apache Storm [34] and Apache Spark [35]). ...
Article
Full-text available
In line with globalization and the information highway, the Mongolian education system has been developed rapidly to introduce information systems and ICT equipment and tools to the education sector since 2000. Nowadays this pandemic circumstance, the Mongolian education sector is managing issues in a short period of time, however, according to the situation analysis, there is a lot of consideration to be taken in the future. The aim of this study is the based on current state to identify needs for further concern to improve the ICT equal access to quality education. In determining the effectiveness of ICT in education projects and programs, it is worth assessing tangible outcomes rather than simply looking at surface factors such as the number of computers and internet speed. There is also a need to research ICT need in Non-Formal Education in order establish the conditions for quality, equitable lifelong education. Keywords: Quality, equal access to education, education policy, ICT, training, internet, Higher Education Institution, competency.
... The third section of the course deals with issues related to the storage of large data sets. The specifics of the IoT lead to the fact that the main attention will be paid to the processing of data streams (stream data processing) and systems that support it (for example, Apache Storm [34] and Apache Spark [35]). ...
Article
Full-text available
In line with globalization and the information highway, the Mongolian education system has been developed rapidly to introduce information systems and ICT equipment and tools to the education sector since 2000. Nowadays this pandemic circumstance, the Mongolian education sector is managing issues in a short period of time, however, according to the situation analysis, there is a lot of consideration to be taken in the future. The aim of this study is the based on current state to identify needs for further concern to improve the ICT equal access to quality education. In determining the effectiveness of ICT in education projects and programs, it is worth assessing tangible outcomes rather than simply looking at surface factors such as the number of computers and internet speed. There is also a need to research ICT need in Non-Formal Education in order establish the conditions for quality, equitable lifelong education.
... The data is collected from internal storage (HDFS), external storage, or a derived dataset formed by other RDDs. These RDDs are maintained by inbuilt options such as On-disk storage, Serialized data (In-memory storage) and Desterilized java objects (In-memory storage) [13,14,15]. ...
Article
Full-text available
In the last one decade, the tremendous growth in data emphasizes big data storage and management issues with the highest priorities. For providing better support to software developers for dealing with big data problems, new programming platforms are continuously developing and Hadoop MapReduce is a big game-changer followed by Spark, which sets the world of big data on fire with its processing speed and comfortable APIs. Hadoop framework emerged as a leading tool based on the MapReduce programming model with a distributed file system. Spark is on the other hand, recently developed big data analysis and management framework used to explore unlimited underlying features of Big Data. In this research work, a comparative analysis of Hadoop MapReduce and Spark has been presented based on working principle, performance, cost, ease of use, compatibility, data processing, failure tolerance, and security. Experimental analysis has been performed to observe the performance of Hadoop MapReduce and Spark for establishing their suitability under different constraints of the distributed computing environment.
... Recent concerns is to develop friendly, highly interactive commercial Visual Analytics tools to support data processing, external database connections and effective data mining algorithms. In the frame of big data analytics approach, aiming to perform landcover classification, the autohors of the paper [4] investigate various smart data analytics methods that take advantage of machine learning algorithms and state-of-the-art parallelization approaches in order to overcome limitations of big data processing ...
Conference Paper
This paper introduces a tool designed to provide an innovative and insightful way of exploring Earth observation data content beyond visualization, by addressing a visual analytics process. The considered framework combines machine learning and visualization techniques, empowered through human interaction, to gain knowledge from the data. The proposed tool- eVADE leverages the methodologies developed in the fields of information retrieval, data mining and knowledge representation by the means of a visual analytics component. eVADE increases users capability to understand and extract meaningful semantic clusters together with quantitative measurements, presented in a suggestive visual way.
... Recent concerns is to develop friendly, highly interactive commercial Visual Analytics tools to support data processing, external database connections and effective data mining algorithms. In the frame of big data analytics approach, aiming to perform landcover classification, the autohors of the paper [4] investigate various smart data analytics methods that take advantage of machine learning algorithms and state-of-the-art parallelization approaches in order to overcome limitations of big data processing ...
... Remote sensing image classification has become an important part of remote sensing applications, which can be used in urban planning, environmental monitoring, classification, crop management, and many other applications [1][2][3][4]. Hyperspectral images (HSI) contain hundreds-dimensional spectrum vectors, which may bring to higher accuracy for land cover recognition and classification. Therefore, hyperspectral remote sensing image classification has always been the concerning focus of researchers. ...
Article
Full-text available
Aiming at solving the difficulty of modeling on spatial coherence, complete feature extraction, and sparse representation in hyperspectral image classification, a joint sparse representation classification method is investigated by flexible patches sampling of superpixels. First, the principal component analysis and total variation diffusion are employed to form the pseudo color image for simplifying superpixels computing with (simple linear iterative clustering) SLIC model. Then, we design a joint sparse recovery model by sampling overcomplete patches of superpixels to estimate joint sparse characteristics of test pixel, which are carried out on the orthogonal matching pursuit (OMP) algorithm. At last, the pixel is labeled according to the minimum distance constraint for final classification based on the joint sparse coefficients and structured dictionary. Experiments conducted on two real hyperspectral datasets show the superiority and effectiveness of the proposed method.
Article
This review paper discusses the state of the art in earth observation image (EO) information mining from a semantics based approaches. The need for mining of RS data has arisen due to the ever increasing amount of data that is being collected by various EO platforms. The EO data archives are reaching to unmanageable sizes, and the challenges in storing and disseminating this information is reaching alarming proportions. It is believed that 90% of this data remains in the archives, untouched. This is due to various reasons, such as the inability to search and find the data, the lack of contextualization in the searching and retrieving of the relevant data, the humongous amount of time required to process these datasets, etc. Currently, the data is made available to the user through interfaces that support only syntactical queries, and lack the intuitiveness which does not cater to the user’s conjecture. These limiting factors highly affect the usage of these archived datasets. This review which is based on selected papers covers two areas of earth observations, which can benefit from the integration of semantic technologies: (1) data from EO imagery (2) EO data in the form of thematic data. Further, to work on huge EO image databases the computational power needs to be increased exponentially. The recent advent of graphical processing units (GPU’s) for general purpose computing has tremendously helped in developing rapid approaches for the mining EO image archives. The various processes involved in using GPU computing in a variety of EO applications is discussed along with the recent work in this domain.
Article
Information theory has recently become an interesting topic in earth observation data management and analysis, since it can provide important information on hidden interactions and correlations among the considered data records. Although several methods have been proposed and implemented to efficiently extract a proper set of features and deliver accurate image investigation, classification, and segmentation, these architectures show drawbacks when the data sets are characterized by complex interactions among the samples. In this paper, a new approach based on information theory for automatic pattern recognition is introduced for accurate classification of remotely sensed data. Experimental results carried out on real data sets show the validity of the proposed approach.
Article
In Earth observations technical literature, several methods have been proposed and implemented to efficiently extract a proper set of features for classification and segmentation purposes. However, these architectures show drawbacks when the considered datasets are characterized by complex interactions among the samples, especially when they rely on strong assumptions on noise and label domains. In this paper, a new unsupervised approach for feature extraction, based on data driven discovery, is introduced for accurate classification of remotely sensed data. Specifically, the proposed architecture exploits mutual information maximization in order to retrieve the most relevant features with respect to information measures. Experimental results on real datasets show that the proposed approach represents a valid framework for feature extraction from remote sensing images.
Article
Full-text available
Support Vector Machines (SVM) are powerful classification and regression tools. They have been widely studied by many scholars and applied in many kinds of practical fields. But their compute and storage requirements increase rapidly with the number of training vectors, putting many problems of practical interest out of their reach. For applying SVM to large scale data mining, parallel SVM are studied and some parallel SVM methods are proposed. Most currently parallel SVM methods are based on classical MPI model. It is not easy to be used in practical, especial to large scale data-intensive data mining problems. MapReduce is an efficient distribution computing model to process large scale data mining problems. Some MapReduce software were developed, such as Hadoop, Twister and so on. In this paper, parallel SVM based on iterative MapReduce model Twister is studied. The program flow is developed. The efficiency of the method is illustrated through analyzing practical problems.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.