Content uploaded by Morris Riedel

Author content

All content in this area was uploaded by Morris Riedel on Aug 17, 2019

Content may be subject to copyright.

Cloud Deep Networks for Hyperspectral Image Analysis

Journal:

Transactions on Geoscience and Remote Sensing

Manuscript ID

TGRS-2019-00084

Manuscript Type:

Regular paper

Date Submitted by the

Author:

15-Jan-2019

Complete List of Authors:

Haut, Juan M.; Universidad de Extremadura, Tecnología de los

Computadores y las Comunicaciones

Gallardo Jaramago, Jose Antonio; Hypercomp, Department of

Technology of Com-puters and Communications

Paoletti, Mercedes Eugenia; University of Extremadura, Department of

Technology of Computers and Communications

Cavallaro, Gabriele; Forschungszentrum Julich, Jülich Supercomputing

Centre

Plaza, Javier; University of Extremadura, Computer Science Dept.;

University of Extremadura, Computer Science Department

Plaza, Antonio; University of Extremadura, Technology of Computers and

Communications

Riedel, Morris; Juelich Research Center, Federated Systems and Data

Keywords:

Hyperspectral Data

Transactions on Geoscience and Remote Sensing

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Cloud Deep Networks for Hyperspectral Image

Analysis

Juan M. Haut, Student Member, IEEE, Jose A. Gallardo, Mercedes E. Paoletti, Student Member, IEEE,

Gabriele Cavallaro, Member, IEEE, Javier Plaza, Senior Member, IEEE, Antonio Plaza, Fellow, IEEE,

and Morris Riedel, Member, IEEE

Abstract—Advances in remote sensing hardware have led to a

signiﬁcantly increased capability for high quality data acquisition,

which allows the collection of remotely sensed images with very

high spatial, spectral and radiometric resolution. This trend

calls for the development of new techniques to enhance the way

such unprecedented volumes of data are stored, processed and

analyzed. An important approach to deal with massive volumes

of information is data compression, related to how data are

compressed before their storage or transmission. For instance,

hyperspectral images (HSIs) are characterized by hundreds of

spectral bands. In this sense, high performance (HPC) and

high throughput (HTC) computing offer interesting alternatives.

Particularly, distributed solutions based on cloud computing

can manage and store huge amounts of data in fault-tolerant

environments, by interconnecting distributed computing nodes

so that no specialized hardware is needed. This strategy greatly

reduces the processing costs, making the processing of high

volumes of remotely sensed data a natural and even cheap

solution. In this paper, we present a new cloud-based technique

for spectral analysis and compression of HSIs. Speciﬁcally, we

develop a cloud implementation of a popular deep neural network

for non-linear data compression, known as auto-encoder (AE).

Apache Spark serves as the backbone of our cloud computing

environment by connecting the available processing nodes using

a master-slave architecture. Our newly developed approach has

been tested using two widely available HSI datasets. Experimental

results indicate that cloud computing architectures offer an

adequate solution for managing big remotely sensed data sets.

Index Terms—High performance computing (HPC), high

throughput computing (HTC), cloud computing, hyperspectral

images (HSIs), Auto-encoder (AE), dimensionality reduction

(DR), speed-up.

This paper was supported by Ministerio de Educaci´

on (Resoluci´

on de 26

de diciembre de 2014 y de 19 de noviembre de 2015, de la Secretar´

ıa de

Estado de Educaci´

on, Formaci´

on Profesional y Universidades, por la que

se convocan ayudas para la formaci´

on de profesorado universitario, de los

subprogramas de Formaci´

on y de Movilidad incluidos en el Programa Estatal

de Promoci´

on del Talento y su Empleabilidad, en el marco del Plan Estatal de

Investigaci´

on Cient´

ıﬁca y T´

ecnica y de Innovaci´

on 2013-2016. This work has

also been supported by Junta de Extremadura (decreto 14/2018, ayudas para

la realizaci´

on de actividades de investigaci´

on y desarrollo tecnol´

ogico, de di-

vulgaci´

on y de transferencia de conocimiento por los Grupos de Investigaci´

on

de Extremadura, Ref. GR18060) and by MINECO project TIN2015-63646-

C5-5-R. (Corresponding author: Juan M. Haut)

J. M. Haut, J. A. Gallardo, M. E. Paoletti, J. Plaza and A. Plaza are with

the Hyperspectral Computing Laboratory, Department of Technology of Com-

puters and Communications, Escuela Polit´

ecnica, University of Extremadura,

10003 C´

aceres, Spain.(e-mail: juanmariohaut@unex.es; mpaoletti@unex.es;

jplaza@unex.es; aplaza@unex.es).

G. Cavallaro is with the J¨

ulich Supercomputing Center, Wilhelm-Johnen-

Straße 52428 J¨

ulich, Germany (e-mail:g.cavallaro@fz-juelich.de)

M. Riedel is with the J¨

ulich Supercomputing Center, Wilhelm-Johnen-

Straße 52428 J¨

ulich, Germany, and with the University of Iceland, 107

Reykjavik, Iceland (e-mail: m.riedel@fz-juelich.de)

I. INT ROD UC TI ON

EARTH Observation (EO) has evolved dramatically in the

last decades due to the technological advances incor-

porated into remote sensing instruments in the optical and

microwave domains [1]. With their hundreds of contiguous and

narrow channels within the visible, near-infrared and short-

wave infrared spectral ranges, hyperspectral images (HSIs)

have been used for the retrieval of bio-, geo-chemical and

physical parameters that characterize the surface of the earth.

These data are now used in a wide-range of applications, aimed

at monitoring and implementing new policies in the domain of

agriculture, geology, assessment of environmental resources,

urban planning, military/defense, disaster management, etc.

[2], [3], [4].

Most of the developments carried out over the last decades

in the ﬁeld of imaging spectroscopy have been achieved

via spectrometers on board airborne platforms. For instance,

the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS)

[5], designed and operated by the NASA’s Jet Propulsion

Laboratory (JPL), was the ﬁrst full spectral range imaging

spectrometer. It has been dedicated to remote sensing of the

earth in a large number of experiments and ﬁeld campaigns

since the late 1980s. Other examples of airborne missions

include the European Space Agency (ESA)’s Airborne Prism

Experiment (APEX) (2011-2016) [6], or the Compact Air-

borne Spectrographic Imager (CASI) [7] (1989-today), among

many others.

The vast amount of data collected by airborne platforms

has paved the way for EO satellite hyperspectral missions.

The Hyperion instrument on-board NASA’s Earth Observing

One (EO-1) spacecraft (2000-2017) [8] and the Compact

High Resolution Imaging Spectrometer (CHRIS) on ESA’s

Proba-1 microsatellite [9] (2001-today) have been to of the

main sources of space-based HSI data in the last decades.

Currently, there are several HSI missions under development,

including the Environmental Mapping and Analysis Program

(EnMAP) [10], the Prototype Research Instruments and Space

Mission technology Advancement (PRISMA) [11], among

others. Their main objective is to ﬁll the current gap in

space-based imaging spectroscopy data and achieve better

radiometric performance than the precursor missions.

The adoption of an open and free data policy by the National

Aeronautics and Space Administration (NASA) [12] and, more

recently, by ESA’s Copernicus initiative (the largest single

EO programme) [13] is now producing an unprecedented

Page 1 of 14 Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 2

amount of data to the research community. For instance, in

2017 the Sentinel Data Access System provided an estimated

10.04 TB/day with an average download volume of 93.5

TB/day1. Even though the Copernicus space component (i.e.,

the Sentinels) have not included a hyperpectral instrument

yet (Sentinel-10 is a HSI mission expected to be operational

around 2025-2030), it has been shown that the vast amount

of open data currently available calls for re-deﬁnition of

the challenges within the entire HSI life cycle (i.e., data

acquisition, processing and application phases). It is not by

coincidence that remote sensing data are now described under

the big data terminology, with characteristics such as volume

(increasing scale of acquired/archived data), velocity (rapidly

growing data generation rate, real-time processing needs),

variety (data acquired from multiple sources), veracity (data

uncertainty/accuracy), and value (extracted information) [14],

[15].

In this context, traditional processing methods such as

desktop approaches (i.e., MATLAB, R, SAS, ENVI, etc.) offer

limited capabilities when dealing with such large amounts of

data, especially regarding the velocity component (i.e., the

demand for real-time applications). Despite modern desktop

computers and laptops are becoming increasingly more pow-

erful, with multi-core and many-core capabilities including

graphics processing units (GPUs), the limitations in terms of

memory and core availability currently limit the processing of

large HSI data archives. Therefore, the use of highly scalable

parallel processing approaches is a mandatory solution to

improve the access to and the analysis of such great amount of

complex data, in order to provide decision-makers with clear,

timely, and useful information [16], [17].

Many changes have been introduced to parallel and dis-

tributed architectures over the past 30 years. In particular,

research has been focused on how to leverage many-core

architectures (e.g., GPUs) to deal with the growing demand

of domain-speciﬁc applications for handling computationally

intense problems. Other parallel architectures such as clusters

[18], grids [19], or clouds [20], [21] have also been widely

exploited for remotely sensed data processing, since they pro-

vide tremendous storage/computation capacity and outstanding

scalability. Parallel and distributed computing approaches can

be categorized into high performance computing (HPC) or

high throughput computing (HTC) solutions. Contrary to an

HPC system [22] (generally, a supercomputer that includes

a massive number of processors connected through a fast

dedicated network), an HTC system is more focused on the

execution of independent and sequential jobs that can be

individually scheduled on many different computing resources,

regardless of how fast an individual job can be completed.

A classic example of an HPC system is a cluster, while a

typical example of an HTC system is a grid. Cloud com-

puting is the natural evolution of grid computing, adopting

its backbone and infrastructure [21] but delivering computing

resources as a service over the network connection [23]. In

other words, the cloud moves desktop and laptop computing

1https://sentinel.esa.int/web/sentinel/news/-/article/sentinel-data-access-

annual-report-2017

(via the Internet) to a service-oriented platform using large

remote server clusters, and massive storage to data centres. In

this scenario, computing relies on sharing a pool of physical

and/or virtual resources, rather than on deploying local or

personal hardware and software. The process of virtualization

has enabled the cost-effectiveness and simplicity of cloud

computing solutions [24] (i.e., it exempts users from the

need to purchase and maintain complex computing hardware)

such as IaaS (infrastructure as a service), PaaS (platform as

a service), or SaaS (software as a service). Several cloud

computing resources are currently available commercially, on

apay as you go model from providers such as Amazon Web

Services (AWS) [25], Microsoft Azure [26], and Google’s

Compute Engine [27].

Cloud computing infrastructures can rely on several comput-

ing frameworks that support the processing of large data sets

in a distributed environment. For example, the MapReduce

model [28] is the basis of a large number of open-source im-

plementations. The most popular ones are Apache Hadoop [29]

and its variant, Apache Spark [30] (an in-memory computing

framework). Despite the recent advances in cloud computing

technology, not enough efforts have been devoted to exploiting

cloud computing infrastructures for the processing of HSI data.

However, cloud computing offers a natural solution for the

processing of large HSI databases, as well as an evolution of

previously developed techniques for other kinds of computing

platforms, mainly due to the capacity of cloud computing to

provide internet-scale, service-oriented computing [31], [32],

[33].

In this work, we focus on the problem of how to develop

scalable data analysis and compression techniques [34], [35],

[4], [36] with the goal of facilitating the management of

remotely sensed HSI data. Dimensionality reduction (DR) of

HSIs is a fundamental pre-processing step that is applied

before many data transfer, store and processing operations. On

the one hand, when HSI data are efﬁciently compressed, they

can be handled more efﬁciently on-board satellite platforms

with limited storage and downlink bandwidth. On the other

hand, since HSI data lives primarily in a subspace [37], a

few informative features can be extracted from the hundreds

of highly correlated spectral bands that comprise HSI data

[38] without signiﬁcantly affecting the data quality (lossy

compression of HSIs can still retain informative data for

subsequent processing steps).

Speciﬁcally, this paper develops a new cloud implemen-

tation of HSI data compression. As in [39], we adopt the

Hadoop distributed ﬁle system (HDFS) and Apache Spark

as well as a map-reduce methodology [24] to carry out our

implementation. However, we address the the DR problem

using a non-linear deep auto-encoder (AE) neural network

instead of the standard linear principal component analysis

(PCA) algorithm. The performance of our newly proposed

cloud-based AE is validated using two widely available and

known HSI data sets. Our experimental results show that

the proposed implementation can effectively exploit cloud

computing technology to efﬁciently perform non-linear com-

pression of large HSI data sets, while accelerating signiﬁcantly

the processing time in a distributed environment.

Page 2 of 14Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 3

The remainder of the paper is organized as follows. Sec-

tion II provides an overview of the theoretical and opera-

tional details of the considered AE neural network for HSI

data compression, and the considered optimization method.

Section III presents our cloud-distributed AE network for

HSI data compression, describing the details of the network

conﬁguration and the distributed implementation. Section IV

evaluates the performance of the proposed approach using

two widely available HSI data sets, taking into account the

quality of the compression and signal reconstruction, and also

the computational efﬁciency of the implementation in a real

cloud environment. Finally, section V concludes the work,

summarizing the obtained results and suggesting some future

research directions.

II. BACKGROU ND

HSI data are characterized by their intrinsically complex

spectral characteristics, where samples of the same class

exhibit high variability due to data acquisition factors or

atmospheric and lighting interferers. DR and feature extraction

(FE) methods are fundamental tools for the extraction of dis-

criminative features that reduce the intra-class variability and

inter-class similarity [40] present in HSI data sets. Futhermore,

by reducing the high spectral dimensionality of HSIs, these

methods are able to alleviate the curse of dimensionality [41],

which makes HSI data difﬁcult to interpret by supervised

classiﬁers due to the Hughes phenomenon [42].

Several methods have been developed to perform DR and

FE from HSIs, such as the independent component analysis

(ICA) [43], [44] or the maximum noise fraction (MNF) [45],

[46], being PCA [47], [48], [49] one of the most widely used

methods for FE purposes. This unsupervised, linear algorithm

reduces the original high-dimensional and correlated feature

space to a lower-dimensional space of uncorrelated factors

(also called principal components or PCs) by applying an

orthogonal transformation through a projection matrix, which

makes it a simple yet efﬁcient algorithm. However, PCA is re-

stricted to a linear map-projection and is not able to learn non-

linear transformations. In this context, auto-associative neural

networks such as AEs [50] offer a more ﬂexible architecture

for FE and DR purposes, managing the non-linearities of the

data through an architecture made up of stacked layers and

non-linear activation functions (called stacked AE or SAE)

that can provide more detailed data representations from the

original input image (one per layer), which can be reused by

other HSI processing methods.

A. Auto-encoder (AE) Neural Network

Let us consider a HSI data cube X∈Rn1×n2×nbands ,

where n1×n2are the spatial dimensions and nbands is the

number of spectral bands. Xis traditionally observed by

pixel-based algorithms as a collection of n1×n2spectral

samples, where each xi∈Rnbands = [xi,1, xi,2,· · · , xi,nbands ]

contains the spectral signature of the observed surface ma-

terial. In this sense, the goal of DR methods is to obtain,

for each xi, a vector ci∈Rnnew that captures the most

representative information of xiin a lower feature-space, being

Fig. 1. Graphic representation of a traditional auto-encoder for spectral

compression and restoration of hyperspectral images.

nnew << nbands. To achieve this goal, the SAE applies

an unsupervised symmetrical deep neural network to encode

the data in a lower-dimensional latent space, performing a

traditional embedding, and then decoding it to the original

space through a reconstruction stage. In fact, the SAE can

be interpreted as a mirrored net, where three main parts

can be identiﬁed (as shown in Fig. 1): i) the encoder or

mapping layers, ii) the middle or bottleneck layer, and iii)

the decoder or demapping layers. Based on the traditional

multilayer perceptron (MLP), the l-th layer deﬁned in the SAE

performs an afﬁne transformation between the input data x(l)

i

and its set of weights W(l)and biases b(l), as Eq. (1) indicates:

x(l+1)

i=Hx(l)

i·W(l)+b(l),(1)

where x(l+1)

i∈Rn(l)is an abstract representation (or feature

representation) of the original input data xiin the feature

space obtained by the n(l)neurons that compose the l-th

layer, where the output of the k-th neuron is obtained as the

dot product between the n(l−1) outputs of the previous layer

and its weights, passed through an activation function that is

usually implemented by the Rectiﬁed Linear Unit (ReLU) [51],

i.e. H(x) = max(0, x). Finally, the k-th feature in x(l+1)

ican

be obtained as:

x(l+1)

i,k =H

n(l−1)

X

j=1 x(l)

i,j ·w(l)

k,j +b(l)

.(2)

With this in mind, the SAE applies two main processing

steps to each input sample xi. The ﬁrst one, known as coding

stage, performs the embedding of the data, mapping it from

Rnbands space to Rnnew latent space. That is, the nencoder

layers of the encoder map their input data to a projected

representation following Eqs. (1) and (2), until reaching the

bottleneck layer. As a result, the bottleneck layer contains

the projection of each xi∈Rnbands in its latent space,

deﬁned by its nnew neurons, ci∈Rnnew . As a result,

Page 3 of 14 Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 4

the SAE allows to generate compressed (nnew < nbands),

extended (nnew > nbands) or even equally (nnew =nbands)

dimensional representations, depending on the ﬁnal dimension

of the code vector ci.

The second stage performs the opposite operation, i.e. the

decoding, where the network tries to recover the original

information, obtaining an approximate reconstruction of the

original input vector [52]. In this case, the ndecoder layers of

the decoder demap the code vector ciuntil reaching the output

layer, where a reconstructed sample x0

iis obtained. Eq. (3)

gives an overview of the encoding-decoding process followed

by the SAE:

ci←For lin nencoder:x(l+1)

i=Hx(l)

i·W(l)+b(l)

x0

i←For ll in ndecoder:c(ll+1)

i=Hc(ll)

i·W(ll)+b(ll)

(3)

In order to obtain a lower-dimensional (but more discrimi-

native) representation of the input data, the network parameters

are iteratively adjusted in unsupervised fashion, where the

optimizer minimizes the reconstruction error between the input

data at the encoding stage, xi, and its reconstruction at the end

of the decoding stage, x0

i. This error function, given by Eq.

(4), is usually implemented in the form of a mean squared

error (MSE):

E(X) = min kX−X0k2= min

n1·n2

X

i=1

kxi−x0

ik2.(4)

B. Broyden-Fletcher-Goldfarb-Shanno (BFGS) Algorithm

After describing the operational procedure of SAEs, it is

now important to observe the network optimization process.

As any artiﬁcial neural network with back-propagation, the

optimizer tries to ﬁnd the set of parameters (synaptic weights

and biases) that, for a given network architecture, minimize

the error function E(X)deﬁned by Eq. (4). This function

evaluates how well the neural network ﬁts the dataset X, and

depends on the adaptative and learnable parameters of the

network, that can be denoted as W, so E(X,W). As E(X,W)

is non-linear, its optimization must be carried out iteratively,

reducing its value until an adequate stopping criterion is

reached. In this sense, standard optimizers back-propagate the

error signal through the network architecture calculating, for

each learnable parameter, the gradient of the error, i.e. the

direction and displacement that the parameter must undergo

in order to minimize the ﬁnal error (also interpreted as the

importance of that parameter when obtaining the ﬁnal error).

Mathematically, the updating of Win the t-th epoch can be

calculated by Eq. (5):

Wt+1 =Wt+ ∆W,

being ∆W=µt·pt,(5)

where µtand ptare the learning rate (a positive scalar) and

the descent search direction, respectively [53]. The main goal

of any optimizer is to obtain the correct ptin order to descend

properly in the error function until the minimum is reached.

As opposed to standard optimizers, traditional Newton-

based methods determine the descent direction ptusing the

second derivative information contained into the Hessian ma-

trix, rather than just the gradient information, thus stabilizing

the process:

Ht·pt=−∇E(X,Wt)

pt=−H−1

t· ∇E(X,Wt)

Wt+1 =Wt−µt·H−1

t· ∇E(X,Wt),

(6)

where ∇E(X,Wt)is the gradient of the error function eval-

uated with the network’s parameters at the t-th epoch, Wt,

and Htand H−1

tare respectively the Hessian matrix and its

inverse, obtained at the t-th epoch. However, these methods

obtain the Hessian matrix and its inverse at each epoch,

which is quite expensive to compute, requiring a large amount

of memory. Instead of that, the Broyden-Fletcher-Goldfarb-

Shanno (BFGS) method [54] performs an estimation of how

the Hessian matrix has changed in each epoch, obtaining an

approximation (instead of the full matrix) that is improved

every epoch. In fact, as any algorithm of the family of

multivariate minimization quasi-newton methods, the BFGS

algorithm modiﬁes the last expression of Eq. (6) as follows:

Wt+1 =Wt−µt·Gt· ∇E(X,Wt),(7)

where Gtis the inverse Hessian approximation matrix (usu-

ally, when t= 0 the initial approximation matrix is the identity

matrix, G0=I). This Gtis updated at each epoch by means

of an update matrix:

Gt+1 =Gt+Ut.(8)

However, such update needs to comply with the quasi-

Newton condition, which is described below. Assuming that

E(X,W)is continuous for Wtand Wt+1 (with gradients

gt=∇E(X,Wt)and gt+1 =∇E(X,Wt+1), respectively)

and the Hessian His constant, then Eq. (9) is satisﬁed:

qt≡gt+1 −gtand pt≡ Wt+1 − Wt

Secant condition on the Hessian: qt=H·pt

Secant condition on the inverse: H−1·qt=pt

(9)

Since G=H−1, the last expression in Eq. (9) can be

modiﬁed to G·qt=pt, so the approximation matrix G

can be obtained (at each epoch t) as a combination of the

linearly independent directions and their respective gradients.

Following the Davidon, Fletcher and Powell (DFP) rank 2

formula [55], Gcan be updated using Eq. (10):

Gt+1 =Gt+pt·p>

t

p>

t·qt

−Gt·qt·q>

t

q>

t·Gt·qt

·Gt.(10)

Finally, the BFGS method updates its approximation matrix

by computing the complementary formula of the DFP method,

changing Gby Hand ptby qt, so Eq. (10) is ﬁnally modiﬁed

as follows:

Ht+1 =Ht+qt·q>

t

q>

t·pt

−Ht·pt·p>

t

p>

t·Ht·pt

·Ht.(11)

Page 4 of 14Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 5

As the BFGS method intends to compute the inverse of Hand

G=H−1, it inverts Eq. (11) to analytically obtain the ﬁnal

update of the approximation matrix:

Gt+1 =Gt+1 + q>

t·Gt·qt

q>

t·pt·pt·p>

t

p>

t·qt

−pt·q>

t·Gt+Gt·qt·p>

t

q>

t·pt

(12)

Algorithm 1 Broyden-Fletcher-Goldfarb-Shanno Algorithm

1: procedure BFGS(Wt: current parameters of the neural

network, E(X,W): Error function, Gt: current approxi-

mation to the Hessian)

2: gt=∇E(X,Wt)

3: pt=−Gt·gt

4: Wt+1 =Wt+µt·pt µtby linear search

5: gt+1 =∇E(X,Wt+1)

6: qt=gt+1 −gt

7: pt=Wt+1 − Wt

8: A=1 + q>

t·Gt·qt

q>

t·pt·pt·p>

t

p>

t·qt

9: B=pt·q>

t·Gt+Gt·qt·p>

t

q>

t·pt

10: Gt+1 =Gt+A−B

return Wt+1,Gt+1

11: end procedure

Algorithm 1 provides a general overview of how the BFGS

method works in one epoch. A weakness of BFGS is that it

requires the computation of the gradient on the full dataset,

consuming a large amount of memory to properly run the

optimization. Taking into account the dimensionality of HSIs,

we can conclude that this method is not able to scale with the

number of samples [56]. In order to overcome this limitation,

and with the aim of speeding up the computation of both

the forward (afﬁne transformations) and backward (optimizer)

steps of the AE for DR of HSIs, in the following section we

develop a distributed solution for cloud computing environ-

ments.

III. PROP OS ED IM PL EM EN TATIO N

A. Distributed Framework

We have developed a completely new distributed AE for HSI

data analysis2. In this context, two problems have been specif-

ically addressed in this work: i) the computing engine, and ii)

the distributed programming model over the cloud architecture.

Regarding the ﬁrst problem, our distributed implementation

of the network model is run on top of a standalone Spark

cluster, due to its capacity to provide fast processing of

large data volumes on distributed platforms, in addition to

fault-tolerance. Furthermore, the Spark cluster is characterized

by a master-slave architecture, which makes it very ﬂexible.

Speciﬁcally, a Spark cluster is formed by a master node, which

manages how the resources are used and distributed in the

cluster by hosting a Java virtual machine (JVM) driver, and the

scheduler, which distributes the tasks between the execution

2Code avaiable on: https://github.com/jgallardst/cloud-nn-hsi

nodes, and Nworker nodes (which can be more than one

per node) that execute the program tasks by creating a Java

distributed agent, called executor (where tasks are computed),

and store the data partitions (see Fig. 2).

Fig. 2. Graphic representation of a generic Spark cluster, which is composed

by one client node and Nworker nodes, where in each node several executor

Java virtual machines are running in parallel over several data partitions.

In relation to the second point, the adopted programming

model to perform the implementation of the distributed AE

is based on organizing the original HSI data in tuples or

key/value pairs, in order to apply the MapReduce model [39],

which divides the data processing task into two distributed

operations: i) mapping, which processes a set of data-tuples,

generating intermediate key-value pairs, and ii) reduction,

which gathers all the intermediate pairs obtained by the

mapping to generate the ﬁnal result. In order to achieve

this behavior, data information in Spark is abstracted and

encapsulated into a fault-tolerant data structure called Resilient

Distributed Dataset (RDD). These RDDs are organized as

distributed collections of data, which are scattered by Spark

across the worker nodes when they are needed on the succes-

sive computations, being persisted on the memory of the nodes

or on the disk. This architecture allows for the parallelization

of the executions, achieved by performing MapReduce tasks

over the RDDs on the nodes. Moreover, two basic operations

can be performed on a RDD: i) the so-called transformations,

which are based on applying an operation to every row on

a partition, resulting in another RDD, and ii) actions, which

retrieve a value or a set of values that can be both RDD data or

the result of an operation where some RDD data are involved.

Operations are queued until an action is called; the needed

transformations are placed into a dependency graph, where

each node is a job stage, following a lazy execution paradigm.

This means that operations are not performed until they are

really needed, avoiding the repetition of a single operation

more than once.

In order to enable a simple and easy mechanism for man-

aging large data sets, the Spark environment provides another

Page 5 of 14 Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 6

level of abstraction that uses the concept of Dataframe. These

Dataframes allow data to be organized on named columns,

being easier to manipulate (as in relational tables, columns

can be accessed by the column name instead of by the index).

With this in mind, the Spark standalone cluster functionality

can be summarized as follows:

1) The master node (also caled driver node) creates and

manages the Spark driver (see Fig. 2), a Java process

that contains the SparkContext of the application.

2) The driver context performs the data partitioning and

parallelization between the worker nodes, assigning to

each one a number of partitions, which depends on two

main aspects: the block size and the way the data are

stacked. Also, the driver creates the executors on the

worker nodes, which store data partitions on the worker

node and perform tasks on them.

3) When an action is called, a job is launched and the

master coordinates how the associated tasks are dis-

tributed into the different executors. In order to reduce

the data exchanging time, the Spark driver attempts to

perform “smart” task allocations, so that there are more

possibilities to assign a task to the executor, located in

the worker where the data partition used by the task to

perform the operation has been allocated.

4) When all the tasks on a given stage are ﬁnished, the

Scheduler allocates another stage of the job (if it was

a transformation), or retrieves the ﬁnal output (if it was

an action).

Algorithm 2 shows a general overview of how our algorithm

is pipelined in the considered Spark cluster.

Algorithm 2 Iterative Process

1: procedure SPARK FLOW

2: P artitionedData ←Spark.parallelizeData()

3: t←0

4: while t< niterations do

5: broadcastOutputData().

6: foreach partition ∈P artitionedData do

7: P artitionedData.apply T ask().

8: end for

9: retrieveOutputData().

10: t←t+1

11: end while

12: end procedure

B. Cloud Implementation

This section describes in detail the full distributed training

process, from the parallelization of HSI data across nodes to

the intrinsic logic of each training step, explaining the beneﬁts

of our distributed training algorithm. Fig. 3 gives a general

overview of the full data pipeline developed in this work.

In the beginning, the original 3-dimensional HSI data cube

X∈Rn1×n2×nbands , where n1×n2are the spatial dimensions

(height and width) and nbands is the spectral dimension given

by the number of spectral channels, is reshaped into a HSI

matrix X∈Rnpixels×nbands , where npixels =n1×n2, i.e.

each row collects a full spectral pixel, being each column the

Fig. 3. Data pipeline of our distributed auto-encoder, where the input HSI

cube is ﬁrst reshaped into a matrix and then split into several partitions

allocated into the Spark worker nodes, composed by several rows where each

one contains BS spectral pixels. These data partitions are then scaled and

duplicated in order to obtain the input network data and the corresponding

output network data. The AE is then executed and, for each iteration t, the

gradients are collected by the Spark driver, which calculates the ﬁnal gradient

and performs the optimization with the L-BFGS algorithm. The updated

weights are ﬁnally broadcasted to each neural model contained in the cluster.

corresponding value in the spectral band. This matrix Xis

read by the Spark Driver, which collects the original HSI data

and partitions it into Psmaller subsets that are delivered to

the worker nodes in parallel. These workers store the obtained

partitions on their local disks. In this sense, each data partition

composes a RDD.

It must be noted that complex neural network topologies

derive on greedy RAM memory usage on the driver node.

Since Spark transformations apply an operation to every row

of the RDD, the fewer the number of rows, the fewer the

number of operations that must be carried out. In order to

improve the computation of the distributed model, a blocksize

(BS) hyperparameter is provided, with the aim of indicating

how many pixels should be stacked into a single row in order

to compute them together. With this observation in mind, the

p-th data partition (with p= 1,· · · , P ) can be seen as a 2-

dimensional matrix (p)D∈Rnrows ×(BS·nbands )composed by

nrows rows, where each one stores BS concatenated spectral

Page 6 of 14Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 7

pixels, i.e. (p)dj∈R(BS·nbands )= [xi,xi+1,· · · ,xi+B S ]. In

the end, each data partition (p)Dstores BS ·nrow s pixels.

The resulting partitions are then distributed across the worker

nodes. Such distribution allows the executors, located in each

worker node, to apply the subsequent tasks to those partitions

that each worker receives.

After distributing the data into RDDs, a distributed data

analysis process begins prior to the application of neural

network-based processing. In the ﬁrst step, the data contained

in each partition (p)Dare scaled in a distributed way, taking

advantage of the cloud architecture and the available paral-

lelization of resources. In this sense, each partition’s row (p)dj

(and, internally, each pixel contained within) is transformed

based on the global maximum and minimum features (xmax

and xmin) of the whole image X, and the column local

maximum and minimum features ((p)dmax and (p)dmin), of

the p-th partition where the data are allocated:

(p)ˆ

dj=

(p)dj−(p)dmin

(p)dmax −(p)dmin

(p)dj=(p)ˆ

dj·(xmax −xmin) + xmin

(13)

Once the HSI data has been split into partitions and scaled,

the next step consists of the application of the AE model.

The proposed AE is composed by 5 layers, as summarized in

Table I. These layers are: l(1) , the input layer that receives

the spectral signature contained in each pixel xiof X(i.e.,

the rows of the distributed partitions), composed by as many

neurons as spectral bands; l(2),l(3) and l(4) : the hidden AE-

layers, and l(5): the output layer that obtains the reconstructed

signature x0

i, composed also by as many neurons as spectral

bands.

TABLE I

TOP OLO GY O F THE PRO PO SED AU TO- ENCODER NEURAL NE TWO RK F OR

HYPERSPECTRAL IMAG E ANALYS IS

Layer ID l(1) l(2) l(3) l(4) l(5)

Neurons per l(i)nbands 140 60 140 nbands

With the topology described in Table I in mind, the encoder

part is composed by l(1),l(2) and l(3) , which performs the

mapping from the original spectral space to the latent space

of the bottleneck layer l(3). In addition, the decoder part

is composed by l(3),l(4) and l(5) , which performs the de-

mapping from the latent space of l(3) to the original spectral

space.

At this point, it is interesting to brieﬂy comment the per-

formance of the AE network. In order to correctly propagate

the data through the network, from each partition (p)D∈

Rnrows ×(BS·nbands ), a matrix of unstacked pixels (p)X∈

R(BS·nr ows)×nbands is extracted, i.e. the BS spectral pixels

that are contained in each (p)dj= [xi,xi+1,· · · ,xi+B S ](with

j= 1,· · · , nrows and i= 1,· · · , npixels ) are each extracted

to create, one by one, the rows of (p)X, denoted as (p)xk[with

k= 1,· · · ,(BS ·nrow s)] in order to determine the level at

which the AE is working.

Every training iteration tis performed using the traditional

neural network forward-backward procedure, in addition to a

tree-aggregate operation that computes and sums the execu-

tors’ gradients and losses to return a single loss value and

a matrix of gradients. Each executor computes its loss by

forwarding the input network data (p)Xthrough the AE layers,

and comparing the l(5) layer’s output vector with the vector of

input features, following Eq. (4) and obtaining (at each t) the

corresponding MSE of the partition: (p)MSEt=E((p)X,Wt).

Gradients are then computed by back-propagating the error

signal through the AE, obtaining for each partition the (p)Gt

matrix at iteration t. Each gradient matrix is reduced in the

Driver, which runs the optimizer in order to obtain the ﬁnal

matrix ∆Wt. This matrix indicates how much each neuron

weight should be modiﬁed before ﬁnishing the t-th training

iteration, based on how that neuron impacts the output. Fig. 4

gives a graphical overview of the adopted training procedure.

If we denote by Pthe number of total partitions and

by (p)X∈R(BS·nr ows)×nbands the p-th unstacked partition

data, composed by (BS ×nrow s)normalized rows/feature

vectors of nbands spectral features, i.e. (p)xk∈Rnbands =

(p)xk,1,· · · ,(p)xk,nbands , and considering the l-th layer of

the AE model, composed by n(l)

neurons, its output is denoted

by (p)X(l+1) and it is computed by adapting Eq. (1) into Eq.

(14) as the matrix multiplication:

(p)X(l+1) =H(p)X(l)·W(l)+b(l),(14)

where the meaning of each term is:

•(p)X(l+1) ∈R(BS×nr ows)×n(l)

neurons is the matrix that

represents the output of the neurons in layer lwith size

(BS ·nrow s)×n(l)

neurons, where n(l)

neurons is the number

of neurons of the l-th layer (in the case that l= 5,

n(5)

neurons =nbands).

•(p)X(l)∈R(BS×nr ows)×n(l−1)

neurons is the matrix that serves

as input to the l-th layer, which contains the (BS ·nrows )

pixel vectors represented in the feature space of the

previous layer, deﬁned by n(l−1)

neurons neurons.

•W(l)∈Rn(l−1)

neurons×n(l)

neurons is the matrix of weights,

which connects the current n(l)

neurons neurons with the

n(l−1)

neurons neurons of the previous layer, and b(l)is the

bias of the current layer.

•His the ReLU activation function, which gives the

following non-linear output: ReLU(x) = max(0, x).

After data forwarding, the reconstructed data (p)X0in the

p-th partition at the t-th iteration is compared to the original

input (p)Xby applying the MSE function deﬁned by Eq. (4)

on each executor. Executors then retrieve the error computed

by their carried data, obtaining a value (p)MSEtper partition.

Then, the ﬁnal error is obtained as the mean of all executor

errors, as shown in Eq. (15):

(p)MSEt=1

(BS ×nrow s)

(BS×nr ows)

X

k=1

k(p)xk−(p)x0

kk2

MSEt=1

P

P

X

p=1

(p)MSEt,

(15)

Page 7 of 14 Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 8

Fig. 4. Distributed forward and backward pipelines of the training stage (at iteration t) after unstacking the hyperspectral pixels in each distributed data

partition (each one allocated to a different worker node).

where (BS ×nrow s)is the number of pixels that compose the

p-th data partition, whereas (p)xk∈(p)Xand (p)x0

k∈(p)X0

are the original input sample and output reconstructed sample

in the p-th data partition, respectively. Those partition errors

are then back-propagated to compute the gradient (p)Gtmatrix

of each partition at iteration t. In this sense, for each layer in

the neural model (using the resulting outputs) the impact that

each neuron has on the ﬁnal error is obtained as the result

of the ReLU’s derivative of every output, which is deﬁned as

follows:

H0(x) = (0,if x≤0

1,if x > 0(16)

Such impact can be denoted as (p)gL

t= [(p)g(1)

t,· · · ,(p)g(5)

t],

where the l-th element (p)g(l)

tstores the impact of the n(l)

neurons

allocated into the l-th layer of the network.

The gradient of each partition, (p)Gt, is then computed

by applying the double precision general matrix to matrix

multiplication (DGEMM) operation where, given three input

matrices (A,Band C) and two constants (αand β), the

obtained results are calculated by Eq. (17) and stored in C:

C=α∗A∗B+β∗C.(17)

DGEMM is performed to compute the entire gradient matrix

in parallel, instead of computing each layer gradient vectors

separately. This allows us to make neural computations faster

and efﬁcient in terms of reducing power consumption. In this

sense, each item of Eq. (17) has been replaced by:

•α=1

nbands is a parameter regularizer.

•A=(p)Xis the input data partition matrix.

•B=(p)gL

tis the matrix representing the impact of each

neuron on every layer of the neural network.

•β= 1 is also a parameter regularizer. As Cshould be

unchanged, it has been set to 1.

•C=(p)Gt−1is initially the older gradient matrix of

the p-th partition. After the updates resulting from the

DGEMM operation, the current gradient (p)Gtis stored

on C.

Finally, the gradient matrix Gtof the whole network is

computed as the average of the sum of all partition’s gradients:

(p)Gt. The entire training process on each data partition is

graphically illustrated on Fig. 4.

The ﬁnal optimization step is performed locally on the

master node using a variant of the BFGS algorithm, called

limited BFGS (L-BFGS). Since BFGS needs a huge amount

of memory for the computation of the matrices, L-BFGS limits

the memory usage, so it ﬁts better into our implementation.

The optimizer uses the computed gradients and a step-size

procedure to get closer to a minimum of Eq. (4). The procedure

is repeated until a desired number of iterations, niterations, is

reached.

IV. EXP ER IM EN TAL EVALUATIO N

A. Conﬁguration of the Environment

In order to test our newly developed implementation, a

dedicated hardware and software environment based on a high-

end cloud computing paradigm has been adopted. The virtual

resources have been provided by the Jetsream Cloud Services3

at the Indiana University Pervasive Technology Institute (PTI).

Its user interface is based on Atmosphere computing platform4

and uses Openstack5as the operational software environment.

The hardware environment consists of a collection of cloud

computing nodes. In particular, the cluster consists of one

master node and eight slave nodes, which are hosted in virtual

machines with six virtual cores at 2.5 GHz each. Every node

has 16 GB of RAM and 60 GB of storage via a Block Storage

File System. As mentioned before, Spark performs as the

backbone for node interconnection, meanwhile data transfers

are supported by a local 4x40 Gbps dedicated network.

Each virtual machine runs Ubuntu 16.04 as operating sys-

tem, with Spark 2.1.1 and Java 1.8 serving as running plat-

forms. The Spark framework provides a distributed machine

learning library known as Spark Machine Learning Library

3https://jetstream-cloud.org/

4https://www.atmosphereiot.com/platform.html

5https://www.openstack.org/

Page 8 of 14Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 9

(MLLib6), which is used as support for the implementation

of our distributed AE network for remotely sensed HSI data

analysis. Moreover, the proposed implementation has been

coded in Scala 2.11, compiled into Java bytecode and inter-

preted in JVMs. Finally, mathematical operations from MLLib

are handled by Breeze (the numerical processing library for

Scala), in its 0.12 version, and by Netlib 1.1.2. In this sense,

Netlib wraps JVM calls into low level Basic Linear Algebra

Subprograms (BLAS) calls, so those calls are executed faster

than traditional executions.

B. Hyperspectral Datasets

With the aim of testing the performance of our newly

developed cloud-based and distributed AE network model,

two different HSI data sets have been considered in our

experiments. These data sets correspond to the full version of

the AVIRIS Indian Pines scene, referred hereinafter as the big

Indian Pines scene (BIP), and a set of images corresponding to

six different zones captured by the Hyperion spectrometer [57]

onboard NASA’s EO-1 Satellite, which we have designated as

Hyperion data set (HDS). Both data sets are characterized by

their huge size, which makes them ideal to be processed in a

cloud-distributed environment. In the following, we provide a

description of the aforementioned data sets.

•The big Indian Pines scene (BIP) scene (see Fig. 5)

was collected by AVIRIS in 1992 [5] over agricultural

ﬁelds in northwestern Indiana. The image comprises a

full ﬂightline with a total of 2678 ×614 pixels (with 20

meters per pixel spatial resolution), covering 220 spectral

bands from 400 to 2500 nm.

•The Hyperion data set (HDS) is composed by six full

ﬂightlines (see Fig. 6) collected in 2016 by the Hyperion

spectrometer mounted on NASA’s EO-1 satellite, which

collects spectral signatures using 220 spectral channels

ranging from 357 to 2576 nm with 10-nm bandwidth. The

captured scenes have a spatial resolution of 30 meters per

pixel. The standard scene width and length are 7.7 km

and 42 km, respectively, with an optional increased scene

length of 185 km. In particular, the considered images

have been stacked and treated together as a single image

comprising 20401 ×256 pixels with the spectral range

mentioned above. These images have been provided by

the Earth Resources Observation and Science (EROS)

Center in GEOTIFF format7. Also, each scene is accom-

panied by one identiﬁer in the format YDDDXXXML,

which indicates the day of acquisition (DDD), and the

sensor that recorded the image (XXX, denoting Hyperion,

ALI or AC with 0=off and 1=on), the pointing mode

(M, which can be Nfor Nadir, Pfor Pointed within

path/row or Kfor Pointed outside path/row) and the scene

length (L, which can be Ffor Full scene, Pfor Partial

scene, Qfor Second partial scene and Sfor Swath). Also,

other letters can be used to create distinct entity IDs, for

example to indicate the Ground/Receiving Station (GGG)

6https://spark.apache.org/mllib

7These scenes are available online from the Earth Explorer site,

https://earthexplorer.usgs.gov

or the Version Number (VV). In this case, the identiﬁers

of the six considered images are: 065110KU, 035110KU,

212110KR, 247110KW, 261110KR and 321110KR.

C. Experiments and Discussion

Three different experiments have been conducted in order

to validate the performance of our cloud-distributed AE for

HSI data compression:

1) The ﬁrst experiment analyzes the scalability of our

cloud-distributed AE, using a medium-sized data set. For

this purpose, the BIP data set has been processed with a

ﬁxed number of training samples in the cloud environ-

ment described above, using one master and different

numbers of worker nodes. Here, we have reduced the

dimensionality of the BIP data set using PCA, retaining

the ﬁrst 60 principal components that account for most

of the variance in the original data.

2) The second experiment illustrates the internal paral-

lelization (at the core level) of the worker nodes. For

this purpose, the HDS has been processed using four

different percentages of training data and 8 worker nodes

in the considered cluster, each with 6 virtual cores. As in

the previous experiment, we reduced the dimensionality

of the HDS data set using PCA, retaining the ﬁrst

60 principal components that account for most of the

variance in the original data.

3) Finally, the third experiment test the performance of our

cloud-distributed AE using different numbers of training

samples and worker nodes over a large data set. This ex-

periment allows us to understand the internal operation

of data partitions. In this sense, the HDS data set used in

the previous experiment has been considered again using

4 different training percentages and 6 different numbers

of worker nodes.

1) Experiment 1: Our ﬁrst experiment evaluates the per-

formance of the distributed implementation of the proposed

AE, using the BIP scene (reduced to 60 principal components

extracted by PCA) using 80% randomly selected samples to

create the training set and the remaining 20% of the samples

to create the test set. In order to demonstrate the scalability

of our cloud-distributed AE, the cloud environment has been

conﬁgured with one master node and different numbers of

worker nodes, speciﬁcally: 1, 2, 4, 8, 12 and 16 workers. In

order to show the robustness of our model, ﬁve Monte Carlo

runs have been executed, obtaining as a result the average and

the standard deviation of those executions.

Fig. 7 shows the obtained speed-up in a graphical way.

Such speed-up has been calculated as T1/Tn, where T1is

the execution time of the slowest execution with one worker

node and Tnis the average time of the executions with n

worker nodes. Comparing the theoretical and real speed-up

values obtained, it can be observed that the model is able to

scale very well, reaching a speed-up value that is very close

to the theoretical one with 2, 4 and 8 workers. However, for

12 workers and beyond, we can see that the communication

times between the nodes hamper the speed-up due to the

insufﬁcient amount of data, a fairly common behaviour in

Page 9 of 14 Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 10

Fig. 5. False RGB color map of the big Indian Pines (BIP) scene, using the spectral bands 88, 111 and 150.

Fig. 6. False RGB color map of the Hyperion data set (HDS), using the spectral bands 35, 110 and 150.

cloud environments, in which the main bottleneck occurs in

the communication between nodes. As a result, it is important

to make sure that there exists an adequate balance between

the total amount of data to be processed and the number

of processing nodes. Table II tabulates the performance data

collected in this experiment, which comprises reconstruction

errors, computation times and speed-ups. As we can observe in

Table II, the reconstruction errors achieved by the AE network

are very similar for different numbers of workers (with slight

changes due to the random selection of samples), keeping a

continuous value as the speed-up increases and more nodes

are introduced into the cluster. Also, it is worth noting that the

standard deviations of the error are very low, demonstrating

that the proposed implementation remains highly robust in

all cases. These very low errors are ﬁnally reﬂected in Fig.

8, which shows three reconstructed signatures of different

materials in the BIP scene. As it can be seen in Fig. 8, the

reconstructed signatures are extremely similar to the original

ones, a fact that allows for their exploitation in advanced

processing tasks such as classiﬁcation or spectral unmixing.

2) Experiment 2: Our second experiment explores the in-

ternal parallelization of each worker node (at the core level).

For this purpose, the cloud-distributed AE has been tested on

the HDS dataset, again reducing the spectral dimensionality to

60 principal components and randomly collecting 20%, 40%,

60% and 80% of training samples to create the training set,

and the remaining 80%, 60%, 40% and 20% to create the test

set. Moreover, 1 master node and 8 worker nodes (each one

with 6 virtual cores) have been considered to implement the

cloud environment.

Fig. 9(a) shows the results obtained in this experiment. If

we compare the theoretical speed-up values and the real ones

Page 10 of 14Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 11

TABLE II

REC ONS TRU CTI ON E RRO RS (OB TAIN ED A S THE MSE B ETW EE N THE O RI GIN AL TE ST S AMP LE S AND T HE O NES R EC ONS TRU CTE D BY T HE PR OPO SE D

CLOUD-DI STR IBU TE D AE), ALO NG W ITH T HE P ROC ESS ING T IM ES AN D SP EED UP S OBTA INE D FO R DIFF ER ENT N UM BER S OF WO RK ERS W HE N

PRO CES SI NG TH E BIP IMAGE.

Workers 1 2 4 8 12 16

Loss (MSE) 7.93e-5 7.92e-5 8.35e-5 9.51e-5 8.60e-5 8.35e-5

Time (s) 17398.74 8991.12 4518.39 2354.91 1803.27 1288.69

Speedup 1 1.9308 3.8506 7.3882 9.6484 13.5011

Fig. 7. Scalability of the proposed cloud-distributed network when processing

the BIP dataset with 1, 2, 4, 8, 12 and 16 worker nodes and 1 master node.

The red line indicates the theoretical speed-up value and the blue bars indicate

the actual values reached.

obtained, it can be seen that our implementation is able to

reach a speed-up that is almost identical to the theoretical one.

This is quite important, as the obtained results indicate that the

scalability achieved in each node is almost linear with regards

to the size of the HSI scenes considered in each node, thanks

to the cores available in each node. In this way, the proposed

cloud-distributed AE implementation takes full advantage of

all the available resources, both in parallel (multi-core) and

distributed fashion.

3) Experiment 3: Our last experiment evaluates the scal-

ability of the proposed cloud AE for HSI data compression

using a very large-sized data set. The HDS images have been

considered for this purpose. Due to the great amount of data,

this experiment has been split in two parts. The ﬁrst part

performs a comparison over a cloud environment composed

by 1, 2, 4, 8, 12 and 16 worker nodes, and 1 master node,

employing the 20% and 40% of the samples to create the

training set, and the remaining 80% and 60% of data to create

the test set. However, due to the memory limitations of the

workers, the second part performs a comparison over a cloud

environment composed by 2, 4, 8, 12 and 16 worker nodes, and

1 master node, employing the 60% and 80% of the samples

to create the training set, and the remaining 40% and 20% of

data to create the test set. In this context, it must be noted

that while in the ﬁrst part the speed-up is obtained based on

the implementation with 1 worker node, in the second one

the speed-up is obtained based on the implementation with 2

worker nodes.

Figs. 9(b) and 9(c) show the results obtained by the two

parts of this experiment in a graphical way. In this case,

it is interesting to observe that the theoretical speed-up and

the linear speed-up values do not coincide. When we talk

about linear speed-up, we normally refer to the expected

speed-up when linear partitioning is performed in the cluster.

However, in a real environment the partitioning is not always

linear. In fact, we can observe a performance gap in the 8-

node conﬁguration. This can be explained by the relationship

between the total number of cores in the cluster, C(obtained

as the number of cores per node multiplied by the number of

nodes), and the number of existing data partitions, P, given

by Eq. (18):

(λ−1) ·C < P < λ ·C, (18)

where λis a scalar. For instance, when using 8-node con-

ﬁguration, its value is set to λ= 2. Taking Eq. (18) into

consideration, and assuming that the cluster cores execute

tasks when they are free, the non-compliance of Eq. (18) leads

to the fact that some cores remain idle after ﬁnishing their

ﬁrst allocated tasks, so the ﬁne-grained parallelism is not fully

exploited in this case.

In the considered cluster, since each node has 6 cores, a total

of C= 6 ×Nworking cores can be exploited. Furthermore,

these Cworking cores allow for the processing of the data

partitions in batches of Ctasks at most. For instance, when

a conﬁguration of 8 nodes is used, the cluster environment

is made up of a total of C= 6 ×8 = 48 working cores.

This indicates that, at most, in one processing batch Spark

will launch 48 tasks. As Spark splits the HSD data on 58

data partitions, 58 tasks must be executed over each partition.

However, in each batch only 48 tasks can be performed. This

means that two batches must be run; the ﬁrst one with 48

tasks and the second one with only 10. As a result, the second

batch cannot fully exploit ﬁne-grained parallelism as only 10

cores are being used, with 38 idle nodes. This results in an

unnecessary waste of computing resources.

However, when the idle cores from the second batch are

used, the performance improves. This is the case of the 12-

node conﬁguration (C= 72), where the partitioning becomes

more efﬁcient, complying with Eq. (18). Linear speed-up

based on workers needs to be added to this core-level speed-

up, leading to a new speed-up which is calculated as the

multiplication of those speed-ups, as indicated by Eq. (19):

Tw

1

Tw

n

·Tp

1

Tp

n

,(19)

where Tw

nis the processing time at the worker level and Tp

n

the processing time at the core level.

Page 11 of 14 Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 12

Fig. 8. Comparison between the original (in blue) and reconstructed (in dotted red) spectral signatures extracted from the BIP scene by the proposed cloud

AE implementation using 8 workers.

(a) (b) (c)

Fig. 9. Scalability of the proposed cloud-distributed network when processing the HDS data set in experiments 2 and 3: a) with 8 worker nodes and 1 master

node, considering 20%, 40%, 60% and 80% of training data (experiment 2), b) with 1, 2, 4, 8, 12 and 16 worker nodes and 1 master node, considering 20%

and 40% of training data (experiment 3, ﬁrst part), and c) with 2, 4, 8, 12 and 16 worker nodes and 1 master node, considering 60% and 80% of training data

(experiment 3, second part). The numbers in the parentheses indicate the total amount of data used in MB. The red lines indicates the theoretical speed-up

value (red-continuous line) and linear speed-up value (red-dotted line), while the blue and orange bars indicate the actual values reached.

With the aforementioned observations in mind, and focusing

on the results given by the ﬁrst part of the experiment and

reported on Fig. 9(b) we can observe that, for each conﬁgura-

tion and training with 20% and 40% of the available samples,

the proposed AE exhibits quite similar speed-ups, with slight

variations due to the distribution of data and the role of idle

nodes. It is interesting to observe with 1-8 nodes how the

speed-up is quite similar to the theoretical one, while with

12-16 nodes the differences between obtained and theoretical

speed-up values are higher, indicating that the proposed AE

with only 20% and 40% of training samples does not take full

advantage of the cloud environment’s potential.

On the other part, Fig. 9(c) reports the obtained results

of the second part of this experiment. In this case, the base

implementation of the AE is conducted on a cloud environment

with 2 worker nodes, employing 60% and 80% of training

data. With 2 and 4 worker nodes the obtained speed-up values

are very similar employing 60% and 80% of the available

samples, while with 8, 12 and 16 nodes it is clear how the

version with more training data is able to achieve a superior

speed-up, reaching a value very similar to the theoretical one

with 16 nodes. This indicates that the amount of data handled

in this case is more convenient to take full advantage of the

way that Spark organizes data partitions and tasks in batches,

achieving better parallelization at the core level (ﬁne-grained

parallelism) and also better distribution at the worker level

(coarse-grained parallelism). These conclusions are supported

by the data tabulated in Table III, where the speed-up employ-

ing 20% and 40% of training data has been obtained taking as

base times the cloud environment with 1 node, while for 60%

and 80% of training data the speed-up is obtained comparing

with the environment composed by 2 worker nodes due to

the exhausting use of memory. Once again, the MSE keeps

constant with different number of nodes, varying slightly with

the training percentage, which indicates that the network is

able to optimize very well, without overﬁtting the parameters

when 60% or 80% of the available training samples are used,

but also avoiding underﬁtting when few samples are used for

training purposes.

V. CON CLUSION AND FUTURE LINES

This paper presents a new cloud-based AE neural network

for remotely sensed HSI data analysis in distributed fashion.

This kind of artiﬁcial neural network ﬁnds non-linear solu-

tions when compressing the data, as opposed to traditional

techniques such as PCA. In this sense, the proposed approach

is more suitable for complex data sets such as HSIs. The

proposed AE implementation over a Spark cluster exhibits

great performance, not only in terms of data compression

and error reconstruction, but also in terms of scalability when

processing huge data volumes, which cannot be achieved by

traditional (sequential) AE implementations. Those sequential

algorithms may be a valid option when the data to be managed

and analyzed can be stored in a single machine with limited

Page 12 of 14Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13

TABLE III

REC ONS TRU CTI ON E RRO RS (OB TAIN ED A S THE MSE B ETW EE N THE O RI GIN AL TE ST S AMP LE S AND T HE O NES R EC ONS TRU CTE D BY T HE PR OPO SE D

CLOUD-DI STR IBU TE D AE), ALO NG W ITH T HE P ROC ESS ING T IM ES AN D SP EED UP S OBTA INE D FO R DIFF ER ENT N UM BER S OF WO RK ERS W HE N

PRO CES SI NG TH E HDS DATA SET.

Number of workers

Training Percentage (size) 1 2 4 8 12 16

Loss (MSE)

20 (1838 MB) 6.09e-5 6.12e-05 6.01e-05 5.92e-05 6.47e-05 5.60e-05

40 (3676 MB) 6.56e-5 5.80e-05 6.30e-05 6.44e-05 6.67e-05 6.13e-05

60 (5515 MB) N/A 5.88e-05 6.49e-05 6.27e-05 6.44e-05 5.70e-05

80 (7353 MB) N/A 6.59e-05 6.29e-05 6.25e-05 6.30e-05 6.20e-05

Time (s)

20 (1838 MB) 14919.73 7632.64 4433.11 2952.27 1606.94 1171.87

40 (3676 MB) 30526.79 15709.24 9087.66 5721.97 3311.36 2182.60

60 (5515 MB) N/A 21505.27 12458.54 8456.84 4536.73 3122.92

80 (7353 MB) N/A 32645.44 18900.49 11633.75 6084.69 4103.02

Speedup

20 (1838 MB) 1 1.9547 3.3655 5.0536 9.2845 12.7315

40 (3676 MB) 1 1.9432 3.3591 5.2796 9.2187 13.9864

60 (5515 MB) N/A 1 1.7247 2.5191 4.7402 6.8862

80 (7353 MB) N/A 1 1.7268 2.8304 5.3651 7.9564

processing and memory resources. However, for large amounts

of HSI data, sequential implementations can easily run out of

memory or require a vast amount of computing time, which

cannot be assumed when reliable processing is needed in a

reasonable time. In this regard, both HPC and HTC alternatives

have provided new paths to solve those problems, includ-

ing parallelization on GPUs and distribution/parallelization

on clusters with cloud computing-based solutions. The ex-

periments carried out in this work demonstrate that cloud

versions of HSI data processing methods provide efﬁcient

and effective HPC-HTC alternatives that successfully solve

the inherent problems of sequential versions by increasing

hardware capabilities in a cheaper way as compared with

other solutions such as grid computing. Also, the obtained

results reveal that the computation performance of cloud-

based solutions easily increases with larger data sets, taking

advantage of the computational load distribution when there

is a good balance between the amount of data and the cluster

complexity. Encouraged by the good results obtained in this

work, in the future we will develop other implementation of

HSI processing techniques in cloud computing environments.

Further work will also explore the design of more sophisticated

scheduling algorithms in order to circumvent the negative

impact introduced by idle processing cores in our current

implementation.

ACK NOW LE DG EM EN T

The authors would like to express their gratitude to the

Jetstream initiative, led by the Indiana University Pervasive

Technology Institute (PTI), for providing the cloud computing

environment and hardware resources used in this work.

REF ER EN CE S

[1] W. Emery and A. Camps, Chapter 2 - Basic Electromagnetic Concepts

and Applications to Optical Sensors. Elsevier, 2017, pp. 43–85.

[2] A. F. Goetz, G. Vane, J. E. Solomon, and B. N. Rock,

“Imaging Spectrometry for Earth Remote Sensing,” Science,

vol. 228, no. 4704, pp. 1147–1153, 1985. [Online]. Available:

http://science.sciencemag.org/content/228/4704/1147

[3] M. Teke, H. S. Deveci, O. Halilo˘

glu, S. Z¨

ubeyde G¨

urb¨

uz, and

U. Sakarya, “A Short Survey of Hyperspectral Remote Sensing Ap-

plications in Agriculture,” in Recent Advances in Space Technologies

(RAST), 2013.

[4] A. Plaza, J. Plaza, A. Paz, and S. Sanchez, “Parallel Hyperspectral Image

and Signal Processing,” IEEE Signal Processing Magazine, vol. 28,

no. 3, pp. 119–126, 2011.

[5] R. O. Green, M. L. Eastwood, C. M. Sarture, T. G. Chrien, M. Aronsson,

B. J. Chippendale, J. A. Faust, B. E. Pavri, C. J. Chovit, M. Solis,

M. R. Olah, and O. Williams, “Imaging spectroscopy and the Airborne

Visible/Infrared Imaging Spectrometer (AVIRIS),” Remote Sensing of

Environment, vol. 65, no. 3, pp. 227–248, 1998. [Online]. Available:

http://www.sciencedirect.com/science/article/pii/S0034425798000649

[6] K. I. Iteen and et al., “APEX - the Hyperspectral ESA Airborne Prism

Experiment,” Sensors, 2008.

[7] C. D. A. Stephen K. Babey, “Compact Airborne Spectrographic Imager

(CASI): a Progress Review,” in Proc.SPIE, vol. 1937, 1993, pp. 1937

– 1937 – 12. [Online]. Available: https://doi.org/10.1117/12.157052

[8] S. G. Ungar, J. S. Pearlman, J. A. Mendenhall, and D. Reuter, “Overview

of the Earth Observing One (EO-1) mission,” IEEE Transactions on

Geoscience and Remote Sensing, 2003.

[9] M. J. Barnsley, J. J. Settle, M. A. Cutter, D. R. Lobb, and F. Teston,

“The PROBA/CHRIS Mission: a Low-Cost Smallsat for Hyperspectral

Multiangle Observations of the Earth Surface and Atmosphere,” IEEE

Transactions on Geoscience and Remote Sensing, vol. 42, no. 7, pp.

1512–1520, July 2004.

[10] L. Guanter and et al., “The EnMAP spaceborne imaging spectroscopy

mission for earth observation,” 2015.

[11] C. Galeazzi, A. Sacchetti, A. Cisbani, and G. Babini, “The PRISMA

Program,” in IGARSS 2008 - 2008 IEEE International Geoscience and

Remote Sensing Symposium, vol. 4, July 2008, pp. IV – 105–IV – 108.

[12] M. A. Wulder, J. G. Masek, W. B. Cohen, T. R. Loveland, and C. E.

Woodcock, “Opening the archive: How free data has enabled the science

and monitoring promise of Landsat,” Remote Sensing of Environment,

2012.

[13] J. Aschbacher, ESA’s Earth Observation Strategy and Copernicus.

Singapore: Springer Singapore, 2017, pp. 81–86. [Online]. Available:

https://doi.org/10.1007/978-981-10-3713-9 5

[14] Y. Ma, H. Wu, L. Wang, B. Huang, R. Ranjan, A. Zomaya, and W. Jie,

“Remote sensing big data computing: Challenges and opportunities,”

Future Generation Computer Systems, vol. 51, no. Supplement C, pp.

47 – 60, 2015.

[15] M. Chi, A. Plaza, J. A. Benediktsson, Z. Sun, J. Shen, and Y. Zhu, “Big

Data for Remote Sensing: Challenges and Opportunities,” Proceedings

of the IEEE, vol. 104, no. 11, pp. 2207–2219, 2016.

[16] A. Plaza, J. Plaza, and D. Valencia, “Impact of platform heterogeneity on

the design of parallel algorithms for morphological processing of high-

dimensional image data,” Journal of Supercomputing, vol. 40, no. 1, pp.

81–107, 2007.

[17] A. Plaza, J. A. Benediktsson, J. W. Boardman, J. Brazile, L. Bruzzone,

G. Camps-Valls, J. Chanussot, M. Fauvel, P. Gamba, A. Gualtieri,

M. Marconcini, J. C. Tilton, and G. Trianni, “Recent advances in

Page 13 of 14 Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 14

techniques for hyperspectral image processing,” Remote Sensing of

Environment, vol. 113, no. 1, pp. S110–S122, 2009.

[18] A. Plaza, D. Valencia, J. Plaza, and P. Martinez, “Commodity cluster-

based parallel processing of hyperspectral imagery,” Journal of Parallel

and Distributed Computing, 2006.

[19] D. Gorgan, V. Bacu, T. Stefanut, D. Rodila, and D. Mihon, “Grid Based

Satellite Image Processing Platform for Earth Observation Application

Development,” in 2009 IEEE International Workshop on Intelligent

Data Acquisition and Advanced Computing Systems: Technology and

Applications, 2009, pp. 247–252.

[20] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud Computing and Grid

Computing 360-Degree Compared,” in 2008 Grid Computing Environ-

ments Workshop, Nov 2008, pp. 1–10.

[21] Z. Chen, N. Chen, C. Yang, and L. Di, “Cloud Computing Enabled

Web Processing Service for Earth Observation Data Processing,” IEEE

Journal of Selected Topics in Applied Earth Observations and Remote

Sensing, 2012.

[22] G. Hager and G. Wellein, Introduction to High Performance Computing

for Scientists and Engineers. CRC Press., 2010.

[23] K. Stanoevska-Slabeva, T. Wozniak, and S. Ristol, Grid and Cloud

Computing: A Business Perspective on Technology and Applications.

Springer, 2010.

[24] A. Fern´

andez, S. del R´

ıo, V. L´

opez, A. Bawakid, M. J. del Jesus, J. M.

Ben´

ıtez, and F. Herrera, “Big Data with Cloud Computing: An insight

on the computing environment, MapReduce, and programming frame-

works,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge

Discovery, vol. 4, 2014.

[25] Amazon Web Services, “Overview of Amazon Web Services,” Amazon

Web Services, 2017.

[26] Microsoft, “Microsoft Azure Cloud Computing Platform; Services,”

2017.

[27] C. Severance, Using Google App Engine: Start Building and Running

Web Apps on Google’s Infrastructure. O’Reilly, 2009.

[28] J. Dean and S. Ghemawat, “Mapreduce: Simpliﬁed data processing on

large clusters,” Journal of Communication Systems, 2004.

[29] D. Borthakur, “The Hadoop Distributed File System : Architecture and

Design,” Access, 2007.

[30] Apache Spark, “Apache Spark - Uniﬁed Analytics Engine for Big Data,”

2018.

[31] J. M. Haut, M. Paoletti, J. Plaza, and A. Plaza, “Cloud implementation

of the k-means algorithm for hyperspectral image analysis,” The Journal

of Supercomputing, vol. 73, no. 1, pp. 514–529, 2017.

[32] V. A. A. Quirita, G. A. O. P. da Costa, P. N. Happ, R. Q. Feitosa,

R. da Silva Ferreira, D. A. B. Oliveira, and A. Plaza, “A new cloud

computing architecture for the classiﬁcation of remote sensing data,”

IEEE Journal of Selected Topics in Applied Earth Observations and

Remote Sensing, vol. 10, no. 2, pp. 409–416, 2017.

[33] Y. Zhang, Z. Wu, J. Sun, Y. Zhang, Y. Zhu, J. Liu, Q. Zang, and

A. Plaza, “A distributed parallel algorithm based on low-rank and sparse

representation for anomaly detection in hyperspectral images,” Sensors,

vol. 18, no. 11, 2018. [Online]. Available: http://www.mdpi.com/1424-

8220/18/11/3627

[34] J. Setoain, M. Prieto, C. Tenllado, and F. Tirado, “GPU for Parallel On-

Board Hyperspectral Image Processing,” International Journal of High

Performance Computing Applications, 2008.

[35] A. J. Plaza and C. I. Chang, High Performance Computing in Remote

Sensing Book Review Book Review. Boca Raton, Florida: Chapman &

Hall/CRC Press, Computer & Information Science Series, 2008.

[36] C. Gonz´

alez, S. S´

anchez, A. Paz, J. Resano, D. Mozos, and A. Plaza,

“Use of FPGA or GPU-based architectures for remotely sensed hyper-

spectral image processing,” Integration, the VLSI Journal, vol. 46, no. 2,

pp. 89 – 103, 2013.

[37] A. Plaza, P. Mart´

ınez, J. Plaza, and R. P´

erez, “Dimensionality reduc-

tion and classiﬁcation of hyperspectral image data using sequences

of extended morphological transformations,” in IEEE Transactions on

Geoscience and Remote Sensing, 2005.

[38] D. Tuia, S. Lopez, M. Schaepman, and J. Chanussot, “Foreword to

the special issue on hyperspectral image and signal processing,” IEEE

Journal of Selected Topics in Applied Earth Observations and Remote

Sensing, 2015.

[39] Z. Wu, Y. Li, A. Plaza, J. Li, F. Xiao, and Z. Wei, “Parallel and

distributed dimensionality reduction of hyperspectral data on cloud

computing architectures,” IEEE Journal of Selected Topics in Applied

Earth Observations and Remote Sensing, vol. 9, no. 6, pp. 2270–2278,

June 2016.

[40] X. Jia, B. C. Kuo, and M. M. Crawford, “Feature mining for hyperspec-

tral image classiﬁcation,” Proceedings of the IEEE, vol. 101, no. 3, pp.

676–697, March 2013.

[41] R. Bellman, Adaptive Control Processes: A Guided Tour, ser. Princeton

Legacy Library. Princeton University Press, 2015.

[42] G. Hughes, “On the mean accuracy of statistical pattern recognizers,”

IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55–63,

January 1968.

[43] T.-W. Lee, “Independent component analysis,” in Independent compo-

nent analysis. Springer, 1998, pp. 27–66.

[44] J. Wang and C.-I. Chang, “Independent component analysis-based

dimensionality reduction with applications in hyperspectral image anal-

ysis,” IEEE transactions on geoscience and remote sensing, vol. 44,

no. 6, pp. 1586–1600, 2006.

[45] A. A. Green, M. Berman, P. Switzer, and M. D. Craig, “A transformation

for ordering multispectral data in terms of image quality with implica-

tions for noise removal,” IEEE Trans. Geosci. Remote Sens., vol. 26,

no. 1, pp. 65–74, Jan. 1988.

[46] N. He, M. E. Paoletti, L. Fang, S. Li, A. Plaza, J. Plaza et al., “Feature

extraction with multiscale covariance maps for hyperspectral image

classiﬁcation,” IEEE Transactions on Geoscience and Remote Sensing,

no. 99, pp. 1–15, 2018.

[47] S. Wold, K. Esbensen, and P. Geladi, “Principal Component Analysis,”

Chemometrics and Intelligent Laboratory System, vol. 2, no. 1, pp. 37–

52, 1987.

[48] D. Fernandez, C. Gonzalez, D. Mozos, and S. Lopez, “Fpga implemen-

tation of the principal component analysis algorithm for dimensionality

reduction of hyperspectral images,” Journal of Real-Time Image Pro-

cessing, pp. 1–12, 2016.

[49] E. Martel, R. Lazcano, J. L´

opez, D. Madro˜

nal, R. Salvador, S. L´

opez,

E. Juarez, R. Guerra, C. Sanz, and R. Sarmiento, “Implementation of the

principal component analysis onto high-performance computer facilities

for hyperspectral dimensionality reduction: Results and comparisons,”

Remote Sensing, vol. 10, no. 6, p. 864, 2018.

[50] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep Learning-Based

Classiﬁcation of Hyperspectral Data,” IEEE Journal of Selected Topics

in Applied Earth Observations and Remote Sensing, vol. 7, no. 6, pp.

2094–2107, 2014.

[51] V. Nair and G. E. Hinton, “Rectiﬁed Linear Units Improve Restricted

Boltzmann Machines,” in Proceedings of the 27th International Con-

ference on Machine Learning (ICML-10), Johannes F¨

urnkranz and

Thorsten Joachims, Ed. Omnipress, 2010, pp. 807–814.

[52] G. E. Hinton and R. S. Zemel, “Autoencoders, Minimum Description

Length and Helmholtz Free Energy,” in Proceedings of the 6th

International Conference on Neural Information Processing Systems.

Denver, Colorado: Morgan Kaufmann Publishers Inc., 1993, pp. 3–10.

[Online]. Available: http://dl.acm.org/citation.cfm?id=2987189.2987190

[53] M. S. Apostolopoulou, D. G. Sotiropoulos, I. E. Livieris, and P. Pintelas,

“A memoryless bfgs neural network training algorithm,” in Industrial

Informatics, 2009. INDIN 2009. 7th IEEE International Conference on.

IEEE, 2009, pp. 216–221.

[54] N. M. Nawi, M. R. Ransing, and R. S. Ransing, “An improved learning

algorithm based on the broyden-ﬂetcher-goldfarb-shanno (bfgs) method

for back propagation neural networks,” in Sixth International Conference

on Intelligent Systems Design and Applications, vol. 1. IEEE, 2006,

pp. 152–157.

[55] R. Fletcher and M. J. Powell, “A rapidly convergent descent method for

minimization,” The computer journal, vol. 6, no. 2, pp. 163–168, 1963.

[56] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng,

“On optimization methods for deep learning,” in Proceedings of the

28th International Conference on International Conference on Machine

Learning. Omnipress, 2011, pp. 265–272.

[57] J. Pearlman, S. Carman, C. Segal, P. Jarecke, P. Clancy, and W. Browne,

“Overview of the hyperion imaging spectrometer for the nasa eo-1

mission,” in IGARSS 2001. Scanning the Present and Resolving the

Future. Proceedings. IEEE 2001 International Geoscience and Remote

Sensing Symposium (Cat. No.01CH37217), vol. 7, July 2001, pp. 3036–

3038 vol.7.

Page 14 of 14Transactions on Geoscience and Remote Sensing

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60