Content uploaded by Ruben Urraca
Author content
All content in this area was uploaded by Ruben Urraca on Jun 23, 2015
Content may be subject to copyright.
A straightforward implementation of a GPU-accelerated
ELM in R with NVIDIA graphic cards
M. Alia-Martinez, J. Antonanzas, F. Antonanzas-Torres, A. Pernía-Espinoza, and R.
Urraca
EDMANS Group, University of La Rioja, Logroño, Spain
edmans@dim.unirioja.es,
http:www.mineriadatos.com
Abstract. General purpose computing on graphics processing units (GPGPU) is
a promising technique to cope with nowadays arising computational challenges
due to the suitability of GPUs for parallel processing. Several libraries and func-
tions are being released to boost the use of GPUs in real world problems. How-
ever, many of these packages require a deep knowledge in GPUs’ architecture
and in low-level programming. As a result, end users find trouble in exploiting
GPGPU advantages. In this paper, we focus on the GPU-acceleration of a predic-
tion technique specially designed to deal with big datasets: the extreme learning
machine (ELM). The intent of this study is to develop a user-friendly library
in the open source R language and subsequently release the code in https:
//github.com/maaliam/EDMANS-elmNN- GPU.git. Therefore R users
can freely implement it with the only requirement of having a NVIDIA graphic
card. The most computationally demanding operations were identified by per-
forming a sensitivity analysis. As a result, only matrix multiplications were ex-
ecuted in the GPU as they take around 99 % of total execution time. A speedup
rate up to 15 times was obtained with this GPU-accelerated ELM in the most
computationally expensive scenarios. Moreover, the applicability of the GPU-
accelerated ELM was also tested with a typical case of model selection, in which
genetic algorithms were used to fine-tune an ELM and training thousands of mod-
els is required. In this case, still a speedup of 6 times was obtained.
Keywords: ELM; GPU; gputools; CUDA; R Software; big data; optimization
techniques
1 Introduction
In this day and age, data creation is steadily increasing at rates nobody could envision
few years ago. Sources of data in society are diverse, from social media to machines,
sensing or transactions [4]. This new phenomenon is commonly referred as big data
and it emerges as one of the most challenging topics in our days. A vast amount of data
has to be processed at the same rate it is recorded to obtain valuable information given
the existing computing limitations. What is more, high computational demands can still
arise when processing relatively-small databases but thousands of operations have to be
executed. A typical case is the automation of the model selection process. In order to ob-
tain a fine-tuned predictive model, both model parameter optimization (MPO) and fea-
ture selection (FS) are usually executed simultaneously. The problem basically consists
2 M. Alia-Martinez et al.
on minimizing a loss function by applying different iterative meta-heuristics, which are
optimization techniques that imply the computation of hundreds or even thousands of
models. For instance, [2] tuning an extreme learning machine (ELM) for classification
purposes by using genetic algorithms (GAs), setting both the population size and the
number of generations to 100 and implementing 10-CV to obtain a more robust error
measurement. As a consequence, 100,000 ELMs had to be computed.
Any of the above-mentioned situations entails too large execution times for prac-
tical applications. A huge effort is being made in the development of new processing
techniques as well as more powerful computational tools to cope with these challenges.
For instance, one of most basic and well-known solutions is the use of parallel com-
puting with multi-core processors, which is commonly available in any commercial
software. More advanced solutions exist in the context of different fields such as high
performance computing (HPC) and supercomputers. Some applications are the use of
computer clusters, computer grids or cloud computing among others. In this study, we
focus in one of these emerging solutions: the use of the graphics processing unit (GPU)
instead of the traditional central processing unit (CPU) to run computationally demand-
ing computations.
GPUs are evolving at faster rates than the traditional CPUs [8]. Though they were
initially built to manage the graphic display in computers, their multi-core structure
with usually more than 200 units makes them adequate for parallel processing. The
release of compute unified device architecture (CUDA) software from NVIDIA first
enabled their use for running non-graphical computations, also known as general pur-
pose computing on graphics processing units (GPGPU). Afterwards, several libraries
for CUDA users have been released as well as software for other GPU developers, such
as FireStream for AMD or the more general OpenCL. Nevertheless, most of these ap-
plications require an extensive knowledge on GPU structure and parallel processing in
some specific programming languages such as C or Python. On the contrary, typical
end users are researchers on the fields of statistics, chemistry, biology or finance, who
posses limited programming abilities in more friendly interfaces such as Matlab, Math-
ematica or R. Consequently, there is still a need of developing user-friendly tools and
functions, so these end users are able to fully exploit the advantages of GPGPU.
In this line, this study is focused on accelerating ELMs, an appropriate technique to
deal with big datasets [12]. ELM major operations are simple matrix-matrix multiplica-
tions, inversions and transposes, so consequently they present an adequate structure to
perform parallel processing. Some previous works exist related to the GPU-acceleration
of ELMs [5], but they require high programming skills, as most of them are based
on programming in C language. Therefore, our proposal is implemented in the open
source R language [11], widely extended in data analysis field. A new library is freely
released at https://github.com/maaliam/EDMANS-elmNN-GPU.git for
future implementations. Computations were run with NVIDIA GPUs in order to take
advantage of the numerous CUDA-related libraries available, such as the R package
gputools [1].
A straightforward implementation of a GPU-accelerated ELM in R 3
2 Methodology
2.1 ELM
ELM [7] implements a simplification of traditional learning algorithms used in a Single-
Hidden Layer Feedforward Network (SLFN) [9]. Instead of using an iterative proce-
dure, ELM proposes a straightforward solution to the optimization problem behind a
SLFN, while keeping its generalization capacity.
Mathematically, the algorithm is described as follows. An ELM predicts an output
variable tigiven a training set (xi,ti)with xi2Rd
,ti2R,i=1,...,N, where xiis the set
of inputs, dis the dimension of the input feature space, and Nthe number of instances.
Random values are assigned to the weight vector wibetween the inputs and the hidden
layer, and to the bias bi. As a consequence, the hidden-layer output matrix His directly
obtained:
H(xi)=2
4
h(x1)
...
h(xN)
3
5=2
6
4
g(w1x1+b1)... g(wLx1+bL)
...
......
g(w1xN+b1)... g(wLxN+bL)
3
7
5(1)
where Lis the number of neurons in the hidden layer and gis the activation function.
Initially, the algorithm essentially maps the set of points from the d-dimensional fea-
ture space to a L-dimensional neurons space. Once matrix His obtained, the problem
is reduced to calculate the weight vector bbetween the hidden layer and the outputs
(Hb=T). This typical optimization problem is solved by using the minimal norm
Least-Square method, in which the Moore-Penrose generalized inverse (H†)of matrix
Hhas to be computed:
b=H†T(2)
In this study, a variation of the traditional methods to compute this inverse is utilized.
Following the theory explained at [6], a regularization parameter is included. This ap-
proach is strongly recommended for dealing with databases with high number of sam-
ples.
b=✓I
C+HTH◆1
HTT(3)
where Cstands for the cost parameter. Based on the ridge regression theory, the addition
of a positive value to the diagonal term of matrix HTHimproves the robustness of the
model.
2.2 GPU-acceleration
The original ELM functions of package elmNN [3] were used as baseline to develop
the GPU-accelerated ELM. Two functions from this package were modified:
–elmtrain.default(), a method of the general function elmtrain().
–predict.elmNN(), a method of the general function predict().
4 M. Alia-Martinez et al.
In both functions, an additional argument labeled GPU has been included. The default
value set to GPU=FALSE run the computations in the CPU and when the argument is
turned into GPU=TRUE the GPU-accelerated version is activated.
First, a sensitivity study was performed (Subsect. 3.1) in order to detect the most
demanding operations in each function. The nomenclature used is based on the package
elmNN. According to this package, all matrices and vectors are transposed compared
to the traditional theoretical explanation of Subsect. 2.1.
The training function based on elmtrain.default() computes the weight
vector bgiven (xi,ti)and following the equations described in Subsect. 2.1. This func-
tion is comprised by ten different R instructions:
#1 tempH <- inpweight %*%P
#2 biasMatrix <- matrix(rep(biashid, ncol(P)), nrow=nhid,
ncol=ncol(P), byrow = F)
#3 tempH <- tempH + biasMatrix
#4 H <- 1 / (1 + exp(-1*tempH))
#5 H.Prod <- H %*% t(H)
#6 C.Mat <- diag(ncol(H.Prod))/C + H.Prod
#7 inverse<-solve(C.Mat)
#8 mult<-inverse %*%H
#9 outweight <-mult %*% t(T)
#10 Y <- t(t(H) %*% outweight)
The predict function predict.elmNN() computes the output vector tibased on
the set of input samples xiand on the previously calculated weight vector b. The func-
tion is comprised by six instructions:
#11 TV.P <- t(x)
#12 tmpHTest<-inpweight %*% TV.P
#13 biasMatrixTE <- matrix(rep(biashid, ncol(TV.P)), nrow=nhid,
ncol=ncol(TV.P), byrow = F)
#14 tmpHTest <- tmpHTest + biasMatrixTE
#15 HTest <- 1 / (1 + exp(-1*tmpHTest))
#16 TY <- t(t(HTest) %*% outweight)
Time spent by each instruction was recorded to establish the influence of each op-
eration in the overall execution time. Preliminary trials (Subsect. 3.1) indicated that
matrix-matrix multiplications where the most cost demanding operations, so they were
the only executed in the GPU (instructions # 1, # 5, # 8 and # 12). These multipli-
cations were implemented using the gpuMatMult() function of package gputools.
Basically, this function is a wrapper for the cublasDgemm function of NVIDIA CUDA
Basic Linear Algebra Subroutines (cuBLAS) library. The remaining operations were
still implemented in the CPU.
3 Experiments
3.1 Sensitivity Analysis
Initially, the computational cost of each instruction using the CPU was compared with
the proposed GPU-accelerated ELM. A two-dimensional grid of databases was de-
A straightforward implementation of a GPU-accelerated ELM in R 5
signed to perform the comparisons by varying the number of samples Nand features
d. Besides, the main parameter of a ELM, the number of neurons in the hidden layer
L, was also altered. Consequently, a three-dimensional grid of models was created with
the following ranges and intervals of parameters:
–Number of samples N: from 5,000 to 50,000 in 5,000 samples interval.
–Number of neurons in the hidden layer L: from 100 to 1,000 in 100 neurons interval.
–Number of features d: 10, 100 and 1,000.
All databases were randomly generated using numbers from 0 to 1 as the only goal
was to evaluate the execution times. Therefore, cost parameter Cwas held constant and
equal to 1. To perform an even comparison, just exactly the same randomly generated
bias biand input weight vector wiwere used in both types of ELMs.
Computations were performed in a workstation with the following hardware speci-
fications: NVIDIA GeForce GTX650 with 384 cores and 2Gb DDR5, a dual core pro-
cessor (AMD Athlon™64 X2 @ 1.8 GHz) and a 4GB RAM memory.
3.2 Case of study: Estimation of Daily Global Solar Irradiation
The efficiency of GPU-accelerated ELM was further analyzed with a real case. So-
lar global irradiation was predicted given thirteen meteorological variables as inputs
related to rainfall, temperature, extraterrestrial irradiation, wind speed and humidity.
Daily measurements recorded at 4 different locations of southern Spain (Cordoba, Jaen,
Nijar and Puebla Rio) from 2009 to 2013 were used. After spurious cleaning, a database
composed by 6,605 samples was obtained. The period 2009-2012 was used for training
(5249 samples) whilst 2013 samples were used for testing (1356 samples).
This case of study is a practical example where high computational resources are re-
quired with a relatively-small database, as running a vast amount of models is required
to perform model selection. Genetic algorithms were used to simultaneously execute
MPO and FS. Therefore, a hybrid chromosome was utilized, where a binary part mod-
els the set of input variables chosen and a real-coded part stands for the parameters
of the model [13]. The population size and the number of generations were set to 64
and 20 respectively. Individuals were first ranked according to a fitness function Jthat
accounts for error prediction (J=MAEval )and they were subsequently re-ranked ac-
cording to the model complexity. The re-rank works as follows. First, the complexity of
each individual was evaluated based on the generalized degrees of freedom (GDF) the-
ory [14]. Using GDF, complexity is defined as the sensitivity of each fitted value yie to
a randomly perturbed target value yi. In this case, the "horizontal" estimate with 10 rep-
etitions and Gaussian noise N(0,0.05)was utilized as described in [10]. Then, models
inside the same error interval (relative difference between errors less than 0.1%) were
re-ranked according to this complexity value. The 16 best individuals (25% elitism)
were selected as parents for the next generations. A mutation rate of 10% was chosen
but the best individual was not muted.
GA computations were run in the workstation described in Subsect. 3.1. When
working only with the CPU, only one core was used in order to carry out an equal
comparison and, consequently, only one model was computed at each time.
6 M. Alia-Martinez et al.
4 Results and Discussion
Table 1 depicts the execution time in seconds taken by each instruction in train and pre-
dict functions, when only the CPU was used to run the computations. Results proved
that most time-consuming operations were matrix-matrix multiplications: operations #
1,#5and#8inelmtrain.default() and instruction # 12 in predict.elmNN().
Operations #1 and # 12 are eventually the same, as both compute matrix Hbased
on the randomly generated input weight vector wiand the set of input variables xi:
H[L⇥N]=wi[L⇥d]xi[d⇥N](4)
Instructions #5 and # 8 are the two matrix-matrix multiplications required to implement
minimal norm Least-Squares with the regularization parameter. Operation #5 computes
the matrix multiplication between Hand its transpose:
H.prod[L⇥L]=H[L⇥N]t(H)[N⇥L](5)
while operation # 5 multiplies the inverted matrix obtained after adding the regulariza-
tion term and H:
mult[L⇥N]=inverse[L⇥L]H[L⇥N](6)
Looking at the dimensions of these multiplications, number of features donly in-
fluences operations # 1 and # 12. According to this, results of Table 1 show that execu-
tion time in operation # 1 increased 115 times, from 0.44 to 50.82 seconds, when the
database of 1,000 features was used. Other operations barely remained constant when
the number of features was varied.
Modifying the number of samples Nand the number of neurons in the hidden layer
dhad a similar impact on the execution time. The execution time of multiplications #
1, # 5, # 8 and # 12 considerably grew. On the contrary, the execution time of some
of the remaining operations such as the activation function (# 4), the calculation of the
regularization term (# 6), matrix-vector multiplications (# 10 and # 16) or computing
the inverse (# 8) experimented a virtually negligible increase. This was the reason why
only matrix-matrix multiplications were implemented in the GPU.
Table 2 shows execution times in seconds using the same databases but with the
GPU-accelerated ELM. Remarkable time reductions were obtained in most demanding
computations. For instance, with the 50,000 sampled database, time required by all
matrix-matrix multiplications was around 2 seconds, which is now in the same range
of other computations such as the arithmetic operations implemented by the sigmoid
activation function (# 4). What is more, using 1,000 neurons in the hidden layer, time
required by matrix multiplications was eventually half of the time spent to obtain the
inverse (# 7).
A straightforward implementation of a GPU-accelerated ELM in R 7
Table 1. Execution time in seconds taken by each instruction using the CPU. Results of the three-
dimensional (neurons, samples, features) grid of models are summarized in three blocks. In each
block, the variation of one parameter was individually studied while averaging the other two.
elmtrain.default() predict.elmNN()
# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16
Features
10 0.44 0.74 0.16 1.49 35.78 0.01 1.26 35.43 0.14 0.44 0.00 0.43 0.74 0.16 1.48 0.43
100 4.43 0.73 0.16 1.49 35.81 0.01 1.28 35.37 0.14 0.45 0.02 4.43 0.74 0.16 1.48 0.44
1,000 50.82 0.74 0.16 1.49 35.79 0.01 1.28 35.39 0.14 0.44 0.35 50.84 0.73 0.15 1.64 0.50
Neurons
100 3.46 0.13 0.02 0.27 1.02 0.00 0.01 0.68 0.03 0.07 0.12 3.46 0.13 0.02 0.26 0.07
200 6.75 0.26 0.05 0.54 3.87 0.00 0.03 2.70 0.05 0.15 0.12 6.75 0.26 0.05 0.54 0.15
300 10.04 0.41 0.09 0.81 8.56 0.00 0.11 8.37 0.08 0.23 0.12 10.04 0.41 0.09 0.81 0.23
400 13.34 0.53 0.11 1.08 15.08 0.00 0.28 14.84 0.10 0.31 0.12 13.36 0.53 0.11 1.08 0.31
500 16.65 0.66 0.14 1.35 23.36 0.00 0.54 23.10 0.12 0.39 0.12 16.63 0.66 0.14 1.35 0.39
600 19.99 0.80 0.18 1.62 33.52 0.01 0.93 33.24 0.15 0.47 0.12 19.95 0.80 0.17 1.62 0.47
700 23.30 0.96 0.21 1.89 45.52 0.01 1.46 45.22 0.17 0.56 0.12 23.41 0.95 0.21 1.89 0.56
800 27.29 1.07 0.23 2.16 59.31 0.01 2.14 59.04 0.20 0.64 0.12 27.31 1.06 0.22 2.21 0.66
900 30.67 1.20 0.26 2.45 75.02 0.01 3.06 74.63 0.22 0.73 0.12 30.68 1.20 0.26 2.61 0.83
1,000 34.16 1.33 0.29 2.75 92.67 0.02 4.17 92.14 0.25 0.88 0.12 34.12 1.33 0.28 2.99 0.93
Samples
5,000 3.38 0.13 0.03 0.27 6.49 0.01 1.27 6.43 0.03 0.06 0.02 3.37 0.13 0.02 0.27 0.06
10,000 6.75 0.27 0.05 0.54 13.00 0.01 1.27 12.93 0.05 0.15 0.05 6.76 0.27 0.05 0.54 0.15
15,000 10.16 0.41 0.09 0.81 19.50 0.01 1.27 19.36 0.08 0.23 0.06 10.13 0.41 0.09 0.82 0.23
20,000 13.47 0.53 0.11 1.08 26.05 0.01 1.27 25.70 0.10 0.31 0.09 13.48 0.53 0.11 1.09 0.32
25,000 16.85 0.66 0.14 1.37 32.53 0.01 1.27 32.20 0.12 0.39 0.10 16.87 0.66 0.14 1.36 0.40
30,000 20.22 0.80 0.17 1.63 39.07 0.01 1.27 38.65 0.15 0.48 0.14 20.26 0.80 0.17 1.64 0.49
35,000 23.63 0.95 0.21 1.90 45.56 0.01 1.27 45.05 0.17 0.56 0.14 23.60 0.95 0.20 1.92 0.57
40,000 26.99 1.06 0.23 2.16 52.09 0.01 1.28 51.48 0.20 0.67 0.20 27.02 1.07 0.23 2.27 0.69
45,000 30.38 1.20 0.26 2.44 58.57 0.01 1.28 57.87 0.22 0.74 0.19 30.40 1.20 0.26 2.55 0.77
50,000 33.81 1.33 0.29 2.72 65.08 0.01 1.28 64.29 0.25 0.84 0.25 33.80 1.33 0.28 2.91 0.91
Table 2. Execution time in seconds taken by each instruction using the GPU. Results of the three-
dimensional (neurons, samples, features) grid of models are summarized in three blocks. In each
block, the variation of one parameter was individually studied while averaging the other two.
elmtrain.default() predict.elmNN()
# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16
Features
10 0.13 0.74 0.16 1.48 1.10 0.01 1.26 0.84 0.08 0.37 0.00 0.12 0.74 0.16 1.49 0.37
100 0.23 0.73 0.16 1.48 1.10 0.01 1.28 0.84 0.08 0.37 0.02 0.22 0.74 0.16 1.48 0.37
1,000 1.20 0.74 0.16 1.49 1.10 0.01 1.27 0.84 0.08 0.37 0.63 1.25 0.73 0.15 1.52 0.39
Neurons
100 0.24 0.13 0.02 0.26 0.09 0.00 0.00 0.05 0.02 0.06 0.12 0.14 0.13 0.02 0.26 0.06
200 0.22 0.26 0.05 0.54 0.23 0.00 0.03 0.13 0.03 0.13 0.12 0.22 0.26 0.05 0.54 0.12
300 0.31 0.41 0.09 0.81 0.40 0.00 0.11 0.26 0.04 0.20 0.12 0.31 0.41 0.09 0.81 0.20
400 0.39 0.53 0.11 1.08 0.59 0.00 0.28 0.41 0.06 0.26 0.12 0.39 0.53 0.11 1.08 0.26
500 0.47 0.66 0.14 1.35 0.82 0.00 0.54 0.59 0.07 0.33 0.12 0.47 0.66 0.14 1.35 0.32
600 0.55 0.80 0.17 1.62 1.08 0.01 0.93 0.81 0.09 0.40 0.12 0.55 0.80 0.17 1.62 0.39
700 0.63 0.96 0.20 1.89 1.37 0.01 1.45 1.06 0.10 0.47 0.12 0.62 0.95 0.21 1.89 0.47
800 0.71 1.07 0.23 2.15 1.70 0.01 2.14 1.33 0.11 0.54 0.12 0.72 1.06 0.22 2.16 0.55
900 0.81 1.20 0.26 2.43 2.15 0.01 3.04 1.70 0.13 0.61 0.34 0.96 1.20 0.26 2.50 0.64
1,000 0.89 1.33 0.29 2.71 2.58 0.02 4.15 2.05 0.15 0.73 0.85 0.95 1.33 0.28 2.75 0.72
Samples
5,000 0.19 0.13 0.02 0.26 0.18 0.01 1.27 0.15 0.02 0.05 0.02 0.09 0.13 0.02 0.27 0.05
10,000 0.19 0.27 0.05 0.54 0.40 0.01 1.27 0.31 0.03 0.13 0.05 0.19 0.27 0.05 0.54 0.13
15,000 0.28 0.41 0.09 0.81 0.60 0.01 1.27 0.46 0.04 0.20 0.06 0.28 0.41 0.09 0.81 0.20
20,000 0.37 0.53 0.11 1.08 0.81 0.01 1.27 0.61 0.06 0.28 0.09 0.37 0.53 0.11 1.08 0.27
25,000 0.46 0.66 0.14 1.35 1.00 0.01 1.27 0.76 0.07 0.34 0.10 0.46 0.66 0.14 1.35 0.34
30,000 0.55 0.80 0.17 1.63 1.21 0.01 1.27 0.91 0.09 0.41 0.14 0.57 0.80 0.17 1.64 0.41
35,000 0.65 0.95 0.21 1.89 1.40 0.01 1.27 1.07 0.10 0.47 0.14 0.65 0.95 0.20 1.90 0.48
40,000 0.74 1.06 0.23 2.15 1.60 0.01 1.27 1.22 0.12 0.54 0.20 0.86 1.07 0.23 2.17 0.55
45,000 0.83 1.20 0.26 2.43 1.80 0.01 1.27 1.37 0.13 0.61 0.92 0.86 1.20 0.26 2.44 0.61
50,000 0.93 1.33 0.29 2.70 2.01 0.01 1.27 1.52 0.15 0.69 0.47 0.97 1.33 0.28 2.74 0.69
8 M. Alia-Martinez et al.
Neurons in the hidden layer
Speedup
0
5
10
15
5000 samples
200 400 600 800 1000
10000 samples
15000 samples
200 400 600 800 1000
20000 samples
25000 samples
30000 samples
35000 samples
0
5
10
15
40000 samples
0
5
10
15
200 400 600 800 1000
45000 samples
50000 samples
1000 features
100 features
10 features
Fig. 1. Execution time speedup (vertical axis) in the elmtrain.default() function for the
different combinations of samples, features and neurons in the hidden layer
In this line, Fig. 1 depicts the overall time reduction between the two ELM versions.
A striking speedup up to 15 times was obtained when using the database of 1,000 vari-
ables. This speedup was up to 10 times when smaller databases with 10 or 100 features
were used. The slight fall in the speedup curve in databases with 1,000 features and
40,000, 45,000 or 50,000 samples was caused by RAM limitations in the workstation.
The case of 100 features was further analyzed in detail in Fig. 2 with a contour plot,
where samples and neurons were selected as x and y axis respectively while speedup
was the third dimension. It can be appreciated that the higher speedups were obtained
when both the number of samples and neurons increased proportionally. On the con-
trary, the speedup slightly increased and even decreased when only neurons or samples
were raised alone. This explained why in Fig. 1, when low number of samples were
used, the speedup curve stagnated of even started to decreased when the number of
A straightforward implementation of a GPU-accelerated ELM in R 9
neurons was raised. Similar patterns were observed with 10 and 1,000 features (data
not shown).
0
2
4
6
8
10
12
10000 20000 30000 40000 50000
200
400
600
800
1000
3.5
4
4.5
5
5.5
6
6.5
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
11.5
Speedup of the execution time (100 Features)
samples
neurons
Fig. 2. Contour plot of execution time speedup in the elmtrain.default() function for the
different number of samples and neurons in the hidden layer in the case of 100 features
Finally, the applicability of the GPU-accelerated ELM was evaluated with a real
case of study of model selection using GA, where 12,800 ELMs were computed. Al-
though the GA procedure introduces some sources of uncertainty in different steps of
the procedure, such as the creation of initial generation or mutations, preliminary re-
sults have proved the robustness of this methodology by repeating GA several times.
This robustness was again verified in this case of study, as the evolution followed in
GA by both ELM versions presented a similar trend (data not shown). Besides, the final
solution reached (best individual of last generation) of both ELMs was roughly equiva-
lent. With the CPU version, an ELM with 8 selected features, 472 neurons in the hidden
layer and a cost of 29was obtained with a normalized testing MAE of 0.059, while in
the GPU version, an ELM with the same 8 selected features, 524 neurons in the hidden
layer and a cost of 211 was obtained with a normalized testing MAE of 0.060.
As a result, the computational cost through each generation were comparable in
terms of complexity. However, when looking at execution times, Fig. 3 shows how
the GPU-accelerated ELM significantly outperforms the CPU version achieving a total
overall speedup of roughly 6 times. The CPU version took around 123 hours to run all
generations whilst the proposed GPU-accelerated ELM spent only 20 hours.
10 M. Alia-Martinez et al.
0
100
200
300
400
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●
0
2000
4000
6000
8000
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
Generation execution time (min)
Total execution time (min)
Number of generation
CPU
GPU
Fig. 3. Total execution time (lines) and execution time per generation (bars) in minutes spent by
CPU and GPU versions of ELM through the GA optimization process
5 Conclusions and Future Work
General purpose computing of graphics processing units (GPGPU) is emerging as one
of the most appealing technologies to deal with nowadays computational challenges in
data analysis. In the previous study, a GPU-accelerated version of ELM has been pro-
posed by modifying the R package elmNN. Preliminary trials showed how by only run-
ning matrix-matrix multiplications in the GPU, speedups up to 15 times were obtained
in the most demanding situations with high number of features, samples and neurons
on the hidden layer. The methodology was also proved useful in a real case where a
relatively-small database was used but a vast amount of models had to be computed in
the context of model selection meta-heuristics. In this case, an overall reduction around
6 times was obtained using GA to perform a fine-tuning of an ELM.
The GPU-accelerated version of package elmNN has been freely released at https:
//github.com/maaliam/EDMANS-elmNN-GPU.git for future applications of
R users with the only requirement of owning a workstation with a NVIDIA graphic card.
Finally, in the future some different aspects could be explored as running the whole
ELM algorithm in the GPU, using a GPU cluster in order to compute several ELMs at
the same time and implementing the same methodology in other well-know prediction
techniques such as support vector machines or the multilayer perceptron.
Acknowledgments R. Urraca and J. Antonanzas would like to acknowledge the fellow-
ship FPI-UR-2014 granted by the University of La Rioja. F. Antonanzas-Torres would
like to express his gratitude for the FPI-UR-2012 and ATUR grant No. 03061402 at
the University of La Rioja. We are also greatly indebted to Banco Santander for the
PROFAI-13/06 fellowship, to the Agencia de Desarrollo Económico de La Rioja for
A straightforward implementation of a GPU-accelerated ELM in R 11
the ADER-2012-I-IDD-00126 (CONOBUILD) fellowship and to the Instituto de Estu-
dios Riojanos (IER) for funding parts of this research.
References
1. Buckner, J., Wilson, J., Seligman, M., Athey, B., Watson, S., Meng, F.: The gputools package
enables GPU computing in R. Bioinformatics 26(1), 134–135 (2010)
2. Chyzhyk, D., Savio, A., Graña, M.: Evolutionary ELM wrapper feature selection for
alzheimer’s disease CAD on anatomical brain MRI. Neurocomputing 128, 73–80 (2014)
3. Gosso, A.: elmNN: Implementation of ELM (Extreme Learning Machine) algorithm for
SLFN (Single Hidden Layer Feedforward Neural Networks). R package version 1.3 (2012),
http://CRAN.R-project.org/package=elmNN
4. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Ullah Khan, S.: The rise
of "big data" on cloud computing: Review and open research issues. Information Systems
47(0), 98–115 (2015)
5. van Heeswijk, M., Miche, Y., Oja, E., Lendasse, A.: GPU-accelerated and parallelized ELM
ensembles for large-scale regression. Neurocomputing 74(16), 2430 – 2437 (2011)
6. Huang, G.B.: Extreme learning machine for regression and multiclass classification. IEEE
Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 42(2), 513–529 (2012)
7. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications.
Neurocomputing 70, 489–501 (2006)
8. Peddie, J.: The new visualization engine - the heterogeneous processor unit. In: Dill, J., Earn-
shaw, R., Kasik, D., Vince, J., Wong, P.C. (eds.) Expanding the frontiers of visual analytics
and visualization, pp. 377–396. Springer International Publishing (2012)
9. Salcedo-Sanz, S., Casanova-Mateo, C., Pastor-Sanchez, A., Giron, M.S.: Daily global solar
radiation prediction based on a hybrid coral reefs optimization - extreme learning machine
approach. Solar Energy 105, 91–98 (2014)
10. Seni, G., Elder, J.: Ensembe methods in data mining. Improving accuracy through combining
predictions. Morgan & Claypool (2010)
11. Team, R.C.: R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2014), http://www.R-project.org/
12. Urraca, R., Antonanzas, J., Martinez-de Pison, F.J., Antonanzas-Torres, F.: Estimation of
solar global irradiation in remote areas. Journal of Renewable and Sustainable Energy (In
Press)
13. Urraca-Valle, R., Sodupe-Ortega, E., Antoñanzas Torres, J., Antoñanzas-Torres, F.,
Martinez-de Pison, F.J.: An overall performance comparative of GA-PARSIMONY method-
ology with regression algorithms. In: de la Puerta, J.G., Ferreira, I.G., Bringas, P.G., Klett,
F., Abraham, A., de Carvalho, A.C., Herrero, A., Baruque, B., Quintian, H., Corchado, E.
(eds.) Advances in Intelligent Systems and Computing, vol. 299, pp. 53–62–. Springer Inter-
national Publishing (2014)
14. Ye, J.: On measuring and correcting the effects of data mining and model selection. Journal
of the American Statistical Association 93(441), 120 – 131 (1998)