Conference PaperPDF Available

A straightforward implementation of a GPU-accelerated ELM in R with NVIDIA graphic cards


Abstract and Figures

General purpose computing on graphics processing units (GPGPU) is a promising technique to cope with nowadays arising computational challenges due to the suitability of GPUs for parallel processing. Several libraries and functions are being released to boost the use of GPUs in real world problems. However, many of these packages require a deep knowledge in GPUs’ architecture and in low-level programming. As a result, end users find trouble in exploiting GPGPU advantages. In this paper, we focus on the GPU-acceleration of a prediction technique specially designed to deal with big datasets: the extreme learning machine (ELM). The intent of this study is to develop a user-friendly library in the open source R language and subsequently release the code in https:// github. com/ maaliam/ EDMANS-elmNN-GPU. git. Therefore R users can freely implement it with the only requirement of having a NVIDIA graphic card. The most computationally demanding operations were identified by performing a sensitivity analysis. As a result, only matrix multiplications were executed in the GPU as they take around 99 % of total execution time. A speedup rate up to 15 times was obtained with this GPU-accelerated ELM in the most computationally expensive scenarios. Moreover, the applicability of the GPU-accelerated ELM was also tested with a typical case of model selection, in which genetic algorithms were used to fine-tune an ELM and training thousands of models is required. In this case, still a speedup of 6 times was obtained.
Content may be subject to copyright.
A straightforward implementation of a GPU-accelerated
ELM in R with NVIDIA graphic cards
M. Alia-Martinez, J. Antonanzas, F. Antonanzas-Torres, A. Pernía-Espinoza, and R.
EDMANS Group, University of La Rioja, Logroño, Spain,
Abstract. General purpose computing on graphics processing units (GPGPU) is
a promising technique to cope with nowadays arising computational challenges
due to the suitability of GPUs for parallel processing. Several libraries and func-
tions are being released to boost the use of GPUs in real world problems. How-
ever, many of these packages require a deep knowledge in GPUs’ architecture
and in low-level programming. As a result, end users find trouble in exploiting
GPGPU advantages. In this paper, we focus on the GPU-acceleration of a predic-
tion technique specially designed to deal with big datasets: the extreme learning
machine (ELM). The intent of this study is to develop a user-friendly library
in the open source R language and subsequently release the code in https:
// GPU.git. Therefore R users
can freely implement it with the only requirement of having a NVIDIA graphic
card. The most computationally demanding operations were identified by per-
forming a sensitivity analysis. As a result, only matrix multiplications were ex-
ecuted in the GPU as they take around 99 % of total execution time. A speedup
rate up to 15 times was obtained with this GPU-accelerated ELM in the most
computationally expensive scenarios. Moreover, the applicability of the GPU-
accelerated ELM was also tested with a typical case of model selection, in which
genetic algorithms were used to fine-tune an ELM and training thousands of mod-
els is required. In this case, still a speedup of 6 times was obtained.
Keywords: ELM; GPU; gputools; CUDA; R Software; big data; optimization
1 Introduction
In this day and age, data creation is steadily increasing at rates nobody could envision
few years ago. Sources of data in society are diverse, from social media to machines,
sensing or transactions [4]. This new phenomenon is commonly referred as big data
and it emerges as one of the most challenging topics in our days. A vast amount of data
has to be processed at the same rate it is recorded to obtain valuable information given
the existing computing limitations. What is more, high computational demands can still
arise when processing relatively-small databases but thousands of operations have to be
executed. A typical case is the automation of the model selection process. In order to ob-
tain a fine-tuned predictive model, both model parameter optimization (MPO) and fea-
ture selection (FS) are usually executed simultaneously. The problem basically consists
2 M. Alia-Martinez et al.
on minimizing a loss function by applying different iterative meta-heuristics, which are
optimization techniques that imply the computation of hundreds or even thousands of
models. For instance, [2] tuning an extreme learning machine (ELM) for classification
purposes by using genetic algorithms (GAs), setting both the population size and the
number of generations to 100 and implementing 10-CV to obtain a more robust error
measurement. As a consequence, 100,000 ELMs had to be computed.
Any of the above-mentioned situations entails too large execution times for prac-
tical applications. A huge effort is being made in the development of new processing
techniques as well as more powerful computational tools to cope with these challenges.
For instance, one of most basic and well-known solutions is the use of parallel com-
puting with multi-core processors, which is commonly available in any commercial
software. More advanced solutions exist in the context of different fields such as high
performance computing (HPC) and supercomputers. Some applications are the use of
computer clusters, computer grids or cloud computing among others. In this study, we
focus in one of these emerging solutions: the use of the graphics processing unit (GPU)
instead of the traditional central processing unit (CPU) to run computationally demand-
ing computations.
GPUs are evolving at faster rates than the traditional CPUs [8]. Though they were
initially built to manage the graphic display in computers, their multi-core structure
with usually more than 200 units makes them adequate for parallel processing. The
release of compute unified device architecture (CUDA) software from NVIDIA first
enabled their use for running non-graphical computations, also known as general pur-
pose computing on graphics processing units (GPGPU). Afterwards, several libraries
for CUDA users have been released as well as software for other GPU developers, such
as FireStream for AMD or the more general OpenCL. Nevertheless, most of these ap-
plications require an extensive knowledge on GPU structure and parallel processing in
some specific programming languages such as C or Python. On the contrary, typical
end users are researchers on the fields of statistics, chemistry, biology or finance, who
posses limited programming abilities in more friendly interfaces such as Matlab, Math-
ematica or R. Consequently, there is still a need of developing user-friendly tools and
functions, so these end users are able to fully exploit the advantages of GPGPU.
In this line, this study is focused on accelerating ELMs, an appropriate technique to
deal with big datasets [12]. ELM major operations are simple matrix-matrix multiplica-
tions, inversions and transposes, so consequently they present an adequate structure to
perform parallel processing. Some previous works exist related to the GPU-acceleration
of ELMs [5], but they require high programming skills, as most of them are based
on programming in C language. Therefore, our proposal is implemented in the open
source R language [11], widely extended in data analysis field. A new library is freely
released at for
future implementations. Computations were run with NVIDIA GPUs in order to take
advantage of the numerous CUDA-related libraries available, such as the R package
gputools [1].
A straightforward implementation of a GPU-accelerated ELM in R 3
2 Methodology
2.1 ELM
ELM [7] implements a simplification of traditional learning algorithms used in a Single-
Hidden Layer Feedforward Network (SLFN) [9]. Instead of using an iterative proce-
dure, ELM proposes a straightforward solution to the optimization problem behind a
SLFN, while keeping its generalization capacity.
Mathematically, the algorithm is described as follows. An ELM predicts an output
variable tigiven a training set (xi,ti)with xi2Rd
,ti2R,i=1,...,N, where xiis the set
of inputs, dis the dimension of the input feature space, and Nthe number of instances.
Random values are assigned to the weight vector wibetween the inputs and the hidden
layer, and to the bias bi. As a consequence, the hidden-layer output matrix His directly
g(w1x1+b1)... g(wLx1+bL)
g(w1xN+b1)... g(wLxN+bL)
where Lis the number of neurons in the hidden layer and gis the activation function.
Initially, the algorithm essentially maps the set of points from the d-dimensional fea-
ture space to a L-dimensional neurons space. Once matrix His obtained, the problem
is reduced to calculate the weight vector bbetween the hidden layer and the outputs
(Hb=T). This typical optimization problem is solved by using the minimal norm
Least-Square method, in which the Moore-Penrose generalized inverse (H)of matrix
Hhas to be computed:
In this study, a variation of the traditional methods to compute this inverse is utilized.
Following the theory explained at [6], a regularization parameter is included. This ap-
proach is strongly recommended for dealing with databases with high number of sam-
where Cstands for the cost parameter. Based on the ridge regression theory, the addition
of a positive value to the diagonal term of matrix HTHimproves the robustness of the
2.2 GPU-acceleration
The original ELM functions of package elmNN [3] were used as baseline to develop
the GPU-accelerated ELM. Two functions from this package were modified:
elmtrain.default(), a method of the general function elmtrain().
predict.elmNN(), a method of the general function predict().
4 M. Alia-Martinez et al.
In both functions, an additional argument labeled GPU has been included. The default
value set to GPU=FALSE run the computations in the CPU and when the argument is
turned into GPU=TRUE the GPU-accelerated version is activated.
First, a sensitivity study was performed (Subsect. 3.1) in order to detect the most
demanding operations in each function. The nomenclature used is based on the package
elmNN. According to this package, all matrices and vectors are transposed compared
to the traditional theoretical explanation of Subsect. 2.1.
The training function based on elmtrain.default() computes the weight
vector bgiven (xi,ti)and following the equations described in Subsect. 2.1. This func-
tion is comprised by ten different R instructions:
#1 tempH <- inpweight %*%P
#2 biasMatrix <- matrix(rep(biashid, ncol(P)), nrow=nhid,
ncol=ncol(P), byrow = F)
#3 tempH <- tempH + biasMatrix
#4 H <- 1 / (1 + exp(-1*tempH))
#5 H.Prod <- H %*% t(H)
#6 C.Mat <- diag(ncol(H.Prod))/C + H.Prod
#7 inverse<-solve(C.Mat)
#8 mult<-inverse %*%H
#9 outweight <-mult %*% t(T)
#10 Y <- t(t(H) %*% outweight)
The predict function predict.elmNN() computes the output vector tibased on
the set of input samples xiand on the previously calculated weight vector b. The func-
tion is comprised by six instructions:
#11 TV.P <- t(x)
#12 tmpHTest<-inpweight %*% TV.P
#13 biasMatrixTE <- matrix(rep(biashid, ncol(TV.P)), nrow=nhid,
ncol=ncol(TV.P), byrow = F)
#14 tmpHTest <- tmpHTest + biasMatrixTE
#15 HTest <- 1 / (1 + exp(-1*tmpHTest))
#16 TY <- t(t(HTest) %*% outweight)
Time spent by each instruction was recorded to establish the influence of each op-
eration in the overall execution time. Preliminary trials (Subsect. 3.1) indicated that
matrix-matrix multiplications where the most cost demanding operations, so they were
the only executed in the GPU (instructions # 1, # 5, # 8 and # 12). These multipli-
cations were implemented using the gpuMatMult() function of package gputools.
Basically, this function is a wrapper for the cublasDgemm function of NVIDIA CUDA
Basic Linear Algebra Subroutines (cuBLAS) library. The remaining operations were
still implemented in the CPU.
3 Experiments
3.1 Sensitivity Analysis
Initially, the computational cost of each instruction using the CPU was compared with
the proposed GPU-accelerated ELM. A two-dimensional grid of databases was de-
A straightforward implementation of a GPU-accelerated ELM in R 5
signed to perform the comparisons by varying the number of samples Nand features
d. Besides, the main parameter of a ELM, the number of neurons in the hidden layer
L, was also altered. Consequently, a three-dimensional grid of models was created with
the following ranges and intervals of parameters:
Number of samples N: from 5,000 to 50,000 in 5,000 samples interval.
Number of neurons in the hidden layer L: from 100 to 1,000 in 100 neurons interval.
Number of features d: 10, 100 and 1,000.
All databases were randomly generated using numbers from 0 to 1 as the only goal
was to evaluate the execution times. Therefore, cost parameter Cwas held constant and
equal to 1. To perform an even comparison, just exactly the same randomly generated
bias biand input weight vector wiwere used in both types of ELMs.
Computations were performed in a workstation with the following hardware speci-
fications: NVIDIA GeForce GTX650 with 384 cores and 2Gb DDR5, a dual core pro-
cessor (AMD Athlon™64 X2 @ 1.8 GHz) and a 4GB RAM memory.
3.2 Case of study: Estimation of Daily Global Solar Irradiation
The efficiency of GPU-accelerated ELM was further analyzed with a real case. So-
lar global irradiation was predicted given thirteen meteorological variables as inputs
related to rainfall, temperature, extraterrestrial irradiation, wind speed and humidity.
Daily measurements recorded at 4 different locations of southern Spain (Cordoba, Jaen,
Nijar and Puebla Rio) from 2009 to 2013 were used. After spurious cleaning, a database
composed by 6,605 samples was obtained. The period 2009-2012 was used for training
(5249 samples) whilst 2013 samples were used for testing (1356 samples).
This case of study is a practical example where high computational resources are re-
quired with a relatively-small database, as running a vast amount of models is required
to perform model selection. Genetic algorithms were used to simultaneously execute
MPO and FS. Therefore, a hybrid chromosome was utilized, where a binary part mod-
els the set of input variables chosen and a real-coded part stands for the parameters
of the model [13]. The population size and the number of generations were set to 64
and 20 respectively. Individuals were first ranked according to a fitness function Jthat
accounts for error prediction (J=MAEval )and they were subsequently re-ranked ac-
cording to the model complexity. The re-rank works as follows. First, the complexity of
each individual was evaluated based on the generalized degrees of freedom (GDF) the-
ory [14]. Using GDF, complexity is defined as the sensitivity of each fitted value yie to
a randomly perturbed target value yi. In this case, the "horizontal" estimate with 10 rep-
etitions and Gaussian noise N(0,0.05)was utilized as described in [10]. Then, models
inside the same error interval (relative difference between errors less than 0.1%) were
re-ranked according to this complexity value. The 16 best individuals (25% elitism)
were selected as parents for the next generations. A mutation rate of 10% was chosen
but the best individual was not muted.
GA computations were run in the workstation described in Subsect. 3.1. When
working only with the CPU, only one core was used in order to carry out an equal
comparison and, consequently, only one model was computed at each time.
6 M. Alia-Martinez et al.
4 Results and Discussion
Table 1 depicts the execution time in seconds taken by each instruction in train and pre-
dict functions, when only the CPU was used to run the computations. Results proved
that most time-consuming operations were matrix-matrix multiplications: operations #
1,#5and#8inelmtrain.default() and instruction # 12 in predict.elmNN().
Operations #1 and # 12 are eventually the same, as both compute matrix Hbased
on the randomly generated input weight vector wiand the set of input variables xi:
Instructions #5 and # 8 are the two matrix-matrix multiplications required to implement
minimal norm Least-Squares with the regularization parameter. Operation #5 computes
the matrix multiplication between Hand its transpose:[LL]=H[LN]t(H)[NL](5)
while operation # 5 multiplies the inverted matrix obtained after adding the regulariza-
tion term and H:
Looking at the dimensions of these multiplications, number of features donly in-
fluences operations # 1 and # 12. According to this, results of Table 1 show that execu-
tion time in operation # 1 increased 115 times, from 0.44 to 50.82 seconds, when the
database of 1,000 features was used. Other operations barely remained constant when
the number of features was varied.
Modifying the number of samples Nand the number of neurons in the hidden layer
dhad a similar impact on the execution time. The execution time of multiplications #
1, # 5, # 8 and # 12 considerably grew. On the contrary, the execution time of some
of the remaining operations such as the activation function (# 4), the calculation of the
regularization term (# 6), matrix-vector multiplications (# 10 and # 16) or computing
the inverse (# 8) experimented a virtually negligible increase. This was the reason why
only matrix-matrix multiplications were implemented in the GPU.
Table 2 shows execution times in seconds using the same databases but with the
GPU-accelerated ELM. Remarkable time reductions were obtained in most demanding
computations. For instance, with the 50,000 sampled database, time required by all
matrix-matrix multiplications was around 2 seconds, which is now in the same range
of other computations such as the arithmetic operations implemented by the sigmoid
activation function (# 4). What is more, using 1,000 neurons in the hidden layer, time
required by matrix multiplications was eventually half of the time spent to obtain the
inverse (# 7).
A straightforward implementation of a GPU-accelerated ELM in R 7
Table 1. Execution time in seconds taken by each instruction using the CPU. Results of the three-
dimensional (neurons, samples, features) grid of models are summarized in three blocks. In each
block, the variation of one parameter was individually studied while averaging the other two.
elmtrain.default() predict.elmNN()
# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16
10 0.44 0.74 0.16 1.49 35.78 0.01 1.26 35.43 0.14 0.44 0.00 0.43 0.74 0.16 1.48 0.43
100 4.43 0.73 0.16 1.49 35.81 0.01 1.28 35.37 0.14 0.45 0.02 4.43 0.74 0.16 1.48 0.44
1,000 50.82 0.74 0.16 1.49 35.79 0.01 1.28 35.39 0.14 0.44 0.35 50.84 0.73 0.15 1.64 0.50
100 3.46 0.13 0.02 0.27 1.02 0.00 0.01 0.68 0.03 0.07 0.12 3.46 0.13 0.02 0.26 0.07
200 6.75 0.26 0.05 0.54 3.87 0.00 0.03 2.70 0.05 0.15 0.12 6.75 0.26 0.05 0.54 0.15
300 10.04 0.41 0.09 0.81 8.56 0.00 0.11 8.37 0.08 0.23 0.12 10.04 0.41 0.09 0.81 0.23
400 13.34 0.53 0.11 1.08 15.08 0.00 0.28 14.84 0.10 0.31 0.12 13.36 0.53 0.11 1.08 0.31
500 16.65 0.66 0.14 1.35 23.36 0.00 0.54 23.10 0.12 0.39 0.12 16.63 0.66 0.14 1.35 0.39
600 19.99 0.80 0.18 1.62 33.52 0.01 0.93 33.24 0.15 0.47 0.12 19.95 0.80 0.17 1.62 0.47
700 23.30 0.96 0.21 1.89 45.52 0.01 1.46 45.22 0.17 0.56 0.12 23.41 0.95 0.21 1.89 0.56
800 27.29 1.07 0.23 2.16 59.31 0.01 2.14 59.04 0.20 0.64 0.12 27.31 1.06 0.22 2.21 0.66
900 30.67 1.20 0.26 2.45 75.02 0.01 3.06 74.63 0.22 0.73 0.12 30.68 1.20 0.26 2.61 0.83
1,000 34.16 1.33 0.29 2.75 92.67 0.02 4.17 92.14 0.25 0.88 0.12 34.12 1.33 0.28 2.99 0.93
5,000 3.38 0.13 0.03 0.27 6.49 0.01 1.27 6.43 0.03 0.06 0.02 3.37 0.13 0.02 0.27 0.06
10,000 6.75 0.27 0.05 0.54 13.00 0.01 1.27 12.93 0.05 0.15 0.05 6.76 0.27 0.05 0.54 0.15
15,000 10.16 0.41 0.09 0.81 19.50 0.01 1.27 19.36 0.08 0.23 0.06 10.13 0.41 0.09 0.82 0.23
20,000 13.47 0.53 0.11 1.08 26.05 0.01 1.27 25.70 0.10 0.31 0.09 13.48 0.53 0.11 1.09 0.32
25,000 16.85 0.66 0.14 1.37 32.53 0.01 1.27 32.20 0.12 0.39 0.10 16.87 0.66 0.14 1.36 0.40
30,000 20.22 0.80 0.17 1.63 39.07 0.01 1.27 38.65 0.15 0.48 0.14 20.26 0.80 0.17 1.64 0.49
35,000 23.63 0.95 0.21 1.90 45.56 0.01 1.27 45.05 0.17 0.56 0.14 23.60 0.95 0.20 1.92 0.57
40,000 26.99 1.06 0.23 2.16 52.09 0.01 1.28 51.48 0.20 0.67 0.20 27.02 1.07 0.23 2.27 0.69
45,000 30.38 1.20 0.26 2.44 58.57 0.01 1.28 57.87 0.22 0.74 0.19 30.40 1.20 0.26 2.55 0.77
50,000 33.81 1.33 0.29 2.72 65.08 0.01 1.28 64.29 0.25 0.84 0.25 33.80 1.33 0.28 2.91 0.91
Table 2. Execution time in seconds taken by each instruction using the GPU. Results of the three-
dimensional (neurons, samples, features) grid of models are summarized in three blocks. In each
block, the variation of one parameter was individually studied while averaging the other two.
elmtrain.default() predict.elmNN()
# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16
10 0.13 0.74 0.16 1.48 1.10 0.01 1.26 0.84 0.08 0.37 0.00 0.12 0.74 0.16 1.49 0.37
100 0.23 0.73 0.16 1.48 1.10 0.01 1.28 0.84 0.08 0.37 0.02 0.22 0.74 0.16 1.48 0.37
1,000 1.20 0.74 0.16 1.49 1.10 0.01 1.27 0.84 0.08 0.37 0.63 1.25 0.73 0.15 1.52 0.39
100 0.24 0.13 0.02 0.26 0.09 0.00 0.00 0.05 0.02 0.06 0.12 0.14 0.13 0.02 0.26 0.06
200 0.22 0.26 0.05 0.54 0.23 0.00 0.03 0.13 0.03 0.13 0.12 0.22 0.26 0.05 0.54 0.12
300 0.31 0.41 0.09 0.81 0.40 0.00 0.11 0.26 0.04 0.20 0.12 0.31 0.41 0.09 0.81 0.20
400 0.39 0.53 0.11 1.08 0.59 0.00 0.28 0.41 0.06 0.26 0.12 0.39 0.53 0.11 1.08 0.26
500 0.47 0.66 0.14 1.35 0.82 0.00 0.54 0.59 0.07 0.33 0.12 0.47 0.66 0.14 1.35 0.32
600 0.55 0.80 0.17 1.62 1.08 0.01 0.93 0.81 0.09 0.40 0.12 0.55 0.80 0.17 1.62 0.39
700 0.63 0.96 0.20 1.89 1.37 0.01 1.45 1.06 0.10 0.47 0.12 0.62 0.95 0.21 1.89 0.47
800 0.71 1.07 0.23 2.15 1.70 0.01 2.14 1.33 0.11 0.54 0.12 0.72 1.06 0.22 2.16 0.55
900 0.81 1.20 0.26 2.43 2.15 0.01 3.04 1.70 0.13 0.61 0.34 0.96 1.20 0.26 2.50 0.64
1,000 0.89 1.33 0.29 2.71 2.58 0.02 4.15 2.05 0.15 0.73 0.85 0.95 1.33 0.28 2.75 0.72
5,000 0.19 0.13 0.02 0.26 0.18 0.01 1.27 0.15 0.02 0.05 0.02 0.09 0.13 0.02 0.27 0.05
10,000 0.19 0.27 0.05 0.54 0.40 0.01 1.27 0.31 0.03 0.13 0.05 0.19 0.27 0.05 0.54 0.13
15,000 0.28 0.41 0.09 0.81 0.60 0.01 1.27 0.46 0.04 0.20 0.06 0.28 0.41 0.09 0.81 0.20
20,000 0.37 0.53 0.11 1.08 0.81 0.01 1.27 0.61 0.06 0.28 0.09 0.37 0.53 0.11 1.08 0.27
25,000 0.46 0.66 0.14 1.35 1.00 0.01 1.27 0.76 0.07 0.34 0.10 0.46 0.66 0.14 1.35 0.34
30,000 0.55 0.80 0.17 1.63 1.21 0.01 1.27 0.91 0.09 0.41 0.14 0.57 0.80 0.17 1.64 0.41
35,000 0.65 0.95 0.21 1.89 1.40 0.01 1.27 1.07 0.10 0.47 0.14 0.65 0.95 0.20 1.90 0.48
40,000 0.74 1.06 0.23 2.15 1.60 0.01 1.27 1.22 0.12 0.54 0.20 0.86 1.07 0.23 2.17 0.55
45,000 0.83 1.20 0.26 2.43 1.80 0.01 1.27 1.37 0.13 0.61 0.92 0.86 1.20 0.26 2.44 0.61
50,000 0.93 1.33 0.29 2.70 2.01 0.01 1.27 1.52 0.15 0.69 0.47 0.97 1.33 0.28 2.74 0.69
8 M. Alia-Martinez et al.
Neurons in the hidden layer
5000 samples
200 400 600 800 1000
10000 samples
15000 samples
200 400 600 800 1000
20000 samples
25000 samples
30000 samples
35000 samples
40000 samples
200 400 600 800 1000
45000 samples
50000 samples
1000 features
100 features
10 features
Fig. 1. Execution time speedup (vertical axis) in the elmtrain.default() function for the
different combinations of samples, features and neurons in the hidden layer
In this line, Fig. 1 depicts the overall time reduction between the two ELM versions.
A striking speedup up to 15 times was obtained when using the database of 1,000 vari-
ables. This speedup was up to 10 times when smaller databases with 10 or 100 features
were used. The slight fall in the speedup curve in databases with 1,000 features and
40,000, 45,000 or 50,000 samples was caused by RAM limitations in the workstation.
The case of 100 features was further analyzed in detail in Fig. 2 with a contour plot,
where samples and neurons were selected as x and y axis respectively while speedup
was the third dimension. It can be appreciated that the higher speedups were obtained
when both the number of samples and neurons increased proportionally. On the con-
trary, the speedup slightly increased and even decreased when only neurons or samples
were raised alone. This explained why in Fig. 1, when low number of samples were
used, the speedup curve stagnated of even started to decreased when the number of
A straightforward implementation of a GPU-accelerated ELM in R 9
neurons was raised. Similar patterns were observed with 10 and 1,000 features (data
not shown).
10000 20000 30000 40000 50000
Fig. 2. Contour plot of execution time speedup in the elmtrain.default() function for the
different number of samples and neurons in the hidden layer in the case of 100 features
Finally, the applicability of the GPU-accelerated ELM was evaluated with a real
case of study of model selection using GA, where 12,800 ELMs were computed. Al-
though the GA procedure introduces some sources of uncertainty in different steps of
the procedure, such as the creation of initial generation or mutations, preliminary re-
sults have proved the robustness of this methodology by repeating GA several times.
This robustness was again verified in this case of study, as the evolution followed in
GA by both ELM versions presented a similar trend (data not shown). Besides, the final
solution reached (best individual of last generation) of both ELMs was roughly equiva-
lent. With the CPU version, an ELM with 8 selected features, 472 neurons in the hidden
layer and a cost of 29was obtained with a normalized testing MAE of 0.059, while in
the GPU version, an ELM with the same 8 selected features, 524 neurons in the hidden
layer and a cost of 211 was obtained with a normalized testing MAE of 0.060.
As a result, the computational cost through each generation were comparable in
terms of complexity. However, when looking at execution times, Fig. 3 shows how
the GPU-accelerated ELM significantly outperforms the CPU version achieving a total
overall speedup of roughly 6 times. The CPU version took around 123 hours to run all
generations whilst the proposed GPU-accelerated ELM spent only 20 hours.
10 M. Alia-Martinez et al.
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
Generation execution time (min)
Total execution time (min)
Number of generation
Fig. 3. Total execution time (lines) and execution time per generation (bars) in minutes spent by
CPU and GPU versions of ELM through the GA optimization process
5 Conclusions and Future Work
General purpose computing of graphics processing units (GPGPU) is emerging as one
of the most appealing technologies to deal with nowadays computational challenges in
data analysis. In the previous study, a GPU-accelerated version of ELM has been pro-
posed by modifying the R package elmNN. Preliminary trials showed how by only run-
ning matrix-matrix multiplications in the GPU, speedups up to 15 times were obtained
in the most demanding situations with high number of features, samples and neurons
on the hidden layer. The methodology was also proved useful in a real case where a
relatively-small database was used but a vast amount of models had to be computed in
the context of model selection meta-heuristics. In this case, an overall reduction around
6 times was obtained using GA to perform a fine-tuning of an ELM.
The GPU-accelerated version of package elmNN has been freely released at https:
// for future applications of
R users with the only requirement of owning a workstation with a NVIDIA graphic card.
Finally, in the future some different aspects could be explored as running the whole
ELM algorithm in the GPU, using a GPU cluster in order to compute several ELMs at
the same time and implementing the same methodology in other well-know prediction
techniques such as support vector machines or the multilayer perceptron.
Acknowledgments R. Urraca and J. Antonanzas would like to acknowledge the fellow-
ship FPI-UR-2014 granted by the University of La Rioja. F. Antonanzas-Torres would
like to express his gratitude for the FPI-UR-2012 and ATUR grant No. 03061402 at
the University of La Rioja. We are also greatly indebted to Banco Santander for the
PROFAI-13/06 fellowship, to the Agencia de Desarrollo Económico de La Rioja for
A straightforward implementation of a GPU-accelerated ELM in R 11
the ADER-2012-I-IDD-00126 (CONOBUILD) fellowship and to the Instituto de Estu-
dios Riojanos (IER) for funding parts of this research.
1. Buckner, J., Wilson, J., Seligman, M., Athey, B., Watson, S., Meng, F.: The gputools package
enables GPU computing in R. Bioinformatics 26(1), 134–135 (2010)
2. Chyzhyk, D., Savio, A., Graña, M.: Evolutionary ELM wrapper feature selection for
alzheimer’s disease CAD on anatomical brain MRI. Neurocomputing 128, 73–80 (2014)
3. Gosso, A.: elmNN: Implementation of ELM (Extreme Learning Machine) algorithm for
SLFN (Single Hidden Layer Feedforward Neural Networks). R package version 1.3 (2012),
4. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Ullah Khan, S.: The rise
of "big data" on cloud computing: Review and open research issues. Information Systems
47(0), 98–115 (2015)
5. van Heeswijk, M., Miche, Y., Oja, E., Lendasse, A.: GPU-accelerated and parallelized ELM
ensembles for large-scale regression. Neurocomputing 74(16), 2430 – 2437 (2011)
6. Huang, G.B.: Extreme learning machine for regression and multiclass classification. IEEE
Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 42(2), 513–529 (2012)
7. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications.
Neurocomputing 70, 489–501 (2006)
8. Peddie, J.: The new visualization engine - the heterogeneous processor unit. In: Dill, J., Earn-
shaw, R., Kasik, D., Vince, J., Wong, P.C. (eds.) Expanding the frontiers of visual analytics
and visualization, pp. 377–396. Springer International Publishing (2012)
9. Salcedo-Sanz, S., Casanova-Mateo, C., Pastor-Sanchez, A., Giron, M.S.: Daily global solar
radiation prediction based on a hybrid coral reefs optimization - extreme learning machine
approach. Solar Energy 105, 91–98 (2014)
10. Seni, G., Elder, J.: Ensembe methods in data mining. Improving accuracy through combining
predictions. Morgan & Claypool (2010)
11. Team, R.C.: R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2014),
12. Urraca, R., Antonanzas, J., Martinez-de Pison, F.J., Antonanzas-Torres, F.: Estimation of
solar global irradiation in remote areas. Journal of Renewable and Sustainable Energy (In
13. Urraca-Valle, R., Sodupe-Ortega, E., Antoñanzas Torres, J., Antoñanzas-Torres, F.,
Martinez-de Pison, F.J.: An overall performance comparative of GA-PARSIMONY method-
ology with regression algorithms. In: de la Puerta, J.G., Ferreira, I.G., Bringas, P.G., Klett,
F., Abraham, A., de Carvalho, A.C., Herrero, A., Baruque, B., Quintian, H., Corchado, E.
(eds.) Advances in Intelligent Systems and Computing, vol. 299, pp. 53–62–. Springer Inter-
national Publishing (2014)
14. Ye, J.: On measuring and correcting the effects of data mining and model selection. Journal
of the American Statistical Association 93(441), 120 – 131 (1998)
... For CoverType dataset, matrix multiplication accounts for roughly 97% of the training time (see Sect. 2). Hence, we were motivated to carry out substantial research on adapting the standard ELM algorithm to several types of acceleration hardware that support parallelization, such as Field-Programmable Gate Arrays (FPGAs) [5,25,40] and Graphic Processing Units (GPUs) [1,17,36], and specialized multi-core ...
... A parallel incremental extreme SVM classifier was proposed in [7]. The ELM algorithm for large-scale regression on GPU was proposed in the paper [1,17,36]. Besides, paper [8] described an algorithm that was designed and implemented on MapReduce framework. Research has shown that Field-Programmable Gate Array (FPGA) performs better than General-Purpose Processors in machine learning algorithms [39]. ...
Full-text available
As an important learning algorithm, extreme learning machine (ELM) is known for its excellent learning speed. With the expansion of ELM’s applications in the field of classification and regression, the need for its real-time performance is increasing. Although the use of hardware acceleration is an obvious solution, how to select the appropriate acceleration hardware for ELM-based applications is a topic worthy of further discussion. For this purpose, we designed and evaluated the optimized ELM algorithms on three kinds of state-of-the-art acceleration hardware, i.e., multi-core CPU, Graphics Processing Unit (GPU), and Field-Programmable Gate Array (FPGA) which are all suitable for matrix multiplication optimization. The experimental results showed that the speedup ratio of these optimized algorithms on acceleration hardware achieved 10–800. Therefore, we suggest that (1) use GPU to accelerate ELM algorithms for large dataset, and (2) use FPGA for small dataset because of its lower power, especially for some embedded applications. We also opened our source code.
... Alia-Martinez et al. [155] developed a library in R language for GPU acceleration of ELM for big datasets. The authors performed a sensitivity analysis, which identified matrix multiplication as the most computationally demanding operations that consume 99% of execution time in ELM. ...
In spite of the prominence of extreme learning machine model, as well as its excellent features such as insignificant intervention for learning and model tuning, the simplicity of implementation, and high learning speed, which makes it a fascinating alternative method for Artificial Intelligence, including Big Data Analytics, it is still limited in certain aspects. These aspects must be treated towards achieving an effective and cost-sensitive model. This review discussed the major drawbacks of ELM, which include difficulty in determination of hidden layer structure, prediction instability and Imbalanced data distributions, the poor capability of sample structure preserving (SSP), and difficulty in accommodating lateral inhibition by direct random feature mapping. Other drawbacks include multi-graph complexity, global memory size, one-by-one or chuck-by-chuck (a block of data), global memory size limitation, and challenges with big data. The recent trend proposed by experts for each drawbacks are discussed in detail towards achieving an effective and cost-sensitive model.
... Therefore, we decided to focus our attention on end if 25: end while GPU-based implementation of extreme learning classifiers. There exists an efficient R package allowing for transferring the training procedure onto GPU using NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library [1]. This allows for significant speed up of the most computationally costly operations such as calculating the matrix storing outputs of hidden layer H and its Moore-Penrose pseudoinverse (steps 3 and 4 of Algorithm 1). ...
Mining data streams is one of the most vital fields in the current era of big data. Continuously arriving data may pose various problems, connected to their volume, variety or velocity. In this paper we focus on two important difficulties embedded in the nature of data streams: non-stationary nature and skewed class distributions. Such a scenario requires a classifier that is able to rapidly adapt itself to concept drift and displays robustness to class imbalance problem. We propose to use online version of Extreme Learning Machine that is enhanced by an efficient drift detector and method to alleviate the bias towards the majority class. We investigate three approaches based on undersampling, oversampling and cost-sensitive adaptation. Additionally, to allow for a rapid updating of the proposed classifier we show how to implement online Extreme Learning Machines with the usage of GPU. The proposed approach allows for a highly efficient mining of high-speed, drifting and imbalanced data streams with significant acceleration offered by GPU processing.
As a popular classification algorithm for machine learning, Extreme Learning Machine (ELM) has been widely used. However, its performance on various hardware devices is unclear. According to the baseline implementation of single core ELM, we find that the main time cost of ELM is matrix multiplication. Then, this paper designs various optimized hardware algorithms for several computing devices (Multi-Core, GPU, and FPGA). According to the experiment of each platform, we can see that the speedup ratio of the new hardware platform to ELM is 4~100+, we open our source code and strongly recommend that the later researchers design the application of ELM algorithm based on appropriate hardware platform.
Full-text available
Cloud computing is a powerful technology to perform massive-scale and complex computing. It eliminates the need to maintain expensive computing hardware, dedicated space, and software. Massive growth in the scale of data or big data generated through cloud computing has been observed. Addressing big data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful data processing and analysis. The rise of big data in cloud computing is reviewed in this study. The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced. The relationship between big data and cloud computing, big data storage systems, and Hadoop technology are also discussed. Furthermore, research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance. Lastly, open research issues that require substantial research efforts are summarized.
It is clear that the learning speed of feedforward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades. Two key reasons behind may be: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike these conventional implementations, this paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed. The experimental results based on a few artificial and real benchmark function approximation and classification problems including very large complex applications show that the new algorithm can produce good generalization performance in most cases and can learn thousands of times faster than conventional popular learning algorithms for feedforward neural networks.1
This chapter presents a brief and partial historical overview of the combination of technological events leading to a new paradigm in visualization—the development and embracing of Heterogeneous Processor Units (HPUs) along with supporting operating systems and development tools across multiple platforms from handheld mobile devices to supercomputers. HPUs are the result of the evolution of integration of more functions and functionality in semiconductors due to the regular cadence of manufacturing processes shrinking—often referred to as Moore’s Law. The HPU is the integration of powerful serial processors like the ×86 architecture or RISC processors like ARM and MIPS, and highly parallel processors known as GPUs—graphics processor units. These HPUs bring new opportunities to the creation of powerful yet low cost visualization systems.
Solar global irradiation is barely recorded in remote areas around the world. The lack of access to an electricity grid in these areas presents an enormous opportunity for electrification through renewable energy sources and, specifically, with photovoltaic energy where great solar resources are available. Traditionally, solar resource estimation was performed using parametric-empirical models based on the relationship between solar irradiation and other atmospheric and commonly measured variables, such as temperatures, rainfall, sunshine duration, etc., achieving a relatively high level of certainty. The significant improvement in soft-computing techniques, applied extensively in many research fields, has led to improvements in solar global irradiation modeling. This study conducts a comparative assessment of four different soft-computing techniques (artificial neural networks, support vector regression, M5P regression trees, and extreme learning machines). The results were also compared with two well-known parametric models [Liu and Scot, Agric. For. Meteorol. 106(1), 41–59 (2001) and Antonanzas-Torres et al., Renewable Energy 60, 604–614 (2013b)]. A striking mean absolute error of 1.74 MJ=m 2 day was achieved with support vector regression (around 10% lower than with classic parametric models). Furthermore, the annual sums of estimated solar irradiation with this technique were within the intrinsic tolerance of pyranometers (5%). This methodology is performed in free environment R software and released at for future replications of the study in different areas.
This paper presents a performance comparative of GA-PAR SIMONY methodology with five well-known regression algorithms and with different genetic algorithm (GA) configurations. This approach is mainly based on combining GA and feature selection (FS) during model tuning process to achieve better overall parsimonious models that assure good generalization capacities. For this purpose, individuals, already sorted by their fitness function, are rearranged in each iteration depending on the model complexity. The main objective is to analyze the overall model performance achieve with this methodology for each regression algorithm against different real databases and varying the GA setting parameters. Our preliminary results show that two algorithms, multilayer perceptron (MLP) with the Broyden-Fletcher-Goldfarb-Shanno training method and support vector machines for regression (SVR) with radial basis function kernel, performing better with similar features reduction when database has low number of input attributes ( ≲32) and it has been used low GA population sizes.