ArticlePDF Available

An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks

Authors:

Abstract and Figures

Recurrent neural networks (RNN) have been successfully applied to various sequential decision-making tasks, natural language processing applications, and time-series predictions. Such networks are usually trained through back-propagation through time (BPTT) which is prohibitively expensive, especially when the length of the time dependencies and the number of hidden neurons increase. To reduce the training time, extreme learning machines (ELMs) have been recently applied to RNN training, reaching a 99% speedup on some applications. Due to its non-iterative nature, ELM training, when parallelized, has the potential to reach higher speedups than BPTT. In this work, we present Opt-PR-ELM, an optimized parallel RNN training algorithm based on ELM that takes advantage of the GPU shared memory and of parallel QR factorization algorithms to efficiently reach optimal solutions. The theoretical analysis of the proposed algorithm is presented on six RNN architectures, including LSTM and GRU, and its performance is empirically tested on ten time-series prediction applications. Opt-PR-ELM is shown to reach up to 461 times speedup over its sequential counterpart and to require up to 20x less time to train than parallel BPTT. Such high speedups over new generation CPUs are extremely crucial in real-time applications and IoT environments.
Content may be subject to copyright.
JAISCR, 2021, Vol. 11, No. 1, pp. 33
AN OPTIMIZED PARALLEL IMPLEMENTATION OF
NON-ITERATIVELY TRAINED RECURRENT NEURAL
NETWORKS
Julia El Zini, Yara Rizk and Mariette Awad
Department of Electrical and Computer Engineering
American University of Beirut
E-mail: {jwe04,yar01,mariette.awad}@aub.edu.lb
Submitted: 7th May 2020; Accepted: 14th September 2020
Abstract
Recurrent neural networks (RNN) have been successfully applied to various sequential
decision-making tasks, natural language processing applications, and time-series predic-
tions. Such networks are usually trained through back-propagation through time (BPTT)
which is prohibitively expensive, especially when the length of the time dependencies
and the number of hidden neurons increase. To reduce the training time, extreme learning
machines (ELMs) have been recently applied to RNN training, reaching a 99% speedup
on some applications. Due to its non-iterative nature, ELM training, when parallelized,
has the potential to reach higher speedups than BPTT.
In this work, we present Opt-PR-ELM, an optimized parallel RNN training algorithm
based on ELM that takes advantage of the GPU shared memory and of parallel QR fac-
torization algorithms to efficiently reach optimal solutions. The theoretical analysis of the
proposed algorithm is presented on six RNN architectures, including LSTM and GRU,
and its performance is empirically tested on ten time-series prediction applications. Opt-
PR-ELM is shown to reach up to 461 times speedup over its sequential counterpart and
to require up to 20x less time to train than parallel BPTT. Such high speedups over new
generation CPUs are extremely crucial in real-time applications and IoT environments.
Keywords: GPU implementation, parallelization, Recurrent Neural Network (RNN),
Long-short Term Memory (LSTM), Gated Recurrent Unit (GRU), Extreme Learning Ma-
chines (ELM), non-iterative training
1 Introduction
Recurrent neural networks (RNN) are a type
of neural networks that have been successfully ap-
plied to many problems in machine learning [22].
They have proven their ability to exceed human
performance in time series prediction and sequen-
tial decision-making [31]. RNNs’ training is usu-
ally based on gradient descent methods, specifically
back-propagation through time (BPTT) [40], and
real-time recurrent learning [41] which require a
substantial amount of iterations before converging.
Moreover, when unfolded through time, RNNs be-
come even deeper [1] and their training becomes
even more expensive since the number of learned
weights grows exponentially with the number of
hidden neurons and the length of time dependency.
Non-iterative training algorithms have been in-
vestigated in the literature [32, 1, 35] to reduce
the training cost of neural networks. Recently, Er-
tugrul et al. [10] proposed a non-iterative train-
F. E. Alsaadi, S. A. Ul Haq Bokhary, A. Shah, U. Ali, J. Cao, M. O. Alassafi, M. U. Rehman, J. U. Rahman
Proof. Based on the information given in Table 8,
we compute the ABC4index of Gas follows
ABC4(G)=uvE(G)Su+Sv2
SuSv=217+172
17×17 +
817+382
17×38 +238+382
38×38 +424+322
24×32 +
424+452
24×45 +432+452
32×45 +445+472
45×47 +
447+382
47×38 +826+382
26×38 +426+452
26×45 +
438+702
38×70 +424+732
24×73 +232+732
32×73 +(4m+
4n32)26+472
26×47 +626+702
26×70 +(2m+2n
16)26+802
26×80 +431+452
31×45 +(2m+2n
16)31+472
31×47 +431+702
31×70 +431+732
31×73 +(4m+
4n32)31+802
31×80 +236+702
36×70 +436+732
36×73 +
(6m+6n48)36+802
36×80 +(6mn 24m24n+
96)36+902
36×90 +445+732
45×73 +445+802
45×80 +
+(2m+2n20)47+472
47×47 +447+702
47×70 +(4m+
4n36)47+802
47×80 +470+802
70×80 +473+802
73×80 +
273+902
73×90 +(2m+2n18)80+802
80×80 +(4m+
4n36)80+902
80×90 +(3mn 14m14n+
65)90+902
90×90 .
Further simplification give us the re-
quired result ABC4(G)=[
310
15 +178
30 ]mn +
[2
611 876762 +2
65 845 +4
1457 27683 +16895
155 +
570
20 4
15 310 +2
47 92 +5
47 47 +2
15 21
7
30 178](m+n)+ 8
17 2+4
323 34328 +
74
19 +2
893 148238 +2
665 70490 +3
42+
2010
45 +41610
219 +30
6+15038
292 +4
247 15314 +
2
295 26910 16
611 86762 +6
455 42770
4
65 845+4
465 1147032
1457 27683+6
1085 23870+
4
2263 230826 8
155 16895 +2
105 455 +
2
219 7811 2
5570 +16
15 310 +4
47 94 +
8
1095 10585 +123
15 20
47 92 +2
329 15134
45
47 47 +518
35 +55115
365 +117530
730 9
40 158
6
521 +13
18 178.
Theorem 5.4 For m,n6, the GA5index of a graph
G
=RT S(m,n)is
GA5(G)=[
12
710 +3]mn +[
8
73 1222 +
8
53 65 +72
29 548
710 +32
127 235 +48
17 2+
2
39 1457 +32
111 155 11](m+n)+16
55 646 +
8
85 1786 +96
77 10 +16
105 146 +16
73+16
23 30 +
247
2+8
71 1170 64
73 1222 +455
464
53 65 +
6
19 155 +12
53 70 +48
109 73 576
29 5+192
710 +
6
23 235 +12
59 365 +8
117 3290 288
127 235 +
16
17 14+32
153 365+12
163 730432
17 2+96
77 10+
32
97 438 16
39 1457 +8
101 2170 +2263
13
256
111 155 +96
25 +61.
Proof. By following the instructions about the edge
partitioning in Table 8, we compute the GA5index
of the graph Gas follows
GA5(G)= 2SuSv
(Su+Sv)=217×17
17+17 ×2+217×38
17+38 ×
(8)+238×38
38+38 ×2+238×47
38+47 ×4+238×70
38+70 ×
4+224×32
24+32 ×4+224×45
24+45 ×4+224×73
24+73 ×
4+232×45
32+45 ×4+232×73
32+73 ×2+226×38
26+38 ×
8+226×45
26+45 ×4+226×47
26+47 ×(4m+4n32)+
226×70
26+70 ×6+226×80
26+80 ×(2m+2n16)+
231×45
31+45 ×4+231×47
31+47 ×(2m+2n16)+
231×70
31+70 ×4+231×73
31+73 ×4+231×80
31+80 ×(4m+
4n32)+236×70
36+70 ×2+236×73
36+73 ×4+236×80
36+80 ×
(6m+6n48)+236×90
36+90 ×(6mn 24m
24n+96)+245×47
45+47 ×4+245×73
45+73×4+245×80
45+80 ×
4+247×47
47+47 ×)(2m+2n20)+247×70
47+70 ×4+
247×80
47+80 ×(4m+4n36)+270×80
70+80 ×4+
273×80
73+80 ×4+273×90
73+90 ×2+280×80
80+80 ×(2m+2n
18)+280×90
80+90 ×(4m+4n36)+290×90
90+9×(3mn
14m14n+65).
Further simplification give us the required result
GA5(G)=[
12
710 +3]mn +[
8
73 1222 +
8
53 65 +72
29 548
710 +32
127 235 +48
17 2+
2
39 1457 +32
111 155 11](m+n)+16
55 646 +
8
85 1786 +96
77 10 +16
105 146 +16
73+16
23 30 +
247
2+8
71 1170 64
73 1222 +455
464
53 65 +
6
19 155 +12
53 70 +48
109 73 576
29 5+192
710 +
6
23 235 +12
59 365 +8
117 3290 288
127 235 +
16
17 14+32
153 365+12
163 730432
17 2+96
77 10+
32
97 438 16
39 1457 +8
101 2170 +2263
13
256
111 155 +96
25 +61.
6 Conclusion
In this article, we have done computation of
some degree based topological indices for certain
networks sheets. As a consequence, we got formu-
10
– 50
10.2478/jaiscr-2021-0003
34 Julia El Zini, Yara Rizk and Mariette Awad
ing algorithm for Jordan RNNs[19]. Then, Rizk
et al. [30] extended it to different RNN archi-
tectures, including Elman, fully connected RNN,
and Long Short-Term Memory (LSTM). Their al-
gorithm was tested on time-series and sequential
decision-making problems and achieved a speedup
of up to 99% over iterative training.
Although they only need one iteration to obtain
near-optimal solutions, non-iterative training algo-
rithms minimize their cost function by computing a
Moore-Penrose pseudo-inverse which requires am-
ple computational resources, especially for large
matrices. To the best of our knowledge, no attempts
have been made in the literature to parallelize non-
iterative training algorithms for RNNs. Fortunately,
such algorithms hold great potential for paralleliza-
tion due to their non-sequential nature.
In this work, we propose Basic-PR-ELM, a
basic parallel version of ELM training applied
on six RNN architectures: Elman, Jordan, NAR-
MAX, fully connected, LSTM, and GRU. Basic-
PR-ELM relies on parallel QR factorization to solve
the pseudo-inverse required in ELM training algo-
rithms. Then, the memory access patterns were
studied and led to Opt-PR-ELM, an optimizedver-
sion of parallel ELM training that utilizes the GPU
shared memory to speedup the training process fur-
ther.
The proposed algorithms, Basic-PR-ELM and
Opt-PR-ELM, are tested on 10 publicly available
time-series prediction applications and on different
GPU architectures to empirically show their scal-
ability, robustness, portability, speedup potentials,
and energy efficiency. Compared to the sequential
version proposed by Rizk et al. in [30], Basic-PR-
ELM and Opt-PR-ELM achieve a speedup of up
to 311 and 461, respectively on the LSTM archi-
tecture. Notably, Opt-PR-ELM is shown to train
LSTM networks 20 times faster than the parallel it-
erative training algorithms (BPTT).
The rest of the paper is organized as follows:
Section 2 presents the background on ELM-training
and the RNN architectures. Section 3 summa-
rizes the related work on RNN training and the
parallel training algorithms. Section 4 presents
the proposed algorithms Basic-PR-ELM and Opt-
PR-ELM and Section 5 theoretically analyzes their
memory and floating-point operations. Then, Sec-
tions 6 discusses the experimental setup and Sec-
tion 7 reports the empirical results. Finally, Sec-
tion 8 concludes with final remarks.
2 Background
2.1 Extreme Learning Machine
Extreme Learning Machine (ELM) is a non-
iterative training algorithm introduced by Huang et
al. [16] for single hidden layer feedforward neural
networks (SLFNs). Given narbitrary distinct train-
ing samples (xj,yj)where xjRm,yjR,Mhid-
den nodes and gas activation function, the predicted
output Ojcan be written as
M
i=1βig(wT
ixj+bi)where wiRmis the
weight vector connecting the ith hidden node and
the input nodes, βRMis the weight vector con-
necting all the hidden nodes and the output node
and biis the bias of the ith hidden node. Through-
out the training, the input weights wij are randomly
generated and fixed and the output weights β1...βM
are analytically computed. The goal is to minimize
the error between the predicted and the true output
as
min
β
n
j=1Ojtj2=
n
j=1
M
i=1
βig(wT
ixj+bi)tj.
(1)
Defining Hand Tas:
H(n×M)=
g(wT
1x1+b1)... g(wT
Mx1+bM)
.
.
.....
.
.
g(wT
1xn+b1)... g(wT
Mxn+bM)
(2)
T(n×1)=[t1,t2,...,tn]T,(3)
one can compactly write the problem in Equa-
tion 1 as minimizing HβT2. The solution of
this problem is given as: β=HT, where H=
(HTH)1His the Moore-penrose generalized in-
verse of the matrix H.
2.2 RNN architectures
RNNs are one of the most powerful neural
networks that are best suitable to model long-
term dependencies in time-series applications [31].
RNN architectures differ in the way cycles are in-
troduced in the network. In this work, we con-
sider six RNN architectures, illustrated in Fig-
ure 1: Elman [9], Jordan [19], NARMAX [8],
35
Julia El Zini, Yara Rizk and Mariette Awad
ing algorithm for Jordan RNNs[19]. Then, Rizk
et al. [30] extended it to different RNN archi-
tectures, including Elman, fully connected RNN,
and Long Short-Term Memory (LSTM). Their al-
gorithm was tested on time-series and sequential
decision-making problems and achieved a speedup
of up to 99% over iterative training.
Although they only need one iteration to obtain
near-optimal solutions, non-iterative training algo-
rithms minimize their cost function by computing a
Moore-Penrose pseudo-inverse which requires am-
ple computational resources, especially for large
matrices. To the best of our knowledge, no attempts
have been made in the literature to parallelize non-
iterative training algorithms for RNNs. Fortunately,
such algorithms hold great potential for paralleliza-
tion due to their non-sequential nature.
In this work, we propose Basic-PR-ELM, a
basic parallel version of ELM training applied
on six RNN architectures: Elman, Jordan, NAR-
MAX, fully connected, LSTM, and GRU. Basic-
PR-ELM relies on parallel QR factorization to solve
the pseudo-inverse required in ELM training algo-
rithms. Then, the memory access patterns were
studied and led to Opt-PR-ELM, an optimizedver-
sion of parallel ELM training that utilizes the GPU
shared memory to speedup the training process fur-
ther.
The proposed algorithms, Basic-PR-ELM and
Opt-PR-ELM, are tested on 10 publicly available
time-series prediction applications and on different
GPU architectures to empirically show their scal-
ability, robustness, portability, speedup potentials,
and energy efficiency. Compared to the sequential
version proposed by Rizk et al. in [30], Basic-PR-
ELM and Opt-PR-ELM achieve a speedup of up
to 311 and 461, respectively on the LSTM archi-
tecture. Notably, Opt-PR-ELM is shown to train
LSTM networks 20 times faster than the parallel it-
erative training algorithms (BPTT).
The rest of the paper is organized as follows:
Section 2 presents the background on ELM-training
and the RNN architectures. Section 3 summa-
rizes the related work on RNN training and the
parallel training algorithms. Section 4 presents
the proposed algorithms Basic-PR-ELM and Opt-
PR-ELM and Section 5 theoretically analyzes their
memory and floating-point operations. Then, Sec-
tions 6 discusses the experimental setup and Sec-
tion 7 reports the empirical results. Finally, Sec-
tion 8 concludes with final remarks.
2 Background
2.1 Extreme Learning Machine
Extreme Learning Machine (ELM) is a non-
iterative training algorithm introduced by Huang et
al. [16] for single hidden layer feedforward neural
networks (SLFNs). Given narbitrary distinct train-
ing samples (xj,yj)where xjRm,yjR,Mhid-
den nodes and gas activation function, the predicted
output Ojcan be written as
M
i=1βig(wT
ixj+bi)where wiRmis the
weight vector connecting the ith hidden node and
the input nodes, βRMis the weight vector con-
necting all the hidden nodes and the output node
and biis the bias of the ith hidden node. Through-
out the training, the input weights wij are randomly
generated and fixed and the output weights β1...βM
are analytically computed. The goal is to minimize
the error between the predicted and the true output
as
min
β
n
j=1Ojtj2=
n
j=1
M
i=1
βig(wT
ixj+bi)tj.
(1)
Defining Hand Tas:
H(n×M)=
g(wT
1x1+b1)... g(wT
Mx1+bM)
.
.
.....
.
.
g(wT
1xn+b1)... g(wT
Mxn+bM)
(2)
T(n×1)=[t1,t2,...,tn]T,(3)
one can compactly write the problem in Equa-
tion 1 as minimizing HβT2. The solution of
this problem is given as: β=HT, where H=
(HTH)1His the Moore-penrose generalized in-
verse of the matrix H.
2.2 RNN architectures
RNNs are one of the most powerful neural
networks that are best suitable to model long-
term dependencies in time-series applications [31].
RNN architectures differ in the way cycles are in-
troduced in the network. In this work, we con-
sider six RNN architectures, illustrated in Fig-
ure 1: Elman [9], Jordan [19], NARMAX [8],
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
fully connected RNN, LSTM [15] and GRU [5].
Figure 1. RNN architectures adapted from prior
work in [30]
In Figure 1 and throughout this work, xS×Q
is the input to the network, Mis the number of hid-
den neurons, wiRSis the vector of weights con-
necting the input to the ith neuron, αik Ris the
weight from the neuron ito itsef from the kth previ-
ous time step and biis ith bias.
2.2.1 Elman
Elman RNNs are single hidden layer networks
where context neurons introduce recurrence by
feeding back signals as internal state of the network.
At time step t, the output is
ˆy=
M
i=1
βifi(t),(4)
where fi(t)=gwT
ix(t)+Q
k=1αik fi(tk)+biis
the output of neuron iat time t.
2.2.2 Jordan
Jordan networks are similar to Elman’s except
for the way recurrence is introduced. In the Jordan
architecture, signals are fed back from the predicted
output of the previous time step. Consequently,
such networks are more suitable for time series pre-
diction where dependencies are on current input and
previous outputs. Specifically, the output at time
step tis described by Equation 4 with
fi(t)=gwT
ix(t)+Q
k=1αik ˆy(tk)+bi.
2.2.3 NARMAX
The Nonlinear AutoregRessive Moving
Average model with eXogenous inputs (NAR-
MAX) represents a wide class of nonlinear sys-
tems [2]. NARMAX networks, have been pro-
posed for non-linear time series prediction us-
ing artificial neural networks and are described
by ˆy(t)=M
i=1βigwT
ix(t)+F
l=1w
il y(tl)+
R
l=1w
il e(tl)+bi, where Fand Rare the
lengths of the time dependency of the output and
the error feedbacks respectively, e(t)=y(t)ˆy(t),
w
il R(w
il Rresp.) is the weight from the out-
put (error resp.) at the lth time step to the ith hidden
neuron.
2.2.4 Fully Connected RNN
A fully connected RNN is the most gen-
eral RNN architecture in which signals are fed
back from all hidden neurons at previous time
steps. Specifically, the output at time step tis
described by Equation 4 with fi(t)=gwT
ix(t)+
Q
l=1M
m=1αilk fm(tk)+bi. In this case, αilk
Ris the weight connecting the neuron ito neuron l
from the kth previous time step.
2.2.5 LSTM
LSTMs were introduced by [15] to solve the
vanishing gradient problem in BPTT. LSTMs have
been successfully applied to a wide variety of appli-
cations inluding speech recognition [12, 13], ma-
chine translation [4, 42] and human action recog-
nition [23, 24]. An LSTM unit is composed of
the main cell, an input, output and forget gates
which regulate the flow of information into and out
of the cell through forgetting factors and weights.
This formulation gives the network the ability to
decide on which information to remember. The
output of LSTM is described by Equation 4 with
f(t)=o(t)gf(c(t)),is the Hadamard product of
two matrices and o(t),c(t),λ(t)and in(t)are given
by
o(t)=goWox(t)+Uof(t1)+bo
c(t)=λ(t)c(t1)+in(t)gcWcx(t)+Ucf(t1)+bc
λ(t)=gλWλx(t)+Uλf(t1)+bλ
36 Julia El Zini, Yara Rizk and Mariette Awad
in(t)=gin(Winx(t)+Uin f(t1)+bin).
2.2.6 GRU
GRUs are introduced in [5] as a gating mech-
anism for RNNs. They resemble LSTMs but have
only two gates and fewer parameters. GRUs ex-
pose their state at each time step and do not have
any mechanism to control the degree to which their
state is exposed [7]. They exhibit good perfor-
mances on small datasets [7] and are widely used
in speech recognition [34, 6] and sequence model-
ing [7]. GRUs’ output is described by Equation 4
while f(t)is given by
f(t)=(1z(t))f(t1)+z(t)gf(Wfx(t)+
Uf(rtf(t1)+bf)),(5)
where z(t)=gz(Wzx(t)+Uzf(t1)+bz)and
r(t)=gr(Wrx(t)+Urf(t1)+br).
3 Related Work
This work focuses on the parallelization of a
non-iterative training algorithm for RNNs. In what
follows, we first discuss the basic training meth-
ods of RNNs while focusing on the non-iterative
ones. Then, we report the parallelization attempts
for training algorithms.
3.1 RNN Training
3.1.1 Iterative RNN Training
Training RNNs has been mainly done itera-
tively through BPTT [40] which unfolds the re-
currence through time to transform the RNN into
a feedforward network trained using gradient de-
scent. BPTT is susceptible to local minima and
suffers from the vanishing and exploding gradient
problems with long time dependencies. BPTT can
also be slow, given that it is applied iteratively in
batch mode. Other iterative algorithms include, but
are not limited to, Hessian free optimization [25],
extended Kalman filters [39] and genetic algorithms
(GA) [3]. Although successful, these algorithms
are computationally expensive and require manu-
ally tuning of many hyper-parameters.
3.1.2 Non-Iterative RNN Training
Different non-iterative training algorithms have
been proposed to reduce the computational cost of
training neural networks in general. For instance,
the authors in [32, 35, 28, 16] proposed ELM, a
non-iterative method to train single hidden layer
feedforward networks by randomly assigning input
weights and computing output weights using the
least-squares method. These methods were later ex-
tended to RNN architectures when Ertugrul imple-
mented a non-iterative training for the Jordan RNN
architecture in electricity load forecasting applica-
tions [10]. Later, Park et al. extended it to online
RNNs [29] and Rizk et al. generalized the approach
to more powerful RNN architectures [30].
Although these methods achieved high
speedups (up to 99% in [30]), they heavily rely
on stencil operations and on the computation of the
generalized inverse of matrices which are CPU in-
tensive operations and could be further optimized
using parallel algorithms.
3.2 Parallelizing Training Algorithms
Several frameworks have been developed to
solve challenges of high performance computing in
the big data area [43], including parallelizing train-
ing algorithms. This is the first attempt to paral-
lelize non-iterative training of RNNs; thus we de-
scribe previous work on the parallelization of RNN
iterative training algorithms and on the parallel
non-iterative training for neural networks - not ex-
clusively RNN.
3.2.1 Parallelizing Iterative Training Algo-
rithms For RNN
Parallelizing RNN training is mostly based on
parallelizing the back-propagation algorithm (BP).
For instance, Sierra et al. parallelized BP on
CUBLAS and achieved a speedup of 63. In [44],
data is distributed on multiple GPUs achieving a
speedup of up to 51 [33]. In [38], parallel scan al-
gorithm improves the step complexity of BP from
O(n)to O(logn). Khomenko et al. parallelized
their data on multiple GPUs and relied on batch
bucketing by input sequence length to accelerate
RNN training achieving a speedup of up to 4 [20].
In [27], a semantic correlation-based data pre-fetch
framework is implemented to break the dependency
37
Julia El Zini, Yara Rizk and Mariette Awad
in(t)=gin(Winx(t)+Uin f(t1)+bin).
2.2.6 GRU
GRUs are introduced in [5] as a gating mech-
anism for RNNs. They resemble LSTMs but have
only two gates and fewer parameters. GRUs ex-
pose their state at each time step and do not have
any mechanism to control the degree to which their
state is exposed [7]. They exhibit good perfor-
mances on small datasets [7] and are widely used
in speech recognition [34, 6] and sequence model-
ing [7]. GRUs’ output is described by Equation 4
while f(t)is given by
f(t)=(1z(t))f(t1)+z(t)gf(Wfx(t)+
Uf(rtf(t1)+bf)),(5)
where z(t)=gz(Wzx(t)+Uzf(t1)+bz)and
r(t)=gr(Wrx(t)+Urf(t1)+br).
3 Related Work
This work focuses on the parallelization of a
non-iterative training algorithm for RNNs. In what
follows, we first discuss the basic training meth-
ods of RNNs while focusing on the non-iterative
ones. Then, we report the parallelization attempts
for training algorithms.
3.1 RNN Training
3.1.1 Iterative RNN Training
Training RNNs has been mainly done itera-
tively through BPTT [40] which unfolds the re-
currence through time to transform the RNN into
a feedforward network trained using gradient de-
scent. BPTT is susceptible to local minima and
suffers from the vanishing and exploding gradient
problems with long time dependencies. BPTT can
also be slow, given that it is applied iteratively in
batch mode. Other iterative algorithms include, but
are not limited to, Hessian free optimization [25],
extended Kalman filters [39] and genetic algorithms
(GA) [3]. Although successful, these algorithms
are computationally expensive and require manu-
ally tuning of many hyper-parameters.
3.1.2 Non-Iterative RNN Training
Different non-iterative training algorithms have
been proposed to reduce the computational cost of
training neural networks in general. For instance,
the authors in [32, 35, 28, 16] proposed ELM, a
non-iterative method to train single hidden layer
feedforward networks by randomly assigning input
weights and computing output weights using the
least-squares method. These methods were later ex-
tended to RNN architectures when Ertugrul imple-
mented a non-iterative training for the Jordan RNN
architecture in electricity load forecasting applica-
tions [10]. Later, Park et al. extended it to online
RNNs [29] and Rizk et al. generalized the approach
to more powerful RNN architectures [30].
Although these methods achieved high
speedups (up to 99% in [30]), they heavily rely
on stencil operations and on the computation of the
generalized inverse of matrices which are CPU in-
tensive operations and could be further optimized
using parallel algorithms.
3.2 Parallelizing Training Algorithms
Several frameworks have been developed to
solve challenges of high performance computing in
the big data area [43], including parallelizing train-
ing algorithms. This is the first attempt to paral-
lelize non-iterative training of RNNs; thus we de-
scribe previous work on the parallelization of RNN
iterative training algorithms and on the parallel
non-iterative training for neural networks - not ex-
clusively RNN.
3.2.1 Parallelizing Iterative Training Algo-
rithms For RNN
Parallelizing RNN training is mostly based on
parallelizing the back-propagation algorithm (BP).
For instance, Sierra et al. parallelized BP on
CUBLAS and achieved a speedup of 63. In [44],
data is distributed on multiple GPUs achieving a
speedup of up to 51 [33]. In [38], parallel scan al-
gorithm improves the step complexity of BP from
O(n)to O(logn). Khomenko et al. parallelized
their data on multiple GPUs and relied on batch
bucketing by input sequence length to accelerate
RNN training achieving a speedup of up to 4 [20].
In [27], a semantic correlation-based data pre-fetch
framework is implemented to break the dependency
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
in the input to parallelize the training of cogni-
tive applications [27]. Their work is tested on
LSTMs using image captioning, speech recogni-
tion, and language processing applications showing
a speedup of 5.1, 44.9 and 1.53, respectively. Re-
cently, GA is introduced into the Elman architecture
to accelerate the training and prevent the local min-
ima problem [18]. GA-Elman outperformes tradi-
tional training algorithms in terms of convergence
speed and accuracy.
3.2.2 Parallelizing Non-Iterative Training Al-
gorithms
Non-iterative training algorithms for RNNs are
shown to require less training time than iterative
methods [30, 10, 29]. However, even with non-
iterative training, large datasets require costly com-
putations, especially when increasing the number of
neurons or when model selection is performed to
avoid over-fitting [36]. Parallelizing non-iterative
training has been explored in single layer feedfor-
ward networks by [14]. Their approach is based on
a Map-Reduce and achieves a speedup of up to 5.6
when tested on 32 cores. Following a similar ap-
proach, Wang et al. [37] developed a parallel imple-
mentation of online ELM and achieved a speedup
of 3.5 when trained on 120K instances with 120 at-
tributes. Huang et al. extended their approach to the
ensemble online sequential ELM which was tested
on real and synthetic data with 5120K training in-
stances and 512 attributes and achieved a speedup
of 40 on a cluster with 80 cores [17]. In [36],
Van et al. attempted to parallelize ELM on Flink
with multi hidden layer feedforward network and
achieved a speedup of 17.
To the best of our knowledge, our work is the
first attempt to parallelize non-iterative training for
different RNN architectures.
4 Methodology
Before proposing our methods, we present the
nomenclature that will be used throughout this pa-
per in Table 1.
Table 1. Nomenclature
Symbol Definition
nNumber of training sam-
ples
MNumber of hidden neurons
QMax number of time de-
pendencies
SDimension of input
xjRS×Qjth Input instance
yjRjth Output instance
XRn×S×QInput matrix
YRnOutput matrix
WRS×LWeight matrix connecting
the input to the hidden neu-
rons
αRL×QWeight matrix connection
the hidden neuron to itself
for previous time steps
bRLBias vector for the hidden
neurons
βRLWeight vector connecting
hidden neurons to output
layer
S-R-ELM Sequential ELM for RNN
training
Basic-PR-ELM Basic parallel ELM RNN
training
Opt-PR-ELM Optimal parallel ELM
RNN training
BPTT Back-propagation through
time
P-BPTT Parallel Back-propagation
through time
BS Block size
TW Tile width
In this work, a parallel version of ELM-trained
RNNs will be formalized and implemented. The
sequential version of our approach, denoted by
S-RELM, is summarized in Algorithm 1 and is
adopted from our previous work in [30].
38 Julia El Zini, Yara Rizk and Mariette Awad
Algorithm 1 S-R-ELM algorithm
1: Randomly assign W,α,b
2: Compute H(t),t=1...Qaccording to the cor-
responding RNN architecture
3: Compute β=H(Q)Yusing the generalized
Moore–Penrose pseudoinverse
H(t) at row iand column jis referred to as hij[t]
in this paper and is computed as in Equations 6, 7,
8, 9, 10 and 11 for the Elman, Jordan, NARMAX,
fully connected, LSTM and GRU architectures re-
spectively.
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
k=1
α[j,k]hij[tk]
(6)
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
k=1
α[j,k]ˆy(tk)
(7)
hij[t]=g(W[:,j].X[i,:,t]+bi+
F
l=1
W[i,l]y(tl)+
R
l=1
W[i,l]e(tl)(8)
hij[t]=gW[:,j].X[i,:,t]+bi+
Q
k=1
M
l=1
α[j,l,k]hij[tk](9)
hij[t]=o[i,j,t]gfc[i,j,t](10)
hij[t]=1z[i,j,t]hij[t1]+z[i,j,t]
gfWf[:,j].X[i,:,t]+Uf(r[i,j,t]hij[t1]+bi)
(11)
Considering Algorithm 1, one can see that the
running time of the ELM training mainly consists
of two CPU intensive operations: computing Hand
computing βby solving the linear system using the
Moore-Penrose pseudo-inverse. Thus, those two
operations are the main target when optimizing the
performance of non-iterative training.
4.1 HComputation
4.1.1 Basic Parallel Implementation (Basic-
PR-ELM)
For all RNN architectures, the computation of
H(t)at row iand column jis independant of the
computation of H(t)at row i2and column j2,i2=
i,j2=j; it only depends on H(t2)at row iand col-
umn jfor t2<t. Given only this dependency, a par-
allel Hcomputation can be done as follows: each
thread (i,j)can independently compute H(t)at row
iand column jfor t=1,...,Q. We describe the ba-
sic implementation of the computation of Hfor the
Elman architecture in Algorithm 2.
Algorithm 2 Basic-PR-ELM by thread (i,j)
1: tx threadIdx.x
2: ty cuda.threadIdx.y
3: Row tx+blockIdx.x×blockDim.x
4: Col ty+blockI dx.y×blockDim.y
5: for t=1→ Qdo
6: hij W[:,Col].X[Row,:,t]
7: hij hij +bCol
8: for tprev =1→ tdo
9: hij hij +α[j,tprev]×
H[Row,Col,tprev]
10: end for
11: H[Row,Col,t]hij
12: end for
4.1.2 Optimized Parallel Implementation
(Opt-PR-ELM)
Figure 2. Basic-PR-ELM memory access patterns
on Elman
Figure 2 illustrates the memory access patterns
of Basic-PR-ELM on the Elman architecture. One
can clearly see that threads in the same row ac-
cess the same elements of Xand threads in the
39
Julia El Zini, Yara Rizk and Mariette Awad
Algorithm 1 S-R-ELM algorithm
1: Randomly assign W,α,b
2: Compute H(t),t=1...Qaccording to the cor-
responding RNN architecture
3: Compute β=H(Q)Yusing the generalized
Moore–Penrose pseudoinverse
H(t) at row iand column jis referred to as hij[t]
in this paper and is computed as in Equations 6, 7,
8, 9, 10 and 11 for the Elman, Jordan, NARMAX,
fully connected, LSTM and GRU architectures re-
spectively.
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
k=1
α[j,k]hij[tk]
(6)
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
k=1
α[j,k]ˆy(tk)
(7)
hij[t]=g(W[:,j].X[i,:,t]+bi+
F
l=1
W[i,l]y(tl)+
R
l=1
W[i,l]e(tl)(8)
hij[t]=gW[:,j].X[i,:,t]+bi+
Q
k=1
M
l=1
α[j,l,k]hij[tk](9)
hij[t]=o[i,j,t]gfc[i,j,t](10)
hij[t]=1z[i,j,t]hij[t1]+z[i,j,t]
gfWf[:,j].X[i,:,t]+Uf(r[i,j,t]hij[t1]+bi)
(11)
Considering Algorithm 1, one can see that the
running time of the ELM training mainly consists
of two CPU intensive operations: computing Hand
computing βby solving the linear system using the
Moore-Penrose pseudo-inverse. Thus, those two
operations are the main target when optimizing the
performance of non-iterative training.
4.1 HComputation
4.1.1 Basic Parallel Implementation (Basic-
PR-ELM)
For all RNN architectures, the computation of
H(t)at row iand column jis independant of the
computation of H(t)at row i2and column j2,i2=
i,j2=j; it only depends on H(t2)at row iand col-
umn jfor t2<t. Given only this dependency, a par-
allel Hcomputation can be done as follows: each
thread (i,j)can independently compute H(t)at row
iand column jfor t=1,...,Q. We describe the ba-
sic implementation of the computation of Hfor the
Elman architecture in Algorithm 2.
Algorithm 2 Basic-PR-ELM by thread (i,j)
1: tx threadIdx.x
2: ty cuda.threadIdx.y
3: Row tx+blockIdx.x×blockDim.x
4: Col ty+blockI dx.y×blockDim.y
5: for t=1→ Qdo
6: hij W[:,Col].X[Row,:,t]
7: hij hij +bCol
8: for tprev =1→ tdo
9: hij hij +α[j,tprev]×
H[Row,Col,tprev]
10: end for
11: H[Row,Col,t]hij
12: end for
4.1.2 Optimized Parallel Implementation
(Opt-PR-ELM)
Figure 2. Basic-PR-ELM memory access patterns
on Elman
Figure 2 illustrates the memory access patterns
of Basic-PR-ELM on the Elman architecture. One
can clearly see that threads in the same row ac-
cess the same elements of Xand threads in the
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
same column access the same elements of Wand
α. Thus, the tiling concept can be applied to utilize
the shared memory to speed up the computation of
H. Moreover, we notice that bCol can be preloaded
and used efficiently by other threads.
Algorithm 3 describes how these optimizations
can be applied for the Elman architecture.
Algorithm 3 Opt-PR-ELM by thread (i,j)
1: tx threadIdx.x
2: ty cuda.threadIdx.y
3: Row tx+blockIdx.x×blockDim.x
4: Col ty+blockI dx.y×blockDim.y
5: Hloc t-dimensional array in the local mem-
ory of thread (i,j)
6: for t1→ Qdo
7: hij 0
8: for tile =1→ num_tiles do
9: Wshared W[tx+t ile ×TW :,Col]
10: Xshared X[Row,ty +t ile TW,t]
11: synch()
12: hij hij +Wshared.Xshared
13: end for
14: synch()
15: if tx =0 and ty =0then
16: bshared b[Col]
17: end if
18: synch()
19: hij hij +bshared
20: for tile 1→  t
TW do
21: αshared α[Col,tx+t ile ×TW]
22: synch()
23: hij hij +αshared.Hloc [tprev]
24: end for
25: synch()
26: Hloc[t]hij
27: H[Row,Col,t]hij
28: end for
First, in the dot product W[:,Col].X[Row,:,t],
each thread can load only one element of Wand
one element of Xinto the shared memory. Once
the threads synchronize, then all needed elements
of Wand Xare loaded, and the dot product can
be computed efficiently. Second, only one thread
can load b[j]that is needed by all the threads in the
same column of the block. The same tiling con-
cept used to compute W[:,Col].X[Row,:,t]can be
used to speed up the computation of α[j,tprev]×
H(tprev)[Row,Col]. Lastly, each thread can save the
values of H(t)[Row,Col]in its register file to reduce
the time taken to read from the global memory in
line 8 of Algorithm 2. If these values do not fit in
the registers, they are read from the global memory.
Alogirhtms 2 and 3 could be easily extended to
other architectures when Equation 6 is replaced by
Equations 7, 8, 9, 10 or 11.
4.2 Computing β
βis the solution of the following system: Hβ=
Y. Instead of computing the pseudo-inverse Hand
then multiplying it by Y, one can perform a QR fac-
torization of Has H=QR, then compute z=QTY.
Having that, βwould be the solution of Rβ=zby
back substitution since Rwill be an upper triangular
matrix. In this work, we make use of Numba [21]
and Numpy [26] libraries which provide an efficient
implementation of this method in Python.
5 Theoretical Analysis
We analyze the memory read and write oper-
ations and the floating point operations (FLOPS)
for the proposed algorithms: Basic-PR-ELM and
Opt-PR-ELM. For the Elman architecture, Basic-
PR-ELM performs Q(2S+Q+2)read operations
divided as follows:
2 ×SQ to read the values needed in line 6
Qreads for bCol in line 7
2 ×(QQ+1
2)reads in the loop at line 8
Moreover, only Qwrite operations are needed (in
line 11) and Q(2S+Q+2)FLOPS are performed
as follows:
2 ×SQ to perform the dot product at line 6
QFLOPS for the addition in line 7
2 ×(QQ+1
2)to perform the loop at line 8
The memory operations to FLOPS ratio is
2S+Q+3
2S+Q+2>1 which might limit the performance of
Basic-PR-ELM. This ratio improves with Opt-PR-
ELM as it minimizes the memory operations while
keeping the same number of FLOPS. Specifically,
Opt-PR-ELM decreases the number of reads to
1
TW2(2×SQ +Q(Q+1)
2)+1 divided as follows:
40 Julia El Zini, Yara Rizk and Mariette Awad
Table 2. Number of memory operations and FLOPS for each RNN architecture for Basic-PR-ELM
Architecture # Read Operations # Write Operations FLOPS
Elman Q(2S+Q+2)Q Q(2S+Q+2)
Jordan Q(2S+1+(Q+1)(1/2+M))Q Q(2S+1+Q+1
2(2SM +M))
NARMAX Q(2S+1)+2(2F+M+R)Q Q(2S+1+2F+R(2+2SM +M))
Fully Connected Q(2S+1+2MQ)Q Q(2S+Q+2QM)
LSTM Q(5S+13)5Q Q(8S+18)
GRU Q(4S+8)3Q Q(3S+17)
Figure 3. Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M=50
7
TABLE III: Benchmarks Description
Database Size Output Statistics
Category Name # of instances Q % Train Mean Std Dev Min Max
Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08
Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02
Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05
Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10
AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04
Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02
Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03
Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14
Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09
Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01
(a) Jordan (b) Elman
(c) NARMAX (d) Fully Connected
(e) LSTM (f) GRU
Fig. 3: Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M= 50
algorithm is architecture-dependent. Table V shows that Opt-
PR-ELM also achieves high speedups on the Quadro K2000
GPUs for different RNN architectures on different datasets, but
the speedups on the Tesla K20m GPU are constantly higher
41
Julia El Zini, Yara Rizk and Mariette Awad
Table 2. Number of memory operations and FLOPS for each RNN architecture for Basic-PR-ELM
Architecture # Read Operations # Write Operations FLOPS
Elman Q(2S+Q+2)Q Q(2S+Q+2)
Jordan Q(2S+1+(Q+1)(1/2+M))Q Q(2S+1+Q+1
2(2SM +M))
NARMAX Q(2S+1)+2(2F+M+R)Q Q(2S+1+2F+R(2+2SM +M))
Fully Connected Q(2S+1+2MQ)Q Q(2S+Q+2QM)
LSTM Q(5S+13)5Q Q(8S+18)
GRU Q(4S+8)3Q Q(3S+17)
Figure 3. Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M=50
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Figure 4. Speedup of Opt-PR-ELM for the different architectures when the number of hidden neurons
increases from 5 to 100
9
(a) Jordan (b) Elman
(c) NARMAX (d) Fully Connected
(e) LSTM (f) GRU
Fig. 4: Speedup of Opt-PR-ELM for the different architectures when the number of hidden neurons increases from 5to 100
because of the computational capability of the latter. The
speedups in Table V are reported with respect to the core-i7
CPU with 16 GB RAMs. Speedups with respect to sequential
code on older generation CPUs (core-i5 with 8GB RAMs)
show a speedup of up to 5 times higher. One can draw the
following conclusion: increasing the number of cores of a CPU
yields a speedup of up to only 5 times. However, parallelizing
the code can yield a speedup of up to 326 with respect to
sequential code on core-i7 CPUs and 651 on core-i5 CPUs.
A rough estimation of the current pricing based on google
search shows that GPU architectures cost between 500$ and
7,000$ for NVIDIA GTX 1080 and Tesla GPUs11 respectively,
while CPU architectures such as Intel Core i7-9700K with 8
cores cost 400$12. Considering the aforementioned speedups,
one can conclude that investing in parallel architectures can be
more profitable than upgrading the existing CPU architecture,
especially in applications where real-time performance and
11https://www.amazon.com/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-
v100/dp/B076P84525
12https://www.amazon.com/CPU-Processors-Memory-Computer-Add-
Ons/b?ie=UTF8&node=229189
cost efficiency are essential such as general IoT applications.
E. Comparison with Parallel Iterative RNN Training
Although Opt-PR-ELM achieves high speedups compared
its S-R-ELM, we need to show that its absolute training time
is lower than the parallel version of the BPTT (P-BPTT) as
implemented in [11]. We choose the architectures that [11]
implements, i.e. fully connected, LSTM and GRU, and we
report the training time of Opt-PR-ELM (BS=32) and P-BPTT
when M= 10. P-BPTT is trained for 10 epochs with 64
as batch size, mean squared error (MSE) as loss function
and ADAM as optimizer. We are interested in the absolute
training times of the two parallel algorithms rather than their
speedup over their sequential versions. Thus, we report the
runtimes of Opt-PR-ELM and P-BPTT algorithms when tested
on the same Tesla K20m GPU and the ratio between both
training times. As Table VI shows, Opt-PR-ELM runs up
to 10x faster than P-BPTT when tested with LSTM on the
energy consumption dataset. Fig. 5 illustrates the MSE versus
time for P-BPTT algorithms when tested with LSTM on the
energy consumption with M= 50. For the same dataset
42 Julia El Zini, Yara Rizk and Mariette Awad
Table 3. Benchmarks Description
Database Size Output Statistics
Category Name # of instances Q % Train Mean Std Dev Min Max
Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08
Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02
Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05
Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10
AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04
Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02
Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03
Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14
Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09
Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01
2
TW2×SQ to read the values needed in line 12
at most 1 read for bCol in line 16
1
TW2(QQ+1
2)reads in the loop at line 20
where TW is the tile width which is set to block
size in this work. The new memory operations
to FLOPS ratio is 1
TW2(2×SQ+Q(Q+1)
2)+1+Q
Q(2S+Q+2)which is
less then the ratio of Basic-PR-ELM by a factor of
TW2. Specifically, Opt-PR-ELM minimizes the
number of read operations by a factor of 256 (1024
resp.) when the tile width is set to 16 (32 resp.).
Table 2 reports the number of memory opera-
tions and FLOPS needed by Basic-PR-ELM for
each RNN architecture. The values of Opt-PR-
ELM are ommited as it requires the same number
of write operations and FLOPS and less read oper-
ations by a factor of TW2.
6 Experimental Setup
6.1 Setup
Serial algorithms were run on an Intel 64-bit
core-i7 machine with a memory of 16 GB. Parallel
algorithms were run on NVidia Tesla K20m GPU
with 2688 CUDA cores and 723MHz GPU core
clock speed. The GPU main memory is 6GB and
bandwidth of 250 GB/s between the host and the
device. All experiments are repeated 5 times, and
the average value is reported.
6.2 Time Series Prediction Benchmarks
Basic-PR-ELM and Opt-PR-ELM were vali-
dated on time series prediction problems. Table 3
presents the characteristics of the datasets ordered
according to the number of instances. According
to their size, we split the databases into three cate-
gories: small datasets containing less than 10K in-
stances, medium datasets with multiples of 10K in-
stances and large dataset consisting of multiples of
100K instances. Japan population1tracks the pop-
ulation of various Japanese regions, while the Que-
bec Births2tracks the number of births in Quebec
1kaggle.com/jd1325/japan-population-data
2datamarket.com/data/list/ ?q=provider%3Atsdl
3kaggle.com/keplersmachines/kepler-labelled-time-series-data
4kaggle.com/benjibb/sp500- since-1950
5aemo.com.au/
6kaggle.com/selfishgene/historical-hourly-weather-data
43
Julia El Zini, Yara Rizk and Mariette Awad
Table 3. Benchmarks Description
Database Size Output Statistics
Category Name # of instances Q % Train Mean Std Dev Min Max
Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08
Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02
Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05
Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10
AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04
Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02
Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03
Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14
Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09
Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01
2
TW2×SQ to read the values needed in line 12
at most 1 read for bCol in line 16
1
TW2(QQ+1
2)reads in the loop at line 20
where TW is the tile width which is set to block
size in this work. The new memory operations
to FLOPS ratio is 1
TW2(2×SQ+Q(Q+1)
2)+1+Q
Q(2S+Q+2)which is
less then the ratio of Basic-PR-ELM by a factor of
TW2. Specifically, Opt-PR-ELM minimizes the
number of read operations by a factor of 256 (1024
resp.) when the tile width is set to 16 (32 resp.).
Table 2 reports the number of memory opera-
tions and FLOPS needed by Basic-PR-ELM for
each RNN architecture. The values of Opt-PR-
ELM are ommited as it requires the same number
of write operations and FLOPS and less read oper-
ations by a factor of TW2.
6 Experimental Setup
6.1 Setup
Serial algorithms were run on an Intel 64-bit
core-i7 machine with a memory of 16 GB. Parallel
algorithms were run on NVidia Tesla K20m GPU
with 2688 CUDA cores and 723MHz GPU core
clock speed. The GPU main memory is 6GB and
bandwidth of 250 GB/s between the host and the
device. All experiments are repeated 5 times, and
the average value is reported.
6.2 Time Series Prediction Benchmarks
Basic-PR-ELM and Opt-PR-ELM were vali-
dated on time series prediction problems. Table 3
presents the characteristics of the datasets ordered
according to the number of instances. According
to their size, we split the databases into three cate-
gories: small datasets containing less than 10K in-
stances, medium datasets with multiples of 10K in-
stances and large dataset consisting of multiples of
100K instances. Japan population1tracks the pop-
ulation of various Japanese regions, while the Que-
bec Births2tracks the number of births in Quebec
1kaggle.com/jd1325/japan-population-data
2datamarket.com/data/list/ ?q=provider%3Atsdl
3kaggle.com/keplersmachines/kepler-labelled-time-series-data
4kaggle.com/benjibb/sp500- since-1950
5aemo.com.au/
6kaggle.com/selfishgene/historical-hourly-weather-data
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 4. Average RMSE (±standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same
range for different RNN architectures on all the datasets.
Architecture
Dataset Algorithm Elman Jordan NARMAX Fully Connected LSTM GRU
Japan
pop.
S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4
Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5
Quebec
Births
S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4
Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3
ExoplanetS-R-ELM 5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1
Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2
SP500 S-R-ELM 1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2
Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1
AEMO S-R-ELM 1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5
Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6
Hourly
Weather
S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3
Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3
Energy
Cons.
S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5
Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5
Elec.
Load
S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1
Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0
Stock
Prices
S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4
Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3
Temp. S-R-ELM 4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6
Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5
Jordan NARMAX Fully Connected LSTM GRU
Japan
pop.
S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4
Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5
S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4
Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3
5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1
Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2
1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2
Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1
1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5
Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6
S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3
Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3
Cons.
S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5
Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5
Elec.
Load
S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1
Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0
Stock
S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4
Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3
4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6
Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5
Julia El Zini, Yara Rizk and Mariette Awad
Table 4. Average RMSE (
±
standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same
range for different RNN architectures on all the datasets.
Architecture
Dataset Algorithm Elman Jordan NARMAX Fully Connected LSTM GRU
Japan
pop.
S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4
Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5
Quebec
Births
S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4
Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3
ExoplanetS-R-ELM 5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1
Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2
SP500 S-R-ELM 1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2
Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1
AEMO S-R-ELM 1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5
Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6
Hourly
Weather
S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3
Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3
Energy
Cons.
S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5
Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5
Elec.
Load
S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1
Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0
Stock
Prices
S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4
Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3
Temp. S-R-ELM 4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6
Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5
44 Julia El Zini, Yara Rizk and Mariette Awad
Table 5. Speedup of Opt-PR-ELM (BS=32) when tested on the Tesla K20m and Quadro K2000 GPUS for
different RNN architectures on various datasets when the number of hidden neurons Mis 20.
Dataset
Archi
tec-
ture
GPU Japan
pop.
Quebec
Births
Exop. SP500 AEMO Hourly
weather
Energy
cons.
Elec.
Load
Stock
Prices
Temp.
Elman Tesla
K20m
12 12 18 26 42 64 163 164 251 261
Quadro
K2000
12 10 16 23 40 61 60 163 239 251
Jordan Tesla
K20m
12 13 42 26 42 64 163 165 244 300
Quadro
K2000
11 11 39 23 40 60 163 163 189 295
NAR- Tesla
K20m
13 12 29 29 45 72 167 168 263 281
MAX Quadro
K2000
11 11 28 26 42 71 162 162 257 273
Fully Tesla
K20m
17 18 35 36 50 73 198 226 281 326
Connec. Quadro
K2000
14 16 33 34 48 71 196 225 279 324
LSTM Tesla
K20m
21 21 43 39 50 74 219 201 310 327
Quadro
K2000
19 20 41 36 45 70 215 196 307 323
GRU Tesla
K20m
20 18 46 40 50 67 197 200 309 326
Quadro
K2000
15 14 42 35 47 58 192 187 300 320
Julia El Zini, Yara Rizk and Mariette Awad
Table 5. Speedup of Opt-PR-ELM (BS=32) when tested on the Tesla K20m and Quadro K2000 GPUS for
different RNN architectures on various datasets when the number of hidden neurons Mis 20.
Dataset
Archi
tec-
ture
GPU Japan
pop.
Quebec
Births
Exop. SP500 AEMO Hourly
weather
Energy
cons.
Elec.
Load
Stock
Prices
Temp.
Elman Tesla
K20m
12 12 18 26 42 64 163 164 251 261
Quadro
K2000
12 10 16 23 40 61 60 163 239 251
Jordan Tesla
K20m
12 13 42 26 42 64 163 165 244 300
Quadro
K2000
11 11 39 23 40 60 163 163 189 295
NAR- Tesla
K20m
13 12 29 29 45 72 167 168 263 281
MAX Quadro
K2000
11 11 28 26 42 71 162 162 257 273
Fully Tesla
K20m
17 18 35 36 50 73 198 226 281 326
Connec. Quadro
K2000
14 16 33 34 48 71 196 225 279 324
LSTM Tesla
K20m
21 21 43 39 50 74 219 201 310 327
Quadro
K2000
19 20 41 36 45 70 215 196 307 323
GRU Tesla
K20m
20 18 46 40 50 67 197 200 309 326
Quadro
K2000
15 14 42 35 47 58 192 187 300 320
45
Julia El Zini, Yara Rizk and Mariette Awad
Table 5. Speedup of Opt-PR-ELM (BS=32) when tested on the Tesla K20m and Quadro K2000 GPUS for
different RNN architectures on various datasets when the number of hidden neurons Mis 20.
Dataset
Archi
tec-
ture
GPU Japan
pop.
Quebec
Births
Exop. SP500 AEMO Hourly
weather
Energy
cons.
Elec.
Load
Stock
Prices
Temp.
Elman Tesla
K20m
12 12 18 26 42 64 163 164 251 261
Quadro
K2000
12 10 16 23 40 61 60 163 239 251
Jordan Tesla
K20m
12 13 42 26 42 64 163 165 244 300
Quadro
K2000
11 11 39 23 40 60 163 163 189 295
NAR- Tesla
K20m
13 12 29 29 45 72 167 168 263 281
MAX Quadro
K2000
11 11 28 26 42 71 162 162 257 273
Fully Tesla
K20m
17 18 35 36 50 73 198 226 281 326
Connec. Quadro
K2000
14 16 33 34 48 71 196 225 279 324
LSTM Tesla
K20m
21 21 43 39 50 74 219 201 310 327
Quadro
K2000
19 20 41 36 45 70 215 196 307 323
GRU Tesla
K20m
20 18 46 40 50 67 197 200 309 326
Quadro
K2000
15 14 42 35 47 58 192 187 300 320
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 6. Runtime (seconds) of Opt-PR-ELM (BS=32) and the iterative training algorithm and the
ratio= BP
Opt-PR-ELM
Fully Connected LSTM GRU
Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio
Japan pop. 0.23 3.52 15 0.38 7.41 20 0.38 6.59 17
Quebec
Births
0.56 6.75 12 0.85 13.56 16 0.81 12.94 16
Exoplanet 10.03 24.98 2 15.23 54.32 4 13.14 43.12 3
SP500 3.56 20.66 6 7.77 37.55 5 5.61 35.65 6
AEMO 3.01 21.34 7 7.29 38.32 5 5.62 35.71 6
Hourly
Weather
30.46 156.76 5 50.49 243.99 5 30.04 201.12 7
Energy
Cons.
32.14 203.45 6 51.90 525.87 10 45.67 435.89 10
Elec. Load 36.70 256.89 7 53.60 572.74 11 51.7 532.31 10
Stock Prices 41.30 301.23 7 56.78 639.04 11 52.34 621.18 12
Temperature 45.45 354.99 8 62.00 678.11 11 59.32 641.09 11
and Exoplanet3describes the change in the light
intensity of several thousand stars. Additionally,
SP 5004records the stock prices since 1950 while
AEMO5reports the electricity load demand in Aus-
tralia and hourly weather6contains 5 years of
temperature measures. The energy consumption
dataset7reports the hourly power consumption data
in megawatts, the electricity load dataset8reports
the electricity demand at the MT166 and MT257
substations and the stock prices dataset9consists of
historical stock prices for all companies currently
on the S&P 500 index. Finally, the temperature
dataset10 reports sensor data collected from a per-
manent magnet synchronous motor (PMSM) de-
ployed on a testbench where PMSM represents a
german OEM’s prototype model.
7 Experimental Results
7.1 Speedup
Figure 3 illustrates the speedups of Basic-PR-
ELM and Opt-PR-ELM for the six architectures
tested against the serial version when the number of
hidden neurons Mis 50. Opt-PR-ELM was tested
with two different configurations: when the number
of threads per block, block size BS, is 16 and 32,
respectively.
Clearly, Basic-PR-ELM and Opt-PR-
ELM achieve high speedups, especially when the
size of the dataset increases. For instance, for
the Elman architecture, Basic-PR-ELM achieves
a speedup of 19 on the small Exoplanet dataset, 72
on the hourly energy consumption medium dataset,
and up to 207 on the largest dataset (Temperature).
Opt-PR-ELM achieves higher speedups that reach
up to 311 with LSTM on the temperature dataset
when BS =16. The speedup increases to 461 when
BS increases to 32.
7kaggle.com/selfishgene/historical-hourly-weather-data
8archive.ics.uci.edu/ ml/index.php
9kaggle.com/camnugent/sandp500
10kaggle.com/wkirgsn/electric-motor-temperature
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 6. Runtime (seconds) of Opt-PR-ELM (BS=32) and the iterative training algorithm and the
ratio= BP
Opt-PR-ELM
Fully Connected LSTM GRU
Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio
Japan pop. 0.23 3.52 15 0.38 7.41 20 0.38 6.59 17
Quebec
Births
0.56 6.75 12 0.85 13.56 16 0.81 12.94 16
Exoplanet 10.03 24.98 2 15.23 54.32 4 13.14 43.12 3
SP500 3.56 20.66 6 7.77 37.55 5 5.61 35.65 6
AEMO 3.01 21.34 7 7.29 38.32 5 5.62 35.71 6
Hourly
Weather
30.46 156.76 5 50.49 243.99 5 30.04 201.12 7
Energy
Cons.
32.14 203.45 6 51.90 525.87 10 45.67 435.89 10
Elec. Load 36.70 256.89 7 53.60 572.74 11 51.7 532.31 10
Stock Prices 41.30 301.23 7 56.78 639.04 11 52.34 621.18 12
Temperature 45.45 354.99 8 62.00 678.11 11 59.32 641.09 11
and Exoplanet3describes the change in the light
intensity of several thousand stars. Additionally,
SP 5004records the stock prices since 1950 while
AEMO5reports the electricity load demand in Aus-
tralia and hourly weather6contains 5 years of
temperature measures. The energy consumption
dataset7reports the hourly power consumption data
in megawatts, the electricity load dataset8reports
the electricity demand at the MT166 and MT257
substations and the stock prices dataset9consists of
historical stock prices for all companies currently
on the S&P 500 index. Finally, the temperature
dataset10 reports sensor data collected from a per-
manent magnet synchronous motor (PMSM) de-
ployed on a testbench where PMSM represents a
german OEM’s prototype model.
7 Experimental Results
7.1 Speedup
Figure 3 illustrates the speedups of Basic-PR-
ELM and Opt-PR-ELM for the six architectures
tested against the serial version when the number of
hidden neurons Mis 50. Opt-PR-ELM was tested
with two different configurations: when the number
of threads per block, block size BS, is 16 and 32,
respectively.
Clearly, Basic-PR-ELM and Opt-PR-
ELM achieve high speedups, especially when the
size of the dataset increases. For instance, for
the Elman architecture, Basic-PR-ELM achieves
a speedup of 19 on the small Exoplanet dataset, 72
on the hourly energy consumption medium dataset,
and up to 207 on the largest dataset (Temperature).
Opt-PR-ELM achieves higher speedups that reach
up to 311 with LSTM on the temperature dataset
when BS =16. The speedup increases to 461 when
BS increases to 32.
7kaggle.com/selfishgene/historical-hourly-weather-data
8archive.ics.uci.edu/ ml/index.php
9kaggle.com/camnugent/sandp500
10kaggle.com/wkirgsn/electric-motor-temperature
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 6. Runtime (seconds) of Opt-PR-ELM (BS=32) and the iterative training algorithm and the
ratio= BP
Opt-PR-ELM
Fully Connected LSTM GRU
Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio
Japan pop. 0.23 3.52 15 0.38 7.41 20 0.38 6.59 17
Quebec
Births
0.56 6.75 12 0.85 13.56 16 0.81 12.94 16
Exoplanet 10.03 24.98 2 15.23 54.32 4 13.14 43.12 3
SP500 3.56 20.66 6 7.77 37.55 5 5.61 35.65 6
AEMO 3.01 21.34 7 7.29 38.32 5 5.62 35.71 6
Hourly
Weather
30.46 156.76 5 50.49 243.99 5 30.04 201.12 7
Energy
Cons.
32.14 203.45 6 51.90 525.87 10 45.67 435.89 10
Elec. Load 36.70 256.89 7 53.60 572.74 11 51.7 532.31 10
Stock Prices 41.30 301.23 7 56.78 639.04 11 52.34 621.18 12
Temperature 45.45 354.99 8 62.00 678.11 11 59.32 641.09 11
and Exoplanet3describes the change in the light
intensity of several thousand stars. Additionally,
SP 5004records the stock prices since 1950 while
AEMO5reports the electricity load demand in Aus-
tralia and hourly weather6contains 5 years of
temperature measures. The energy consumption
dataset7reports the hourly power consumption data
in megawatts, the electricity load dataset8reports
the electricity demand at the MT166 and MT257
substations and the stock prices dataset9consists of
historical stock prices for all companies currently
on the S&P 500 index. Finally, the temperature
dataset10 reports sensor data collected from a per-
manent magnet synchronous motor (PMSM) de-
ployed on a testbench where PMSM represents a
german OEM’s prototype model.
7 Experimental Results
7.1 Speedup
Figure 3 illustrates the speedups of Basic-PR-
ELM and Opt-PR-ELM for the six architectures
tested against the serial version when the number of
hidden neurons Mis 50. Opt-PR-ELM was tested
with two different configurations: when the number
of threads per block, block size BS, is 16 and 32,
respectively.
Clearly, Basic-PR-ELM and Opt-PR-
ELM achieve high speedups, especially when the
size of the dataset increases. For instance, for
the Elman architecture, Basic-PR-ELM achieves
a speedup of 19 on the small Exoplanet dataset, 72
on the hourly energy consumption medium dataset,
and up to 207 on the largest dataset (Temperature).
Opt-PR-ELM achieves higher speedups that reach
up to 311 with LSTM on the temperature dataset
when BS =16. The speedup increases to 461 when
BS increases to 32.
7kaggle.com/selfishgene/historical-hourly-weather-data
8archive.ics.uci.edu/ ml/index.php
9kaggle.com/camnugent/sandp500
10kaggle.com/wkirgsn/electric-motor-temperature
46 Julia El Zini, Yara Rizk and Mariette Awad
However, Opt-PR-ELM does not always
achieve higher speedups. Specifically, Basic-PR-
ELM and Opt-PR-ELM achieve similar speedups
for the Japan population, Quebec births, SP500,
AEMO, energy consumption, and the electricity
load datasets. To investigate these results, we take
a closer look at the characteristics of the datasets.
When Q=10, a thread is computing the dot product
between a row of Xand a column of W, and it is do-
ing 2 ×10 memory read operations. Consequently,
num_tiles will be only 1, and the loop at line 8 of
Alg. 3 will only be executed once. In this case, the
performance does not improve and might slightly
decrease due to the thread synchronization in Opt-
PR-ELM. However, Opt-PR-ELM achieves higher
speedups when Q>BS and when BS increases to
32. We notice that the speedup increases with more
complex architectures, LSTM for example, since
these architectures require more computations that
can be better accelerated on a GPU.
7.2 Scalability
To test the scalability of our approach, we
change the number of hidden neurons M, and we
report the speedup of Opt-PR-ELM (BS=32) for the
different architectures on the various datasets. Fig-
ure 4 illustrates that the speedup increases when M
increases from 5 to 10,20,50,100. Specifically, the
speedup increases by a factor of 20 when Min-
creases from 5 to 100 with a GRU on the energy
consumption dataset. Thus, Opt-PR-ELM scales up
well with more computationally expensive opera-
tions.
7.3 Robustness
Robustness, i.e. repeatability, is a key prop-
erty for Opt-PR-ELM where random initialization
might affect the solution. Moreover, floating-point
computations might differ between the GPU and the
CPU, which might affect the output. To ensure that
such perturbations do not affect the performance of
our parallel algorithm, we run S-R-ELM and Opt-
PR-ELM (BS=32) five times, and we measure their
root mean squared error (RMSE). Table 4 reports
the average RMSE and its standard deviation when
S-R-ELM and Opt-PR-ELM are tested on different
datasets with different RNN architectures. We se-
lect Maccording to the size of the problem; i.e.
we used M=100 for exoplanet where Q=5657,
M=20 for hourly weather, stock prices and tem-
perature where Q=50 and M=10 for the rest of
the datasets that have Q=10. Tables 3 and 4 show
that the cases where the RMSE is high correspond
to datasets with large outputs. For instance, having
outputs ranging from 0 to 2.06×109, the electricity
load dataset has higher RMSE than other datasets.
However, S-R-ELM and Opt-PR-ELM achieve ac-
curacies in the same range for different RNN archi-
tectures on all the datasets, which means that GPU
floating-point operations do not have a clear effect
on the performance of our algorithm.
7.4 Portability
To verify that our algorithm is portable, we
ran Opt-PR-ELM (BS=32) on an NVIDIA Quadro
K2000 GPU while fixing the number of hidden
nodes Mat 20. It is important to check for porta-
bility to understand how much the proposed al-
gorithm is architecture-dependent. Table 5 shows
that Opt-PR-ELM also achieves high speedups on
the Quadro K2000 GPUs for different RNN archi-
tectures on different datasets, but the speedups on
the Tesla K20m GPU are constantly higher because
of the computational capability of the latter. The
speedups in Table 5 are reported with respect to the
core-i7 CPU with 16 GB RAMs. Speedups with re-
spect to sequential code on older generation CPUs
(core-i5 with 8GB RAMs) show a speedup of up to
5 times higher. One can draw the following con-
clusion: increasing the number of cores of a CPU
yields a speedup of up to only 5 times. However,
parallelizing the code can yield a speedup of up to
326 with respect to sequential code on core-i7 CPUs
and 651 on core-i5 CPUs. A rough estimation of the
current pricing based on google search shows that
GPU architectures cost between 500$ and 7,000$
for NVIDIA GTX 1080 and Tesla GPUs11 respec-
tively, while CPU architectures such as Intel Core
i7-9700K with 8 cores cost 400$12. Considering
the aforementioned speedups, one can conclude that
investing in parallel architectures can be more prof-
itable than upgrading the existing CPU architecture,
especially in applications where real-time perfor-
11https://www.amazon.com/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-v100/dp/B076P84525
12https://www.amazon.com/CPU-Processors-Memory-Computer-Add-Ons/b?ie=UTF8&node=229189
47
Julia El Zini, Yara Rizk and Mariette Awad
However, Opt-PR-ELM does not always
achieve higher speedups. Specifically, Basic-PR-
ELM and Opt-PR-ELM achieve similar speedups
for the Japan population, Quebec births, SP500,
AEMO, energy consumption, and the electricity
load datasets. To investigate these results, we take
a closer look at the characteristics of the datasets.
When Q=10, a thread is computing the dot product
between a row of Xand a column of W, and it is do-
ing 2 ×10 memory read operations. Consequently,
num_tiles will be only 1, and the loop at line 8 of
Alg. 3 will only be executed once. In this case, the
performance does not improve and might slightly
decrease due to the thread synchronization in Opt-
PR-ELM. However, Opt-PR-ELM achieves higher
speedups when Q>BS and when BS increases to
32. We notice that the speedup increases with more
complex architectures, LSTM for example, since
these architectures require more computations that
can be better accelerated on a GPU.
7.2 Scalability
To test the scalability of our approach, we
change the number of hidden neurons M, and we
report the speedup of Opt-PR-ELM (BS=32) for the
different architectures on the various datasets. Fig-
ure 4 illustrates that the speedup increases when M
increases from 5 to 10,20,50,100. Specifically, the
speedup increases by a factor of 20 when Min-
creases from 5 to 100 with a GRU on the energy
consumption dataset. Thus, Opt-PR-ELM scales up
well with more computationally expensive opera-
tions.
7.3 Robustness
Robustness, i.e. repeatability, is a key prop-
erty for Opt-PR-ELM where random initialization
might affect the solution. Moreover, floating-point
computations might differ between the GPU and the
CPU, which might affect the output. To ensure that
such perturbations do not affect the performance of
our parallel algorithm, we run S-R-ELM and Opt-
PR-ELM (BS=32) five times, and we measure their
root mean squared error (RMSE). Table 4 reports
the average RMSE and its standard deviation when
S-R-ELM and Opt-PR-ELM are tested on different
datasets with different RNN architectures. We se-
lect Maccording to the size of the problem; i.e.
we used M=100 for exoplanet where Q=5657,
M=20 for hourly weather, stock prices and tem-
perature where Q=50 and M=10 for the rest of
the datasets that have Q=10. Tables 3 and 4 show
that the cases where the RMSE is high correspond
to datasets with large outputs. For instance, having
outputs ranging from 0 to 2.06×109, the electricity
load dataset has higher RMSE than other datasets.
However, S-R-ELM and Opt-PR-ELM achieve ac-
curacies in the same range for different RNN archi-
tectures on all the datasets, which means that GPU
floating-point operations do not have a clear effect
on the performance of our algorithm.
7.4 Portability
To verify that our algorithm is portable, we
ran Opt-PR-ELM (BS=32) on an NVIDIA Quadro
K2000 GPU while fixing the number of hidden
nodes Mat 20. It is important to check for porta-
bility to understand how much the proposed al-
gorithm is architecture-dependent. Table 5 shows
that Opt-PR-ELM also achieves high speedups on
the Quadro K2000 GPUs for different RNN archi-
tectures on different datasets, but the speedups on
the Tesla K20m GPU are constantly higher because
of the computational capability of the latter. The
speedups in Table 5 are reported with respect to the
core-i7 CPU with 16 GB RAMs. Speedups with re-
spect to sequential code on older generation CPUs
(core-i5 with 8GB RAMs) show a speedup of up to
5 times higher. One can draw the following con-
clusion: increasing the number of cores of a CPU
yields a speedup of up to only 5 times. However,
parallelizing the code can yield a speedup of up to
326 with respect to sequential code on core-i7 CPUs
and 651 on core-i5 CPUs. A rough estimation of the
current pricing based on google search shows that
GPU architectures cost between 500$ and 7,000$
for NVIDIA GTX 1080 and Tesla GPUs11 respec-
tively, while CPU architectures such as Intel Core
i7-9700K with 8 cores cost 400$12. Considering
the aforementioned speedups, one can conclude that
investing in parallel architectures can be more prof-
itable than upgrading the existing CPU architecture,
especially in applications where real-time perfor-
11https://www.amazon.com/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-v100/dp/B076P84525
12https://www.amazon.com/CPU-Processors-Memory-Computer-Add-Ons/b?ie=UTF8&node=229189
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
mance and cost efficiency are essential such as gen-
eral IoT applications.
7.5 Comparison with Parallel Iterative
RNN Training
Although Opt-PR-ELM achieves high speedups
compared its S-R-ELM, we need to show that its
absolute training time is lower than the parallel ver-
sion of the BPTT (P-BPTT) as implemented in [11].
We choose the architectures that [11] implements,
i.e. fully connected, LSTM and GRU, and we re-
port the training time of Opt-PR-ELM (BS=32) and
P-BPTT when M=10. P-BPTT is trained for 10
epochs with 64 as batch size, mean squared error
(MSE) as loss function and ADAM as optimizer.
We are interested in the absolute training times of
the two parallel algorithms rather than their speedup
over their sequential versions. Thus, we report the
runtimes of Opt-PR-ELM and P-BPTT algorithms
when tested on the same Tesla K20m GPU and
the ratio between both training times. As Table 6
shows, Opt-PR-ELM runs up to 10x faster than P-
BPTT when tested with LSTM on the energy con-
sumption dataset. Figure 5 illustrates the MSE ver-
sus time for P-BPTT algorithms when tested with
LSTM on the energy consumption with M=50.
For the same dataset and RNN architecture, Opt-
PR-ELM reaches 2.56 ×103as MSE, whereas P-
BPTT reaches a lower MSE of 1.4×103. How-
ever, Opt-PR-ELM took only 57 sec to reach its op-
timal MSE, whereas P-BPTT took 525 sec to reach
its optimal MSE and 340 sec to reach the same MSE
(1.1×103).
Figure 5. MSE versus time (sec) for P-BPTT
algorithms when tested on the energy consumption
dataset with M=50 and LSTM as architecture
Thus, Opt-PR-ELM could reach the same per-
formance as P-BPTT 6 times faster. The sequential
nature of iterative training explains the results: al-
though one can attempt to parallelize each epoch,
the training needs to be done in a sequence of con-
secutive dependent epochs.
7.6 Opt-PR-ELM Runtime
One can argue that using memory streams or
initializing the random weights on the GPU can
lead to higher speedups. To investigate this, we
study how the runtime of Opt-PR-ELM is decom-
posed between the parameters initialization, data
transfer to and from the GPU and the actual com-
putations for the six architectures. Figure 6 shows
what portion each step takes from the runtime of
Opt-PR-ELM when tested on the energy comsump-
tion dataset dataset with M=50. The initialization
does not appear on the bar because it is less than
0.01% of the total runtime. Moreover, transfer data
to the GPU consistently takes more time than the
transfer back because the former deals with the fol-
lowing matrices: XRn×S×Q,YRn,WRS×L,
αRL×Qand bRL, while the latter only transfers
βRL. The steps that take the major time portion
are the computations of Hand β. One can conclude
that data streams or the GPU random initializations
will not affect the speedup since initialization and
data transfer are not a bottleneck in Opt-PR-ELM.
Figure 6. Time decomposition of Opt-PR-ELM on
the energy consumption dataset with M=50
8 Conclusion
In this work, we proposed Opt-PR-ELM, a par-
allel version of non-iteratively trained RNNs for
time series prediction. Focusing on six RNN ar-
chitectures: Elman, Jordan, NARMAX, fully con-
nected RNN, LSTM and GRU, we first developed
a basic version of the parallel algorithm. Then,
we studied its memory access patterns to propose
an optimized version that takes advantage of the
shared memory of the GPU. In addition to perform-
ing a theoretical, computational analysis of Opt-PR-
ELM on the various architectures, empirical valida-
tion was performed on 10 publicly available time
series prediction datasets.
48 Julia El Zini, Yara Rizk and Mariette Awad
Opt-PR-ELM was shown to achieve a speedup
of up to 461 over its sequential version and requires
less time to train than the parallel BPTT by a fac-
tor of 20. Higher speedups are achieved on older
generation CPUs which highlights the importance
of investing in high-end parallel architectures, es-
pecially in IoT and machine learning applications
that require accurate, cost-sensitive yet efficient so-
lutions.
We further studied the portability and scala-
bility of our proposed algorithm by changing the
GPU architecture and the number of hidden neurons
and reporting the speedup. Opt-PR-ELM showed
higher speedups when the number of computations
increases or the number of launched threads per
block increases. Finally, Opt-PR-ELM was shown
to reach similar accuracies as its sequential version.
Future work includes extending Opt-PR-
ELM to RNNs with multiple layers and investi-
gating its performance on applications that have
multi-dimensional outputs such as machine transla-
tion and speech recognition.
Acknowledgment
This work was supported by the University Re-
search Board at the American University of Beirut.
References
[1] Yoshua Bengio, Patrice Simard, Paolo Frasconi,
et al. Learning long-term dependencies with gradi-
ent descent is difficult. IEEE transactions on neural
networks, 5(2):157–166, 1994.
[2] Stephen A Billings. Nonlinear system identifica-
tion: NARMAX methods in the time, frequency,
and spatio-temporal domains. John Wiley & Sons,
2013.
[3] Armando Blanco, Miguel Delgado, and Maria C
Pegalajar. A real-coded genetic algorithm for train-
ing recurrent neural networks. Neural networks,
14(1):93–105, 2001.
[4] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry
Bahdanau, and Yoshua Bengio. On the properties
of neural machine translation: Encoder-decoder
approaches. arXiv preprint arXiv:1409.1259, 2014.
[5] Kyunghyun Cho, Bart Van Merriënboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[6] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy
Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. In
Advances in neural information processing sys-
tems, pages 577–585, 2015.
[7] Junyoung Chung, Caglar Gulcehre, KyungHyun
Cho, and Yoshua Bengio. Empirical evaluation of
gated recurrent neural networks on sequence mod-
eling. arXiv preprint arXiv:1412.3555, 2014.
[8] Jerome T Connor, R Douglas Martin, and Les E
Atlas. Recurrent neural networks and robust time
series prediction. IEEE transactions on neural net-
works, 5(2):240–254, 1994.
[9] Jeffrey L Elman. Finding structure in time. Cogni-
tive science, 14(2):179–211, 1990.
[10] Ömer Faruk Ertugrul. Forecasting electricity load
by a novel recurrent extreme learning machines ap-
proach. International Journal of Electrical Power &
Energy Systems, 78:429–435, 2016.
[11] Martín Abadi et al. TensorFlow: Large-scale ma-
chine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
[12] Alex Graves, Navdeep Jaitly, and Abdel-rahman
Mohamed. Hybrid speech recognition with deep
bidirectional lstm. In 2013 IEEE workshop on
automatic speech recognition and understanding,
pages 273–278. IEEE, 2013.
[13] Alex Graves, Abdel-rahman Mohamed, and Geof-
frey Hinton. Speech recognition with deep recur-
rent neural networks. In 2013 IEEE international
conference on acoustics, speech and signal pro-
cessing, pages 6645–6649. IEEE, 2013.
[14] Qing He, Tianfeng Shang, Fuzhen Zhuang, and
Zhongzhi Shi. Parallel extreme learning machine
for regression based on mapreduce. Neurocomput-
ing, 102:52–58, 2013.
[15] Sepp Hochreiter and Jürgen Schmidhuber.
Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[16] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong
Siew, et al. Extreme learning machine: a new learn-
ing scheme of feedforward neural networks. Neural
networks, 2:985–990, 2004.
[17] Shan Huang, Botao Wang, Junhao Qiu, Jitao Yao,
Guoren Wang, and Ge Yu. Parallel ensemble of on-
line sequential extreme learning machine based on
mapreduce. Neurocomputing, 174:352–367, 2016.
49
Julia El Zini, Yara Rizk and Mariette Awad
Opt-PR-ELM was shown to achieve a speedup
of up to 461 over its sequential version and requires
less time to train than the parallel BPTT by a fac-
tor of 20. Higher speedups are achieved on older
generation CPUs which highlights the importance
of investing in high-end parallel architectures, es-
pecially in IoT and machine learning applications
that require accurate, cost-sensitive yet efficient so-
lutions.
We further studied the portability and scala-
bility of our proposed algorithm by changing the
GPU architecture and the number of hidden neurons
and reporting the speedup. Opt-PR-ELM showed
higher speedups when the number of computations
increases or the number of launched threads per
block increases. Finally, Opt-PR-ELM was shown
to reach similar accuracies as its sequential version.
Future work includes extending Opt-PR-
ELM to RNNs with multiple layers and investi-
gating its performance on applications that have
multi-dimensional outputs such as machine transla-
tion and speech recognition.
Acknowledgment
This work was supported by the University Re-
search Board at the American University of Beirut.
References
[1] Yoshua Bengio, Patrice Simard, Paolo Frasconi,
et al. Learning long-term dependencies with gradi-
ent descent is difficult. IEEE transactions on neural
networks, 5(2):157–166, 1994.
[2] Stephen A Billings. Nonlinear system identifica-
tion: NARMAX methods in the time, frequency,
and spatio-temporal domains. John Wiley & Sons,
2013.
[3] Armando Blanco, Miguel Delgado, and Maria C
Pegalajar. A real-coded genetic algorithm for train-
ing recurrent neural networks. Neural networks,
14(1):93–105, 2001.
[4] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry
Bahdanau, and Yoshua Bengio. On the properties
of neural machine translation: Encoder-decoder
approaches. arXiv preprint arXiv:1409.1259, 2014.
[5] Kyunghyun Cho, Bart Van Merriënboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[6] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy
Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. In
Advances in neural information processing sys-
tems, pages 577–585, 2015.
[7] Junyoung Chung, Caglar Gulcehre, KyungHyun
Cho, and Yoshua Bengio. Empirical evaluation of
gated recurrent neural networks on sequence mod-
eling. arXiv preprint arXiv:1412.3555, 2014.
[8] Jerome T Connor, R Douglas Martin, and Les E
Atlas. Recurrent neural networks and robust time
series prediction. IEEE transactions on neural net-
works, 5(2):240–254, 1994.
[9] Jeffrey L Elman. Finding structure in time. Cogni-
tive science, 14(2):179–211, 1990.
[10] Ömer Faruk Ertugrul. Forecasting electricity load
by a novel recurrent extreme learning machines ap-
proach. International Journal of Electrical Power &
Energy Systems, 78:429–435, 2016.
[11] Martín Abadi et al. TensorFlow: Large-scale ma-
chine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
[12] Alex Graves, Navdeep Jaitly, and Abdel-rahman
Mohamed. Hybrid speech recognition with deep
bidirectional lstm. In 2013 IEEE workshop on
automatic speech recognition and understanding,
pages 273–278. IEEE, 2013.
[13] Alex Graves, Abdel-rahman Mohamed, and Geof-
frey Hinton. Speech recognition with deep recur-
rent neural networks. In 2013 IEEE international
conference on acoustics, speech and signal pro-
cessing, pages 6645–6649. IEEE, 2013.
[14] Qing He, Tianfeng Shang, Fuzhen Zhuang, and
Zhongzhi Shi. Parallel extreme learning machine
for regression based on mapreduce. Neurocomput-
ing, 102:52–58, 2013.
[15] Sepp Hochreiter and Jürgen Schmidhuber.
Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[16] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong
Siew, et al. Extreme learning machine: a new learn-
ing scheme of feedforward neural networks. Neural
networks, 2:985–990, 2004.
[17] Shan Huang, Botao Wang, Junhao Qiu, Jitao Yao,
Guoren Wang, and Ge Yu. Parallel ensemble of on-
line sequential extreme learning machine based on
mapreduce. Neurocomputing, 174:352–367, 2016.
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
[18] Weikuan Jia, Dean Zhao, Yuanjie Zheng, and Su-
juan Hou. A novel optimized ga–elman neural net-
work algorithm. Neural Computing and Applica-
tions, 31(2):449–459, 2019.
[19] Michael I Jordan. Serial order: A parallel dis-
tributed processing approach. In Advances in psy-
chology, volume 121, pages 471–495. Elsevier,
1997.
[20] Viacheslav Khomenko, Oleg Shyshkov, Olga
Radyvonenko, and Kostiantyn Bokhan. Acceler-
ating recurrent neural network training using se-
quence bucketing and multi-gpu data paralleliza-
tion. In IEEE First International Conference on
Data Stream Mining & Processing, pages 100–103.
IEEE, 2016.
[21] Siu Kwan Lam, Antoine Pitrou, and Stanley Seib-
ert. Numba: A llvm-based python jit compiler. In
Proceedings of the second Workshop on the LLVM
Compiler Infrastructure in HPC, pages 1–6. ACM,
2015.
[22] Yann LeCun, Yoshua Bengio, and Geoffrey Hin-
ton. Deep learning. Nature, 521(7553):436, 2015.
[23] Jun Liu, Amir Shahroudy, Dong Xu, and Gang
Wang. Spatio-temporal lstm with trust gates for 3d
human action recognition. In European Conference
on Computer Vision, pages 816–833. Springer,
2016.
[24] Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Ab-
diyeva, and Alex C Kot. Skeleton-based human ac-
tion recognition with global context-aware atten-
tion lstm networks. IEEE Transactions on Image
Processing, 27(4):1586–1599, 2017.
[25] James Martens and Ilya Sutskever. Learning recur-
rent neural networks with hessian-free optimiza-
tion. In Proceedings of the 28th International Con-
ference on Machine Learning (ICML-11), pages
1033–1040. Citeseer, 2011.
[26] Travis Oliphant. Guide to NumPy. 01 2006.
[27] Peng Ouyang, Shouyi Yin, and Shaojun Wei. A fast
and power efficient architecture to parallelize lstm
based rnn for cognitive intelligence applications. In
Proceedings of the 54th Annual Design Automa-
tion Conference 2017, pages 1–6. ACM, 2017.
[28] Yoh-Han Pao, Gwang-Hoon Park, and Dejan J
Sobajic. Learning and generalization characteris-
tics of the random vector functional-link net. Neu-
rocomputing, 6(2):163–180, 1994.
[29] Jin-Man Park and Jong-Hwan Kim. Online re-
current extreme learning machine and its applica-
tion to time-series prediction. In 2017 International
Joint Conference on Neural Networks (IJCNN),
pages 1983–1990. IEEE, 2017.
[30] Yara Rizk and Mariette Awad. On extreme learn-
ing machines in sequential and time series predic-
tion: A non-iterative and approximate training al-
gorithm for recurrent neural networks. Neurocom-
puting, 325:1–19, 2019.
[31] Jürgen Schmidhuber. Deep learning in neural net-
works: An overview. Neural networks, 61:85–117,
2015.
[32] Wouter F Schmidt, Martin A Kraaijveld, and
Robert PW Duin. Feedforward neural networks
with random weights. In 11th IAPR International
Conference on Pattern Recognition. Vol. II. Con-
ference B: Pattern Recognition Methodology and
Systems, pages 1–4. IEEE, 1992.
[33] Xavier Sierra-Canto, Francisco Madera-Ramirez,
and Victor Uc-Cetina. Parallel training of a back-
propagation neural network using cuda. In 2010
Ninth International Conference on Machine Learn-
ing and Applications, pages 307–312. IEEE, 2010.
[34] Zhiyuan Tang, Ying Shi, Dong Wang, Yang Feng,
and Shiyue Zhang. Memory visualization for gated
recurrent neural networks in speech recognition. In
2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages
2736–2740. IEEE, 2017.
[35] Hubert AB Te Braake and Gerrit Van Straten. Ran-
dom activation weight neural net (rawn) for fast
non-iterative training. Engineering Applications of
Artificial Intelligence, 8(1):71–80, 1995.
[36] Mark Van Heeswijk, Yoan Miche, Erkki Oja,
and Amaury Lendasse. Gpu-accelerated and par-
allelized elm ensembles for large-scale regression.
Neurocomputing, 74(16):2430–2437, 2011.
[37] Botao Wang, Shan Huang, Junhao Qiu, Yu Liu, and
Guoren Wang. Parallel online sequential extreme
learning machine based on mapreduce. Neurocom-
puting, 149:224–232, 2015.
[38] Shang Wang, Yifan Bai, and Gennady Pekhi-
menko. Scaling back-propagation by parallel scan
algorithm. arXiv preprint arXiv:1907.10134, 2019.
[39] Xiaoyu Wang and Yong Huang. Convergence study
in extended kalman filter-based training of recur-
rent neural networks. IEEE Transactions on Neural
Networks, 22(4):588–600, 2011.
[40] Paul J Werbos et al. Backpropagation through time:
what it does and how to do it. Proceedings of the
IEEE, 78(10):1550–1560, 1990.
[41] Ronald J Williams and David Zipser. Gradient-
based learning algorithms for recurrent. Backprop-
agation: Theory, architectures, and applications,
433, 1995.
50 Julia El Zini, Yara Rizk and Mariette Awad
[42] Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144, 2016.
[43] Feng Zhang, Jidong Zhai, Marc Snir, Hai Jin, Hi-
ronori Kasahara, and Mateo Valero. Guest edito-
rial: Special issue on network and parallel com-
puting for emerging architectures and applications,
2019.
[44] Shunlu Zhang, Pavan Gunupudi, and Qi-Jun
Zhang. Parallel back-propagation neural network
training technique using cuda on multiple gpus. In
IEEE MTT-S International Conference on Numer-
ical Electromagnetic and Multiphysics Modeling
and Optimization, pages 1–3. IEEE, 2015.
Julia El Zini is a Ph.D. student en-
rolled in the electrical and computer
engineering department at the Ameri-
can University of Beirut (AUB). She
has received her B.S. and M.S. in com-
puter science from AUB, Lebanon,
in 2015 and 2017, respectively. Her
research interests include distributed
optimization, parallel computing, rein-
forcement leaning, multi-task and transfer learning, and scal-
able machine learning applications.
Yara Rizk obtained her Ph.D. in Elec-
trical and Computer Engineering from
the American University of Beirut
(AUB) in 2018. Prior, she received her
BE in Computer and Communication
Engineering from AUB, Lebanon, in
2012. Her research interests span ro-
botics, multi-agent systems, machine
learning, classication, clustering, and
articial intelligence. Rizk has attended a technical intern-
ship (2013-2014) at Intel in Hillsboro, Oregon, USA and is an
active researcher multiple peer-reviewed publications.
Mariette Awad obtained her Ph.D. in
Electrical Engineering from the Uni-
versity of Vermont (2007). Her cur-
rent research focuses on HMI, ecient
articial intelligence, applied ma-
chine learning and Inter net of Things.
Dr. Awad has received more than 25
grants to support her research includ-
ing 2 multidisciplinar y multi-million
dollar grants from the Qatar National Research Fund (QNRF)
and Intel. Her work culminated in a book, Ecient Machine
Learning, in 2015 as well as more than 100 conference, book
chapter, and jour nal publications. Prior to her academic po-
sition, she was with IBM System and Technology group in
Vermont for six years where her technical leadership and in-
novative spirit has earned her management recognition twice,
two business awards, and 10 patents.
... The proposed model introduces a novel time-aware attention mechanism to compute attention weights for long-term sessions, embedding short-term preference with the time signal. Instead of using weights for all historical sessions, a bi-directional GRU [34,35] is employed to encode the temporal sequence of sessions and provide a fixed long-term preference representation per user. ...
Article
Full-text available
Recommender systems (RSs) often focus on learning users’ long-term preferences, while the sequential pattern of behavior is ignored. On the other hand, sequential RSs try to predict the next action by exploring relations between items in a user’s last interactions but do not consider the general preference. Recently, the performance of RSs has increased by unifying these two types of paradigms. However, existing methods still have two limitations. First, the user’s behavior uncertainty impedes precise learning of preferences. Second, being unable to understand the semantics of items makes the effect of the same item considered in the same way. These limitations jointly prevent RS from learning multifaceted preferences to capture the actual intentions of users. Existing methods have not properly addressed these problems since they ignore context-aware interactions between the user and item in terms of the links between the user and item attributes and sequential user actions over time. To address these challenges, this paper proposes a novel model, called the Dynamic Intention-Aware Recommendation with attention-based context-aware item attributes modeling (DIARec), which is capable of determining users’ preferences based on their goal intention, taking into account the influence of various item features on user decision-making in their current context. Specifically, to model users’ dynamic intentions, we introduce a dynamic intent-aware module to represent the hierarchical relations between items and their attributes in a given session. Experiments on benchmark datasets indicate that the proposed model DIARec outperforms other state-of-the-art methods.
... While processing long sequences, the gradient descent algorithm (using BPTT) may not update the model parameters as the gradient information is lost (either approaches zero or infinity). Additionally, these models generally do not benefit from the parallel computations offered by graphical processing units (GPUs), tensor processing units (TPUs), and other hardware accelerators [27]. Certain architectural modifications and training tricks may help LSTMs and GRUs mitigate gradient-related problems to a certain extent. ...
Article
Full-text available
Transformer architectures have widespread applications, particularly in Natural Language Processing and Computer Vision. Recently, Transformers have been employed in various aspects of time-series analysis. This tutorial provides an overview of the Transformer architecture, its applications, and a collection of examples from recent research in time-series analysis. We delve into an explanation of the core components of the Transformer, including the self-attention mechanism, positional encoding, multi-head, and encoder/decoder. Several enhancements to the initial Transformer architecture are highlighted to tackle time-series tasks. The tutorial also provides best practices and techniques to overcome the challenge of effectively training Transformers for time-series analysis.
... Thanks to the recurrent network, it will be possible to apply such a solution. In this section we use the LSTM-Long Short Term Memory cells [20] and GRU Gated Recurrent Unit [12], [27]. A variable number of calls approach was used depending on the amount of data available for each user. ...
Chapter
Full-text available
One of the biggest problems faced by companies is the sudden departure of employees from the company. Such events may even result in a serious paralysis of the functioning of enterprises in the event of resignation from work by people holding significant positions. Therefore, an extremely important issue is to develop techniques that will allow detecting the planned resignation of a given employee well in advance. Gaining knowledge about the factors influencing this type of events may allow for taking actions aimed at counteracting them. This work proposes a proprietary method based on the use of artificial neural networks to predict employees leaving work and to indicate which of the possible analyzed reasons are the most significant. Ultimately, the proposed system achieved an efficiency of 74 %.
Chapter
A new approach to the practical realizations of calculations to the Levenberg-Marquardt learning algorithm is presented. The proposed solutions aim to effectively reduce the high computational load of the LM algorithm. The detailed application of proposed methods in the process of learning neural networks is explicitly discussed. Experimental results have been obtained for all proposed methods and they confirm a very good performance of them.
Chapter
In this paper, a novel approach to the GQR algorithm is presented. The idea revolves around batch training for the feedforward neural networks. The core of this paper contains a mathematical explanation for the batch approach, which can be utilized in the GQR algorithm. The final section of the article contains several simulations. They prove the novel approach to be superior to the original GQR algorithm.
Chapter
This study provides a comparison of the efficiency of anomaly detection in data using Isolation Forest (IF) combined with k-Means and Fuzzy C-Means algorithms. It also presents how to determine the anomaly score from the clustering results using the triangular and Gaussian membership functions. The number of clusters, the significance of the anomaly score obtained from the clustering process, and the degree of fuzziness of the clusters are additionally taken into account when testing the efficiency of anomaly detection. Moreover, we demonstrate that in most of the examined datasets, preceding IF with clustering algorithms allows obtaining significantly better results. Furthermore, combining IF with Fuzzy C-Means produces better results than combining it with k-Means. The results discussed in this paper allow one to decide which clustering method to use when combining it with IF to detect anomalies in the data. In addition, a comprehensive analysis presented in the paper sheds the light on the procedure of a choice of the parameters of the algorithms to get possibly the best results.KeywordsAnomaly detectionIsolation forestClustering methodsk-MeansFuzzy C-Means
Chapter
A new parallel computational approach to the Levenberg-Marquardt learning algorithm is presented. The proposed solution is based on the AVX instructions to effectively reduce the high computational load of this algorithm. Detailed parallel neural network computations are explicitly discussed. Additionally obtained acceleration is shown based on a few test problems.KeywordsNeural network learning algorithmLevenberg-marquardt learning algorithmVector computationsApproximationClassification
Chapter
In this paper, the computational improvement for the scaled Givens rotation-based training algorithms is presented. Application of the scaled rotations boosts the algorithm significantly due to the elimination of the computation of the square root. In a classic variant scaled rotations utilize so-called scale factors — \(\chi \). It turns out that the scale factors can be omitted during the computation which boosts the overall algorithm performance even further. This paper gives a mathematical explanation of how to apply the proposed improvement to the scaled variants of the training algorithms. The last section of the paper contains several benchmarks which prove the proposed method to be superior to the classic approach.KeywordsNeural network training algorithmQR decompositionScaled givens rotationsOptimizationApproximationClassification
Chapter
The presented paper describes implementations of the gesture recognition methods based on the convolutional neural networks. For this purpose, we adopted three CNN structures. The data was obtained from a specially prepared dataset containing images with hand gestures taken on a green screen and downloaded backgrounds depicting hospital and office conditions. During experiments, the precision of recognizing individual gesture classes was measured. Experiments were carried out that showed the performance time of the gesture recognizing in images using a GPU card.KeywordsConvolutional neural networksImage classificationGesture recognition
Chapter
This work is related to our original statistical model-based iterative reconstruction conception for medical computed tomography with the flying focal spot. This new reconstruction approach is based on a continuous-to-continuous data model and the forward model formulated as a shift-invariant system. The proposed reconstruction methodology resembles the single slice rebinning concept, belonging to the so-called “nutating” reconstruction algorithms. Our algorithm is classified as an iterative reconstruction method. The proposed forward model is derived as a shift-invariant system. Thanks to this fact, it is possible to use an FFT algorithm to reduce the computational complexity of the reconstruction problem. Because of this, we can obtain reconstructed images in a time comparable to that of FBP methods, which is especially important for ambulatory purposes. The performed by us computer simulations have shown that the statistical reconstruction conception presented here outperforms the referential traditional FBP method concerning the image quality obtained and can be competitive regarding the time of the reconstruction performance.KeywordsImage reconstruction from projectionsComputed tomographyIterative reconstruction algorithmFlying focal spot
Conference Paper
Full-text available
In an era when the performance of a single compute device plateaus, software must be designed to scale on massively parallel systems for better runtime performance. However, in the context of training deep learning models, the popular back-propagation (BP) algorithm imposes a strong sequential dependency in the process of gradient computation. Under model parallelism, BP takes Θ(n) steps to complete which hinders its scalability on parallel systems (n represents the number of compute devices into which a model is partitioned). In this work, in order to improve the scalability of BP, we reformulate BP into a scan operation which is a primitive that performs an in-order aggregation on a sequence of values and returns the partial result at each step. We can then scale such reformulation of BP on parallel systems by our modified version of the Blelloch scan algorithm which theoretically takes Θ(log n) steps. We evaluate our approach on a vanilla Recurrent Neural Network (RNN) training with synthetic datasets and a RNN with Gated Recurrent Units (GRU) training with the IRMAS dataset, and demonstrate up to 2.75× speedup on the overall training time and 108× speedup on the backward pass. We also demonstrate that the retraining of pruned networks can be a practical use case of our method.
Article
Full-text available
Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, Long Short-Term Memory (LSTM) networks have shown promising performance in this task due to their strengths in modeling the dependencies and dynamics in sequential data. As not all skeletal joints are informative for action recognition, and the irrelevant joints often bring noise which can degrade the performance, we need to pay more attention to the informative ones. However, the original LSTM network does not have explicit attention ability. In this paper, we propose a new class of LSTM network , Global Context-Aware Attention LSTM (GCA-LSTM), for skeleton based action recognition, which is capable of selectively focusing on the informative joints in each frame by using a global context memory cell. To further improve the attention capability, we also introduce a recurrent attention mechanism, with which the attention performance of our network can be enhanced progressively. Besides, a two-stream framework, which leverages coarse-grained attention and fine-grained attention, is also introduced. The proposed method achieves state-of-the-art performance on five challenging datasets for skeleton based action recognition.
Article
Full-text available
The Elman neural network has good dynamic properties and strong global stability, being most widely used to deal with nonlinear, dynamic, and complex data. However, as an optimization of the backpropagation (BP) neural network, the Elman model inevitably inherits some of its inherent deficiencies, influencing the recognition precision and operating efficiency. Many improvements have been proposed to resolve these problems, but it has proved difficult to balance the many relevant features such as storage space, algorithm efficiency, recognition precision, etc. Also, it is difficult to obtain a permanent solution from a temporary solution simultaneously. To address this, a genetic algorithm (GA) can be introduced into the Elman algorithm to optimize the connection weights and thresholds, which can prevent the neural network from becoming trapped in local minima and improve the training speed and success rate. The structure of the hidden layer can also be optimized using the GA, which can solve the difficult problem of determining the number of neurons. Most previous studies on such evolutionary Elman algorithms optimized the connection weights or network structure individually, which represents a slight deficiency. We propose herein a novel optimized GA–Elman neural network algorithm where the connection weights are real-encoded, while the neurons of the hidden layer also adopt real-coding but with the addition of binary control genes. In this new algorithm, the connection weights and the number of hidden neurons are optimized using hybrid encoding and evolution simultaneously, greatly improving the performance of the resulting novel GA–Elman algorithm. The results of three experiments show that this new GA–Elman model is superior to the traditional model in terms of all calculated indexes.
Article
Recurrent neural networks (RNN) are a type of artificial neural networks (ANN) that have been successfully applied to many problems in artificial intelligence. However, they are expensive to train since the number of learned weights grows exponentially with the number of hidden neurons. Non-iterative training algorithms have been proposed to reduce the training time, mainly on feedforward ANN. In this work, the application of non-iterative randomized training algorithms to various RNN architectures, including Elman RNN, fully connected RNN, and long short-term memory (LSTM), are investigated. The mathematical formulation and theoretical computational complexity of the proposed algorithms are presented. Finally, their performance is empirically compared to other iterative RNN training algorithms on time series prediction and sequential decision-making problems. Non-iteratively-trained RNN architectures showed promising results as significant training speedup of up to 99%, and improved repeatability were achieved compared to backpropagation-trained RNN. Although the decrease in prediction accuracy was found to be statistically significant based on Friedman and ANOVA testing, some applications like real-time embedded systems can tolerate and make use of that.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
Long Short-Term Memory (LSTM) based Recurrent Neural Networks (RNNs) are promising for cognitive intelligence applications like speech recognition, image caption and nature language processing, etc. However, the cascade dependent structure in RNN with huge amount of power inefficient operations like multiplication, memory accessing and nonlinear transformation, could not guarantee high computing speed and low power consumption. In this work, by exploiting semantic correlation, we propose a semantic correlation based data pre-fetch method to break the dependency and achieve parallel processing. Based on this method, a full parallel and pipeline architecture that tackles huge amount operations is designed. Experiments on benchmarks of image caption, speech recognition and language processing show that, this work improves computing speed by 5.1 times, 44.9 times and 1.53 times, respectively, and power efficiency by 1885.7 times, 4061.5 times and 127.5 times, respectively, when compared with state-of-the-art works.