Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
JAISCR, 2021, Vol. 11, No. 1, pp. 33
AN OPTIMIZED PARALLEL IMPLEMENTATION OF
NON-ITERATIVELY TRAINED RECURRENT NEURAL
NETWORKS
Julia El Zini, Yara Rizk and Mariette Awad
Department of Electrical and Computer Engineering
American University of Beirut
E-mail: {jwe04,yar01,mariette.awad}@aub.edu.lb
Submitted: 7th May 2020; Accepted: 14th September 2020
Abstract
Recurrent neural networks (RNN) have been successfully applied to various sequential
decision-making tasks, natural language processing applications, and time-series predic-
tions. Such networks are usually trained through back-propagation through time (BPTT)
which is prohibitively expensive, especially when the length of the time dependencies
and the number of hidden neurons increase. To reduce the training time, extreme learning
machines (ELMs) have been recently applied to RNN training, reaching a 99% speedup
on some applications. Due to its non-iterative nature, ELM training, when parallelized,
has the potential to reach higher speedups than BPTT.
In this work, we present Opt-PR-ELM, an optimized parallel RNN training algorithm
based on ELM that takes advantage of the GPU shared memory and of parallel QR fac-
torization algorithms to efficiently reach optimal solutions. The theoretical analysis of the
proposed algorithm is presented on six RNN architectures, including LSTM and GRU,
and its performance is empirically tested on ten time-series prediction applications. Opt-
PR-ELM is shown to reach up to 461 times speedup over its sequential counterpart and
to require up to 20x less time to train than parallel BPTT. Such high speedups over new
generation CPUs are extremely crucial in real-time applications and IoT environments.
Keywords: GPU implementation, parallelization, Recurrent Neural Network (RNN),
Long-short Term Memory (LSTM), Gated Recurrent Unit (GRU), Extreme Learning Ma-
chines (ELM), non-iterative training
1 Introduction
Recurrent neural networks (RNN) are a type
of neural networks that have been successfully ap-
plied to many problems in machine learning [22].
They have proven their ability to exceed human
performance in time series prediction and sequen-
tial decision-making [31]. RNNs’ training is usu-
ally based on gradient descent methods, specifically
back-propagation through time (BPTT) [40], and
real-time recurrent learning [41] which require a
substantial amount of iterations before converging.
Moreover, when unfolded through time, RNNs be-
come even deeper [1] and their training becomes
even more expensive since the number of learned
weights grows exponentially with the number of
hidden neurons and the length of time dependency.
Non-iterative training algorithms have been in-
vestigated in the literature [32, 1, 35] to reduce
the training cost of neural networks. Recently, Er-
tugrul et al. [10] proposed a non-iterative train-
F. E. Alsaadi, S. A. Ul Haq Bokhary, A. Shah, U. Ali, J. Cao, M. O. Alassafi, M. U. Rehman, J. U. Rahman
Proof. Based on the information given in Table 8,
we compute the ABC4index of Gas follows
ABC4(G)=∑uv∈E(G)√Su+Sv−2
SuSv=2√17+17−2
17×17 +
8√17+38−2
17×38 +2√38+38−2
38×38 +4√24+32−2
24×32 +
4√24+45−2
24×45 +4√32+45−2
32×45 +4√45+47−2
45×47 +
4√47+38−2
47×38 +8√26+38−2
26×38 +4√26+45−2
26×45 +
4√38+70−2
38×70 +4√24+73−2
24×73 +2√32+73−2
32×73 +(4m+
4n−32)√26+47−2
26×47 +6√26+70−2
26×70 +(2m+2n−
16)√26+80−2
26×80 +4√31+45−2
31×45 +(2m+2n−
16)√31+47−2
31×47 +4√31+70−2
31×70 +4√31+73−2
31×73 +(4m+
4n−32)√31+80−2
31×80 +2√36+70−2
36×70 +4√36+73−2
36×73 +
(6m+6n−48)√36+80−2
36×80 +(6mn −24m−24n+
96)√36+90−2
36×90 +4√45+73−2
45×73 +4√45+80−2
45×80 +
+(2m+2n−20)√47+47−2
47×47 +4√47+70−2
47×70 +(4m+
4n−36)√47+80−2
47×80 +4√70+80−2
70×80 +4√73+80−2
73×80 +
2√73+90−2
73×90 +(2m+2n−18)√80+80−2
80×80 +(4m+
4n−36)√80+90−2
80×90 +(3mn −14m−14n+
65)√90+90−2
90×90 .
Further simplification give us the re-
quired result ABC4(G)=[
√310
15 +√178
30 ]mn +
[2
611 √876762 +2
65 √845 +4
1457 √27683 +√16895
155 +
√570
20 −4
15 √310 +2
47 √92 +5
47 √47 +2
15 √21 −
7
30 √178](m+n)+ 8
17 √2+4
323 √34328 +
√74
19 +2
893 √148238 +2
665 √70490 +3
4√2+
√2010
45 +√41610
219 +√30
6+√15038
292 +4
247 √15314 +
2
295 √26910 −16
611 √86762 +6
455 √42770 −
4
65 √845+4
465 √11470−32
1457 √27683+6
1085 √23870+
4
2263 √230826 −8
155 √16895 +2
105 √455 +
2
219 √7811 −2
5√570 +16
15 √310 +4
47 √94 +
8
1095 √10585 +√123
15 −20
47 √92 +2
329 √15134 −
45
47 √47 +√518
35 +√55115
365 +√117530
730 −9
40 √158 −
6
5√21 +13
18 √178.
Theorem 5.4 For m,n≥6, the GA5index of a graph
G∼
=RT S(m,n)is
GA5(G)=[
12
7√10 +3]mn +[
8
73 √1222 +
8
53 √65 +72
29 √5−48
7√10 +32
127 √235 +48
17 √2+
2
39 √1457 +32
111 √155 −11](m+n)+16
55 √646 +
8
85 √1786 +96
77 √10 +16
105 √146 +16
7√3+16
23 √30 +
√247
2+8
71 √1170 −64
73 √1222 +√455
4−64
53 √65 +
6
19 √155 +12
53 √70 +48
109 √73 −576
29 √5+192
7√10 +
6
23 √235 +12
59 √365 +8
117 √3290 −288
127 √235 +
16
17 √14+32
153 √365+12
163 √730−432
17 √2+96
77 √10+
32
97 √438 −16
39 √1457 +8
101 √2170 +√2263
13 −
256
111 √155 +96
25 +61.
Proof. By following the instructions about the edge
partitioning in Table 8, we compute the GA5index
of the graph Gas follows
GA5(G)= 2√SuSv
(Su+Sv)=2√17×17
17+17 ×2+2√17×38
17+38 ×
(8)+2√38×38
38+38 ×2+2√38×47
38+47 ×4+2√38×70
38+70 ×
4+2√24×32
24+32 ×4+2√24×45
24+45 ×4+2√24×73
24+73 ×
4+2√32×45
32+45 ×4+2√32×73
32+73 ×2+2√26×38
26+38 ×
8+2√26×45
26+45 ×4+2√26×47
26+47 ×(4m+4n−32)+
2√26×70
26+70 ×6+2√26×80
26+80 ×(2m+2n−16)+
2√31×45
31+45 ×4+2√31×47
31+47 ×(2m+2n−16)+
2√31×70
31+70 ×4+2√31×73
31+73 ×4+2√31×80
31+80 ×(4m+
4n−32)+2√36×70
36+70 ×2+2√36×73
36+73 ×4+2√36×80
36+80 ×
(6m+6n−48)+2√36×90
36+90 ×(6mn −24m−
24n+96)+2√45×47
45+47 ×4+2√45×73
45+73×4+2√45×80
45+80 ×
4+2√47×47
47+47 ×)(2m+2n−20)+2√47×70
47+70 ×4+
2√47×80
47+80 ×(4m+4n−36)+2√70×80
70+80 ×4+
2√73×80
73+80 ×4+2√73×90
73+90 ×2+2√80×80
80+80 ×(2m+2n−
18)+2√80×90
80+90 ×(4m+4n−36)+2√90×90
90+9×(3mn−
14m−14n+65).
Further simplification give us the required result
GA5(G)=[
12
7√10 +3]mn +[
8
73 √1222 +
8
53 √65 +72
29 √5−48
7√10 +32
127 √235 +48
17 √2+
2
39 √1457 +32
111 √155 −11](m+n)+16
55 √646 +
8
85 √1786 +96
77 √10 +16
105 √146 +16
7√3+16
23 √30 +
√247
2+8
71 √1170 −64
73 √1222 +√455
4−64
53 √65 +
6
19 √155 +12
53 √70 +48
109 √73 −576
29 √5+192
7√10 +
6
23 √235 +12
59 √365 +8
117 √3290 −288
127 √235 +
16
17 √14+32
153 √365+12
163 √730−432
17 √2+96
77 √10+
32
97 √438 −16
39 √1457 +8
101 √2170 +√2263
13 −
256
111 √155 +96
25 +61.
6 Conclusion
In this article, we have done computation of
some degree based topological indices for certain
networks sheets. As a consequence, we got formu-
10
– 50
10.2478/jaiscr-2021-0003
34 Julia El Zini, Yara Rizk and Mariette Awad
ing algorithm for Jordan RNNs[19]. Then, Rizk
et al. [30] extended it to different RNN archi-
tectures, including Elman, fully connected RNN,
and Long Short-Term Memory (LSTM). Their al-
gorithm was tested on time-series and sequential
decision-making problems and achieved a speedup
of up to 99% over iterative training.
Although they only need one iteration to obtain
near-optimal solutions, non-iterative training algo-
rithms minimize their cost function by computing a
Moore-Penrose pseudo-inverse which requires am-
ple computational resources, especially for large
matrices. To the best of our knowledge, no attempts
have been made in the literature to parallelize non-
iterative training algorithms for RNNs. Fortunately,
such algorithms hold great potential for paralleliza-
tion due to their non-sequential nature.
In this work, we propose Basic-PR-ELM, a
basic parallel version of ELM training applied
on six RNN architectures: Elman, Jordan, NAR-
MAX, fully connected, LSTM, and GRU. Basic-
PR-ELM relies on parallel QR factorization to solve
the pseudo-inverse required in ELM training algo-
rithms. Then, the memory access patterns were
studied and led to Opt-PR-ELM, an optimizedver-
sion of parallel ELM training that utilizes the GPU
shared memory to speedup the training process fur-
ther.
The proposed algorithms, Basic-PR-ELM and
Opt-PR-ELM, are tested on 10 publicly available
time-series prediction applications and on different
GPU architectures to empirically show their scal-
ability, robustness, portability, speedup potentials,
and energy efficiency. Compared to the sequential
version proposed by Rizk et al. in [30], Basic-PR-
ELM and Opt-PR-ELM achieve a speedup of up
to 311 and 461, respectively on the LSTM archi-
tecture. Notably, Opt-PR-ELM is shown to train
LSTM networks 20 times faster than the parallel it-
erative training algorithms (BPTT).
The rest of the paper is organized as follows:
Section 2 presents the background on ELM-training
and the RNN architectures. Section 3 summa-
rizes the related work on RNN training and the
parallel training algorithms. Section 4 presents
the proposed algorithms Basic-PR-ELM and Opt-
PR-ELM and Section 5 theoretically analyzes their
memory and floating-point operations. Then, Sec-
tions 6 discusses the experimental setup and Sec-
tion 7 reports the empirical results. Finally, Sec-
tion 8 concludes with final remarks.
2 Background
2.1 Extreme Learning Machine
Extreme Learning Machine (ELM) is a non-
iterative training algorithm introduced by Huang et
al. [16] for single hidden layer feedforward neural
networks (SLFNs). Given narbitrary distinct train-
ing samples (xj,yj)where xj∈Rm,yj∈R,Mhid-
den nodes and gas activation function, the predicted
output Ojcan be written as
∑M
i=1βig(wT
ixj+bi)where wi∈Rmis the
weight vector connecting the ith hidden node and
the input nodes, β∈RMis the weight vector con-
necting all the hidden nodes and the output node
and biis the bias of the ith hidden node. Through-
out the training, the input weights wij are randomly
generated and fixed and the output weights β1...βM
are analytically computed. The goal is to minimize
the error between the predicted and the true output
as
min
β
n
∑
j=1Oj−tj2=
n
∑
j=1
M
∑
i=1
βig(wT
ixj+bi)−tj.
(1)
Defining Hand Tas:
H(n×M)=
g(wT
1x1+b1)... g(wT
Mx1+bM)
.
.
.....
.
.
g(wT
1xn+b1)... g(wT
Mxn+bM)
(2)
T(n×1)=[t1,t2,...,tn]T,(3)
one can compactly write the problem in Equa-
tion 1 as minimizing Hβ−T2. The solution of
this problem is given as: β=H†T, where H†=
(HTH)−1His the Moore-penrose generalized in-
verse of the matrix H.
2.2 RNN architectures
RNNs are one of the most powerful neural
networks that are best suitable to model long-
term dependencies in time-series applications [31].
RNN architectures differ in the way cycles are in-
troduced in the network. In this work, we con-
sider six RNN architectures, illustrated in Fig-
ure 1: Elman [9], Jordan [19], NARMAX [8],
35
Julia El Zini, Yara Rizk and Mariette Awad
ing algorithm for Jordan RNNs[19]. Then, Rizk
et al. [30] extended it to different RNN archi-
tectures, including Elman, fully connected RNN,
and Long Short-Term Memory (LSTM). Their al-
gorithm was tested on time-series and sequential
decision-making problems and achieved a speedup
of up to 99% over iterative training.
Although they only need one iteration to obtain
near-optimal solutions, non-iterative training algo-
rithms minimize their cost function by computing a
Moore-Penrose pseudo-inverse which requires am-
ple computational resources, especially for large
matrices. To the best of our knowledge, no attempts
have been made in the literature to parallelize non-
iterative training algorithms for RNNs. Fortunately,
such algorithms hold great potential for paralleliza-
tion due to their non-sequential nature.
In this work, we propose Basic-PR-ELM, a
basic parallel version of ELM training applied
on six RNN architectures: Elman, Jordan, NAR-
MAX, fully connected, LSTM, and GRU. Basic-
PR-ELM relies on parallel QR factorization to solve
the pseudo-inverse required in ELM training algo-
rithms. Then, the memory access patterns were
studied and led to Opt-PR-ELM, an optimizedver-
sion of parallel ELM training that utilizes the GPU
shared memory to speedup the training process fur-
ther.
The proposed algorithms, Basic-PR-ELM and
Opt-PR-ELM, are tested on 10 publicly available
time-series prediction applications and on different
GPU architectures to empirically show their scal-
ability, robustness, portability, speedup potentials,
and energy efficiency. Compared to the sequential
version proposed by Rizk et al. in [30], Basic-PR-
ELM and Opt-PR-ELM achieve a speedup of up
to 311 and 461, respectively on the LSTM archi-
tecture. Notably, Opt-PR-ELM is shown to train
LSTM networks 20 times faster than the parallel it-
erative training algorithms (BPTT).
The rest of the paper is organized as follows:
Section 2 presents the background on ELM-training
and the RNN architectures. Section 3 summa-
rizes the related work on RNN training and the
parallel training algorithms. Section 4 presents
the proposed algorithms Basic-PR-ELM and Opt-
PR-ELM and Section 5 theoretically analyzes their
memory and floating-point operations. Then, Sec-
tions 6 discusses the experimental setup and Sec-
tion 7 reports the empirical results. Finally, Sec-
tion 8 concludes with final remarks.
2 Background
2.1 Extreme Learning Machine
Extreme Learning Machine (ELM) is a non-
iterative training algorithm introduced by Huang et
al. [16] for single hidden layer feedforward neural
networks (SLFNs). Given narbitrary distinct train-
ing samples (xj,yj)where xj∈Rm,yj∈R,Mhid-
den nodes and gas activation function, the predicted
output Ojcan be written as
∑M
i=1βig(wT
ixj+bi)where wi∈Rmis the
weight vector connecting the ith hidden node and
the input nodes, β∈RMis the weight vector con-
necting all the hidden nodes and the output node
and biis the bias of the ith hidden node. Through-
out the training, the input weights wij are randomly
generated and fixed and the output weights β1...βM
are analytically computed. The goal is to minimize
the error between the predicted and the true output
as
min
β
n
∑
j=1Oj−tj2=
n
∑
j=1
M
∑
i=1
βig(wT
ixj+bi)−tj.
(1)
Defining Hand Tas:
H(n×M)=
g(wT
1x1+b1)... g(wT
Mx1+bM)
.
.
.....
.
.
g(wT
1xn+b1)... g(wT
Mxn+bM)
(2)
T(n×1)=[t1,t2,...,tn]T,(3)
one can compactly write the problem in Equa-
tion 1 as minimizing Hβ−T2. The solution of
this problem is given as: β=H†T, where H†=
(HTH)−1His the Moore-penrose generalized in-
verse of the matrix H.
2.2 RNN architectures
RNNs are one of the most powerful neural
networks that are best suitable to model long-
term dependencies in time-series applications [31].
RNN architectures differ in the way cycles are in-
troduced in the network. In this work, we con-
sider six RNN architectures, illustrated in Fig-
ure 1: Elman [9], Jordan [19], NARMAX [8],
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
fully connected RNN, LSTM [15] and GRU [5].
Figure 1. RNN architectures adapted from prior
work in [30]
In Figure 1 and throughout this work, x∈S×Q
is the input to the network, Mis the number of hid-
den neurons, wi∈RSis the vector of weights con-
necting the input to the ith neuron, αik ∈Ris the
weight from the neuron ito itsef from the kth previ-
ous time step and biis ith bias.
2.2.1 Elman
Elman RNNs are single hidden layer networks
where context neurons introduce recurrence by
feeding back signals as internal state of the network.
At time step t, the output is
ˆy=
M
∑
i=1
βifi(t),(4)
where fi(t)=gwT
ix(t)+∑Q
k=1αik fi(t−k)+biis
the output of neuron iat time t.
2.2.2 Jordan
Jordan networks are similar to Elman’s except
for the way recurrence is introduced. In the Jordan
architecture, signals are fed back from the predicted
output of the previous time step. Consequently,
such networks are more suitable for time series pre-
diction where dependencies are on current input and
previous outputs. Specifically, the output at time
step tis described by Equation 4 with
fi(t)=gwT
ix(t)+∑Q
k=1αik ˆy(t−k)+bi.
2.2.3 NARMAX
The Nonlinear AutoregRessive Moving
Average model with eXogenous inputs (NAR-
MAX) represents a wide class of nonlinear sys-
tems [2]. NARMAX networks, have been pro-
posed for non-linear time series prediction us-
ing artificial neural networks and are described
by ˆy(t)=∑M
i=1βigwT
ix(t)+∑F
l=1w
il y(t−l)+
∑R
l=1w
il e(t−l)+bi, where Fand Rare the
lengths of the time dependency of the output and
the error feedbacks respectively, e(t)=y(t)−ˆy(t),
w
il ∈R(w
il ∈Rresp.) is the weight from the out-
put (error resp.) at the lth time step to the ith hidden
neuron.
2.2.4 Fully Connected RNN
A fully connected RNN is the most gen-
eral RNN architecture in which signals are fed
back from all hidden neurons at previous time
steps. Specifically, the output at time step tis
described by Equation 4 with fi(t)=gwT
ix(t)+
∑Q
l=1∑M
m=1αilk fm(t−k)+bi. In this case, αilk ∈
Ris the weight connecting the neuron ito neuron l
from the kth previous time step.
2.2.5 LSTM
LSTMs were introduced by [15] to solve the
vanishing gradient problem in BPTT. LSTMs have
been successfully applied to a wide variety of appli-
cations inluding speech recognition [12, 13], ma-
chine translation [4, 42] and human action recog-
nition [23, 24]. An LSTM unit is composed of
the main cell, an input, output and forget gates
which regulate the flow of information into and out
of the cell through forgetting factors and weights.
This formulation gives the network the ability to
decide on which information to remember. The
output of LSTM is described by Equation 4 with
f(t)=o(t)◦gf(c(t)),◦is the Hadamard product of
two matrices and o(t),c(t),λ(t)and in(t)are given
by
o(t)=goWox(t)+Uof(t−1)+bo
c(t)=λ(t)◦c(t−1)+in(t)◦gcWcx(t)+Ucf(t−1)+bc
λ(t)=gλWλx(t)+Uλf(t−1)+bλ
36 Julia El Zini, Yara Rizk and Mariette Awad
in(t)=gin(Winx(t)+Uin f(t−1)+bin).
2.2.6 GRU
GRUs are introduced in [5] as a gating mech-
anism for RNNs. They resemble LSTMs but have
only two gates and fewer parameters. GRUs ex-
pose their state at each time step and do not have
any mechanism to control the degree to which their
state is exposed [7]. They exhibit good perfor-
mances on small datasets [7] and are widely used
in speech recognition [34, 6] and sequence model-
ing [7]. GRUs’ output is described by Equation 4
while f(t)is given by
f(t)=(1−z(t))◦f(t−1)+z(t)◦gf(Wfx(t)+
Uf(rt◦f(t−1)+bf)),(5)
where z(t)=gz(Wzx(t)+Uzf(t−1)+bz)and
r(t)=gr(Wrx(t)+Urf(t−1)+br).
3 Related Work
This work focuses on the parallelization of a
non-iterative training algorithm for RNNs. In what
follows, we first discuss the basic training meth-
ods of RNNs while focusing on the non-iterative
ones. Then, we report the parallelization attempts
for training algorithms.
3.1 RNN Training
3.1.1 Iterative RNN Training
Training RNNs has been mainly done itera-
tively through BPTT [40] which unfolds the re-
currence through time to transform the RNN into
a feedforward network trained using gradient de-
scent. BPTT is susceptible to local minima and
suffers from the vanishing and exploding gradient
problems with long time dependencies. BPTT can
also be slow, given that it is applied iteratively in
batch mode. Other iterative algorithms include, but
are not limited to, Hessian free optimization [25],
extended Kalman filters [39] and genetic algorithms
(GA) [3]. Although successful, these algorithms
are computationally expensive and require manu-
ally tuning of many hyper-parameters.
3.1.2 Non-Iterative RNN Training
Different non-iterative training algorithms have
been proposed to reduce the computational cost of
training neural networks in general. For instance,
the authors in [32, 35, 28, 16] proposed ELM, a
non-iterative method to train single hidden layer
feedforward networks by randomly assigning input
weights and computing output weights using the
least-squares method. These methods were later ex-
tended to RNN architectures when Ertugrul imple-
mented a non-iterative training for the Jordan RNN
architecture in electricity load forecasting applica-
tions [10]. Later, Park et al. extended it to online
RNNs [29] and Rizk et al. generalized the approach
to more powerful RNN architectures [30].
Although these methods achieved high
speedups (up to 99% in [30]), they heavily rely
on stencil operations and on the computation of the
generalized inverse of matrices which are CPU in-
tensive operations and could be further optimized
using parallel algorithms.
3.2 Parallelizing Training Algorithms
Several frameworks have been developed to
solve challenges of high performance computing in
the big data area [43], including parallelizing train-
ing algorithms. This is the first attempt to paral-
lelize non-iterative training of RNNs; thus we de-
scribe previous work on the parallelization of RNN
iterative training algorithms and on the parallel
non-iterative training for neural networks - not ex-
clusively RNN.
3.2.1 Parallelizing Iterative Training Algo-
rithms For RNN
Parallelizing RNN training is mostly based on
parallelizing the back-propagation algorithm (BP).
For instance, Sierra et al. parallelized BP on
CUBLAS and achieved a speedup of 63. In [44],
data is distributed on multiple GPUs achieving a
speedup of up to 51 [33]. In [38], parallel scan al-
gorithm improves the step complexity of BP from
O(n)to O(logn). Khomenko et al. parallelized
their data on multiple GPUs and relied on batch
bucketing by input sequence length to accelerate
RNN training achieving a speedup of up to 4 [20].
In [27], a semantic correlation-based data pre-fetch
framework is implemented to break the dependency
37
Julia El Zini, Yara Rizk and Mariette Awad
in(t)=gin(Winx(t)+Uin f(t−1)+bin).
2.2.6 GRU
GRUs are introduced in [5] as a gating mech-
anism for RNNs. They resemble LSTMs but have
only two gates and fewer parameters. GRUs ex-
pose their state at each time step and do not have
any mechanism to control the degree to which their
state is exposed [7]. They exhibit good perfor-
mances on small datasets [7] and are widely used
in speech recognition [34, 6] and sequence model-
ing [7]. GRUs’ output is described by Equation 4
while f(t)is given by
f(t)=(1−z(t))◦f(t−1)+z(t)◦gf(Wfx(t)+
Uf(rt◦f(t−1)+bf)),(5)
where z(t)=gz(Wzx(t)+Uzf(t−1)+bz)and
r(t)=gr(Wrx(t)+Urf(t−1)+br).
3 Related Work
This work focuses on the parallelization of a
non-iterative training algorithm for RNNs. In what
follows, we first discuss the basic training meth-
ods of RNNs while focusing on the non-iterative
ones. Then, we report the parallelization attempts
for training algorithms.
3.1 RNN Training
3.1.1 Iterative RNN Training
Training RNNs has been mainly done itera-
tively through BPTT [40] which unfolds the re-
currence through time to transform the RNN into
a feedforward network trained using gradient de-
scent. BPTT is susceptible to local minima and
suffers from the vanishing and exploding gradient
problems with long time dependencies. BPTT can
also be slow, given that it is applied iteratively in
batch mode. Other iterative algorithms include, but
are not limited to, Hessian free optimization [25],
extended Kalman filters [39] and genetic algorithms
(GA) [3]. Although successful, these algorithms
are computationally expensive and require manu-
ally tuning of many hyper-parameters.
3.1.2 Non-Iterative RNN Training
Different non-iterative training algorithms have
been proposed to reduce the computational cost of
training neural networks in general. For instance,
the authors in [32, 35, 28, 16] proposed ELM, a
non-iterative method to train single hidden layer
feedforward networks by randomly assigning input
weights and computing output weights using the
least-squares method. These methods were later ex-
tended to RNN architectures when Ertugrul imple-
mented a non-iterative training for the Jordan RNN
architecture in electricity load forecasting applica-
tions [10]. Later, Park et al. extended it to online
RNNs [29] and Rizk et al. generalized the approach
to more powerful RNN architectures [30].
Although these methods achieved high
speedups (up to 99% in [30]), they heavily rely
on stencil operations and on the computation of the
generalized inverse of matrices which are CPU in-
tensive operations and could be further optimized
using parallel algorithms.
3.2 Parallelizing Training Algorithms
Several frameworks have been developed to
solve challenges of high performance computing in
the big data area [43], including parallelizing train-
ing algorithms. This is the first attempt to paral-
lelize non-iterative training of RNNs; thus we de-
scribe previous work on the parallelization of RNN
iterative training algorithms and on the parallel
non-iterative training for neural networks - not ex-
clusively RNN.
3.2.1 Parallelizing Iterative Training Algo-
rithms For RNN
Parallelizing RNN training is mostly based on
parallelizing the back-propagation algorithm (BP).
For instance, Sierra et al. parallelized BP on
CUBLAS and achieved a speedup of 63. In [44],
data is distributed on multiple GPUs achieving a
speedup of up to 51 [33]. In [38], parallel scan al-
gorithm improves the step complexity of BP from
O(n)to O(logn). Khomenko et al. parallelized
their data on multiple GPUs and relied on batch
bucketing by input sequence length to accelerate
RNN training achieving a speedup of up to 4 [20].
In [27], a semantic correlation-based data pre-fetch
framework is implemented to break the dependency
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
in the input to parallelize the training of cogni-
tive applications [27]. Their work is tested on
LSTMs using image captioning, speech recogni-
tion, and language processing applications showing
a speedup of 5.1, 44.9 and 1.53, respectively. Re-
cently, GA is introduced into the Elman architecture
to accelerate the training and prevent the local min-
ima problem [18]. GA-Elman outperformes tradi-
tional training algorithms in terms of convergence
speed and accuracy.
3.2.2 Parallelizing Non-Iterative Training Al-
gorithms
Non-iterative training algorithms for RNNs are
shown to require less training time than iterative
methods [30, 10, 29]. However, even with non-
iterative training, large datasets require costly com-
putations, especially when increasing the number of
neurons or when model selection is performed to
avoid over-fitting [36]. Parallelizing non-iterative
training has been explored in single layer feedfor-
ward networks by [14]. Their approach is based on
a Map-Reduce and achieves a speedup of up to 5.6
when tested on 32 cores. Following a similar ap-
proach, Wang et al. [37] developed a parallel imple-
mentation of online ELM and achieved a speedup
of 3.5 when trained on 120K instances with 120 at-
tributes. Huang et al. extended their approach to the
ensemble online sequential ELM which was tested
on real and synthetic data with 5120K training in-
stances and 512 attributes and achieved a speedup
of 40 on a cluster with 80 cores [17]. In [36],
Van et al. attempted to parallelize ELM on Flink
with multi hidden layer feedforward network and
achieved a speedup of 17.
To the best of our knowledge, our work is the
first attempt to parallelize non-iterative training for
different RNN architectures.
4 Methodology
Before proposing our methods, we present the
nomenclature that will be used throughout this pa-
per in Table 1.
Table 1. Nomenclature
Symbol Definition
nNumber of training sam-
ples
MNumber of hidden neurons
QMax number of time de-
pendencies
SDimension of input
xj∈RS×Qjth Input instance
yj∈Rjth Output instance
X∈Rn×S×QInput matrix
Y∈RnOutput matrix
W∈RS×LWeight matrix connecting
the input to the hidden neu-
rons
α∈RL×QWeight matrix connection
the hidden neuron to itself
for previous time steps
b∈RLBias vector for the hidden
neurons
β∈RLWeight vector connecting
hidden neurons to output
layer
S-R-ELM Sequential ELM for RNN
training
Basic-PR-ELM Basic parallel ELM RNN
training
Opt-PR-ELM Optimal parallel ELM
RNN training
BPTT Back-propagation through
time
P-BPTT Parallel Back-propagation
through time
BS Block size
TW Tile width
In this work, a parallel version of ELM-trained
RNNs will be formalized and implemented. The
sequential version of our approach, denoted by
S-RELM, is summarized in Algorithm 1 and is
adopted from our previous work in [30].
38 Julia El Zini, Yara Rizk and Mariette Awad
Algorithm 1 S-R-ELM algorithm
1: Randomly assign W,α,b
2: Compute H(t),t=1...Qaccording to the cor-
responding RNN architecture
3: Compute β=H(Q)†Yusing the generalized
Moore–Penrose pseudoinverse
H(t) at row iand column jis referred to as hij[t]
in this paper and is computed as in Equations 6, 7,
8, 9, 10 and 11 for the Elman, Jordan, NARMAX,
fully connected, LSTM and GRU architectures re-
spectively.
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
∑
k=1
α[j,k]hij[t−k]
(6)
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
∑
k=1
α[j,k]ˆy(t−k)
(7)
hij[t]=g(W[:,j].X[i,:,t]+bi+
F
∑
l=1
W[i,l]y(t−l)+
R
∑
l=1
W[i,l]e(t−l)(8)
hij[t]=gW[:,j].X[i,:,t]+bi+
Q
∑
k=1
M
∑
l=1
α[j,l,k]hij[t−k](9)
hij[t]=o[i,j,t]◦gfc[i,j,t](10)
hij[t]=1−z[i,j,t]◦hij[t−1]+z[i,j,t]◦
gfWf[:,j].X[i,:,t]+Uf(r[i,j,t]◦hij[t−1]+bi)
(11)
Considering Algorithm 1, one can see that the
running time of the ELM training mainly consists
of two CPU intensive operations: computing Hand
computing βby solving the linear system using the
Moore-Penrose pseudo-inverse. Thus, those two
operations are the main target when optimizing the
performance of non-iterative training.
4.1 HComputation
4.1.1 Basic Parallel Implementation (Basic-
PR-ELM)
For all RNN architectures, the computation of
H(t)at row iand column jis independant of the
computation of H(t)at row i2and column j2,∀i2=
i,j2=j; it only depends on H(t2)at row iand col-
umn jfor t2<t. Given only this dependency, a par-
allel Hcomputation can be done as follows: each
thread (i,j)can independently compute H(t)at row
iand column jfor t=1,...,Q. We describe the ba-
sic implementation of the computation of Hfor the
Elman architecture in Algorithm 2.
Algorithm 2 Basic-PR-ELM by thread (i,j)
1: tx ←threadIdx.x
2: ty ←cuda.threadIdx.y
3: Row ←tx+blockIdx.x×blockDim.x
4: Col ←ty+blockI dx.y×blockDim.y
5: for t=1→ Qdo
6: hij ←W[:,Col].X[Row,:,t]
7: hij ←hij +bCol
8: for tprev =1→ tdo
9: hij ←hij +α[j,tprev]×
H[Row,Col,tprev]
10: end for
11: H[Row,Col,t]←hij
12: end for
4.1.2 Optimized Parallel Implementation
(Opt-PR-ELM)
Figure 2. Basic-PR-ELM memory access patterns
on Elman
Figure 2 illustrates the memory access patterns
of Basic-PR-ELM on the Elman architecture. One
can clearly see that threads in the same row ac-
cess the same elements of Xand threads in the
39
Julia El Zini, Yara Rizk and Mariette Awad
Algorithm 1 S-R-ELM algorithm
1: Randomly assign W,α,b
2: Compute H(t),t=1...Qaccording to the cor-
responding RNN architecture
3: Compute β=H(Q)†Yusing the generalized
Moore–Penrose pseudoinverse
H(t) at row iand column jis referred to as hij[t]
in this paper and is computed as in Equations 6, 7,
8, 9, 10 and 11 for the Elman, Jordan, NARMAX,
fully connected, LSTM and GRU architectures re-
spectively.
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
∑
k=1
α[j,k]hij[t−k]
(6)
hij[t]=g(W[:,j].X[i,:,t]+bi+
Q
∑
k=1
α[j,k]ˆy(t−k)
(7)
hij[t]=g(W[:,j].X[i,:,t]+bi+
F
∑
l=1
W[i,l]y(t−l)+
R
∑
l=1
W[i,l]e(t−l)(8)
hij[t]=gW[:,j].X[i,:,t]+bi+
Q
∑
k=1
M
∑
l=1
α[j,l,k]hij[t−k](9)
hij[t]=o[i,j,t]◦gfc[i,j,t](10)
hij[t]=1−z[i,j,t]◦hij[t−1]+z[i,j,t]◦
gfWf[:,j].X[i,:,t]+Uf(r[i,j,t]◦hij[t−1]+bi)
(11)
Considering Algorithm 1, one can see that the
running time of the ELM training mainly consists
of two CPU intensive operations: computing Hand
computing βby solving the linear system using the
Moore-Penrose pseudo-inverse. Thus, those two
operations are the main target when optimizing the
performance of non-iterative training.
4.1 HComputation
4.1.1 Basic Parallel Implementation (Basic-
PR-ELM)
For all RNN architectures, the computation of
H(t)at row iand column jis independant of the
computation of H(t)at row i2and column j2,∀i2=
i,j2=j; it only depends on H(t2)at row iand col-
umn jfor t2<t. Given only this dependency, a par-
allel Hcomputation can be done as follows: each
thread (i,j)can independently compute H(t)at row
iand column jfor t=1,...,Q. We describe the ba-
sic implementation of the computation of Hfor the
Elman architecture in Algorithm 2.
Algorithm 2 Basic-PR-ELM by thread (i,j)
1: tx ←threadIdx.x
2: ty ←cuda.threadIdx.y
3: Row ←tx+blockIdx.x×blockDim.x
4: Col ←ty+blockI dx.y×blockDim.y
5: for t=1→ Qdo
6: hij ←W[:,Col].X[Row,:,t]
7: hij ←hij +bCol
8: for tprev =1→ tdo
9: hij ←hij +α[j,tprev]×
H[Row,Col,tprev]
10: end for
11: H[Row,Col,t]←hij
12: end for
4.1.2 Optimized Parallel Implementation
(Opt-PR-ELM)
Figure 2. Basic-PR-ELM memory access patterns
on Elman
Figure 2 illustrates the memory access patterns
of Basic-PR-ELM on the Elman architecture. One
can clearly see that threads in the same row ac-
cess the same elements of Xand threads in the
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
same column access the same elements of Wand
α. Thus, the tiling concept can be applied to utilize
the shared memory to speed up the computation of
H. Moreover, we notice that bCol can be preloaded
and used efficiently by other threads.
Algorithm 3 describes how these optimizations
can be applied for the Elman architecture.
Algorithm 3 Opt-PR-ELM by thread (i,j)
1: tx ←threadIdx.x
2: ty ←cuda.threadIdx.y
3: Row ←tx+blockIdx.x×blockDim.x
4: Col ←ty+blockI dx.y×blockDim.y
5: Hloc ←t-dimensional array in the local mem-
ory of thread (i,j)
6: for t←1→ Qdo
7: hij ←0
8: for tile =1→ num_tiles do
9: Wshared ←W[tx+t ile ×TW :,Col]
10: Xshared ←X[Row,ty +t ile ∗TW,t]
11: synch()
12: hij ←hij +Wshared.Xshared
13: end for
14: synch()
15: if tx =0 and ty =0then
16: bshared ←b[Col]
17: end if
18: synch()
19: hij ←hij +bshared
20: for tile ←1→ t
TW do
21: αshared ←α[Col,tx+t ile ×TW]
22: synch()
23: hij ←hij +αshared.Hloc [tprev]
24: end for
25: synch()
26: Hloc[t]←hij
27: H[Row,Col,t]←hij
28: end for
First, in the dot product W[:,Col].X[Row,:,t],
each thread can load only one element of Wand
one element of Xinto the shared memory. Once
the threads synchronize, then all needed elements
of Wand Xare loaded, and the dot product can
be computed efficiently. Second, only one thread
can load b[j]that is needed by all the threads in the
same column of the block. The same tiling con-
cept used to compute W[:,Col].X[Row,:,t]can be
used to speed up the computation of α[j,tprev]×
H(tprev)[Row,Col]. Lastly, each thread can save the
values of H(t)[Row,Col]in its register file to reduce
the time taken to read from the global memory in
line 8 of Algorithm 2. If these values do not fit in
the registers, they are read from the global memory.
Alogirhtms 2 and 3 could be easily extended to
other architectures when Equation 6 is replaced by
Equations 7, 8, 9, 10 or 11.
4.2 Computing β
βis the solution of the following system: Hβ=
Y. Instead of computing the pseudo-inverse H†and
then multiplying it by Y, one can perform a QR fac-
torization of Has H=QR, then compute z=QTY.
Having that, βwould be the solution of Rβ=zby
back substitution since Rwill be an upper triangular
matrix. In this work, we make use of Numba [21]
and Numpy [26] libraries which provide an efficient
implementation of this method in Python.
5 Theoretical Analysis
We analyze the memory read and write oper-
ations and the floating point operations (FLOPS)
for the proposed algorithms: Basic-PR-ELM and
Opt-PR-ELM. For the Elman architecture, Basic-
PR-ELM performs Q(2S+Q+2)read operations
divided as follows:
– 2 ×SQ to read the values needed in line 6
–Qreads for bCol in line 7
– 2 ×(QQ+1
2)reads in the loop at line 8
Moreover, only Qwrite operations are needed (in
line 11) and Q(2S+Q+2)FLOPS are performed
as follows:
– 2 ×SQ to perform the dot product at line 6
–QFLOPS for the addition in line 7
– 2 ×(QQ+1
2)to perform the loop at line 8
The memory operations to FLOPS ratio is
2S+Q+3
2S+Q+2>1 which might limit the performance of
Basic-PR-ELM. This ratio improves with Opt-PR-
ELM as it minimizes the memory operations while
keeping the same number of FLOPS. Specifically,
Opt-PR-ELM decreases the number of reads to
1
TW2(2×SQ +Q(Q+1)
2)+1 divided as follows:
40 Julia El Zini, Yara Rizk and Mariette Awad
Table 2. Number of memory operations and FLOPS for each RNN architecture for Basic-PR-ELM
Architecture # Read Operations # Write Operations FLOPS
Elman Q(2S+Q+2)Q Q(2S+Q+2)
Jordan Q(2S+1+(Q+1)(1/2+M))Q Q(2S+1+Q+1
2(2SM +M))
NARMAX Q(2S+1)+2(2F+M+R)Q Q(2S+1+2F+R(2+2SM +M))
Fully Connected Q(2S+1+2MQ)Q Q(2S+Q+2QM)
LSTM Q(5S+13)5Q Q(8S+18)
GRU Q(4S+8)3Q Q(3S+17)
Figure 3. Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M=50
7
TABLE III: Benchmarks Description
Database Size Output Statistics
Category Name # of instances Q % Train Mean Std Dev Min Max
Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08
Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02
Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05
Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10
AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04
Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02
Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03
Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14
Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09
Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01
(a) Jordan (b) Elman
(c) NARMAX (d) Fully Connected
(e) LSTM (f) GRU
Fig. 3: Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M= 50
algorithm is architecture-dependent. Table V shows that Opt-
PR-ELM also achieves high speedups on the Quadro K2000
GPUs for different RNN architectures on different datasets, but
the speedups on the Tesla K20m GPU are constantly higher
41
Julia El Zini, Yara Rizk and Mariette Awad
Table 2. Number of memory operations and FLOPS for each RNN architecture for Basic-PR-ELM
Architecture # Read Operations # Write Operations FLOPS
Elman Q(2S+Q+2)Q Q(2S+Q+2)
Jordan Q(2S+1+(Q+1)(1/2+M))Q Q(2S+1+Q+1
2(2SM +M))
NARMAX Q(2S+1)+2(2F+M+R)Q Q(2S+1+2F+R(2+2SM +M))
Fully Connected Q(2S+1+2MQ)Q Q(2S+Q+2QM)
LSTM Q(5S+13)5Q Q(8S+18)
GRU Q(4S+8)3Q Q(3S+17)
Figure 3. Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M=50
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Figure 4. Speedup of Opt-PR-ELM for the different architectures when the number of hidden neurons
increases from 5 to 100
9
(a) Jordan (b) Elman
(c) NARMAX (d) Fully Connected
(e) LSTM (f) GRU
Fig. 4: Speedup of Opt-PR-ELM for the different architectures when the number of hidden neurons increases from 5to 100
because of the computational capability of the latter. The
speedups in Table V are reported with respect to the core-i7
CPU with 16 GB RAMs. Speedups with respect to sequential
code on older generation CPUs (core-i5 with 8GB RAMs)
show a speedup of up to 5 times higher. One can draw the
following conclusion: increasing the number of cores of a CPU
yields a speedup of up to only 5 times. However, parallelizing
the code can yield a speedup of up to 326 with respect to
sequential code on core-i7 CPUs and 651 on core-i5 CPUs.
A rough estimation of the current pricing based on google
search shows that GPU architectures cost between 500$ and
7,000$ for NVIDIA GTX 1080 and Tesla GPUs11 respectively,
while CPU architectures such as Intel Core i7-9700K with 8
cores cost 400$12. Considering the aforementioned speedups,
one can conclude that investing in parallel architectures can be
more profitable than upgrading the existing CPU architecture,
especially in applications where real-time performance and
11https://www.amazon.com/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-
v100/dp/B076P84525
12https://www.amazon.com/CPU-Processors-Memory-Computer-Add-
Ons/b?ie=UTF8&node=229189
cost efficiency are essential such as general IoT applications.
E. Comparison with Parallel Iterative RNN Training
Although Opt-PR-ELM achieves high speedups compared
its S-R-ELM, we need to show that its absolute training time
is lower than the parallel version of the BPTT (P-BPTT) as
implemented in [11]. We choose the architectures that [11]
implements, i.e. fully connected, LSTM and GRU, and we
report the training time of Opt-PR-ELM (BS=32) and P-BPTT
when M= 10. P-BPTT is trained for 10 epochs with 64
as batch size, mean squared error (MSE) as loss function
and ADAM as optimizer. We are interested in the absolute
training times of the two parallel algorithms rather than their
speedup over their sequential versions. Thus, we report the
runtimes of Opt-PR-ELM and P-BPTT algorithms when tested
on the same Tesla K20m GPU and the ratio between both
training times. As Table VI shows, Opt-PR-ELM runs up
to 10x faster than P-BPTT when tested with LSTM on the
energy consumption dataset. Fig. 5 illustrates the MSE versus
time for P-BPTT algorithms when tested with LSTM on the
energy consumption with M= 50. For the same dataset
42 Julia El Zini, Yara Rizk and Mariette Awad
Table 3. Benchmarks Description
Database Size Output Statistics
Category Name # of instances Q % Train Mean Std Dev Min Max
Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08
Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02
Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05
Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10
AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04
Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02
Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03
Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14
Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09
Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01
–2
TW2×SQ to read the values needed in line 12
– at most 1 read for bCol in line 16
–1
TW2(QQ+1
2)reads in the loop at line 20
where TW is the tile width which is set to block
size in this work. The new memory operations
to FLOPS ratio is 1
TW2(2×SQ+Q(Q+1)
2)+1+Q
Q(2S+Q+2)which is
less then the ratio of Basic-PR-ELM by a factor of
≈TW2. Specifically, Opt-PR-ELM minimizes the
number of read operations by a factor of 256 (1024
resp.) when the tile width is set to 16 (32 resp.).
Table 2 reports the number of memory opera-
tions and FLOPS needed by Basic-PR-ELM for
each RNN architecture. The values of Opt-PR-
ELM are ommited as it requires the same number
of write operations and FLOPS and less read oper-
ations by a factor of ≈TW2.
6 Experimental Setup
6.1 Setup
Serial algorithms were run on an Intel 64-bit
core-i7 machine with a memory of 16 GB. Parallel
algorithms were run on NVidia Tesla K20m GPU
with 2688 CUDA cores and 723MHz GPU core
clock speed. The GPU main memory is 6GB and
bandwidth of 250 GB/s between the host and the
device. All experiments are repeated 5 times, and
the average value is reported.
6.2 Time Series Prediction Benchmarks
Basic-PR-ELM and Opt-PR-ELM were vali-
dated on time series prediction problems. Table 3
presents the characteristics of the datasets ordered
according to the number of instances. According
to their size, we split the databases into three cate-
gories: small datasets containing less than 10K in-
stances, medium datasets with multiples of 10K in-
stances and large dataset consisting of multiples of
100K instances. Japan population1tracks the pop-
ulation of various Japanese regions, while the Que-
bec Births2tracks the number of births in Quebec
1kaggle.com/jd1325/japan-population-data
2datamarket.com/data/list/ ?q=provider%3Atsdl
3kaggle.com/keplersmachines/kepler-labelled-time-series-data
4kaggle.com/benjibb/sp500- since-1950
5aemo.com.au/
6kaggle.com/selfishgene/historical-hourly-weather-data
43
Julia El Zini, Yara Rizk and Mariette Awad
Table 3. Benchmarks Description
Database Size Output Statistics
Category Name # of instances Q % Train Mean Std Dev Min Max
Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08
Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02
Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05
Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10
AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04
Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02
Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03
Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14
Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09
Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01
–2
TW2×SQ to read the values needed in line 12
– at most 1 read for bCol in line 16
–1
TW2(QQ+1
2)reads in the loop at line 20
where TW is the tile width which is set to block
size in this work. The new memory operations
to FLOPS ratio is 1
TW2(2×SQ+Q(Q+1)
2)+1+Q
Q(2S+Q+2)which is
less then the ratio of Basic-PR-ELM by a factor of
≈TW2. Specifically, Opt-PR-ELM minimizes the
number of read operations by a factor of 256 (1024
resp.) when the tile width is set to 16 (32 resp.).
Table 2 reports the number of memory opera-
tions and FLOPS needed by Basic-PR-ELM for
each RNN architecture. The values of Opt-PR-
ELM are ommited as it requires the same number
of write operations and FLOPS and less read oper-
ations by a factor of ≈TW2.
6 Experimental Setup
6.1 Setup
Serial algorithms were run on an Intel 64-bit
core-i7 machine with a memory of 16 GB. Parallel
algorithms were run on NVidia Tesla K20m GPU
with 2688 CUDA cores and 723MHz GPU core
clock speed. The GPU main memory is 6GB and
bandwidth of 250 GB/s between the host and the
device. All experiments are repeated 5 times, and
the average value is reported.
6.2 Time Series Prediction Benchmarks
Basic-PR-ELM and Opt-PR-ELM were vali-
dated on time series prediction problems. Table 3
presents the characteristics of the datasets ordered
according to the number of instances. According
to their size, we split the databases into three cate-
gories: small datasets containing less than 10K in-
stances, medium datasets with multiples of 10K in-
stances and large dataset consisting of multiples of
100K instances. Japan population1tracks the pop-
ulation of various Japanese regions, while the Que-
bec Births2tracks the number of births in Quebec
1kaggle.com/jd1325/japan-population-data
2datamarket.com/data/list/ ?q=provider%3Atsdl
3kaggle.com/keplersmachines/kepler-labelled-time-series-data
4kaggle.com/benjibb/sp500- since-1950
5aemo.com.au/
6kaggle.com/selfishgene/historical-hourly-weather-data
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 4. Average RMSE (±standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same
range for different RNN architectures on all the datasets.
Architecture
Dataset Algorithm Elman Jordan NARMAX Fully Connected LSTM GRU
Japan
pop.
S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4
Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5
Quebec
Births
S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4
Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3
ExoplanetS-R-ELM 5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1
Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2
SP500 S-R-ELM 1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2
Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1
AEMO S-R-ELM 1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5
Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6
Hourly
Weather
S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3
Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3
Energy
Cons.
S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5
Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5
Elec.
Load
S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1
Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0
Stock
Prices
S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4
Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3
Temp. S-R-ELM 4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6
Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5
8
TABLE IV: Average RMSE (±standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same range for
different RNN architectures on all the datasets.
Architecture
Dataset Algorithm Elman
Jordan NARMAX Fully Connected LSTM GRU
Japan
pop.
S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4
Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5
Quebec
Births
S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4
Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3
ExoplanetS-R-ELM
5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1
Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2
SP500 S-R-ELM
1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2
Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1
AEMO S-R-ELM
1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5
Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6
Hourly
Weather
S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3
Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3
Energy
Cons.
S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5
Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5
Elec.
Load
S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1
Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0
Stock
Prices
S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4
Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3
Temp. S-R-ELM
4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6
Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5
Julia El Zini, Yara Rizk and Mariette Awad
Table 4. Average RMSE (
±
standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same
range for different RNN architectures on all the datasets.
Architecture
Dataset Algorithm Elman Jordan NARMAX Fully Connected LSTM GRU
Japan
pop.
S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4
Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5
Quebec
Births
S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4
Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3
ExoplanetS-R-ELM 5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1
Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2
SP500 S-R-ELM 1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2
Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1
AEMO S-R-ELM 1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5
Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6
Hourly
Weather
S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3
Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3
Energy
Cons.
S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5
Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5
Elec.
Load
S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1
Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0
Stock
Prices
S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4
Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3
Temp. S-R-ELM 4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6
Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5
44 Julia El Zini, Yara Rizk and Mariette Awad
Table 5. Speedup of Opt-PR-ELM (BS=32) when tested on the Tesla K20m and Quadro K2000 GPUS for
different RNN architectures on various datasets when the number of hidden neurons Mis 20.
Dataset
Archi
tec-
ture
GPU Japan
pop.
Quebec
Births
Exop. SP500 AEMO Hourly
weather
Energy
cons.
Elec.
Load
Stock
Prices
Temp.
Elman Tesla
K20m
12 12 18 26 42 64 163 164 251 261
Quadro
K2000
12 10 16 23 40 61 60 163 239 251
Jordan Tesla
K20m
12 13 42 26 42 64 163 165 244 300
Quadro
K2000
11 11 39 23 40 60 163 163 189 295
NAR- Tesla
K20m
13 12 29 29 45 72 167 168 263 281
MAX Quadro
K2000
11 11 28 26 42 71 162 162 257 273
Fully Tesla
K20m
17 18 35 36 50 73 198 226 281 326
Connec. Quadro
K2000
14 16 33 34 48 71 196 225 279 324
LSTM Tesla
K20m
21 21 43 39 50 74 219 201 310 327
Quadro
K2000
19 20 41 36 45 70 215 196 307 323
GRU Tesla
K20m
20 18 46 40 50 67 197 200 309 326
Quadro
K2000
15 14 42 35 47 58 192 187 300 320
Julia El Zini, Yara Rizk and Mariette Awad
Table 5. Speedup of Opt-PR-ELM (BS=32) when tested on the Tesla K20m and Quadro K2000 GPUS for
different RNN architectures on various datasets when the number of hidden neurons Mis 20.
Dataset
Archi
tec-
ture
GPU Japan
pop.
Quebec
Births
Exop. SP500 AEMO Hourly
weather
Energy
cons.
Elec.
Load
Stock
Prices
Temp.
Elman Tesla
K20m
12 12 18 26 42 64 163 164 251 261
Quadro
K2000
12 10 16 23 40 61 60 163 239 251
Jordan Tesla
K20m
12 13 42 26 42 64 163 165 244 300
Quadro
K2000
11 11 39 23 40 60 163 163 189 295
NAR- Tesla
K20m
13 12 29 29 45 72 167 168 263 281
MAX Quadro
K2000
11 11 28 26 42 71 162 162 257 273
Fully Tesla
K20m
17 18 35 36 50 73 198 226 281 326
Connec. Quadro
K2000
14 16 33 34 48 71 196 225 279 324
LSTM Tesla
K20m
21 21 43 39 50 74 219 201 310 327
Quadro
K2000
19 20 41 36 45 70 215 196 307 323
GRU Tesla
K20m
20 18 46 40 50 67 197 200 309 326
Quadro
K2000
15 14 42 35 47 58 192 187 300 320
45
Julia El Zini, Yara Rizk and Mariette Awad
Table 5. Speedup of Opt-PR-ELM (BS=32) when tested on the Tesla K20m and Quadro K2000 GPUS for
different RNN architectures on various datasets when the number of hidden neurons Mis 20.
Dataset
Archi
tec-
ture
GPU Japan
pop.
Quebec
Births
Exop. SP500 AEMO Hourly
weather
Energy
cons.
Elec.
Load
Stock
Prices
Temp.
Elman Tesla
K20m
12 12 18 26 42 64 163 164 251 261
Quadro
K2000
12 10 16 23 40 61 60 163 239 251
Jordan Tesla
K20m
12 13 42 26 42 64 163 165 244 300
Quadro
K2000
11 11 39 23 40 60 163 163 189 295
NAR- Tesla
K20m
13 12 29 29 45 72 167 168 263 281
MAX Quadro
K2000
11 11 28 26 42 71 162 162 257 273
Fully Tesla
K20m
17 18 35 36 50 73 198 226 281 326
Connec. Quadro
K2000
14 16 33 34 48 71 196 225 279 324
LSTM Tesla
K20m
21 21 43 39 50 74 219 201 310 327
Quadro
K2000
19 20 41 36 45 70 215 196 307 323
GRU Tesla
K20m
20 18 46 40 50 67 197 200 309 326
Quadro
K2000
15 14 42 35 47 58 192 187 300 320
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 6. Runtime (seconds) of Opt-PR-ELM (BS=32) and the iterative training algorithm and the
ratio= BP
Opt-PR-ELM
Fully Connected LSTM GRU
Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio
Japan pop. 0.23 3.52 15 0.38 7.41 20 0.38 6.59 17
Quebec
Births
0.56 6.75 12 0.85 13.56 16 0.81 12.94 16
Exoplanet 10.03 24.98 2 15.23 54.32 4 13.14 43.12 3
SP500 3.56 20.66 6 7.77 37.55 5 5.61 35.65 6
AEMO 3.01 21.34 7 7.29 38.32 5 5.62 35.71 6
Hourly
Weather
30.46 156.76 5 50.49 243.99 5 30.04 201.12 7
Energy
Cons.
32.14 203.45 6 51.90 525.87 10 45.67 435.89 10
Elec. Load 36.70 256.89 7 53.60 572.74 11 51.7 532.31 10
Stock Prices 41.30 301.23 7 56.78 639.04 11 52.34 621.18 12
Temperature 45.45 354.99 8 62.00 678.11 11 59.32 641.09 11
and Exoplanet3describes the change in the light
intensity of several thousand stars. Additionally,
SP 5004records the stock prices since 1950 while
AEMO5reports the electricity load demand in Aus-
tralia and hourly weather6contains ≈5 years of
temperature measures. The energy consumption
dataset7reports the hourly power consumption data
in megawatts, the electricity load dataset8reports
the electricity demand at the MT166 and MT257
substations and the stock prices dataset9consists of
historical stock prices for all companies currently
on the S&P 500 index. Finally, the temperature
dataset10 reports sensor data collected from a per-
manent magnet synchronous motor (PMSM) de-
ployed on a testbench where PMSM represents a
german OEM’s prototype model.
7 Experimental Results
7.1 Speedup
Figure 3 illustrates the speedups of Basic-PR-
ELM and Opt-PR-ELM for the six architectures
tested against the serial version when the number of
hidden neurons Mis 50. Opt-PR-ELM was tested
with two different configurations: when the number
of threads per block, block size BS, is 16 and 32,
respectively.
Clearly, Basic-PR-ELM and Opt-PR-
ELM achieve high speedups, especially when the
size of the dataset increases. For instance, for
the Elman architecture, Basic-PR-ELM achieves
a speedup of 19 on the small Exoplanet dataset, 72
on the hourly energy consumption medium dataset,
and up to 207 on the largest dataset (Temperature).
Opt-PR-ELM achieves higher speedups that reach
up to 311 with LSTM on the temperature dataset
when BS =16. The speedup increases to 461 when
BS increases to 32.
7kaggle.com/selfishgene/historical-hourly-weather-data
8archive.ics.uci.edu/ ml/index.php
9kaggle.com/camnugent/sandp500
10kaggle.com/wkirgsn/electric-motor-temperature
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 6. Runtime (seconds) of Opt-PR-ELM (BS=32) and the iterative training algorithm and the
ratio= BP
Opt-PR-ELM
Fully Connected LSTM GRU
Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio
Japan pop. 0.23 3.52 15 0.38 7.41 20 0.38 6.59 17
Quebec
Births
0.56 6.75 12 0.85 13.56 16 0.81 12.94 16
Exoplanet 10.03 24.98 2 15.23 54.32 4 13.14 43.12 3
SP500 3.56 20.66 6 7.77 37.55 5 5.61 35.65 6
AEMO 3.01 21.34 7 7.29 38.32 5 5.62 35.71 6
Hourly
Weather
30.46 156.76 5 50.49 243.99 5 30.04 201.12 7
Energy
Cons.
32.14 203.45 6 51.90 525.87 10 45.67 435.89 10
Elec. Load 36.70 256.89 7 53.60 572.74 11 51.7 532.31 10
Stock Prices 41.30 301.23 7 56.78 639.04 11 52.34 621.18 12
Temperature 45.45 354.99 8 62.00 678.11 11 59.32 641.09 11
and Exoplanet3describes the change in the light
intensity of several thousand stars. Additionally,
SP 5004records the stock prices since 1950 while
AEMO5reports the electricity load demand in Aus-
tralia and hourly weather6contains ≈5 years of
temperature measures. The energy consumption
dataset7reports the hourly power consumption data
in megawatts, the electricity load dataset8reports
the electricity demand at the MT166 and MT257
substations and the stock prices dataset9consists of
historical stock prices for all companies currently
on the S&P 500 index. Finally, the temperature
dataset10 reports sensor data collected from a per-
manent magnet synchronous motor (PMSM) de-
ployed on a testbench where PMSM represents a
german OEM’s prototype model.
7 Experimental Results
7.1 Speedup
Figure 3 illustrates the speedups of Basic-PR-
ELM and Opt-PR-ELM for the six architectures
tested against the serial version when the number of
hidden neurons Mis 50. Opt-PR-ELM was tested
with two different configurations: when the number
of threads per block, block size BS, is 16 and 32,
respectively.
Clearly, Basic-PR-ELM and Opt-PR-
ELM achieve high speedups, especially when the
size of the dataset increases. For instance, for
the Elman architecture, Basic-PR-ELM achieves
a speedup of 19 on the small Exoplanet dataset, 72
on the hourly energy consumption medium dataset,
and up to 207 on the largest dataset (Temperature).
Opt-PR-ELM achieves higher speedups that reach
up to 311 with LSTM on the temperature dataset
when BS =16. The speedup increases to 461 when
BS increases to 32.
7kaggle.com/selfishgene/historical-hourly-weather-data
8archive.ics.uci.edu/ ml/index.php
9kaggle.com/camnugent/sandp500
10kaggle.com/wkirgsn/electric-motor-temperature
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
Table 6. Runtime (seconds) of Opt-PR-ELM (BS=32) and the iterative training algorithm and the
ratio= BP
Opt-PR-ELM
Fully Connected LSTM GRU
Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio Opt-
PR-
ELM
P-
BPTT
Ratio
Japan pop. 0.23 3.52 15 0.38 7.41 20 0.38 6.59 17
Quebec
Births
0.56 6.75 12 0.85 13.56 16 0.81 12.94 16
Exoplanet 10.03 24.98 2 15.23 54.32 4 13.14 43.12 3
SP500 3.56 20.66 6 7.77 37.55 5 5.61 35.65 6
AEMO 3.01 21.34 7 7.29 38.32 5 5.62 35.71 6
Hourly
Weather
30.46 156.76 5 50.49 243.99 5 30.04 201.12 7
Energy
Cons.
32.14 203.45 6 51.90 525.87 10 45.67 435.89 10
Elec. Load 36.70 256.89 7 53.60 572.74 11 51.7 532.31 10
Stock Prices 41.30 301.23 7 56.78 639.04 11 52.34 621.18 12
Temperature 45.45 354.99 8 62.00 678.11 11 59.32 641.09 11
and Exoplanet3describes the change in the light
intensity of several thousand stars. Additionally,
SP 5004records the stock prices since 1950 while
AEMO5reports the electricity load demand in Aus-
tralia and hourly weather6contains ≈5 years of
temperature measures. The energy consumption
dataset7reports the hourly power consumption data
in megawatts, the electricity load dataset8reports
the electricity demand at the MT166 and MT257
substations and the stock prices dataset9consists of
historical stock prices for all companies currently
on the S&P 500 index. Finally, the temperature
dataset10 reports sensor data collected from a per-
manent magnet synchronous motor (PMSM) de-
ployed on a testbench where PMSM represents a
german OEM’s prototype model.
7 Experimental Results
7.1 Speedup
Figure 3 illustrates the speedups of Basic-PR-
ELM and Opt-PR-ELM for the six architectures
tested against the serial version when the number of
hidden neurons Mis 50. Opt-PR-ELM was tested
with two different configurations: when the number
of threads per block, block size BS, is 16 and 32,
respectively.
Clearly, Basic-PR-ELM and Opt-PR-
ELM achieve high speedups, especially when the
size of the dataset increases. For instance, for
the Elman architecture, Basic-PR-ELM achieves
a speedup of 19 on the small Exoplanet dataset, 72
on the hourly energy consumption medium dataset,
and up to 207 on the largest dataset (Temperature).
Opt-PR-ELM achieves higher speedups that reach
up to 311 with LSTM on the temperature dataset
when BS =16. The speedup increases to 461 when
BS increases to 32.
7kaggle.com/selfishgene/historical-hourly-weather-data
8archive.ics.uci.edu/ ml/index.php
9kaggle.com/camnugent/sandp500
10kaggle.com/wkirgsn/electric-motor-temperature
46 Julia El Zini, Yara Rizk and Mariette Awad
However, Opt-PR-ELM does not always
achieve higher speedups. Specifically, Basic-PR-
ELM and Opt-PR-ELM achieve similar speedups
for the Japan population, Quebec births, SP500,
AEMO, energy consumption, and the electricity
load datasets. To investigate these results, we take
a closer look at the characteristics of the datasets.
When Q=10, a thread is computing the dot product
between a row of Xand a column of W, and it is do-
ing 2 ×10 memory read operations. Consequently,
num_tiles will be only 1, and the loop at line 8 of
Alg. 3 will only be executed once. In this case, the
performance does not improve and might slightly
decrease due to the thread synchronization in Opt-
PR-ELM. However, Opt-PR-ELM achieves higher
speedups when Q>BS and when BS increases to
32. We notice that the speedup increases with more
complex architectures, LSTM for example, since
these architectures require more computations that
can be better accelerated on a GPU.
7.2 Scalability
To test the scalability of our approach, we
change the number of hidden neurons M, and we
report the speedup of Opt-PR-ELM (BS=32) for the
different architectures on the various datasets. Fig-
ure 4 illustrates that the speedup increases when M
increases from 5 to 10,20,50,100. Specifically, the
speedup increases by a factor of 20 when Min-
creases from 5 to 100 with a GRU on the energy
consumption dataset. Thus, Opt-PR-ELM scales up
well with more computationally expensive opera-
tions.
7.3 Robustness
Robustness, i.e. repeatability, is a key prop-
erty for Opt-PR-ELM where random initialization
might affect the solution. Moreover, floating-point
computations might differ between the GPU and the
CPU, which might affect the output. To ensure that
such perturbations do not affect the performance of
our parallel algorithm, we run S-R-ELM and Opt-
PR-ELM (BS=32) five times, and we measure their
root mean squared error (RMSE). Table 4 reports
the average RMSE and its standard deviation when
S-R-ELM and Opt-PR-ELM are tested on different
datasets with different RNN architectures. We se-
lect Maccording to the size of the problem; i.e.
we used M=100 for exoplanet where Q=5657,
M=20 for hourly weather, stock prices and tem-
perature where Q=50 and M=10 for the rest of
the datasets that have Q=10. Tables 3 and 4 show
that the cases where the RMSE is high correspond
to datasets with large outputs. For instance, having
outputs ranging from 0 to 2.06×109, the electricity
load dataset has higher RMSE than other datasets.
However, S-R-ELM and Opt-PR-ELM achieve ac-
curacies in the same range for different RNN archi-
tectures on all the datasets, which means that GPU
floating-point operations do not have a clear effect
on the performance of our algorithm.
7.4 Portability
To verify that our algorithm is portable, we
ran Opt-PR-ELM (BS=32) on an NVIDIA Quadro
K2000 GPU while fixing the number of hidden
nodes Mat 20. It is important to check for porta-
bility to understand how much the proposed al-
gorithm is architecture-dependent. Table 5 shows
that Opt-PR-ELM also achieves high speedups on
the Quadro K2000 GPUs for different RNN archi-
tectures on different datasets, but the speedups on
the Tesla K20m GPU are constantly higher because
of the computational capability of the latter. The
speedups in Table 5 are reported with respect to the
core-i7 CPU with 16 GB RAMs. Speedups with re-
spect to sequential code on older generation CPUs
(core-i5 with 8GB RAMs) show a speedup of up to
5 times higher. One can draw the following con-
clusion: increasing the number of cores of a CPU
yields a speedup of up to only 5 times. However,
parallelizing the code can yield a speedup of up to
326 with respect to sequential code on core-i7 CPUs
and 651 on core-i5 CPUs. A rough estimation of the
current pricing based on google search shows that
GPU architectures cost between 500$ and 7,000$
for NVIDIA GTX 1080 and Tesla GPUs11 respec-
tively, while CPU architectures such as Intel Core
i7-9700K with 8 cores cost 400$12. Considering
the aforementioned speedups, one can conclude that
investing in parallel architectures can be more prof-
itable than upgrading the existing CPU architecture,
especially in applications where real-time perfor-
11https://www.amazon.com/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-v100/dp/B076P84525
12https://www.amazon.com/CPU-Processors-Memory-Computer-Add-Ons/b?ie=UTF8&node=229189
47
Julia El Zini, Yara Rizk and Mariette Awad
However, Opt-PR-ELM does not always
achieve higher speedups. Specifically, Basic-PR-
ELM and Opt-PR-ELM achieve similar speedups
for the Japan population, Quebec births, SP500,
AEMO, energy consumption, and the electricity
load datasets. To investigate these results, we take
a closer look at the characteristics of the datasets.
When Q=10, a thread is computing the dot product
between a row of Xand a column of W, and it is do-
ing 2 ×10 memory read operations. Consequently,
num_tiles will be only 1, and the loop at line 8 of
Alg. 3 will only be executed once. In this case, the
performance does not improve and might slightly
decrease due to the thread synchronization in Opt-
PR-ELM. However, Opt-PR-ELM achieves higher
speedups when Q>BS and when BS increases to
32. We notice that the speedup increases with more
complex architectures, LSTM for example, since
these architectures require more computations that
can be better accelerated on a GPU.
7.2 Scalability
To test the scalability of our approach, we
change the number of hidden neurons M, and we
report the speedup of Opt-PR-ELM (BS=32) for the
different architectures on the various datasets. Fig-
ure 4 illustrates that the speedup increases when M
increases from 5 to 10,20,50,100. Specifically, the
speedup increases by a factor of 20 when Min-
creases from 5 to 100 with a GRU on the energy
consumption dataset. Thus, Opt-PR-ELM scales up
well with more computationally expensive opera-
tions.
7.3 Robustness
Robustness, i.e. repeatability, is a key prop-
erty for Opt-PR-ELM where random initialization
might affect the solution. Moreover, floating-point
computations might differ between the GPU and the
CPU, which might affect the output. To ensure that
such perturbations do not affect the performance of
our parallel algorithm, we run S-R-ELM and Opt-
PR-ELM (BS=32) five times, and we measure their
root mean squared error (RMSE). Table 4 reports
the average RMSE and its standard deviation when
S-R-ELM and Opt-PR-ELM are tested on different
datasets with different RNN architectures. We se-
lect Maccording to the size of the problem; i.e.
we used M=100 for exoplanet where Q=5657,
M=20 for hourly weather, stock prices and tem-
perature where Q=50 and M=10 for the rest of
the datasets that have Q=10. Tables 3 and 4 show
that the cases where the RMSE is high correspond
to datasets with large outputs. For instance, having
outputs ranging from 0 to 2.06×109, the electricity
load dataset has higher RMSE than other datasets.
However, S-R-ELM and Opt-PR-ELM achieve ac-
curacies in the same range for different RNN archi-
tectures on all the datasets, which means that GPU
floating-point operations do not have a clear effect
on the performance of our algorithm.
7.4 Portability
To verify that our algorithm is portable, we
ran Opt-PR-ELM (BS=32) on an NVIDIA Quadro
K2000 GPU while fixing the number of hidden
nodes Mat 20. It is important to check for porta-
bility to understand how much the proposed al-
gorithm is architecture-dependent. Table 5 shows
that Opt-PR-ELM also achieves high speedups on
the Quadro K2000 GPUs for different RNN archi-
tectures on different datasets, but the speedups on
the Tesla K20m GPU are constantly higher because
of the computational capability of the latter. The
speedups in Table 5 are reported with respect to the
core-i7 CPU with 16 GB RAMs. Speedups with re-
spect to sequential code on older generation CPUs
(core-i5 with 8GB RAMs) show a speedup of up to
5 times higher. One can draw the following con-
clusion: increasing the number of cores of a CPU
yields a speedup of up to only 5 times. However,
parallelizing the code can yield a speedup of up to
326 with respect to sequential code on core-i7 CPUs
and 651 on core-i5 CPUs. A rough estimation of the
current pricing based on google search shows that
GPU architectures cost between 500$ and 7,000$
for NVIDIA GTX 1080 and Tesla GPUs11 respec-
tively, while CPU architectures such as Intel Core
i7-9700K with 8 cores cost 400$12. Considering
the aforementioned speedups, one can conclude that
investing in parallel architectures can be more prof-
itable than upgrading the existing CPU architecture,
especially in applications where real-time perfor-
11https://www.amazon.com/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-v100/dp/B076P84525
12https://www.amazon.com/CPU-Processors-Memory-Computer-Add-Ons/b?ie=UTF8&node=229189
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
mance and cost efficiency are essential such as gen-
eral IoT applications.
7.5 Comparison with Parallel Iterative
RNN Training
Although Opt-PR-ELM achieves high speedups
compared its S-R-ELM, we need to show that its
absolute training time is lower than the parallel ver-
sion of the BPTT (P-BPTT) as implemented in [11].
We choose the architectures that [11] implements,
i.e. fully connected, LSTM and GRU, and we re-
port the training time of Opt-PR-ELM (BS=32) and
P-BPTT when M=10. P-BPTT is trained for 10
epochs with 64 as batch size, mean squared error
(MSE) as loss function and ADAM as optimizer.
We are interested in the absolute training times of
the two parallel algorithms rather than their speedup
over their sequential versions. Thus, we report the
runtimes of Opt-PR-ELM and P-BPTT algorithms
when tested on the same Tesla K20m GPU and
the ratio between both training times. As Table 6
shows, Opt-PR-ELM runs up to 10x faster than P-
BPTT when tested with LSTM on the energy con-
sumption dataset. Figure 5 illustrates the MSE ver-
sus time for P-BPTT algorithms when tested with
LSTM on the energy consumption with M=50.
For the same dataset and RNN architecture, Opt-
PR-ELM reaches 2.56 ×10−3as MSE, whereas P-
BPTT reaches a lower MSE of 1.4×10−3. How-
ever, Opt-PR-ELM took only 57 sec to reach its op-
timal MSE, whereas P-BPTT took 525 sec to reach
its optimal MSE and 340 sec to reach the same MSE
(1.1×10−3).
Figure 5. MSE versus time (sec) for P-BPTT
algorithms when tested on the energy consumption
dataset with M=50 and LSTM as architecture
Thus, Opt-PR-ELM could reach the same per-
formance as P-BPTT 6 times faster. The sequential
nature of iterative training explains the results: al-
though one can attempt to parallelize each epoch,
the training needs to be done in a sequence of con-
secutive dependent epochs.
7.6 Opt-PR-ELM Runtime
One can argue that using memory streams or
initializing the random weights on the GPU can
lead to higher speedups. To investigate this, we
study how the runtime of Opt-PR-ELM is decom-
posed between the parameters initialization, data
transfer to and from the GPU and the actual com-
putations for the six architectures. Figure 6 shows
what portion each step takes from the runtime of
Opt-PR-ELM when tested on the energy comsump-
tion dataset dataset with M=50. The initialization
does not appear on the bar because it is less than
0.01% of the total runtime. Moreover, transfer data
to the GPU consistently takes more time than the
transfer back because the former deals with the fol-
lowing matrices: X∈Rn×S×Q,Y∈Rn,W∈RS×L,
α∈RL×Qand b∈RL, while the latter only transfers
β∈RL. The steps that take the major time portion
are the computations of Hand β. One can conclude
that data streams or the GPU random initializations
will not affect the speedup since initialization and
data transfer are not a bottleneck in Opt-PR-ELM.
Figure 6. Time decomposition of Opt-PR-ELM on
the energy consumption dataset with M=50
8 Conclusion
In this work, we proposed Opt-PR-ELM, a par-
allel version of non-iteratively trained RNNs for
time series prediction. Focusing on six RNN ar-
chitectures: Elman, Jordan, NARMAX, fully con-
nected RNN, LSTM and GRU, we first developed
a basic version of the parallel algorithm. Then,
we studied its memory access patterns to propose
an optimized version that takes advantage of the
shared memory of the GPU. In addition to perform-
ing a theoretical, computational analysis of Opt-PR-
ELM on the various architectures, empirical valida-
tion was performed on 10 publicly available time
series prediction datasets.
48 Julia El Zini, Yara Rizk and Mariette Awad
Opt-PR-ELM was shown to achieve a speedup
of up to 461 over its sequential version and requires
less time to train than the parallel BPTT by a fac-
tor of 20. Higher speedups are achieved on older
generation CPUs which highlights the importance
of investing in high-end parallel architectures, es-
pecially in IoT and machine learning applications
that require accurate, cost-sensitive yet efficient so-
lutions.
We further studied the portability and scala-
bility of our proposed algorithm by changing the
GPU architecture and the number of hidden neurons
and reporting the speedup. Opt-PR-ELM showed
higher speedups when the number of computations
increases or the number of launched threads per
block increases. Finally, Opt-PR-ELM was shown
to reach similar accuracies as its sequential version.
Future work includes extending Opt-PR-
ELM to RNNs with multiple layers and investi-
gating its performance on applications that have
multi-dimensional outputs such as machine transla-
tion and speech recognition.
Acknowledgment
This work was supported by the University Re-
search Board at the American University of Beirut.
References
[1] Yoshua Bengio, Patrice Simard, Paolo Frasconi,
et al. Learning long-term dependencies with gradi-
ent descent is difficult. IEEE transactions on neural
networks, 5(2):157–166, 1994.
[2] Stephen A Billings. Nonlinear system identifica-
tion: NARMAX methods in the time, frequency,
and spatio-temporal domains. John Wiley & Sons,
2013.
[3] Armando Blanco, Miguel Delgado, and Maria C
Pegalajar. A real-coded genetic algorithm for train-
ing recurrent neural networks. Neural networks,
14(1):93–105, 2001.
[4] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry
Bahdanau, and Yoshua Bengio. On the properties
of neural machine translation: Encoder-decoder
approaches. arXiv preprint arXiv:1409.1259, 2014.
[5] Kyunghyun Cho, Bart Van Merriënboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[6] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy
Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. In
Advances in neural information processing sys-
tems, pages 577–585, 2015.
[7] Junyoung Chung, Caglar Gulcehre, KyungHyun
Cho, and Yoshua Bengio. Empirical evaluation of
gated recurrent neural networks on sequence mod-
eling. arXiv preprint arXiv:1412.3555, 2014.
[8] Jerome T Connor, R Douglas Martin, and Les E
Atlas. Recurrent neural networks and robust time
series prediction. IEEE transactions on neural net-
works, 5(2):240–254, 1994.
[9] Jeffrey L Elman. Finding structure in time. Cogni-
tive science, 14(2):179–211, 1990.
[10] Ömer Faruk Ertugrul. Forecasting electricity load
by a novel recurrent extreme learning machines ap-
proach. International Journal of Electrical Power &
Energy Systems, 78:429–435, 2016.
[11] Martín Abadi et al. TensorFlow: Large-scale ma-
chine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
[12] Alex Graves, Navdeep Jaitly, and Abdel-rahman
Mohamed. Hybrid speech recognition with deep
bidirectional lstm. In 2013 IEEE workshop on
automatic speech recognition and understanding,
pages 273–278. IEEE, 2013.
[13] Alex Graves, Abdel-rahman Mohamed, and Geof-
frey Hinton. Speech recognition with deep recur-
rent neural networks. In 2013 IEEE international
conference on acoustics, speech and signal pro-
cessing, pages 6645–6649. IEEE, 2013.
[14] Qing He, Tianfeng Shang, Fuzhen Zhuang, and
Zhongzhi Shi. Parallel extreme learning machine
for regression based on mapreduce. Neurocomput-
ing, 102:52–58, 2013.
[15] Sepp Hochreiter and Jürgen Schmidhuber.
Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[16] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong
Siew, et al. Extreme learning machine: a new learn-
ing scheme of feedforward neural networks. Neural
networks, 2:985–990, 2004.
[17] Shan Huang, Botao Wang, Junhao Qiu, Jitao Yao,
Guoren Wang, and Ge Yu. Parallel ensemble of on-
line sequential extreme learning machine based on
mapreduce. Neurocomputing, 174:352–367, 2016.
49
Julia El Zini, Yara Rizk and Mariette Awad
Opt-PR-ELM was shown to achieve a speedup
of up to 461 over its sequential version and requires
less time to train than the parallel BPTT by a fac-
tor of 20. Higher speedups are achieved on older
generation CPUs which highlights the importance
of investing in high-end parallel architectures, es-
pecially in IoT and machine learning applications
that require accurate, cost-sensitive yet efficient so-
lutions.
We further studied the portability and scala-
bility of our proposed algorithm by changing the
GPU architecture and the number of hidden neurons
and reporting the speedup. Opt-PR-ELM showed
higher speedups when the number of computations
increases or the number of launched threads per
block increases. Finally, Opt-PR-ELM was shown
to reach similar accuracies as its sequential version.
Future work includes extending Opt-PR-
ELM to RNNs with multiple layers and investi-
gating its performance on applications that have
multi-dimensional outputs such as machine transla-
tion and speech recognition.
Acknowledgment
This work was supported by the University Re-
search Board at the American University of Beirut.
References
[1] Yoshua Bengio, Patrice Simard, Paolo Frasconi,
et al. Learning long-term dependencies with gradi-
ent descent is difficult. IEEE transactions on neural
networks, 5(2):157–166, 1994.
[2] Stephen A Billings. Nonlinear system identifica-
tion: NARMAX methods in the time, frequency,
and spatio-temporal domains. John Wiley & Sons,
2013.
[3] Armando Blanco, Miguel Delgado, and Maria C
Pegalajar. A real-coded genetic algorithm for train-
ing recurrent neural networks. Neural networks,
14(1):93–105, 2001.
[4] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry
Bahdanau, and Yoshua Bengio. On the properties
of neural machine translation: Encoder-decoder
approaches. arXiv preprint arXiv:1409.1259, 2014.
[5] Kyunghyun Cho, Bart Van Merriënboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[6] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy
Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. In
Advances in neural information processing sys-
tems, pages 577–585, 2015.
[7] Junyoung Chung, Caglar Gulcehre, KyungHyun
Cho, and Yoshua Bengio. Empirical evaluation of
gated recurrent neural networks on sequence mod-
eling. arXiv preprint arXiv:1412.3555, 2014.
[8] Jerome T Connor, R Douglas Martin, and Les E
Atlas. Recurrent neural networks and robust time
series prediction. IEEE transactions on neural net-
works, 5(2):240–254, 1994.
[9] Jeffrey L Elman. Finding structure in time. Cogni-
tive science, 14(2):179–211, 1990.
[10] Ömer Faruk Ertugrul. Forecasting electricity load
by a novel recurrent extreme learning machines ap-
proach. International Journal of Electrical Power &
Energy Systems, 78:429–435, 2016.
[11] Martín Abadi et al. TensorFlow: Large-scale ma-
chine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
[12] Alex Graves, Navdeep Jaitly, and Abdel-rahman
Mohamed. Hybrid speech recognition with deep
bidirectional lstm. In 2013 IEEE workshop on
automatic speech recognition and understanding,
pages 273–278. IEEE, 2013.
[13] Alex Graves, Abdel-rahman Mohamed, and Geof-
frey Hinton. Speech recognition with deep recur-
rent neural networks. In 2013 IEEE international
conference on acoustics, speech and signal pro-
cessing, pages 6645–6649. IEEE, 2013.
[14] Qing He, Tianfeng Shang, Fuzhen Zhuang, and
Zhongzhi Shi. Parallel extreme learning machine
for regression based on mapreduce. Neurocomput-
ing, 102:52–58, 2013.
[15] Sepp Hochreiter and Jürgen Schmidhuber.
Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[16] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong
Siew, et al. Extreme learning machine: a new learn-
ing scheme of feedforward neural networks. Neural
networks, 2:985–990, 2004.
[17] Shan Huang, Botao Wang, Junhao Qiu, Jitao Yao,
Guoren Wang, and Ge Yu. Parallel ensemble of on-
line sequential extreme learning machine based on
mapreduce. Neurocomputing, 174:352–367, 2016.
AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .
[18] Weikuan Jia, Dean Zhao, Yuanjie Zheng, and Su-
juan Hou. A novel optimized ga–elman neural net-
work algorithm. Neural Computing and Applica-
tions, 31(2):449–459, 2019.
[19] Michael I Jordan. Serial order: A parallel dis-
tributed processing approach. In Advances in psy-
chology, volume 121, pages 471–495. Elsevier,
1997.
[20] Viacheslav Khomenko, Oleg Shyshkov, Olga
Radyvonenko, and Kostiantyn Bokhan. Acceler-
ating recurrent neural network training using se-
quence bucketing and multi-gpu data paralleliza-
tion. In IEEE First International Conference on
Data Stream Mining & Processing, pages 100–103.
IEEE, 2016.
[21] Siu Kwan Lam, Antoine Pitrou, and Stanley Seib-
ert. Numba: A llvm-based python jit compiler. In
Proceedings of the second Workshop on the LLVM
Compiler Infrastructure in HPC, pages 1–6. ACM,
2015.
[22] Yann LeCun, Yoshua Bengio, and Geoffrey Hin-
ton. Deep learning. Nature, 521(7553):436, 2015.
[23] Jun Liu, Amir Shahroudy, Dong Xu, and Gang
Wang. Spatio-temporal lstm with trust gates for 3d
human action recognition. In European Conference
on Computer Vision, pages 816–833. Springer,
2016.
[24] Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Ab-
diyeva, and Alex C Kot. Skeleton-based human ac-
tion recognition with global context-aware atten-
tion lstm networks. IEEE Transactions on Image
Processing, 27(4):1586–1599, 2017.
[25] James Martens and Ilya Sutskever. Learning recur-
rent neural networks with hessian-free optimiza-
tion. In Proceedings of the 28th International Con-
ference on Machine Learning (ICML-11), pages
1033–1040. Citeseer, 2011.
[26] Travis Oliphant. Guide to NumPy. 01 2006.
[27] Peng Ouyang, Shouyi Yin, and Shaojun Wei. A fast
and power efficient architecture to parallelize lstm
based rnn for cognitive intelligence applications. In
Proceedings of the 54th Annual Design Automa-
tion Conference 2017, pages 1–6. ACM, 2017.
[28] Yoh-Han Pao, Gwang-Hoon Park, and Dejan J
Sobajic. Learning and generalization characteris-
tics of the random vector functional-link net. Neu-
rocomputing, 6(2):163–180, 1994.
[29] Jin-Man Park and Jong-Hwan Kim. Online re-
current extreme learning machine and its applica-
tion to time-series prediction. In 2017 International
Joint Conference on Neural Networks (IJCNN),
pages 1983–1990. IEEE, 2017.
[30] Yara Rizk and Mariette Awad. On extreme learn-
ing machines in sequential and time series predic-
tion: A non-iterative and approximate training al-
gorithm for recurrent neural networks. Neurocom-
puting, 325:1–19, 2019.
[31] Jürgen Schmidhuber. Deep learning in neural net-
works: An overview. Neural networks, 61:85–117,
2015.
[32] Wouter F Schmidt, Martin A Kraaijveld, and
Robert PW Duin. Feedforward neural networks
with random weights. In 11th IAPR International
Conference on Pattern Recognition. Vol. II. Con-
ference B: Pattern Recognition Methodology and
Systems, pages 1–4. IEEE, 1992.
[33] Xavier Sierra-Canto, Francisco Madera-Ramirez,
and Victor Uc-Cetina. Parallel training of a back-
propagation neural network using cuda. In 2010
Ninth International Conference on Machine Learn-
ing and Applications, pages 307–312. IEEE, 2010.
[34] Zhiyuan Tang, Ying Shi, Dong Wang, Yang Feng,
and Shiyue Zhang. Memory visualization for gated
recurrent neural networks in speech recognition. In
2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages
2736–2740. IEEE, 2017.
[35] Hubert AB Te Braake and Gerrit Van Straten. Ran-
dom activation weight neural net (rawn) for fast
non-iterative training. Engineering Applications of
Artificial Intelligence, 8(1):71–80, 1995.
[36] Mark Van Heeswijk, Yoan Miche, Erkki Oja,
and Amaury Lendasse. Gpu-accelerated and par-
allelized elm ensembles for large-scale regression.
Neurocomputing, 74(16):2430–2437, 2011.
[37] Botao Wang, Shan Huang, Junhao Qiu, Yu Liu, and
Guoren Wang. Parallel online sequential extreme
learning machine based on mapreduce. Neurocom-
puting, 149:224–232, 2015.
[38] Shang Wang, Yifan Bai, and Gennady Pekhi-
menko. Scaling back-propagation by parallel scan
algorithm. arXiv preprint arXiv:1907.10134, 2019.
[39] Xiaoyu Wang and Yong Huang. Convergence study
in extended kalman filter-based training of recur-
rent neural networks. IEEE Transactions on Neural
Networks, 22(4):588–600, 2011.
[40] Paul J Werbos et al. Backpropagation through time:
what it does and how to do it. Proceedings of the
IEEE, 78(10):1550–1560, 1990.
[41] Ronald J Williams and David Zipser. Gradient-
based learning algorithms for recurrent. Backprop-
agation: Theory, architectures, and applications,
433, 1995.
50 Julia El Zini, Yara Rizk and Mariette Awad
[42] Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144, 2016.
[43] Feng Zhang, Jidong Zhai, Marc Snir, Hai Jin, Hi-
ronori Kasahara, and Mateo Valero. Guest edito-
rial: Special issue on network and parallel com-
puting for emerging architectures and applications,
2019.
[44] Shunlu Zhang, Pavan Gunupudi, and Qi-Jun
Zhang. Parallel back-propagation neural network
training technique using cuda on multiple gpus. In
IEEE MTT-S International Conference on Numer-
ical Electromagnetic and Multiphysics Modeling
and Optimization, pages 1–3. IEEE, 2015.
Julia El Zini is a Ph.D. student en-
rolled in the electrical and computer
engineering department at the Ameri-
can University of Beirut (AUB). She
has received her B.S. and M.S. in com-
puter science from AUB, Lebanon,
in 2015 and 2017, respectively. Her
research interests include distributed
optimization, parallel computing, rein-
forcement leaning, multi-task and transfer learning, and scal-
able machine learning applications.
Yara Rizk obtained her Ph.D. in Elec-
trical and Computer Engineering from
the American University of Beirut
(AUB) in 2018. Prior, she received her
BE in Computer and Communication
Engineering from AUB, Lebanon, in
2012. Her research interests span ro-
botics, multi-agent systems, machine
learning, classication, clustering, and
articial intelligence. Rizk has attended a technical intern-
ship (2013-2014) at Intel in Hillsboro, Oregon, USA and is an
active researcher multiple peer-reviewed publications.
Mariette Awad obtained her Ph.D. in
Electrical Engineering from the Uni-
versity of Vermont (2007). Her cur-
rent research focuses on HMI, ecient
articial intelligence, applied ma-
chine learning and Inter net of Things.
Dr. Awad has received more than 25
grants to support her research includ-
ing 2 multidisciplinar y multi-million
dollar grants from the Qatar National Research Fund (QNRF)
and Intel. Her work culminated in a book, Ecient Machine
Learning, in 2015 as well as more than 100 conference, book
chapter, and jour nal publications. Prior to her academic po-
sition, she was with IBM System and Technology group in
Vermont for six years where her technical leadership and in-
novative spirit has earned her management recognition twice,
two business awards, and 10 patents.