ArticlePDF Available

An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks

January 2021
Journal of Artificial Intelligence and Soft Computing Research 11(1):33-50

January 2021
11(1):33-50

DOI:10.2478/jaiscr-2021-0003

License
CC BY-NC-ND 4.0

Authors:

Julia El Zini

American University of Beirut

Yara Rizk

American University of Beirut

Mariette Awad

American University of Beirut

Recurrent neural networks (RNN) have been successfully applied to various sequential decision-making tasks, natural language processing applications, and time-series predictions. Such networks are usually trained through back-propagation through time (BPTT) which is prohibitively expensive, especially when the length of the time dependencies and the number of hidden neurons increase. To reduce the training time, extreme learning machines (ELMs) have been recently applied to RNN training, reaching a 99% speedup on some applications. Due to its non-iterative nature, ELM training, when parallelized, has the potential to reach higher speedups than BPTT. In this work, we present Opt-PR-ELM, an optimized parallel RNN training algorithm based on ELM that takes advantage of the GPU shared memory and of parallel QR factorization algorithms to efficiently reach optimal solutions. The theoretical analysis of the proposed algorithm is presented on six RNN architectures, including LSTM and GRU, and its performance is empirically tested on ten time-series prediction applications. Opt-PR-ELM is shown to reach up to 461 times speedup over its sequential counterpart and to require up to 20x less time to train than parallel BPTT. Such high speedups over new generation CPUs are extremely crucial in real-time applications and IoT environments.

RNN architectures adapted from prior work in [30]

…

Speedup of Opt-PR-ELM for the different architectures when the number of hidden neurons increases from 5 to 100

…

MSE versus time (sec) for P-BPTT algorithms when tested on the energy consumption dataset with M = 50 and LSTM as architecture

…

Time decomposition of Opt-PR-ELM on the energy consumption dataset with M = 50

…

Nomenclature

…

Figures - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Content may be subject to copyright.

Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

JAISCR, 2021, Vol. 11, No. 1, pp. 33

AN OPTIMIZED PARALLEL IMPLEMENTATION OF

NON-ITERATIVELY TRAINED RECURRENT NEURAL

NETWORKS

Julia El Zini, Yara Rizk and Mariette Awad

Department of Electrical and Computer Engineering

American University of Beirut

E-mail: {jwe04,yar01,mariette.awad}@aub.edu.lb

Submitted: 7th May 2020; Accepted: 14th September 2020

Abstract

Recurrent neural networks (RNN) have been successfully applied to various sequential

decision-making tasks, natural language processing applications, and time-series predic-

tions. Such networks are usually trained through back-propagation through time (BPTT)

which is prohibitively expensive, especially when the length of the time dependencies

and the number of hidden neurons increase. To reduce the training time, extreme learning

machines (ELMs) have been recently applied to RNN training, reaching a 99% speedup

on some applications. Due to its non-iterative nature, ELM training, when parallelized,

has the potential to reach higher speedups than BPTT.

In this work, we present Opt-PR-ELM, an optimized parallel RNN training algorithm

based on ELM that takes advantage of the GPU shared memory and of parallel QR fac-

torization algorithms to efﬁciently reach optimal solutions. The theoretical analysis of the

proposed algorithm is presented on six RNN architectures, including LSTM and GRU,

and its performance is empirically tested on ten time-series prediction applications. Opt-

PR-ELM is shown to reach up to 461 times speedup over its sequential counterpart and

to require up to 20x less time to train than parallel BPTT. Such high speedups over new

generation CPUs are extremely crucial in real-time applications and IoT environments.

Keywords: GPU implementation, parallelization, Recurrent Neural Network (RNN),

Long-short Term Memory (LSTM), Gated Recurrent Unit (GRU), Extreme Learning Ma-

chines (ELM), non-iterative training

1 Introduction

Recurrent neural networks (RNN) are a type

of neural networks that have been successfully ap-

plied to many problems in machine learning [22].

They have proven their ability to exceed human

performance in time series prediction and sequen-

tial decision-making [31]. RNNs’ training is usu-

ally based on gradient descent methods, speciﬁcally

back-propagation through time (BPTT) [40], and

real-time recurrent learning [41] which require a

substantial amount of iterations before converging.

Moreover, when unfolded through time, RNNs be-

come even deeper [1] and their training becomes

even more expensive since the number of learned

weights grows exponentially with the number of

hidden neurons and the length of time dependency.

Non-iterative training algorithms have been in-

vestigated in the literature [32, 1, 35] to reduce

the training cost of neural networks. Recently, Er-

tugrul et al. [10] proposed a non-iterative train-

F. E. Alsaadi, S. A. Ul Haq Bokhary, A. Shah, U. Ali, J. Cao, M. O. Alassaﬁ, M. U. Rehman, J. U. Rahman

Proof. Based on the information given in Table 8,

we compute the ABC4index of Gas follows

ABC4(G)=∑uv∈E(G)√Su+Sv−2

SuSv=2√17+17−2

17×17 +

8√17+38−2

17×38 +2√38+38−2

38×38 +4√24+32−2

24×32 +

4√24+45−2

24×45 +4√32+45−2

32×45 +4√45+47−2

45×47 +

4√47+38−2

47×38 +8√26+38−2

26×38 +4√26+45−2

26×45 +

4√38+70−2

38×70 +4√24+73−2

24×73 +2√32+73−2

32×73 +(4m+

4n−32)√26+47−2

26×47 +6√26+70−2

26×70 +(2m+2n−

16)√26+80−2

26×80 +4√31+45−2

31×45 +(2m+2n−

16)√31+47−2

31×47 +4√31+70−2

31×70 +4√31+73−2

31×73 +(4m+

4n−32)√31+80−2

31×80 +2√36+70−2

36×70 +4√36+73−2

36×73 +

(6m+6n−48)√36+80−2

36×80 +(6mn −24m−24n+

96)√36+90−2

36×90 +4√45+73−2

45×73 +4√45+80−2

45×80 +

+(2m+2n−20)√47+47−2

47×47 +4√47+70−2

47×70 +(4m+

4n−36)√47+80−2

47×80 +4√70+80−2

70×80 +4√73+80−2

73×80 +

2√73+90−2

73×90 +(2m+2n−18)√80+80−2

80×80 +(4m+

4n−36)√80+90−2

80×90 +(3mn −14m−14n+

65)√90+90−2

90×90 .

Further simpliﬁcation give us the re-

quired result ABC4(G)=[

√310

15 +√178

30 ]mn +

611 √876762 +2

65 √845 +4

1457 √27683 +√16895

155 +

√570

20 −4

15 √310 +2

47 √92 +5

47 √47 +2

15 √21 −

30 √178](m+n)+ 8

17 √2+4

323 √34328 +

√74

19 +2

893 √148238 +2

665 √70490 +3

4√2+

√2010

45 +√41610

219 +√30

6+√15038

292 +4

247 √15314 +

295 √26910 −16

611 √86762 +6

455 √42770 −

65 √845+4

465 √11470−32

1457 √27683+6

1085 √23870+

2263 √230826 −8

155 √16895 +2

105 √455 +

219 √7811 −2

5√570 +16

15 √310 +4

47 √94 +

1095 √10585 +√123

15 −20

47 √92 +2

329 √15134 −

47 √47 +√518

35 +√55115

365 +√117530

730 −9

40 √158 −

5√21 +13

18 √178.

Theorem 5.4 For m,n≥6, the GA5index of a graph

G∼

=RT S(m,n)is

GA5(G)=[

7√10 +3]mn +[

73 √1222 +

53 √65 +72

29 √5−48

7√10 +32

127 √235 +48

17 √2+

39 √1457 +32

111 √155 −11](m+n)+16

55 √646 +

85 √1786 +96

77 √10 +16

105 √146 +16

7√3+16

23 √30 +

√247

2+8

71 √1170 −64

73 √1222 +√455

4−64

53 √65 +

19 √155 +12

53 √70 +48

109 √73 −576

29 √5+192

7√10 +

23 √235 +12

59 √365 +8

117 √3290 −288

127 √235 +

17 √14+32

153 √365+12

163 √730−432

17 √2+96

77 √10+

97 √438 −16

39 √1457 +8

101 √2170 +√2263

13 −

256

111 √155 +96

25 +61.

Proof. By following the instructions about the edge

partitioning in Table 8, we compute the GA5index

of the graph Gas follows

GA5(G)= 2√SuSv

(Su+Sv)=2√17×17

17+17 ×2+2√17×38

17+38 ×

(8)+2√38×38

38+38 ×2+2√38×47

38+47 ×4+2√38×70

38+70 ×

4+2√24×32

24+32 ×4+2√24×45

24+45 ×4+2√24×73

24+73 ×

4+2√32×45

32+45 ×4+2√32×73

32+73 ×2+2√26×38

26+38 ×

8+2√26×45

26+45 ×4+2√26×47

26+47 ×(4m+4n−32)+

2√26×70

26+70 ×6+2√26×80

26+80 ×(2m+2n−16)+

2√31×45

31+45 ×4+2√31×47

31+47 ×(2m+2n−16)+

2√31×70

31+70 ×4+2√31×73

31+73 ×4+2√31×80

31+80 ×(4m+

4n−32)+2√36×70

36+70 ×2+2√36×73

36+73 ×4+2√36×80

36+80 ×

(6m+6n−48)+2√36×90

36+90 ×(6mn −24m−

24n+96)+2√45×47

45+47 ×4+2√45×73

45+73×4+2√45×80

45+80 ×

4+2√47×47

47+47 ×)(2m+2n−20)+2√47×70

47+70 ×4+

2√47×80

47+80 ×(4m+4n−36)+2√70×80

70+80 ×4+

2√73×80

73+80 ×4+2√73×90

73+90 ×2+2√80×80

80+80 ×(2m+2n−

18)+2√80×90

80+90 ×(4m+4n−36)+2√90×90

90+9×(3mn−

14m−14n+65).

Further simpliﬁcation give us the required result

GA5(G)=[

7√10 +3]mn +[

73 √1222 +

53 √65 +72

29 √5−48

7√10 +32

127 √235 +48

17 √2+

39 √1457 +32

111 √155 −11](m+n)+16

55 √646 +

85 √1786 +96

77 √10 +16

105 √146 +16

7√3+16

23 √30 +

√247

2+8

71 √1170 −64

73 √1222 +√455

4−64

53 √65 +

19 √155 +12

53 √70 +48

109 √73 −576

29 √5+192

7√10 +

23 √235 +12

59 √365 +8

117 √3290 −288

127 √235 +

17 √14+32

153 √365+12

163 √730−432

17 √2+96

77 √10+

97 √438 −16

39 √1457 +8

101 √2170 +√2263

13 −

256

111 √155 +96

25 +61.

6 Conclusion

In this article, we have done computation of

some degree based topological indices for certain

networks sheets. As a consequence, we got formu-

– 50

10.2478/jaiscr-2021-0003

34 Julia El Zini, Yara Rizk and Mariette Awad

ing algorithm for Jordan RNNs[19]. Then, Rizk

et al. [30] extended it to different RNN archi-

tectures, including Elman, fully connected RNN,

and Long Short-Term Memory (LSTM). Their al-

gorithm was tested on time-series and sequential

decision-making problems and achieved a speedup

of up to 99% over iterative training.

Although they only need one iteration to obtain

near-optimal solutions, non-iterative training algo-

rithms minimize their cost function by computing a

Moore-Penrose pseudo-inverse which requires am-

ple computational resources, especially for large

matrices. To the best of our knowledge, no attempts

have been made in the literature to parallelize non-

iterative training algorithms for RNNs. Fortunately,

such algorithms hold great potential for paralleliza-

tion due to their non-sequential nature.

In this work, we propose Basic-PR-ELM, a

basic parallel version of ELM training applied

on six RNN architectures: Elman, Jordan, NAR-

MAX, fully connected, LSTM, and GRU. Basic-

PR-ELM relies on parallel QR factorization to solve

the pseudo-inverse required in ELM training algo-

rithms. Then, the memory access patterns were

studied and led to Opt-PR-ELM, an optimizedver-

sion of parallel ELM training that utilizes the GPU

shared memory to speedup the training process fur-

ther.

The proposed algorithms, Basic-PR-ELM and

Opt-PR-ELM, are tested on 10 publicly available

time-series prediction applications and on different

GPU architectures to empirically show their scal-

ability, robustness, portability, speedup potentials,

and energy efﬁciency. Compared to the sequential

version proposed by Rizk et al. in [30], Basic-PR-

ELM and Opt-PR-ELM achieve a speedup of up

to 311 and 461, respectively on the LSTM archi-

tecture. Notably, Opt-PR-ELM is shown to train

LSTM networks 20 times faster than the parallel it-

erative training algorithms (BPTT).

The rest of the paper is organized as follows:

Section 2 presents the background on ELM-training

and the RNN architectures. Section 3 summa-

rizes the related work on RNN training and the

parallel training algorithms. Section 4 presents

the proposed algorithms Basic-PR-ELM and Opt-

PR-ELM and Section 5 theoretically analyzes their

memory and ﬂoating-point operations. Then, Sec-

tions 6 discusses the experimental setup and Sec-

tion 7 reports the empirical results. Finally, Sec-

tion 8 concludes with ﬁnal remarks.

2 Background

2.1 Extreme Learning Machine

Extreme Learning Machine (ELM) is a non-

iterative training algorithm introduced by Huang et

al. [16] for single hidden layer feedforward neural

networks (SLFNs). Given narbitrary distinct train-

ing samples (xj,yj)where xj∈Rm,yj∈R,Mhid-

den nodes and gas activation function, the predicted

output Ojcan be written as

∑M

i=1βig(wT

ixj+bi)where wi∈Rmis the

weight vector connecting the ith hidden node and

the input nodes, β∈RMis the weight vector con-

necting all the hidden nodes and the output node

and biis the bias of the ith hidden node. Through-

out the training, the input weights wij are randomly

generated and ﬁxed and the output weights β1...βM

are analytically computed. The goal is to minimize

the error between the predicted and the true output

min

∑

j=1Oj−tj2=

∑

j=1

∑

i=1

βig(wT

ixj+bi)−tj.

(1)

Deﬁning Hand Tas:

H(n×M)=





g(wT

1x1+b1)... g(wT

Mx1+bM)

.....

g(wT

1xn+b1)... g(wT

Mxn+bM)







(2)

T(n×1)=[t1,t2,...,tn]T,(3)

one can compactly write the problem in Equa-

tion 1 as minimizing Hβ−T2. The solution of

this problem is given as: β=H†T, where H†=

(HTH)−1His the Moore-penrose generalized in-

verse of the matrix H.

2.2 RNN architectures

RNNs are one of the most powerful neural

networks that are best suitable to model long-

term dependencies in time-series applications [31].

RNN architectures differ in the way cycles are in-

troduced in the network. In this work, we con-

sider six RNN architectures, illustrated in Fig-

ure 1: Elman [9], Jordan [19], NARMAX [8],

Julia El Zini, Yara Rizk and Mariette Awad

ing algorithm for Jordan RNNs[19]. Then, Rizk

et al. [30] extended it to different RNN archi-

tectures, including Elman, fully connected RNN,

and Long Short-Term Memory (LSTM). Their al-

gorithm was tested on time-series and sequential

decision-making problems and achieved a speedup

of up to 99% over iterative training.

Although they only need one iteration to obtain

near-optimal solutions, non-iterative training algo-

rithms minimize their cost function by computing a

Moore-Penrose pseudo-inverse which requires am-

ple computational resources, especially for large

matrices. To the best of our knowledge, no attempts

have been made in the literature to parallelize non-

iterative training algorithms for RNNs. Fortunately,

such algorithms hold great potential for paralleliza-

tion due to their non-sequential nature.

In this work, we propose Basic-PR-ELM, a

basic parallel version of ELM training applied

on six RNN architectures: Elman, Jordan, NAR-

MAX, fully connected, LSTM, and GRU. Basic-

PR-ELM relies on parallel QR factorization to solve

the pseudo-inverse required in ELM training algo-

rithms. Then, the memory access patterns were

studied and led to Opt-PR-ELM, an optimizedver-

sion of parallel ELM training that utilizes the GPU

shared memory to speedup the training process fur-

ther.

The proposed algorithms, Basic-PR-ELM and

Opt-PR-ELM, are tested on 10 publicly available

time-series prediction applications and on different

GPU architectures to empirically show their scal-

ability, robustness, portability, speedup potentials,

and energy efﬁciency. Compared to the sequential

version proposed by Rizk et al. in [30], Basic-PR-

ELM and Opt-PR-ELM achieve a speedup of up

to 311 and 461, respectively on the LSTM archi-

tecture. Notably, Opt-PR-ELM is shown to train

LSTM networks 20 times faster than the parallel it-

erative training algorithms (BPTT).

The rest of the paper is organized as follows:

Section 2 presents the background on ELM-training

and the RNN architectures. Section 3 summa-

rizes the related work on RNN training and the

parallel training algorithms. Section 4 presents

the proposed algorithms Basic-PR-ELM and Opt-

PR-ELM and Section 5 theoretically analyzes their

memory and ﬂoating-point operations. Then, Sec-

tions 6 discusses the experimental setup and Sec-

tion 7 reports the empirical results. Finally, Sec-

tion 8 concludes with ﬁnal remarks.

2 Background

2.1 Extreme Learning Machine

Extreme Learning Machine (ELM) is a non-

iterative training algorithm introduced by Huang et

al. [16] for single hidden layer feedforward neural

networks (SLFNs). Given narbitrary distinct train-

ing samples (xj,yj)where xj∈Rm,yj∈R,Mhid-

den nodes and gas activation function, the predicted

output Ojcan be written as

∑M

i=1βig(wT

ixj+bi)where wi∈Rmis the

weight vector connecting the ith hidden node and

the input nodes, β∈RMis the weight vector con-

necting all the hidden nodes and the output node

and biis the bias of the ith hidden node. Through-

out the training, the input weights wij are randomly

generated and ﬁxed and the output weights β1...βM

are analytically computed. The goal is to minimize

the error between the predicted and the true output

min

∑

j=1Oj−tj2=

∑

j=1

∑

i=1

βig(wT

ixj+bi)−tj.

(1)

Deﬁning Hand Tas:

H(n×M)=





g(wT

1x1+b1)... g(wT

Mx1+bM)

.....

g(wT

1xn+b1)... g(wT

Mxn+bM)







(2)

T(n×1)=[t1,t2,...,tn]T,(3)

one can compactly write the problem in Equa-

tion 1 as minimizing Hβ−T2. The solution of

this problem is given as: β=H†T, where H†=

(HTH)−1His the Moore-penrose generalized in-

verse of the matrix H.

2.2 RNN architectures

RNNs are one of the most powerful neural

networks that are best suitable to model long-

term dependencies in time-series applications [31].

RNN architectures differ in the way cycles are in-

troduced in the network. In this work, we con-

sider six RNN architectures, illustrated in Fig-

ure 1: Elman [9], Jordan [19], NARMAX [8],

AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .

fully connected RNN, LSTM [15] and GRU [5].

Figure 1. RNN architectures adapted from prior

work in [30]

In Figure 1 and throughout this work, x∈S×Q

is the input to the network, Mis the number of hid-

den neurons, wi∈RSis the vector of weights con-

necting the input to the ith neuron, αik ∈Ris the

weight from the neuron ito itsef from the kth previ-

ous time step and biis ith bias.

2.2.1 Elman

Elman RNNs are single hidden layer networks

where context neurons introduce recurrence by

feeding back signals as internal state of the network.

At time step t, the output is

ˆy=

∑

i=1

βifi(t),(4)

where fi(t)=gwT

ix(t)+∑Q

k=1αik fi(t−k)+biis

the output of neuron iat time t.

2.2.2 Jordan

Jordan networks are similar to Elman’s except

for the way recurrence is introduced. In the Jordan

architecture, signals are fed back from the predicted

output of the previous time step. Consequently,

such networks are more suitable for time series pre-

diction where dependencies are on current input and

previous outputs. Speciﬁcally, the output at time

step tis described by Equation 4 with

fi(t)=gwT

ix(t)+∑Q

k=1αik ˆy(t−k)+bi.

2.2.3 NARMAX

The Nonlinear AutoregRessive Moving

Average model with eXogenous inputs (NAR-

MAX) represents a wide class of nonlinear sys-

tems [2]. NARMAX networks, have been pro-

posed for non-linear time series prediction us-

ing artiﬁcial neural networks and are described

by ˆy(t)=∑M

i=1βigwT

ix(t)+∑F

l=1w

il y(t−l)+

∑R

l=1w

il e(t−l)+bi, where Fand Rare the

lengths of the time dependency of the output and

the error feedbacks respectively, e(t)=y(t)−ˆy(t),

w

il ∈R(w

il ∈Rresp.) is the weight from the out-

put (error resp.) at the lth time step to the ith hidden

neuron.

2.2.4 Fully Connected RNN

A fully connected RNN is the most gen-

eral RNN architecture in which signals are fed

back from all hidden neurons at previous time

steps. Speciﬁcally, the output at time step tis

described by Equation 4 with fi(t)=gwT

ix(t)+

∑Q

l=1∑M

m=1αilk fm(t−k)+bi. In this case, αilk ∈

Ris the weight connecting the neuron ito neuron l

from the kth previous time step.

2.2.5 LSTM

LSTMs were introduced by [15] to solve the

vanishing gradient problem in BPTT. LSTMs have

been successfully applied to a wide variety of appli-

cations inluding speech recognition [12, 13], ma-

chine translation [4, 42] and human action recog-

nition [23, 24]. An LSTM unit is composed of

the main cell, an input, output and forget gates

which regulate the ﬂow of information into and out

of the cell through forgetting factors and weights.

This formulation gives the network the ability to

decide on which information to remember. The

output of LSTM is described by Equation 4 with

f(t)=o(t)◦gf(c(t)),◦is the Hadamard product of

two matrices and o(t),c(t),λ(t)and in(t)are given

o(t)=goWox(t)+Uof(t−1)+bo

c(t)=λ(t)◦c(t−1)+in(t)◦gcWcx(t)+Ucf(t−1)+bc

λ(t)=gλWλx(t)+Uλf(t−1)+bλ

36 Julia El Zini, Yara Rizk and Mariette Awad

in(t)=gin(Winx(t)+Uin f(t−1)+bin).

2.2.6 GRU

GRUs are introduced in [5] as a gating mech-

anism for RNNs. They resemble LSTMs but have

only two gates and fewer parameters. GRUs ex-

pose their state at each time step and do not have

any mechanism to control the degree to which their

state is exposed [7]. They exhibit good perfor-

mances on small datasets [7] and are widely used

in speech recognition [34, 6] and sequence model-

ing [7]. GRUs’ output is described by Equation 4

while f(t)is given by

f(t)=(1−z(t))◦f(t−1)+z(t)◦gf(Wfx(t)+

Uf(rt◦f(t−1)+bf)),(5)

where z(t)=gz(Wzx(t)+Uzf(t−1)+bz)and

r(t)=gr(Wrx(t)+Urf(t−1)+br).

3 Related Work

This work focuses on the parallelization of a

non-iterative training algorithm for RNNs. In what

follows, we ﬁrst discuss the basic training meth-

ods of RNNs while focusing on the non-iterative

ones. Then, we report the parallelization attempts

for training algorithms.

3.1 RNN Training

3.1.1 Iterative RNN Training

Training RNNs has been mainly done itera-

tively through BPTT [40] which unfolds the re-

currence through time to transform the RNN into

a feedforward network trained using gradient de-

scent. BPTT is susceptible to local minima and

suffers from the vanishing and exploding gradient

problems with long time dependencies. BPTT can

also be slow, given that it is applied iteratively in

batch mode. Other iterative algorithms include, but

are not limited to, Hessian free optimization [25],

extended Kalman ﬁlters [39] and genetic algorithms

(GA) [3]. Although successful, these algorithms

are computationally expensive and require manu-

ally tuning of many hyper-parameters.

3.1.2 Non-Iterative RNN Training

Different non-iterative training algorithms have

been proposed to reduce the computational cost of

training neural networks in general. For instance,

the authors in [32, 35, 28, 16] proposed ELM, a

non-iterative method to train single hidden layer

feedforward networks by randomly assigning input

weights and computing output weights using the

least-squares method. These methods were later ex-

tended to RNN architectures when Ertugrul imple-

mented a non-iterative training for the Jordan RNN

architecture in electricity load forecasting applica-

tions [10]. Later, Park et al. extended it to online

RNNs [29] and Rizk et al. generalized the approach

to more powerful RNN architectures [30].

Although these methods achieved high

speedups (up to 99% in [30]), they heavily rely

on stencil operations and on the computation of the

generalized inverse of matrices which are CPU in-

tensive operations and could be further optimized

using parallel algorithms.

3.2 Parallelizing Training Algorithms

Several frameworks have been developed to

solve challenges of high performance computing in

the big data area [43], including parallelizing train-

ing algorithms. This is the ﬁrst attempt to paral-

lelize non-iterative training of RNNs; thus we de-

scribe previous work on the parallelization of RNN

iterative training algorithms and on the parallel

non-iterative training for neural networks - not ex-

clusively RNN.

3.2.1 Parallelizing Iterative Training Algo-

rithms For RNN

Parallelizing RNN training is mostly based on

parallelizing the back-propagation algorithm (BP).

For instance, Sierra et al. parallelized BP on

CUBLAS and achieved a speedup of 63. In [44],

data is distributed on multiple GPUs achieving a

speedup of up to 51 [33]. In [38], parallel scan al-

gorithm improves the step complexity of BP from

O(n)to O(logn). Khomenko et al. parallelized

their data on multiple GPUs and relied on batch

bucketing by input sequence length to accelerate

RNN training achieving a speedup of up to 4 [20].

In [27], a semantic correlation-based data pre-fetch

framework is implemented to break the dependency

Julia El Zini, Yara Rizk and Mariette Awad

in(t)=gin(Winx(t)+Uin f(t−1)+bin).

2.2.6 GRU

GRUs are introduced in [5] as a gating mech-

anism for RNNs. They resemble LSTMs but have

only two gates and fewer parameters. GRUs ex-

pose their state at each time step and do not have

any mechanism to control the degree to which their

state is exposed [7]. They exhibit good perfor-

mances on small datasets [7] and are widely used

in speech recognition [34, 6] and sequence model-

ing [7]. GRUs’ output is described by Equation 4

while f(t)is given by

f(t)=(1−z(t))◦f(t−1)+z(t)◦gf(Wfx(t)+

Uf(rt◦f(t−1)+bf)),(5)

where z(t)=gz(Wzx(t)+Uzf(t−1)+bz)and

r(t)=gr(Wrx(t)+Urf(t−1)+br).

3 Related Work

This work focuses on the parallelization of a

non-iterative training algorithm for RNNs. In what

follows, we ﬁrst discuss the basic training meth-

ods of RNNs while focusing on the non-iterative

ones. Then, we report the parallelization attempts

for training algorithms.

3.1 RNN Training

3.1.1 Iterative RNN Training

Training RNNs has been mainly done itera-

tively through BPTT [40] which unfolds the re-

currence through time to transform the RNN into

a feedforward network trained using gradient de-

scent. BPTT is susceptible to local minima and

suffers from the vanishing and exploding gradient

problems with long time dependencies. BPTT can

also be slow, given that it is applied iteratively in

batch mode. Other iterative algorithms include, but

are not limited to, Hessian free optimization [25],

extended Kalman ﬁlters [39] and genetic algorithms

(GA) [3]. Although successful, these algorithms

are computationally expensive and require manu-

ally tuning of many hyper-parameters.

3.1.2 Non-Iterative RNN Training

Different non-iterative training algorithms have

been proposed to reduce the computational cost of

training neural networks in general. For instance,

the authors in [32, 35, 28, 16] proposed ELM, a

non-iterative method to train single hidden layer

feedforward networks by randomly assigning input

weights and computing output weights using the

least-squares method. These methods were later ex-

tended to RNN architectures when Ertugrul imple-

mented a non-iterative training for the Jordan RNN

architecture in electricity load forecasting applica-

tions [10]. Later, Park et al. extended it to online

RNNs [29] and Rizk et al. generalized the approach

to more powerful RNN architectures [30].

Although these methods achieved high

speedups (up to 99% in [30]), they heavily rely

on stencil operations and on the computation of the

generalized inverse of matrices which are CPU in-

tensive operations and could be further optimized

using parallel algorithms.

3.2 Parallelizing Training Algorithms

Several frameworks have been developed to

solve challenges of high performance computing in

the big data area [43], including parallelizing train-

ing algorithms. This is the ﬁrst attempt to paral-

lelize non-iterative training of RNNs; thus we de-

scribe previous work on the parallelization of RNN

iterative training algorithms and on the parallel

non-iterative training for neural networks - not ex-

clusively RNN.

3.2.1 Parallelizing Iterative Training Algo-

rithms For RNN

Parallelizing RNN training is mostly based on

parallelizing the back-propagation algorithm (BP).

For instance, Sierra et al. parallelized BP on

CUBLAS and achieved a speedup of 63. In [44],

data is distributed on multiple GPUs achieving a

speedup of up to 51 [33]. In [38], parallel scan al-

gorithm improves the step complexity of BP from

O(n)to O(logn). Khomenko et al. parallelized

their data on multiple GPUs and relied on batch

bucketing by input sequence length to accelerate

RNN training achieving a speedup of up to 4 [20].

In [27], a semantic correlation-based data pre-fetch

framework is implemented to break the dependency

AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .

in the input to parallelize the training of cogni-

tive applications [27]. Their work is tested on

LSTMs using image captioning, speech recogni-

tion, and language processing applications showing

a speedup of 5.1, 44.9 and 1.53, respectively. Re-

cently, GA is introduced into the Elman architecture

to accelerate the training and prevent the local min-

ima problem [18]. GA-Elman outperformes tradi-

tional training algorithms in terms of convergence

speed and accuracy.

3.2.2 Parallelizing Non-Iterative Training Al-

gorithms

Non-iterative training algorithms for RNNs are

shown to require less training time than iterative

methods [30, 10, 29]. However, even with non-

iterative training, large datasets require costly com-

putations, especially when increasing the number of

neurons or when model selection is performed to

avoid over-ﬁtting [36]. Parallelizing non-iterative

training has been explored in single layer feedfor-

ward networks by [14]. Their approach is based on

a Map-Reduce and achieves a speedup of up to 5.6

when tested on 32 cores. Following a similar ap-

proach, Wang et al. [37] developed a parallel imple-

mentation of online ELM and achieved a speedup

of 3.5 when trained on 120K instances with 120 at-

tributes. Huang et al. extended their approach to the

ensemble online sequential ELM which was tested

on real and synthetic data with 5120K training in-

stances and 512 attributes and achieved a speedup

of 40 on a cluster with 80 cores [17]. In [36],

Van et al. attempted to parallelize ELM on Flink

with multi hidden layer feedforward network and

achieved a speedup of 17.

To the best of our knowledge, our work is the

ﬁrst attempt to parallelize non-iterative training for

different RNN architectures.

4 Methodology

Before proposing our methods, we present the

nomenclature that will be used throughout this pa-

per in Table 1.

Table 1. Nomenclature

Symbol Deﬁnition

nNumber of training sam-

ples

MNumber of hidden neurons

QMax number of time de-

pendencies

SDimension of input

xj∈RS×Qjth Input instance

yj∈Rjth Output instance

X∈Rn×S×QInput matrix

Y∈RnOutput matrix

W∈RS×LWeight matrix connecting

the input to the hidden neu-

rons

α∈RL×QWeight matrix connection

the hidden neuron to itself

for previous time steps

b∈RLBias vector for the hidden

neurons

β∈RLWeight vector connecting

hidden neurons to output

layer

S-R-ELM Sequential ELM for RNN

training

Basic-PR-ELM Basic parallel ELM RNN

training

Opt-PR-ELM Optimal parallel ELM

RNN training

BPTT Back-propagation through

time

P-BPTT Parallel Back-propagation

through time

BS Block size

TW Tile width

In this work, a parallel version of ELM-trained

RNNs will be formalized and implemented. The

sequential version of our approach, denoted by

S-RELM, is summarized in Algorithm 1 and is

adopted from our previous work in [30].

38 Julia El Zini, Yara Rizk and Mariette Awad

Algorithm 1 S-R-ELM algorithm

1: Randomly assign W,α,b

2: Compute H(t),t=1...Qaccording to the cor-

responding RNN architecture

3: Compute β=H(Q)†Yusing the generalized

Moore–Penrose pseudoinverse

H(t) at row iand column jis referred to as hij[t]

in this paper and is computed as in Equations 6, 7,

8, 9, 10 and 11 for the Elman, Jordan, NARMAX,

fully connected, LSTM and GRU architectures re-

spectively.

hij[t]=g(W[:,j].X[i,:,t]+bi+

∑

k=1

α[j,k]hij[t−k]

(6)

hij[t]=g(W[:,j].X[i,:,t]+bi+

∑

k=1

α[j,k]ˆy(t−k)

(7)

hij[t]=g(W[:,j].X[i,:,t]+bi+

∑

l=1

W[i,l]y(t−l)+

∑

l=1

W[i,l]e(t−l)(8)

hij[t]=gW[:,j].X[i,:,t]+bi+

∑

k=1

∑

l=1

α[j,l,k]hij[t−k](9)

hij[t]=o[i,j,t]◦gfc[i,j,t](10)

hij[t]=1−z[i,j,t]◦hij[t−1]+z[i,j,t]◦

gfWf[:,j].X[i,:,t]+Uf(r[i,j,t]◦hij[t−1]+bi)

(11)

Considering Algorithm 1, one can see that the

running time of the ELM training mainly consists

of two CPU intensive operations: computing Hand

computing βby solving the linear system using the

Moore-Penrose pseudo-inverse. Thus, those two

operations are the main target when optimizing the

performance of non-iterative training.

4.1 HComputation

4.1.1 Basic Parallel Implementation (Basic-

PR-ELM)

For all RNN architectures, the computation of

H(t)at row iand column jis independant of the

computation of H(t)at row i2and column j2,∀i2=

i,j2=j; it only depends on H(t2)at row iand col-

umn jfor t2<t. Given only this dependency, a par-

allel Hcomputation can be done as follows: each

thread (i,j)can independently compute H(t)at row

iand column jfor t=1,...,Q. We describe the ba-

sic implementation of the computation of Hfor the

Elman architecture in Algorithm 2.

Algorithm 2 Basic-PR-ELM by thread (i,j)

1: tx ←threadIdx.x

2: ty ←cuda.threadIdx.y

3: Row ←tx+blockIdx.x×blockDim.x

4: Col ←ty+blockI dx.y×blockDim.y

5: for t=1→ Qdo

6: hij ←W[:,Col].X[Row,:,t]

7: hij ←hij +bCol

8: for tprev =1→ tdo

9: hij ←hij +α[j,tprev]×

H[Row,Col,tprev]

10: end for

11: H[Row,Col,t]←hij

12: end for

4.1.2 Optimized Parallel Implementation

(Opt-PR-ELM)

Figure 2. Basic-PR-ELM memory access patterns

on Elman

Figure 2 illustrates the memory access patterns

of Basic-PR-ELM on the Elman architecture. One

can clearly see that threads in the same row ac-

cess the same elements of Xand threads in the

Julia El Zini, Yara Rizk and Mariette Awad

Algorithm 1 S-R-ELM algorithm

1: Randomly assign W,α,b

2: Compute H(t),t=1...Qaccording to the cor-

responding RNN architecture

3: Compute β=H(Q)†Yusing the generalized

Moore–Penrose pseudoinverse

H(t) at row iand column jis referred to as hij[t]

in this paper and is computed as in Equations 6, 7,

8, 9, 10 and 11 for the Elman, Jordan, NARMAX,

fully connected, LSTM and GRU architectures re-

spectively.

hij[t]=g(W[:,j].X[i,:,t]+bi+

∑

k=1

α[j,k]hij[t−k]

(6)

hij[t]=g(W[:,j].X[i,:,t]+bi+

∑

k=1

α[j,k]ˆy(t−k)

(7)

hij[t]=g(W[:,j].X[i,:,t]+bi+

∑

l=1

W[i,l]y(t−l)+

∑

l=1

W[i,l]e(t−l)(8)

hij[t]=gW[:,j].X[i,:,t]+bi+

∑

k=1

∑

l=1

α[j,l,k]hij[t−k](9)

hij[t]=o[i,j,t]◦gfc[i,j,t](10)

hij[t]=1−z[i,j,t]◦hij[t−1]+z[i,j,t]◦

gfWf[:,j].X[i,:,t]+Uf(r[i,j,t]◦hij[t−1]+bi)

(11)

Considering Algorithm 1, one can see that the

running time of the ELM training mainly consists

of two CPU intensive operations: computing Hand

computing βby solving the linear system using the

Moore-Penrose pseudo-inverse. Thus, those two

operations are the main target when optimizing the

performance of non-iterative training.

4.1 HComputation

4.1.1 Basic Parallel Implementation (Basic-

PR-ELM)

For all RNN architectures, the computation of

H(t)at row iand column jis independant of the

computation of H(t)at row i2and column j2,∀i2=

i,j2=j; it only depends on H(t2)at row iand col-

umn jfor t2<t. Given only this dependency, a par-

allel Hcomputation can be done as follows: each

thread (i,j)can independently compute H(t)at row

iand column jfor t=1,...,Q. We describe the ba-

sic implementation of the computation of Hfor the

Elman architecture in Algorithm 2.

Algorithm 2 Basic-PR-ELM by thread (i,j)

1: tx ←threadIdx.x

2: ty ←cuda.threadIdx.y

3: Row ←tx+blockIdx.x×blockDim.x

4: Col ←ty+blockI dx.y×blockDim.y

5: for t=1→ Qdo

6: hij ←W[:,Col].X[Row,:,t]

7: hij ←hij +bCol

8: for tprev =1→ tdo

9: hij ←hij +α[j,tprev]×

H[Row,Col,tprev]

10: end for

11: H[Row,Col,t]←hij

12: end for

4.1.2 Optimized Parallel Implementation

(Opt-PR-ELM)

Figure 2. Basic-PR-ELM memory access patterns

on Elman

Figure 2 illustrates the memory access patterns

of Basic-PR-ELM on the Elman architecture. One

can clearly see that threads in the same row ac-

cess the same elements of Xand threads in the

AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .

same column access the same elements of Wand

α. Thus, the tiling concept can be applied to utilize

the shared memory to speed up the computation of

H. Moreover, we notice that bCol can be preloaded

and used efﬁciently by other threads.

Algorithm 3 describes how these optimizations

can be applied for the Elman architecture.

Algorithm 3 Opt-PR-ELM by thread (i,j)

1: tx ←threadIdx.x

2: ty ←cuda.threadIdx.y

3: Row ←tx+blockIdx.x×blockDim.x

4: Col ←ty+blockI dx.y×blockDim.y

5: Hloc ←t-dimensional array in the local mem-

ory of thread (i,j)

6: for t←1→ Qdo

7: hij ←0

8: for tile =1→ num_tiles do

9: Wshared ←W[tx+t ile ×TW :,Col]

10: Xshared ←X[Row,ty +t ile ∗TW,t]

11: synch()

12: hij ←hij +Wshared.Xshared

13: end for

14: synch()

15: if tx =0 and ty =0then

16: bshared ←b[Col]

17: end if

18: synch()

19: hij ←hij +bshared

20: for tile ←1→  t

TW do

21: αshared ←α[Col,tx+t ile ×TW]

22: synch()

23: hij ←hij +αshared.Hloc [tprev]

24: end for

25: synch()

26: Hloc[t]←hij

27: H[Row,Col,t]←hij

28: end for

First, in the dot product W[:,Col].X[Row,:,t],

each thread can load only one element of Wand

one element of Xinto the shared memory. Once

the threads synchronize, then all needed elements

of Wand Xare loaded, and the dot product can

be computed efﬁciently. Second, only one thread

can load b[j]that is needed by all the threads in the

same column of the block. The same tiling con-

cept used to compute W[:,Col].X[Row,:,t]can be

used to speed up the computation of α[j,tprev]×

H(tprev)[Row,Col]. Lastly, each thread can save the

values of H(t)[Row,Col]in its register ﬁle to reduce

the time taken to read from the global memory in

line 8 of Algorithm 2. If these values do not ﬁt in

the registers, they are read from the global memory.

Alogirhtms 2 and 3 could be easily extended to

other architectures when Equation 6 is replaced by

Equations 7, 8, 9, 10 or 11.

4.2 Computing β

βis the solution of the following system: Hβ=

Y. Instead of computing the pseudo-inverse H†and

then multiplying it by Y, one can perform a QR fac-

torization of Has H=QR, then compute z=QTY.

Having that, βwould be the solution of Rβ=zby

back substitution since Rwill be an upper triangular

matrix. In this work, we make use of Numba [21]

and Numpy [26] libraries which provide an efﬁcient

implementation of this method in Python.

5 Theoretical Analysis

We analyze the memory read and write oper-

ations and the ﬂoating point operations (FLOPS)

for the proposed algorithms: Basic-PR-ELM and

Opt-PR-ELM. For the Elman architecture, Basic-

PR-ELM performs Q(2S+Q+2)read operations

divided as follows:

– 2 ×SQ to read the values needed in line 6

–Qreads for bCol in line 7

– 2 ×(QQ+1

2)reads in the loop at line 8

Moreover, only Qwrite operations are needed (in

line 11) and Q(2S+Q+2)FLOPS are performed

as follows:

– 2 ×SQ to perform the dot product at line 6

–QFLOPS for the addition in line 7

– 2 ×(QQ+1

2)to perform the loop at line 8

The memory operations to FLOPS ratio is

2S+Q+3

2S+Q+2>1 which might limit the performance of

Basic-PR-ELM. This ratio improves with Opt-PR-

ELM as it minimizes the memory operations while

keeping the same number of FLOPS. Speciﬁcally,

Opt-PR-ELM decreases the number of reads to

TW2(2×SQ +Q(Q+1)

2)+1 divided as follows:

40 Julia El Zini, Yara Rizk and Mariette Awad

Table 2. Number of memory operations and FLOPS for each RNN architecture for Basic-PR-ELM

Architecture # Read Operations # Write Operations FLOPS

Elman Q(2S+Q+2)Q Q(2S+Q+2)

Jordan Q(2S+1+(Q+1)(1/2+M))Q Q(2S+1+Q+1

2(2SM +M))

NARMAX Q(2S+1)+2(2F+M+R)Q Q(2S+1+2F+R(2+2SM +M))

Fully Connected Q(2S+1+2MQ)Q Q(2S+Q+2QM)

LSTM Q(5S+13)5Q Q(8S+18)

GRU Q(4S+8)3Q Q(3S+17)

Figure 3. Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M=50

TABLE III: Benchmarks Description

Database Size Output Statistics

Category Name # of instances Q % Train Mean Std Dev Min Max

Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08

Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02

Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05

Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10

AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04

Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02

Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03

Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14

Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09

Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01

(a) Jordan (b) Elman

(e) LSTM (f) GRU

Fig. 3: Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M= 50

algorithm is architecture-dependent. Table V shows that Opt-

PR-ELM also achieves high speedups on the Quadro K2000

GPUs for different RNN architectures on different datasets, but

the speedups on the Tesla K20m GPU are constantly higher

Julia El Zini, Yara Rizk and Mariette Awad

Table 2. Number of memory operations and FLOPS for each RNN architecture for Basic-PR-ELM

Architecture # Read Operations # Write Operations FLOPS

Elman Q(2S+Q+2)Q Q(2S+Q+2)

Jordan Q(2S+1+(Q+1)(1/2+M))Q Q(2S+1+Q+1

2(2SM +M))

NARMAX Q(2S+1)+2(2F+M+R)Q Q(2S+1+2F+R(2+2SM +M))

Fully Connected Q(2S+1+2MQ)Q Q(2S+Q+2QM)

LSTM Q(5S+13)5Q Q(8S+18)

GRU Q(4S+8)3Q Q(3S+17)

Figure 3. Speedup of Basic-PR-ELM and Opt-PR-ELM for the different architectures when M=50

AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .

Figure 4. Speedup of Opt-PR-ELM for the different architectures when the number of hidden neurons

increases from 5 to 100

(a) Jordan (b) Elman

(e) LSTM (f) GRU

Fig. 4: Speedup of Opt-PR-ELM for the different architectures when the number of hidden neurons increases from 5to 100

because of the computational capability of the latter. The

speedups in Table V are reported with respect to the core-i7

CPU with 16 GB RAMs. Speedups with respect to sequential

code on older generation CPUs (core-i5 with 8GB RAMs)

show a speedup of up to 5 times higher. One can draw the

following conclusion: increasing the number of cores of a CPU

yields a speedup of up to only 5 times. However, parallelizing

the code can yield a speedup of up to 326 with respect to

sequential code on core-i7 CPUs and 651 on core-i5 CPUs.

A rough estimation of the current pricing based on google

search shows that GPU architectures cost between 500$ and

7,000$ for NVIDIA GTX 1080 and Tesla GPUs11 respectively,

while CPU architectures such as Intel Core i7-9700K with 8

cores cost 400$12. Considering the aforementioned speedups,

one can conclude that investing in parallel architectures can be

more proﬁtable than upgrading the existing CPU architecture,

especially in applications where real-time performance and

11https://www.amazon.com/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-

v100/dp/B076P84525

12https://www.amazon.com/CPU-Processors-Memory-Computer-Add-

Ons/b?ie=UTF8&node=229189

cost efﬁciency are essential such as general IoT applications.

E. Comparison with Parallel Iterative RNN Training

Although Opt-PR-ELM achieves high speedups compared

its S-R-ELM, we need to show that its absolute training time

is lower than the parallel version of the BPTT (P-BPTT) as

implemented in [11]. We choose the architectures that [11]

implements, i.e. fully connected, LSTM and GRU, and we

report the training time of Opt-PR-ELM (BS=32) and P-BPTT

when M= 10. P-BPTT is trained for 10 epochs with 64

as batch size, mean squared error (MSE) as loss function

and ADAM as optimizer. We are interested in the absolute

training times of the two parallel algorithms rather than their

speedup over their sequential versions. Thus, we report the

runtimes of Opt-PR-ELM and P-BPTT algorithms when tested

on the same Tesla K20m GPU and the ratio between both

training times. As Table VI shows, Opt-PR-ELM runs up

to 10x faster than P-BPTT when tested with LSTM on the

energy consumption dataset. Fig. 5 illustrates the MSE versus

time for P-BPTT algorithms when tested with LSTM on the

energy consumption with M= 50. For the same dataset

42 Julia El Zini, Yara Rizk and Mariette Awad

Table 3. Benchmarks Description

Database Size Output Statistics

Category Name # of instances Q % Train Mean Std Dev Min Max

Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08

Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02

Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05

Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10

AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04

Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02

Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03

Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14

Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09

Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01

–2

TW2×SQ to read the values needed in line 12

– at most 1 read for bCol in line 16

–1

TW2(QQ+1

2)reads in the loop at line 20

where TW is the tile width which is set to block

size in this work. The new memory operations

to FLOPS ratio is 1

TW2(2×SQ+Q(Q+1)

2)+1+Q

Q(2S+Q+2)which is

less then the ratio of Basic-PR-ELM by a factor of

≈TW2. Speciﬁcally, Opt-PR-ELM minimizes the

number of read operations by a factor of 256 (1024

resp.) when the tile width is set to 16 (32 resp.).

Table 2 reports the number of memory opera-

tions and FLOPS needed by Basic-PR-ELM for

each RNN architecture. The values of Opt-PR-

ELM are ommited as it requires the same number

of write operations and FLOPS and less read oper-

ations by a factor of ≈TW2.

6 Experimental Setup

6.1 Setup

Serial algorithms were run on an Intel 64-bit

core-i7 machine with a memory of 16 GB. Parallel

algorithms were run on NVidia Tesla K20m GPU

with 2688 CUDA cores and 723MHz GPU core

clock speed. The GPU main memory is 6GB and

bandwidth of 250 GB/s between the host and the

device. All experiments are repeated 5 times, and

the average value is reported.

6.2 Time Series Prediction Benchmarks

Basic-PR-ELM and Opt-PR-ELM were vali-

dated on time series prediction problems. Table 3

presents the characteristics of the datasets ordered

according to the number of instances. According

to their size, we split the databases into three cate-

gories: small datasets containing less than 10K in-

stances, medium datasets with multiples of 10K in-

stances and large dataset consisting of multiples of

100K instances. Japan population1tracks the pop-

ulation of various Japanese regions, while the Que-

bec Births2tracks the number of births in Quebec

1kaggle.com/jd1325/japan-population-data

2datamarket.com/data/list/ ?q=provider%3Atsdl

3kaggle.com/keplersmachines/kepler-labelled-time-series-data

4kaggle.com/benjibb/sp500- since-1950

5aemo.com.au/

6kaggle.com/selﬁshgene/historical-hourly-weather-data

Julia El Zini, Yara Rizk and Mariette Awad

Table 3. Benchmarks Description

Database Size Output Statistics

Category Name # of instances Q % Train Mean Std Dev Min Max

Small Japan population 2,540 10 80 1.40E+06 1.40E+06 1.00E+05 1.03E+08

Quebec Births 5,113 10 80 2.51E+02 4.19E+01 -2.31E+01 3.66E+02

Exoplanet 5,657 3197 80 -3.01E+02 1.45E+04 -6.43E+05 2.11E+05

Medium SP500 17,218 10 80 8.99E+08 1.53E+09 1.00E+06 1.15E+10

AEMO 17,520 10 80 7.98E+03 1.19E+03 5.11E+03 1.38E+04

Hourly weather 45,300 50 80 2.79E+02 3.78E+01 0.00E+00 3.07E+02

Large Energy Consumption 119,000 10 70 1.66E+03 3.02E+02 0.00E+00 3.05E+03

Electricity load 280,514 10 70 2.70E+14 2.60E+14 0.00E+00 9.90E+14

Stock prices 619,000 50 70 4.48E+06 1.08E+07 0.00E+00 2.06E+09

Temperature 998,000 50 70 5.07E+01 2.21E+01 4.00E+00 8.10E+01

–2

TW2×SQ to read the values needed in line 12

– at most 1 read for bCol in line 16

–1

TW2(QQ+1

2)reads in the loop at line 20

where TW is the tile width which is set to block

size in this work. The new memory operations

to FLOPS ratio is 1

TW2(2×SQ+Q(Q+1)

2)+1+Q

Q(2S+Q+2)which is

less then the ratio of Basic-PR-ELM by a factor of

≈TW2. Speciﬁcally, Opt-PR-ELM minimizes the

number of read operations by a factor of 256 (1024

resp.) when the tile width is set to 16 (32 resp.).

Table 2 reports the number of memory opera-

tions and FLOPS needed by Basic-PR-ELM for

each RNN architecture. The values of Opt-PR-

ELM are ommited as it requires the same number

of write operations and FLOPS and less read oper-

ations by a factor of ≈TW2.

6 Experimental Setup

6.1 Setup

Serial algorithms were run on an Intel 64-bit

core-i7 machine with a memory of 16 GB. Parallel

algorithms were run on NVidia Tesla K20m GPU

with 2688 CUDA cores and 723MHz GPU core

clock speed. The GPU main memory is 6GB and

bandwidth of 250 GB/s between the host and the

device. All experiments are repeated 5 times, and

the average value is reported.

6.2 Time Series Prediction Benchmarks

Basic-PR-ELM and Opt-PR-ELM were vali-

dated on time series prediction problems. Table 3

presents the characteristics of the datasets ordered

according to the number of instances. According

to their size, we split the databases into three cate-

gories: small datasets containing less than 10K in-

stances, medium datasets with multiples of 10K in-

stances and large dataset consisting of multiples of

100K instances. Japan population1tracks the pop-

ulation of various Japanese regions, while the Que-

bec Births2tracks the number of births in Quebec

1kaggle.com/jd1325/japan-population-data

2datamarket.com/data/list/ ?q=provider%3Atsdl

3kaggle.com/keplersmachines/kepler-labelled-time-series-data

4kaggle.com/benjibb/sp500- since-1950

5aemo.com.au/

6kaggle.com/selﬁshgene/historical-hourly-weather-data

AN OPTIMIZED PARALLEL IMPLEMENTATION OF .. .

Table 4. Average RMSE (±standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same

range for different RNN architectures on all the datasets.

Architecture

Dataset Algorithm Elman Jordan NARMAX Fully Connected LSTM GRU

Japan

pop.

S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4

Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5

Quebec

Births

S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4

Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3

ExoplanetS-R-ELM 5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1

Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2

SP500 S-R-ELM 1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2

Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1

AEMO S-R-ELM 1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5

Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6

Hourly

Weather

S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3

Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3

Energy

Cons.

S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5

Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5

Elec.

Load

S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1

Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0

Stock

Prices

S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4

Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3

Temp. S-R-ELM 4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6

Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5

TABLE IV: Average RMSE (±standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same range for

different RNN architectures on all the datasets.

Architecture

Dataset Algorithm Elman

Jordan NARMAX Fully Connected LSTM GRU

Japan

pop.

S-R-ELM 3.97E-2 ±4.67E-2 1.12E-1 ±3.75E-1 6.54E-1 ±3.32E-2 5.43E-3 ±3.89E-5 2.45E-1 ±2.36E-1 4.46E-1 ±3.35E-4

Opt-PR-ELM 3.74E-2 ±7.17E-8 1.23E-1 ±2.89E-2 6.23E-1 ±2.31E-2 6.23E-3 ±2.65E-4 2.46E-1 ±4.56E-2 4.75E-2 ±5.81E-5

Quebec

Births

S-R-ELM 4.06E-3 ±7.68E-5 1.01E-1 ±5.00E-3 3.42E-1 ±5.05E-3 2.02E-2 ±4.99E-7 1.01E-1 ±5.76E-1 1.01E+0 ±5.16E-4

Opt-PR-ELM 2.02E-3 ±4.89E-5 4.35E-1 ±5.32E-4 3.46E-1 ±3.79E-3 2.42E-2 ±7.07E-1 1.49E-2 ±1.46E-4 1.16E+0 ±3.56E-3

ExoplanetS-R-ELM

5.40E+0 ±3.03E-1 2.87E+0 ±7.91E-3 2.01E-1 ±2.98E-3 3.46E-1 ±1.01E-2 5.45E-1 ±2.31E-1 4.32E+0 ±4.56E-1

Opt-PR-ELM 5.42E+0 ±3.05E-1 2.34E+0 ±7.34E-2 2.53E-1 ±1.98E-3 3.42E-1 ±1.51E-2 3.65E-1 ±2.31E-5 5.21E+0 ±3.76E-2

SP500 S-R-ELM

1.69E-1 ±7.78E-3 1.32E-1 ±3.75E-4 9.01E-1 ±8.70E-4 1.96E+0 ±4.32E-1 1.01E-1 ±5.16E-2 7.84E+0 ±5.55E-2

Opt-PR-ELM 2.34E-1 ±7.98E-4 4.01E-1 ±6.36E-5 9.11E-1 ±8.32E-5 1.36E+0 ±1.90E-2 1.24E-1 ±3.14E-2 7.83E+0 ±5.53E-1

AEMO S-R-ELM

1.26E-1 ±1.45E-3 3.30E-2 ±7.16E-3 9.61E-2 ±8.79E-3 5.00E-2 ±1.32E-5 1.36E-2 ±5.33E-4 2.33E-1 ±2.23E-5

Opt-PR-ELM 1.34E-1 ±1.25E-4 1.12E-2 ±5.16E-2 3.23E-3 ±1.01E-2 5.36E-2 ±1.12E-4 1.22E-2 ±5.67E-3 2.01E-1 ±2.13E-6

Hourly

Weather

S-R-ELM 1.98E-1 ±5.17E+0 3.14E-1 ±2.07E-3 8.06E-1 ±7.63E-5 7.39E-2 ±6.03E-2 2.10E-2 ±2.24E-5 3.21E-1 ±9.61E-3

Opt-PR-ELM 1.52E-1 ±3.34E+0 3.98E-1 ±5.67E-4 2.00E-1 ±7.03E-4 3.79E-2 ±5.03E-3 1.02E-2 ±2.14E-5 4.32E-1 ±9.16E-3

Energy

Cons.

S-R-ELM 1.83E-4 ±1.98E-3 2.21E-3 ±3.43E-1 2.22E-4 ±5.26E-3 3.56E-3 ±5.56E-4 1.56E-3 ±9.96E-4 2.34E-2 ±2.22E-5

Opt-PR-ELM 1.38E-4 ±2.45E-3 3.48E-3 ±3.03E-2 6.44E-5 ±5.16E-4 2.65E-3 ±5.16E-5 2.56E-3 ±5.326E-5 3.24E-3 ±2.12E-5

Elec.

Load

S-R-ELM 2.56E+0 ±7.93E+0 2.40E+0 ±3.90E-1 8.64E+0 ±9.81E+0 4.16E-1 ±3.45E-1 8.32E+0 ±8.05E+0 1.12E+0 ±5.16E-1

Opt-PR-ELM 2.34E+0 ±7.03E-1 4.76E+0 ±2.20E-2 4.86E+0 ±8.91E-1 4.64E-1 ±3.97E-2 2.84E+0 ±8.13E-1 2.98E+0 ±5.06E+0

Stock

Prices

S-R-ELM 6.41E-1 ±7.93E-1 1.10E-1 ±9.09E-5 4.80E+0 ±3.87E-1 2.13E-2 ±3.89E-1 4.00E-1 ±1.09E-3 2.62E-1 ±3.82E-4

Opt-PR-ELM 3.41E-1 ±3.35E-2 1.56E-1 ±9.23E-5 4.81E+0 ±3.32E-2 2.03E-3 ±1.92E-4 4.94E-1 ±5.69E-4 6.28E-1 ±3.28E-3

Temp. S-R-ELM

4.32E-4 ±9.85E-5 5.65E-3 ±6.79E-9 3.56E-4 ±7.10E-6 2.91E-5 ±3.72E-9 4.92E-4 ±6.02E-5 3.54E-4 ±2.95E-6

Opt-PR-ELM 4.12E-4 ±9.67E-4 5.03E-3 ±6.19E-2 3.15E-4 ±9.25E-6 9.21E-5 ±3.02E-5 8.17E-4 ±6.92E-4 3.19E-3 ±5.29E-5

Julia El Zini, Yara Rizk and Mariette Awad

Table 4. Average RMSE (

standard deviation) of S-R-ELM and Opt-PR-ELM (BS=32) showing that both algorithm achieve accuracies within the same