Content uploaded by Baptiste Wicht

Author content

All content in this area was uploaded by Baptiste Wicht on Dec 22, 2017

Content may be subject to copyright.

On CPU performance optimization of Restricted

Boltzmann Machine and Convolutional RBM

Baptiste Wicht, Andreas Fischer, and Jean Hennebert

University of Applied Science of Western Switzerland

University of Fribourg, Switzerland

baptiste.wicht@hefr.ch,andreas.fischer@unifr.ch,jean.hennebert@hefr.ch

Abstract. Although Graphics Processing Units (GPUs) seem to cur-

rently be the best platform to train machine learning models, most re-

search laboratories are still only equipped with standard CPU systems.

In this paper, we investigate multiple techniques to speedup the train-

ing of Restricted Boltzmann Machine (RBM) models and Convolutional

RBM (CRBM) models on CPU with the Contrastive Divergence (CD)

algorithm. Experimentally, we show that the proposed techniques can

reduce the training time by up to 30 times for RBM and up to 12 times

for CRBM, on a data set of handwritten digits.

1 Introduction

Although most of the recent research has shown that learning on Graphics Pro-

cessing Units (GPUs) is generally more eﬃcient than training on Central Pro-

cessing Units (CPUs) [13, 14, 20], especially for Convolutional Neural Networks

(CNNs) [7, 9, 16], GPUs are not accessible everywhere. Some researchers may

not have access to them and some laboratories may not want to upgrade their

CPU clusters to GPU clusters. Therefore, it remains important to be able to

train neural networks in reasonable time on machines equipped only with CPUs.

Restricted Boltzmann Machines (RBMs) are old models [19], that resurged

recently to initialize the weights of an Artiﬁcial Neural Network (ANN) [4] or

to extract features from samples [2]. Later on, the model was extended with the

Convolutional RBM (CRBM) [11]. Performance optimization of these models

was investigated on GPUs only [8, 15].

In the present paper, we present several techniques to reduce the training

time of RBM and CRBM models. Techniques such as CPU vectorization, usage

of BLAS kernels and reduction of convolutions to other functions are explored.

To evaluate the performance, several networks are trained on 60’000 images of

handwritten digits from the MNIST data set.

The rest of this paper is organized as follows. The system setup for the

experiments is presented in Section 2. Section 3 presents techniques to speed

up an RBM while optimizations for CRBM are detailed in Section 4. Section 5

covers the training of a DBN. Finally, conclusions are drawn in Section 6.

2 Baptiste Wicht, Andreas Fischer, and Jean Hennebert

2 System Setup

The experiments have been computed on a Gentoo Linux machine with 12Go of

RAM running an Intel R

CoreTM i7-2600 with a frequency of 3.40GHz. The tests

were written in C++ using our own Deep Learning Library (DLL)1and Ex-

pression Templates Library (ETL)2libraries. The programs were compiled with

GNU Compiler Collection (GCC) 4.9. Vector operations are vectorized using

AVX. Intel R

Math Kernel Library (MKL) is used as the BLAS implementation.

The experiments are conducted on the MNIST data set [10]. It contains

grayscale images of handwritten digits, normalized to a size of 28x28 pixels.

Each experiment is done on the 60’000 training images for 5 epochs and the

average time per epoch is used as the ﬁnal result.

3 Restricted Boltzmann Machine

Fig. 1: Graphical representation of the Contrastive Divergence Algorithm. The

algorithm CD-k stops at t=k. Each iteration performs a full Gibbs step.

A Restricted Boltzmann Machine (RBM) [19] is a generative stochastic Ar-

tiﬁcial Neural Network (ANN), developed to learn the probability distribution

of some input. Training an RBM using the algorithm for general Boltzmann

Machines [3] is very slow. Hinton et al. proposed a new technique, Contrastive

Divergence (CD) [4], depicted in Figure 1. It is quite similar to the Stochastic

Gradient Descent method, used to train regular ANNs. It approximates the Log-

Likelihood gradients by minimizing the reconstruction error, thus training the

model into an autoencoder. The algorithm performs a certain number of steps

of Gibbs sampling (CD-n). When the RBM is used as a feature extractor or as

a way of pretraining a Deep Belief Network [6], CD-1 is generally suﬃcient [5].

The original RBM model was designed with binary visible and binary hidden

units (also called a Bernoulli RBM). Several diﬀerent types of units were since

developed (for instance Gaussian, ReLu or Softmax) [5]. This research focuses

1https://github.com/wichtounet/dll/

2https://github.com/wichtounet/etl/

On CPU performance optimization of RBM and CRBM 3

on binary units, but the conclusions stand for all general types of units. Indeed,

only the activation functions would change. The probability activations of visible

and hidden units can be computed as follows:

p(hj= 1|v) = σ(cj+

m

X

i

viWi,j ) (1)

p(vi= 1|h) = σ(bi+

n

X

j

hjWi,j ) (2)

The states of the units are obtained by sampling the activation probabilities.

For binary units, Bernoulli sampling is performed to obtain the states:

sj=(1 if pj>Unif(0,1)

0 otherwise (3)

si=(1 if pi>Unif(0,1)

0 otherwise (4)

From an implementation point of view, an RBM is made of a vector vof m

visible units, a vector hof nhidden units, a matrix Wof weights connecting the

visible and the hidden units, a vector bof mvisible biases and a vector cof n

hidden biases. In practice, the weights are represented as single-precision ﬂoating

point numbers rather than double-precision. Indeed, some single-precision com-

putations can be as much as twice faster than their double-precision counter-

parts. Moreover, the precision is generally more than suﬃcient for CD training.

Algorithm 1 Standard CD-1 algorithm (one sample)

v0= training sample

h0= sample hidden activations from v0

v1= sample visible activations from h0

h1= sample hidden activations from v1

Wpos =v0⊗h0

Wneg =v1⊗h1

∇W=(Wpos −Wneg)

∇b=(v0−v1)

∇c=(h0−h1)

Algorithm 1 describes the CD-1 algorithm for one sample. The same proce-

dure is done for each sample of the data set and is repeated for as many epochs as

necessary. In practice, it is important to note that the hidden activations should

be computed directly from the visible activation probabilities rather than from

the states [5]. Therefore, it is never necessary to compute the states of the visible

4 Baptiste Wicht, Andreas Fischer, and Jean Hennebert

Table 1: Training time for an epoch of RBM training, in seconds. The speedup

is the improvement gained by using BLAS kernels for linear algebra operations.

A B C D

Base 161.99 47.00 167.20 70.78

Base + BLAS 141.91 43.03 114.18 36.12

Speedup 1.14 1.09 1.46 1.95

units during training. Moreover, the last update of the hidden unit is only used

to compute the positive gradients, in which case the probabilities are used rather

than the states. Therefore, it is not necessary to sample the states of the hidden

units for the last update.

In the algorithm and activation formulas, several computation routines are

well-known and can be optimized. The Basic Linear Algebra Subprogams (BLAS)

are a collection of small and highly optimized linear algebra routines. In the ac-

tivation formulas, the sums are simply vector-matrix multiplication and matrix-

vector multiplication. They can be implemented using the SGEMV operation

from BLAS. The outer products to compute the positive and negative gradients

can be implemented using the SGER routine. Finally, the computation of the

visible and hidden biases gradients can be done with the SAXPY operation. For

evaluation, the following networks are trained:

–A: 784 visible units, 500 hidden units

–B: 500 visible units, 500 hidden units

–C: 500 visible units, 2000 hidden units

–D: 2000 visible units, 10 hidden units

Table 1 shows the time, in seconds, necessary to train an epoch of the net-

works. Even if the computations are simple, BLAS operations can bring an

important speedup to CD training, compared to standard implementations of

these operations. For the tested networks, the speedup ranges from 1.09 to 1.95.

The MKL BLAS implementation is highly tuned for Intel processor and each

routine is especially optimized for cache and maximum throughput.

Experimentally, we ﬁnd that more than 75% of the training time is spent

inside the BLAS library, 8% in the sigmoid function and around 7% in random

number generation. The sigmoid time could be optimized further by using an

approximation of the sigmoid function or a vectorized version of the exponential

function. Since this represents only a fraction of the total time, it would only

slightly improve the general training time.

3.1 Mini-Batch training

In practice, CD is rarely performed one element at a time, but rather on a

mini-batch. The data set is split into several mini-batches of the same size. The

gradients are computed for a complete batch before the weights are updated.

Algorithm 2 shows the updated version of CD-1 for mini-batch training.

On CPU performance optimization of RBM and CRBM 5

Algorithm 2 Mini-batch CD-1 algorithm (one mini-batch)

for all v0∈mini-batch do

h0= sample hidden activations from v0

v1= sample visible activations from h0

h1= sample hidden activations from v1

Wpos

+

=v0⊗h0

Wneg

+

=v1⊗h1

end for

∇W=

B(Wpos −Wneg)

∇b=

B(v0−v1)

∇c=

B(h0−h1)

In practice, this could be implemented by accumulating the gradients element

after element. However, it is better to compute the gradients independently for

each element of the mini-batch. This needs more memory to store the intermedi-

ary results for the complete mini-batch. However, this is only for a small portion

of the data set and has the advantage of allowing higher level optimizations of the

loop body. Each iteration being completely independent, this could seem like an

excellent candidate for parallelization. However, this is not the case. Depending

on the dimensions of the matrices, a small speedup can be obtained by computing

each iteration in parallel before aggregating the results sequentially. However,

since most of the time will be spent in memory-bound operations (matrix-vector

multiplication and outer product), there won’t be enough bandwidth for many

cores to process the data in parallel. A better optimization is to compute the

activations and states of the units for a complete mini-batch at once instead of

one sample at time. If we consider has a [B, n] matrix and vas a [B, m] matrix,

they can be computed directly as follows3:

h=σ(repmat(c, B) + v∗W) (5)

v=σ(repmat(b, B)+(W∗hT)T) (6)

This has the great advantage of performing a single large matrix-matrix

multiplication instead of multiple small vector-matrix multiplication. In prac-

tice, this is much more eﬃcient. In that case, the SGEMM operation of the BLAS

library is used to compute the activation probabilities. Moreover, if the matrices

are big enough, it is also possible to use a parallel version of the matrix-matrix

multiplication algorithm. Figure 2 shows the time necessary to train each net-

work with diﬀerent batch sizes. It compares the base version with a hand-crafted

matrix multiplication and the version using BLAS. The parallel BLAS version is

also included in the results. On average the BLAS version is twice faster than the

standard version and the parallel version of BLAS reduces the time by another

factor of two.

3repmat vertically stacks the array Btimes

6 Baptiste Wicht, Andreas Fischer, and Jean Hennebert

Base BLAS BLAS+Threads

8

16

24

32

64

128

256

512

2

4

·104

Time [ms]

A

8

16

24

32

64

128

256

512

1

2

3

·104

B

8

16

24

32

64

128

256

512

0.5

1

1.5·105

Mini-Batch size

Time [ms]

C

8

16

24

32

64

128

256

512

0.5

1

1.5

2

·104

Mini-Batch size

D

Fig. 2: Mini-Batch performance

Generally, increasing the mini-batch size reduces the training time. However,

due to the small output dimension of the network D, the possible speedup is

greatly reduced and larger mini-batch do not provide any substantial improve-

ments. However, a too large mini-batch size may have negative impact on the

classiﬁcation performance of the network since many gradients will be averaged.

On the other hand, a small batch size is generally leading to a more stable con-

vergence. The batch size must be chosen as a trade-oﬀ between training time and

classiﬁcation performance. Moreover, a large mini-batch also increases the need

for the inputs to be shuﬄed prior to each epoch. For MNIST, mini-batch size of

up to 128 samples are still reasonable, but higher mini-batch are increasing the

overall training time, by decreasing the learning each epoch does. To conclude

this section, Table 2 compares the basic implementation and the ﬁnal optimized

version with mini-batch (128 samples) and a threaded BLAS library. Depending

on the network, the optimized version is between 11 and 30 times faster.

On CPU performance optimization of RBM and CRBM 7

Table 2: Final results for standard RBM training, in seconds.

A B C D

Base 161.99 47.00 167.20 70.78

Mini-Batch + BLAS + Threads 5.35 3.94 14.57 4.39

Speedup 30.27 11.92 11.47 16.12

4 Convolutional Restricted Boltzmann Machine

The original RBM model can be extended to the Convolutional Restricted Boltz-

mann Machine (CRBM) [11]. The visible and hidden layers are connected to-

gether by convolution, allowing the model to learn features shared among all lo-

cations of the input, thus improving translation invariance of the model. While

this research focuses on two-dimensional CRBM, one-dimensional CRBM are

also possible, for instance for audio [12] and the model can be adapted for three-

dimensional inputs. Only square inputs and ﬁlters are described here for the sake

of simplicity, but the model is able to handle rectangular inputs and ﬁlters.

A CRBM model has a matrix Vof C×NV×NVvisible units. It has Kgroups

of NH×NHhidden units. There are C×Kconvolutional ﬁlters of dimension

NW×NW(by convolutional properties NW,NV−NH+ 1). There is a single

visible bias cand a vector bof Khidden biases. The notation •vis used to denote

avalid convolution and •fis for a full convolution. A tilde over a matrix ( ˜

A)

is used to indicate that the matrix is ﬂipped horizontally and vertically. For a

network with binary units, the probability activation are computed as follows:

P(hk

ij = 1|vc) = σ((X

c

˜

Wk

c•vvc)ij +bk) (7)

P(vk

cij = 1|h) = σ((X

k

Wk•fhk)ij +c) (8)

A CRBM is trained similarly to an RBM, with an adapted version of formulas

to compute the positive and negative gradients:

Wpos

ck =v0

c•v˜

h0

k(9)

Wneg

ck =v1

c•v˜

h1

k(10)

Training a CRBM requires a large number of convolutions for each epoch.

Indeed, for each sample, there would be 2KC valid convolutions for the gra-

dients, 2KC valid convolutions to compute the hidden activation probabilities

(done twice in CD) and KC full convolutions for the visible units. Contrary

to matrix multiplication, there is no general convolution reference implementa-

tion. The ﬁrst optimization that can be applied is to vectorize the convolution

implementations. Modern processors are able to process several ﬂoating point

operations in one instruction. For instance, AVX instructions process 8 ﬂoats

8 Baptiste Wicht, Andreas Fischer, and Jean Hennebert

at once, while SSE instructions process 4 ﬂoats once. While modern compilers

are able to vectorize simple code, vectorizing complex program must be done

by hand. We vectorized the inner loop of the convolutions, with AVX for large

kernels and SSE for small kernels (smaller than 8 pixels). For evaluation, the

following networks are trained:

–A: 1x28x28 visible units, 40 9x9 ﬁlters

–B: 40x20x20 visible units, 40 5x5 ﬁlters

–C: 40x16x16 visible units, 96 5x5 ﬁlters

–D: 96x12x12 visible units, 8 5x5 ﬁlters

Table 3 shows the time necessary to train the diﬀerent networks and the ob-

tained speedup. Due to the small images and ﬁlters in the measured networks,

the speedups are only very interesting for the ﬁrst layer of the network, with

larger kernel and images. Moreover, the two-dimensional property of the algo-

rithms adds overhead to the vectorized version, reducing the possible speedups.

Table 3: Results for Convolutional RBM training, in seconds.

A B C D

Base 380.37 3013.82 3947.46 338.16

Base + Vectorization 198.21 2174.66 3358.76 295.83

Speedup 1.91 1.38 1.17 1.14

When training the model using mini-batch, it becomes interesting to compute

the gradients of each sample concurrently. For convolutions, there is no simple

technique to compute the gradients of a complete batch, therefore parallelization

inside batches is the best option. Figure 3 shows the results with diﬀerent number

of threads, with a mini-batch of 64. The performance increases almost linearly

with the number of threads until four threads are used and then only slightly

improves with more threads, exhibiting memory-bound computation behaviour.

Since threads on the same core share the same cache, having more threads than

cores does not improve the performance substantially in this case.

4.1 Valid convolution

As seen previously, training a CRBM requires four times more valid convolutions

than full convolutions. Thus, it is extremely important to make it as fast as

possible. By rearranging the image to be convolved, it is possible to reduce a

valid convolution to a vector-matrix multiplication [18]. The general algorithm

is presented in Algorithm 3.

However, because of the memory-ineﬃcient im2col operation, this is experi-

mentally slower than the vectorized version. Nevertheless, since the same image

is convolved with Kﬁlters, the overhead of im2col can be greatly mitigated,

by doing it only once for Kconvolutions. Moreover, the multiple vector-matrix

On CPU performance optimization of RBM and CRBM 9

1

2

3

4

5

6

7

8

100

200

300

Time [s]

A D

1

2

3

4

5

6

7

8

1,000

2,000

3,000 B C

Fig. 3: Parallel performance

Algorithm 3 Convolution C=I•vKwith Matrix Multiplication

W0= reshape( ˜

W , [1, k1k2])

I0= matrix(k1k2, c1c2)

I0= im2col(K, [k1k2])

C=W0∗I0

operations become a single matrix-matrix multiplication. Finally, since the com-

putation of the activation probabilities and the gradients operates on ﬂipped

weights and that ﬂipping is an involution, the computation can be done directly

on the original weights, saving several ﬂipping operations. Table 4 presents the

results obtained when using this optimization for all the valid convolutions on

the parallel version. On average, the training time is divided by two.

Experimentally, the diﬀerence in precision is found to be very small between

the diﬀerent versions and the reference. On average, the average diﬀerence be-

tween the vectorized version and the reference is in the order of 1e−5% and in

the order of 5e−5% for the reduction with matrix multiplication. No diﬀerence

has been observed when training a CRBM with diﬀerent versions of the valid

convolution. This diﬀerence may vary between diﬀerent BLAS implementations.

Table 4: Results for Convolutional RBM training, in seconds.

A B C D

Parallel 46.69 494.52 756.70 68.47

Parallel + Reduction 28.45 241.79 336.56 40.12

Speedup 1.64 2.04 2.24 1.70

4.2 Full convolution

While there are no standard implementation of the full convolution, it is possible

to reduce it to a another algorithm for which there exists eﬃcient implementa-

10 Baptiste Wicht, Andreas Fischer, and Jean Hennebert

tions. Using the convolution theorem, a full convolution can be reduced to a

Fourier transform. Indeed, the convolution in the time domain is equal to the

pointwise multiplication in the frequency domain [1]. Since the image and the

kernel may not be of the same size, it is necessary to pad them with zeroes before

computing their transforms. Algorithm 4 shows the steps used to compute a full

convolution using a Fourier transform.

Algorithm 4 Convolution C=I∗fKwith Fourier Transform

I0= pad(I)

K0= pad(K)

C0=F(I0)· F (I0)

C=F−1(C0)

In practice, this can be implemented using the Fast Fourier Transform (FFT).

There exists some standard and very eﬃcient implementations of the FFT. While

it is not a part of BLAS, the MKL library provides an FFT implementation.

Unfortunately, this is not always faster than a properly vectorized convolu-

tion. Table 5 shows the performance for diﬀerent image and kernel sizes. The

FFT convolution is around 3 times slower for an 16x16 image and a kernel of

5x5, while it is almost 12 times faster for an image of 256x256 and a kernel of

31x31. This shows that using an FFT algorithm to perform the full convolution

can brings very large speedup to the training of a CRBM. However, it is only

really interesting for large models. Another optimization that can be done when

computing the full convolution by FFT is to precompute the Fourier transforms

of the images. Indeed, each image will be convolved several times with diﬀerent

kernels, therefore only one transform per image is necessary. On the evaluated

networks, this does not bring any substantial performance improvements. Only

the network Ahas big enough images and kernels to proﬁt from this, but this

only result in a speedup of an epoch by less than 1%.

Again, the diﬀerence in precision is found to be very small. On average,

the average diﬀerence for the vectorized version is found to be in the order of

1e−4% and in the order of 3e−5% for the FFT reduction. No diﬀerence has been

observed when training a CRBM with diﬀerent versions of the full convolution.

The diﬀerence may vary between diﬀerent FFT implementations.

Table 5: Performance of full convolution by FFT, in milliseconds

Image 12x12 16x16 16x16 28x28 50x50 128x128 128x128 256x256

Kernel 5x5 5x5 9x9 9x9 17x17 17x17 31x31 31x31

Vectorized 4.98 8.16 20.89 49.72 367.78 2010 7139 30787

FFT 11.59 24.49 25.8 46.38 122.43 368.83 1700 2598

Speedup 0.42 0.33 0.83 1.07 3.00 5.45 4.19 11.85

On CPU performance optimization of RBM and CRBM 11

5 Deep Belief Network

A Deep Belief Network (DBN) is a network formed by stacking RBMs on top of

each other. It is pretrained by training each layer with Contrastive Divergence.

Once the ﬁrst layer has been trained, its activation probabilities are computed

for each input sample and these values are taken as the input of the next layer,

and so on until the last layer of the network. A Convolutional DBN (CDBN) is

similar, except that it stacks CRBMs.

Since pretraining a DBN consists in training RBMs with CD, the same op-

timizations discussed in previous sections apply. If there is enough memory, it

is important to keep the entire data set in memory as well as the intermedi-

ate results (the activation probabilities of the previous layer) during training to

maximize the performance. When this is not possible, the best course of action

is to keep a multiple of the mini-batch size of samples in memory (and their in-

termediate output) for training. Ideally, computing the activation probabilities

of the previous layer should be done in a separate thread so that CD always has

data ready for training.

If the network is to be used for classiﬁcation, it can then be ﬁne-tuned using

standard algorithms like for instance Stochastic Gradient Descent, Conjugate

Gradient or Limited-Memory BFGS. Performance of these algorithms is out of

the scope of this paper and has already been studied [17].

6 Conclusion and Future Work

Several techniques were presented to speedup training of RBM and CRBM mod-

els, on a single-CPU system. By using these techniques, RBM’s training time

has been reduced by up to 30 times and CRBM’s training time has been reduced

by up to 12 times. This demonstrates that even on CPU, many techniques can

be used to substantially speedup the training of RBM models and train large

models within reasonable time.

Future work could go in several directions. Combining several full convolu-

tions together and using the FFT reduction could reduce its overhead and allow

better performance even for small kernels. The performance of the vectorized

convolution versions can also be improved further by vectorizing the convolu-

tion at the image level rather than just at the kernel level. Finally, once the large

operations are fully optimized, operations such as sigmoid or Bernoulli sampling

could also be considered for optimization.

References

1. Bracewell, R.: The Fourier transform and its applications. New York 5 (1965)

2. Coates, A., Ng, A.Y., Lee, H.: An analysis of single-layer networks in unsupervised

feature learning. In: Proceedings of the Int. Conf. on Artiﬁcial Intelligence and

Statistics. pp. 215–223 (2011)

12 Baptiste Wicht, Andreas Fischer, and Jean Hennebert

3. Hinton, G.E., Sejnowski, T.J.: Parallel distributed processing: Explorations in the

microstructure of cognition, vol. 1. chap. Learning and Relearning in Boltzmann

Machines, pp. 282–317. MIT Press, Cambridge, MA, USA (1986), http://dl.acm.

org/citation.cfm?id=104279.104291

4. Hinton, G.E.: Training Products of Experts by minimizing Contrastive Divergence.

Neural Computation 14, 1771–1800 (2002)

5. Hinton, G.E.: A practical guide to training Restricted Boltzmann Machines. In:

Neural Networks: Tricks of the Trade, pp. 599–619. Springer (2012)

6. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-

ral networks. Science 313(5786), 504–507 (2006)

7. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-

rama, S., Darrell, T.: Caﬀe: Convolutional architecture for fast feature embedding.

In: Proceedings of the ACM International Conference on Multimedia. pp. 675–678.

ACM (2014)

8. Krizhevsky, A., Hinton, G.: Convolutional Deep belief Networks on CIFAR-10.

Unpublished manuscript 40 (2010)

9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con-

volutional neural networks. In: Advances in neural information processing systems.

pp. 1097–1105 (2012)

10. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied to

document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)

11. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional Deep Belief Networks

for scalable unsupervised learning of hierarchical representations. In: Proceedings

of the Int. Conf. on Machine Learning. pp. 609–616. ACM (2009)

12. Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio

classiﬁcation using CDBNs. In: Proceedings of the Advances in Neural Information

Processing Systems. pp. 1096–1104 (2009)

13. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish,

N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., et al.: Debunking the 100x

GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In:

ACM SIGARCH Computer Architecture News. vol. 38, pp. 451–460. ACM (2010)

14. Lopes, N., Ribeiro, B.: Towards adaptive learning with improved convergence of

Deep Belief Networks on Graphics Processing Units. Pattern Recognition 47(1),

114–127 (2014)

15. Ly, D.L., Paprotski, V., Yen, D.: Neural networks on GPUs: Restricted Boltz-

mann Machines. see http://www.eecg.toronto.edu/˜moshovos/CUDA08/doku.php

(2008)

16. Mathieu, M., Henaﬀ, M., LeCun, Y.: Fast training of convolutional networks

through ﬀts. arXiv preprint arXiv:1312.5851 (2013)

17. Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Le, Q.V., Ng, A.Y.: On optimiza-

tion methods for deep learning. In: Proceedings of the 28th International Confer-

ence on Machine Learning (ICML-11). pp. 265–272 (2011)

18. Ren, J.S., Xu, L.: On vectorization of deep convolutional neural networks for vision

tasks. arXiv preprint arXiv:1501.07338 (2015)

19. Smolensky, P.: Information processing in dynamical systems: Foundations of har-

mony theory. Parallel distributed Processing 1, 194–281 (1986)

20. Upadhyaya, S.R.: Parallel approaches to machine learning: A comprehensive sur-

vey. Journal of Parallel and Distributed Computing 73(3), 284–292 (2013)