Conference PaperPDF Available

On CPU Performance Optimization of Restricted Boltzmann Machine and Convolutional RBM

Authors:

Abstract and Figures

Although Graphics Processing Units (GPUs) seem to currently be the best platform to train machine learning models, most research laboratories are still only equipped with standard CPU systems. In this paper, we investigate multiple techniques to speedup the training of Restricted Boltzmann Machine (RBM) models and Convolutional RBM (CRBM) models on CPU with the Contrastive Divergence (CD) algorithm. Experimentally, we show that the proposed techniques can reduce the training time by up to 30 times for RBM and up to 12 times for CRBM, on a data set of handwritten digits.
Content may be subject to copyright.
On CPU performance optimization of Restricted
Boltzmann Machine and Convolutional RBM
Baptiste Wicht, Andreas Fischer, and Jean Hennebert
University of Applied Science of Western Switzerland
University of Fribourg, Switzerland
baptiste.wicht@hefr.ch,andreas.fischer@unifr.ch,jean.hennebert@hefr.ch
Abstract. Although Graphics Processing Units (GPUs) seem to cur-
rently be the best platform to train machine learning models, most re-
search laboratories are still only equipped with standard CPU systems.
In this paper, we investigate multiple techniques to speedup the train-
ing of Restricted Boltzmann Machine (RBM) models and Convolutional
RBM (CRBM) models on CPU with the Contrastive Divergence (CD)
algorithm. Experimentally, we show that the proposed techniques can
reduce the training time by up to 30 times for RBM and up to 12 times
for CRBM, on a data set of handwritten digits.
1 Introduction
Although most of the recent research has shown that learning on Graphics Pro-
cessing Units (GPUs) is generally more efficient than training on Central Pro-
cessing Units (CPUs) [13, 14, 20], especially for Convolutional Neural Networks
(CNNs) [7, 9, 16], GPUs are not accessible everywhere. Some researchers may
not have access to them and some laboratories may not want to upgrade their
CPU clusters to GPU clusters. Therefore, it remains important to be able to
train neural networks in reasonable time on machines equipped only with CPUs.
Restricted Boltzmann Machines (RBMs) are old models [19], that resurged
recently to initialize the weights of an Artificial Neural Network (ANN) [4] or
to extract features from samples [2]. Later on, the model was extended with the
Convolutional RBM (CRBM) [11]. Performance optimization of these models
was investigated on GPUs only [8, 15].
In the present paper, we present several techniques to reduce the training
time of RBM and CRBM models. Techniques such as CPU vectorization, usage
of BLAS kernels and reduction of convolutions to other functions are explored.
To evaluate the performance, several networks are trained on 60’000 images of
handwritten digits from the MNIST data set.
The rest of this paper is organized as follows. The system setup for the
experiments is presented in Section 2. Section 3 presents techniques to speed
up an RBM while optimizations for CRBM are detailed in Section 4. Section 5
covers the training of a DBN. Finally, conclusions are drawn in Section 6.
2 Baptiste Wicht, Andreas Fischer, and Jean Hennebert
2 System Setup
The experiments have been computed on a Gentoo Linux machine with 12Go of
RAM running an Intel R
CoreTM i7-2600 with a frequency of 3.40GHz. The tests
were written in C++ using our own Deep Learning Library (DLL)1and Ex-
pression Templates Library (ETL)2libraries. The programs were compiled with
GNU Compiler Collection (GCC) 4.9. Vector operations are vectorized using
AVX. Intel R
Math Kernel Library (MKL) is used as the BLAS implementation.
The experiments are conducted on the MNIST data set [10]. It contains
grayscale images of handwritten digits, normalized to a size of 28x28 pixels.
Each experiment is done on the 60’000 training images for 5 epochs and the
average time per epoch is used as the final result.
3 Restricted Boltzmann Machine
Fig. 1: Graphical representation of the Contrastive Divergence Algorithm. The
algorithm CD-k stops at t=k. Each iteration performs a full Gibbs step.
A Restricted Boltzmann Machine (RBM) [19] is a generative stochastic Ar-
tificial Neural Network (ANN), developed to learn the probability distribution
of some input. Training an RBM using the algorithm for general Boltzmann
Machines [3] is very slow. Hinton et al. proposed a new technique, Contrastive
Divergence (CD) [4], depicted in Figure 1. It is quite similar to the Stochastic
Gradient Descent method, used to train regular ANNs. It approximates the Log-
Likelihood gradients by minimizing the reconstruction error, thus training the
model into an autoencoder. The algorithm performs a certain number of steps
of Gibbs sampling (CD-n). When the RBM is used as a feature extractor or as
a way of pretraining a Deep Belief Network [6], CD-1 is generally sufficient [5].
The original RBM model was designed with binary visible and binary hidden
units (also called a Bernoulli RBM). Several different types of units were since
developed (for instance Gaussian, ReLu or Softmax) [5]. This research focuses
1https://github.com/wichtounet/dll/
2https://github.com/wichtounet/etl/
On CPU performance optimization of RBM and CRBM 3
on binary units, but the conclusions stand for all general types of units. Indeed,
only the activation functions would change. The probability activations of visible
and hidden units can be computed as follows:
p(hj= 1|v) = σ(cj+
m
X
i
viWi,j ) (1)
p(vi= 1|h) = σ(bi+
n
X
j
hjWi,j ) (2)
The states of the units are obtained by sampling the activation probabilities.
For binary units, Bernoulli sampling is performed to obtain the states:
sj=(1 if pj>Unif(0,1)
0 otherwise (3)
si=(1 if pi>Unif(0,1)
0 otherwise (4)
From an implementation point of view, an RBM is made of a vector vof m
visible units, a vector hof nhidden units, a matrix Wof weights connecting the
visible and the hidden units, a vector bof mvisible biases and a vector cof n
hidden biases. In practice, the weights are represented as single-precision floating
point numbers rather than double-precision. Indeed, some single-precision com-
putations can be as much as twice faster than their double-precision counter-
parts. Moreover, the precision is generally more than sufficient for CD training.
Algorithm 1 Standard CD-1 algorithm (one sample)
v0= training sample
h0= sample hidden activations from v0
v1= sample visible activations from h0
h1= sample hidden activations from v1
Wpos =v0h0
Wneg =v1h1
W=(Wpos Wneg)
b=(v0v1)
c=(h0h1)
Algorithm 1 describes the CD-1 algorithm for one sample. The same proce-
dure is done for each sample of the data set and is repeated for as many epochs as
necessary. In practice, it is important to note that the hidden activations should
be computed directly from the visible activation probabilities rather than from
the states [5]. Therefore, it is never necessary to compute the states of the visible
4 Baptiste Wicht, Andreas Fischer, and Jean Hennebert
Table 1: Training time for an epoch of RBM training, in seconds. The speedup
is the improvement gained by using BLAS kernels for linear algebra operations.
A B C D
Base 161.99 47.00 167.20 70.78
Base + BLAS 141.91 43.03 114.18 36.12
Speedup 1.14 1.09 1.46 1.95
units during training. Moreover, the last update of the hidden unit is only used
to compute the positive gradients, in which case the probabilities are used rather
than the states. Therefore, it is not necessary to sample the states of the hidden
units for the last update.
In the algorithm and activation formulas, several computation routines are
well-known and can be optimized. The Basic Linear Algebra Subprogams (BLAS)
are a collection of small and highly optimized linear algebra routines. In the ac-
tivation formulas, the sums are simply vector-matrix multiplication and matrix-
vector multiplication. They can be implemented using the SGEMV operation
from BLAS. The outer products to compute the positive and negative gradients
can be implemented using the SGER routine. Finally, the computation of the
visible and hidden biases gradients can be done with the SAXPY operation. For
evaluation, the following networks are trained:
A: 784 visible units, 500 hidden units
B: 500 visible units, 500 hidden units
C: 500 visible units, 2000 hidden units
D: 2000 visible units, 10 hidden units
Table 1 shows the time, in seconds, necessary to train an epoch of the net-
works. Even if the computations are simple, BLAS operations can bring an
important speedup to CD training, compared to standard implementations of
these operations. For the tested networks, the speedup ranges from 1.09 to 1.95.
The MKL BLAS implementation is highly tuned for Intel processor and each
routine is especially optimized for cache and maximum throughput.
Experimentally, we find that more than 75% of the training time is spent
inside the BLAS library, 8% in the sigmoid function and around 7% in random
number generation. The sigmoid time could be optimized further by using an
approximation of the sigmoid function or a vectorized version of the exponential
function. Since this represents only a fraction of the total time, it would only
slightly improve the general training time.
3.1 Mini-Batch training
In practice, CD is rarely performed one element at a time, but rather on a
mini-batch. The data set is split into several mini-batches of the same size. The
gradients are computed for a complete batch before the weights are updated.
Algorithm 2 shows the updated version of CD-1 for mini-batch training.
On CPU performance optimization of RBM and CRBM 5
Algorithm 2 Mini-batch CD-1 algorithm (one mini-batch)
for all v0mini-batch do
h0= sample hidden activations from v0
v1= sample visible activations from h0
h1= sample hidden activations from v1
Wpos
+
=v0h0
Wneg
+
=v1h1
end for
W=
B(Wpos Wneg)
b=
B(v0v1)
c=
B(h0h1)
In practice, this could be implemented by accumulating the gradients element
after element. However, it is better to compute the gradients independently for
each element of the mini-batch. This needs more memory to store the intermedi-
ary results for the complete mini-batch. However, this is only for a small portion
of the data set and has the advantage of allowing higher level optimizations of the
loop body. Each iteration being completely independent, this could seem like an
excellent candidate for parallelization. However, this is not the case. Depending
on the dimensions of the matrices, a small speedup can be obtained by computing
each iteration in parallel before aggregating the results sequentially. However,
since most of the time will be spent in memory-bound operations (matrix-vector
multiplication and outer product), there won’t be enough bandwidth for many
cores to process the data in parallel. A better optimization is to compute the
activations and states of the units for a complete mini-batch at once instead of
one sample at time. If we consider has a [B, n] matrix and vas a [B, m] matrix,
they can be computed directly as follows3:
h=σ(repmat(c, B) + vW) (5)
v=σ(repmat(b, B)+(WhT)T) (6)
This has the great advantage of performing a single large matrix-matrix
multiplication instead of multiple small vector-matrix multiplication. In prac-
tice, this is much more efficient. In that case, the SGEMM operation of the BLAS
library is used to compute the activation probabilities. Moreover, if the matrices
are big enough, it is also possible to use a parallel version of the matrix-matrix
multiplication algorithm. Figure 2 shows the time necessary to train each net-
work with different batch sizes. It compares the base version with a hand-crafted
matrix multiplication and the version using BLAS. The parallel BLAS version is
also included in the results. On average the BLAS version is twice faster than the
standard version and the parallel version of BLAS reduces the time by another
factor of two.
3repmat vertically stacks the array Btimes
6 Baptiste Wicht, Andreas Fischer, and Jean Hennebert
Base BLAS BLAS+Threads
8
16
24
32
64
128
256
512
2
4
·104
Time [ms]
A
8
16
24
32
64
128
256
512
1
2
3
·104
B
8
16
24
32
64
128
256
512
0.5
1
1.5·105
Mini-Batch size
Time [ms]
C
8
16
24
32
64
128
256
512
0.5
1
1.5
2
·104
Mini-Batch size
D
Fig. 2: Mini-Batch performance
Generally, increasing the mini-batch size reduces the training time. However,
due to the small output dimension of the network D, the possible speedup is
greatly reduced and larger mini-batch do not provide any substantial improve-
ments. However, a too large mini-batch size may have negative impact on the
classification performance of the network since many gradients will be averaged.
On the other hand, a small batch size is generally leading to a more stable con-
vergence. The batch size must be chosen as a trade-off between training time and
classification performance. Moreover, a large mini-batch also increases the need
for the inputs to be shuffled prior to each epoch. For MNIST, mini-batch size of
up to 128 samples are still reasonable, but higher mini-batch are increasing the
overall training time, by decreasing the learning each epoch does. To conclude
this section, Table 2 compares the basic implementation and the final optimized
version with mini-batch (128 samples) and a threaded BLAS library. Depending
on the network, the optimized version is between 11 and 30 times faster.
On CPU performance optimization of RBM and CRBM 7
Table 2: Final results for standard RBM training, in seconds.
A B C D
Base 161.99 47.00 167.20 70.78
Mini-Batch + BLAS + Threads 5.35 3.94 14.57 4.39
Speedup 30.27 11.92 11.47 16.12
4 Convolutional Restricted Boltzmann Machine
The original RBM model can be extended to the Convolutional Restricted Boltz-
mann Machine (CRBM) [11]. The visible and hidden layers are connected to-
gether by convolution, allowing the model to learn features shared among all lo-
cations of the input, thus improving translation invariance of the model. While
this research focuses on two-dimensional CRBM, one-dimensional CRBM are
also possible, for instance for audio [12] and the model can be adapted for three-
dimensional inputs. Only square inputs and filters are described here for the sake
of simplicity, but the model is able to handle rectangular inputs and filters.
A CRBM model has a matrix Vof C×NV×NVvisible units. It has Kgroups
of NH×NHhidden units. There are C×Kconvolutional filters of dimension
NW×NW(by convolutional properties NW,NVNH+ 1). There is a single
visible bias cand a vector bof Khidden biases. The notation vis used to denote
avalid convolution and fis for a full convolution. A tilde over a matrix ( ˜
A)
is used to indicate that the matrix is flipped horizontally and vertically. For a
network with binary units, the probability activation are computed as follows:
P(hk
ij = 1|vc) = σ((X
c
˜
Wk
cvvc)ij +bk) (7)
P(vk
cij = 1|h) = σ((X
k
Wkfhk)ij +c) (8)
A CRBM is trained similarly to an RBM, with an adapted version of formulas
to compute the positive and negative gradients:
Wpos
ck =v0
cv˜
h0
k(9)
Wneg
ck =v1
cv˜
h1
k(10)
Training a CRBM requires a large number of convolutions for each epoch.
Indeed, for each sample, there would be 2KC valid convolutions for the gra-
dients, 2KC valid convolutions to compute the hidden activation probabilities
(done twice in CD) and KC full convolutions for the visible units. Contrary
to matrix multiplication, there is no general convolution reference implementa-
tion. The first optimization that can be applied is to vectorize the convolution
implementations. Modern processors are able to process several floating point
operations in one instruction. For instance, AVX instructions process 8 floats
8 Baptiste Wicht, Andreas Fischer, and Jean Hennebert
at once, while SSE instructions process 4 floats once. While modern compilers
are able to vectorize simple code, vectorizing complex program must be done
by hand. We vectorized the inner loop of the convolutions, with AVX for large
kernels and SSE for small kernels (smaller than 8 pixels). For evaluation, the
following networks are trained:
A: 1x28x28 visible units, 40 9x9 filters
B: 40x20x20 visible units, 40 5x5 filters
C: 40x16x16 visible units, 96 5x5 filters
D: 96x12x12 visible units, 8 5x5 filters
Table 3 shows the time necessary to train the different networks and the ob-
tained speedup. Due to the small images and filters in the measured networks,
the speedups are only very interesting for the first layer of the network, with
larger kernel and images. Moreover, the two-dimensional property of the algo-
rithms adds overhead to the vectorized version, reducing the possible speedups.
Table 3: Results for Convolutional RBM training, in seconds.
A B C D
Base 380.37 3013.82 3947.46 338.16
Base + Vectorization 198.21 2174.66 3358.76 295.83
Speedup 1.91 1.38 1.17 1.14
When training the model using mini-batch, it becomes interesting to compute
the gradients of each sample concurrently. For convolutions, there is no simple
technique to compute the gradients of a complete batch, therefore parallelization
inside batches is the best option. Figure 3 shows the results with different number
of threads, with a mini-batch of 64. The performance increases almost linearly
with the number of threads until four threads are used and then only slightly
improves with more threads, exhibiting memory-bound computation behaviour.
Since threads on the same core share the same cache, having more threads than
cores does not improve the performance substantially in this case.
4.1 Valid convolution
As seen previously, training a CRBM requires four times more valid convolutions
than full convolutions. Thus, it is extremely important to make it as fast as
possible. By rearranging the image to be convolved, it is possible to reduce a
valid convolution to a vector-matrix multiplication [18]. The general algorithm
is presented in Algorithm 3.
However, because of the memory-inefficient im2col operation, this is experi-
mentally slower than the vectorized version. Nevertheless, since the same image
is convolved with Kfilters, the overhead of im2col can be greatly mitigated,
by doing it only once for Kconvolutions. Moreover, the multiple vector-matrix
On CPU performance optimization of RBM and CRBM 9
1
2
3
4
5
6
7
8
100
200
300
Time [s]
A D
1
2
3
4
5
6
7
8
1,000
2,000
3,000 B C
Fig. 3: Parallel performance
Algorithm 3 Convolution C=IvKwith Matrix Multiplication
W0= reshape( ˜
W , [1, k1k2])
I0= matrix(k1k2, c1c2)
I0= im2col(K, [k1k2])
C=W0I0
operations become a single matrix-matrix multiplication. Finally, since the com-
putation of the activation probabilities and the gradients operates on flipped
weights and that flipping is an involution, the computation can be done directly
on the original weights, saving several flipping operations. Table 4 presents the
results obtained when using this optimization for all the valid convolutions on
the parallel version. On average, the training time is divided by two.
Experimentally, the difference in precision is found to be very small between
the different versions and the reference. On average, the average difference be-
tween the vectorized version and the reference is in the order of 1e5% and in
the order of 5e5% for the reduction with matrix multiplication. No difference
has been observed when training a CRBM with different versions of the valid
convolution. This difference may vary between different BLAS implementations.
Table 4: Results for Convolutional RBM training, in seconds.
A B C D
Parallel 46.69 494.52 756.70 68.47
Parallel + Reduction 28.45 241.79 336.56 40.12
Speedup 1.64 2.04 2.24 1.70
4.2 Full convolution
While there are no standard implementation of the full convolution, it is possible
to reduce it to a another algorithm for which there exists efficient implementa-
10 Baptiste Wicht, Andreas Fischer, and Jean Hennebert
tions. Using the convolution theorem, a full convolution can be reduced to a
Fourier transform. Indeed, the convolution in the time domain is equal to the
pointwise multiplication in the frequency domain [1]. Since the image and the
kernel may not be of the same size, it is necessary to pad them with zeroes before
computing their transforms. Algorithm 4 shows the steps used to compute a full
convolution using a Fourier transform.
Algorithm 4 Convolution C=IfKwith Fourier Transform
I0= pad(I)
K0= pad(K)
C0=F(I0)· F (I0)
C=F1(C0)
In practice, this can be implemented using the Fast Fourier Transform (FFT).
There exists some standard and very efficient implementations of the FFT. While
it is not a part of BLAS, the MKL library provides an FFT implementation.
Unfortunately, this is not always faster than a properly vectorized convolu-
tion. Table 5 shows the performance for different image and kernel sizes. The
FFT convolution is around 3 times slower for an 16x16 image and a kernel of
5x5, while it is almost 12 times faster for an image of 256x256 and a kernel of
31x31. This shows that using an FFT algorithm to perform the full convolution
can brings very large speedup to the training of a CRBM. However, it is only
really interesting for large models. Another optimization that can be done when
computing the full convolution by FFT is to precompute the Fourier transforms
of the images. Indeed, each image will be convolved several times with different
kernels, therefore only one transform per image is necessary. On the evaluated
networks, this does not bring any substantial performance improvements. Only
the network Ahas big enough images and kernels to profit from this, but this
only result in a speedup of an epoch by less than 1%.
Again, the difference in precision is found to be very small. On average,
the average difference for the vectorized version is found to be in the order of
1e4% and in the order of 3e5% for the FFT reduction. No difference has been
observed when training a CRBM with different versions of the full convolution.
The difference may vary between different FFT implementations.
Table 5: Performance of full convolution by FFT, in milliseconds
Image 12x12 16x16 16x16 28x28 50x50 128x128 128x128 256x256
Kernel 5x5 5x5 9x9 9x9 17x17 17x17 31x31 31x31
Vectorized 4.98 8.16 20.89 49.72 367.78 2010 7139 30787
FFT 11.59 24.49 25.8 46.38 122.43 368.83 1700 2598
Speedup 0.42 0.33 0.83 1.07 3.00 5.45 4.19 11.85
On CPU performance optimization of RBM and CRBM 11
5 Deep Belief Network
A Deep Belief Network (DBN) is a network formed by stacking RBMs on top of
each other. It is pretrained by training each layer with Contrastive Divergence.
Once the first layer has been trained, its activation probabilities are computed
for each input sample and these values are taken as the input of the next layer,
and so on until the last layer of the network. A Convolutional DBN (CDBN) is
similar, except that it stacks CRBMs.
Since pretraining a DBN consists in training RBMs with CD, the same op-
timizations discussed in previous sections apply. If there is enough memory, it
is important to keep the entire data set in memory as well as the intermedi-
ate results (the activation probabilities of the previous layer) during training to
maximize the performance. When this is not possible, the best course of action
is to keep a multiple of the mini-batch size of samples in memory (and their in-
termediate output) for training. Ideally, computing the activation probabilities
of the previous layer should be done in a separate thread so that CD always has
data ready for training.
If the network is to be used for classification, it can then be fine-tuned using
standard algorithms like for instance Stochastic Gradient Descent, Conjugate
Gradient or Limited-Memory BFGS. Performance of these algorithms is out of
the scope of this paper and has already been studied [17].
6 Conclusion and Future Work
Several techniques were presented to speedup training of RBM and CRBM mod-
els, on a single-CPU system. By using these techniques, RBM’s training time
has been reduced by up to 30 times and CRBM’s training time has been reduced
by up to 12 times. This demonstrates that even on CPU, many techniques can
be used to substantially speedup the training of RBM models and train large
models within reasonable time.
Future work could go in several directions. Combining several full convolu-
tions together and using the FFT reduction could reduce its overhead and allow
better performance even for small kernels. The performance of the vectorized
convolution versions can also be improved further by vectorizing the convolu-
tion at the image level rather than just at the kernel level. Finally, once the large
operations are fully optimized, operations such as sigmoid or Bernoulli sampling
could also be considered for optimization.
References
1. Bracewell, R.: The Fourier transform and its applications. New York 5 (1965)
2. Coates, A., Ng, A.Y., Lee, H.: An analysis of single-layer networks in unsupervised
feature learning. In: Proceedings of the Int. Conf. on Artificial Intelligence and
Statistics. pp. 215–223 (2011)
12 Baptiste Wicht, Andreas Fischer, and Jean Hennebert
3. Hinton, G.E., Sejnowski, T.J.: Parallel distributed processing: Explorations in the
microstructure of cognition, vol. 1. chap. Learning and Relearning in Boltzmann
Machines, pp. 282–317. MIT Press, Cambridge, MA, USA (1986), http://dl.acm.
org/citation.cfm?id=104279.104291
4. Hinton, G.E.: Training Products of Experts by minimizing Contrastive Divergence.
Neural Computation 14, 1771–1800 (2002)
5. Hinton, G.E.: A practical guide to training Restricted Boltzmann Machines. In:
Neural Networks: Tricks of the Trade, pp. 599–619. Springer (2012)
6. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-
ral networks. Science 313(5786), 504–507 (2006)
7. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding.
In: Proceedings of the ACM International Conference on Multimedia. pp. 675–678.
ACM (2014)
8. Krizhevsky, A., Hinton, G.: Convolutional Deep belief Networks on CIFAR-10.
Unpublished manuscript 40 (2010)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in neural information processing systems.
pp. 1097–1105 (2012)
10. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
11. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional Deep Belief Networks
for scalable unsupervised learning of hierarchical representations. In: Proceedings
of the Int. Conf. on Machine Learning. pp. 609–616. ACM (2009)
12. Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio
classification using CDBNs. In: Proceedings of the Advances in Neural Information
Processing Systems. pp. 1096–1104 (2009)
13. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish,
N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., et al.: Debunking the 100x
GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In:
ACM SIGARCH Computer Architecture News. vol. 38, pp. 451–460. ACM (2010)
14. Lopes, N., Ribeiro, B.: Towards adaptive learning with improved convergence of
Deep Belief Networks on Graphics Processing Units. Pattern Recognition 47(1),
114–127 (2014)
15. Ly, D.L., Paprotski, V., Yen, D.: Neural networks on GPUs: Restricted Boltz-
mann Machines. see http://www.eecg.toronto.edu/˜moshovos/CUDA08/doku.php
(2008)
16. Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks
through ffts. arXiv preprint arXiv:1312.5851 (2013)
17. Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Le, Q.V., Ng, A.Y.: On optimiza-
tion methods for deep learning. In: Proceedings of the 28th International Confer-
ence on Machine Learning (ICML-11). pp. 265–272 (2011)
18. Ren, J.S., Xu, L.: On vectorization of deep convolutional neural networks for vision
tasks. arXiv preprint arXiv:1501.07338 (2015)
19. Smolensky, P.: Information processing in dynamical systems: Foundations of har-
mony theory. Parallel distributed Processing 1, 194–281 (1986)
20. Upadhyaya, S.R.: Parallel approaches to machine learning: A comprehensive sur-
vey. Journal of Parallel and Distributed Computing 73(3), 284–292 (2013)
... Deep Learning Library (DLL) is a Machine Learning library originally focused on RBM and CRBM support. It was developed and used in the context of several research work [29][30][31][32]. It also has support for various neural network layers and backpropagation techniques. ...
... The forward activation of a dense layer for a mini-batch can be computed with a single matrix-matrix multiplication [31]. This is also possible for the backward pass, by transposing the weight matrix. ...
... Every image batch is convolved with K kernels. It is possible to rearrange an image into columns so that a matrix-matrix multiplication can be used to compute the K valid convolutions of the image at once [24,31]. This proved to be very efficient for large images or large kernels. ...
Chapter
Full-text available
Deep Learning Library (DLL) is a library for machine learning with deep neural networks that focuses on speed. It supports feed-forward neural networks such as fully-connected Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). Our main motivation for this work was to propose and evaluate novel software engineering strategies with potential to accelerate runtime for training and inference‘. Such strategies are mostly independent of the underlying deep learning algorithms. On three different datasets and for four different neural network models, we compared DLL to five popular deep learning libraries. Experimentally, it is shown that the proposed library is systematically and significantly faster on CPU and GPU. In terms of classification performance, similar accuracies as the other libraries are reported.
... Deep Learning Library (DLL) is a Machine Learning library originally focused on RBM and CRBM support. It was developed and used in the context of several research work [29][30][31][32]. It also has support for various neural network layers and backpropagation techniques. ...
... The forward activation of a dense layer for a mini-batch can be computed with a single matrix-matrix multiplication [31]. This is also possible for the backward pass, by transposing the weight matrix. ...
... Every image batch is convolved with K kernels. It is possible to rearrange an image into columns so that a matrix-matrix multiplication can be used to compute the K valid convolutions of the image at once [24,31]. This proved to be very efficient for large images or large kernels. ...
Article
Full-text available
Deep Learning Library (DLL) is a new library for machine learning with deep neural networks that focuses on speed. It supports feed-forward neural networks such as fully-connected Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). It also has very comprehensive support for Restricted Boltzmann Machines (RBMs) and Convolutional RBMs. Our main motivation for this work was to propose and evaluate novel software engineering strategies with potential to accelerate runtime for training and inference. Such strategies are mostly independent of the underlying deep learning algorithms. On three different datasets and for four different neural network models, we compared DLL to five popular deep learning frameworks. Experimentally, it is shown that the proposed framework is systematically and significantly faster on CPU and GPU. In terms of classification performance, similar accuracies as the other frameworks are reported.
... GPUs are now considered by most as the go-to architecture for large neural networks training. Nevertheless, it is still possible to obtain very good performance using only CPU when using adequate performance optimizations (Wicht, Fischer, and Hennebert, 2016c). Section A.5 presents comparisons on CPU versus GPU performance on several operations. ...
... More details on the performed optimizations of the framework are available in (Wicht, Fischer, and Hennebert, 2016c). ...
Thesis
Full-text available
In this thesis, we propose to use methodologies that automatically learn how to extract relevant features from images. We are especially interested in evaluating how these features compare against handcrafted features. More precisely, we are interested in the unsupervised training that is used for the Restricted Boltzmann Machine (RBM) and Convolutional RBM (CRBM) models. These models relaunched the Deep Learning interest of the last decade. During the time of this thesis, the auto-encoders approach, especially Convolutional Auto-Encoders (CAE) have been used more and more. Therefore, one objective of this thesis is also to compare the CRBM approach with the CAE approach. The scope of this work is defined by several machine learning tasks. The first one, handwritten digit recognition, is analysed to see how much the unsupervised pretraining technique introduced with the Deep Belief Network (DBN) model improves the training of neural networks. The second, detection and recognition of Sudoku in images, is evaluating the efficiency of DBN and Convolutional DBN (CDBN) models for classification of images of poor quality. Finally, features are learned fully unsupervised from images for a keyword spotting task and are compared against well-known handcrafted features. Moreover, the thesis was also oriented around a software engineering axis. Indeed, a complete machine learning framework was developed during this thesis to explore possible optimizations and possible algorithms in order to train the tested models as fast as possible.
... 6 https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit. 7 Similar architectures have been commonly employed in the literature [12,16,24,25,34]. 8 Notice all parameters and architectures have been empirically chosen [21]. ...
Article
Full-text available
Deep learning techniques have been paramount in the last years, mainly due to their outstanding results in a number of applications, that range from speech recognition to face-based user identification. Despite other techniques employed for such purposes, Deep Boltzmann Machines are among the most used ones, which are composed of layers of Restricted Boltzmann Machines (RBMs) stacked on top of each other. In this work, we evaluate the concept of temperature in DBMs, which play a key role in Boltzmann-related distributions, but it has never been considered in this context up to date. Therefore, the main contribution of this paper is to take into account this information and to evaluate its influence in DBMs considering the task of binary image reconstruction. We expect this work can foster future research considering the usage of different temperatures during learning in DBMs.
Thesis
Full-text available
Deep learning techniques have been studied extensively in the last years, due to its good results related to essential tasks on a large range of applications, such as speech and face recognition, as well as objects classification. Among the most employed techniques is the Restrict Boltzmann Machines (RBMs), which are energy-based stochastic neural networks composed of two layers of neurons., i.e., visible and hidden, whose objective is to estimate the connection weights between both layers, generally using Markov chains. Recently, the scientific community spent many efforts on sampling methods, since RBMs effectiveness is directly related to the success of the sampling process. Thereby, the present work contributes with RBMs Learning area, as well as its variants DBNs and DBMs. Further, the work covers the application of meta-heuristic methods concerning a proper fine-tune of these techniques. Moreover, the validation of the model is presented in the context of image reconstruction and pattern recognition. In general, the present work presents different approaches to training these techniques, as well as the evaluation of meta-heuristic methods efficiency in training. Finally, this thesis presents a collection of works developed by the author during the study period, which was published/submitted until the present time, concerning: (i) temperature parameter introduction in DBM formulation, (ii) DBM using adaptive temperature, (iii) DBM meta-parameters optimization through meta-heuristic techniques, and (iv) iRBM meta-parameters optimization through meta-heuristic techniques.
Article
Full-text available
We recently have witnessed many ground-breaking results in machine learning and computer vision, generated by using deep convolutional neural networks (CNN). While the success mainly stems from the large volume of training data and the deep network architectures, the vector processing hardware (e.g. GPU) undisputedly plays a vital role in modern CNN implementations to support massive computation. Though much attention was paid in the extent literature to understand the algorithmic side of deep CNN, little research was dedicated to the vectorization for scaling up CNNs. In this paper, we studied the vectorization process of key building blocks in deep CNNs, in order to better understand and facilitate parallel implementation. Key steps in training and testing deep CNNs are abstracted as matrix and vector operators, upon which parallelism can be easily achieved. We developed and compared six implementations with various degrees of vectorization with which we illustrated the impact of vectorization on the speed of model training and testing. Besides, a unified CNN framework for both high-level and low-level vision tasks is provided, along with a vectorized Matlab implementation with state-of-the-art speed performance.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Conference Paper
A great deal of research has focused on algorithms for learning features from un- labeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning al- gorithms and deep models. In this paper, however, we show that several very sim- ple factors, such as the number of hidden nodes in the model, may be as important to achieving high performance as the choice of learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning al- gorithms (sparse auto-encoders, sparse RBMs and K-means clustering, Gaussian mixtures) to NORB and CIFAR datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (stride) be- tween extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are as critical to achieving high performance as the choice of algorithm itselfso critical, in fact, that when these parameters are pushed to their limits, we are able to achieve state-of-the- art performance on both CIFAR and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure it- self, and is very easy implement. Despite the simplicity of our system, we achieve performance beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.0% accuracy respectively).
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Technical Report
Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data. RBMs are usually trained using the contrastive divergence learning procedure. This requires a certain amount of practical experience to decide how to set the values of numerical meta-parameters. Over the last few years, the machine learning group at the University of Toronto has acquired considerable expertise at training RBMs and this guide is an attempt to share this expertise with other machine learning researchers.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
We describe how to train a two-layer convolutional Deep Belief Network (DBN) on the 1.6 million tiny images dataset. When training a convolutional DBN, one must decide what to do with the edge pixels of teh images. As the pixels near the edge of an image contribute to the fewest convolutional lter outputs, the model may