ArticlePDF Available

Theano: A CPU and GPU math compiler in Python

Authors:

Abstract and Figures

Theano is a compiler for mathematical expressions in Python that combines the convenience of NumPy's syntax with the speed of optimized native machine language. The user composes mathematical expressions in a high-level description that mimics NumPy's syntax and semantics, while being statically typed and functional (as opposed to imperative). These expressions allow Theano to provide symbolic differentiation. Before performing computation, Theano optimizes the choice of expressions, translates them into C++ (or CUDA for GPU), compiles them into dynamically loaded Python modules, all automatically. Common machine learn-ing algorithms implemented with Theano are from 1.6× to 7.5× faster than competitive alternatives (including those implemented with C/C++, NumPy/SciPy and MATLAB) when compiled for the CPU and between 6.5× and 44× faster when compiled for the GPU. This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.
Content may be subject to copyright.
PROC. OF THE 9th PYTHON IN SCIENCE CONF. (SCIPY 2010) 1
Theano: A CPU and GPU Math Compiler in Python
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins,
Joseph Turian, David Warde-Farley, Yoshua Bengio
AbstractTheano is a compiler for mathematical expressions in
Python that combines the convenience of NumPy’s syntax with the
speed of optimized native machine language. The user composes
mathematical expressions in a high-level description that mimics
NumPy’s syntax and semantics, while being statically typed and
functional (as opposed to imperative). These expressions allow
Theano to provide symbolic differentiation. Before performing
computation, Theano optimizes the choice of expressions, translates
them into C++ (or CUDA for GPU), compiles them into dynamically
loaded Python modules, all automatically. Common machine learn-
ing algorithms implemented with Theano are from 1.6×to 7.5×
faster than competitive alternatives (including those implemented
with C/C++, NumPy/SciPy and MATLAB) when compiled for the
CPU and between 6.5×and 44×faster when compiled for the GPU.
This paper illustrates how to use Theano, outlines the scope of the
compiler, provides benchmarks on both CPU and GPU processors,
and explains its overall design.
Introduction
Python is a powerful and flexible language for describing
large-scale mathematical calculations, but the Python inter-
preter is in many cases a poor engine for executing them.
One reason is that Python uses full-fledged Python objects
on the heap to represent simple numeric scalars. To reduce
the overhead in numeric calculations, it is important to use
array types such as NumPy’s ndarray so that single Python
objects on the heap can stand for multidimensional arrays of
numeric scalars, each stored efficiently in the host processor’s
native format.
[NumPy] provides an N-dimensional array data type, and
many functions for indexing, reshaping, and performing ele-
mentary computations (exp,log,sin, etc.) on entire arrays
at once. These functions are implemented in C for use within
Python programs. However, the composition of many such
NumPy functions can be unnecessarily slow when each call
is dominated by the cost of transferring memory rather than
the cost of performing calculations [Alted]. [numexpr] goes
one step further by providing a loop fusion optimization that
can glue several element-wise computations together. Unfor-
tunately, numexpr requires an unusual syntax (the expression
must be encoded as a string within the code), and at the time
of this writing, numexpr is limited to optimizing element-wise
computations. [Cython] and [scipy.weave] address Python’s
performance issue by offering a simple way to hand-write
crucial segments of code in C (or a dialect of Python which
can be easily compiled to C, in Cython’s case). While this
approach can yield significant speed gains, it is labor-intensive:
if the bottleneck of a program is a large mathematical expres-
sion comprising hundreds of elementary operations, manual
The corresponding author is with Université de Montréal, e-mail:
james.bergstra@umontreal.ca.
program optimization can be time-consuming and error-prone,
making an automated approach to performance optimization
highly desirable.
Theano, on the other hand, works on a symbolic represen-
tation of mathematical expressions, provided by the user in
a NumPy-like syntax. Access to the full computational graph
of an expression opens the door to advanced features such
as symbolic differentiation of complex expressions, but more
importantly allows Theano to perform local graph transforma-
tions that can correct many unnecessary, slow or numerically
unstable expression patterns. Once optimized, the same graph
can be used to generate CPU as well as GPU implementations
(the latter using CUDA) without requiring changes to user
code.
Theano is similar to [SymPy], in that both libraries ma-
nipulate symbolic mathematical graphs, but the two projects
have a distinctly different focus. While SymPy implements a
richer set of mathematical operations of the kind expected in
a modern computer algebra system, Theano focuses on fast,
efficient evaluation of primarily array-valued expressions.
Theano is free open source software, licensed under the
New (3-clause) BSD license. It depends upon NumPy, and
can optionally use SciPy. Theano includes many custom C
and CUDA code generators which are able to specialize
for particular types, sizes, and shapes of inputs; leveraging
these code generators requires gcc (CPU) and nvcc (GPU)
compilers, respectively. Theano can be extended with custom
graph expressions, which can leverage scipy.weave, Py-
CUDA, Cython, and other numerical libraries and compilation
technologies at the user’s discretion. Theano has been actively
and continuously developed and used since January 2008.
It has been used in the preparation of numerous scientific
papers and as a teaching platform for machine learning in
graduate courses at l’Université de Montréal. Documentation
and installation instructions can be found on Theano’s website
[theano]. All Theano users should subscribe to the announce1
mailing list (low traffic). There are medium traffic mailing lists
for developer discussion2and user support3.
This paper is divided as follows: Case Study: Logistic
Regression shows how Theano can be used to solve a sim-
ple problem in statistical prediction. Benchmarking Results
presents some results of performance benchmarking on prob-
lems related to machine learning and expression evaluation.
What’s in Theano gives an overview of the design of Theano.
Limitations and Future Work outlines current limitations of our
implementation and currently planned additions to Theano.
1http://groups.google.com/group/theano-announce
2http://groups.google.com/group/theano-dev
3http://groups.google.com/group/theano-users
2 PROC. OF THE 9th PYTHON IN SCIENCE CONF. (SCIPY 2010)
Case Study: Logistic Regression
To get a sense of how Theano feels from a user’s perspec-
tive, we will look at how to solve a binary logistic regression
problem. Binary logistic regression is a classification model
parameterized by a weight matrix Wand bias vector b. The
model estimates the probability P(Y=1|x)(which we will
denote with shorthand p) that the input xbelongs to class
y=1 as:
P(Y=1|x(i)) = p(i)=eW x(i)+b
1+eWx(i)+b(1)
The goal is to optimize the log probability of Ntraining
examples, D={(x(i),y(i)),0<iN}), with respect to Wand
b. To maximize the log likelihood we will instead minimize
the (average) negative log likelihood4:
`(W,b) = 1
N
i
y(i)log p(i)+ (1y(i))log(1p(i))(2)
To make it a bit more interesting, we can also include an
`2penalty on W, giving a cost function E(W,b)defined as:
E(W,b) = `(W,b) + 0.01
i
j
w2
i j (3)
In this example, tuning parameters Wand bwill be
done through stochastic gradient descent (SGD) on E(W,b).
Stochastic gradient descent is a method for minimizing a
differentiable loss function which is the expectation of some
per-example loss over a set of training examples. SGD esti-
mates this expectation with an average over one or several
examples and performs a step in the approximate direction
of steepest descent. Though more sophisticated algorithms for
numerical optimization exist, in particular for smooth convex
functions such as E(W,b), stochastic gradient descent remains
the method of choice when the number of training examples
is too large to fit in memory, or in the setting where training
examples arrive in a continuous stream. Even with relatively
manageable dataset sizes, SGD can be particularly advanta-
geous for non-convex loss functions (such as those explored
in Benchmarking Results), where the stochasticity can allow
the optimizer to escape shallow local minima [Bottou].
According to the SGD algorithm, the update on Wis
WWµ1
N0
i
E(W,b,x,y)
W
x=x(i),y=y(i)
,(4)
where µ=0.1 is the step size and Nis the number of
examples with which we will approximate the gradient (i.e.
the number of rows of x). The update on bis likewise
bbµ1
N0
i
E(W,b,x,y)
b
x=x(i),y=y(i)
.(5)
Implementing this minimization procedure in Theano in-
volves the following four conceptual steps: (1) declaring sym-
bolic variables, (2) using these variables to build a symbolic
expression graph, (3) compiling Theano functions, and (4)
4Taking the mean in this fashion decouples the choice of the regularization
coefficient and the stochastic gradient step size from the number of training
examples.
calling said functions to perform numerical computations.
The code listings in Figures 1 - 4 illustrate these steps with
a working program that fits a logistic regression model to
random data.
1: import numpy
2: import theano.tensor as T
3: from theano import shared, function
4:
5: x = T.matrix()
6: y = T.lvector()
7: w = shared(numpy.random.randn(100))
8: b = shared(numpy.zeros(()))
9: print "Initial model:"
10: print w.get_value(), b.get_value()
Figure 1: Logistic regression, part 1: declaring variables.
The code in Figure 1 declares four symbolic variables x,y
w, and bto represent the data and parameters of the model.
Each tensor variable is strictly typed to include its data type,
its number of dimensions, and the dimensions along which it
may broadcast (like NumPy’s broadcasting) in element-wise
expressions. The variable xis a matrix of the default data type
(float64), and yis a vector of type long (or int64). Each
row of xwill store an example x(i), and each element of ywill
store the corresponding label y(i). The number of examples to
use at once represents a tradeoff between computational and
statistical efficiency.
The shared() function creates shared variables for W
and band assigns them initial values. Shared variables behave
much like other Theano variables, with the exception that
they also have a persistent value. A shared variable’s value is
maintained throughout the execution of the program and can
be accessed with .get_value() and .set_value(), as
shown in line 10.
11: p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
12: xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)
13: cost = xent.mean() + 0.01*(w**2).sum()
14: gw,gb = T.grad(cost, [w,b])
15: prediction = p_1 > 0.5
Figure 2: Logistic regression, part 2: the computation graph.
The code in Figure 2 specifies the computational graph
required to perform stochastic gradient descent on the pa-
rameters of our cost function. Since Theano’s interface shares
much in common with that of NumPy, lines 11-15 should
be self-explanatory for anyone familiar with that module. On
line 11, we start by defining P(Y=1|x(i)) = 1 as the sym-
bolic variable p_1. Notice that the matrix multiplication and
element-wise exponential functions are simply called via the
T.dot and T.exp functions, analogous to numpy.dot and
numpy.exp.xent defines the cross-entropy loss function,
which is then combined with the `2penalty on line 13, to form
the cost function of Eq (3) and denoted by cost.
Line 14 is crucial to our implementation of SGD, as it
performs symbolic differentiation of the scalar-valued cost
variable with respect to variables wand b.T.grad operates
by iterating backwards over the expression graph, applying the
chain rule of differentiation and building symbolic expressions
for the gradients on wand b. As such, gw and gb are also
THEANO: A CPU AND GPU MATH COMPILER IN PYTHON 3
symbolic Theano variables, representing E/Wand E/b
respectively. Finally, line 15 defines the actual prediction
(prediction) of the logistic regression by thresholding
P(Y=1|x(i)).
16: predict = function(inputs=[x],
17: outputs=prediction)
18: train = function(
19: inputs=[x,y],
20: outputs=[prediction, xent],
21: updates={w:w-0.1*gw, b:b-0.1*gb})
Figure 3: Logistic regression, part 3: compilation.
The code of Figure 3 creates the two functions required to
train and test our logistic regression model. Theano functions
are callable objects that compute zero or more outputs from
values given for one or more symbolic inputs. For example,
the predict function computes and returns the value of
prediction for a given value of x. Parameters wand b
are passed implicitly - all shared variables are available as
inputs to all functions as a convenience to the user.
Line 18 (Figure 3) which creates the train function
highlights two other important features of Theano functions:
the potential for multiple outputs and updates. In our example,
train computes both the prediction (prediction) of the
classifier as well as the cross-entropy error function (xent).
Computing both outputs together is computationally efficient
since it allows for the reuse of intermediate computations, such
as dot(x,w). The optional updates parameter enables
functions to have side-effects on shared variables. The updates
argument is a dictionary which specifies how shared variables
should be updated after all other computation for the function
takes place, just before the function returns. In our example,
calling the train function will update the parameters wand
bwith new values as per the SGD algorithm.
22: N = 4
23: feats = 100
24: D = (numpy.random.randn(N, feats),
25: numpy.random.randint(size=N,low=0, high=2))
26: training_steps = 10
27: for i in range(training_steps):
28: pred, err = train(D[0], D[1])
29: print "Final model:",
30: print w.get_value(), b.get_value()
31: print "target values for D", D[1]
32: print "prediction on D", predict(D[0])
Figure 4: Logistic regression, part 4: computation.
Our example concludes (Figure 4) by using the functions
train and predict to fit the logistic regression model. Our
data Din this example is just four random vectors and labels.
Repeatedly calling the train function (lines 27-28) fits our
parameters to the data. Note that calling a Theano function
is no different than calling a standard Python function: the
graph transformations, optimizations, compilation and calling
of efficient C-functions (whether targeted for the CPU or GPU)
have all been done under the hood. The arguments and return
values of these functions are NumPy ndarray objects that
interoperate normally with other scientific Python libraries and
tools.
Figure 5: Fitting a multi-layer perceptron to simulated data with
various implementations of stochastic gradient descent. These models
have 784 inputs, 500 hidden units, a 10-way classification, and are
trained 60 examples at a time.
Benchmarking Results
Theano was developed to simplify the implementation of
complex high-performance machine learning algorithms. This
section presents performance in two processor-intensive tasks
from that domain: training a multi-layer perceptron (MLP) and
training a convolutional network. We chose these architectures
because of their popularity in the machine learning community
and their different computational demands. Large matrix-
matrix multiplications dominate in the MLP example and two-
dimensional image convolutions with small kernels are the
major bottleneck in a convolutional network. More information
about these models and their associated learning algorithms
is available from the Deep Learning Tutorials [DLT]. The
implementations used in these benchmarks are available online
[dlb].
CPU timing was carried out on an an Intel(R) Core(TM)2
Duo CPU E8500 @ 3.16GHz with 2 GB of RAM. All
implementations were linked against the BLAS implemented
in the Intel Math Kernel Library, version 10.2.4.032 and
allowed to use only one thread. GPU timing was done on a
GeForce GTX 285. CPU computations were done at double-
precision, whereas GPU computations were done at single-
precision.
Our first benchmark involves training a single layer MLP by
stochastic gradient descent. Each implementation repeatedly
carried out the following steps: (1) multiply 60 784-element
input vectors by a 784 ×500 weight matrix, (2) apply an
element-wise hyperbolic tangent operator (tanh) to the result,
(3) multiply the result of the tanh operation by a 500 ×10
matrix, (4) classify the result using a multi-class generalization
of logistic regression, (5) compute the gradient by performing
similar calculations but in reverse, and finally (6) add the
gradients to the parameters. This program stresses element-
wise computations and the use of BLAS routines.
Figure 5 compares the number of examples processed
per second across different implementations. We compared
4 PROC. OF THE 9th PYTHON IN SCIENCE CONF. (SCIPY 2010)
Figure 6: Fitting a convolutional network using different software.
The benchmark stresses convolutions of medium-sized (256 by 256)
images with small (7 by 7) filters.
Theano (revision #ec057beb6c) against NumPy 1.4.1, MAT-
LAB 7.9.0.529, and Torch 5 (a machine learning library
written in C/C++) [torch5] on the CPU and GPUMat 0.25
for MATLAB ([gpumat]) on the GPU.
When running on the CPU, Theano is 1.8x faster than
NumPy, 1.6x faster than MATLAB, and 7.5x faster than Torch
5.5Theano’s speed increases 5.8x on the GPU from the CPU,
a total increase of 11x over NumPy (CPU) and 44x over Torch
5 (CPU). GPUmat brings about a speed increase of only 1.4x
when switching to the GPU for the MATLAB implementation,
far less than the 5.8x increase Theano achieves through CUDA
specializations.
Because of the difficulty in implementing efficient convolu-
tional networks, we only benchmark against known libraries
that offer a pre-existing implementation. We compare against
EBLearn [EBL] and Torch, two libraries written in C++.
EBLearn was implemented by Yann LeCun’s lab at NYU, who
have done extensive research in convolutional networks. To put
these results into perspective, we implemented approximately
half (no gradient calculation) of the algorithm using SciPy’s
signal.convolve2d function. This benchmark uses con-
volutions of medium sized images (256 ×256) with small
filters (7 ×7). Figure 6 compares the performance of Theano
(both CPU and GPU) with that of competing implementations.
On the CPU, Theano is 2.2x faster than EBLearn, its best
competitor. This advantage is owed to the fact that Theano
compiles more specialized convolution routines. Theano’s
speed increases 4.9x on the GPU from the CPU, a total of
10.7x over EBLearn (CPU). On the CPU, Theano is 5.8x
faster than SciPy even though SciPy is doing only half the
computations. This is because SciPy’s convolution routine has
not been optimized for this application.
We also compared Theano with numexpr and NumPy for
evaluating element-wise expressions on the CPU (Figure 7).
5Torch was designed and implemented with flexibility in mind, not speed
(Ronan Collobert, p.c.).
Figure 7: Speed comparison between NumPy, numexpr, and Theano
for different sizes of input on four element-wise formulae. In each
subplot, the solid blue line represents Theano, the dashed red line
represent numexpr, and performance is plotted with respect to NumPy.
For small amounts of data, the extra function-call overhead of
numexpr and Theano makes them slower. For larger amounts
of data, and for more complicated expressions, Theano is
fastest because it uses an implementation specialized for each
expression.
What kinds of work does Theano support?
Theano’s expression types cover much of the same func-
tionality as NumPy, and include some of what can be found
in SciPy. Table 1 lists some of the most-used expressions in
Theano. More extensive reference documentation is available
online [theano].
Theano’s strong suit is its support for strided N-dimensional
arrays of integers and floating point values. Signed and
unsigned integers of all native bit widths are supported, as
are both single-precision and double-precision floats. Single-
precision and double-precision complex numbers are also
supported, but less so - for example, gradients through several
mathematical functions are not implemented. Roughly 90%
of expressions for single-precision N-dimensional arrays have
GPU implementations. Our goal is to provide GPU implemen-
tations for all expressions supported by Theano.
Random numbers are provided in two ways: via NumPy’s
random module, and via an internal generator from the
MRG family [Ecu]. Theano’s RandomStreams replicates
the numpy.random.RandomState interface, and acts as
a proxy to NumPy’s random number generator and the various
random distributions that use it. The MRG_RandomStreams
class implements a different random number generation algo-
rithm (called MRG31k3p) that maps naturally to GPU archi-
tectures. It is implemented for both the CPU and GPU so that
programs can produce the same results on either architecture
without sacrificing speed. The MRG_RandomStreams class
THEANO: A CPU AND GPU MATH COMPILER IN PYTHON 5
Operators +,-,/,*,**,//,eq,neq,<,<=,>,>=,
&,|,^
Allocation alloc,eye,[ones,zeros]_like,
identity{_like}
Indexing* basic slicing (see set_subtensor and
inc_subtensor for slicing lvalues); lim-
ited support for advanced indexing
Mathematical
Functions
exp,log,tan[h],cos[h],sin[h],
real,imag,sqrt,floor,ceil,
round,abs
Tensor
Operations
all,any,mean,sum,min,max,var,
prod,argmin,argmax,reshape,
flatten,dimshuffle
Conditional cond,switch
Looping Scan
Linear
Algebra
dot,outer,tensordot,diag,
cholesky,inv,solve
Calculus* grad
Signal
Processing
conv2d,FFT,max_pool_2d
Random RandomStreams,
MRG_RandomStreams
Printing Print
Sparse compressed row/col storage, limited opera-
tor support, dot,transpose, conversion
to/from dense
Machine
Learning
sigmoid, softmax, multi-class hinge loss
TABLE I.: Overview of Theano’s core functionality. This list is
not exhaustive, and is superseded by the online documentation.
More details are given in text for items marked with an asterisk.
dimshuffle is like numpy.swapaxes.
offers a more limited selection of random number distributions
than NumPy though: uniform, normal, and multinomial.
Sparse vectors and matrices are supported via SciPy’s
sparse module. Only compressed-row and compressed-
column formats are supported by most expressions. There
are expressions for packing and unpacking these sparse types,
some operator support (e.g. scaling, negation), matrix transpo-
sition, and matrix multiplication with both sparse and dense
matrices. Sparse expressions currently have no GPU equiva-
lents.
There is also support in Theano for arbitrary Python objects.
However, there are very few expressions that make use of that
support because the compilation pipeline works on the basis of
inferring properties of intermediate results. If an intermediate
result can be an arbitrary Python object, very little can be
inferred. Still, it is occasionally useful to have such objects in
Theano graphs.
Theano has been developed to support machine learning
Canonicalization
Stabilization
Specialization
GPU Transfer
Code Generation
Figure 8: The compilation pipeline for functions compiled for GPU.
Functions compiled for the CPU omit the GPU transfer step. v2
research, and that has motivated the inclusion of more special-
ized expression types such as the logistic sigmoid, the softmax
function, and multi-class hinge loss.
Compilation by theano.function
What happens under the hood when creating a function?
This section outlines, in broad strokes, the stages of the com-
pilation pipeline. Prior to these stages, the expression graph
is copied so that the compilation process does not change
anything in the graph built by the user. As illustrated in Figure
8, the expression graph is subjected to several transformations:
(1) canonicalization, (2) stabilization, (3) specialization, (4)
optional GPU transfer, (5) code generation. There is some
overlap between these transformations, but at a high level they
have different objectives. (The interested reader should note
that these transformations correspond roughly, but not exactly
to the optimization objects that are implemented in the project
source code.)
Canonicalization: The canonicalization transformation puts
the user’s expression graph into a standard form. For example,
duplicate expressions are merged into a single expression. Two
expressions are considered duplicates if they carry out the
same operation and have the same inputs. Since Theano ex-
pressions are purely functional (i.e., cannot have side effects),
these expressions must return the same value and thus it is
safe to perform the operation once and reuse the result. The
symbolic gradient mechanism often introduces redundancy,
so this step is quite important. For another example, sub-
expressions involving only multiplication and division are
put into a standard fraction form (e.g. a / (((a *b)
/ c) / d) -> (a *c*d) / (a *b) -> (c *
d) / (b)). Some useless calculations are eliminated in this
phase, for instance cancelling out uses of the aterm in the
previous example, but also reducing exp(log(x)) to x, and
computing outright the values of any expression whose inputs
are fully known at compile time. Canonicalization simplifies
and optimizes the graph to some extent, but its primary
function is to collapse many different expressions into a single
6 PROC. OF THE 9th PYTHON IN SCIENCE CONF. (SCIPY 2010)
normal form so that it is easier to recognize expression patterns
in subsequent compilation stages.
Stabilization: The stabilization transformation improves the
numerical stability of the computations implied by the ex-
pression graph. For instance, consider the function log(1 +
exp(x)), which tends toward zero as limx→−, and xas
limx→−. Due to limitations in the representation of double
precision numbers, the computation as written yields infinity
for x > 709. The stabilization phase replaces patterns like
one with an expression that simply returns xwhen xis
sufficiently large (using doubles, this is accurate beyond the
least significant digit). It should be noted that this phase
cannot guarantee the stability of computations. It helps in some
cases, but the user is still advised to be wary of numerically
problematic computations.
Specialization: The specialization transformation replaces
expressions with faster ones. Expressions like pow(x,2) be-
come sqr(x). Theano also performs more elaborate special-
izations: for example, expressions involving scalar-multiplied
matrix additions and multiplications may become BLAS
General matrix multiply (GEMM) nodes and reshape,
transpose, and subtensor expressions (which create
copies by default) are replaced by constant-time versions
that work by aliasing memory. Expressions subgraphs in-
volving element-wise operations are fused together (as in
numexpr) in order to avoid the creation and use of unnec-
essary temporary variables. For instance, if we denote the
a+boperation on tensors as map(+, a, b), then an
expression such as map(+, map(*, a, b), c) would
become map(lambda ai,bi,ci: ai*bi+ci, a, b,
c). If the user desires to use the GPU, expressions with corre-
sponding GPU implementations are substituted in, and transfer
expressions are introduced where needed. Specialization also
introduces expressions that treat inputs as workspace buffers.
Such expressions use less memory and make better use of
hierarchical memory, but they must be used with care because
they effectively destroy intermediate results. Many expressions
(e.g. GEMM and all element-wise ones) have such equivalents.
Reusing memory this way allows more computation to take
place on GPUs, where memory is at a premium.
Moving Computation to the GPU: Each expression in
Theano is associated with an implementation that runs on
either the host (a host expression) or a GPU device (a
GPU expression). The GPU-transfer transformation replaces
host expressions with GPU expressions. The majority of host
expression types have GPU equivalents and the proportion is
always growing.
The heuristic that guides GPU allocation is simple: if any
input or output of an expression resides on the GPU and the
expression has a GPU equivalent, then the GPU equivalent
is substituted in. Shared variables storing float32 tensors
default to GPU storage, and the expressions derived from
them consequently default to using GPU implementations. It is
possible to explicitly force any float32 variable to reside on
the GPU, so one can start the chain reaction of optimizations
and use the GPU even in graphs with no shared variables. It is
possible (though awkward, and discouraged) to specify exactly
which computations to perform on the GPU by disabling the
default GPU optimizations.
Tensors stored on the GPU use a special internal data type
with an interface similar to the ndarray. This datatype fully
supports strided tensors, and arbitrary numbers of dimensions.
The support for strides means that several operations such as
the transpose and simple slice indexing can be performed in
constant time.
Code Generation: The code generation phase of the com-
pilation process produces and loads dynamically-compiled
Python modules with specialized implementations for the
expressions in the computation graph. Not all expressions
have C (technically C++) implementations, but many (roughly
80%) of Theano’s expressions generate and compile C or
CUDA code during theano.function. The majority of
expressions that generate C code specialize the code based on
the dtype, broadcasting pattern, and number of dimensions of
their arguments. A few expressions, such as the small-filter
convolution (conv2d), further specialize code based on the
size the arguments will have.
Why is it so important to specialize C code in this way?
Modern x86 architectures are relatively forgiving of code that
does not make good use techniques such as loop unrolling
and prefetching contiguous blocks of memory, and only the
conv2d expression goes to any great length to generate many
special case implementations for the CPU. By comparison,
GPU architectures are much less forgiving of code that is
not carefully specialized for the size and physical layout of
function arguments. Consequently, the code generators for
GPU expressions like GpuSum,GpuElementwise, and
GpuConv2d generate a wider variety of implementations than
their respective host expressions. With the current generation
of graphics cards, the difference in speed between a naïve
implementation and an optimal implementation of an expres-
sion as simple as matrix row summation can be an order of
magnitude or more. The fact that Theano’s GPU ndarray-
like type supports strided tensors makes it even more important
for the GPU code generators to support a variety of memory
layouts. These compile-time specialized CUDA kernels are
integral to Theano’s GPU performance.
Limitations and Future Work
While most of the development effort has been directed
at making Theano produce fast code, not as much attention
has been paid to the optimization of the compilation process
itself. At present, the compilation time tends to grow super-
linearly with the size of the expression graph. Theano can deal
with graphs up to a few thousand nodes, with compilation
times typically on the order of seconds. Beyond that, it can
be impractically slow, unless some of the more expensive
optimizations are disabled, or pieces of the graph are com-
piledseparately.
A Theano function call also requires more overhead (on the
order of microseconds) than a native Python function call. For
this reason, Theano is suited to applications where functions
correspond to expressions that are not too small (see Figure
5).
The set of types and operations that Theano provides
continues to grow, but it does not cover all the function-
THEANO: A CPU AND GPU MATH COMPILER IN PYTHON 7
ality of NumPy and covers only a few features of SciPy.
Wrapping functions from these and other libraries is often
straightforward, but implementing their gradients or related
graph transformations can be more difficult. Theano does not
yet have expressions for sparse or dense matrix inversion,
nor linear algebra decompositions, although work on these is
underway outside of the Theano trunk. Support for complex
numbers is also not as widely implemented or as well-tested
as for integers and floating point numbers. NumPy arrays with
non-numeric dtypes (strings, Unicode, Python objects) are not
supported at present.
We expect to improve support for advanced indexing and
linear algebra in the coming months. Documentation online
describes how to add new operations and new graph trans-
formations. There is currently an experimental GPU version
of the scan operation, used for looping, and an experimental
lazy-evaluation scheme for branching conditionals.
The library has been tuned towards expressions related
to machine learning with neural networks, and it is not as
well tested outside of this domain. Theano is not a powerful
computer algebra system, and it is an important area of future
work to improve its ability to recognize numerical instability
in complicated element-wise expression graphs.
Debugging Theano functions can require non-standard tech-
niques and Theano specific tools. The reason is two-fold:
1) definition of Theano expressions is separate from their
execution, and 2) optimizations can introduce many changes
to the computation graph. Theano thus provides separate
execution modes for Theano functions, which allows for auto-
mated debugging and profiling. Debugging entails automated
sanity checks, which ensure that all optimizations and graph
transformations are safe (Theano compares the results before
and after their application), as well as comparing the outputs
of both C and Python implementations.
We plan to extend GPU support to the full range of C data
types, but only float32 tensors are supported as of this writing.
There is also no support for sparse vectors or matrices on
the GPU, although algorithms from the CUSPARSE package
should make it easy to add at least basic support for sparse
GPU objects.
Conclusion
Theano is a mathematical expression compiler for Python
that translates high level NumPy-like code into machine
language for efficient CPU and GPU computation. Theano
achieves good performance by minimizing the use of tem-
porary variables, minimizing pressure on fast memory caches,
making full use of gemm and gemv BLAS subroutines, and
generating fast C code that is specialized to sizes and constants
in the expression graph. Theano implementations of machine
learning algorithms related to neural networks on one core of
an E8500 CPU are up to 1.8 times faster than implementations
in NumPy, 1.6 times faster than MATLAB, and 7.6 times faster
than a related C++ library. Using a Nvidia GeForce GTX285
GPU, Theano is an additional 5.8 times faster. One of Theano’s
greatest strengths is its ability to generate custom-made CUDA
kernels, which can not only significantly outperform CPU
implementations but alternative GPU implementations as well.
Acknowledgements
Theano has benefited from the contributions of many
members of Yoshua Bengio’s machine learning group in
the computer science department (Départment d’Informatique
et de Recherche Operationelle) at l’Université de Montréal,
especially Arnaud Bergeron, Thierry Bertin-Mahieux, Olivier
Delalleau, Douglas Eck, Dumitru Erhan, Philippe Hamel, Si-
mon Lemieux, Pierre-Antoine Manzagol, and François Savard.
The authors acknowledge the support of the following agencies
for research funding and computing support: NSERC, RQCHP,
CIFAR, SHARCNET and CLUMEQ.
REFERENCES
[theano] Theano. http://www.deeplearning.net/
software/theano
[NumPy] T. E. Oliphant. “Python for Scientific Computing”.
Computing in Science & Engineering 9, 10 (2007).
[Bottou] L. Bottou. “Online Algorithms and Stochastic Ap-
proximations”. In D. Saad, ed. Online Learning and
Neural Networks (1998). Cambridge University Press,
Cambridge, UK. Online: http://leon.bottou.
org/papers/bottou-98x
[numexpr] D. Cooke et al. “numexpr”. http://code.
google.com/p/numexpr/
[Cython] S. Behnel, R. Bradshaw, and D. S. Seljebotn. “Cython:
C-Extensions for Python”. http://www.cython.
org/
[scipy.weave] SciPy Weave module. http://docs.scipy.
org/doc/scipy/reference/tutorial/
weave.html
[Alted] F. Alted. “Why Modern CPUs Are Starving And What
Can Be Done About It”. Computing in Science and
Engineering 12(2):68-71, 2010.
[SymPy] SymPy Development Team. “SymPy: Python Library
for Symbolic Mathematics”. http://www.sympy.
org/
[BLAS] J. J. Dongarra, J. Du Croz, I. S. Duff, and
S. Hammarling. “Algorithm 679: A set of
Level 3 Basic Linear Algebra Subprograms”.
ACM Trans. Math. Soft., 16:18-28, 1990.
http://www.netlib.org/blas
[LAPACK] E. Anderson et al. “LAPACK Users’ Guide, Third
Edition”. http://www.netlib.org/lapack/
lug/index.html
[DLT] Deep Learning Tutorials. http://
deeplearning.net/tutorial/
[dlb] Benchmarking code: http://github.com/
pascanur/DeepLearningBenchmarks
[torch5] R. Collobert. “Torch 5”. http://torch5.
sourceforge.net
[EBL] “EBLearn: Energy Based Learning, a C++
Machine Learning Library”. http://eblearn.
sourceforge.net/
[gpumat] “GPUmat: GPU toolbox for MATLAB”. http://
gp-you.org
[Ecu] P. L’Ecuyer, F. Blouin, and R. Couture. “A Search for
Good Multiple Recursive Generators”. ACM Transac-
tions on Modeling and Computer Simulation, 3:87-98,
1993.
... State-of-the-art Define-and-Run frameworks [Abadi et al. 2016;Bergstra et al. 2010;Chen et al. 2015;Collobert et al. 2008;Intel 2016;Jia et al. 2014;Skymind 2017a] allows users to create computational graphs, which are immutable Abstract Syntax Trees (ASTs) of some object languages which can be evaluated by the framework runtime. Define-and-Run frameworks can schedule computational graphs to multiple CPUs, GPUs or other devices. ...
Preprint
The Java and Scala community has built a very successful big data ecosystem. However, most of neural networks running on it are modeled in dynamically typed programming languages. These dynamically typed deep learning frameworks treat neural networks as differentiable expressions that contain many trainable variable, and perform automatic differentiation on those expressions when training them. Until 2019, none of the learning frameworks in statically typed languages provided the expressive power of traditional frameworks. Their users are not able to use custom algorithms unless creating plenty of boilerplate code for hard-coded back-propagation. We solved this problem in DeepLearning.scala 2. Our contributions are: 1. We discovered a novel approach to perform automatic differentiation in reverse mode for statically typed functions that contain multiple trainable variable, and can interoperate freely with the metalanguage. 2. We designed a set of monads and monad transformers, which allow users to create monadic expressions that represent dynamic neural networks. 3. Along with these monads, we provide some applicative functors, to perform multiple calculations in parallel. With these features, users of DeepLearning.scala were able to create complex neural networks in an intuitive and concise way, and still maintain type safety.
... Still, numexpr uses the strings to encode the data within the code. Numexpr optimized the syntax on a minimal note while writing the code [22,23]. ...
Article
Full-text available
Brain tumor is a destructive and most prevailing disease that leads to the death of life. Thus, treatment planning is a crucial task that could enhance the quality of a patient's life. There are various image techniques like computed tomography (CT), Magnetic Resonance Imaging (MRI), and ultrasound imaging that are employed to measure the tumor size in the lungs, brain, liver etc. Classification of Tumor and non-tumor can be done, but it has some limitations, like only a limited number of images can be measured accurately and quantitatively. Therefore, an automatic classification scheme plays an essential role in preventing the death rate of human beings, rendering us a challenging task. For this purpose, we take some MRI images to classify tumors by examining the massive amount of data generated by MRI scans. This work proposes an idea in which the Convolutional Neural Network (CNN) is deployed for the classification of tumors. a deeper convolution layer is designed to improve the performance by using a small kernel size. Surgeons and radiologists can classify brain tumors more efficiently and effectively using this technique. Extended CNN achieves approximately 99% accuracy rate by showing the experimental results onward.
... DP frameworks for deep learning have dominated the ML landscape during the past decade, most notably autodifferentiation frameworks such as Theano [346], TensorFlow [347], and PyTorch [348]. In contrast to deep learning, which mainly involves compositions of large matrix multiplications and nonlinear element-wise operations, physical simulation requires complex and customizable operations due to the intrinsic computational irregularities, which can lead to unsatisfactory performance in the aforementioned frameworks. ...
Preprint
Full-text available
The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science.
... Theano, which is a library that uses GPU or CPU to accelerate computations [31]. ...
Article
Full-text available
Text classification is one of the widely used phenomena in different natural language processing tasks. State-of-the-art text classifiers use the vector space model for extracting features. Recent progress in deep models, recurrent neural networks those preserve the positional relationship among words achieve a higher accuracy. To push text classification accuracy even higher, multi-dimensional document representation, such as vector sequences or matrices combined with document sentiment, should be explored. In this paper, we show that documents can be represented as a sequence of vectors carrying semantic meaning and classified using a recurrent neural network that recognizes long-range relationships. We show that in this representation, additional sentiment vectors can be easily attached as a fully connected layer to the word vectors to further improve classification accuracy. On the UCI sentiment labelled dataset, using the sequence of vectors alone achieved an accuracy of 85.6%, which is better than 80.7% from ridge regression classifier—the best among the classical technique we tested. Additional sentiment information further increases accuracy to 86.3%. On our suicide notes dataset, the best classical technique—the Naíve Bayes Bernoulli classifier, achieves accuracy of 71.3%, while our classifier, incorporating semantic and sentiment information, exceeds that at 75% accuracy.
Article
Full-text available
Unmanned aerial vehicles (UAV) can be controlled in diverse ways. One of the most common is through artificial intelligence (AI), which comprises different methods, such as reinforcement learning (RL). The article aims to provide a comparison of three RL algorithms—DQN as the benchmark, SARSA as a same-family algorithm, and A2C as a different-structure one—to address the problem of a UAV navigating from departure point A to endpoint B while avoiding obstacles and, simultaneously, using the least possible time and flying the shortest distance. Under fixed premises, this investigation provides the results of the performances obtained for this activity. A neighborhood environment was selected because it is likely one of the most common areas of use for commercial drones. Taking DQN as the benchmark and not having previous knowledge of the behavior of SARSA or A2C in the employed environment, the comparison outcomes showed that DQN was the only one achieving the target. At the same time, SARSA and A2C did not. However, a deeper analysis of the results led to the conclusion that a fine-tuning of A2C could overcome the performance of DQN under certain conditions, demonstrating a greater speed at maximum finding with a more straightforward structure.
Article
Full-text available
With the advancements in mobile technology, a large amount of data is generated by end devices, which has created a renewed interest in developing new AI-based applications for gaining insights into these data. However, most of these distributed applications need data aggregated at a central server, which has posed severe bandwidth, latency, security, and privacy issues. This paper presents OSGAN (One-Shot distributed learning algorithm using Generative Adversarial Networks), a generic framework that trains a generative adversarial network (GAN) at each client and uses GAN’s generative capabilities to create sample data at the server. The server aggregates these data from various clients, builds a deep learning model and sends its parameters back to the clients in one communication round, i.e. the exchange of information between the clients and the server happens only once. In this paper, we present the design and implementation of OSGAN and evaluate its performance by comparing it with the state-of-the-art federated learning (FL) and central training algorithms for both IID and non-IID distribution of data. Our experiments on multiple datasets show that our proposed approach achieves a similar accuracy when compared with both FL and central training algorithms. Specifically, the accuracy drop with OSGAN is maintained within 2% for multiple datasets and multiple numbers of clients. Our results show that the proposed approach reduces the amount of data transfer by almost 98% when compared with federated learning and close to 80% when compared with the central learning approach thereby providing substantial benefit in terms of saving the bandwidth.
Chapter
Python libraries are a collection of essential functions that eliminate the need for users to develop code from scratch. Python is a plethora of libraries that serve a range of purposes and it has become a necessity to have sound knowledge of the best ones. Human and machine data production greatly outpaces humans’ ability to absorb, assess, and make complicated decisions based on that data. AI (Artificial Intelligence) is the foundation of all computer learning and the future of all intricate decision making. These technologies are being looked upon as tools and techniques to make this world a better place. It’s application ranges from various fields like healthcare, finance, transport, manufacturing, fraud detection and so on which evidently depicts its potential to transform the future. This paper intends to well verse the readers with the top libraries used to implement concepts of Artificial Intelligence like Machine Learning, Data Science, Deep Learning, Data Visualization and so on. It provides meticulous and unambiguous details about the essential building blocks necessary to execute and perform such ideas. It also includes a comparative analysis of various libraries to provide a detailed understanding and overview of them.KeywordsArtificial intelligenceMachine learningData sciencePythonPython librariesDeep learningHealthcare
Article
Full-text available
We report the results of an extensive computer search for good multiple recursive generators, in terms of their lattice structure and implementation speed. Those generators are a little slower than the usual linear congruential generators, but have much longer periods and much better statistical properties. We provide specific parameter sets for 32-bit, 48-bit, and 64-bit computers. We also explain how to build efficient portable implementations and give examples of computer codes in Pascal and C.
Article
Full-text available
Python is an interpreted language with expressive syntax, which transforms itself into a high-level language suited for scientific and engineering code. Some of its features include a liberal open source license, ability to run on many platforms, powerful interactive interpreter, ability to expand with earlier compiled code, ability to interact with a wide variety of other software, and a large number of library modules. An important factor in the utility of Python as a computing language is its clear syntax, which can make code easy to understand and maintain. This language also contributes to the construction of maintainable code by separating code into logical groups such as modules, class, and functions, in addition to offering clean syntax. Python can be easily extended with a large C-API for calling Python functionality from C programming language, connecting to non-Python compiled code, and extending the language itself by creating new Python types in C language.
Article
CPUs spend most of their time waiting for data to arrive. Identifying low-level bottlenecks-and how to ameliorate them-can save hours of frustration over poor performance in apparently well-written programs.
Cython: C-Extensions for Python
  • S Behnel
  • R Bradshaw
  • D S Seljebotn
S. Behnel, R. Bradshaw, and D. S. Seljebotn. "Cython: C-Extensions for Python". http://www.cython. org/ [scipy.weave] SciPy Weave module. http://docs.scipy. org/doc/scipy/reference/tutorial/ weave.html