As is typical in evolutionary algorithms, fitness evaluation in GP takes the majority of the computational effort. In this paper we demonstrate the use of the Graphics Processing Unit (GPU) to accelerate the evaluation of individuals. We show that for both binary and floating point based data types, it is possible to get speed increases of several hundred times over a typical CPU implementation. This allows for evaluation of many thousands of fitness cases, and hence should enable more ambitious solutions to be evolved using GP.
Fast Genetic Programming on GPUs
Simon Harding and Wolfgang Ba nzhaf
Computer Science Department, Memorial University, Newfoundland
1 Introduction
It is well known that fitness evaluation is the most time consuming part o f the
genetic programming (GP) system. This limits the types of problems that may
be addressed by GP, as large numbers of fitness cas e s make GP runs imprac-
tical. In some systems it is possible to accelerate the evaluation process using
a variety of techniques. In this paper we pr e sent a method using the graphics
processing unit on the video adapter. We study the evaluation of evolved math-
ematical express ions and dig ital circuits, as they are typically used to evaluate
the performance of a genetic programming algorithm.
Different a pproaches have been used in the past for accelerating evaluation.
For example, it is possible to co-evolve fitness cases in order to reduce the number
of evaluations [1]. This, however, adds significant complexity to the algorithm,
and does not guarantee an increase in performance under all cir cumstances. In
other applications , one could select the number of fitness cases, e.g.. by stochastic
sampling or other methods [2]. Should the system need to be tested agains t
a complete input set, however, this approach would not be suitable. Another
method involves compiling the evolved expression to executable code or even
using bina ry code directly [3]. Writing expressions as native code or in a s imila r
vain has many advantages [4]. The compiler or a hand-written algorithm can
perform optimisations, e.g. by removing redundant code, which in addition to
directly running the expression gives a sig nificant increase in performance. The
use of reflection in modern languages such as Java and C] provides for the
possibility to co mpile and link code to the currently executing applicatio n.
Under some circumstances it is possible to offload the evaluation to more s uit-
able hardware. When evaluating digital circuits, they can b e loaded into a field
programmable gate array (FPGA) and then executed on dedicated har dware [5].
This appr oach can provide large speed increases. However, the downloading of
configurations into a n FPGA can be a costly overhead. The biggest drawback to
this approach is that it requires the use of external hardware, which may have
to be s pecifica lly developed.
Recently it has become possible to access the processing power of the graphic
processing unit (GPU). Modern GPUs are extremely good at performing parallel
mathematical op erations [6]. However, until recently it was cumbersome to use
this resource for general purpose computing. For a general survey on algorithms
implemented on GPUs the reader is referred to [7]. For example, discrete wavelet
transformations [8], the solution of dense linear systems [9], physics simulations
for games, fluid simulators [10], etc., have been shown to be executed faster on
In this paper we demonstrate a method for using the GPU as an evaluator
for genetic pro gramming expressions, and show that there are considerable speed
increases to be gained. Using recent libraries we also show that putting the func-
tions on the GPU to work is relatively painless. As many of these technologies
are new, we include web links to sites containing the mos t recent information on
the projects discussed.
Because capable hardware and software are new, there is relatively little
previous work on using GPUs for evolutionary computation. For example [11]
implements a evolutionary pro gramming algorithm on a GPU, and finds that
there is a 5-fold speed increase. Work by [12] expands on this, and evaluates
expressions on the GPU. There all the operations a re treated as graphics op-
erations, which makes implementation difficult and limits the flexibility of the
evaluations. Yu et al [13], on the other hand, implement a Genetic Algorithm
on GPUs. Depending on population siz e , they find a speed up factor of up to
20. Her e both the genetic operators and fitness evaluation ar e performed on
the GPU. Ebner et a l, use human interaction to evolve a e sthetically pleasing
shader programs[14]. Here, linear genetic programming structures are compiled
into shader programs. The shader programs were then used to render textures on
images, which were selected by a user. However, the technique was not extended
into more general purpose computation.
To our knowledge, this contribution is the first study of general purpose Ge-
netic Programming, exec uted on a graphics hardware platform. It makes use of
the fact that GP fitness cases are numerous and can be executed in parallel. Pro-
vided there is a sufficient number of fitness cases (large datasets), a substantial
sp e e dup can be reached.
2 The Architecture of Graphics Processing Units
Graphics processors are sp e cialized stream processors used to render graphics.
Typically, the GPU is able to perform graphics manipulations much fa ster than
a general purpose CPU, as the processor is specifically designed to handle certain
primitive op e rations. Internally, the GPU contains a number of s mall processors
that are used to perform calculations on 3D vertex information and on textures.
These proces sors operate in parallel with ea ch other, and work on different parts
of the problem. First the vertex processors ca lc ulate the 3D view, then the shader
processors paint this model before it is displayed. Programming the GPU is
typically done through a virtual machine interface such as OpenGL or DirectX
which provide a common interface to the diverse GPUs availa ble thus mak ing
development easy. However, DirectX and OpenGL are optimized fo r graphics
processing, hence other APIs are required to use the GPU as a genera l purpose
device. There are many such AP Is, and section 3 descr ibes several of the more
common ones.
For gener al purpos e computing, we here wish to make us e of the parallelism
provided by the shader process ors, see Figure 1. Each processor can perform
multiple floating point ope rations per clock cycle, meaning that performance is
determined by the clock speed and the number of pixel shaders and the width
of the pixel shaders. Pixel shader s are programmed to perfo rm a given set of
instructions on each pixel in a texture. Depending on the GPU, the number of
instructions may be limited. In order to use more than this number of operations,
a program needs to be broken down into s uitably sized units, which may impact
performance. Newer GPUs support unlimited instructions, but some older cards
support as few as 64 instructions. GPUs typically use floating point arithmetic,
the precision of which is often controllable as les s precise representations are
faster to compute with. Again, the maximum precision is manufacturer spec ific,
but recent cards provide up to 128-bit precision.
The graphics card used in these experiments is a NVidia GForce 7300 Go,
which is a GPU optimized for laptop use. It is underpowered compared to cards
available for desktop PCs. Because GPUs are parallel and have very strict pr o-
cessing models, the computational ability of the GPU scales well with the num-
ber of pixel shaders. We would therefore expect to see major improvements to
the performance of the benchmarks given here if we were to run it on such a
GPU. According to [15 ], “an NVIDIA 7800 GTX 512 is c apable of around 200
GFLOPS. ATI’s latest X1900 architecture has a claimed performa nce of 554
GFLOPS”. Since it is now possible to place multiple GPUs inside a single PC
chassis, this should result in TFLOP perfo rmance for numerical processing at
low cost.
A further advantage of the GPU is that it uses less power tha n a typical
CPU. Power consumption has become an important consideration in building
clusters, since it causes heat ge neration.
3 Programming a GPU
In this sec tion we provide a br ie f overview of some of the general purpose com-
putation toolkits for GPUs that are available. This is not an exhaus tive list, but
Fig. 1. Arrays, representing the test cases, are converted to textures. These textures
are then manipulated (in parallel) by small programs inside each of the pixel shaders.
The result is another textu re, which can be converted back to a normal array for CPU
based processing.
is intended to act as a guide to others. More information on these sy stems can
be found at
SH Sh is an open source project for accessing the GPU under C++ [16,
17]. Many graphics cards are supported, and the system is platform indepen-
dent. Many low level features can be ac c e ssed using Sh, however these require
knowledge of the mechanisms used by the shaders. The Sh libraries provide
typical matrix and vector manipulations, such as dot products and addition-
multiplication operators. In addition to providing ge neral purpose computing,
Sh also provides many routines for use in graphics programming. This feature
is unique amongst the tools desc ribe d here, and would be useful in visualisation
of results.
Brook: Brook is another way to access the features on the GP U [18]. Brook
takes the form of extensions to the C programming language, adding support for
GPU specific data types. Applications develope d with Brook are c ompiled using
a special C compiler, which generates C++ and Cg code. Cg is a programming
language for graphics, that is similar to C. One major advantage of Brook is
that it can target either OpenGL or DirectX, and is therefore mo re platform
independent than other tools. However, code must b e compiled sepa rately for
each target platform. Brook appears to be a very popular choice, and is used for
large applications, such as folding@home.
PyGPU: Another recent library allows the access o f GPU functionality from
the Python language [19]. PyGPU runs as an embedded language inside Python.
The work is in its early stages, but results are promising. However it currently
lacks the optimization required to make full use of the GPU. It requir e s a variety
of extra packages to be ins talled into Python, such a NumPy and PyGame (which
does not yet support the most recent Python release). Given the rise in popularity
of Python for scientific computing, this implementation should prove useful in
the future. Python itself, however, appears to have significant performance issues
compared to C++ and JIT languages such as Java or C]
Accelerator: Recently a .Net assembly called Accelerator was released that
provides access to the GPU via the DirectX interface [20]. The s ystem is com-
pletely abstracted from the GPU, and pres ents the end user with only arrays
that can be operated o n in parallel. Unfortunately, the system is only available
for the Windows platform due to its reliance on Direc tX. However, the assembly
can be used from any .Net programming language .
This tool differs from the previous interfaces in that it uses lazy evaluation.
Operations are not performed o n the data until the e valuated result is requested.
This enables a certain degree of real time optimization, and reduces the computa-
tional load on the GPU. In particular, optimisation of common sub expressions,
which will reduce the crea tion of temporary shaders and textures. The movement
of data to and from the GPU can als o be efficiently optimized, which reduces the
impact of the relatively slow transfer of data out of the GPU. The compilation
to the s hader model occur s at run time, and hence can automatically make us e
of the different features available on the supported graphics cards .
In this paper we use the Accelerator package. The tota l time require d to
reimplement an existing parser tree based GP parser was less than two ho urs,
which we would expect to be considerably less than using any o f the other
solutions pre sented here. As with other implementations, Accelerator is based
on arrays implemented as textures. The AP I then allows one to perform parallel
operations on the arrays. Conversion to textures, and transfer to the GPU is
handled transparently by the API, allowing the developer to concentrate on the
implementation of the algorithm. The available function set for operating on
parallel arrays is similar to the other APIs. It includes element-wise arithmetic
operations, square root, multiply-add, and trigonometric operations. There are
also conditional operations and functions for comparing two arrays. The API also
provides reduction operators, such as the sum, product, minimum or maximum
value in the array. Further functions perform transformations, such as shift and
rotate on the e le ments of the a rray.
The other sy stems described here present different variations on these func-
tions, and generally offer functionality that allows different operations to be
applied to different parts of the a rrays.
As usual, available benchmarks may not give a fair reflection to real world perfor-
4 Parsing a GP Expression
Typically pars ing a GP expression involves traversing the expres sion tree in a
bottom-up, breadth first manner. At each node visited the interpreter p e rforms
the specified function on the inputs to the node, and outputs the result as the
node output. The tree is re-evaluated for every input se t. Hence, for 100 test
cases the tree would be executed 100 times.
Using the GPU we are able to pa rallelize this process, which means that in
effect the tree only has to be parsed o nce - with the function evaluation performed
in par allel. Even without the arithmetic acceleration provided by the GPU, this
results in a co nsiderable reduction in co mputation. Our GP interpreter uses a
case sta tement at the evaluation of each node to determine what function to
apply to the input values. If run on the GPU, the tree needs only to be exec uted
once - removing the need for repeatedly accessing the case statement. The use
of the GPU is illustr ated in Figure 1. The population and genetic algorithm run
on the CPU, with evaluations run on the GPU. The CPU converts arrays of
test cases to textures on the GPU and loads a shader program into the shader
processors. The Accelerator tool kit compiles each individuals GP expression
into a shader program. The program is then executed, and the resulting texture
is converted back in to an array. The fitness is determined from this output array.
5 Benchmarks
5.1 Configuration
The GP pars e r used here is wr itten in C], and compiled using Visual Studio
2005. All benchmarks were done using the Release build configuration, and were
executed on CLR 2.0 on Windows XP. The GPU is an NVidia GeForce 7300 GO
with 512Mb video memory. The C PU used is a n Intel Centrino T2400 (running
at 1.8 3Ghz), with 1.5Gb of system memor y.
In these experiments, GP trees were randomly generated with a given num-
ber of nodes . The expressions were evaluated on the CPU and then on the GPU,
and each evaluation was timed for evaluation purposes. Timing was performed
using calls to Win32 API QueryPerformanceCounter, which returns high pre c i-
sion timings. For each input size/expression length pair, 100 different randomly
generated expressions were used, and results were averaged to calculate accelera-
tion factors. Therefo re our results show the average number of times the GPU is
faster at evaluating a given tree size for a given number of fitness cases. Results
less than 1 mean tha t the CPU was faster at evaluating the expression, values
above 1 indicate the GPU performed better.
5.2 Floating point
In the first experiment, we evaluated random GP trees containing va rying num-
bers of no des, and expo sed them to varying test case sizes. Mathematical func-
tions +, , and / were used. The same expression was tested on the CPU and
the GPU, and the speed difference was recorded. Results are shown in Table
1. For small node counts and fitness cases, the CPU perfor mance is superior
because of the overhead of mapping the c omputation to the GPU. For larger
problems, however, there is a massive speed increase for GPU e xecution.
5.3 Binary
The se c ond experiment compares the performance of the GPU at handling
boolean e xpressions. In the CPU version, we use the C] boolea n type - which is
convenient, but not necessarily the most efficient re presentation. For the GPU,
we tested two different appr oaches, one using the boolean parallel array provided
by Accelerator, the other using float. The perfor mance of these two represen-
tation is shown in Table 2. It is interesting to note that improvements are not
guaranteed. As can be seen in the table, the speed up can decr ease as expression
size increas e s. We assume this is due to the way in which large shader programs
are handled by either the Accelerator or the GPU. For example, the length of
the shader program on the NVIDIA GPU may be limited, and going b e yond this
length would require repeated passes of the data. This type of behaviour can be
seen in many of the results presented here.
We limit the functions in the expressions to AND, OR and NOT , which are
supported by the boolean array type. Following some sample code provided with
Accelerator, We mimicked boolean behavior using 0.0f as fa lse , and 1.0f as true.
For two floats, AND can be viewed as the minimum of the two values. Similarly
OR can be viewed as the maximum of the two values. NOT can be pe rformed
as a multiply add, where the first stage is to multiply by -1 then add 1.
5.4 Real world tests
In this experiment, we investigate the speed up on both toy and r e al world
problems, r ather than on arbitrary expressions. The GP representation we chose
to use here is CGP, but similar results should be obtained from other repre-
sentations. CGP is fully described in [21]. In the benchmark e xperiments, the
expression lengths were uniform througho ut the tests. However, in r eal GP the
length of the e xpressions vary throughout the run. As the GPU sometimes r e sults
in slower performance, we need to verify that on average, there is an advantage.
Regression We evolved functions that regres sed over x
+ x
[22]. We
tested the evaluatio n difference using a number of test cases. In each instance,
the test cases were uniformly distributed between - 1 to +1. We also changed
the maximum length of the CGP graph. Hence, expression lengths co uld range
anywhere from 1 node to the maximum size of the CGP graph. GP was r un for
200 generations to allow for convergence. The function set comprised of +, ,
and /. In C], divis ion by zero on a float returns “Infinity”, which is consistent
with the res ult fro m the Accelerato r library.
Test Cases
Expression Length 64 256 1024 4096 16384 65536
10 0.04 0.16 0.6 2.39 8.94 28.34
100 0.4 1.38 5.55 23.03 84.23 271.69
500 1.82 7.04 27.84 101.13 407.34 1349.52
1000 3.47 13.78 52.55 204.35 803.28 2694.97
5000 10.02 26.35 87.46 349.73 1736.3 4642.4
10000 13.01 36.5 157.03 442.23 1678.45 7351.06
Table 1. Results showing the number of times faster evaluating floating point based
expressions is on th e GPU, compared to CPU implementation. An increase of less than
1 shows that the CPU is more efficient.
Boolean implementation
Test Cases
Expression Length 4 16 64 256 1024 4096 16384 65536
10 0.22 1.04 1.05 2.77 7.79 36.53 84.08 556.40
50 0.44 0.57 1.43 3.02 14.75 58.17 228.13 896.33
100 0.39 0.62 1.17 4.36 14.49 51.51 189.57 969.33
500 0.35 0.43 0.75 2.64 14.11 48.01 256.07 1048.16
1000 0.23 0.39 0.86 3.01 10.79 50.39 162.54 408.73
1500 0.40 0.55 1.15 4.19 13.69 53.49 113.43 848.29
Boolean implementation, using floating point
Test Cases
Expression Length 4 16 64 256 1024 4096 16384 65536
10 0.024 0.028 0.028 0.072 0.282 0.99 3.92 14.66
50 0.035 0.049 0.115 0.311 1.174 4.56 17.72 70.48
100 0.061 0.088 0.201 0.616 2.020 8.78 34.69 132.84
500 0.002 0.003 0.005 0.017 0.064 0.25 0.99 3.50
1000 0.001 0.001 0.003 0.008 0.030 0.12 0.48 1.49
1500 0.000 0.001 0.002 0.005 0.019 0.07 0.29 1.00
Table 2. Results showing the number of times faster evaluating boolean expressions is
on the GPU, compared to CPU implementation. An increase of less than 1 shows that
the CPU is more efficient. Booleans were implemented as floating point numbers and
as booleans. Although faster than the CPU for large input sizes, in general it appears
preferential to use the boolean representation. Using floating point representation can
provide speed increases, but th e results are varied.
Test Cases
Max Expression Length 10 100 1000 2000
10 0.02 0.08 0.7 1.22
100 0.07 0.33 2.79 5.16
1000 0.42 1.71 15.29 87.02
10000 0.4 1.79 16.25 95.37
Table 3. Results for the regression experiment. The results show the number of times
faster evaluating evolved GP expressions is on the GPU, compared to CPU implemen-
tation. The maximum expression length is the number of nodes in the CGP graph.
Test Cases
Max Expression Length 194 388 970 1940
10 0.15 0.23 0.51 1.01
100 0.38 0.67 1.63 3.01
1000 1.77 3.19 9.21 22.7
10000 1.69 3.21 8.94 22.38
Table 4. Results for the two spirals classification exp eriment. The results show the
number of times faster evaluating evolved GP expressions is on the GPU, compared to
CPU implementation. The maximum expression length is the number of nodes in the
CGP graph.
Fitness was defined as the sum of the absolute error s of each test ca se and
the output of the expression. This can also be calculated using the GPU. E ach
individual was evaluated with the CPU, then the GPU and the speed difference
recorded. Also the outputs from both the GPU and CPU were compared to
ensure that they were evaluating the expression in the same manner. We did not
find any instances where the two differed.
Table 3 shows results that a re consistent with the tests described in previous
sections. Fo r smaller input sets and small expressions, it was more efficient to
evaluate them on the CPU. However, for the larger test and expression sizes the
performance increa se was dr amatic.
Classification In this experiment we attempted the classification problem of
distinguishing b e tween two spirals, as described in [22]. This problem has two
input values (x and y coordinates of a point on a spiral) and has a single output
indicating which spiral the point is found. In [22], 194 test cases are used. In
these experiments, we extend the number of test cases to 388, 970 and 1940. We
also extended the function set to include sin, cos,
x, x
and a comparator. The
comparator looks at the first input value to the node, and if it is less than or
equal to zero returns the s e c ond input, 0 otherwise. The relative speed increases
can be seen in Table 4. Aga in we see that the GPU is superior for larger numbers
of test cases , with larger expression sizes .
Classification in Bioinformatics In this experiment we investigate the be-
haviour on ano ther classification problem, this time a protein class ifier as de-
scribed in [23]. Here the task is to predict the location of a protein in a cell, from
the amino acids in the particular protein. We used the entire dataset as the train-
ing set. The set consisted of 2427 entries, with 19 variables each and 1 output.
We investigated the perfo rmance gain using several expres sion lengths, and the
results can be seen in Table 5 . Here, the larg e number of test cases used results
in considerable improvements in evaluation time, e ven for small e xpressions.
Test Cases
Expression Length 2427
10 3.44
100 6.67
1000 11.84
10000 14.21
Table 5. Results for the protein classifcation experiment. The results show the number
of times faster evaluating evolved GP expressions is on the GPU, compared to CPU
implementation. The maximum expression length is the number of nodes in the CGP
6 Conclusions
This paper demonstrates that evaluation of genetic programming expressions c an
strongly benefit from using the graphics processor to parallelise the evaluations.
With new development tools, it is now very easy to leverage the GPU for general
purp ose computation. However, there are a few caveats. Here we have tested the
system using Cartesian GP, however we expect similar advantages with other
representations, such as tree and linea r GP.
Few clusters are constructed with high per formance graphics cards, which will
limit the immediate use of these systems. It will require further benchmarking
whether low end GPUs found in most PCs today provide a speed advantage.
Given the computational benefits and the relatively low costs of fast graphics
cards, it is likely that GPU acceleration for numerical applications will become
widespread amongst lower priced installations.
Many typical GP problems do not have large sets of fitness cases for two rea-
sons: Firs t, evaluation has always been considered computationally expensive.
Second, we currently find it very difficult to evolve solutions to harder pro blems.
With the ability to tackle larg er problems in reasonable time we have to also find
innova tive approaches that let us solve these problems. Traditional GP has dif-
ficulty with scaling. For example, the largest evolved multiplier has 1024 fitness
cases [24]. In the same time it would take a CPU implementation to evaluate an
individual with that ma ny fitness cases, we could test more than 65536 fitness
cases on a GPU. This leads to a gap between what we can realistically evaluate,
and what we c an evolve. The authors of this paper advocate developmental en-
codings, and using the evaluation approach introduced here we will be able to
test this position.
For small sets of fitness cases, the overhead of transferring data to the GPU
and for constructing shader s results in a performance decrease. It can be imag-
ined that one would want to determine in practical applications when the ad-
vantage of GPU computing kicks in and switch execution to the proper type of
hardware. In this contribution, we have just looked at the most trivial way of
parallelizing a GP system on GPU hardware. More sophisticated approaches to
parallelisation will have to be examined in the future.
Appendix: Code Examples
To demonstrate the ease of development, we include a small code sample showing
the use of MS Accelerator from C]. The first stage is to make arrays o f the data
to operate on. In a GP system these may be the fitness cases.
float[,] DataA = new float[4096, 4096];
float[,] DataB = new float[4096, 4096];
Next, the GPU has to be initialized, and the floating point arrays converted
to parallel arrays:
FloatParallelArray ParallelDataA =
new DisposableFloatParallelArray(DataA);
FloatParallelArray ParallelDataB =
new DisposableFloatParallelArray(DataB);
The parallel arrays are textures inside the GPU memory. Next, the shader
program is specified by per fo rming operations on the parallel arrays. However,
the computation is not done until requested, as the shader pro gram needs to be
compiled, uploaded to the GPU shader processors and executed.
FloatParallelArray ParallelResult =
ParallelArrays.Add(ParallelDataA, ParallelDataB);
Finally, we request that the expression is evalua ted, and get the result from
the GPU. The res ult is stored as a texture in the GPU, which needs to be
converted back into a floating point array that can be used by the CPU.
float[,] Result = new float[4096, 4096];
ParallelArrays.ToArray(ParallelResult, out Result);
