Conference PaperPDF Available

Fast Genetic Programming on GPUs

Authors:
  • Machine Intelligence Ltd.

Abstract and Figures

As is typical in evolutionary algorithms, fitness evaluation in GP takes the majority of the computational effort. In this paper we demonstrate the use of the Graphics Processing Unit (GPU) to accelerate the evaluation of individuals. We show that for both binary and floating point based data types, it is possible to get speed increases of several hundred times over a typical CPU implementation. This allows for evaluation of many thousands of fitness cases, and hence should enable more ambitious solutions to be evolved using GP.
Content may be subject to copyright.
Fast Genetic Programming on GPUs
Simon Harding and Wolfgang Ba nzhaf
Computer Science Department, Memorial University, Newfoundland
{simonh,banzhaf}@cs.mun.ca
http://www.cs.mun.ca
Abstract. As is typical in evolutionary algorithms, fitness evaluation
in GP takes the majority of the computational effort. In t his paper we
demonstrate the use of the Graphics Processing Unit (GPU) to accelerate
the evaluation of individuals. We show that for both binary and floating
point based data types, it is possible to get speed increases of several
hundred times over a typical CPU implementation. This allows for eval-
uation of many thousands of fitn ess cases, and hence should enable more
ambitious solutions to be evolved using GP.
Key words: Genetic programming, Graphics Card Acceleration, Par-
allel Evaluation
1 Introduction
It is well known that fitness evaluation is the most time consuming part o f the
genetic programming (GP) system. This limits the types of problems that may
be addressed by GP, as large numbers of fitness cas e s make GP runs imprac-
tical. In some systems it is possible to accelerate the evaluation process using
a variety of techniques. In this paper we pr e sent a method using the graphics
processing unit on the video adapter. We study the evaluation of evolved math-
ematical express ions and dig ital circuits, as they are typically used to evaluate
the performance of a genetic programming algorithm.
Different a pproaches have been used in the past for accelerating evaluation.
For example, it is possible to co-evolve fitness cases in order to reduce the number
of evaluations [1]. This, however, adds significant complexity to the algorithm,
and does not guarantee an increase in performance under all cir cumstances. In
other applications , one could select the number of fitness cases, e.g.. by stochastic
sampling or other methods [2]. Should the system need to be tested agains t
a complete input set, however, this approach would not be suitable. Another
method involves compiling the evolved expression to executable code or even
using bina ry code directly [3]. Writing expressions as native code or in a s imila r
vain has many advantages [4]. The compiler or a hand-written algorithm can
perform optimisations, e.g. by removing redundant code, which in addition to
directly running the expression gives a sig nificant increase in performance. The
use of reflection in modern languages such as Java and C] provides for the
possibility to co mpile and link code to the currently executing applicatio n.
2 Simon Harding and Wolfgang Banzhaf
Under some circumstances it is possible to offload the evaluation to more s uit-
able hardware. When evaluating digital circuits, they can b e loaded into a field
programmable gate array (FPGA) and then executed on dedicated har dware [5].
This appr oach can provide large speed increases. However, the downloading of
configurations into a n FPGA can be a costly overhead. The biggest drawback to
this approach is that it requires the use of external hardware, which may have
to be s pecifica lly developed.
Recently it has become possible to access the processing power of the graphic
processing unit (GPU). Modern GPUs are extremely good at performing parallel
mathematical op erations [6]. However, until recently it was cumbersome to use
this resource for general purpose computing. For a general survey on algorithms
implemented on GPUs the reader is referred to [7]. For example, discrete wavelet
transformations [8], the solution of dense linear systems [9], physics simulations
for games, fluid simulators [10], etc., have been shown to be executed faster on
GPUs.
In this paper we demonstrate a method for using the GPU as an evaluator
for genetic pro gramming expressions, and show that there are considerable speed
increases to be gained. Using recent libraries we also show that putting the func-
tions on the GPU to work is relatively painless. As many of these technologies
are new, we include web links to sites containing the mos t recent information on
the projects discussed.
Because capable hardware and software are new, there is relatively little
previous work on using GPUs for evolutionary computation. For example [11]
implements a evolutionary pro gramming algorithm on a GPU, and finds that
there is a 5-fold speed increase. Work by [12] expands on this, and evaluates
expressions on the GPU. There all the operations a re treated as graphics op-
erations, which makes implementation difficult and limits the flexibility of the
evaluations. Yu et al [13], on the other hand, implement a Genetic Algorithm
on GPUs. Depending on population siz e , they find a speed up factor of up to
20. Her e both the genetic operators and fitness evaluation ar e performed on
the GPU. Ebner et a l, use human interaction to evolve a e sthetically pleasing
shader programs[14]. Here, linear genetic programming structures are compiled
into shader programs. The shader programs were then used to render textures on
images, which were selected by a user. However, the technique was not extended
into more general purpose computation.
To our knowledge, this contribution is the first study of general purpose Ge-
netic Programming, exec uted on a graphics hardware platform. It makes use of
the fact that GP fitness cases are numerous and can be executed in parallel. Pro-
vided there is a sufficient number of fitness cases (large datasets), a substantial
sp e e dup can be reached.
2 The Architecture of Graphics Processing Units
Graphics processors are sp e cialized stream processors used to render graphics.
Typically, the GPU is able to perform graphics manipulations much fa ster than
Fast Genetic Programming on GPUs 3
a general purpose CPU, as the processor is specifically designed to handle certain
primitive op e rations. Internally, the GPU contains a number of s mall processors
that are used to perform calculations on 3D vertex information and on textures.
These proces sors operate in parallel with ea ch other, and work on different parts
of the problem. First the vertex processors ca lc ulate the 3D view, then the shader
processors paint this model before it is displayed. Programming the GPU is
typically done through a virtual machine interface such as OpenGL or DirectX
which provide a common interface to the diverse GPUs availa ble thus mak ing
development easy. However, DirectX and OpenGL are optimized fo r graphics
processing, hence other APIs are required to use the GPU as a genera l purpose
device. There are many such AP Is, and section 3 descr ibes several of the more
common ones.
For gener al purpos e computing, we here wish to make us e of the parallelism
provided by the shader process ors, see Figure 1. Each processor can perform
multiple floating point ope rations per clock cycle, meaning that performance is
determined by the clock speed and the number of pixel shaders and the width
of the pixel shaders. Pixel shader s are programmed to perfo rm a given set of
instructions on each pixel in a texture. Depending on the GPU, the number of
instructions may be limited. In order to use more than this number of operations,
a program needs to be broken down into s uitably sized units, which may impact
performance. Newer GPUs support unlimited instructions, but some older cards
support as few as 64 instructions. GPUs typically use floating point arithmetic,
the precision of which is often controllable as les s precise representations are
faster to compute with. Again, the maximum precision is manufacturer spec ific,
but recent cards provide up to 128-bit precision.
The graphics card used in these experiments is a NVidia GForce 7300 Go,
which is a GPU optimized for laptop use. It is underpowered compared to cards
available for desktop PCs. Because GPUs are parallel and have very strict pr o-
cessing models, the computational ability of the GPU scales well with the num-
ber of pixel shaders. We would therefore expect to see major improvements to
the performance of the benchmarks given here if we were to run it on such a
GPU. According to [15 ], “an NVIDIA 7800 GTX 512 is c apable of around 200
GFLOPS. ATI’s latest X1900 architecture has a claimed performa nce of 554
GFLOPS”. Since it is now possible to place multiple GPUs inside a single PC
chassis, this should result in TFLOP perfo rmance for numerical processing at
low cost.
A further advantage of the GPU is that it uses less power tha n a typical
CPU. Power consumption has become an important consideration in building
clusters, since it causes heat ge neration.
3 Programming a GPU
In this sec tion we provide a br ie f overview of some of the general purpose com-
putation toolkits for GPUs that are available. This is not an exhaus tive list, but
4 Simon Harding and Wolfgang Banzhaf
Fig. 1. Arrays, representing the test cases, are converted to textures. These textures
are then manipulated (in parallel) by small programs inside each of the pixel shaders.
The result is another textu re, which can be converted back to a normal array for CPU
based processing.
is intended to act as a guide to others. More information on these sy stems can
be found at www.gpgpu.org.
SH Sh is an open source project for accessing the GPU under C++ [16,
17]. Many graphics cards are supported, and the system is platform indepen-
dent. Many low level features can be ac c e ssed using Sh, however these require
knowledge of the mechanisms used by the shaders. The Sh libraries provide
typical matrix and vector manipulations, such as dot products and addition-
multiplication operators. In addition to providing ge neral purpose computing,
Sh also provides many routines for use in graphics programming. This feature
is unique amongst the tools desc ribe d here, and would be useful in visualisation
of results.
Brook: Brook is another way to access the features on the GP U [18]. Brook
takes the form of extensions to the C programming language, adding support for
GPU specific data types. Applications develope d with Brook are c ompiled using
a special C compiler, which generates C++ and Cg code. Cg is a programming
language for graphics, that is similar to C. One major advantage of Brook is
that it can target either OpenGL or DirectX, and is therefore mo re platform
independent than other tools. However, code must b e compiled sepa rately for
each target platform. Brook appears to be a very popular choice, and is used for
large applications, such as folding@home.
Fast Genetic Programming on GPUs 5
PyGPU: Another recent library allows the access o f GPU functionality from
the Python language [19]. PyGPU runs as an embedded language inside Python.
The work is in its early stages, but results are promising. However it currently
lacks the optimization required to make full use of the GPU. It requir e s a variety
of extra packages to be ins talled into Python, such a NumPy and PyGame (which
does not yet support the most recent Python release). Given the rise in popularity
of Python for scientific computing, this implementation should prove useful in
the future. Python itself, however, appears to have significant performance issues
compared to C++ and JIT languages such as Java or C]
1
.
Accelerator: Recently a .Net assembly called Accelerator was released that
provides access to the GPU via the DirectX interface [20]. The s ystem is com-
pletely abstracted from the GPU, and pres ents the end user with only arrays
that can be operated o n in parallel. Unfortunately, the system is only available
for the Windows platform due to its reliance on Direc tX. However, the assembly
can be used from any .Net programming language .
This tool differs from the previous interfaces in that it uses lazy evaluation.
Operations are not performed o n the data until the e valuated result is requested.
This enables a certain degree of real time optimization, and reduces the computa-
tional load on the GPU. In particular, optimisation of common sub expressions,
which will reduce the crea tion of temporary shaders and textures. The movement
of data to and from the GPU can als o be efficiently optimized, which reduces the
impact of the relatively slow transfer of data out of the GPU. The compilation
to the s hader model occur s at run time, and hence can automatically make us e
of the different features available on the supported graphics cards .
In this paper we use the Accelerator package. The tota l time require d to
reimplement an existing parser tree based GP parser was less than two ho urs,
which we would expect to be considerably less than using any o f the other
solutions pre sented here. As with other implementations, Accelerator is based
on arrays implemented as textures. The AP I then allows one to perform parallel
operations on the arrays. Conversion to textures, and transfer to the GPU is
handled transparently by the API, allowing the developer to concentrate on the
implementation of the algorithm. The available function set for operating on
parallel arrays is similar to the other APIs. It includes element-wise arithmetic
operations, square root, multiply-add, and trigonometric operations. There are
also conditional operations and functions for comparing two arrays. The API also
provides reduction operators, such as the sum, product, minimum or maximum
value in the array. Further functions perform transformations, such as shift and
rotate on the e le ments of the a rray.
The other sy stems described here present different variations on these func-
tions, and generally offer functionality that allows different operations to be
applied to different parts of the a rrays.
1
As usual, available benchmarks may not give a fair reflection to real world perfor-
mance.
6 Simon Harding and Wolfgang Banzhaf
4 Parsing a GP Expression
Typically pars ing a GP expression involves traversing the expres sion tree in a
bottom-up, breadth first manner. At each node visited the interpreter p e rforms
the specified function on the inputs to the node, and outputs the result as the
node output. The tree is re-evaluated for every input se t. Hence, for 100 test
cases the tree would be executed 100 times.
Using the GPU we are able to pa rallelize this process, which means that in
effect the tree only has to be parsed o nce - with the function evaluation performed
in par allel. Even without the arithmetic acceleration provided by the GPU, this
results in a co nsiderable reduction in co mputation. Our GP interpreter uses a
case sta tement at the evaluation of each node to determine what function to
apply to the input values. If run on the GPU, the tree needs only to be exec uted
once - removing the need for repeatedly accessing the case statement. The use
of the GPU is illustr ated in Figure 1. The population and genetic algorithm run
on the CPU, with evaluations run on the GPU. The CPU converts arrays of
test cases to textures on the GPU and loads a shader program into the shader
processors. The Accelerator tool kit compiles each individuals GP expression
into a shader program. The program is then executed, and the resulting texture
is converted back in to an array. The fitness is determined from this output array.
5 Benchmarks
5.1 Configuration
The GP pars e r used here is wr itten in C], and compiled using Visual Studio
2005. All benchmarks were done using the Release build configuration, and were
executed on CLR 2.0 on Windows XP. The GPU is an NVidia GeForce 7300 GO
with 512Mb video memory. The C PU used is a n Intel Centrino T2400 (running
at 1.8 3Ghz), with 1.5Gb of system memor y.
In these experiments, GP trees were randomly generated with a given num-
ber of nodes . The expressions were evaluated on the CPU and then on the GPU,
and each evaluation was timed for evaluation purposes. Timing was performed
using calls to Win32 API QueryPerformanceCounter, which returns high pre c i-
sion timings. For each input size/expression length pair, 100 different randomly
generated expressions were used, and results were averaged to calculate accelera-
tion factors. Therefo re our results show the average number of times the GPU is
faster at evaluating a given tree size for a given number of fitness cases. Results
less than 1 mean tha t the CPU was faster at evaluating the expression, values
above 1 indicate the GPU performed better.
5.2 Floating point
In the first experiment, we evaluated random GP trees containing va rying num-
bers of no des, and expo sed them to varying test case sizes. Mathematical func-
tions +, , and / were used. The same expression was tested on the CPU and
Fast Genetic Programming on GPUs 7
the GPU, and the speed difference was recorded. Results are shown in Table
1. For small node counts and fitness cases, the CPU perfor mance is superior
because of the overhead of mapping the c omputation to the GPU. For larger
problems, however, there is a massive speed increase for GPU e xecution.
5.3 Binary
The se c ond experiment compares the performance of the GPU at handling
boolean e xpressions. In the CPU version, we use the C] boolea n type - which is
convenient, but not necessarily the most efficient re presentation. For the GPU,
we tested two different appr oaches, one using the boolean parallel array provided
by Accelerator, the other using float. The perfor mance of these two represen-
tation is shown in Table 2. It is interesting to note that improvements are not
guaranteed. As can be seen in the table, the speed up can decr ease as expression
size increas e s. We assume this is due to the way in which large shader programs
are handled by either the Accelerator or the GPU. For example, the length of
the shader program on the NVIDIA GPU may be limited, and going b e yond this
length would require repeated passes of the data. This type of behaviour can be
seen in many of the results presented here.
We limit the functions in the expressions to AND, OR and NOT , which are
supported by the boolean array type. Following some sample code provided with
Accelerator, We mimicked boolean behavior using 0.0f as fa lse , and 1.0f as true.
For two floats, AND can be viewed as the minimum of the two values. Similarly
OR can be viewed as the maximum of the two values. NOT can be pe rformed
as a multiply add, where the first stage is to multiply by -1 then add 1.
5.4 Real world tests
In this experiment, we investigate the speed up on both toy and r e al world
problems, r ather than on arbitrary expressions. The GP representation we chose
to use here is CGP, but similar results should be obtained from other repre-
sentations. CGP is fully described in [21]. In the benchmark e xperiments, the
expression lengths were uniform througho ut the tests. However, in r eal GP the
length of the e xpressions vary throughout the run. As the GPU sometimes r e sults
in slower performance, we need to verify that on average, there is an advantage.
Regression We evolved functions that regres sed over x
6
2x
4
+ x
2
[22]. We
tested the evaluatio n difference using a number of test cases. In each instance,
the test cases were uniformly distributed between - 1 to +1. We also changed
the maximum length of the CGP graph. Hence, expression lengths co uld range
anywhere from 1 node to the maximum size of the CGP graph. GP was r un for
200 generations to allow for convergence. The function set comprised of +, ,
and /. In C], divis ion by zero on a float returns “Infinity”, which is consistent
with the res ult fro m the Accelerato r library.
8 Simon Harding and Wolfgang Banzhaf
Test Cases
Expression Length 64 256 1024 4096 16384 65536
10 0.04 0.16 0.6 2.39 8.94 28.34
100 0.4 1.38 5.55 23.03 84.23 271.69
500 1.82 7.04 27.84 101.13 407.34 1349.52
1000 3.47 13.78 52.55 204.35 803.28 2694.97
5000 10.02 26.35 87.46 349.73 1736.3 4642.4
10000 13.01 36.5 157.03 442.23 1678.45 7351.06
Table 1. Results showing the number of times faster evaluating floating point based
expressions is on th e GPU, compared to CPU implementation. An increase of less than
1 shows that the CPU is more efficient.
Boolean implementation
Test Cases
Expression Length 4 16 64 256 1024 4096 16384 65536
10 0.22 1.04 1.05 2.77 7.79 36.53 84.08 556.40
50 0.44 0.57 1.43 3.02 14.75 58.17 228.13 896.33
100 0.39 0.62 1.17 4.36 14.49 51.51 189.57 969.33
500 0.35 0.43 0.75 2.64 14.11 48.01 256.07 1048.16
1000 0.23 0.39 0.86 3.01 10.79 50.39 162.54 408.73
1500 0.40 0.55 1.15 4.19 13.69 53.49 113.43 848.29
Boolean implementation, using floating point
Test Cases
Expression Length 4 16 64 256 1024 4096 16384 65536
10 0.024 0.028 0.028 0.072 0.282 0.99 3.92 14.66
50 0.035 0.049 0.115 0.311 1.174 4.56 17.72 70.48
100 0.061 0.088 0.201 0.616 2.020 8.78 34.69 132.84
500 0.002 0.003 0.005 0.017 0.064 0.25 0.99 3.50
1000 0.001 0.001 0.003 0.008 0.030 0.12 0.48 1.49
1500 0.000 0.001 0.002 0.005 0.019 0.07 0.29 1.00
Table 2. Results showing the number of times faster evaluating boolean expressions is
on the GPU, compared to CPU implementation. An increase of less than 1 shows that
the CPU is more efficient. Booleans were implemented as floating point numbers and
as booleans. Although faster than the CPU for large input sizes, in general it appears
preferential to use the boolean representation. Using floating point representation can
provide speed increases, but th e results are varied.
Test Cases
Max Expression Length 10 100 1000 2000
10 0.02 0.08 0.7 1.22
100 0.07 0.33 2.79 5.16
1000 0.42 1.71 15.29 87.02
10000 0.4 1.79 16.25 95.37
Table 3. Results for the regression experiment. The results show the number of times
faster evaluating evolved GP expressions is on the GPU, compared to CPU implemen-
tation. The maximum expression length is the number of nodes in the CGP graph.
Fast Genetic Programming on GPUs 9
Test Cases
Max Expression Length 194 388 970 1940
10 0.15 0.23 0.51 1.01
100 0.38 0.67 1.63 3.01
1000 1.77 3.19 9.21 22.7
10000 1.69 3.21 8.94 22.38
Table 4. Results for the two spirals classification exp eriment. The results show the
number of times faster evaluating evolved GP expressions is on the GPU, compared to
CPU implementation. The maximum expression length is the number of nodes in the
CGP graph.
Fitness was defined as the sum of the absolute error s of each test ca se and
the output of the expression. This can also be calculated using the GPU. E ach
individual was evaluated with the CPU, then the GPU and the speed difference
recorded. Also the outputs from both the GPU and CPU were compared to
ensure that they were evaluating the expression in the same manner. We did not
find any instances where the two differed.
Table 3 shows results that a re consistent with the tests described in previous
sections. Fo r smaller input sets and small expressions, it was more efficient to
evaluate them on the CPU. However, for the larger test and expression sizes the
performance increa se was dr amatic.
Classification In this experiment we attempted the classification problem of
distinguishing b e tween two spirals, as described in [22]. This problem has two
input values (x and y coordinates of a point on a spiral) and has a single output
indicating which spiral the point is found. In [22], 194 test cases are used. In
these experiments, we extend the number of test cases to 388, 970 and 1940. We
also extended the function set to include sin, cos,
x, x
y
and a comparator. The
comparator looks at the first input value to the node, and if it is less than or
equal to zero returns the s e c ond input, 0 otherwise. The relative speed increases
can be seen in Table 4. Aga in we see that the GPU is superior for larger numbers
of test cases , with larger expression sizes .
Classification in Bioinformatics In this experiment we investigate the be-
haviour on ano ther classification problem, this time a protein class ifier as de-
scribed in [23]. Here the task is to predict the location of a protein in a cell, from
the amino acids in the particular protein. We used the entire dataset as the train-
ing set. The set consisted of 2427 entries, with 19 variables each and 1 output.
We investigated the perfo rmance gain using several expres sion lengths, and the
results can be seen in Table 5 . Here, the larg e number of test cases used results
in considerable improvements in evaluation time, e ven for small e xpressions.
10 Simon Harding and Wolfgang Banzhaf
Test Cases
Expression Length 2427
10 3.44
100 6.67
1000 11.84
10000 14.21
Table 5. Results for the protein classifcation experiment. The results show the number
of times faster evaluating evolved GP expressions is on the GPU, compared to CPU
implementation. The maximum expression length is the number of nodes in the CGP
graph.
6 Conclusions
This paper demonstrates that evaluation of genetic programming expressions c an
strongly benefit from using the graphics processor to parallelise the evaluations.
With new development tools, it is now very easy to leverage the GPU for general
purp ose computation. However, there are a few caveats. Here we have tested the
system using Cartesian GP, however we expect similar advantages with other
representations, such as tree and linea r GP.
Few clusters are constructed with high per formance graphics cards, which will
limit the immediate use of these systems. It will require further benchmarking
whether low end GPUs found in most PCs today provide a speed advantage.
Given the computational benefits and the relatively low costs of fast graphics
cards, it is likely that GPU acceleration for numerical applications will become
widespread amongst lower priced installations.
Many typical GP problems do not have large sets of fitness cases for two rea-
sons: Firs t, evaluation has always been considered computationally expensive.
Second, we currently find it very difficult to evolve solutions to harder pro blems.
With the ability to tackle larg er problems in reasonable time we have to also find
innova tive approaches that let us solve these problems. Traditional GP has dif-
ficulty with scaling. For example, the largest evolved multiplier has 1024 fitness
cases [24]. In the same time it would take a CPU implementation to evaluate an
individual with that ma ny fitness cases, we could test more than 65536 fitness
cases on a GPU. This leads to a gap between what we can realistically evaluate,
and what we c an evolve. The authors of this paper advocate developmental en-
codings, and using the evaluation approach introduced here we will be able to
test this position.
For small sets of fitness cases, the overhead of transferring data to the GPU
and for constructing shader s results in a performance decrease. It can be imag-
ined that one would want to determine in practical applications when the ad-
vantage of GPU computing kicks in and switch execution to the proper type of
hardware. In this contribution, we have just looked at the most trivial way of
parallelizing a GP system on GPU hardware. More sophisticated approaches to
parallelisation will have to be examined in the future.
Fast Genetic Programming on GPUs 11
Appendix: Code Examples
To demonstrate the ease of development, we include a small code sample showing
the use of MS Accelerator from C]. The first stage is to make arrays o f the data
to operate on. In a GP system these may be the fitness cases.
float[,] DataA = new float[4096, 4096];
float[,] DataB = new float[4096, 4096];
Next, the GPU has to be initialized, and the floating point arrays converted
to parallel arrays:
ParallelArrays.InitGPU();
FloatParallelArray ParallelDataA =
new DisposableFloatParallelArray(DataA);
FloatParallelArray ParallelDataB =
new DisposableFloatParallelArray(DataB);
The parallel arrays are textures inside the GPU memory. Next, the shader
program is specified by per fo rming operations on the parallel arrays. However,
the computation is not done until requested, as the shader pro gram needs to be
compiled, uploaded to the GPU shader processors and executed.
FloatParallelArray ParallelResult =
ParallelArrays.Add(ParallelDataA, ParallelDataB);
Finally, we request that the expression is evalua ted, and get the result from
the GPU. The res ult is stored as a texture in the GPU, which needs to be
converted back into a floating point array that can be used by the CPU.
float[,] Result = new float[4096, 4096];
ParallelArrays.ToArray(ParallelResult, out Result);
References
1. Lasarczyk, C., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a
fitness case topology. Evolutionary Computation 12 (2004) 223–242
2. Banzhaf, W., Nordin, P., Keller, R., Francone, F.: Genetic Programming - An
Introduction. Morgan Kaufmann, San Francisco, CA, USA (1998)
3. Nordin, P., Banzhaf, W., Francone, F.: Efficient Evolution of Machine Code for
CISC Architectures using Blocks and Homologous Crossover. In Spector, L., Lang-
don, W., O’R eilly, U.M., Angeline, P., eds.: Advances in Genetic Programming I II,
MIT Press, Cambridge, MA, USA (1999) 275–299
4. Brameier, M., Banzhaf, W.: Linear Genetic Programming. Springer, New York,
USA (2006)
5. Lau, W.S., Li, G., Lee, K .H ., Leung, K.S., Cheang, S.M.: Multi-logic-unit processor:
A combinational logic circuit evaluation engine for genetic parallel programming.
In: EuroGP. (2005) 167–177
12 Simon Harding and Wolfgang Banzhaf
6. Thompson, C., Hahn, S., Oskin, M.: Using Modern Graphics Architectures for
General-Purpose Computing: A Framework and Analysis. I n: Proceedings of the
35th International Symposium on Microarchitecture, Istanbul, IEEE Computer
Society Press (2002) 306 317
7. Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., Purcell,
T.: A survey of general-purpose computation on graphics hardware. Eurographics
2005, State of the Art Reports (2005) 21–51
8. Wang, J., T. T. Wong, P.A.H., Leung, C.S.: Discrete wavelet transform on gpu.
In: Proceedings of ACM Workshop on General Purp ose Computing on Graphics
Processors. (2004) C–41
9. Galoppo, N., Govindaraju, N., Henson, M., Manocha, D.: Lu-gpu: Efficient algo-
rithms for solving dense linear systems on graphics hardware. Supercomputing,
2005. Proceedings of the ACM/IEEE SC 2005 Conference (2005) 3– 3
10. Hagen, T.R., Hjelmervik, J.M., Lie, K.A., Natvig, J.R., Henriksen, M.O.: Visual
simulation of sh allow-water waves. Simulation Modelling Practice and Theory 13
(2005) 716–726
11. Wong, M.L., Wong, T.T., Fok, K.L.: Parallel evolutionary algorithms on graphics
processing unit. In : Proceedings of IEEE Congress on Evolutionary Computation
2005 (CEC 2005). Volume 3. (2005) 2286–2293
12. Fok, K.L., Wong, T.T., Wong, M.L.: Evolutionary computing on consumer-level
graphics hardware. IEEE Intelligent Systems, t o appear (2005)
13. Yu, Q., Chen, C., Pan, Z.: Parallel Genetic Algorithms on Programmable Graphics
Hardware. Lecture Notes in Computer Science 3612 (2005) 1051
14. Ebner, M., Reinhardt, M., Albert, J.: Evolution of vertex and pix el shaders. In
Keijzer, M., Tettamanzi, A., Collet, P., van Hemert, J., Tomassini, M., eds.: Pro-
ceedings of t he Eighth European Conference on Genetic Programming ( Eu roGP
2005), Lausanne, Switzerland, Springer-Verlag (2005) 261–270
15. Wikipedia: Flops wikipedia, the free encyclopedia. http://en.wikipedia.org/
w/index.php?title=FLOPS&oldid=84987291 (2006) [Online; accessed 1-November-
2006].
16. RapidMind Inc: Libsh. (http://libsh.org/)
17. LibSh Wiki: Libsh sample code. ( http://www.libsh.org/wiki/index.php/
Sample_Code)
18. Stanford Un iversity Graphics Lab: Brook. (http://graphics.stanford.edu/
projects/brookgpu/)
19. Lejdfors, C., Ohlsson, L.: Implementing an embedded gpu language by combining
translation and generation. In: SAC ’06: Proceedings of the 2006 ACM symposium
on Applied computing, New York, NY, USA, ACM Press (2006) 1610–1614
20. Tarditi, D., Puri, S., Oglesby, J.: Msr-tr-2005-184 accelerator: Using data par-
allelism to program gpus for general-purpose uses. Technical report, Microsoft
Research (2006)
21. Miller, J.F., Thomson, P.: Cartesian genetic programming. In et al., R.P., ed.:
Proc. of EuroGP 2000. Volume 1802 of LN CS., Springer-Verlag (2000) 121–132
22. Koza, J.: Genetic Programming: On the Programming of Computers by Natural
Selection. MIT Press, Cambridge, Massachusetts, US A (1992)
23. Langdon, W.B., Banzhaf, W.: Repeated sequences in linear genetic programming
genomes. Complex Systems 15(4) (2005) 285–306
24. Torresen, J.: Evolving multiplier circuits by training set and training vector par-
titioning. In: ICES‘03:From biology to hardware. Volume 2606. (2003) 228–237
... Because GP's inherent parallelism presents opportunities for performance optimization, more recent, very successful approaches have focused on hardware-based acceleration, such as leveraging Graphics Processing Units (GPUs) for their tensor operation capabilities [8,21,22,27]. However, not only do these solutions often require specialized hardware and expertise, but they also suffer from the same issues as the specialized solutions created for CPUs that we pointed out above. ...
... (2) Faster Program Execution: This focuses on optimizing the execution of evolved programs. While most GP systems interpret the evolved code, compiling it to binary code can offer performance benefits, especially when GP individuals are executed numerous times, such as when there are many fitness cases [18,21]. However, traditional compilers can be computationally expensive. ...
... One commonly used method to enhance the speed of GP is by capitalizing on its ability to be parallelized. Several studies have investigated the use of parallel hardware, such as Central Processing Units (CPUs) with Single Instruction Multiple Data (SIMD) capabilities [1,3,7,14,23,28,29], architectures based on Graphics Processing Units (GPUs) [8,21,22,27], and even Field Programmable Gate Arrays (FPGAs) [10,17,39], for executing fitness evaluations in the context of GP. However, GPUs are currently considered the most promising option for achieving significant speed improvements due to their widespread availability and focus on high-throughput computing. ...
... However, particularly when GP individuals are run many times, e.g. because there are many test cases, it can be beneficial not to interpret the code but to compile it and run the compiled binary code (e.g. Harding [22], Gregor and Spalek [20]). However running a traditional compiler is expensive and Fukunaga et al. [15] and Vasicek and Slany [76] have advocated generating machine code inside the GP system and executing that. ...
Preprint
Full-text available
We summarise how a 3.0 GHz 16 core AVX512 computer can interpret the equivalent of up to on average 1103370000000 GPop/s. Citations to existing publications are given. Implementation stress is placed on both parallel computing, bandwidth limits and avoiding repeated calculation. Information theory suggests in digital computing, failed disruption propagation gives huge speed ups as FDP and incremental evaluation can be used to reduce fitness evaluation time in phenotypically converged populations. Conversely FDP may be responsible for evolution stagnation. So the wider Evolutionary Computing, Artificial Life, Unconventional Computing and Software Engineering community may need to avoid deep nesting.
... In fact, only a few deal with CGP in this context. In [29], a data-parallel technique for the GPU is implemented, where a speedup up to 22.38 is achieved in comparison to the CPU code. In [30], a distributed, GPU-based CGP implementation was proposed, in which CUDA code is generated, distributed to a cluster, and compiled during runtime, achieving up to 12.7 billion Genetic Programming operations per second (GPop/s). ...
Article
Full-text available
In Machine Learning (ML), the use of subsets of training data, referred to as batches, rather than the entire dataset, has been extensively researched to reduce computational costs, improve model efficiency, and enhance algorithm generalization. Despite extensive research, a clear definition and consensus on what constitutes batch training have yet to be reached, leading to a fragmented body of literature that could otherwise be seen as different facets of a unified methodology. To address this gap, we propose a theoretical redefinition of batch training, creating a clearer and broader overview that integrates diverse perspectives. We then apply this refined concept specifically to Genetic Programming (GP). Although batch training techniques have been explored in GP, the term itself is seldom used, resulting in ambiguity regarding its application in this area. This review seeks to clarify the existing literature on batch training by presenting a new and practical classification system, which we further explore within the specific context of GP. We also investigate the use of dynamic batch sizes in ML, emphasizing the relatively limited research on dynamic or adaptive batch sizes in GP compared to other ML algorithms. By bringing greater coherence to previously disjointed research efforts, we aim to foster further scientific exploration and development. Our work highlights key considerations for researchers designing batch training applications in GP and offers an in-depth discussion of future research directions, challenges, and opportunities for advancement.
Article
Full-text available
It is 30 years since John R. Koza published “Jaws”, the first book on genetic programming [Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press (1992)]. I recount and expand the celebration at GECCO 2022, very briefly summarise some of what the rest of us have done and make suggestions for the next thirty years of GP research.
Article
Genetic programming (GP) has been applied to various binary image classification tasks and achieved promising results. However, existing approaches are difficult to be applied to large binary classification tasks due to the huge computational cost in fitness evaluations. To address this issue, we introduce a highly efficient method that enables fitness evaluations to be entirely conducted on graphics processing units (GPUs). Specifically, a prefix notation is used as program representation on the GPU device side, and column-major storage is used for the training dataset on the device side to achieve coalesced global memory access on the GPU. The evaluation of multiple GP programs in each generation can be simultaneous, which increases the parallelism of the algorithm. In addition, a parallel reduction is performed to maximize the use of the powerful parallel computing capability of GPU devices. Furthermore, the hoist mutation is also added to the proposed approach to help eliminate stack overflow on the device side. We compare training time and classification accuracy on various datasets with several GP and non-GP approaches. Experimental results indicate that the proposed approach significantly speeds up the existing GP-based binary image classification approaches without degradation in classification accuracy. We also analyze the influence of the batch size on the training time and investigate the classification accuracy in different settings of the max program depth and the number of generations. The code is available at https://github.com/RayZhhh/CupaGP for reference.
Article
Full-text available
This chapter describes recent advances in genetic programming of machine code. Evolutionary pro- gram induction of binary machine code is one of the fastest GP methods and the most well studied linear approach. The technique has previously been known as Compiling Genetic Programming System (CGPS) but to avoid confusion with methods using an actual compiler and to separate the system from the method, the name has been changed to Automatic Induction of Machine Code with Genetic Programming (AIM-GP). AIM-GP stores individuals as a linear string of native binary ma- chine code, which is directly executed by the processor. The absence of an interpreter and complex memory handling allows increased speed of several orders of magnitudes. AIM-GP has so far been applied to processors with a fixed instruction length (RISC) using integer arithmetics. This chapter describes several new advances to the AIM-GP method which are important for the applicability of the technique. Such advances include enabling the induction of code for CISC processors such as the most widespread computer architecture INTEL x86 as well as JAVA and many embedded processors. The new technique also makes AIM-GP more portable in general and simplifies the adaptation to any processor architecture. Other additions include the use of floating point instructions, control flow in- structions, ADFs and new genetic operators e.g. aligned homologous crossover. We also discuss the benefits and drawbacks of register machine GP versus tree-based GP. This chapter is meant to be a directed towards the practitioner, who wants to extend AIM-GP to new architectures and application domains.
Conference Paper
Full-text available
Parallel genetic algorithms are usually implemented on par- allel machines or distributed systems. This paper describes how fine- grained parallel genetic algorithms can be mapped to programmable graphics hardware found in commodity PC. Our approach stores chro- mosomes and their fitness values in texture memory on graphics card. Both fitness evaluation and genetic operations are implemented entirely with fragment programs executed on graphics processing unit in parallel. We demonstrate the effectiveness of our approach by comparing it with compatible software implementation. The presented approach allows us benefit from the advantages of parallel genetic algorithms on low-cost platform.
Conference Paper
Full-text available
Genetic Parallel Programming (GPP) is a novel Genetic Programming paradigm. GPP Logic Circuit Synthesizer (GPPLCS) is a combinational logic circuit learning system based on GPP. The GPPLCS comprises a Multi-Logic-Unit Processor (MLP) which is a hardware processor built on a Field Programmable Gate Array (FPGA). The MLP is designed to speed up the evaluation of genetic parallel programs that represent combinational logic circuits. Four combinational logic circuit problems are presented to show the performance of the hardware-assisted GPPLCS. Experimental results show that the hardware MLP speeds up evolutions over 10 times. For difficult problems such as the 6-bit priority selector and the 6-bit comparator, the speedup ratio can be up to 22.
Chapter
In this chapter linear genetic programming (LGP) will be explored in further detail. The basis of the specific linear GP variant we want to investigate in this book will be described, in particular the programming language used for evolution, the representation of individuals, and the specific evolutionary algorithm employed. This will form the core of our LGP system, while fundamental concepts of linear GP will also be discussed, including various forms of program execution. Linear GP operates with imperative programs. All discussions and experiments in this book are conducted independently from a special type of programming language or processor architecture. Even though genetic programs are interpreted and partly noted in the high-level language C, the applied programming concepts exist principally in or may be translated into most modern imperative programming languages, down to the level of machine languages.
Article
Biological chromosomes are replete with repetitive sequences, micro satellites, SSR tracts, ALU, etc. in their DNA base sequences. We started looking for similar phenomena in evolutionary computation. First studies find copious repeated sequences, which can be hierarchically decomposed into shorter sequences, in programs evolved using both homologous and two point crossover but not with headless chicken crossover or other mutations. In bloated programs the small number of effective or expressed instructions appear in both repeated and nonrepeated code. Hinting that building-blocks or code reuse may evolve in unplanned ways. Mackey-Glass chaotic time series prediction and eukaryotic protein localisation (both previously used as artificial intelligence machine learning benchmarks) demonstrate evolution of Shannon information (entropy) and lead to models capable of lossy Kolmogorov compression. Our findings with diverse benchmarks and GP systems suggest this emergent phenomenon may be widespread in genetic systems.
Article
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general‐purpose computation to graphics hardware. We begin with the technical motivations that underlie general‐purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general‐purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general‐purpose application development on graphics hardware.
Article
A commodity-type graphics card (GPU) is used to simulate nonlinear water waves described by a system of balance laws called the shallow-water system. To solve this hyperbolic system we use explicit high-resolution central-upwind schemes, which are particularly well suited for exploiting the parallel processing power of the GPU. In fact, simulations on the GPU are found to run 15–30 times faster than on a CPU. The simulated cases involve dry-bed zones and nontrivial bottom topographies, which are real challenges to the robustness and accuracy of the discretization.
Conference Paper
In real-time rendering, objects are represented using polygons or triangles. Triangles are easy to render and graphics hardware is highly optimized for rendering of triangles. Initially, the shading computations were carried out by dedicated hardwired algorithms for each vertex and then interpolated by the rasterizer. Todays graphics hardware contains vertex and pixel shaders which can be reprogrammed by the user. Vertex and pixel shaders allow almost arbitrary computations per vertex respectively per pixel. We have developed a system to evolve such programs. The system runs on a variety of graphics hardware due to the use of NVIDIA’s high level Cg shader language. Fitness of the shaders is determined by user interaction. Both fixed length and variable length genomes are supported. The system is highly customizable. Each individual consists of a series of meta commands. The resulting Cg program is translated into the low level commands which are required for the particular graphics hardware.