ChapterPDF Available

Hardware Acceleration for CGP: Graphics Processing Units

Authors:
  • Machine Intelligence Ltd.

Abstract and Figures

Graphic Processing Units (GPUs) are fast, highly parallel units. In addition to processing 3D graphics, modern GPUs can be programmed for more general-purpose computation. A GPU consists of a large number of ‘shader processors’, and conceptually operates as a single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD) stream processor. A modern GPU can have several hundred of these stream processors, which, combined with their relatively low cost, makes them an attractive platform for scientific computing. In the last two years, the genetic programming community has begun to exploit GPUs to accelerate the evaluation of individuals in a population [1, 4]. CGP was the first GP technique implemented in a general-purpose fashion on GPUs. By ‘general purpose’, we mean a technique that can be applied to a number of GP applications and not just a single, specialized task. Implementing CGP on GPUs has resulted in very significant performance increases. In this chapter, we discuss several of our implementations of CGP on GPUs. To begin with, we start with an overview of the hardware and software used, before discussing applications and the speed-ups obtained. 8.2 The Architecture of Graphics Processing Units Graphics processors are specialized stream processors used to render graphics. Typically, a GPU is able to perform graphics manipulations much faster than a generalpurpose CPU, as graphics processors are specifically designed to handle certain primitive operations. Internally, a GPU contains a number of small processors that are used to perform calculations on 3D vertex information and on textures. These processors operate in parallel with each other, and work on different parts of the
Content may be subject to copyright.
Chapter 8
Hardware Acceleration for CGP: Graphics
Processing Units
Simon L. Harding and Wolfgang Banzhaf
8.1 Graphics Processing Units
Graphic Processing Units (GPUs) are fast, highly parallel units. In addition to pro-
cessing 3D graphics, modern GPUs can be programmed for more general-purpose
computation. A GPU consists of a large number of ‘shader processors’, and concep-
tually operates as a single instruction multiple data (SIMD) or multiple instruction
multiple data (MIMD) stream processor. A modern GPU can have several hundred
of these stream processors, which, combined with their relatively low cost, makes
them an attractive platform for scientific computing. In the last two years, the genetic
programming community has begun to exploit GPUs to accelerate the evaluation of
individuals in a population [1, 4].
CGP was the first GP technique implemented in a general-purpose fashion on
GPUs. By ‘general purpose’, we mean a technique that can be applied to a number
of GP applications and not just a single, specialized task. Implementing CGP on
GPUs has resulted in very significant performance increases.
In this chapter, we discuss several of our implementations of CGP on GPUs. To
begin with, we start with an overview of the hardware and software used, before
discussing applications and the speed-ups obtained.
8.2 The Architecture of Graphics Processing Units
Graphics processors are specialized stream processors used to render graphics. Typ-
ically, a GPU is able to perform graphics manipulations much faster than a general-
purpose CPU, as graphics processors are specifically designed to handle certain
primitive operations. Internally, a GPU contains a number of small processors that
are used to perform calculations on 3D vertex information and on textures. These
processors operate in parallel with each other, and work on different parts of the
J.F. Miller (ed.), Cartesian Genetic Programming, Natural Computing Series,
DOI 10.1007/978-3-642-17310-3
8, © Springer-Verlag Berlin Heidelberg 2011
231
232 Simon L. Harding and Wolfgang Banzhaf
problem. First, the vertex processors calculate a 3D view, and then the shader pro-
cessors paint this model before it is displayed. Programming a GPU is typically
done through a virtual-machine interface such as OpenGL or DirectX, which pro-
vides a common interface to the diverse GPUs available, thus making development
easy. However, DirectX and OpenGL are optimized for graphics processing, and
hence other APIs are required to use a GPU as a general purpose device. There are
many such APIs, and Sect. 8.3 describes several of the ones that have been used to
implement CGP.
For general-purpose computing, we wish here to make use of the parallelism pro-
vided by the shader processors (see Fig. 8.1). Each processor can perform multiple
floating-point operations per clock cycle, meaning that the performance is deter-
mined by the clock speed, the number of pixel shaders and the width of the pixel
shaders. Pixel shaders are programmed to perform a given set of instructions on
each pixel in a texture. Depending on the GPU, the number of instructions may be
limited. In order to use more than this number of operations, a program needs to be
broken down into suitably sized units, which may impact performance. Newer GPUs
support unlimited instructions, but some older cards support as few as 64 instruc-
tions. GPUs typically use floating-point arithmetic, the precision of which is often
controllable, as less precise representations are faster to compute with. Again, the
maximum precision is manufacturer-specific, but recent cards provide up to 128-bit
precision.
Fig. 8.1 Arrays, representing test cases, are converted to textures. These textures are then manipu-
lated (in parallel) by small programs inside each of the pixel shaders. The result is another texture,
which can be converted back to a normal array for CPU-based processing.
8 Hardware Acceleration for CGP: Graphics Processing Units 233
A variety of different graphics cards were used in our experiments. In particular,
we used three cards from Nvidia: GeForce 7300 Go, GeForce 8600 and GeForce
9800.
8.3 Programming a GPU
We experimented with two different software APIs for programming GPUs: Mi-
crosoft’s Accelerator and the API from Nvidia.
Accelerator is a .Net assembly from Microsoft Research that provides access to
the GPU via the DirectX interface [15]. The system is completely abstracted from
the GPU, and presents the end user with only arrays that can be operated on in par-
allel. Unfortunately, the system is available only for the Windows platform, owing
to its reliance on DirectX. However, the assembly can be used from any .Net pro-
gramming language. Conversion to textures and transfer to the GPU are handled
transparently by the API, allowing the developer to concentrate on the implementa-
tion of the algorithm. The function set available for operating on parallel arrays is
similar to those for other APIs. It includes element wise arithmetic operations, and
square root, multiply–add and trigonometric operations. There are also conditional
operations and functions for comparing two arrays. The API also provides reduc-
tion operators, such as the sum, product, minimum value and maximum value of
an array. Further functions perform transformations such as shift and rotate on the
elements of an array.
One benefit of this tool is that it uses lazy evaluation. Operations are not per-
formed on the data until the evaluated result is requested. This enables a certain
degree of real-time optimization, and reduces the computational load on the GPU.
In particular, it enables optimization of common subexpressions, which reduces the
creation of temporary shaders and textures. The movement of data to and from the
GPU can also be optimized efficiently, which reduces the impact of the relatively
slow transfer of data out of the GPU. The compilation to the shader model occurs at
run time, and hence can automatically make use of the different features available
on the various graphics cards supported.
CUDA is an extension of the C programming language developed by Nvidia.
It allows low-level access to the GPU and therefore can make optimal use of the
hardware. CUDA can be viewed as a set of language extensions that provide support
for parallel programming using massive numbers of threads. The API provides all
the tools needed to marshal data to and from the GPU and for the actual algorithm
implementation. The programs look very much like C programs.
CUDA programs are typically written as pure C programs. However, many li-
braries exist that expose the programming interface of the CUDA driver to other
programming environments. In the work described here, the CUDA.net library from
GASS [3] was used. This library allows any .Net language to communicate with the
CUDA driver.
234 Simon L. Harding and Wolfgang Banzhaf
Accessing the CUDA driver allows applications to access the GPU memory, load
code libraries (modules) and execute functions. The libraries themselves still need
to be implemented in C and compiled using the CUDA compiler. During this com-
pilation process, the CUDA compiler can optimize the generated code.
In addition to the overhead of compiling shader programs, GPU programming
also has a number of other bottlenecks. In particular, there is a small cost of trans-
ferring data to or from the GPU memory.
8.4 Parallel Implementation of GP
Implementations of GP on GPUs have largely fallen into two distinct evaluation
methodologies: population-parallel and fitness-case-parallel. Both methods exploit
the highly parallel architecture of the GPU. In the fitness-case-parallel approach, all
the fitness cases are executed in parallel, with only one individual being evaluated
at a time. This can be considered as a SIMD approach. In the population-parallel
approach, multiple individuals are evaluated simultaneously. Both approaches have
problems that impact performance.
The fitness-case-parallel approach requires that a different program is uploaded
to the GPU for each evaluation. This introduces a large overhead. The programs
need to be compiled (on the CPU side) and then loaded into the graphics card before
any evaluation can occur. Previous work has shown that for smaller numbers of
fitness cases (typically less than 1000), this overhead is larger than the increase in
computational speed [2, 7, 6]. In these cases, there is no benefit from executing
the program on the GPU. For some problems, there are fewer fitness cases than
shader processors. In this scenario, some of the processors are idle. This problem
will become worse as the number of shader processors increases with new GPU
designs. Unfortunately, a large number of classic benchmark GP problems fit into
this category. However, there are problems such as image processing that naturally
provide large numbers of fitness cases (as each pixel can be considered as a different
fitness case) [5, 8].
GPUs can be considered as processors (or in which each processor handles mul-
tiple copies of the same program, using different program counters [14]). When a
population of programs is to be evaluated in parallel, the common approach so far
has been to implement some form of interpreter on the GPU [14, 13, 16]. The in-
terpreter is able to execute the evolved programs (typically machine-code-like pro-
grams represented as textures) in a pseudo-parallel manner. To see why this execu-
tion is pseduo-parallel, the implementation of the interpreter needs to be considered.
A population can be represented as a 2D array, with individuals represented in
the columns and their instructions in the rows. The interpreter consists of a set of IF
statements, where each condition corresponds to a given operator in the GP func-
tion set. The interpreter looks at the current instruction of each individual, and sees
if it matches the current IF statement condition. If it does, that instruction is exe-
cuted. This happens for each individual in the population in parallel, and for some
8 Hardware Acceleration for CGP: Graphics Processing Units 235
individuals the current instruction may not match the current IF statement. These in-
dividuals must wait until the interpreter reaches their instruction type. Hence, with
this approach, some of the GPU processors are not doing useful computation. If
there are four functions in the function set, we can expect that on average at least
3/4 of the shader processors will be ‘idle’. If the function set is larger, then more
shaders will be ‘idle’. Despite this idling, population-parallel approaches can still
be considerably faster than CPU implementations.
The performance of population-parallel implementations is therefore impacted
by the size of the function set used. Conversely, the fitness-case-parallel approaches
are impacted by the number of fitness cases. Population-parallel approaches are also
largely unsuited for GPU implementation when small population sizes are required.
In our work, we used a fitness-case-parallel version of the algorithm. Both Ac-
celerator and CUDA compile the code before execution. Although this introduces
a significant overhead, it does have some advantages, namely that the compiler can
optimize the executed code.
With CGP, it is very commonly to observed that large numbers of nodes in the
genotype are reused, and hence there are a lot of intermediate results that can be
reused during execution. With an interpreter-based implementation, this would be
very hard to implement. However, with compilation, we can rely on the compiler
(and our own code generator) to spot where such work is redundant and remove it.
When running in a fitness-parallel fashion, the processing can be considered as
vector mathematics. Each column in the data set is represented as a vector of val-
ues. The programs then operate on these vectors, in an elementwise fashion. Each
shader program runs on one index of the vector, and hence each shader program is
responsible for one row in the data set. Figure 8.2 illustrates this approach.
In the work demonstrated here, only the fitness evaluation was implemented on
the GPU. The evolutionary algorithm (and other parts of the software) were all im-
plemented on the CPU side, as shown in Fig. 8.1.
8.5 Initial GPU Results
In our initial papers [7, 6], we used an Nvidia 7300 Go GPU. The programming
used the Accelerator package. At the time of writing, this GPU is still in use, but is
a very dated and underpowered platform. The GP parser used here was written in
C, and compiled using Visual Studio 2005. All benchmark tests were done using
the Release build configuration, and were executed on a CLR 2.0 with Windows XP.
The GPU was an Nvidia GeForce 7300 Go with 512 Mb video memory. The CPU
used was an Intel Centrino T2400 (running at 1.83 Ghz), with 1.5 Gb of system
memory.
In these experiments, GP trees were randomly generated with a given number
of nodes. The expressions were evaluated on the CPU and then on the GPU, and
each evaluation was timed. For each input-size/expression-length pair, 100 different
randomly generated expressions were used, and the results were averaged to calcu-
236 Simon L. Harding and Wolfgang Banzhaf
Fig. 8.2 (a) The data set is split into vectors. Each vector contains one column. (b) These vectors
are passed to the evolved shader program. Each shader program runs on a single index, so it is able
to see only one row of the data set. (c) The output of the evolved program is compared with the
expected output, and a fitness value obtained. All these steps exploit the parallelism of the shader
processor.
late acceleration factors. Therefore our results showed the average number of times
by which the GPU was faster at evaluating a given tree size for a given number of
fitness cases. Results less than 1 mean that the CPU was faster at evaluating the
expression, and values above 1 indicate that the GPU performed better.
8.5.1 Floating-Point-Based Expressions
In the first experiment, we evaluated random GP trees containing varying numbers
of nodes, and exposed them to test cases of varying sizes. The mathematical func-
tions +, , and / were used. The same expression was tested on the CPU and
the GPU, and the speed difference was recorded. The results are shown in Table
8.1. For small node counts and fitness cases, the CPU performance is superior be-
cause of the overhead of mapping the computation to the GPU. For larger problems,
however, there is a massive speed increase for GPU execution.
8 Hardware Acceleration for CGP: Graphics Processing Units 237
Table 8.1 Results showing the number of times faster evaluating floating-point-based expressions
is on the GPU compared with CPU implementation. An increase of less than 1 shows that the CPU
is more efficient
Test cases
Expression length 64 256 1024 4096 16384 65536
10 0.04 0.16 0.6 2.39 8.94 28.34
100 0.4 1.38 5.55 23.03 84.23 271.69
500 1.82 7.04 27.84 101.13 407.34 1349.52
1000 3.47 13.78 52.55 204.35 803.28 2694.97
5000 10.02 26.35 87.46 349.73 1736.3 4642.4
10000 13.01 36.5 157.03 442.23 1678.45 7351.06
8.5.2 Binary
The second experiment compared the performance of the GPU at handling Boolean
expressions. In the CPU version, we used the CBoolean type which is conve-
nient, but not necessarily the most efficient representation. For the GPU, we tested
two different approaches, one using the Boolean parallel array provided by Accel-
erator, and the other using floating-point numbers. The performance of these two
representations is shown in Table 8.2. It is interesting to note that improvements are
not guaranteed. As can be seen in the table, the speed-up can decrease as the ex-
pression size increases. We assume that this is due to the way in which large shader
programs are handled by either Accelerator or the GPU. For example, the length
of the shader program on the Nvidia GPU may be limited, and going beyond this
length would require repeated passes of the data. This type of behaviour can be seen
in many of the results presented here.
We limited the functions in the expressions to AND, OR and NOT, which are
supported by the Boolean array type. Following some sample code provided with
Accelerator, we mimicked Boolean behaviour using 0.0f as false and 1.0f as true.
For two floats, AND can be viewed as the minimum of the two values. Similarly,
OR can be viewed as the maximum of the two values. NOT can be performed as a
multiply–add, where the first stage is to multiply by minus 1 then 1 is added.
8.5.3 Regression and Classifcation
Next we investigated the speed-up on both toy and real-world problems, rather than
on arbitrary expressions. In the benchmark experiments, the expression lengths were
uniform throughout the tests. However, in real GP, the lengths of the expressions
vary throughout the run. As the GPU sometimes results in slower performance, we
need to verify that on average there is an advantage.
238 Simon L. Harding and Wolfgang Banzhaf
Table 8.2 Results showing the number of times faster evaluating Boolean expressions is on the
GPU, compared with CPU implementation. An increase of less than 1 shows that the CPU is
more efficient. Booleans were implemented as floating-point numbers and as Booleans. Although
representation as floating-point numbers provides faster valuation than that obtained with the CPU
faster than the CPU for large input sizes, in general it appears preferential to use the Boolean
representation. Using floating-point representation can provide speed increases, but the results are
varied
Boolean implementation
Test cases
Expression length 4 16 64 256 1024 4096 16384 65536
10 0.22 1.04 1.05 2.77 7.79 36.53 84.08 556.40
50 0.44 0.57 1.43 3.02 14.75 58.17 228.13 896.33
100 0.39 0.62 1.17 4.36 14.49 51.51 189.57 969.33
500 0.35 0.43 0.75 2.64 14.11 48.01 256.07 1048.16
1000 0.23 0.39 0.86 3.01 10.79 50.39 162.54 408.73
1500 0.40 0.55 1.15 4.19 13.69 53.49 113.43 848.29
Boolean implementation, using floating-point numbers
Test cases
Expression length 4 16 64 256 1024 4096 16384 65536
10 0.024 0.028 0.028 0.072 0.282 0.99 3.92 14.66
50 0.035 0.049 0.115 0.311 1.174 4.56 17.72 70.48
100 0.061 0.088 0.201 0.616 2.020 8.78 34.69 132.84
500 0.002 0.003 0.005 0.017 0.064 0.25 0.99 3.50
1000 0.001 0.001 0.003 0.008 0.030 0.12 0.48 1.49
1500 0.000 0.001 0.002 0.005 0.019 0.07 0.29 1.00
8.5.3.1 Regression
We evolved functions that performed regression with respect to x
6
2x
4
+ x
2
[11].
We measured the difference in the speed of evaluation using a number of test cases.
In each instance, the test cases were uniformly distributed between 1 and +1. We
also changed the maximum length of the CGP graph. Hence, the expression lengths
could range anywhere from one node to the maximum size of the CGP graph. GP
was run for 200 generations to allow convergence. The function set comprised +,
, and /.InC, division by zero of a float returns ‘Infinity’, which is consistent
with the result from the Accelerator library.
The fitness was defined as the sum of the absolute errors for each test case and
the output of the expression. This could also be calculated using the GPU. Each
individual was evaluated with the CPU and then the GPU and the speed difference
was recorded. Also, the outputs from the GPU and CPU were compared to ensure
that they were evaluating the expression in the same manner. We did not find any
instances where the two differed.
Table 8.3 shows results that are consistent with the results for the other tests
described above. For smaller input sets and small expressions, it was more efficient
to evaluate them on the CPU. However, for the larger test and expression sizes, the
performance increase was dramatic.
8 Hardware Acceleration for CGP: Graphics Processing Units 239
Table 8.3 Results for the regression experiment. The results show the number of times faster eval-
uating evolved GP expressions is on the GPU, compared with CPU implementation. The maximum
expression length is the number of nodes in the CGP graph
Test cases
Max. expressionlLength 10 100 1000 2000
10 0.02 0.08 0.7 1.22
100 0.07 0.33 2.79 5.16
1000 0.42 1.71 15.29 87.02
10000 0.4 1.79 16.25 95.37
8.5.3.2 Classification
In this experiment we attempted to solve the classification problem of distinguishing
between two spirals, as described in [11]. This problem has two input values (the
x and y coordinates of a point on a spiral) and has a single output indicating which
spiral the point is found to be on. In [11], 194 test cases were used. In the present
experiments, we extended the number of test cases to 388, 970 and 1940. We also
extended the function set to include sine, cosine,
x, x
y
and a comparator. The
comparator looked at the first input value of a node, and if it was less than or equal
to zero returned the second input, and 0 otherwise. The relative speed increases can
be seen in Table 8.4. Again we see that the GPU is superior for larger numbers of
test cases, with larger expression sizes.
Table 8.4 Results for the two-spirals classification experiment. The results show the number of
times faster evaluating evolved GP expressions is on the GPU compared with CPU implementation.
The maximum expression length is the number of nodes in the CGP graph
Test cases
Max. expression length 194 388 970 1940
10 0.15 0.23 0.51 1.01
100 0.38 0.67 1.63 3.01
1000 1.77 3.19 9.21 22.7
10000 1.69 3.21 8.94 22.38
8.5.3.3 Classification in Bioinformatics
In this experiment we investigated the behaviour on another classification problem,
this time a protein classifier, as described in [12]. Here the task was to predict the
location of a protein in a cell from the amino acids in that particular protein. We
used the entire data set as the training set. The set consisted of 2427 entries, with 19
variables each and one output. We investigated the performance gain using several
expression lengths, and the results can be seen in Table 8.5. Here, the large number
240 Simon L. Harding and Wolfgang Banzhaf
of test cases used results in considerable improvements in evaluation time, even for
small expressions.
Table 8.5 Results for the protein classification experiment. The results show the number of times
faster evaluating evolved GP expressions is on the GPU compared with CPU implementation. The
maximum expression length is the number of nodes in the CGP graph
Test cases
Expression length 2427
10 3.44
100 6.67
1000 11.84
10000 14.21
8.6 Image Processing on the GPU
In a later set of experiments, we demonstrated the use of the GPU as a platform for
the evolution of image filters, such as filters for noise reduction. An image filter is
defined as a convolution kernel (or program) that takes a set of inputs (pixels in an
image) and maps them to a new pixel value. This program is applied to each pixel in
an image to produce another image. Such filters are common in image processing.
For example, one commonly used filter to reduce noise in an image is a median
filter, where the new state of a pixel is the median value of it and its neighbourhood.
Using CGP it is possible to evolve such filters. Further examples of this and other
approaches can be seen in Chap. 6.
Running the filters on a GPU allows us to apply the kernel to every pixel (log-
ically, but not physically) simultaneously. The parallel nature of the GPU allows a
number of kernels to be calculated at the same time. This number is dependent on the
number of shader processors available. The image filter consists of an evolved pro-
gram that takes a pixel and its neighbourhood (a total of nine pixels) and computes
the new value of the centre pixel. On a traditional processor, you would iterate over
each pixel in turn and execute the evolved program each time. Using the parallelism
of the GPU, many pixels (in effect all of them) can be operated on simultaneously.
Hence, the evolved program is only evaluated once. Although the evolved program
actually evaluates the entire image at once, we can break down the problem and con-
sider what is required for each pixel. For each pixel, we need a program that takes it
and its neighbourhood, and calculates a new pixel value. For this, the evolved pro-
gram requires as many inputs as there are pixels in the neighbourhood, and a single
output. In the evolved program, each function has two inputs and one output. These
inputs are floating-point numbers that correspond to the grey-level values of the pix-
els. Figure 8.3 illustrates a program that takes a nine-pixel subimage, and computes
a new pixel value.
8 Hardware Acceleration for CGP: Graphics Processing Units 241
Fig. 8.3 In this example, the evolved program has nine inputs which correspond to a section of
an image. The output of the program determines the new colour of the centre pixel. Note that one
node has no connections to its output. This means that the node is redundant, and will not be used
during the computation.
Mapping the image-filtering problem to the parallel architecture of the GPU is
relatively straightforward.
It is important to appreciate that a GPU typically takes two arrays and produces
a third by performing a parallel operation on them. The operation is elementwise,
in the same way as for matrix operations. To clarify this, consider two arrays, a =
[1,2,3] and b =[4,5,6]. If we perform addition, we get c =[5, 6, 9]. With the SIMD
architecture of the GPU, it is difficult to do an operation such as adding the first
element of one array to the second of another. To do such an operation, the second
array would need to be shifted to move the element in the second position to the first
position.
For image filtering, we need to take a subimage from the main image, and use
these pixels as inputs for a program (our convolution kernel), keeping in mind the
vectorized operations of the GPU. To do this we take an image (e.g. the top left
array in Fig. 8.4) and shift the array by one pixel in all eight possible directions.
This produces a total of nine arrays (labelled (a) to (i) in Fig. 8.4).
Taking the same indexed element from each array will return the neighbourhood
of a pixel. In Fig. 8.4, the neighbourhood is shaded grey and the dotted line indicates
how these elements are aligned. The GPU runs many copies of the evolved program
in parallel and, essentially, each program can can act on only one array index. By
shifting the arrays in this way, we have lined up the data so that although each
program can see only a given array position, by looking at the set of arrays (or,
more specifically, a single index in each of the arrays in the set) it can have access
to the given pixel and its neighbourhood. For example, if we add array (e) to array
(i) the new value of the centre pixel will be 6, as the centre pixel in (e) has a value
of 5 and the centre pixel in (i) has a value of 1.
242 Simon L. Harding and Wolfgang Banzhaf
Fig. 8.4 Converting the input image to a set of shifted images allows the elementwise operations of the GPU to access a pixel and its neighbourhood. The evolved
program treats each of these images as inputs. For example, should the evolved program want to sum the centre pixel and its top-left neighbour, it would add (e)
to (i).
8 Hardware Acceleration for CGP: Graphics Processing Units 243
8.6.1 Evolving Image Filters Using Accelerator
In [5, 9], CGP was used with MS Accelerator to evolve a large number of different
image filters. Chapter 6 contains more detail about the CGP component of this work,
whereas here the focus is on the GPU side of the algorithm. In summary, the original
input images (Fig. 8.5) were combined together to form a larger image. A filter was
applied, using GIMP. We then used an evolutionary algorithm to find a mapping
between the input and output images. The fitness function attempted to minimize
the error between the desired output (the GIMP-processed image) and the output
from the evolved filters.
GPU acceleration allows us to process images much more rapidly than on a CPU,
and enables improvements to be made in the quality of the results collected. In [5]
four different images were used for fitness evaluation, and none for validation. In
[9], 16 different images (Fig. 8.5) were used; these were largely taken from the USC-
SIPI image repository, with 12 used for fitness evaluation and four for validation.
This allowed us to be confident that evolved filters would generalize well. As we
employed a GPU for acceleration, we were able to test all the images at the same
time and obtain both the fitness score and the validation score at the same time.
The fitness score of an individual was the average of the absolute difference (per
pixel) between the target image and the image produced using CGP. The lower this
error, the closer our evolved solution was to the desired output.
The fitness function was relatively complicated. The first step was to take the
output from the evolved program, subtract it from our desired output and then find
the absolute value of the result. The whole input image was processed at once (this
input image contained 16 subimages). This provided an array containing the differ-
ence between the two images. Next, a mask was applied to remove the edges where
the subimages meet. This was done because when the images were shifted, there
would be overlapping data from different images that could introduce undesirable
artefacts into the fitness computation. Edges were removed by multiplying the dif-
ference array by an array containing 0s and 1s. The 1s labelled where the fitness
function would measure the difference.
To calculate the training error, the difference array was multiplied by another
mask. This mask contained 1s for each pixel in the training images, and 0 for pixels
that we did not wish to consider. The fitness value was the sum the contents of the
array divided by the number of pixels.
The fitness function is illustrated in Fig. 8.6.
8.6.2 Results
Results for the quality of the image filters can be found in Chap. 6. Here, only the
GPU acceleration is considered. Table 8.6 shows the performance of this implemen-
tation.
244 Simon L. Harding and Wolfgang Banzhaf
Fig. 8.5 The training and validation image set. All images were presented simultaneously to the
GPU. The first column of images was used to compute the validation fitness, and the remaining 12
for the training fitness. Each image is 256 × 256 pixels, with the entire set of images containing
1024 × 1024 pixels.
It was found that using the GPU greatly decreased the evaluation time per in-
dividual. On our test system (Nvidia 8800 GTX, AMD Athlon 3500+, Microsoft
Accelerator API), the average speed was approximately 145 million genetic pro-
gramming operations per second (GPOps), and a peak performance of 324 Mil-
lion GPOps was obtained. The processing rate was dependent on the length of the
evolved programs. Some filters benefited more from the GPU implementation than
others.
When the evolved programs were executed using the CPU, we obtained only
1.2 million GPOps, i.e. a factor of 100 times slower than for the GPU. However,
using the reference driver incurs significant overhead and may not be an accurate
reflection of the true speed of the CPU.
8 Hardware Acceleration for CGP: Graphics Processing Units 245
Fig. 8.6 Fitness calculation for images (simplified for a single image). The evolved program is run
on the input image, and its per-pixel difference from the target image is calculated. This is then
multiplied by an edge mask to remove artefacts created by the multiple subimages touching each
other. Finally, the final error is calculated (the average pixel error).
Filter Peak MGPOps Avg MGPOps
Dilate 116 62
Dilate2 281 133
Emboss 254 129
Erode 230 79
Erode2 307 195
Motion 280 177
Neon 266 185
Sobel 292 194
Sobel2 280 166
Unsharp 324 139
Table 8.6 Maximum and average mega genetic programming operations per second (MGPOps)
observed for each filter type
8.7 CUDA Implementation
The Accelerator implementations discussed so far suffer from many problems. Ac-
celerator is very limiting (Windows only, bugs, too high-level), and does not appear
to be under active development – although it is likely to resurface in future versions
of DirectX.
The next implementation of CGP for GPUs that we describe is based on Nvidia’s
CUDA. However, doing this presented many challenges. To use a test-case-parallel
approach, individual programs have to be compiled and this introduces a signifi-
cant time overhead. In [10], a new approach was introduced which aims to remove
this overhead. This section discusses the most recent version of this approach.
246 Simon L. Harding and Wolfgang Banzhaf
Fig. 8.7 Algorithm overview. The root computer administers the client computers. The root com-
puter is responsible for maintaining the population and distributing sections of the data set to the
GPU-equipped client computers. The client computers can be capable of executing programs (i.e.
contain a compatible GPU) and/or be capable of compiling CUDA programs using NvCC. Differ-
ent combinations of compiling and executing computers can be used, depending on the resources
available.
8.7.1 Algorithm
The algorithm devised here is an implementation of a parallel evolutionary algo-
rithm, where there is a master controller (called the ‘root’) and a number of slave
computers (or ‘clients’), as shown in Fig. 8.7. In summary, the algorithm works in
the following manner:
1. A process is loaded onto all computers in the cluster.
2. The root process is started on the master computer.
3. Each computer announces its presence, and the root listens to these broadcasts
and compiles a list of available machines.
4. After announcing their presence, each of the client machines starts a server
process to listen for commands from the root.
5. After 2 minutes, the root stops listening for broadcasts and connects to each
client machine.
6. Upon connecting to each machine, the root asks each client what its capabilities
are (in terms of CUDA compiler and device capability).
7. The root distributes sections of the fitness set to each CUDA-capable computer.
8. The root initializes the first population.
9. The root begins the evolutionary loop.
a. Convert population to CUDA C code.
b. Parallel-compile population.
c. Collate compiled code and resend to all compiling clients.
d. Perform execution of individuals on all GPU-equipped clients.
e. Collect (partial) fitness values for individuals from GPU-equipped clients.
f. Generate next population, return to step 9a.
8 Hardware Acceleration for CGP: Graphics Processing Units 247
The algorithm here offloads both the compilation and the execution to a cluster
of GPU-equipped machines. This significantly reduces the overhead and execution
costs.
The cluster consisted of 28 computers, each with an Nvidia GeForce 8200 graph-
ics card. This GPU is a very low end device. Each 8200 GPU has eight stream pro-
cessors and access to 128 Mb of RAM. As the GPU was shared with the operating
system, the free GPU memory on each machine was approximately 80 Mb. The
root computer node used here was equipped with an Intel Q6700 quadcore, 4 Gb
of RAM and Windows XP. The client computers used Gentoo Linux, AMD 4850e
processors and 2 Gb of RAM. The software developed here uses a CUDA wrapper
called Cuda.Net [3] and should work on any combination of Linux and Windows
platforms.
8.7.2 Compilation and Code Generation
To mitigate the slow compilation step in the algorithm, a parallelized method was
employed. Figure 8.8 shows how the compilation speed varies with the number
of individuals in a population. The compilation time here includes rendering the
genotypes to CUDA C code, saving the code and running the compiler. For a single
machine, the compilation time increases quickly with the size of the population. For
large populations, the compilation phase quickly becomes a significant bottleneck.
Using distributed compilation it is possible to maintain a rapid compilation phase
for typical population sizes.
CUDA allows library files (extension ‘.cubin’) called ‘modules’ to be loaded
into the GPU. Modules are libraries of functions that can be loaded, executed on
the GPU and then unloaded. Module load times increase linearly with the number
of functions (or individuals) they contain. However, there is a constant overhead to
this, and for unloading the module, transferring the module across the network and
other file-handling operations. Therefore it is preferable to minimize the number of
module files that are created.
To compile a module, the user first creates a source file containing only CUDA
functions. These should be declared as in the following example:
extern "C" __global__
void Individual0(
float
*
ffOutput,
float
*
ffInput0,
int width, int height)
When compiled, each of these functions is exposed as a function in the compiled
library. Internally, the .cubin files are text files with a short header, followed by a set
of code blocks representing the functions. The code blocks contain the binary code
for the compiled function, in addition to information about memory and register
usage.
248 Simon L. Harding and Wolfgang Banzhaf
Fig. 8.8 How compilation speed varies with the number of individuals in a population, comparing
single-machine compilation with compilation using 14 machines in the cluster.
It is possible to merge the contents of module files together by simply appending
.cubin files together. This allows one to compile sections of a library independently
of each other and then reduce these sections to a single module. However, only one
header is permitted per .cubin file.
In this implementation, the following process was used to convert the genotypes
into executable code.
1. Convert the population to CUDA. This process is performed on the root ma-
chine. At this stage, the population is divided up into sections, the number of
which is equal to the number of machines available for compilation. Each sec-
tion is then converted to CUDA using the code generator described in Sect.
8.7.2. This stage is performed in parallel across the cores on the root computer,
where a separate thread is launched for each section of the population’s code
generator.
2. Send the uncompiled code to the distributed machines. In parallel, each section
of uncompiled code is transmitted to a computer for compilation. The remote
computer saves the code to the local disk, before compilation. The remote com-
puter then loads the .cubin file from disk and returns it to the root computer.
3. The host computer merges the returned .cubin files together (in memory) and
then sends the merged version back to all the computers in the grid for execu-
8 Hardware Acceleration for CGP: Graphics Processing Units 249
tion. Again, this happens in parallel, with the remote computers sending back
their compiled files asynchronously.
Figure 8.8 shows the performance benefit obtained with this parallel compilation
method. For typical population sizes, the parallel compilation means that each pro-
cessor compiles only a small number of functions – and for such sizes the compiler
is operating at its most efficient ratio of compilation time to function count. How-
ever, it should be noted that the compilation time does not appear to scale linearly
with the number of compiling clients.
Tables 8.7 and 8.8 show the average times taken to compile populations of var-
ious sizes, with various program lengths. The results show a clear performance in-
crease using the parallel compilation method.
Table 8.7 Time (in seconds) to perform the entire compilation process using 28 computers
Population size
Graph length
256 512 1024 2048
32 3.08 2.88 2.78 2.82
64 2.43 2.32 2.48 2.65
128 2.84 2.78 3.01 3.01
256 3.29 3.28 3.24 3.39
512 3.73 3.63 4.26 3.97
1024 4.96 5.89 5.41 6.35
2048 8.37 7.98 8.38 9.65
Table 8.8 Time (in seconds) to perform the entire compilation process using a single-threaded
compilation method. Owing to memory allocation issues, some results are incomplete
Population size
Graph length
256 512 1024 2048
32 1.22 1.22 1.22 1.22
64 1.55 1.55 1.54 1.54
128 2.20 2.20 2.21 2.18
256 3.54 3.73 3.60 3.55
512 6.43 6.44 6.43 6.45
1024 13.33 13.56 13.43 13.45
2048 36.16 36.26 36.06 36.01
Code generation is the process of converting the genotypes in the population to
CUDA-compatible code. During development, a number of issues were found that
made this process more complicated than expected. One issue was that the CUDA
compiler does not like long expressions. In initial tests, the evolved programs were
written as a single expression. However, when the length of the expression was
increased, the compilation time increased dramatically. This is presumably because
the programs were difficult to optimize.
250 Simon L. Harding and Wolfgang Banzhaf
Another issue was that functions with many input variables can cause the compi-
lation to fail, with the compiler complaining that it was unable to allocate sufficient
registers. In the initial development, we passed all the inputs in the training set to
each individual, regardless of whether the expression used them. This worked well
for small numbers of inputs; however the training set that was used to test the system
contained 41 columns. The solution to this problem was to pass the function only
the inputs that it used. However, this requires each function to be executed with a
different parameter configuration. Conveniently, the CUDA.Net interface does al-
low this, as function calls can be generated dynamically at run time. The other issue
here is that all or many inputs may be needed to solve a problem. It is hoped that
this is a compiler bug and that it will be resolved in future updates to CUDA.
Fig. 8.9 Example of the CUDA code generated for an individual.
The code in Fig. 8.9 illustrates a typical generated individual. As noted before,
long expressions failed to compile within a reasonable time. The workaround used
here was to limit the number of nodes that made up a single statement, and then
use temporary variables to combine these smaller statements together. It is unclear
what the best length is for statements, but it is likely to be a function of how many
temporary variables (and hence registers) are used. The names of the temporary
variables were related to the node index in the graph at which the statement length
limit was introduced.
8 Hardware Acceleration for CGP: Graphics Processing Units 251
The final output of the individual was stored in a temporary variable (r). As this
is a binary classification problem, it was convenient to add code here to threshold
the actual output value to one of the two classes.
During code generation, it was also possible to remove redundant code. When a
CGP graph is parsed recursively, unconnected nodes are automatically ignored and
so do not need to be processed. It is also possible to detect code which serves no
purpose, for example division where the numerator and denominator are the same,
or multiplication by 0. This can be seen as another advantage of precompiling pro-
grams over simple interpreter-based solutions.
8.7.3 Fitness Function
The task here was to evolve an ‘emboss’ filter on a single image. A large image
(3872 × 2592 pixels) was used. It should be noted that Accelerator could only cope
with images of size 2048 × 2048 pixels.
The image was converted to grey scale, and then converted in a way similar
to that described earlier, where each neighbourhood becomes a row in a data set.
The desired output image was an ‘embossed’ version of the input. The data set
contained 10,023,300 rows and 10 columns, which was approximately 400 Mb of
data in memory. As each row was independent of every other row, the image could
be split across the client computers in the same way as before.
The fitness was calculated as the sum of the differences between the image out-
putted by the evolved program and the target (embossed) image.
8.7.4 Results
Table 8.9 Average giga GP operations per second when the ‘emboss’ image filter was evolved
using 28 computers
Population size
Graph length
256 512 1024 2048
32 3.3 3.6 4.5 5.4
64 5.2 6.8 7.3 8.4
128 7.2 8.4 9.9 13.8
256 11.4 13.2 15.3 18.6
512 14.2 16.6 18.3 21.8
1024 16.4 17.3 21.0 24.1
2048 16.4 18.3 21.6 26.2
252 Simon L. Harding and Wolfgang Banzhaf
Table 8.10 Peak giga GP operations per second when evolving the ‘emboss’ image filter was
evolved using 28 computers
Population size
Graph length
256 512 1024 2048
32 7.2 7.5 8.9 10.0
64 8.0 9.9 13.8 16.2
128 12.3 15.7 17.9 22.5
256 19.5 21.7 26.2 29.7
512 22.1 24.5 28.5 32.6
1024 25.0 28.9 30.0 34.7
2048 24.1 27.3 32.2 34.2
Table 8.9 shows the average numbers of giga GP operations per second (GG-
POps) and Table 8.10 shows the peak numbers of GGPOps for each of the popula-
tion/graph size combinations used.
The results are considerably faster than those obtained using Accelerator on a
single GPU, which had a peak of 0.25 GGPOps. The peak results here are over 100
times faster. The speed improvement is likely to be due to an increase in the number
of shader processors available, although it is difficult to compare the results, given
the differences in image size and fitness function.
8.8 Conclusions
GPUs are becoming increasingly common in all forms of scientific computing. For
GP practitioners, we find that the nature of the problem of evaluating either indi-
viduals in parallel or test cases in parallel maps very well to the GPU architecture.
As GPU technology improves, and particularly as the development tools improve, it
is likely that more and more GP implementations will exploit this form of parallel
processing.
More information about GPUs and genetic programming can be found at
www.gpgpgpu.com.
8.9 Acknowledgements
WB and SH gratefully acknowledge funding from Atlantic Canada’s HPC network
ACENET, from the Canadian Foundation of Innovation, New Opportunities Grant
No. 204503, and from NSERC under the Discovery Grant Program RGPIN 283304-
07.
8 Hardware Acceleration for CGP: Graphics Processing Units 253
References
1. Banzhaf, W., Harding, S.L., Langdon, W.B., Wilson, G.: Accelerating Genetic Programming
through Graphics Processing Units. In: R.L. Riolo, T. Soule, B. Worzel (eds.) Genetic Pro-
gramming Theory and Practice VI, chap. 1, pp. 229–249. Springer (2008)
2. Chitty, D.M.: A data parallel approach to genetic programming using programmable graphics
hardware. In: D. Thierens, H.G. Beyer, et al. (eds.) Proc. Genetic and Evolutionary Computa-
tion Conference, vol. 2, pp. 1566–1573. ACM Press (2007)
3. GASS Ltd.: CUDA.NET. http://www.gass-ltd.co.il/en/products/cuda.net/
4. Harding, S.L.: Genetic Programming on GPU Bibliography. http://www.gpgpgpu.com/
5. Harding, S.L.: Evolution of Image Filters on Graphics Processor Units Using Cartesian Ge-
netic Programming. In: J. Wang (ed.) IEEE World Congress on Computational Intelligence,
pp. 1921–1928. IEEE Press (2008)
6. Harding, S.L., Banzhaf, W.: Fast Genetic Programming and Artificial Developmental Sys-
tems on GPUs. In: International Symposium on High Performance Computing Systems and
Applications, p. 2. IEEE Computer Society (2007)
7. Harding, S.L., Banzhaf, W.: Fast genetic programming on GPUs. In: Proc. European Confer-
ence on Genetic Programming, LNCS, vol. 4445, pp. 90–101. Springer (2007)
8. Harding, S.L., Banzhaf, W.: Genetic programming on GPUs for image processing. Interna-
tional Journal of High Performance Systems Architecture 1(4), 231–240 (2008)
9. Harding, S.L., Banzhaf, W.: Genetic Programming on GPUs for Image Processing. In: J. Lan-
chares, F. Fernandez, J. Risco-Martin (eds.) Proc. International Workshop on Parallel and
Bioinspired Algorithms, pp. 65–72. Complutense University of Madrid Press (2008)
10. Harding, S.L., Banzhaf, W.: Distributed Genetic Programming on GPUs using CUDA. In:
I. Hidalgo, F. Fernandez, J. Lanchares (eds.) Proc. International Workshop on Parallel Archi-
tectures and Bioinspired Algorithms, pp. 1–10 (2009)
11. Koza, J.: Genetic Programming: On the Programming of Computers by Natural Selection.
MIT Press (1992)
12. Langdon, W.B., Banzhaf, W.: Repeated Sequences in Linear Genetic Programming Genomes.
Complex Systems 15(4), 285–306 (2005)
13. Langdon, W.B., Banzhaf, W.: A SIMD Interpreter for Genetic Programming on GPU Graphics
Cards. In: Proc. European Conference on Genetic Programming, LNCS, vol. 4971, pp. 73–85.
Springer (2008)
14. Robilliard, D., Marion-Poty, V., Fonlupt, C.: Population Parallel GP on the G80 GPU. In:
Proc. European Conference on Genetic Programming, LNCS, vol. 4971, pp. 98–109. Springer
(2008)
15. Tarditi, D., Puri, S., Oglesby, J.: MSR-TR-2005-184 Accelerator: Using Data Parallelism to
Program GPUs for General-Purpose Uses. Tech. rep., Microsoft Research (2006)
16. Wilson, G., Banzhaf, W.: Linear Genetic Programming GPGPU on Microsoft’s Xbox 360. In:
J. Wang (ed.) IEEE World Congress on Computational Intelligence. IEEE Press (2008)
... Regarding program representations other than the traditional tree, comparatively few works have focused on the performance of program evaluation. In brief, besides CPU studies, there have been GPU and FPGA solutions for Linear GP [21,32,46,77,126], Cartesian GP [30,42,80,100,102], and Geometric Semantic GP [38,76,120]. ...
Article
Full-text available
This paper establishes the potential of accelerating the evaluation phase of tree-based genetic programming through contemporary field-programmable gate array (FPGA) technology. This exploration stems from the fact that FPGAs can sometimes leverage increased levels of both data and function parallelism, as well as superior power/energy efficiency, when compared to general-purpose CPU/GPU systems. In this investigation, we introduce a fixed-depth, tree-based architecture that can fully parallelize tree evaluation for type-consistent primitives that are unrolled and pipelined. We show that our accelerator on a 14nm FPGA achieves an average speedup of 43×\times when compared to a recent open-source GPU solution, TensorGP, implemented on 8nm process-node technology, and an average speedup of 4,902×\times when compared to a popular baseline GP software tool, DEAP, running parallelized across all cores of a 2-socket, 28-core (56-thread), 14nm CPU server. Despite our single-FPGA accelerator being 2.4×\times slower on average when compared to the recent state-of-the-art Operon tool executing on the same 2-processor, 28-core CPU system, we show that this single-FPGA system is 1.4×\times better than Operon in terms of performance-per-watt. Importantly, we also describe six future extensions that could provide at least a 64–192×\times speedup over our current design. Therefore, our initial results provide considerable motivation for the continued exploration of FPGA-based GP systems. Overall, any success in significantly improving runtime and energy efficiency could potentially enable novel research efforts through faster and/or less costly GP runs, similar to how GPUs unlocked the power of deep learning during the past fifteen years.
... In order to reduce the design time, various application specific accelerators as well as evolutionary algorithm modifications have been proposed. While the former case typically involves parallel fitness function implementation based on FPGA accelerators [6,8] or running on multicore CPUs, GPUs [9] or even computer clusters [3] and exploiting parallelism at various levels (instruction, data, thread or process), the latter one includes genotype representation or search algorithm modifications. In the past, spatially structured evolutionary algorithms have been intensively studied and variety of approaches differing in the used evolutionary algorithm or communication topology has emerged [10][11][12]. ...
Conference Paper
Full-text available
In this paper, a new approach to synthesize bent Boolean functions by means of Cartesian Genetic Programming (CGP) is proposed. Bent functions have important applications in cryptography due to their high nonlinearity. However, they are very rare and their discovery using conventional brute force methods is not efficient enough. We show that by using CGP we can routinely design bent functions of up to 16 variables. The evolutionary approach exploits parallelism in both the fitness calculation and the search algorithm.
Chapter
Miller’s Cartesian genetic programming (CGP) has significantly influenced the development of evolutionary circuit design and evolvable hardware. We present key ingredients of CGP with respect to the efficient search in the space of digital circuits. We then show that approximate computing, which is currently one of the promising approaches used to reduce power consumption of computer systems, is a natural application for CGP. We briefly survey typical applications of CGP in approximate circuit design and outline new directions in approximate computing that could benefit from CGP.
Thesis
Full-text available
Although robotics research has seen advances over the last decades robots are still not in widespread use outside industrial applications. Yet a range of proposed scenarios have robots working together, helping and coexisting with humans in daily life. In all these a clear need to deal with a more unstructured, changing environment arises. I herein present a system that aims to overcome the limitations of highly complex robotic systems, in terms of autonomy and adaptation. The main focus of research is to investigate the use of visual feedback for improving reaching and grasping capabilities of complex robots. To facilitate this a combined integration of computer vision and machine learning techniques is employed. From a robot vision point of view the combination of domain knowledge from both imaging processing and machine learning techniques, can expand the capabilities of robots. I present a novel framework called Cartesian Genetic Programming for Image Processing (CGP-IP). CGP-IP can be trained to detect objects in the incoming camera streams and successfully demonstrated on many different problem domains. The approach requires only a few training images (it was tested with 5 to 10 images per experiment) is fast, scalable and robust yet requires very small training sets. Additionally, it can generate human readable programs that can be further customized and tuned. While CGP-IP is a supervised-learning technique, I show an integration on the iCub, that allows for the autonomous learning of object detection and identification. Finally this dissertation includes two proof-of-concepts that integrate the motion and action sides. First, reactive reaching and grasping is shown. It allows the robot to avoid obstacles detected in the visual stream, while reaching for the intended target object. Furthermore the integration enables us to use the robot in non-static environments, i.e. the reaching is adapted on-the- fly from the visual feedback received, e.g. when an obstacle is moved into the trajectory. The second integration highlights the capabilities of these frameworks, by improving the visual detection by performing object manipulation actions.
Book
Full-text available
Evolutionary algorithms (EAs) are metaheuristics that learn from natural collective behavior and are applied to solve optimization problems in domains such as scheduling, engineering, bioinformatics, and finance. Such applications demand acceptable solutions with high-speed execution using finite computational resources. Therefore, there have been many attempts to develop platforms for running parallel EAs using multicore machines, massively parallel cluster machines, or grid computing environments. Recent advances in general-purpose computing on graphics processing units (GPGPU) have opened up this possibility for parallel EAs, and this is the first book dedicated to this exciting development. The three chapters of Part I are tutorials, representing a comprehensive introduction to the approach, explaining the characteristics of the hardware used, and presenting a representative project to develop a platform for automatic parallelization of evolutionary computing (EC) on GPGPUs. The ten chapters in Part II focus on how to consider key EC approaches in the light of this advanced computational technique, in particular addressing generic local search, tabu search, genetic algorithms, differential evolution, swarm optimization, ant colony optimization, systolic genetic search, genetic programming, and multiobjective optimization. The six chapters in Part III present successful results from real-world problems in data mining, bioinformatics, drug discovery, crystallography, artificial chemistries, and sudoku Although the parallelism of EAs is suited to the single-instruction multiple-data (SIMD)-based GPU, there are many issues to be resolved in design and implementation, and a key feature of the contributions is the practical engineering advice offered. This book will be of value to researchers, practitioners, and graduate students in the areas of evolutionary computation and scientific computing.
Conference Paper
The aim of this work is to accelerate the task of evolutionary image filter design using coevolution of candidate filters and training vectors subsets. Two coevolutionary methods are implemented and compared for this task in the framework of Cartesian Genetic Programming (CGP). Experimental results show that only 15---20% of original training vectors are needed to find an image filter which provides the same quality of filtering as the best filter evolved using the standard CGP which utilizes the whole training set. Moreover, the median time of evolution was reduced 2.99 times in comparison with the standard CGP.
Article
Full-text available
Using of a cluster of Graphics Processing Unit (GPU) equipped computers, it is possible to accelerate the evaluation of individuals in Genetic Programming. Pro-gram compilation, fitness case data and fitness execution are spread over the cluster of computers, allowing for the efficient processing of very large datasets. Here, the im-plementation is demonstrated on datasets containing over 10 million rows and several hundred megabytes in size. Populations of candidate individuals are compiled into NVidia CUDA programs and executed on a set of client computers -each with a different subset of the dataset. The paper discusses the implementation of the system and acts as a tutorial for other researchers experimenting with genetic programming and GPUs.
Article
Full-text available
The evolution of image filters using Genetic Pro-gramming is a relatively unexplored task. This is most likely due to the high computational cost of evaluating the evolved programs. We use the parallel processors available on modern graphics cards to greatly increase the speed of evaluation. Previous papers in this area dealt with noise reduction and edge detection. Here we demonstrate that other more complicated processes can also be successfully evolved, and that we can "reverse engineer" the output from filters used in common graphics manipulation programs.
Conference Paper
Full-text available
As is typical in evolutionary algorithms, fitness evaluation in GP takes the majority of the computational effort. In this paper we demonstrate the use of the Graphics Processing Unit (GPU) to accelerate the evaluation of individuals. We show that for both binary and floating point based data types, it is possible to get speed increases of several hundred times over a typical CPU implementation. This allows for evaluation of many thousands of fitness cases, and hence should enable more ambitious solutions to be evolved using GP.
Chapter
Full-text available
Keywords: Graphics Processing Units (GPUs) are in the process of becoming a major source of computational power for numerical applications. Originally designed for application of time-consuming graphics operations, GPUs are stream processors that implement the SIMD paradigm. The true degree of parallelism of GPUs is often hidden from the user, making programming even more flexible and convenient. In this chapter we survey Genetic Programming methods currently ported to GPUs. graphics processing units, parallel processing 1.
Conference Paper
Full-text available
Graphics processor units are fast, inexpensive parallel computing devices. Recently there has been great interest in harnessing this power for various types of scientific computation, including genetic programming. In previous work, we have shown that using the graphics processor provides dramatic speed improvements over a standard CPU in the context of fitness evaluation. In this work, we use Cartesian Genetic Programming to generate shader programs that implement image filter operations. Using the GPU, we can rapidly apply these programs to each pixel in an image and evaluate the performance of a given filter. We show that we can successfully evolve noise removal filters that produce better image quality than a standard median filter.
Conference Paper
Full-text available
The availability of low cost powerful parallel graphics cards has stimulated a trend to port GP on Graphics Processing Units (GPUs). Previous works on GPUs have shown evaluation phase speedups for large training cases sets. Using the CUDA language on the G80 GPU, we show it is possible to efficiently interpret several GP programs in parallel, thus obtaining speedups also for small training sets starting at less than 100 training cases. Our scheme was embedded in the well-known ECJ library, providing an easy entry point for owners of G80 GPUs.
Article
Biological chromosomes are replete with repetitive sequences, micro satellites, SSR tracts, ALU, etc. in their DNA base sequences. We started looking for similar phenomena in evolutionary computation. First studies find copious repeated sequences, which can be hierarchically decomposed into shorter sequences, in programs evolved using both homologous and two point crossover but not with headless chicken crossover or other mutations. In bloated programs the small number of effective or expressed instructions appear in both repeated and nonrepeated code. Hinting that building-blocks or code reuse may evolve in unplanned ways. Mackey-Glass chaotic time series prediction and eukaryotic protein localisation (both previously used as artificial intelligence machine learning benchmarks) demonstrate evolution of Shannon information (entropy) and lead to models capable of lossy Kolmogorov compression. Our findings with diverse benchmarks and GP systems suggest this emergent phenomenon may be widespread in genetic systems.
Conference Paper
We describe how to harness the graphics processing abilities of a consumer video game console (Xbox 360) for general programming on graphics processing unit (GPGPU) purposes. In particular, we implement a linear GP (LGP) system to solve classification and regression problems. We conduct inter- and intra-platform benchmarking of the Xbox 360 and PC, using GPU and CPU implementations on both architectures. Platform benchmarking confirms highly integrated CPU and GPU programming flexibility of the Xbox 360, having the potential to alleviate typical GPGPU decisions of allocating particular functionalities to CPU or GPU.
Conference Paper
GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls.We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50% of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU.