Content uploaded by Aleksander Byrski
Author content
All content in this area was uploaded by Aleksander Byrski on Oct 25, 2022
Content may be subject to copyright.
Leveraging Heterogeneous Parallel Platform
in Solving Hard Discrete Optimization Problems
with Metaheuristics
M. Pietro´na, A. Byrskia,∗
, M. Kisiel-Dorohinickia
aAGH University of Science and Technology
Faculty of Computer Science, Electronics and Telecommunications, Al. Mickiewicza 30,
30-059 Krakow, Poland
Abstract
The research reported in the paper deals with difficult black-box problems solved
by means of popular metaheuristic algorithms implemented on up-to-date par-
allel, multi-core, and many-core platforms. In consecutive publications we are
trying to show how particular population-based techniques may further benefit
from employing dedicated hardware like GPGPU or FPGA for delegating dif-
ferent parts of the computing in order to speed it up. The main contribution of
this paper is an experimental study focused on profiling of different possibilities
of implementation of Scatter Search algorithm, especially delegating some of its
selected components to GPGPU. As a result, a concise know-how related to the
implementation of a population-based metaheuristic similar to Scatter Search
is presented using a difficult discrete optimization problem; namely, Golomb
Ruler, as a benchmark.
1. Introduction
Solving hard discrete problems usually requires a metaheuristic-based ap-
proach as well as immense computing power. Metaheuristics such as Evolution-
ary Algorithms deal with a population of individuals (representing potential
solutions of the problem) and methods of transforming this population, follow-
ing various biological or social phenomena [23]. Also, the hybrids of evolution-
ary and local search algorithms are very often used as effective solvers for such
problems [11].
Parallel hardware platforms (e.g., graphic cards, multi-core processors) en-
able us to make use of the data flow structure of such population-based algo-
rithms and can help improve efficiency when solving difficult problems [17]. In
turn, this makes it possible to develop dedicated software environments, taking
∗Corresponding author
Email addresses: pietron@agh.edu.pl (M. Pietro´n), olekb@agh.edu.pl (A. Byrski),
doroh@agh.edu.pl (M. Kisiel-Dorohinicki)
Preprint submitted to Elsevier October 25, 2022
advantage of their specific features. Interesting possibilities arise when the use
of dedicated hardware is considered, such as graphical processing units (GPG-
PUs), which enable us to run thousands of threads in parallel. The top peak
performance of the most efficient high-performance graphic card is over 1 TB/s.
However, the programmer or designer must be aware of the memory hierarchy,
which has a significant influence on the implementation. Still, a significant num-
ber of numerical and data-mining algorithms were efficiently implemented using
GPGPUs [27, 26, 36, 34, 31].
For a long time, we have been studying metaheuristic computing, both from
an algorithmic point of view (delivering e.g., Evolutionary Multi-Agent Sys-
tems and many of its flavors [6, 8, 7] and formally modeling this computation
model [38, 5]) and from an engineering point of view (developing software frame-
works dedicated to metaheuristic computing using such technologies as Python,
Java, Scala, or Erlang [25, 24, 33, 9]). Recently, encouraged by the quite-good
results of scaling of our Erlang-based platforms, we have decided to try to fur-
ther increase the efficiency of our software tools by leveraging parallel hardware
architectures. In consecutive publications we are trying to show how particu-
lar population-based techniques may further benefit from employing dedicated
hardware like GPGPU or FPGA for delegating different parts of the computing
in order to speed it up (e.g. [35]).
To show the performance of particular algorithms we need some benchmark.
Low Autocorrelation Binary Sequences, Job Shop, and Golomb Ruler are well-
known combinatorial problems [4] that are quite difficult to solve using tradi-
tional approaches. They may be considered to be good examples of the so-called
black-box scenarios [16], demanding the application of a metaheuristic approach
in order to locate even sub-optimal solutions. The most popular metaheuris-
tics used for such problems are Simulated Annealing, Tabu Search as well as
Evolutionary and Memetic Algorithms (see e.g., [20, 40, 12, 14, 15]).
A very efficient metaheuristic algorithm for solving such optimization prob-
lems is Scatter Search [30]. As a complex population-based algorithm consisting
of many subsequent yet repeatable stages, it may become a good candidate for
implementing in a hybrid way; in particular, leveraging a heterogeneous infras-
tructure. According to the best knowledge of the authors, detailed studies of
Scatter Search related to efficient hardware use and profiling of its component
metaheuristics (except evolutionary and genetic algorithms, cf. [43, 42]), con-
sidering the implementation on heterogeneous architectures are hard to find in
the state-of-the-art. Thus it became the subject of our research.
It is to stress that we already have some substantial GPGPU-related results;
the most relevant related to the realization of a Memetic Algorithm solving Op-
timal Golomb Ruler (OGR) problem [35]. It was shown that GPGPUs can be
incorporated into the process of solving difficult black-box problems, although it
is to note that GPGPU can be efficient in implementing only some parts of the
considered algorithms. On the other hand, algorithms with strong dependen-
cies in their data flows and without data re-use or data parallelism cannot be
efficiently accelerated in GPGPU (this is a difficulty e.g. with the Tabu Search
algorithm).
2
The main result of this work is an experimental study focused on both effi-
ciency and effectiveness, devoted to finding the implementation possibilities of
Scatter Search. The structure of Scatter Search (and similar population-based
metaheuristics) allows for different approaches of heterogeneous implementa-
tion; i.e., delegating its different parts to devices such as GPGPU and FPGA.
Thus, starting from GPGPU-related tests, we try to find out how to speed-
up the computing without rushing headlong, but leveraging available hardware
devices only when necessary. In this way, delegating such components as the ini-
tialization or local search operators (e.g., Simulated Annealing, GRASP, Tabu
Search) is considered and reliably tested against highly-optimized CPU versions
of these algorithms. The studies are based on an Optimal Golomb Ruler prob-
lem as a benchmark. In the end, a hybrid multi-GPGPU framework is shown,
configured according to the obtained know-how.
The structure of the article is as follows. After this introduction, we provide
the most important information from the basics of GPGPU computing; then,
the exemplary problem (OGR) is described, followed by the outline of the meta-
heuristics used (SS). The main part is the experimental study that comes from
the profiling of the considered metaheuristic. At the very end, the concept of a
heterogeneous multi-GPGPU computing framework are presented (along with
the results obtained).
2. GPGPU and multiprocessor computing
The architecture of a GPGPU is presented in Fig. 1. GPGPU is constructed
as an N multiprocessor structure with M cores each. The cores share an In-
struction Unit. Processors have dedicated memory chips that are much faster
than global memory (which is shared by all multiprocessors). GPGPUs are
constructed as massive parallel devices, enabling thousands of parallel threads
to run in groups called blocks with shared memory.
Creating the optimized code is not trivial, and thorough knowledge of the
GPGPU architecture is necessary to do it effectively. The main aspects to con-
sider are the usage of the memory, efficient division of code into parallel threads,
and inter-thread communication. As mentioned earlier, different kinds of mem-
ory are specially optimized regarding the access time; therefore, programmers
should optimally use them to speed up access to the data on which the algorithm
operates. Another important thing is to optimize the synchronization and com-
munication of the threads. The important aspect is that the synchronization
of the threads between blocks is much slower than within a block; if this is not
necessary, it should be avoided.
In this work, a dedicated software architecture of CUDA is used, which
enables the programming of GPGPU using high-level languages such as C
and C++ [1]. CUDA requires an NVIDIA GPGPU like Fermi or GeForce
8XXX/Tesla/Quadro, etc. This technology provides three key mechanisms to
parallelize programs: thread group hierarchy, shared memory, and barrier syn-
chronization. These mechanisms provide fine-grained parallelism nested within
coarse-grained task parallelism.
3
Figure 1: GPGPU architecture
Multi-core processors are used for comparison of the developed GPGPU so-
lutions discussed in this paper—experiments that are run on single-core and
multi-core processors will be used to show the scalability of the considered al-
gorithms on the CPU. It is to note that, in both cases (GPGPU and CPU
computing), the code is optimized and appropriately modified in order to ex-
ploit all of the possible features of both solutions.
Most modern processors consist of two or more independent processing units.
Such architecture enables multiple instructions to be run at the same time.
The cores are integrated into a single integrated circuit (which may or may
not share cache). They may implement message passing or shared memory
for inter-core communication. The single cores in multi-core systems may also
implement architectures such as vector processing, SIMD, or multi-threading.
These techniques offer another aspect of parallelization (implicit to the high level
languages used by compilers). Obviously, the performance gained by the use of a
multi-core processor depends on the algorithms used and their implementation.
There are many programming models and libraries for multi-core program-
ming. The most popular are pthreads, OpenMP, Cilk++, and TDD. In our
work, OpenMP was used [3], being a software platform supporting multi-
threaded, shared-memory, parallel processing, multi-core architectures for the
C, C++, and Fortran languages. Using OpenMP, the programmer does not
need to create threads nor assign tasks to each thread. The programmer rather
inserts directives to assist the compiler in generating threads for the appropriate
parallel processor platform.
4
3. Optimal Golomb Ruler problem
The Golomb Ruler is a set of marks at integer positions along an imaginary
ruler such that no two pairs of marks are the same distance apart. The number
of marks on the ruler is its order, and the largest distance between two of its
marks is its length (distance between the first and last elements). A Golomb
Ruler is optimal if no shorter GR of the same order exists. The examples
of using Golomb Rulers can be found in Information Theory, where they are
used for generating error-correcting codes [37], selecting radio frequencies to
reduce the effects of inter-modulation interference (with both terrestrial and
extraterrestrial applications), or designing phased arrays of radio antennas [41]
or, in astronomy, one-dimensional synthesis arrays.
0146
4
1
6
3
5
2
Figure 2: Structure of a Golomb Ruler
Fig. 2 shows an optimal Golomb Ruler with 5 marks. The process of cre-
ating a sample GR is easy, but finding the optimal GR for a specified order is
computationally very challenging [32]. For instance, the search for 19 marks of
a GR took approximately 36,200 CPU hours on a Sun Sparc workstation using
brute-force parallel search implementation. The best-known GR of maximal
length to be found up until now is 27.
As mentioned above, the problem of finding an optimal Golomb Ruler with
no duplicated distances is an NP-hard problem. Due to this fact, several heuris-
tics have been developed. A well-known sequential solution of finding an optimal
GR was proposed by Shearer [39]. This solution is based on a branch-and-bound
algorithm with backtracking, and it generates optimal GRs up to 16. Soliday [40]
proposed an evolutionary algorithm in which chromosomes are represented by
integers having lengths of n−1 segments (indirect representation). A mutation
operator is of two types: a change in segment length or permutation in segment
order. They use two evaluation criteria, such as the overall length of the ruler
and the number of repeated measurements.
The first hybrid approaches towards the implementation of evolutionary
algorithms were realized by Feeney [18]. His approach combined genetic al-
5
gorithms with a local search technique and Baldwinian or Lamarckian learn-
ing [13]. The proposed (direct) representation consists of an array of integers
corresponding to the marks. The crossover operator is a random swap between
two positions, and a sort procedure was added at the end. For this algorithm,
the distance achieved from optimal rulers is between 6.8 and 20.3. Van Hen-
teryck and Dotu [15] created an evolutionary algorithm with Tabu Search as
mutation operator and single-point crossover. The algorithm uses a random
strategy in the selection process to choose parents for breeding. In this case,
distances from optimal rulers for 12 to 16 marks were between 7.1 and 10.2.
Cotta, Dota, and Van Henteryck used the GRASP (greedy randomized adap-
tive search) method, Scatter Search and Tabu Search, clustering techniques, and
constraint programming. Combining these techniques, a memetic algorithm was
proposed that was reported to achieve distances to the optimal rulers of lengths
from 10 to 16 between 1.6 and 6.2 [15]. It took parallel solutions [21, 2] several
months using thousands of computers to find optimal rulers up to 26 marks.
There were also trials of accelerating NP-hard problems in GPGPU [29, 28].
Most of them concentrated on crossover operators and hill climbing. It is worth
mentioning the Paradiseo system as being one of the first scalable systems with
a genetic algorithm for solving optimization problems [10]. Yet, according to
the authors’ knowledge, there are no trails to follow for solving the difficult
black-box problems in heterogeneous environments.
4. Scatter search algorithm
Scatter search (SS) [30] is a meta-heuristic based on a population-based
search. As can be seen in the pseudocode 1, the algorithm is based on the
iterative application of a collection of search procedures to a pool of solutions
(called the reference set) in a similar fashion as in evolutionary algorithms. This
is why SS might be considered a particular case of a memetic algorithm. More
precisely, the following components can be identified in the algorithm:
•Initialization(): A diversification-generation method for generating a col-
lection of solutions, possibly using some initial solution as seed or GRASP
algorithm or some clustering methods (e.g., k-Means, DBscan).
•Improvement(): A method for enhancing the quality of solutions (e.g.,
simulated annealing, Tabu Search).
•UpdateRef erenceSet() A reference-set-update method for building a ref-
erence set from the initial set of solutions generated and for maintaining
it by incorporating some solutions produced in subsequent steps.
•GenerateSubset() A method for selecting solutions from the reference set
and arranging them in small groups (pairs, triplets, or larger groups) for
undergoing combination.
•SolutionCombination() A method for creating new solutions by combin-
ing the information contained in a certain group of solutions.
6
Algorithm 1 Scatter Search algorithm
pop, rset, sset : Population
pop ←Initialization()
while not Stopping Condition (actual cycle ≤nr cycles) do
pop ←Improvement(pop)
if new optimal GR found(pop) then
GRASP(pop, tmp ogr)
end if
if not Stagnated(pop) then
rset ←UpdateReferenceSet(pop)
else
rset ←RestartReferenceSet(pop) GRASP
end if
sset ←GenerateSubset(rset)
pop ←SolutionCombination(sset)
end while
return BestSolutionIn(pop)
•RestartRef erenceSet() A restart method for refreshing the reference set
once it has been found to be stagnated. This can be typically done by
using the diversification-generation method with the improvement method
mentioned above; however, other strategies might be considered as well.
As one may see, the diversification-generation method serves two purposes in
the SS algorithm: it is used for generating the initial population from which
the reference set will be initially extracted, and it is utilized for refreshing the
reference set whenever a restart is needed. The generation of new solutions
is performed by using a randomized procedure that tries to generate diverse
solutions.
The application of the SS algorithm for an OGR problem can make the
use of both an indirect approach and a direct representation in different stages
of the search. More specifically, the indirect approach is used in the phases
of initialization and restarting the population and takes ideas borrowed from
grasp. The direct approach is considered in the stages of recombination and
local improvement, particularly when the local-improvement method is based
on Tabu Search or simulated annealing algorithm.
The specification of a particular SS algorithm is completed once the items
above are detailed. Therefore, SS is an excellent candidate as a template hybrid
algorithm. Each single method can be changed by a different algorithm run on
a chosen hardware platform.
5. GRASP algorithm
One of the first approaches in solving the Golomb Ruler problem incorpo-
rated ideas from Greedy Randomized Adaptive Search Procedures (GRASP)
7
[19] to an evolutionary algorithm. GRASP is an example of a repetitive sam-
pling technique. In each iteration, the algorithm produces a solution for the
problem at hand by a greedy randomized construction algorithm (See Alg. 2).
In GreedyRandomizedConstruction(), the solutions are represented by a list of
attributes, and the algorithm builds these solutions incrementally (specifying
the values of each attribute one by one). Later, these values are ranked accord-
ing to some local quality measure and selected using a quality-based mechanism.
Algorithm 2 GRASP algorithm
s, s1: Solution
s←ConstructRandomSolution()
while not Stopping Condition do
s1←GreedyRandomizedConstruction()
s1←LocalSearch(s1)
if Quality(s1)>Quality(s)then s←s1
end if
end while
return s
In each step of the new solution-creation process, a ranked list of potential
attribute values is created. The value of the attribute is chosen by using a
qualitative or quantitative criterion (based on looking for the best solution in
the so-called Restricted Candidate List, or a solution belonging to a certain
range computed over this list). Applying GRASP to solve a Golomb Ruler, the
attributes of solutions are assumed to be the positions of the marks. The RCL
is generated with the use of infeasible solutions; later, it is trimmed down to
form only one feasible set of Golomb Rulers. The best ruler is, of course, chosen
based on its actual length; however, some randomness may also be taken into
consideration in order to avoid getting stuck in a local extremum.
6. Tabu Search
Tabu Search is a general, global optimization and search metaheuristic using
an FIFO list of previously obtained solutions in order to impose search in the
unknown areas (see Alg. 3) [22]. The algorithm can work with tentative solu-
tions that may be infeasible, (i.e., in the Golomb Ruler, there may exist some
repeated distances between marks). The algorithm uses the notion of constraint
violations on the distances (number of repetitions of the distances in a certain
ruler), and exceeding the currently assumed violation constraint is not allowed.
The moves in the local search consist of changing the value of a single mark
(minding possible moves that do not make the ruler infeasible).
In an attempt to make Tabu Search more efficient, we tried to implement a
parallel version of this algorithm based on shared memory. The implementation
in a multi-core CPU is based on using locks on memory by parallel threads.
8
Algorithm 3 Tabu algorithm
t : TabuList s ←RandomSolution() →mutation
(mark, new value,cycles in tabu)←triple(s)
while not Stopping Condition do
s’ ←FindNextSolution(s)
if s’ >s and not Contains(TabuList,s’) or is not Valid(TabuList,s’) then
s←s’
removeFirst(TabuList)
Push(TabuList, triple(s))
end if
end while
return s
The tabu list consists of size of list ·3. Each mutated solution in checks (in a
synchronized manner) whether the current move is allowed and writes it to the
tabu list. In the case of the GPU, the list size is limited to shared memory (48
kB) and can only be used locally by the blocks threads. Additionally, a triple
(index, move, time interval) must be compressed to 32-bit format for atomic
operations to be performed.
Modification of Tabu Search can be run in the GPU. This mutation of tabu
algorithm is based on non-deterministic accesses to memory (see Section 4).
Each thread responsible for one solution writes to three 32-bit values without
synchronization in each cycle. Each thread has allocated its own locations in
a tabu list. The write operations are executed without bank conflicts (Fig. 3).
The tabu list is organized as a cycle-array list. The read operations for all
threads are performed on the whole tabu list stored in a shared memory.
The efficiency of the non-deterministic version is shown in Fig. 4.
Figure 3: Non deterministic tabu.
9
Figure 4: Viewing tabu list in non-deterministic tabu.
7. Crossover operators
The combination method begins by building a list of all marks present in
either of the parents, trying to minimize the number of constraints violated
when placing a new mark.
The second crossover method based on indirect representation chooses ran-
domly unique values of differences from crossed parents. Indirect representation
is a type of representation in which the only differences between marks are stored
(for the ruler from Fig. 1, the indirect representation is [1,3,2] and the length
is n−1, where nis the length of direct representation).
The third available combination method is based on a single-point crossover.
In this version of crossover, the operator curand library is used for choosing
random parents and a random point in a single representation and then copying
genomes between the two solutions. The last step is to choose the best solutions
among the children and parents. The results of running a crossover operator
on indirect representation are described in 4, and the one-point crossover is
presented in 5. The reason that CPU outperforms GPU is of irregular accesses
to shared memory and loading data to local memory (not cached) in each cycle
of crossing.
8. Experimental verification of metaheuristics leveraging GPGPU
The main result of the research presented in [35] was the efficient implemen-
tation of a memetic algorithm leveraging GPGPU for a local search. We have
shown experimentally that our implementation is about ten times faster than
a highly optimized multicore CPU one. After completing this research, we de-
cided to go further into the hybrid implementations of metaheuristics, focusing
10
Figure 5: The histogram update computation.
Figure 6: Reduction process of finding the best solution and its position
on Scatter Search as a very efficient algorithm for solving difficult optimization
problems (cf. Section 4).
As this algorithm can also be decomposed into components, we focused on
determining which components will be particularly suitable for implementing
a leveraging GPGPU in a hybrid way (as was the case of the local search in
the evolutionary algorithm in our previous paper [35]). It seems that the im-
provement method is the best potential candidate for delegating to the GPGPU;
however, as most of the Tabu Search implementations extensively use shared
data structures (i.e., tabu list), it cannot be efficiently implemented using the
GPU. Therefore, instead of focusing only on Tabu Search, we have also tried
to delegate the Scatter Search improvement method implemented as Simulated
Annealing and Random Mutation Hill Climbing.
The research was done on the following platforms: Nvidia Tesla m2090 and
Intel Xeon E5645 2.40 GHz. It is to note that all of the results presented in
this section and the next are averaged over 10 runs. In those cases where the
observed standard deviation is not given in the table, it did not exceed 10%,
especially as the time-related results gathered from the GPGPU devices were
11
highly repeatable (the deviation was close to 0).
8.1. Verification of improvement methods
Random Mutation Hill Climbing (RMHC) and Simulated Annealing (SA)
exploit GPGPU’s computational resources as much as possible. The kernel is
run on a number of blocks equal to the size of the population. Each block
copies a separate solution from the global memory and its histogram of all of
the possible ruler’s differences. Next, all available threads in the block run the
mutation operator on the solution stored in their shared memory. Each thread
generates two random numbers in the mutation process. The first one is a point
in the Golomb Ruler to be mutated, and the second is the new value at that
position.
The algorithm needs number of threads in block ·4·(size of ruler −1) ·4
bytes of shared memory for the data storing new and old differences between the
chosen position and other points. From this data, a fitness value is computed
for each mutated solution (see Fig. 5). Elements with even indexes in the table
are those values added or removed from the histogram after mutation, and
elements with odd indexes are those numbers of the value of modification of
these values in the histogram. The values needed for updating the histogram
and computing the fitness are stored in a manner that minimizes bank conflicts
(see Fig. 5). The fitness values and index of the thread are written to the
shared memory for a reduction process (1024 ·2·4 bytes), which will be run for
choosing the best mutated solution (see Fig. 6). Two parallel reductions are
run; the first for finding the best fitness value, and the second for finding which
solution represents the best temporal ruler. Between iterations in the blocks, the
threads can exchange information about the optimal solution by a temporary
minimal ruler found in each block. This can be done by atomic operation on
a global memory variable (atomicMin CUDA function). This process can help
narrow the domain space of any possible optimal solutions (greedy algorithm)
in the next iterations. In the case of parallel implementation in a multi-core
environment, OpenMP directives were used. The outer loop of the algorithm
(responsible for iteration over solutions) was parallelized. The OpenMP lock
mechanism was used to update the global temporal minimal ruler found.
The last approach is the implementation of Simulated Annealing in the GPU.
Each thread stores a single solution and all distances between its marks in the
registers and local memory. The parameters of the method are the starting
temperature and the procedure of changing the temperature during execution
of the whole process. The threads synchronize after a given number of algorithm
cycles to check if their temporary solution is worse than the current best-found
solution. If so, they try to generate a new one.
The experimental results obtained for both of these improvement methods
are shown in Table 1. The results show that RMHC and SA have similar quality
in finding optimal Golomb rulers.
However, comparing the results obtained for RMHC and SA with those
obtained when using the tabu list (parameters of the simulation remaining the
same), it is easy to see that the tabu list prevails (cf. Tables 1 and 2). However,
12
Table 1: Comparison between Simulated Annealing and RMHC (length of minimal rulers
found) based on single-bit mutations (population size: 4096; number of steps: 10,000).
length of the ruler SA RMHC
12 157 143
13 161 180
14 295 283
15 391 399
one must remember the nature of the used operators, as the tabu list requires the
use of shared memory, so the straightforward implementation of this algorithm
using GPU (see Table 3) is visibly slower than its CPU version. Therefore,
implementation of the tabu list on the GPU must be carefully planned.
Table 2: The quality of tabu-based search on cyclic list in CPU (10,000 cycles, best results in
5 probes, population size 4096).
length of the ruler CPU
12 127
13 154
14 206
15 248
Table 3: The efficiency of tabu based on cyclic list in CPU (single core vectorized) and GPU
(13-mark ruler, tabu memory size - 4096).
population size CPU [ms] GPU [ms]
64 8459 ±101.4 25400 ±0.1
128 16938 ±33.5 25500 ±0.1
256 33689 ±37.9 25700 ±0.1
512 67381 ±223.1 26000 ±0.1
8.2. Verification of crossover
Now, let us focus on the solution recombination algorithm in SS. In our case,
it is a one-point crossover and indirect representation crossover.
The crossover operators are implemented in two ways as a single point on
direct representation. The operator implementation uses the curand library for
choosing a random point in a single representation and then copies genomes
(from the shared memory) between the two solutions (the number of bank con-
flicts depends on the size of the rulers and crossover points)—see Fig. 7. Then,
the fitness function of the new solutions stored in local thread memories are
computed. Then, they are compared with the parents, and the best solutions
are written to the shared or device memory.
In Table 4, the efficiency of the CPU and GPU implementation of a one-
point crossover is shown. It can be noticed that the GPU is much slower than
13
Figure 7: Single point crossover in GPU.
the CPU version. The reason is that this crossover produces two children and
then processes them (e.g., removes duplicate marks, fitness computation). It is
worth saying that vectorization gives a quite linear speed-up. Table 5 presents a
crossover on indirect representation. This operator is faster in the non-optimized
CPU than in the previous one, but its parallel (vectorized) speed-up is lower.
The GPU version outperforms the vectorized CPU version when the size of the
population grows.
Table 4: Crossover GPU vs CPU (13-mark ruler, 400-time run).
population CPU [ms] CPU (vectorized) [ms] GPU [ms]
512 645 245 4806
1024 1400 541 6920
2048 2784 970 10980
4096 5550 1950 17762
8.3. Verification of clustering algorithm
The clustering step is used during the reference update stage of SS, and
we also considered delegating this step to the GPU. In this case, the k-means
algorithm was used.
Table 5: Crossover on indirect representation GPU vs CPU (13-mark ruler, 400-time run).
population CPU [ms] CPU (vectorized) [ms] GPU [ms]
512 460 304 907
1024 945 670 1114
2048 1827 1350 1545
4096 3510 2910 2400
14
The K means algorithm is based on solution histogram representation. The
metric is computed as a cosine angle between the clustered solutions and cen-
troids. GPU implementation is divided into two kernels. The first one is re-
sponsible for classifying solutions that belong to the centroids. Each thread is
responsible for each solution. The second is responsible for the centroids update.
A single thread computes new centroids coordinates.
In Table 6, the efficiency of the GPU implementation of k-means as compared
to the CPU is given. It is quite easy to see that the GPU implementation of this
Table 6: GPU K-means implementation of GR solutions clustering (time in ms, 100 cycles,
32 clusters).
population CPU [ms] GPU
[ms]
CPU (3 cores)
[ms]
CPU (vector-
ized) [ms]
2048 3745.4 ±22.4 482 ±0 2070.7 ±49.3 1005.5 ±31.6
4096 7419.9 ±27.8 964 ±0 4463.8 ±85.2 1988 ±37.3
clustering method is superior to that of the CPU, making this a good candidate
for running on a GPU.
Now, let us focus on the initialization and re-initialization steps of SS, im-
plemented as a GRASP algorithm in our case.
8.4. Verification of initialization and re-initialization algorithm
The GRASP algorithm is implemented in such a way that each thread is
responsible for one solution. Each thread stores all of the data in local memory
needed for GRASP execution. These structures are: RCL list, histogram, and
temporary solution. The memory needed in the case of static memory allocation
for a single thread is equal to max size of rcl+tmp max+size of ruler, where
max size of rcl is equal to the number of elements from which the new mark will
be randomly chosen. The second approach includes dynamic memory allocation
for the histogram. The maximum size of this memory is equal to n·(n−1) ÷2.
The histogram values are then sorted. Threads communicate with each other
by atomic operation (atomicMin) using global memory. Of course, the minimal
ruler found thus far is stored.
In Tables 7 and 8, one can see that the GPU implementation of GRASP is
clearly superior to that of the CPU (even highly optimized), especially when a
higher number of individuals are used.
Table 7: GRASP GPU vs CPU (7-mark ruler, 1600-time run).
population CPU [ms] CPU (vectorized) [ms] GPU [ms]
256 1008 ±41.7 397.3 ±27.3 385 ±0.1
512 1873.5 ±28.5 727.1 ±45.5 386 ±0.1
1024 3651.2 ±35.9 1403.5 ±28.07 387 ±0.1
2048 7313.3 ±43.8 2737.1 ±43.2 438 ±0.1
15
Table 8: GRASP GPU vs CPU (13-mark ruler, 1600-time run).
population CPU [ms] CPU (vector-
ized) [ms]
CPU (3 cores)
[ms]
GPU [ms]
512 17536.7 ±40.9 4757.5 ±92.2 7376.2 ±84.9 3486.2 ±0.4
1024 33994 ±938.9 14223.7 ±35.1 13901.2 ±618.9 3980 ±0.1
2048 69825.2 ±192.7 28855.7 ±399.9 28635 ±224 4880 ±0.1
9. Heterogeneous multi-platform implementation of Scatter Search
Summing up the previously collected experiences regarding particular meta-
heuristic building blocks, let us check how the complete algorithm—namely
Scatter Search—behaves with the components delegated to the GPU. As profil-
ing shows, this improvement method takes most of the total time of algorithm
execution. The presented results in the previous section showed that all of the
improvement methods (apart from deterministic tabu) can be accelerated in a
GPU. It should be said that the best results were obtained by deterministic
tabu. The rest of the parts of the Scatter algorithm (update reference set, com-
bination method, and clustering) can also be accelerated, but their impact on
total time is negligible. The next section describes the possible configurations
of heterogeneous solutions.
The pseudo code of the implemented version of Scatter Search is described
in Fig. 4. The algorithm can be fully parameterized. GRASP can be configured
by the parameter defining the size of the set from which the new mark is chosen.
The combination method is defined by the number of crossover operations to be
run in a single cycle, the parameter of stagnation (defining the number of cycles
after the reference update method is invoked), and the size of tabu memory.
Algorithm 4 Implemented Scatter Search algorithm
pop : Population
pop ←GRASPInitialization()
while not Stopping Condition do
pop ←Crossover(pop)
pop ←GRASPGeneration(pop)
pop ←TabuSearch(pop)
end while
return BestSolutionIn(pop)
The quality of Scatter Search is presented in Table 9. The table describes
the percentage deviation from the optimal ruler.
Now, let us try to leverage the available distributed infrastructure and
GPU hardware, in order to implement the Scatter Search leveraging the above-
mentioned metaheuristic components so some of them will run separately on
different hardware platforms.
The whole algorithm can be run in two main configurations:
16
Table 9: The quality of Scatter Search in a single-core CPU. (population size ×cycles ×
tabu memory) percentage of deviation from the known OGR
Scatter Search 10 11 12 13 14 15 16
100 ×5000 ×5000 0 0 0 6 7 8 9
100 ×15000 ×5000 0 0 0 4 6 7.5 8.5
100 ×35000 ×5000 0 0 0 0 4 5 6
100 ×85000 ×5000 0 0 0 0 2 4.5 5
•cluster nodes communicate by sending the minimal rulers found so far (in
intervals set as a parameter),
•scatter and gather operations are performed to send the best solutions
found thus far to the root node, then the root node runs the clustering
algorithm and sends back single clusters to each node.
Figure 8: The heterogeneous cluster architecture.
In the first approach, the MPI AllReduce function is used to compute a
minimal optimal ruler found in a given time period of running the algorithm.
In the second solution, MPI scatter and MPI gather functions are used to send
all solutions to the root node, then the root runs the K-means algorithm for
divided solutions among the clusters, and finally they are sent to all nodes
(each node receives one cluster of solutions).
Table 10 presents the results of convergence to the optimal ruler between
the sequential version of the algorithm and the parallel one (OpenMP). Table
11 describes the speed-up of running Scatter in the heterogeneous version in a
single cluster node.
The GRASP algorithm can be run as a single algorithm (as described earlier)
or on the heterogeneous platform. It can be run on all available hardware accel-
erators. Each accelerator runs GRASP independently, and in a set of intervals,
they communicate the best solution found so far. The CPU and GPU com-
municate internally in each node, the CPUs in clusters communicate by MPI
functions (see Section 9). The GRASP algorithm is also useful in the case of
17
hybrid algorithms like Scatter Search. This enables us to produce new solutions
during the initialization process and also during the reference-restart method.
The efficiency of the algorithm is shown in Table 8. Observations of the quality
of the solutions found showed that we were able to find optimal Golomb Rulers
with lengths up to 12 after 10 minutes of computing, with a zero or negligible
deviation from the known best values.
Table 10: Quality comparison between single and multicore CPU implementation of Scatter
Search. 100 ×10000 ×2000 (the lengths of the best rulers found after a 15-minute search).
Golomb Ruler
length
CPU (vector-
ized)
CPU (6 cores) GPU (initial
GRASP and
tabu)
12 111 96 90
13 145 121 115
14 175 147 140
15 214 175 165
We observed that, when we run Scatter Search while using two nodes of
a cluster using their CPU and GPU resources (having 100 individuals run for
50,000 cycles while the tabu list had a length of 2000), we were able to find
optimal Golomb Rulers up to a length of 16 with zero or negligible deviation.
The cluster can also be configured to run the GRASP algorithm using
all available resources. The accelerated algorithm is run on GPUs and the
multi-core CPU, and communication between nodes is implemented by the
MPI AllReduce operation.
The cluster has a peak performance of about 136.7 TFlops. Each node
consists of Intel Xeon 5645 family processors and NVIDIA Tesla M2050 and
M2090 graphic cards. The bus between the CPU and GPU enables a 4 GB/s
bandwidth. The interconnection between node clusters is based on Infiniband
QDR 4x (peaking at about 40 GB/s) 8.
10. Conclusions
This article presents the know-how gathered from consecutive studies of
adapting selected metaheuristics (particularly those utilizing evolutionary and
Table 11: Speedup of multi-core openmp implementation of Scatter Search (population size ×
cycles ×tabu memory size).
Scatter Search CPU (vectorized) [ms] GPU (initial GRASP
and tabu) [ms]
64 ×10000 ×4096 8800 25900
128 ×10000 ×4096 17600 26100
256 ×10000 ×4096 35000 27000
512 ×10000 ×4096 69300 28100
18
local search techniques) to processing on parallel hardware platforms. Based
on the experiences gathered while speeding up classical memetic algorithms by
the use of GPGPU [35], we experimented on a very effective metaheuristic here;
namely, Scatter Search, used to solve a difficult discrete optimization problem
(OGR).
In the case of the Golomb Ruler problem (as in other NP-hard problems), the
crucial aspect of hardware implementation is its representation. The amount of
processed data increases dramatically when the problem grows. Block registers
and shared memory are not enough to store the required data. Therefore,
local thread memory is often needed, so L1 and L2 cache optimization should
be performed. As proven, GPU-like hardware accelerators can be useful in
accelerating black box problems only by running chosen parts of any given
heuristics.
The lessons learned confirms that, for selected components of the researched
metaheuristic only, the use of GPGPU gives us an expected speed-up. Thus, the
implementation of complex techniques on parallel hardware platforms is often
difficult and occasionally may prove inefficient. However, since some parts of the
algorithms meet the better specific needs of the GPGPU architecture than other
parts, one can easily imagine the heterogeneous realization of such algorithms.
Good candidates for such a design are techniques built from components (like
Scatter Search) which uses various sub-algorithms. Having a ’library’ of different
implementations, the construction of such complex techniques may be perceived
as building from components rather than from scratch.
We have evaluated the components of Scatter Search and have found that
both local search and GRASP turned out to be most-suitable for GPGPU-based
implementation. Other components often have gained better performance when
realized on a CPU with the appropriate use of multi-core or vectorized process-
ing. This can be treated as a germ of a ’library’ and a base for heterogeneous
realization of the SS algorithm, which was evaluated as a whole to illustrate the
above-described concept.
Thus, one of the most important results of our research is an experimental
study on a multi-GPGPU, heterogeneous distributed platform. Our cluster
consists of nodes with both multi-core processors (up to 12 cores each) and up to
two GPUs in a single node. The presented results show very good performance
of such a solution. Another is the outcome of the profiling of different parts of
Scatter Search, showing clearly that the implementation of such metaheuristics
as Simulated Annealing or GRASP is very efficient when run on GPGPU, while
Tabu Search implemented in a classic manner is completely unfeasible in this
setting.
The gathered know-how extends the state-of-the-art and makes the design
of dedicated implementations of the population-based metaheuristics possible,
efficiently leveraging the available hardware computing devices—GPGPUs in
the case of this paper. Moreover, keeping in mind that similar results, related
to profiling metaheuristics for efficient use of GPGPUs are hard to come by,
with notable exception of evolutionary algorithms and similar techniques, one
can perceive the added value of this paper as interesting for practitioners. In
19
particular the ones focused on OGR problem can find here exclusive content.
As the reported results seem promising, we plan to generalize our platform
as a next step, trying to both solve different black-box problems and implement
new promising metaheuristics. We are also on the way to include FPGAs into
our toolbox as a means of delegating certain parts of the computing. Moreover,
the first approach to constructing a custom co-processor has already been suc-
cessfully realized (applied to solving the Low Autocorrelation Binary Sequences
problem), and we are striving to incorporate the resulting hardware into our
heterogeneous computing architecture. The first results of this research are
under review.
Acknowledgments
The research presented in the paper received partial support from AGH
University of Science and Technology statutory project (no. 11.11.230.124).
The research was partially supported by PL-Grid infrastructure.
References
[1] Cuda framework. https://developer.nvidia.com/cuda-gpus.
[2] Ogr pro ject. http://www.distributed.net/org.
[3] Openmp library. http://www.openmp.org.
[4] G. Bloom and S. Golomb. Applications of numbered undirected graphs. In
Proceedings of the IEEE, 65(4), pages 562–570, April 1977.
[5] A. Byrski and R. Schaefer. Formal model for agent-based asynchronous
evolutionary computation. In 2009 IEEE Congress on Evolutionary Com-
putation, pages 78–85, May 2009.
[6] Aleksander Byrski, Rafal Drezewski, Leszek Siwik, and Marek Kisiel-
Dorohinicki. Evolutionary multi-agent systems. The Knowledge Engineer-
ing Review, 30:171–186, 3 2015.
[7] Aleksander Byrski and Marek Kisiel-Dorohinicki. Immune-Based Optimiza-
tion of Predicting Neural Networks, pages 703–710. Springer Berlin Heidel-
berg, Berlin, Heidelberg, 2005.
[8] Aleksander Byrski and Marek Kisiel-Dorohinicki. Agent-Based Evolution-
ary and Immunological Optimization, pages 928–935. Springer Berlin Hei-
delberg, Berlin, Heidelberg, 2007.
[9] Aleksander Byrski and Marek Kisiel-Dorohinicki. Agent-Based Model and
Computing Environment Facilitating the Development of Distributed Com-
putational Intelligence Systems, pages 865–874. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2009.
20
[10] S. Cahon, N. Melab, and E.-G. Talbi. Paradiseo: A framework for the
reusable design of parallel and distributed metaheuristics. IEEE Transac-
tions on Computers, 10:357–380, 2004.
[11] Carlos Cotta, Iv´an Dot´u, Antonio J. Fern´andez, and Pascal Van Henten-
ryck. A Memetic Approach to Golomb Rulers, pages 252–261. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2006.
[12] Carlos Cotta, Iv´an Dot´u, Antonio J. Fern´andez, and Pascal Van Henten-
ryck. Local search-based hybrid algorithms for finding golomb rulers. Con-
straints, 12(3):263–291, 2007.
[13] J. de Monet de Lamarck. Zoological Philosophy: An Exposition with regard
to the natural history of animals. Macmillan, 1914.
[14] A. Dollas, W.T. Rankin, and D. Macracken. New algorithms for golomb
ruler derivation and proof of the 19 mark ruler. IEEE Transaction on
Information Theory, 44:379–382, 1998.
[15] I. Dotu and P. Van Hentenryck. A simple hybrid evolutionary algorithm
for finding golomb rulers. In 2005 IEEE Congress on Evolutionary Com-
putation, volume 3, pages 2018–2023 Vol. 3, Sept 2005.
[16] S. Droste, T. Jansen, and I. Wegener. Upper and lower bounds for ran-
domized search heuristics in black-box optimization. Theory of Computing
Systems, 39:525–544, 2006.
[17] L.J. Eshelman. Genetic algorithms. Handbook of Evolutionary Computa-
tion. IOP Publishing Ltd and Oxford University Press, 1997.
[18] B. Feeney. Determining optimum and near-optimum golomb rulers using
genetic algorithms. Master’s thesis, University College Cork, 2003.
[19] Thomas A Feo and Mauricio G.C Resende. A probabilistic heuristic for a
computationally difficult set covering problem. Operations Research Let-
ters, 8(2):67 – 71, 1989.
[20] B. Galinier and et al. A constrained-based approach to the golomb ruler
problem. In 3rd International workshop on integration of AI and OR tech-
niques, pages 562–570, 2001.
[21] Vanderschel Garey. In search of optimal 20, 21 and 22
mark golomb rulers. Technical report, GVANT project, 1999.
http://members.aol.com/golomb20/index.html.
[22] Fred Glover. Applications of integer programming future paths for integer
programming and links to artificial intelligence. Computers & Operations
Research, 13(5):533 – 549, 1986.
[23] D.E. Goldberg. Genetic algorithms in search, optimization and machine
learning. Addison-Wesley, 1989.
21
[24] D. Krzywicki, L Faber, K. Pietak, A. Byrski, and M. Kisiel-Dorohinicki.
Lightweight Distributed Component-Oriented Multi-Agent Simulation
Platform. In Webjørn Rekdalsbakken, Robin T. Bye, and Houxiang Zhang,
editors, ECMS, pages 469–476. European Council for Modelling and Sim-
ulation, 2013.
[25] Daniel Krzywicki, Lukasz Faber, Aleksander Byrski, and Marek Kisiel-
Dorohinicki. Computing agents for decision support systems. Future Gen-
eration Comp. Syst., 37:390–400, 2014.
[26] Y. Li, K. Zhao, X. Chu, and J.Liu. Speeding up k-means algorithm by
gpus. IEEE Computer and Information Technology, pages 115–122, jun
2010.
[27] C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang. Accelerating pattern
matching using a novel parallel algorithm on GPUs. IEEE Transactions
on Computers, 62:1906–1916, oct 2013.
[28] T. Van Luong, N. Melab, and E. G. Talbi. Parallel hybrid evolutionary
algorithms on GPU. In IEEE Congress on Evolutionary Computation,
pages 1–8, July 2010.
[29] V. Luong, N. Melab, and E.-G. Talbi. GPU computing for parallel lo-
cal search metaheuristic algorithms. IEEE Transactions on Computers,
62:173–185, 2011.
[30] Rafael Mart´ı, Manuel Laguna, and Fred Glover. Principles of scatter search.
European Journal of Operational Research, 169(2):359 – 372, 2006. Feature
Cluster on Scatter Search Methods for Optimization.
[31] D. Merrill and A. Grimshaw. High performance and scalable radix sort-
ing: A case study of implementing dynamic parallelism for gpu computing.
Parallel Processing Letters, 21:245–272, 2011.
[32] Christophe Meyer and Periklis A. Papakonstantinou. On the complexity
of constructing golomb rulers. Discrete Applied Mathematics, 157(4):738 –
748, 2009.
[33] Kamil Pietak and Marek Kisiel-Dorohinicki. Transactions on Computa-
tional Collective Intelligence X, chapter Agent-Based Framework Facilitat-
ing Component-Based Implementation of Distributed Computational In-
telligence Systems, pages 31–44. Springer Berlin Heidelberg, Berlin, Hei-
delberg, 2013.
[34] M. Pietron, P. Russek, and K. Wiatr. Accelerating SELECT WHERE
and SELECT JOIN queries on GPU. Journal of Computer Science AGH,
14:243–252, 2013.
22
[35] Marcin Pietron, Aleksander Byrski, and Marek Kisiel-Dorohinicki. GPGPU
for difficult black-box problems. Procedia Computer Science, 51:1023 –
1032, 2015. International Conference On Computational Science, ICCS
2015 Computational Science at the Gates of Nature.
[36] Marcin Pietron, Maciej Wielgosz, Dominik Zurek, Ernest Jamro, and Kaz-
imierz Wiatr. Comparison of GPU and FPGA Implementation of SVM
Algorithm for Fast Image Segmentation, pages 292–302. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2013.
[37] J. Robinson and A. Bernstein. A class of binary recurrent codes with limited
error propagation. IEEE Transactions on Information Theory, 13(1):106–
113, January 1967.
[38] R. Schaefer, A. Byrski, and M. Smolka. The island model as a markov
dynamic system. International Journal of Applied Mathematics and Com-
puter Science, 22(4), 2012.
[39] J. B. Shearer. Some new optimum golomb rulers. IEEE Transactions on
Information Theory, 36(1):183–184, Jan 1990.
[40] Stephen W. Soliday, Abdollah Homaifar, and Gary L. Lebby. Genetic
algorithm approach to the search for golomb rulers. In Proceedings of the
6th International Conference on Genetic Algorithms, pages 528–535, San
Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.
[41] A.R Thompson, J.M. Moran, and G.W. Swenson. Interferometry and Syn-
thesis in Radio Astronomy 2nd. ed. Wiley, 2004.
[42] Shigeyoshi Tsutsui and Pierre Collet. Massively Parallel Evolutionary Com-
putation on GPGPUs. Springer Berlin Heidelberg, 2013.
[43] Sifa Zhang and Zhenming He. Implementation of parallel genetic algo-
rithm based on cuda. In Zhihua Cai, Zhenhua Li, Zhuo Kang, and Yong
Liu, editors, Advances in Computation and Intelligence: 4th International
Symposium, ISICA 2009 Huangshi, China, Ocotober 23-25, 2009 Proceed-
ings, pages 24–30. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
23