ArticlePDF Available

Abstract and Figures

Difficult black-box problems arise in many scientific and industrial areas. In this paper, efficient use of a hardware accelerator to implement dedicated solvers for such problems is discussed and studied based on an example of Golomb Ruler problem. The actual solution of the problem is shown based on evolutionary and memetic algorithms accelerated on GPGPU. The presented results prove that GPGPU outperforms CPU in some memetic algorithms which can be used as a part of hybrid algorithm of finding near optimal solutions of Golomb Ruler problem. The presented research is a part of building heterogenous parallel algorithm for difficult black-box Golomb Ruler problem.
Content may be subject to copyright.
doi: 10.1016/j.procs.2015.05.249
GPGPU for Difficult Black-box Problems
Marcin Pietro´n, Aleksander Byrski, Marek Kisiel-Dorohinicki
AGH University of Science and Technology
Faculty of Computer Science, Electronics and Telecommunications
Al. Mickiewicza 30, 30-962, Krakow, Poland
{pietron,olekb,doroh}@agh.edu.pl
Abstract
Difficult black-box problems arise in many scientific and industrial areas. In this paper, efficient
use of a hardware accelerator to implement dedicated solvers for such problems is discussed and
studied based on an example of Golomb Ruler problem. The actual solution of the problem is
shown based on evolutionary and memetic algorithms accelerated on GPGPU. The presented
results prove that GPGPU outperforms CPU in some memetic algorithms which can be used
as a part of hybrid algorithm of finding near optimal solutions of Golomb Ruler problem. The
presented research is a part of building heterogenous parallel algorithm for difficult black-box
Golomb Ruler problem.
Keywords: Evolutionary Computing, GPGPU computing, memetic computing
1 Introduction
Low Autocorrelation Binary Sequences, Job Shop, Golomb Ruler are well known difficult com-
binatorial problems [4]. They are good examples of so-called black-box scenarios [10]) posing
serious problem for the solving software, demanding application of metaheuristic approach in
order to locate even sub-optimal solutions.
The most popular metaheuristics used for such problems are dynamic programming, con-
straint programming, simulated anealing, tabu search, evolutionary and memetic algorithms
(see, e.g. [13], [22], [5], [8], [9]).
The hybrids of evolutionary and local search algorithms are very often used in solving NP-
hard problems [6].The parallel hardware platforms (e.g. graphic cards, multicore processors)
enable to make use of data flow structure (with contradiction to other heuristics like constraint
or dynamic progrmming) of evolutionary and memetic algorithms and can help to improve
efficiency of solving difficult problems [11] making possible development of dedicated software
environments, taking advante of their specific features. Interesting possibilities arise when using
dedicated hardware is considered, such as graphical processing units (GPGPUs), which enable
to run thousands of threads in parallel. The top peak performance of the most efficient high
performance graphic card is over 1 TB/s. However, the programmer or designer must be aware
Procedia Computer Science
Volume 51, 2015, Pages 1023–1032
ICCS 2015 International Conference On Computational Science
Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2015
c
The Authors. Published by Elsevier B.V.
1023
of memory hierarchy which has significant influence on algorithm (see Section 2). A significant
number of numerical and data mining algorithms were efficiently implemented using GPGPUs
[16], [15], [20], [19], [17].
In this work we study ability of parallelizing of metaheuristics in GPGPU platform choosing
a very hard combinatorial problem, namely Optimal Golomb Ruler, as a case study. The
examples of using Golomb Rulers can be found in Information Theory related to error correcting
codes, selection of radio frequencies to reduce the effects of intermodulation interference with
both terrestrial and extraterrestrial applications or design of phased arrays of radio antennas
or in astronomy one-dimensional synthesis arrays.
In this paper we focus firstly on GPGPU architecture and its parallel characteristics. Later
we deal with details of GPGPU implementation for evolutionary solving for Golomb ruler. In
the end the experimental results are shown and the paper is concluded.
2 GPGPU and multiprocessor computing
The architecture of a GPGPU card is described in Fig. 1. GPGPU is constructed as N mul-
tiprocessor structure with M cores each. The cores share an Instruction Unit with other cores
in a multiprocessor. Multiprocessors have dedicated memory chips which are much faster than
global memory, shared for all multiprocessors. These memories are: read-only constant/texture
memory and shared memory. The GPGPU cards are constructed as massive parallel devices,
enabling thousands of parallel threads to run which are grouped in blocks with shared mem-
ory. A dedicated software architecture—CUDA—makes possible programming GPGPU using
high-level languages such as C and C++ [1]. CUDA requires an NVIDIA GPGPU like Fermi,
GeForce 8XXX/Tesla/Quadro etc. This technology provides three key mechanisms to parallelize
programs: thread group hierarchy, shared memories, and barrier synchronization. These mech-
anisms provide fine-grained parallelism nested within coarse-grained task parallelism. Creating
the optimized code is not trivial and thorough knowledge of GPGPUs architecture is necessary
to do it effectively. The main aspects to consider are the usage of the memories, efficient di-
vision of code into parallel threads and thread communications. As it was mentioned earlier,
constant/texture, shared memories and local memories are specially optimized regarding the
access time, therefore programmers should optimally use them to speedup access to data on
which an algorithm operates. Another important thing is to optimize synchronization and the
communication of the threads. The synchronization of the threads between blocks is much
slower than in a block. If it is not necessary it should be avoided, if necessary, it should be
solved by the sequential running of multiple kernels. Another important aspect is the fact that
recursive function calls are not allowed in CUDA kernels. Providing stack space for all the
active threads requires substantial amounts of memory.
Modern processors consist of two or more independent central processing units. This archi-
tecture enables multiple CPU instructions (add, move data, branch etc.) to run at the same
time. The cores are integrated into a single integrated circuit. The manufacturers AMD and
Intel have developed several multi-core processors (dual-core, quad-core, hexa-core, octa-core
etc.). The cores may or may not share caches, and they may implement message passing or
shared memory inter-core communication. The single cores in multi-core systems may imple-
ment architectures such as vector processing, SIMD, or multi-threading. These techniques offer
another aspect of parallelization (implicit to high level languages, used by compilers). The per-
formance gained by the use of a multi-core processor depends on the algorithms used and their
implementation. The multicore-processors are used for comparison of the developed GPGPU
computing solution discussed in this paper, so experiments run on single-core and multi-core
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1024
Figure 1: GPGPU architecture
processors will be used to show the scalability of evolutionary and memetic algorithms in CPU.
It is to note, that in both cases (GPGPU and CPU computing) the code is optimized and
appropriately modified in order to exploit all possible features of both solutions.
There are lot of programming models and libraries of multi-core programming. The most
popular are pthreads, OpenMP, Cilk++, TDD etc. In our work OpenMP was used [3], being
a software platform supporting multi-threaded, shared-memory parallel processing multi-core
architectures for C, C++ and Fortran languages. By using OpenMP, the programmer does not
need to create the threads nor assign tasks to each thread. The programmer inserts directives
to assist the compiler into generating threads for the parallel processor platform.
3 Golomb Ruler black-box problem
The Golomb Ruler is a set of marks at integer positions along an imaginary ruler such that no
two pairs of marks are the same distance apart. The number of marks on the ruler is its order,
and the largest distance between two of its marks is its length (distance between last and the
first element). Golomb Ruler is optimal if no shorter GR of the same order exists. Fig. 2 shows
optimal golomb ruler with 5 marks. The process of creating GR is easy, but finding the optimal
OGR (or rulers) for a specified order is computationally very challenging. For instance, the
search for 19 marks of GR took approximately 36200 CPU hours on a Sun Sparc workstation
using brute force parallel search implementation. The best known maximal length GR found
up-to-now is 27.
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1025
0146
4
1
6
3
5
2
Figure 2: Structure of a Golomb Ruler
A well-known sequential solution of OGR problem was proposed by Shearer [21]. It is based
on branch and bound algorithm with backtracking. It generates optimal GRs up to 16. Soliday
[22] proposed an algorithm in which chromosomes are represented by integers having length of
n1 segments, mutation operator is composed of two types: a change in the segment length
or a permutation in the segment order. They use two evaluation criteria such as the overall
length of the ruler and the number of repeated measurements.
First hybrid approaches of evolutionary algorithms were introduced by Feeney [12]. It
combines genetic algorithms with local search technique and Baldwinian or Lamarckian learning
[7]. The proposed representation consists of an array of integers corresponding to the marks.
The crossover operator is a random swap between two positions and a sort procedure was added
at the end. The distance achieved from optimal rulers is between 6.8and20.3. Van Henteryck
and Dotu [9] created evolutionary algorithm with Tabu Search in mutation operator and single-
point crossover. The algorithm uses a random strategy in selection process to choose parents
for breeding (distances from optimal for 12 to 16 marks rulers are between 7.1and10.2). Cotta,
Dota and Van Henteryck used GRASP (greedy randomized adaptive search) method, scatter
search and tabu search, clustering techniques and constraint programming. Combining these
techniques memetic algorithm was proposed. The distances to the optimum between 10 to 16
length of OGR were between 1.6and6.2 [9]. Parallel solutions [14] and [2] were able to find
optimal rulers up to 26 marks in several months on thousands computers.
4 Implementation of GPGPU for evolutionary solving of
Golomb Ruler
The general purpose graphic cards are commonly used as computing accelerators in many sci-
entific problems. The image processing, data mining, numerical algorithms are most popular
domains in which GPGPU were used with a success. Recently we can observe that also compu-
tational intelligence, multi-objective optimization make use of graphic cards architecture and
computing power. The adaptation of each algorithm must be preceded by careful analysis, con-
cerning data flow in the particular algorithms (data dependence), extracting hidden parallelism
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1026
and appropriate data mapping to device memory hierarchy. Such studies show that evolution-
ary algorithms are better to adapt to parallel hardware accelerators than other metaheuristics
like dynamic programming, constraint programming etc. Almost each of the usually realized
steps can be parallelized in evolutionary algorithm (crossover operator, mutation operator run
on whole population etc.). The challenging problem in adaptation the evolutionary algorithms
in GPGPU is data mapping. The each problem can have its own representation. Moreover,
fitness function can be calculated in several ways. These artefacts have strong influence on data
size and data usage of the algorithm.
The evolutionary algorithm can be realized on a host utilizing GPGPU in a hybrid way. At
the first step, initial population is created on the host, then population is sent to GPGPU to
realize crossover operator and memetic algorithm (see Fig. 3).
Block 1 Block 2 Block N
Initialization
Selection
Reduction and best
solution selection
Reduction and best
solution selection
Reduction and best
solution selection
Mutation
operator
Mutation
operator
Mutation
operator
Device memory
Write best
solution
every N cycle
Write best
solution
every N cycle
Figure 3: The hybrid algorithm of solving Golomb ruler.
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1027
Memetic algorithms can be also be designed to make possible parallelization of their parts.
Our main goal was to find heuristics for solving Golomb Ruler problem that can be acceler-
ated in GPGPUs and compare them with highly optimized counterparts implemented in CPU.
The CPU implementations are C-based, fully vectorized and also multi-core adapted. The ap-
proaches similar to Random Hill Climbing (RMHC) and Simulated Annealing was proposed
with some specific modifications. In our implementation memetic algorithm of solving Golomb
Ruler Problem was constructed to exploit computional capabilities and memory bandwith of
GPGPU as much as possible. The shared and local memory data optimization was done by
minimizing following parameters: data space for storing genotypes, number of shared memory
bank conflicts (see Fig. 5 and Fig. 6).
Our implementation enables running mutation, crossover operators and memetic part of the
evolutionary algorithm in GPGPU. The crossover operators are implemented in two ways as
single and double point version. In both operators curand library is used for choosing random
points in a single representation and then copy genomes between two solutions in case crossover
operator and generate random changes in selected genome in case of mutation (see Fig. 4). The
first kernel is responsible for generating random numbers, each thread generates one random
number which belongs to interval [1, size of representation]. In the second kernel each thread
mutates or copies genomes between crossed-over representations. The last step is responsible
for fitness computation based on changes from genetic operators.
Figure 4: The architecture of crossover operator.
The whole algorithm implemented in GPGPU is described in Fig. 3. It exploits GPGPUs
computational resources as much as possible. The kernel is run on number of blocks equals size
of population. Each block copies separate solution from the global memory and its histogram
of all possible ruler’s differences. Next, all available threads in the block run mutation operator
on the solution stored in theirs shared memory. Each thread generates two random numbers in
mutation process. The first one is point in golomb ruler to be mutated and the second one is
new value on that position.
The algorithm needs number of threads in block ·4·(siz e of ruler 1) ·4 bytes of shared
memory for data storing new and old differences between chosen position and other points.
From this data fitness value is computed for each mutated solution (see Fig. 5), the elements
with even indexes in table are values added or removed from histogram after mutation and
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1028
elements with odd indexes are number of value of modification of these values in histogram.
The values needed for updating the histogram and computing the fitness are stored in a manner
to minimize bank conflicts (see Fig. 5). The fitness values and index of the thread are written
to the shared memory for a reduction process (1024 ·2·4 bytes) which will be run for choosing
best mutated solution (see Fig. 6). Two parallel reduction are run. The first one for finding
best fitness value and the second one for finding which solution is representing the best temporal
ruler. Between iterations in blocks, the threads can exchange the information about optimal
solution by temporary minimal ruler found in each block. It can be done by atomic operation
on global memory variable (atomicMin CUDA function). This process can help to narrow
domain space of possible optimal solutions (greedy algorithm) in next iterations. In case of
parallel implementation in multi-core environment openmp directives were used. The outer
loop of algorithm (responsible for iteration over solutions) was parallelized. The OpenMP lock
mechanism was used for updating the global temporal minimal ruler found.
Figure 5: The histogram update computation.
Figure 6: Reduction process of finding best solution and its position
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1029
Table 1: Execution times (in ms) of algorithm for Golomb ruler 6 (80 iterations of algorithm).
Population size (GR 6) GPGPU CPU (1 core) CPU (-O3) (1 core) 4-cores (-O3)
64 18 677 260 83
128 38 1363 556 170
256 94 2823 1103 380
512 255 5685 2202 730
Table 2: Execution times of execution (in ms) of algorithm for Golomb ruler 7 (80 iterations of
algorithm).
Population size (GR 7) GPGPU CPU (1 core) CPU (-O3) (1 core) 4-cores (-O3)
64 30 987 380 140
128 51 1850 870 280
5 Experimental results
The results presented in Table 1, Table 2 and Table 3 show the execution times of the presented
algorithm. It was run memetic algorithm consisting in hybridization of evolutionary algorithm
with a single-point random mutation hill climbing [18] local search operator. Each iteration
consisted of generating 512 new mutated solutions from single solution. Every each 20 cycles
one point crossover were run in GPGPU. in GPGPU, normal single core CPU, vectorized and
highly optimized single core and multi-core CPU implementation (using the compiler directive
-O3 ). The experiments consisted in running a constant number of iterations of the main
loop of the algorithm. The tables show that GPGPU implementation of presented algorithm
significantly outperforms its CPU counterpart. The GPGPU version is more than ten times
faster then fully vectorized single core version and also significantly faster then optimized multi-
core version. Table 4 describes efficiency of finding Golomb Rulers for different lenghts and their
percentage deviation from optimal one achieved in mentioned time. It shows that presented
implementation can be used in pre-eliminary fast finding better Golomb Rulers and narrows
domain for further process of optimal GR search. The simulations were run on NVIDIA Tesla
m2090 and QEMU Intel 64-rhel6 (2.4 GHz). The above results were gathered when stable times
of execution of algorithm were achieved in approximately five to ten executions (the observed
standard deviation of the average results was negligible).
Table 3: Execution times of execution (in ms) of algorithm for Golomb ruler 13 (80 iterations
of algorithm).
Population size (GR 13) GPGPU CPU (1 core) CPU (-O3) (1 core) 4-cores (-O3)
64 160 3001 1315 350
128 270 4508 2604 877
6 Conclusion
The described implementation of algorithm for finding GRs shows that GPGPUs can be in-
corporated in process of solving some difficult black-box problems, although it is to note, that
GPGPU can be efficient in implementing only some types of algorithms, namely those having
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1030
Table 4: Time of finding golomb rulers by hybrid GPGPU algorithm.
length of the ruler time deviation from optimal solution
4 1,5s optimal
53s optimal
6 13min optimal
7 40min 15%
13 1h 30-40%
14 1h 30-40%
15 1h 30-45%
16 1h 30-45%
implicitly parallel structure. On the other hand, algorithms with strong dependencies in their
data flows and without data re-using or data parallelism cannot be efficiently accelerated in
GPGPU (e.g. this is the case of Tabu Search algorithm).
As a main result of the presented research, efficient implementation of a hybrid system
solving Golomb Ruler was presented, along with experimental results showing that our GPGPU
implementation is 10 times faster than a highly optimized multicore CPU one. It must be
added that proposed algorithm cannot be used for finding optimal rulers (due to local optimal
solutions) but rather efficient way for narrowing solutions space. Further research will focus
on incorporating simulated annealling heuristic to our GPGPU solver and building parallel
hybrid GPGPU/CPU (based on OpenMP and MPI) algorithms for other difficult problems,
benchmark ones and real life. We will aim also at utilizing different technologies to construct
dedicated parallel frameworks, for efficient utilization of clusters of GPGPU (e.g. following the
approaches presented in [23, 24]).
Acknowledgments
The research presented in the paper received partial support from AGH University of Science
and Technology statutory project (no. 11.11.230.124).
References
[1] Cuda framework. https://developer.nvidia.com/cuda- gpus.
[2] Ogr project. http://www.distributed.net/org.
[3] Openmp library. http://www.openmp.org.
[4] G. Bloom and S. Golomb. Applications of numbered undirected graphs. In Proceedings of the
IEEE, 65(4), pages 562–570, April 1977.
[5] C. Cotta. Local search based hybrid algorithms for finding golomb rulers. pages 263–291, 2007.
[6] C. Cotta, I. Dotu, A.J. Fernandez, and P.Van. A memetic approach to golomb rulers. Parallel
Problem Solving From Nature IX, pages 255–261, 2006.
[7] J. de Monet de Lamarck. Zoological Philosophy: An Exposition with regard to the natural history
of animals. Macmillan, 1914.
[8] A. Dollas, W.T. Rankin, and D. Macracken. New algorithms for golomb ruler derivation and proof
of the 19 mark ruler. IEEE Transaction on Information Theory, 44:379–382, 1998.
[9] I. Dotu and P. Van Hentenryck. A simple hybrid evolutionary algorithm for finding golomb rulers.
IEEE Congress on Evolutionary Computation, 3:2018–2023, 2005.
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1031
[10] S. Droste, T. Jansen, and I. Wegener. Upper and lower bounds for randomized search heuristics
in black-box optimization. Theory of Computing Systems, 39:525–544, 2006.
[11] L.J. Eshelman. Genetic algorithms. Handbook of Evolutionary Computation. IOP Publishing Ltd
and Oxford University Press, 1997.
[12] B. Feeney. Determining optimum and near-optimum golomb rulers using genetic algorithms.
Master’s thesis, University College Cork, 2003.
[13] B. Galinier and et al. A constrained-based approach to the golomb ruler problem. In 3rd Inter-
national workshop on integration of AI and OR techniques, pages 562–570, 2001.
[14] Vanderschel Garey. In search of optimal 20, 21 and 22 mark golomb rulers. Technical report,
GVANT project, 1999. http://members.aol.com/golomb20/index.html.
[15] Y. Li, K. Zhao, X. Chu, and J.Liu. Speeding up k-means algorithm by gpus. IEEE Computer and
Information Technology, pages 115–122, jun 2010.
[16] C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang. Accelerating pattern matching using a novel
parallel algorithm on gpus. IEEE Transactions on Computers, 62:1906–1916, oct 2013.
[17] D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of
implementing dynamic parallelism for gpu computing. Parallel Processing Letters, 21:245–272,
2011.
[18] H. M¨uhlenbein. How genetic algorithms really work: I. mutation and hillclimbing. In Proc. of 2nd
Parallel Problem Solving from Nature Conference. 1992.
[19] M. Pietron, P. Russek, and K. Wiatr. Accelerating select where and select join queries on gpu.
Journal of Computer Science AGH, 14:243–252, 2013.
[20] M. Pietron, M. Wielgosz, D. Zurek, E. Jamro, and K. Wiatr. Comparison of gpu and fpga
implementation of svm algorithm for fast image segmentation. LNCS Springer-Verlag Heidelberg,
pages 292–302, 2013.
[21] J.B. Shearer. Some new optimum golomb rulers. IEEE Transaction on Information and Theory,
page 183, 1990.
[22] S. W. Soliday, A. Homaifar, and G.L. Lebby. Genetic algorithm approach to the search for golomb
rulers. IGGA, Ed. Morgan Kaufmann, pages 528–535, 1995.
[23] Wojciech Turek. Erlang as a high performance software agent platform. In Dariusz Barbucha,
Manh Thanh Le, Robert J. Howlett, and Lakhmi C. Jain, editors, Advanced Methods and Tech-
nologies for Agent and Multi-Agent Systems, Proceedings of the 7th KES Conference on Agent and
Multi-Agent Systems - Technologies and Applications (KES-AMSTA 2013), May 27-29, 2013, Hue
City, Vietnam, volume 252 of Frontiers in Artificial Intelligence and Applications, pages 21–29.
IOS Press, 2013.
[24] Wojciech Turek, Robert Marcjan, and Krzysztof Cetnarowicz. Agent-based mobile robots navi-
gation framework. In Vassil N. Alexandrov, G. Dick van Albada, Peter M. A. Sloot, and Jack
Dongarra, editors, Computational Science - ICCS 2006, 6th International Conference, Reading,
UK, May 28-31, 2006, Proceedings, Part III, volume 3993 of Lecture Notes in Computer Science,
pages 775–782. Springer, 2006.
GPGPU for Difficult Black-box Problems M. Pietro´n, A. Byrski, M. Kisiel-Dorohinicki
1032
... Mokhtar Essaid et al length (distance between the last and the first element). The Golomb ruler is optimal if no shorter ruler of the same order exists [62]. An evolutionary algorithm (EA) has been employed in [62] in order to solve the problem. ...
... The Golomb ruler is optimal if no shorter ruler of the same order exists [62]. An evolutionary algorithm (EA) has been employed in [62] in order to solve the problem. The GPU based implementation runs the crossover, the mutation, and the memetic part (random modified hill climbing (RMHC) and simulated annealing are used) in parallel. ...
... Each subswarm runs PSO separately. To further accelerate the execution time, the solution level has been also implemented to generate, evaluate solutions, or both in parallel [21,22,28,29,32,35,54,56,62,64,88,89]. As an example, in papers like [21] [22] [32], ACO algorithm have been implemented, where tour construction phase can be performed in parallel. ...
Article
Metaheuristics have been showing interesting results in solving hard optimization problems. However, they become limited in terms of effectiveness and runtime for high dimensional problems. Thanks to the independency of metaheuristics components, parallel computing appears as an attractive choice to reduce the execution time and to improve solution quality. By exploiting the increasing performance and programability of graphics processing units (GPUs) to this aim, GPU-based parallel metaheuristics have been implemented using different designs. Recent results in this area show that GPUs tend to be effective co-processors for leveraging complex optimization problems. In this survey, mechanisms involved in GPU programming for implementing parallel metaheuristics are presented and discussed through a study of relevant research papers. Metaheuristics can obtain satisfying results when solving optimization problems in a reasonable time. However, they suffer from the lack of scalability. Metaheuristics become limited ahead complex high-dimensional optimization problems. To overcome this limitation, GPU based parallel computing appears as a strong alternative. Thanks to GPUs, parallel metaheuristics achieved better results in terms of computation, and even solution quality.
... In consecutive publications we are trying to show how particular population-based techniques may further benefit from employing dedicated hardware like GPGPU or FPGA for delegating different parts of the computing in order to speed it up (e.g. [35]). ...
... It is to stress that we already have some substantial GPGPU-related results; the most relevant related to the realization of a Memetic Algorithm solving Optimal Golomb Ruler (OGR) problem [35]. It was shown that GPGPUs can be incorporated into the process of solving difficult black-box problems, although it is to note that GPGPU can be efficient in implementing only some parts of the considered algorithms. ...
... The main result of the research presented in [35] was the efficient implementation of a memetic algorithm leveraging GPGPU for a local search. We have shown experimentally that our implementation is about ten times faster than a highly optimized multicore CPU one. ...
Article
Full-text available
The research reported in the paper deals with difficult black-box problems solved by means of popular metaheuristic algorithms implemented on up-to-date parallel, multi-core, and many-core platforms. In consecutive publications we are trying to show how particular population-based techniques may further benefit from employing dedicated hardware like GPGPU or FPGA for delegating different parts of the computing in order to speed it up. The main contribution of this paper is an experimental study focused on profiling of different possibilities of implementation of Scatter Search algorithm, especially delegating some of its selected components to GPGPU. As a result, a concise know-how related to the implementation of a population-based metaheuristic similar to Scatter Search is presented using a difficult discrete optimization problem; namely, Golomb Ruler, as a benchmark.
... Important direction of the research is to adjust both algorithms to lower communication overhead between CPU and GPU which includes required data conversions. Also, the plan assumes the implementation of the hybrid concept for other difficult problems that can be solved using algorithms with parallel structure -some research for optimal Golomb ruler search has been already published in [17,16]. ...
Article
Full-text available
Memetic agent-based paradigm, which combines evolutionary computation and local search techniques in one of promising meta-heuristics for solving large and hard discrete problem such as Low Autocorrellation Binary Sequence (LABS) or optimal Golomb-ruler (OGR). In the paper as a follow-up of the previous research, a short concept of hybrid agent-based evolutionary systems platform, which spreads computations among CPU and GPU, is shortly introduced. The main part of the paper presents an efficient parallel GPU implementation of LABS local optimization strategy. As a means for comparison, speed-up between GPU implementation and CPU sequential and parallel versions are shown. This constitutes a promising step toward building hybrid platform that combines evolutionary meta-heuristics with highly efficient local optimization of chosen discrete problems.
... The work reported in this paper concentrates on the realization of genetic and evolutionary algorithms on the Parallella board. It is related to and extends our previous publications regarding the implementation of effective tools for running population-based computational intelligence systems [4], especially using the agent paradigm [5], [6] in both parallel and distributed [7], as well as heterogeneous environments [8]. ...
Conference Paper
Full-text available
Recent years have seen a growing trend towards the introduction of more advanced manycore processors. On the other hand, there is also a growing popularity for cheap, credit-card-sized, devices offering more and more advanced features and computational power. In this paper we evaluate Parallella -- a small board with the Epiphany manycore coprocessor consisting of sixteen MIMD cores connected by a mesh network-on-a-chip. Our tests are based on classical genetic algorithms. We discuss some possible optimizations and issues that arise from the architecture of the board. Although we achieve significant speed improvements, there are issues, such us the limited local memory size and slow memory access, that make the implementation of efficient code for Parallella difficult.
Article
Full-text available
This paper presents implementations of a few selected SQL operations using theCUDA programming framework on the GPU platform. Nowadays, the GPU’sparallel architectures give a high speed-up on certain problems. Therefore, thenumber of non-graphical problems that can be run and sped-up on the GPUstill increases. Especially, there has been a lot of research in data mining onGPUs. In many cases it proves the advantage of offloading processing fromthe CPU to the GPU. At the beginning of our project we chose the set ofSELECT WHERE and SELECT JOIN instructions as the most common op-erations used in databases. We parallelized these SQL operations using threemain mechanisms in CUDA: thread group hierarchy, shared memories, andbarrier synchronization. Our results show that the implemented highly parallelSELECT WHERE and SELECT JOIN operations on the GPU platform canbe significantly faster than the sequential one in a database system run on theCPU.
Article
Full-text available
This paper presents implementations of a few selected SQL operations using theCUDA programming framework on the GPU platform. Nowadays, the GPU’sparallel architectures give a high speed-up on certain problems. Therefore, thenumber of non-graphical problems that can be run and sped-up on the GPUstill increases. Especially, there has been a lot of research in data mining onGPUs. In many cases it proves the advantage of offloading processing fromthe CPU to the GPU. At the beginning of our project we chose the set ofSELECT WHERE and SELECT JOIN instructions as the most common op-erations used in databases. We parallelized these SQL operations using threemain mechanisms in CUDA: thread group hierarchy, shared memories, andbarrier synchronization. Our results show that the implemented highly parallelSELECT WHERE and SELECT JOIN operations on the GPU platform canbe significantly faster than the sequential one in a database system run on theCPU.
Conference Paper
Full-text available
This paper presents preliminary implementation results of the SVM (Support Vector Machine) algorithm. SVM is a dedicated mathematical formula which allows us to extract selective objects from a picture and assign them to an appropriate class. Consequently, a black and white images reflecting an occurrence of the desired feature is derived from an original picture fed into the classifier. This work is primarily focused on the FPGA and GPU implementations aspects of the algorithm as well as on comparison of the hardware and software performance. A human skin classifier was used as an example and implemented both on Intel Xeon E5645.40 GHz, Xilinx Virtex-5 LX220 and Nvidia Tesla m2090. It is worth emphasizing that in case of FPGA implementation the critical hardware components were designed using HDL (Hardware Description Language), whereas the less demanding or standard ones such as communication interfaces, FIFO, FSMs were implemented in Impulse C. Such an approach allowed us both to cut a design time and preserve a high performance of the hardware classification module. In case of GPU implementation whole algorithm is implemented in CUDA.
Article
Full-text available
The Golomb ruler problem is a very hard combinatorial optimization problem that has been tackled with many different approaches, such as constraint programming (CP), local search (LS), and evolutionary algorithms (EAs), among other techniques. This paper describes several local search-based hybrid algorithms to find optimal or near-optimal Golomb rulers. These algorithms are based on both stochastic methods and systematic techniques. More specifically, the algorithms combine ideas from greedy randomized adaptive search procedures (GRASP), scatter search (SS), tabu search (TS), clustering techniques, and constraint programming (CP). Each new algorithm is, in essence, born from the conclusions extracted after the observation of the previous one. With these algorithms we are capable of solving large rulers with a reasonable efficiency. In particular, we can now find optimal Golomb rulers for up to 16 marks. In addition, the paper also provides an empirical study of the fitness landscape of the problem with the aim of shedding some light about the question of what makes the Golomb ruler problem hard for certain classes of algorithm.
Article
Graphics processing units (GPUs) have attracted a lot of attention due to their cost-effective and enormous power for massive data parallel computing. In this paper, we propose a novel parallel algorithm for exact pattern matching on GPUs. A traditional exact pattern matching algorithm matches multiple patterns simultaneously by traversing a special state machine called an Aho-Corasick machine. Considering the particular parallel architecture of GPUs, in this paper, we first propose an efficient state machine on which we perform very efficient parallel algorithms. Also, several techniques are introduced to do optimization on GPUs, including reducing global memory transactions of input buffer, reducing latency of transition table lookup, eliminating output table accesses, avoiding bank-conflict of shared memory, coalescing writes to global memory, and enhancing data transmission via peripheral component interconnect express. We evaluate the performance of the proposed algorithm using attack patterns from Snort V2.8 and input streams from DEFCON. The experimental results show that the proposed algorithm performed on NVIDIA GPUs achieves up to 143.16-Gbps throughput, 14.74 times faster than the Aho-Corasick algorithm implemented on a 3.06-GHz quad-core CPU with the OpenMP. The library of the proposed algorithm is publically accessible through Google Code.
Article
Randomized search heuristics like local search, tabu search, simulated annealing, or all kinds of evolutionary algorithms have many applications. However, for most problems the best worst-case expected run times are achieved by more problem-specific algorithms. This raises the question about the limits of general randomized search heuristics. Here a framework called black-box optimization is developed. The essential issue is that the problem but not the problem instance is knownto the algorithm which can collect information about the instance only by asking for the value of points in the search space. All known randomized search heuristics fit into this scenario. Lower bounds on the black-box complexity of problems are derived without complexity theoretical assumptions and are compared with upper bounds in this scenario.
Conference Paper
Cluster analysis plays a critical role in a wide variety of applications, but it is now facing the computational challenge due to the continuously increasing data volume. Parallel computing is one of the most promising solutions to overcoming the computational challenge. In this paper, we target at parallelizing k-Means, which is one of the most popular clustering algorithms, by using the widely available Graphics Processing Units (GPUs). Different from existing GPU-based k-Means algorithms, we observe that data dimension is an important factor that should be taken into consideration when parallelizing k-Means on GPUs. In particular, we use two different strategies for low-dimensional data sets and high-dimensional data sets respectively, in order to make the best use of the power of GPUs. For low-dimensional data sets, we exploit GPU on-chip registers to significantly decrease data access latency. For high-dimensional data sets, we design a novel algorithm which simulates matrix multiplication and exploits GPU on-chip registers and also on-chip shared memory to achieve high compute-to-memory-access ratio. As a result, our GPU-based k-Means algorithm is three to eight times faster than the best reported GPU-based algorithm.
Conference Paper
The problem of mobile robot navigation has received a noticeable attention over last few years. Several different approaches were presented, each having major limitations. In this paper a new, agent-based solution the problem of mobile robots navigation is proposed. It is based on a novel representation of the environment, that divides it into a number of distinct regions, and assigns autonomous software Space Agents to supervise it. Space Agents create a graph, that represents a high-level structure of the entire environment. The graph is used as a virtual space, that robot controlling agents work in. The most important features of the approach are: path planning for multiple robots based on most recent data available in the system, automated collision avoidance, simple localization of a ”lost robot” and unrestricted scalability.