Figure 6 - uploaded by Aleksander Byrski
Content may be subject to copyright.
Source publication
Difficult black-box problems arise in many scientific and industrial areas. In this paper, efficient use of a hardware accelerator to implement dedicated solvers for such problems is discussed and studied based on an example of Golomb Ruler problem. The actual solution of the problem is shown based on evolutionary and memetic algorithms accelerated...
Contexts in source publication
Context 1
... with odd indexes are number of value of modification of these values in histogram. The values needed for updating the histogram and computing the fitness are stored in a manner to minimize bank conflicts (see Fig. 5). The fitness values and index of the thread are written to the shared memory for a reduction process (1024 · 2 · 4 bytes) which will be run for choosing best mutated solution (see Fig. 6). Two parallel reduction are run. The first one for finding best fitness value and the second one for finding which solution is representing the best temporal ruler. Between iterations in blocks, the threads can exchange the information about optimal solution by temporary minimal ruler found in each block. It can be done by atomic operation on global memory variable (atomicMin CUDA function). This process can help to narrow domain space of possible optimal solutions (greedy algorithm) in next iterations. In case of parallel implementation in multi-core environment openmp directives were used. The outer loop of algorithm (responsible for iteration over solutions) was parallelized. The OpenMP lock mechanism was used for updating the global temporal minimal ruler ...
Context 2
... algorithms can be also be designed to make possible parallelization of their parts. Our main goal was to find heuristics for solving Golomb Ruler problem that can be accelerated in GPGPUs and compare them with highly optimized counterparts implemented in CPU. The CPU implementations are C-based, fully vectorized and also multi-core adapted. The approaches similar to Random Hill Climbing (RMHC) and Simulated Annealing was proposed with some specific modifications. In our implementation memetic algorithm of solving Golomb Ruler Problem was constructed to exploit computional capabilities and memory bandwith of GPGPU as much as possible. The shared and local memory data optimization was done by minimizing following parameters: data space for storing genotypes, number of shared memory bank conflicts (see Fig. 5 and Fig. 6). Our implementation enables running mutation, crossover operators and memetic part of the evolutionary algorithm in GPGPU. The crossover operators are implemented in two ways as single and double point version. In both operators curand library is used for choosing random points in a single representation and then copy genomes between two solutions in case crossover operator and generate random changes in selected genome in case of mutation (see Fig. 4). The first kernel is responsible for generating random numbers, each thread generates one random number which belongs to interval [1, size of representation ]. In the second kernel each thread mutates or copies genomes between crossed-over representations. The last step is responsible for fitness computation based on changes from genetic operators. The whole algorithm implemented in GPGPU is described in Fig. 3. It exploits GPGPUs computational resources as much as possible. The kernel is run on number of blocks equals size of population. Each block copies separate solution from the global memory and its histogram of all possible ruler’s differences. Next, all available threads in the block run mutation operator on the solution stored in theirs shared memory. Each thread generates two random numbers in mutation process. The first one is point in golomb ruler to be mutated and the second one is new value on that position. The algorithm needs number of threads in block · 4 · ( size of ruler − 1) · 4 bytes of shared memory for data storing new and old differences between chosen position and other points. From this data fitness value is computed for each mutated solution (see Fig. 5), the elements with even indexes in table are values added or removed from histogram after mutation ...
Context 3
... of these values in histogram. The values needed for updating the histogram and computing the fitness are stored in a manner to minimize bank conflicts (see Fig. 5). The fitness values and index of the thread are written to the shared memory for a reduction process (1024 · 2 · 4 bytes) which will be run for choosing best mutated solution (see Fig. 6). Two parallel reduction are run. The first one for finding best fitness value and the second one for finding which solution is representing the best temporal ruler. Between iterations in blocks, the threads can exchange the information about optimal solution by temporary minimal ruler found in each block. It can be done by atomic ...
Similar publications
Adjuvant properties of bacterial cell wall components like MPLA (monophosphoryl lipid A) are well described and have gained FDA approval for use in vaccines such as Cervarix. MPLA is the product of chemically modified lipooligosaccharide (LOS), altered to diminish toxic proinflammatory effects while retaining adequate immunogenicity. Despite the vi...
Ant colony optimization (ACO) algorithms have proved to be able to adapt for solving dynamic optimization problems (DOPs). The integration of local search algorithms has also proved to significantly improve the output of ACO algorithms. However, almost all previous works consider stationary environments. In this paper, the MAX-MIN Ant System, one o...
Citations
... Mokhtar Essaid et al length (distance between the last and the first element). The Golomb ruler is optimal if no shorter ruler of the same order exists [62]. An evolutionary algorithm (EA) has been employed in [62] in order to solve the problem. ...
... The Golomb ruler is optimal if no shorter ruler of the same order exists [62]. An evolutionary algorithm (EA) has been employed in [62] in order to solve the problem. The GPU based implementation runs the crossover, the mutation, and the memetic part (random modified hill climbing (RMHC) and simulated annealing are used) in parallel. ...
... Each subswarm runs PSO separately. To further accelerate the execution time, the solution level has been also implemented to generate, evaluate solutions, or both in parallel [21,22,28,29,32,35,54,56,62,64,88,89]. As an example, in papers like [21] [22] [32], ACO algorithm have been implemented, where tour construction phase can be performed in parallel. ...
Metaheuristics have been showing interesting results in solving hard optimization problems. However, they become limited in terms of effectiveness and runtime for high dimensional problems. Thanks to the independency of metaheuristics components, parallel computing appears as an attractive choice to reduce the execution time and to improve solution quality. By exploiting the increasing performance and programability of graphics processing units (GPUs) to this aim, GPU-based parallel metaheuristics have been implemented using different designs. Recent results in this area show that GPUs tend to be effective co-processors for leveraging complex optimization problems. In this survey, mechanisms involved in GPU programming for implementing parallel metaheuristics are presented and discussed through a study of relevant research papers.
Metaheuristics can obtain satisfying results when solving optimization problems in a reasonable time. However, they suffer from the lack of scalability. Metaheuristics become limited ahead complex high-dimensional optimization problems. To overcome this limitation, GPU based parallel computing appears as a strong alternative. Thanks to GPUs, parallel metaheuristics achieved better results in terms of computation, and even solution quality.
... Important direction of the research is to adjust both algorithms to lower communication overhead between CPU and GPU which includes required data conversions. Also, the plan assumes the implementation of the hybrid concept for other difficult problems that can be solved using algorithms with parallel structure -some research for optimal Golomb ruler search has been already published in [17,16]. ...
Memetic agent-based paradigm, which combines evolutionary computation and local search techniques in one of promising meta-heuristics for solving large and hard discrete problem such as Low Autocorrellation Binary Sequence (LABS) or optimal Golomb-ruler (OGR). In the paper as a follow-up of the previous research, a short concept of hybrid agent-based evolutionary systems platform, which spreads computations among CPU and GPU, is shortly introduced. The main part of the paper presents an efficient parallel GPU implementation of LABS local optimization strategy. As a means for comparison, speed-up between GPU implementation and CPU sequential and parallel versions are shown. This constitutes a promising step toward building hybrid platform that combines evolutionary meta-heuristics with highly efficient local optimization of chosen discrete problems.
... The work reported in this paper concentrates on the realization of genetic and evolutionary algorithms on the Parallella board. It is related to and extends our previous publications regarding the implementation of effective tools for running population-based computational intelligence systems [4], especially using the agent paradigm [5], [6] in both parallel and distributed [7], as well as heterogeneous environments [8]. ...
Recent years have seen a growing trend towards the introduction of more advanced manycore processors. On the other hand, there is also a growing popularity for cheap, credit-card-sized, devices offering more and more advanced features and computational power. In this paper we evaluate Parallella -- a small board with the Epiphany manycore coprocessor consisting of sixteen MIMD cores connected by a mesh network-on-a-chip. Our tests are based on classical genetic algorithms. We discuss some possible optimizations and issues that arise from the architecture of the board. Although we achieve significant speed improvements, there are issues, such us the limited local memory size and slow memory access, that make the implementation of efficient code for Parallella difficult.
... In consecutive publications we are trying to show how particular population-based techniques may further benefit from employing dedicated hardware like GPGPU or FPGA for delegating different parts of the computing in order to speed it up (e.g. [35]). ...
... It is to stress that we already have some substantial GPGPU-related results; the most relevant related to the realization of a Memetic Algorithm solving Optimal Golomb Ruler (OGR) problem [35]. It was shown that GPGPUs can be incorporated into the process of solving difficult black-box problems, although it is to note that GPGPU can be efficient in implementing only some parts of the considered algorithms. ...
... The main result of the research presented in [35] was the efficient implementation of a memetic algorithm leveraging GPGPU for a local search. We have shown experimentally that our implementation is about ten times faster than a highly optimized multicore CPU one. ...
The research reported in the paper deals with difficult black-box problems solved by means of popular metaheuristic algorithms implemented on up-to-date parallel, multi-core, and many-core platforms. In consecutive publications we are trying to show how particular population-based techniques may further benefit from employing dedicated hardware like GPGPU or FPGA for delegating different parts of the computing in order to speed it up. The main contribution of this paper is an experimental study focused on profiling of different possibilities of implementation of Scatter Search algorithm, especially delegating some of its selected components to GPGPU. As a result, a concise know-how related to the implementation of a population-based metaheuristic similar to Scatter Search is presented using a difficult discrete optimization problem; namely, Golomb Ruler, as a benchmark.