Figure 1 - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Content may be subject to copyright.
Source publication
Difficult black-box problems arise in many scientific and industrial areas. In this paper, efficient use of a hardware accelerator to implement dedicated solvers for such problems is discussed and studied based on an example of Golomb Ruler problem. The actual solution of the problem is shown based on evolutionary and memetic algorithms accelerated...
Context in source publication
Context 1
... memory hierarchy which has significant influence on algorithm (see Section 2). A significant number of numerical and data mining algorithms were efficiently implemented using GPGPUs [16], [15], [20], [19], [17]. In this work we study ability of parallelizing of metaheuristics in GPGPU platform choosing a very hard combinatorial problem, namely Optimal Golomb Ruler, as a case study. The examples of using Golomb Rulers can be found in Information Theory related to error correcting codes, selection of radio frequencies to reduce the effects of intermodulation interference with both terrestrial and extraterrestrial applications or design of phased arrays of radio antennas or in astronomy one-dimensional synthesis arrays. In this paper we focus firstly on GPGPU architecture and its parallel characteristics. Later we deal with details of GPGPU implementation for evolutionary solving for Golomb ruler. In the end the experimental results are shown and the paper is concluded. The architecture of a GPGPU card is described in Fig. 1. GPGPU is constructed as N multiprocessor structure with M cores each. The cores share an Instruction Unit with other cores in a multiprocessor. Multiprocessors have dedicated memory chips which are much faster than global memory, shared for all multiprocessors. These memories are: read-only constant/texture memory and shared memory. The GPGPU cards are constructed as massive parallel devices, enabling thousands of parallel threads to run which are grouped in blocks with shared memory. A dedicated software architecture—CUDA—makes possible programming GPGPU using high-level languages such as C and C++ [1]. CUDA requires an NVIDIA GPGPU like Fermi, GeForce 8XXX/Tesla/Quadro etc. This technology provides three key mechanisms to parallelize programs: thread group hierarchy, shared memories, and barrier synchronization. These mechanisms provide fine-grained parallelism nested within coarse-grained task parallelism. Creating the optimized code is not trivial and thorough knowledge of GPGPUs architecture is necessary to do it effectively. The main aspects to consider are the usage of the memories, efficient di- vision of code into parallel threads and thread communications. As it was mentioned earlier, constant/texture, shared memories and local memories are specially optimized regarding the access time, therefore programmers should optimally use them to speedup access to data on which an algorithm operates. Another important thing is to optimize synchronization and the communication of the threads. The synchronization of the threads between blocks is much slower than in a block. If it is not necessary it should be avoided, if necessary, it should be solved by the sequential running of multiple kernels. Another important aspect is the fact that recursive function calls are not allowed in CUDA kernels. Providing stack space for all the active threads requires substantial amounts of memory. Modern processors consist of two or more independent central processing units. This architecture enables multiple CPU instructions (add, move data, branch etc.) to run at the same time. The cores are integrated into a single integrated circuit. The manufacturers AMD and Intel have developed several multi-core processors (dual-core, quad-core, hexa-core, octa-core etc.). The cores may or may not share caches, and they may implement message passing or shared memory inter-core communication. The single cores in multi-core systems may implement architectures such as vector processing, SIMD, or multi-threading. These techniques offer another aspect of parallelization (implicit to high level languages, used by compilers). The performance gained by the use of a multi-core processor depends on the algorithms used and their implementation. The multicore-processors are used for comparison of the developed GPGPU computing solution discussed in this paper, so experiments run on single-core and ...
Similar publications
Adjuvant properties of bacterial cell wall components like MPLA (monophosphoryl lipid A) are well described and have gained FDA approval for use in vaccines such as Cervarix. MPLA is the product of chemically modified lipooligosaccharide (LOS), altered to diminish toxic proinflammatory effects while retaining adequate immunogenicity. Despite the vi...
Ant colony optimization (ACO) algorithms have proved to be able to adapt for solving dynamic optimization problems (DOPs). The integration of local search algorithms has also proved to significantly improve the output of ACO algorithms. However, almost all previous works consider stationary environments. In this paper, the MAX-MIN Ant System, one o...
Citations
... Mokhtar Essaid et al length (distance between the last and the first element). The Golomb ruler is optimal if no shorter ruler of the same order exists [62]. An evolutionary algorithm (EA) has been employed in [62] in order to solve the problem. ...
... The Golomb ruler is optimal if no shorter ruler of the same order exists [62]. An evolutionary algorithm (EA) has been employed in [62] in order to solve the problem. The GPU based implementation runs the crossover, the mutation, and the memetic part (random modified hill climbing (RMHC) and simulated annealing are used) in parallel. ...
... Each subswarm runs PSO separately. To further accelerate the execution time, the solution level has been also implemented to generate, evaluate solutions, or both in parallel [21,22,28,29,32,35,54,56,62,64,88,89]. As an example, in papers like [21] [22] [32], ACO algorithm have been implemented, where tour construction phase can be performed in parallel. ...
Metaheuristics have been showing interesting results in solving hard optimization problems. However, they become limited in terms of effectiveness and runtime for high dimensional problems. Thanks to the independency of metaheuristics components, parallel computing appears as an attractive choice to reduce the execution time and to improve solution quality. By exploiting the increasing performance and programability of graphics processing units (GPUs) to this aim, GPU-based parallel metaheuristics have been implemented using different designs. Recent results in this area show that GPUs tend to be effective co-processors for leveraging complex optimization problems. In this survey, mechanisms involved in GPU programming for implementing parallel metaheuristics are presented and discussed through a study of relevant research papers.
Metaheuristics can obtain satisfying results when solving optimization problems in a reasonable time. However, they suffer from the lack of scalability. Metaheuristics become limited ahead complex high-dimensional optimization problems. To overcome this limitation, GPU based parallel computing appears as a strong alternative. Thanks to GPUs, parallel metaheuristics achieved better results in terms of computation, and even solution quality.
... Important direction of the research is to adjust both algorithms to lower communication overhead between CPU and GPU which includes required data conversions. Also, the plan assumes the implementation of the hybrid concept for other difficult problems that can be solved using algorithms with parallel structure -some research for optimal Golomb ruler search has been already published in [17,16]. ...
Memetic agent-based paradigm, which combines evolutionary computation and local search techniques in one of promising meta-heuristics for solving large and hard discrete problem such as Low Autocorrellation Binary Sequence (LABS) or optimal Golomb-ruler (OGR). In the paper as a follow-up of the previous research, a short concept of hybrid agent-based evolutionary systems platform, which spreads computations among CPU and GPU, is shortly introduced. The main part of the paper presents an efficient parallel GPU implementation of LABS local optimization strategy. As a means for comparison, speed-up between GPU implementation and CPU sequential and parallel versions are shown. This constitutes a promising step toward building hybrid platform that combines evolutionary meta-heuristics with highly efficient local optimization of chosen discrete problems.
... The work reported in this paper concentrates on the realization of genetic and evolutionary algorithms on the Parallella board. It is related to and extends our previous publications regarding the implementation of effective tools for running population-based computational intelligence systems [4], especially using the agent paradigm [5], [6] in both parallel and distributed [7], as well as heterogeneous environments [8]. ...
Recent years have seen a growing trend towards the introduction of more advanced manycore processors. On the other hand, there is also a growing popularity for cheap, credit-card-sized, devices offering more and more advanced features and computational power. In this paper we evaluate Parallella -- a small board with the Epiphany manycore coprocessor consisting of sixteen MIMD cores connected by a mesh network-on-a-chip. Our tests are based on classical genetic algorithms. We discuss some possible optimizations and issues that arise from the architecture of the board. Although we achieve significant speed improvements, there are issues, such us the limited local memory size and slow memory access, that make the implementation of efficient code for Parallella difficult.
... In consecutive publications we are trying to show how particular population-based techniques may further benefit from employing dedicated hardware like GPGPU or FPGA for delegating different parts of the computing in order to speed it up (e.g. [35]). ...
... It is to stress that we already have some substantial GPGPU-related results; the most relevant related to the realization of a Memetic Algorithm solving Optimal Golomb Ruler (OGR) problem [35]. It was shown that GPGPUs can be incorporated into the process of solving difficult black-box problems, although it is to note that GPGPU can be efficient in implementing only some parts of the considered algorithms. ...
... The main result of the research presented in [35] was the efficient implementation of a memetic algorithm leveraging GPGPU for a local search. We have shown experimentally that our implementation is about ten times faster than a highly optimized multicore CPU one. ...
The research reported in the paper deals with difficult black-box problems solved by means of popular metaheuristic algorithms implemented on up-to-date parallel, multi-core, and many-core platforms. In consecutive publications we are trying to show how particular population-based techniques may further benefit from employing dedicated hardware like GPGPU or FPGA for delegating different parts of the computing in order to speed it up. The main contribution of this paper is an experimental study focused on profiling of different possibilities of implementation of Scatter Search algorithm, especially delegating some of its selected components to GPGPU. As a result, a concise know-how related to the implementation of a population-based metaheuristic similar to Scatter Search is presented using a difficult discrete optimization problem; namely, Golomb Ruler, as a benchmark.