ArticlePublisher preview available

The UTFLA: uniformization of non-uniform iteration spaces in two-level perfect nested loops using SFLA

  • Islamic Azad University Of Langaroud
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

One of the factors increasing the execution time of computational programs is the loops, and parallelization of the loops is used to decrease this time. One of the steps of parallelizing compilers is uniformization of non-uniform loops in wavefront method which is considered as a NP-hard problem. In this paper, a new method has been presented to make uniform the non-uniform two-level perfect nested loops using the frog-leaping algorithm, called UTFLA, which is a combination of deterministic and stochastic methods, because the challenge most of loop paralleling methods, old or dynamic or new ones, face is the high algorithm execution time. UTFLA has been designed in a way to find the best results with the lowest amount of basic dependency cone size in the minimum possible time and gives more appropriate results in a more reasonable time compared to other methods.
This content is subject to copyright. Terms and conditions apply.
J Supercomput (2016) 72:2221–2234
DOI 10.1007/s11227-016-1725-8
The UTFLA: uniformization of non-uniform iteration
spaces in two-level perfect nested loops using SFLA
Shabnam Mahjoub1·Hakimeh Vojoudi1
Published online: 11 May 2016
© Springer Science+Business Media New York 2016
Abstract One of the factors increasing the execution time of computational programs
is the loops, and parallelization of the loops is used to decrease this time. One of the
steps of parallelizing compilers is uniformization of non-uniform loops in wavefront
method which is considered as a NP-hard problem. In this paper, a new method has
been presented to make uniform the non-uniform two-level perfect nested loops using
the frog-leaping algorithm, called UTFLA, which is a combination of deterministic
and stochastic methods, because the challenge most of loop paralleling methods, old
or dynamic or new ones, face is the high algorithm execution time. UTFLA has been
designed in a way to find the best results with the lowest amount of basic dependency
cone size in the minimum possible time and gives more appropriate results in a more
reasonable time compared to other methods.
Keywords Parallelizing compilers ·Uniformization ·Loops ·Frog leaping algorithm
1 Introduction
To improve the performance of applications, multi-processor and multi-core systems
can be used which decrease the overhead costs from serial programming. There are
generally two methods for parallelization [1]: automatic parallelization and parallel
programming. In automatic parallelization, the parallelizing compilers turn the serial
program into parallel automatically and in parallel programming, the whole program
is divided into smaller works from the main work and these are assigned to different
BShabnam Mahjoub
1Department of Computer Engineering, Langaroud Branch, Islamic Azad University, Langaroud,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... There are two different approaches to the parallelisation of nested loops. In the first approach, the non-uniform iteration space is transformed into a uniform one [18,20,[43][44][45], after which the parallelisation method is used. In fact, in the data dependency analysis step, if the loops do not have a uniform structure, it is not possible to use common and simple methods such as Wavefront to run them in parallel. ...
... Study [44] was the first to come up with the idea of uniformisation. After it, however, only few studies were conducted on the idea [18,20,43,45], whose main problem was the large DCS and the presence of at least one main vector. Our previous study proposed the first approach based on a genetic algorithm to solve this problem [18] on three-level perfect nested loops. ...
... Transforming the non-uniform pattern of the dependence vectors of a loop to a uniform one is an NP-Hard problem [43]. For this reason, this paper uses the Frog Leaping Algorithm (FLA) which has a very high ability to converge rapidly. ...
Full-text available
Due to the design of computer systems in the multi‐core and/or multi‐processor form, it is possible to use the maximum capacity of processors to run an application with the least time consumed through parallelisation. This is the responsibility of parallel compilers, which perform parallelisation in several steps by distributing iterations between different processors and executing them simultaneously to achieve lower runtime. The present paper focuses on the uniformisation of three‐level perfect nested loops as an important step in parallelisation and proposes a method called Towards Three‐Level Loop Parallelisation (TLP) that uses a combination of a Frog Leaping Algorithm and Fuzzy to achieve optimal results because in recent years, many algorithms have worked on volumetric data, that is, three‐dimensional spaces. Results of the implementation of the TLP algorithm in comparison with existing methods lead to a wide variety of optimal results at desired times, with minimum cone size resulting from the vectors. Besides, the maximum number of input dependence vectors is decomposed by this algorithm. These results can accelerate the process of generating parallel codes and facilitate their development for High‐Performance Computing purposes.
... Working with this type of data requires a lot of memory and high computing power [37]. So far, little work has been done on the uniformization of two-and three-dimensional iteration spaces in a separate manner [24], [38]. But the problem is that for a 2D space, the size of the dependence cone is proportional to the size of the angle between the BDVS, while in a 3D space, it is proportional to the volume enclosed between the BDVS, and the same mechanism cannot therefore be used for uniformization of both types of loops. ...
... However, it still suffered from the problem of high runtime and lack of practicality. In [38], which uses the approach based on FLA, three effective factors in the problem of fixed coefficients were considered, greatly limiting the final results. In addition, experiments and evaluations have been performed on limited data sets that are only parallel to a maximum of two known vectors. ...
... In fact, the volume of the area enclosed between these three vectors and the spheres with a radius equal to 1 remains constant by changing the coordinate system and the rotation of the vectors so that one of the vectors coincides on the z-axis. After the rotation of the three original vectors, new vectors are obtained which, instead of cartesian coordinates, their spherical coordinates can be used to calculate DCS using Equations (7) to (9) [38]. The vector that matched the z-axis after rotation (v 2 ) has ϕ = 0. Also, because the volume of the area enclosed in the sphere is calculated with a radius equal to one, ρ can be considered equal to one for each of the vectors. ...
Full-text available
The growth of software techniques for implementing applications must go hand in hand with the growth of computer system hardware in the design of multi-core and multi-processor systems; otherwise, we cannot expect to be able to use maximum hardware capacities. One of the most important and challenging techniques for running applications is to run them in parallel with a focus on loop parallelism to reduce execution time. On the other hand, in recent years, many algorithms have been working on volumetric data, i.e., three-dimensional spaces; therefore, parallelization must be possible for all types of two-dimensional and three-dimensional loops. Uniformization is an important part of loop parallelism, and also the present paper’s focus. The proposed algorithm in the present paper performed uniformization with a combination of the frog leaping algorithm and the fuzzy system for two- and three-dimensional loops on a wide range of input dependence vectors and achieved a considerable variety of results in the desired time. The results of this study can be used to facilitate the development of parallel codes.
Most important scientific and engineering applications have complex computations or large data. In all these applications, a huge amount of time is consumed by nested loops. Therefore, loops are the main source of the parallelization of scientific and engineering programs. Many parallelizing compilers focus on parallelization of nested loops with uniform dependences, and parallelization of nested loops with non-uniform dependences has not been extensively investigated. This paper addresses the problem of parallelizing two-level nested loops with non-uniform dependences. The aim is to minimize the execution time by improving the load balancing and minimizing the inter-processor communication. We propose a new tiling algorithm, k-StepIntraTiling, using bin packing problem to minimize the execution time. We demonstrate the effectiveness of the proposed method in several experiments. Simulation and experimental results show that the algorithm effectively reduces the total execution time of several benchmarks compared to the other tiling methods.
Full-text available
In this paper we review main ideas mentioned in several other papers which talk about optimization techniques used by compilers. Here we focus on loop unrolling technique and its effect on power consumption, energy usage and also its impact on program speed up by achieving ILP (Instruction-level parallelism). Concentrating on superscalar processors, we discuss the idea of generalized loop unrolling presented by J.C. Hang and T. Leng and then we present a new method to traverse a linked list to get a better result of loop unrolling in that case. After that we mention the results of some experiments carried out on a Pentium 4 processor (as an instance of super scalar architecture). Furthermore, the results of some other experiments on supercomputer (the Alliat FX/2800 System) containing superscalar node processors would be mentioned. These experiments show that loop unrolling has a slight measurable effect on energy usage as well as power consumption. But it could be an effective way for program speed up.
When the inter-iteration dependency pattern of the iterations of a loop cannot be determined statically, compile time parallelization of the loop is not possible. In these cases, runtime parallelization [8] is the only alternative. The idea is to transform the loop into two code fragements: the inspector and the executor . When the program is run, the inspector examines the iteration dependencies and constructs a parallel schedule. The executor subsequently uses that schedule to carry out the actual computation in parallel. In this paper, we show how to reduce the overhead of running the inspector through its parallel execution. We describe two related approaches. The first, which emphasizes inspector efficiency, achieves nearly linear speedup relative to a sequential execution of the inspector, but produces a schedule that may be less efficient for the executor. The second technique, which emphasizes executor efficiency, does not in general achieve linear speedup of the inspector, but is guaranteed to produce the best achievable schedule. We present these techniques, show that they are correct, and compare their performance to existing techniques using a set of experiments. Because in this paper we are optimizing inspector time, but leaving the executor unchanged, the techniques we present have most dramatic effect when the inspector must be run for each invocation of the source loop. In a companion paper [3], we explore techniques that build upon those developed here to also improve executor performance.
Conference Paper
Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine partitioning have been shown to be effective for parallelization and communication minimization. This paper presents algorithms that improve data locality using affine partitioning. Blocking and array contraction are two important optimizations that have been shown to be useful for data locality. Blocking creates a set of inner loops so that data brought into the faster levels of the memory hierarchy can be reused. Array contraction reduces an array to a scalar variable and thereby reduces the number of memory operations executed and the memory footprint. Loop transforms axe often necessary to make blocking and array contraction possible. By bringing the full generality of affine partitioning to bear on the problem, our locality algorithm can find more contractable arrays than previously possible. This paper also generalizes the concept of blocking and shows that affine partitioning allows the benefits of blocking be realized in arbitrarily nested loops. Experimental results on a number of benchmarks and a complete multigrid application in aeronautics indicates that affine partitioning is effective in practice.
High Performance Computing (HPC) is use of multiplecomputer resources to solve large critical problems.Multiprocessor and Multicore is two broad parallelcomputers which support parallelism. ClusteredSymmetric Multiprocessors (SMP) is the most fruitful wayout for large scale applications. Enhancing theperformance of computer application is the main role ofparallel processing. Single processor performance onhigh-end systems often enjoys a noteworthy outlayadvantage when implemented in parallel on systemsutilizing multiple, lower-cost, and commoditymicroprocessors. Parallel computers are going mainstream because clusters of SMP (SymmetricMultiprocessors) nodes provide support for an amplecollection of parallel programming paradigms. MPI andOpenMP are the trendy flavors in a parallel programming.In this paper we have taken a review on parallelparadigm’s available in multiprocessor and multicoresystem.
The lexical analyzer is the first phase of the compiler and commonly the most time consuming. The compilation of large programs is still far from optimized in today's compilers. With modern processors moving more towards improving parallelization and multithreading, it has become impossible for performance gains in older compilers as technology advances. Any multicore architecture relies on improving parallelism than on improving single core performance. A compiler that is completely parallel and optimized is yet to be developed and would require significant effort to create. On careful analysis we find that the performance of a compiler is majorly affected by the lexical analyzer's scanning and tokenizing phases. This effort is directed towards the creation of a completely parallelized lexical analyzer designed to run on the Cell/B.E. processor that utilizes its multicore functionalities to achieve high performance gains in a compiler. Each SPE reads a block of data from the input and tokenizes them independently. To prevent dependence of SPE's, a scheme for dynamically extending static block-limits is incorporated. Each SPE is given a range which it initially scans and then finalizes its input buffer to a set of complete tokens from the range dynamically. This ensures parallelization of the SPE's independently and dynamically, with the PPE scheduling load for each SPE. The initially static assignment of the code blocks is made dynamic as soon as one SPE commits. This aids SPE load distribution and balancing. The PPE maintains the output buffer until all SPE's of a single stage commit and move to the next stage before being written out to the file, to maintain order of execution. The approach can be extended easily to other multicore architectures as well. Tokenization is performed by high-speed string searching, with the keyword dictionary of the language, using Aho-Corasick algorithm.
Conference Paper
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall, and apply a fully parallel data dependence test to determine if it had any cross-iteration dependences; if the test fails, then the loop is re-executed serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run-time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching; it detects at run-time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods.
Conference Paper
Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multicore architectures. Many existing automatic techniques only parallelize iterations within a loop invocation and synchronize threads at the end of each loop invocation. When parallel code contains many loop invocations, synchronization can easily become a performance bottleneck. Some automatic techniques address this problem by exploiting cross-invocation parallelism. These techniques use static analysis to partition iterations among threads to avoid crossthread dependences. However, this partitioning is not always achievable at compile-time, because program input determines dependence patterns at run-time. By contrast, this paper proposes DOMORE, the first automatic parallelization technique that uses runtime information to exploit additional cross-invocation parallelism. Instead of partitioning iterations statically, DOMORE dynamically detects crossthread dependences and synchronizes only when necessary. DOMORE consists of a compiler and a runtime library. At compile time, DOMORE automatically parallelizes loops and inserts a custom runtime engine into programs. At run-time, the engine observes dependences and synchronizes iterations only when necessary. For six programs, DOMORE achieves a geomean loop speedup of 2.1× over parallel execution without cross-invocation parallelization and of 3.2 × over sequential execution on eight cores.
A memetic meta-heuristic called the shuffled frog-leaping algorithm (SFLA) has been developed for solving combinatorial optimization problems. The SFLA is a population-based cooperative search metaphor inspired by natural memetics. The algorithm contains elements of local search and global information exchange. The SFLA consists of a set of interacting virtual population of frogs partitioned into different memeplexes. The virtual frogs act as hosts or carriers of memes where a meme is a unit of cultural evolution. The algorithm performs simultaneously an independent local search in each memeplex. The local search is completed using a particle swarm optimization-like method adapted for discrete problems but emphasizing a local search. To ensure global exploration, the virtual frogs are periodically shuffled and reorganized into new memplexes in a technique similar to that used in the shuffled complex evolution algorithm. In addition, to provide the opportunity for random generation of improved information, random virtual frogs are generated and substituted in the population.The algorithm has been tested on several test functions that present difficulties common to many global optimization problems. The effectiveness and suitability of this algorithm have also been demonstrated by applying it to a groundwater model calibration problem and a water distribution system design problem. Compared with a genetic algorithm, the experimental results in terms of the likelihood of convergence to a global optimal solution and the solution speed suggest that the SFLA can be an effective tool for solving combinatorial optimization problems.