Article

Generalized Loop-Unrolling: a Method for Program Speed-Up

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

It is well-known that, to optimize a program for speed-up, efforts should be focused on the regions where the payoff will be greatest. Loop constructs in a program represent such regions. In the literature, it has been shown that a certain degree of speed-up can be achieved by loop unrolling. The technique published so far, however, appears to be applicable to FOR-loops only. This paper presents a generalized loop-unrolling method that can be applied to any type of loop construct. Possible complications in its applications, together with some experimental results, are discussed in detail. Introduction There has been considerable effort to develop source-to-source transformation methods that restructure loop constructs to expose possibilities for parallelism. Most published loop restructuring methods are designed for countable loops, where the iteration count can be determined without executing the loop. One such method, called loop unrolling [2], is designed to unroll FOR loops for p...

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Where stands for the equivalence relation, and wp(S, B) the weakest precondition of S with respect to post condition B [3]. ...
... 13. } As mentioned in [3] "The experimental results show that this unrolled loop is able to achieve a speed up factor very close to 2, and if we unroll the loop k times, we can achieve a speed up factor of k." The instructions number 6,7,8,9 forms a basic block, but because of data dependencies superscalar processors can not execute these instructions in parallel. ...
... Example 2:A loop for traversing a linked list and counting the nodes traversed:The best solution presented by Hang and Leng[3] is to attach a special node named NULL_NODE at the end of the list. The link field of this node points to the node itself.With this idea, after unrolling the loop twice, it becomes: ...
Article
Full-text available
In this paper we review main ideas mentioned in several other papers which talk about optimization techniques used by compilers. Here we focus on loop unrolling technique and its effect on power consumption, energy usage and also its impact on program speed up by achieving ILP (Instruction-level parallelism). Concentrating on superscalar processors, we discuss the idea of generalized loop unrolling presented by J.C. Hang and T. Leng and then we present a new method to traverse a linked list to get a better result of loop unrolling in that case. After that we mention the results of some experiments carried out on a Pentium 4 processor (as an instance of super scalar architecture). Furthermore, the results of some other experiments on supercomputer (the Alliat FX/2800 System) containing superscalar node processors would be mentioned. These experiments show that loop unrolling has a slight measurable effect on energy usage as well as power consumption. But it could be an effective way for program speed up.
... Loop Unrolling (also known as Loop Unwinding and Loop Unfolding) is an optimization technique -performed by the compiler or manually by the programmer -applicable to certain kinds of loops in order to reduce (or even prevent) the occurrence of execution branches and minimize the cost of instructions for controlling the loop [1,8,16,25]. Its goal is to optimize the program's execution speed at the expense of increasing the size of the generated code (space-time tradeoff ). ...
Article
Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A parallel reduction, in turn, is the reduction operation concurrently performed when multiple execution units are available. The current work reports an investigation on this subject and depicts a GPU-based parallel approach for it. Employing techniques like Loop Unrolling, Persistent Threads and Algebraic Expressions to avoid thread divergence, the presented approach was able to achieve a 2.8x speedup when compared to the work of Catanzaro, using a generic, simple and easily portable code. Experiments conducted to evaluate the approach show that the strategy is able to perform efficiently in AMD and NVidia's hardware, as well as in OpenCL and CUDA.
... 9. LLVM unrolls the loop by a factor of 8, precisely the same as a constant COLS. This loop-unrolling[9]removes the innermost loop – 8 comparison instructions, 8 jump instructions and arithmetic instructions of a loop induction variable. As a result, LLVM can reduce the dynamic instructions significantly, but increases the static code size. ...
Article
Full-text available
The embedded processor market has grown rapidly and consistently with the appearance of mobile devices. In an embedded system, the power consumption and execution time are important factors affecting the performance. The system performance is determined by both hardware and software. Although the hardware architecture is high-end, the software runs slowly due to the low quality of codes. This study compared the performance of two major compilers, LLVM and GCC on a32-bit EISC embedded processor. The dynamic instructions and static code sizes were evaluated from these compilers with the EEMBC benchmarks.LLVM generally performed better in the ALU intensive benchmarks, whereas GCC produced a better register allocation and jump optimization. The dynamic instruction count and static code of GCCwere on average 8% and 7% lower than those of LLVM, respectively.
... The drawback of such a system is the limited ILP available in the application programs. In this paper we present a technique to overcome such limitations with the help of loop unrolling [5]. ...
Conference Paper
Full-text available
Application Specific Instruction-set Processor (ASIP) is one of the popular processor design techniques for embedded systems which allow customizability in processor design without overly hindering design flexibility. Multi-pipeline ASIPs were proposed to improve the performance of such systems by compromising between speed and processor area. One of the problems in the multi-pipeline design is the limited inherent instruction level parallelism (ILP) available in applications. The ILP of application programs can be improved via a compiler optimization technique known as loop unrolling. In this paper, we present the impact of loop unrolling on the performance (speed) of multi-pipeline ASIPs. The improvement in speed averages around 15% for a number of benchmark applications with the maximum improvement of around 30%. In addition, we report the variation of performance against the loop unrolling factor - the amount of unrolling performed on an application.
... The vision of [2] remained unverified and unimplemented but it marks the goal of extending loop transformations to work with LPs. [7] showed how specific special forms of while-loops (not necessary LPs) can be unrolled. They showed several ideas of unrolling while-loops including loops that traverse linked list. ...
Article
Full-text available
Induction pointers (IPs) are the analogue of induction variables (IVs), namely, pointers that are advanced by a fixed amount every iteration of a loop (e.g., p=p→next→next). Although IPs have been considered in previous works, there is no algorithm to properly compute the correct amount of pointer jumping (AOPJ) by which IPs should be advanced if loop unrolling is to be applied to loops of the form while(p){…p=p→next→next;}. The main difficulty in computing the correct AOPJ of IPs is that pointers can be used to modify the data structure that is traversed by the loop (e.g., adding/removing/by-passing elements). Consequently, a simple advancement p=p→next in a loop does not necessarily mean that p is advanced by one element every iteration. This situation contrasts with the use of IVs, which cannot change the structure of arrays that are traversed by loops. Hence, if i is an IV, A[i+1] will always mean the next element of A[], while if p=p→next; is preceded by p→next=q; it may be advanced by k>1 elements at every iteration. The proposed method for computing the correct AOPJ of IPs and an accompanying loop unrolling technique were implemented in the SUIF compiler for C programs. Our experiments with automatic unrolling of loops with pointers yielded an improvement of 3–5% for a set of SPEC2000 programs. Experiments with a VLIW IA-64 machine also verified the usefulness of this approach for embedded systems.
... These operations are typically expressed as sums over all matrix or vector indices, that is, (multiple) loops in a program. In an effort to speed-up and optimize the CPU time on a computer, one of the techniques used is loop unrolling [19]. The following two programs compute dy = da × dx + dy with dx ∈ R n , dy ∈ R n and da ∈ R. ...
Article
In many industrial applications, models constructed from real problems using empirical or physical laws are used for control, prediction, error detection, design, or simulation. Models often involve some unknown coefficients (parameters). The unknown parameters of a model have to be estimated prior to solving the model. Parameters can be estimated using the model and the observed (measured) data from the real problem. Parameter estimation problems can be reduced to various forms, i.e. "nonlinear optimization", "nonlinear equations" or "nonlinear least squares" which can be solved by numerical methods. There are many numerical methods, each of which is appropriate for a specific variety of problem. In this dissertation, Newton's method and its variants (Newton's method with line search/trust-region) for solving parameter estimation problems of small to medium dimension are addressed. Implementation techniques of the algorithms are outlined in details. Test results are given. Solution procedures of the following parameter identification problems from the company IAV GmbH, Gifhorn are discussed: 1. Identification of geometric tolerances in a sensor wheel; 2. Identification of combustion parameters in a diesel engine; 3. Identification of reaction parameters in a "Selective Catalytic Reduction" (SCR) catalyst. Mathematical models are given and the parameter estimation problems are discussed. In particular, the computational time and the storage requiremnets, as well as the stability and the convergence behaviour are considered.
... Finally, loop unrolling mechanisms have been extensively studied in the literature related to multi-processor systems for many years. All the proposed techniques perform better for in-order architectures and were originally designed for mono-processor systems (Huang and Leng, 1999 ). Therefore, current widelyavailable compilers are not able to exploit the dynamic scheduling facilities found in out-of-order processors, and the instruction level parallelism achieved is not so spectacular. ...
Article
Tomorrow's embedded devices need to run multimedia applications demanding high computational power with low energy consumption constraints. In this context, the register file is a key source of power consumption and its inappropriate design and management severely affects system power. In this paper, we present a new approach to reduce the energy of shared register files in forthcoming embedded VLIW processors running real-life applications up to 60% without performance penalty. This approach relies on limited hardware extensions and a compiler-based energy-aware register assignment algorithm to deactivate at run-time parts of the register file (i.e., sub-banks) in an independent way.
Chapter
Reduction operations aggregate a finite set of numeric elements into a single value. They are extensively employed in many computational tasks and can be performed in parallel when multiple processing units are available. This work presents a GPU-based approach for parallel reduction, which employs techniques like loop unrolling, persistent threads and algebraic expressions. It avoids thread divergence and it is able to surpass the methods currently in use. Experiments conducted to evaluate the approach show that the strategy performs efficiently on both AMD and NVidia’s hardware platforms, as well as using OpenCL and CUDA, making it portable.
Article
Full-text available
One of the factors increasing the execution time of computational programs is the loops, and parallelization of the loops is used to decrease this time. One of the steps of parallelizing compilers is uniformization of non-uniform loops in wavefront method which is considered as a NP-hard problem. In this paper, a new method has been presented to make uniform the non-uniform two-level perfect nested loops using the frog-leaping algorithm, called UTFLA, which is a combination of deterministic and stochastic methods, because the challenge most of loop paralleling methods, old or dynamic or new ones, face is the high algorithm execution time. UTFLA has been designed in a way to find the best results with the lowest amount of basic dependency cone size in the minimum possible time and gives more appropriate results in a more reasonable time compared to other methods.
Article
Full-text available
For high-resolution, iterative 3D PET image reconstruction the efficient implementation of forward-backward projectors is essential to minimise the calculation time. Mathematically, the projectors are summarised as a system response matrix (SRM) whose elements define the contribution of image voxels to lines-of-response (LORs). In fact, the SRM easily comprises billions of non-zero matrix elements to evaluate the tremendous number of LORs as provided by state-of-the-art PET scanners. Hence, the performance of iterative algorithms, e.g. maximum-likelihood-expectation-maximisation (MLEM), suffers from severe computational problems due to the intensive memory access and huge number of floating point operations. Here, symmetries occupy a key role in terms of efficient implementation. They reduce the amount of independent SRM elements, thus allowing for a significant matrix compression according to the number of exploitable symmetries. With our previous work, the PET REconstruction Software TOolkit (PRESTO), very high compression factors (>300) are demonstrated by using specific non-Cartesian voxel patterns involving discrete polar symmetries. In this way, a pre-calculated memory-resident SRM using complex volume-of-intersection calculations can be achieved. However, our original ray-driven implementation suffers from addressing voxels, projection data and SRM elements in disfavoured memory access patterns. As a consequence, a rather limited numerical throughput is observed due to the massive waste of memory bandwidth and inefficient usage of cache respectively. In this work, an advantageous symmetry-driven evaluation of the forward-backward projectors is proposed to overcome these inefficiencies. The polar symmetries applied in PRESTO suggest a novel organisation of image data and LOR projection data in memory to enable an efficient single instruction multiple data vectorisation, i.e. simultaneous use of any SRM element for symmetric LORs. In addition, the calculation time is further reduced by using simultaneous multi-threading (SMT). A global speedup factor of 11 without SMT and above 100 with SMT has been achieved for the improved CPU-based implementation while obtaining equivalent numerical results.
Article
Full-text available
This work presents a technique to optimize popular image processing algorithms on mobile platforms such as cell phones, net-books and personal digital assistants (PDAs). The increasing demand for video applications like context-aware computing on mobile embedded systems requires the use of computationally intensive image processing algorithms. The system engineer has a mandate to optimize them so as to meet real-time deadlines. A methodology to take advantage of the asymmetric dual-core processor, which includes an ARM and a DSP core supported by shared memory, is presented with implementation details. The target platform chosen is the popular OMAP 3530 processor for embedded media systems. It has an asymmetric dual-core architecture with an ARM Cortex-A8 and a TMS320C64x Digital Signal Processor (DSP). The development platform was the BeagleBoard with 256 MB of NAND RAM and 256 MB SDRAM memory. The basic image correlation algorithm is chosen for benchmarking as it finds widespread application for various template matching tasks such as face-recognition. The basic algorithm prototypes conform to OpenCV, a popular computer vision library. OpenCV algorithms can be easily ported to the ARM core which runs a popular operating system such as Linux or Windows CE. However, the DSP is architecturally more efficient at handling DFT algorithms. The algorithms are tested on a variety of images and performance results are presented measuring the speedup obtained due to dual-core implementation. A major advantage of this approach is that it allows the ARM processor to perform important real-time tasks, while the DSP addresses performance-hungry algorithms.
Article
Virtual machines provide platform independence by using intermediate code. Program source code is compiled into intermediate code, which can be executed on many different platforms. Even though there are many virtual machines, there are not many virtual machines that can support C/C++, especially with high performance. Recently, Low Level Virtual Machine (LLVM) provides limited degree of architecture independence for C programs. Though it does not support multi-platform execution of complex programs, it is able to run simple benchmarks, enabling us to look into important aspects. In this paper, we analyze the performance impact of using intermediate code for C programs in the embedded system. Compared to the traditional native execution, LLVM has a little weakness in handling memory accesses, but generally shows competitive performance in other optimizations. Especially, in a benchmark with frequent function calls, LLVM exhibits even better performance than GCC.
Article
Test cases of original program could be reused in order to reduce cost of testing the modified program. The reusability of test cases is affected by the rewriting skills and test coverage criteria which are adopted in rewriting and testing process of modified program. This paper summarizes frequent rewriting pattern in rewriting program, and analyzes the effect of them in testing modified program. Many substantial experimental works have been carried on three applications, and results State reusability of test case are related with rewriting pattern and test coverage criteria.
Conference Paper
Real time computer vision applications like video streaming on cell phones, remote surveillance and virtual reality have stringent performance requirements but can be severely restrained by limited resources. The use of optimized algorithms is vital to meet real-time requirements especially on popular mobile platforms. This paper presents work on performance optimization of common computer vision algorithms such as correlation on such embedded systems. The correlation algorithm which is popular for face recognition, can be implemented using convolution or the Discrete Fourier Transform (DFT). The algorithms are benchmarked on the Intel Pentium processor and Beagleboard, which is a new low-cost low-power platform based on the Texas Instruments (TI) OMAP 3530 processor architecture. The OMAP processor consists of an asymmetric dual-core architecture, including an ARM and a DSP supported by shared memory. OpenCV, which is a computer vision library developed by Intel corporation was utilized for some of the algorithms. Comparative results for the various approaches are presented and discussed with an emphasis on real-time implementation.
Conference Paper
Modulo scheduling is a major optimization of high performance compilers wherein The body of a loop is replaced by an overlapping of instructions from different iterations. Hence the compiler can schedule more instructions in parallel than in the original option. Modulo scheduling, being a scheduling optimization, is a typical backend optimization relying on detailed description of the underlying CPU and its instructions to produce a good schedule. This work considers the problem of applying modulo scheduling at source level as a loop transformation, using only general information of the underlying CPU architecture. By doing so it is possible: a) Create a more retargeble compiler as modulo scheduling is now applied at source level, b) Study possible interactions between modulo scheduling and common loop transformations. c) Obtain a source level optimizer whose output is readable to the programmer, yet its final output can be efficiently compiled by a relatively “simple” compiler. Experimental results show that source level modulo scheduling can improve performance also when low level modulo scheduling is applied by the final compiler, indicating that high level modulo scheduling and low level modulo scheduling can co-exist to improve performance. An algorithm for source level modulo scheduling modifying the abstract syntax tree of a program is presented. This algorithm has been implemented in an automatic parallelizer (Tiny). Preliminary experiments yield runtime and power improvements also for the ARM CPU for embedded systems.
Conference Paper
Full-text available
A programming language mechanism and associated compiler techniques which significantly enhance the analyzability of pointer-based data structures frequently used in nonscientific programs are proposed. The approach is based on exploiting two important properties of pointer data structures: structural inductivity and speculative traversability. Structural inductivity facilitates the application of a static interference analysis method for such pointer data structures based on path matrices, and speculative traversability is utilized by a novel loop unrolling technique for while loops that exploit fine-grain parallelism by speculatively traversing such data structures. The effectiveness of this approach is demonstrated by applying it to a collection of loops found in typical nonscientific C programs
Article
A state constraint is a programming construct designed to restrict a program's domain of definition. It can be used to decompose a program pathwise, i.e. dividing the program into subprograms along the control flow, as opposed to dividing the program across the control flow when the program is decomposed into functions and procedures. As a result, a program consisting of one or more execution paths of another program can be constructed and manipulated. The author describes the idea involved, examines the properties of state constraints, establishes a formal basis for pathwise decomposition and discusses their uses in program simplification, testing and verification