Conference Paper

Uncovering hidden loop level parallelism in sequential applications

Adv. Comput. Archit. Lab., Univ. of Michigan, Ann Arbor, MI
DOI: 10.1109/HPCA.2008.4658647 Conference: High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on
Source: IEEE Xplore

ABSTRACT As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applications realize little to no gains with additional cores. One solution to this problem is automatic parallelization that frees the programmer from the difficult task of parallel programming and offers hope for handling the vast amount of legacy single-threaded software. There is a long history of automatic parallelization for scientific applications, but the techniques have generally failed in the context of general-purpose software. Thread-level speculation overcomes the problem of memory dependence analysis by speculating unlikely dependences that serialize execution. However, this approach has lead to only modest performance gains. In this paper, we take another look at exploiting loop-level parallelism in single-threaded applications. We show that substantial amounts of loop-level parallelism is available in general-purpose applications, but it lurks beneath the surface and is often obfuscated by a small number of data and control dependences. We adapt and extend several code transformations from the instruction-level and scientific parallelization communities to uncover the hidden parallelism. Our results show that 61% of the dynamic execution of studied benchmarks can be parallelized with our techniques compared to 27% using traditional thread-level speculation techniques, resulting in a speedup of 1.84 on a four core system compared to 1.41 without transformations.

  • Source
    The 27th International Workshop on Languages and Compilers for Parallel Computing, Hilsboro, OR; 01/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The focus of this dissertation is on kernel loops (K-loops), which are loop nests that contain hardware mapped kernels in the loop body. In this thesis, we propose methods for improving the performance of such K-loops, by using standard loop transformations for exposing and exploiting the coarse grain loop level parallelism. We target a reconfigurable architecture that is a heterogeneous system consisting of a general purpose processor and a field programmable gate array (FPGA). Research projects targeting reconfigurable architectures are trying to give answers to several problems: how to partition the application -- decide which parts to be accelerated on the FPGA, how to optimize these parts (the kernels), what is the performance gain. However, only few try to exploit the coarse grain loop level parallelism. This work goes towards automatically deciding the number of kernel instances to place into the reconfigurable hardware, in a flexible way that can balance between area and performance. In this dissertation, we propose a general framework that helps determine the optimal degree of parallelism for each hardware mapped kernel within a K-loop, taking into account area, memory size and bandwidth, and performance considerations. In the future it can also take into account power. Furthermore, we present algorithms and mathematical models for several loop transformations in the context of K-loops. The algorithms are used to determine the best degree of parallelism for a given K-loop, while the mathematical models are used to determine the corresponding performance improvement. The algorithms are validated with experimental results. The loop transformations that we analyze in this thesis are loop unrolling, loop shifting, K-pipelining, loop distribution, and loop skewing. An algorithm that decides which transformations to use for a given K-loop is also provided. Finally, we also present an analysis of possible situations and justifications of when and why the loop transformations have or have not a significant impact on the K-loop performance.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The individual processors of a chip-multiprocessor traditionally have rigid boundaries. Inter-core communication is only possible via memory and control over a core’s resources is localised. Specialisation necessary to meet today’s challenging energy targets is typically provided through the provision of a range of processor types and accelerators. An alternative approach is to permit specialisation by tailoring the way a large number of homogeneous cores are used. The approach here is to relax processor boundaries, create a richer mix of intercore communication mechanisms and provide finer-grain control over, and access to, the resources of each core. We evaluate one such design, called Loki, that aims to support specialisation in software on a homogeneous many-core architecture. We focus on the design of a single 8-core tile, conceived as the building block for a larger many-core system. We explore the tile’s ability to support a range of parallelisation opportunities and detail the control and communication mechanisms needed to exploit each core’s resources in a flexible manner. Performance and a detailed breakdown of energy usage is provided for a range of benchmarks and configurations.
    Proc. Intl. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS XIII), Samos; 07/2013


Available from