Conference Paper

Uncovering hidden loop level parallelism in sequential applications

Adv. Comput. Archit. Lab., Univ. of Michigan, Ann Arbor, MI
DOI: 10.1109/HPCA.2008.4658647 Conference: High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on
Source: IEEE Xplore


As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applications realize little to no gains with additional cores. One solution to this problem is automatic parallelization that frees the programmer from the difficult task of parallel programming and offers hope for handling the vast amount of legacy single-threaded software. There is a long history of automatic parallelization for scientific applications, but the techniques have generally failed in the context of general-purpose software. Thread-level speculation overcomes the problem of memory dependence analysis by speculating unlikely dependences that serialize execution. However, this approach has lead to only modest performance gains. In this paper, we take another look at exploiting loop-level parallelism in single-threaded applications. We show that substantial amounts of loop-level parallelism is available in general-purpose applications, but it lurks beneath the surface and is often obfuscated by a small number of data and control dependences. We adapt and extend several code transformations from the instruction-level and scientific parallelization communities to uncover the hidden parallelism. Our results show that 61% of the dynamic execution of studied benchmarks can be parallelized with our techniques compared to 27% using traditional thread-level speculation techniques, resulting in a speedup of 1.84 on a four core system compared to 1.41 without transformations.

9 Reads
  • Source
    • "The identification of potentially vectorizable operations requires the discovery of fine-grained concurrency among operations that access contiguously located data elements . Although there has been considerable prior work (e.g., [2] [3] [8] [11] [12] [14] [16] [17] [19] [21] [23] [25] [28] [29] [33] [35] [39]) on using dynamic analysis for characterizing parallelism in applications, previously developed approaches have fundamental limitations for discovering potentially vectorizable operations. Existing work on using dynamic analysis to characterize potential parallelism in sequential programs falls broadly under two general categories: (1) generation of a parallelism profile and critical-path analysis of the directed acyclic graph (explicitly constructed or implicitly modeled ) representing the run-time dependences of the computation, and (2) loop-level or region-level characterization of parallelism, where computations within the loop/region are constrained to execute in the original sequential order. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes. In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.
    ACM SIGPLAN Notices 06/2012; 47(6). DOI:10.1145/2254064.2254108 · 0.66 Impact Factor
  • Source
    • "This significantly complicates performance modeling , often resulting in a large space of possibly effective parallelism configurations. Existing compiler-based parallelization algorithms for such general-purpose programs select a single configuration , typically one deemed most suitable for an unloaded platform [34] [43] [47]. This paper presents Parcae, a compiler and run-time software system that delivers performance portability for both array-based programs and general-purpose programs, extending the applicability of prior work. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Workload, platform, and available resources constitute a parallel program's execution environment. Most parallelization efforts statically target an anticipated range of environments, but performance generally degrades outside that range. Existing approaches address this problem with dynamic tuning but do not optimize a multiprogrammed system holistically. Further, they either require manual programming effort or are limited to array-based data-parallel programs. This paper presents Parcae, a generally applicable automatic system for platform-wide dynamic tuning. Parcae includes (i) the Nona compiler, which creates flexible parallel programs whose tasks can be efficiently reconfigured during execution; (ii) the Decima monitor, which measures resource availability and system performance to detect change in the environment; and (iii) the Morta executor, which cuts short the life of executing tasks, replacing them with other functionally equivalent tasks better suited to the current environment. Parallel programs made flexible by Parcae outperform original parallel implementations in many interesting scenarios.
    ACM SIGPLAN Notices 06/2012; 47(6). DOI:10.1145/2254064.2254082 · 0.66 Impact Factor
  • Source
    • "There are research compilers which parallelize applications using speculation [11] [13] [16] [25]. However, these compilers assume the availability of specialized hardware or cache-coherent shared memory, and their performance is evaluated using a small number of cores (typically fewer than 8). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatic parallelization for clusters is a promising alternative to time-consuming, error-prone manual parallelization. However, automatic parallelization is frequently limited by the imprecision of static analysis. Moreover, due to the inherent fragility of static analysis, small changes to the source code can significantly undermine performance. By replacing static analysis with speculation and profiling, automatic parallelization becomes more robust and applicable. A naïve automatic speculative parallelization does not scale for distributed memory clusters, due to the high bandwidth required to validate speculation. This work is the first automatic speculative DOALL (Spec-DOALL) parallelization system for clusters. We have implemented a prototype automatic parallelization system, called Cluster Spec-DOALL, which consists of a Spec-DOALL parallelizing compiler and a speculative runtime for clusters. Since the compiler optimizes communication patterns, and the runtime is optimized for the cases in which speculation succeeds, Cluster Spec-DOALL minimizes the communication and validation overheads of the speculative runtime. Across 8 benchmarks, Cluster Spec-DOALL achieves a geomean speedup of 43.8x on a 120-core cluster, whereas DOALL without speculation achieves only 4.5x speedup. This demonstrates that speculation makes scalable fully-automatic parallelization for clusters possible.
Show more


9 Reads
Available from