Conference Proceeding

Uncovering hidden loop level parallelism in sequential applications

Adv. Comput. Archit. Lab., Univ. of Michigan, Ann Arbor, MI
03/2008; DOI:10.1109/HPCA.2008.4658647 pp.290 - 301 In proceeding of: High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on
Source: IEEE Xplore

ABSTRACT As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applications realize little to no gains with additional cores. One solution to this problem is automatic parallelization that frees the programmer from the difficult task of parallel programming and offers hope for handling the vast amount of legacy single-threaded software. There is a long history of automatic parallelization for scientific applications, but the techniques have generally failed in the context of general-purpose software. Thread-level speculation overcomes the problem of memory dependence analysis by speculating unlikely dependences that serialize execution. However, this approach has lead to only modest performance gains. In this paper, we take another look at exploiting loop-level parallelism in single-threaded applications. We show that substantial amounts of loop-level parallelism is available in general-purpose applications, but it lurks beneath the surface and is often obfuscated by a small number of data and control dependences. We adapt and extend several code transformations from the instruction-level and scientific parallelization communities to uncover the hidden parallelism. Our results show that 61% of the dynamic execution of studied benchmarks can be parallelized with our techniques compared to 27% using traditional thread-level speculation techniques, resulting in a speedup of 1.84 on a four core system compared to 1.41 without transformations.

0 0
 · 
0 Bookmarks
 · 
27 Views
  • Source
    Conference Proceeding: Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism
    [show abstract] [hide abstract]
    ABSTRACT: As the web becomes the platform of choice for execution of more complex applications, a growing portion of computation is handed off by developers to the client side to reduce network traffic and improve application responsiveness. Therefore, the client-side component, often written in JavaScript, is becoming larger and more compute-intensive, increasing the demand for high performance JavaScript execution. This has led to many recent efforts to improve the performance of JavaScript engines in the web browsers. Furthermore, considering the wide-spread deployment of multi-cores in today's computing systems, exploiting parallelism in these applications is a promising approach to meet their performance requirement. However, JavaScript has traditionally been treated as a sequential language with no support for multithreading, limiting its potential to make use of the extra computing power in multicore systems. In this work, to exploit hardware concurrency while retaining traditional sequential programming model, we develop ParaScript, an automatic runtime parallelization system for JavaScript applications on the client's browser. First, we propose an optimistic runtime scheme for identifying parallelizable regions, generating the parallel code on-the-fly, and speculatively executing it. Second, we introduce an ultra-lightweight software speculation mechanism to manage parallel execution. This speculation engine consists of a selective checkpointing scheme and a novel runtime dependence detection mechanism based on reference counting and range-based array conflict detection. Our system is able to achieve an average of 2.18× speedup over the Firefox browser using 8 threads on commodity multi-core systems, while performing all required analyses and conflict detection dynamically at runtime.
    High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on; 03/2011
  • Source
    Article: Compiler Assisted Out-Of-Order Instruction Commit
    [show abstract] [hide abstract]
    ABSTRACT: This paper proposes an out-of-order instruction commit mechanism using a novel compiler/architecture interface. The compiler provides information about instruction "blocks" and the processor uses the block information to decide which instructions can be committed out of order and when. Some blocks are guar-anteed to be data independent blocks which allows instructions from different such blocks be committed simultaneously and out of order. Other blocks have data or control dependencies and require in-order ex-ecution and in-order commit. Micro-architectural support required for the new commit mode is made on top of the standard, ROB-based commit and includes out-of-order instruction commit, early register release, support for committing loads and stores out of order, and exception handling. All of these are driven by the block information which simplifies the hardware. Results for a 4-wide processor model based on the Alpha 21264 and a set of 6 SPEC2000 and 2006 benchmarks show that, on average, 52% instructions are committed out of order resulting in 10% to 26% speedups over in-order commit with minimal hardware overhead.
    12/2010;
  • Source
    Conference Proceeding: Runtime parallelization of legacy code on a transactional memory system.
    High Performance Embedded Architectures and Compilers, 6th International Conference, HiPEAC 2011, Heraklion, Crete, Greece, January 24-26, 2011. Proceedings; 01/2011

Full-text

View
0 Downloads
Available from

Keywords

additional cores
 
automatic parallelization
 
control dependences
 
difficult task
 
explicit thread-level parallelism
 
exploiting loop-level parallelism
 
four core system
 
general-purpose applications
 
general-purpose software
 
hidden parallelism
 
legacy single-threaded software
 
loop-level parallelism
 
memory dependence analysis
 
modest performance gains
 
scientific applications
 
scientific parallelization communities
 
single-threaded applications
 
traditional thread-level speculation techniques
 
unlikely dependences
 
vast amount