Conference Paper

Runahead Threads to improve SMT performance.

Univ. Politelecnica de Catalunya, Barcelona
DOI: 10.1109/HPCA.2008.4658635 Conference: 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 16-20 February 2008, Salt Lake City, UT, USA
Source: IEEE Xplore

ABSTRACT In this paper, we propose runahead threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in simultaneous multithreaded (SMT) processors. Our technique converts a resource intensive memory-bound thread to a speculative light thread under long-latency blocking memory operations. These speculative threads prefetch data and instructions with minimal resources, reducing critical resource conflicts between threads. We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies. In terms of throughput and fairness, our results show that RaT performs better than any other policy. The proposed mechanism improves average throughput by 37% regarding previous static fetch policies and by 28% compared to previous dynamic resource scheduling mechanisms. RaT also improves fairness by 36% and 30% respectively. In addition, the proposed mechanism permits register file size reduction of up to 60% in a SMT processor without performance degradation.

0 Bookmarks
 · 
81 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Threads experiencing long-latency loads on a simultaneous multithread- ing (SMT) processor may clog shared processor resources without making for- ward progress, thereby starving other threads and reducing overall system through- put. An elegant solution to the long-latency load problem in SMT processors is to employ runahead execution. Runahead threads do not block commit on a long- latency load but instead execute subsequent instructions i n a speculative execu- tion mode to expose memory-level parallelism (MLP) through prefetching. The key benefit of runahead SMT threads is twofold: (i) runahead t hreads do not clog resources on a long-latency load, and (ii) runahead threads exploit far-distance MLP. This paper proposes MLP-aware runahead threads: runahead execution is only initiated in case there is far-distance MLP to be exploited. By doing so, useless runahead executions are eliminated, thereby reducing the number of speculatively executed instructions (and thus energy consumption) while preserving the perfor- mance of the runahead thread and potentially improving the performance of the co-executing thread(s). Our experimental results show tha t MLP-aware runahead threads reduce the number of speculatively executed instructions by 13.9% and 10.1% for two-program and four-program workloads, respectively, compared to MLP-agnostic runahead threads while achieving comparable system throughput and job turnaround time.
    High Performance Embedded Architectures and Compilers, Fourth International Conference, HiPEAC 2009, Paphos, Cyprus, January 25-28, 2009. Proceedings; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Memory-intensive threads can hoard shared resources without making progress on a multithreading processor (SMT), thereby hindering the overall system performance. A recent promising solution to overcome this important problem in SMT processors is Runahead Threads (RaT). RaT employs runahead execution to allow a thread to speculatively execute instructions and prefetch data instead of stalling for a long-latency load. The main advantage of this mechanism is that it exploits memory-level parallelism under long latency loads without clogging up shared resources. As a result, RaT improves the overall processor performance reducing the resource contention among threads. In this paper, we propose simple code semantic based techniques to increase RaT efficiency. Our proposals are based on analyzing the prefetch opportunities (usefulness) of loops and subroutines during runahead thread executions. We dynamically analyze these particular program structures to detect when it is useful or not to control the runahead thread execution. By means of this dynamic information, the proposed techniques make a control decision either to avoid or to stall the loop or subroutine execution in runahead threads. Our experimental results show that our best proposal significantly reduces the speculative instruction execution (33% on average) while maintaining and, even improving the performance of RaT (up to 3%) in some cases.
    ICPP 2009, International Conference on Parallel Processing, Vienna, Austria, 22-25 September 2009; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Workload construction methodologies for multi-program experiments are more complicated than those for single-program experiments. Fixed-workload methodologies pre-select samples from each program and use these in every experiment. They enable direct comparisons between experiments, but may also yield runs of which significant portions are spent execut-ing only the slowest program(s). Variable-workload methodologies eliminate this load imbalance by using the multi-program run to define the workload, normal-izing performance to the performance of the resulting individual program regions. However, they make direct comparisons difficult and tend to produce workloads that over-estimate throughput and speedup. We propose a multi-program workload methodology called FIESTA which is based on the observation that there are two kinds of load imbalance. Sample imbal-ance is due to differences in standalone program run-ning times. Schedule imbalance is due to asymmetric contention during multi-program execution. Sample im-balance is harmful because it dilutes multi-program be-haviors. Schedule imbalance is a characteristic of con-current execution that should be preserved and mea-sured. Traditional fixed-workload methodologies admit both kinds of imbalance. Variable-workload methodolo-gies eliminate both kinds of imbalance. FIESTA is a fixed-workload methodology that eliminates only sample imbalance. It does so by pre-selecting program regions for equal standalone running times rather than for equal instruction counts.
    07/2009;

Full-text (2 Sources)

View
0 Downloads
Available from