Conference Paper

Runahead Threads to improve SMT performance

Univ. Politelecnica de Catalunya, Barcelona
DOI: 10.1109/HPCA.2008.4658635 Conference: 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 16-20 February 2008, Salt Lake City, UT, USA
Source: IEEE Xplore


In this paper, we propose runahead threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in simultaneous multithreaded (SMT) processors. Our technique converts a resource intensive memory-bound thread to a speculative light thread under long-latency blocking memory operations. These speculative threads prefetch data and instructions with minimal resources, reducing critical resource conflicts between threads. We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies. In terms of throughput and fairness, our results show that RaT performs better than any other policy. The proposed mechanism improves average throughput by 37% regarding previous static fetch policies and by 28% compared to previous dynamic resource scheduling mechanisms. RaT also improves fairness by 36% and 30% respectively. In addition, the proposed mechanism permits register file size reduction of up to 60% in a SMT processor without performance degradation.

Full-text preview

Available from:
  • Source
    • "Thus, a thread that misses in the L2 cache cannot execute more instructions than the reorder buffer size permits, which cannot be scaled without introducing large additional complexity. The Runahead Threads (RaT) approach [16] exploits MLP by applying runahead execution [14] to any running thread when a long-latency load is pending. RaT allows memoryintensive threads to advance speculatively in a multithreaded environment instead of stalling the thread, doing beneficial work (prefetching) to improve the performance. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Runahead Threads (RaT) is a promising solution that enables a thread to speculatively run ahead and prefetch data instead of stalling for a long-latency load in a simultaneous multithreading processor. With this capability, RaT can reduces resource monopolization due to memory-intensive threads and exploits memory-level parallelism, improving both system performance and single-thread performance. Unfortunately, the benefits of RaT come at the expense of increasing the number of executed instructions, which adversely affects its energy efficiency. In this paper, we propose Runahead Distance Prediction (RDP), a simple technique to improve the efficiency of Runahead Threads. The main idea of the RDP mechanism is to predict how far a thread should run ahead speculatively such that speculative execution is useful. By limiting the runahead distance of a thread, we generate efficient runahead threads that avoid unnecessary speculative execution and enhance RaT energy efficiency. By reducing runahead-based speculation when it is predicted to be not useful, RDP also allows shared resources to be efficiently used by non-speculative threads. Our results show that RDP significantly reduces power consumption while maintaining the performance of RaT, providing better performance and energy balance than previous proposals in the field.
    19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
  • Source
    • "FIESTA is a simple and intuitive. Nevertheless, it is not used in any of the works we have examined, most of which use variable-workload schemes [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [12] [13] [14] [15] [16] [18] [20] [22] [23]. We believe that this is because the distinction between sample imbalance and schedule imbalance has not been clearly articulated. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Workload construction methodologies for multi-program experiments are more complicated than those for single-program experiments. Fixed-workload methodologies pre-select samples from each program and use these in every experiment. They enable direct comparisons between experiments, but may also yield runs of which significant portions are spent execut-ing only the slowest program(s). Variable-workload methodologies eliminate this load imbalance by using the multi-program run to define the workload, normal-izing performance to the performance of the resulting individual program regions. However, they make direct comparisons difficult and tend to produce workloads that over-estimate throughput and speedup. We propose a multi-program workload methodology called FIESTA which is based on the observation that there are two kinds of load imbalance. Sample imbal-ance is due to differences in standalone program run-ning times. Schedule imbalance is due to asymmetric contention during multi-program execution. Sample im-balance is harmful because it dilutes multi-program be-haviors. Schedule imbalance is a characteristic of con-current execution that should be preserved and mea-sured. Traditional fixed-workload methodologies admit both kinds of imbalance. Variable-workload methodolo-gies eliminate both kinds of imbalance. FIESTA is a fixed-workload methodology that eliminates only sample imbalance. It does so by pre-selecting program regions for equal standalone running times rather than for equal instruction counts.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Memory-intensive threads can hoard shared resources without making progress on a multithreading processor (SMT), thereby hindering the overall system performance. A recent promising solution to overcome this important problem in SMT processors is Runahead Threads (RaT). RaT employs runahead execution to allow a thread to speculatively execute instructions and prefetch data instead of stalling for a long-latency load. The main advantage of this mechanism is that it exploits memory-level parallelism under long latency loads without clogging up shared resources. As a result, RaT improves the overall processor performance reducing the resource contention among threads. In this paper, we propose simple code semantic based techniques to increase RaT efficiency. Our proposals are based on analyzing the prefetch opportunities (usefulness) of loops and subroutines during runahead thread executions. We dynamically analyze these particular program structures to detect when it is useful or not to control the runahead thread execution. By means of this dynamic information, the proposed techniques make a control decision either to avoid or to stall the loop or subroutine execution in runahead threads. Our experimental results show that our best proposal significantly reduces the speculative instruction execution (33% on average) while maintaining and, even improving the performance of RaT (up to 3%) in some cases.
    ICPP 2009, International Conference on Parallel Processing, Vienna, Austria, 22-25 September 2009; 01/2009
Show more