Runahead Threads to improve SMT performance

Conference Paper · February 2008with25 Reads
DOI: 10.1109/HPCA.2008.4658635 · Source: DBLP
Conference: 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 16-20 February 2008, Salt Lake City, UT, USA
Abstract
In this paper, we propose runahead threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in simultaneous multithreaded (SMT) processors. Our technique converts a resource intensive memory-bound thread to a speculative light thread under long-latency blocking memory operations. These speculative threads prefetch data and instructions with minimal resources, reducing critical resource conflicts between threads. We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies. In terms of throughput and fairness, our results show that RaT performs better than any other policy. The proposed mechanism improves average throughput by 37% regarding previous static fetch policies and by 28% compared to previous dynamic resource scheduling mechanisms. RaT also improves fairness by 36% and 30% respectively. In addition, the proposed mechanism permits register file size reduction of up to 60% in a SMT processor without performance degradation.
    • "Tullsen et al. [31] realized the importance of resource partitioning and fetch policies on SMT performance , and proposed the ICOUNT mechanism as an effective solution . Follow-on research has proposed further refinements and improvements , such as flush [30], MLP-aware flush [10], DCRA [4], hill-climbing [7], runahead threads [21], etc. Chandra et al. [6] propose an analytical model that predicts the number of additional misses for each thread due to cache sharing. The input to the model is the per-thread L2 stack distance distribution . "
    [Show abstract] [Hide abstract] ABSTRACT: Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have 'compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate a limited number of possible co-schedules, use heuristics to gauge symbiosis, are rigid in their optimization target, and do not preserve system-level priorities/shares. This paper proposes probabilistic job symbiosis modeling, which predicts whether jobs will create positive or negative symbiosis when co-scheduled without requiring the co-schedule to be evaluated. The model, which uses per-thread cycle stacks computed through a previously proposed cycle accounting architecture, is simple enough to be used in system software. Probabilistic job symbiosis modeling provides six key innovations over prior work in symbiotic job scheduling: (i) it does not require a sampling phase, (ii) it readjusts the job co-schedule continuously, (iii) it evaluates a large number of possible co-schedules at very low overhead, (iv) it is not driven by heuristics, (v) it can optimize a performance target of interest (e.g., system throughput or job turnaround time), and (vi) it preserves system-level priorities/shares. These innovations make symbiotic job scheduling both practical and effective. Our experimental evaluation, which assumes a realistic scenario in which jobs come and go, reports an average 16% (and up to 35%) reduction in job turnaround time compared to the previously proposed SOS (sample, optimize, symbios) approach for a two-thread SMT processor, and an average 19% (and up to 45%) reduction in job turnaround time for a four-thread SMT processor.
    Full-text · Conference Paper · Mar 2010
    • "Previous studies [16][7] have shown that RaT provides better SMT performance and fairness than prior SMT resource management policies. Nevertheless, its advantages come at the cost of executing a large amount of useless speculative work. "
    [Show abstract] [Hide abstract] ABSTRACT: Runahead Threads (RaT) is a promising solution that enables a thread to speculatively run ahead and prefetch data instead of stalling for a long-latency load in a simultaneous multithreading processor. With this capability, RaT can reduces resource monopolization due to memory-intensive threads and exploits memory-level parallelism, improving both system performance and single-thread performance. Unfortunately, the benefits of RaT come at the expense of increasing the number of executed instructions, which adversely affects its energy efficiency. In this paper, we propose Runahead Distance Prediction (RDP), a simple technique to improve the efficiency of Runahead Threads. The main idea of the RDP mechanism is to predict how far a thread should run ahead speculatively such that speculative execution is useful. By limiting the runahead distance of a thread, we generate efficient runahead threads that avoid unnecessary speculative execution and enhance RaT energy efficiency. By reducing runahead-based speculation when it is predicted to be not useful, RDP also allows shared resources to be efficiently used by non-speculative threads. Our results show that RDP significantly reduces power consumption while maintaining the performance of RaT, providing better performance and energy balance than previous proposals in the field.
    Conference Paper · Jan 2010
    • "formance, the FIESTA-2wide workload yields more balanced runs and more accurate SMT-speedups than a Fixed workload. SMT vs. RaT.Figure 7 compares multi-program architectures with different single-program baselines: SMT's baseline is a " vanilla " ROB processor, the singleprogram baseline for RaT (Runahead Threads) [13] is a Runahead processor. We compare average SMTspeedups over vanilla ROB using different workload methodologies. "
    [Show abstract] [Hide abstract] ABSTRACT: Workload construction methodologies for multi-program experiments are more complicated than those for single-program experiments. Fixed-workload methodologies pre-select samples from each program and use these in every experiment. They enable direct comparisons between experiments, but may also yield runs of which significant portions are spent execut-ing only the slowest program(s). Variable-workload methodologies eliminate this load imbalance by using the multi-program run to define the workload, normal-izing performance to the performance of the resulting individual program regions. However, they make direct comparisons difficult and tend to produce workloads that over-estimate throughput and speedup. We propose a multi-program workload methodology called FIESTA which is based on the observation that there are two kinds of load imbalance. Sample imbal-ance is due to differences in standalone program run-ning times. Schedule imbalance is due to asymmetric contention during multi-program execution. Sample im-balance is harmful because it dilutes multi-program be-haviors. Schedule imbalance is a characteristic of con-current execution that should be preserved and mea-sured. Traditional fixed-workload methodologies admit both kinds of imbalance. Variable-workload methodolo-gies eliminate both kinds of imbalance. FIESTA is a fixed-workload methodology that eliminates only sample imbalance. It does so by pre-selecting program regions for equal standalone running times rather than for equal instruction counts.
    Article · Jul 2009
Show more