The Impact of Speculative Execution on SMT Processors.
ABSTRACT By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level
Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched
from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions
are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper).
Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines,
the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular
predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism
for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor
to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing
wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions
and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed
workloads we simulate.
- SourceAvailable from: cseweb.ucsd.edu
Conference Proceeding: Handling long-latency loads in a simultaneous multithreading processor[show abstract] [hide abstract]
ABSTRACT: Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very long-latency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads could be using to make forward progress. This paper shows that in many cases it is better to free the resources associated with a stalled thread rather than keep that thread ready to immediately begin execution upon return of the loaded data. Several possible architectures are examined, and some simple solutions are shown to be very effective, achieving speedups close to 6.0 in some cases, and averaging 15% speedup with four threads and over 100% speedup with two threads running. Response times are cut in half for several workloads in open system experiments.Microarchitecture, 2001. MICRO-34. Proceedings. 34th ACM/IEEE International Symposium on; 01/2002
- [show abstract] [hide abstract]
ABSTRACT: In this paper we discuss methods and metrics for comparing the performance of two simultaneous multithreading microarchitectures. We identify conditions under which the instructions-per-cycle metric may be misleading for comparing two simultaneous multithreading microarchitectures for the same amount of work. Part of the problem is isolated to the definition of what is same work. When simulating a mix of independent programs under the same initial conditions on two different simultaneous multithreading microarchitectures there are two approaches to ensure the work of the two runs is same: constant-work-per-thread or variable-work-per-thread. For both approaches the total number of instructions in the run is constant, however, for the first, the instructions from each thread is also constant, whereas for the second is not. We claim that: ffl when simulating two microarchitectures with the constant-work-per-thread approach, the instructions-per-cycle is sufficient to compare them to establish the A version of this work appears in ISPASS-2001 November 4-6, 2001 (www.ispass.org) microarchitecture with the best performance, ffl when variable-work-per-thread approach is used the instruction-per-cycle may be inadequate for comparing performance. We attribute this to the inability of the instructions-per-cycle metric to account for differences in the load-balance of the two runs.01/2002;
- [show abstract] [hide abstract]
ABSTRACT: Growing interest in ambitious multiple-issue machines and heavilypipelined machines requires a careful examination of how much instructionlevel parallelism exists in typical programs. Such an examination is complicated by the wide variety of hardware and software techniques for increasing the parallelism that can be exploited, including branch prediction, register renaming, and alias analysis. By performing simulations based on instruction traces, we can model techniques at the limits of feasibility and even beyond. Our study shows a striking difference between assuming that the techniques we use are perfect and merely assuming that they are impossibly good. Even with impossibly good techniques, average parallelism rarely exceeds 7, with 5 more common. i 1. Introduction There is growing interest in machines that exploit, usually with compiler assistance, the parallelism that programs have at the instruction level. Figure 1 shows an example of this parallelism. The code fra...02/1999;