Conference Paper

MFLUSH: Handling Long-Latency Loads in SMT On-Chip Multiprocessors

DOI: 10.1109/ICPP.2008.48 Conference: Parallel Processing, 2008. ICPP '08. 37th International Conference on
Source: IEEE Xplore

ABSTRACT Nowadays, there is a clear trend in industry towards employing the growing amount of transistors on chip in replicating execution cores (CMP), where each core is Simultaneous Multithreading (SMT). State-of-the-art high-performance processors like the IBM POWER5 and POWER6 corroborate this CMP+SMT trend. Within each SMT core any of the well-known SMT mechanisms may be applied to face SMT related challenges. Among them, probably the most important issue in an SMT execution pipeline concerns the Instruction Fetch (IFetch) Policy. The FLUSH IFetch Policy represents a choice for throughput-oriented scenarios. It handles L2 cache misses in order to avoid hardware resource monopolization by any given execution thread; involving an additional energy cost via instruction refetching. However, the new constraints imposed by the CMP+SMT scenario may affect well-known SMT mechanisms, like the FLUSH mechanism. In this paper we revisit the FLUSH mechanism and analyze its application in the emerging CMP+SMT scenario. The included analysis points out the new difficulties to be faced by the FLUSH mechanism in the emerging CMP+SMT scenario. Then we propose a novel IFetch Policy designed to cope with the CMP+SMT scenario: the MFLUSH. We also include a complete evaluation of the MFLUSH policy, both in terms of throughput and energy consumption. Our results indicate that the MFLUSH, specifically designed for the emerging CMP+SMT scenario, succeeds not only in overcoming the specific CMP+SMT constraints but also allowing a 20% energy consumption reduction without a significant system throughput loss.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very long-latency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads could be using to make forward progress. This paper shows that in many cases it is better to free the resources associated with a stalled thread rather than keep that thread ready to immediately begin execution upon return of the loaded data. Several possible architectures are examined, and some simple solutions are shown to be very effective, achieving speedups close to 6.0 in some cases, and averaging 15% speedup with four threads and over 100% speedup with two threads running. Response times are cut in half for several workloads in open system experiments.
    Microarchitecture, 2001. MICRO-34. Proceedings. 34th ACM/IEEE International Symposium on; 01/2002
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.
    37th Annual International Symposium on Microarchitecture (MICRO-37 2004), 4-8 December 2004, Portland, OR, USA; 01/2004
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
    Computer Architecture, 1996 23rd Annual International Symposium on; 06/1996


Available from