Conference Paper

Cache streamization for high performance stream processor

Comput. Sch., Nat. Univ. of Defense Technol., Changsha, China
DOI: 10.1109/HIPC.2009.5433214 Conference: High Performance Computing (HiPC), 2009 International Conference on
Source: IEEE Xplore

ABSTRACT Due to high bandwidth demand on memory system of stream applications, most of stream processors use software-managed streaming memory. However, this memory disadvantages ease of programming, compatibility, and supporting irregular stream access, which hinder the usage of stream processor in broader application domains. Meanwhile, hardware-managed coherent caches overcome these shortcomings of software-managed streaming memory with side-effect due to lack of supporting stream. For this problem, this paper developed a streamization cache whose performance is comparable to streaming memory but is more easy to use. The paper presents the motivation and details of our proposed design, including three stream-specific techniques for cache on data fetch policy, replacement policy and multi-client access. Moreover, a streamization cache instance is implemented in FT64, a 64-bit high performance stream processor. Based on a set of streaming application benchmark, the paper estimates the performance, power consumption and the area cost of the proposed architecture. Results show that these streamization techniques for cache are worthwhile.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Processor cycle time continues to decrease faster than main memory access times, placing higher demands on cache memory hierarchy performance. To meet these demands, conventional cache management techniques that rely solely on naive hardware must be augmented with more sophisticated techniques. This paper investigates Informed Caching Environments (ICE) where software can assist in cache management. By exposing some cache management mechanisms and providing an efficient interface for their use, software can complement existing hardware techniques by providing hints on how to manage the cache. In this paper, we focus primarily on a mechanism for software to convey information to the memory hierarchy. We introduce a single instruction---called TAG---that can annotate subsequent memory references with a number of bits, thus avoiding major modifications to the instruction set. Simulation results show that annotating all memory reference instructions in the SPEC95 benchmarks increases exec...
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of perfor- mance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications on systems with up to 16 cores, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and nonallo- cating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.
    TACO. 01/2008; 5.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Memory performance is increasingly determining microprocessor performance and technology trends are exacerbating this problem. Most architectures use set-associative caches with LRU replacement policies to combine fast access with relatively low miss rates. To improve replacement decisions in set-associative caches, we develop a new set of compiler algorithms that predict which data will and will not be reused and provide these hints to the architecture. We prove that the hints either match or improve hit rates over LRU. We describe a practical one-bit cache-line tag implementation of our algorithm, called evict-me. On a cache replacement, the architecture will replace a line for which the evict-me bit is set, or if none is set, it will use the LRU bits. We implement our compiler analysis and its output in the Scale compiler. On a variety of scientific programs, using the evict-me algorithm in both the level 1 and 2 caches improves simulated cycle times by up to 34% over the LRU policy by increasing hit rates. In addition, a combination of simple hardware prefetching and evict-me works together to further improve performance.
    Parallel Architectures and Compilation Techniques, 2002. Proceedings. 2002 International Conference on; 02/2002