Cache streamization for high performance stream processor
ABSTRACT Due to high bandwidth demand on memory system of stream applications, most of stream processors use software-managed streaming memory. However, this memory disadvantages ease of programming, compatibility, and supporting irregular stream access, which hinder the usage of stream processor in broader application domains. Meanwhile, hardware-managed coherent caches overcome these shortcomings of software-managed streaming memory with side-effect due to lack of supporting stream. For this problem, this paper developed a streamization cache whose performance is comparable to streaming memory but is more easy to use. The paper presents the motivation and details of our proposed design, including three stream-specific techniques for cache on data fetch policy, replacement policy and multi-client access. Moreover, a streamization cache instance is implemented in FT64, a 64-bit high performance stream processor. Based on a set of streaming application benchmark, the paper estimates the performance, power consumption and the area cost of the proposed architecture. Results show that these streamization techniques for cache are worthwhile.
- [Show abstract] [Hide abstract]
ABSTRACT: State of the art fabrication technology for integrating numerous hardware resources such as Processors/DSPs and memory arrays into a single chip enables the emergence of Multiprocessor System-on-Chip (MPSoC). Stream programming paradigm based on MPSoC is highly efficient for single functionality scenario due to its dedicated and predictable data supply system. However, when memory traffic is heavily shared among parallel tasks in applications with multiple interrelated functionalities, performance suffers through task interferences and shared memory congestions which lead to poor parallel speedups and memory bandwidth utilizations. This paper proposes a framework of stream processing based on-chip data supply system for task-parallel MPSoCs. In this framework, stream address generations and data computations are decoupled and parallelized to allow full utilization of on-chip resources. Task granularities are dynamically tuned to jointly optimize the overall application performance. Experiments show that proposed framework as well as the tuning scheme are effective for joint optimization in task-parallel MPSoCs.IEEE Computer Architecture Letters 01/2012; DOI:10.1109/L-CA.2011.21 · 1.00 Impact Factor