Conference Paper

Memory Interface for Multi-Purpose Multi-Stream Accelerators

DOI: 10.1145/1878921.1878939 Conference: Proceedings of the 2010 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES 2010, Scottsdale, AZ, USA, October 24-29, 2010
Source: DBLP


Power and programming challenges make heterogeneous multi-cores composed of cores and ASICs an attractive alternative to homogeneous multi-cores. Recently, multi-purpose loop-based generated accelerators have emerged as an especially attractive accelerator option. They have several assets: short design time (automatic generation), flexibility (multi-purpose) but low configuration and routing overhead (unlike FPGAs), computational performance (operations are directly mapped to hardware), and a focus on memory throughput by leveraging loop constructs. However, with multiple streams, the memory behavior of such accelerators can become at least as complex as that of superscalar processors, while they still need to retain the memory ordering predictability and throughput efficiency of DMAs. In this article, we show how to design a memory interface for multi-purpose accelerators which combines the ordering predictability of DMAs, retains key efficiency features of memory systems for complex processors, and requires only a fraction of their cost by leveraging the properties of streams references. We evaluate the approach with a synthesizable version of the memory interface for an example 9-task generated loop-based accelerator

Download full-text


Available from: Hugues Berry
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware specialization is often the key to efficiency for programmable embedded systems, but comes at the expense of flexibility. This paper combines flexibility and efficiency in the design and synthesis of domain-specific datapaths. We merge all individual paths from the Data Flow Graphs (DFGs) of the target applications, leading to a minimal set of required resources; this set is organized into a column of physical operators and cloned, thus generating a domain-specific rectangular lattice. A bus-based FPGA-style interconnection network is then generated and dimensioned to meet the needs of the applications. Our results demonstrate that the lattice has good flexibility: DFGs that were not used as part of the datapath creation phase can be mapped onto it with high probability. Compared to an ASIC design of a single DFG, the speed of our domain-specific coarse-grained reconfigurable datapath is degraded by a factor up to 2×, compared to 3–4× for an FPGA; similarly, our lattice is up to 10× larger than an ASIC, compared to 20–40× for an FPGA. We estimate that our array is up to 6× larger than an ASIC accelerator, which is synthesized using datapath merging and has limited or null generality.
    Full-text · Conference Paper · Mar 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Historically, hardware acceleration technologies have either been application-specific, therefore lacking in flexibility, or fully programmable, thereby suffering from notable inefficiencies on an application-by-application basis. To address the growing need for domain-specific acceleration technologies, this paper describes a design methodology (i) to automatically generate a domain-specific coarse-grained array from a set of representative applications and (ii) to introduce limited forms of architectural generality to increase the likelihood that additional applications can be successfully mapped onto it. In particular, coarse-grained arrays generated using our approach are intended to be integrated into customizable processors that use application-specific instruction set extensions to accelerate performance and reduce energy; rather than implementing these extensions using application-specific integrated circuit (ASIC) logic, which lacks flexibility, they can be synthesized onto our reconfigurable array instead, allowing the processor to be used for a variety of applications in related domains. Results show that our array is around 2× slower and 15× larger than an ultimately efficient ASIC implementation, and thus far more efficient than fieldprogrammable gate arrays (FPGAs), which are known to be 3-4× slower and 20-40× larger. Additionally, we estimate that our array is usually around 2× larger and 2× slower than an accelerator synthesized using traditional datapath merging, which has, if any, very limited flexibility beyond the design set of DFGs.
    No preview · Article · May 2013 · IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • [Show abstract] [Hide abstract]
    ABSTRACT: The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-efficiency. However, the emerging many-accelerator processor shows a much more complicated memory accessing pattern than general purpose processors (GPPs) because the abundant on-chip FUs tend to generate highly-concurrent memory streams with distinct locality and bandwidth demand. The disordered memory streams issued by diverse accelerators exhibit a mutual interference behavior and cannot be efficiently handled by the orthodox main memory interface that provides an inflexible data fetching mode. Unlike the traditional DRAM memory, our proposed Aggregation Memory System (AMS) can function adaptively to the characterized memory streams from different FUs, because it provides the FUs with different data fetching sizes and protects their locality in memory access by intelligently interleaving their data to memory devices through sub-rank binding. Moreover, AMS can batch the requests without sub-rank conflict into a read burst with our optimized memory scheduling policy. Experimental results from trace-based simulation show both conspicuous performance boost and energy saving brought by AMS.
    No preview · Article · Mar 2014 · Journal of Computer Science and Technology