Conference Paper

Virtual Ways: Efficient Coherence for Architecturally Visible Storage in Automatic Instruction Set Extensions

DOI: 10.1007/978-3-642-11515-8_11 Conference: High Performance Embedded Architectures and Compilers, 5th International Conference, HiPEAC 2010, Pisa, Italy, January 25-27, 2010. Proceedings
Source: DBLP


Customizable processors augmented with application-specific Instruction Set Extensions (ISEs) have begun to gain traction in recent years. The most effective ISEs include Architecturally Visible Storage (AVS), compiler-controlled memories accessible exclusively to the ISEs. Unfortunately, the usage of AVS memories creates a coherence
problem with the data cache. A multiprocessor coherence protocol can solve the problem, however, this is an expensive solution
when applied in a uniprocessor context. Instead, we can solve the problem by modifying the cache controller so that the AVS
memories function as extra ways of the cache with respect to coherence, but are not generally accessible as extra ways for use under normal software execution. This solution, which we call Virtual Ways is less costly than a hardware coherence protocol, and eliminate coherence messages from the system bus, which improves energy
consumption. Moreover, eliminating these messages makes Virtual Ways significantly more robust to performance degradation
when there is a significant disparity in clock frequency between the processor and main memory.

Download full-text


Available from: Paolo Ienne
  • [Show abstract] [Hide abstract]
    ABSTRACT: Application-specific instruction-set processors (ASIPs) are specialized to meet the performance and energy needs of individual or small sets of applications. Of particular importance is the inclusion of custom instruction set extensions (ISEs) that accelerate the performance of the applications from which they are derived. This manuscript reviews the architectural features of ASIPs that facilitate high performance and low energy consumption, and provides an overview of compiler algorithms to automatically identify and synthesize ISEs from an application specified using a high-level language.
    No preview · Article · Jan 2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware specialization is often the key to efficiency for programmable embedded systems, but comes at the expense of flexibility. This paper combines flexibility and efficiency in the design and synthesis of domain-specific datapaths. We merge all individual paths from the Data Flow Graphs (DFGs) of the target applications, leading to a minimal set of required resources; this set is organized into a column of physical operators and cloned, thus generating a domain-specific rectangular lattice. A bus-based FPGA-style interconnection network is then generated and dimensioned to meet the needs of the applications. Our results demonstrate that the lattice has good flexibility: DFGs that were not used as part of the datapath creation phase can be mapped onto it with high probability. Compared to an ASIC design of a single DFG, the speed of our domain-specific coarse-grained reconfigurable datapath is degraded by a factor up to 2×, compared to 3–4× for an FPGA; similarly, our lattice is up to 10× larger than an ASIC, compared to 20–40× for an FPGA. We estimate that our array is up to 6× larger than an ASIC accelerator, which is synthesized using datapath merging and has limited or null generality.
    Full-text · Conference Paper · Mar 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Local memories increase the efficiency of hardware accelerators by enabling fast accesses to frequently used data. In addition, the access latencies of local memories are deterministic which allows for more accurate evaluation of the system performance during design exploration. We have previously proposed local memories with an un-cached memory slave interface that permits program running on the processor to access the locally stored variables in the hardware accelerator. While this has relaxed the memory constraints for porting code sections to hardware accelerators, there is now a need to consider the read/write access penalties of local memories from the processor during design exploration. In order to facilitate the selection of profitable hardware accelerators, we need an accurate performance model that takes into account these read/write access penalties. In this paper, we propose a novel model to estimate the penalty incurred due to memory dependencies between the program running on the processor and the local memories in the FPGA hardware accelerator. This model can be used in an automated design exploration framework for heterogeneous FPGA platforms to select profitable hardware accelerators with local memories.
    No preview · Conference Paper · Jan 2013
Show more