Conference Paper

Virtual Ways: Efficient Coherence for Architecturally Visible Storage in Automatic Instruction Set Extensions.

DOI: 10.1007/978-3-642-11515-8_11 Conference: High Performance Embedded Architectures and Compilers, 5th International Conference, HiPEAC 2010, Pisa, Italy, January 25-27, 2010. Proceedings
Source: DBLP

ABSTRACT Customizable processors augmented with application-specific Instruction Set Extensions (ISEs) have begun to gain traction in recent years. The most effective ISEs include Architecturally Visible Storage (AVS), compiler-controlled memories accessible exclusively to the ISEs. Unfortunately, the usage of AVS memories creates a coherence
problem with the data cache. A multiprocessor coherence protocol can solve the problem, however, this is an expensive solution
when applied in a uniprocessor context. Instead, we can solve the problem by modifying the cache controller so that the AVS
memories function as extra ways of the cache with respect to coherence, but are not generally accessible as extra ways for use under normal software execution. This solution, which we call Virtual Ways is less costly than a hardware coherence protocol, and eliminate coherence messages from the system bus, which improves energy
consumption. Moreover, eliminating these messages makes Virtual Ways significantly more robust to performance degradation
when there is a significant disparity in clock frequency between the processor and main memory.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Customized instructions (CIs) implemented using custom functional units (CFUs) have been proposed as a way of improving performance and energy efficiency of software while minimizing cost of designing and verifying accelerators from scratch. However, previous work allows CIs to only communicate with the processor through registers or with limited memory operations. In this work we propose an architecture that allows CIs to seamlessly execute memory operations without any special synchronization operations to guarantee program order of instructions. Our results show that our architecture can provide 24\% energy savings with 14% performance improvement for 2-issue and 4-issue superscalar processor cores.
    Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays; 02/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Way Stealing is a simple architectural modification to a cache-based processor that increases the data bandwidth to and from application-specific instruction set extensions (ISEs), which increase performance and reduce energy consumption. Way Stealing offers higher bandwidth than interfacing the ISEs the processor's register file, and eliminates the need to allocate separate memories called architecturally visible storage (AVS) that are dedicated to the ISEs, and to ensure coherence between the AVS memories and the processor's data cache. Our results show that Way Stealing is competitive in terms of performance and energy consumption with other techniques that use AVS memories in conjunction with a data cache.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2014; 22(1):62-75. · 1.14 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware specialization is often the key to efficiency for programmable embedded systems, but comes at the expense of flexibility. This paper combines flexibility and efficiency in the design and synthesis of domain-specific datapaths. We merge all individual paths from the Data Flow Graphs (DFGs) of the target applications, leading to a minimal set of required resources; this set is organized into a column of physical operators and cloned, thus generating a domain-specific rectangular lattice. A bus-based FPGA-style interconnection network is then generated and dimensioned to meet the needs of the applications. Our results demonstrate that the lattice has good flexibility: DFGs that were not used as part of the datapath creation phase can be mapped onto it with high probability. Compared to an ASIC design of a single DFG, the speed of our domain-specific coarse-grained reconfigurable datapath is degraded by a factor up to 2×, compared to 3–4× for an FPGA; similarly, our lattice is up to 10× larger than an ASIC, compared to 20–40× for an FPGA. We estimate that our array is up to 6× larger than an ASIC accelerator, which is synthesized using datapath merging and has limited or null generality.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012; 03/2012

Full-text (2 Sources)

Available from
May 17, 2014