ArticlePDF Available

A cache design for high performance embedded systems.

Authors:

Abstract and Figures

Future embedded applications will require high performance processors integrating fast and low-power cache. Dynamic Non-Uniform Cache Architectures (D-NUCA) have been proposed to overcome the performance limit introduced by wire delays when designing large cache. In this paper, we propose an alternative design of D-NUCA cache, namely Triangular D-NUCA Cache, to reduce power consumption and silicon area occupancy of D-NUCA cache. We compare the performances of Triangular DNUCA cache with the ones achieved by conventional rectangular organization. Results show that our approach is particular useful in the embedded application domain, as it permits the utilization of half-sized NUCA cache with performance improvements.
Content may be subject to copyright.
... Different implementations of routers have been proposed for high-performance on-chip networks, but it is not clear which is the most suitable router architecture to face the design constraints posed by a NUCA cache scenario. In order to characterise such design constraints, in this paper we analyse how the performance of a reference NUCA L2 cache [1,7,8] is influenced by different values of cutthrough latency and buffering capacity of routers. Such parameters, in fact, allow to guide the selection of an adequate router architecture. ...
... The analysis described in this paper assumes a reference NUCA structure which has been derived from previous works [1,7,8]. The topology of the on-chip network is derived from a 2D mesh by employing only a subset of the links of a full 2D mesh in order to reduce the area overhead, resulting in a tree topology. ...
Article
Full-text available
Non-uniform cache architectures (NUCAs) are a novel design paradigm for large last-level on-chip caches, which have been introduced to deliver low access latencies in wire-delay-dominated environments. Their structure is partitioned into sub-banks and the resulting access latency is a function of the physical position of the requested data. Typically, NUCA caches employ a switched network, made up of links and routers with buffered queues, to connect the different sub-banks and the cache controller, and the characteristics of the network elements may affect the performance of the entire system. This work analyses how different parameters for the network routers, namely cut-through latency and buffering capacity, affect the overall performance of NUCA-based systems for the single processor case, assuming a reference NUCA organisation proposed in literature. The entire analysis is performed utilising a cycle-accurate execution-driven simulator of the entire system and real workloads. The results indicate that the sensitivity of the system to the cut-through latency is very high, thus limiting the effectiveness of the NUCA solution, and that modest buffering capacity is sufficient to achieve a good performance level. As a consequence, in this work we propose an alternative clustered NUCA organisation that limits the average number of hops experienced by cache accesses. This organisation is better performing and scales better as the cut-through latency increases, thus simplifying the implementation of routers, and it is also more effective than another latency reduction solution proposed in literature (hybrid network).
... The analysis described in this paper assumes a reference NUCA structure which has been derived from previous work [18, 13, 4, 3]. The topology of the on-chip network is derived from a 2D mesh, which will be called partial 2D mesh in the following, since only a subset of the links of a full 2D mesh are employed in order to reduce the area overhead . ...
Conference Paper
Full-text available
Non uniform cache architectures (NUCA) are a novel design paradigm for large last-level on-chip caches that has been introduced to deliver low access latencies in wire delay dominated environments. Their structure is partitioned into sub-banks and the resulting access latency is a function of the physical position of the requested data. Typically, to connect the different sub-banks and the cache controller, NUCA caches employ a switched network, made up of links and routers with buffered queues; the characteristics of such switched network may affect the performance of the entire system. This work analyzes how different parameters for the routers, namely cut-through latency and buffering capacity, affect the overall performance of NUCA based systems for the single processor case, assuming a reference organization proposed in literature. The results indicate that the sensitivity of the system to the cut-through latency is very high and that limited buffering capacity is sufficient to achieve a good performance level. As a consequence, we propose an alternative NUCA organization that limits the average number of hops experienced by cache accesses. This organization is better performing in most of the cases and scales better as the cut-through latency increases, thus simplifying the implementation of routers.
... The D-NUCA cache architecture was proposed in [2] [3], showing that a dynamic NUCA structure achieves an IPC 1.5 times higher than a traditional Uniform Cache Architecture (UCA) when maintaining the same size and manufacturing technology. Extensions to the original idea are NuRapid [10] and Triangular D-NUCA cache [11]. They improve performance by decoupling tags from data or by changing mapping and size of each way. ...
Conference Paper
Full-text available
D-NUCA caches are cache memories that, thanks to banked organization, broadcast search and promotion/demotion mechanism, are able to tolerate the increasing wire delay effects introduced by technology scaling. As a consequence, they will outperform conventional caches (UCA, Uniform Cache Architectures) in future generation cores. Due to the promotion/demotion mechanism, we observed that the distribution of hits across the ways of a D-NUCA cache varies across applications as well as across different execution phases within a single application. In this work, we show how such a behavior can be leveraged to improve the D-NUCA power efficiency as well as to decrease its access latency. In particular, we propose: 1) A new micro- architectural technique to reduce the static power consumption of a D-NUCA cache by dynamically adapting the number of active (i.e. powered-on) ways to the need of the running application; our evaluation shows that a strong reduction of the average number of active ways (37.1%) is achievable, without significantly affecting the IPC (-2.25%), leading to a resultant reduction of the Energy Delay Product (EDP) of 30.9%. 2) A strategy to estimate the characteristic parameters of the proposed technique. 3) An evaluation of the effectiveness of the proposed technique in the multicore environment.
... The D-NUCA cache architecture was proposed in [2], [3], showing that a dynamic NUCA structure achieves 1.5 times higher IPC than a traditional Uniform Cache Architecture (UCA) when maintaining the same size and manufacturing technology. Extensions of the original idea are Nurapid [10] and Triangular D-NUCA cache [11]. They improve performance by decoupling tags from data or by changing mapping and size of each way. ...
Article
Full-text available
D-NUCA caches are cache memories that, thanks to banked organization, broadcast search and promotion/demotion mechanism, are able to tolerate the increasing wire delay effects introduced by technology scaling. As a consequence, they will outperform conventional caches (UCA, Uniform Cache Architectures) in future generation cores. Due to the promotion/demotion mechanism, we have found that, in a D-NUCA cache, the distribution of hits on the ways varies across applications as well as across different execution phases within a single application. In this paper, we show how such a behavior can be utilized to improve D-NUCA power efficiency as well as to decrease its access latencies. In particular, we propose a new D-NUCA structure, called Way Adaptable D-NUCA cache, in which the number of active (i.e. powered-on) ways is dynamically adapted to the need of the running application. Our initial evaluation shows that a consistent reduction of both the average number of active ways (42% in average) and the number of bank access requests (29% in average) is achieved, without significantly affecting the IPC.
Article
Full-text available
Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a scalable on-chip network to interconnect the banks with the cache controller, the average access latency can be reduced with respect to a traditional cache. The addition of a migration mechanism to move the most frequently accessed data towards the cache controller (D-NUCA) further improves the average access latency. In this work we propose a last-level cache design, based on the D-NUCA scheme, which is able to significantly limit its static power consumption by dynamically adapting to the needs of the running application: the way adaptable D-NUCA cache. This design leads to a fast and power-efficient memory hierarchy with an average reduction by 31.2% in energy-delay product (EDP) with respect to a traditional D-NUCA. We propose and discuss a methodology for tuning the intrinsic parameters of our design and investigate the adoption of the way adaptable D-NUCA scheme as a shared L2 cache in a chip multiprocessor (CMP) system (24% reduction of EDP).
Article
An embedded processor with SIMD extension is presented. The processor contains 32-bit dual-issue RISC core with out-of order execution capability. A dedicated short-vector SIMD unit is extended to speed up the multimedia processing. Also an effective co-verification method is designed to verify the processor. It focuses on the architecture design and the verification of the embedded processor. The performance evaluation and the physical design are also discussed. The processor is fabricated by CMOS 180 ns technology with 1.42 MIPS@MHz and 7.9 mW@MHz.
Article
Full-text available
Power consumption is becoming one of the most important constraints for microproces-sor design in nanometer-scale technologies. Espe-cially, as the transistor supply voltage and thresh-old voltage are scaled down, leakage energy con-sumption is increased even when the transistor is not switching. This paper proposes a simple technique to reduce the static energy. The key idea of our approach is to allow t h e w a ys within a c a c he to be accessed at dierent speeds and to place infrequently accessed data into the slow ways. We use dual-V t technique to realize the non-uniform set-associative cache, and propose a simple replacement policy to reduce average ac-cess latency. Experimental results on 32-way set-associative caches demonstrate that any severe in-crease in clock cycles to execute application pro-grams is not observed and signicant static en-ergy reduction can be achieved, resulting in the improvement of energy-delay product.
Conference Paper
Full-text available
Reducing the supply voltage to reduce dynamic power consumption in CMOS devices, inadvertently will lead to an exponential increase in leakage power dissipation. In this work we explore an architectural idea to reduce leakage power in data caches. Previous work has shown that cache frames are “dead” for a significant fraction of time [14]. We are exploiting this observation to turn off cache lines that are not likely to be accessed any more. Our method is simple: if a cache-line is not accessed within a fixed interval (called decay interval) we turn off its supply voltage using a gated Vdd technique introduced previously [12]. We study the effect of cache-line decay on both power consumption and performance. We find that it is possible with cache-line decay to build larger caches that dissipate less leakage power than smaller caches while yielding equal or better performance (fewer misses). In addition, because our method can dynamically trade performance for leakage power it can be adjusted according to the requirements of the application and/or the environment.
Article
Memory accesses often account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size, and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in prefabricated microprocessor platforms. Tuning those caches to a program is still, however, a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of tunable caches.We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. Our heuristic seeks not only to reduce the number of configurations that must be examined, but also traverses the search space in a way that minimizes costly cache flushes. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache saves on average 40% of total memory access energy over a standard nontuned reference cache.
Article
As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches. This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement , that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.
Conference Paper
A considerable portion of a chip is dedicated to a cache memory in a modern microprocessor chip. However, some applications may not actively need all the cache storage, especially the computing bandwidth limited applications. Instead, such applications may be able to use some additional computing resources. If the unused portion of the cache could serve these computation needs, the on-chip resources would be utilized more efficiently. This presents an opportunity to explore the reconfiguration of a part of the cache memory for computing. In this paper, we present a cache architecture to convert a cache into a computing unit for either of the following two structured computations, FIR and DCT/IDCT. In order to convert a cache memory to a function unit, we include additional logic to embed multi-bit output LUTs into the cache structure. Therefore, the cache can perform computations when it is reconfigured as a function unit. The experimental results show that the reconfigurable module improves the execution time of applications with a large number of data elements by a large factor (as high as 50 and 60). In addition, the area overhead of the reconfigurable cache module for FIR and DCT/IDCT is less than the core area of those functions. Our simulations indicate that a reconfigurable cache does not take a significant delay penalty compared with a dedicated cache memory. The concept of reconfigurable cache modules can be applied at Level-2 caches instead of Level-1 caches to provide an active-Level-2 cache similar to active memories.