ArticlePDF Available

A cache design for high performance embedded systems.

January 2005

January 2005
1:587-597

Source
DBLP

Authors:

Pierfrancesco Foglia

Università di Pisa

Daniele Mangano

STMicroelectronics

Cosimo Prete

Università di Pisa

Future embedded applications will require high performance processors integrating fast and low-power cache. Dynamic Non-Uniform Cache Architectures (D-NUCA) have been proposed to overcome the performance limit introduced by wire delays when designing large cache. In this paper, we propose an alternative design of D-NUCA cache, namely Triangular D-NUCA Cache, to reduce power consumption and silicon area occupancy of D-NUCA cache. We compare the performances of Triangular DNUCA cache with the ones achieved by conventional rectangular organization. Results show that our approach is particular useful in the embedded application domain, as it permits the utilization of half-sized NUCA cache with performance improvements.

Example of banks organizations of four-ways D-NUCA and TD-NUCA cache memories. represent the hopes needed to communicate with the controller.

…

Simple and Fair Mapping policy of TD-NUCA caches. The arrows represent the possible destinations when moving from a way to the next one. The Simple mapping policy is on the left, the Fair mapping is on the right. With the Fair mapping, the average access times across all paths are equalized.

…

Example of access with incremental search and multicast search. In both the cases, the bright lines indicate the path followed by a request while searching for a hit. The banks accessed in the search process are those bright.

…

of benchmarks utilized for the performance evaluation. The parameters FFWD and RUN represent, respectively, the number of instructions skipped to reach the start of simulation and the number of simulated instructions

…

Figures - uploaded by Cosimo Prete

Content may be subject to copyright.

Impact of on-chip network parameters on NUCA cache performances

Article

Full-text available

Oct 2009

Non-uniform cache architectures (NUCAs) are a novel design paradigm for large last-level on-chip caches, which have been introduced to deliver low access latencies in wire-delay-dominated environments. Their structure is partitioned into sub-banks and the resulting access latency is a function of the physical position of the requested data. Typically, NUCA caches employ a switched network, made up of links and routers with buffered queues, to connect the different sub-banks and the cache controller, and the characteristics of the network elements may affect the performance of the entire system. This work analyses how different parameters for the network routers, namely cut-through latency and buffering capacity, affect the overall performance of NUCA-based systems for the single processor case, assuming a reference NUCA organisation proposed in literature. The entire analysis is performed utilising a cycle-accurate execution-driven simulator of the entire system and real workloads. The results indicate that the sensitivity of the system to the cut-through latency is very high, thus limiting the effectiveness of the NUCA solution, and that modest buffering capacity is sufficient to achieve a good performance level. As a consequence, in this work we propose an alternative clustered NUCA organisation that limits the average number of hops experienced by cache accesses. This organisation is better performing and scales better as the cut-through latency increases, thus simplifying the implementation of routers, and it is also more effective than another latency reduction solution proposed in literature (hybrid network).

Performance Sensitivity of NUCA Caches to On-Chip Network Parameters

Conference Paper

Full-text available

Dec 2008

Non uniform cache architectures (NUCA) are a novel design paradigm for large last-level on-chip caches that has been introduced to deliver low access latencies in wire delay dominated environments. Their structure is partitioned into sub-banks and the resulting access latency is a function of the physical position of the requested data. Typically, to connect the different sub-banks and the cache controller, NUCA caches employ a switched network, made up of links and routers with buffered queues; the characteristics of such switched network may affect the performance of the entire system. This work analyzes how different parameters for the routers, namely cut-through latency and buffering capacity, affect the overall performance of NUCA based systems for the single processor case, assuming a reference organization proposed in literature. The results indicate that the sensitivity of the system to the cut-through latency is very high and that limited buffering capacity is sufficient to achieve a good performance level. As a consequence, we propose an alternative NUCA organization that limits the average number of hops experienced by cache accesses. This organization is better performing in most of the cases and scales better as the cut-through latency increases, thus simplifying the implementation of routers.

Leveraging data promotion for low power D-NUCA caches

Conference Paper

Full-text available

Jan 2008

D-NUCA caches are cache memories that, thanks to banked organization, broadcast search and promotion/demotion mechanism, are able to tolerate the increasing wire delay effects introduced by technology scaling. As a consequence, they will outperform conventional caches (UCA, Uniform Cache Architectures) in future generation cores. Due to the promotion/demotion mechanism, we observed that the distribution of hits across the ways of a D-NUCA cache varies across applications as well as across different execution phases within a single application. In this work, we show how such a behavior can be leveraged to improve the D-NUCA power efficiency as well as to decrease its access latency. In particular, we propose: 1) A new micro- architectural technique to reduce the static power consumption of a D-NUCA cache by dynamically adapting the number of active (i.e. powered-on) ways to the need of the running application; our evaluation shows that a strong reduction of the average number of active ways (37.1%) is achievable, without significantly affecting the IPC (-2.25%), leading to a resultant reduction of the Energy Delay Product (EDP) of 30.9%. 2) A strategy to estimate the characteristic parameters of the proposed technique. 3) An evaluation of the effectiveness of the proposed technique in the multicore environment.

Improving power efficiency of D-NUCA caches

Article

Full-text available

Sep 2007
Comput Architect News

D-NUCA caches are cache memories that, thanks to banked organization, broadcast search and promotion/demotion mechanism, are able to tolerate the increasing wire delay effects introduced by technology scaling. As a consequence, they will outperform conventional caches (UCA, Uniform Cache Architectures) in future generation cores. Due to the promotion/demotion mechanism, we have found that, in a D-NUCA cache, the distribution of hits on the ways varies across applications as well as across different execution phases within a single application. In this paper, we show how such a behavior can be utilized to improve D-NUCA power efficiency as well as to decrease its access latencies. In particular, we propose a new D-NUCA structure, called Way Adaptable D-NUCA cache, in which the number of active (i.e. powered-on) ways is dynamically adapted to the need of the running application. Our initial evaluation shows that a consistent reduction of both the average number of active ways (42% in average) and the number of bank access requests (29% in average) is achieved, without significantly affecting the IPC.

NUMA Caches

Chapter

Jan 2011

Way adaptable D-NUCA caches

Article

Full-text available

Aug 2010

Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a scalable on-chip network to interconnect the banks with the cache controller, the average access latency can be reduced with respect to a traditional cache. The addition of a migration mechanism to move the most frequently accessed data towards the cache controller (D-NUCA) further improves the average access latency. In this work we propose a last-level cache design, based on the D-NUCA scheme, which is able to significantly limit its static power consumption by dynamically adapting to the needs of the running application: the way adaptable D-NUCA cache. This design leads to a fast and power-efficient memory hierarchy with an average reduction by 31.2% in energy-delay product (EDP) with respect to a traditional D-NUCA. We propose and discuss a methodology for tuning the intrinsic parameters of our design and investigate the adoption of the way adaptable D-NUCA scheme as a shared L2 cache in a chip multiprocessor (CMP) system (24% reduction of EDP).

Design and Effective Functional Verification of an Embedded Processor with SIMD Extension

Article

Dec 2009

An embedded processor with SIMD extension is presented. The processor contains 32-bit dual-issue RISC core with out-of order execution capability. A dedicated short-vector SIMD unit is extended to speed up the multimedia processing. Also an effective co-verification method is designed to verify the processor. It focuses on the architecture design and the verification of the embedded processor. The performance evaluation and the physical design are also discussed. The processor is fabricated by CMOS 180 ns technology with 1.42 MIPS@MHz and 7.9 mW@MHz.

A L e a k age-Energy-Reduction Technique for High-Associativity Caches in Embedded Systems

Article

Full-text available

Power consumption is becoming one of the most important constraints for microproces-sor design in nanometer-scale technologies. Espe-cially, as the transistor supply voltage and thresh-old voltage are scaled down, leakage energy con-sumption is increased even when the transistor is not switching. This paper proposes a simple technique to reduce the static energy. The key idea of our approach is to allow t h e w a ys within a c a c he to be accessed at dierent speeds and to place infrequently accessed data into the slow ways. We use dual-V t technique to realize the non-uniform set-associative cache, and propose a simple replacement policy to reduce average ac-cess latency. Experimental results on 32-way set-associative caches demonstrate that any severe in-crease in clock cycles to execute application pro-grams is not observed and signicant static en-ergy reduction can be achieved, resulting in the improvement of energy-delay product.

Cache-Line Decay: A Mechanism to Reduce Cache Leakage Power

Conference Paper

Full-text available

Jun 2001

Reducing the supply voltage to reduce dynamic power consumption in CMOS devices, inadvertently will lead to an exponential increase in leakage power dissipation. In this work we explore an architectural idea to reduce leakage power in data caches. Previous work has shown that cache frames are “dead” for a significant fraction of time [14]. We are exploiting this observation to turn off cache lines that are not likely to be accessed any more. Our method is simple: if a cache-line is not accessed within a fixed interval (called decay interval) we turn off its supply voltage using a gated Vdd technique introduced previously [12]. We study the effect of cache-line decay on both power consumption and performance. We find that it is possible with cache-line decay to build larger caches that dissipate less leakage power than smaller caches while yielding equal or better performance (fewer misses). In addition, because our method can dynamically trade performance for leakage power it can be adjusted according to the requirements of the application and/or the environment.

Clock rate versus IPC: the end of the road for conventional microarchitectures

Conference Paper

Jan 2000

A Self-Tuning Cache Architecture for Embedded Systems

Article

May 2004

Memory accesses often account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size, and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in prefabricated microprocessor platforms. Tuning those caches to a program is still, however, a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of tunable caches.We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. Our heuristic seeks not only to reduce the number of configurations that must be examined, but also traverses the search space in a way that minimizes costly cache flushes. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache saves on average 40% of total memory access energy over a standard nontuned reference cache.

A Fully Associative SoftwareManaged Cache Design

Article

May 2000
Comput Architect News

As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches. This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement , that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

Clock rate vs. IPC: The end of the road for conventional microprocessors

Article

Jan 2000

A Non-Uniform Cache Architecture on Networks-on-Chip: A Fully Associative Approach with Pre-Promotion

Article

Analysis of multi-megabyte secondary cpu cache memories

Article

R. Kessler

A reconfigurable multi-function computing cache architecture

Conference Paper

Feb 2000

A considerable portion of a chip is dedicated to a cache memory in a modern microprocessor chip. However, some applications may not actively need all the cache storage, especially the computing bandwidth limited applications. Instead, such applications may be able to use some additional computing resources. If the unused portion of the cache could serve these computation needs, the on-chip resources would be utilized more efficiently. This presents an opportunity to explore the reconfiguration of a part of the cache memory for computing. In this paper, we present a cache architecture to convert a cache into a computing unit for either of the following two structured computations, FIR and DCT/IDCT. In order to convert a cache memory to a function unit, we include additional logic to embed multi-bit output LUTs into the cache structure. Therefore, the cache can perform computations when it is reconfigured as a function unit. The experimental results show that the reconfigurable module improves the execution time of applications with a large number of data elements by a large factor (as high as 50 and 60). In addition, the area overhead of the reconfigurable cache module for FIR and DCT/IDCT is less than the core area of those functions. Our simulations indicate that a reconfigurable cache does not take a significant delay penalty compared with a dedicated cache memory. The concept of reconfigurable cache modules can be applied at Level-2 caches instead of Level-1 caches to provide an active-Level-2 cache similar to active memories.

A low power architecture for embedded perception.

Conference Paper

Jan 2004

A cache design for high performance embedded systems.

Abstract and Figures

Recommended publications

Profile driven schemes for energy-sensitive cache hierarchy.

Low-cost low-power droop-voltage-aware delay-fault-prevention designs for DVS caches

An Energy Efficient and High Performance Data Cache Structure Utilizing Tag History of Cache Address...

Architectural Leakage Power Minimization of Scratchpad Memories by Application-Driven Subbanking