Conference Paper

A table-based method for single-pass cache optimization.

DOI: 10.1145/1366110.1366129 Conference: Proceedings of the 18th ACM Great Lakes Symposium on VLSI 2008, Orlando, Florida, USA, May 4-6, 2008
Source: DBLP

ABSTRACT Due to the large contribution of the memory subsystem to total system power, the memory subsystem is highly amenable to cus- tomization for reduced power/energy and/or improved performance. Cache parameters such as total size, line size, and associat ivity can be specialized to the needs of an application for system optimiza- tion. In order to determine the best values for cache parameters, most methodologies utilize repetitious application execution to in- dividually analyze each configuration explored. In this pap er we propose a simplified yet efficient technique to accurately es timate the miss rate of many different cache configurations in just o ne single-pass of execution. The approach utilizes simple data struc- tures in the form of a multi-layered table and elementary bitwise operations to capture the locality characteristics of an ap plication's addressing behavior. The proposed technique intends to ease miss rate estimation and reduce cache exploration time. Categories and Subject Descriptors

0 Bookmarks
 · 
251 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: Chip multicore processors (CMPs) have emerged as the dominant architecture choice for modern computing platforms and will most likely continue to be dominant well into the foreseeable future. As with any system, CMPs offer a unique set of challenges. Chief among them is the shared resource contention that results because CMP cores are not independent processors but rather share common resources among cores such as the last level cache (LLC). Shared resource contention can lead to severe and unpredictable performance impact on the threads running on the CMP. Conversely, CMPs offer tremendous opportunities for mulithreaded applications, which can take advantage of simultaneous thread execution as well as fast inter thread data sharing. Many solutions have been proposed to deal with the negative aspects of CMPs and take advantage of the positive. This survey focuses on the subset of these solutions that exclusively make use of OS thread-level scheduling to achieve their goals. These solutions are particularly attractive as they require no changes to hardware and minimal or no changes to the OS. The OS scheduler has expanded well beyond its original role of time-multiplexing threads on a single core into a complex and effective resource manager. This article surveys a multitude of new and exciting work that explores the diverse new roles the OS scheduler can successfully take on.
    ACM Computing Surveys 11/2012; 45(1). · 4.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multi-level caches are widely used to improve the memory access speed of multiprocessor systems. Deciding on a suitable set of cache memories for an application specific embedded system's memory hierarchy is a tedious problem, particularly in the case of MPSoCs. To accurately determine the number of hits and misses for all the configurations in the design space of an MPSoC, researchers extract the trace first using Instruction set simulators and then simulate using a software simulator. Such simulations take several hours to months. We propose a novel method based on specialized hardware which can quickly simulate the design space of cache configurations for a shared memory multiprocessor system on an FPGA, by analyzing the memory traces and calculating the cache hits and misses simultaneously. We demonstrate that our simulator can explore the cache design space of a quad-core system with private L1 caches and a shared L2 cache, over a range of standard benchmarks, taking as less as 0.106 seconds per million memory accesses, which is up to 456 times faster than the fastest known software based simulator. Since we emulate the program and analyze memory traces simultaneously, we eliminate the need to extract multiple memory access traces prior to simulation, which saves a significant amount of time during the design stage.
    Design Automation and Test in Europe; 01/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Low power and/or energy consumption is a requirement not only in embedded systems that run on batteries or have limited cooling capabilities, but also in desktop and mainframes where chips require costly cooling techniques. Since the cache subsystem is typically the most power/energy-consuming subsystem, caches are good candidates for power/energy optimizations, and therefore, cache tuning techniques are widely researched. This survey focuses on state-of-the-art offline static and online dynamic cache tuning techniques and summarizes the techniques' attributes, major challenges, and potential research trends to inspire novel ideas and future research avenues.
    ACM Computing Surveys 06/2013; 45(3). · 4.04 Impact Factor

Preview

Download
3 Downloads
Available from