Conference Paper

A table-based method for single-pass cache optimization.

DOI: 10.1145/1366110.1366129 Conference: Proceedings of the 18th ACM Great Lakes Symposium on VLSI 2008, Orlando, Florida, USA, May 4-6, 2008
Source: DBLP

ABSTRACT Due to the large contribution of the memory subsystem to total system power, the memory subsystem is highly amenable to cus- tomization for reduced power/energy and/or improved performance. Cache parameters such as total size, line size, and associat ivity can be specialized to the needs of an application for system optimiza- tion. In order to determine the best values for cache parameters, most methodologies utilize repetitious application execution to in- dividually analyze each configuration explored. In this pap er we propose a simplified yet efficient technique to accurately es timate the miss rate of many different cache configurations in just o ne single-pass of execution. The approach utilizes simple data struc- tures in the form of a multi-layered table and elementary bitwise operations to capture the locality characteristics of an ap plication's addressing behavior. The proposed technique intends to ease miss rate estimation and reduce cache exploration time. Categories and Subject Descriptors

  • [Show abstract] [Hide abstract]
    ABSTRACT: Chip multicore processors (CMPs) have emerged as the dominant architecture choice for modern computing platforms and will most likely continue to be dominant well into the foreseeable future. As with any system, CMPs offer a unique set of challenges. Chief among them is the shared resource contention that results because CMP cores are not independent processors but rather share common resources among cores such as the last level cache (LLC). Shared resource contention can lead to severe and unpredictable performance impact on the threads running on the CMP. Conversely, CMPs offer tremendous opportunities for mulithreaded applications, which can take advantage of simultaneous thread execution as well as fast inter thread data sharing. Many solutions have been proposed to deal with the negative aspects of CMPs and take advantage of the positive. This survey focuses on the subset of these solutions that exclusively make use of OS thread-level scheduling to achieve their goals. These solutions are particularly attractive as they require no changes to hardware and minimal or no changes to the OS. The OS scheduler has expanded well beyond its original role of time-multiplexing threads on a single core into a complex and effective resource manager. This article surveys a multitude of new and exciting work that explores the diverse new roles the OS scheduler can successfully take on.
    ACM Computing Surveys 11/2012; 45(1). · 4.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Low power and/or energy consumption is a requirement not only in embedded systems that run on batteries or have limited cooling capabilities, but also in desktop and mainframes where chips require costly cooling techniques. Since the cache subsystem is typically the most power/energy-consuming subsystem, caches are good candidates for power/energy optimizations, and therefore, cache tuning techniques are widely researched. This survey focuses on state-of-the-art offline static and online dynamic cache tuning techniques and summarizes the techniques' attributes, major challenges, and potential research trends to inspire novel ideas and future research avenues.
    ACM Computing Surveys (CSUR). 06/2013; 45(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: The literature on scratchpad memories (SPMs) seems to indicate that the use of dynamic overlaying supersedes static, non-overlay-based (NOB) allocation. Although overlay-based (OVB) techniques operating at source-level code might benefit from multiple hot spots for higher energy savings, they cannot exploit libraries. When operating on binaries, OVB approaches lead to smaller savings, often require dedicated hardware, and sometimes prevent data allocation. Besides, all saving reports published so far ignore that, in cache-based systems, caches are likely to be optimized prior to SPM allocation. We show experimental evidence that, when handling binaries, NOB memory savings (15% to 33% on average) are as good as or better than OVB's. Since our savings (as opposed to related work) were measured after cache tuning -- when there is less room for optimization, our results encourage the use of simpler NOB methods to build library aware allocators that cannot depend on dedicated hardware. We also show that, given the capacity Ct of the equivalent pretuned cache, the optimal SPM size lies in [Ct/2, Ct] for 85% of the evaluated programs. Finally, we show counter-intuitive evidence that, even for cache-based architectures containing small SPMs, procedures should be preferred for allocation instead of basic blocks.


Available from