Conference Paper

Decomposing memory performance: data structures and phases.

DOI: 10.1145/1133956.1133970 Conference: Proceedings of the 5th International Symposium on Memory Management, ISMM 2006, Ottawa, Ontario, Canada, June 10-11, 2006
Source: DBLP

ABSTRACT The memory hierarchy continues to have a substantial effect on application performance. This paper explores the potential of high- level application understanding in improving the performance of modern memory hierarchies, decomposing the often-chaotic ad- dress stream of an application into multiple more regular streams. We present two orthogonal methodologies. The first is a syste m called DTrack that decomposes the dynamic reference stream of a C program by tagging each reference with its global variable or heap call-site name. The second is a technique to determine the correct granularity at which to study the global phase behavior of applications. Applying these twin analysis methods to twelve C SPEC2000 benchmarks, we demonstrate that they reveal data struc- ture interactions that remain obscured with traditional ag gregation- based analysis methods. Such a characterization creates a rich pro- file of an application's memory behavior that highlights the most memory-intensive data structures and program phases, and we il- lustrate how this profile can lead system and application des igners to a deeper understanding of the applications they study. that the diverse patterns of behavior in realistic applicat ions are represented. In this study we explore the benefits of program un- derstanding along two orthogonal dimensions - data structure and program phase - and show how such insights can be combined to yield a rich picture of application behavior. Analyzing memory behavior is a well-trodden field; this pape r makes three novel contributions to it. First, we develop a tool called DTrack to decompose the performance of the memory hierarchy by high-level data structure for C programs. DTrack uses a C-to-C compiler to instrument variable allocations, thereby allowing each memory reference to be mapped to a specific global variable or heap call-site. Second, we fill a gap in recent studies on phas e be- havior: selecting the correct profiling interval for an appl ication, the granularity at which behavior statistics are aggregated. W hile stud- ies on phase behavior so far use a single, arbitrarily-chose n profil- ing interval for all applications, we show that the profiling interval is best selected on an application-specific basis. Finally, we apply our methodologies to twelve of the fifteen C benchmarks in the SPEC2000 benchmark suite, and present a detailed characterization of their memory behavior. Our results highlight the wide variety of behaviors exhibited by applications in the distribution of misses by data structure, as well as in the number and interleaving of different phase regimes. Several case studies demonstrate the usefulness of these results in helping the computer architect make sophisticated design decisions. The remainder of this paper is organized as follows. Section 2 distinguishes our work from prior memory system analysis studies and tools. Section 3 describes DTrack and demonstrates its abil- ity to decompose the aggregate behavior of applications by data structure. This Section also describes two case studies tha t illustrate the uses of DTrack in framing and rapidly answering sophisticated questions in designing new systems. Section 4 takes our analysis further, decomposing the address streams of applications by both data structure and time. A major contribution here is a new way to determine the right granularity at which to sample phase data, and we show that this granularity varies from application to ap- plication. Finally, Section 5 provides conclusions and thoughts on future work.

0 Bookmarks
 · 
60 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Knowledge about program worst case execution time (WCET) is essential in validating real-time systems and helps in effective scheduling. One popular approach used in industry is to measure execution time of program components on the target architecture and combine them using static analysis of the program. Measurements need to be taken in the least intrusive way in order to avoid affecting accuracy of estimated WCET. Several programs exhibit phase behavior, wherein program dynamic execution is observed to be composed of phases. Each phase being distinct from the other, exhibits homogeneous behavior with respect to cycles per instruction (CPI), data cache misses etc. In this paper, we show that phase behavior has important implications on timing analysis. We make use of the homogeneity of a phase to reduce instrumentation overhead at the same time ensuring that accuracy of WCET is not largely affected. We propose a model for estimating WCET using static worst case instruction counts of individual phases and a function of measured average CPI. We describe a WCET analyzer built on this model which targets two different architectures. The WCET analyzer is observed to give safe estimates for most benchmarks considered in this paper. The tightness of the WCET estimates are observed to be improved for most benchmarks compared to Chronos, a well known static WCET analyzer.
    Interaction between Compilers and Computer Architectures (INTERACT), 2011 15th Workshop on; 03/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Statistical cache models are powerful tools for understanding application behavior as a function of cache allocation. However, previous techniques have modeled only the average application behavior, which hides the effect of program variations over time. Without detailed time-based information, transient behavior, such as exceeding bandwidth or cache capacity, may be missed. Yet these events, while short, often play a disproportionate role and are critical to understanding program behavior. In this work we extend earlier techniques to incorporate program phase information when collecting runtime profiling data. This allows us to model an application's cache miss ratio as a function of its cache allocation over time. To reduce overhead and improve accuracy we use online phase detection and phase-guided profiling. The phase-guided profiling reduces overhead by more intelligently selecting portions of the application to sample, while accuracy is improved by combining samples from different instances of the same phase. The result is a new technique that accurately models the time-varying behavior of an application's miss ratio as a function of its cache allocation on modern hardware. By leveraging phase-guided profiling, this work both improves on the accuracy of previous techniques and reduces the overhead.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modern system architectures sometimes include scratch pad memories (SPM) in their memory hierarchy to take advantage of their simpler design, in an attempt to meet the system area, performance, and power budget. These systems employing SPM can be broadly categorized as: (a) cacheless systems with only SPM, (b) hybrid systems with both cache and SPM, and (c) reconfigurable systems with the provision to reconfigure local memory as either cache, SPM, or a combination of the two. However SPM based systems have needed larger efforts spent on their programming, mainly due to allocating data and orchestrating data transfers explicitly by soft-ware. Tight product development cycles require faster development and porting of diverse applications to multiple SPM based architectures. In this paper we present SPM-Sieve, a profile-based tool and framework targeted for SPM based architectures that generates partitioning decisions of the first level memory in the system hierarchy, and suggests object mapping amongst the memory partitions without resorting to detailed simulation of all configurations. This is done by natively executing an application and using minimal target architecture specification, which not only provides early information influencing data organization in the application, but also provides a foundation for other more sophisticated algorithms to produce optimized allocations. We demonstrate the utility and generality of SPM-Sieve by evaluating it on a large number of SPEC2000 benchmarks targeted for a 128KB first level memory. We evaluate its effectiveness by performing simulation studies comparing the partition suggested by the tool against varying partition sizes, and observe that its suggestions are very competitive for SPM based architectures with and without caches.
    Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2013 International Conference on; 01/2013

Full-text

Download
0 Downloads
Available from