Conference PaperPDF Available

Execution characteristics of object oriented programs on the UltraSPARC-II

Authors:

Abstract and Figures

It is widely accepted that object-oriented design improves code reusability, facilitates code maintainability and enables higher levels of abstraction. Although software developers and the software engineering community have embraced object-oriented programming for these benefits, there have been wide concerns about the performance overhead associated with this programming paradigm on modern processors. We characterize the performance of several C and C++ benchmarks on an UltraSPARC-II processor. Various architectural data related to execution behavior of the benchmarks are collected using on-chip performance monitoring counters. Factors including CPI, instruction and data cache misses, processor stalls due to instruction cache misses and branch misprediction, from real execution of several programs are measured and presented. While previous research evaluates the behavioral differences between C and C++ programs based on profiling and simulation, we measure execution behavior. Results show that the programs in the C++ suite incur a higher CPI, higher i-cache misses, and higher branch mispredictions than the programs in the C suite. A strong correlation was observed between CPI and branch mispredictions for the C++ application programs
Content may be subject to copyright.
Prefetch and Dispatch Unit
Instruction Cache & Buffer
Memory Management Unit
Grouping Logic
Intger Execution Unit (IEU)
Load Store Unit (LSU)
data cache Load Queue Store Queue
External Cache Unit
Memory Interface Unit (MIU)
Floating Point Unit (FPU)
FP reg
FP multiply
FP add
FP divide
Graphics unit
UltraSPARC-II Bus
External
Cache
Ram
Integer Registers
& Annex
C++ Programs C Programs
GCC
compiler
ver 2.7.2
Executables
Benchmark
spixtools Perf-monitor
Instruction
Set
Data
Execution
Related
Data
17
17.5
18
18.5
19
19.5
20
C++ C
instructions (%)
0
1
2
3
4
5
6
7
8
9
C++ C
0
2
4
6
8
10
12
14
16
18
C++ C
instructions (%)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
C++ C
instructions (%)
0
0.5
1
1.5
2
2.5
3
C++ C
instructions (%)
0
10
20
30
40
50
60
70
80
C++ C
instructions
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
C++ C
CPI
0
2
4
6
8
10
12
14
IC_miss DC_miss EC_miss
miss rate (%)
C++
C
0
1
2
3
4
5
6
7
8
I-Cache mispredict store buffer data wait
cycles (%)
C++
C
20151050
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Branch mispredict stall cycles
CPI
eqn
deltablue
richards
groff
ixx
idl
... Although software engineering and developers embrace objectoriented programming for benefits, the performance of object-oriented programs running on the non-objectoriented processors is always lower than procedureoriented programs. Earlier studies of program behavior show that such programs tend to generate dozens of allocation requests in order to run a single line of code [1] [3]. Currently most systems either emulate an object runtime environment in software on top of the non-object-oriented system (Java), or translate objects into target hardware machine code (C++). ...
Article
Modern multi-core processor architectures supported by system-on-chip and on-chip network to strive for the highest possible performance of various applications. This paper discusses a triple-based multi-core architecture which supports object-oriented methodology and applications in hardware level. However, the Memory Wall is still the bottleneck of the whole system performance. We presents a kind of hierarchical shared memory system architecture (HSM), which is hierarchically constructed memory shared by multi-cores. Moreover, we propose a new approach to map data among different levels of cache and memory, which is called part-inclusion cache mapping policy which facilitates the coherence of shared memory. Moreover, a new object organization module was discussed. This paper focus on object-oriented systems combined with the HSM and part-inclusion policy and presents a new objects management model. Our analysis based on comparisons between our objects management and other common link structured object organization methods shows that our method is predominant on memory parallel access efficiency and costs less storage space to organize objects.
... Five of the SPEC95 integer programs were used for our simulation -gcc, go, m88ksim, compress, and li. These are the same programs used by Radhakrishnan and John [6]. The next suite of programs is written in C++ and has been used for investigating the behavior between C and C++ [2] [5]. ...
Conference Paper
Full-text available
This paper presents a new instruction cache scheme: the TAC (Thrashing-Avoidance Cache). A 2-way TAC scheme employs 2-way banks and XOR mapping functions. The main function of the TAC is to place a group of instructions separated by a call instruction into a bank according to the Bank Selection Logic (BSL) and Bank-originated Pseudo-LRU replacement policies (BoPLRU). After the BSL initially selects a bank on an instruction cache miss, the BoPLRU will determine the final bank for updating a cache line as a correction mechanism. These two mechanisms can guarantee that recent groups of instructions exist in each bank safely. We have developed a simulation program, TACSim, by using Shade and Spixtools, provided by SUN Microsystems, on an ultra SPARC/10 processor. Our experimental results show that 2-way TAC schemes reduce conflict misses more effectively than 2-way skewed-associative caches in both C (17% improvement) and C++ (30% improvement) programs on L1 caches.
... Five of the SPECint95 programs were used for our simulation –xlisp, perl, gcc, m88ksim, and go. These are the same programs used in [3][7]. The next suite of programs is written in C++ and has been used for investigating the behavior between C and C++ [8][9]. ...
Conference Paper
Full-text available
In this paper, we present a new hybrid branch predictor called the GoStay2, which can effectively reduce indirect misprediction rates. The GoStay2 has two different mechanisms compared to other 2-stage hybrid predictors that use a Branch Target Buffer (BTB) as the first stage predictor: Firstly, to reduce conflict misses in the first stage, a new effective 2-way cache scheme is used instead of a 4-way set-associative. Secondly, to reduce mispredictions caused by an inefficient predict and update rule, a new selection mechanism and update rule are proposed. We have developed a simulation program by using Shade and Spixtools, provided by SUN Microsystems, on an Ultra SPARC/10 processor. Our results show that the GoStay2 improves indirect misprediction rates of a 64-entry to 4K-entry BTB (with a 512- or 1K-entry PHT) by 14.9% to 21.53% compared to the leaky filter.
... It is widely accepted that object-oriented paradigm can improve code reusability and facilitate code maintenance. Although software engineering and software developers embrace object-oriented programming for benefits, earlier studies of program behavior show that such programs tend to generate dozens of allocation requests in order to run a single line of code [1] [2]. Currently most systems either emulate an object runtime environment in software on top of the non-object-oriented system (Java), or translate objects into target hardware machine code (C++). ...
Conference Paper
A novel object-oriented processor is proposed in this paper, which provides support for object addressing, message passing and dynamic memory management. Object running on this processor has its own control thread and communicates with others via messages. A virtual addressed object cache that reduces the indirection overhead while maintaining the efficiency of object relocation is presented. Object table that maintains the handles is used to obtain the actual object location on an object cache miss. Hardware support for explicit dynamic memory management is provided. Object allocation and deletion is strictly bounded in time. Moreover, a new concurrently dynamic memory management algorithm is proposed, which enables the processor to freely access heap during memory compaction and the applications will not be suspended for the completion of memory compaction.
Chapter
In this paper, we present two mechanisms that reduce indirect mispredictions of two-stage branch predictors: First, to reduce conflict misses in the first stage predictor, a new cache scheme is proposed instead of a branch target buffer (BTB). Second, to reduce mispredictions caused by the second stage predictor, efficient predict and update rules axe proposed. We have developed a simulation program by using Shade and Spixtools, provided by SUN Microsystems, on an Ultra SPARC/10 processor. Our results show good improvement with these mechanisms compared to other indirect two-stage predictors.
Article
Today, more than 99% of web-browsers are enabled with Javascript capabilities, and Javascript's popularity is only going to increase in the future. However, due to bytecode interpretation, Javascript codes suffer from severe performance penalty (up to 50x slower) compared to the corresponding native C/C++ code. We recognize that the first step to bridge this performance gap is to understand the the architectural execution characteristics of Javascript benchmarks. Therefore, this paper presents an in-depth architectural characterization of widely used V8 and Sunspider Javascript benchmarks using Google's V8 javascript engine. Using statistical data analysis techniques, our characterization study discovers and explains correlation among different execution characteristics in microarchitecture dependent as well as microarchitecture independent fashion. Furthermore, our study measures (dis)similarity among 33 different Javascript benchmarks and discusses its implications. Given the widespread use of Javascripts, we believe our findings are useful for both performance analysis and benchmarking communities.
Conference Paper
Modern multi-core processor architectures strive for the highest possible performance of various applications. This paper discusses a triple-based multi-core architecture which supports object-oriented methodology and applications in hardware level. However, the Memory Wall is still the bottleneck which should be resolved to decrease the disparity between how fast a CPU can operate on data and how fast it can get data. We present hierarchical shared memory system architecture (HSM) which is hierarchically constructed memory shared by multi-cores. Moreover, we propose a new approach mapping data among different levels of cache and memory, which is called partially-inclusive cache mapping policy that facilitates the coherence of shared memory. This paper focus on object-oriented systems combined with the HSM and partially-inclusive policy and presents a new objects management model. The analysis based on comparisons between our objects management and link structured object organization methods shows that our method is predominant in spatial and temporal aspects on memory parallel access efficiency and costs less storage space to organize objects.
Conference Paper
Full-text available
We present an efficient cache scheme, which can considerably reduce instruction cache misses caused by procedure call/returns. This scheme employs N-way banks and XOR mapping functions. The main function of this scheme is to place a group of instructions separated by a call instruction into a bank according to the initial and final bank selection mechanisms. After the initial bank selection mechanism selects a bank on an instruction cache miss, the final bank selection mechanism will determine the final bank for updating a cache line as a correction mechanism. These two mechanisms can guarantee that recent groups of instructions exist in each bank safely. We have developed a simulation program by using Shade and Spixtools, provided by SUN Microsystems, on an ultra SPARC/10 processor. Our experimental results show that these schemes reduce conflict misses more effectively than skewed-associative caches in both C (up to 9.29% improvement) and C++ (up to 30.71% improvement) programs on L1 caches. In addition, they also allow for a significant miss reduction on Branch Target Buffers (BTB)
Conference Paper
Understanding the characteristics of workloads is extremely important in the design of efficient computer architectures. Accurate characterization of workload behavior leads to the design of improved architectures. The characterization of applications allows one to tune the processor micro-architecture, memory hierarchy and system architecture to suit particular features in programs. Workload characterization also has a significant impact on performance evaluation. Understanding the nature of the workload and its intrinsic features can help to interpret performance measurements and simulation results. Identifying and characterizing the intrinsic properties of an application in terms of its memory access behavior, locality, control flow behavior, instruction-level parallelism, etc. can eventually lead to a program behavior model, which can be used in conjunction with a processor model to do analytical performance modeling of computer systems. In this paper, we describe the objectives of workload characterization and emphasize the importance of obtaining architecture-independent metrics for workloads. A study of memory reference locality using some generic metrics is presented as an example
Article
Full-text available
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.
Article
Accurate branch prediction is critical to performance; mispredicted branches mean that ten's of cycles may be wasted in superscalar architectures. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue processors. These mechanisms require considerable processor resources. Concurrently, the larger address space of 64-bit architectures introduce new obstacles and opportunities. A larger address space means branch target buffers become more expensive. In this paper, we show how a combination of less expensive mechanisms can achieve better performance than BTB's. This combination relies on a number of design choices described in the paper. We used trace-driven simulation to show that our proposed design, which uses fewer resources, offers better performance than previously proposed alternatives for most programs, and indicate how to further improve this design.
Article
Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs. In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamic branch prediction architectures. We also show that a program performance can be improved by approximately 5% even when using recently proposed, highly accurate branch prediction architectures. The programs are compiled by any existing compiler and then transformed via binary transformations. When implementing these algorithms on a Alpha AXP 21604 up to a 16% reduction in total execution time is achieved.
Article
ATOM (Analysis Tools with OM) is a single framework for building a wide range of customized program analysis tools. It provides the common infrastructure present in all code-instrumenting tools; this is the difficult and time-consuming part. The user simply defines the tool-specific details in instrumentation and analysis routines. Building a basic block counting tool like Pixie with ATOM requires only a page of code. ATOM, using OM link-time technology, organizes the final executable such that the application program and user's analysis routines run in the same address space. Information is directly passed from the application program to the analysis routines through simple procedure calls instead of inter-process communication or files on disk. ATOM takes care that analysis routines do not interfere with the program's execution, and precise information about the program is presented to the analysis routines at all times. ATOM uses no simulation or interpretation. ATOM has been implemented on the Alpha AXP under OSF/1. It is efficient and has been used to build a diverse set of tools for basic block counting, profiling, dynamic memory recording, instruction and data cache simulation, pipeline simulation, evaluating branch prediction, and instruction scheduling.
Article
C++ class libraries have been developed and used for accelerator modeling and machine control at the Advanced Light Source. A class library for accelerator modeling is portable and supports multiple model instances dynamically at run time. A class library for machine control covers fields from virtual devices to simulation studies.
Conference Paper
Simulation and tracing tools help in the analysis, design, and tuning of both hardware and software systems. Simulators can execute code for hardware that does not yet exist, can provide access to internal state that may be invisible on real hardware, can give deterministic execution in the face of races, and can produce “stress test” situations that are hard to produce on the real hardware [4]. Tracing tools can provide detaide information about the behavior of a program; that information is used to drive an analyzer that analyzes or predicts the behavior of a particular system component. That, in turn, provides feedback that is used to improve the design and implementation of everything from architectures to compilers to applications. Analyzers consume many kinds of trace information; for example, address traces are used for studies of memory hierarchies, opcode and operand usage for superscalar and pipelined processor design, instruction counts for optimization studies, operand values for memoizing studies, and branch behavior for branch prediction.