ArticlePDF Available

Abstract

Future high-performance virtual machines will improve performance through sophisticated online feedback-directed optimizations. This paper presents the architecture of the Jalapeño Adaptive Optimization System, a system to support leading-edge virtual machine technology and enable ongoing research on online feedback-directed optimizations. We describe the extensible system architecture, based on a federation of threads with asynchronous communication. We present an implementation of the general architecture that supports adaptive multi-level optimization based purely on statistical sampling. We empirically demonstrate that this profiling technique has low overhead and can improve startup and steady-state performance, even without the presence of online feedback-directed optimizations. The paper also describes and evaluates an online feedback-directed inlining optimization based on statistical edge sampling. The system is written completely in Java, applying the described techniques n...
Adaptive Optimization in the Jalape
˜
no JVM
Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney
IBM T.J. Watson Research Center Rutgers University
ABSTRACT
1. INTRODUCTION
2000ACM
This copy is posted by permission of ACM and may not be redis-
tributed. The official citation of this work is Arnold, M., Fink, S.,
Grove, D., Hind, M., and Sweeney, P.F. 2000. Adaptive Optimization
in the Jalape˜noJVM.
, Minneapolis, Minnesota, October 15–19,
2000.
2. BACKGROUND
Dynamic
Linker
Adaptive
Optimization
System
Compilers
[Base,Opt,...]
Machine
Code
Class Init
Compile
Stub Invoked
Lazy Compilation
Unresolved
Reference
Profiling Data
Resolution
Code
Executing
Class Load Request
(Re)Compilation Plan
ClassLoader
3. SYSTEM ARCHITECTURE
3.1 Runtime Measurements Subsystem
Performance Monitor
Profile
Information
Runtime Measurements
Subsystem
Compilers
[Base,Opt,...]
Compilation Queue
Controller
Data
Formatted
OrganizerOrganizer
Raw
Data
Raw
Data
Raw
Data
Database
AOS
Organizer Event Queue
Hardware / VM
Threads
Optimized
Compilation
Code
Executing
Data
Formatted
Plan
Compilation
Instrumentation/
Adaptive Optimization System
Install New Code
Instrumented/
Code
3.2 Controller 3.3 Recompilation Subsystem
3.4 AOS Database
4. MULTI-LEVEL RECOMPILATION
4.1 Overview
4.2 Sampling
Method
Samples
Hot Methods
Organizer
Decay
Organizer
Subsystem
Runtime Measurements
Database
AOS
Organizer Event Queue Controller
Compilation Queue
Instrumentation/
Plan
Compilation
Optimized
Code
Instrumented/
Code
Install New Code
Executing
Compilers
[Base,Opt,...]
Adaptive Optimization System
Compilation
Thread
4.3 Recompilation
5. FEEDBACK-DIRECTED INLINING
6. PERFORMANCE EVALUATION
Code
Executing
Instrumentation/
Plan
Compilation
Optimized
Code
Instrumented/
Compilation Queue
Controller
Dynamic Call
Graph Organizer
Method
Samples
Edge
Samples
Hot Methods
Organizer
Dynamic Call
Graph
Decay
Organizer
Adaptive Inlining
Organizer
Inlining Rules
Database
AOS
Thread
Organizer Event Queue
Subsystem
Runtime Measurements
Install New Code
Adaptive Optimization System
Compilers
[Base,Opt,...]
Compilation
6.1 Experimental Methodology
6.2 Multi-Level Recompilation
compress jess db javac mpegaudio mtrt jack opt-compiler volano geometric mean
0
1
2
3
4
5
Speedup over JIT Baseline
JIT OptLevel 0
JIT OptLevel 1
JIT OptLevel 2
Adaptive OptLevel 0
Adaptive OptLevel 1
Adaptive OptLevel 2
Adaptive Multi-Level
compress jess db javac mpegaudio mtrt jack opt-compiler volano geometric mean
0
2
4
6
8
Speedup over JIT Baseline
JIT OptLevel 0
JIT OptLevel 1
JIT OptLevel 2
Adaptive OptLevel 0
Adaptive OptLevel 1
Adaptive OptLevel 2
Adaptive Multi-Level
6.3 Feedback-Directed Inlining
6.4 Adaptive System Overhead
6.5 Recompilation Decisions
0.0
0.5
1.0
1.5
Speedup over Adaptive Multi-Level
1.00
1.10
1.01
1.05
1.01
0.93
0.99
1.011.01
1.00
1.01
1.08
1.74
1.06
1.00
1.23
0.91
1.08
1.04
1.05
1.11
Program Startup
Steady State
compress
jess
db
javac
mpegaudio
mtrt
jack
opt-compiler
volano
Geometric Mean
compress
jess
db
javac
mpegaudio
mtrt
jack
opt-compiler
volano
Geometric Mean
Breakdown of Jalapeño Execution Time
8.6%
0.3%
2.6%
All AOS
Threads
Garbage
Collection
Baseline
Compiler
Application
Thread(s)
Breakdown of Time in AOS Threads
5.8%
15.9%
13.5%
6.5%
58.3%
Controller
Method
Organizer
Inlining
Organizer
Decay
Organizer
Optimizing
Compiler
Program Startup
51.6%
12.4%
20.6%
4.3%
11.1%
Long Running
92.9%
6.0%
0.9%
0.1%
Program Startup
Long Running
compress jess db javac mpegaudio mtrt jack opt-compiler volano
0
20
40
60
80
100
% of All Methods Compiled
Baseline
Baseline -> Level 0
Baseline -> Level 1
Baseline -> Level 0 -> Level 1
Baseline -> Level 2
Baseline -> Level 1 -> Level 2
Baseline -> Level 0 -> Level 2
0
10
20
30
40
50
Number of Methods Recompiled
compress jess db javac mpegaudio mtrt jack
Level 2
Level 1
Level 0
7. DISCUSSION
8. RELATED WORK
9. CONCLUSIONS
Acknowledgments
10. REFERENCES
APPENDIX
... Jai et al. [21] optimized Spark performance by tuning the Memory Manager configurations and were able to obtain performance improvements up to 25%. There have been attempts even to tune JIT flags manually [19], [8]. ...
Preprint
Java is the backbone of widely used big data frameworks, such as Apache Spark, due to its productivity, portability from JVM-based execution, and support for a rich set of libraries. However, the performance of these applications can widely vary depending on the runtime flags chosen out of all existing JVM flags. Manually tuning these flags is both cumbersome and error-prone. Automated tuning approaches can ease the task, but current solutions either require considerable processing time or target a subset of flags to avoid time and space requirements. In this paper, we present OneStopTuner, a Machine Learning based novel framework for autotuning JVM flags. OneStopTuner controls the amount of data generation by leveraging batch mode active learning to characterize the user application. Based on the user-selected optimization metric, OneStopTuner then discards the irrelevant JVM flags by applying feature selection algorithms on the generated data. Finally, it employs sample efficient methods such as Bayesian optimization and regression guided Bayesian optimization on the shortlisted JVM flags to find the optimal values for the chosen set of flags. We evaluated OneStopTuner on widely used Spark benchmarks and compare its performance with the traditional simulated annealing based autotuning approach. We demonstrate that for optimizing execution time, the flags chosen by OneStopTuner provides a speedup of up to 1.35x over default Spark execution, as compared to 1.15x speedup by using the flag configurations proposed by simulated annealing. OneStopTuner was able to reduce the number of executions for data-generation by 70% and was able to suggest the optimal flag configuration 2.4x faster than the standard simulated annealing based approach, excluding the time for data-generation.
... In applications where compilation time is a concern, such as dynamic compilation systems [4,42,70], researchers try to balance compilation time and code quality. In this context, they do not choose a register allocation algorithm based on graph coloring, because it is a complex algorithm and a time-consuming register allocator. ...
Article
Full-text available
Ant Colony Optimization is a metaheuristic used to create heuristic algorithms to find good solutions for combinatorial optimization problems. This metaheuristic is inspired on the effective behavior present in some species of ants of exploring the environment to find and transport food to the nest. Several works have proposed using Ant Colony Optimization algorithms to solve problems such as vehicle routing, frequency assignment, scheduling and graph coloring. The graph coloring problem essentially consists in finding a number k of colors to assign to the vertices of a graph, so that there are no two adjacent vertices with the same color. This paper presents the hybrid ColorAnt-RT algorithms, a class of algorithms for graph coloring problems which is based on the Ant Colony Optimization metaheuristic and uses Tabu Search as local search. The experiments with ColorAnt-RT algorithms indicate that changing the way to reinforce the pheromone trail results in better results. In fact, the results with ColorAnt-RT show that it is a promising option in finding good approximations of k. The good results obtained by ColorAnt-RT motivated it use on a register allocation based on Ant Colony Optimization, called CARTRA. As a result, this paper also presents CARTRA, an algorithm that extends a classic graph coloring register allocator to use the graph coloring algorithm ColorAnt-RT. CARTRA minimizes the amount of spills, thereby improving the quality of the generated code.
... Additionally, offline profiling also suffers from the practical restriction presented by the difficulty or inability in certain cases to collect a profile trace of the application prior to execution. It is believed that an ability to perform the profiling at runtime using an online strategy may help overcome some of these drawbacks [Arnold et al. 2000a[Arnold et al. , 2002. ...
Article
Many performance optimizations rely on or are enhanced by runtime profile information. However, both offline and online profiling techniques suffer from intrinsic and practical limitations that affect the quality of delivered profile data. The quality of profile data is its ability to accurately predict (relevant aspects of) future program behavior. While these limitations are known, their impact on the effectiveness of profile-guided optimizations, compared to the ideal performance, is not as well understood. We define ideal performance for adaptive optimizations as that achieved with a precise profile of future program behavior. In this work, we study and quantify the performance impact of fundamental profiling limitations by comparing the effectiveness of typical adaptive optimizations when using the best profiles generated by offline and online schemes against a baseline where the adaptive optimization is given access to profile information about the future execution of the program. We model and compare the behavior of three adaptive JVM optimizations—heap memory management using object usage profiles, code cache management using method usage profiles, and selective just-in-time compilation using method hotness profiles—for the Java DaCapo benchmarks. Our results provide insight into the advantages and drawbacks of current profiling strategies and shed light on directions for future profiling research.
Article
The study of programming languages is a rich field within computer science, incorporating both the abstract theoretical portions of computer science and the platform specific details. Topics studied in programming languages, chiefly compilers or interpreters, are permanent fixtures in programming that students will interact with throughout their career. These systems are, however, considerably complicated, as they must cover a wide range of functionality in order to enable languages to be created and run. The process of educating students thus requires that the demanding workload of creating one of the systems be balanced against the time and resources present in a university classroom setting. Systems building upon these fundamental systems can become out of reach when the number of preceding concepts and thus classes are taken into account. Among these is the study of just-in-time (JIT) compilers, which marry the processes of interpreters and compilers for the purposes of a flexible and fast runtime. The purpose of this thesis is to present JITed, a framework within which JIT compilers can be developed with a time commitment and workload befitting of a classroom setting, specifically one as short as ten weeks. A JIT compiler requires the development of both an interpreter and a compiler. This poses a problem, as classes teaching compilers and interpreters typically feature the construction of one of those systems as their term project. This makes the construction of both within the same time span as is usually allotted for a single system infeasible. To remedy this, JITed features a prebuilt interpreter, that provides the runtime environment necessary for the compiler portion of a JIT compiler to be built. JITed includes an interface for students to provide both their own compiler and the functionality to determine which portions of code should be compiled. The framework allows for important concepts of both compilers in general and JIT compilers to be taught in a reasonable timeframe.
Chapter
Software protectors are products for shielding a binary executable with transformations that obfuscate and compress its original bytes in order to reveal them only during execution. In addition, they implement early-stage evasion techniques that actively look for the presence of someone who is trying to study or break such protections, since analysis environments used to this end introduce typical artifacts in the execution. In this paper we analyze a plethora of evasions used by protectors through dynamic binary instrumentation (DBI), a technique that augments the execution of a program with capabilities of monitoring and altering it up to the instruction-level granularity. As result of the analysis, we survey what artifacts are searched by the most important software protectors present on the market, dividing them according to the type of artifacts they target: environmental, analysis tools, and DBI itself .
Article
Managed language virtual machines (VM) rely on dynamic or just-in-time (JIT) compilation to generate optimized native code at run-time to deliver high execution performance. Many VMs and JIT compilers collect profile data at run-time to enable profile-guided optimizations (PGO) that customize the generated native code to different program inputs. PGOs are generally considered integral for VMs to produce high-quality and performant native code. In this work, we study and quantify the performance benefits of PGOs, understand the importance of profiling data quantity and quality/accuracy to effectively guide PGOs, and assess the impact of individual PGOs on VM performance. The insights obtained from this work can be used to understand the current state of PGOs, develop strategies to more efficiently balance the cost and exploit the potential of PGOs, and explore the implications of and challenges for the alternative ahead-of-time (AOT) compilation model used by VMs.
Conference Paper
In C, memory errors, such as buffer overflows, are among the most dangerous software errors; as we show, they are still on the rise. Current dynamic bug-finding tools that try to detect such errors are based on the low-level execution model of the underlying machine. They insert additional checks in an ad-hoc fashion, which makes them prone to omitting checks for corner cases. To address this, we devised a novel approach to finding bugs during the execution of a program. At the core of this approach is an interpreter written in a high-level language that performs automatic checks (such as bounds, NULL, and type checks). By mapping data structures in C to those of the high-level language, accesses are automatically checked and bugs discovered. We have implemented this approach and show that our tool (called Safe Sulong) can find bugs that state-of-the-art tools overlook, such as out-of-bounds accesses to the main function arguments.
Article
Just-in-time (JIT) compilation during program execution and ahead-of-time (AOT) compilation during software installation are alternate techniques used by managed language virtual machines (VM) to generate optimized native code while simultaneously achieving binary code portability and high execution performance. Profile data collected by JIT compilers at run-time can enable profile-guided optimizations (PGO) to customize the generated native code to different program inputs. AOT compilation removes the speed and energy overhead of online profile collection and dynamic compilation, but may not be able to achieve the quality and performance of customized native code. The goal of this work is to investigate and quantify the implications of the AOT compilation model on the quality of the generated native code for current VMs. First, we quantify the quality of native code generated by the two compilation models for a state-of-the-art (HotSpot) Java VM. Second, we determine how the amount of profile data collected affects the quality of generated code. Third, we develop a mechanism to determine the accuracy or similarity for different profile data for a given program run, and investigate how the accuracy of profile data affects its ability to effectively guide PGOs. Finally, we categorize the profile data types in our VM and explore the contribution of each such category to performance.
Conference Paper
Just-in-time (JIT) compilation during program execution and ahead-of-time (AOT) compilation during software installation are alternate techniques used by managed language virtual machines (VM) to generate optimized native code while simultaneously achieving binary code portability and high execution performance. Profile data collected by JIT compilers at run-time can enable profile-guided optimizations (PGO) to customize the generated native code to different program inputs. AOT compilation removes the speed and energy overhead of online profile collection and dynamic compilation, but may not be able to achieve the quality and performance of customized native code. The goal of this work is to investigate and quantify the implications of the AOT compilation model on the quality of the generated native code for current VMs. First, we quantify the quality of native code generated by the two compilation models for a state-of-the-art (HotSpot) Java VM. Second, we determine how the amount of profile data collected affects the quality of generated code. Third, we develop a mechanism to determine the accuracy or similarity for different profile data for a given program run, and investigate how the accuracy of profile data affects its ability to effectively guide PGOs. Finally, we categorize the profile data types in our VM and explore the contribution of each such category to performance.
Article
tcc is a compiler that provides efficient and high-level access to dynamic code generation. It implements the 'C ("Tick-C") programming language, an extension of ANSI C that supports dynamic code generation [15]. 'C gives power and flexibility in specifying dynamically generated code: whereas most other systems use annotations to denote run-time invariants. 'C allows the programmer to specify and compose arbitrary expressions and statements at run time. This degree of control is needed to efficiently implement some of the most important applications of dynamic code generation, such as "just in time" compilers [17] and efficient simulators [10, 48, 46].The paper focuses on the techniques that allow tcc to provide 'C's flexibility and expressiveness without sacrificing run-time code generation efficiency. These techniques include fast register allocation, efficient creation and composition of dynamic code specifications, and link-time analysis to reduce the size of dynamic code generators. tcc also implements two different dynamic code generation strategies, designed to address the tradeoff of dynamic compilation speed versus generated code quality. To characterize the effects of dynamic compilation, we present performance measurements for eleven programs compiled using tcc. On these applications, we measured performance improvements of up to one order of magnitude.To encourage further experimentation and use of dynamic code generation, we are making the tcc compiler available in the public domain. This is, to our knowledge, the first high-level dynamic compilation system to be made available.
Article
To guarantee typesafe execution, Java and other strongly typed languages require bounds checking of array accesses. Because array-bounds checks may raise exceptions, they block code motion of instructions with side effects, thus preventing many useful code optimizations, such as partial redundancy elimination or instruction scheduling of memory operations. Furthermore, because it is not expressible at bytecode level, the elimination of bounds checks can only be performed at run time , after the bytecode program is loaded. Using existing powerful bounds-check optimizers at run time is not feasible, however, because they are too heavyweight for the dynamic compilation setting. ABCD is a light-weight algorithm for elimination of Array Bounds Checks on Demand. Its design emphasizes simplicity and efficiency. In essence, ABCD works by adding a few edges to the SSA value graph and performing a simple traversal of the graph. Despite its simplicity, ABCD is surprisingly powerful. On our benchmarks, ABCD removes on average 45% of dynamic bound check instructions, sometimes achieving near-ideal optimization. The efficiency of ABCD stems from two factors. First, ABCD works on a sparse representation. As a result, it requires on average fewer than 10 simple analysis steps per bounds check. Second, ABCD is demand-driven . It can be applied to a set of frequently executed (hot) bounds checks, which makes it suitable for the dynamic-compilation setting, in which compile-time cost is constrained but hot statements are known.
Conference Paper
A high-performance implementation of a Java Virtual Machine (JVM) consists of efficient implementation of Just-In-Time (JIT) compilation, exception handling, synchronization mechanism, and garbage collection (GC). These components are tightly coupled to achieve high performance. In this paper, we present some static anddynamic techniques implemented in the JIT compilation and exception handling of the Microprocessor Research Lab Virtual Machine (MRL VM), i.e., lazy exceptions, lazy GC mapping, dynamic patching, and bounds checking elimination. Our experiments used IA-32 as the hardware platform, but the optimizations can be generalized to other architectures.
Conference Paper
Previous selective dynamic compilation systems have demonstrated that dynamic compilation can achieve performance improvements at low cost on small kernels, but they have had difficulty scaling to larger programs. To overcome this limitation, we developed DyC, a selective dynamic compilation system that includes more sophisticated and flexible analyses and transformations. DyC is able to achieve good performance improvements on programs that are much larger and more complex than the kernels. We analyze the individual optimizations of DyC and assess their impact on performance collectively and individually.
Article
Acknowledgments You are worthy, our Lord and God, to receive glory and honor and power, for you created all things, and by your will they were created and have their being. Revelation 4:11 First and foremost I thank Jesus, my King and Redeemer, who graciously enabled me to complete this dissertation by providing the support acknowledged below. I am very grateful to Kent Dybvig, my outstanding research adviser, for his good ideas, advice, and encouragement. I also appreciate the excellent teaching and support that Dan Friedman, Frank Prosser, and Larry Moss have given me.
Article
The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user's workstation in an automatic, continuous, and transparent manner. Our implementation for a Digital Alpha machine running Digital UNIX 4.0 achieves run-time overheads of less than 0.3% during profile collection. Through the application of three code layout optimizations, we further show that Morph can use statistical profiles to improve application performance. With appropriate system support, automatic profiling and optimization is both possible and effective.