About
103
Publications
22,466
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,581
Citations
Publications
Publications (103)
Cloud computing has made the resources needed to execute large-scale in-memory distributed computations widely available. Specialized programming models, e.g., MapReduce, have emerged to offer transparent fault tolerance and fault recovery for specific computational patterns, but they sacrifice generality. In contrast, the Resilient X10 programming...
Cloud computing has made the resources needed to execute large-scale in-memory distributed computations widely available. Specialized programming models, e.g., MapReduce, have emerged to offer transparent fault tolerance and fault recovery for specific computational patterns, but they sacrifice generality. In contrast, the Resilient X10 programming...
Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not provide fault tolerance capabilities....
Event-processing systems can support high-quality reactions to events by providing context to the event agents. When this context consists of a large amount of data, it helps to train an analytic model for it. In a continuously running solution, this model must be kept up-to-date, otherwise quality degrades. Unfortunately, ripple-through effects ma...
The Asynchronous Partitioned Global Address Space (APGAS) programming model enables programmers to express the parallelism and locality necessary for high performance scientific applications on extreme-scale systems. We used the well-known LULESH hydrodynamics proxy application to explore the performance and programmability of the APGAS model as ex...
X10 is a Java-like programming language that introduces new constructs to significantly simplify scale-out programming based on the Asynchronous Partitioned Global Address Space (APGAS) programming model. The fundamental goal of X10 is to enable scalable, high-performance, high-productivity programming of large scale computer systems for both conve...
X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same fine-grained concurrency mechanisms within and across shared-memory nodes.
We demonstrate that X10...
This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy t...
Effective support for array-based programming has long been one of the central design concerns of the X10 programming language. After significant research and exploration, X10 has adopted an approach based on providing arrays via user definable and extensible class libraries. This paper surveys the range of array abstractions available to the progr...
This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy t...
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framew...
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framew...
We present GLB, a programming model and an associated implementation that can handle a wide range of irregular parallel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced via static scheduling and to problems that are hard to statically load balance. GLB hides the intr...
The ability to smoothly interoperate with other programming languages is an essential feature to reduce the barriers to adoption for new languages such as X10. Compiler-supported interoperability between Managed X10 and Java was initially previewed in X10 version 2.2.2 and is now fully supported in X10 version 2.3. In this paper we describe and mot...
This talk will present a high-level introduction to the X10 language and its implementation with the goal of ensuring a common base knowledge of X10 by all workshop attendees. The X10 language will be introduced primarily via code examples with a focus on X10's support for concurrency and distribution via the APGAS (Asynchronous Partitioned Global...
Techniques are disclosed for schedule management. By way of example, a method for managing performance of tasks of a thread associated with a processor comprises the following steps. A request to execute a task of a first task type within the thread is received. A determination is made whether the processor is currently executing a critical section...
Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism that targets productivity. StarSs deploys a data-flow model: it analyzes dependencies between tas...
This paper proposes two novel techniques for partial inlining. Context-driven partial inlining uses information available to the compiler at a call site to prune the callee prior to assessing whether the (pruned) body of the callee should be inlined. Guarded partial inlining seeks to inline the frequently taken fast path through the callee along wi...
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work...
We propose a framework for SAT researchers to conveniently try out new ideas in the context of parallel SAT solving without the burden of dealing with all the underlying system issues that arise when implementing a massively parallel algorithm. The framework is based on the parallel execution language X10, and allows the parallel solver to easily r...
On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously s...
X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase significantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scientific application codes in X10. This paper reports our experiences writing three codes fr...
X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation...
To reliably write high performance code in any programming language, an application programmer must have some understanding of the performance characteristics of the language's core constructs. We call this understanding a performance model for the language. Some aspects of a performance model are fundamental to the programming language and are exp...
Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming while not hindering application performance. StarSs is a family of parallel programming models based on automatic function level parallelism that targets productivity. StarSs deploys a data-flow model: it analyses dependencies between t...
The MapReduce framework has become a popular and powerful tool to process large datasets in parallel over a cluster of computing nodes [1]. Currently, there are many flavors of implementations of MapReduce, among which the most popular is the Hadoop implementation in Java [5]. However, these implementations either rely on third-party file systems f...
The power of high-level languages lies in their abstraction over hardware and software complexity, leading to greater security, bet- ter reliability, and lower development costs. However, opaque ab- stractions are often show-stoppers for systems programmers, forc- ing them to either break the abstraction, or more often, simply give up and use a dif...
As computer hardware continues to dramatically improve in transistor density and raw capability, the importance of compilers to bridge the gap between high-level programming languages and these abundant hardware resources has never been greater. The workshop on Compiler-Driven Performance provided an important opportunity for academic faculty, stud...
Programs encounter increasingly complex and fragile mappings to computing platforms, resulting in performance characteristics that are often mysterious to students, practitioners, and even researchers. We discuss some steps toward an experimental methodology that demands and provides a deep understanding of complete systems, the necessary instrumen...
Real-time Garbage Collection (RTGC) has recently advanced to the point where it is being used in production for financial trading, mil- itary command-and-control, and telecommunications. However, among potential users of RTGC, there is enormous diversity in both application requirements and deployment environments. Previously described RTGCs tend t...
While real-time garbage collection is now available in production virtual machines, the lack of generational capability means applications with high allocation rates are subject to reduced throughput and high space overheads.
Since frequent allocation is often correlated with a high-level, object-oriented style of programming, this can force builde...
While real-time garbage collection is now available in production virtual machines, the lack of generational capability means applications with high allocation rates are subject to reduced throughput and high space overheads. Since frequent allocation is often correlated with a high-level, objectoriented style of programming, this can force builder...
If the operating system could be specialized for every application, many applications would run faster. For example, Java virtual ma- chines (JVMs) provide their own threading model and memory protection, so general-purpose operating system implementations of these abstractions are redundant. However, traditional means of transforming existing syst...
The emergence of standards for programming real-time systems in Java has encouraged many developers to consider its use for sys- tems previously only built using C, Ada, or assembly language. However, the RTSJ standard in isolation leaves many important problems unaddressed, and suffers from some serious problems in usability and safety. As a resul...
Debugging the timing behavior of real-time systems is notoriously difficult, and with a new generation of complex systems consisting of tens of millions of lines of code, the difficulty is increasing enormously. We have developed TuningFork, a tool especially designed for visualization and analysis of large-scale real-time systems. TuningFork is ca...
While real-time garbage collection has achieved worst-case latencies on the order of a millisecond, this technology is approaching its practical limits. For tasks requiring extremely low latency, and especially periodic tasks with frequencies above 1 KHz, Java programmers must currently resort to the NoHeapRealtimeThread construct of the Real-Time...
TuningFork is an online, scriptable data visualization and analysis tool that supports the development and continuous monitor-ing of real-time systems. While TuningFork was originally designed and tested for use with a particular real-time Java Virtual Machine, the ar-chitecture has been designed from the ground up for extensibility by leveraging t...
While real-time garbage collection has achieved worst-case laten- cies on the order of a millisecond, this technology is approaching its practical limits. For tasks requiring extremely low latency, and especially periodic tasks with frequencies above 1 KHz, Java pro- grammers must currently resort to the NoHeapRealtimeThread construct of the Real-T...
Poor instruction cache locality can degrade performance on modern architectures. For example, our simulation results show that elim- inating all instruction cache misses improves performance by as much as 16% for a modestly sized instruction cache. In this pa- per, we show how to take advantage of dynamic code generation in a Java Virtual Machine (...
TuningFork is an online, scriptable data visualization and analysis tool that supports the development and continuous monitor- ing of real-time systems. While TuningFork was originally designed and tested for use with a particular real-time Java Virtual Machine, the ar- chitecture has been designed from the ground up for extensibility by leveraging...
There are many algorithms for concurrent garbage collection, but they are complex to describe, verify, and implement. This has resulted in a poor under- standing of the relationships between the algorithms, and has precluded system- atic study and comparative evaluation. We present a single high-level, abstract concurrent garbage collection algorit...
Real-time garbage collection has been shown to be feasible, but for programs with high allocation rates, the utilization achievable is not sufficient for some systems.Since a high allocation rate is often correlated with a more high-level, abstract programming style, the ability to provide good real-time performance for such programs will help cont...
Due to the high dynamic frequency of virtual method calls in typical object-oriented programs, feedback-directed devirtualization and inlining is one of the most important optimizations performed by high-performance virtual machines. A critical input to effective feedback-directed inlining is an accurate dynamic call graph. In a virtual machine, th...
Virtual machines face significant performance challenges beyond those
confronted by traditional static optimizers. First, portable program
representations and dynamic language features, such as dynamic class
loading, force the deferral of most optimizations until runtime,
inducing runtime optimization overhead. Second, modular program representatio...
This paper describes the evolution of the Jikes™ Research Virtual Machine project from an IBM internal research project, called Jalapeño, into an open-source project. After summarizing the original goals of the project, we discuss the motivation for releasing it as an open-source project and the activities performed to ensure the success of the pro...
Real-time systems have reached a level of complexity beyond the scaling capability of the low-level or restricted languages traditionally used for real-time programming.While Metronome garbage collection has made it practical to use Java to implement real-time systems, many challenges remain for the construction of complex real-time systems, some s...
Real-time garbage collection has been shown to be feasible, but for programs with high allocation rates, the utilization achievable is not sufficient for some systems.Since a high allocation rate is often correlated with a more high-level, abstract programming style, the ability to provide good real-time performance for such programs will help cont...
Modern Java programs, such as middleware and application servers, include many complex software components. Improving the performance of these Java applications requires a better understanding of the interactions between the application, virtual machine, operating system, and architecture. Hardware performance monitors, which are available on most...
Security concerns on embedded devices like cellular phones make Java an extremely attractive technology for providing third-party and user-downloadable functionality. However, garbage collectors have typically required several times the maximum live data set size (which is the minimum possible heap size) in order to run well. In addition, the size...
While Java provides many software engineering benefits, it lacks a coherent module system and instead provides only packages (which are primarily a name space mechanism) and classloaders (which are very low-level). As a result, large Java applications suffer from unexpected interactions between independent components, require complex CLASSPATH defi...
While Java provides many software engineering benefits, it lacks a coherent module system and instead provides only packages (which are primarily a name space mechanism) and classloaders (which are very low-level). As a result, large Java applications suffer from unexpected interactions between independent components, require complex CLASSPATH defi...
While Java provides many software engineering benefits, it lacks a coherent module system and instead provides only packages (which are primarily a name space mechanism) and classloaders (which are very low-level). As a result, large Java applications suffer from unexpected interactions between independent components, require complex CLASSPATH defi...
As current trends in software development move toward more complex object-oriented programming, inlining has become a vital optimization that provides substantial performance improvements to C and Java programs. Yet, the aggressiveness of the inlining algorithm must be carefully monitored to effectively balance performance and code size. The state-...
Modern Java programs such as middleware and application servers include many complex software components. Improving the performance of these Java applications requires a better understanding of the interactions between the application, virtual machine, operating system, and architecture. Hardware performance monitors, which are available on most mo...
As current trends in software development move toward more complex object-oriented programming, inlining has become a vital optimization that provides substantial performance improvements to C++ and Java programs. Yet, the aggressiveness of the inlining algorithm must be carefully monitored to effectively balance performance and code size. The stat...
While many object-oriented languages impose space overhead of only one word per object to support features like virtual method dispatch, Java's richer functionality has led to implementations that require two or three header words per object. This space overhead increases memory usage and attendant garbage collection costs, reduces cache locality,...
porting the Jikes Research Virtual Machine from its first platform, AIX/PowerPC, to its second, Linux/IA32. We discuss the main is- sues in realizing both an initial functional port, and then tuning efforts to achieve competitive performance. The paper presents software engineering issues in building a portable runtime system and compilers, as well...
Dataflow analyses can have mutually beneficial interactions. Previous e#orts to exploit these interactions have either (1) iteratively performed each individual analysis until no further improvements are discovered or (2) developed "superanalyses " that manually combine conceptually separate analyses. We have devised a new approach that allows anal...
While many object-oriented languages impose space overhead of only one word per object to support features like virtual method dispatch, Java’s richer functionality has led to implementations that require two or three header words per object. This space overhead increases memory usage and attendant garbage collection costs, reduces cache locality,...
A Java virtual machine (JVM) must sometimes check whether a value of one type can be can be treated as a value of another type. The overhead for such dynamic type checking can be a significant factor in the running time of some Java programs. This paper presents a variety of techniques for performing these checks, each tailored to a particular rest...
A Java virtual machine (JVM) must sometimes check whether a value of one type can be can be treated as a value of another type. The overhead for such dynamic type checking can be a signi cant factor in the running time of some Java programs. This paper presents a variety of techniques for performing these checks, each tailored to a particular restr...
A large number of call graph construction algorithms for object-oriented and functional languages have been proposed, each embodying different tradeoffs between analysis cost and call graph preci-sion. In this article we present a unifying framework for understanding call graph construction algo-rithms and an empirical comparison of a representativ...
Single superclass inheritance enables simple and efficient table-driven virtual method dispatch. However, virtual method table dispatch does not handle multiple inheritance and interfaces, This complication has led to a widespread misimpression that interface method dispatch is inherently inefficient. This paper argues that with proper implementati...
The Jalape~no Dynamic Optimizing Compiler is a key component of the Jalape~no Virtual Machine, a new Java 1 Virtual Machine (JVM) designed to support efficient and scalable execution of Java applications on SMP server machines. This paper describes the design of the Jalape~no Optimizing Compiler, and the implementation results that we have obtained...
Single superclass inheritance enables simple and ecient table-driven virtual method dispatch. However, virtual method table dispatch does not handle multiple inheritance and interfaces. This complication has led to a widespread misimpression that interface method dispatch is inherently inecient. This paper argues that with proper implementation tec...
This paper provides details of the component of the Jalape~no adaptive optimization system that determines what methods to optimize. This component, called the controller , can choose from one of several optimization levels. In the current implementation, the controller uses a simple cost/benet analysis to drive adaptive compilation decisions. It h...
The execution model for mobile, dynamically‐linked, object‐oriented programs has evolved from fast interpretation to a mix of interpreted and dynamically compiled execution. The primary motivation for dynamic compilation is that compiled code executes significantly faster than interpreted code. However, dynamic compilation, which is performed while...
Virtual methods can be dispatched efficiently because the code for corresponding methods reside at the same entries in their
respective virtual method tables (VMTs). To achieve efficient interface method dispatch, a fixed-sized interface method table (IMT) is associated with each class. Different implementations of the same interface method signatu...
In this paper, we report on our experiences with guaranteeing GC-pointer safety when using unsafe low-level language extensions to implement a JVM in Java. We give an overview of the original unsafe language extensions that were defined for use by Jalapeño implementers, and introduce sanitized replacements that capture common idioms while also guar...
Future high-performance virtual machines will improve performance through sophisticated online feedback-directed optimizations. this paper presents the architecture of the Jalapeño Adaptive Optimization System, a system to support leading-edge virtual machine technology and enable ongoing research on online feedback-directed optimizations. We descr...
Future high-performance virtual machines will improve performance through sophisticated online feedback-directed optimizations. This paper presents the architecture of the Jalapeño Adaptive Optimization System, a system to support leading-edge virtual machine technology and enable ongoing research on online feedback-directed optimizations. We descr...
Abstract A large number of call graph construction algorithms for object-oriented and functional languages have been proposed, each embodying different tradeoffs between analysis cost and call graph precision. In this paper, we present a unifying framework,for understanding call graph construction algorithms and an empirical comparison,of a represe...
Jalapeño is a virtual machine for Java™ servers written in the Java language. To be able to address the requirements of servers (performance and scalability in particular), Jalapeño was designed “from scratch“ to be as self-sufficient as possible. Jalapeño's unique object model and memory layout allows a hardware null-pointer check as well as fast...
The Educators' Symposium is a unique forum for educators from both academia and industry who have a vested interest in OO education and training. The Educators' Symposium facilitates the exchange of ideas in a number of ways, including featured talks ...
The runtime performance of object-oriented languages often suffers due to the overhead of dynamic dispatching. In order to make these languages competitive with traditional languages, optimizing compilers attempt to eliminate as many of the dynamic dispatches as possible. A variety of local and intraprocedural techniques have been developed to do t...
The Factored Control Flow Graph, FCFG, is a novel representation of a program's intraprocedural control flow, which is designed to efficiently support the analysis of programs written in languages, such as Java, that have frequently occurring operations whose execution may result in exceptional control flow. The FCFG is more compact than traditiona...
. Optimizing compilers for object-oriented languages apply static class analysis and other techniques to try to deduce precise information about the possible classes of the receivers of messages; if successful, dynamicallydispatched messages can be replaced with direct procedure calls and potentially further optimized through inline-expansion. By e...
Previously, techniques such as class hierarchy analysis and profile-guided receiver class prediction have been demonstrated to greatly improve the performance of applications written in pure object-oriented languages, but the degree to which these results are transferable to applications written in hybrid languages has been unclear. In part to answ...
Previous algorithms for interprocedural control flow analysis of higher-order and/or object-oriented languages have been described that perform propagation or constraint satisfaction and take O(N 3 ) time (such as Shivers's 0-CFA and Heintze's set-based analysis), or unification and take O(Na(N,N)) time (such as Steensgaard's pointer analysis), or...
The Jalape~no Dynamic Optimizing Compiler is a key component of the Jalape~no Virtual Machine, a new Java 1 Virtual Machine (JVM) designed to support efficient and scalable execution of Java applications on SMP server machines. This paper describes the design of the Jalape~no Optimizing Compiler, and the implementation results that we have obtained...
Interprocedural analyses enable optimizing compilers to more precisely model the effects of non-inlined procedure calls, potentially resulting in substantial increases in application performance. Applying interprocedural analysis to programs written in object-oriented or functional languages is complicated by the difficulty of constructing an accur...
Because dataflow analyses are difficult to implement from scratch, reusable dataflow analysis frameworks have been developed which provide generic support facilities for managing propagation of dataflow information and iteration in loops. We have designed a framework that improves on previous work by making it easy to perform graph transformations...
Previously, techniques such as class hierarchy analysis and profile-guided receiver class prediction have been demonstrated to greatly improve the performance of applications written in pure object-oriented languages, but the degree to which these results are transferable to applications written in hybrid languages has been unclear. In part to answ...
We describe Vortex, an optimizing compiler intended to produce high-quality code for programs written in a heavily-object-oriented style. To achieve this end, Vortex includes a number of intra- and interprocedural static analyses that can exploit knowledge about the whole program being compiled, including intraprocedural class analysis, class hiera...