Conference Paper

Stopless: a real-time garbage collector for multiprocessors

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present STOPLESS: a concurrent real-time garbage collector suitable for modern multiprocessors running parallel multithreaded applications. Creating a garbage-collected environment that sup- ports real-time on modern platforms is notoriously hard, especially if real-time implies lock-freedom. Known real-time collectors ei- ther restrict the real-time guarantees to uniprocessors only, rely on special hardware, or just give up supporting atomic operations (which are crucial for lock-free software). STOPLESS is the first collector that provides real-time responsiveness while preserving lock-freedom, supporting atomic operations, controlling fragmen- tation by compaction, and supporting modern parallel platforms. STOPLESS is adequate for modern languages such as C# or Java. It was implemented on top of the Bartok compiler and runtime for C# and measurements demonstrate high responsiveness (a factor of a 100 better than previously published systems), virtually no pause times, good mutator utilization, and acceptable overheads.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, there exists no formal and mechanical way to investigate the correctness of concurrent copying protocols under various memory models. For example, although various concurrent copying protocols have been proposed [McCloskey et al. 2008;Pizlo et al. 2007aPizlo et al. , 2008, none of their authors have formally discussed their correctness against memory models, except for Sapphire [Hudson and Moss 2003], for which the correctness of its concurrent copying protocol was discussed in natural language. Such an investigative approach is also desired when porting a concurrent copying GC algorithm to other memory models from the model that the algorithm assumes. ...
... Stopless [Pizlo et al. 2007a] is a concurrent compacting garbage collector that offers lock-freedom of mutators. Although the memory models this collector can run under were not specified, the experimental environment indicated that it supports the x86 memory model. ...
... Because pseudocode was not provided in Pizlo et al. [2007a], we wrote pseudocode to clarify the algorithm to be verified. Our pseudocode for Stopless is shown in Algorithm 1, where copy is the procedure used by the collector to copy object o, and o[f] is the field at offset f in object o. ...
Article
Modern concurrent copying garbage collection (GC), in particular, real-time GC, uses fine-grained synchronizations with a mutator, which is the application program that mutates memory, when it moves objects in its copy phase. It resolves a data race using a concurrent copying protocol, which is implemented as interactions between the collector threads and the read and write barriers that the mutator threads execute. The behavioral effects of the concurrent copying protocol rely on the memory model of the CPUs and the programming languages in which the GC is implemented. It is difficult, however, to formally investigate the behavioral properties of concurrent copying protocols against various memory models. To address this problem, we studied the feasibility of the bounded model checking of concurrent copying protocols with memory models. We investigated a correctness-related behavioral property of copying protocols of various concurrent copying GC algorithms, including real-time GC Stopless, Clover, Chicken, Staccato, and Schism against six memory models, total store ordering (TSO), partial store ordering (PSO), relaxed memory ordering (RMO), and their variants, in addition to sequential consistency (SC) using bounded model checking. For each combination of a protocol and memory model, we conducted model checking with a model of a mutator. In this wide range of case studies, we found faults in two GC algorithms, one of which is relevant to the memory model. We fixed these faults with the great help of counterexamples. We also modified some protocols so that they work under some memory models weaker than those for which the original protocols were designed, and checked them using model checking. We believe that bounded model checking is a feasible approach to investigate behavioral properties of concurrent copying protocols under weak memory models.
... From these JVMs, JamaicaVM, FijiVM and the Pico and Raven variants of aonixPerc also target resource-constrained embedded systems. Implementations of precise garbage collection suitable for real-time systems has been the subject of many research activities [8,158,132,129,38,130,17,19,36,95,141]. In the following, I will wrap up a subset of important projects related to soft and hard real-time garbage collection. ...
... The IRR does not deal with heap fragmentation and is therefore not suitable for the deployment in hard real-time systems without further assumptions. [129] GC that is a concurrent soft real-time collector viable for the deployment in a multiprocessor environment running multithreaded applications. Stopless has been implemented in the Bartok compiler and runtime system for the Common Intermediate Language (CIL). ...
Thesis
Full-text available
Memory management is a challenging task and its handling becomes more difficult with the rising complexity of software systems. Embedded systems are predominated by the use of manual storage control and as these systems become more and more extensive, means to assist the programmer in performing correct memory management are required. By using a type-safe language, the use of automated storage control is promoted: Combining the knowledge about a type-safe application, system software and the hardware properties assists the creation of a memory handling that is tailored to a particular system setup: Software can be deployed on several kinds of microcontrollers that may vary in their hardware-specific properties, such as the availability of memories, the address-space layout or their proneness to transient hardware faults. In addition, a program’s non-functional properties, for instance, its predictability, have to be respected during the design of memory management. In this thesis I describe how system- and application-specific memory management for a particular hardware device can be created. I call my approach cooperative memory management (CMM) and it pursues a co-design of memory management by respecting the software quality attributes efficiency and predictability and the hardware property reliability. CMM is a compound of static analyses on type-safe applications and runtime support. For the implementation of my framework, I use the KESO JVM [169] in which I integrated an amended version of an escape analysis to encourage compiler-assisted storage control. Besides safe, automatic stack allocation, I implemented new backends for my escape analysis through which, among other things, I provide an automated solution to regional memory specified in the Real-Time Specification for Java [33]. The use of escape-analysis results ameliorates a program’s reliability by reducing the effects of transient hardware faults. To further improve reliability, I evolved special runtime checks that are able to preserve memory safety in systems prone to transient faults. Besides compiler support, I developed RT-LAGC, a real-time garbage collector to maintain heap objects not managed by regional memory. RT-LAGC is optimized through the compilation process and tailored to the application whose heap memory it manages. I designed CMM’s storage control techniques to be easily exchangeable and extensible by the developer so that static and dynamic analyses can find out which combination of techniques works reasonably well to achieve efficiency, predictability and reliability. By evaluating a comprehensive real-time benchmark and various memory-management configurations I showed, that the developer-assisted co-design approach is able to address the properties reliability, efficiency and predictability.
... Staccato and Chicken (Pizlo et al. 2008b) impose a lazy tospace invariant under which the latest value is the tospace one if the object has been copied. Stopless (Pizlo et al. 2007a) imposes a more complicated invariant, but still only one location holds the valid latest value of a field. As concurrent collectors move objects while the mutator is running, mutators and collectors must maintain a coherent view of the heap at each read or write operation, requiring frequent synchronisation. ...
... Stopless (Pizlo et al. 2007a) was the first collector to claim lock-free relocation of objects. It ensures that all updates are made to the most up-to-date location of a field, thereby supporting mutators' use of atomic operations without locking. ...
Article
Constructing a high-performance garbage collector is hard. Constructing a fully concurrent ‘on-the-fly’ compacting collector is much more so. We describe our experience of implementing the Sapphire algorithm as the first on-the-fly, parallel, replication copying, garbage collector for the Jikes RVM Java virtual machine (JVM). In part, we explain our innovations such as copying with hardware and software transactions, on-the-fly management of Java’s reference types, and simple, yet correct, lock-free management of volatile fields in a replicating collector. We fully evaluate, for the first time, and using realistic benchmarks, Sapphire’s performance and suitability as a low latency collector. An important contribution of this work is a detailed description of our experience of building an on-the-fly copying collector for a complete JVM with some assurance that it is correct. A key aspect of this is model checking of critical components of this complicated and highly concurrent system.
... The garbage collectors can guarantee the security and reliability of the runtime systems, but they also introduce additional performance overhead. Traditional garbage collectors perform the entire memory reclamation by suspending the running programs [1]. It is obvious that such a strategy will seriously affect performance and responsiveness of the systems, and this is especially unacceptable for a real-time system. ...
... That is the thread must guarantee that the replica contains the same information with the origin when the modification completes. In [1] Pizlo et al. presented an example of inconsistent situation when an object was modified by two threads; here we simply review the process. Figure 2 describes a simple inconsistency scenario that two threads T1, T2 are going to modify the object field . ...
Article
Full-text available
Concurrent garbage collectors (CGC) have recently obtained extensive concern on multicore platform. Excellent designed CGC can improve the efficiency of runtime systems by exploring the full potential processing resources of multicore computers. Two major performance critical components for designing CGC are studied in this paper, stack scanning and heap compaction. Since the lock-based algorithms do not scale well, we present a lock-free solution for constructing a highly concurrent garbage collector. We adopt CAS/MCAS synchronization primitives to guarantee that the programs will never be blocked by the collector thread while the garbage collection process is ongoing. The evaluation results of this study demonstrate that our approach achieves competitive performance.
... However, with the maturity of language runtime technologies for real-time Java, the availability of a plethora of real-time JVMs [2,4,12,13,16,18,44,77,79], which can serve as an execution mechanism for functional languages that compile to bytecode, and the growing interest in leveraging functional reactive programming in real-time systems; researchers are re-examining functional languages and their applicability for real-time systems. Indeed, it is likely due to the rapid advancement of real-time garbage collection [5,7,8,9,19,20,46,50,51,52,56,65,67,73,74,75,76,78,86], scoped memory [10,28,41,69,70,96], as well as alternative high-level memory management techniques [1,6] over the last five years thats has spurred interest in functional languages adapted for real-time. This is due to the fact that most the these results, though developed for real-time Java, are not inherently real-time Java specific. ...
... However, this GC is not suitable for real-time software since stop-the-world GCs are well understood to have negative impacts on meeting real-time deadlines. For example [73], shows that a STW GC under high load can miss as much as 95% of deadlines. ...
Conference Paper
Functional programming languages play an important role in the development of provably correct software systems. As embedded devices become pervasive and perform critical tasks in our lives, their reliability becomes paramount. This presents a natural opportunity to explore the application of functional programming languages to systems that demand highly predictable behavior. In this paper we explore existing functional programming language compilers and their applicability to realtime, embedded systems.
... The clever usage of atomic two-field compare-and-swap (CAS) operations for an incremental object copy is proposed in [13]. During the copy process, an object is expanded to an intermediate wide version and an uninitialized narrow version in tospace. ...
Conference Paper
Full-text available
A real-time garbage collector has to fulfill two conflicting prop- erties: avoid heap fragmentation and provide short blocking time. The heap needs to be compacted to avoid probably unbounded frag- mentation. During compaction all objects are copied; copying is usually performed atomically to avoid interference with mutator threads. Copying of large objects and especially large arrays in- troduces long blocking times that are unacceptable for real-time systems. In this paper an interruptible copy unit is presented that implements non-blocking object copy. The unit intercepts object and array field access and redirects the access either to the source or destination part of the moving object. The unit can be interrupted after a single word move. The resulting maximum blocking time is the time for a memory word read and write. We have implemented the proposed non-blocking copy unit in the Java processor JOP and are able to run high priority real-time tasks at 10 kHz parallel to the garbage collection task on a 100 MHz system.
... • STOPLESS, CHICKEN, and CLOVER -CHICKEN and CLOVER [37] are based on STOPLESS [35] and share the same infrastructure. Collector threads run on their own processors, with mutator threads executing on any unused processors. ...
Article
This report documents the development of a prototype incremental garbage collector in the Jikes RVM 3.1 using a novel alternative to standard read barriers known as specialised self-scavenging. Self-scavenging collectors encode scavenge state on a per-object level, which allows individual objects to conditionally activate scavenge code during collection, eliminating the always-on nature of traditional barriers. We develop and evaluate an incremental collector framework coupled with the first traditional incremental Baker-style garbage collector for Jikes. We then modify the Baker collector to use self-scavenging, where objects encode their scavenge state in a header word. We finally eliminate the cost of explicitly encoding state in the object header by using specialised method variants, introduced through an updated method specialisation framework from the Jikes RVM 3.0.1. This results in a collector that has no conditional state checks when the collector is off or when an object has been scavenged.
... As a consequence, several studies and proposals have been conducted in this direction. Overall, there has been much research on real-time garbage collectors over the last few years [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] . An early effort in implementing garbage collectors with transactional memory was made by McGachey et al. [23]. ...
Conference Paper
Full-text available
While garbage collectors (GCs) significantly simplify programmers' tasks by transparently handling memory management, they also introduce various overheads and sources of unpredictability. Most importantly, GCs typically block the application while reclaiming free memory, which makes them unfit for environments where responsiveness is crucial, such as real-time systems. There have been several approaches for developing concurrent GCs that can exploit the processing capabilities of multi-core architectures, but at the expense of a synchronization overhead between the application and the collector. In this paper, we investigate a novel approach to implementing pauseless moving garbage collection using hardware transactional memory (HTM). We describe the design of a moving GC algorithm that can operate concurrently with the application threads. We study the overheads resulting from using transactional barriers in the Java virtual machine (JVM) and discuss various optimizations. Our findings show that, while the cost of these barriers can be minimized by carefully restricting them to volatile accesses when executing within the interpreter, the actual performance degradation becomes unacceptably high with the just-in-time compiler. The results tend to indicate that current HTM mechanisms cannot be readily used to implement a pauseless GC in Java that can compete with state-of-the-art concurrent GCs.
... A critical part of any language runtime is the Garbage Collector (GC). A significant effort has been put into optimizing garbage collectors for multicores [15,18,11,16], in order to reduce the application stalls. However, recent studies show that current GCs do not scale well with the number of cores [12] and that they can generally degrade the application responsiveness or throughput. ...
Conference Paper
Full-text available
In the last few years, managed runtime environments such as the Java Virtual Machine (JVM) are increasingly used on large-scale multicore servers. The garbage collector (GC) represents a critical component of the JVM and has a significant influence on the overall performance and efficiency of the running application. We perform a study on all available Java GCs, both in an academic environment (set of benchmarks), as well as in a simulated real-life situation (client-server application). We mainly focus on the three most widely used collectors: ParallelOld, ConcurrentMarkSweep and G1. We find that they exhibit different behaviours in the two tested environments. In particular, the default Java GC, ParallelOld, proves to be stable and adequate in the first situation, while in the real-life scenario its use results in unacceptable pauses for the application threads. We believe that this is partly due to the memory requirements of the multicore server. G1 GC performs notably bad on the benchmarks when forced to have a full collection between the iterations of the application. Moreover, even though G1 and ConcurrentMarkSweep GCs introduce significantly lower pauses than ParallelOld in the client-server environment , they can still seriously impact the response time on the client. Pauses of around 3 seconds can make a real-time system unusable and may disrupt the communication between nodes in the case of large-scale distributed systems.
... Sapphire [59] implemented a copying collector for Java with low overhead and short pause time. Pizlo et al. [60] compared three lock-free concurrent GC algorithms: STOPLESS [61], CHICKEN, and CLOVER, which have the pause time as microsecond level. Although these algorithms are designed for and implemented by C#, it is easy to adapt them to Java since they are similar VM based language. ...
Article
This paper presents a complete survey of recent techniques that are applied in the field of real-time Java computing. It focuses on the issues that are especially important for hard real-time applications, which include time predictable garbage collection, worst-case execution time analysis of Java programs, real-time Java threads scheduling and compiler techniques designed for real-time purpose. It also evaluates experimental frameworks that can be used for researching real-time Java. This overview is expected to help researchers understand the state-of-the-art and advance the research in real-time Java computing.
... Another concurrent compaction algorithm, called STOPLESS, targets real-time applications, uses a concept of wide objects while replicating objects [36]. A wide object represents an intermediate representation of an object being copied. ...
Thesis
Large-scale multicore architectures create new challenges for garbage collectors (GCs). On con-temporary cache-coherent Non-Uniform Memory Access (ccNUMA) architectures, applications with a large memory footprint suffer from the cost of the garbage collector (GC), because, as the GC scans the reference graph, it makes many remote memory accesses, saturating the interconnect between memory nodes. In this thesis, we address this problem with NumaGiC, a GC with a mostly-distributed design. In order to maximise memory access locality during collection, a GC thread avoids accessing a different memory node, instead notifying a remote GC thread with a message; nonetheless, NumaGiC avoids the drawbacks of a pure distributed design, which tends to decrease parallelism and increase memory access imbalance, by allowing threads to steal from other nodes when they are idle. NumaGiC strives to find a perfect balance between local access, memory access balance, and parallelism. In this work, we compare NumaGiC with Parallel Scavenge and some of its incrementally improved variants on two different ccNUMA architectures running on the Hotspot Java Virtual Machine of OpenJDK 7. On Spark and Neo4j, two industry-strength analytics applications, with heap sizes ranging from 160 GB to 350 GB, and on SPECjbb2013 and SPECjbb2005, NumaGiC improves overall performance by up to 94% over Parallel Scavenge, and increases the performance of the collector itself by up to 5.4× over Parallel Scavenge. In terms of scalability of GC throughput with increasing number of NUMA nodes, NumaGiC scales substantially better than Parallel Scavenge for all the applications. In fact in case of SPECjbb2005, where inter-node object references are the least among all, NumaGiC scales almost linearly.
... Stopless [30] uses double-wide CAS to copy values first to an intermediate location and then to their destination. Stopless also requires atomic instructions in the write barriers. ...
Article
Full-text available
Compaction of memory in long running systems has always been important. The latency of compaction increases in today’s systems with high memory demands and large heaps. To deal with this problem, we present a lock-free protocol allowing for copying concurrent with the application running, which reduces the latencies of compaction radically. It provides theoretical progress guarantees for copying and application threads without making it practically infeasible, with performance overheads of 15% on average. The algorithm paves the way for a future lock-free Garbage Collector.
... Sometimes with strict requirements for responsiveness, it is necessary to allow mutator to progress during a collection cycle. Concurrent garbage collection [Steele, 1975;Dijkstra et al., 1978;Pizlo et al., 2007Pizlo et al., , 2008McCloskey et al., 2008] is used to implement this system where both collector and mutator threads are running and progressing simultaneously. introduced concurrent cycle collection and Petrank [2001, 2006] introduced on-the-fly reference counting collection that uses update coalescing to reduce the concurrency overheads. ...
Thesis
Full-text available
Garbage collection is an integral part of modern programming languages. It automatically reclaims memory occupied by objects that are no longer in use. Garbage collection began in 1960 with two algorithmic branches — tracing and reference counting. Tracing identifies live objects by performing a transitive closure over the object graph starting with the stacks, registers, and global variables as roots. Objects not reached by the trace are implicitly dead, so the collector reclaims them. In contrast, reference counting explicitly identifies dead objects by counting the number of incoming references to each object. When an object’s count goes to zero, it is unreachable and the collector may reclaim it. Garbage collectors require knowledge of every reference to each object, whether the reference is from another object or from within the runtime. The runtime provides this knowledge either by continuously keeping track of every change to each reference or by periodically enumerating all references. The collector implementation faces two broad choices — exact and conservative. In exact garbage collection, the compiler and runtime system precisely identify all references held within the runtime including those held within stacks, registers, and objects. To exactly identify references, the runtime must introspect these references during execution, which requires support from the compiler and significant engineering effort. On the contrary, conservative garbage collection does not require introspection of these references, but instead treats each value ambiguously as a potential reference. Highly engineered, high performance systems conventionally use tracing and exact garbage collection. However, other well-established but less performant systems use either reference counting or conservative garbage collection. Reference counting has some advantages over tracing such as: a) it is easier implement, b) it reclaims memory immediately, and c) it has a local scope of operation. Conservative garbage collection is easier to implement compared to exact garbage collection because it does not require compiler cooperation. Because of these advantages, both reference counting and conservative garbage collection are widely used in practice. Because both suffer significant performance overheads, they are generally not used in performance critical settings. This dissertation carefully examines reference counting and conservative garbage collection to understand their behavior and improve their performance. My thesis is that reference counting and conservative garbage collection can perform as well or better than the best performing garbage collectors. The key contributions of my thesis are: 1) An in-depth analysis of the key design choices for reference counting. 2) Novel optimizations guided by that analysis that significantly improve reference counting performance and make it competitive with a well tuned tracing garbage collector. 3) A new collector, RCImmix, that replaces the traditional free-list heap organization of reference counting with a line and block heap structure, which improves locality, and adds copying to mitigate fragmentation. The result is a collector that outperforms a highly tuned production generational collector. 4) A conservative garbage collector based on RCImmix that matches the performance of a highly tuned production generational collector. Reference counting and conservative garbage collection have lived under the shadow of tracing and exact garbage collection for a long time. My thesis focuses on bringing these somewhat neglected branches of garbage collection back to life in a high performance setting and leads to two very surprising results: 1) a new garbage collector based on reference counting that outperforms a highly tuned production generational tracing collector, and 2) a variant that delivers high performance conservative garbage collection.
... MC 2 [34] is an incremental soft real-time garbage collector designed for memory constrained devices, which cannot provide hard guarantees on maximum pause time and CPU utilization, but comes with low space overhead and tight space bounds. Stopless [31] is another garbage collector with soft guarantees on response times. It provides low latency while preserving lock-freedom, supporting atomic operations, controlling fragmentation by compaction, and supporting multiprocessor platforms. ...
Article
Full-text available
We study, formally and experimentally, the trade-off in temporal and spatial overhead when managing contiguous blocks of memory using the explicit, dynamic and real-time heap management system Compact-fit (CF). The key property of CF is that temporal and spatial overhead can be bounded, related, and predicted in constant time through the notion of partial and incremental compaction. Partial compaction determines the maximally tolerated degree of memory fragmentation. Incremental compaction of objects, introduced here, determines the maximal amount of memory involved in any, logically atomic, portion of a compaction operation. We explore CF's potential application space on (1) multiprocessor and multicore systems as well as on (2) memory-constrained uniprocessor systems. For (1), we argue that little or no compaction is likely to avoid the worst case in temporal as well as spatial overhead but also observe that scalability only improves by a constant factor. Scalability can be further improved significantly by reducing overall data sharing through separate instances of Compact-fit. For (2), we observe that incremental compaction can effectively trade-off throughput and memory fragmentation for lower latency.
... The algorithm does not guarantee that copying finishes. Stopless (Pizlo et al. 2007) uses double-wide CAS to copy objects in a lock-free fashion for the mutator. What these algorithms have in common is their reliance on blocking handshakes to finish before copying. ...
Conference Paper
On-the-fly Garbage Collectors (GCs) are the state-of-the-art concurrent GC algorithms today. Everything is done concurrently, but phases are separated by blocking handshakes. Hence, progress relies on the scheduler to let application threads (mutators) run into GC checkpoints to reply to the handshakes. For a non-blocking GC, these blocking handshakes need to be addressed. Therefore, we propose a new non-blocking handshake to replace previous blocking handshakes. It guarantees scheduling-independent operation level progress without blocking. It is scheduling independent but requires some other OS support. It allows bounded waiting for threads that are currently running on a processor, regardless of threads that are not running on a processor. We discuss this non-blocking handshake in two GC algorithms for stack scanning and copying objects. They pave way for a future completely non-blocking GC by solving hard open theory problems when OS support is permitted. The GC algorithms were integrated to the G1 GC of OpenJDK for Java. GC pause times were reduced to 12.5% compared to the original G1 on average in DaCapo. For a memory intense benchmark, latencies were reduced from 174 ms to 0.67 ms for the 99.99% percentile. The improved latency comes at a cost of 15% lower throughput.
... In this paper, we checked three GC algorithms: Chicken [34], Staccato [31] and Stopless [33]. The details of these algorithms can be found in their papers. ...
Article
Software model checking suffers from the so-called state explosion problem, and relaxed memory consistency models even worsen this situation. What is worse, parameterizing model checking by memory consistency models, that is, to make the model checker as flexible as we can supply definitions of memory consistency models as an input, intensifies state explosion. This paper explores specific reasons for state explosion in model checking with multiple memory consistency models, provides some optimizations intended to mitigate the problem, and applies them to McSPIN, a model checker for memory consistency models that we are developing. The effects of the optimizations and the usefulness of McSPIN are demonstrated experimentally by verifying copying protocols of concurrent copying garbage collection algorithms. To the best of our knowledge, this is the first model checking of the concurrent copying protocols under relaxed memory consistency models.
... Melt improves the performance of a few programs relative to barrier overhead. This improvement comes from better program locality (jython and lusearch) and lower GC overhead (xalan).Melt's 6% read barrier overhead is comparable to read barrier overheads for concurrent, incremental, and real-time collectors[BCR03, DLM + 78,PFPS07]. With the increasing importance of concurrent software and the potential introduction of transactional memory hardware, future general-purpose hardware is likely to provide read barriers with no overhead. ...
... The Stopless garbage collector [20] ensures consistency during copying by clever use of compare-and-swap (CAS) operations. Unfortunately, copying may not terminate in adverse situations, which makes the approach unsuitable for hard real-time systems. ...
Conference Paper
Garbage collection is a well known technique to increase program safety and developer productivity. Within the past few years, it has also become feasible for uniprocessor hard real-time systems. However, garbage collection for multi-processors does not yet meet the requirements of hard real-time systems. In this paper, we present a hard real-time garbage collector for a Java chip multi-processor that provides non-disruptive and analyzable behavior. For retrieving the references in local variables of threads, we propose a protocol that minimizes disruptions for high-priority tasks while still providing good bounds on the time until stack scanning finishes. Also, we developed a hardware unit that enables transparent, preemptible copying of objects, which eliminates the need to block tasks while copying objects. Evaluation of the hardware shows that the copy unit introduces only little overhead and does not limit the critical path. Measurements resulted in release jitter for high-priority tasks of 224 μs or less on an embedded multi-processor with 8 cores clocked at 100 MHz. This indicates that with the proposed garbage collector, high scheduling quality and garbage collection do not contradict each other on chip multi-processors.
... MC 2 [34] is an incremental soft real-time garbage collector designed for memory constrained devices, which cannot provide hard guarantees on maximum pause time and CPU utilization, but comes with low space overhead and tight space bounds. Stopless [31] is another garbage collector with soft guarantees on response times. It provides low latency while preserving lock-freedom, supporting atomic operations, controlling fragmentation by compaction, and supporting multiprocessor platforms. ...
Technical Report
Full-text available
We study, formally and experimentally, the trade-o in tem- poral and spatial performance when managing contiguous pieces of mem- ory using the explicit, dynamic memory management system Compact-fit (CF). The key property of CF is that temporal and spatial performance can be bounded, related, and predicted in constant time through the notion of partial and incremental compaction. Partial compaction deter- mines the maximally tolerated degree of memory fragmentation. Incre- mental compaction, introduced here, determines the maximal amount of memory involved in any, logically atomic portion of a compaction opera- tion. We explore CF's potential application space on (1) multiprocessor and multicore systems as well as on (2) memory-constrained uniproces- sor systems. For (1), we argue that little or no compaction is likely to avoid the worst case in temporal as well as spatial performance but also observe that scalability only improves by a constant factor. Scalability can be further improved significantly by reducing overall data sharing through separate instances of Compact-fit. For (2), we observe that in- cremental compaction can eectively trade-o throughput and memory fragmentation for lower latency.
... There has been much research on concurrent garbage collection, mainly targeting real-time applications [6,17,22,23]. These collectors have been evaluated on eight cores and two nodes at most. ...
Conference Paper
Large-scale multicore architectures create new challenges for garbage collectors (GCs). In particular, throughput-oriented stop-the-world algorithms demonstrate good performance with a small number of cores, but have been shown to degrade badly beyond approximately 8 cores on a 48-core with OpenJDK 7. This negative result raises the question whether the stop-the-world design has intrinsic limitations that would require a radically different approach. Our study suggests that the answer is no, and that there is no compelling scalability reason to discard the existing highly-optimised throughput-oriented GC code on contemporary hardware. This paper studies the default throughput-oriented garbage collector of OpenJDK 7, called Parallel Scavenge. We identify its bottlenecks, and show how to eliminate them using well-established parallel programming techniques. On the SPECjbb2005, SPECjvm2008 and DaCapo 9.12 benchmarks, the improved GC matches the performance of Parallel Scavenge at low core count, but scales well, up to 48~cores.
... Another, more severe issue is that most garbage collection techniques are not suitable for hard real-time systems: Either the GC does not actively fight external fragmentation, which imposes difficulties in giving allocation guarantees, or, particularly for moving GCs, needs to stop the application until all references within the application of a moved object have been updated, which may cause unacceptable pause times. Real-time GC is a research area of its own, but there exist GCs, that are suited for being deployed under real-time constraints [12][13][14][15][16]. On the other hand, many real-time applications get along with purely static memory allocation, which also eases the verification required for this kind of application. ...
Article
Java still is a rather exotic language in the field of real-time and particularly embedded systems, eventhough it could provide productivity and especially safety and dependability benefits over the dominating language C. The reasons for the lack of acceptance of Java in the embedded world are the high resource consumption caused by the Java runtime environment and the lacking language features for low-level programming. KESO is a Java Virtual Machine (JVM) that was specifically designed for statically configured resource-constrained embedded systems. Rather than providing a fixed subset of the Java standard functionality, KESO uses the available ahead-of-time knowledge to generate a Java runtime that is specifically tailored towards the particular application. A key feature of KESO is its Multi-JVM architecture, which allows the isolated cohabitation of different applications on one hardware platform. Our evaluation uses two non-trivial real-time applications, a control application for a quadrotor helicopter and a collision detector, to compare the cost of an application using KESO to its C counterpart. Our results show that the resource consumption of applications developed on the base of KESO is comparable to C applications, and its mechanisms for communicating among isolated components are efficient and encourage the actual utilization of spatial isolation. Copyright © 2011 John Wiley & Sons, Ltd.
... The algorithm does not guarantee that copying finishes. Stopless (Pizlo et al. 2007) uses double-wide CAS to copy objects in a lock-free fashion for the mutator. What these algorithms have in common is their reliance on blocking handshakes to finish before copying. ...
Article
Many software engineering applications require points-to analysis. These client applications range from optimizing compilers to integrated program development environments (IDEs) and from testing environments to reverse-engineering tools. Moreover, software engineering applications used in an edit-compile cycle need points-to analysis to be fast and precise.In this article, we present a new context- and flow-sensitive approach to points-to analysis where calling contexts are distinguished by the points-to sets analyzed for their call target expressions. Compared to other well-known context-sensitive techniques it is faster in practice, on average, twice as fast as the call string approach and by an order of magnitude faster than the object-sensitive technique. In fact, it shows to be only marginally slower than a context-insensitive baseline analysis. At the same time, it provides higher precision than the call string technique and is similar in precision to the object-sensitive technique. We confirm these statements with experiments using a number of abstract precision metrics and a concrete client application: escape analysis.
... However, this GC is not suitable for real-time software since stop-the-world GCs are well understood to have negative impacts on meeting real-time deadlines. For example, Pizlo et al 20 showed that a STW GC under high load can miss as much as 95% of deadlines. Two notable efforts to implement a real-time suitable GC are Microsoft's exploration into multi-core GC with local heaps 66 and a Google Summer of Code (GSoC) project 67 to implement IMMIX for Haskell. ...
Article
Functional programming languages play an important role in the development of correct software systems. As embedded devices become pervasive and perform critical tasks in our lives, their reliability becomes paramount. This presents a natural opportunity to explore the application of functional programming languages to systems that demand highly predictable behavior. In this paper we explore existing functional programming language compilers and their applicability to real-time, embedded systems. We do this by defining important characteristics needed by a real-time programming language and survey how well existing languages meet these characteristics. We conduct empirical analysis of language runtimes in order to assess the impact of dynamic memory management on predictability and performance. Lastly, we review different programming models for expressing real-time considerations in applications.
... Pizlo et. al. [21] present an algorithm that uses of existing hardware features to achieve lockless real time behavior. Our algorithm differs, leveraging the TM infrastructure to do object copying, and potentially ameliorating mutator performance issues. ...
Conference Paper
Full-text available
We predict that the ever-growing number of cores on our desktops will require a re-examination of concurrent programming. Two technologies are likely to become mainstream in response: Transactional memory provides a superior programming model to traditional lock-based concurrency, while Concurrent GC can take advantage of multiple cores to eliminate perceptible pauses in desktop applications such as games or Internet telephony. This paper proposes a combination of the two technologies, producing a synergy that improves scalability while eliminating the annoyance of user-perceivable pauses. Specifically, we show how concurrent GC can share some of the mechanisms required for transactional memory. Thus as transactional memory becomes more efficient, so too will concurrent GC. We demonstrate how, using a state of the art software transactional memory system, we can build a state of the art concurrent collector. Our goal was to reduce 90% of pause times to under one millisecond. Of the remainder, we aim for 90% to be under 10ms, and90% of those left to be under 100ms. Our performance results show that we were able to achieve these targets, with pause times between one or two orders of magnitude lower than mainstream technologies.
... The Fiji VM compiler has extensive support for garbage collection and scope check barriers, including those barriers necessary to implement common anti-fragmentation techniques such as [24] or [20]. Additionally, we use a combination of techniques from [6] and [11] to enable accurate stack scanning of generated C functions . ...
Conference Paper
Full-text available
Real-time Java is quickly emerging as a platform for building safety-critical embedded systems. The real-time variants of Java, including (8, 15), are attractive alternatives to Ada and C since they provide a cleaner, simpler, and safer programming model. Unfortunately, current real-time Java implementations have trouble scaling down to very hard real-time embedded settings, where memory is scarce and processing power is limited. In this paper, we describe the architecture of the Fiji VM, which enables vanilla Java applications to run in very hard environments, including booting on bare hardware with only very rudimentary operating system support. We also show that our minimalistic approach delivers comparable performance to that of server-class production Java Virtual Machine implementations. 3. A survey of limitations Java virtual machines face when tar- getting hard real-time systems. We discuss our previous experi- ence embedding Java virtual machines and identify six key goals state-of-the-art virtual machines should address to be better suited for hard real-time and mission critical application domains. 4. A concise description of the Fiji VM compiler and runtime system. We describe the unique features of the Fiji VM compiler and runtime system and their implementation. We address each of the six goals in the implementation and design of the Fiji VM. 5. A performance comparison between the Fiji VM and state-of- the-art, server-class Java VMs. We also include performance num- ber of the SPECjvm98 benchmark suite running on bare hardware. 1.1 Background Over the past seven years, our group at Purdue has been develop-
... Garbage Collection (GC) barriers have been studied extensively [11,13,25,28,46,50,51,60,61,65]. Barriers are needed for concurrent, semi-concurrent, and generational garbage collectors. ...
Conference Paper
Byte addressable, Non-Volatile Memory (NVM) is emerging as a revolutionary technology that provides near-DRAM performance and scalable memory capacity. To facilitate the usability of NVM, new programming frameworks have been proposed to automatically or semi-automatically maintain crash-consistent data structures, relieving much of the burden of developing persistent applications from programmers. While these new frameworks greatly improve programmer productivity, they also require many runtime checks for correct execution on persistent objects, which significantly affect the application performance. With a characterization study of various workloads, we find that the overhead of these persistence checks in these programmer-friendly NVM frameworks can be substantial and reach up to 214%. Furthermore, we find that programs nearly always access exclusively either a persistent or a non-persistent object at a given site, making the behavior of these checks highly predictable. In this paper, we propose QuickCheck, a technique that biases persistence checks based on their expected behavior, and exploits speculative optimizations to further reduce the overheads of these persistence checks. We evaluate QuickCheck with a variety of data intensive applications such as a key-value store. Our experiments show that QuickCheck improves the performance of a persistent Java framework on average by 48.2% for applications that do not require data persistence, and by 8.0% for a persistent memcached implementation running YCSB.
... Sapphire [24] implemented a copying collector for Java with low overheads and a short pause time. Pizlo et al. [25] proposed two lock-free concurrent GC algorithms CHICKEN and CLOVER that have the pause time in the order of microseconds and compared them with another algorithm STOPLESS [26]. Although these algorithms are designed for and implemented by C\#, it is easy to adapt them to Java as they are similar in many aspects. ...
Article
Java has been increasingly used in programming for real-time systems. However, some of Java's features such as auto-matic memory management and dynamic compilation are harmful to time predictability. If these problems are not solved properly then it can fundamentally limit the usage of Java for real-time systems, especially for hard real-time systems that require very high time predictability. In this paper, we propose to exploit multicore computing in order to reduce the timing unpredictability that is caused by dynamic compilation and adaptive optimization. Our goal is to retain high per-formance comparable to that of traditional dynamic compilation, while at the same time, obtain better time predictability for Java virtual machine (JVM). We have studied pre-compilation techniques to utilize another core more efficiently, pre-optimization on another core (PoAC) scheme to replace the adaptive optimization system (AOS) in Jikes JVM and the counter based optimization (CBO). Our evaluation reveals that the proposed approaches are able to attain high perfor-mance while greatly reducing the variation of the execution time for Java applications.
Article
Programmers are turning to radical architectures such as reconfigurable hardware (FPGAs) to achieve performance. But such systems, programmed at a very low level in languages with impoverished abstractions, are orders of magnitude more complex to use than conventional CPUs. The continued exponential increase in transistors, combined with the desire to implement ever more sophisticated algorithms, makes it imperative that such systems be programmed at much higher levels of abstraction. One of the fundamental high-level language features is automatic memory management in the form of garbage collection. We present the first implementation of a complete garbage collector in hardware (as opposed to previous "hardware-assist" techniques), using an FPGA and its on-chip memory. Using a completely concurrent snapshot algorithm, it provides single-cycle access to the heap, and never stalls the mutator for even a single cycle, achieving a deterministic mutator utilization (MMU) of 100%. We have synthesized the collector to hardware and show that it never consumes more than 1% of the logic resources of a high-end FPGA. For comparison we also implemented explicit (malloc/free) memory management, and show that real-time collection is about 4% to 17% slower than malloc, with comparable energy consumption. Surprisingly, in hardware real-time collection is superior to stop-the-world collection on every performance axis, and even for stressful micro-benchmarks can achieve 100% MMU with heaps as small as 1.01 to 1.4 times the absolute minimum.
Article
Proposals to treat data races as exceptions provide simplified semantics for shared-memory multithreaded programming languages and memory models by guaranteeing that execution remains data-race-free and sequentially consistent or an exception is raised. However, the high cost of precise race detection has kept the cost-to-benefit ratio of data-race exceptions too high for widespread adoption. Most research to improve this ratio focuses on lowering performance cost. In this position paper, we argue that with small changes in how we view data races, data-race exceptions enable a broad class of benefits beyond the memory model, including performance and simplicity in applications at the runtime system level. When attempted (but exception-raising) racy accesses are treated as legal --- but exceptional --- behavior, applications can exploit the guarantees of the data-race exception mechanism by performing potentially racy accesses and guiding execution based on whether these potential races manifest as exceptions. We apply these insights to concurrent garbage collection, optimistic synchronization elision, and best-effort automatic recovery from exceptions due to sequential-consistency-violating races.
Article
Garbage collection is a well‐known technique to increase program safety and developer productivity. Within the past few years, it has also become feasible for uniprocessor hard real‐time systems. However, garbage collection for multi‐processors does not yet meet the requirements of hard real‐time systems. In this paper, we present a hard real‐time garbage collector for a Java chip multi‐processor that provides non‐disruptive and analyzable behavior. For retrieving the references in local variables of threads, we propose a protocol that minimizes disruptions for high‐priority tasks while still providing good bounds on the time until stack scanning finishes. Also, we developed a hardware unit that enables transparent, preemptible copying of objects, which eliminates the need to block tasks while copying objects. Evaluation of the hardware shows that the copy unit introduces only little overhead and does not limit the critical path. Analyses for different aspects of the system are presented, which indicate that comprehensive analysis of the presented system is indeed possible. Measurements resulted in release jitter for high‐priority tasks of 362 μs or less on an embedded multi‐processor with eight cores clocked at 100 MHz. This indicates that with the proposed garbage collector, high scheduling quality and garbage collection do not contradict each other on chip multi‐processors. Copyright © 2012 John Wiley & Sons, Ltd.
Article
Full-text available
Most proposals for on-the-fly garbage collection ignore the question of Java's weak and other reference types. However, we show that reference types are heavily used in DaCapo benchmarks. Of the few collectors that do address this issue, most block mutators, either globally or individually, while processing reference types. We introduce a new framework for processing reference types on-the-fly in Jikes RVM. Our framework supports both insertion and deletion write barriers. We have model checked our algorithm and incorporated it in our new implementation of the Sapphire on-the-fly collector. Using a deletion barrier, we process references while mutators are running in less than three times the time that previous approaches take while mutators are halted; our overall execution times are no worse, and often better.
Article
Programmers are turning to radical architectures such as reconfigurable hardware (FPGAs) to achieve performance. But such systems, programmed at a very low level in languages with impoverished abstractions, are orders of magnitude more complex to use than conventional CPUs. The continued exponential increase in transistors, combined with the desire to implement ever more sophisticated algorithms, makes it imperative that such systems be programmed at much higher levels of abstraction. One of the fundamental high-level language features is automatic memory management in the form of garbage collection. We present the first implementation of a complete garbage collector in hardware (as opposed to previous "hardware-assist" techniques), using an FPGA and its on-chip memory. Using a completely concurrent snapshot algorithm, it provides single-cycle access to the heap, and never stalls the mutator for even a single cycle, achieving a deterministic mutator utilization (MMU) of 100%. We have synthesized the collector to hardware and show that it never consumes more than 1% of the logic resources of a high-end FPGA. For comparison we also implemented explicit (malloc/free) memory management, and show that real-time collection is about 4% to 17% slower than malloc, with comparable energy consumption. Surprisingly, in hardware real-time collection is superior to stop-the-world collection on every performance axis, and even for stressful micro-benchmarks can achieve 100% MMU with heaps as small as 1.01 to 1.4 times the absolute minimum.
Article
Managed Runtime Environments (MRE) are increasingly used for application servers that use large multi-core hardware. We find that the garbage collector is critical for overall performance in this setting. We explore the costs and scalability of the garbage collectors on a contemporary 48-core multiprocessor machine. We present experimental evaluation of the parallel and concurrent garbage collectors present in OpenJDK, a widely-used Java virtual machine. We show that garbage collection represents a substantial amount of an application's execution time, and does not scale well as the number of cores increases. We attempt to identify some critical scalability bottlenecks for garbage collectors.
Article
Functional programming languages play an important role in the development of correct software systems. As embedded devices become pervasive and perform critical tasks in our lives, their reliability becomes paramount. This presents a natural opportunity to explore the application of functional programming languages to systems that demand highly predictable behavior. In this paper, we explore existing functional programming language compilers and their applicability to real‐time embedded systems. We do this by defining important characteristics needed by a real‐time programming language and survey how well existing languages meet these characteristics. We conduct empirical analysis of language runtimes in order to assess the impact of dynamic memory management on predictability and performance. Lastly, we review different programming models for expressing real‐time considerations in applications.
Article
Full-text available
Memory-management support for lock-free data structures is well known to be a tough problem. Recent work has successfully reduced the overhead of such schemes. However, applying memory-management support to a data structure remains complex and, in many cases, requires redesigning the data structure. In this paper, we present the first lock-free memory-management scheme that is applicable to general (arbitrary) lock-free data structures and that can be applied automatically via a compiler plug-in. In addition to the simplicity of incorporating to data structures, this scheme provides low overhead and does not rely on the lock freedom of any OS services.
Chapter
The relaxedness of memory consistency models, which allows the reordering of instructions and their effects, intensifies the state explosion problem of software model checking. In this paper, we propose three approaches that can reduce the number of states to be visited in software model checking with memory consistency models. The proposed methods control the reordering of instructions. The first approach controls the number of reordered instructions. The second approach specifies the instructions that are reordered in advance, and prevents the other instructions from being reordered. The third approach specifies the instructions that are reordered, and preferentially explores execution traces with the reorderings. We applied these approaches to the McSPIN model checker that we have been developing, and reported the effectiveness of the approaches by examining various concurrent programs.
Conference Paper
For non-blocking data-structures, only memory reclamation with pointer-based techniques can maintain non-blocking progress, but there can be high overhead associated to these techniques, with the most notable example being Hazard Pointers. We present a new algorithm we named Hazard Eras, which allows for efficient lock-free or wait-free memory reclamation in concurrent data structures and can be used as drop-in replacement to Hazard Pointers. Results from our microbenchmark show that when applied to a lock-free linked list, Hazard Eras will match the throughput of Hazard Pointers in the worst-case, and can outperform Hazard Pointers by a factor of 5x. Hazard Eras provides the same progress conditions as Hazard Pointers and can equally be implemented with the C11/C++11 memory model and atomics, making it portable across multiple systems.
Conference Paper
Software model checking suffers from the so-called state explosion problem, and relaxed memory consistency models even worsen this situation. What is worse, parameterizing model checking by memory consistency models, that is, to make the model checker as flexible as we can supply definitions of memory consistency models as an input, intensifies state explosion. This paper explores specific reasons for state explosion in model checking with multiple memory consistency models, provides some optimizations intended to mitigate the problem, and applies them to McSPIN, a model checker for memory consistency models that we are developing. The effects of the optimizations and the usefulness of McSPIN are demonstrated experimentally by verifying copying protocols of concurrent copying garbage collection algorithms. To the best of our knowledge, this is the first model checking of the concurrent copying protocols under relaxed memory consistency models.
Article
This paper presents a concurrent garbage collection method for functional programs running on a multicore processor. It is a concurrent extension of our bitmap-marking non-moving collector with Yuasa's snapshot-at-the-beginning strategy. Our collector is unobtrusive in the sense of the Doligez-Leroy-Gonthier collector; the collector does not stop any mutator thread nor does it force them to synchronize globally. The only critical sections between a mutator and the collector are the code to enqueue/dequeue a 32 kB allocation segment to/from a global segment list and the write barrier code to push an object pointer onto the collector's stack. Most of these data structures can be implemented in standard lock-free data structures. This achieves both efficient allocation and unobtrusive collection in a multicore system. The proposed method has been implemented in SML#, a full-scale Standard ML compiler supporting multiple native threads on multicore CPUs. Our benchmark tests show a drastically short pause time with reasonably low overhead compared to the sequential bitmap-marking collector.
Article
Large-scale multicore architectures create new challenges for garbage collectors (GCs). In particular, throughput-oriented stop-the-world algorithms demonstrate good performance with a small number of cores, but have been shown to degrade badly beyond approximately 8 cores on a 48-core with OpenJDK 7. This negative result raises the question whether the stop-the-world design has intrinsic limitations that would require a radically different approach. Our study suggests that the answer is no, and that there is no compelling scalability reason to discard the existing highly-optimised throughput-oriented GC code on contemporary hardware. This paper studies the default throughput-oriented garbage collector of OpenJDK 7, called Parallel Scavenge. We identify its bottlenecks, and show how to eliminate them using well-established parallel programming techniques. On the SPECjbb2005, SPECjvm2008 and DaCapo 9.12 benchmarks, the improved GC matches the performance of Parallel Scavenge at low core count, but scales well, up to 48~cores.
Conference Paper
Motivated by developing a memory management system that allows functional languages to seamlessly inter-operate with C, we propose an efficient non-moving garbage collection algorithm based on bitmap marking and report its implementation and performance evaluation. In our method, the heap consists of sub-heaps Hi | c ≤ i ≤ B of exponentially increasing allocation sizes (Hi for 2ⁱ bytes) and a special sub-heap for exceptionally large objects. Actual space for each sub-heap is dynamically allocated and reclaimed from a pool of fixed size allocation segments. In each allocation segment, the algorithm maintains a bitmap representing the set of live objects. Allocation is done by searching for the next free bit in the bitmap. By adding meta-level bitmaps that summarize the contents of bitmaps hierarchically and maintaining the current bit position in the bitmap hierarchy, the next free bit can be found in a small constant time for most cases, and in log32(segmentSize) time in the worst case on a 32-bit architecture. The collection is done by clearing the bitmaps and tracing live objects. The algorithm can be extended to generational GC by maintaining multiple bitmaps for the same heap space. The proposed method does not require compaction and objects are not moved at all. This property is significant for a functional language to inter-operate with C, and it should also be beneficial in supporting multiple native threads. The proposed method has been implemented in a full-scale Standard ML compiler. Our benchmark tests show that our non-moving collector performs as efficiently as a generational copying collector designed for functional languages.
Conference Paper
This paper presents a concurrent garbage collection method for functional programs running on a multicore processor. It is a concurrent extension of our bitmap-marking non-moving collector with Yuasa's snapshot-at-the-beginning strategy. Our collector is unobtrusive in the sense of the Doligez-Leroy-Gonthier collector; the collector does not stop any mutator thread nor does it force them to synchronize globally. The only critical sections between a mutator and the collector are the code to enqueue/dequeue a 32 kB allocation segment to/from a global segment list and the write barrier code to push an object pointer onto the collector's stack. Most of these data structures can be implemented in standard lock-free data structures. This achieves both efficient allocation and unobtrusive collection in a multicore system. The proposed method has been implemented in SML#, a full-scale Standard ML compiler supporting multiple native threads on multicore CPUs. Our benchmark tests show a drastically short pause time with reasonably low overhead compared to the sequential bitmap-marking collector.
Conference Paper
Java served well as a general-purpose language. However, during its two decades of constant change it has gotten some weight and legacy in the language syntax and the libraries. Furthermore, Java's success for real-time systems is mediocre. Scala is a modern object-oriented and functional language with interesting new features. Although a new language, it executes on a Java virtual machine, reusing that technology. This paper explores Scala as language for future real-time systems.
Conference Paper
Lock-free data structures achieve high responsiveness, aid scalability, and avoid deadlocks and livelocks. But providing memory management support for such data structures without foiling their progress guarantees is difficult. Often, designers employ the hazard pointers technique, which may impose a high performance overhead. In this work we propose a novel memory management scheme for lock-free data structures called optimistic access. This scheme provides efficient support for lock-free data structures that can be presented in a normalized form. Our novel memory manager breaks the traditional memory management invariant which never lets a program touch reclaimed memory. In other words, it allows the memory manager to reclaim objects that may still be accessed later by concurrently running threads. This broken invariant provides an opportunity to obtain high parallelism with excellent performance, but it also requires a careful design. The optimistic access memory management scheme is easy to employ and we implemented it for a linked list, a hash table, and a skip list. Measurements show that it dramatically outperforms known memory reclamation methods.
Article
Managed languages such as Java and C# are being considered for use in hard real-time systems. A hurdle to their widespread adoption is the lack of garbage collection algorithms that offer predictable space-and-time performance in the face of fragmentation. We introduce SCHISM/CMR, a new concurrent and real-time garbage collector that is fragmentation tolerant and guarantees time-and-space worst-case bounds while providing good throughput. SCHISM/CMR combines mark-region collection of fragmented objects and arrays (arraylets) with separate replication-copying collection of immutable arraylet spines, so as to cope with external fragmentation when running in small heaps. We present an implementation of SCHISM/CMR in the Fiji VM, a high-performance Java virtual machine for mission-critical systems, along with a thorough experimental evaluation on a wide variety of architectures, including server-class and embedded systems. The results show that SCHISM/CMR tolerates fragmentation better than previous schemes, with a much more acceptable throughput penalty.
Conference Paper
Byte-addressable, non-volatile memory (NVM) is emerging as a revolutionary memory technology that provides persistency, near-DRAM performance, and scalable capacity. To facilitate its use, many NVM programming models have been proposed. However, most models require programmers to explicitly specify the data structures or objects that should reside in NVM. Such requirement increases the burden on programmers, complicates software development, and introduces opportunities for correctness and performance bugs.
Article
Clustered Collection reduces garbage collection pauses in programs with large amounts of live data. A full collection of millions of live objects can pause the program for multiple seconds. Much of this work, however, is repeated from one collection to the next, particularly for programs that modify only a small fraction of their object graphs between collections. Clustered Collection reduces redundant work by identifying regions of the object graph which, once traced, need not be traced by subsequent collections. Each of these regions, or "clusters," consists of objects reachable from a single head object. If the collector can reach a cluster's head object, it skips over the cluster, and resumes tracing at the pointers that leave the cluster. If a cluster's head object is not reachable, or an object within a cluster has been written, the cluster collector may have to trace within the cluster. Clustered Collection is complete despite not tracing within clusters: it frees all unreachable objects. Clustered Collection is implemented as modifications to the Racket collector. Measurements of the code and data from the Hacker News web site show that Clustered Collection decreases full collection pause times by a factor of three. Hacker News works well with Clustered Collection because it keeps gigabytes of data in memory but modifies only a small fraction of that data. Other experiments demonstrate the ability of Clustered Collection to tolerate certain kinds of writes, and quantify the cost of finding clusters.
Thesis
Full-text available
Automatic memory management, also known as garbage collection, is a suitable means to increase productivity in the development of computer programs. As it helps to avoid common errors in memory management, such as memory leaks or dangling pointers, it also helps to make programs safer. Conventional garbage collection techniques are however not suited for use in hard real-time systems. As a failure in these systems can have catastrophic consequences—such as the loss of human life—it is of utmost importance to meet the timely requirements of the respective system. In the past few years, methods for garbage collection that are suitable for use in hard real-time systems have been developed. However, these techniques are suited only for uniprocessors. On multi-processors, garbage collection does not yet meet the requirements for hard real-time systems. Garbage collection reclaims memory objects that are provably inaccessible to the application. Objects that are referenced by global or local variables can be accessed at any time. Starting at these objects, the object graph is traversed by the garbage collector to determine which objects are reachable. As there is the possibility that reachable objects are accessed by the application, they must not be reclaimed. Determining the object references in local variables is challenging in hard real-time systems, because the timely requirements usually do not permit stopping the application to do so. The development of appropriate techniques is one of the core areas of this dissertation. Memory fragmentation cannot be tolerated in hard real-time systems, because otherwise, it would be impossible to determine reasonable bounds on the maximum memory consumption. In order to defragment memory, it is necessary to relocate objects. To achieve this without interrupting the application, a hardware unit has been developed, which enables transparent object relocation on chip multi-processors. Multi-processors often use memory models that provide only few guarantees on the ordering of memory accesses. The effects of such memory models on garbage collection are investigated in this thesis. Also, an analysis to determine the maximum size of memory allocations of tasks is presented. Such an analysis is necessary to compute whether the allocation requests of an application can be met even in the worst case. This thesis points out solutions which enable the reconciliation of garbage collection with the requirements of hard real-time systems on chip multi-processors. The theoretic results are supported by measurement results, which indicate that low release jitter can be achieved in multi-processor systems in the presence of garbage collection.
Article
Lock-free data-structures are widely employed in practice, yet designing lock-free memory reclamation for them is notoriously difficult. In particular, all known lock-free reclamation schemes are "manual" in the sense that the developer has to specify when nodes have retired and may be reclaimed. Retiring nodes adequately is non-trivial and often requires the modification of the original lock-free algorithm. In this paper we present an automatic lock-free reclamation scheme for lock-free data-structures in the spirit of a mark-sweep garbage collection. The proposed algorithm works with any normalized lock-free algorithm and with no need for the programmer to retire nodes or make changes to the algorithm. Evaluation of the proposed scheme on a linked-list and a hash table shows that it performs similarly to the best manual (lock-free) memory reclamation scheme.
Conference Paper
Full-text available
Garbage collection algorithms for shared-memory multiprocessors typically rely on some form of global synchronization to preserve consistency. Such global synchronization may lead to problems on asynchronous architectures: if one process is halted or delayed,other, nonfaulty processes will be unable to progress. By contrast, a storage management algorithm is lock-free if (in the absence of resource exhaustion) a process that is allocating or collecting memory can be delayed at any point without forcing other processes to block. The authors present the first algorithm for lock-free garbage collection in a realistic model. The algorithm assumes that processes synchronize by applying read, write, and compare operations to shared memory. This algorithm uses no locks, busy-waiting, or barrier synchronization, it does not assume that processes can observe or modify one another's local variables or registers, and it does not use inter-process interrupts.
Conference Paper
Full-text available
Modern transactional response-time sensitive applications have run into practical limits on the size of garbage collected heaps. The heap can only grow until GC pauses exceed the response-time limits. Sustainable, scalable concurrent collection has become a feature worth paying for.Azul Systems has built a custom system (CPU, chip, board, and OS) specifically to run garbage collected virtual machines. The custom CPU includes a read barrier instruction. The read barrier enables a highly concurrent (no stop-the-world phases), parallel and compacting GC algorithm. The Pauseless algorithm is designed for uninterrupted application execution and consistent mutator throughput in every GC phase.Beyond the basic requirement of collecting faster than the allocation rate, the Pauseless collector is never in a "rush" to complete any GC phase. No phase places an undue burden on the mutators nor do phases race to complete before the mutators produce more work. Portions of the Pauseless algorithm also feature a "self-healing" behavior which limits mutator overhead and reduces mutator sensitivity to the current GC state.We present the Pauseless GC algorithm, the supporting hardware features that enable it, and data on the overhead, efficiency, and pause times when running a sustained workload.
Conference Paper
Full-text available
Modern garbage collectors rely on read and write barriers imposed on heap accesses by the mutator, to keep track of references between different regions of the garbage collected heap, and to synchronize actions of the mutator with those of the collector. It has been a long-standing untested assumption that barriers impose significant overhead to garbage-collected applications. As a result, researchers have devoted effort to development of optimization approaches for elimination of unnecessary barriers, or proposed new algorithms for garbage collection that avoid the need for barriers while retaining the capability for independent collection of heap partitions. On the basis of the results presented here, we dispel the assumption that barrier overhead should be a primary motivator for such efforts. We present a methodology for precise measurement of mutator overheads for barriers associated with mutator heap accesses. We provide a taxonomy of different styles of barrier and measure the cost of a range of popular barriers used for different garbage collectors within Jikes RVM. Our results demonstrate that barriers impose surprisingly low cost on the mutator, though results vary by architecture. We found that the average overhead for a reasonable generational write barrier was less than 2% on average, and less than 6% in the worst case. Furthermore, we found that the average overhead of a read barrier consisting of just an unconditional mask of the low order bits read on the PowerPC was only 0.85%, while on the AMD it was 8.05%. With both read and write barriers, we found that second order locality effects were sometimes more important than the overhead of the barriers themselves, leading to counter-intuitive speedups in a number of situations.
Conference Paper
Full-text available
We describe and prove the correctness of a new concurrent mark-and-sweep garbage collection algorithm. This algorithm derives from the classical on-the-fly algorithm from Dijkstra et al. [9]. A distinguishing feature of our algorithm is that it supports multiprocessor environments where the registers of running processes are not readily accessible, without imposing any overhead on the elementary operations of loading a register or reading or initializing a field. Furthermore our collector never blocks running mutator processes except possibly on requests for free memory; in particular, updating a field or creating or marking or sweeping a heap object does not involve system-dependent synchronization primitives such as locks. We also provide support for process creation and deletion, and for managing an extensible heap of variable-sized objects.
Conference Paper
Full-text available
While real-time garbage collection has achieved worst-case laten- cies on the order of a millisecond, this technology is approaching its practical limits. For tasks requiring extremely low latency, and especially periodic tasks with frequencies above 1 KHz, Java pro- grammers must currently resort to the NoHeapRealtimeThread construct of the Real-Time Specification for Java. This technique requires expensive run-time checks, can result in unpredictable low-level exceptions, and inhibits communication with the rest of the garbage-collected application. We present Eventrons, a pro- gramming construct that can arbitrarily preempt the garbage col- lector, yet guarantees safety and allows its data to be visible to the garbage-collected heap. Eventrons are a strict subset of Java, and require no run-time memory access checks. Safety is enforced us- ing a data-sensitive analysis and simple run-time support with ex- tremely low overhead. We have implemented Eventrons in IBM's J9 Java virtual machine, and present experimental results in which we ran Eventrons at frequencies up to 22 KHz (a 45 µs period). Across 10 million periods, 99.997% of the executions ran within 10 µs of their deadline, compared to 99.999% of the executions of the equivalent program written in C.
Book
Full-text available
Modern software places increasing reliance on dynamic memory allocation, but its direct management is not only notoriously error-prone. Garbage collection eliminates many of these bugs. This reference presents each of the most important algorithms in detail, often with illustrations of its characteristic features and animations of its use.
Article
Full-text available
Garbage collection algorithms for shared-memory multiprocessors typically rely on some form of global synchronization to preserve consistency. Such global synchronization may lead to problems on asynchronous architectures: if one process is halted or delayed, other, nonfaulty processes will be unable to progress. By contrast, a storage management algorithm is lock-free if (in the absence of resource exhaustion) a process that is allocating or collecting memory can be delayed at any point without forcing other processes to block. The authors present the first algorithm for lock-free garbage collection in a realistic model. The algorithm assumes that processes synchronize by applying read, write, and compare&swap operations to shared memory. This algorithm uses no locks, busy-waiting, or barrier synchronization, it does not assume that processes can observe or modify one another's local variables or registers, and it does not use inter-process interrupts
Article
Full-text available
This paper presents the design and implementation of a "quasi real-time" garbage collector for Concurrent Caml Light, an implementation of ML with threads. This two-generation system combines a fast, asynchronous copying collector on the young generation with a nondisruptive concurrent marking collector on the old generation. This design crucially relies on the ML compiletime distinction between mutable and immutable objects. 1 Introduction This paper presents the design and implementation of a garbage collector for Concurrent Caml Light, an implementation of the ML language that provides multiple threads of control executing concurrently in a shared address space. Garbage collection --- the automatic reclamation of unused memory space --- is one of the most problematic components of run-time systems for multi-threaded languages. The naive "stop-the-world" approach, where all threads synchronously stop executing the user's program to perform garbage collection, is clearly inadequate,...
Article
Full-text available
Now that the use of garbage collection in languages like Java is becoming widely accepted due to the safety and software engineering benefits it provides, there is significant interest in applying garbage collection to hard real-time systems. Past approaches have generally suffered from one of two major flaws: either they were not provably real-time, or they imposed large space overheads to meet the real-time bounds. We present a mostly non-moving, dynamically defragmenting collector that overcomes both of these limitations: by avoiding copying in most cases, space requirements are kept low; and by fully incrementalizing the collector we are able to meet real-time bounds. We implemented our algorithm in the Jikes RVM and show that at real-time resolution we are able to obtain mutator utilization rates of 45% with only 1.6--2.5 times the actual space required by the application, a factor of 4 improvement in utilization over the best previously published results. Defragmentation causes no more than 4% of the traced data to be copied.
Article
Full-text available
Many concurrent garbage collection (GC) algorithms have been devised, but few have been implemented and evaluated, particularly for the Java programming language. Sapphire is an algorithm we have devised for concurrent copying GC. Sapphire stresses minimizing the amount of time any given application thread may need to block to support the collector. In particular, Sapphire is intended to work well in the presence of a large number of application threads, on small- to medium-scale shared memory multiprocessors. A specific problem that Sapphire addresses is not stopping all threads while thread stacks are adjusted to account for copied objects (in GC parlance, the "flip" to the new copies).
Multithreaded applications with multi-gigabyte heaps running on modern servers provide new challenges for garbage collection (GC). The challenges for "server-oriented" GC include: ensuring short pause times on a multi-gigabyte heap, while minimizing throughput penalty, good scaling on multiprocessor hardware, and keeping the number of expensive multi-cycle fence instructions required by weak ordering to a minimum. We designed and implemented a fully parallel, incremental, mostly concurrent collector, which employs several novel techniques to meet these challenges. First, it combines incremental GC to ensure short pause times with concurrent low-priority background GC threads to take advantage of processor idle time. Second, it employs a low-overhead work packet mechanism to enable full parallelism among the incremental and concurrent collecting threads and ensure load balancing. Third, it reduces memory fence instructions by using batching techniques: one fence for each block of small objects allocated, one fence for each group of objects marked, and no fence at all in the write barrier. When compared to the mature well-optimized parallel stop-the-world mark-sweep collector already in the IBM JVM, our collector prototype reduces the maximum pause time from 284 ms to 101 ms, and the average pause time from 266 ms to 66 ms while only losing 10% throughput when running the SPECjbb2000 benchmark on a 256 MB heap on a 4-way 550 MHz Pentium multiprocessor.
Article
As an example of cooperation between sequential processes with very little mutual interference despite frequent manipulations of a large shared data space, a technique is developed which allows nearly all of the activity needed for garbage detection and collection to be performed by an additional processor operating concurrently with the processor devoted to the computation proper. Exclusion and synchronization constraints have been kept as weak as could be achieved; the severe complexities engendered by doing so are illustrated.
Conference Paper
The advent of Java and similar languages on the real-time system scene necessitates the development of efficient strategies for scheduling the work of a garbage collector in a non-intrusive way. We propose a scheduling strategy, time-triggered garbage collection, based on assigning the collector a deadline for when it must complete its current cycle.We show that a time-triggered GC with fixed deadline can have equal or better real-time performance than an allocation-triggered GC, which is the standard approach to real-time GC. Also, by using a deadline-based approach, the GC scheduling and, consequently, real-time performance, is independent of a complex and error-prone GC work metric.Time-triggered GC allows a more high-level view on GC scheduling; we look at the GC cycle level rather than at the individual work increments. This makes it possible to schedule GC as any other thread. It is also suitable for making the GC auto-tuning by dynamically adjusting its deadline as necessary.We have implemented our approach in a run-time system for Java and present experimental data to support the practical feasibility of the approach.
Conference Paper
This paper presents a new storage representation for cons cells (and all other LISP heap data structures) which allows more time efficient LISP with real-time garbage collection on stock hardware. “Stock hardware” refers to common modern architectures for Von Neumann uni-processors (e.g. MC68000, IBM370, VAX, NS32032, etc.). Previous real-time garbage collection schemes have either explicitly required specially tailored hardware in order to avoid multiple order of magnitude slowdowns, or have been merely extremely unattractive for implementation on stock hardware due to the high overhead assumed by all basic LISP primitives in checking the garbage collection status of pointers handed to them as arguments. We first show that copying compacting real-time garbage collection algorithms do not always need to protect user programs from seeing uncopied data, so long as a slightly more complicated collection termination condition is used. This opens the way to adding an indirection pointer to all LISP heap objects making it unnecessary to check the garbage collection status of pointers, and so the overhead costs for most primitives reduce to the addition of a single instruction. Impure primitives, i.e. those which write into existing structures (e.g. RPLACD), have their overhead reduced by a factor of almost 2. Code density is correspondingly decreased. The cost of this scheme is increased storage size for all LISP heap objects one extra pointer per object, although for some objects (e.g. floating point numbers) this expense is already necessary in many other non real-time garbage collection algorithms.
Conference Paper
This paper reports our experiences with a mostly-concurrent incremental garbage collector, implemented in the context of a high performance virtual machine for the Java™ programming language. The garbage collector is based on the “mostly parallel” collection algorithm of Boehm et al. and can be used as the old generation of a generational memory system. It overloads efficient write-barrier code already generated to support generational garbage collection to also identify objects that were modified during concurrent marking. These objects must be rescanned to ensure that the concurrent marking phase marks all live objects. This algorithm minimises maximum garbage collection pause times, while having only a small impact on the average garbage collection pause time and overall execution time. We support our claims with experimental results, for both a synthetic benchmark and real programs.
Conference Paper
Read barriers synchronize compacting garbage collection and application processing in a simple yet elegant way. Unfortunately, read barrier checks are expensive to implement in software, and even with hardware support, the clustering of read barrier faults irregularly impairs application progress to an unacceptable extent. For this reason, read barriers are often considered unsuitable for hard real-time systems.In this paper, we introduce a novel hardware read barrier design for an object-based RISC architecture. The design integrates read barrier checking and, for the first time, read barrier fault handling directly into a processor pipeline.Our system handles read barrier faults within 20 clock cycles on average. Despite fault clustering, all application programs we have run on our prototype show minimum mutator utilizations of more that 55% within arbitrary time intervals of only 1ms. Thanks to this property, our system facilitates worst case estimates for tasks with very short response times, thereby paving the way for garbage collection in embedded systems with extremely fine-grained real-time requirements.
Conference Paper
The widely used Mark-and-Sweep garbage collector has a drawback in that it does not move objects during collection. As a result, large long-running realistic applications, such as Web application servers, frequently face the fragmentation problem. To eliminate fragmentation, a heap compaction is run periodically. However, compaction typically imposes very long undesirable pauses in the application. While efficient concurrent collectors are ubiquitous in production runtime systems (such as JVMs), an efficient non-intrusive compactor is still missing.In this paper we present the Compressor, a novel compaction algorithm that is concurrent, parallel, and incremental. The Compressor compacts the entire heap to a single condensed area, while preserving the objects' order, but reduces pause times significantly, thereby allowing acceptable runs on large heaps. Furthermore, the Compressor is the first compactor that requires only a single heap pass. As such, it is the most efficient compactors known today, even when run in a parallel Stop-the-World manner (i.e., when the program threads are halted). Thus, to the best of our knowledge, the Compressor is the most efficient compactor known today. The Compressor was implemented on a Jikes Research RVM and we provide measurements demonstrating its qualities.
Conference Paper
We have implemented the first copying garbage collector that permits continuous unimpeded mutator access to the original objects during copying. The garbage collector incrementally replicates all accessible objects and uses a mutation log to bring the replicas up-to-date with changes made by the mutator. An experimental implementation demonstrates that the costs of using our algorithm are small and that bounded pause times of 50 milliseconds can be readily achieved.
Article
A real-time list processing system is one in which the time required by each elementary list operation (CONS, CAR, CDR, RPLACA, RPLACD, EQ, and ATOM in LISP) is bounded by a (small) constant. Classical list processing systems such as LISP do not have this property because a call to CONS may invoke the garbage collector which requires time proportional to the number of accessible cells to finish. The space requirement of a classical LISP system with N accessible cells under equilibrium conditions is (1.5+μ)N or (1+μ)N, depending upon whether a stack is required for the garbage collector, where μ>0 is typically less than 2. A list processing system is presented which: 1) is real-time--i.e. T(CONS) is bounded by a constant independent of the number of cells in use; 2) requires space (2+2μ)N, i.e. not more than twice that of a classical system; 3) runs on a serial computer without a time-sharing clock; 4) handles directed cycles in the data structures; 5) is fast--the average time for each operation is about the same as with normal garbage collection; 6) compacts--minimizes the working set; 7) keeps the free pool in one contiguous block--objects of nonuniform size pose no problem; 8) uses one phase incremental collection--no separate mark, sweep, relocate phases; 9) requires no garbage collector stack; 10) requires no "mark bits", per se; 11) is simple--suitable for microcoded implementation. Extensions of the system to handle a user program stack, compact list representation ("CDR-coding"), arrays of non-uniform size, and hash linking are discussed. CDR-coding is shown to reduce memory requirements for N LISP cells to ≈(I+μ)N. Our system is also compared with another approach to the real-time storage management problem, reference counting, and reference counting is shown to be neither competitive with our system when speed of allocation is critical, nor compatible, in the sense that a system with both forms of garbage collection is worse than our pure one.
Article
Multithreaded applications with multigigabyte heaps running on modern servers provide new challenges for garbage collection (GC). The challenges for “server-oriented” GC include: ensuring short pause times on a multigigabyte heap while minimizing throughput penalty, good scaling on multiprocessor hardware, and keeping the number of expensive multicycle fence instructions required by weak ordering to a minimum.We designed and implemented a collector facing these demands building on the mostly concurrent garbage collector proposed by Boehm et al. [1991]. Our collector incorporates new ideas into the original collector. We make it parallel and incremental; we employ concurrent low-priority background GC threads to take advantage of processor idle time; we propose novel algorithmic improvements to the basic mostly concurrent algorithm improving its efficiency and shortening its pause times; and finally, we use advanced techniques, such as a low-overhead work packet mechanism to enable full parallelism among the incremental and concurrent collecting threads and ensure load balancing.We compared the new collector to the mature, well-optimized, parallel, stop-the-world mark-sweep collector already in the IBM JVM. When allowed to run aggressively, using 72% of the CPU utilization during a short concurrent phase, our collector prototype reduces the maximum pause time from 161 ms to 46 ms while only losing 11.5% throughput when running the SPECjbb2000 benchmark on a 600-MB heap on an 8-way PowerPC 1.1-GHz processors. When the collector is limited to a nonintrusive operation using only 29% of the CPU utilization, the maximum pause time obtained is 79 ms and the loss in throughput is 15.4%.
Article
We propose a heap compaction algorithm appropriate for modern computing environments. Our algorithm is targeted at SMP platforms. It demonstrates high scalability when running in parallel but is also extremely efficient when running single-threaded on a uniprocessor. Instead of using the standard forwarding pointer mechanism for updating pointers to moved objects, the algorithm saves information for a pack of objects. It then does a small computation to process this information and determine each object's new location. In addition, using a smart parallel moving strategy, the algorithm achieves (almost) perfect compaction in the lower addresses of the heap, whereas previous algorithms achieved parallelism by compacting within several predetermined segments. Next, we investigate a method that trades compaction quality for a further reduction in time and space overhead. Finally, we propose a modern version of the two-finger compaction algorithm. This algorithm fails, thus, re-validating traditional wisdom asserting that retaining the order of live objects significantly improves the quality of the compaction. The parallel compaction algorithm was implemented on the IBM production Java Virtual Machine. We provide measurements demonstrating high efficiency and scalability. Subsequently, this algorithm has been incorporated into the IBM production JVM.
Article
This paper presents the first multiprocessor garbage collection algorithm with provable bounds on time and space. The algorithm is a real-time shared-memory copying collector. We prove that the algorithm requires at most 2(R(1 + 2=k) + N + 5PD) memory locations, where P is the number of processors, R is the maximum reachable space during a computation (number of locations accessible from the root set), N is the maximum number of reachable objects, D is the maximum depth of any data object, and k is a parameter specifying how many locations are copied each time a location is allocated. Furthermore we show that client threads are never stopped for more than time proportional to k nonblocking machine instructions. The bounds are guaranteed even with arbitrary length arrays. The collector only requires write-barriers (reads are unaffected by the collector), makes few assumptions about the threads that are generating the garbage, and allows them to run mostly asynchronously. 1 Introduction...
Conference Paper
Work on non-blocking data structures has proposed extending processor designs with a compare-and-swap primitive, CAS2, which acts on two arbitrary memory locations. Experience suggested that current operations, typically single-word compare-and-swap (CAS1), are not expressive enough to be used alone in an efficient manner. In this paper we build CAS2 from CAS1 and, in fact, build an arbitrary multi-word compare-and-swap (CASN). Our design requires only the primitives available on contemporary systems, reserves a small and constant amount of space in each word updated (either 0 or 2 bits) and permits nonoverlapping updates to occur concurrently. This provides compelling evidence that current primitives are not only universal in the theoretical sense introduced by Herlihy, but are also universal in their use as foundations for practical algorithms. This provides a straightforward mechanism for deploying many of the interesting non-blocking data structures presented in the literature that have previously required CAS2.
Article
Instrumenting code to collect profiling information can cause substantial execution overhead. This overhead makes instrumentation difficult to perform at runtime, often preventing many known offline feedback-directed optimizations from being used in online systems. This paper presents a general framework for performing instrumentation sampling to reduce the overhead of previously expensive instrumentation. The framework is simple and effective, using code-duplication and counter-based sampling to allow switching between instrumented and non-instrumented code. Our framework does not rely on any hardware or operating system support, yet provides a high frequency sample rate that is tunable, allowing the tradeoff between overhead and accuracy to be adjusted easily at runtime. Experimental results are presented to validate that our technique can collect accurate profiles (93-98% overlap with a perfect profile) with low overhead (averaging 6% total overhead with a naive implementation). A Jalape~ no-specific optimization is also presented that reduces overhead further, resulting in an average total overhead of 3%.
Article
We describe a parallel, real-time garbage collector and present experimental results that demonstrate good scalability and good real-time bounds. The collector is designed for shared-memory multiprocessors and is based on an earlier collector algorithm [2], which provided fixed bounds on the time any thread must pause for collection. However, since our earlier algorithm was designed for simple analysis, it had some impractical features. This paper presents the extensions necessary for a practical implementation: reducing excessive interleaving, handling stacks and global variables, reducing double allocation, and special treatment of large and small objects. An implementation based on the modified algorithm is evaluated on a set of 15 SML benchmarks on a Sun Enterprise 10000, a 64-way UltraSparc-II multiprocessor. To the best of our knowledge, this is the first implementation of a parallel, real-time garbage collector. The average collector speedup is 7.5 at 8 processors and 17.7 at 32 processors. Maximum pause times range from 3 ms to 5 ms. In contrast, a non-incremental collector (whether generational or not) has maximum pause times from 10 ms to 650 ms. Compared to a non-parallel, stop-copy collector, parallelism has a 39% overhead, while real-time behavior adds an additional 12% overhead. Since the collector takes about 15% of total execution time, these features have an overall time costs of 6% and 2%.
Article
Typical shared-memory multiprocessor OS kernels use interlocking, implemented as spinlocks or waiting semaphores. We have implemented a complete multiprocessor OS kernel (including threads, virtual memory, and I/O including a window system and a file system) using only lock-free synchronization methods based on Compare-and-Swap. Lock-free synchronization avoids many serious problems caused by locks: considerable overhead, concurrency bottlenecks, deadlocks, and priority inversion in real-time scheduling. Measured numbers show the low overhead of our implementation, competitive with user-level thread management systems. Contents 1 Introduction 1 2 Synchronization in OS Kernels 1 2.1 Disabling Interrupts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2.2 Locking Synchronization Methods : : : : : : : : : : : : : : : : : : : : : : : : : : 2 2.3 Lock-Free Synchronization Methods : : : : : : : : : : : : : : : : : : : : : : : : : 2 3 Lock-Free Quajects 3 3.1 LIFO Stac...
Article
Reference-counting is traditionally considered unsuitable for multiprocessor systems. According to conventional wisdom, the update of reference slots and reference-counts requires atomic or synchronized operations. In this work we demonstrate this is not the case by presenting a novel reference-counting algorithm suitable for a multiprocessor system that does not require any synchronized operation in its write barrier (not even a compare-and-swap type of synchronization). A second novelty of this algorithm is that it allows eliminating a large fraction of the reference-count updates, thus, drastically reducing the reference-counting traditional overhead. This article includes a full proof of the algorithm showing that it is safe (does not reclaim live objects) and live (eventually reclaims all unreachable objects). We have implemented our algorithm on Sun Microsystems' Java Virtual Machine (JVM) 1.2.2 and ran it on a four-way IBM Netfinity 8500R server with 550-MHz Intel Pentium III Xeon and 2 GB of physical memory. Our results show that the algorithm has an extremely low latency and throughput that is comparable to the stop-the-world mark and sweep algorithm used in the original JVM.
Article
The power of dynamic memory management can be used to produce more flexible control applications without compromising the robustness of the applications. It is demonstrated how automatic memory management, or garbage collection (GC), can be used in a system that has to comply with hard real-time demands while still preserving the predictability of the system. A suitable garbage collection algorithm is described together with a strategy for scheduling the work of the algorithm. A method for proving that a given set of processes will always meet their deadlines without interference from the garbage collector is given. 1 Introduction Traditionally, embedded real-time systems have been implemented using static techniques to ensure predictability. Static task scheduling has been used together with static memory management. The demand for more flexible applications together with improved implementation and analysis techniques have brought an increased use of dynamic process scheduling. Howe...
Article
An on-the-fly garbage collector does not stop the program threads to perform the collection. Instead, the collector executes in a separate thread (or process) in parallel to the program. On-the-fly collectors are useful for multi-threaded applications running on multiprocessor servers, where it is important to fully utilize all processors and provide even response time, especially for systems for which stopping the threads is a costly operation. In this work, we report on the incorporation of generations into an on-the-fly garbage collector. The incorporation is non-trivial since an on-the-fly collector avoids explicit synchronization with the program threads. To the best of our knowledge, such an incorporation has not been tried before. We have implemented the collector for a prototype Java Virtual Machine on AIX, and measured its performance on a 4-way multiprocessor. As for other generational collectors, an on-the-fly generational collector has the potential for reducing the overall running time and working set of an application by concentrating collection efforts on the young objects. However, in contrast to other generational collectors, on-the-fly collectors do not move the objects; thus, there is no segregation between the old and the young objects. Furthermore, on-the-fly collectors do not stop the threads, so there is no extra benefit for the short pauses obtained by generational collection. Nevertheless, comparing our on-the-fly collector with and without generations, it turns out that the generational collector performs better for most applications. The best reduction in overall running time for the benchmarks we measured was 25%. However, there were some benchmarks for which it had no effect and one for which the overall running time increased by 4%.
Article
This paper presents a solution to the third problem of classical list processing techniques and removes that roadblock to their more general use. Using the method given here, a computer could have list processing primitives built in as machine instructions and the programmer would still be assured that each instruction would finish in a reasonable amount of time. For example, the interrupt handler for a keyboard could store its characters on the same kinds of lists---and in the same storage area---as the lists of the main program. Since there would be no long wait for a garbage collection, response time could be guaranteed to be small. Even an operating system could use these primitives to manipulate its burgeoning databases. Business database designers no longer need shy away from pointer-based systems for fear that their systems will be impacted by a week-long garbage collection! As memory is becoming cheaper,
Yoav Ossia, Avi Owshanko, and Erez Petrank. A parallel, incremental, mostly concurrent garbage collector for servers
  • Katherine Barabash
  • Ori Ben
  • Irit Yitzhak
  • Goft
  • K Elliot
  • Kolodner
  • Leikehman
Katherine Barabash, Ori Ben-Yitzhak, Irit Goft, Elliot K. Kolodner, Victor Leikehman, Yoav Ossia, Avi Owshanko, and Erez Petrank. A parallel, incremental, mostly concurrent garbage collector for servers. ACM TOPLAS, 27(6):1097–1146, November 2005.
Erez Petrank, Igor Yanover, and Yossi Levanoni. Implementing an on-the-fly garbage collector for Java
  • Tamar Domani
  • Elliot K Kolodner
  • Ethan Lewis
  • Elliot E Salant
  • Katherine Barabash
  • Itai Lahan
Tamar Domani, Elliot K. Kolodner, Ethan Lewis, Elliot E. Salant, Katherine Barabash, Itai Lahan, Erez Petrank, Igor Yanover, and Yossi Levanoni. Implementing an on-the-fly garbage collector for Java. In ISMM 2000.