Article

Garbage collection can be faster than stack allocation

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

An old and simple algorithm for garbage collection gives very good results when the physical memory is much larger than the number of reachable cells. In fact, the overhead associated with allocating and collecting cells from the heap can be reduced to less than one instruction per cell by increasing the size of physical memory. Special hardware, intricate garbage-collection algorithms, and fancy compiler analysis become unnecessary.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Tofte had for many years been fascinated by the beauty of the Algol stack discipline and somewhat unhappy about the explanations of why, in principle, something similar could not be done for the call-by-value lambda calculus (these explanations typically had to do with "escaping functions".) Although, in a theoretical sense, heap allocation could be more efficient than stack allocation (see, e.g., [3]), in practice, the stack discipline could provide better cache locality and less fragmentation than heap allocation (see, e.g., [42]). ...
... The report was released in april 1997 as part of the ML Kit version 2. 2 We also held a summer school on programming with regions. 3 The summer school covered lectures on the theory behind region-based memory management and practical programming exercises. Concerning the latter, it was interesting to see how students became very excited about getting their programs to run in as little memory as they could possibly manage; this showed that the technology really did give the programmer a handle on understanding memory usage. ...
... An example runtime stack containing three infinite regions (r 1 , r 2 , r 4 ) and a finite region (r3 ). An infinite region was represented by a region descriptor on the stack and a linked list of fixed size region pages. ...
Article
Full-text available
We report on our experience with designing, implementing, proving correct, and evaluating a region-based memory management system.
... Furthermore, estimates are made on the most promising methods. Appel [7] is most concerned about von-Neumann's computer architecture regarding pushing and popping stack frames (cf. sec.IV). ...
... • maximisation of R, if A H does not hold; • assigning R:=H, if A H. Apart from the restrictions in [7], there is one more related to addressing: "XOR"-linked data structures (see [11], [12]) do not allow efficiently locating garbage as described in [6]. It is mainly because heap regions (e.g. ...
... The method "transforming into stack" is a prominent technique [45], [9], [46], [6], [7]. The motivation behind this is that pointers are too often too hard to analyse, and garbage collection requires resources that otherwise would not be needed (cf. ...
Preprint
Full-text available
The article provides an overview of the existing methods of dynamic memory verification; a comparative analysis is carried out; the applicability for solving problems of control, monitoring, and verification of dynamic memory is evaluated. This article is divided into eight sections. The first section introduces formal verification, followed by a section that discusses dynamic memory management problems. The third section discusses Hoare's calculus resumed by heap transformations to the stack. The fifth and sixth sections introduce the concept of dynamic memory shape analysis and the rotation of pointers. The seventh is on separation logic. The last section discusses possible areas of further research, particularly the recognition at recording level of various instances of objects; automation of proofs; "hot" code, that is, software code that updates itself when the program runs; expanding intuitiveness, for instance, on proof explanations.
... Programs can often adapt behavior in response to available memory. For instance, the time that a garbage collector spends in the collector can be reduced by using a larger heap [1], and computed results can be cached and reused reducing latency and increasing throughput [8]. Adaptations are not required for the correct execution of a program; they improve the program's utilityassuming, that is, that the extra memory could not have been better employed elsewhere. ...
... Many garbage collectors can reduce the time they spend collecting by using a larger heap. This is because the time it takes to perform a collection is proportional to the size of the live objects, not the heap size [1]. Allow- Table 1: The time to decompress and the time to decompress and scale images to fit in a 764 × 706 area. ...
... This strategy does not take advantage of additional memory to delay collections. Because collection time is proportional to the number of live objects and not heap size [1], this can be significant [14]. Our benchmark allocates some data structures, links them, overwrites the root pointer (thereby making the data structures garbage), and then loops. ...
Article
Full-text available
Many programs could improve their performance by adapting their memory use. To ensure that an adaptation increases utility, a program needs to know not only how much memory is available but how much it should use at any given time. This is difficult on commodity operating systems, which provide little information to inform these decisions. As evidence of the importance of adaptations to program performance, many programs currently adapt using ad hoc heuristics to control their behavior. Supporting adaptive applications has become press- ing: the range of hardware that applications are expected to run on—from smart phones and netbooks to high-end desktops and servers—is increasing as is the dynamic na- ture of workloads stemming from server consolidation. The practical result is that the ad hoc heuristics are less effective as assumptions about the environment are less reliable, and as such memory is more frequently under- or over- utilized. Failing to adapt limits the degree of possible consolidation. We contend that in order for pro- grams to make the best of available resources, research needs to be conducted into how the operating system can better support aggressive adaptations.
... and the following time constants are required : The values estimated for the time constants were obtained on an HP9000/375 (details are given in [5]). They agree in their magnitudes with those given in [1], which reports a similar, but simpler analysis comparing copying collection with stack allocation. (These values must be accepted in a qualified way, since there may be cache effects involved.) ...
... We can compare the results above with the conclusion of [1]. Appel analyses the cost per collected cell of gc for a copying collector and compares this with the cost of popping a cell from a stack. ...
Conference Paper
Full-text available
Optimization by compile time garbage collection is one possible weapon in the functional language implementer’s armoury for combatting the excessive memory allocation usually exhibited by functional programs. It is an interesting idea, but the practical question of whether it yields benefits in practice has still not been answered convincingly one way or the other. In this short paper we present a mathematical model of the performance of straightforward versions of mark-scan and copying garbage collectors with programs optimized for explicit deallocation. A mark-scan heap manager has a free list, whereas a copying heap manager does not — herein lies the dilemma, since a copying garbage collector is usually considered to be faster than a mark-scan, but it cannot take advantage of this important optimization. For tract ability we consider only heaps with fixed cells. The results reported show that the garbage collection scheme of choice depends quite strongly on the heap occupancy ratio: the proportion of the total heap occupied by accessible data structures averaged over the execution of the program. We do not know what typical heap occupancy ratios are, and so are unable to make specific recommendations, but the results may be of use in tailoring applications and heap management schemes, or in controlling schemes where the heap size varies dynamically. An important result arising from the work reported here is that when optimizing for explicit deallocation, a very large proportion of cell releases must be optimized before very much performance benefit is obtained.
... The primary benefit of explicit memory management is performance. While Appel [1987] argues that given enough memory, managed memory can be faster than explicit memory management, Wilson et al. [1992] argue this conclusion is unlikely to hold for modern languages with deep, complex memory structures. ...
... Garbage collection performance can improve up to ten-fold with increases in the heap size [Yang et al., 2006]. The frequency of collection is inversely related to the size of the heap [Appel, 1987]. ...
Article
Main memory remains a scarce computing resource. Even though main memory is becoming more abundant, software applications are inexorably engineered to consume as much memory as is available. For example, expert systems, scientific computing, data mining, and embedded systems commonly suffer from the lack of main memory availability. This thesis introduces JDiet, an innovative memory management system for Java applications. The goal of JDiet is to provide the developer with a highly configurable framework to reduce the memory footprint of a memory-constrained system, enabling it to operate on much larger working sets. Inspired by buffer management techniques common in modern database management systems, JDiet frees main memory by evicting non-essential data to a disk-based store. A buffer retains a fixed amount of managed objects in main memory. As non-resident objects are accessed, they are swapped from the store to the buffer using an extensible replacement policy. While the Java virtual machine naïvely delegates virtual memory management to the operating system, JDiet empowers the system designer to select both the managed data and replacement policy. Guided by compile-time configuration, JDiet performs aspect-oriented bytecode engineering, requiring no explicit coupling to the source or compiled code. The results of an experimental evaluation of the effectiveness of JDiet are reported. A JDiet-enabled XML DOM parser is capable of parsing and processing over 200% larger input documents by sacrificing less than an order of magnitude in performance.
... Runtime systems and, especially, garbage collection (GC) have also been an active area of research in the SML community [Appel 1987[Appel , 1989b[Appel , 1990Appel and Concalves 1993;Reppy 1993]. A particularly interesting line of research explored using static type information to eliminate runtime tags [Appel 1989a;Goldberg 1991;Goldberg and Gloger 1992;Tolmach 1994]. ...
Conference Paper
Full-text available
The ML family of strict functional languages, which includes F#, OCaml, and Standard ML, evolved from the Meta Language of the LCF theorem proving system developed by Robin Milner and his research group at the University of Edinburgh in the 1970s. This paper focuses on the history of Standard ML, which plays a central rôle in this family of languages, as it was the first to include the complete set of features that we now associate with the name “ML” (i.e., polymorphic type inference, datatypes with pattern matching, modules, exceptions, and mutable state). Standard ML, and the ML family of languages, have had enormous influence on the world of programming language design and theory. ML is the foremost exemplar of a functional programming language with strict evaluation (call-by-value) and static typing. The use of parametric polymorphism in its type system, together with the automatic inference of such types, has influenced a wide variety of modern languages (where polymorphism is often referred to as generics). It has popularized the idea of datatypes with associated case analysis by pattern matching. The module system of Standard ML extends the notion of type-level parameterization to large-scale programming with the notion of parametric modules, or functors. Standard ML also set a precedent by being a language whose design included a formal definition with an associated metatheory of mathematical proofs (such as soundness of the type system). A formal definition was one of the explicit goals from the beginning of the project. While some previous languages had rigorous definitions, these definitions were not integral to the design process, and the formal part was limited to the language syntax and possibly dynamic semantics or static semantics, but not both. The paper covers the early history of ML, the subsequent efforts to define a standard ML language, and the development of its major features and its formal definition. We also review the impact that the language had on programming-language research.
... Before going into more detail, let us first ask whether we cannot simply turn all dynamically allocated variables into stacked, as it was proposed, for instance, by [10]. Often this will indeed work, however, sometimes it is not a good idea after all, because of performance issues [11], for instance. More often the nature of the problem forbids general static assumptions on stack bounds. ...
Preprint
Full-text available
In this paper, we review existing points-to Separation Logics for dynamic memory reasoning and we find that different usages of heap separation tend to be an obstacle. Hence, two total and strict spatial heap operations are proposed upon heap graphs, for conjunction and disjunction -- similar to logical conjuncts. Heap conjunction implies that there exists a free heap vertex to connect to or an explicit destination vertex is provided. Essentially, Burstall's properties do not change. By heap we refer to an arbitrary simple directed graph, which is finite and may contain composite vertices representing class objects. Arbitrary heap memory access is restricted, as well as type punning, late class binding and further restrictions. Properties of the new logic are investigated, and as a result group properties are shown. Both expecting and superficial heaps are specifiable. Equivalence transformations may make denotated heaps inconsistent, although those may be detected and patched by the two generic linear canonization steps presented. The properties help to motivate a later full introduction of a set of equivalences over heap for future work. Partial heaps are considered as a useful specification technique that help to reduce incompleteness issues with specifications. Finally, the logic proposed may be considered for extension for the Object Constraint Language.
... Time to complete this process is short in a case of many objects die and little objects are alive. Generational GC expects that a reference following process in young generation finishes shortly [11]. The GC can be tuned by adjusting the frequencies of checks for the young and old generations. ...
Article
Android Runtime (ART), which is the standard application runtime environment, has a garbage collection (GC) function. ART have an implementation of generational GC. The GC clusters objects into two groups, which are the young and old generations. An object in the young generation is promoted into an old object after passing several times of GC executions. In this paper, we propose to adjust the promoting condition based on the object feature, which is the size of each object. We then evaluate our proposed method and demonstrate that our method based on the feature can reduce the memory consumption of applications with smaller performance decline than the method without feature consideration.
... Time to complete such processes is short in environment wherein many objects die and small number of objects are alive. Generational GC expects that a link following process in young generation can finish shortly [8]. It tries to execute GC effectively by increasing and decreasing frequencies of check for young and old generations. ...
Conference Paper
Android operating system has become one of the most popular smartphone platforms. A large number of applications are developed for Android Runtime (ART). Garbage Collection (GC) is an essential function of ART for Java-based applications. GC suspends all the application threads, called Stop The World (STW), and sacrifices application's performance. For reducing pauses to check heap, generational GC was proposed. It separates objects into two groups, the objects which probably die in the next GC and the others. Then, it tries to reduce search space and time to check. With this GC, accurate estimation of object lifetime is important. In this paper, we explore trend of object lifetimes of modern Android application in order to improve estimation of lifetime of ART GC. For this purpose, we have constructed our application behavior monitoring system, which is called ART monitor. This is implemented by modifying Android and enable observation of creations and destruction of objects in ART. We have investigated the relation among object lifetime, object sizes, and object names in recent Android applications which are popular in Google Play Store. We than found that small object probably dies with short lifetime and objects of some classes always survive long time. Based on these findings, we discuss a method for improving ART GC performance.
... The latter does not cause copying. Copying GC [6] is moving GC. MS and Reference Counting GC [7] are non-moving, CMS in ART, described later, also is non-moving GC. ...
Conference Paper
Android operating system has become one of the most popular smartphone platforms. A large number of applications are developed for Android Runtime (ART). Garbage Collection (GC) is an essential function of ART for Java-based applications. GC suspends all the application threads, called Stop The World (STW), and sacrifices application's performance. For reducing pauses to check heap, generational GC was proposed. It separates objects into two groups, the objects which probably die in the next GC and the others. Then, it tries to reduce search space and time to check. With this GC, accurate estimation of object lifetime is important. In this paper, we investigate objects' lifetime trend in modern Java-based applications of Android, and discuss a method for improving GC based on this trend. First, we introduce our monitoring system for object creations and collections in ART. Second, we present the monitoring results of recent popular Android applications and reveal the relations between lifetimes and object sizes.
... As long as there is sufficient space, allocation takes constant time. When a garbage collection is required, the cost may be attributed to the allocations of the live data in the nursery, so that in an amortized sense garbage collection is cost-free [3]. It is easy to see that no object is evicted to main memory using this implementation that would not have been evicted in the abstract sense. ...
Article
The widely studied I/O and ideal-cache models were developed to account for the large difference in costs to access memory at different levels of the memory hierarchy. Both models are based on a two level memory hierarchy with a fixed size fast memory (cache) of size M, and an unbounded slow memory organized in blocks of size B. The cost measure is based purely on the number of block transfers between the primary and secondary memory. All other operations are free. Many algorithms have been analyzed in these models and indeed these models predict the relative performance of algorithms much more accurately than the standard Random Access Machine (RAM) model. The models, however, require specifying algorithms at a very low level, requiring the user to carefully lay out their data in arrays in memory and manage their own memory allocation. We present a cost model for analyzing the memory efficiency of algorithms expressed in a simple functional language. We show how some algorithms written in standard forms using just lists and trees (no arrays) and requiring no explicit memory layout or memory management are efficient in the model. We then describe an implementation of the language and show provable bounds for mapping the cost in our model to the cost in the ideal-cache model. These bounds imply that purely functional programs based on lists and trees with no special attention to any details of memory layout can be asymptotically as efficient as the carefully designed imperative I/O efficient algorithms. For example we describe an o(n/BlogM/Bn/B) cost sorting algorithm, which is optimal in the ideal cache and I/O models.<!-- END_PAGE_1 --
... When the young generations shrinks below a threshold value, we do a copying collection of all reachable objects in the old generation, creating a new, smaller, old generation. Again, a key property is that the allocator repeatedly allocates all available free space, then 3 The particular algorithm we use is based on that of Appel [App87]. does some copying of reachable objects, compacting them into contiguous space. ...
Article
Full-text available
C Project Description Programming languages that rely on garbage collection are becoming ubiquitous. Java and C# are es-pecially popular. Both languages provide numerous software engineering advantages over languages like C and C++. The advantages of garbage collection include safety from accidental memory over-writes, protection from security violations, and the automatic prevention of space leaks. An impor-tant disadvantage of garbage collection is its negative impact on memory usage and execution time. Garbage-collected programs tend to consume more memory and exhibit worse locality than similar programs written in C and C++. They also suffer from reduced application throughput and increased response time due to garbage collection pauses. The focus of almost all recent research on garbage collection has been on improving the per-formance of individual applications executing on a dedicated machine. This model is obsolete. On servers, there is increasing use of Java and C# for applications and servlets as well as for stored proce-dures in database management systems. On desktops, components of office suites like Microsoft Office are being written in C#, and StarOffice already includes Java-based extensions. In the near future, we expect that both servers and desktop machines will spend most of their time executing a multitude of garbage-collected applications. We believe that this trend is leading us towards a performance disaster. Garbage-collected applica-tions already require more memory than their counterparts written in C and C++, and interact poorly with virtual memory managers. A recent memo from Sun Microsystems cites the increased footprint of Java applications as one of the key problems with Java and "a barrier to the delivery of reliable 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 SPECjbb Score Number of Measurement Seconds SPECjbb Score vs. Number of Measurement Seconds Heap size = 460MB Heap size = 480MB Heap size = 500MB (a) An example of the effects of paging on SPECjbb, an enterprise workload running Java Beans. The "score" corresponds to the num-ber of transactions processed (bigger is better). Throughput drops by nearly 2/3 as heap sizes exceed available memory. 1 10 100 1000 10000 100000 0 2 4 6 8 10 12 14 16 Wall Clock Time (s) (log) Millions of Replacements Elapsed Time Needed vs. Number of Replacements Heap size = 230MB Heap size = 384MB Heap size = 448MB (b) Execution time (smaller is better) for a syn-thetic benchmark with memory pressure sim-ulated by increasing heap size. As heap size grows, execution time increases by orders of magnitude. Note that the execution time axis is shown in log scale. Figure 1: Memory pressure on garbage-collected applications leads to significant performance degra-dation. For each graph, the error bars correspond to the maximum and minimum values over three runs.
... This can be extremely time-consuming in large systems. Andrew Appel (1987) suggests that garbage collection has little or no run-time costs in systems with very large amounts of memory available, since garbage collection passes are rarely required. However, practical evidence is that software, and users' expectations, expand to fill the space available. ...
... Copying algorithms require time proportional to the number of accessible objects whereas mark-and-sweep algorithms require time proportional to the number of accessible and inaccessible objects. With the increasing prevalence of large memories, the additional space required by copying algorithms is of decreasing importance [11]. ...
Article
Garbage collection algorithms rely on invariants to permit the identification of pointers and to correctly locate accessible objects. These invariants translate into contraints on object layouts and programmming conventions governing pointer use. There are recent variations of collectors in which the invariants are relaxed. Typically, rules governing pointer use are relaxed, and a 'conservative' collection algorithm that treats all potential pointers as valid is used. Such pointers are 'ambiguous' because integers and other data can masquerade as pointers. Ambiguous pointers cannot be modified and hence the object they reference cannot be moved. Consequently, conservative collectors are based on mark-and-sweep algorithms. Copying algorithms, while more efficient, have not been used because they move objects and adjust pointers. This paper describes a variation of a copying garbage collector that can be used in the presence of ambiguous references. The algorithm constrains the layout and placement of objects, but not the location of referencing pointers. It simply avoids copying objects that are referenced directly by ambiguous pointers, reclaiming their storage on a subsequent collection when they are no longer ambiguously referenced. An implementation written in the ANSI C programming language is given.
... It is thus important that we take account of any effects that Optimistic Evaluation might have on the cost of garbage collection. If the heap residency increases moderately, then the effects are not particularly dramatic; most modern garbage collectors reduce their collection frequency as the heap residency increases, ensuring that garbage collection takes up a fairly constant proportion of runtime [App87,Wil92]. However this approach breaks down if the heap size increases beyond the available physical memory. ...
Article
This dissertation is not substantially the same as any I have submitted for a degree or diploma or any other qualification at any other university. Further, no part of the dissertation has already been or is being currently submitted for any such degree, diploma or other qualification. This dissertation does not exceed 60,000 words. This total includes tables and footnotes, but excludes appendices, bibliography, and diagrams. Material from Chapters 2, 3 and 4 has been presented in a paper I co-authored with my supervisor [EP03b]. Chapter 5 is largely the same as a paper that I co-authored with my super-visor and with Alan Mycroft [EPM03]. Chapter 11 is largely identical to another paper that I co-authored with my supervisor [EP03a]. This dissertation is my own work and contains noth-ing which is the outcome of work done in collaboration with others, except as specified in the text.
... Thus, a large heap size does not affect the running time of the collector, and the collector can be run less frequently. Appel [App87] has demonstrated that using a "standard" computer with a heap size seven times larger than the amount of live memory, a copying garbage collector is faster than stack allocation! However, Miller and Rozas [MR94] have demonstrated that stack allocation can be performed even faster than allocation using copying garbage collection on some platforms. ...
... It would be quite difficult to implement in lazy languages because values which do not appear in the result of a function call may not be evaluated until a considerable time after the evaluation of the function call. Also, it is argued in (Appel, 1987) that garbage collection can be faster than stack allocation when reasonably large stores are used. ...
Article
The widely studied I/O and ideal-cache models were developed to account for the large difference in costs to access memory at different levels of the memory hierarchy. Both models are based on a two level memory hierarchy with a fixed size primary memory(cache) of size M, an unbounded secondary memory organized in blocks of size B. The cost measure is based purely on the number of block transfers between the primary and secondary memory. All other operations are free. Many algorithms have been analyzed in these models and indeed these models predict the relative performance of algorithms much more accurately than the standard RAM model. The models, however, require specifying algorithms at a very low level requiring the user to carefully lay out their data in arrays in memory and manage their own memory allocation. In this paper we present a cost model for analyzing the memory efficiency of algorithms expressed in a simple functional language. We show how some algorithms written in standard forms using just lists and trees (no arrays) and requiring no explicit memory layout or memory management are efficient in the model. We then describe an implementation of the language and show provable bounds for mapping the cost in our model to the cost in the ideal-cache model. These bound imply that purely functional programs based on lists and trees with no special attention to any details of memory layout can be as asymptotically as efficient as the carefully designed imperative I/O efficient algorithms. For example we describe an O(n_B logM/Bn_B)cost sorting algorithm, which is optimal in the ideal cache and I/O models.
Chapter
The explosion of applications in data-parallel systems and ever-growing high-efficiency needs for data analysis have made parallel systems under enormous memory pressure when dealing with large datasets. Out-of-memory errors and excessive garbage collection can seriously affect system performance. Generally, for those data-flow tasks with intensive in-memory computing requirements, how to achieve efficient memory caching algorithms is a primary measure to make a trade-off between performance and memory overhead. By taking advantage of the latest research findings on the DAG-based task scheduling, we design a new caching algorithm for in-memory computing by exploiting the critical path information of DAG, called Non-critical path least reference count (NLC). The strategy is distinct from the existing ones in that it applies the global information of the critical path to the caching replacements rather than the task scheduling as most existing works do. Through empirical studies, we demonstrated that NLC can not only effectively enhance the parallel execution efficiency, but also reduce the number of evictions, improve the hit ratio, and memory utilization rate as well. Our comprehensive evaluations based on the selected benchmark graphs indicate that our strategy can not only fulfill the parallel system requirements but also reduce the costs by as much as , compared with the most advanced LRC algorithm.
Article
В статье дается обзор существующих методов верификации динамической памяти; проводится их сравнительный анализ; оценивается применимость для решения задач управления, контроля и верификации динамической памяти. Данная статья состоит из восьми разделов. Первый раздел - введение. Во втором обсуждаются проблемы управления динамической памятью. В третьем рассматривается исчисление Хоара. В четвёртом речь идёт о преобразованиях heap в стек. Пятый вводит понятие анализа образов динамической памяти. Шестой посвящен ротации указателей, седьмой - логике распределенной памяти. В последнем разделе рассматриваются возможные направления дальнейших научных исследований в данной области, в частности: распознавание на уровне записи различных экземпляров объектов; автоматизация доказательств; использование «горячего» кода, то есть программного кода, который сам себя обновляет при запуске программы; расширение интуитивности объяснений доказательств и другие
Book
Is your memory hierarchy stopping your microprocessor from performing at the high level it should be? Memory Systems: Cache, DRAM, Disk shows you how to resolve this problem. The book tells you everything you need to know about the logical design and operation, physical design and operation, performance characteristics and resulting design trade-offs, and the energy consumption of modern memory hierarchies. You learn how to to tackle the challenging optimization problems that result from the side-effects that can appear at any point in the entire hierarchy. As a result you will be able to design and emulate the entire memory hierarchy. Understand all levels of the system hierarchy -Xcache, DRAM, and disk. Evaluate the system-level effects of all design choices. Model performance and energy consumption for each component in the memory hierarchy.
Article
Identifying some pointers as invisible threads, for the purposes of storage management, is a generalization from several widely used programming conventions, like threaded trees. The necessary invariant is that nodes that are accessible (without threads) emit threads only to other accessible nodes. Dynamic tagging or static typing of threads ameliorates storage recycling both in functional and imperative languages.We have seen the distinction between threads and links sharpen both hardware- and software-supported storage management in S CHEME , and also in C. Certainly, therefore, implementations of languages that already have abstract management and concrete typing, should detect and use this as a new static type.
Article
This paper describes the distributed memory implementation of a shared memory parallel functional language. The language is Id, an implicitly parallel, mostly functional language that is currently evolving into a dialect of Haskell. The target is a distributed memory machine, because we expect these to be the most widely available parallel platforms in the future. The difficult problem is to bridge the gap between the shared memory language model and the distributed memory machine model. The language model assumes that all data is uniformly accessible, whereas the machine has a severe memory hierarchy: a processor's access to remote memory (using explicit communication) is orders of magnitude slower than its access to local memory. Thus, avoiding communication is crucial for good performance. The Id language, and its general dataflow-inspierd compilation to multithreaded code are described elsewhere. In this paper, we focus on our new parallel runtime system and its features for avoiding communication and for tolerating its latency when necessary: multithreading, scheduling and load balancing; the distributed heap model and distributed coherent cacheing, and parallel garbage collection. We have completed the first implementation, and we present some preliminary performance mearsurements.
Article
The traditional model of virtual memory working sets does not account for programs that can adjust their working sets on demand. Examples of such programs are garbage-collected systems and databases with block cache buffers. We present a memory-use model of such systems, and propose a method that may be used by virtual memory managers to advise programs on how to adjust their working sets. Our method tries to minimize memory contention and ensure better overall system response time. We have implemented a memory “advice server” that runs as a non-privileged process under Berkeley Unix. User processes may ask this server for advice about working set sizes, so as to take maximum advantage of memory resources. Our implementation is quite simple, and has negligible overhead, and experimental results show that it results in sizable performance improvements.
Article
We've designed and implemented a copying garbage-collection algorithm that is efficient, real-time, concurrent, runs on commercial uniprocessors and shared-memory multiprocessors, and requires no change to compilers. The algorithm uses standard virtual-memory hardware to detect references to “from space” objects and to synchronize the collector and mutator threads. We've implemented and measured a prototype running on SRC's 5-processor Firefly. It will be straightforward to merge our techniques with generational collection. An incremental, non-concurrent version could be implemented easily on many versions of Unix.
Article
Hardware-assisted real-time garbage collection offers high throughput and small worst-case bounds on the times required to allocate dynamic objects and to access the memory contained within previously allocated objects. Whether the proposed technology is cost effective depends on various choices between configuration alternatives. This paper reports the performance of several different configurations of the hardware-assisted real-time garbage collection system subjected to several different workloads. Reported measurements demonstrate that hardware-assisted real-time garbage collection is a viable alternative to traditional explicit memory management techniques, even for low-level languages like C++.
Article
Memory Management Units (MMUS) are traditionally used by operating systems to implement disk-paged virtual memory. Some operating systems allow user programs to specify the protection level (inaccessible, readonly. read-write) of pages, and allow user programs to handle protection violations. but these mechanisms are not, always robust, efficient, or well-matched to the needs of applications. We survey several user-level algorithms that make use of page-protection techniques, and analyze their common characteristics. in an attempt to answer the question, �M7hat virtual-memory primitives should the operating system provide to user processes, and how well do today�s operating systems provide them?”.
Article
This paper describes the design and implementation of a concurrent compacting garbage collector for languages that distinguish mutable data from immutable data 1993 as well for languages that manipulate only immutable data (e.g., pure functional languages such as Haskell). The collector runs on shared-memory parallel computers and requires minimal mutator/collector synchronization. No special hardware or operating system support is required. Our concurrent collector derives from sequential semi-space copying collectors. The collector requires that a heap object includes forwarding pointer in addition to its data fields. An access to an immutable object can be satisfied either by the original or the forwarded copy of the object. A mutable object is always accessed through the forwarded copy, if one exists. Measurements of this collector in a Standard ML compiler on a shared-memory computer indicate that it eliminates perceptible garbage-collection pauses by reclaiming storage in parallel with the computation proper. All observed pause times are less than 20 milliseconds. We describe extensions for the concurrent collection of multiple mutator threads and refinements to our design that can improve its efficiency.
Thesis
Full-text available
Most computer programs are concurrent ones: they need to perform several tasks at the same time. Threads and events are two common techniques to implement concurrency. Events are generally more lightweight and efficient than threads, but also more difficult to use. Additionally, they are often not powerful enough; it is then necessary to write hybrid code, that uses both preemptively-scheduled threads and cooperatively-scheduled event handlers, which is even more complex. In this dissertation, we show that concurrent programs written in threaded style can be translated automatically into efficient, equivalent event-driven programs through a series of proven source-to-source transformations. We first propose Continuation-Passing C, an extension of the C programming language for writing concurrent systems that provides very lightweight, unified (cooperative and preemptive) threads. CPC programs are processed by the CPC translator to produce efficient sequentialized event-loop code, using native threads for the preemptive parts. We then define and prove the correctness of these transformations, in particular lambda lifting and CPS conversion, for an imperative language. Finally, we validate the design and implementation of CPC by comparing it to other thread librairies, and by exhibiting our Hekate BitTorrent seeder. We also justify the choice of lambda lifting by implementing eCPC, a variant of CPC using environments, and comparing its performances to CPC.
Article
We have built a portable platform for running Standard ML of New Jersey programs on multiprocessors. It can be used to implement user-level thread packages for multiprocessors within the ML language with first-class continuations. The platform supports experimentation with different thread scheduling policies and synchronization constructs. it has been used to construct a Modula-3 style thread package and a version of Concurrent ML, and has been ported to three different multiprocessors running variants of Unix. This paper describes the platform's design, implementation, and performance.
Conference Paper
Memory management on which researchers are mainly focusing are the different techniques related to memory management i.e. garbage collection techniques, scheduling, real time system, user oriented design and many more. The major problem found out in memory management is processor and operating system delays. Here a new design of processor is discussed in the paper that will improve to the over all system performance.
Conference Paper
The widely studied I/O and ideal-cache models were developed to account for the large difference in costs to access memory at different levels of the memory hierarchy. Both models are based on a two level memory hierarchy with a fixed size primary memory(cache) of size M, an unbounded secondary memory organized in blocks of size B. The cost measure is based purely on the number of block transfers between the primary and secondary memory. All other operations are free. Many algorithms have been analyzed in these models and indeed these models predict the relative performance of algorithms much more accurately than the standard RAM model. The models, however, require specifying algorithms at a very low level requiring the user to carefully lay out their data in arrays in memory and manage their own memory allocation. In this paper we present a cost model for analyzing the memory efficiency of algorithms expressed in a simple functional language. We show how some algorithms written in standard forms using just lists and trees (no arrays) and requiring no explicit memory layout or memory management are efficient in the model. We then describe an implementation of the language and show provable bounds for mapping the cost in our model to the cost in the ideal-cache model. These bound imply that purely functional programs based on lists and trees with no special attention to any details of memory layout can be as asymptotically as efficient as the carefully designed imperative I/O efficient algorithms. For example we describe an O(n_B logM/Bn_B)cost sorting algorithm, which is optimal in the ideal cache and I/O models.
Article
A software system is described for simulating the execution of concurrent programs. Although the system has been implemented on a single processor, it enables the user to simulate the behaviour of shared-memory multiprocessors. The tool uses semaphores and coroutine-like concepts as the basis for an extension of Pascal and incorporates several syntactic and implementation constructs that are of interest by themselves. Though simple, it provides facilities for monitoring and run-time checking that make it extremely flexible in the modelling and prototyping of concurrent systems. The main features of the tool are illustrated through several programming examples.
Article
The technique of stack allocation improves the efficiency of Java program, but the trade-off between the ratio of allocated objects and the size of stack is difficult. In this paper, a flow-insensitive, inter-procedural, and context-sensitive escape analysis is implemented. The allocation policy based on loop as a basic unit is presented, and the concepts of object stack and stack region are proposed. The authors implement the stack allocation based on escape analysis through the loop analysis. The result of SPECjvm98 shows that 8.3% to 25% (with an average of 15.18%) of all objects could be stack allocated in the new algorithm with the controlled size of stack.
Article
Since the C language is a machine independent low-level language, it is well-suited as a portable target language for the implementation of higher-order programming languages. This paper presents an efficient C code generator for Caml-Light, a variant of CAML designed at INRIA. Fundamentally, the compilation technique consists of translating ML code via an intermediate language named Sqil and the runtime system relies on a new conservative garbage collector. This scheme produces at the same...
Article
Full-text available
Automatic storage management, or garbage collection, is a feature usually associated with languages oriented toward ''symbolic processing,'' such as Lisp or Prolog; it is seldom associated with ''systems'' languages, such as C and C++. This report surveys techniques for performing garbage collection for languages such as C and C++, and presents an implementation of a concurrent copying collector for C++. The report includes performance measurements on both a uniprocessor and a multiprocessor.
Article
Previous Schemes for implementing full tail-recursion when compiling into C have required some form of "trampoline" to pop the stack. We propose solving the tail-recursion problem in the same manner as Standard ML of New Jersey, by allocating all frames in the (garbage-collected) heap. The Scheme program is translated into continuation-passing style, so the target C functions never return. The C stack pointer then becomes the allocation pointer for a Cheney-style copying garbage collection scheme. Our Scheme can use C function calls, C arguments, C variable-arity functions, and separate compilation without requiring complex block-compilation of entire programs. Our C version of the "Boyer" benchmark is available at ftp://ftp.netcom.com/pub/hb/hbaker/cboyer13.c.
Article
Submitted to EuroSys 2009. General-purpose operating systems not only fail to provide adaptive applications the information they need to intelli- gently adapt, but also schedule resources in such a way that were applications to aggressively adapt, resources would be inappropriately scheduled. The problem is that these systems use demand as the primary indicator of utility, which is a poor indicator of utility for adaptive applications. We present a resource management framework appropri- ate for traditional as well as adaptive applications. The pri- mary difference from current schedulers is the use of stake- holder preferences in addition to demand. We also show how to revoke memory, compute the amount of memory available to each principal, and account shared memory. Finally, we introduce a prototype system, Viengoos, and present some benchmarks that demonstrate that it can efficiently support multiple aggressively adaptive applications simultaneously.
Article
Abstract A dynamic,programming,language is a language that provides the programmer,with a lot of possibilities of dynamic,computations,at run-time. Often when,referring to dynamic programming languages, programmers think of languages like Lisp, Python or Smalltalk. These are all dynamically typed languages, which offer a lot of flexibility with respect to the use of types at run-time. In a statically typed language, one form of dynamism,would be to allow type specifications and composition of new types to be computed at run-time, but with the guarantees that come with the static analysis. A programming,language which fits this profile is the gbeta programming,language. Virtual machines,provide an abstraction over the operating system and the hard- ware it is running on. This gives a well defined interface between the operating system and the program,that is executed inside the virtual machine. This report presents the initial design and structure of a virtual machine,for efficiently executing gbeta pro- grams with details about some of the very dynamic,aspect of gbeta. v Contents
Article
We have implemented a parallel graph reducer on a commerciallyavailable shared memory multiprocessor (a SequentSymmetryTM), that achieves real speedup comparedto a a fast compiled implementation of the conventional Gmachine.Using 15 processors, this speedup ranges between5 and 11, depending on the program.Underlying the implementation is an abstract machinecalled the h; Gi-machine. We describe the sequential andthe parallel h; Gi-machine, and our implementation ofthem. We provide...
Article
Even though modern programming languages are becoming more important than ever before, programmers have traditionally faced a dilemma: programs written in these languages traditionally have had lower performance than programs written in more conventional, but error-prone languages. In this thesis, I study this problem in the context of one particular modern programming language, Standard ML. Standard ML contains all the language features mentioned previously and more. I use an empirical approach to understand where Standard ML programs though better optimization.
Article
An abstract is not available.
Article
We've designed and implemented a copying garbage-collection algorithm that is efficient, real-time, concurrent, runs on commercial uniprocessors and shared-memory multiprocessors, and requires no change to compilers. The algorithms uses standard virtual-memory hardware to detect references to "from space" objects and to synchronize the collector and mutator threads. We've implemented and measured a prototype running on SRC's 5-processor Firefly. It will be straightforward to merge our techniques with generational collection. An incremental, non-concurrent version could be implemented easily on many versions of Unix.
Article
We have developed a compiler for the lexically-scoped dialect of LISP known as SCHEME. The compiler knows relatively little about specific data manipulation primitives such as arithmetic operators, but concentrates on general issues of environment and control. Rather than having specialized knowledge about a large variety of control and environment constructs, the compiler handles only a small basis set which reflects the semantics of lambda-calculus. All of the traditional imperative constructs, such as sequencing, assignment, looping, GO TO, as well as many standard LISP constructs such as AND, OR and COND, are expressed as macros in terms of the applicative basis set. A small number of optimization techniques, coupled with the treatment of function calls as GO TO statements, serves to produce code as good as that produced by more traditional compilers.
Conference Paper
This paper discusses garbage collection techniques used in a high-performance Lisp implementation with a large virtual memory, the Symbolics 3600. Particular attention is paid to practical issues and experience. In a large system problems of scale appear and the most straightforward garbage-collection techniques do not work well. Many of these problems involve the interaction of the garbage collector with demand-paged virtual memory. Some of the solutions adopted in the 3600 are presented, including incremental copying garbage collection, approximately depth-first copying, ephemeral objects, tagged architecture, and hardware assists. We discuss techniques for improving the efficiency of garbage collection by recognizing that objects in the Lisp world have a variety of lifetimes. The importance of designing the architecture and the hardware to facilitate garbage collection is stressed.
Conference Paper
The proposed ML is not intended to be the functional language. There are too many degrees of freedom for such a thing to exist: lazy or eager evaluation, presence or absence of references and assignment, whether and how to handle exceptions, types-as-parameters or polymorphic type-checking, and so on. Nor is the language or its implementation meant to be a commercial product. It aims to be a means for propagating the craft of functional programming and a vehicle for further research into the design of functional languages.
Article
A real-time list processing system is one in which the time required by each elementary list operation (CONS, CAR, CDR, RPLACA, RPLACD, EQ, and ATOM in LISP) is bounded by a (small) constant. Classical list processing systems such as LISP do not have this property because a call to CONS may invoke the garbage collector which requires time proportional to the number of accessible cells to finish. The space requirement of a classical LISP system with N accessible cells under equilibrium conditions is (1.5+μ)N or (1+μ)N, depending upon whether a stack is required for the garbage collector, where μ>0 is typically less than 2. A list processing system is presented which: 1) is real-time--i.e. T(CONS) is bounded by a constant independent of the number of cells in use; 2) requires space (2+2μ)N, i.e. not more than twice that of a classical system; 3) runs on a serial computer without a time-sharing clock; 4) handles directed cycles in the data structures; 5) is fast--the average time for each operation is about the same as with normal garbage collection; 6) compacts--minimizes the working set; 7) keeps the free pool in one contiguous block--objects of nonuniform size pose no problem; 8) uses one phase incremental collection--no separate mark, sweep, relocate phases; 9) requires no garbage collector stack; 10) requires no "mark bits", per se; 11) is simple--suitable for microcoded implementation. Extensions of the system to handle a user program stack, compact list representation ("CDR-coding"), arrays of non-uniform size, and hash linking are discussed. CDR-coding is shown to reduce memory requirements for N LISP cells to ≈(I+μ)N. Our system is also compared with another approach to the real-time storage management problem, reference counting, and reference counting is shown to be neither competitive with our system when speed of allocation is critical, nor compatible, in the sense that a system with both forms of garbage collection is worse than our pure one.
Article
An important property of the Newell Shaw-Simon scheme for computer storage of lists is that data having multiple occurrences need not be stored at more than one place in the computer. That is, lists may be “overlapped.” Unfortunately, overlapping poses a problem for subsequent erasure. Given a list that is no longer needed, it is desired to erase just those parts that do not overlap other lists. In LISP, McCarthy employs an elegant but inefficient solution to the problem. The present paper describes a general method which enables efficient erasure. The method employs interspersed reference counts to describe the extent of the overlapping.
Article
We have developed a compiler for the lexically-scoped dialect of LISP known as SCHEME. The compiler knows relatively little about specific data manipulation primitives such as arithmetic operators, but concentrates on general issues of environment and control. Rather than having specialized knowledge about a large variety of control and environment constructs, the compiler handles only a small basis set which reflects the semantics of lambda-calculus. All of the traditional imperative constructs, such as sequencing, assignment, looping, GOTO, as well as many standard LISP constructs such as AND, OR, and COND, are expressed in macros in terms of the applicative basis set. A small number of optimization techniques, coupled with the treatment of function calls as GOTO statements, serve to produce code as good as that produced by more traditional compilers. The macro approach enables speedy implementation of new constructs as desired without sacrificing efficiency in the generated code. A fair amount of analysis is devoted to determining whether environments may be stack-allocated or must be heap-allocated. Heap-allocated environments are necessary in general because SCHEME (unlike Algol 60 and Algol 68, for example) allows procedures with free lexically scoped variables to be returned as the values of other procedures; the Algol stack-allocation environment strategy does not suffice. The methods used here indicate that a heap-allocating generalization of the "display" technique leads to an efficient implementation of such "upward funargs". Moreover, compile-time optimization and analysis can eliminate many "funargs" entirely, and so far fewer environment structures need be allocated at run time than might be expected. A subset of SCHEME (rather than triples, for example) serves as the representation intermediate between the optimized SCHEME code and the final output code; code is expressed in this subset in the so-called continuation-passing style. As a subset of SCHEME, it enjoys the same theoretical properties; one could even apply the same optimizer used on the input code to the intermediate code. However, the subset is so chosen that all temporary quantities are made manifest as variables, and no control stack is needed to evaluate it. As a result, this apparently applicative representation admits an imperative interpretation which permits easy transcription to final imperative machine code. These qualities suggest that an applicative language like SCHEME is a better candidate for an UNCOL than the more imperative candidates proposed to date.
Article
The attached paper is a description of the LISP system starting with the machine-independent system of recursive functions of symbolic expressions. This seems to be a better point of view for looking at the system than the original programming approach. After revision, the paper will be submitted for publication in a logic or computing journal. This memorandum contains only the machine independent parts of the system. The representation of S-expressions in the computer and the system for representing S-functions by computer subroutines will be added.
Article
In previous heap storage systems, the cost of creating objects and garbage collection is independent of the lifetime of the object. Since objects with short lifetimes account for a large portion of storage use, it is worth optimizing a garbage collector to reclaim storage for these objects more quickly. The garbage collector should spend proportionately less effort reclaiming objects with longer lifetimes. We present a garbage collection algorithm that (1) makes storage for short-lived objects cheaper than storage for long-lived objects, (2) that operates in real-time--object creation and access times are bounded, (3) increases locality of reference, for better virtual memory performance, (4) works well with multiple processors and a large address space. 1.
Article
This paper presents a solution to the third problem of classical list processing techniques and removes that roadblock to their more general use. Using the method given here, a computer could have list processing primitives built in as machine instructions and the programmer would still be assured that each instruction would finish in a reasonable amount of time. For example, the interrupt handler for a keyboard could store its characters on the same kinds of lists---and in the same storage area---as the lists of the main program. Since there would be no long wait for a garbage collection, response time could be guaranteed to be small. Even an operating system could use these primitives to manipulate its burgeoning databases. Business database designers no longer need shy away from pointer-based systems for fear that their systems will be impacted by a week-long garbage collection! As memory is becoming cheaper,