ArticlePDF Available

Static Analysis for Identifying and Allocating Clusters of Immortal Objects

Authors:

Abstract and Figures

Long living objects lengthen the trace time which is a critical phase of the garbage collection process. However, it is possible to recognize object clusters i.e. groups of long living objects having approximately the same lifetime and treat them separately to reduce the load on the garbage collector and hence improve overall performance. Segregating objects this way leaves the heap for objects with shorter lifetimes and now a typical collection can nd more garbage than before. In this paper, we describe a compile time analysis strategy to identify object clusters in programs. The result of the compile time analysis is the set of allocation sites that contribute towards allocating objects belonging to such clusters. All such allocation sites are replaced by a new allocation method that allocates objects into the cluster area rather than the heap. This study was carried out for a concurrent collector which we developed for Rotor, Microsoft's Shared Source Implementation of .NET. We analyze the performance of the program with combina- tions of the cluster and stack allocation optimizations. Our results show that the clustering optimization reduces the number of collections by 66.5% on average, even eliminating the need for collection in some programs. As a result, the total pause time reduces by 62.8% on average. Using both stack allocation and the cluster optimizations brings down the number of collections by 91.5% thereby improving the total pause time by 79.33%.
Content may be subject to copyright.
Static Analysis for Identifying and Allocating Clusters
of Immortal Objects
Archana Ravindar
Department of Computer Science
Indian Institute of Science
Bangalore-12
archana@csa.iisc.ernet.in
Y.N.Srikant
Department of Computer Science
Indian Institute of Science
Bangalore-12
srikant@csa.iisc.ernet.in
ABSTRACT
Long living objects lengthen the trace time which is a critical phase of the garbage collection process. However, it
is possible to recognize object clusters i.e. groups of long living objects having approximately the same lifetime
and treat them separately to reduce the load on the garbage collector and hence improve overall performance.
Segregating objects this way leaves the heap for objects with shorter lifetimes and now a typical collection can find
more garbage than before.
In this paper, we describe a compile time analysis strategy to identify object clusters in programs. The result of
the compile time analysis is the set of allocation sites that contribute towards allocating objects belonging to such
clusters. All such allocation sites are replaced by a new allocation method that allocates objects into the cluster
area rather than the heap. This study was carried out for a concurrent collector which we developed for Rotor,
Microsoft’s Shared Source Implementation of .NET. We analyze the performance of the program with combina-
tions of the cluster and stack allocation optimizations. Our results show that the clustering optimization reduces
the number of collections by 66.5% on average, even eliminating the need for collection in some programs. As a
result, the total pause time reduces by 62.8% on average. Using both stack allocation and the cluster optimizations
brings down the number of collections by 91.5% thereby improving the total pause time by 79.33%.
Keywords
Static analysis, compiler-assisted memory management, effective garbage collection, object clustering
1 INTRODUCTION
Garbage collection has come a long way since the time
it was introduced for collecting lists in LISP. Now it
has become a necessity in modern object oriented lan-
guages, since it successfully abstracts the problem of
memory management from the user. Advances like
collecting generations and concurrent collection were
successful in bringing down the collection overhead
and thereby making garbage collection practically us-
able in runtime systems.
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without
fee provided that copies are not made or distributed for profit
or commercial advantage and that copies bear this notice
and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
.NET Technologies’2005 workshop proceedings,
ISBN 80-86943-01-1
Copyright UNION Agency – Science Press, Plzen, Czech Republic
All said and done, the program incurs performance
penalty if it is garbage collected. So it becomes essen-
tial to keep the overhead at a minimum. This is pos-
sible if we reclaim the maximum amount of garbage
with the least number of effective collections. Sev-
eral previous work have tried to achieve this goal in
their own way by looking at different object proper-
ties like connectivity [Hay91, Hir02, Hir03, Sam04],
object types [Shu02], age [McK99] other than object
traceability alone.
Our goal is to make each collection effective and
thereby reduce the total number of collections required
to reclaim garbage in the program. We achieve this by
identifying long living clusters of objects and allocat-
ing them in a separate mature object space that is not
subject to garbage collection. The idea is to avoid trac-
ing objects that are going to live till the end. Segregat-
ing objects this way leaves the heap for objects with
shorter lifetimes and now a typical collection can find
more garbage than before, making collections more ef-
fective. Although we have studied clustering in a gen-
erational setting this elementary concept is applicable
to incremental collectors too.
This paper describes a compile time clustering analy-
sis algorithm based on the compositional pointer and
escape analysis framework proposed in [Wha99]. The
clustering algorithm makes use of the lifetime in-
formation of objects computed by the points to es-
cape analysis algorithm, that is constructed for every
method. The objects that do not escape the longest
living methods are designated as the root of the clus-
ter. The objects that are reachable from the root are
treated as cluster objects. All such cluster objects are
statically allocated in a separate mature object space.
When the stack frame of the method binding the life-
time of the root object is popped, the entire cluster is
garbage and hence the mature object space can be re-
claimed in its entirety.
The clustering scheme is evaluated using a base-
line collector that can run in both stop-the-world and
concurrent modes that we developed for Rotor, Mi-
crosoft’s shared source implementation of .NET. The
baseline collector has two generations and uses the
copying scheme to collect both. We analyze the per-
formance of the collector and the program with the
cluster and the stack allocation optimizations. Our re-
sults show a marked decrease in the total number of
collections and considerable improvement in the indi-
vidual collection performance. It is observed that a
combination of the clustering and the stack allocation
optimization improves the performance even further.
The remainder of the paper is organized as follows.
We begin by reviewing related work in Section 2. Sec-
tion 3 describes the concept of clustering and how we
extend the compositional pointer and escape analysis
to identify clusters. In Section 4 we describe the base-
line collector and the experimental platform. In Sec-
tion 5 we present and evaluate the results. Finally we
conclude in section 6 with possible avenues of future
work.
2 RELATED WORK
Hayes introduced the term object clustering [Hay91].
The main observation was that large clusters of ob-
jects, pointed to by key objects were allocated at
roughly the same time and lived for approximately the
same amount of time. When the key objects became
unreachable it indicated a good opportunity to collect.
Hayes identified the cluster as the program executed
and incrementally placed it in the mature object space.
Our work tries to identify the cluster at compile time
and statically allocates the cluster into the mature ob-
ject space. The compile time clustering algorithm is
used to find key objects. Unlike Hayes’s scheme where
the mature object space is collected, we do not sub-
ject the mature object space to garbage collection. We
combine the concepts of escape analysis and cluster-
ing to reclaim the cluster.
Pretenuring tries to solve the problem of repeated col-
lections of long living objects by directly allocating
such objects into the old generation by using static and
dynamic profiles [Bla98, Che98, Har00]. But the old
generation is still subject to collection, so in spite of
applying the pretenuring optimization, major collec-
tions might still occur. Our scheme tries to completely
eliminate major collections by allocating these long
living or immortal objects in a separate mature object
space that is not subject to collection.
Dynamic object colocation [Sam04] allocates objects
directly into the same area of an object that will ref-
erence it, by using a mix of compile-time and runtime
optimizations. Static compiler analysis is used to com-
pute connectivity information and the runtime compo-
nent involves an allocation routine which takes a colo-
cator object as an additional parameter and is respon-
sible for dynamic colocation. The dynamic colocator
can start placing objects into the mature object space
only when some initial set of colocators are present.
Hence it requires a warm up young generation collec-
tion to produce these initial colocators, whereas the
intention of our scheme is to reduce the number of
collections, even eliminating the need for collection
if possible. [Sam04] reports a considerable increase in
the number of intergenerational pointers for some of
its programs. Our results indicate that clustering only
reduces the number of intergenerational pointers but
never increases it.
Connectivity based garbage collection makes use of
the observation that connected objects die together.
Based on this hypothesis it allocates objects that are
connected together into a statically determined parti-
tion so that collecting a partition would be much faster
than collecting the heap. [Hir03] works by building
a hierarchy of partition dags and collects these parti-
tions such that an ancestor is collected together with its
descendants thereby eliminating the need for a write-
barrier.
3 CLUSTERING
The concept of data is fundamental to every program.
Programs feed on data, they build several data struc-
tures that assist them in performing their functional-
ity. In the object oriented paradigm, objects store data.
These data objects are seldom isolated, rather they are
related to one another in some way and hence linked
together to form clusters.
Most often a program is associated with a set of crit-
ical objects that are bound to stay till almost the end
of the program. Such objects are said to be immor-
tal. If these objects are treated in the same way as the
default heap objects, they would unnecessarily be pro-
cessed by the garbage collector, resulting in increased
collection times. Figure 1 illustrates the impact of long
living objects on the total collection time, measured as
the fraction of time spent in scavenging live objects.
We observe that the scavenge time accounts for a sig-
nificant fraction of the total collection time (up to 83%
in
211 anagram). Further investigation reveals that
up to 88% of the objects were found to be live during
the collection. Hence tracing immortal or long living
clusters plays a major role in lengthening the total col-
lection time.
Figure 1: Proportion of Collection Time spent on
Scavenge.
If we can recognize the allocation sites in the program
responsible for creating long living clusters (high-
lighted in Figure 2) at compile time, we can statically
allocate them in a region that is not processed by the
garbage collector. The region can then be reclaimed
in its entirety at program termination. Such a strat-
egy allows the garbage collector to focus on objects
that are volatile and objects whose lifetimes cannot be
statically determined. We describe the clustering algo-
rithm which identifies long living clusters in the next
section.
Extending Compositional Pointer Analysis
To Identify Clusters
The algorithm to identify clusters in a program is
based on the compositional and pointer escape anal-
ysis proposed for Java programs by Whaley and Ri-
nard [Wha99]. The referencing behavior among ob-
jects and fields is abstracted in the form of a points-
Figure 2: Set of Allocation Sites that contribute to-
wards Cluster Objects in
211 anagram
to-escape or the PTE graph. Nodes in the PTE graph
represent objects allocated by the program and edges
represent references between them. Objects that are
created within the currently analyzed region are rep-
resented by inside nodes in the PTE graph, whereas
those created outside the currently analyzed region or
accessed via outside edges are represented by outside
nodes in the PTE graph. Similarly inside edges repre-
sent references created within the currently analyzed
region. References created outside the currently ana-
lyzed region are represented by outside edges in the
PTE graph. We restrict our analysis to programs that
are single-threaded.
The algorithm is compositional in nature i.e. meth-
ods can be analyzed independently of their callers and
callees. [Wha99] describes an intra-procedural algo-
rithm that computes individual PTE graphs for each
method and an inter-procedural algorithm that com-
putes precise points-to-escape information for each
method. The inter-procedural algorithm combines the
PTE graph for each method with the PTE graphs cre-
ated for all its callees.
The ultimate objective of the algorithm is to determine
for every allocation site A, the method M whose stack
frame will outlive the object created at A. In such a
situation, object created at A is said to be captured by
M. If enough information is not available to ascertain
whether an object escapes or not, it is allocated in the
heap.
An object is said to have escaped a method M if it is
a formal parameter or if a reference to the object is
written into a static class variable or a reference to the
object is passed to one of the callees of M say N and
there is no information available about what N did to
the object. The object will escape if M returns it. If the
object satisfies none of the above conditions it is said
to be captured within M.
In essence, when a complete points-to escape analysis
graph is constructed for a method M it consists of the
nodes that were either created within the method M or
nodes created outside M but are reachable from within
M. The clustering algorithm makes use of this fact to
recognize a cluster.
3.1.1 Design
In this section we describe the clustering algorithm
in the form of pseudocode as shown in Figures 3
and 4. To begin with, we need to preprocess the
statements to include only those that will affect the
PTE graph [Wha99]. The csharp compiler invokes
CompileMethod for every method, that creates basic
blocks, while it translates the source code into op-
codes. We intercept at points where code is generated
for statements that we are interested in and save the
details of the statement in a separate data structure.
Once the code for the method is generated, we iter-
ate through the statements that we created to compute
the PTE graph. The graph is implemented as an adja-
cency list. Each node is a structure that stores the set
of incoming and outgoing edges, node kind and infor-
mation whether it was visited or not. Each edge is a
structure that stores the head and the tail node, edge
kind and the variable it represents.
During the intra-procedural analysis, when we en-
counter a call statement it is possible that the PTE
graph for that call is not yet computed. The status of
all such statements that have incomplete information
is marked as pending. During the inter-procedural
analysis we process only pending statements to com-
pute the complete PTE graph. Finally, we process the
PTE graphs of only those methods M that lie close
to main in the call graph, to compute cluster informa-
tion. This list of methods can be got by profiling. The
PTE graph for all such M would consist of only those
nodes that have escaped up to M, since they are reach-
able from within M. All other nodes that have been
captured within methods lying below M would not be
visible in the PTE graph for M. Hence the cluster al-
gorithm correctly identifies only those objects that are
going to live till the stack frame of M has been popped
off and is bound to benefit the collection process.
The marked nodes in M which are not pointed to by
any other node in the PTE graph of M are said to be
the roots of the cluster. They serve the same function
as the key objects because they are the only way to
reach a cluster. When the key object is garbage, all
the objects connected to it are dead. Hence when the
stack frame for method M is popped, the root object
and hence the entire cluster associated with it is dead
and can therefore be reclaimed.
The clustering analysis algorithm is conservative in
the sense that some of the objects belonging to the
cluster might die before the stack frame containing the
root of the cluster is popped. This is especially true
in cases where a dynamically growing structure like
a stack or a list is part of the cluster. However, we
shall shortly see that even this naive approach of iden-
tifying a clusters performs reasonably well for most
programs.
Figure 3: Pseudocode for Inter and Intra procedu-
ral analysis
3.1.2 Example
Figure 5 shows the local PTE graphs for two of the
methods in
211 anagram. In the PTE graph for
Figure 4: Pseudocode for identifying Clusters
Figure 5: Identifying clusters using PTE graphs.
read
file, sif is captured. Despite being a local ob-
ject, istr is linked to the dict variable by the library
call dict.Add and hence becomes a part of the clus-
ter. Since the reference is added outside the method,
it is indicated as an outside edge. The intra-procedural
analysis for read
file deems all nodes except sif as
escaped. The dotted line in Figure 5 indicates how
the nodes in run will be mapped onto the nodes of the
callee read
file during inter-procedural analysis.
Interprocedural analysis is followed by the application
of the clustering algorithm as described earlier, that
marks all the nodes in the graph that corresponds to the
cluster allocation sites. In this particular example, the
clustering algorithm accesses the complete PTE graph
of run and marks all nodes reachable from the node
representing agm as cluster nodes. agm is designated
as the root of the cluster. Since by definition each node
is associated with an object and hence with an alloca-
tion site producing that object, one can output the set
of allocation sites responsible for cluster allocation.
The fact that the analysis is compositional makes it
possible to analyze libraries independently of the ap-
plication. When analyzing an application, we use pre-
computed results for any library calls that it may make.
Since the clustering algorithm can access the precom-
puted results for the library calls, it is possible for
the algorithm to come up with cluster allocation sites
within the library code, as we saw dict.Add in Figure
5. To support clustering completely, we create a new
library that consists of additional functions to support
cluster allocation.
Other changes to Rotor for implementing the cluster-
ing scheme include the introduction of two new op-
codes newclus and newst that are wired to perform al-
location in the cluster and in the stack respectively. In
this implementation, we have simulated the allocation
on the stack using a separate area apart from the heap
and the cluster area. To measure the impact of sim-
ulating the stack allocation we ran the programs with
a maximum heap size (so that there was no garbage
collection) and compared the elapsed times with the
baseline which has no stack allocation implemented.
On average, the overhead of stack implementation was
found to be -2.1%.
3.1.3 Issue with Boxing
In any implementation of CLI, when an instance of a
value type is passed as a parameter to a method that
expects a reference parameter, boxing is performed
[Ecm03]. Boxed objects are implicit and are not ev-
ident in csharp source code. Since the clustering algo-
rithm works on the source code, it does not have a han-
dle to the boxed objects. Our implementation tackles
this problem by converting implicit boxing to explicit
boxing. We overload the existing methods that take
a reference as a parameter, to take value types also.
These additional methods now include code that per-
forms explicit boxing. So now the clustering algorithm
can access the boxed objects and include them in the
analysis.
4 METHODOLOGY
Baseline Collector
The baseline collector is designed to work on the prin-
ciples of concurrent replication collection [Too93].
It consists of two generations. The young genera-
tion is also known as newspace. This is where all
the new objects are allocated. The old generation
is comprised of two semispaces- fromspace and the
tospace. Copying collection is used to collect both
generations. When allocation in the newspace crosses
a particular threshold, a minor collection is invoked
that scavenges the live objects into fromspace. Even-
tually the fromspace gets filled up to its threshold
value which invokes a major collection that collects
Figure 6: Baseline Collector Organization with
Clustering Incorporated.
the entire heap.
Scavenging is a concurrent operation, hence the pro-
gram and the collector thread need to be synchronized
to ensure that things work correctly. Our approach for
synchronizing the program and the collector is an ex-
tension of the Dijkstra’s tricolor scheme. We associate
each object with a color that is used to indicate ob-
ject state information to both the collector and the pro-
gram. The details of the synchronization scheme can
be found in [Rav05].
All generational collectors are associated with a write
barrier [Hos92], that is a piece of code executed with
every pointer write. We add the synchronization code
to the write barrier to support concurrency in the col-
lector. The baseline collector supports finalization,
weak pointers and interior pointers. However, unlike
the rotor garbage collector, it does not support large
objects allocation and pinning. Incorporating cluster-
ing into the garbage collector adds a new mature object
space to the existing heap. The baseline collector can
also run in the stop-the-world mode. The final memory
model of the collector is as shown in Figure 6.
Experimental Platform
This study was conducted on Rotor version 1.0 [Rot01].
We ran the programs on an Intel pentium III 450 Mhz
processor with 128MB of main memory and a 512KB
cache, running Free BSD 4.5.
5 RESULTS
In this section, we evaluate the baseline collector by
comparing its performance with Rotor’s garbage col-
lector. We evaluate the clustering optimization w.r.t.
the collector and program performance. We also study
the impact of the stack allocation optimization along
with the clustering optimization. To carry out this
study, we used the C# versions of the Java programs
from Spec JVM98 [Spc98], Java olden [Jolden], Java
grande [Jgrande] and the gc test suite provided with
Rotor [Rot01]. The benchmarks and their runtime pa-
rameters are summarized in Table 1.
Performance of the Concurrent Collector
In this section we describe the performance of the
baseline collector w.r.t. pause times and elapsed times.
The results for both the stop the world and concurrent
modes are presented. The heap sizes are chosen such
that both the Rotor garbage collector and the baseline
collector have the same number of collections.
5.1.1 Pause Time
The main objective for choosing a concurrent gc algo-
rithm for the baseline collector was to reduce the pause
times. Almost all the programs report significant re-
ductions in pause times for the concurrent mode, ex-
cept for raytrace which shows an increase of 4.62%.
The average reduction in pause times for the concur-
rent mode is 36.24%. However pause times increase
by 2.14% on average when the collector is run in the
stop-the-world mode.
5.1.2 Elapsed Time
The baseline collector introduces a very small over-
head of 1.11% when run in the stop-the-world mode.
However, the overhead is slightly worse in the concur-
rent mode. That is because of the additional synchro-
nization code that needs to be executed. The average
overhead on the elapsed time is 1.75% for the concur-
rent mode. It can be observed that in spite of a substan-
tial improvement in the pause time, the elapsed times
do not change by much. That is because the collection
time constitutes a very small portion of the elapsed
time.
Performance of Clustering
In this section we describe the performance of the pro-
grams when the clustering optimization and the stack
allocation optimizations are performed. The programs
are run with the heap sizes as shown in Table 2. Clus-
tering reduces the total heap requirement by 12.6% on
average.
5.2.1 Reduction in the Number of Collections
Both clustering and the stack allocation optimizations
are geared towards reducing the load on the garbage
collector. For certain programs where the total popula-
tion of objects is dominated by clusters, clustering op-
timization yields a lot of benefit. For programs where
volatile objects dominate, stack allocation yields sim-
ilar benefit. The average reduction in the number of
collections for programs where only the stack allo-
cation optimization and the clustering optimization is
used is 75% and 66.5% respectively. A combination
of the stack and cluster allocation yields the highest
reduction of collections at 91.56%. The results are the
same for the collector when operated in the concurrent
mode.
Source Program Runtime parameters
Rotor gc test suite directedgraph No. of vertices=100
Spec JVM98 208 cst No of iterations=1, speed=1
209 db No. of iterations=1, Speed=10
211 anagram Speed=1
210 si Speed=10
Java Olden bisort No of nodes=4, size=2500
jhealth MaxLevel=5, MaxTime=100, seed=23
power No of feeders= 5, No of laterals= 10,
No of branches= 3, No of leaves= 5
tsp Size= 600
treeadd No of levels=16
Java Grande raytrace Width= 25, height= 25
Table 1: Set of Benchmarks used and their Configuration
Program Young gen Old gen Young gen Old gen Max Cluster Size
size (MB) size (MB) with clustering (MB) with clustering (MB) (MB)
211 anagram 2 8 0.7 1.4 3.8
209 db 1 10 1 2 2.5
210 si 1 2 1 2 0.9
bisort 1 2 1 2 0.05
jhealth 1 2 0.3 0.6 2.6
208 cst 1 40 0.7 1.4 12.7
power 1 2 0.3 0.6 0.07
tsp 1 2 0.7 1.4 0.05
raytrace 0.8 1.6 0.8 1.6 3.6
directedgraph 1 2 1 2 0.15
treeadd 4 8 0.19 0.38 1.4
Table 2: Heap and Mature Object Space sizes
5.2.2 Reduction in Collection and Pause Times
One of the direct consequences of the reduction in the
number of collections is the reduction in the total col-
lection time and the total pause time. Reduction in the
number of objects scavenged also contributes to reduc-
tion in the collection time. The average reduction in
the total collection time using only the stack allocation
optimization is 60.9%; with only the cluster optimiza-
tion it is about 60.6%; with both optimizations on, the
reduction is about 79.27%. The corresponding average
reductions in the pause times are 63.55% with only the
stack allocation optimization, 62.82% with only the
cluster optimization and 79.33% with both optimiza-
tions applied.
When the collector operates in the concurrent mode,
the average reductions in pause times are 60.09%,
60.9% and 79.27% with only the stack allocation, only
the clustering optimization and both optimizations ap-
plied respectively.
5.2.3 Reduction in Copycounts
Once the clustering optimization is done, there is
greater chance for a collection to find more garbage
than earlier. Since the long living clusters are ex-
empted from collection, only those objects that are rel-
atively volatile remain in the heap. This causes a re-
duction in the number of objects copied. Copy counts
can also reduce due to the reduction in the number of
collections as we saw in the previous section. Copy
counts reduce by almost 60.11% with only the stack al-
location optimization applied and by 91.37% with the
cluster optimization applied. A combination of both
reduces the copy counts further by 94.02%. The re-
sults are almost the same for the collector when oper-
ated in the concurrent mode.
5.2.4 Impact on Inter-region References
A profile of the inter-region references indicate very
minimal interaction between the cluster objects and
the heap objects (Table 3). The number of such clus-
ter to heap pointers is critical to the success of cluster-
ing. The cluster is reclaimed in its entirety and not col-
Program Total No. of cluster Total interregion Total interregion % Reduction % Garbage
to heap references pointers without clustering pointers with clustering in barriers in cluster
211 anagram 5 - - - 25
209 db 1 6916 14 99.79 11.2
210 si 2 44731 39324 12.08 19.38
bisort 0 7 7 0 0
jhealth 0 - - - 77.5
208 cst 6 403912 169319 58.08 33.3
power 0 1 1 0 0
tsp 0 8 8 0 0
raytrace 3 163798 293 99.82 99.5
directedgraph 0 - - - 0.02
treeadd 0 - - - 0
Table 3: Interregion References and Effectiveness of the Clustering Scheme
Figure 7: Impact on the Number of Collections
lected as in the case of the heap that is collected from
time to time. Just as we track inter-generational point-
ers to ensure complete collection, we need to track
cluster to heap pointers. Hence, if the number of such
cluster to heap pointers are large, the collection time
is bound to increase. The average number of cluster to
heap references that the clustering algorithm achieves
is 1.54. Clustering is also found to reduce the total
number of inter-region pointers as shown in Table 3.
The impact on the number of inter-region pointers is
studied only for those programs in which the number
of collections are reduced to a non-zero value with the
application of clustering. The average reduction in the
number of interregion pointers is found to be 33.72%.
5.2.5 Reduction in Allocation times
The cluster allocation routine is straightforward and
need not populate objects with extra header informa-
tion which would otherwise be required for heap ob-
Figure 8: Total Collection times with Clustering
and Stack Allocation Optimizations
jects. So the time required to allocate a cluster object
is less than the time required to allocate a heap ob-
ject. Clustering improves the total allocation time by
14.99% on average. Stack allocation improves the to-
tal time by 12.61% on average. A combination of both
optimizations results in an improvement of 20.63%.
5.2.6 Impact on Elapsed Time
Clustering has little effect on the total elapsed time,
on average it increases the elapsed time by 0.44%. Us-
ing only the stack allocation optimization improves the
elapsed times by 1.75% on average. A combination of
both the optimizations improves the elapsed time by
1.018%. The main reason for this is that the collection
time is only a small portion of the overall elapsed time.
Only if there is a drastic improvement in the collection
time, elapsed times improve visibly, for example the
number of collections in
208 cst with the clustering
Figure 9: Total Copy Counts with Clustering and
Stack Allocation Optimizations
optimization decreases from 23 to 11. Hence, in this
case the elapsed time reduces by 12.19%. The other
reason is that the addition of a separate cluster area in-
troduces overheads w.r.t. the elapsed time. Since an
object can now reside in the cluster area apart from the
heap, the garbage collector code needs to recognize
objects in the cluster area and also in the stack, if the
stack allocation optimization is applied.
Additional barrier code to keep track of cluster to heap
pointers also contributes to increased elapsed times.
The effect on elapsed time is more or less the same for
the concurrent mode. Using just the escape analysis
optimization, the average elapsed times decrease by
only 0.377%; clustering increases the elapsed times by
1.19%. Using both optimizations the average elapsed
time decreases by 0.59%
Figure 10: Impact of the Clustering and Stack Al-
location Optimizations on the Elapsed Time
5.2.7 Effectiveness of the Clustering Algorithm
To evaluate the effectiveness of the clustering algo-
rithm and to verify its claim of retaining genuinely
long living objects right up to the end, additional in-
strumentation is added to the code. At the time of
reclamation of the cluster area, instead of freeing it
up, the cluster area is collected to find the amount
of garbage generated within itself. The amount of
garbage generated in the cluster using our algorithm
is found to be 24.17% on average. Ideally it should be
0%.
The clustering algorithm presented here does not cap-
ture dynamic growth of clusters. More complex
pointer analysis is required to come up with an ideal
cluster. Since the clustering algorithm is by nature
static, Allocation site homogeneity is an issue. For
example raytrace includes an allocation site that is
called in two different contexts. In one it creates a cap-
tured object, in the other it creates a cluster object. If
we decide to allocate the object in the heap, the num-
ber of cluster to heap references shoot up, thereby de-
grading the performance of the collector. On the other
hand, if we decide to cluster allocate the object, huge
amount of garbage would be generated within the clus-
ter area due to the volatile nature of the object. In such
cases dynamic object colocation [Sam04] might per-
form better since it makes colocation decisions on the
fly at runtime.
6 CONCLUSIONS
For a garbage collector to work effectively, it has to be
aware of object properties and not just object traceabil-
ity. The compiler plays an important role in provid-
ing valuable information about object properties to the
garbage collector. This paper describes and evaluates
a compile time technique that recognizes clusters in a
program and statically allocates cluster objects sepa-
rately. Our results demonstrate that the clustering op-
timization reduces the number of collections consider-
ably and also improves the individual collection times
by a fairly large amount. When applied along with the
stack allocation optimization it produces even better
results. Clustering also improves the total number of
interregion pointers. However, elapsed times do not
improve in the same vein as the collection times. Only
those programs in which there is a drastic reduction
in the number of collections show a considerable im-
provement in the elapsed time.
Future Work
Our work can be extended in several directions. The
current clustering algorithm identifies clusters that are
created only in those methods that have the longest
lifetimes. One can extend the clustering concept to
all other methods to discover scoped memory regions.
The current compiler analysis itself can be made more
sophisticated so that it not only outputs the allocation
sites but also provides information to the programmer
whether the cluster optimization would prove benefi-
cial for that program or not. Several parameters are
indicative of whether a cluster would prove as an ad-
vantage or as a penalty. Some of them are the num-
ber of cluster to heap references, allocation site homo-
geneity, the fraction of the objects that are allocated
in the cluster, dynamic growth of clusters that might
contribute garbage within the cluster area. However,
the compiler would require complex pointer analysis
to infer some of this information. Allocating cluster
objects in a separate area brings in the need for addi-
tional barrier code to track cluster to heap references.
Static analysis can be used to eliminate the write barri-
ers wherever unnecessary and hence improve elapsed
times.
7 ACKNOWLEDGEMENTS
This work was funded by Microsoft Research. We
thank the anonymous reviewers for their comments
and suggestions.
References
[App89] A.Appel, Simple generational garbage collection
and fast allocation. Software: Practice and Experience,
19(2):171-183, Feb 1989.
[Bla98] S. M. Blackburn, S. Singhai, M. Hertz, K. S.
McKinley, and J. E. B. Moss. Pretenuring for Java.
In ACM Conference on Object-Oriented Programming
Systems, Languages, and Applications, pages 342-352,
Tampa, FL, Oct. 2001. ACM.
[Che98] P.Cheng, R. Harper and P. Lee, Generational stack
collection and profile-driven pretenuring. In ACM Con-
ference on Programming Languages Design and Im-
plementation, pages 162-173, Montreal, Canada, May
1998.
[Det02] Morgan Deters and Ron K. Cytron, Automated
Discovery of Scoped Memory Regions for Real-Time
Java. In ACM International Symposium on Memory
Management, pages 25-35, Berlin, Germany, June 2002.
[Ecm03] ECMA C# and Common Language Infrastructure
Standards. http://msdn.microsoft.com/net/ecma/
[Gay01] D. Gay and A. Aiken. Language Support for Re-
gions. In Proceedings of the ACM SIGPLAN ’01 Con-
ference on Programming Language Design and Imple-
mentation, pages 70-80, 2001.
[Har00] T. L. Harris. Dynamic adaptive pre-tenuring. In
ACM International Symposium on Memory Manage-
ment, pages 127-136, Minneapolis, MN, Oct. 2000.
[Hay91] Barry Hayes. Using key object opportunism to col-
lect old objects. In Object-Oriented Programming, Sys-
tems, Languages, and Applications (OOPSLA), 1991.
[Hir02] M.Hirzel, J.Hinkel, A.Diwan and M.Hind, Under-
standing the connectivity of heap objects. In ACM In-
ternational Symposium on Memory Management, pages
36-49, Berlin, Germany, June 2002.
[Hir03] M.Hirzel, A.Diwan and M.Hertz. Connectivity
based garbage collection. In ACM Conference on Ob-
ject Oriented Programming Systems , Languages and
Applications, pages 359-373,Anaheim, CA, Oct 2003.
[Hos92] Antony L. Hosking, J. Eliot B. Moss and Darko
Stefanovic, A comparative performance evaluation of
write barrier implementations. In ACM Conference
on Object-Oriented Programming Systems, Languages,
and Applications, pages 92-109, 1992
[Jgrande] http://www.epcc.ed.ac.uk/javagrande
[Jolden] Brenden Cahoon, Java Olden benchmarks,
http://www.cs.utexas.edu/users/cahoon/
[Lie83] H. Lieberman and C. E. Hewitt. A real time garbage
collector based on the lifetimes of objects. Communica-
tions of the ACM, 26(6):419-429, 1983.
[McK99] D. Stefanovic, K. McKinley, and J. Moss. Age-
based garbage collection. In ACM Conference on
Object-Oriented Programming Systems, Languages,
and Applications, pages 370-381, Denver, CO, Nov.
1999.
[Rav05] Archana Ravindar and Y.N.Srikant. Design and
Implementation of a Concurrent Garbage Collector for
Rotor. Technical Report IISc-CSA-TR-2005-2, Dept of
Computer Science and Automation, IISc.
[Rot01] http://www.sscli.net
[Rug87] Cristina Ruggieri and Thomas P. Murtagh. Life-
time Analysis of Dynamically Allocated Objects, pages
285-293, ACM SIGPLAN’88
[Sam04] Samuel Z. Guyer and Kathryn S. Mckinley. Find-
ing Your Cronies: Static Analysis for Dynamic Coloca-
tion. In ACM Conference on Object-Oriented Program-
ming Systems, Languages, and Applications, pages
237-250, Vancouver, British Columbia, Canada, Octo-
ber 2004.
[Shu02] Y. Shuf, M. Gupta, R. Bordawekar, and J. P. Singh.
Exploiting prolific types for memory management and
optimizations. In ACM Symposium on the Principles of
Programming Languages, pages 295-306, Portland, OR,
Jan. 2002.
[Spc98] http://www.spec.org/osg/jvm98
[Stu03] David Stutz, Ted Neward and Geoff Shilling.
Shared Source CLI Essentials.
[Too93] James O’ Toole and Scott Nettles. Concurrent
Replicating Garbage Collection. In ACM Symposium
on LISP and Functional Programming
[Wha99] John Whaley and Martin Rinard. Compositional
Pointer and Escape Analysis for Java Programs, In ACM
Conference on Object-Oriented Programming Systems,
Languages, and Applications, pages 187-206, Denver,
CO, Nov. 1999.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Modern generational garbage collectors look for garbage among the young objects, because they have high mortality; however, these objects include the very youngest objects, which clearly are still live. We introduce new garbage collection algorithms, called age-based, some of which postpone consideration of the youngest objects. Collecting less than the whole heap requires write barrier mechanisms to track pointers into the collected region. We describe here a new, efficient write barrier implementation that works for age-based and traditional generational collectors. To compare several collectors, their configurations, and program behavior, we use an accurate simulator that models all heap objects and the pointers among them, but does not model cache or other memory effects. For object-oriented languages, our results demonstrate that an older-first collector, which collects older objects before the youngest ones, copies on average much less data than generational collectors. Our results also show that an older-first collector does track more pointers, but the combined cost of copying and pointer tracking still favors an older-first over a generational collector in many cases. More importantly, we reopen for consideration the question where in the heap and with which policies copying collectors will achieve their best performance.
Conference Paper
Full-text available
Advances in operating systems and languages have brought the ideal of reasonably-bounded execution time closer to developers who need such assurances for real-time and embedded systems applications. Recently, extensions to the Java libraries and virtual machine have been proposed in an emerging standard, which provides for specification of release times, execution costs, and deadlines for a restricted class of threads. To use such features, the code executing in the thread must never reference storage that could be subject to garbage collection. The new standard provides for region-like, stack-allocated areas (scopes) of storage that are ignored by garbage collection and deallocated en masse. It now falls to the developer to adapt ordinary Java code to use the real-time Java scoped memory regions.Unfortunately, it is difficult to determine manually how to map object instantiations to scopes. Moreover, if ordinary Java code is modified to effect instantiations in scopes, the resulting code is difficult to read, maintain, and reuse. Static analysis can yield scopes that are correct across all program executions, but such analysis is necessarily conservative in nature. If too many objects appear to live forever under such analysis, then developers cannot rely on static analysis alone to form reasonable scopes.In this paper we present an approach for automatically determining appropriate storage scopes for Java objects, based on dynamic analysis---observed object lifetimes and object referencing behavior. While such analysis is perhaps unsafe across all program executions, our analysis can be coupled with static analysis to bracket object lifetimes, with the truth lying somewhere in between. We provide experimental results that show the memory regions discovered by our technique.
Article
The choice of binding time disciplines has major consequences for both the run-time efficiency of programs and the convenience of the language expressing algorithms. Late storage binding time, dynamic allocation, provides the flexibility necessary to implement the complex data structures common in today's object oriented style of programming. In this paper we show that compile-time lifetime analysis can be applied to programs written in languages with static type systems and dynamically allocated objects, to provide earlier storage binding time for objects, while maintaining all the advantages of dynamic allocation.
Conference Paper
This paper presents a combined pointer and escape analy- sis algorithm for Java programs. The algorithm is based on the abstraction of points-to escape graphs, which character- ize how local variables and fields in objects refer to other objects. Each points-to escape graph also contains escape information, which characterizes how objects allocated in one region of the program can escape to be accessed by an- other region. The algorithm is designed to analyze arbitrary regions of complete or incomplete programs, obtaining com- plete information for objects that do not escape the analyzed regions. We have developed an implementation that uses the es- cape information to eliminate synchronization for objects that are accessed by only one thread and to allocate objects on the stack instead of in the heap. Our experimental results are encouraging. We were able to analyze programs tens of thousands of lines long. For our benchmark programs, our algorithms enable the elimination of between 24% and 67% of the synchronization operations. They also enable the stack allocation of between 22% and 95% of the objects.
Conference Paper
Object allocation and deallocation data gathered for the Cedar system on Xerox Dorados supports the weak generational hypothesis! newly-created objects have a much lower survival rate than objects that are older. The survivors at all collections thresholds are highly organized; large clusters of objects are allocated at roughly the same time, and live for roughly the same length of time. By cleverly selecting representatives from the clusters and examining the reachability of these key objects more frequently than the cluster itself, the storage system can use the death of these key objects to find good opportunities to collect the clusters they represent.
Conference Paper
We have implemented a concurrent copying garbage collec- tor that uses replicating garbage collection. In our design, the client can continuously access the heap during garbage col- lection. No low-level synchronization between the client and the garbage collector is required on individual object opera- tions. The garbage collector replicates live heap objects and periodically synchronizes with the client to obtain the client's current root set and mutation log. An experimental imple- mentation using the Standard ML of New Jersey system on a shared-memory multiprocessor demonstrates excellent pause time performance and moderate execution time speedups.
Conference Paper
Region-based memory management systems structure memory by grouping objects in regions under program control. Memory is reclaimed by deleting regions, freeing all objects stored therein. Our compiler for C with regions, RC, prevents unsafe region deletions by keeping a count of references to each region. Using type annotations that make the structure of a program's regions more explicit, we reduce the overhead of reference counting from a maximum of 27% to a maximum of 11% on a suite of realistic benchmarks. We generalise these annotations in a region type system whose main novelty is the use of existentially quantified abstract regions to represent pointers to objects whose region is partially or totally unknown. A distribution of RC is available at http://www.cs.berkeley.edu/~dgay/rc.tar.gz.
Article
Generational garbage collection algorithms achieve efficiency because newer records point to older records; the only way an older record can point to a newer record is by a store operation to a previously created record, and such operations are rare in many languages. A garbage collector that concentrates just on recently allocated records can take advantage of this fact. Such a garbage collector can be so efficient that the allocation of records costs more than their disposal. A scheme for quick record allocation attacks this bottleneck. Many garbage-collected environments do not know when to ask the operating system for more memory. A robust heuristic solves this problem. This paper presents a simple, efficient, low-overhead version of generational garbage collection with fast allocation, suitable for implementation in a Unix environment.
Article
In previous heap storage systems, the cost of creating objects and garbage collection is independent of the lifetime of the object. Since objects with short lifetimes account for a large portion of storage use, it is worth optimizing a garbage collector to reclaim storage for these objects more quickly. The garbage collector should spend proportionately less effort reclaiming objects with longer lifetimes. We present a garbage collection algorithm that (1) makes storage for short-lived objects cheaper than storage for long-lived objects, (2) that operates in real-time--object creation and access times are bounded, (3) increases locality of reference, for better virtual memory performance, (4) works well with multiple processors and a large address space. 1.