Conference PaperPDF Available

Lock reservation: Java locks can mostly do without atomic operations

Abstract and Figures

Because of the built-in support for multi-threaded programming, Java programs perform many lock operations. Although the overhead has been significantly reduced in the recent virtual machines, One or more atomic operations are required for acquiring and releasing an object's lock even in the fastest cases.This paper presents a novel algorithm called lock reservation. It exploits thread locality of Java locks, which claims that the locking sequence of a Java lock contains a very long repetition of a specific thread. The algorithm allows locks to be reserved for threads. When a thread attempts to acquire a lock, it can do without any atomic operation if the lock is reserved for the thread. Otherwise, it cancels the reservation and falls back to a conventional locking algorithm.We have evaluated an implementation of lock reservation in IBM's production virtual machine and compiler. The results show that it achieved performance improvements up to 53% in real Java programs.
Content may be subject to copyright.
Lock Reservation: Java Locks Can Mostly Do
Without Atomic Operations
Kiyokuni Kawachiya Akira Koseki Tamiya Onodera
IBM Research, Tokyo Research Laboratory
1623-14, Shimotsuruma, Yamato, Kanagawa 242-8502, Japan
{kawatiya,akoseki,tonodera}@jp.ibm.com
ABSTRACT
Because of the built-in support for multi-threaded program-
ming, Java programs perform many lock operations. Al-
though the overhead has been significantly reduced in the
recent virtual machines, one or more atomic operations are
required for acquiring and releasing an ob ject’s lock even in
the fastest cases.
This paper presents a novel algorithm called lock reserva-
tion. It exploits thread locality of Java locks, which claims
that the locking sequence of a Java lock contains a very long
repetition of a specific thread. The algorithm allows locks to
be reserved for threads. When a thread attempts to acquire
a lock, it can do without any atomic operation if the lock is
reserved for the thread. Otherwise, it cancels the reservation
and falls back to a conventional locking algorithm.
We have evaluated an implementation of lock reservation
in IBM’s production virtual machine and compiler. The
results show that it achieved performance improvements up
to 53% in real Java programs.
Categories and Subject Descriptors
D.3.4 [Programming Languages]: Processors—optimiza-
tion
General Terms
Languages, Algorithms, Performance, Measurement, Exper-
imentation
Keywords
Java, synchronization, monitor, lock, reservation, thread lo-
cality, atomic operation
1. INTRODUCTION
One important characteristics of the Java programming
language [17] is the built-in support for multi-threaded pro-
gramming. For synchronization between independently exe-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
OOPSLA’02, November 4-8, 2002, Seattle, Washington, USA.
Copyright 2002 ACM 1-58113-417-1/02/0011...$5.00.
cuting threads, Java adopts semantics based on monitor [11,
18], and has monitors associated with objects.
The language constructs for synchronization are synchro-
nized methods and blocks. When a thread executes a syn-
chronized method against an object or a synchronized block
with an object, the thread acquires the object’s lock be-
fore the execution and releases the lock after the execu-
tion. Thus, at most one thread can execute the synchronized
method or the synchronized block.
Because of the built-in support for multi-threaded pro-
gramming, libraries in Java tend to be designed to be thread-
safe, containing many methods declared as synchronized.
As a result, Java applications perform a significant number
of lock operations. It was reported that 19% of the total
execution time was wasted by thread synchronization in an
early version of Java virtual machine [4].
Many techniques have since been proposed for optimiz-
ing locks in Java, which can be divided into two categories,
runtime techniques and compile-time techniques. The for-
mer attempts to make lock operations cheaper [2, 6, 13, 34],
while the latter attempts to eliminate lock operations [3, 9,
10, 12, 38, 44].
Almost all the runtime techniques follow the principle of
optimizing common cases. They exploit the observation that
Java locks are normally not contended, and optimize the
uncontended cases. These techniques allow a lock to be ac-
quired and released with only a few machine instructions
in the absence of contention. However, the instruction se-
quence inevitably contains one or more compound atomic
operations such as compare_and_swap. Considering that
atomic operations are especially expensive in modern archi-
tectures, the synchronization has not yet become sufficiently
light, though the overhead has significantly been reduced.
This paper proposes a new runtime technique called lock
reservation. It also follows the principle of optimizing com-
mon cases. The observation exploited is the biased distri-
bution of lockers called thread locality. That is, for a given
object, the lock tends to be dominantly acquired and re-
leased by a specific thread, which is obviously the case in
single-threaded applications1.
The key idea is to allow a lock to be reserved for a thread.
When a thread attempts to acquire an object’s lock, the
acquisition is ultra-fast if the lock is reserved for the thread.
In particular, it does not require any atomic operation. On
1Java virtual machines may create internal helper threads,
where Java programs can never be single-threaded in the
strict sense.
130
Table 1: Benchmark programs
Multi-
Program name threaded? Description
SPECjvm98 Run each program three times in the application mode.
_202_jess No Expert shell system solving a set of puzzles
_201_compress No LZW compression and decompression
_209_db No Perform database functions on memory resident database
_222_mpegaudio No Decompress MP3 audio files
_228_jack No Parser generator generating itself
_213_javac No Java source-to-bytecode compiler from the JDK 1.0.2
_227_mtrt Yes Two-threaded ray tracer
SPECjbb2000 Yes Simulate the operations of a TPC-C like business logic, run for 8 warehouses.
Volano Server Yes Chat room simulator
Volano Client Yes Chat client, creating 200 connections and sending 100 messages per connection.
Created Garbage
collected
denotes that thread X acquires the lock
Object 1
Created Garbage
collected
Object 2
Exploitable locality
ACB B B B
B
X
C C C C C B
B B CAC
C CC
Difficult-to-exploit locality
Figure 1: General thread locality and exploitable
thread locality
the other hand, if the lock is reserved for another thread,
the reservation must first be canceled, and the acquisition
falls back to an existing algorithm.
As we see later, lock reservation can be built on any ex-
isting locking algorithm, as long as it uses a word or field in
the object header and has one available bit. This bit is used
for representing the reservation status. When the status bit
is set, the meaning of the rest of the bits is defined by our
lock reservation algorithm, while when the bit is not set, the
meaning is defined by the underlying algorithm.
The rest of the paper is organized as follows. Section 2
shows the thread locality of locks in real Java programs. Sec-
tion 3 describes the algorithm of lock reservation. Section 4
presents performance results, while Section 5 discusses the
related work. Finally, Section 6 offers conclusions.
2. THREAD LOCALITY OF JAVA LOCKS
This section studies the thread locality of Java locks, which
we exploit for reducing the synchronization overhead of Java
programs. Thread locality of a lock is defined in terms of
the locking sequence, the sequence of threads (in temporal
order) that acquire the lock. The general form of thread
locality is stated as follows. For a given lock, if its locking
sequence contains a very long repetition of a specific thread,
the lock is said to exhibit thread locality, while the specific
thread is said to be the dominant locker.
However, the general form of thread locality is not easy
to exploit, since we consider adaptive optimization of locks
rather than static optimization using off-line profiles. When
the locking sequence of a lock is currently being constructed,
it is very hard for the runtime system to cheaply determine
whether the lock exhibits thread locality or whether the cur-
rent locker is the dominant locker.
Table 2: Exploitable thread locality of Java locks
Number of Number of Ratios of lock
sync’d lock ops. in 1st.
Program name objects operations repetitions
SPECjvm98
_202_jess 21278 14646978 99.993%
_201_compress 2135 28895 97.211%
_209_db 66592 162117521 99.9998%
_222_mpegaudio 1620 27168 98.108%
_228_jack 1635497 38570415 99.998%
_213_javac 1192734 47062772 99.974%
_227_mtrt 3020 3522926 99.557%
SPECjbb200022077210 102282147 79.392%
Volano Server 7279 7244208 75.983%
Volano Client 4102 10419671 84.270%
Thus, a stronger form of thread locality is considered for
exploitability, which is described as follows. For a given lock,
if the locking sequence starts with a very long repetition of
a specific thread, the lock is said to show exploitable thread
locality. When the lock exhibits exploitable thread locality,
the initial locker is the dominant locker. Figure 1 shows two
objects, one with general but not exploitable locality, and
the other with exploitable locality.
To investigate how many objects show exploitable thread
locality in real programs, we gathered locking statistics us-
ing an instrumented version of the IBM Development Kit
for Windows, Java Technology Edition, Version 1.3.1 [20].
We measured the Java programs listed in Table 1 — the
seven programs of the SPECjvm98 [40], the SPECjbb2000
[39] for eight warehouses, and the server and client programs
of the Volano Mark [43]. Among these programs, _227_mtrt,
SPECjbb2000, and the Volano Mark are multi-threaded pro-
grams. We ran these programs with the JIT compiler dis-
abled, since some locks would otherwise be optimized away
by compiler optimizations.
The focus in our measurements is the first repetition in
the locking sequence of each lock. This is the beginning
subsequence consisting only of the initial locker3. If the first
repetition of a lock is very long, the lock shows exploitable
thread locality. Table 2 presents the results4, including the
2The total number of locks for SPECjbb2000 varies depend-
ing on the execution speed.
3The length of the first repetition may be one. Also, the
initial locker may appear again after the first repetition.
4The results shown here are for the complete execution of
each program, including lock operations during the program
startup and shutdown.
131
tid rcnt Reserve mode
LRV bit
1
A01
A>0 (b) Reserved for and held by Thread A
1
001
Lockword semantics in the reserve mode
(a) Reserved for Thread A, but not held
Base mode
0
(defined by base lock)
(c) Reserved anonymously
(will be reserved by the initial locker)
Lockword structure
Recursion count
Thread ID
Figure 2: Lockword structure and semantics
total number of synchronized objects, the total number of
lock operations, and the ratios of lock operations in the first
repetitions. As shown in the table, the vast majority of lock
operations are performed by the initial lockers. Even for
multi-threaded programs, more than 75% of the lock opera-
tions were performed by the initial lockers in the first repe-
titions. Thus, we can draw the conclusion that a significant
number of objects exhibit exploitable thread locality.
Notice that the ratios in the last column are not 1.0 even
for single-threaded programs, since the virtual machine cre-
ates system threads for internal tasks such as finalization.
We also note that the initial locker of an object is not neces-
sarily the creator of the object. This happens in the Volano
Mark programs, where a single thread is dedicated to creat-
ing objects and passing them to worker threads that actually
use the objects.
3. LOCK RESERVATION
This section presents a new locking algorithm called lock
reservation. It exploits the observation that Java locks show
thread locality, as discussed in the previous section. The key
idea is to reserve locks for threads. When a thread attempts
to acquire an object’s lock, one of the following actions is
taken:
1. If the object’s lock is reserved for the thread, the runtime
system allows the thread to acquire the lock with a few
instructions involving no atomic operation.
2. If the ob ject’s lock is reserved for another thread, the
runtime system cancels the reservation, and falls back to
a conventional algorithm for further processing.
3. If the ob ject’s lock is not reserved, the runtime system
uses a conventional algorithm.
Our algorithm can be built on any existing locking algo-
rithm, as long as it uses a lockword5, a word in the object
header for locking, and allows one bit to be available in the
lockword. The bit is used for representing the lock reserva-
tion status, and hence named the LRV bit. When the LRV
5Actually, we don’t need the whole 32 bits of the word,
and could put in the word other information unrelated to
locking. However, for the sake of explanation, we assume
that the whole word is used for locking.
1
A 0
Reserved for Thread A
0 0
Anonymously reserved
Acquired
unreserve
acquire release
acquire release
1
Object
creation
1
A 1
1
A 2
unreserve
unreserve
0
0
0
:
:
Base locking algorithm
Recursively acquired
:
:
Reserve mode Base mode
xxxxxx
yyyyyy
zzzzzz
acquire
(initial synchronization)
Figure 3: Lock state transitions
bit is set, the lockword is in the reserve mode, and the struc-
ture is defined by our algorithm. When the bit is not set,
the lockword is in the base mode, and the structure is de-
fined by the underlying algorithm that the runtime system
falls back to after canceling the reservation.
3.1 Lockword Structure
Figure 2 shows the structure of the lockword. When the
LRV bit is set, the lockword is in the reserve mode, and is
further divided into the thread identifier (tid) field and the
recursion count (rcnt) field. The former field contains an
identifier of the owner thread, for which the lock is reserved,
while the latter field keeps the lock recursion level. When
the rcnt field is zero, the lock is reserved but not held by
any thread (Figure 2(a)). When the field is non-zero, the
lock is held by the owner thread (Figure 2(b)). As we will
see later, the owner thread can acquire the lock by simply
incrementing the rcnt field, with no atomic operation.
The rcnt field is also intended for recursive locking, which
is fairly common in Java. The owner thread acquires the lock
recursively by simply incrementing the rcnt field, in just the
same manner as it initially acquires the lock. We must main-
tain the recursion count of a lock since Java does not allow
a thread to release a lock more times than it acquires the
lock. The virtual machine must detect such an illegal state
and raise an instance of IllegalMonitorStateException.
When an ob ject is created, the lock is anonymously re-
served. That is, the lockword is in the reserve mode, but not
reserved for or held by any particular thread (Figure 2(c)).
This is because the thread for which the lock should be re-
served is normally not known at the time of creation.
In general, a reservation policy determines when and for
which thread a lock is reserved. Since we base our algorithm
on exploitable thread locality from the previous section, we
use the initial-locker policy in our algorithm. That is, when
an object is locked for the first time by a thread, we reserve
the object’s lock for that thread.
When the reservation is canceled, the LRV bit is reset,
and the lockword is put in the base mode. The structure
is completely defined by the base algorithm. As we will see
later, canceling a reservation is the most challenging part of
our algorithm, requiring the owner thread to be suspended.
The cancellation replaces the lockword in the reserve mode
with the corresponding state in the base algorithm.
Figure 3 depicts the state transitions of the lockword in
our algorithm.
132
3.2 Algorithm
Figure 4 shows the algorithm of lock reservation6. A
thread attempting to acquire an object’s lock calls the ac-
quire() function, where it reads the lockword, and performs
four checks to see if it is not in a special state (lines 21–24).
If it passes all the checks, the lock is in the most common
state where the thread owns the lock’s reservation. It com-
pletes the lock acquisition by simply incrementing the rcnt
field (line 28).
Similarly, a thread attempting to release an object’s lock
calls the release() function, where it first reads the lock-
word, and performs three checks to see if it is not in a spe-
cial state (lines 52–54). When it passes all the checks, the
function finishes the lock release by simply decrementing the
rcnt field (line 58). Thus, it only takes a few non-atomic in-
structions to acquire and release a lock in the most common
case when the thread owns the reservation.
There are three special cases in the acquire() function.
First, when the lock is anonymously reserved (line 22), the
function attempts to make it specifically reserved by using
compare_and_swap (line 33). Second, when the lock is re-
served for another thread (line 23), the thread calls the
unreserve() function to cancel the reservation (line 37),
and falls back to the base algorithm. This second spe-
cial case also results when the thread owns the reservation
but the recursion count has reached the maximum value
(line 24). Third, when the lockword is not in the reserve
mode (line 21), the thread executes the corresponding func-
tion of the base algorithm (line 40).
There is only one legal special case in the release() func-
tion. That is, when the lockword is not in the reserve mode
(line 52), the function invokes the corresponding function in
the base algorithm (line 65). The Java specification [17]
requires that, when a thread attempts to release a lock,
the thread actually holds the lock. Otherwise the runtime
system must raise an instance of IllegalMonitorState-
Exception. The checks in lines 53 and 54 detect the illegal
state in the reserve mode.
We now explain cancellation of a reservation, the most
complicated part of our algorithm, which the unreserve()
function is responsible for. Basically, a thread calls the func-
tion when the thread attempts to acquire a lock which is
reserved for another thread7. The calling thread atomi-
cally replaces the lockword in the reserve mode with the
equivalent state in the base algorithm. In doing so, it first
suspends the owner thread (line 74), modifies the lockword
using the atomic operation (line 80), and resumes the sus-
pended thread (line 90).
Special care must be taken when the owner thread is in
the middle of the acquire() or release() functions, more
specifically, when it is in one of the unsafe regions which are
between the read and write of the lockword in the acquire()
(lines 18–28) and release() functions (lines 49–58). To
avoid a data race condition, the unreserve() function ob-
6For readability, the code shown here is slightly different
from the actual code. For instance, the condition checks
in the beginning of the acquire() and release() functions
are merged into two checks in the actual code. Also, the
base acquire() and base release() functions are tightly
coupled with the acquire() and release() functions, re-
spectively.
7The unreserve() function is also called when the rcnt is
about to overflow or when the wait() method is called.
tains the execution context of the suspended thread (line 83)
to see whether the thread is in one of the unsafe regions. If
it is in an unsafe region, the function modifies the program
counter with the address of the corresponding retry point
(line 17 or 48). Notice that each unsafe region was care-
fully made restartable by preventing any side effects from
occurring.
Finally, after a lock’s reservation is canceled, our algo-
rithm does not return the lock back to the reserve mode.
The algorithm supporting repeated reservation would be-
come too complicated, while it might result in more cancel-
lations and degrade performance. In addition, the investiga-
tions in the previous section show that most lock operations
can be performed in the reserve mode even without repeated
reservation.
3.3 Correctness
We now discuss the correctness of our algorithm. As
we have shown, a thread does not have to execute any
atomic operation in acquiring and releasing a lock when it
owns the reservation. In other words, the owner thread can
read-modify-write the lockword without atomic operations.
Thus, when a different thread attempts to change the lock-
word between the read and the write, special care must be
taken to prevent the modification from being lost. The lock
state would otherwise become inconsistent.
When a thread does not own a lock’s reservation, our algo-
rithm requires the thread to call the unreserve() function,
where the thread without the reservation modifies the lock-
word after suspending the owner thread. When the latter
thread is suspended in the middle of an unsafe region, it
is forced to restart the unsafe region, detecting that it no
longer has the reservation. This prevents the thread from
continuing the execution based on the no-longer-valid as-
sumption that the thread still owns the reservation.
The owner thread may have already completed the com-
putation and ceased to exist when another thread attempts
to cancel a reservation. Although the unreserve() must
also handle this case properly, there is no risk of a data race
condition involving the owner thread.
More than one thread may simultaneously try to make
an anonymous reservation specific (line 33) or try to con-
vert the lockword in the reserve mode to the base mode
(line 80). However, it is guaranteed that only one thread
succeeds since atomic operations are used in both cases.
Once the reservation is canceled, the lockword will be
never reserved again. Thus, after the cancellation, our al-
gorithm behaves in exactly the same manner as the base
algorithm, and the correctness is ensured by the correctness
of the base algorithm.
3.4 Discussion
This section considers the performance characteristics of
lock reservation, discusses in detail how to determine whether
a thread has been suspended in the middle of an unsafe re-
gion and how to cancel reservations, and explains multipro-
cessor issues.
Performance Characteristics
Our algorithm is strongly expected to reduce the synchro-
nization overhead when the reservation succeeds, since the
owner thread can acquire and release the lock by simply
133
1 : // Lockword structure in each object header
2 : struct Object {
3 : :
4 : struct lockword { // [tid:rcnt :R]
5 : unsigned int tid : N; // Thread ID of the owner thread.
6 : unsigned int rcnt : M; // Recursion count. Non-zero denotes that the lock is acquired.
7 : unsigned int reserve : 1; // LRV bit. One denotes that the lock is reserved.
8 : } lockword;
9 : :
10 : };
11 :
12 : int acquire(stru ct Object *obj)
13 : {
14 : struct lockword l1, l2;
15 : int myTID = thread_id();
16 :
17 : retry_acquire :
18 : l1 = obj->lockword; // read the lockword ------------------(1)
19 : A
20 : // check special cases |
21 : if (l1.reserve == 0) goto base_acquire; // [xxxxxx:0] not reserved |
22 : if (l1.tid == 0) goto make_spec ific; // [0:0:1] anonymously reserved |unsafe
23 : if (l1.tid != myTID) goto unreserve_and_ba se; // [other:xxx:1] reserved for another thread |region
24 : if (l1.rcnt == RCNT_MAX) goto unreserve_an d_base; // [myTID:max:1 ] rcnt reached the maximum |
25 : |
26 : // reserved for me, and rcnt does not reach the maximum |
27 : l2 = l1; l2.rcnt++; // [myTID:rcnt:1] -> [myTID:rcnt+ 1:1] V
28 : obj->lockword = l2; // write the lockword ------------- -----(2)
29 : return SUCCESS;
30 :
31 : make_specific :
32 : l2 = l1; l2.tid = myTID; l2.rcnt = 1;
33 : if (compare_and_swap(&o bj->lockw ord, l1, l2) != SUCCESS) goto retry_acquir e; // [0:0:1] -> [myTID:1 :1]
34 : return SUCCESS;
35 :
36 : unreserve_and _base_acq uire:
37 : unreserve(obj, l1.tid, myTID); // [xxx:xxx:1] -> [xxxxxx:0]
38 :
39 : base_acquire:
40 : return base_acquire(o bj); // if not reserved, call the function for the base mode
41 : }
42 :
43 : int release(stru ct Object *obj)
44 : {
45 : struct lockword l1, l2;
46 : int myTID = thread_id();
47 :
48 : retry_release :
49 : l1 = obj->lockword; // read the lockword ------------------(1)
50 : A
51 : // check special cases |
52 : if (l1.reserve == 0) goto base_release; // [xxxxxx:0] not reserved |
53 : if (l1.tid != myTID) goto illegal_state; // [other:xxx:1 ] reserved for another thread |unsafe
54 : if (l1.rcnt == 0) goto illegal_state; // [myTID:0:1] rcnt is zero |region
55 : |
56 : // reserved for and held by me |
57 : l2 = l1; l2.rcnt--; // [myTID:rcnt:1] -> [myTID:rcnt- 1:1] V
58 : obj->lockword = l2; // write the lockword ------------- -----(2)
59 : return SUCCESS;
60 :
61 : illegal_state :
62 : return IllegalMonitor StateExce ption;
63 :
64 : base_release:
65 : return base_release(o bj); // if not reserved, call the function for the base mode
66 : }
67 :
68 : void unreserve(struct Object *obj, int ownerTID, int myTID)
69 : {
70 : struct lockword l1, l2;
71 : struct Context context;
72 :
73 : if (ownerTID == myTID) ownerTID = 0; // don’t suspend myself
74 : thread_sus pend(owne rTID); // no-op when the target thread does not exist
75 :
76 : retry_unreser ve:
77 : l1 = obj->lockword;
78 : if (l1.reserve == 0) goto already_unreserv ed; // already unreserve d by someone
79 : l2 = base_equivalent_lockw ord(l1); // create the equivale nt lock state in the base mode
80 : if (compare_and_swap(&o bj->lockw ord, l1, l2) != SUCCESS) goto retry_unrese rve; // [xxx:xxx:1] -> [xxxxxx:0]
81 :
82 : // modify the owner thread’s context if it is in an unsafe region
83 : if (thread_get _context( ownerTID, &context) == SUCCESS) {
84 : if (in_unsafe_reg ion(conte xt.pc)) { // check if (1) < next PC <= (2)
85 : context.pc = retry_point (context. pc); // move the PC to the corresponding retry point
86 : thread_set_context(own erTID, &context);
87 : } }
88 :
89 : already_unres erved:
90 : thread_res ume(owner TID);
91 : }
Note. Each of the thread-manipulating functions (thread_suspend(),thread_resume(),thread_get_context(), and thread set -
context()) does nothing and returns FAIL if the target thread does not exist. The thread_suspend() function can be called multiple times,
where the target thread will be resumed after thread_resume() is called the same number of times. Note that the thread_suspend()
and thread_resume() functions are unrelated to the deprecated Java methods suspend() and resume() in the java.lang.Thread class.
Figure 4: Algorithm of lock reservation
134
reading and writing the lockword without any atomic oper-
ations.
When a lock is not reserved, our algorithm falls back to
the base algorithm with almost no additional overhead. It
simply requires two additional checks, one in the acquire()
function (line 21) and the other in the release() function
(line 52). However, depending on the details of the base
algorithm, we can completely eliminate the additional over-
head. That is, if the base algorithm starts the lock acquisi-
tion and the lock release by testing one or more bits in the
lockword, we may be able to merge the additional checks of
our algorithm into the testing. Actually, this is the case in
our implementation which we present in Section 4.
The greatest concern in terms of performance is reserva-
tion cancellation in the unreserve() function, which relies
on expensive system calls such as thread_suspend() and
thread get context(). However, since we do not reserve
locks repeatedly, the cancellation occurs at most once dur-
ing the lifetime of an object. As we will show in the next
section, the ratios of cancellations to lock operations are less
than 0.05% in actual lock-intensive programs. Thus, we be-
lieve that performance loss from cancellation does not offset
the performance gain by reservation success except for arti-
ficially created pathological benchmarks.
Unsafe Regions
If a thread always acquires and releases an object’s lock
by calling the runtime functions acquire() and release(),
respectively, we have only two unsafe regions in the virtual
machine. The in_unsafe_region() (line 84) has only to
perform two range checks, which is easy to implement.
However, the JIT compiler may inline the synchronization
operations into the generated code. This results in many
unsafe regions in the virtual machine, which we must register
in a data structure with the corresponding retry addresses.
Given a program counter, the in_unsafe_region() function
searches the data structure to see if the program counter
points to any unsafe region.
Alternatively, we could use the designated sequences ap-
proach by Bershad et al. [8]. That is, the JIT compiler em-
beds some landmark no-op around each unsafe region, while
the in unsafe region() function compares the instruction
stream of a suspended thread against the landmark no-op
to determine if the thread is in an unsafe region.
Whatever techniques are used, we need to obtain the pro-
gram counter of a suspended thread by invoking an appro-
priate system call, which is expensive in most operating sys-
tems. Thus, it is desirable to reduce the number of calls to
the thread_get_context() function. If the virtual machine
provides a fast way to see if the thread is in the module of
compiled code, we can reduce the number by creating unsafe
regions only in the compiled code.
Some virtual machines allow us to cheaply determine,
without the program counter, whether a thread has been
suspended within the module of compiled code. For in-
stance, the virtual machine we use in Section 4 maintains a
thread local variable for the thread’s execution mode. The
variable takes values such as EXECUTING COMPILED CODE,COM-
PILING, and INTERPRETING. We can thus know if the thread
is in the module of compiled code by simply checking the
current value of the thread local variable.
On the other hand, we can confine unsafe regions to the
module of compiled code as follows. In general, we can
convert an unsafe region into a safe region by modifying
the lockword with a compare_and_swap even in the reserve
mode (lines 28 and 58). Although we should not make such
conversions for frequently executed unsafe regions, it is rea-
sonable to convert the unsafe regions in the Java bytecode
interpreter and other performance insensitive components.
Putting these two things together, we can, in our vir-
tual machine, use the following sequence in the unreserve()
function.
82 : // modify the owner thread’s context if necessary
| 82a: if (get_exec _mode(own erTID) == EXECUTING_COM PILED_COD E) {
83 : if (thread_ get_conte xt(ownerT ID, &context ) == SUCCESS) {
84 : if (in_unsafe_regi on(contex t.pc)) {
85 : context.pc = retry_point(conte xt.pc);
86 : thread_set_context(ow nerTID, &context);
87 : } }
| 87a: }
88 :
The quick check in line 82a is expected to filter out many
uninteresting cases, resulting in many fewer calls to the
thread get context().
Reservation Cancellation
The essential property in the unreserve() function is to
prevent the owner thread from changing the lockword while
another thread is canceling the reservation. As long as this
property is satisfied, we could implement the unreserve()
function in different ways. We show two variations of the
function.
First, we could use signals as provided in Unix operating
systems. In this variation, the thread without the reserva-
tion requests the cancellation, while the owner thread ac-
tually does the cancellation. More concretely, the thread
without the reservation sends a signal to the owner thread,
and waits until the latter has completed the processing. In
the signal handler, the owner thread cancels the reservation,
checks with the saved program counter to see if it has been
interrupted in an unsafe region, and, if so, modifies the pro-
gram counter to the corresponding retry address.
Second, we could exploit predicated stores8, which are
available, for instance, on Intel’s IA-64 processors [22]. We
dedicate one predicate register to lock reservation. We ini-
tialize it to TRUE before reading the lockword in acquire()
and release(), while we write into the lockword in the re-
serve mode with a predicated store qualified by the predicate
register. In the unreserve() function, we set the value of
the predicate register of the owner thread to FALSE. This
prevents the owner thread from changing the lockword in-
consistently9.
Multiprocessor Considerations
The Java language specification [17] describes the Java mem-
ory model in Chapter 17. According to rules about the
interaction of locks and variables, we cannot move before
a lock acquisition the load operations that follow the ac-
quisition or move after a lock release the store operations
8In the IA-64, most operations can be qualified by a one-bit
predicate register to indicate whether it is actually executed
or not. The execution of a predicated store consists of check-
ing the predicate register and conditionally performing the
store, and cannot be interrupted in the middle.
9Hudson et al. [19] proposed a similar technique for ob-
ject allocation which utilizes dedicated predicate registers
set and reset by the context switcher.
135
1 0 1
A 0
Reserved for Thread A
1Monitor ID
Flat mode
0 0 0
Inflated modeAnonymously reserved
Not acquired
Acquired
acquire release
acquire release
acquire release
(rcnt overflow)
inflate
deflate
1
Reserve mode Base mode (tasuki lock)
(rcnt overflow)
Object creation
1
Stid rcnt R
lockword
Shape bit
LRV bit
Thread ID
Recursion count
01
A 1
01
A 2
00
0 0
00
A 1
00
A 2
acquire release
00
B 1
00
B 2
acquire release
acquire
release
Heavy-
weight
monitor
acquire / release
:::
Flat mode
0
acquire
(initial synchronization)
unreserve
unreserve
unreserve
Figure 5: Complete lock state transitions when lock reservation is coupled with tasuki lock
that precede the release. Therefore, when we implement
lock reservation on a multiprocessor system with a relaxed
memory model [1], we need to issue appropriate types of
memory barriers in the functions of lock acquisition and re-
lease. More concretely, the lfence (load fence) and sfence
(store fence) instructions must be inserted at lock acquisi-
tion and release points, respectively, on the Pentium 4 [21],
while the ld.acq (load acquire) and st.rel (store release)
instructions must be used at lock acquisition and release
points, respectively, on the IA-64 [22]. Memory barriers
are normally much cheaper than atomic operations such as
compare_and_swap. However, an older processor may not
support memory barriers at all, so that an expensive in-
struction must be used to meet the requirements.
Practically speaking, we believe that these memory bar-
riers are unnecessary in the reserve mode, since no other
thread can be trying to execute the critical region. We can
take care of the necessary synchronizations when the reser-
vation is canceled and while the owner thread is suspended.
Finally, we note that Pugh [36] pointed out flaws in the Java
memory model, and that the revision is being discussed un-
der Java Specification Request 133 [24].
4. PERFORMANCE MEASUREMENTS
This section evaluates the effectiveness of lock reservation
with the IBM Development Kit for Windows, Java Tech-
nology Edition, Version 1.3.1 [20] and its JIT compiler [23,
41]. We ran all of the benchmark programs under Windows
2000 SP2 on an unloaded IBM IntelliStation M Pro con-
taining two 1.7-GHz Pentium 4 Xeon processors with 1024
megabytes of main memory.
We implemented lock reservation on top of the existing
algorithm in the development kit, which is called tasuki lock
[34], one of the fastest locking algorithms for Java. Tasuki
lock is an improved version of thin lock [6], and both use one
word10in an object header for representing the lock state.
10Actually, the lowest eight bits in the lockword are used for
other states unrelated to locks.
The lockword contains a mode flag called the shape bit,
which distinguishes between the two modes of tasuki lock.
When the shape bit is zero, it is in the flat mode. Otherwise,
it is in the inflated mode.
As long as contention does not occur, the lock is in the
flat mode. The lockword in the flat mode is further divided
into the tid field and the rcnt field, as in our algorithm. In
this mode, the lock can be acquired by a compare_and_swap,
and released by a simple store11 .
When contention happens, the lockword is converted to
the inflated mode, where a heavyweight monitor is created
and the reference to the monitor is stored in the lockword.
The lock remains in the inflated mode unless contention
ceases.
Although our lock reservation can be built upon any al-
gorithm, tasuki lock is a very natural fit since the lockword
structure in the flat mode is almost the same as the struc-
ture in the reserve mode. This allows lock operations to be
highly efficient in terms of both space and time. Figure 5
shows all of the state transitions when lock reservation is
coupled with tasuki lock.
We took a simple approach to implementing checks for
unsafe regions. Our virtual machine includes two sets of im-
plementations of acquire() and release() functions, one
pair in the module of the JIT runtime code and the other
pair in the module of the interpreter. The former is called
from but not inlined into the JIT generated code. The lat-
ter, written in C, is called from the interpreter, and imple-
mented without unsafe regions as described in Section 3.4.
This means we only have two unsafe regions in our virtual
machine. To make the comparison exact, we disabled inlin-
ing of the lock acquisition and release code in the original
virtual machine.
Finally, in order to comply with the current Java memory
model, we inserted the lfence and sfence instructions into
the functions for lock acquisition and release, respectively.
11The simple store must be followed by a memory barrier in
a multiprocessor system.
136
4.1 Micro-Benchmarks
We show the results of two micro-benchmarks.
PrimitiveTest
The PrimitiveTest is intended for measuring the cost of syn-
chronization, that is, of acquiring and releasing a lock, in dif-
ferent lock states. We measured the following two cases in
three lock states, reserved, not reserved (flat), and inflated.
Outermost: Acquire and release a lock using a synchro-
nized block ntimes, and measure the elapsed time.
Recursive: Perform the same measurement inside another
synchronized block.
To calculate the cost of acquiring and releasing a lock in
each state, we created a special virtual machine that per-
forms nothing on lock acquisition or release, and calculated
the differences between the times of the normal and special
virtual machines.
TransitionTest
The TransitionTest measures the cost of transitions of lock
states unique to lock reservation. We created a total of n
objects, and forced them to make the following two transi-
tions.
Anonymous-to-specific: Acquire and release the lock of
each object, making the anonymously reserved lock specif-
ically reserved.
Reserved-to-base: Cancel the lock reservation for each
object by creating another thread and having this second
thread acquire and release the lock.
To calculate the cost of each transition, we took the differ-
ences from the times of lock acquisition and release without
the transition.
We confirmed in both tests that the relevant methods were
compiled to native code, and that the synchronization oper-
ations within the methods were not optimized away by the
JIT compiler. We also verified that garbage collection did
not occur during the measurements.
Table 3 shows the results for the PrimitiveTest. For com-
parison, the table also contains the numbers for the original
tasuki algorithm. When the reservation succeeds, we dra-
matically reduced the cost of the outermost synchronization
by more than 70%. On the other hand, after the reserva-
tion is canceled, the cost of the synchronization is almost
the same as in the original algorithm.
Table 4 presents the results for the TransitionTest. Since
we found that the cost of cancellation heavily depends on
whether or not thread_get_context() (line 83 in Figure 4)
is actually executed, we show two cases for cancellation in
the table, the faster case in which the function is not ex-
ecuted and the slower case in which the function is exe-
cuted. As the table shows, the cost of making an anonymous
reservation specific is very small and negligible, while the
cost of reservation cancellation is very large, as expected.
The cost of cancellation is by 20 to 60 times larger than
the cost of synchronization in the inflated mode. One rea-
son for this is that getting a thread context (by calling
Table 3: Synchronization costs in lock reservation
The cost of acquiring and releasing a lock is
shown for three lock states of our algorithm and
two lock states of the original algorithm.
Lockword state Outermost Recursive
Reserved 61.4 nsec 61.4 nsec
Not reserved 229.5 nsec 61.4 nsec
Inflated 335.5 nsec 155.8 nsec
Flat in original 228.9 nsec 62.2 nsec
Inflated in original 330.3 nsec 150.0 nsec
Table 4: Costs of lock state transitions
The times spent on lock acquisition and release
are not included.
State transition Time
Anonymous-to-specific 89.0 nsec
Reserved-to-base (faster case) 6741 nsec
Reserved-to-base (slower case) 18986 nsec
GetThreadContext()) is very slow in Windows. Although
we do not believe that this badly influences the performance
of real programs, it is important to reduce the number of ex-
pensive system calls, for instance by using the quick check
as described in Section 3.4.
4.2 Macro-Benchmarks
We now show the performance improvement in real pro-
grams. We measured the performance of the same set of
programs as in our investigation in Section 2 (listed in Ta-
ble 1). We ran each program several times with two virtual
machines, one with the original algorithm and the other with
lock reservation, and compared the best scores. We took the
measurements with the JIT compiler enabled.
Figure 6 shows the results. Lock reservation improved
the performance of all programs except _201_compress and
_222_mpegaudio, both of which perform very few lock op-
erations. We observed especially significant improvements
of more than 30% in _209_db,_228_jack, and _213_javac.
As a result, lock reservation improved the geometric mean
of the SPECjvm98 programs by 18.13%. Furthermore, we
observed improvements of 5% to 10% even in the multi-
threaded programs, SPECjbb2000 and the Volano Mark.
Table 5 shows lock statistics in the actual environment,
which we measured separately. As the table shows, even
when the JIT compiler is enabled, many lock operations are
performed. The table also shows the ratios of lock opera-
tions accelerated by our implementation of lock reservation.
Note that these numbers do not include synchronizations
performed inside the interpreter or performed recursively in
the compiled code, even if the reservations were successful.
Because of this, most of the lock operations were not accel-
erated in _201_compress and _222_mpegaudio, since they
were not in hot methods and were executed by the inter-
preter rather than compiled by the JIT. For other, lock-
intensive programs, more than 58% of the lock operations
were accelerated by lock reservation.
137
11.59%
0.25%
52.76%
0.83%
33.14%
37.81%
1.55%
9.79%
5.45%
_202_jess
_201_compress
_209_db
_222_mpegaudio
_228_jack
_213_javac
_227_mtrt
SPECjbb2000
Volano Mark
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
Figure 6: Performance improvements
Table 5: Lock statistics
Number of Ratios of Ratios of
lock accelerated reservation
Program name operations lock ops. cancellations
SPECjvm98
_202_jess 14585409 99.289% 0.00125%
_201_compress 29150 31.547% 0.419%
_209_db 162079177 99.963% 0.0000296%
_222_mpegaudio 27480 35.837% 0.313%
_228_jack 35207339 91.947% 0.000395%
_213_javac 43510883 99.402% 0.00403%
_227_mtrt 3523262 99.035% 0.00284%
SPECjbb200012 335718621 58.544% 0.0535%
Volano Server 6862014 79.755% 0.0248%
Volano Client 10381000 84.333% 0.0138%
4.3 Possible Extensions
As the results of the micro- and macro-benchmarks show,
the implementation of lock reservation significantly improves
performance if the reservation succeeds, while it maintains
comparable performance if the reservation fails. The only
problem is a relatively high cost in canceling a reserva-
tion, which occurs when a thread acquires an object’s lock
reserved for another thread. However, as Table 5 shows,
canceling a reservation rarely happens in real programs.
Although the locks are initially put in the reserve mode
in our implementation, less than 0.05% of the lock opera-
tions caused reservations to be canceled in the lock-intensive
benchmarks.
There might be pathological programs in which reserva-
tions are canceled more frequently. It may be important
to lower the cost of a cancellation, and to reduce the num-
ber of cancellations by refining the reservation policy. For
instance, if dynamic profiles of cancellations uncover that
reservations are frequently canceled for objects of specific
classes or created at specific execution points, we should
initially put them into the base mode. Also, we may be able
to predict which thread is likely to initially acquire an ob-
ject’s lock, using dynamic profiles or static analysis. Finally,
if we can reduce the cost of a cancellation, it could become
worthwhile to pursue an algorithm allowing repeated reser-
vations.
12Again, the total number of locks for SPECjbb2000 is not
very meaningful because it varies with the execution speed.
5. RELATED WORK
There is a significant body of literature on locks. Here
we mainly focus on Java locks and locks without atomic
operations.
5.1 Improvements of Java Locks
As we mentioned in Section 1, synchronization operations
tend to be very frequent in Java, so many techniques have
been proposed to optimize them.
The early versions of virtual machines from Sun allocate
monitors separately from ob jects, and maintain the map-
ping from objects to monitors in a data structure called the
monitor cache [47]. While this does not require any bit in an
object header for synchronization, it suffers from slow per-
formance and bad scalability, since the monitor cache must
be synchronized. A similar technique is also used in Kaffe
[26].
Onodera [33] proposed a simple and space-efficient way
of implementing Java locks. The method directly stores a
reference to a monitor in a rarely used field in an object
header, displacing the original value of the field into the
monitor. Furthermore, it does so for only heavily synchro-
nized objects in order to reduce the space overhead. While
it mostly eliminated the need to synchronize the monitor
cache, it could not drastically reduce the synchronization
overhead, since it still used heavyweight monitors.
Bacon et al. [6] proposed a locking algorithm for Java,
called thin lock, exploiting the observation that most locks
are not contended in Java. An object’s lock operates in one
of two modes, the flat mode and the inflated mode. They
reserve a 24-bit field in an object header, which has one of
two structures, depending on the current operating mode of
the lock, and are distinguished by one bit, called the shape
bit. Initially, each lock is in the flat mode, and remains in
this mode as long as contention does not occur. Acquiring
and releasing a lock in the flat mode is highly efficient, ex-
ecuting only a few machine instructions. In particular, the
instruction sequence for acquisition includes only one atomic
operation, while the instruction sequence for release contains
no atomic operations. When contention is detected, the lock
changes to the inflated mode, and falls back to the heavy-
weight monitor. Once a lock is put in the inflated mode, thin
lock keeps the lock in this mode for the rest of its life, result-
ing in all the subsequent synchronizations being performed
through the heavyweight monitor.
Onodera and Kawachiya [34] discovered that most con-
tentions are temporary in Java, and proposed an enhanced
algorithm, named tasuki lock, which supports deflation to
recover the higher performance of the flat mode. SableVM
[16] employs a variation of the tasuki lock.
Agesen et al. [2] proposed another locking algorithm,
called meta lock. While it needs only two bits in a header
for synchronization, it requires two atomic operations in ac-
quiring and releasing a lock. Thus, it is not as time-efficient
as thin lock and tasuki lock. Recently, Dice [13] proposed a
modified version of meta lock named relaxed lock.
Although the details are significantly different, these fast
algorithms can acquire and release a lock with a small num-
ber of machine instructions containing one or two atomic
operations. Lock reservation attempts to further reduce the
overhead by completely eliminating the atomic operations
that are now becoming more and more expensive in modern
architectures. It exploits the observation that most locks
138
are not only uncontended in Java, but also dominantly ac-
quired by a specific thread. As we already described, if
implemented on top of tasuki lock, it only requires one ad-
ditional bit in the header to represent the reservation status,
while it attains an unprecedented level of performance for
synchronization when reservation succeeds. We note that
Bacon and Fink [7] independently proposed a similar idea
of eliminating atomic operations in Java locks.
5.2 Elimination of Java Locks
Another approach to improve the synchronization perfor-
mance is to eliminate locks altogether rather than to reduce
the cost of the locks.
Using escape analysis [35], we can find ob jects accessible
only by their creator threads, and eliminate all the syn-
chronization operations for the locks of such non-escaping
objects [3, 9, 10, 12, 38, 44]. However, these techniques
are the most effective in a static compiler that can perform
whole program analysis, while they provide only limited ben-
efits for a dynamic language such as Java. When applying
escape analysis to Java, many more objects must conserva-
tively be judged as escaping, and their locks cannot be op-
timized away. Whaley [45] recently proposed partial method
compilation for improving effectiveness of escape analysis for
a dynamic language.
The IBM JIT compiler eliminates some of the recursive
locks [27]. This can happen when the compiler inlines one
synchronized method into another synchronized method. If
the compiler determines that the receiver objects of the two
methods are identical, it then eliminates the lock operations
for the inlinee.
Since lock reservation is a runtime technique, it is basi-
cally complementary to compiler optimizations such as es-
cape analysis and recursive lock elimination. It can speed
up the locks of escaping objects and outermost locks as long
as they show thread locality.
Bacon [5] attempted to eliminate all of the synchroniza-
tion overhead from single-threaded executions. As long as
the system creates and runs only one thread, nothing is done
for lock acquisition and release. When the running program
attempts to create a second thread, the system scans the
stack frames and properly recovers the lock states. Muller
[32] also briefly mentioned a similar idea. Ruf [38] proposed
whole-program analysis to determine if the program does
not create a second thread. Unfortunately, these ideas can-
not be used in most of the commercial virtual machines,
since they always create a couple of helper threads, besides
the main thread, at start-up time.
5.3 Other Lock Optimizations
Some of the locking algorithms provide mutual execution
without atomic operations such as compare_and_swap and
test_and_set.
Bershad et al. [8] proposed a unique locking algorithm
that closely cooperates with the operating system’s sched-
uler. When a thread is preempted in one of the critical
sections, it is forcibly restarted from the entry point of the
section. To determine whether a thread is suspended in
such a restartable atomic sequence, they mark each atomic
sequence with a designated sequence. As we mentioned in
Section 3.4, we can apply the technique to mark the un-
safe regions in our algorithm. By extending Bershad’s idea,
Johnson et al. [25] proposed interruptible critical sections,
which support the modification of multiple data ob jects.
When a virtual machine is built on a user-level thread
package, we can use scheduler-based techniques to imple-
ment locks. Actually, both CACAO [28] and LaTTe [46]
implement locks by inhibiting thread switches inside the
critical sections. However, scheduler-based locks are only ef-
fective on a uniprocessor system. Moreover, they may cause
starvation when a foreign function is called through the Java
Native Interface, and the foreign function attempts to ac-
quire a system-level lock. On the contrary, lock reservation
works properly on a multiprocessor system and under the
system-level, preemptive scheduler.
Dijkstra [14] and Lamport [30] presented complex algo-
rithms for mutual execution which do not rely on compound
atomic operations. However, to the best of our knowledge,
they have never been used in practical systems, because of
their subtlety and lack of generality.
The communities of database systems [42] and distributed
file systems invented many optimization techniques based on
access locality, which is similar to our thread locality. Kung
and Robinson [29] proposed an optimistic concurrency con-
trol for database systems, which speculatively executes criti-
cal regions without acquiring locks and commits the changes
if there is no contention. Rajwar and Goodman [37] re-
cently proposed a technique to implement a similar idea at
the micro-architectural level. Microsoft’s CIFS distributed
file system includes a file-locking mechanism called oppor-
tunistic locks or oplocks [15, 31]. When a client is granted an
exclusive oplock for a file, it can cache the file data for better
performance. If another client attempts to open the file, the
server sends the client holding the oplock an oplock break re-
quest to return the cached data. This resembles reservation
cancellation in our algorithm.
6. CONCLUDING REMARKS
We have presented a new locking algorithm, lock reserva-
tion, which optimizes Java locks by exploiting thread local-
ity. The algorithm allows locks to be reserved for threads,
and runs in either reserve mode or base mode. When a
thread attempts to acquire a lock in the reserve mode, it
can do so extremely quickly without any atomic operation
if the lock is reserved for the thread. If the lock is not re-
served for the thread, it cancels the reservation and falls
back to the base mode.
We have defined thread locality of locks, which claims that
the locking sequence of a lock contains a very long repetition
of a specific thread, and confirmed that the vast majority of
Java locks exhibit the thread locality.
We have evaluated an implementation of lock reservation
in IBM’s production virtual machine and compiler. The
results of micro-benchmarks show that we could reduce the
locking overhead by more than 70% when the reservation
succeeded. The results of macro-benchmarks show that lock
reservation sped up more than 58% of the locks operations,
and achieved up to 53% performance improvements in real
Java applications.
ACKNOWLEDGMENTS
We thank the members of the Network Computing Plat-
form group in IBM Tokyo Research Laboratory, who gave
us valuable suggestions.
139
REFERENCES
[1] S. V. Adve and K. Gharachorloo. Shared Memory
Consistency Models: A Tutorial. IEEE Computer,
29(12), 66–76, 1996.
[2] O. Agesen, D. Detlefs, A. Garthwaite, R. Knippel,
Y. S. Ramakrishna, and D. White. An Efficient
Meta-lock for Implementing Ubiquitous
Synchronization. Proceedings of ACM OOPSLA ’99,
207–222, 1999.
[3] J. Aldrich, C. Chambers, E. G. Sirer, and S. Eggers.
Static Analyses for Eliminating Unnecessary
Synchronization from Java Programs. Proceedings of
the 6th Int’l Static Analysis Symposium (SAS ’99),
19–38, 1999.
[4] E. Armstrong. HotSpot: A New Breed of Virtual
Machine. http://www.javaworld.com/jw-03-1998/
jw-03-hotspot.html, 1998.
[5] D. F. Bacon. Fast and Effective Optimization of
Statically Typed Object-Oriented Languages. Ph.D.
Thesis UCB/CSD-98-1017, University of California,
1997.
[6] D. F. Bacon, R. Konuru, C. Murthy, and M. Serrano.
Thin Locks: Featherweight Synchronization for Java.
Proceedings of ACM PLDI ’98, 258–268, 1998.
[7] D. F. Bacon and S. Fink. Personal Communication.
[8] B. N. Bershad, D. D. Redell, and J. R. Ellis. Fast
Mutual Exclusion for Uniprocessors. Proceedings of
ACM ASPLOS V, 223–233, 1992.
[9] B. Blanchet. Escape Analysis for Object-Oriented
Languages: Application to Java. Proceedings of ACM
OOPSLA ’99, 20–34, 1999.
[10] J. Bogda and U. H¨olzle. Removing Unnecessary
Synchronization in Java. Proceedings of ACM
OOPSLA ’99, 35–46, 1999.
[11] P. A. Buhr, M. Fortier, and M. H. Coffin. Monitor
Classification. ACM Computing Surveys, 27(1),
63–107, 1995.
[12] J.-D. Choi, M. Gupta, M. Serrano, V. C. Sreedhar,
S. Midkiff. Escape Analysis for Java. Proceedings of
ACM OOPSLA ’99, 1–19, 1999.
[13] D. Dice. Implementing Fast Java Monitors with
Relaxed-Locks. Proceedings of USENIX JVM ’01,
79–90, 2001.
[14] E. W. Dijkstra. Solution of a Problem in Concurrent
Programming and Control. Communications of the
ACM, 8(9), 569, 1965.
[15] R. Eckstein, D. Collier-Brown, and P. Kelly. Using
Samba. O’Reilly, 1999.
http://www.oreilly.com/catalog/samba/chapter/
book/ch05 05.html.
[16] E. M. Gagnon and L. J. Hendren. SableVM: A
Research Framework for the Efficient Execution of
Java Bytecode Proceedings of USENIX JVM ’01,
27–39, 2001.
[17] J. Gosling, B. Joy, and G. Steele. The Java Language
Specification. Addison Wesley, 1996.
[18] C. A. R. Hoare. Monitors: An Operating System
Structuring Concept. Communications of the ACM,
17(10), 549–557, 1974.
[19] R. L. Hudson, J. E. B. Moss, S. Subramoney, and
W. Washburn. Cycles to Recycle: Garbage Collection
on the IA-64. Proceedings of the 2nd ACM Int’l
Symposium on Memory Management (ISMM ’00),
101–110, 2000.
[20] IBM developerWorks Java Technology Zone.
http://www.ibm.com/developerworks/java/.
[21] Intel Corporation. IA-32 Intel Architecture Software
Developer’s Manual Vol. 1–3.
http://developer.intel.com/design/Pentium4/
manuals/.
[22] Intel Corporation. Intel Itanium Architecture Software
Developer’s Manual Vol. 1–3.
http://developer.intel.com/design/itanium/
manuals/.
[23] K. Ishizaki, M. Kawahito, T. Yasue, M. Takeuchi,
T. Ogasawara, T. Suganuma, T. Onodera,
H. Komatsu, and T. Nakatani. Design,
Implementation, and Evaluation of Optimizations in a
Just-In-Time Compiler. Proceedings of ACM Java
Grande ’99, 119–128, 1999.
[24] Java Community Process. JSR 133: Java Memory
Model and Thread Specification Revision.
http://jcp.org/jsr/detail/133.jsp.
[25] T. Johnson and K. Harathi. Interruptible Critical
Sections. Technical Report TR94007, University of
Florida, 1994.
[26] Kaffe.org. Developing Kaffe.
http://www.kaffe.org/develop.html.
[27] M. Kawahito. Personal Communication.
[28] A. Krall and M. Probst. Monitors and Exceptions:
How to Implement Java Efficiently. Proceedings of
ACM Workshop on Java for High-Performance
Network Computing, 15–24, 1998.
[29] H. T. Kung and J. T. Robinson. On Optimistic
Methods for Concurrency Control. ACM Transactions
on Database System, 6(2), 213–226, 1981.
[30] L. Lamport. A Fast Mutual Exclusion Algorithm.
ACM Transactions on Computing System, 5(1), 1–11,
1987.
[31] P. Leach and D. Perry. CIFS: A Common Internet
File System.
http://www.microsoft.com/mind/1196/cifs.asp,
1996.
[32] G. Muller, B. Moura, F. Bellard, and C. Consel.
Harissa: A Flexible and Efficient Java Environment
Mixing Bytecode and Compiled Code. Proceedings of
the 3rd USENIX Conference on Object Oriented
Technologies and Systems (COOTS ’97), 1–20, 1997.
[33] T. Onodera. A Simple and Space-Efficient Monitor
Optimization for Java. IBM Research Report RT0259,
IBM, 1998.
[34] T. Onodera and K. Kawachiya. A Study of Locking
Objects with Bimodal Fields. Proceedings of ACM
OOPSLA ’99, 223–237, 1999.
[35] Y. G. Park and B. Goldberg. Escape Analysis on
Lists. Proceedings of ACM PLDI ’92, 116–127, 1992.
[36] W. Pugh. Fixing the Java Memory Model. Proceedings
of ACM Java Grande ’99, 89–98, 1999.
[37] R. Rajwar and J. R. Goodman. Speculative Lock
Elision: Enabling Highly Concurrent Multithreaded
Execution. Proceedings of the 34th ACM/IEEE
MICRO 34, 294–305, 2001.
140
[38] E. Ruf. Effective Synchronization Removal for Java.
Proceedings of ACM PLDI ’00, 208–218, 2000.
[39] Standard Performance Evaluation Corporation.
SPEC JBB2000.
http://www.spec.org/osg/jbb2000/.
[40] Standard Performance Evaluation Corporation.
SPEC JVM98 Benchmarks.
http://www.spec.org/osg/jvm98/.
[41] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue,
M. Kawahito, K. Ishizaki, H. Komatsu, and
T. Nakatani. Overview of the IBM Java Just-in-Time
Compiler. IBM Systems Journal, 39(1), 175–193,
2000.
[42] A. Thomasian. Concurrency Control: Methods,
Performance, and Analysis. ACM Computing Surveys,
30(1), 70–119, 1998.
[43] Volano LLC. Volano Benchmarks.
http://www.volano.com/benchmarks.html.
[44] J. Whaley and M. Rinard. Compositional Pointer and
Escape Analysis for Java Programs. Proceedings of
ACM OOPSLA ’99, 187–206, 1999.
[45] J. Whaley. Partial Method Compilation using
Dynamic Profile Information. Proceedings of ACM
OOPSLA ’01, 166–179, 2001.
[46] B.-S. Yang, J. Lee, J. Park, S.-M. Moon, K. Ebcioglu,
and E. Altman. Lightweight Monitor for Java VM.
ACM SIGARCH Computer Architecture News, 27(1),
35–38, 1999.
[47] F. Yellin and T. Lindholm. Java Runtime Internals.
Presentation in JavaOne ’96,
http://java.sun.com/javaone/javaone96/pres/
Runtime.pdf, 1996.
141
... However, its frequent synchronization typically slows executions by several times or more. Alternatively, optimistic tracking avoids synchronization for accesses not involved in cross-thread dependences, but requires coordination between threads when accesses are involved in dependences [11,13,22,33,35,39]. We emphasize that although optimistic tracking performs well for the many programs that perform relatively few conflicting accesses, its very high cost for some programs is a severe impediment to its widespread use in high-performance systems. ...
... In contrast, optimistic tracking avoids synchronization at most accesses. Prior work uses optimistic tracking either to implement program locks (Section 8) [13,22,33] or to track cross-thread dependences [11,35,39]. This paper focuses on the latter context. ...
... Program locks face similar tradeoffs as pessimistic versus optimistic tracking. Notably, biased locking avoids atomic operations for repeated lock acquisitions by the same thread, requiring coordination when another thread acquires the lock [13,22,33]. A biased lock typically falls back to an unbiased lock after triggering coordination once. ...
Conference Paper
Full-text available
It is notoriously challenging to develop parallel software systems that are both scalable and correct. Runtime support for parallelism---such as multithreaded record & replay, data race detectors, transactional memory, and enforcement of stronger memory models---helps achieve these goals, but existing commodity solutions slow programs substantially in order to track (i.e., detect or control) an execution's cross-thread dependences accurately. Prior work tracks cross-thread dependences either "pessimistically," slowing every program access, or "optimistically," allowing for lightweight instrumentation of most accesses but dramatically slowing accesses involved in cross-thread dependences. This paper seeks to hybridize pessimistic and optimistic tracking, which is challenging because there exists a fundamental mismatch between pessimistic and optimistic tracking. We address this challenge based on insights about how dependence tracking and program synchronization interact, and introduce a novel approach called hybrid tracking. Hybrid tracking is suitable for building efficient runtime support, which we demonstrate by building hybrid-tracking-based versions of a dependence recorder and a region serializability enforcer. An adaptive, profile-based policy makes runtime decisions about switching between pessimistic and optimistic tracking. Our evaluation shows that hybrid tracking enables runtime support to overcome the performance limitations of both pessimistic and optimistic tracking alone.
... To accomplish this goal, we propose to replace the atomic RC operations with biased RC operations. Similar to biased locks [25], our biased operations leverage uneven sharing to create asymmetrical execution times for RC operations based on the thread performing the operation. Each object is biased toward, or favors, a specific thread. ...
... There have been many works [19,25,29,[34][35][36]42] which try to limit the amount of overhead to acquire uncontested locks. While BRC is inspired by biased locking [25], it is not a straightforward re-application of biased locking. ...
... There have been many works [19,25,29,[34][35][36]42] which try to limit the amount of overhead to acquire uncontested locks. While BRC is inspired by biased locking [25], it is not a straightforward re-application of biased locking. BRC proposes an efficient biasing technique tailored to RC by exploiting the fact that RC does not require strong exclusivity like locking. ...
Conference Paper
Reference counting (RC) is one of the two fundamental approaches to garbage collection. It has the desirable characteristics of low memory overhead and short pause times, which are key in today's interactive mobile platforms. However, RC has a higher execution time overhead than its counterpart, tracing garbage collection. The reason is that RC implementations maintain per-object counters, which must be continually updated. In particular, the execution time overhead is high in environments where low memory overhead is critical and, therefore, non-deferred RC is used. This is because the counter updates need to be performed atomically. To address this problem, this paper proposes a novel algorithm called Biased Reference Counting (BRC), which significantly improves the performance of non-deferred RC. BRC is based on the observation that most objects are only accessed by a single thread, which allows most RC operations to be performed non-atomically. BRC leverages this by biasing each object towards a specific thread, and keeping two counters for each object --- one updated by the owner thread and another updated by the other threads. This allows the owner thread to perform RC operations non-atomically, while the other threads update the second counter atomically. We implement BRC in the Swift programming language runtime, and evaluate it with client and server programs. We find that BRC makes each RC operation more than twice faster in the common case. As a result, BRC reduces the average execution time of client programs by 22.5%, and boosts the average throughput of server programs by 7.3%.
... We introduce relaxed dependence tracking (RT), which enables a thread to continue executing past a memory access involved in a cross-thread dependence, without accurately tracking the dependence. Our design of RT targets dependence tracking based on so-called biased reader-writer locking [11,14,31,32,46,48,49,57] (Section 2.2), which avoids the costs of reacquiring a lock for non-conflicting accesses, but incurs latency at conflicting accesses in order to perform coordination among conflicting threads (Section 2.3). The high cost of coordination provides both a challenge and an opportunity for RT to hide this latency, by relaxing the tracking of dependences at accesses involved in dependences. ...
... Prior work introduces so-called biased locking, in which each object's lock is "biased" toward one thread (or multiple threads in the case of reader locks) [11,14,31,32,46,48,49,57]. These "owner" thread(s) can reacquire the lock without performing an atomic operation or even a store. ...
... This paper focuses on one context for biased locking: providing instrumentation-access atomicity for cap-turing cross-thread dependences. Other work has used biased locking for program locks [32,46,55]. Biased program locks typically mitigate coordination costs by switching a conflicting lock to an unbiased state after the first conflict. ...
Conference Paper
It is notoriously difficult to achieve both correctness and scalability for many shared-memory parallel programs. To improve correctness and scalability, researchers have developed various kinds of parallel runtime support such as multithreaded record & replay and software transactional memory. Existing forms of runtime support slow programs significantly in order to track an execution's cross-thread dependences accurately. This paper investigates the potential for runtime support to hide latency introduced by dependence tracking, by tracking dependences in a relaxed way---meaning that not all dependences are tracked accurately. The key challenge in relaxing dependence tracking is preserving both the program's semantics and the runtime support's guarantees. We present an approach called relaxed dependence tracking (RT) and demonstrate its potential by building two types of RT-based runtime support. Our evaluation shows that RT hides much of the latency incurred by dependence tracking, although RT-based runtime support incurs costs and complexity in order to handle relaxed dependence information. By demonstrating how to relax dependence tracking to hide latency while preserving correctness, this work shows the potential for addressing a key cost of dependence tracking, thus advancing knowledge in the design of parallel runtime support.
... Emmi et al. [17] proposed an automatic lock allocation technique to infer the location of a lock in a program and ensure that the lock was correct and avoid deadlocks. Kawachiya et al. [18] proposed a lock retention algorithm that allowed a lock to be retained by a thread. When a thread tried to acquire a lock operation, if the thread retained the lock, it would not have to perform an atomic operation to get the lock; otherwise, the thread would use the traditional method to obtain a lock. ...
Article
Full-text available
Internet of Things (IoT) software should provide good support for IoT devices as IoT devices are growing in quantity and complexity. Communication between IoT devices is largely realized in a concurrent way. How to ensure the correctness of concurrent access becomes a big challenge to IoT software development. This paper proposes a general refactoring framework for fine-grained read–write locking and implements an automatic refactoring tool to help developers convert built-in monitors into fine-grained ReentrantReadWriteLocks. Several program analysis techniques, such as visitor pattern analysis, alias analysis, and side-effect analysis, are used to assist with refactoring. Our tool is tested by several real-world applications including HSQLDB, Cassandra, JGroups, Freedomotic, and MINA. A total of 1072 built-in monitors are refactored into ReentrantReadWriteLocks. The experiments revealed that our tool can help developers with refactoring for ReentrantReadWriteLocks and save their time and energy.
... The lock-word field in Java object header plays a important part in Java synchronization mechanisms. Much research has been done on improving Java synchronization mechanisms based on the lock-word field 11,12,13,14 . However, the lock-word field is removed from the object header in the packed object data model, because multiple packedObject headers can refer to the same underlying data, lock-words in different headers referring to the same data would require extra synchronization. ...
Article
In this paper, we develop a multi‐granularity locking scheme for Java PackedObjects, an experimental enhancement introduced in IBM's J9 Java Virtual Machine. The packed object model organizes data in a multi‐tier manner in which object data can be nested in the container object instead of being pointed to by an object reference, as in the traditional Java object model. This new object data model creates new challenges for multi‐tier data synchronization, requiring concurrent locks on the multi‐tier data of different granularities for maintaining consistency. This is different from the traditional Java synchronization model. In this paper we make use of a concurrent multiway tree to represent the containing and ordering relationship between PackedObjects at different tiers and develop an efficient multi‐granularity locking scheme allowing multiple threads to concurrently manipulate the concurrent multiway tree for synchronization operations. In the evaluation, we compare our new tree‐based multitierSync with the previous multitierSync approaches based on linked‐lists (optimized‐list‐based and lazy‐list‐based). The experimental results show that the tree‐based MultitierPackedSync outperforms the list‐based approaches considerably in different workloads, and the higher the workload, the better the performance gains achieved by the tree‐based MultitierPackedSync.
... Biased locking [Kawachiya et al. 2002;Pizlo et al. 2011;Russell and Detlefs 2006;Vasudevan et al. 2010] is a well-known implementation technique suited to locks that are acquired by only one thread, at least within some large consecutive series of acquires. Initially the lock is unowned and must be acquired in the typical fashion. ...
Article
This paper presents Fast Instrumentation Bias (FIB), a sound and complete dynamic data race detection algorithm that improves performance by reducing or eliminating the costs of analysis atomicity. In addition to checking for errors in target programs, dynamic data race detectors must introduce synchronization to guard against metadata races that may corrupt analysis state and compromise soundness or completeness. Pessimistic analysis synchronization can account for nontrivial performance overhead in a data race detector. The core contribution of FIB is a novel cooperative ownership-based synchronization protocol whose states and transitions are derived purely from preexisting analysis metadata and logic in a standard data race detection algorithm. By exploiting work already done by the analysis, FIB ensures atomicity of dynamic analysis actions with zero additional time or space cost in the common case. Analysis of temporally thread-local or read-shared accesses completes safely with no synchronization. Uncommon write-sharing transitions require synchronous cross-thread coordination to ensure common cases may proceed synchronization-free. We implemented FIB in the Jikes RVM Java virtual machine. Experimental evaluation shows that FIB eliminates nearly all instrumentation atomicity costs on programs where data often experience windows of thread-local access. Adaptive extensions to the ownership policy effectively eliminate high coordination costs of the core ownership protocol on programs with high rates of serialized sharing. FIB outperforms a naive pessimistic synchronization scheme by 50% on average. Compared to a tuned optimistic metadata synchronization scheme based on conventional fine-grained atomic compare-and-swap operations, FIB is competitive overall, and up to 17% faster on some programs. Overall, FIB effectively exploits latent analysis and program invariants to bring strong integrity guarantees to an otherwise unsynchronized data race detection algorithm at minimal cost.
Article
Recent advances in verification have made it possible to envision trusted implementations of real-world languages. Java with its type-safety and fully specified semantics would appear to be an ideal candidate; yet, the complexity of the translation steps used in production virtual machines have made it a challenging target for verifying compiler technology. One of Java's key innovations, its memory model, poses significant obstacles to such an endeavor. The Java Memory Model is an ambitious attempt at specifying the behavior of multithreaded programs in a portable, hardware agnostic, way. While experts have an intuitive grasp of the properties that the model should enjoy, the specification is complex and not well-suited for integration within a verifying compiler infrastructure. Moreover, the specification is given in an axiomatic style that is distant from the intuitive reordering-based reasonings traditionally used to justify or rule out behaviors, and ill suited to the kind of operational reasoning one would expect to employ in a compiler. This paper takes a step back, and introduces a Buffered Memory Model (BMM) for Java. We choose a pragmatic point in the design space sacrificing generality in favor of a model that is fully characterized in terms of the reorderings it allows, amenable to formal reasoning, and which can be efficiently applied to a specific hardware family, namely x86 multiprocessors. Although the BMM restricts the reorderings compilers are allowed to perform, it serves as the key enabling device to achieving a verification pathway from bytecode to machine instructions. Despite its restrictions, we show that it is backwards compatible with the Java Memory Model and that it does not cripple performance on TSO architectures.
Article
Current memory reclamation mechanisms for highly-concurrent data structures present an awkward trade-off. Techniques such as epoch-based reclamation perform well when all threads are running on dedicated processors, but the delay or failure of a single thread will prevent any other thread from reclaiming memory. Alternatives such as hazard pointers are highly robust, but they are expensive because they require a large number of memory barriers. This paper proposes three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques. These new proposals are backward-compatible with existing code that uses hazard pointers. They move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principal code path. These proposals include (1) exploiting the operating system's memory protection ability, (2) exploiting certain x86 hardware features to trigger memory barriers only when needed, and (3) a novel hardware-assisted mechanism, called a hazard lookaside buffer (HLB) that allows a reclaiming thread to query whether there are hazardous pointers that need to be flushed to memory. We evaluate our proposals using a few fundamental data structures (linked lists and skiplists) and libcuckoo, a recent high-throughput hash-table library, and show significant improvements over the hazard pointer technique.
Article
It is notoriously challenging to develop parallel software systems that are both scalable and correct. Runtime support for parallelism—such as multithreaded record and replay, data race detectors, transactional memory, and enforcement of stronger memory models—helps achieve these goals, but existing commodity solutions slow programs substantially to track (i.e., detect or control) an execution’s cross-thread dependencies accurately. Prior work tracks cross-thread dependencies either “pessimistically,” slowing every program access, or “optimistically,” allowing for lightweight instrumentation of most accesses but dramatically slowing accesses that are conflicting (i.e., involved in cross-thread dependencies). This article presents two novel approaches that seek to improve the performance of dependence tracking. Hybrid tracking (HT) hybridizes pessimistic and optimistic tracking by overcoming a fundamental mismatch between these two kinds of tracking. HT uses an adaptive, profile-based policy to make runtime decisions about switching between pessimistic and optimistic tracking. Relaxed tracking (RT) attempts to reduce optimistic tracking’s overhead on conflicting accesses by tracking dependencies in a “relaxed” way—meaning that not all dependencies are tracked accurately—while still preserving both program semantics and runtime support’s correctness. To demonstrate the usefulness and potential of HT and RT, we build runtime support based on the two approaches. Our evaluation shows that both approaches offer performance advantages over existing approaches, but there exist challenges and opportunities for further improvement. HT and RT are distinct solutions to the same problem. It is easier to build runtime support based on HT than on RT, although RT does not incur the overhead of online profiling. This article presents the two approaches together to inform and inspire future designs for efficient parallel runtime support.
Chapter
Full-text available
Since the appearance of the first papers in mid-70’s formalizing two-phase locking as a means of Concurrency Control (CC) [23], there have been numerous proposals based on locking, time-stamp ordering, and optimistic CC [6], [52], [77]. CC is required to ensure correctness and database integrity when it is updated by several transactions concurrently [23].
Conference Paper
Full-text available
The Java language provides a promising solution to the design of safe programs, with an application spectrum ranging from Web services to operating system components. The well-known tradeoff of Java's portability is the inefficiency of its basic execution model, which relies on the interpretation of an object-based virtual machine. Many solutions have been proposed to overcome this problem, such as just-in-time (JIT) and off-line bytecode compilers. However, most compilers trade efficiency for either portability or the ability to dynamically load bytecode. In this paper, we present an approach which reconciles portability and efficiency, and preserves the ability to dynamically load bytecode. We have designed and implemented an efficient environment for the execution of Java programs, named Harissa. Harissa permits the mixing of compiled and interpreted methods. Harissa's compiler translates Java bytecode to C, incorporating aggressive optimizations such as virtual-method call optimization based on the Class Hierarchy Analysis. To evaluate the performance of Harissa, we have conducted an extensive experimental study aimed at comparing the various existing alternatives to execute Java programs. The C code produced by Harissa's compiler is more efficient than all other alternative ways of executing Java programs (that were available to us): it is up to 140 times faster than the JDK interpreter, up to 13 times faster than the Softway Guava JIT, and 30% faster than the Toba bytecode to C compiler.
Article
Object locking can be efficiently implemented by bimodal use of a field reserved in an object. The field is used as a lightweight lock in one mode, while it holds a reference to a heavyweight lock in the other mode. A bimodal locking algorithm recently proposed for Java achieves the highest performance in the absence of contention, and is still fast enough when contention occurs. However, mode transitions inherent in bimodal locking have not yet been fully considered. The algorithm requires busy-wait in the transition from the light mode (inflation), and does not make the reverse transition (deflation) at all. We propose a new algorithm that allows both inflation without busy-wait and deflation, but still maintains an almost maximum level of performance in the absence of contention. We also present statistics on the synchronization behavior of real multithreaded Java programs, which indicate that busy-wait in inflation and absence of deflation can be problematic in terms of robustness and performance. Actually, an implementation of our algorithm shows increased robustness, and achieves performance improvements of up to 13.1% in server-oriented benchmarks.
Article
A number of mainly independent sequential-cyclic processes with restricted means of communication with each other can be made in such a way that at any moment one and only one of them is engaged in the “critical section” of its cycle. © 1983, ACM. All rights reserved.