Automatic Debugging of Concurrent Programs through Active Sampling of Low Dimensional Random Projections.
ABSTRACT Concurrent computer programs are fast becoming prevalent in many critical applications. Unfortunately, these programs are especially difficult to test and debug. Recently, it has been suggested that injecting random timing noise into many points within a program can assist in eliciting bugs within the program. Upon eliciting the bug, it is necessary to identify a minimal set of points that indicate the source of the bug to the programmer. In this paper, we pose this problem as an active feature selection problem. We propose an algorithm called the iterative group sampling algorithm that iteratively samples a lower dimensional projection of the program space and identifies candidate relevant points. We analyze the convergence properties of this algorithm. We test the proposed algorithm on several realworld programs and show its superior performance. Finally, we show the algorithms' performance on a large concurrent program.
 Citations (28)
 Cited In (0)

Article: Oxford dictionary of statistics

Conference Paper: Noise Makers Need to Know Where to be Silent  Producing Schedules That Find Bugs.
[Show abstract] [Hide abstract]
ABSTRACT: A noise maker is a tool that seeds a concurrent program with conditional synchronization primitives, such as yield(), for the purpose of increasing the likelihood that a bug manifest itself. We introduce a novel fault model that classifies locations as "good", "neutral", or "bad," based on the effect of a thread switch at the location. Using the model, we explore the terms under which an efficient search for reallife concurrent bugs can be conducted. We accordingly justify the use of probabilistic algorithms for this search and gain a deeper insight of the work done so far on noise making. We validate our approach by experimenting with a set of programs taken from publicly available multithreaded benchmarks. Our empirical evidence demonstrates that reallife behavior is similar to one derived from the model.Leveraging Applications of Formal Methods, Second International Symposium, ISoLA 2006, Paphos, Cyprus, 1519 November 2006; 01/2006  SourceAvailable from: Shmuel Ur
Conference Paper: Compiling a Benchmark of Documented MultiThreaded Bugs.
[Show abstract] [Hide abstract]
ABSTRACT: Summary form only given. Testing multithreaded, concurrent, or distributed programs is acknowledged to be a very difficult task. We decided to create a benchmark of programs containing documented multithreaded bugs that can be used in the development of testing tool for the domain. In order to augment the benchmark with a sizable number of programs, we assigned students in a software testing class to write buggy multithreaded Java programs and document the bugs. This paper documents this experiment. We explain the task that was given to the students, go over the bugs that they put into the programs both intentionally and unintentionally, and show our findings. We believe this part of the benchmark shows typical programming practices, including bugs, of novice programmers. In grading the assignments, we used our technologies to look for undocumented bugs. In addition to finding many undocumented bugs, which was not surprising given that writing correct multithreaded code is difficult, we also found a number of bugs in our tools. We think this is a good indication of the expected utility of the benchmark for multithreaded testing tool creators.18th International Parallel and Distributed Processing Symposium (IPDPS 2004), CDROM / Abstracts Proceedings, 2630 April 2004, Santa Fe, New Mexico, USA; 01/2004
Page 1
Automatic debugging of concurrent programs through active sampling of low
dimensional random projections
Elad YomTov, Rachel Tzoref, Shmuel Ur, Shlomo Hoory
IBM, Haifa Research Lab
Haifa University Campus
Haifa, 31905, Israel
{yomtov,rachelt,ur,shlomoh}@il.ibm.com
Abstract
Concurrentcomputerprogramsarefastbecomingpreva
lent in many critical applications. Unfortunately, these pro
grams are especially difficult to test and debug. Recently, it
has been suggested that injecting random timing noise into
many points within a program can assist in eliciting bugs
within the program. Upon eliciting the bug, it is necessary
to identify a minimal set of points that indicate the source
of the bug to the programmer. In this paper, we pose this
problem as an active feature selection problem. We pro
pose an algorithm called the iterative group sampling algo
rithm that iteratively samples a lower dimensional projec
tion of the program space and identifies candidate relevant
points. We analyze the convergence properties of this algo
rithm. We test the proposed algorithm on several realworld
programs and show its superior performance. Finally, we
show the algorithms’ performance on a large concurrent
program.
1. Introduction
Concurrent computer programs are simultaneously exe
cuted programs or program threads. Concurrent programs
can be run on a single multithreaded processor, on several
processors in a single computer, or using several comput
ers distributed across a network. The increasing popularity
of concurrent programs has brought the issue of concurrent
defect analysis to the forefront. This is true for both servers
andclientmachines, sincealmosteveryCPUavailablethese
days is multicore.
Concurrent program debugging usually refers to specific
examples of bugs that are found only in concurrent pro
grams. One example is race condition, where the output of
a process is unintentionally dependent on the sequence or
the timing of other events. Another example unique to con
current programs is deadlocks, where two threads simulta
neously wait for the other to produce some action, and thus
can never progress. The technique described in this paper is
relevant to all types of concurrent bugs.
Much research has been devoted to testing multi
threaded programs. This research has examined aspects
such as detecting data races[19, 20, 13], replaying in sev
eral distributed and concurrent contexts[3], static analysis
[23, 12, 5], model checking [22], coverage analysis [17, 2],
and cloning [11]. Additionally, generation of different in
terleavings as a way of improving the efficiency of testing
by revealing concurrent faults was demonstrated in [6, 24].
There are a number of distinguishing features between
concurrent and sequential testing. These differences make
it especially challenging to find concurrent bugs if the set
of possible interleavings is huge and it is not practical to try
all of them. First, only a few of the interleavings actually
produce concurrent faults; thus, the probability of produc
ing a concurrency fault can be very low. Second, under the
simple conditions of unit testing, executing the same tests
repeatedly to search for concurrent bugs does not help. As
a result, concurrent bugs are often found in stress tests or by
the customer. The problem of testing multithreaded pro
grams is even more costly because tests that reveal a con
current fault in the field or in a stress test are usually long
and run under different environmental conditions. As a re
sult, such tests are not necessarily repeatable, and when a
fault is detected, much effort must be invested in recreating
the conditions under which it occurred. When the condi
tions of the bug are finally recreated, the debugging itself
may mask the bug (the observer effect).
Recently, it has been proposed [6] to elicit bugs by in
jecting timing noise at randomly selected points within a
program, known as instrumentations. Instrumentations are
created in any point in the program whose relative execu
tion order can impact the result of the program. In general,
instrumentations are done on accesses to shared variables
Page 2
and on concurrent instructions such as yield and sleep.
Edelstein et al. [6] contains an extensive description of the
locations that need to be instrumented and the amount of
noise that needs to be introduced in order to manifest the
bug. Copty and Ur [4] suggested to use search methods
such as delta debugging (DD) to identify a minimal set of
the most indicative points where injecting noise causes the
bug to appear with the highest probability. These points are
assumed to be indicative of the location of the bug. Pre
viously [25], we demonstrated how these instrumentation
points can be used to pinpoint the locations of the bug.
DD is a wellknown algorithm for searching for sets of
changes. The DD algorithm suggested by Zeller [28] works
as follows: start with two sets of instrumentation points
c ⊂ ´ c, such that the program works with c and does not
work with ´ c. Start with c as the empty set and ´ c as the full
set of changes that finds the bug. Then, roughly divide the
changes in ´ c in two. If testing with the first part yields the
bug, continue recursively with that part. If not, try the sec
ond part. If that part yields the bug, continue recursively.
Otherwise a subset of the solution must be in the first part
and another subset in the second part. Continue recursively
searching the first part, while implementing all the changes
from the second. At the same time, search the second part
while implementing all the changes to the first. The mini
mal solution is the union of the two searches.
The DD search algorithm assumes the problem to be
monotonic; such that if a set of instrumentations reveals the
bug, then any superset of this set also reveals the bug. In
BenAsher et al. [1] and Copty and Ur [4] it was shown
that concurrent programs are not necessarily monotonic,
and [25] later showed that real concurrent programs are
not monotonic. The reason that concurrent programs are
nonmonotonic is that there are three kinds of instrumenta
tion points: relevant points, which are points that, if instru
mented, increase the probability of eliciting a bug; irrele
vant points, which are instrumentation points that have no
effect on the probability of eliciting the bug; and blocking
instrumentations, which cause the bug not to appear if they
are instrumented.
Studies of bugs in multithreaded programs [9, 16] re
veal that most bug patterns can be exposed using very few
instrumentation points, and sometimes only one. However,
the appearance of a bug is nondeterministic in the sense
that ”correct” instrumentation only increases the likelihood
of the bug appearing, but does not guarantee it.
Thus, theproblem thatneeds tobe solved toapply instru
mentation to debugging of concurrent programs is to find a
minimal subset of relevant instrumentation points, discard
ing both irrelevant and blocking instrumentations. This pro
cess is akin to feature selection, albeit in a significantly dif
ferent setting from that assumed by most feature selection
algorithms. In our case, the features, i.e., the instrumenta
tion points, are binary variables, because noise can either be
injected into an instrumentation point or not. Furthermore,
it is possible to actively probe the tested program with a
specific instrumentation pattern and obtain, most probably,
the appearance of a bug. Finally, the existence of blocking
points means that even if a correct set of relevant points is
identified, they might not elicit a bug if a blocking point is
instrumented with them.
Feature selection is a basic problem that has received
much attention over the years. It is usually defined as the
process of finding a minimal subset of indicators that best
represents the data [27]. A brute force approach to feature
selection, i.e., trying all possible feature combinations, is
possible only for very small feature sets. For example, test
ing all possible combinations for one hundred features re
quires testing over 1030configurations. However, a pro
gram containing one hundred instrumentation points is just
a little bigger than a toy program. In practice, simple heuris
tics such as sequential forward search and sequential back
ward search (also known as sequential growing and sequen
tial pruning, respectively), usually work well. In the first
case, the algorithm starts with an empty set and sequentially
adds the feature that, together with the current set of se
lected features, best improves prediction. In the latter case,
the algorithm begins with all the features and sequentially
removes the least indicative feature upon each iteration. A
review of recent, more sophisticated methods, is given in
Guyon et al. [10]. Most of these feature selection methods
are of limited use in the present setting because the proba
bility of eliciting a bug given a random subset of instrumen
tation points is very small. Therefore, a large number of
samples are required before these algorithms can run suc
cessfully.
In the past, there were several attempts at using learn
ing to debug programs. The authors of Zheng et al. [29]
developed a method for sampling programs and identifying
probable bug locations in singlethread programs. Their in
strumentation is performed using assertions placed in the
code that are randomly sampled at runtime. This implies
that they require a diverse sample of runs to execute all the
assertions. The authors then use a utility function to build a
classifier whose goal is to correctly predict the outcome of
runs (success or failure) based on the outcomes of the asser
tions. The weights of the utility function then serve as indi
cators for the location of the bug. Their debugging process
requires tuning the parameters of the classifier using a train
ing set and then finding the weights of the classifier using
an optimization algorithm. This method, while effective for
small programs, seems to incur a high computational cost,
and requires setting the assertions manually in the code.
GeneticFinder [7] is a noisemaker that uses a genetic
algorithm as a search method, with the goal of increasing
the probability of the bug manifestation, and minimizing
Page 3
the set of variables and program locations on which noise is
made. The search is performed at runtime, i.e., all program
locations are instrumented, and at each point it is decided
during runtime whether to apply noise. From our experi
ence, instrumentation alone can change the scheduling of
the program. Thus, due to the phenomenon of blocking in
strumentations, the approach of partial instrumentation is
much more accurate.
Tarantula[14]usesaprobabilisticmetricforrankingsus
picious lines of code in programs. Based on their appear
ance in failed and successful runs of a program, Tarantula
assigns a value called hue to each line. The authors sug
gest examining lines according to their hue, starting from
the lowest values of the hue. The authors do not address the
issue of how to sample the program, assuming instead that
users perform sampling in an independent manner.
Finally, a recent article we [25] proposed a technique
based on Design of Experiments (DoE) methods. These
methods find the best combination of sampling points given
a budget (in the form of the possible number of program
executions). Once the program is sampled, each instrumen
tation point is scored for its likelihood of inducing the bug
and the highest scored points are reported to the user. This
method is lacking in two aspects: first, the sampling points
are generated before program execution, so there is no feed
back between the results of one program execution and the
next. Second, it uses DOptimal, socalled because it is op
timal where there is a linear relationship between the input
and output of a system. However, in the case of program
debugging, it is not obvious that such a relationship exists.
This article aims to solve these two problems.
2. The active sampling algorithm
Inthecurrentwork, weattempttofindtherelevant points
in a program by actively searching the space induced by the
instrumentation points. However, because this space is very
large, we propose to project it onto a lower dimension using
a random projection, which is then sampled exhaustively.
Irrelevant or blocking points are identified at the lower di
mension and discarded, and the process continues until a
minimal set is found.
Following is a description of the proposed algorithm,
which we call the Iterative Group Sampling (IGS) algo
rithm. Line numbers refer to Algorithm 1.
The algorithm starts with a set ?S? of all D instrumen
tation points, which are all initially considered as candidate
relevant points, and a low dimension parameter L. We dis
cusstheselectionofLbelow. Thealgorithmgeneratesafull
factorial sampling matrix at the lower dimension L. A full
factorial matrix is a sampling matrix which tests all possible
combinations of L (binary) inputs. This matrix, denoted by
XL, has 2L− 2 rows and L columns (the cases where no
instrumentation point is activated and when all instrumenta
tion points are activated are superfluous). At each iteration,
a random matrix H of size L×S, whose entries are {0,1},
is generated. Entries in the matrix are generated such that
each column of the matrix contains exactly one nonzero
entry. This induces a partition of ?S? into L disjoint groups.
The actual sampling matrix according to which the pro
gram is sampled is generated by projecting XLonto the
original instrumentation space using the random matrix H
(Line 5). Each of the 2L−2 instrumentation configurations
are then tested, and if the bug is elicited it is recorded.
After running the tests, each instrumentation point is
ranked according to its ability to elicit bugs. We use the
likelihood ratio test [18] (which was also used in Tzoref
et al.[25]) scoring function to score the points.
P(SuccessXi) denote the sample probability that tests
which instrumented the ith instrumentation point actually
results in a bug, and P(!SuccessXi) denote the same for
the case where the bug is not found. We assign each point
this score:
Let
Score(i) = P(SuccessXi)/P(!SuccessXi)
The points with the highest score are assumed to be most
indicative of the location of the bug. Such a scoring func
tionassumesnocorrelationbetweeninstrumentationpoints;
that is, if a bug requires that two points be instrumented in
a correlated fashion, it may not be found using this scoring
function. Therefore, instead of keeping points with a high
score, we conservatively discard from ?S? those points that
obtained the lowest score. Note that these points are mem
bers of one or more groups. The process is then repeated
with a smaller set ?S? until the size of the candidate group
is small enough to be tested exhaustively using a full facto
rial design. We note that we tested other scoring functions
such as the conditional entropy of the output given the sam
pling matrix, but our experience is that these functions are
less sensitive than the likelihood ratio.
A program may contain more than one bug. Frequently,
itisrelativelyeasytoclassifyrunswhichendedprematurely
to a specific bug, according to the error messages they gen
erated. Thus, even if there is more than one bug in the pro
gram, it is possible to focus on a specific bug by observing
the error produced by each run, and identifying its asso
ciated bug. In fact, during early stages of this work, we
discovered an additional bug in one of the tested programs,
which was previously unknown to the programmers. Both
this and the other, known bug, were correctly pinpointed by
the proposed algorithm.
The main parameter to set in Algorithm 1 is the low di
mension at which points are sampled, L. This parameter
determines the number of iterations required for identifying
the set of relevant features and discarding all irrelevant and
blocking points. Intuitively, a low L is useful when there
(1)
Page 4
Algorithm 1 The iterative group sampling (IGS) algorithm
1: Given D sampling points and L, a parameter specifying
a low sampling dimension
2: Initialize S = {1,2,...,D} (A set of possible relevant
points) and let XLbe a full factorial sampling matrix in
dimension L ([XL] = (2L− 2) × L).
3: while S ≥ L do
4:
Create a random matrix H such that H = L × S
and?L
XH= XL· H.
6:
for i = 1 to 2L− 2 do
7:
Sample the program using XHsuch that if Hij=
1 then point Sjis instrumented.
8:
If a bug is discovered, assign T(i) = 1, otherwise
T(i) = 0.
9:
end for
10:
If the bug was discovered at least once during this
iteration, identify groups that caused the smallest
number of bug appearances using T and discard all
points within these groups from S.
11: end while
12: Return a set of relevant points S
i=1Hi,j= 1 ∀ j = 1,2,...,S.
Compute the high dimensional sampling matrix
5:
are few blocking points. However, if L is chosen at a low
value when there are many blocking points, it is likely that
the partition of points results in blocking and relevant points
in the same partition, which prevents their identification un
til a valid repartition of the points is found. Alternatively, a
high L requires running many tests to fully sample the low
dimensional search space.
The probability that a random partition of the points into
L groups will result in a good configuration, that is, one
where relevant and blocking points are in different groups,
can be computed as follows:
PrGood(Kr,Kb,L) =
min(Kr,L)
?
i=1
Pr(Kr,L,i)·Pb(Kb,L,i)
(2)
where Pris the probability that all relevant points are
distributed between exactly i groups and Pbis the probabil
ity that all blocking points are distributed into the remaining
groups. The number of relevant points is denoted by Krand
that of blocking points by Kb. Using this notation,
Pb(Kb,L,i) =
?L − i
L
?Kb
(3)
and by applying the InclusionExclusion principle,
?L
j=0
Pr(Kr,L,i) =
i
?
LKr
i−1
?
?i
j
?
(i − j)Kr(−1)j
(4)
the total number of tests that are run by the algorithm un
til a good partition is found is given by TTotal=?2L− 2?·
sure that within this number of iterations a good partition
is found with probability θ,
NIter, where NIter is the number of iterations. To en
(1 − PrGood)NIter≤ θ
(5)
Alternatively, the total number of tests until a good par
tition is found is given by this equation:
TTotal(Kr,Kb,L,θ) =
log(θ)
log(1 − PrGood)·?2L− 2?
(6)
Thus, if Krand Kbare known, it is possible to find an
L which minimizes TTotal. However, in practice, it may
be difficult to obtain a value for these parameters. There
fore, we investigated how the number of groups should be
selected, based on worsecase assumptions.
The largest probability for a bad partition is when Kr=
1. In that case, Equation 6 is reduced to this:
TTotal, worse case≥
log(θ)
1 −?L−1
log
?
L
?Kb? ·?2L− 2?
(7)
It is possible to approximate the total number of program
executions using the following derivation:
min
L
TTotal, worse case≈ min
2L
e−Kb/L≈ min
L
−
2L− 2
log?1 − e−Kb/L?
elog(2)L+Kb
≈ min
L
−
L
L
(8)
Therefore, in the worst case, the minimal number of
TTotalis obtained when L ≈?Kb/log(2).
in the following manner: start the algorithm with a reason
ably small L and run it for a single iteration. Based on the
empirical probability of finding a bug in this iteration, use
Equation 2 to estimate Kb, and later use this estimate to
compute the optimal number of groups, L, in Equation 7.
Theoretically, once a single good partition is found, all
groups containing relevant points are identified, and a sim
ple implementation of the delta debugging technique suf
fices to identify these points. In practice, however, it is ex
pected that not all blocking points will be identified in a
single iteration and thus the number of tests computed in
Equation 6 is only an approximation of the total number of
required tests.
The IGS algorithm has three possible sources of error.
The first is when a bad random partition is chosen. Such a
partition is one where both relevant and blocking points are
in the same group. This prevents the bug from appearing
during all 2L− 2 program executions. In such a case, the
It is also possible to compute a worsecase estimate Kb
Page 5
simple solution is to repartition the points using a different
random partition, and resample the program. The cost of
such an error is 2L− 2 wasted program executions. A re
lated problem is when a group which contains good points
that were masked by bad points is removed. We have ob
served such behavior in practice. However, this is a rare oc
currence, and is random in nature, rerunning the algorithm
will usually find the correct solution. Therefore, occasional
mistakes that IGS makes are more than offset by the gain in
performance, as we describe below.
The second source of error is when there are redundant
instrumentation points; that is, several points may indepen
dently induce the bug with similar probability. In this case,
IGS may identify only some of these points. This, however,
is considered to be a better choice than allowing the redun
dant points as well as irrelevant points to be displayed to the
programmer.
In software testing, a tool that reports many false alarms
will not be used. It is obvious that a tool that generates a
flood of false alarms may be of little use. However, it is
a common mistake for tool makers, especially static ana
lyzers, to err on the side of caution. The tool maker may
reason, for example, that it is acceptable if one of every
two alarms is false, claiming that if it takes half an hour to
invalidate or validate an alarm, then on average, a bug is
found every hour, which is very efficient compared to test
ing. The problem is that while the economic reasoning is
sound, in practice developers will not use such a tool as
they stop trusting its output.
Therefore, the design decision taken by most successful
tools will be to have fewer warnings with less likelihood
of false alarms. In our case, when IGS automatically find
places in the code that we think may be close to the source
of the bug, it should be very careful not to choose too many
irrelevant places. Thus, the fact that IGS may underreport
relevant points is not a serious drawback of the algorithm.
In hardware verification, where typically the cost of bugs
is much higher, the design decision taken by the tools is
sometimes different. They tend to err more on the side of
caution as it is very important not to miss bugs. In this case,
the reports may contain a higher proportion of false alarms.
Finally, the third source of error for the algorithm can
occur when the bug appears by chance, regardless of the
specific instrumentation. This happens (as detailed below)
with a relatively low probability. In such cases, the wrong
group may be discarded by the algorithm and cause it to
diverge. Backtracking may be required; that is, if the algo
rithmfailstoelicitthebugaftertoomanyrandompartitions,
the last discarded points are added back to the set of feasible
instrumentation points.
We note that IGS is easily amenable to parallelization
because each of the 2L− 2 tests run at a given iteration can
be executed in parallel. This provides additional speedup to
the debugging process.
3. Benchmarking the proposed algorithm
3.1. Tested programs
We compared the performance of the IGS algorithm to
that of random sampling and the RELIEF algorithm, on
three concurrent Java programs. The programs and the code
required for sampling them can be seen in the benchmark
described by Eytani et al. [8].
The first of these programs is a web crawler embedded
in an IBM product, which collects web pages by follow
ing the links from one page to the other. Delta debugging
does not work for this example. The skeleton of the algo
rithm (that is, only the code that involves concurrent pro
gramming, without the operational code) has 1200 lines of
code and 314 instrumentation points. The program contains
a race condition that is very rarely manifested, since it only
occurs in a very small percentage of all possible interleav
ings. Without instrumentation the bug did not occur in 2000
program executions. When instrumenting all possible loca
tions, it appears on average only in one out of 750 program
executions. We did not observe the bug in any of 2000 ex
ecutions when randomly instrumenting two of the points,
which is the minimal number of points required to elicit the
specific bug (according to an analysis of the bug itself).
Thesecondprogramwetestedwasaserverloopprogram
that continuously performs database transactions according
to client requests from another IBM product. This program
contains a deadlock. This program has a skeleton compris
ing 152 lines of code and 72 instrumentation points. The
deadlock is easily induced. Randomly sampling two points
induces the bug in 25% of the samples. Thus, many small
subsets of instrumentations reveal the bug. Despite this fact,
DD was unable to find minimal instrumentation [25], which
is the goal of the debugging algorithm.
The third program we used is an example of a buffer
overflow and underflow in a standard producerconsumer
program. The goal is to overflow the central buffer that
connects the consumers to the producers. The test program
consists of four threads (two producers and two consumers)
running concurrently. This program has a skeleton of 131
lines of code and 160 instrumentation points (some of the
instrumentation points are in wrapper code and are not es
sentially a part of the actual program). Inducing the bug in
this program using a random instrumentation is very diffi
cult: only 10 of 5000 (0.2%) random instrumentations with
equalprobabilitiescausedthebugtoappear. Thisisbecause
in order to generate the bug it is necessary to instrument the
bug in both the correct locations and in these location to
cause delays only in the correct timings. In our scenario we
do not have control over the latter.
Page 6
3.2. Testing algorithms
RELIEF [15] is a highly successful feature weighting al
gorithm. It works by randomly selecting a point from the
data set, finding the closest example from the data set with
the same label and the closest sample with a different label,
and modifying the weights so that the point from the same
classbecomescloserandthepointfromtheotherclassmore
distant. In our case, the label would be whether or not the
bug was observed for a given instrumentation pattern.
The algorithms are measured in terms of speed and ac
curacy. Speed is the number of times that a program must
be run to collect enough information for identifying the rel
evant instrumentation points. Obviously, running the tested
programs for a very large number of times is impractical.
Accuracy (or specificity) is the fraction of features identi
fied that are in the minimal set that can be used by the pro
grammer to understand the source of the bug.
IGS was run using between two and eight groups for
each of the three test programs. For each number of groups
we ran IGS twice. To maintain a valid comparison between
the IGS algorithm and RELIEF, we trained the latter algo
rithm using the same number of runs used by IGS and se
lected the same number of instrumentation points as that
chosen by IGS. The training points were generated so that
each point was instrumented with a probability of 0.5.
3.3. Results
Figure 1 shows the number of times that each program
was run as a function of the number of groups when using
the IGS algorithm. As this figure demonstrates, the number
of groups that minimizes the number of tests is relatively
small.
A secondorder polynomial was fit to the experimental
data. The reason for using such a polynomial is explained
by our assumption that values of Kband Krin Equation 2
are small. Under this assumption, an analysis similar to that
in Equation 8 results in a secondorder polynomial relation
ship between the number of groups and the total number of
tests. For example, ifKr= 2, Equation 8 scales aseL0.5/L,
for which a good approximation is a second order polyno
mial. There is strong correlation (R2> 0.75) of a quadratic
fit between the number of groups and the number of tests
for the first two programs. This fit is worse for the third pro
gram we tested, which we attribute to the fact that correct
instrumentation of this program requires correct instrumen
tation of both location and time, but our instrumentation has
no control over the latter.
A human expert labeled the points in each of the pro
grams that would assist a programmer in understanding the
source of the bug. Table 1 shows the average precision ob
tained by the IGS algorithm compared to that of RELIEF,
Program numberAverage precision
IGS RELIEF
0.54
0.46
0.18
Average detection
IGS RELIEF
1.00
1.00
0.77
1
2
3
0.04
0.21
0.12
0.16
0.46
0.62
Table 1. Average precision and average de
tection rate obtained by IGS and by RELIEF.
that is, the fraction of relevant points from all the points
labeled by the algorithm as relevant. The average was com
puted using the results of runs with all the different group
numbers. This table shows that the IGS algorithm has a rel
atively high precision, which, as noted above, is an impor
tant design consideration for real debugging systems. Table
1 also shows the average detection rate, which is the frac
tion of runs that resulted in any relevant points in the points
labeled by the algorithm as relevant (i.e., a recall of at least
one point). This is a different measure from the average pre
cision in that it gives an indication of the usefulness of the
search algorithm to a user. A high detection rate means that
at least some of the points labeled as relevant are indicative
of the bug. IGS is, again, superior in its performance com
pared to RELIEF, scoring a perfect detection rate for the
first two programs.
We note that these 3 programs are the same as those used
in Tzoref et al. [25]. The IGS algorithm identified points
similar to those found by the batch method in that paper.
However, in contrast with that method, where 5000 points
were required for identifying the bugs (13 orders of magni
tude more than the number of instrumentation points), IGS
always required fewer program executions, sometimes up to
two orders of magnitude less (See Figure 1). This reduced
execution time is further exemplified in the example below.
4. Bug location enhancement
As noted in Tzoref et al. [25] it is frequently not enough
to examine the score of single points in order to pinpoint the
location of a bug. This is because some bugs are induced by
many subsets of instrumentations, and the absolute score of
each point is not indicative enough of the bug’s location.
Therefore, it was suggested to look not at the absolute val
ues of the points scores, but at the derivative of the scores
along the program controlflow graph.
However, when using the IGS algorithm, not all points
are run for a comparable number of runs, and thus comput
ing their score according to the runs in which they took part
may skew their scores. Therefore, we propose the follow
ing modification to the derivative score method described
in Tzoref et al. [25]: Once instrumentation points which
Page 7
Figure 1. Number of program executions until convergence of IGS as a function of the number of
groups, for the three programs tested. The solid line denotes a secondorder polynomial fit to the
experimental data.
elicit the bugs are identified, new subsets of the instru
mentation points are run by traversing the program execu
tion paths forward and backward from the instrumentation
points identified by IGS. Each run contains all but one of the
points identified by IGS and additionally one point which is
before or after the remaining point identified by IGS, along
the program execution path. This is repeated until a drop in
the number of program bugs is identified.
Such a process results in a local derivative score around
the instrumentation points identified by IGS. Below we
show results of this method.
5. Speeding up the IGS algorithm
ThebasicdesignoftheIGSalgorithmasdescribedinAl
gorithm 1 calls for the removal of all groups which caused
the bug to appear in the fewest program executions during
an iteration (See line 10). However, since a bug may ap
pear randomly using any (or none) of the instrumentation
points, it is beneficial to relax this step by allowing the re
moval of additional groups. This can be performed when
some of the groups elicited the bug in a frequency that is
significantly higher compared to other groups. For exam
ple, suppose that IGS is run with three groups, and at one of
the iterations the first group did not elicit the bug, the sec
ond caused it to appear in 2% of the runs and the third in
90% of runs. Intuitively, it seems useful to discard the first
two groups and keep only the last.
We propose the following procedure for deciding if some
groups elicited the bug significantly more than others. After
running the program, instrumented as detailed in lines 69
of Algorithm 1, the groups are sorted in increasing order of
the number of times that the bug appeared. For each pair of
consecutive groups, a contingency table is constructed for
these groups, as shown in Table 2.
From this table, we can compute the probability that the
difference in bug elicitation is significant, using the Normal
Number of times
Found bug
Did not find bug
Group 1
a
b
Group 2
c
d
Table 2. Contingency table for comparing two
IGS groups
approximation to the Binomial distribution [26], that is, as
suming a ≥ c and labeling the total number of times that
the program was executed by n = a + b, a test score is
computed:
S =
a − c
?(a + c) ∗ (2 ∗ n − a − c)/(2 ∗ n)
If the difference is statistically significant as determined
by a test score S greater than a predetermined threshold,
this difference is marked as significant. We used a conser
vative threshold of p < 10−6, so as to avoid possible errors
due to repeated hypothesis testing. Only groups with the
largest number of times that the bug was elicited and which
are above a statistical significance threshold are kept. Other
groups are discarded. If no statistically significant differ
ence is found, the basic IGS procedure is taken, as shown in
Algorithm 1.
(9)
6. Scaling up to larger programs
In this section we describe our experiments on a much
larger program compared to the ones described in Section
3. Because of its size the batch algorithm proposed by
Tzoref et al. [25] could not have analyzed this program
within reasonable time. We did not make any changes or
parameter adjustments when applying IGS to this example.
This demonstrates the strength of IGS in that it requires no
knowledge of the specific properties of the bug (other than
Page 8
a way to identify its occurrence) or the program under test
in order to converge to the correct solution.
We tested the enhancement of the IGS algorithm on the
Java 1.4 collection library that was used as a case study
in Sen and Agha [21]. This is a threadsafe Collection
framework implemented as part of the java.util pack
age of the standard Java library provided by Sun Microsys
tems. It consists of 141 classes and has 8067 instrumen
tation points. In Sen and Agha[21], they describe several
concurrent bugs found in this library. One of these bugs is
a race condition that may occur on the field size of the
LinkedList class when the methods l1.clear() and
l2.containsAll(l1) are concurrently executed. The
value of size of l1 is set to zero while l2 is iterating
on the elements of l1. If the write access to size occurs
after l2 consumes an element and increments the elements
counter, the race leads to an infinite loop. However, ifsize
issettozeroashortwhilebefore, immediatelyafterl2con
sumes the element but before it increments the counter, this
results in an uncaught exception due to a failure of a san
ity check on the values of size and the counter, which is
performed by l2.
We used two threads for each test. Two LinkedList
objects were instantiated, and concurrently called the li
brary methods clear() and containAll().
points were labeled for each bug, indicating the accesses
that cause the race.
When randomly instrumenting two points out of the
8067 instrumentation points, which is the minimal number
of points required to manifest the race, we have never seen
either of the errors occur.
IGS was run twice on this example, each time focusing
on a different manifestation of the race. The number of
groups used was five. Thirtyfour iterations were required
for convergence when focusing on the infinite loop, and 27
iterations when focusing on the uncaught exception.
For both cases, IGS identified two points as eliciting the
bug. The point just before setting size to zero appeared
in both pairs. For each pair of points (p1,p2), we measured
the local derivative score as described in Section 4. In each
iteration, we instrumented two points and ran the program
multiple times, where at each run one point is an original
point p1, and the other is a point in the surroundings of p2
on the control flow graph. We continued until observing a
drop in the number of times the bug was manifested. We
then repeated the process while switching the roles of p1
and p2. The results for both bugs are depicted in Figure 2.
A partial control flow graph of the program is shown where
nodes denote instrumentation points and edges possible ex
ecution flows. The two best points (nodes) found by IGS
are highlighted as are the maximal and minimal derivative
scores in the surrounding of the best two points found by
IGS. The specific details of the instrumentation points in
Two
the figure is of lesser importance. We use it to demonstrate
the localization of the bugs achieved using IGS.
For both cases, the derivative score reaches its peak at
the points of the race that lead to the bug, i.e., for the infinite
loop the points are right before size is set to zero and right
after the sanity check is performed, and for the uncaught
exception the points are right before size is set to zero
and right before the sanity check is performed. Thus, for
both cases, we were able to pinpoint the race and the exact
scenario that leads to the error.
Finally, when using the speedup procedure described in
Section 5, we found that the runtime of the IGS algorithm
was reduced by approximately 50% when locating the ele
ment exception bug, but wit hout significant reduction for
the infinite loop bug. In both cases, however, similar results
were obtained with regards to the instrumentation points
which caused the bugs.
7. Discussion
In this paper we present an efficient automatic method
for debugging concurrent programs. Using instrumentation
of points within a program and noise injection it is possible
to elicit bugs and pinpoint locations in the program code
that are strongly suggestive of the source of the bugs. We
demonstrated that finding an indicative minimal set can be
posed as a problem of active feature selection and suggested
the use of the IGS algorithm for solving this problem. Our
analysis and results suggest that the IGS algorithm is much
more accurate than the RELIEF algorithm, obtaining good
results with much fewer runs compared to previouslyused
batch algorithms.
We also demonstrated two improvements of IGS which
assist in bug localization and improve runtime. Finally, we
demonstrated that IGS can scale to large concurrent pro
grams which could not be debugged in reasonable time us
ing the previously suggested batch algorithm.
One major drawback of IGS is that it does not take
into account the sampling history; that is, the information
collected in previous iterations about each instrumentation
point. In future work we intend to find methods for a princi
pled way in which to incorporate such information, which,
it is hoped, will speed up the convergence of IGS even fur
ther.
Acknowledgment
This work is partially supported by the European Com
munity under the Information Society Technologies (IST)
programme of the 6th FP for RTD  project SHADOWS
contract IST035157. The authors are solely responsible for
the content of this paper. It does not represent the opinion
Page 9
Figure 2. Difference score graph for the Java 1.4 collection library. The top graph shows the scores
for the element exception bug and the bottom graph shows the scores for the infinite loop bug. The
original points found by IGS as well as the largest and smallest derivative scores in the surrounding
of the best points found by IGS are highlighted.
Page 10
of the European Community, and the European Community
is not responsible for any use that might be made of data
appearing therein.
References
[1] Y. BenAsher, Y. Eytani, E. Farchi, and S. Ur. Noise makers
need to know where to be silent  producing schedules that
find bugs. In International Symposium on Leveraging Ap
plications of Formal Methods, Verification and Validation
(ISOLA), 2006.
[2] A. Bron, E. Farchi, Y. Magid, Y. Nir, and S. Ur. Applica
tions of synchronization coverage. In PPoPP ’05: Proceed
ings of the tenth ACM SIGPLAN symposium on Principles
and practice of parallel programming, pages 206–212, New
York, NY, USA, 2005. ACM Press.
[3] J.D. Choi and H. Srinivasan. Deterministic replay of Java
multithreaded applications. In Proceedings of the SIGMET
RICS Symposium on Parallel and Distributed Tools, August
1998.
[4] S. Copty and S. Ur. Toward automatic concurrent debugging
via minimal program mutant generation with AspectJ. In TV
’06: Proceedings of Multithreading in Hardware and Soft
ware: Formal Approaches to Design and Verification, pages
125–132, 2006.
[5] J. C. Corbett, M. Dwyer, J. Hatcliff, C. Pasareanu, R.,
S. Laubach, and H. Zheng. Bandera: Extracting finitestate
models from Java source code. In Proc. 22nd International
Conference on Software Engineering (ICSE). ACM Press,
June 2000.
[6] O. Edelstein, E. Farchi, Y. Nir, G. Ratsaby, and S. Ur. Mul
tithreaded Java program test generation. IBM Systems Jour
nal, 41(1):111–125, 2002.
[7] Y. Eytani. Concurrent Java test generation as a search prob
lem. In Proceedings of the Fifth Workshop on Runtime Ver
ification (RV), volume 144(4) of Electronic Notes in Theo
retical Computer Science, 2005.
[8] Y. Eytani, K. Havelund, S. D. Stoller, and S. Ur. Towards
a framework and a benchmark for testing tools for multi
threaded programs: Research articles. Concurr. Comput. :
Pract. Exper., 19(3):267–279, 2007.
[9] Y. Eytani and S. Ur. Compiling a benchmark of documented
multithreaded bugs. In IPDPS, 2004.
[10] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature
Extraction, Foundations and Applications. Series Studies in
Fuzziness and Soft Computing. Springer, 2006.
[11] A. Hartman, A. Kirshin, and K. Nagin. A test execution
environment running abstract tests for distributed software.
In Proceedings of Software Engineering and Applications,
SEA 2002, 2002.
[12] K. Havelund and T. Pressburger. Model checking Java pro
grams using Java PathFinder. International Journal on Soft
ware Tools for Technology Transfer, STTT, 2(4), April 2000.
[13] E. Itzkovitz, A. Schuster, and O. ZeevBenMordehai.
Towards integration of datarace detection in DSM sys
tems. Journal of Parallel and Distributed Computing. Spe
cial Issue on Software Support for Distributed Computing,
59(2):180–203, Nov 1999.
[14] J. A. Jones and M. J. Harrold. Empirical evaluation of the
Tarantula automatic faultlocalization technique.
’05: Proceedings of the 20th IEEE/ACM international Con
ference on Automated software engineering, pages 273–282.
ACM Press, 2005.
[15] K. Kira and L. A. Rendell. A practical approach to feature
selection. In ML92: Proceedings of the ninth international
workshop on Machine learning, pages 249–256, San Fran
cisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc.
[16] B. Long and P. A. Strooper. A classification of concurrency
failures in Java components. In IPDPS, page 287, 2003.
[17] Y. Malaiya, N. Li, J. Bieman, R. Karcich, and B. Skibbe.
Software test coverage and reliability. Technical report, Col
orado State University, 1996.
[18] A. Papoulis. Probability, random variables, and stochastic
processes. McGrawHill Books, 1991.
[19] B. Richards and J. R. Larus. Protocolbased datarace detec
tion. In Proceedings of the 2nd SIGMETRICS Symposium
on Parallel and Distributed Tools, August 1998.
[20] S. Savage.Eraser: A dynamic race detector for multi
threaded programs. ACM Transactions on Computer Sys
tems, 15(4):391–411, November 1997.
[21] K. Sen and G. Agha. Cute and jcute : Concolic unit testing
and explicit path modelchecking tools. In CAV. Tool Paper,
2006.
[22] S. D. Stoller. Modelchecking multithreaded distributed
Java programs. In Proceedings of the 7th International SPIN
Workshop on Model Checking, 2000.
[23] S. D. Stoller. Modelchecking multithreaded distributed
Java programs. International Journal on Software Tools for
Technology Transfer, 4(1):71–91, Oct. 2002.
[24] S. D. Stoller. Testing concurrent Java programs using ran
domized scheduling. In Proceedings of the Second Work
shop on Runtime Verification (RV), volume 70(4) of Elec
tronic Notes in Theoretical Computer Science. Elsevier,
2002.
[25] R. Tzoref, S. Ur, and Y. YomTov. Instrumenting where it
hurts  an automatic concurrent debugging technique. In IS
STA ’07: Proceedings of the 2007 ACM SIGSOFT Interna
tional Symposium on Software Testing and Analysis, New
York, NY, USA, 2007. ACM Press.
[26] G. Upton and I. Cook. Oxford Dictionary of Statistics. Ox
ford University Press, Oxford, UK, 2002.
[27] E. YomTov.An introduction to pattern classification.
In O. Bousquet, U. von Luxburg, and G. Ratsch, edi
tors, Advanced Lectures on Machine Learning, LNAI 3176.
Springer, 2004.
[28] A. Zeller. Yesterday, my program worked. Today, it does
not. Why? In ESEC/FSE7: Proceedings of the 7th Euro
pean software engineering conference held jointly with the
7th ACM SIGSOFT international symposium on foundations
ofsoftwareengineering, pages253–267, London, UK,1999.
SpringerVerlag.
[29] A. Zheng, M. Jordan, B. Liblit, and A. Aiken. Statistical
debugging of sampled programs. In Advances in Neural In
formation Processing Systems, 2003.
In ASE