Empirical studies of test case prioritization in a JUnit testing environment
ABSTRACT Test case prioritization provides a way to run test cases with the highest priority earliest. Numerous empirical studies have shown that prioritization can improve a test suite's rate of fault detection, but the extent to which these results generalize is an open question because the studies have all focused on a single procedural language, C, and a few specific types of test suites, in particular, Java and the JUnit testing framework are being used extensively in practice, and the effectiveness of prioritization techniques on Java systems tested under JUnit has not been investigated. We have therefore designed and performed a controlled experiment examining whether test case prioritization can be effective on Java programs tested under JUnit, and comparing the results to those achieved in earlier studies. Our analyses show that test case prioritization can significantly improve the rate of fault detection of JUnit test suites, but also reveal differences with respect to previous studies that can be related to the language and testing paradigm.
- SourceAvailable from: Annibale Panichella[Show abstract] [Hide abstract]
ABSTRACT: A way to reduce the cost of regression testing consists of selecting or prioritizing subsets of test cases from a test suite according to some criteria. Besides greedy algorithms, cost cognizant additional greedy algorithms, multi-objective optimization algorithms, and Multi-Objective Genetic Algorithms (MOGAs), have also been proposed to tackle this problem. However, previous studies have shown that there is no clear winner between greedy and MOGAs, and that their combination does not necessarily produce better results. In this paper we show that the optimality of MOGAs can be significantly improved by diversifying the solutions (sub-sets of the test suite) generated during the search process. Specifically, we introduce a new MOGA, coined as DIV-GA (DIversity based Genetic Algorithm), based on the mechanisms of orthogonal design and orthogonal evolution that increase diversity by injecting new orthogonal individuals during the search process. Results of an empirical study conducted on eleven programs show that DIV-GA outperforms both greedy algorithms and the traditional MOGAs from the optimality point of view. Moreover, the solutions (sub-sets of the test suite) provided by DIV-GA are able to detect more faults than the other algorithms, while keeping the same test execution cost.IEEE Transactions on Software Engineering 10/2014; · 2.59 Impact Factor
Conference Paper: The Strength of Random Search on Automated Program Repair[Show abstract] [Hide abstract]
ABSTRACT: Automated program repair recently received considerable attentions, and many techniques on this research area have been proposed. Among them, two genetic-programming-based techniques, GenProg and Par, have shown the promis-ing results. In particular, GenProg has been used as the baseline technique to check the repair effectiveness of new techniques in much literature. Although GenProg and Par have shown their strong ability of fixing real-life bugs in nontrivial programs, to what extent GenProg and Par can benefit from genetic programming, used by them to guide the patch search process, is still unknown. To address the question, we present a new automated repair technique using random search, which is commonly considered much simpler than genetic programming, and implement a prototype tool called RSRepair. Experiment on 7 programs with 24 versions shipping with real-life bugs suggests that RSRepair, in most cases (23/24), outperforms GenProg in terms of both repair effectiveness (requiring fewer patch trials) and efficiency (requiring fewer test case executions), justifying the stronger strength of random search over genetic programming. According to experimental results, we suggest that every proposed technique using optimization algorithm should check its effectiveness by comparing it with random search.The 36th International Conference on Software Engineering (ICSE 2014), Hyderabad, India; 05/2014
- [Show abstract] [Hide abstract]
ABSTRACT: In recent years, researchers have intensively investigated various topics in test-case prioritization, which aims to re-order test cases to increase the rate of fault detection during regression testing. The total and additional prioritization strategies, which prioritize based on total numbers of elements covered per test, and numbers of additional (not-yet-covered) elements covered per test, are two widely-adopted generic strategies used for such prioritization. This paper proposes a basic model and an extended model that unify the total strategy and the additional strategy. Our models yield a spectrum of generic strategies ranging between the total and additional strategies, depending on a parameter referred to as the p value. We also propose four heuristics to obtain differentiated p values for different methods under test. We performed an empirical study on 19 versions of four Java programs to explore our results. Our results demonstrate that wide ranges of strategies in our basic and extended models with uniform p values can significantly outperform both the total and additional strategies. In addition, our results also demonstrate that using differentiated p values for both the basic and extended models with method coverage can even outperform the additional strategy using statement coverage.Software Engineering (ICSE), 2013 35th International Conference on; 01/2013
Empirical Studies of Test Case Prioritization in a JUnit Testing Environment
Hyunsook Do, Gregg Rothermel, Alex Kinneer
Computer Science and Engineering Department
University of Nebraska - Lincoln
Test case prioritization provides a way to run test cases
with the highest priority earliest. Numerous empirical stud-
ies have shown that prioritization can improve a test suite’s
rate of fault detection, but the extent to which these results
generalize is an open question because the studies have all
focused on a single procedural language, C, and a few spe-
cific types of test suites. In particular, Java and the JU-
nit testing framework are being used extensively in prac-
tice, and the effectiveness of prioritization techniques on
Java systems tested under JUnit has not been investigated.
We have therefore designed and performed a controlled ex-
periment examining whether test case prioritization can be
effective on Java programs tested under JUnit, and com-
paring the results to those achieved in earlier studies. Our
analyses show that test case prioritization can significantly
improve the rate of fault detection of JUnit test suites, but
also reveal differences with respect to previous studies that
can be related to the language and testing paradigm.
As a software system evolves, software engineers re-
gression test it to detect whether new faults have been in-
troduced into previously tested code. The simplest regres-
sion testing technique is to re-run all existing test cases, but
this can require a lot of effort, depending on the size and
complexity of the system under test. For this reason, re-
searchers have studiedvarious techniquesfor improvingthe
cost-effectiveness of regression testing, such as regression
test selection [10, 27], test suite minimization [9, 18, 23],
and test case prioritization [15, 29, 33].
Test case prioritization provides a way to run test cases
that have the highest priority — according to some crite-
rion — earliest, and can yield meaningful benefits, such as
providingearlier feedback to testers and earlier detection of
faults. Numerous prioritization techniques have been de-
scribed in the research literature and they have been evalu-
ated through various empirical studies [11, 14, 15, 28, 29,
31, 33]. These studies haveshownthat severalprioritization
techniques can improve a test suite’s rate of fault detection.
Most of these studies, however, have focused on a single
procedural language, C, and on a few specific types of test
suites, so whether their results generalize to other program-
ming and testing paradigms is an open question. Replica-
tion of these studies with populations other than those pre-
viously examined is needed, to provide a more complete
understanding of test case prioritization.
In this work, we set out to performa replicated study, fo-
cusing on an object-oriented language, Java, that is rapidly
gaining usage in the software industry, and thus is practi-
cally important in its own right. We focus further on a new
testing paradigm, the JUnit testing framework, which is in-
creasinglybeingused by developersto implementtest cases
for Java programs . In fact, with the introduction of the
JUnit testing framework, many software development orga-
nizations are building JUnit test cases into their code bases,
as is evident through the examination of Open Source Soft-
ware hosts such as SourceForge and Apache Jakarta [2, 3].
TheJUnit frameworkencouragesdeveloperstowrite test
cases, and then to rerun all of these test cases whenever
they modify their code. As mentioned previously, however,
as the size of a system grows, retest-all regression testing
strategies can be excessively expensive. JUnit users will
need methodologies with which to remedy this problem.1
We have therefore designed and performed a controlled
experiment examining whether test case prioritization can
be effective on object-oriented systems, specifically those
written in Java and tested with JUnit test cases. We ex-
amine prioritization effectiveness in terms of rate of fault
detection, and we also consider whether empirical results
show similarity (or dissimilarity) with respect to the results
of previous studies. As objects of study we consider four
1This may not be the case with respect to extreme programming, an-
other development methodology making use of JUnit test cases, because
extreme programming is intended for development of modest size projects
using a small number of programmers . However, JUnit is also being
used extensively in the testing of Java systems constructed by more tra-
ditional methodologies, resulting in large banks of integration and system
tests. It is in this context that prioritization can potentially be useful.
open source Java programs that have JUnit test suites, and
we examine the ability of several test case prioritization
techniques to improve the rate of fault detection of these
test suites, while also varying other factors that affect pri-
oritization effectiveness. Our results indicate that test case
prioritization can significantly improve the rate of fault de-
tection of JUnit test suites, but also reveal differences with
respect to previous studies that can be related to the Java
and JUnit paradigms.
In the next section of this paper, we describe the test
case prioritization problem and related work. Section 3 de-
scribes the JUnit testing framework, and our extensions to
that framework that allow it to support prioritization. Sec-
tion 4 presents our experiment design, results, and analysis,
describing what we have done in terms of experiment setup
to manipulate JUnit test cases. Section 5 discusses our re-
sults, and Section 6 presents conclusions and future work.
2 Background and Related Work
2.1Test Case Prioritization
Test case prioritization techniques [15, 29, 33] schedule
test cases in an execution order accordingto some criterion.
The purpose of this prioritization is to increase the likeli-
hood that if the test cases are used for regression testing in
the given order, they will more closely meet some objective
than they would if they were executed in some other order.
For example, testers might schedule test cases in an order
that achieves code coverage at the fastest rate possible, ex-
ercises features in order of expected frequency of use, or
increases the likelihood of detecting faults early in testing.
Dependingon the types of informationavailable for pro-
grams and test cases, and the way in which those types of
information are used, various test case prioritization tech-
niques can be employed. One way in which techniques can
be distinguished involves the type of code coverage infor-
mation they use. Test cases can be prioritized in terms of
the numberof statements, basic blocks, or methods they ex-
ecuted on a previous version of the software. For example,
a total block coverage prioritization technique simply sorts
test cases in the order of the numberof basic blocks (single-
entry, single-exit sequences of statements) they covered, re-
solving ties randomly.
A second way in which prioritization techniques can be
distinguished involves the use of “feedback”. When prior-
itizing test cases, if a particular test case has been selected
as “next best”, information about that test case can be used
to re-evaluate the value of test cases not yet chosen prior to
picking the next test case. For example, additional block
coverage prioritization iteratively selects a test case that
yieldsthe greatestblockcoverage,thenadjuststhe coverage
information for the remaining test cases to indicate cover-
age of blocks not yet covered, and repeats this process until
all blocks coverable by at least one test case have been cov-
ered. This process is then repeated on remaining test cases.
A third way in which prioritization techniques can be
distinguished involves their use of information about code
modifications. For example,the amountof changein a code
element can be factored into prioritization by weighting the
elements covered using a measure of change.
Other dimensions along which prioritization techniques
can be distinguished that have been suggested in the liter-
ature [14, 15, 22] include test cost estimates, fault sever-
ity estimates, estimates of fault propagationprobability,test
history information, and usage statistics obtained through
2.2 Previous Empirical Work
Early studies of test case prioritization focused on the
cost-effectiveness of individual techniques, the estimation
of a technique’s performance,or comparisonsof techniques
[15, 28, 29, 31, 33]. These studies showed that various
techniques could be cost-effective, and suggested tradeoffs
among them. However, the studies also revealed wide vari-
ances in performance,and attributed these to factors involv-
ing the programs under test, test suites used to test them,
and types of modifications made to the programs.
Recent studies of prioritization have begun to examine
the factorsaffectingprioritizationeffectiveness[11, 22, 25].
Rothermel et al.  studied the effects of test suite design
test suites and examining the effects on cost-effectiveness
of test selection and prioritization. While this study did
not consider correlating attributes of change with technique
performance, Elbaum et al.  performed experiments
exploring characteristics of program structure, test suite
composition, and changes on prioritization, and identified
several metrics characterizing these attributes that correlate
with prioritization effectiveness.
More recent studies have examined how some of these
factors affect the effectiveness and efficiency of prioriti-
zation, and have considered the generalization of findings
through controlled experiments [12, 16, 26]. These studies
expose tradeoffs and constraints that affect the success of
techniques, and provide guidelines for designing and man-
aging prioritization and testing processes.
Most recently, Saff and Ernst  considered test case
prioritization for Java in the context of continuous testing,
which uses spare CPU resources to continuously run re-
gression tests in the background as a programmer codes.
They propose combining the concepts of test frequencyand
test case prioritization, and report the results of a study that
show that prioritized continuous testing reduced wasted de-
velopment time. However, their prioritization techniques
are based on different sources of information than ours,
such as history of recent or frequent errors and test cost,
rather than code coverage information. The measure of ef-
fectiveness they use also differs from ours: theirs involves
reduction of wasted time in development, whereas ours in-
volves the weighted average of the percentage of faults de-
tected over the life of a test suite.
With the exception of the work reported in , all of
this previousempiricalwork has concernedC programsand
system-level test suites constructed for code coverage, or
for partition-based coverage of requirements. In contrast,
the study we describe here examines whether prior results
generalize, by replicating previous experiments on a new
populationof programsand test suites (Java and JUnit), and
examining whether the results are consistent with those of
the previous studies.
3JUnit Testing and Prioritization
JUnit test cases are Java classes that contain one or more
test methods and that are grouped into test suites, as shown
in Figure 1. The figure presents a simple hierarchy having
only a single test-class level, but the tree can extend deeper
using additional nesting of Test Suites. The leaf nodes in
such a hierarchy, however, always consist of test-methods,
where a test-method is a minimal unit of test code.
. . .
. . .
Figure 1. JUnit test suite structure
JUnit test classes that contain one or more test methods
can be run individually,or a collection of JUnit test cases (a
test suite)can be runas a unit. Runningindividualtest cases
is reasonable for small programs, but for large numbers of
test cases can be expensive, because each independent exe-
cution of a test case incurs startup costs. Thus in practice,
developers design JUnit test cases to run as sequences of
tests invoked through test suites that invoke test classes.
Clearly, choices in test suite granularity(the numberand
size of the test cases making up a test suite) can affect the
cost of running JUnit test cases, and we want to investi-
gate the relationship between this factor and prioritization
results. To do this, we focus on two levels of test suite gran-
ularity: test-class level, a collection of test-classes that rep-
resents a coarse test suite granularity,and test-methodlevel,
a collection of test-methods that represents a fine test suite
granularity. To support this focus we needed to ensure that
the JUnit framework allowed us to achieve the following
1. treat each TestCase class as a single test case for pur-
poses of prioritization (test-class level);
2. reorderTestCase classes to producea prioritizedorder;
3. treat individual test methods within TestCase classes
as test cases for prioritization (test-method level);
4. reorder test methods to produce a prioritized order.
Objectives 1 and 2 were trivially achieved as a conse-
quence of the fact that the default unit of test code that can
be specified for execution in the JUnit framework is a Test-
Case class. Thus it was necessary only to extract the names
of all TestCase classes invokedby thetop levelTestSuite for
the object program2(a simple task) and then execute them
individually with the JUnit test runner in a desired order.
Objectives 3 and 4 were more difficult to achieve, due
to the fact that a TestCase class is also the minimal unit of
test code that can be specified for execution in the normal
JUnit framework. Since a TestCase class can define mul-
tiple test methods, all of which will be executed when the
TestCase is specified for execution, providing the ability to
treat individual methods as test cases required us to extend
the JUnit framework to support this finer granularity. Thus
the principal challenge we faced was to design and imple-
ment JUnit extensions that provide a means for specifying
individual test methods for execution, as found in the total
set of methods distributed across multiple TestCase classes.
Since the fundamental purpose of the JUnit framework
is to discover and execute test methods defined in Test-
Case classes, the problem of providing test-method level
testing reduces to the problem of uniquely identifying each
test method discovered by the framework and making them
available for individual execution by the tester. We accom-
plished this task by extending (subclassing) various com-
ponents of the framework and inserting mechanisms for as-
signing numeric test IDs to each test method discovered.
We then created a SelectiveTestRunner that uses the new
extensioncomponents. The relationshipbetween our exten-
sions and the existing JUnit framework is shown in Figure
2, which also shows how the JUnit framework is related to
the Galileo system for analyzing Java bytecode (which we
used to obtain coverage information for use in prioritiza-
tion). Our new SelectiveTestRunner is able to access test
cases individually using numeric test IDs.
To implement prioritization at the test-method level we
also needed to provide a way for the test methods to be ex-
ecuted in a tester-specified order. Because the JUnit frame-
work must discover the test methods, and our extensions
assign numeric IDs to tests in the order of discovery, to ex-
ecute the test cases in an order other than the one in which
they are provided requires that all test cases be discovered
prior to execution. We accomplished this by using a simple
two-passtechnique. Inthefirstpass, allthetestmethodsrel-
2This process may need to be repeated iteratively if Test Suites are
nested in other Test Suites.
generation of traces
Figure 2. JUnit framework and Galileo
evant to the system are discovered and assigned numbers.
The second pass then uses a specified ordering to retrieve
and execute each method by its assigned ID. The tester (or
testing tool) is provided a means, via the SelectiveTestRun-
ner, of retrieving the IDs assigned to test methods by the
framework for use with prioritization.
We wish to address the following research questions:
RQ1: Can test case prioritization improve the rate of fault
detection of JUnit test suites?
RQ2: How do the three types of information and informa-
tion use that distinguish prioritization techniques (type
of coverage information, use of feedback, and use of
modification information) impact the effectiveness of
RQ3: Can test suite granularity (the choice of running test-
class level versus test-method level JUnit test cases)
impact the effectiveness of prioritization techniques?
In addition to these research questions, we examine
whether test case prioritization results obtained from sys-
JUnit test cases show different trends than those obtained
ditional coverage- or requirements-based system test cases.
To address our questions we designed several controlled
experiments. The following subsections present, for these
experiments, our objects of analysis, independentvariables,
dependent variables and measures, experiment setup and
design, threats to validity, and data and analysis.
4.1Objects of Analysis
We used four Java programs as objects of analysis: Ant,
XML-security, JMeter, and JTopas. Ant is a Java-based
build tool ; it is similar to make, but instead of being
extended with shell-based commands it is extended using
Java classes. JMeter is a Java desktop application designed
to load test functional behavior and measure performance
. XML-security implements security standards for XML
. JTopas is a Java library used for parsing text data .
Several sequential versions of each of these systems were
available and were selected for these experiments.
Table 1 lists, for each of our objects, “Versions” (the
number of versions),“Size” (the number of classes), “Test
Classes” (the number of JUnit test-class level test cases),
“Test Methods” (the number of JUnit test-method level test
cases), and “Faults” (the number of faults). The number of
classes corresponds to the total number of class files in the
final version. The numbers for test-class (test-method) list
the number of test classes (test methods) in the most recent
version. The number of faults indicates the total number of
faults available for each of the objects (see Section 4.3.3).
Table 1. Experiment objects
4.2Variables and Measures
Our experiments manipulated two independent variables:
prioritization technique and test suite granularity.
Variable 1: Prioritization Technique
We consider nine different test case prioritization tech-
niques, which we classify into three groups to match an
earlier study on prioritization for C programs . Table 2
summarizes these groups and techniques. The first group is
the control group, containing three “techniques” that serve
as experimentalcontrols. (We usetheterm“technique”here
as a convenience; in actuality, the control group does not
involve any practical prioritization heuristics; rather, it in-
volves various orderings against which practical heuristics
should be compared.) The second group is the block level
group, containing two fine granularity prioritization tech-
niques. The third group is the method level group, contain-
ing four coarse granularity prioritization techniques.
• No prioritization(untreated): One control that we con-
sider is simplytheapplicationofnotechnique; this lets
us consider “untreated” JUnit test suites.
• Random prioritization (random): As a second control
we use random prioritization, in which we randomly
order the test cases in a JUnit test suite.
Table 2. Test case prioritization techniques.
ordered to optimize rate of fault
prioritize on coverage of block
prioritize on coverage of block
not yet covered
prioritize on coverage of method
prioritize on coverage of method
not yet covered
prioritize on coverage of method
and change information
prioritize on coverage of method/
change information, and adjusted
on previous coverage
• Optimal prioritization (optimal): To measure the ef-
fects of prioritization techniques on rate of fault detec-
tion, our empirical study uses programs that contain
known faults. For the purposes of experimentation we
can determine, for any test suite, which test cases ex-
pose which faults, and thus we can determine an opti-
mal ordering of test cases in a JUnit test suite for max-
imizing that suite’s rate of fault detection. This is not
a viable practical technique, but it provides an upper
bound on the effectiveness of our heuristics.
Block level techniques
• Total block coverage prioritization (block-total): By
case, the number of basic blocks in that program that
are exercised by that test case. We can prioritize these
test cases according to the total number of blocks they
cover simply by sorting them in terms of that number.
• Additional block coverageprioritization(block-addtl):
Additional block coverage prioritization combines
feedback with coverage information. It iteratively se-
lects a test case that yields the greatest block cover-
age, adjusts the coverage information on subsequent
test cases to indicate their coverage of blocks not yet
covered, and repeats this process until all blocks cov-
ered by at least one test case have been covered. If
multiple test cases cover the same number of blocks
not yet covered, they are ordered randomly. When all
blocks have been covered, this process is repeated on
the remaining test cases until all have been ordered.
Method level techniques
• Total method coverage prioritization (method-total):
Total method coverage prioritization is the same as to-
tal blockcoverageprioritization,exceptthatit relies on
coverage measured in terms of methods.
• Additional method coverage prioritization (method-
addtl): Additional method coverage prioritization is
the same as additional block coverage prioritization,
except that it relies on coverage in terms of methods.
• Total diff method prioritization (method-diff-total):
Total diff method coverage prioritization uses modi-
fication information; it sorts test cases in the order of
their coverageof methodsthat differtextually(as mea-
sured by a Java parser that parses pairs of individual
Java methods through the Unix “diff” function). If
multiple test cases cover the same number of differing
methods, they are ordered randomly.
• Additional diff method prioritization (method-diff-
addtl): Additional diff method prioritization uses both
feedback and modification information. It iteratively
selects a test case that yields the greatest coverage of
methods that differ, adjusts the information on subse-
quent test cases to indicate their coverage of methods
not yet covered, and then repeats this process until all
methods that differ and have been covered by at least
one test case have been covered. If multiple test cases
cover the same number of differing methods not yet
covered, they are ordered randomly. This process is
repeated until all test cases that execute methods that
differ have been used; additional method coveragepri-
oritization is applied to remaining test cases.
The foregoing set of techniques matches the set exam-
ined in  in all but two respects. First, we use three
control techniques, considering an “untreated” technique in
which test cases are run in the order in which they are given
in the original JUnit test cases. This is a sensible control
technique for our study since in practice developers would
run JUnit test cases in their original ordering.
Second, the studies with C programs used statement and
function level prioritization techniques, where coverage is
based on source code, whereas our study uses coverage
based on Java bytecode. Analysis at the bytecode level is
appropriate for Java environments. Since Java is a plat-
form independentlanguage, vendors or programmersmight
choose to provide just class files for system components. In
such cases we want to be able to analyze even those class
files, and bytecode analysis allows this.
The use of bytecodelevel analysis does affect our choice
of prioritizationtechniques. As anequivalentto C “function
level” coverage, a method level granularity was an obvious
choice. As a statement level equivalent, we could use ei-
ther individual bytecode instructions, or basic blocks of in-
structions, but we cannot infer a one-to-onecorrespondence
betweenJava sourcestatements andeitherbytecodeinstruc-
tionsorblocks.3We chosethebasic blockbecausethebasic
3A Java source statement typically compiles to several bytecode in-
structions, and a basic block from bytecode often corresponds to more than
one Java source code statement.