Content uploaded by Renzo Degiovanni
Author content
All content in this area was uploaded by Renzo Degiovanni on Nov 25, 2024
Content may be subject to copyright.
An Empirical Study on the Suitability of Test-based Patch Acceptance
Criteria
LUCIANO ZEMIN, Instituto Tecnológico de Buenos Aires (ITBA), Argentina
SIMÓN GUTIÉRREZ BRIDA, Dept. of Computer Science, National University of Río Cuarto, Argentina
ARIEL GODIO, Instituto Tecnológico de Buenos Aires (ITBA), Argentina
CÉSAR CORNEJO, National Council for Scientic and Technical Research (CONICET) and Dept. of Computer
Science, National University of Río Cuarto, Argentina
RENZO DEGIOVANNI, Luxembourg Institute of Science and Technology„ Luxembourg
GERMÁN REGIS, Dept. of Computer Science, National University of Río Cuarto, Argentina
NAZARENO AGUIRRE, National Council for Scientic and Technical Research (CONICET) and Dept. of
Computer Science, National University of Río Cuarto, Argentina
MARCELO FRIAS, e University of Texas at El Paso, United States
In this article, we empirically study the suitability of tests as acceptance criteria for automated program xes, by checking
patches produced by automated repair tools using a bug-nding tool, as opposed to previous works that used tests or manual
inspections. We develop a number of experiments in which faulty programs from IntroClass, a known benchmark for program
repair techniques, are fed to the program repair tools GenProg, Angelix, AutoFix and Nopol, using test suites of varying
quality, including those accompanying the benchmark. We then check the produced patches against formal specications
using a bug-nding tool. Our results show that, in the studied scenarios, automated program repair tools are signicantly
more likely to accept a spurious program x than producing an actual one. Using bounded-exhaustive suites larger than the
originally given ones (with about 100 and 1,000 tests) we verify that overing is reduced but a) few new correct repairs
are generated and b) some tools see their performance reduced by the larger suites and fewer correct repairs are produced.
Finally, by comparing with previous work, we show that overing is underestimated in semantics-based tools and that
patches not discarded using held-out tests may be discarded using a bug-nding tool.
CCS Concepts: • Soware and its engineering
→
Soware testing and debugging;Formal soware verication;
Empirical soware validation.
Additional Key Words and Phrases: automatic program repair, formal specications, testing, oracle
Authors’ addresses: Luciano Zemin, Instituto Tecnológico de Buenos Aires (ITBA), Buenos Aires, Argentina; Simón Gutiérrez Brida, Dept. of
Computer Science, National University of Río Cuarto, Río Cuarto, Argentina; Ariel Godio, Instituto Tecnológico de Buenos Aires (ITBA),
Buenos Aires, Argentina; César Cornejo, National Council for Scientic and Technical Research (CONICET) and Dept. of Computer Science,
National University of Río Cuarto, Río Cuarto, Argentina; Renzo Degiovanni, Luxembourg Institute of Science and Technology, , Esch-sur-
Alzee, Luxembourg; Germán Regis, Dept. of Computer Science, National University of Río Cuarto, Río Cuarto, Argentina; Nazareno Aguirre,
National Council for Scientic and Technical Research (CONICET) and Dept. of Computer Science, National University of Río Cuarto, Río
Cuarto, Argentina; Marcelo Frias, e University of Texas at El Paso, El Paso, United States.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permied. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from
permissions@acm.org.
© 2024 Copyright held by the owner/author(s).
ACM 1557-7392/2024/11-ART
https://doi.org/10.1145/3702971
ACM Trans. Sow. Eng. Methodol.
2 • Zemín et al.
1 INTRODUCTION
Soware has become ubiquitous, and many of our activities depend directly or indirectly on it. Having adequate
soware development techniques and methodologies that contribute to producing quality soware systems has
become essential for many human activities. e signicant advances in automated analysis techniques have
led, in the last few decades, to the development of powerful tools able to assist soware engineers in soware
development, that have proved to greatly contribute to soware quality. Indeed, tools based on model checking [
8
],
constraint solving [
42
], evolutionary computation [
11
] and other automated approaches, are being successfully
applied to various aspects of soware development, from requirements specication [
3
,
15
] to verication [
23
]
and bug nding [
16
,
21
]. Despite the great eort that is put in soware development to detect soware problems
(wrong requirements, decient specications, design aws, implementation errors, etc.), e.g., through the use
of the above mentioned techniques, many bugs reach and make it through the deployment phases. is makes
eective soware maintenance greatly relevant to the quality of the soware that is produced, and since soware
maintenance takes a signicant part of the resources of most soware projects, also economically relevant to the
soware development industry. us, the traditional emphasis of soware analysis techniques, that concentrated
in detecting the existence of defects in soware and specications, has recently started to broaden up to be
applied to automatically repair soware [1, 10, 29, 32, 53].
While the idea of automatic program repair (APR) is certainly appealing, automatically xing arbitrary
program defects is known to be infeasible. In the worst-case it reduces to program synthesis which is known
to be undecidable for Turing-complete programming languages [
7
]. us, the various techniques that have
been proposed to automatically repair programs are intrinsically incomplete, in various respects. Firstly, many
techniques for automatically repairing programs need to produce repair candidates, oen consisting of syntactical
modications on the original (known to be faulty) program. Clearly, all (even bounded) program repair candidates
cannot be exhaustively considered, and thus the space of repairs to consider needs to be somehow limited.
Secondly, for every repair candidate, checking whether the produced candidate constitutes indeed a repair is
an undecidable problem on its own, and solving it fully automatically is then, also, necessarily incomplete.
Moreover, this laer problem requires a description of the expected behavior of the program to be xed, i.e., a
specication, subject to automated analysis, if one wants the whole repair process to remain automatic. Producing
such specications is costly, and therefore requiring these specications is believed to undermine the applicability
of automatic repair approaches. Most automated repair techniques then use partial specications, given in terms
of a validation test suite. Moreover, most techniques heavily rely on these tests as part of the techniques, e.g., for
fault localization [53].
ere is a risk in using tests as specications, since as it is well known, their incompleteness makes it possible
to obtain spurious repairs, i.e., programs that seem to solve the problems of faulty code, but are incorrect despite
the fact that the validation suite is not able to expose such incorrectness. Nevertheless, various tools report
signicant success in repairing code using tests as criteria for accepting patches [
18
]. More recently, various
researchers have observed that automatically produced patches are likely to overt test suites used for their
validation, leading tools to produce invalid program xes [
34
,
47
]. en, further checks have been performed, to
analyze more precisely the quality of automatically produced patches, and consequently the ability of automated
program repair techniques in producing actual xes. However, these further checks have usually been performed
through manual inspection, or using extended alternative test suites, leaving room for still undetected aws.
In this paper, we empirically study the suitability of tests as acceptance criteria for automated program xes, by
checking patches produced by automated repair tools using a bug-nding tool, as opposed to previous works that
used tests and/or manual inspections. We develop a number of experiments using IntroClass, a known benchmark
for program repair techniques, consisting of small programs solving simple assignments. Faulty programs from
this benchmark are used to feed four state-of-the-art program repair tools, using test suites of varying quality
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 3
and extension, including those accompanying the benchmark. Produced patches are then complemented with
corresponding formal specications, given as pre- and post-conditions, and checked using Pex [
49
], an automated
test generation tool based on concrete/symbolic execution and constraint solving, that aempts to exhaustively
cover bounded symbolic paths for the patches. Our results show that:
•
In general, automated program repair tools are signicantly more likely to accept a spurious program x
than producing an actual one, in the studied scenarios.
•
By improving the quality of the test suite by extending it to a sort of bounded-exhaustive suite (whose
size is bounded to approximately 100 or 1000 tests), we show that a few more correct xes are obtained in
those cases where the tool under analysis is able to cope with the suite size.
•
Finally, we show that overing is more likely to occur in semantics-based tools than previously reported
in [
56
]. e use of the bug-nding tool allows us to detect overing patches that remain undetected
using held-out tests, tests that are not used during the patch generation process but are instead kept aside
for the verication of the produced patches.
Notice that using IntroClass, a benchmark built from simple small programs (usually not exceeding 30 lines of
code), is not a limitation of our analysis. If state-of-the-art tools fail to distinguish correct xes from spurious
ones on this benchmark, we should not expect them to perform beer on larger, or more complex benchmarks. If
anything, using IntroClass makes the analysis more conclusive.
is paper reports research that extends previous work reported in [
59
]. It poses and answers research questions
that were not part of [
59
]. A deeper discussion of the relationship between these two works is presented in
Section 5.
e paper is organized as follows. Aer this Introduction, in Section 2 we introduce automated program repair
and the overing problem. In Section 3 we evaluate 4 tools for automated program repair, namely, Angelix
[
37
], AutoFix [
52
], GenProg [
18
] and Nopol [
55
]. We present and answer the research questions. In Section 4 we
discuss the threats to the validity of the results presented in the paper. In Section 5 we discuss related work. In
Section 6 we discuss the results we obtain in the paper, to nish in Section 7 with some conclusions and proposals
for further work.
2 AUTOMATED PROGRAM REPAIR
Automated program repair techniques aim at xing faulty programs through the application of transformations
that modify the program’s code. Generation and Validation (G&V) techniques for automated program repair
receive a faulty program to repair, a specication of the program’s expected behavior, and aempt to generate a
patch through the application of syntactic transformations on the original program that satises the provided
specication [
1
]. Dierent techniques and tools have been devised for automated program repair, which can
be distinguished on various aspects such as the programming language or kind of system they apply to, the
syntactic modications that can be applied to programs (or, similarly, the fault model a tool aims to repair), the
process to produce the x candidates or program patches, how program specications are captured (and how
these are contrasted against x candidates), and how the explosion of x candidates is tamed.
From the point of view of the technology underlying the repair generation techniques, a wide spectrum of
approaches exist, including search-based approaches such as those based on evolutionary computation [
1
,
52
,
53
],
constraint-based automated analyses such as those based on SMT and SAT solving [
17
,
37
], and model checking
and automated synthesis [
48
,
51
]. A crucial aspect of program repair that concerns this paper, is how program
specications are captured and provided to the tools. Some approaches, notably some of the initial ones (e.g.,
[
1
,
48
]), require formal specications in the form of pre- and post-conditions, or logical descriptions provided
in some suitable logical formalism. ese approaches then vary in the way they use these specications to
assess repair candidates; some check repair candidates against specications using some automated verication
ACM Trans. Sow. Eng. Methodol.
4 • Zemín et al.
technique [
48
]; some use the specications to produce tests, which are then used to drive the patch generation
and patch assessment activities [
52
]. Moreover, some of these tools and techniques require strong specications,
capturing the, say, “full” expected behavior of the program [
1
,
48
], while others use contracts present in code,
that developers have previously wrien mainly for runtime checking [52].
Many of the latest mainstream approaches, however, use tests as specications. ese approaches relieve
techniques from the requirement of providing a formal specication accompanying the faulty program, arguing
that such specications are costly to produce, and are seldom found in soware projects. Tests, on the other hand,
are signicantly more commonly used as part of development processes, and thus requiring tests is clearly less
demanding. Weimer et al. mention in [
54
], for instance, that by requiring tests instead of formal specications
one greatly improves the practical applicability of their proposed technique. Kaleeswaran et al. [
25
] also mention
that approaches requiring specications suppose the existence of such specications, a situation that is rarely
true in practice; they also acknowledge the limitations of tests as specications for repairs, and aim at a less
ambitious goal than fully automatically repairing code, namely, to generate repair hints.
e partial nature of tests as specications immediately leads to validity issues regarding the xes provided
by automated program repair tools, since a program patch may be accepted because it passes all tests in the
validation suite but still not be a true program x (there might still be other test cases, not present in the validation
suite, for which the program patch fails). is problem, known as overing [
47
], has been previously identied
by various researchers [
34
,
47
,
56
], and several tools are known to produce spurious patches as a result of their
program repair techniques. is problem is handled dierently by dierent techniques. Some resign the challenge
of producing xes and aim at producing hints (e.g., the already mentioned [
25
]). Others take into account a notion
of quality, and manually compare the produced patches with xes provided by human developers or by other
tools [
34
,
36
]. Notice that, even aer manual inspection, subtle defects may be still present in the repairs, thus
leading to accepting a x that is invalid. We partly study this issue in this paper.
3 EVALUATION
In this section we evaluate Angelix, AutoFix, GenProg and Nopol, 4 well-regarded tools that use tests as their
patch acceptance criterion. e evaluation is performed on the IntroClass dataset, which is described in detail
in Section 3.1. e dataset contains student-developed solutions for 6 simple problems. e correctness of the
student’s solutions (which usually take under 30 LOC), can be evaluated using instructor-prepared test suites.
Each of the provided solutions is faulty: at least one test in the corresponding suite fails.
e tools were selected because they use dierent underlying techniques for patch generation, and also due to
the existence of mature-enough implementations that would allow to carry on the experiments. Angelix [
37
]
collects semantic information from controlled symbolic executions of the program and uses MaxSMT [
43
] to
synthesize a patch. AutoFix [
52
] is intended for Eiel [
39
] programs repair. It is contract-based, and is based
in AutoTest [
40
] for automated test generation. Genprog [
18
] uses genetic algorithms to search for a program
variant that retains correct functionality yet passes failing tests. Finally, Nopol [
55
] collects program execution
information and uses satisability modulo theories (SMT) [
2
] to generate patches. Unlike Angelix, Nopol focuses
on the repair of buggy conditional statements.
Since our aim is to evaluate the suitability of test-based patch acceptance criteria, we will introduce some
terminology that will help us beer understand the following sections. Given a faulty routine
𝑚
, and a test suite
𝑇
employed as an acceptance criterion for automated program repair, a tool-synthesized version
𝑚0
of
𝑚
that
passes all tests in
𝑇
is called a patch. A patch may overt and be correct with respect to the provided suite, yet be
faulty with respect to a dierent suite, or more precisely, with respect to its actual expected program behavior.
We may then have correct and incorrect patches; a correct patch, i.e., one that meets the program’s expected
behavior, will be called a x. is gives rise to our rst research question.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 5
RQ1: When applying a given program repair tool/technique on a faulty program, how likely is the tool/tech-
nique to provide a patch, and if a patch is found, how likely it is for it to be a x?
Patch correctness is typically determined by manual inspection. Since manual inspections are error-prone (in
fact, the faulty routines that constitute the IntroClass dataset were all manually inspected by their corresponding
developers, yet they are faulty), we will resort to automated verication of patches, in order to determine if they
are indeed xes. We will use concrete/symbolic execution combined with constraint solving, to automatically verify
produced patches, against their corresponding specications captured as contracts [
38
]. More precisely, we will
translate patches into C#, and equip these with pre- and post-conditions captured using Code Contracts [
14
];
we will then search for inputs that violate these assertions via concrete/symbolic execution and SMT solving,
using the Pex tool [
49
]. Finally, to prevent any error introduced by the above described process, we run the
corresponding test generated by Pex on the original patched method to check that it actually fails.
To assess the above research question, we need to run automatic repair tools on faulty programs. As we
mentioned, we consider the IntroClass dataset, so whatever conclusion we obtain will, in principle, be tied to
this specic dataset and its characteristics (we further discuss this issue in the threats to validity Section 4). By
focusing on this dataset, we will denitely get more certainty regarding the following issues:
•Overing produced by repair tools on the IntroClass dataset, and
•
Experimental data on the limitations of manual inspections in the context of automated program repair
(especially because this benchmark has been used previously to evaluate various program repair tools).
We will show, with the aid of the bug-nding tool that patches that were hinted as correct in [
56
], are not.
Notice that when a patch is produced, but this patch is not a x, one may rightfully consider the problem being
on the quality of the test suite used for patch generation, not necessarily a limitation of test-based acceptance
criteria as a whole, or the program xing technique in particular: by providing more/beer tests one may prevent
the acceptance of incorrect patches. at is, overing may be considered a limitation of the particular test suites
rather than a limitation of test-based acceptance criteria. To take into account this issue, for instance, [
47
,
56
]
enrich the test suites provided with the benchmark with white-box tests that guarantee branch coverage of a
correct variant of the buggy programs. en, as shown in [
47
, Fig. 3], between 40% and 50% of the patches that
are produced with the original suite, are discarded when the additional white-box suite ensuring branch coverage
is considered. Yet the analysis does not address the following two issues:
•Are the patches passing the additional white-box tests indeed xes? And, equally important,
•Would the tool reject more patches by choosing larger suites?
is leads to our second research question:
RQ2: How does overing relate to the thoroughness of the validation test suites, in program repair techniques?
oroughness can be dened in many ways, typically via testing criteria. Given the vast amount of testing
criteria, an exhaustive analysis, with quality suites according to many dierent of these criteria, is infeasible.
Our approach will be to enrich the validation test suites, those provided with the dataset, by adding bounded-
exhaustive suites [
4
] for dierent bounds. e rationale here is to aempt to be as thorough as possible, to avoid
overing. For each case study, we obtain suites with approximately 100 tests, and with approximately 1,000
tests (with two dierent bounds), for each routine. ese suites can then be assessed according to measures for
dierent testing criteria. Notice that, as the size of test suites is increased, some tools and techniques may see
their performance aected. is leads to our third research question:
ACM Trans. Sow. Eng. Methodol.
6 • Zemín et al.
RQ3: How does test suite size aect the performance of test-based automated repair tools?
As we mentioned earlier in this section, patches are classied as correct (i.e., xes) or not using either manual
inspections or, as in [
56
], using held-out tests. We consider these to be error-prone procedures to assess the
correctness of a patch. In [
56
] overing of the Angelix APR tool is analyzed over the IntroClass dataset. Held-out
tests are used to determine non-overing patches. Our fourth research question is then stated as:
RQ4: Can we produce substantial evidence on the fact held-out tests underestimate overing patches in
semantics-based APR tools such as Angelix?
e remaining part of this section is organized as follows. Section 3.1 describes the IntroClass dataset. Section
3.2 describes the experimental setup we used. Section 3.3 describes the reproducibility package we are providing
in order to guarantee reproducibility of the performed experiments. Finally, Section 3.5 presents the evaluations
performed, and discusses research questions RQ1–RQ4.
3.1 The IntroClass Dataset
e IntroClass benchmark is thoroughly discussed in [
31
]. It contains student-developed C programs for solving 6
simple problems (that we will describe below) as well as instructor-provided test suites to assess their correctness.
IntroClass has been used to evaluate a number of automated repair tools [
27
,
46
,
47
,
56
], and its simplicity reduces
the requirements on tool scalability. e benchmark comprises methods to solve the following problems:
Checksum:
Given an input string
𝑆=𝑐0, . . . 𝑐𝑘
, this method computes a checksum character
𝑐
following
the formula 𝑐=Í0≤𝑖<𝑆 .length ( ) 𝑆.charAt(𝑖)% 64 +0 0.
Digits: Convert an input integer number into a string holding the number’s digits in reverse order.
Grade:
Receives 5 oats
𝑓1, 𝑓2, 𝑓3, 𝑓4
and
score
as inputs. e rst four are given in decreasing order (
𝑓1>
𝑓2>𝑓3>𝑓4
). ese 4 values induce 5 intervals
(∞, 𝑓1]
,
(𝑓1, 𝑓2]
,
(𝑓2, 𝑓3]
,
(𝑓3, 𝑓4]
, and
(𝑓4,−∞]
. A grade
𝐴
,
𝐵,𝐶,𝐷or 𝐹is returned according to the interval score belongs to.
Median: Compute the median among 3 integer input values.
Smallest: Compute the smallest value among 4 integer input values.
Syllables:
Compute the number of syllables into which an input string can be split according to English
grammar (vowels ’a’, ’e’, ’i’, ’o’ and ’u’, as well as the character ’y’, are considered as syllable dividers).
ere are two versions of the dataset, the original one described in [
31
], whose methods are given in the C
language, and a Java translation of the original dataset described in [
12
]. Some of the programs that result from
the translation from C to Java were not syntactically correct and consequently did not compile. Other programs
saw signicant changes in their behavior. Interestingly, for some programs, the transformation itself repaired the
bug (the C program fails on some inputs, but the Java version is correct). e laer situation is mostly due to
the dierent behavior of the non-initialized variables in C versus Java [
12
]. ese abnormal cases were removed
from the resulting Java dataset, which thus has fewer methods than the C one.
Because of the automated program repair tools that we evaluate, which include AutoFix and Angelix, we need
to consider yet other versions of the IntroClass dataset.
•
IntroClass Eiel: is new version is the result of translating the original C dataset into Eiel. For the
translation, we employed the C2Eiel tool [
50
]; moreover, since AutoFix requires contracts for program
xing, we replaced the input/output sentences in the original IntroClass, which received inputs and
produced outputs from/to standard input/output, to programs that received inputs as parameters, and
produced outputs as return values. We equipped the resulting programs with the correct contracts for
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 7
pre- and post-conditions of each case study. As in the translation from C to Java, several faulty programs
became “correct” as a result of the translation. ese cases have to do with default values for variables, as
for Java, and with how input is required and output is produced; for instance, faulty cases that reported
output values with accompanying messages in lowercase, when they were expected to be upper case, are
disregarded since in Eiel translated programs outputs are produced as return values.
•
IntroClass Angelix: In order to run experiments with Angelix the source code of each variation has to be
instrumented and adapted to include calls to some macro functions, and to return the output in a single
integer or char. Since in several variants the errors consist of modications of the input/output Strings,
which are stripped out by the instrumentation, the faulty versions became “correct”.
Since IntroClass consists not only of dierent students’ implementations but also dierent commits/versions
of the implementation of each student, in several cases the instrumentation resulted in duplicate les, which
we removed using the di tool to reduce bias (notice that the IntroClass version used in [
26
] and reported in
Fig. 1 does not eliminate duplicates and contains 208 extra datapoints). Table 1 describes the four datasets and,
for each dataset, the number of faulty versions for each method. e sizes of their corresponding test suites are
only relevant for C and Java since, in the case of AutoFix, tests are automatically produced using AutoTest [
33
]
(the tool does not receive user-provided test suites).
checksum digits grade median smallest syllables Total
IntroClass C (Genprog) 46 143 136 98 84 63 570
IntroClass C (Angelix) 31 149 36 58 41 44 359
IntroClass Java (Nopol) 11 75 89 57 52 13 297
IntroClass Eiel (AutoFix) 45 141 72 86 56 77 477
Test suite size 16 16 18 13 16 16 95
Table 1. Description of the IntroClass C, Java and Eiel datasets.
3.2 Experimental Setup
In this section we will describe the soware and hardware infrastructure we employed to run the experiments
whose results we will report in Section 3.5. We also describe the criteria used to generate the bounded-exhaustive
test-suites, as well as the automated repair tools we will evaluate and their congurations.
In order to evaluate the subjects from the IntroClass dataset we consider, besides the instructor-provided suite
delivered within the dataset, two new suites. For those programs in which it is feasibe, we consider bounded-
exhaustive suites. Bounded-exhaustive suites contain all the inputs that can be generated within user-provided
bounds. We will use such suites for programs digits,median,smallest and syllables. Program grade uses oats and,
therefore, even small bounds would produce suites that are too big; therefore, for program grade we use tests
that are not bounded exhaustive, but that are part of a bounded exhaustive suite. We chose bounds so that the
resulting suites have approximately 100 tests and 1,000 tests for each method under analysis. is gives origin to
two new suites that we will call S100 and S1,000, whose test inputs for each problem are characterized below
in Tables 2 and 3. Notice that from these inputs, actual tests are built using reference implementations of the
methods under repair as an oracle. Notice also that all the tests in S100 also belong to S1,000, i.e., S1,000 extends
S100 in all cases.
Along the experiments we report in this section, we used a workstation with Intel Core i7 2600, 3.40 GHz, and
8 GB of RAM running Ubuntu 16.04 LTS x86_64. e experiments involving Pex were performed on a virtual
machine (VirtualBox) running a fresh install of Windows 7 SP1. e specic version of the soware used in the
ACM Trans. Sow. Eng. Methodol.
8 • Zemín et al.
Test inputs specication Total
𝑆100checksum ={𝑐0, . . . ,𝑐𝑘|0≤𝑘≤4∧ ∀0≤𝑖≤𝑘𝑐𝑖∈ {0𝑎0,0𝑏0,0𝑐0}} 120
𝑆100digits ={𝑘| − 64 ≤𝑘≤63}128
𝑆100grade ={( 𝑓1, . . . , 𝑓4,score) | (∀1≤𝑖≤4𝑓𝑖∈ {30,40,50,60,70,80})
∧ (𝑓1>𝑓2>𝑓3>𝑓4) ∧ 𝑠𝑐𝑜𝑟 𝑒 ∈ {5,10,15,20, . . . , 90}} 285
𝑆100median ={(𝑘1, 𝑘2, 𝑘3) |∀1≤𝑖≤3−2≤𝑘𝑖≤2}125
𝑆100smallest ={(𝑘1, 𝑘 2, 𝑘3, 𝑘4) |∀1≤𝑖≤4−2≤𝑘𝑖≤1}256
𝑆100syllables ={𝑐0, . . .𝑐𝑘|0≤𝑘<4∧ ∀0≤𝑖≤𝑘𝑐𝑖∈ {0𝑎0,0𝑏0,0𝑐0} } 120
Table 2. Specification of test suites 𝑆100
Test inputs specication Total
𝑆1,000checksum ={𝑐0, . . ., 𝑐𝑘|0≤𝑘≤5∧ ∀0≤𝑖≤𝑘𝑐𝑖∈ {0𝑎0,0𝑏0,0𝑐0,0𝑒0}} 1,364
𝑆1,000digits ={𝑘| − 512 ≤𝑘≤511}1,024
𝑆1,000grade ={( 𝑓1, . . . , 𝑓4,score) |(∀1≤𝑖≤4𝑓𝑖∈ {10,20,30,40,50,60,70,80,90})
∧ (𝑓1>𝑓2>𝑓3>𝑓4) ∧ score ∈ {0,5,10,15,20, . . . , 100}} 2,646
𝑆1,000median ={(𝑘1, 𝑘 2, 𝑘3)|∀1≤𝑖≤3−5≤𝑘𝑖≤4}1,000
𝑆1,000smallest ={(𝑘1, 𝑘 2, 𝑘3, 𝑘4)|∀1≤𝑖≤4−3≤𝑘𝑖≤2}1,296
𝑆1,000syllables ={𝑐0, . . ., 𝑐𝑘|0≤𝑘<5∧ ∀0≤𝑖≤𝑘𝑐𝑖∈ {0𝑎0,0𝑏0,0𝑐0,0𝑒0}} 1,364
Table 3. Specification of test suites 𝑆1,000
experiments, including that of the APR tools can be found on the reproducibility package. In general, nding
an appropriate timeout depends on the context in which the tool is being used. Perhaps, for mission-critical
applications a large timeout is more appropriate, while for other domains, it may be too expensive to devote one
hour to (most probably failed) repair aempts. We set a two hour timeout; enough for the tools to run, and at the
same time reasonable for the time running all the experiments will require.
3.3 Reproducibility
e empirical study we present in this paper involves a large set of dierent experiments. ese involve 4 dierent
datasets (versions of IntroClass, as described in the previous section), congurations for 4 dierent repair tools
accross 3 dierent languages and 3 dierent sets of tests for the tools that receive test suites. Also, all case
studies have been equipped with contracts, translated into C# and veried using Pex. We make available all these
elements for the interested reader to reproduce our experiments in
https://sites.google.com/a/dc.exa.unrc.edu.ar/test-specs-program-repair/
Instructions to reproduce each experiment are provided therein.
3.4 Why use Bounded-Exhaustive Suites
Research questions 2 and 3 discuss the impact of using alternative (w.r.t. test suites used as specications of the
programs being developed) test suites in overing and tool scalability. It is expected that these held-out tests
will allow one to expose those patches generated by an APR tool that overt to the specication suite. e rst
problem, namely, the impact on overing of considering alternative suites, has been already discussed in the
literature by using, besides the original suite, new white-box suites automatically generated in order to satisfy
a coverage criterion (usually, branch coverage) [
26
,
47
,
56
]. Using white-box suites as held-out tests is rarely
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 9
useful in soware development. ey are obtained from a (usually non-existent) correct implementation whose
complexity usually diers greatly from the students-developed version loosing both, the property of the suite
being white-box and the satisfaction of the coverage criterion. Also, they may not be comprehensive enough. For
example, for the IntroClass dataset, the white-box suites used in [
56
] have between 8 and 10 tests, and keeping
portions of the suite to check the impact of larger/smaller suites on assessing scalability seems at least risky. As
an alternative of using small white-box suites, that suer from the mentioned methodological limitations, we
propose the use of bounded-exhaustive suites whenever possible, or similar ones when bounded-exhaustiveness
is not an option. ey can easily be made to grow in size as much as necessary, and they capture a similar
behavior to formal specications in a fragment of the program domain. We will be exploring alternatives to
bouned-exhaustive suites with similar characteristics as further work.
3.5 Experimental Results
In this section we present the evaluation of each of the repair tools on the generated suites, and from the collected
data we will discuss research questions RQ1–RQ4 in Sections 3.6–3.9. Tables 4–7 summarize the experimental
data. On these tables, we report a patch/x if any repair mode/conguration or execution produced a patch/x.
For example, we considered ten executions of Genprog due to its random behavior. It suces only one execution
to nd a patch to be reported.
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 11 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
digits 75 O 7 1 9.3% 1.3%
O∪S100 2 1 2.7% 1.3%
O∪S1,000 2 1 2.7% 1.3%
grade 89 O 2 1 2.2% 1.1%
O∪S100 2 1 2.2% 1.1%
O∪S1,000 2 1 2.2% 1.1%
median 57 O 11 4 19.3% 7%
O∪S100 4 4 7% 7%
O∪S1,000 4 4 7% 7%
smallest 52 O 12 0 23.1% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
syllables 13 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
Table 4. Repair statistics for the Nopol automated repair tool.
3.6 Research estion 1
is research question addresses overing, a well-known limitation of Generation and Validation automatic
program repair approaches that use test suites as the validation mechanism. e use of patch validation techniques
based on human inspections or comparisons with developer patches (or even accepting patches as xes without
ACM Trans. Sow. Eng. Methodol.
10 • Zemín et al.
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 46 O 3 2 6.5% 4.3%
O∪S100 23 0 50% 0%
O∪S1,000 22 1 47.8% 2.2%
digits 143 O 29 12 20.3% 8.4%
O∪S100 17 10 11.9% 7%
O∪S1,000 13 7 9.1% 4.9%
grade 136 O 2 0 1.5% 0%
O∪S100 50 0 36.8% 0%
O∪S1,000 23 0 16.9% 0%
median 98 O 37 13 37.8% 13.3%
O∪S100 67 0 68.4% 0%
O∪S1,000 51 3 52% 3.1%
smallest 84 O 37 3 44% 3.6%
O∪S100 61 0 72.6% 0%
O∪S1,000 47 0 56% 0%
syllables 63 O 19 0 30.2% 0%
O∪S100 38 0 60.3% 0%
O∪S1,000 37 0 58.7% 0%
Table 5. Repair statistics for the GenProg automatic repair tool.
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 45 O 1 0 2.2% 0%
O∪S100 1 0 2.2% 0%
O∪S1,000 1 0 2.2% 0%
digits 141 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
grade 72 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
median 86 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
smallest 56 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
syllables 77 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
Table 6. Repair statistics for the AutoFix automatic repair tool.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 11
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 31 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
digits 149 O 20 5 13.4% 3.4%
O∪S100 13 10 8.7% 6.7%
O∪S1,000 8 4 5.4% 2.7%
grade 36 O 7 7 19.4% 19.4%
O∪S100 7 7 19.4% 19.4%
O∪S1,000 5 5 13.9% 13.9%
median 58 O 29 10 50.0% 17.2%
O∪S100 15 15 25.9% 25.9%
O∪S1,000 6 6 10.3% 10.3%
smallest 41 O 33 1 80.5% 2.4%
O∪S100 1 1 2.4% 2.4%
O∪S1,000 1 1 2.4% 2.4%
syllables 44 O 0 0 0% 0%
O∪S100 0 0 0% 0%
O∪S1,000 0 0 0% 0%
Table 7. Repair statistics for the Angelix automatic repair tool.
further discussion), has not allowed the community to identify the whole extent of this problem. For example,
paper [
26
] includes the table we reproduce in Fig. 1. e table gives the erroneous impression that 287 out of 778
Fig. 1. Performance of GenProg (and other tools) on the IntroClass dataset, as reported in [26].
bugs were xed (36.8%). e paper actually analyzes this in more detail and by using independent test suites to
validate the generated patches, it claims GenProg’s patches pass 68.7% of independent tests, giving the non-expert
reader the impression the produced patches were of good quality while, in fact, it might be the case that none of
ACM Trans. Sow. Eng. Methodol.
12 • Zemín et al.
the patches is actually a x. Actually, as our experiments reported in Table 5 show, only 30 out of 570 faults were
correctly xed (which gives a xing ratio of 5.3%, well below the 36.8% presented in [26]).
We have obtained similar results for the other tools under analysis. Angelix patches 89 faults out of 360 program
variants (a ratio of 24.7%), yet only 23 patches are xes (the ratio reduces to 6.4%). e remaining patches were
discarded with the aid of Pex. Nopol patched 32 out of 297 versions (10.7%), using the evaluation test suite. Upon
verication with Pex, the number of xes is 6 (2%). AutoFix uses contracts (which we provided) in order to
automatically (and randomly) generate the evaluation suite. When a patch is produced, AutoFix validates the
adequacy of the patch with a randomly generated suite. AutoFix then produced patches for the great majority of
faulty routines, but itself showed that most of these were inadequate, and overall reported only 1 patch (which
was an invalid x).
As previously discussed in the beginning of Section 3, these unsatisfactory results might be due to the low
quality of the validation test suite. Yet, it is worth emphasizing, that the IntroClass dataset was developed to be
used in program repair, and the community has vouched for its quality by publishing the benchmark and using
the benchmark in their research.
3.7 Research estion 2
is research question relates to the impact of more thorough validation suites on overing, as well as on the
quality of the produced patches. Table 4 shows that Nopol prots from larger suites in order to reduce overing
signicantly. It suces to consider suite O
∪
S100 to notice a reduction in overing. e number of patches is
reduced from 32 to 8. Unfortunately, the number of xes remains low. is shows that Nopol, when fed with a
beer quality evaluation suite, is able to produce (a few) good quality xes. GenProg, on the other hand, shows
an interesting behaviour (see Table 5): it doubles the number of patches with suite O
∪
S100, yet the number of
xes is reduced from 30 to 10. With suite O
∪
S1000 produces around 50% more patches but the number of xes
is reduced from 30 to 11. We believe this is due to its random nature. Angelix (see Table 7) sees its overing
reduced. Interestingly, suite O
∪
S100 allows Angelix to obtain 5 more xes for method
digits
and 5 more for
method
median
. Since AutoFix generates the evaluation suites, rather than providing larger suites we extend the
test generation time. AutoFix does not have a good performance on this dataset.
3.8 Research estion 3
is research question relates to the impact larger validation suites may have on repair performance. In most
of the evaluated tools this impact can be illustrated by showing the number of timeouts reached during the
repair process. Table 8 reports, for each tool able to use suites S100 and S1,000, and for each validation suite,
the number of timeouts reached during repair. We report a timeout only when a timeout occurs for all repair
modes/congurations and executions. For example, we considered two repair modes (condition and precondition)
for Nopol. Depending on the suite, some of them reached a timeout, however, there is no faulty version for
which all of them do. Recall that AutoFix generates its validation suites from user-provided contracts, and is
therefore le out of this analysis. While Nopol’s mechanism for data collection discards tests that are considered
redundant (which as shown in Table 8 makes Nopol resilient to suite size increments), GenProg and Angelix are
both sensitive to the size of the evaluation suite.
3.9 Research estion 4
is research question addresses our intuition that using held-out tests, as a means to determine if a patch found
by an APR tool is indeed correct, is an error-prone procedure. is procedure is widely used and accepted by
the community [
26
,
30
,
47
,
56
,
58
]. In [
56
, Table 2], reproduced in Fig. 2, we see the overing reported for
Angelix on the IntroClass benchmark using the originally provided black-box test suites. e table omits methods
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 13
Nopol GenProg Angelix
O 0 2 5
O∪S100 0 135 47
O∪S1,000 0 237 69
Table 8. Number of timeouts reached by tool and validation suite.
Empir Software Eng
repair technique would often generate general patches, and produce overfitting patches
infrequently. We test all patches produced for Research Question 1 against the held-out to
measure rate.
Results Tabl e 2a and b show the number of patches produced for each subject program
that fail at least one held-out test for the IntroClass and Codeflaws datasets, respectively. On
IntroClass, 61 of the 81 (75%) Angelix-produced patches overfit to the training tests, while
80%, 81%, and 90% of the CVC4-, Enumerative-, and SemFix-produced patches do, respec-
tively. On Codeflaws, 44 of the 81 (54%) Angelix-produced patches overfit, while 83.5%,
87%, and 68% of patches generated by CVC4, Enumerative, and SemFix do, respectively.
This suggests that, although semantics-based repair has been shown to produce high-quality
repairs on a number of subjects, overfitting to the training tests is still a concern. We
present case studies to help characterize the nature of overfitting in semantics-based APR
in Section 3.5.
One possible reason that CVC4 and Enumerative underperform Angelix’s default syn-
thesis engine is that the SyGuS techniques do not take into account the original buggy
expressions. We observed that the resulting patches can be very different from the originals
they replace, which can impact performance arbitrarily. However, the CVC4 and Enu-
merative techniques do generate non-overfitting patches for programs that default Angelix
cannot produce non-overfitting patches, as shown in Fig. 3aandb.Similarly,SemFix,
CVC4, and Enumerative also have non-overlapping non-overfitting patches (results omit-
ted). This phenomenon also happens between Angelix and SemFix. This suggests that using
multiple synthesis engines to complement one another may increase the effectiveness of
semantics-based APR.
3.3 Training Test Suite Features
Our next three research questions look at the relationship between features of the training
test suite and produced patch quality, looking specifically at (4) test suite size, (5) number
of failing tests, and (6) test suite provenance.
Tab l e 2 Baseline overfitting results on IntroClass (top) and Codeflaws (bottom). In both tables, A/B denotes
A overfitting patches out of B total patches generated
(a) IntroClass overfitting rates for each APR approach, using black box (center columns) and white box
(right-most columns) as training tests. We omit syllables and checksum, for which no patches
were generated
Black box White box
Subject Angelix CVC4 Enum SemFix Angelix CVC4 Enum SemFix
smallest 27/37 33/39 24/29 36/45 31/37 33/37 33/36 33/37
median 29/38 21/ 28 21/27 40/44 25/35 36/36 23/23 38/38
digits 5/63/43/310/10 0/52/22/22/8
(b) Codeflaws overfitting rates for each APR approach
Subject Total Angelix CVC4 Enum SemFix
Codeflaws 651 44/81 76/91 80/92 38/56
Fig. 2. Overfiing patches produced by Angelix as reported in [56, Table 2].
syllables
and
checksum
for which no patches were generated. Table 9 compares the number of xes obtained
in [56] (reproduced in Fig. 2) and in this article in Table 7 for those methods that are in the intersection of both
studies, namely, median,smallest and digits.
Fig. 2 Table 7
median 9 5
smallest 10 10
digits 1 1
Table 9. Comparing the number of reported fixes with [56].
While there is a discrepancy in the results reported in Table 9, this does not imply per se that an error has
been made. Methods have been instrumented, parameters may include subtle dierences that lead to dierent
results, etc. ese discrepancies, nevertheless, called our aention and led to research question 4. Paper [
56
]
reports a reproducibility package [
57
] that, in particular, includes all the patches generated by Angelix. It is
not clear how the les in [
57
] match the experiments reported in [
56
] (the number of les do no match the
results reported in the paper). Still, many les are available that allow us to study this research question in depth.
Disregarding their provenance, there are 45,131 patches in the reproducibility package [
57
]. Since there may be
repeated patches, we removed repetitions and 2,120 patches remained. For each of these patches we checked
overing using the test suite provided in IntroClass (suite O in our tables) as held-out tests. Also, we checked
overing using Pex. Finally, as we did with our experiments, we ran the corresponding test generated by Pex (if
any) on the original patched method to check that it actually fails. Table 10 reports the total number of unique
patches for each method. Since there might be multiple patches for a specic student version, we also present
the number of versions involved in brackets. For example, there are 78 unique patches for the method digits in
the available reproducibility package, yet they correspond to 22 dierent student versions. Table 10 provides
strong evidence that using held-out tests to assess the correctness of a patch is unacceptable since more than
80% of them are deemed correct by the held-out tests, yet they are incorrect (the 80% value is obtained from
Table 10 by calculating
(#𝑝𝑎𝑠𝑠 𝑂 −#𝑃 𝑎𝑠𝑠 𝑃 𝐸𝑋 )/#𝑃 𝑎𝑠𝑠 𝑂
). Interestingly, the number of student versions xed
ACM Trans. Sow. Eng. Methodol.
14 • Zemín et al.
reported in Table 10 (aer an in-depth analysis of the reproducibility package obtained from [
56
]) and Table 7 are
very similar. is observation suggests that despite the dierences in the instrumentation of the dataset and the
experimental setup, running the same tool on the same dataset produced the same results, providing evidence
against any bias introduced during our experiments.
#Patches #Pass O #Pass Pex
median 887 [58] 130 [28] 66 [12]
smallest 1155 [51] 283 [26] 11 [3]
digits 78 [22] 9 [7] 3 [1]
Table 10. ality of patches that pass the held-out tests.
4 THREATS TO VALIDITY
In the paper we focused on the IntroClass dataset. erefore, the conclusions we draw only apply to this dataset
and, more precisely, to the way in which the selected automated repair tools are able to handle IntroClass.
Nevertheless, we believe this dataset is particularly adequate to stress some of the points we make in the paper.
Particularly, considering small methods that can be easily specied in formal behavioral specication languages
such as Code Contracts [
13
] or JML [
5
], allow us to determine if patches are indeed xes or are spurious x
candidates. is is a problem that is usually overlooked in the literature: either patches are accepted as xes (no
further study on the quality of patches is made) [
47
], or they are subject to human inspection (which we consider
severely error-prone), or are compared against developer xes retrieved from the project repository [
37
] (which,
as pointed out in [
47
], may show that automated repair tools and developers overt in a similar way). Since these
tools work on dierent languages (GenProg and Angelix work on C code, while Nopol works on Java code and
AutoFix repairs Eiel code, the corresponding datasets are dierent across languages as explained in Section
3.2. erefore, the reader is advised to not draw conclusions by comparing between tools working on dierent
datasets.
Also, we used the repair tools to the best of our possibilities. is is complex in itself because research tools
usually have usability limitations and are not robust enough. In all cases we consulted corresponding tool
developers in order to make sure we were using the tools with the right parameters, and reported a number of
bugs that in some cases were xed in time for us to run experiments with the xed versions. e reproducibility
package includes all the seings we used.
Notice that we are using Pex as our golden standard, i.e., if a patch is deemed correct by Pex, it is accepted as a
x. is may not always be the case. If that happens, it will count against our hypothesis. For the experimental
evaluation we used a 2 hour timeout. We consider this an appropriate timeout, but increasing it may yield added
results.
e results reported only apply to the studied tools. Other tools might behave in a substantially dierent way.
We aempted to conduct this study on a wider class of tools, yet some tools were not available even for academic
use (for instance PAR [
29
]), while other tools had usability limitations that prevented us from running them even
on this simple dataset (this was the case for instance with SPR [34]).
5 RELATED WORK
is paper extends previous results published by the authors in [
59
]. e paper addressed the use of specications
and automated bug-nding to assess overing in automated program repair tools. Similar ideas were later also
published in [
44
], where OpenJML [
9
] is used for verication purposes rather than Pex. Only the overing
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 15
problem is analyzed in [
44
], but on a dierent benchmark of programs. is new benchmark is a valuable
contribution of [
44
]. Instead, we stick to the IntroClass dataset, which has simpler programs, has been used as
a benchmark in other papers, and serves as a lower bound for the evaluation of APR tools, i.e., if tools fail to
properly x these simple programs, it is hardly the case they will be successful on more complex ones.
Automatic program xing has become over the last few years a very active research topic, and various tools for
program repair are now available, many of which we have already referred to earlier in this paper. ese generally
dier in their approaches for producing program patches, using several dierent underlying approaches, including
search-based techniques, evolutionary computation, paern-based program xing, program mutation, synthesis,
and others. Since this paper is mainly concerned with how these program xing approaches evaluate the produced
x candidates, we will concentrate on that aspect of automated program repair techniques. A very small set of
program repair techniques use formal specications as acceptance criteria for program xes. Gopinath et al. [
17
]
propose a technique to repair programs automatically, by employing SAT solving for various tasks, including the
construction of repair values from faulty programs where suspicious statements are parameterized, and checking
whether the repair candidates are indeed xes; they use contracts specied in Alloy [
20
], and SAT-based bounded
verication for checking candidate programs against specications. Staber et al. [
48
] apply automated repairs on
programs captured as nite state machines, whose intended behavior is captured through linear time temporal
logic [
8
]; repair actions essentially manipulate the state transition relation, using a game-theoretic approach. von
Essen and Jobstmann [
51
] propose a technique to repair reactive systems, specied by formal specications in
linear time temporal logic, resorting to automated synthesis. Arcuri and Yao’s approach [
1
] applies to sequential
programs accompanied by formal specications in the form of rst-order logic pre- and post-conditions, and uses
genetic programming to evolve a buggy program in the search for a x, driven by a set of tests automatically
computed from the formal specication and the program. eir use of formal specications is then weaker than
the previously mentioned cases. Wei et al. [
52
] propose a technique that combines tests for fault localization
with specications in the form of contracts, for automatically repairing Eiel programs. eir technique may
correct both programs and contracts; it uses automatically generated tests to localize faults and to instantiate x
schemas to produce x candidates; x candidates are then assessed indirectly against contracts, since they are
evaluated on a collected set of failing and passing tests, automatically built using random test generation using
the contracts.
All other automated repair tools we are aware of use tests as specications, mainly as a way of making the
corresponding techniques more widely applicable, since tests can be more commonly found in soware projects
and their use scales more reasonably than other verication approaches. We summarize here a set of known
tools and techniques that use tests as specications. e BugFix tool by Jerey et al. [
22
] applies to C programs,
and uses tests as specications; the tool employs machine learning techniques to produce bug-xing suggestions
from rules learned from previous bug xes. Weimer et al. [
54
] use genetic algorithms for automatically producing
program xes for C programs, using tests as specications too; moreover, they emphasize the fact that tests
as opposed to formal specications lead to wider applicability of their technique. Kern and Esparza [
28
] repair
Java programs by systematically exploring alternatives to hotspots (error prone parts of the code), provided that
the developers characterize hotspot constructs and provide suitable syntactic changes for these; they also use
tests as specications, but their experiments tend to use larger test sets compared to the approaches based on
evolutionary computation. Debroy and Wong [
10
] propose a technique that combines fault localization with
mutation for program repair; fault localization is a crucial part of their technique, in which a test suite is involved,
the same one used as acceptance criterion for produced program patches. Tool SemFix by Nguyen et al. [
41
]
combines symbolic execution with constraint solving and program synthesis to automatically repair programs;
this tool uses provided tests both for fault localization and for producing constraints that would lead to program
patches that pass all tests. Kaleeswaran et al. [
25
] propose a technique for identifying, from a faulty program,
parts of it that are likely to be part of the repaired code, and suggest expressions on how to change these. Tests
ACM Trans. Sow. Eng. Methodol.
16 • Zemín et al.
are used in their approach both for localizing faults, and for capturing the expected behavior of a program to
synthesize hints for xes. Ke [
27
] proposes an approach to program repair that identies faulty code fragments
and looks for alternative, human-wrien, pieces of code, that would constitute patches of the faulty program;
while their approach uses constraints to capture the expected behavior of fragments and constraint solving to nd
patches, this behavior is taken from tests, and the produced patches are in the end evaluated against a set of test
cases, for acceptance. Long and Rinard [
34
] propose SPR, a technique based on the use of transformation schemas
that target a wide variety of program defects, and are instantiated using a novel condition synthesis algorithm.
SPR also uses tests as specications, not only as acceptance criterion, but also as part of its condition synthesis
mechanism. Mechtaev et al. [
37
] propose Angelix, a tool for program repair based on symbolic execution and
constraint solving, supported by a novel notion of angelic forest to capture information regarding (symbolic)
executions of the program being repaired; while Angelix uses symbolic execution and constraint solving, the
intended behavior of the program to be repaired is in this case also captured through test cases. Finally, Xuan et
al. [
55
] propose Nopol, which also resorts to constraint solving to produce patches from information originating
in test executions, encoded as constraints. Again, Nopol uses tests both in the patch generation process, and as
acceptance criterion for its produced xes to produce patches from information originating in test executions,
encoded as constraints.
Various tools for program repair that employ testing as acceptance criteria for program xes have been shown
to produce spurious (incorrect) repairs. Paper [
46
] shows that GenProg and other tools overt patches to the
provided acceptance suites. ey do so by showing that third-party generated suites reject the produced patches.
Since several tools (particularly GenProg) use suites to guide the patch generation process, [
46
] actually shows
that the original suites are not good enough. We go one step further and show that even considering more
comprehensive suites the performance of the repair tools is only partially improved: fewer overts are produced,
but no new xes. is supports the experience by the authors of [46], and generalizes it to other tools as well:
“Our analysis substantially changed our understanding of the capabilities of the analyzed
automatic patch generation systems. It is now clear that GenProg, RSRepair, and AE are
overwhelmingly less capable of generating meaningful patches than we initially understood
from reading the relevant papers.”
e overing problem is also addressed in [
47
,
56
], where the original test suite is extended with a white-box
one, automatically generated using the symbolic execution engine KLEE [
6
]. Research question 2 in [
47
] analyzes
the relationship between test suite coverage and overing, a problem we also study in this paper. eir analysis
proceeds by considering subsets of the given suite, and showing this leads to even more overed patches.
Rather than taking subsets of the original suite, we go the other way around and extend the original suite with a
substantial amount of new tests. is allows us to reach to conclusions that exceed [
47
], as for instance the fact
that, while overing decreases, the xing ratio remains very low. Also, we analyze the impact of larger suites
on tool performance, which cannot be correctly addressed by using small suites.
Long and Rinard [
35
] also study the overing problem but from the perspective of the tools search space. It
concludes that many tools show poor performance because their search space contains signicantly fewer xes
than patches, and in some cases, the patch generation process employed produces a search space that does not
contain any xes.
Kali [
46
] was developed with the purpose of generating patches that delete functionality. RSRepair [
45
] is an
adaptation of GenProg that substitutes genetic programming by random search.
6 DISCUSSION
e signicant advances in automated program analysis have enabled the development of powerful tools for
assisting developers in various tasks, such as test case generation, program verication, and fault localization.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 17
e great amount of eort that soware maintenance demands is turning the focus of automated analysis into
automatically xing programs, and a wide variety of tools for automated program repair have been developed
in the last few years. e mainstream of these tools, as we have analyzed in this paper, concentrate in using
tests as specications, since tests are more oen found in soware projects, compared to more sophisticated
formal specications, and their evaluation scales beer than the analysis of formal specications using more
thorough techniques. While several researchers have acknowledged the problem of using inherently partial
specications based on tests to capture expected program behavior, the more detailed analyses that have been
proposed consisted in using larger test suites, or perform manual inspections, in order to assess more precisely
the eectiveness of automated program repair techniques, and the severity of the so called test suite overing
patches [47].
Our approach in this paper has been to empirically study the suitability of tests as x acceptance criteria in
the context of automated program repair, by checking produced patches using an automatic bug-nding tool,
as opposed to previous works that used tests or manual inspections. We believe that previous approaches to
analyze overing have failed to demonstrate the criticality of invalid patches overing test suites. Our results
show that the percentage of valid xes that state-of-the-art program repair tools, that use tests as acceptance
criteria, are able to provide is signicantly lower than the estimations of previous assessments, e.g., [
47
], even in
simple examples such as the ones analyzed in this paper. Moreover, increasing the number of tests reduces the
number of spurious xes but does not contribute to generating more xes, i.e, it does not improve these tools’
eectiveness; instead, such increases make tools most oen exhaust resources without producing patches.
7 CONCLUSIONS AND FURTHER WORK
Some conclusions can be drawn from these results. While weaker or lighterweight specications, e.g., based on
tests, have been successful in improving the applicability of automated analyses, as it has been shown in the
contexts of test generation, bug nding, fault localization and other techniques, this does not seem to be the case
in the context of automated program repair. Indeed, as our results show, using tests as specications makes it
signicantly more likely to obtain invalid patches (that pass all tests) than actual xes. We foresee three lines of
research in order to overcome this fundamental limitation:
(1)
Use of strong formal specications describing the problem to be solved by the program under analysis. In
some domains (for instance when automatically repairing formal models [
19
,
60
,
61
]) this is the natural
way to go. For programs, more work is necessary in order to assess if partial formal specications present
improvements over test-based specications.
(2) Use more comprehensive test suites, as for instance bounded-exhaustive suites. ese capture a portion
of the semantics of the strong formal specication. Since these suites are likely to be large in size, new
tools must be prepared to deal with large suites.
(3)
Include a human in the loop that assesses if a repair candidate is indeed a x. If she determines it is not,
she may expand the suite with new tests. is iterative process has limitations (the human may make
wrong decisions), but has good chances of being more eective than test-based specications.
is work opens more lines for further work. An obvious one consists of auditing patches reported in the
literature, by performing an automated evaluation as the one performed in this paper. is is not a simple
task in many cases, since it demands understanding the contexts of the repairs, and formally capturing the
expected behavior of repaired programs. Also, in the paper we used bounded-exhaustive test suites that contain
approximately 100 or 1,000 tests. For some tools we saw improvement in the number of xes, and for others we
saw that large suites deem the tools useless. We will study how tools behave when ner granularity is applied in
the construction of bounded-exhaustive suites, hoping to nd sweet spots that favor the quality of the produced
patches.
ACM Trans. Sow. Eng. Methodol.
18 • Zemín et al.
In this article we are not proposing the use of specications and verication tools along automated program
repair in industrial seings. Specications are scarce, and producing good-enough specications is an expensive
task. Yet it is essential that APR tool users be aware of the actual limitations APR tools have. Still, in an academic
seing we believe that checking tools against the IntroClass dataset and assessing the quality of patches using
formal specications should be a standard. is paper provides all the infrastructure necessary to make this task
a simple one.
REFERENCES
[1] A. Arcuri and X. Yao, A Novel Co-evolutionary Approach to Automatic Soware Bug Fixing, in CEC 2008.
[2]
C. Barre, R. Sebastiani, S. Seshia, C. Tinelli, Satisability Modulo eories. In Biere, A.; Heule, M.J.H.; van Maaren, H.; Walsh, T. (eds.).
Handbook of Satisability. Frontiers in Articial Intelligence and Applications. Vol. 185. IOS Press. pp. 825–885, 2009.
[3]
R. Bharadwaj, and C. Heitmeyer, Model Checking Complete Requirements Specications Using Abstraction, Automated Soware Engineer-
ing 6(1), Kluwer Academic Publishers, 1999.
[4] Chandrasekhar Boyapati, Sarfraz Khurshid, Darko Marinov: Korat: automated testing based on Java predicates. ISSTA 2002: 123-133
[5]
L. Burdy, Y. Cheon, D.R. Cok, M.D. Ernst, J.R. Kiniry, G.T. Leavens, K. Rustan M. Leino and E. Poll, An overview of JML tools and
applications, in STTT 7(3), Springer, 2005.
[6]
C. Cadar, D. Dunbar, and D. Engler, KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In
USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 209?224, San Diego, CA, USA, 2008.
[7]
A. Church, Application of recursive arithmetic to the problem of circuit synthesis, in Summaries of talks presented at the Summer Institute
for Symbolic Logic Cornell University, 1957, 2nd edn., Communications Research Division, Institute for Defense Analyses, Princeton, N.
J., 1960, pp. 3–50. 3a-45a.
[8] E. Clarke, O. Grumberg and D. Peled, Model Checking, MIT Press, 2000.
[9] R. Cok, OpenJML: JML for Java 7 by extending OpenJDK, in NASA Formal Methods Symposium. Springer, 2011, pp. 472?479.
[10] V. Debroy and W.E. Wong, Using Mutation to Automatically Suggest Fixes to Faulty Programs. ICST 2010, pp. 65–74.
[11] K.A. De Jong, Evolutionary Computation, A Unied Approach, MIT Press, 2006.
[12]
T. Durieux, M. Monperrus, IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs, Research Report, Universite Lille 1, 2016.
[13] M. Fähndrich, Static Verication for Code Contracts. SAS 2010: 2-5.
[14]
M. Fähndrich, M. Barne, D. Leijen, F. Logozzo, Integrating a set of contract checking tools into visual studio, in Proceedings of the Second
International Workshop on Developing Tools as Plug-Ins TOPI 2012, IEEE, 2012.
[15]
A. Fuxman, M. Pistore, J. Mylopoulos and P. Traverso, Model checking early requirements specication in Tropos, in Proceedings of the
5th IEEE International Symposium on Requirements Engineering, Toronto, Canada, 2001.
[16]
J.P. Galeoi, N. Rosner, C. López Pombo, M.F. Frias, TACO: Ecient SAT-Based Bounded Verication Using Symmetry Breaking and Tight
Bounds. IEEE TSE 39(9): 1283-1307 (2013).
[17] D. Gopinath, M.Z. Malik and S. Khurshid, Specication-Based Program Repair Using SAT. TACAS 2011: pp. 173–188.
[18]
C. L. Goues, T. Nguyen, S. Forrest, W. Weimer, GenProg: a Generic Method for Automatic Soware Repair, IEEE Transactions on Soware
Engineering 38, IEEE, 2012.
[19]
Simón Gutiérrez Brida, Germán Regis, Guolong Zheng, Hamid Bagheri, anhVu Nguyen, Nazareno Aguirre, Marcelo F. Frias: Bounded
Exhaustive Search of Alloy Specication Repairs. ICSE 2021: 1135-1147
[20] D. Jackson, Soware Abstractions: Logic, Language and Analysis, e MIT Press, 2006.
[21]
D. Jackson and M. Vaziri, Finding Bugs with a Constraint Solver, in Proceedings of the 2000 ACM SIGSOFT international symposium on
Soware testing and analysis ISSTA 2000, ACM, 2000.
[22]
D. Jerey, M. Feng, N. Gupta, R. Gupta, BugFix: a Learning-based Tool to Assist Developers in Fixing Bugs, in Proceedings of International
Conference on Program Comprehension ICPC 2009, 2009.
[23] R. Jhala and R. Majumdar, Soware Model Checking, ACM Comput. Surv. 41(4), ACM, 2009.
[24]
B. Jobstmann, A. Griesmayer, R. Bloem, Program Repair As a Game, in Proceedings of Computer Aided Verication Conference CAV
2005, 2005.
[25]
S. Kaleeswaran, V. Tulsian, A. Kanade, A. Orso, MintHint: Automated Synthesis of Repair Hints, International Conference on Soware
Engineering (ICSE), 2014.
[26]
Y. Ke, K.T. Stolee, C. Le Goues, and Y. Brun, Repairing Programs with Semantic Code Search, International Conference on Automated
Soware Engineering (ASE), 2013.
[27]
Y. Ke, An automated approach to program repair with semantic code search, Graduate eses and Dissertations, Iowa State University,
2015.
[28] C. Kern and J. Esparza, Automatic Error Correction of Java Programs, Formal Methods for Industrial Critical Systems (FMICS), 2010.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria • 19
[29]
D. Kim, J. Nam, J. Song, and S. Kim, Automatic Patch Generation Learned from Human-wrien Patches, in Proceedings of the 35th
International Conference on Soware Engineering (ICSE 2013), San Francisco, May 18-26, 2013, pp. 802?811.
[30]
X. Kong, L. Zhang, W. E. Wong, and B. Li, Experience report: How do techniques, programs, and tests impact automated program repair? in
2015 IEEE 26th International Symposium on Soware Reliability Engineering (ISSRE). IEEE, 2015, pp. 194-204.
[31]
C. Le Goues, N. Holtschulte, E.K. Smith, Y. Brun, P. Devanbu, S. Forrest, and W. Weimer, e ManyBugs and IntroClass Benchmarks for
Automated Repair of C Programs, IEEE Transactions on Soware Engineering (TSE), 2013.
[32] C. Le Goues, M. Pradel, A. Roychoudhury: Automated program repair. Communications of the ACM 62(12): 56-65 (2019).
[33]
A. Leitner, I. Ciupa, B. Meyer, M. Howard, Reconciling Manual and Automated Testing: the AutoTest Experience, in Proceedings of the
40th Hawaii International Conference on System Sciences, 2007.
[34]
F. Long and M. C. Rinard, Staged Program Repair with Condition Synthesis, in Symposium on the Foundations of Soware Engineering
(FSE). 2015.
[35]
F. Long and M. C. Rinard, Analysis of the Search Spaces for Generate and Validate Patch Generation Systems, International Conference on
Soware Engineering (ICSE), 2016.
[36] S. Mechtaev, J. Yi, and A. Roychoudhury. Directx: Looking for simple program repairs. In ICSE, 2015.
[37]
S. Mechtaev, J. Yi, and A. Roychoudhury, Angelix, Scalable Multiline Program Patch Synthesis via Symbolic Analysis, International
Conference on Soware Engineering (ICSE), 2016.
[38] B. Meyer, Applying “Design by Contract”, IEEE Computer, IEEE, 1992.
[39] B. Meyer. A Touch of Class, 2nd corrected ed. 2013, Springer.
[40] B. Meyer, A. Fiva, I. Ciupa, A. Leitner, Y. Wei, and E. Stapf, Programs that test themselves. IEEE Soware, pages 22–24, 2009.
[41]
H.D.T. Nguyen, D. Qi, A. Roychoudhury, S. Chandra, SemFix: Program Repair via Semantic Analysis, International Conference on Soware
Engineering (ICSE), 2013.
[42]
R. Nieuwenhuis, A. Oliveras and C. Tinelli, Solving SAT and SAT Modulo eories: From an Abstract Davis-Putnam-Logemann-Loveland
Procedure to DPLL(T), Journal of the ACM 53 (6), pp. 937–977, ACM, 2006.
[43]
R. Nieuwenhuis and A. Oliveras. On SAT Modulo eories and Optimization Problems. In eory and Applications of Satisability Testing
- SAT 2006, 9th International Conference, Seale, WA, USA, August 12-15, 2006, Proceedings (Lecture Notes in Computer Science, Vol.
4121), Armin Biere and Carla P. Gomes (Eds.). Springer, 156–169, 2006.
[44]
A. Nilizadeh, G. T. Leavens, X.-B. D. Le, C. S. Pǎsǎreanu and D. R. Cok, Exploring True Test Overing in Dynamic Automated Program
Repair using Formal Methods, 2021, 14th IEEE Conference on Soware Testing, Verication and Validation (ICST), 2021, pp. 229-240
[45]
Y. Qi, X. Mao, Y. Lei, Z. Dai and C. Wang, e Strength of Random Search on Automated Program Repair, International Conference on
Soware Engineering (ICSE), 2014.
[46]
Z. Qi, F. Long, S. Achour, and M.C. Rinard. An analysis of patch plausibility and correctness for generate-and-validate patch generation
systems. In Proceedings of the 2015 International Symposium on Soware Testing and Analysis, ISSTA 2015, Baltimore, MD, USA, July
12-17, 2015, pages 24?36, 2015.
[47]
E.K. Smith, E. Barr, C. Le Goues, and Y. Brun, Is the Cure Worse an the Disease? Overing in Automated Program Repair, Symposium
on the Foundations of Soware Engineering (FSE), 2015.
[48]
S. Staber, B. Jobstmann, R. Bloem, Finding and Fixing Faults, in Proceedings of Conference on Correct Hardware Design and Verication
Methods, 2005.
[49]
N. Tillmann and J. de Halleux, Pex: White Box Test Generation for .NET, in Proceedings of the Second International Conference on Tests
and Proofs TAP 2008, LNCS, Springer, 2008.
[50]
M. Trudel, C. Furia, M. Nordio, Automatic C to O-O Translation with C2Eiel, in Proceedings of the 2012 19th Working Conference on
Reverse Engineering WCRE 2012, IEEE, 2012.
[51] C. v. Essen, B. Jobstmann, Program Repair without Regret, in Proceedings of Computer Aided Verication CAV 2013, 2013.
[52]
Y. Wei, Y. Pei, C.A. Furia, L.S. Silva, S. Buchholz, B. Meyer, and A. Zeller, Automated Fixing of Programs with Contracts, International
Symposium on Soware Testing and Analysis (ISSTA), 2010.
[53] Weimer W., Nguyen T., Le Goues C. and Forrest S., Automatically nding patches using genetic programming. ICSE 2009: pp. 364–374.
[54]
W. Weimer, S. Forrest, C. Le Goues, T. Nguyen, Automatic Program Repair with Evolutionary Computation, Communications of the ACM
53:5, ACM, 2010.
[55]
J. Xuan, M. Martinez, F. Demarco, M. Cl?ement, S. Lamelas, et al.. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs.
IEEE Transactions on Soware Engineering
[56]
Xuan-Bach D. Le, Ferdian ung, David Lo, Claire Le Goues, Overing in semantics-based automated program repair. Empirical Soware
Engineering 23(5): 3007-3033 (2018).
[57]
Xuan-Bach D. Le, Ferdian ung, David Lo, Claire Le Goues, https://doi.org/10.5281/zenodo.1012686, paper [
56
] reproducibility package,
last accessed on September 2, 2021.
[58]
H. Ye, M. Martinez, and M. Monperrus, Automated patch assessment for program repair at scale Empirical Soware Engineering, vol. 26,
no. 2, pp. 1-38, 2021.
ACM Trans. Sow. Eng. Methodol.
20 • Zemín et al.
[59]
Luciano Zemín, Simón Gutiérrez Brida, Ariel Godio, César Cornejo, Renzo Degiovanni, Germán Regis, Nazareno Aguirre, Marcelo
F. Frias: An Analysis of the Suitability of Test-Based Patch Acceptance Criteria. SBST@ICSE 2017: 14-20
[60]
Guolong Zheng, anhVu Nguyen, Simón Gutiérrez Brida, Germán Regis, Nazareno Aguirre, Marcelo F. Frias, Hamid Bagheri: ATR:
template-based repair for Alloy specications. ISSTA 2022: 666-677
[61]
Guolong Zheng, anhVu Nguyen, Simón Gutiérrez Brida, Germán Regis, Marcelo F. Frias, Nazareno Aguirre, Hamid Bagheri: FLACK:
Counterexample-Guided Fault Localization for Alloy Models. ICSE 2021: 637-648
ACM Trans. Sow. Eng. Methodol.