ArticlePDF Available

Abstract and Figures

In this article, we empirically study the suitability of tests as acceptance criteria for automated program fixes, by checking patches produced by automated repair tools using a bug-finding tool, as opposed to previous works that used tests or manual inspections. We develop a number of experiments in which faulty programs from IntroClass , a known benchmark for program repair techniques, are fed to the program repair tools GenProg, Angelix, AutoFix and Nopol, using test suites of varying quality, including those accompanying the benchmark. We then check the produced patches against formal specifications using a bug-finding tool. Our results show that, in the studied scenarios, automated program repair tools are significantly more likely to accept a spurious program fix than producing an actual one. Using bounded-exhaustive suites larger than the originally given ones (with about 100 and 1,000 tests) we verify that overfitting is reduced but a) few new correct repairs are generated and b) some tools see their performance reduced by the larger suites and fewer correct repairs are produced. Finally, by comparing with previous work, we show that overfitting is underestimated in semantics-based tools and that patches not discarded using held-out tests may be discarded using a bug-finding tool.
Content may be subject to copyright.
An Empirical Study on the Suitability of Test-based Patch Acceptance
Criteria
LUCIANO ZEMIN, Instituto Tecnológico de Buenos Aires (ITBA), Argentina
SIMÓN GUTIÉRREZ BRIDA, Dept. of Computer Science, National University of Río Cuarto, Argentina
ARIEL GODIO, Instituto Tecnológico de Buenos Aires (ITBA), Argentina
CÉSAR CORNEJO, National Council for Scientic and Technical Research (CONICET) and Dept. of Computer
Science, National University of Río Cuarto, Argentina
RENZO DEGIOVANNI, Luxembourg Institute of Science and Technology„ Luxembourg
GERMÁN REGIS, Dept. of Computer Science, National University of Río Cuarto, Argentina
NAZARENO AGUIRRE, National Council for Scientic and Technical Research (CONICET) and Dept. of
Computer Science, National University of Río Cuarto, Argentina
MARCELO FRIAS, e University of Texas at El Paso, United States
In this article, we empirically study the suitability of tests as acceptance criteria for automated program xes, by checking
patches produced by automated repair tools using a bug-nding tool, as opposed to previous works that used tests or manual
inspections. We develop a number of experiments in which faulty programs from IntroClass, a known benchmark for program
repair techniques, are fed to the program repair tools GenProg, Angelix, AutoFix and Nopol, using test suites of varying
quality, including those accompanying the benchmark. We then check the produced patches against formal specications
using a bug-nding tool. Our results show that, in the studied scenarios, automated program repair tools are signicantly
more likely to accept a spurious program x than producing an actual one. Using bounded-exhaustive suites larger than the
originally given ones (with about 100 and 1,000 tests) we verify that overing is reduced but a) few new correct repairs
are generated and b) some tools see their performance reduced by the larger suites and fewer correct repairs are produced.
Finally, by comparing with previous work, we show that overing is underestimated in semantics-based tools and that
patches not discarded using held-out tests may be discarded using a bug-nding tool.
CCS Concepts: Soware and its engineering
Soware testing and debugging;Formal soware verication;
Empirical soware validation.
Additional Key Words and Phrases: automatic program repair, formal specications, testing, oracle
Authors’ addresses: Luciano Zemin, Instituto Tecnológico de Buenos Aires (ITBA), Buenos Aires, Argentina; Simón Gutiérrez Brida, Dept. of
Computer Science, National University of Río Cuarto, Río Cuarto, Argentina; Ariel Godio, Instituto Tecnológico de Buenos Aires (ITBA),
Buenos Aires, Argentina; César Cornejo, National Council for Scientic and Technical Research (CONICET) and Dept. of Computer Science,
National University of Río Cuarto, Río Cuarto, Argentina; Renzo Degiovanni, Luxembourg Institute of Science and Technology, , Esch-sur-
Alzee, Luxembourg; Germán Regis, Dept. of Computer Science, National University of Río Cuarto, Río Cuarto, Argentina; Nazareno Aguirre,
National Council for Scientic and Technical Research (CONICET) and Dept. of Computer Science, National University of Río Cuarto, Río
Cuarto, Argentina; Marcelo Frias, e University of Texas at El Paso, El Paso, United States.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permied. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from
permissions@acm.org.
© 2024 Copyright held by the owner/author(s).
ACM 1557-7392/2024/11-ART
https://doi.org/10.1145/3702971
ACM Trans. Sow. Eng. Methodol.
2 Zemín et al.
1 INTRODUCTION
Soware has become ubiquitous, and many of our activities depend directly or indirectly on it. Having adequate
soware development techniques and methodologies that contribute to producing quality soware systems has
become essential for many human activities. e signicant advances in automated analysis techniques have
led, in the last few decades, to the development of powerful tools able to assist soware engineers in soware
development, that have proved to greatly contribute to soware quality. Indeed, tools based on model checking [
8
],
constraint solving [
42
], evolutionary computation [
11
] and other automated approaches, are being successfully
applied to various aspects of soware development, from requirements specication [
3
,
15
] to verication [
23
]
and bug nding [
16
,
21
]. Despite the great eort that is put in soware development to detect soware problems
(wrong requirements, decient specications, design aws, implementation errors, etc.), e.g., through the use
of the above mentioned techniques, many bugs reach and make it through the deployment phases. is makes
eective soware maintenance greatly relevant to the quality of the soware that is produced, and since soware
maintenance takes a signicant part of the resources of most soware projects, also economically relevant to the
soware development industry. us, the traditional emphasis of soware analysis techniques, that concentrated
in detecting the existence of defects in soware and specications, has recently started to broaden up to be
applied to automatically repair soware [1, 10, 29, 32, 53].
While the idea of automatic program repair (APR) is certainly appealing, automatically xing arbitrary
program defects is known to be infeasible. In the worst-case it reduces to program synthesis which is known
to be undecidable for Turing-complete programming languages [
7
]. us, the various techniques that have
been proposed to automatically repair programs are intrinsically incomplete, in various respects. Firstly, many
techniques for automatically repairing programs need to produce repair candidates, oen consisting of syntactical
modications on the original (known to be faulty) program. Clearly, all (even bounded) program repair candidates
cannot be exhaustively considered, and thus the space of repairs to consider needs to be somehow limited.
Secondly, for every repair candidate, checking whether the produced candidate constitutes indeed a repair is
an undecidable problem on its own, and solving it fully automatically is then, also, necessarily incomplete.
Moreover, this laer problem requires a description of the expected behavior of the program to be xed, i.e., a
specication, subject to automated analysis, if one wants the whole repair process to remain automatic. Producing
such specications is costly, and therefore requiring these specications is believed to undermine the applicability
of automatic repair approaches. Most automated repair techniques then use partial specications, given in terms
of a validation test suite. Moreover, most techniques heavily rely on these tests as part of the techniques, e.g., for
fault localization [53].
ere is a risk in using tests as specications, since as it is well known, their incompleteness makes it possible
to obtain spurious repairs, i.e., programs that seem to solve the problems of faulty code, but are incorrect despite
the fact that the validation suite is not able to expose such incorrectness. Nevertheless, various tools report
signicant success in repairing code using tests as criteria for accepting patches [
18
]. More recently, various
researchers have observed that automatically produced patches are likely to overt test suites used for their
validation, leading tools to produce invalid program xes [
34
,
47
]. en, further checks have been performed, to
analyze more precisely the quality of automatically produced patches, and consequently the ability of automated
program repair techniques in producing actual xes. However, these further checks have usually been performed
through manual inspection, or using extended alternative test suites, leaving room for still undetected aws.
In this paper, we empirically study the suitability of tests as acceptance criteria for automated program xes, by
checking patches produced by automated repair tools using a bug-nding tool, as opposed to previous works that
used tests and/or manual inspections. We develop a number of experiments using IntroClass, a known benchmark
for program repair techniques, consisting of small programs solving simple assignments. Faulty programs from
this benchmark are used to feed four state-of-the-art program repair tools, using test suites of varying quality
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 3
and extension, including those accompanying the benchmark. Produced patches are then complemented with
corresponding formal specications, given as pre- and post-conditions, and checked using Pex [
49
], an automated
test generation tool based on concrete/symbolic execution and constraint solving, that aempts to exhaustively
cover bounded symbolic paths for the patches. Our results show that:
In general, automated program repair tools are signicantly more likely to accept a spurious program x
than producing an actual one, in the studied scenarios.
By improving the quality of the test suite by extending it to a sort of bounded-exhaustive suite (whose
size is bounded to approximately 100 or 1000 tests), we show that a few more correct xes are obtained in
those cases where the tool under analysis is able to cope with the suite size.
Finally, we show that overing is more likely to occur in semantics-based tools than previously reported
in [
56
]. e use of the bug-nding tool allows us to detect overing patches that remain undetected
using held-out tests, tests that are not used during the patch generation process but are instead kept aside
for the verication of the produced patches.
Notice that using IntroClass, a benchmark built from simple small programs (usually not exceeding 30 lines of
code), is not a limitation of our analysis. If state-of-the-art tools fail to distinguish correct xes from spurious
ones on this benchmark, we should not expect them to perform beer on larger, or more complex benchmarks. If
anything, using IntroClass makes the analysis more conclusive.
is paper reports research that extends previous work reported in [
59
]. It poses and answers research questions
that were not part of [
59
]. A deeper discussion of the relationship between these two works is presented in
Section 5.
e paper is organized as follows. Aer this Introduction, in Section 2 we introduce automated program repair
and the overing problem. In Section 3 we evaluate 4 tools for automated program repair, namely, Angelix
[
37
], AutoFix [
52
], GenProg [
18
] and Nopol [
55
]. We present and answer the research questions. In Section 4 we
discuss the threats to the validity of the results presented in the paper. In Section 5 we discuss related work. In
Section 6 we discuss the results we obtain in the paper, to nish in Section 7 with some conclusions and proposals
for further work.
2 AUTOMATED PROGRAM REPAIR
Automated program repair techniques aim at xing faulty programs through the application of transformations
that modify the program’s code. Generation and Validation (G&V) techniques for automated program repair
receive a faulty program to repair, a specication of the program’s expected behavior, and aempt to generate a
patch through the application of syntactic transformations on the original program that satises the provided
specication [
1
]. Dierent techniques and tools have been devised for automated program repair, which can
be distinguished on various aspects such as the programming language or kind of system they apply to, the
syntactic modications that can be applied to programs (or, similarly, the fault model a tool aims to repair), the
process to produce the x candidates or program patches, how program specications are captured (and how
these are contrasted against x candidates), and how the explosion of x candidates is tamed.
From the point of view of the technology underlying the repair generation techniques, a wide spectrum of
approaches exist, including search-based approaches such as those based on evolutionary computation [
1
,
52
,
53
],
constraint-based automated analyses such as those based on SMT and SAT solving [
17
,
37
], and model checking
and automated synthesis [
48
,
51
]. A crucial aspect of program repair that concerns this paper, is how program
specications are captured and provided to the tools. Some approaches, notably some of the initial ones (e.g.,
[
1
,
48
]), require formal specications in the form of pre- and post-conditions, or logical descriptions provided
in some suitable logical formalism. ese approaches then vary in the way they use these specications to
assess repair candidates; some check repair candidates against specications using some automated verication
ACM Trans. Sow. Eng. Methodol.
4 Zemín et al.
technique [
48
]; some use the specications to produce tests, which are then used to drive the patch generation
and patch assessment activities [
52
]. Moreover, some of these tools and techniques require strong specications,
capturing the, say, “full” expected behavior of the program [
1
,
48
], while others use contracts present in code,
that developers have previously wrien mainly for runtime checking [52].
Many of the latest mainstream approaches, however, use tests as specications. ese approaches relieve
techniques from the requirement of providing a formal specication accompanying the faulty program, arguing
that such specications are costly to produce, and are seldom found in soware projects. Tests, on the other hand,
are signicantly more commonly used as part of development processes, and thus requiring tests is clearly less
demanding. Weimer et al. mention in [
54
], for instance, that by requiring tests instead of formal specications
one greatly improves the practical applicability of their proposed technique. Kaleeswaran et al. [
25
] also mention
that approaches requiring specications suppose the existence of such specications, a situation that is rarely
true in practice; they also acknowledge the limitations of tests as specications for repairs, and aim at a less
ambitious goal than fully automatically repairing code, namely, to generate repair hints.
e partial nature of tests as specications immediately leads to validity issues regarding the xes provided
by automated program repair tools, since a program patch may be accepted because it passes all tests in the
validation suite but still not be a true program x (there might still be other test cases, not present in the validation
suite, for which the program patch fails). is problem, known as overing [
47
], has been previously identied
by various researchers [
34
,
47
,
56
], and several tools are known to produce spurious patches as a result of their
program repair techniques. is problem is handled dierently by dierent techniques. Some resign the challenge
of producing xes and aim at producing hints (e.g., the already mentioned [
25
]). Others take into account a notion
of quality, and manually compare the produced patches with xes provided by human developers or by other
tools [
34
,
36
]. Notice that, even aer manual inspection, subtle defects may be still present in the repairs, thus
leading to accepting a x that is invalid. We partly study this issue in this paper.
3 EVALUATION
In this section we evaluate Angelix, AutoFix, GenProg and Nopol, 4 well-regarded tools that use tests as their
patch acceptance criterion. e evaluation is performed on the IntroClass dataset, which is described in detail
in Section 3.1. e dataset contains student-developed solutions for 6 simple problems. e correctness of the
student’s solutions (which usually take under 30 LOC), can be evaluated using instructor-prepared test suites.
Each of the provided solutions is faulty: at least one test in the corresponding suite fails.
e tools were selected because they use dierent underlying techniques for patch generation, and also due to
the existence of mature-enough implementations that would allow to carry on the experiments. Angelix [
37
]
collects semantic information from controlled symbolic executions of the program and uses MaxSMT [
43
] to
synthesize a patch. AutoFix [
52
] is intended for Eiel [
39
] programs repair. It is contract-based, and is based
in AutoTest [
40
] for automated test generation. Genprog [
18
] uses genetic algorithms to search for a program
variant that retains correct functionality yet passes failing tests. Finally, Nopol [
55
] collects program execution
information and uses satisability modulo theories (SMT) [
2
] to generate patches. Unlike Angelix, Nopol focuses
on the repair of buggy conditional statements.
Since our aim is to evaluate the suitability of test-based patch acceptance criteria, we will introduce some
terminology that will help us beer understand the following sections. Given a faulty routine
𝑚
, and a test suite
𝑇
employed as an acceptance criterion for automated program repair, a tool-synthesized version
𝑚0
of
𝑚
that
passes all tests in
𝑇
is called a patch. A patch may overt and be correct with respect to the provided suite, yet be
faulty with respect to a dierent suite, or more precisely, with respect to its actual expected program behavior.
We may then have correct and incorrect patches; a correct patch, i.e., one that meets the program’s expected
behavior, will be called a x. is gives rise to our rst research question.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 5
RQ1: When applying a given program repair tool/technique on a faulty program, how likely is the tool/tech-
nique to provide a patch, and if a patch is found, how likely it is for it to be a x?
Patch correctness is typically determined by manual inspection. Since manual inspections are error-prone (in
fact, the faulty routines that constitute the IntroClass dataset were all manually inspected by their corresponding
developers, yet they are faulty), we will resort to automated verication of patches, in order to determine if they
are indeed xes. We will use concrete/symbolic execution combined with constraint solving, to automatically verify
produced patches, against their corresponding specications captured as contracts [
38
]. More precisely, we will
translate patches into C#, and equip these with pre- and post-conditions captured using Code Contracts [
14
];
we will then search for inputs that violate these assertions via concrete/symbolic execution and SMT solving,
using the Pex tool [
49
]. Finally, to prevent any error introduced by the above described process, we run the
corresponding test generated by Pex on the original patched method to check that it actually fails.
To assess the above research question, we need to run automatic repair tools on faulty programs. As we
mentioned, we consider the IntroClass dataset, so whatever conclusion we obtain will, in principle, be tied to
this specic dataset and its characteristics (we further discuss this issue in the threats to validity Section 4). By
focusing on this dataset, we will denitely get more certainty regarding the following issues:
Overing produced by repair tools on the IntroClass dataset, and
Experimental data on the limitations of manual inspections in the context of automated program repair
(especially because this benchmark has been used previously to evaluate various program repair tools).
We will show, with the aid of the bug-nding tool that patches that were hinted as correct in [
56
], are not.
Notice that when a patch is produced, but this patch is not a x, one may rightfully consider the problem being
on the quality of the test suite used for patch generation, not necessarily a limitation of test-based acceptance
criteria as a whole, or the program xing technique in particular: by providing more/beer tests one may prevent
the acceptance of incorrect patches. at is, overing may be considered a limitation of the particular test suites
rather than a limitation of test-based acceptance criteria. To take into account this issue, for instance, [
47
,
56
]
enrich the test suites provided with the benchmark with white-box tests that guarantee branch coverage of a
correct variant of the buggy programs. en, as shown in [
47
, Fig. 3], between 40% and 50% of the patches that
are produced with the original suite, are discarded when the additional white-box suite ensuring branch coverage
is considered. Yet the analysis does not address the following two issues:
Are the patches passing the additional white-box tests indeed xes? And, equally important,
Would the tool reject more patches by choosing larger suites?
is leads to our second research question:
RQ2: How does overing relate to the thoroughness of the validation test suites, in program repair techniques?
oroughness can be dened in many ways, typically via testing criteria. Given the vast amount of testing
criteria, an exhaustive analysis, with quality suites according to many dierent of these criteria, is infeasible.
Our approach will be to enrich the validation test suites, those provided with the dataset, by adding bounded-
exhaustive suites [
4
] for dierent bounds. e rationale here is to aempt to be as thorough as possible, to avoid
overing. For each case study, we obtain suites with approximately 100 tests, and with approximately 1,000
tests (with two dierent bounds), for each routine. ese suites can then be assessed according to measures for
dierent testing criteria. Notice that, as the size of test suites is increased, some tools and techniques may see
their performance aected. is leads to our third research question:
ACM Trans. Sow. Eng. Methodol.
6 Zemín et al.
RQ3: How does test suite size aect the performance of test-based automated repair tools?
As we mentioned earlier in this section, patches are classied as correct (i.e., xes) or not using either manual
inspections or, as in [
56
], using held-out tests. We consider these to be error-prone procedures to assess the
correctness of a patch. In [
56
] overing of the Angelix APR tool is analyzed over the IntroClass dataset. Held-out
tests are used to determine non-overing patches. Our fourth research question is then stated as:
RQ4: Can we produce substantial evidence on the fact held-out tests underestimate overing patches in
semantics-based APR tools such as Angelix?
e remaining part of this section is organized as follows. Section 3.1 describes the IntroClass dataset. Section
3.2 describes the experimental setup we used. Section 3.3 describes the reproducibility package we are providing
in order to guarantee reproducibility of the performed experiments. Finally, Section 3.5 presents the evaluations
performed, and discusses research questions RQ1–RQ4.
3.1 The IntroClass Dataset
e IntroClass benchmark is thoroughly discussed in [
31
]. It contains student-developed C programs for solving 6
simple problems (that we will describe below) as well as instructor-provided test suites to assess their correctness.
IntroClass has been used to evaluate a number of automated repair tools [
27
,
46
,
47
,
56
], and its simplicity reduces
the requirements on tool scalability. e benchmark comprises methods to solve the following problems:
Checksum:
Given an input string
𝑆=𝑐0, . . . 𝑐𝑘
, this method computes a checksum character
𝑐
following
the formula 𝑐=Í0𝑖<𝑆 .length ( ) 𝑆.charAt(𝑖)% 64 +0 0.
Digits: Convert an input integer number into a string holding the number’s digits in reverse order.
Grade:
Receives 5 oats
𝑓1, 𝑓2, 𝑓3, 𝑓4
and
score
as inputs. e rst four are given in decreasing order (
𝑓1>
𝑓2>𝑓3>𝑓4
). ese 4 values induce 5 intervals
(∞, 𝑓1]
,
(𝑓1, 𝑓2]
,
(𝑓2, 𝑓3]
,
(𝑓3, 𝑓4]
, and
(𝑓4,−∞]
. A grade
𝐴
,
𝐵,𝐶,𝐷or 𝐹is returned according to the interval score belongs to.
Median: Compute the median among 3 integer input values.
Smallest: Compute the smallest value among 4 integer input values.
Syllables:
Compute the number of syllables into which an input string can be split according to English
grammar (vowels ’a’, ’e’, ’i’, o’ and ’u’, as well as the character ’y’, are considered as syllable dividers).
ere are two versions of the dataset, the original one described in [
31
], whose methods are given in the C
language, and a Java translation of the original dataset described in [
12
]. Some of the programs that result from
the translation from C to Java were not syntactically correct and consequently did not compile. Other programs
saw signicant changes in their behavior. Interestingly, for some programs, the transformation itself repaired the
bug (the C program fails on some inputs, but the Java version is correct). e laer situation is mostly due to
the dierent behavior of the non-initialized variables in C versus Java [
12
]. ese abnormal cases were removed
from the resulting Java dataset, which thus has fewer methods than the C one.
Because of the automated program repair tools that we evaluate, which include AutoFix and Angelix, we need
to consider yet other versions of the IntroClass dataset.
IntroClass Eiel: is new version is the result of translating the original C dataset into Eiel. For the
translation, we employed the C2Eiel tool [
50
]; moreover, since AutoFix requires contracts for program
xing, we replaced the input/output sentences in the original IntroClass, which received inputs and
produced outputs from/to standard input/output, to programs that received inputs as parameters, and
produced outputs as return values. We equipped the resulting programs with the correct contracts for
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 7
pre- and post-conditions of each case study. As in the translation from C to Java, several faulty programs
became “correct” as a result of the translation. ese cases have to do with default values for variables, as
for Java, and with how input is required and output is produced; for instance, faulty cases that reported
output values with accompanying messages in lowercase, when they were expected to be upper case, are
disregarded since in Eiel translated programs outputs are produced as return values.
IntroClass Angelix: In order to run experiments with Angelix the source code of each variation has to be
instrumented and adapted to include calls to some macro functions, and to return the output in a single
integer or char. Since in several variants the errors consist of modications of the input/output Strings,
which are stripped out by the instrumentation, the faulty versions became “correct”.
Since IntroClass consists not only of dierent students’ implementations but also dierent commits/versions
of the implementation of each student, in several cases the instrumentation resulted in duplicate les, which
we removed using the di tool to reduce bias (notice that the IntroClass version used in [
26
] and reported in
Fig. 1 does not eliminate duplicates and contains 208 extra datapoints). Table 1 describes the four datasets and,
for each dataset, the number of faulty versions for each method. e sizes of their corresponding test suites are
only relevant for C and Java since, in the case of AutoFix, tests are automatically produced using AutoTest [
33
]
(the tool does not receive user-provided test suites).
checksum digits grade median smallest syllables Total
IntroClass C (Genprog) 46 143 136 98 84 63 570
IntroClass C (Angelix) 31 149 36 58 41 44 359
IntroClass Java (Nopol) 11 75 89 57 52 13 297
IntroClass Eiel (AutoFix) 45 141 72 86 56 77 477
Test suite size 16 16 18 13 16 16 95
Table 1. Description of the IntroClass C, Java and Eiel datasets.
3.2 Experimental Setup
In this section we will describe the soware and hardware infrastructure we employed to run the experiments
whose results we will report in Section 3.5. We also describe the criteria used to generate the bounded-exhaustive
test-suites, as well as the automated repair tools we will evaluate and their congurations.
In order to evaluate the subjects from the IntroClass dataset we consider, besides the instructor-provided suite
delivered within the dataset, two new suites. For those programs in which it is feasibe, we consider bounded-
exhaustive suites. Bounded-exhaustive suites contain all the inputs that can be generated within user-provided
bounds. We will use such suites for programs digits,median,smallest and syllables. Program grade uses oats and,
therefore, even small bounds would produce suites that are too big; therefore, for program grade we use tests
that are not bounded exhaustive, but that are part of a bounded exhaustive suite. We chose bounds so that the
resulting suites have approximately 100 tests and 1,000 tests for each method under analysis. is gives origin to
two new suites that we will call S100 and S1,000, whose test inputs for each problem are characterized below
in Tables 2 and 3. Notice that from these inputs, actual tests are built using reference implementations of the
methods under repair as an oracle. Notice also that all the tests in S100 also belong to S1,000, i.e., S1,000 extends
S100 in all cases.
Along the experiments we report in this section, we used a workstation with Intel Core i7 2600, 3.40 GHz, and
8 GB of RAM running Ubuntu 16.04 LTS x86_64. e experiments involving Pex were performed on a virtual
machine (VirtualBox) running a fresh install of Windows 7 SP1. e specic version of the soware used in the
ACM Trans. Sow. Eng. Methodol.
8 Zemín et al.
Test inputs specication Total
𝑆100checksum ={𝑐0, . . . ,𝑐𝑘|0𝑘4 0𝑖𝑘𝑐𝑖 {0𝑎0,0𝑏0,0𝑐0}} 120
𝑆100digits ={𝑘| 64 𝑘63}128
𝑆100grade ={( 𝑓1, . . . , 𝑓4,score) | (1𝑖4𝑓𝑖 {30,40,50,60,70,80})
(𝑓1>𝑓2>𝑓3>𝑓4) 𝑠𝑐𝑜𝑟 𝑒 {5,10,15,20, . . . , 90}} 285
𝑆100median ={(𝑘1, 𝑘2, 𝑘3) |∀1𝑖32𝑘𝑖2}125
𝑆100smallest ={(𝑘1, 𝑘 2, 𝑘3, 𝑘4) |∀1𝑖42𝑘𝑖1}256
𝑆100syllables ={𝑐0, . . .𝑐𝑘|0𝑘<4 0𝑖𝑘𝑐𝑖 {0𝑎0,0𝑏0,0𝑐0} } 120
Table 2. Specification of test suites 𝑆100
Test inputs specication Total
𝑆1,000checksum ={𝑐0, . . ., 𝑐𝑘|0𝑘5 0𝑖𝑘𝑐𝑖 {0𝑎0,0𝑏0,0𝑐0,0𝑒0}} 1,364
𝑆1,000digits ={𝑘| 512 𝑘511}1,024
𝑆1,000grade ={( 𝑓1, . . . , 𝑓4,score) |(∀1𝑖4𝑓𝑖 {10,20,30,40,50,60,70,80,90})
(𝑓1>𝑓2>𝑓3>𝑓4) score {0,5,10,15,20, . . . , 100}} 2,646
𝑆1,000median ={(𝑘1, 𝑘 2, 𝑘3)|∀1𝑖35𝑘𝑖4}1,000
𝑆1,000smallest ={(𝑘1, 𝑘 2, 𝑘3, 𝑘4)|∀1𝑖43𝑘𝑖2}1,296
𝑆1,000syllables ={𝑐0, . . ., 𝑐𝑘|0𝑘<5 0𝑖𝑘𝑐𝑖 {0𝑎0,0𝑏0,0𝑐0,0𝑒0}} 1,364
Table 3. Specification of test suites 𝑆1,000
experiments, including that of the APR tools can be found on the reproducibility package. In general, nding
an appropriate timeout depends on the context in which the tool is being used. Perhaps, for mission-critical
applications a large timeout is more appropriate, while for other domains, it may be too expensive to devote one
hour to (most probably failed) repair aempts. We set a two hour timeout; enough for the tools to run, and at the
same time reasonable for the time running all the experiments will require.
3.3 Reproducibility
e empirical study we present in this paper involves a large set of dierent experiments. ese involve 4 dierent
datasets (versions of IntroClass, as described in the previous section), congurations for 4 dierent repair tools
accross 3 dierent languages and 3 dierent sets of tests for the tools that receive test suites. Also, all case
studies have been equipped with contracts, translated into C# and veried using Pex. We make available all these
elements for the interested reader to reproduce our experiments in
https://sites.google.com/a/dc.exa.unrc.edu.ar/test-specs-program-repair/
Instructions to reproduce each experiment are provided therein.
3.4 Why use Bounded-Exhaustive Suites
Research questions 2 and 3 discuss the impact of using alternative (w.r.t. test suites used as specications of the
programs being developed) test suites in overing and tool scalability. It is expected that these held-out tests
will allow one to expose those patches generated by an APR tool that overt to the specication suite. e rst
problem, namely, the impact on overing of considering alternative suites, has been already discussed in the
literature by using, besides the original suite, new white-box suites automatically generated in order to satisfy
a coverage criterion (usually, branch coverage) [
26
,
47
,
56
]. Using white-box suites as held-out tests is rarely
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 9
useful in soware development. ey are obtained from a (usually non-existent) correct implementation whose
complexity usually diers greatly from the students-developed version loosing both, the property of the suite
being white-box and the satisfaction of the coverage criterion. Also, they may not be comprehensive enough. For
example, for the IntroClass dataset, the white-box suites used in [
56
] have between 8 and 10 tests, and keeping
portions of the suite to check the impact of larger/smaller suites on assessing scalability seems at least risky. As
an alternative of using small white-box suites, that suer from the mentioned methodological limitations, we
propose the use of bounded-exhaustive suites whenever possible, or similar ones when bounded-exhaustiveness
is not an option. ey can easily be made to grow in size as much as necessary, and they capture a similar
behavior to formal specications in a fragment of the program domain. We will be exploring alternatives to
bouned-exhaustive suites with similar characteristics as further work.
3.5 Experimental Results
In this section we present the evaluation of each of the repair tools on the generated suites, and from the collected
data we will discuss research questions RQ1–RQ4 in Sections 3.6–3.9. Tables 4–7 summarize the experimental
data. On these tables, we report a patch/x if any repair mode/conguration or execution produced a patch/x.
For example, we considered ten executions of Genprog due to its random behavior. It suces only one execution
to nd a patch to be reported.
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 11 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
digits 75 O 7 1 9.3% 1.3%
OS100 2 1 2.7% 1.3%
OS1,000 2 1 2.7% 1.3%
grade 89 O 2 1 2.2% 1.1%
OS100 2 1 2.2% 1.1%
OS1,000 2 1 2.2% 1.1%
median 57 O 11 4 19.3% 7%
OS100 4 4 7% 7%
OS1,000 4 4 7% 7%
smallest 52 O 12 0 23.1% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
syllables 13 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
Table 4. Repair statistics for the Nopol automated repair tool.
3.6 Research estion 1
is research question addresses overing, a well-known limitation of Generation and Validation automatic
program repair approaches that use test suites as the validation mechanism. e use of patch validation techniques
based on human inspections or comparisons with developer patches (or even accepting patches as xes without
ACM Trans. Sow. Eng. Methodol.
10 Zemín et al.
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 46 O 3 2 6.5% 4.3%
OS100 23 0 50% 0%
OS1,000 22 1 47.8% 2.2%
digits 143 O 29 12 20.3% 8.4%
OS100 17 10 11.9% 7%
OS1,000 13 7 9.1% 4.9%
grade 136 O 2 0 1.5% 0%
OS100 50 0 36.8% 0%
OS1,000 23 0 16.9% 0%
median 98 O 37 13 37.8% 13.3%
OS100 67 0 68.4% 0%
OS1,000 51 3 52% 3.1%
smallest 84 O 37 3 44% 3.6%
OS100 61 0 72.6% 0%
OS1,000 47 0 56% 0%
syllables 63 O 19 0 30.2% 0%
OS100 38 0 60.3% 0%
OS1,000 37 0 58.7% 0%
Table 5. Repair statistics for the GenProg automatic repair tool.
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 45 O 1 0 2.2% 0%
OS100 1 0 2.2% 0%
OS1,000 1 0 2.2% 0%
digits 141 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
grade 72 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
median 86 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
smallest 56 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
syllables 77 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
Table 6. Repair statistics for the AutoFix automatic repair tool.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 11
Method #Versions Suite #Patches #Fixes %Patches %Fixes
checksum 31 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
digits 149 O 20 5 13.4% 3.4%
OS100 13 10 8.7% 6.7%
OS1,000 8 4 5.4% 2.7%
grade 36 O 7 7 19.4% 19.4%
OS100 7 7 19.4% 19.4%
OS1,000 5 5 13.9% 13.9%
median 58 O 29 10 50.0% 17.2%
OS100 15 15 25.9% 25.9%
OS1,000 6 6 10.3% 10.3%
smallest 41 O 33 1 80.5% 2.4%
OS100 1 1 2.4% 2.4%
OS1,000 1 1 2.4% 2.4%
syllables 44 O 0 0 0% 0%
OS100 0 0 0% 0%
OS1,000 0 0 0% 0%
Table 7. Repair statistics for the Angelix automatic repair tool.
further discussion), has not allowed the community to identify the whole extent of this problem. For example,
paper [
26
] includes the table we reproduce in Fig. 1. e table gives the erroneous impression that 287 out of 778
Fig. 1. Performance of GenProg (and other tools) on the IntroClass dataset, as reported in [26].
bugs were xed (36.8%). e paper actually analyzes this in more detail and by using independent test suites to
validate the generated patches, it claims GenProg’s patches pass 68.7% of independent tests, giving the non-expert
reader the impression the produced patches were of good quality while, in fact, it might be the case that none of
ACM Trans. Sow. Eng. Methodol.
12 Zemín et al.
the patches is actually a x. Actually, as our experiments reported in Table 5 show, only 30 out of 570 faults were
correctly xed (which gives a xing ratio of 5.3%, well below the 36.8% presented in [26]).
We have obtained similar results for the other tools under analysis. Angelix patches 89 faults out of 360 program
variants (a ratio of 24.7%), yet only 23 patches are xes (the ratio reduces to 6.4%). e remaining patches were
discarded with the aid of Pex. Nopol patched 32 out of 297 versions (10.7%), using the evaluation test suite. Upon
verication with Pex, the number of xes is 6 (2%). AutoFix uses contracts (which we provided) in order to
automatically (and randomly) generate the evaluation suite. When a patch is produced, AutoFix validates the
adequacy of the patch with a randomly generated suite. AutoFix then produced patches for the great majority of
faulty routines, but itself showed that most of these were inadequate, and overall reported only 1 patch (which
was an invalid x).
As previously discussed in the beginning of Section 3, these unsatisfactory results might be due to the low
quality of the validation test suite. Yet, it is worth emphasizing, that the IntroClass dataset was developed to be
used in program repair, and the community has vouched for its quality by publishing the benchmark and using
the benchmark in their research.
3.7 Research estion 2
is research question relates to the impact of more thorough validation suites on overing, as well as on the
quality of the produced patches. Table 4 shows that Nopol prots from larger suites in order to reduce overing
signicantly. It suces to consider suite O
S100 to notice a reduction in overing. e number of patches is
reduced from 32 to 8. Unfortunately, the number of xes remains low. is shows that Nopol, when fed with a
beer quality evaluation suite, is able to produce (a few) good quality xes. GenProg, on the other hand, shows
an interesting behaviour (see Table 5): it doubles the number of patches with suite O
S100, yet the number of
xes is reduced from 30 to 10. With suite O
S1000 produces around 50% more patches but the number of xes
is reduced from 30 to 11. We believe this is due to its random nature. Angelix (see Table 7) sees its overing
reduced. Interestingly, suite O
S100 allows Angelix to obtain 5 more xes for method
digits
and 5 more for
method
median
. Since AutoFix generates the evaluation suites, rather than providing larger suites we extend the
test generation time. AutoFix does not have a good performance on this dataset.
3.8 Research estion 3
is research question relates to the impact larger validation suites may have on repair performance. In most
of the evaluated tools this impact can be illustrated by showing the number of timeouts reached during the
repair process. Table 8 reports, for each tool able to use suites S100 and S1,000, and for each validation suite,
the number of timeouts reached during repair. We report a timeout only when a timeout occurs for all repair
modes/congurations and executions. For example, we considered two repair modes (condition and precondition)
for Nopol. Depending on the suite, some of them reached a timeout, however, there is no faulty version for
which all of them do. Recall that AutoFix generates its validation suites from user-provided contracts, and is
therefore le out of this analysis. While Nopol’s mechanism for data collection discards tests that are considered
redundant (which as shown in Table 8 makes Nopol resilient to suite size increments), GenProg and Angelix are
both sensitive to the size of the evaluation suite.
3.9 Research estion 4
is research question addresses our intuition that using held-out tests, as a means to determine if a patch found
by an APR tool is indeed correct, is an error-prone procedure. is procedure is widely used and accepted by
the community [
26
,
30
,
47
,
56
,
58
]. In [
56
, Table 2], reproduced in Fig. 2, we see the overing reported for
Angelix on the IntroClass benchmark using the originally provided black-box test suites. e table omits methods
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 13
Nopol GenProg Angelix
O 0 2 5
OS100 0 135 47
OS1,000 0 237 69
Table 8. Number of timeouts reached by tool and validation suite.
Empir Software Eng
repair technique would often generate general patches, and produce overfitting patches
infrequently. We test all patches produced for Research Question 1 against the held-out to
measure rate.
Results Tabl e 2a and b show the number of patches produced for each subject program
that fail at least one held-out test for the IntroClass and Codeflaws datasets, respectively. On
IntroClass, 61 of the 81 (75%) Angelix-produced patches overfit to the training tests, while
80%, 81%, and 90% of the CVC4-, Enumerative-, and SemFix-produced patches do, respec-
tively. On Codeflaws, 44 of the 81 (54%) Angelix-produced patches overfit, while 83.5%,
87%, and 68% of patches generated by CVC4, Enumerative, and SemFix do, respectively.
This suggests that, although semantics-based repair has been shown to produce high-quality
repairs on a number of subjects, overfitting to the training tests is still a concern. We
present case studies to help characterize the nature of overfitting in semantics-based APR
in Section 3.5.
One possible reason that CVC4 and Enumerative underperform Angelix’s default syn-
thesis engine is that the SyGuS techniques do not take into account the original buggy
expressions. We observed that the resulting patches can be very different from the originals
they replace, which can impact performance arbitrarily. However, the CVC4 and Enu-
merative techniques do generate non-overfitting patches for programs that default Angelix
cannot produce non-overfitting patches, as shown in Fig. 3aandb.Similarly,SemFix,
CVC4, and Enumerative also have non-overlapping non-overfitting patches (results omit-
ted). This phenomenon also happens between Angelix and SemFix. This suggests that using
multiple synthesis engines to complement one another may increase the effectiveness of
semantics-based APR.
3.3 Training Test Suite Features
Our next three research questions look at the relationship between features of the training
test suite and produced patch quality, looking specifically at (4) test suite size, (5) number
of failing tests, and (6) test suite provenance.
Tab l e 2 Baseline overfitting results on IntroClass (top) and Codeflaws (bottom). In both tables, A/B denotes
A overfitting patches out of B total patches generated
(a) IntroClass overfitting rates for each APR approach, using black box (center columns) and white box
(right-most columns) as training tests. We omit syllables and checksum, for which no patches
were generated
Black box White box
Subject Angelix CVC4 Enum SemFix Angelix CVC4 Enum SemFix
smallest 27/37 33/39 24/29 36/45 31/37 33/37 33/36 33/37
median 29/38 21/ 28 21/27 40/44 25/35 36/36 23/23 38/38
digits 5/63/43/310/10 0/52/22/22/8
(b) Codeflaws overfitting rates for each APR approach
Subject Total Angelix CVC4 Enum SemFix
Codeflaws 651 44/81 76/91 80/92 38/56
Fig. 2. Overfiing patches produced by Angelix as reported in [56, Table 2].
syllables
and
checksum
for which no patches were generated. Table 9 compares the number of xes obtained
in [56] (reproduced in Fig. 2) and in this article in Table 7 for those methods that are in the intersection of both
studies, namely, median,smallest and digits.
Fig. 2 Table 7
median 9 5
smallest 10 10
digits 1 1
Table 9. Comparing the number of reported fixes with [56].
While there is a discrepancy in the results reported in Table 9, this does not imply per se that an error has
been made. Methods have been instrumented, parameters may include subtle dierences that lead to dierent
results, etc. ese discrepancies, nevertheless, called our aention and led to research question 4. Paper [
56
]
reports a reproducibility package [
57
] that, in particular, includes all the patches generated by Angelix. It is
not clear how the les in [
57
] match the experiments reported in [
56
] (the number of les do no match the
results reported in the paper). Still, many les are available that allow us to study this research question in depth.
Disregarding their provenance, there are 45,131 patches in the reproducibility package [
57
]. Since there may be
repeated patches, we removed repetitions and 2,120 patches remained. For each of these patches we checked
overing using the test suite provided in IntroClass (suite O in our tables) as held-out tests. Also, we checked
overing using Pex. Finally, as we did with our experiments, we ran the corresponding test generated by Pex (if
any) on the original patched method to check that it actually fails. Table 10 reports the total number of unique
patches for each method. Since there might be multiple patches for a specic student version, we also present
the number of versions involved in brackets. For example, there are 78 unique patches for the method digits in
the available reproducibility package, yet they correspond to 22 dierent student versions. Table 10 provides
strong evidence that using held-out tests to assess the correctness of a patch is unacceptable since more than
80% of them are deemed correct by the held-out tests, yet they are incorrect (the 80% value is obtained from
Table 10 by calculating
(#𝑝𝑎𝑠𝑠 𝑂 #𝑃 𝑎𝑠𝑠 𝑃 𝐸𝑋 )/#𝑃 𝑎𝑠𝑠 𝑂
). Interestingly, the number of student versions xed
ACM Trans. Sow. Eng. Methodol.
14 Zemín et al.
reported in Table 10 (aer an in-depth analysis of the reproducibility package obtained from [
56
]) and Table 7 are
very similar. is observation suggests that despite the dierences in the instrumentation of the dataset and the
experimental setup, running the same tool on the same dataset produced the same results, providing evidence
against any bias introduced during our experiments.
#Patches #Pass O #Pass Pex
median 887 [58] 130 [28] 66 [12]
smallest 1155 [51] 283 [26] 11 [3]
digits 78 [22] 9 [7] 3 [1]
Table 10. ality of patches that pass the held-out tests.
4 THREATS TO VALIDITY
In the paper we focused on the IntroClass dataset. erefore, the conclusions we draw only apply to this dataset
and, more precisely, to the way in which the selected automated repair tools are able to handle IntroClass.
Nevertheless, we believe this dataset is particularly adequate to stress some of the points we make in the paper.
Particularly, considering small methods that can be easily specied in formal behavioral specication languages
such as Code Contracts [
13
] or JML [
5
], allow us to determine if patches are indeed xes or are spurious x
candidates. is is a problem that is usually overlooked in the literature: either patches are accepted as xes (no
further study on the quality of patches is made) [
47
], or they are subject to human inspection (which we consider
severely error-prone), or are compared against developer xes retrieved from the project repository [
37
] (which,
as pointed out in [
47
], may show that automated repair tools and developers overt in a similar way). Since these
tools work on dierent languages (GenProg and Angelix work on C code, while Nopol works on Java code and
AutoFix repairs Eiel code, the corresponding datasets are dierent across languages as explained in Section
3.2. erefore, the reader is advised to not draw conclusions by comparing between tools working on dierent
datasets.
Also, we used the repair tools to the best of our possibilities. is is complex in itself because research tools
usually have usability limitations and are not robust enough. In all cases we consulted corresponding tool
developers in order to make sure we were using the tools with the right parameters, and reported a number of
bugs that in some cases were xed in time for us to run experiments with the xed versions. e reproducibility
package includes all the seings we used.
Notice that we are using Pex as our golden standard, i.e., if a patch is deemed correct by Pex, it is accepted as a
x. is may not always be the case. If that happens, it will count against our hypothesis. For the experimental
evaluation we used a 2 hour timeout. We consider this an appropriate timeout, but increasing it may yield added
results.
e results reported only apply to the studied tools. Other tools might behave in a substantially dierent way.
We aempted to conduct this study on a wider class of tools, yet some tools were not available even for academic
use (for instance PAR [
29
]), while other tools had usability limitations that prevented us from running them even
on this simple dataset (this was the case for instance with SPR [34]).
5 RELATED WORK
is paper extends previous results published by the authors in [
59
]. e paper addressed the use of specications
and automated bug-nding to assess overing in automated program repair tools. Similar ideas were later also
published in [
44
], where OpenJML [
9
] is used for verication purposes rather than Pex. Only the overing
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 15
problem is analyzed in [
44
], but on a dierent benchmark of programs. is new benchmark is a valuable
contribution of [
44
]. Instead, we stick to the IntroClass dataset, which has simpler programs, has been used as
a benchmark in other papers, and serves as a lower bound for the evaluation of APR tools, i.e., if tools fail to
properly x these simple programs, it is hardly the case they will be successful on more complex ones.
Automatic program xing has become over the last few years a very active research topic, and various tools for
program repair are now available, many of which we have already referred to earlier in this paper. ese generally
dier in their approaches for producing program patches, using several dierent underlying approaches, including
search-based techniques, evolutionary computation, paern-based program xing, program mutation, synthesis,
and others. Since this paper is mainly concerned with how these program xing approaches evaluate the produced
x candidates, we will concentrate on that aspect of automated program repair techniques. A very small set of
program repair techniques use formal specications as acceptance criteria for program xes. Gopinath et al. [
17
]
propose a technique to repair programs automatically, by employing SAT solving for various tasks, including the
construction of repair values from faulty programs where suspicious statements are parameterized, and checking
whether the repair candidates are indeed xes; they use contracts specied in Alloy [
20
], and SAT-based bounded
verication for checking candidate programs against specications. Staber et al. [
48
] apply automated repairs on
programs captured as nite state machines, whose intended behavior is captured through linear time temporal
logic [
8
]; repair actions essentially manipulate the state transition relation, using a game-theoretic approach. von
Essen and Jobstmann [
51
] propose a technique to repair reactive systems, specied by formal specications in
linear time temporal logic, resorting to automated synthesis. Arcuri and Yao’s approach [
1
] applies to sequential
programs accompanied by formal specications in the form of rst-order logic pre- and post-conditions, and uses
genetic programming to evolve a buggy program in the search for a x, driven by a set of tests automatically
computed from the formal specication and the program. eir use of formal specications is then weaker than
the previously mentioned cases. Wei et al. [
52
] propose a technique that combines tests for fault localization
with specications in the form of contracts, for automatically repairing Eiel programs. eir technique may
correct both programs and contracts; it uses automatically generated tests to localize faults and to instantiate x
schemas to produce x candidates; x candidates are then assessed indirectly against contracts, since they are
evaluated on a collected set of failing and passing tests, automatically built using random test generation using
the contracts.
All other automated repair tools we are aware of use tests as specications, mainly as a way of making the
corresponding techniques more widely applicable, since tests can be more commonly found in soware projects
and their use scales more reasonably than other verication approaches. We summarize here a set of known
tools and techniques that use tests as specications. e BugFix tool by Jerey et al. [
22
] applies to C programs,
and uses tests as specications; the tool employs machine learning techniques to produce bug-xing suggestions
from rules learned from previous bug xes. Weimer et al. [
54
] use genetic algorithms for automatically producing
program xes for C programs, using tests as specications too; moreover, they emphasize the fact that tests
as opposed to formal specications lead to wider applicability of their technique. Kern and Esparza [
28
] repair
Java programs by systematically exploring alternatives to hotspots (error prone parts of the code), provided that
the developers characterize hotspot constructs and provide suitable syntactic changes for these; they also use
tests as specications, but their experiments tend to use larger test sets compared to the approaches based on
evolutionary computation. Debroy and Wong [
10
] propose a technique that combines fault localization with
mutation for program repair; fault localization is a crucial part of their technique, in which a test suite is involved,
the same one used as acceptance criterion for produced program patches. Tool SemFix by Nguyen et al. [
41
]
combines symbolic execution with constraint solving and program synthesis to automatically repair programs;
this tool uses provided tests both for fault localization and for producing constraints that would lead to program
patches that pass all tests. Kaleeswaran et al. [
25
] propose a technique for identifying, from a faulty program,
parts of it that are likely to be part of the repaired code, and suggest expressions on how to change these. Tests
ACM Trans. Sow. Eng. Methodol.
16 Zemín et al.
are used in their approach both for localizing faults, and for capturing the expected behavior of a program to
synthesize hints for xes. Ke [
27
] proposes an approach to program repair that identies faulty code fragments
and looks for alternative, human-wrien, pieces of code, that would constitute patches of the faulty program;
while their approach uses constraints to capture the expected behavior of fragments and constraint solving to nd
patches, this behavior is taken from tests, and the produced patches are in the end evaluated against a set of test
cases, for acceptance. Long and Rinard [
34
] propose SPR, a technique based on the use of transformation schemas
that target a wide variety of program defects, and are instantiated using a novel condition synthesis algorithm.
SPR also uses tests as specications, not only as acceptance criterion, but also as part of its condition synthesis
mechanism. Mechtaev et al. [
37
] propose Angelix, a tool for program repair based on symbolic execution and
constraint solving, supported by a novel notion of angelic forest to capture information regarding (symbolic)
executions of the program being repaired; while Angelix uses symbolic execution and constraint solving, the
intended behavior of the program to be repaired is in this case also captured through test cases. Finally, Xuan et
al. [
55
] propose Nopol, which also resorts to constraint solving to produce patches from information originating
in test executions, encoded as constraints. Again, Nopol uses tests both in the patch generation process, and as
acceptance criterion for its produced xes to produce patches from information originating in test executions,
encoded as constraints.
Various tools for program repair that employ testing as acceptance criteria for program xes have been shown
to produce spurious (incorrect) repairs. Paper [
46
] shows that GenProg and other tools overt patches to the
provided acceptance suites. ey do so by showing that third-party generated suites reject the produced patches.
Since several tools (particularly GenProg) use suites to guide the patch generation process, [
46
] actually shows
that the original suites are not good enough. We go one step further and show that even considering more
comprehensive suites the performance of the repair tools is only partially improved: fewer overts are produced,
but no new xes. is supports the experience by the authors of [46], and generalizes it to other tools as well:
“Our analysis substantially changed our understanding of the capabilities of the analyzed
automatic patch generation systems. It is now clear that GenProg, RSRepair, and AE are
overwhelmingly less capable of generating meaningful patches than we initially understood
from reading the relevant papers.
e overing problem is also addressed in [
47
,
56
], where the original test suite is extended with a white-box
one, automatically generated using the symbolic execution engine KLEE [
6
]. Research question 2 in [
47
] analyzes
the relationship between test suite coverage and overing, a problem we also study in this paper. eir analysis
proceeds by considering subsets of the given suite, and showing this leads to even more overed patches.
Rather than taking subsets of the original suite, we go the other way around and extend the original suite with a
substantial amount of new tests. is allows us to reach to conclusions that exceed [
47
], as for instance the fact
that, while overing decreases, the xing ratio remains very low. Also, we analyze the impact of larger suites
on tool performance, which cannot be correctly addressed by using small suites.
Long and Rinard [
35
] also study the overing problem but from the perspective of the tools search space. It
concludes that many tools show poor performance because their search space contains signicantly fewer xes
than patches, and in some cases, the patch generation process employed produces a search space that does not
contain any xes.
Kali [
46
] was developed with the purpose of generating patches that delete functionality. RSRepair [
45
] is an
adaptation of GenProg that substitutes genetic programming by random search.
6 DISCUSSION
e signicant advances in automated program analysis have enabled the development of powerful tools for
assisting developers in various tasks, such as test case generation, program verication, and fault localization.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 17
e great amount of eort that soware maintenance demands is turning the focus of automated analysis into
automatically xing programs, and a wide variety of tools for automated program repair have been developed
in the last few years. e mainstream of these tools, as we have analyzed in this paper, concentrate in using
tests as specications, since tests are more oen found in soware projects, compared to more sophisticated
formal specications, and their evaluation scales beer than the analysis of formal specications using more
thorough techniques. While several researchers have acknowledged the problem of using inherently partial
specications based on tests to capture expected program behavior, the more detailed analyses that have been
proposed consisted in using larger test suites, or perform manual inspections, in order to assess more precisely
the eectiveness of automated program repair techniques, and the severity of the so called test suite overing
patches [47].
Our approach in this paper has been to empirically study the suitability of tests as x acceptance criteria in
the context of automated program repair, by checking produced patches using an automatic bug-nding tool,
as opposed to previous works that used tests or manual inspections. We believe that previous approaches to
analyze overing have failed to demonstrate the criticality of invalid patches overing test suites. Our results
show that the percentage of valid xes that state-of-the-art program repair tools, that use tests as acceptance
criteria, are able to provide is signicantly lower than the estimations of previous assessments, e.g., [
47
], even in
simple examples such as the ones analyzed in this paper. Moreover, increasing the number of tests reduces the
number of spurious xes but does not contribute to generating more xes, i.e, it does not improve these tools’
eectiveness; instead, such increases make tools most oen exhaust resources without producing patches.
7 CONCLUSIONS AND FURTHER WORK
Some conclusions can be drawn from these results. While weaker or lighterweight specications, e.g., based on
tests, have been successful in improving the applicability of automated analyses, as it has been shown in the
contexts of test generation, bug nding, fault localization and other techniques, this does not seem to be the case
in the context of automated program repair. Indeed, as our results show, using tests as specications makes it
signicantly more likely to obtain invalid patches (that pass all tests) than actual xes. We foresee three lines of
research in order to overcome this fundamental limitation:
(1)
Use of strong formal specications describing the problem to be solved by the program under analysis. In
some domains (for instance when automatically repairing formal models [
19
,
60
,
61
]) this is the natural
way to go. For programs, more work is necessary in order to assess if partial formal specications present
improvements over test-based specications.
(2) Use more comprehensive test suites, as for instance bounded-exhaustive suites. ese capture a portion
of the semantics of the strong formal specication. Since these suites are likely to be large in size, new
tools must be prepared to deal with large suites.
(3)
Include a human in the loop that assesses if a repair candidate is indeed a x. If she determines it is not,
she may expand the suite with new tests. is iterative process has limitations (the human may make
wrong decisions), but has good chances of being more eective than test-based specications.
is work opens more lines for further work. An obvious one consists of auditing patches reported in the
literature, by performing an automated evaluation as the one performed in this paper. is is not a simple
task in many cases, since it demands understanding the contexts of the repairs, and formally capturing the
expected behavior of repaired programs. Also, in the paper we used bounded-exhaustive test suites that contain
approximately 100 or 1,000 tests. For some tools we saw improvement in the number of xes, and for others we
saw that large suites deem the tools useless. We will study how tools behave when ner granularity is applied in
the construction of bounded-exhaustive suites, hoping to nd sweet spots that favor the quality of the produced
patches.
ACM Trans. Sow. Eng. Methodol.
18 Zemín et al.
In this article we are not proposing the use of specications and verication tools along automated program
repair in industrial seings. Specications are scarce, and producing good-enough specications is an expensive
task. Yet it is essential that APR tool users be aware of the actual limitations APR tools have. Still, in an academic
seing we believe that checking tools against the IntroClass dataset and assessing the quality of patches using
formal specications should be a standard. is paper provides all the infrastructure necessary to make this task
a simple one.
REFERENCES
[1] A. Arcuri and X. Yao, A Novel Co-evolutionary Approach to Automatic Soware Bug Fixing, in CEC 2008.
[2]
C. Barre, R. Sebastiani, S. Seshia, C. Tinelli, Satisability Modulo eories. In Biere, A.; Heule, M.J.H.; van Maaren, H.; Walsh, T. (eds.).
Handbook of Satisability. Frontiers in Articial Intelligence and Applications. Vol. 185. IOS Press. pp. 825–885, 2009.
[3]
R. Bharadwaj, and C. Heitmeyer, Model Checking Complete Requirements Specications Using Abstraction, Automated Soware Engineer-
ing 6(1), Kluwer Academic Publishers, 1999.
[4] Chandrasekhar Boyapati, Sarfraz Khurshid, Darko Marinov: Korat: automated testing based on Java predicates. ISSTA 2002: 123-133
[5]
L. Burdy, Y. Cheon, D.R. Cok, M.D. Ernst, J.R. Kiniry, G.T. Leavens, K. Rustan M. Leino and E. Poll, An overview of JML tools and
applications, in STTT 7(3), Springer, 2005.
[6]
C. Cadar, D. Dunbar, and D. Engler, KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In
USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 209?224, San Diego, CA, USA, 2008.
[7]
A. Church, Application of recursive arithmetic to the problem of circuit synthesis, in Summaries of talks presented at the Summer Institute
for Symbolic Logic Cornell University, 1957, 2nd edn., Communications Research Division, Institute for Defense Analyses, Princeton, N.
J., 1960, pp. 3–50. 3a-45a.
[8] E. Clarke, O. Grumberg and D. Peled, Model Checking, MIT Press, 2000.
[9] R. Cok, OpenJML: JML for Java 7 by extending OpenJDK, in NASA Formal Methods Symposium. Springer, 2011, pp. 472?479.
[10] V. Debroy and W.E. Wong, Using Mutation to Automatically Suggest Fixes to Faulty Programs. ICST 2010, pp. 65–74.
[11] K.A. De Jong, Evolutionary Computation, A Unied Approach, MIT Press, 2006.
[12]
T. Durieux, M. Monperrus, IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs, Research Report, Universite Lille 1, 2016.
[13] M. Fähndrich, Static Verication for Code Contracts. SAS 2010: 2-5.
[14]
M. Fähndrich, M. Barne, D. Leijen, F. Logozzo, Integrating a set of contract checking tools into visual studio, in Proceedings of the Second
International Workshop on Developing Tools as Plug-Ins TOPI 2012, IEEE, 2012.
[15]
A. Fuxman, M. Pistore, J. Mylopoulos and P. Traverso, Model checking early requirements specication in Tropos, in Proceedings of the
5th IEEE International Symposium on Requirements Engineering, Toronto, Canada, 2001.
[16]
J.P. Galeoi, N. Rosner, C. López Pombo, M.F. Frias, TACO: Ecient SAT-Based Bounded Verication Using Symmetry Breaking and Tight
Bounds. IEEE TSE 39(9): 1283-1307 (2013).
[17] D. Gopinath, M.Z. Malik and S. Khurshid, Specication-Based Program Repair Using SAT. TACAS 2011: pp. 173–188.
[18]
C. L. Goues, T. Nguyen, S. Forrest, W. Weimer, GenProg: a Generic Method for Automatic Soware Repair, IEEE Transactions on Soware
Engineering 38, IEEE, 2012.
[19]
Simón Gutiérrez Brida, Germán Regis, Guolong Zheng, Hamid Bagheri, anhVu Nguyen, Nazareno Aguirre, Marcelo F. Frias: Bounded
Exhaustive Search of Alloy Specication Repairs. ICSE 2021: 1135-1147
[20] D. Jackson, Soware Abstractions: Logic, Language and Analysis, e MIT Press, 2006.
[21]
D. Jackson and M. Vaziri, Finding Bugs with a Constraint Solver, in Proceedings of the 2000 ACM SIGSOFT international symposium on
Soware testing and analysis ISSTA 2000, ACM, 2000.
[22]
D. Jerey, M. Feng, N. Gupta, R. Gupta, BugFix: a Learning-based Tool to Assist Developers in Fixing Bugs, in Proceedings of International
Conference on Program Comprehension ICPC 2009, 2009.
[23] R. Jhala and R. Majumdar, Soware Model Checking, ACM Comput. Surv. 41(4), ACM, 2009.
[24]
B. Jobstmann, A. Griesmayer, R. Bloem, Program Repair As a Game, in Proceedings of Computer Aided Verication Conference CAV
2005, 2005.
[25]
S. Kaleeswaran, V. Tulsian, A. Kanade, A. Orso, MintHint: Automated Synthesis of Repair Hints, International Conference on Soware
Engineering (ICSE), 2014.
[26]
Y. Ke, K.T. Stolee, C. Le Goues, and Y. Brun, Repairing Programs with Semantic Code Search, International Conference on Automated
Soware Engineering (ASE), 2013.
[27]
Y. Ke, An automated approach to program repair with semantic code search, Graduate eses and Dissertations, Iowa State University,
2015.
[28] C. Kern and J. Esparza, Automatic Error Correction of Java Programs, Formal Methods for Industrial Critical Systems (FMICS), 2010.
ACM Trans. Sow. Eng. Methodol.
An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria 19
[29]
D. Kim, J. Nam, J. Song, and S. Kim, Automatic Patch Generation Learned from Human-wrien Patches, in Proceedings of the 35th
International Conference on Soware Engineering (ICSE 2013), San Francisco, May 18-26, 2013, pp. 802?811.
[30]
X. Kong, L. Zhang, W. E. Wong, and B. Li, Experience report: How do techniques, programs, and tests impact automated program repair? in
2015 IEEE 26th International Symposium on Soware Reliability Engineering (ISSRE). IEEE, 2015, pp. 194-204.
[31]
C. Le Goues, N. Holtschulte, E.K. Smith, Y. Brun, P. Devanbu, S. Forrest, and W. Weimer, e ManyBugs and IntroClass Benchmarks for
Automated Repair of C Programs, IEEE Transactions on Soware Engineering (TSE), 2013.
[32] C. Le Goues, M. Pradel, A. Roychoudhury: Automated program repair. Communications of the ACM 62(12): 56-65 (2019).
[33]
A. Leitner, I. Ciupa, B. Meyer, M. Howard, Reconciling Manual and Automated Testing: the AutoTest Experience, in Proceedings of the
40th Hawaii International Conference on System Sciences, 2007.
[34]
F. Long and M. C. Rinard, Staged Program Repair with Condition Synthesis, in Symposium on the Foundations of Soware Engineering
(FSE). 2015.
[35]
F. Long and M. C. Rinard, Analysis of the Search Spaces for Generate and Validate Patch Generation Systems, International Conference on
Soware Engineering (ICSE), 2016.
[36] S. Mechtaev, J. Yi, and A. Roychoudhury. Directx: Looking for simple program repairs. In ICSE, 2015.
[37]
S. Mechtaev, J. Yi, and A. Roychoudhury, Angelix, Scalable Multiline Program Patch Synthesis via Symbolic Analysis, International
Conference on Soware Engineering (ICSE), 2016.
[38] B. Meyer, Applying “Design by Contract”, IEEE Computer, IEEE, 1992.
[39] B. Meyer. A Touch of Class, 2nd corrected ed. 2013, Springer.
[40] B. Meyer, A. Fiva, I. Ciupa, A. Leitner, Y. Wei, and E. Stapf, Programs that test themselves. IEEE Soware, pages 22–24, 2009.
[41]
H.D.T. Nguyen, D. Qi, A. Roychoudhury, S. Chandra, SemFix: Program Repair via Semantic Analysis, International Conference on Soware
Engineering (ICSE), 2013.
[42]
R. Nieuwenhuis, A. Oliveras and C. Tinelli, Solving SAT and SAT Modulo eories: From an Abstract Davis-Putnam-Logemann-Loveland
Procedure to DPLL(T), Journal of the ACM 53 (6), pp. 937–977, ACM, 2006.
[43]
R. Nieuwenhuis and A. Oliveras. On SAT Modulo eories and Optimization Problems. In eory and Applications of Satisability Testing
- SAT 2006, 9th International Conference, Seale, WA, USA, August 12-15, 2006, Proceedings (Lecture Notes in Computer Science, Vol.
4121), Armin Biere and Carla P. Gomes (Eds.). Springer, 156–169, 2006.
[44]
A. Nilizadeh, G. T. Leavens, X.-B. D. Le, C. S. Pǎsǎreanu and D. R. Cok, Exploring True Test Overing in Dynamic Automated Program
Repair using Formal Methods, 2021, 14th IEEE Conference on Soware Testing, Verication and Validation (ICST), 2021, pp. 229-240
[45]
Y. Qi, X. Mao, Y. Lei, Z. Dai and C. Wang, e Strength of Random Search on Automated Program Repair, International Conference on
Soware Engineering (ICSE), 2014.
[46]
Z. Qi, F. Long, S. Achour, and M.C. Rinard. An analysis of patch plausibility and correctness for generate-and-validate patch generation
systems. In Proceedings of the 2015 International Symposium on Soware Testing and Analysis, ISSTA 2015, Baltimore, MD, USA, July
12-17, 2015, pages 24?36, 2015.
[47]
E.K. Smith, E. Barr, C. Le Goues, and Y. Brun, Is the Cure Worse an the Disease? Overing in Automated Program Repair, Symposium
on the Foundations of Soware Engineering (FSE), 2015.
[48]
S. Staber, B. Jobstmann, R. Bloem, Finding and Fixing Faults, in Proceedings of Conference on Correct Hardware Design and Verication
Methods, 2005.
[49]
N. Tillmann and J. de Halleux, Pex: White Box Test Generation for .NET, in Proceedings of the Second International Conference on Tests
and Proofs TAP 2008, LNCS, Springer, 2008.
[50]
M. Trudel, C. Furia, M. Nordio, Automatic C to O-O Translation with C2Eiel, in Proceedings of the 2012 19th Working Conference on
Reverse Engineering WCRE 2012, IEEE, 2012.
[51] C. v. Essen, B. Jobstmann, Program Repair without Regret, in Proceedings of Computer Aided Verication CAV 2013, 2013.
[52]
Y. Wei, Y. Pei, C.A. Furia, L.S. Silva, S. Buchholz, B. Meyer, and A. Zeller, Automated Fixing of Programs with Contracts, International
Symposium on Soware Testing and Analysis (ISSTA), 2010.
[53] Weimer W., Nguyen T., Le Goues C. and Forrest S., Automatically nding patches using genetic programming. ICSE 2009: pp. 364–374.
[54]
W. Weimer, S. Forrest, C. Le Goues, T. Nguyen, Automatic Program Repair with Evolutionary Computation, Communications of the ACM
53:5, ACM, 2010.
[55]
J. Xuan, M. Martinez, F. Demarco, M. Cl?ement, S. Lamelas, et al.. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs.
IEEE Transactions on Soware Engineering
[56]
Xuan-Bach D. Le, Ferdian ung, David Lo, Claire Le Goues, Overing in semantics-based automated program repair. Empirical Soware
Engineering 23(5): 3007-3033 (2018).
[57]
Xuan-Bach D. Le, Ferdian ung, David Lo, Claire Le Goues, https://doi.org/10.5281/zenodo.1012686, paper [
56
] reproducibility package,
last accessed on September 2, 2021.
[58]
H. Ye, M. Martinez, and M. Monperrus, Automated patch assessment for program repair at scale Empirical Soware Engineering, vol. 26,
no. 2, pp. 1-38, 2021.
ACM Trans. Sow. Eng. Methodol.
20 Zemín et al.
[59]
Luciano Zemín, Simón Gutiérrez Brida, Ariel Godio, César Cornejo, Renzo Degiovanni, Germán Regis, Nazareno Aguirre, Marcelo
F. Frias: An Analysis of the Suitability of Test-Based Patch Acceptance Criteria. SBST@ICSE 2017: 14-20
[60]
Guolong Zheng, anhVu Nguyen, Simón Gutiérrez Brida, Germán Regis, Nazareno Aguirre, Marcelo F. Frias, Hamid Bagheri: ATR:
template-based repair for Alloy specications. ISSTA 2022: 666-677
[61]
Guolong Zheng, anhVu Nguyen, Simón Gutiérrez Brida, Germán Regis, Marcelo F. Frias, Nazareno Aguirre, Hamid Bagheri: FLACK:
Counterexample-Guided Fault Localization for Alloy Models. ICSE 2021: 637-648
ACM Trans. Sow. Eng. Methodol.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we do automatic correctness assessment for patches generated by program repair systems. We consider the human-written patch as ground truth oracle and randomly generate tests based on it, a technique proposed by Shamshiri et al., called Random testing with Ground Truth (RGT) in this paper. We build a curated dataset of 638 patches for Defects4J generated by 14 state-of-the-art repair systems, we evaluate automated patch assessment on this dataset. The results of this study are novel and significant: First, we improve the state of the art performance of automatic patch assessment with RGT by 190% by improving the oracle; Second, we show that RGT is reliable enough to help scientists to do overfitting analysis when they evaluate program repair systems; Third, we improve the external validity of the program repair knowledge with the largest study ever.
Conference Paper
Full-text available
Automated program repair (APR) techniques have shown a promising ability to generate patches that fix program bugs automatically. Typically such APR tools are dynamic in the sense that they find bugs by testing and they validate patches by running a program’s test suite. Patches can also be validated manually. However, neither of these methods for validating patches can truly tell whether a patch is correct. Test suites are usually incomplete, and thus APR-generated patches may pass the tests but not be truly correct; in other words, the APR tools may be overfitting to the tests. The possibility of test overfitting leads to manual validation, which is costly, potentially biased, and can also be incomplete. Therefore, we must move past these methods to truly assess APR’s overfitting problem. We aim to evaluate the test overfitting problem in dynamic APR tools using ground truth given by a set of programs equipped with formal behavioral specifications. Using these formal specifications and an automated verification tool, we found that there is definitely overfitting in the generated patches of seven well-studied APR tools, although many (about 59%) of the generated patches were indeed correct. Our study further points out two new problems that can affect APR tools: changes to the complexity of programs and numeric problems. An additional contribution is that we introduce the first publicly available data set of formally specified and verified Java programs, their test suites, and buggy variants, each of which has exactly one bug.
Technical Report
Full-text available
Reproducible and comparative research requires well-designed and publicly available benchmarks. We present IntroClassJava, a benchmark of 297 small Java programs, specified by Junit test cases, and usable by any fault localization or repair system for Java. The dataset is based on the IntroClass benchmark and is publicly available on Github.
Article
Full-text available
The primary goal of Automated Program Repair (APR) is to automatically fix buggy software, to reduce the manual bug-fix burden that presently rests on human developers. Existing APR techniques can be generally divided into two families: semantics- vs. heuristics-based. Semantics-based APR uses symbolic execution and test suites to extract semantic constraints, and uses program synthesis to synthesize repairs that satisfy the extracted constraints. Heuristic-based APR generates large populations of repair candidates via source manipulation, and searches for the best among them. Both families largely rely on a primary assumption that a program is correctly patched if the generated patch leads the program to pass all provided test cases. Patch correctness is thus an especially pressing concern. A repair technique may generate overfitting patches, which lead a program to pass all existing test cases, but fails to generalize beyond them. In this work, we revisit the overfitting problem with a focus on semantics-based APR techniques, complementing previous studies of the overfitting problem in heuristics-based APR. We perform our study using IntroClass and Codeflaws benchmarks, two datasets well-suited for assessing repair quality, to systematically characterize and understand the nature of overfitting in semantics-based APR. We find that similar to heuristics-based APR, overfitting also occurs in semantics-based APR in various different ways.
Article
Automated program repair can relieve programmers from the burden of manually fixing the ever-increasing number of programming mistakes.
Article
We present the first systematic analysis of key characteristics of patch search spaces for automatic patch generation systems. We analyze sixteen different configurations of the patch search spaces of SPR and Prophet, two current state-of-the-art patch generation systems. The analysis shows that 1) correct patches are sparse in the search spaces (typically at most one correct patch per search space per defect), 2) incorrect patches that nevertheless pass all of the test cases in the validation test suite are typically orders of magnitude more abundant, and 3) leveraging information other than the test suite is therefore critical for enabling the system to successfully isolate correct patches. We also characterize a key tradeoff in the structure of the search spaces. Larger and richer search spaces that contain correct patches for more defects can actually cause systems to find fewer, not more, correct patches. We identify two reasons for this phenomenon: 1) increased validation times because of the presence of more candidate patches and 2) more incorrect patches that pass the test suite and block the discovery of correct patches. These fundamental properties, which are all characterized for the first time in this paper, help explain why past systems often fail to generate correct patches and help identify challenges, opportunities, and productive future directions for the field.