Available via license: CC BY 4.0
Content may be subject to copyright.
Do LLMs generate test oracles that capture the
actual or the expected program behaviour?
Michael Konstantinou
SnT, University of Luxembourg
Luxembourg
michael.konstantinou@uni.lu
Renzo Degiovanni
Luxuembourg Institute of
Science and Technology
Luxembourg
renzo.degiovanni@list.lu
Mike Papadakis
SnT, University of Luxembourg
Luxembourg
michail.papadakis@uni.lu
Abstract—Software testing is an essential part of the software
development cycle to improve the code quality. Typically, a
unit test consists of a test prefix and a test oracle which
captures the developer’s intended behaviour. A known limitation
of traditional test generation techniques (e.g. Randoop and
Evosuite) is that they produce test oracles that capture the
actual program behaviour rather than the expected one. Recent
approaches leverage Large Language Models (LLMs), trained
on an enormous amount of data, to generate developer-like code
and test cases. We investigate whether the LLM-generated test
oracles capture the actual or expected software behaviour. We
thus, conduct a controlled experiment to answer this question,
by studying LLMs performance on two tasks, namely, test oracle
classification and generation. The study includes developer-
written and automatically generated test cases and oracles for 24
open-source Java repositories, and different well tested prompts.
Our findings show that LLM-based test generation approaches
are also prone on generating oracles that capture the actual
program behaviour rather than the expected one. Moreover,
LLMs are better at generating test oracles rather than classifying
the correct ones, and can generate better test oracles when the
code contains meaningful test or variable names. Finally, LLM-
generated test oracles have higher fault detection potential than
the Evosuite ones.
Index Terms—test oracle generation, large language models,
neural networks, empirical evaluation
I. INTRODUCTION
Software testing is an essential part of the software develop-
ment, maintenance and evolution cycle and aims at improving
code quality [1]. Designing test cases is also time consum-
ing and labour-intensive activity that is typically performed
manually [2].
To deal with this issue, many automatic test generation
approaches utilising various types of technologies, such as
evolutionary algorithms [3], symbolic execution [4], feedback-
directed random search [5] and specification mining [6], have
been proposed. These techniques aim at generating test cases
that exercise the software under test guided by one or more
coverage criteria, i.e., generating test suites that maximize
coverage such as branch coverage or mutation testing, with
many studies reporting that they achieve comparable or even
higher coverage scores than the developer test suites [7], [8].
While effective at covering code (or killing mutants), auto-
matic test generation falls short in finding faults, particularly
business-logic related faults. This is because of the inherent
inability of these techniques to compose test oracles (test
assertions) that capture the expected program behaviour. This
means that the fault detection ability of these techniques is
limited to zero-level oracles, such program crashes or memory
violations (when applied at system level).
To reveal business-logic software faults with test generation
techniques one need to manually validate and correct, when
needed, the generated tests and their respective oracles [7]. In
other words, to reveal logic faults, users need to define the ex-
pected program behaviour that is tested by the generated tests,
which is to be contrasted with the actual program behaviour.
This is why test generation approaches, such as Evosuite [9] or
KLEE [10], often generate test oracles that capture the actual
program behaviour assuming that it is correct, i.e., making the
assumption that the current implementation under analysis is
correct, which makes them incapable of finding logic faults.
With the advent of the language models, many test genera-
tion techniques that integrate ML techniques as well as power-
ful Large Language Models (LLMs) have been proposed [11]–
[13]. LLMs have been trained on a huge amount of code and
data and as a result they can generate developer-like code and
test cases. The advantage of LLMs, compared to traditional
techniques, is that they make use of the natural channel of
code [14], i.e., naming conventions used during coding and
to exploit similarities with existing code that they have been
trained on, and thus, effectively generate tests.
Some approaches employ LLMs to generate only the test
oracle, instead of generating the entire test suite. Given a test
prefix, these techniques generate a test assertion (test oracle)
that hopefully captures the expected program behaviour. The
idea is that existing test generation techniques can be com-
bined with the oracle generation ones, to generate tests that
reveal software faults. For instance, TOGA [15] and TOGLL
[16] use Evosuite to generate test prefixes, and leverage LLMs
to generate the test oracles. These approaches assume that
LLMs can guess the expected program behaviour and thus,
use the LLMs to generated correct test oracles.
In view of this, we investigate the extend to which LLMs
are capable of guessing the expected program behaviour in
contrast to the actual one, i.e., the extend to which LLMs
generate test oracles capturing the expected over the actual
program behaviour. An answer to this question in favour of the
arXiv:2410.21136v1 [cs.SE] 28 Oct 2024
former case signifies that LLMs can be used to reveal logic-
related faults. An answer in favour to the later case signifies
that LLMs are more suited for regression testing.
We study LLMs’ performance on two test oracle generation
tasks, test oracle classification (judging the correctness of
externally given test oracles) and test oracle generation (letting
the LLM generate the oracles it considers relevant), when
using developer-written, automatically generated test cases and
oracles, or buggy and correct implementation from 24 open-
source Java repositories.
Interestingly, our results show that LLMs are more likely to
generate test oracles that capture the actual program behaviour
(what is actually implemented) rather than the expected one,
i.e., the intended behaviour. Additionally, we find that the
overall performance of the LLMs is relatively low (less than
50% accuracy) meaning that LLMs do not provide a strong
oracle correctness signal. Therefore, all LLMs suggestions will
need human inspection.
We also find that LLMs are better at generating test oracles
than judging the correctness of externally generated oracles
(up to 18,18% better accuracy on average). LLM-generated
test oracles are heavily impacted by the naming conventions
used by the tests, they have up to 16,10% higher performance
when using developer written naming conventions than when
using the Evosuite’s ones. Interestingly we find that LLMs
generated assertions that led up to 2,96% higher mutation
score than Evosuite’s test oracles, indicating that they can be
used for test augmentation.
Taken together, our results corroborate the conclusion that
unless having meaningful test or variable names LLMs can
mainly be used to capture the actual program behaviour (thus
to be used for regression testing). Additionally, we find that
LLMs could be a good addition to existing test generation
tools, or to the test writing task, by using them to perform test
augmentation. Overall, this work raises the awareness of the
practical issues involved, advantages and disadvantages of the
LLM-based test oracle generation abilities.
We provide a replication package that includes the two large
datasets, based on developer-written and Evosuite-generated
test suites and oracles, used in the empirical study at: https:
//doi.org/10.5281/zenodo.13867480.
The paper proceeds by covering the relevant background,
followed by a detailed description of the experimental setup.
Next, we present the results, accompanied by a discussion
based on our findings. We then address the threats to validity
of our study and review the related work. Finally, we present
the conclusions of our study.
II. BACKGROU ND
A. Traditional Test Oracle Generation
Evosuite [3] is a state-of-the-art tool for automated test
generation, based on a search-based method. It generates test
cases by analyzing the program’s implementation and current
execution. It suffers from two main limitations though. Its first
limitation is that Evosuite’s test cases are hard to read and
interpret. The second limitation is that Evosuite generates its
test cases assuming that the current implementation is correct.
Therefore, if the code is buggy, it is very likely that some of
the generated tests will be incorrect.
Randoop [5] is a feedback-directed random test generation
tool. Just like Evosuite, it also produces regression oracles
but it also aims to generate tests that violate the relevant API
contract(s). Similarly, the limitation of generating tests out of
a buggy code, is present here as well.
B. Neural networks and LLMs for test oracle generation
Neural networks are meant to detect patterns in the docu-
mentation string (docstring) or the code in order to generate an
answer. With proper training, someone can generate code using
LLMs without manually writing the rules or the patterns to
seek for like in other approaches such as specification mining
or rule-based systems. Therefore, a number of approaches have
been tried using LLMs for test oracle generation.
ATLAS [17] is a recurrent neural network based approach,
which uses a test prefix and a unit under test in order to gen-
erate an assertion. It does not use any other information (e.g.
docstring) and it focused only on the inference of assertion test
oracles. However, when transformed based models have been
used for this purpose, their results outperformed ATLAS [18]
[19] [20]. Those tools however, only focus on the production
of assertion oracles and do not attempt to infer or evaluate
possible exception oracles.
TOGA [15] is an approach that uses two neural network
models based on CodeBERT [21], and an internal rule-based
system in order to generate both: exception and assertion test
oracles. The internal mechanism of TOGA initially uses an
exception classifier, which will classify whether the given
test prefix will throw an exception or not. If the classifier
outputs that the test will throw an exception, then manually
the tool will generate an exception oracle as typically done
using Junit4. Otherwise, TOGA will scan the test and extract
a few possible assertions using its own rule-based system.
Afterwards, a second neural network model is used which
will evaluate the correctness of the assertions in order to pick
the assertion that is more likely to be correct. Despite its novel
approach and its improvement on the exception oracle finding,
its internal rule-based system relies a lot on the Evosuite
test style convention. For instance, it generates a candidate
assertion assuming that the test prefix contains a final line
similar to the one used by Evosuite, the one that executes
the method under test. Moreover, other studies that evaluated
TOGA showed a high number of false positives [22] [23].
Followed by TOGA, TOGLL [16] is a study that fine-
tuned and assessed 7 LLMs for test oracle generation and
conducted its evaluation on six different prompt variations.
Once figuring out the best prompt variations and the best
performing LLM, they introduced TOGLL, a tool based on
their best fine-tuned LLM. It can generate about 3,8 times
more correct assertions and 4,9 times more exception oracles
than TOGA. Although significantly improved results compared
to TOGA, it also suffers from false positives (about 25% for
assertion oracles) and compilation errors since about 5% of
the assertions produced do not compile. In addition, it has a
maximum token limit of 600 tokens; which in about 3% of
their cases could not be used on the sixth prompt variation.
C. Mutation Testing
Mutation testing [24] is a test adequacy criterion where
test requirements are represented by mutants, i.e., artificially
seeded faults obtained from slight syntactic modifications to an
original program (e.g., size > 0 is mutated to size > 1).
Mutants are used to assess the effectiveness and thoroughness
of a test suite, by measuring how many of these artificial faults
the suite is able to detect. Whether there exists a test case that
is capable of producing observable outputs that distinguish
between the mutant and the original program, we say the
mutant is killed; otherwise, the mutant survived.Equivalent
mutants are those that cannot be killed as they behave as
the original program. The mutation score (MS) is computed
as the ratio between killed mutants over the total number
of generated mutants, and gives an estimation of the fault
detection capabilities of the test suite.
In this study we use two mutant generation tools, namely,
µBERT [25] and PiTest [26], each one for a different research
question. µBERT [25] is a mutation testing tool that uses a pre-
trained language model (CodeBERT) [27] to generate mutants
by masking and replacing tokens. Since it works at cource code
level, we use µBERT to generate buggy code versions that is
used in our experiments to answer RQ1. PiTest [26] is one
of the state-of-the-art mutation testing tools that seeds faults
using syntactic transformation rules (aka mutant operators)
at the bytecode level. We particularly use PiTest in RQ4, to
compute and compare the mutation scores between the LLM-
generated test oracles and EvoSuite’s generated ones.
III. EXPERIMENTAL SET UP
A. Research Questions
Our study is designed to assess the performance of LLMs in
two different, but related, tasks: test oracle classification [15],
and test oracle generation [16]. The study also aims to evaluate
the fault detection potential of the generated test oracles
and identify the conditions under which LLM performance
improves. We chose to use GPT 3.5-turbo [28] for this
experiment because it is a state-of-the-art tool and one of the
most widely used LLMs. Additionally, it has been previously
employed in test generation studies [11] [8], making it a
relevant choice for our analysis
We start by checking whether LLMs can capture the pro-
gram’s intended program behaviour or not. Thus, we ask:
RQ1: (How well) Can LLMs capture the actual or the
expected program behaviour when classifying externally
defined test oracles?
To answer to this question, we design a controlled exper-
iment in which we assess if LLMs can correctly classify a
given test oracle assertion for a given code under test (ie. a
test oracle classification task) under four possible scenarios:
•Correct code and test with correct assertion (CC+CA)
•Correct code and test with wrong assertion (CC+WA)
•Wrong code and test with correct assertion (WC+CA)
•Wrong code and test with wrong assertion (WC+WA)
Each scenario is composed of two parts: the source code
of the program under test, and a test case with an oracle
assertion. We measure if LLMs can identify the intended
behaviour of the code and classify correct assertions as pos-
itives, when confronted with clean and buggy code. This is
important to avoid the so-called clean program assumption
[29] that is known to impact the performance of test criteria
and simulate a realistic scenario, i.e., the case that a buggy
code is tested. Assertions not capturing the intended program
behaviour should be classified as negatives, even under the
presence of correct code. These are also important since they
provide wrong developer signal, which in practice would mean
that developers will need to investigate false alerts.
Following previous studies [30], [31], we explore three
different prompt variations that include/exclude class method
implementations and documentation. We use the dataset from
the evaluation of TOGA [22] and TOGLL [16] to gather Java
subjects, and their corresponding test prefixes with correct
test oracle assertions, generated by Evosuite. To cover the
four aforementioned scenarios, we augment this dataset by
running a mutation testing tool, µBERT [25], to generate
invalid code versions, and GPT [28] to generate invalid test
oracle assertions (i.e., those failing when the test is run).
Intuitively, a high accuracy would reflect a good LLM
capability for identifying the intended program behaviour,
while on the contrary, a low accuracy would suggest that the
LLM is affected by the actual program behaviour.
While RQ1 studies LLM’s sensitivity on a more semantic
perspective of the code and test oracles (i.e. fed with correct
and buggy versions), now we aim to focus on a more syntac-
tic perspective. Recent studies [32] have shown that LLM’s
performance is affected when executed on out of distribution
inputs. Hence, we wonder:
RQ2: What can influence the finding of an expected
oracle instead of an actual oracle?
We precisely study whether the LLMs are influenced by
meaningful test names and descriptive variable names in com-
bination of good quality comments. Although previous studies
leverage Evosuite’s test prefixes to produce test oracles, we
believe that the given (automatically generated) test prefixes
decrease the performance of the LLMs. Therefore, in this
research question, we follow the same process we applied
in RQ1, but we use developer written tests to evaluate the
performance of the LLM. Afterwards, we slightly modify
the tests to simulate the naming convention of Evosuite test
cases, in order to find out what influences the performance of
the LLMs. The findings of this experiment will give a new
perspective on the usage of LLMs in the generation of test
oracles that capture the expected behaviour.
Having studied the LLM’s performance on classifying test
oracles, we investigate their generative performance and ask:
RQ3: (How well) Can LLMs generate test oracles
capturing the expected program behaviour?
Using the three different prompt variations as before, we ask
the LLM to generate five different test oracle assertions. Then,
we evaluate their correctness by incorporating the test oracle
assertions to the given test prefixes, and running them on the
program under test. Whether an assertion does not compile
or its execution results in a failure/error, it is considered as
invalid. On the other hand, if its execution succeeds, then
the test assertion is considered as correct. These findings can
provide further information regarding the capabilities of LLMs
and how they can achieve their highest potential for test oracle
generation.
Finally, we aim to compare the strength of the test oracle
assertions generated by the LLM and a traditional test gener-
ation technique, Evosuite. Then, we ask:
RQ4: How strong are the oracles generated by the LLM?
To answer to this question, we compute the mutation score
of the test suites that integrate Evosuite’s and the LLM
generated test oracles, respectively. We use PiTest [26] for this
task, since it is very efficient and the mutants are generated at
byte code level, and are independent to the ones used in the
previous RQs, mitigating any possible bias. The answer to this
question will provide further insights regarding any potential
advantage(s) of using LLMs in test oracle generation problem
against traditional methods.
B. Prompt Design
Several studies evaluating LLMs on test generation con-
cluded that the prompt design affects the performance of
the LLM [30], [31]. When evaluating TOGLL, their study
included 6 prompt variations. Each prompt included more
information regarding the method under test, the test prefix
and the documentation string. It has been shown that the
prompts that included the method under test and the test
prefix performed best, and in some cases the addition of the
documentation string (docstring) improved the performance.
Therefore, for the purposes of our study, we used 3 prompt
variations, similar to TOGLL’s evaluation with an extra addi-
tion:
•method under test (mut) + test prefix
•docstring + mut + test prefix
•entire class under test (cut) + docstring + mut + test prefix
To our knowledge, when evaluating test oracle generation
techniques, there has never been used a prompt that
includes the whole class under test. However, when
generating the entire test suite, several studies provided
the whole class in the instruction prompt.
In this study, it seemed relevant to include the entire class
under test, in order to find out whether the LLMs are
influenced by the implemented code.
Finally, it is important to use the proper instruction to
ensure good results. We designed two prompt instructions, one
for each task, following Chattester’s [11] approach. Similarly,
both instructions begin with a role-playing text (”You are
a professional who writes Java methods...”) which serves
Class Under Test
The complete class under test placed here. This part is
added only on the third prompt variation
Method Under Test
The complete method under test placed here. In case the
prompt variation includes a docstring, then the docstring
will be added on top of the method.
Test prefix
The test prefix is added here. It does not contain the test
oracle.
Prompt instruction for classification
You are a professional who writes Java test methods. Given
the previous data, is the following assertion correct? Answer
with only one word.
Assertion
Assertion asked to classify is placed here.
Fig. 1. Prompt structure for classification tasks.
as an optimization technique [33]. Secondly, we include to
the prompt its task which specifies that the LLM needs to
answer based on the given code data. Finally, the last sentence
specifies to the model the desired format of its answer. For
the first case the answer is always binary (e.g. True/False,
Correct/Incorrect), while for the second case the answer is
typically 4-5 assertion oracles. The two prompt instructions
are given below.
Classification instruction: You are a professional who
writes Java test methods. Given the previous data, is the
following assertion correct? Answer with only one word.
Generation instruction: You are a professional who writes
Java test methods in JUnit4 and Java 8. Given the previous
data, generate 5 possible assertions. Answer with only 5
assertions.
Figures 1 and 2 illustrate the structure of the prompt given
to the LLM. Depending on the prompt variation, the applicable
parts are filled.
C. Datasets
To conduct this experiment, we used two existing artifacts
that contain both automated test suites and developer written
tests. However, to cover all the aforementioned cases, the
different prompt variations and the different tasks (classifi-
cation or generation) we had to create a new dataset for each
research question based on the two artifacts. Below follows
a description of the two datasets and the final dataset created
for the purposes of this experiment.
1) Open source Java repositories: Firstly introduced for
the (re-)evaluation of TOGA [22], the used artifact contains
Class Under Test
The complete class under test placed here. This part is
added only on the third prompt variation
Method Under Test
The complete method under test placed here. In case the
prompt variation includes a docstring, then the docstring
will be added on top of the method.
Test prefix
The test prefix is added here. It does not contain the test
oracle.
Prompt instruction for generation
You are a professional who writes Java test methods in
JUnit4 and Java 8. Given the previous data, generate 5
possible assertions. Answer with only 5 assertions.
Fig. 2. Prompt structure for generation tasks.
8 repositories that were used in Evosuite’s benchmark and
17 repositories from the Apache Commons packages. In the
same study, they generated test suites using Evosuite which
were included in the artifact, and used for the purposes of
our study as well. The same dataset was used to evaluate the
performance of TOGA and TOGLL. Furthermore, combined
with the fact that both aforementioned tools used Evosuite test
prefixes to generate automated test suites, it seemed relevant to
include the same Evosuite generated test prefixes as the ones
used during their evaluation.
2) Gitbug Java: Gitbug java [34] is a dataset introduced
in 2024, which includes 199 bugs found in 55 open source
repositories. The data are commits that include the bug, and
the commit that fixed the bug. All the data are collected in
commits occurred in 2023, and as indicated, there are no
records in this dataset that have been seen by the GPT model
used in this experiment (GPT 3.5 Turbo).
3) Construction of a new dataset(s): Although the first
artifact was used to evaluate the two SOTA tools, its data
structure lacked sufficient information to address all of our
research questions. Since TOGA only requires the method
under test, the docstring and the test prefix, the data found in
the first artifact could not cover all the experiment cases we
aimed for. Therefore, we followed their original procedure of
dataset extraction, incorporating a few modifications to include
additional information.
The artifact contained scripts that parse the test case gener-
ated by Evosuite, and extract the method under test, test prefix
and assertion of each record. In our modification, we added the
class under test, the package name and the test class name. The
same procedure was done for the data found in Gitbug-Java.
Furthermore, regarding the first artifact, we used µBERT to
generate mutants that allowed us to generate the cases where
the code is wrong. Finally, we used the assertions that were
wrong in RQ3 to construct the cases where the assertion is
wrong. It is worth mentioning, that in our attempt to create an
equally distributed dataset, we found out that the repository
Async-http-client did not contain many samples that include
the documentation string. Hence, when generating the different
cases and mutants, it was extremely difficult to find much
data as only a handful of records contained information on all
columns. Hence, the repository was dropped, leaving us only
with 24 open source repositories from the first artifact.
To summarize, in the end, we created a new dataset for each
artifact found, and for each one of the four different cases.
During the execution of the experiment, we set a maximum
limit of using only 1000 records from each case in order to
ensure that the results are based on an equally distributed
dataset of 1000 records on each experiment case and each
prompt variation. Furthermore, to ensure a fair comparison
between the three prompt variations, any record that did not
contain all the three fields of cut, docstring and mut, were
excluded regardless of the prompt variation. This ensured that
all prompt variations and experiment cases used the exact same
records.
D. Metrics
In each one of the four analysed scenarios (correct/wrong
code + correct/wrong assertion), we measure the LLM’s per-
formance in terms of its accuracy in predicting the expected
label for the given assertion. In the scenarios where the given
assertion is the correct one (no matter if the code is correct or
not), we expect that the model classify the assertion as correct.
In the scenarios where the assertion is incorrect, we expect the
model to predict it as incorrect. Typically the accuracy in a
classification problem is calculated using the true positives,
true negatives, false positives and false negatives obtained.
However, because we break the data into the four analysed
scenarios, in practice the problem contains only two metrics:
true classification (hit) or missclassification.
The accuracy is calculated as the number of hits divided
by the total number of records. In other words, the following
formula indicates the accuracy of the model for this research
question.
Accuracy =
Hits
Hits+M isses
We measure the strength of the test oracle assertions in
terms of the mutation scores they can achieve, i.e.the ratio
of killed mutants among all the generated mutants (cf. Sec-
tion II-C).
IV. RES ULT S
A. RQ1: (How well) Can LLMs find the actual or the expected
oracle in a classification problem?
Table I presents the average accuracy across all repositories
for each prompt variation, under the four studied scenarios.
For instance, “CC + CA” refers to the scenario in which both
the code and the test oracle assertion are correct. Moreover,
Figure 3 illustrates the difference of the accuracy between the
first two scenarios, where in both cases the given test assertion
Fig. 3. RQ1: Difference on the accuracy between experiment cases 1 and 2 for each repository
is correct, but in the first one the code is correct and in the
second one is incorrect (buggy). Overall, for the same given
assertion, the LLM achieves a higher classification accuracy
when fed with the correct code, rather than when the given
code is buggy. This indicates that, correct assertion are more
likely to be misclassified by the LLM when the given code is
slightly mutated (buggy).
Precisely, when observing the results for the first scenario
(CC + CA), in 10/24 repositories, the LLM performed worse
than random in all prompt variations. Specifically, the average
accuracy for the first prompt variation was 40,77%, for the
second prompt variation 46,26% and for the third prompt
variation 45,39%. On the second scenario (WC + CA), the
LLM’s accuracy dropped. For the first prompt variation the
accuracy dropped by 8,82%, for the second prompt variation
the accuracy dropped by 9,46%, and for the third prompt
variation the accuracy dropped by 8,38%.
Interestingly, when the wrong assertion is given (scenarios
CC + WA and WC + WA in Table I), the accuracy of the
LLM increases. Specifically, the best results occurred in the
case where both the assertion and the code provided were
incorrect.
Conclusion RQ1: LLM’s test oracle classification accuracy consid-
erably drops in the presence of buggy code, suggesting that its
predictions are derived towards the actual implementation rather
than the desired one.
TABLE I
RQ1: AV ER AGE AC CUR ACY F OR E ACH EX PE RIM EN T CAS E AN D EAC H
PROMPT VARIATION
(CC = COR RE CT COD E,WC=WRONG CO DE,CA=CORR EC T
ASS ERTI ON ,WA=WRONG AS SERT IO N)
Experiment case Prompt Variation Average Accuracy
CC + CA
1 40,77%
2 46,26%
3 45,39%
WC + CA
1 31,94%
2 36,80%
3 37,01%
CC + WA
1 55,70%
2 51,06%
3 61,84%
WC + WA
1 84,16%
2 79,61%
3 83,80%
B. RQ2: What can influence the finding of an expected oracle
instead of an actual oracle?
To answer to this question we analysed 1000 test prefixes
taken from the Gitbug-Java dataset. Table II reflects how
the LLM’s accuracy if affected while more noise (out of
distribution test and variable names) are injected in the test
prefixes. Notice that the model obtains the highest accuracy,
when the original test code is used. However, while more
modifications are included, the more the accuracy of the model
drops.
Particularly, the second prompt variation resulted in the best
performing in all the cases. Notably, the prediction accuracy
between the original code given and the most noisy one (sim-
ilarly to Evosuite, where test ad variables have meaningless
names), the results dropped by 16,10% for the first prompt
variation, 9,80% for the second prompt variation, and 2,70%
on the third prompt variation.
When only the test name was modified to resemble Evo-
suite’s naming convention, it can already be seen that the
results dropped except when the third prompt variation was
used. For the first prompt variation the accuracy dropped by
6,20%, for the second prompt variation the accuracy dropped
by 5,90% and for the third prompt variation the accuracy
increased by just 0,60%.
TABLE II
RQ2: RES ULTS G ROUP ED B Y MOD IFI CATIO NS A ND PR OMP T VARI ATION
Modifications Prompt Variation Hits Misses
Original
1 44,00% 56,00%
2 53,00% 47,00%
3 43,20% 56,80%
Test
names
1 37,80% 62,20%
2 47,10% 52,90%
3 43,80% 56,20%
Test and
variable
names
1 27,90% 72,10%
2 43,20% 56,80%
3 40,50% 59,50%
Let us consider the following listing in which the LLM was
able to classify the correct test assertion. We can see that the
test written by the developers, contains relevant information
that indicate the expected behaviour. For instance, the test
states that the expected behaviour would be to throw an excep-
tion for an existing path. Combined with the relevant variable
names such as path, the LLM is given more information
regarding the expected behaviour of the code.
1@Test
2void shouldThrowForExistingPath() {
3// given
4final String path = "config.me";
5CommentsConfiguration conf = new
CommentsConfiguration();
6conf.setComment(path, "Old","Comments","1","2
","3");
7
8// when
9IllegalStateException ex = assertThrows(
IllegalStateException.class, () -> conf.
setComment(path, "New Comment"));
10
11 // then
12 assertThat(ex.getMessage(), equalTo("Comment
lines already exists for the path ’config.me’"))
;
13 assertThat(conf.getAllComments().keySet(),
contains(path));
14 }
Listing 1. Developer written test example
To summarize, it is clear that the LLM’s performance
improves on finding the expected test oracle when more
information regarding the expected behaviour are present.
Meaningful test names and the variable names clearly improve
the accuracy of the LLM on such tasks.
Conclusion RQ2: Descriptive test and variable names improve
LLMs’ predictions, reducing their test oracle classification accuracy
by up to 6,20% under code with anonymous test and variable
naming (Evosuite-like).
C. RQ3: (How well) Can LLMs generate a test oracle in a
text generation problem?
To answer to this question, we take 1000 test prefixes for
each project in TOGLL [16] dataset, and instruct the LLM
to generate five test oracle assertions to complete each test
prefix. In few cases the model generated less assertions than
five (typically four), and in very few occasions (12 instances
between all experiments) it did not produce any assertion and
thus were discarded.
Test prefixes were equipped with the generated assertions
and run on the program under test, to determine if each
generated assertion is correct or not, and collect the results.
Table III summarises the total number of LLM-generated test
assertions and test cases by each prompt variation. Notice
that the first prompt variation generates the least amount of
assertions resulting in less generated test cases in total. While
the second prompt variation generated the highest number of
assertions, and the third variation is somewhere in between
the two.
TABLE III
RQ3: NUM BER O F GE NER ATED A SSE RTI ONS A ND T EST C ASE S
Nr. Generated assertions Nr. Tests generated
Prompt variation 1 101,345 20,316
Prompt variation 2 102,236 20,482
Prompt variation 3 101,575 20,335
Table IV shows the results for each prompt variation in
terms of accuracy. The accuracy in this case is measured as
the ratio between the number of assertions that make the test
pass, among all generated assertions.
TABLE IV
RQ3: RES ULTS
(P* = PROM PT VARI ATIO N)
Avg. Accuracy Max Min Median
Prompt variation 1 58,95% 74,77% 35,69% 59,32%
Prompt variation 2 57,47% 76,83% 27,26% 57,97%
Prompt variation 3 60,01% 74,85% 25,75% 61,81%
P1: ≥1true assertion 93,76% 100,0% 78,78% 94,47%
P2: ≥1true assertion 91,31% 99,59% 71,27% 92,47%
P3: ≥1true assertion 89,34% 98,50% 64,66% 91,44%
Table IV shows the minimum, maximum, average and
median value for each prompt variation. Since we instruct the
LLM to generate five different assertions, it is expected that
some of them are not correct. For instance, if the test oracle
needs to assert the value of an integer number, the model may
generate five assertions and use five different numbers in each
assertion. Thus, we also measure the At least 1 true assertion
metric to count the percentage of test prefixes for which the
LLM generated at least 1 test oracle that passes the test.
On average, the three prompt variations produce reasonably
good results, where near 60% of the LLM-generated test
assertions pass the tests. Particularly, the third prompt variation
obtains a better performance, on average, but a larger deviation
throughout the different projects ( 25,75% the lowest accuracy,
and 74,85% the highest).
When analysing the At least 1 true assertion metric, we can
observe that in around 90% of the cases, the LLM managed
to produce a test assertion that passes the test. Particularly,
the first prompt variation is the most effective, producing a
valid test assertion for 93,76% of the cases. In the worst
case scenario (the minimum value), the third prompt variation
produced at least one valid test assertion for 64,66% of the
text prefixes of a particular project.
Conclusion RQ3: LLMs are more effective at generating test oracles
rather than classifying them. LLMs generated at least 1 valid
assertion in between 89,34% and 93,76% of the cases.
D. RQ4: How strong are the oracles generated by the LLM?
We compute and compare the mutation score of the test
suites generated by including the LLM and Evosuite generated
test oracle assertions, respectively. Notice that, for each prompt
variation and project, we only consider the test cases for which
the LLM generated at least one assertion that passes the tests
on the original code. Afterwards, we run PiTest to calculate
the mutation scores and summarise.
Table V indicates that, in the three prompt variations, the
LLM-generated test assertions lead to a higher mutation score
than the ones generated by Evosuite. Figure 4 shows the best
mutation score obtained for each repository regardless of the
prompt variation, for LLM-generated assertions and Evosuite-
generated assertions. On average, the best mutation score for
the GPT model across all repositories was 19.10%, compared
to Evosuite’s average best mutation score of 17.32%. For
reference, we also include the maximum mutation score that
Evosuite can obtain by using all the test prefixes Evosuite can
generate (i.e. not only the ones for which the LLM generated
an assertion).
TABLE V
RQ4: MUTATIO N SC ORE O BTAI NE D IN EAC H PRO MP T VARIATI ON
Avg. Score Max Min Median
LLM prompt variation 1 19,92% 70,21% 0,63% 14,26%
Evosuite 17,56% 50,58% 0,57% 13,03%
LLM prompt variation 2 18,01% 43,13% 0,64% 14,05%
Evosuite 17,29% 50,50% 0,65% 13,39%
LLM prompt variation 3 19,37% 42,40% 0,57% 15,92%
Evosuite 17,11% 50,66% 0,47% 13,55%
Conclusion RQ4: LLMs can generate stronger oracles than tra-
ditional approaches, leading to higher fault detection capabilities.
This suggest that LLMs can be used to complement and improve
test case generation.
V. DISCUSSION
A. High accuracy to filter invalid assertions
As seen in RQ1, the LLM has a high accuracy in classifying
(externally generated) incorrect assertions as negative. In a
context in which assertions are first generated and then filtered
out, such in TOGA and TOGLL, LLMs are good candidates
to discard the invalid ones.
We manually investigated some assertions in which the
LLM misclassified as incorrect (i.e. these were correct but
were predicted as invalid). We asked the model to justify why
did it answer that the assertion was incorrect. Below we report
the justification provided by the LLM in two examples:
Example 1:The assertion is incorrect because it does not validate
any specific behaviour of the averageOptionVolumeAbove() method.
It simply checks if the result is 1, which doesn’t provide any
meaningful validation of the method’s functionality.
Example 2:The assertion is incorrect because it’s comparing the
expected value Integer.MAX VALUE with the actual value returned
by scannerSubscription0.averageOptionVolumeAbove(), which is
also Integer.MAX VALUE. This assertion will pass, but it doesn’t
effectively verify the behaviour of the method under test. The
method averageOptionVolumeAbove() should be returning the de-
fault value Integer.MAX VALUE, which is what the assertion is
comparing against. However, to effectively test the method, you
should set a different value for averageOptionVolumeAbove() and
then verify if the method returns that value correctly.
The model often replied that the assertion was correct but
too vague. In other words, the LLM rejected some correct
assertions because they did not seem relevant to the test’s
behaviour. We believe that this limitation can be mitigated
by incorporating a second LLM-as-a-Judge, to increase the
confidence that the assertion should be discarded.
B. Prone to capture the actual behaviour, not the expected one
Our experiments showed that the LLM’s accuracy to cor-
rectly classify a correct assertion as positive, drops when the
given code is buggy. This suggest that the LLM is prone to
follow the actual implementation to classify the test oracle
rather than the expected behaviour. To gain further insights,
we asked the model to justify its decision. Below we provide
one example:
Example 3:The assertion is incorrect because the method av-
erageOptionVolumeAbove() in the ScannerSubscription class is
expected to return -2, not Integer.MAX VALUE.
Precisely, the LLM justification followed the information
found on the actual code implementation and based its as-
sertion on conclusions that can be derived from the imple-
mentation. Hence, although multiple studies use LLMs as if
they would extract the expected oracle, our empirical results
suggest that the LLM follows the actual code execution to find
the test oracle.
C. Meaningful names impact the performance
Recent studies suggest that standard test generation ap-
proaches, such as Evosuite, can be used to generate effective
Fig. 4. RQ4: Mutation score comparison between Evosuite and GPT, when taking the best results in each repository
test prefixes (e.g. to cover all branches) and then use an AI-
based solution (e.g. an LLM) to generate the test oracle. While
this is a promising line of research, RQ2 provided concrete
evidence that automatically generated test/variable names can
considerably drop the LLM effectiveness. In particular, the
model is much more precise to correctly classify correct
assertions as positive when the test uses developer-like naming
conventions. This finding is inline with recent studies [32] that
show that LLM’s performance are affected when evaluated on
out-of-distribution inputs.
In order to understand better why the model performs better
on developer written tests, we asked the LLM to provide us
with the justification on such predictions. While the code was
still taken into consideration, the information provided by the
test and variable names, such as shouldFail alongside relevant
exceptions in the test, provided valuable insights to the model
to capture the test’s expected behaviour. Therefore, the LLM
could guess better what should be the expected test assertion,
regardless the actual code implementation.
This suggest that, if LLMs are meant to be used to clas-
sify/generate the test oracle, will be determinant to generate
tests that follow developer-like naming conventions.
D. Great potential on generating test oracles
Results from RQ3-4 indicate that we are more likely to
generate a correct assertion if we let the LLM generate the
oracles it considers relevant, instead of asking the LLM to
judge externally given test oracles. The model generated at
least one valid assertion in +90% of the test prefixes used in
the experiments, in contrast to the 50% accuracy in classifying
correctly the valid assertions. This suggest that the idea of
using LLMs to automatically generate test assertions has a
great potential and is a promising line of research worth to
explore in the future.
VI. TH RE AD S TO VALI DI TY
A. External Validity
One potential threat may relate to the subjects we used
that we aim at mitigating by selecting datasets that have
already been used to evaluate test oracle classification/gen-
eration approaches. Although our evaluation expands to many
tests-assertions pairs and Java projects of different sizes, the
results may not generalize to other projects or programming
languages.
Another external threat lies in the tools and LLMs speci-
ficity and running configurations we consider. To reduce this
threat, we employ one of the state-of-the-art commercial LLM
currently available (GPT 3.5 Turbo), as well as, fundamentally
different modern mutation tools and run them using their
corresponding default configurations, generating same number
of mutants for each subject.
B. Internal Validity
Threats to internal validity may arise in the prompts we
used. To mitigate this issue, we rely on prompts already used
and evaluated by TOGLL [16] and Chattester [11]. Other
threats may relate to how we label the code and assertions as
correct or incorrect. To counter this threat, we use a mutation
testing tool to inject faults in the correct program version and
generate invalid program version, and we use GPT to generate
candidate assertions and those that do not pass the test suites
were labeled as invalid. In addition, potential threads may be
found when decoding GPT’s answers in RQ1 and RQ2. To
eliminate this threat, we manually checked that all 135 unique
answers given by GPT were decoded correctly, and fixed any
mistakes found.
C. Construct Validity
we mitigate the data-leakage issue by using the Gitbug-Java
dataset as part of our experiments, whose code and tests have
not been seen during the LLM training, neither the mutants
and assertions generation.
Other threat may relate to our assessment metrics. However,
we employ standard metrics to assess AI-model prediction
performances such as accuracy. Instead of reporting general
prediction indicators, e.g. precision/recall, we split our evalu-
ation an focus into the four scenarios that allowed us to get
specific insights whether the LLM is fed with correct/incorrect
code and assertions. To assess the test oracle fault detection
capabilities, we rely on standard mutation testing techniques
and metrics, and we employ one of the state-of-the-art tools
to efficiently compute the mutation score.
VII. REL ATED WORK
Several empirical studies have previously evaluated the
performance of LLMs in generating complete test suites or
test oracles. However, none of these studies have examined
whether LLMs mirror the actual implementation or the devel-
oper’s intent.
Siddiq et al. [35] conducted a study to evaluate LLMs on test
generation with focus on compilation rates, correctness, cover-
age, and test smells. Their evaluation consists of three models:
Codex, GPT, and StarCoder. Ou´
edraogo et al. [36] conducted
a similar study but by using a broader number of LLMs
and incorporated different prompt techniques and strategies to
their methodology. While these studies share some common
findings regarding the capabilities of LLMs, neither provides
a metric to determine whether the generated tests reflect the
developer’s intent or simply assert the current implementation.
Tang et al. [37] conducted a study that compares ChatGPT
with a non-neural network based method, Evosuite, on terms
of correctness, readability, and code coverage. However, they
do not compare the two tools on oracle generation or their
mutation score capabilities. Our study aims at complementing
these studies by providing concrete evidence on the LLMs
capabilities on test oracle classification and generation, their
limitations and challenges for the future research.
When it comes to the evaluation of LLMs in terms of
test oracle generation, there aren’t many empirical studies.
The first large-scale evaluation of LLMs on test oracle gen-
eration comes from TOGLL [16]. Initially the study fine-
tunes 7 LLMs and evaluates their performance on test oracle
generation. In this study they evaluate the accuracy and the
strength of the generated oracles before contributing a tool
for generating test oracles. When it comes to the strength of
the oracles produced, they evaluate the mutants killed in a
complete test suite generated by Evosuite against a complete
test suite generated by the pretrained LLMs. In our study,
we evaluate the test oracles generated by LLMs against the
corresponding Evosuite-generated oracles, focusing on the
strength of the test oracles when applied to the same code
and, consequently, the same set of mutants. Additionally, our
findings suggest that the training data used in TOGLL may
limit the potential of LLMs in generating the expected oracles.
More importantly, while their study aims to identify an LLM
with superior oracle generation capabilities, it does not assess
the LLMs’ alignment with the developer’s intent or compare
the classification and generation performance of LLMs.
Apart from the evaluation of LLMs, it worth mentioning
that there have been studies that evaluate test oracle generation
tools. TOGA evaluates the performance of neural network
based techniques, evolutionary algorithms and specification
mining methods on the oracles generated [15]. However, when
future studies evaluated TOGA with different metrics and
different datasets [16] have shown that the metrics were unreal-
istic and that a straightforward baseline was not available [23].
Nonetheless, such studies propose and evaluate test oracle
generation tools without thoroughly analyzing the underlying
LLMs. In this paper, we take a deeper approach, aiming
to understand and analyze the behaviour of these LLMs to
uncover why these tools perform the way they do. Although
both studies exposed limitations of the current LLM-based
approaches, none of those studies explains the behaviour of
LLMs and how these tools can be improved.
VIII. CONCLUSION
In this study we empirically investigated whether the LLMs
can identify the expected program behaviour and thus, be used
to classify/generate adequate test oracles. Our findings indicate
that LLMs are prone to generate test oracles that capture the
actual program implementation rather than the expected one.
We also observed that developer-like test and variable naming
can help the LLM to produce test oracles that capture the
expected behaviour, and that the LLM was very effective in
discarding invalid assertions.
Results provided evidence that LLMs are more effective
in generating valid test assertions rather than in selecting
externally generated assertions. The LLM managed to produce
at least one valid assertion for up to 93,76% of the test
prefixes analysed, which led to higher mutation score than the
ones produced by Evosuite. This suggests that incorporating
LLMs to generate the test oracles is a promising line of future
research.
REFERENCES
[1] P. Ammann and J. Offutt, Introduction to software testing. Cambridge
University Press, 2016.
[2] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo,
“The oracle problem in software testing: A survey,” IEEE Trans.
Software Eng., vol. 41, no. 5, pp. 507–525, 2015. [Online]. Available:
https://doi.org/10.1109/TSE.2014.2372785
[3] G. Fraser and A. Arcuri, “Evolutionary generation of whole test suites,”
in 2011 11th International Conference on Quality Software, 2011, pp.
31–40.
[4] N. Tillmann and J. de Halleux, “Pex–white box test generation for .net,”
in Tests and Proofs, B. Beckert and R. H¨
ahnle, Eds. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2008, pp. 134–153.
[5] C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random
testing for java,” in Companion to the 22nd ACM SIGPLAN conference
on Object-oriented programming systems and applications companion,
2007, pp. 815–816.
[6] V. Dallmeier, N. Knopp, C. Mallon, S. Hack, and A. Zeller,
“Generating test cases for specification mining,” in Proceedings
of the 19th International Symposium on Software Testing and
Analysis, ser. ISSTA ’10. New York, NY, USA: Association
for Computing Machinery, 2010, p. 85–96. [Online]. Available:
https://doi.org/10.1145/1831708.1831719
[7] J. M. Rojas, G. Fraser, and A. Arcuri, “Automated unit test generation
during software development: a controlled experiment and think-aloud
observations,” in Proceedings of the 2015 International Symposium on
Software Testing and Analysis, ISSTA 2015, Baltimore, MD, USA, July
12-17, 2015, M. Young and T. Xie, Eds. ACM, 2015, pp. 338–349.
[Online]. Available: https://doi.org/10.1145/2771783.2771801
[8] Z. Yuan, M. Liu, S. Ding, K. Wang, Y. Chen, X. Peng, and Y. Lou,
“Evaluating and improving chatgpt for unit test generation,” Proc. ACM
Softw. Eng., vol. 1, no. FSE, pp. 1703–1726, 2024. [Online]. Available:
https://doi.org/10.1145/3660783
[9] G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for
object-oriented software,” in SIGSOFT/FSE’11 19th ACM SIGSOFT
Symposium on the Foundations of Software Engineering (FSE-19)
and ESEC’11: 13th European Software Engineering Conference
(ESEC-13), Szeged, Hungary, September 5-9, 2011, T. Gyim´
othy
and A. Zeller, Eds. ACM, 2011, pp. 416–419. [Online]. Available:
https://doi.org/10.1145/2025113.2025179
[10] C. Cadar, D. Dunbar, and D. R. Engler, “KLEE: unassisted and
automatic generation of high-coverage tests for complex systems
programs,” in 8th USENIX Symposium on Operating Systems Design
and Implementation, OSDI 2008, December 8-10, 2008, San Diego,
California, USA, Proceedings, R. Draves and R. van Renesse,
Eds. USENIX Association, 2008, pp. 209–224. [Online]. Available:
http://www.usenix.org/events/osdi08/tech/full papers/cadar/cadar.pdf
[11] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng,
“No more manual tests? evaluating and improving chatgpt for unit test
generation,” arXiv preprint arXiv:2305.04207, 2023.
[12] M. Sch¨
afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of
using large language models for automated unit test generation,” IEEE
Transactions on Software Engineering, 2023.
[13] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software
testing with large language models: Survey, landscape, and vision,”
IEEE Trans. Softw. Eng., vol. 50, no. 4, p. 911–936, Feb. 2024.
[Online]. Available: https://doi.org/10.1109/TSE.2024.3368208
[14] M. Allamanis, E. T. Barr, P. T. Devanbu, and C. Sutton, “A survey
of machine learning for big code and naturalness,” ACM Comput.
Surv., vol. 51, no. 4, pp. 81:1–81:37, 2018. [Online]. Available:
https://doi.org/10.1145/3212695
[15] E. Dinella, G. Ryan, T. Mytkowicz, and S. K. Lahiri, “Toga: a
neural method for test oracle generation,” in Proceedings of the 44th
International Conference on Software Engineering, ser. ICSE ’22.
New York, NY, USA: Association for Computing Machinery, 2022,
p. 2130–2141. [Online]. Available: https://doi.org/10.1145/3510003.
3510141
[16] S. B. Hossain and M. Dwyer, “Togll: Correct and strong test
oracle generation with llms,” 2024. [Online]. Available: https:
//arxiv.org/abs/2405.03786
[17] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk,
“On learning meaningful assert statements for unit test cases,”
in Proceedings of the ACM/IEEE 42nd International Conference
on Software Engineering, ser. ICSE ’20. New York, NY, USA:
Association for Computing Machinery, 2020, p. 1398–1409. [Online].
Available: https://doi.org/10.1145/3377811.3380429
[18] R. White and J. Krinke, “Reassert: Deep learning for assert generation,”
2020. [Online]. Available: https://arxiv.org/abs/2011.09784
[19] M. Tufano, D. Drain, A. Svyatkovskiy, and N. Sundaresan, “Generating
accurate assert statements for unit test cases using pretrained
transformers,” in Proceedings of the 3rd ACM/IEEE International
Conference on Automation of Software Test, ser. AST ’22. ACM, May
2022. [Online]. Available: http://dx.doi.org/10.1145/3524481.3527220
[20] A. Mastropaolo, S. Scalabrino, N. Cooper, D. N. Palacio,
D. Poshyvanyk, R. Oliveto, and G. Bavota, “Studying the usage
of text-to-text transfer transformer to support code-related tasks,” 2021.
[Online]. Available: https://arxiv.org/abs/2102.02017
[21] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou,
B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model
for programming and natural languages,” 2020. [Online]. Available:
https://arxiv.org/abs/2002.08155
[22] S. B. Hossain, A. Filieri, M. B. Dwyer, S. Elbaum, and W. Visser,
“Neural-based test oracle generation: A large-scale evaluation and
lessons learned,” in Proceedings of the 31st ACM Joint European
Software Engineering Conference and Symposium on the Foundations
of Software Engineering, ser. ESEC/FSE 2023. New York, NY, USA:
Association for Computing Machinery, 2023, p. 120–132. [Online].
Available: https://doi.org/10.1145/3611643.3616265
[23] Z. Liu, K. Liu, X. Xia, and X. Yang, “Towards more realistic
evaluation for neural test oracle generation,” in Proceedings of the
32nd ACM SIGSOFT International Symposium on Software Testing
and Analysis, ser. ISSTA 2023. New York, NY, USA: Association
for Computing Machinery, 2023, p. 589–600. [Online]. Available:
https://doi.org/10.1145/3597926.3598080
[24] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. L. Traon, and M. Harman,
“Chapter six - mutation testing advances: An analysis and survey,”
Advances in Computers, vol. 112, pp. 275–378, 2019. [Online].
Available: https://doi.org/10.1016/bs.adcom.2018.03.015
[25] R. Degiovanni and M. Papadakis, “µbert: Mutation testing using pre-
trained language models,” in 2022 IEEE International Conference
on Software Testing, Verification and Validation Workshops (ICSTW).
IEEE, 2022, pp. 160–169.
[26] H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque,
“PIT: a practical mutation testing tool for java (demo),” in Proceedings
of the 25th International Symposium on Software Testing and Analysis,
ISSTA 2016, Saarbr¨
ucken, Germany, July 18-20, 2016, A. Zeller and
A. Roychoudhury, Eds. ACM, 2016, pp. 449–452. [Online]. Available:
https://doi.org/10.1145/2931037.2948707
[27] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert:
A pre-trained model for programming and natural languages,”
in Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: Findings, EMNLP 2020, Online
Event, 16-20 November 2020, ser. Findings of ACL, T. Cohn,
Y. He, and Y. Liu, Eds., vol. EMNLP 2020. Association for
Computational Linguistics, 2020, pp. 1536–1547. [Online]. Available:
https://doi.org/10.18653/v1/2020.findings-emnlp.139
[28] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M.
Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,
I. Sutskever, and D. Amodei, “Language models are few-shot learners,”
2020. [Online]. Available: https://arxiv.org/abs/2005.14165
[29] T. T. Chekam, M. Papadakis, Y. L. Traon, and M. Harman, “An
empirical study on mutation, statement and branch coverage fault
revelation that avoids the unreliable clean program assumption,”
in Proceedings of the 39th International Conference on Software
Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017,
S. Uchitel, A. Orso, and M. P. Robillard, Eds. IEEE / ACM, 2017,
pp. 597–608. [Online]. Available: https://doi.org/10.1109/ICSE.2017.61
[30] L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou,
G. Liang, Q. Wang et al., “An empirical study of unit test generation
with large language models,” arXiv preprint arXiv:2406.18181, 2024.
[31] W. C. Ou´
edraogo, K. Kabor´
e, H. Tian, Y. Song, A. Koyuncu, J. Klein,
D. Lo, and T. F. Bissyand´
e, “Large-scale, independent and comprehen-
sive study of the power of llms for test case generation,” arXiv preprint
arXiv:2407.00225, 2024.
[32] J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence:
the threats of using llms in software engineering,” in Proceedings
of the 2024 ACM/IEEE 44th International Conference on Software
Engineering: New Ideas and Emerging Results, ser. ICSE-NIER’24.
New York, NY, USA: Association for Computing Machinery, 2024, p.
102–106. [Online]. Available: https://doi.org/10.1145/3639476.3639764
[33] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation
via chatgpt,” arXiv preprint arXiv:2304.07590, 2023.
[34] A. Silva, N. Saavedra, and M. Monperrus, “Gitbug-java: A reproducible
benchmark of recent java bugs,” in Proceedings of the 21st International
Conference on Mining Software Repositories.
[35] M. L. Siddiq, J. Santos, R. H. Tanvir, N. Ulfat, F. Rifat, and V. C. Lopes,
“Exploring the effectiveness of large language models in generating unit
tests,” arXiv preprint arXiv:2305.00418, 2023.
[36] W. C. Ou´
edraogo, K. Kabor´
e, H. Tian, Y. Song, A. Koyuncu,
J. Klein, D. Lo, and T. F. Bissyand´
e, “Large-scale, independent and
comprehensive study of the power of llms for test case generation,”
2024. [Online]. Available: https://arxiv.org/abs/2407.00225
[37] Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “Chatgpt vs sbst: A comparative
assessment of unit test suite generation,” IEEE Transactions on Software
Engineering, vol. 50, no. 6, pp. 1340–1359, 2024.