Conference PaperPDF Available

Defect Detection Efficiency: Test Case Based vs. Exploratory Testing


Abstract and Figures

This paper presents a controlled experiment comparing the defect detection efficiency of exploratory testing (ET) and test case based testing (TCT). While traditional testing literature emphasizes test cases, ET stresses the individual tester's skills during test execution and does not rely upon predesigned test cases. In the experiment, 79 advanced software engineering students performed manual functional testing on an open-source application with actual and seeded defects. Each student participated in two 90-minute controlled sessions, using ET in one and TCT in the other. We found no significant differences in defect detection efficiency between TCT and ET. The distributions of detected defects did not differ significantly regarding technical type, detection difficulty, or severity. However, TCT produced significantly more false defect reports than ET. Surprisingly, our results show no benefit of using predesigned test cases in terms of defect detection efficiency, emphasizing the need for further studies of manual testing.
Content may be subject to copyright.
Defect Detection Efficiency: Test Case Based vs. Exploratory Testing
Juha Itkonen, Mika V. Mäntylä and Casper Lassenius
Helsinki University of Technology, Software Business and Engineering Institute
P.O. BOX 9210, FIN-02015 TKK, Finland
This paper presents a controlled experiment com-
paring the defect detection efficiency of exploratory
testing (ET) and test case based testing (TCT). While
traditional testing literature emphasizes test cases, ET
stresses the individual tester’s skills during test execu-
tion and does not rely upon predesigned test cases. In
the experiment, 79 advanced software engineering
students performed manual functional testing on an
open-source application with actual and seeded de-
fects. Each student participated in two 90-minute con-
trolled sessions, using ET in one and TCT in the other.
We found no significant differences in defect detection
efficiency between TCT and ET. The distributions of
detected defects did not differ significantly regarding
technical type, detection difficulty, or severity. How-
ever, TCT produced significantly more false defect
reports than ET. Surprisingly, our results show no
benefit of using predesigned test cases in terms of de-
fect detection efficiency, emphasizing the need for fur-
ther studies of manual testing.
1. Introduction
Many different techniques, tools and automation
strategies have been developed to make testing more
efficient. Despite the wide variety of proposed solu-
tions, the fundamental challenge of software testing—
revealing new defects in freshly developed software or
after major modifications—is in practice still largely
dependent on the performance of human testers doing
manual testing.
While test automation is becoming increasingly
popular due to, e.g., approaches like Test-Driven De-
velopment and eXtreme Programming [1, 8, 9], em-
pirical research shows that companies typically per-
form very little automated testing [3] and most new
defects are found by manual testing. The role of auto-
mation is emphasized in regression testing and it is
best viewed as a way of removing the enactment of
simple and repetitive tasks from human testers in order
to free up time for creative manual testing [3, 11, 14].
Interestingly, manual testing and especially test execu-
tion practices have been fairly little studied in the
software engineering community. Testing research has
focused on techniques for test case design, selection
and prioritization, as well as on optimizing automated
testing. However, we do not know, i.e., what factors
affect the efficiency of manual testing, and how, or
what practices industrial testers find useful. Previous
research shows that aspects such as testers’ skills and
the type of the software have as strong an effect on test
execution results as the test case design techniques
We think that test execution is not a simple me-
chanic task of executing completely specified test
cases, which can be easily carried out by a novice em-
ployee, outsourced, or even completely automated.
Instead, testers’ skills and knowledge are likely to be
important also during test execution. Indeed, often
testers use test cases primarily as a means of structur-
ing and guiding their work.
Recently, practitioner literature has discussed the
idea of testing without using predesigned test cases, so
called exploratory testing (ET) [16]. Reports on ex-
ploratory testing have proposed that ET, in some situa-
tions, could be even orders of magnitude more effi-
cient than test case based testing [5]. Other claimed
benefits of ET include the ability to better utilize test-
ers’ creativity, experience and skills, lower documenta-
tion overhead and lower reliance on comprehensive
documentation [5, 6, 16, 20, 22, 26].
Considering the claims stated in the practitioner lit-
erature, we decided to carry out an experiment to test a
simple question: Do testers performing manual func-
tional testing with predesigned test cases find more or
different defects compared to testers working without
predesigned test cases?
The rest of this paper is structured as follows. Sec-
tion 2 reviews existing research on test case based and
First International Symposium on Empirical Software Engineering and Measurement
0-7695-2886-4/07 $20.00 © 2007 IEEE
DOI 10.1109/ESEM.2007.56
First International Symposium on Empirical Software Engineering and Measurement
0-7695-2886-4/07 $20.00 © 2007 IEEE
DOI 10.1109/ESEM.2007.56
exploratory testing. Section 3 presents the experimen-
tal design and data analysis methods. Section 4 pre-
sents the experimental results, which are discussed in
Section 5, along with a discussion of the limitations of
this study. Finally, in Section 6 we present the conclu-
sions and outline future research directions.
2. Background
Testing in the software engineering literature is
considered a process based upon the design, genera-
tion, selection, and optimization of a set of test cases
for testing a certain system or functionality. Various
methods and techniques have then been developed that
help determine what test cases to execute [10, 13, 23].
The underlying assumption is that given the right set of
documented test cases prior to testing, testing goals can
be achieved by more or less mechanically executing
the test cases. However, this view is problematic for at
least three reasons. First, empirical studies of testing
techniques show that there are many other factors than
the technique used to design or select test cases that
explain the effectiveness and efficiency of testing.
These include, e.g., properties of the actual software
being tested, the types of the actual defects in the
tested software and the experience, skills, and motiva-
tion of testers [18]. Second, the actual importance of
documenting test cases before executing the tests is
unknown. Third, practitioner reports of testing ap-
proaches not based on a predesigned set of test cases
claim results that are clearly comparable to those ob-
tained using more formal techniques for test case de-
sign [5, 22, 26].
2.1. Experiments on testing techniques
Several experimental studies have been conducted
in order to compare test techniques to each other, es-
sentially looking at how to most efficiently build and
execute an "optimal" set of test cases. These studies are
reviewed in [18]. Juristo concludes that existing
knowledge is limited, somewhat conflicting and lack a
formal foundation [18].
Kamsties and Lott found that time taken to find a
defect was dependent on the subject [19]. Basili and
Selby, instead, found that the fault rate depended on
the software under study, and that the defect detection
rate was unrelated to tester experience [7]. Wood et al.
found defect detection rate to depend on the type of
faults in the program [27]. These studies show that
factors other than the test case design technique can
have significant effects on the testing results.
One conclusion that can be drawn from the existing
studies is that more faults are detected by combining
individual testers than techniques [18]. This is impor-
tant because it shows that the results of test execution
vary significantly despite the test case design strategy
used. Wood et al. found that combined pairs and trip-
lets of individual testers using the same technique
found more defects than individuals [27]. The testers
seem to find different defects even though using the
same technique. Similar results were reported also for
code reading and structural testing techniques.
Possible reasons for the variation in the results are
many. Individual testers might execute the documented
tests differently; the testers’ ability to recognize fail-
ures might be different; or individual testers might end
up with different test cases even though using the same
test case design technique.
However, designing the test cases beforehand and
writing them down in a test case specification docu-
ment is only one way of applying defect detection
strategies. A strategy can be applied with or without
detailed test cases and it is hard to understand the ef-
fects of the detailed documentation and the effects of
the applied strategy.
2.2. Industrial experiences
While only a few studies have looked at industrial
practice, they show that test cases are seldom rigor-
ously used and documented in industrial settings. In-
stead, practitioners report that they find test cases diffi-
cult to design and often quite useless [2, 3, 16].
In practice, it seems that test case selection and de-
sign is much left to the individual testers: "The use of
structured approaches to V&V is sparse. Instead, the
selection of test cases is very much based on the ex-
perience of the staff." [3]. Even more interesting is the
finding that "On the other hand, no one reported par-
ticular problems that can be traced back to the lack of
structured methods specifically” [3].
It seems, that large amount of testing in industry is
performed without applying actual testing techniques
or, e.g., any formal test adequacy criteria. Reasons for
this can be many, but it shows the importance of study-
ing and improving also these less formal testing ap-
2.3. Exploratory testing
Exploratory testing is an approach that does not rely
on the documentation of test cases prior to test execu-
tion. This approach has been acknowledged in soft-
ware testing books since the 1970’s [23]. However,
authors have usually not presented actual techniques or
methods for performing exploratory testing; instead
treating it as an ‘ad hoc’ or error guessing method.
Furthermore, exploratory testing lacks scientific re-
search [16]. While test case design techniques set the
theoretical principles for testing, it is too straightfor-
ward to ignore all the factors that can affect testing
activities during test execution work.
In the context of verifying executable specifications
Houdek et al. [15] have performed a student experi-
ment comparing reviews, systematic testing techniques
and the exploratory (ad-hoc) testing approach. The
results showed that the exploratory approach required
less effort, and there was no difference between the
techniques with respect to defect detection effective-
ness. None of the studied techniques alone revealed a
majority of the defects and only 44% of the defects
were such that the same defect was found by more than
one technique.
Some research on exploratory testing can be found
in end-user programming context. Rothemel et al. [25]
reported benefits of supporting exploratory testing
tasks by a tool that is based on formal test adequacy
criteria. Phalgune et al. have found that oracle mistakes
are common and should be taken into account in tools
supporting end-user programmer testing [24]. Oracle
mistakes, meaning that a tester judges incorrect behav-
iour correct or vice versa, could be an important factor
affecting the effectiveness of exploratory testing and
should be studied also in the professional software
development context.
Even though the efficiency and applicability of ex-
ploratory testing lacks reliable research, there are an-
ecdotal reports listing many benefits of this type of
testing. The claimed benefits, summarized in [16] in-
clude, e.g., effectiveness, the ability to utilize tester’s
creativity and non-reliance on documentation [5, 6, 20,
22, 26]. Considering the claimed benefits of explora-
tory testing and its popularity in industry, the approach
seems to deserve more research. The exploratory ap-
proach lets the tester freely explore without being re-
stricted by pre-designed test cases. The aspects that are
proposed to make exploratory testing so effective are
the experience, creativity, and personal skills of the
tester. These aspects affect the results, and some
amount of exploratory searching and learning exists, in
all manual testing; perhaps excluding the most rigorous
and controlled laboratory settings. Since the effects of
exploratory approach and the strength of those effects
have not been studied and are not known, it is hard to
draw strong conclusions on the performance of manual
testing techniques.
We recognize that planning and designing test cases
can provide many other benefits besides defect detec-
tion effectiveness. These include, e.g., benefits for test
planning, test coverage, repeatability, and tracking. In
this paper, however, we focus only on the viewpoint of
defect detection effectiveness.
3. Methodology
In this section, we describe the research problem
and the experimental design.
3.1. Research problem and questions
We study the effects of using predesigned test cases
in manual functional testing at the system level. Due to
the scarce existing knowledge, we focus on one re-
search problem: What is the effect of using predesigned
and documented test cases in manual functional testing
with respect to defect detection performance?
Based on existing knowledge we can pose two al-
ternative hypotheses. First, because almost all research
is focused on test case design issues we could hypothe-
sise that the results are better when using predesigned
test cases. Second, from the practitioner reports and
case studies on exploratory testing we could draw a
hypothesis that results are better when testing without
predesigned test cases. The research questions and the
hypotheses of this study are presented below.
Research question 1: How does using predesigned
test cases affect the number of detected defects?
Hypothesis H1
: There is no difference in the number
of detected defects between testing with and without
predesigned test cases.
Hypothesis H1
: More defects are detected with pre-
designed test cases than without predesigned test cases.
Hypothesis H1
: More defects are detected without
predesigned test cases than with predesigned test cases.
Research question 2: How does using predesigned
test cases affect the type of defects found?
Hypothesis H2
: There is no difference in the type of
the detected defects between testing with and without
predesigned test cases.
Research question 3: How does using predesigned
test cases affect the number of false defect reports?
Hypothesis H3
: There is no difference in the number
of produced false defect reports between testing with
and without predesigned test cases.
False defect reports refer to reported defects that
cannot be understood, are duplicates, or report non-
existing defects. This metric is used to analyze how
using test cases affects the quality of test results.
3.2. Experimental design
We used a one-factor block design with a single
blocking variable [17]. We also used the empirical
research guidelines presented by Kitchenham et al.
[21], as applicable to our context.
The study was performed as a student experiment
on the software testing and quality assurance course at
Helsinki University of Technology in November 2005.
Participation in the experiment was a compulsory part
of the course. The subjects were randomly divided into
two groups, both of which performed similar test ses-
sions with and without test cases. The experiment con-
sisted of three separate phases: preparation, session 1,
and session 2. In the preparation phase, each subject
designed and documented test cases for the feature set
that was allocated for test case based testing for the
group. All subjects, regardless of which testing ap-
proach they first utilized, designed and submitted their
test cases according to the same schedule. The subjects
designed the test cases without supervision and got to
use as much effort as they required for the preparation
phase. Note that each student designed the test cases
only for the test case based testing, they did not pre-
pare test cases for the other feature set that was tested
using exploratory approach. An overview of the ex-
perimental arrangements is shown in
Table 1.
Table 1. Experiment arrangements
Phase Group 1 Group 2
Test cases for fea-
ture set A
Test cases for
feature set B
Test case based
Exploratory testing
session 1
Feature set A Feature set A
Exploratory testing Test case based
session 2
Feature set B Feature set B
The subjects were instructed to use the techniques
they had learned in the course, i.e. equivalence class
partitioning, boundary value analysis and combination
testing. The source documentation for the test case
design was the User’s Guide for the tested software.
The subjects’ task was to cover all functionality that
was documented in the User’s Guide concerning their
allocated feature set.
The subjects’ performance in the experiment af-
fected their course grade: the quality of the test cases
and their performance in test execution were evaluated
by the course assistants. The grading was based on the
subjectively evaluated quality of their predesigned test
cases and defect reports and the number of defects they
found during the controlled test sessions.
Testing session 1 took place one week after the
submission deadline for the test case designs, and test-
ing session 2 one week after the first session. All sub-
jects participated in both sessions, but the ordering of
the test approaches was different for the two groups.
The structure and length of both controlled testing ses-
sions were exactly the same, as shown in
Table 2. The
subjects of Group 1 and Group 2 performed the ses-
sions at the same time in different computer class-
rooms. In both sessions, the same application, the open
source text editor JEdit, was tested, but the tested fea-
ture set was different in session 1 and session 2. Note
that in this design the two testing approaches were
compared, not the results of the two groups or the two
Table 2. Testing session phases
Phase Length Description
Session setup 15 min Introduction and guidelines
Downloading and starting
the correct variant of the
jEdit application.
90 min Focused testing following
the exploratory or test case
based approach.
Writing the test log and re-
porting all found defects.
Survey and
submitting the
reports and logs
10 min
Short survey form is filled in
and defect reports and test
logs collected.
3.2.1. Experimental units. The experimental units
were two variants of version 4.2 of the jEdit text edi-
tor. Both variants were created from the same applica-
tion release by artificially seeding defects into the ap-
plication at the source code level and then recompiling.
We had three major reasons for selecting jEdit.
First, we wanted the tested software to be as realistic as
possible, not an unrealistically small and simple appli-
cation. Second, it had to be possible to seed defects
into the application. Third, the application domain had
to be familiar to the students without special training.
JEdit, while being a fairly simple text editor, has a
far too wide and complicated functionality to be tested
as a whole, even superficially, in the 90 minute scope
of the test sessions of this experiment. Therefore, we
chose two distinct and restricted feature sets for test-
ing; Feature set A for Session 1 and Feature set B for
Session 2. We created two different variants of the
tested software in which we artificially seeded defects:
In variant A we seeded 25 defects in Feature set A, and
in variant B we seeded 24 defects in Feature set B.
Naturally, the number of seeded defects was not the
total number of defects in the software as any real
software is usually far from defect free. This was also
the case with JEdit. The variants with seeded defects
were not available to the subjects before the test ses-
sions. The normal open source version of the software
was of course available to the subjects beforehand, and
they could familiarize themselves with the features and
utilize the software when designing their test cases.
3.2.2. Factors and blocking variables. The factor in
this experiment is the applied testing approach. The
factor has two alternatives: test case based testing
(TCT) and exploratory testing (ET).
Blocking variables represent the undesired varia-
tions in the experimental design that cannot be elimi-
nated or made constant. In this experiment the only
significant blocking variable was the tested feature set,
including the actual and seeded defects that could not
be kept the same for all elementary experiments. The
reason for this is the fact that we wanted to run the
experiment twice with each subject—once with both of
the factor alternatives—in order to reduce the possible
effects of sampling error and increase the sample size.
This design meant that there must be two separate test-
ing sessions for each subject. After the first testing
session, the subjects are naturally much more familiar
with the tested functionality and the behaviour of the
application. From the experimental point of view, also
the defects in the tested variant of the software must be
considered public after the first testing session. This
forced us to use different feature sets and different sets
of seeded defects in the two testing sessions. In addi-
tion, the actual defects that exist in the tested software
variant affect the test results: the total number and
types of defects differs in the feature sets as does the
difficulty of detecting them.
3.2.3. Response variables. This study looked at the
defect detection efficiency measured by the number of
defects found during a fixed length testing session.
Additionally, more insight into the efficiency is gained
by considering the proportions of different defect types
and severities as well as the number of false defect
reports produced during a testing session.
3.2.4. Subjects. The final number of subjects who per-
formed both phases of the experiment and thus were
included in the experimental data, was 79. The subjects
were randomly assigned into two groups; Group 1 (39
students) and Group 2 (40 students). We collected
demographic data on the subjects to characterize them
in terms of experience in software development and
testing, phase of M.Sc. studies, etc. 27 subjects had no
previous experience in software engineering and 63
had no previous experience in testing. 8 subjects had
one year and 4 subjects had two years testing experi-
ence. Only four subjects reported having some sort of
training in software testing prior to taking the course.
The demographic data is summarized in
Table 3. The
credits in the Table 3 refer to Finnish study credits.
The M.Sc. degree requires 160 credits.
Table 3. Characteristics of the subjects
Characteristic x¯ x̃ σ
Study year
4,8 4,0 1,8
107,9 110,0 41,6
Sw dev experience (years)
2,0 1,0 2,7
Testing experience (years)
0,5 0,0 1,1
x¯ = mean, x̃ = median, and σ = standard deviation
3.2.5. Parameters. The most important parameters in
this experiment are the individual properties of the
student subjects, the type of the software under test,
the time available for test execution, the tools used, the
testing environment, and the training given.
The major undesired variation originates from the
individual properties of the student subjects, e.g., ex-
perience in software engineering, amount of studies,
prior training in software testing, and individual skills.
These variations are handled by two means. First, all
subjects performed the experiment two times, once
using each of the testing approaches. Second, the sub-
jects were randomly assigned into two groups that ap-
plied the two approaches in opposite orders. The two
groups were used for the sole purpose of randomizing
the application order of the two approaches, and the
testing assignments in this experiment were individual
tasks for each of the subjects.
The tested software was the same throughout the
experiment. The available time used for test execution
was fixed to 90 minutes. The testing tools and envi-
ronment was an identical PC workstation with a Win-
dows XP environment in university computer class-
rooms for each elementary experiment.
3.2.6. Internal replication. In this experiment, the
elementary experiment corresponds to one subject ap-
plying one of the two factor alternatives to test one of
the two variants of the tested software. We had a
paired design where 79 subjects all replicated the ele-
mentary experiment two times, once using each of the
two testing approaches (factor alternatives). This adds
up to a total of 158 internal replications and 79 paired
replications with both alternatives and a single subject.
3.2.7. Training and instructions. The subjects were
trained to use the test case design techniques before the
experiment. The training was given in lecture format
and the training material consisted of lecture slides,
chapters in the course text book [12], and excerpts
from another software test design book [13]. The train-
ing was supported by multiple choice questionnaires.
Instructions for the assignments were given on the
course web site. In the testing sessions, the subjects got
printed instructions on the session arrangements, but
no instructions on testing techniques or strategies.
Testers using test cases got only brief instructions to
follow their predesigned test cases. Exploratory testers
got a brief charter that listed the tested functionality
and instructed them to focus testing from an average
user’s viewpoint and additionally to pay attention to
issues that may be problematic for an advanced user.
3.3. Data collection and analysis
We collected data in three ways. First, subjects sub-
mitted the predesigned test cases in electronic format.
Second, in the testing sessions the subjects filled in test
logs and defect report forms. Third, after each session
the subjects filled in a survey questionnaire.
The number of defects detected by ET and TCT
groups were compared using the t-test. In addition, we
used multi-factorial analysis of variance (ANOVA) to
control for and understand the effect of the different
feature sets, and the possible interactions between the
feature set and the testing approach. Interaction would
mean that the effect of a testing approach is not similar
in the case of the two different feature sets. The t-test
and ANOVA are both parametric methods and thus
assume that the analyzed data is normally distributed
and at least on interval scale. We can assume that the
defect count data is roughly normally distributed and it
is measured on a ratio scale.
To analyze the differences in the defect types we
represent the defect distributions of the ET and TCT
groups and perform significance analysis using the
Mann-Whitney test that is a non-parametric alternative
to the t-test. Finally, to analyze the number of false
reports we used the Mann-Whitney test to analyze the
significance of the difference between the two ap-
proaches. The t-test could not be used for analyzing
defect distributions or number of false reports as the
data did not have a normal distribution.
The data analysis was performed using the SPSS
software package.
4. Results
In this section we present the collected data and the
results of the experiment based on the statistical analy-
sis of the data.
4.1. Defect counts
The main response variable in this experiment was
the number of defects a subject detected during a 90-
minute fixed-length testing session. The defect count
data is summarized in
Table 4.
Table 4. Summary of defect count data
Found defects
per subject
of defects
x¯ σ
A 44 6,275 2,172
B 41 7,821 2,522
Total 85 7,038 2,462
A 43 5,359 2,288
B 39 7,350 2,225
Total 82 6,367 2,456
x¯ = mean and σ = standard deviation
The number of defects refers to how many different
individual defects all subjects together found. Since the
feature sets were different, the number of individual
defects found in each is different. The total numbers of
individual detected defects in feature sets A and B
were 53 and 48, respectively.
Figure 1 contains the
box-plots of the data, where the boxes contain 50% of
the data points. There are two extreme values in the
data of Feature set B. The absolute mean defect counts
for the ET and TCT approaches were 7,038 and 6,367
respectively, the difference showing 0,671 defects
more in the ET approach, which using the two-tailed t-
test is not statistically significant (0,088). For feature
sets A and B, the differences between the ET and TCT
defect counts were 0,916 and 0,471 respectively. There
was no difference in the number of detected seeded
defects between the approaches. The ET approach de-
tected more real (non-seeded) defects.
In this experiment, the effects of learning between
the two testing session rounds cannot be separated
from the effects of the two feature sets because feature
set A was used solely in the first session and feature
set B in the second one. This means that we cannot say
if the higher reported defect counts in the second test-
ing session are caused by learning or by the type of the
features and defects under test. In the subsequent dis-
cussion when we talk about the effect of feature sets
we mean the combined effect of the feature set and
subjects' learning.
Detected defects per subject
Figure 1. Defect counts
4.2. Effects of testing approach and feature set
We used two different feature sets in this experi-
ment. Although we tried to select similar feature sets to
have comparable results, it is clear that the differences
in the feature sets could have an effect on the number
of defects found. The mean defect count from feature
set A was 5,817 and from feature set B 7,585. If we
had used completely custom-made “laboratory” soft-
ware, it would have been possible to better control for
the number of defects. However, as we used real world
software, we face the problem of having two feature
sets where unequal numbers of defects were detected,
and where the total number of defects is unknown.
Thus, we needed to control for the interaction effect of
the feature set.
Table 5. Effect of approach and feature set
Source F Sig.
Testing approach 3,57 0,061
Feature set 23,25 0,000
Testing approach * Feature set 0,37 0,544
We used multi-factorial ANOVA to control for the
effect of the feature set and to get a better picture of
how the feature set in combination with the testing
approach factor affects the results. This leads to a 2x2
factorial design, two factors with two levels (alterna-
tives) each. The summary of the results of the ANOVA
analysis is presented in
Table 5, in which we can see
the significance values for both the feature set and the
defect detection technique. The effect of the feature
sets is statistically significant with a value of 0,000.
The effect of the testing approach has a significance
value of 0,061. Thus, we can see that the effect of the
testing approach is stronger when the feature set effect
is controlled for, but it is still not statistically signifi-
Based on the ANOVA analysis it is possible to ana-
lyze possible interactions between the two factors. In
Table 5 we can see that the interaction effect of the
testing technique and the feature set has a significance
value of 0,544. This means that in this case there was
no considerable interaction effect present. In
Figure 2
the mean defect counts are plotted for the four combi-
nations of the two factors. This analysis indicates that
we have an effect for both testing approach and feature
set, but no interaction between the factors.
set A
set B
Detected defects (mean)
Figure 2. Defect count interaction effect
4.3. Detection difficulty, types, and severities
The distributions of defect type and severity can be
used to understand the differences between the two
testing approaches. The primary author classified all
defect reports according to three dimensions: type,
severity, and detection difficulty. Type indicates the
technical details of each defect, e.g., usability, per-
formance, documentation. Severity means the defect’s
impact on the end user. The distribution of the defects
according to this classification is presented in
Table 6 characterizes the defects based on the detec-
tion difficulty. A mode 0 defect means that the defect
is immediately obvious to the tester, e.g., a missing
button. A mode 1 defect (single-mode defect) requires
one action of the tester in order to cause a failure and
reveal the defect, e.g., save a file to find out that some
part of the file is not saved. The double-mode and tri-
ple-mode defects require a combination of 2 and 3
actions or inputs in order to cause failure and get the
defect detected. With respect to the difficulty of detec-
tion, there is no clear difference between the ap-
Table 6 we can see that ET found more defects in
all classes of detection difficulty. The most notable
differences were for mode 0 and mode 3 defects, for
which ET found 29% and 33% more defects than TCT.
However, the Mann-Whitney U test shows the differ-
ences to be statistically insignificant for all classes.
Table 6. Detection difficulty distribution
Mode ET TCT ET/TCT Total
0 = easiest 120 93 129 % 213
1 327 320 102 % 647
2 89 75 119 % 164
3 = hardest 20 15
133 %
Total 556 503 111 % 1059
Table 7 shows the defects categorized based on
their technical type. From the table we can see that
there are no radical differences in the number of de-
fects with different technical types. ET found 10%
more wrong function defects, 43% more GUI defects,
and 280% more usability problems than TCT.
Table 7. Technical type distribution
Type ET TCT ET/TCT Total
Documentation 8 4 200 % 12
GUI 70 49 143 % 119
Inconsistency 5 3 167 % 8
Missing function 98 96 102 % 194
Performance 39 41 95 % 80
Technical defect 54 66
82 %
Usability 19 5 380 % 24
Wrong function 263 239 110 % 502
Total 556 503 111 % 1059
However, for the usability defects, we must note
that the absolute numbers are very small. On the other
hand, TCT found 22% more technical defects. The
Mann-Whitney U test shows that the only significant
difference is for the Usability defects (p=0,006).
Table 8
shows the defects categorized based on
their severities. From the table we can see that ET
found 64% more negligible defects, 32% more minor
defects, and 14% more normal defects. TCT found 5%
more severe and 2% more critical defects. The only
significant difference according to the Mann-Whitney
U test was for minor defects (p=0,038).
Table 8. Severity distribution
Severity ET TCT ET/TCT Total
Negligible 23 14 164 % 37
Minor 98 74 132 % 172
Normal 231 203 114 % 434
Severe 153 160 96 % 313
Critical 51 52 98 % 103
Total 556 503 111 % 1059
We must emphasise that by using repeated Mann-
Whitney tests we are likely to come up with statisti-
cally significant values by chance. Thus, the reader
should be cautious with inferences based on the statis-
tically significant values for the defect type and sever-
ity classes presented in this section.
4.4. False defect reports
The data of false defect reports, meaning defect re-
ports that are incomprehensible, duplicate or reported a
non-existent defect, are summarized in
Table 9. TCT
produced on average 1,05 more false reports than ET.
Due to a non-normal distribution, we used the
Mann-Whitney U test that showed that the effect of
testing approach is highly significant with a two-tailed
significance of 0,000.
set A
set B
False defect reports (mean)
Figure 3. False defect interaction effect
Figure 3
illustrates the interaction between the ef-
fects of the testing approach and the feature set with
respect to false defect report count. From the figure we
can see the main effect between ET and TCT, ET hav-
ing less false reports. There also is an interaction effect
as more defect reports are reported by TCT testers with
feature set B than with feature set A.
Table 9. False defect counts
False defects per subject Testing
Feature set
x¯ σ
A 1,00 1,396
B 1,05 1,191 ET
Total 1,03 1,291
A 1,64 1,564
B 2,50 1,867 TCT
Total 2,08 1,767
x¯ = mean and σ = standard deviation
5. Discussion
This section summarizes the results and reflects the
findings in the light of existing research and knowl-
edge. Additionally, we outline the limitations of this
research as well as discuss future research.
5.1. Answering the research questions
5.1.1. Research question 1. How does using predes-
igned test cases affect the number of detected defects?
In this experiment, the subjects found less defects
when using predesigned test cases. Statistical test
showed that there is an 8,8% probability that this result
is obtained by chance. Thus, the difference between
the two approaches was not statistically significant,
and does not allow rejecting the null hypothesis that
assumes there is no difference in the number of de-
tected defects when testing with or without test cases.
Although we cannot reject the null hypothesis, the
results strengthen the hypotheses of the possible bene-
fits of exploratory testing. Based on the results of this
study, we can conclude that an exploratory approach
could be efficient, especially considering the average 7
hours of effort the subjects used for test case design
activities. This means that testing with predesigned test
cases in this study took on average 8,5 hours, whereas
testing without test cases took on average 1,5 hours.
Still, the defect detection rates of the two approaches
were not different. The benefits of exploratory testing
have been proposed to be based on the experience and
skills of the testers. In this experiment, the subjects had
received some training regarding test case design tech-
niques, but did not have any specific techniques or
methods for exploratory testing. Thus, at least in the
context of this experiment, the exploratory approach is
more efficient as no time is spent on creating test
5.1.2. Research question 2. How does using predes-
igned test cases affect the type of found defects? We
analyzed the differences in the types of the detected
defects from three viewpoints; severity, type, and de-
tection difficulty. Based on the data, we can conclude
that testers seem to find more of both the most obvious
defects, as well as the ones most difficult to detect
when testing without test cases. In the terms of defect
type, the testers found more user interface defects and
usability problems without test cases. More technical
defects were found using test cases. When considering
defect severity, the data shows that more low severity
defects were found without test cases. The statistical
significance of the differences in all these defect char-
acterizations is low. We must be cautious of drawing
strong conclusions based on the defect classification
data even though the results show a significant differ-
ence in the numbers of usability and minor defects
detected between the two approaches.
The differences in the defect types and severities
suggest that testing without test cases tend to produce
larger amounts of defects that are obvious to detect and
related to user interface and usability issues. These
differences could be explained by the fact that test
cases are typically not written to test obvious features
and writing good test cases for testing many details of
a graphical user interface is very laborious and chal-
lenging. On the other hand, subjects testing without
test cases found more defects that were difficult to
detect, which supports the claims that exploratory test-
ing makes better use of tester's creativity and skills
during test execution. The higher amount of low sever-
ity defects detected without test cases suggests that
predesigned test cases guide the tester to pay attention
on more focused areas and thus lead to ignoring some
of the minor issues.
5.1.3. Research question 3. How does using predes-
igned test cases affect the number of false defect re-
ports? The purpose of this research question was to
provide an understanding on the effects of the two ap-
proaches from the test reporting quality viewpoint. The
data in section 4.4. shows that testers reported around
twice as many: 2,08 vs. 1,03, false defect reports when
testing with test cases than when testing without test
cases. This difference is statistically significant.
This issue raises the more general question of the
consequences of following predesigned test cases in
manual test execution. Test cases are used to guide the
work of the tester and more studies are needed to better
understand how different ways of documenting tests
and guiding testers’ work affect their behaviour in per-
forming the tests and the results of testing efforts.
5.2. Limitations
The main threats to external validity of this study
are using students as subjects, the time-boxed testing
sessions, and variations in the applied testing tech-
niques. It is not obvious how the results of a student
experiment can be generalized to the industrial context,
but we have presented the data on the professional and
academic experience of our subjects in Section 3. The
subjects’ lack of testing experience might have af-
fected the quality of the test cases as well as the per-
formance in exploratory testing tasks.
In this experiment we had strictly time-boxed and
controlled testing sessions, which is good for internal
validity, but raises some questions about how typical
this kind of setting would be in industry. Such strict
restriction as the 90-minute time-box places might not
be typical in industry, but short calendar time for test-
ing in general is very typical restriction. Testing ap-
proaches that can adapt to testing time restrictions will
be highly relevant for the industry.
The subjects of the experiment were instructed to
use the trained black-box testing techniques for the test
case design, but we could not control that the subjects
actually used those techniques properly. For the ex-
ploratory testing sessions we cannot determine if the
subjects used the same testing principles that they used
for designing the documented test cases or if they ex-
plored the functionality in pure ad-hoc manner. For
this reason it is safer to assume the ad-hoc manner to
hold true.
The threats to internal validity of this study include
the combined learning effect and the effect of the
tested feature set. We could not analyze how good test
case designers our subjects were and how much the
quality of the test cases affected the results and how
much the actual test execution approach. In addition, it
seems that all subjects could not execute all the test
cases they had designed during the time-boxed session.
6. Conclusions and future work
This paper makes four contributions. First, we iden-
tify a lack of research on manual test execution from
other than the test case design point of view. It is obvi-
ous that focusing only on test case design techniques
does not cover many important aspects that affect man-
ual testing. Second, our data showed no benefit in
terms of defect detection efficiency of using predes-
igned test cases in comparison to an exploratory testing
approach. Third, there appears to be no big differences
in the detected defect types, severities, and in detection
difficulty. Fourth, our data indicates that test case
based testing produces more false defect reports.
Studying factors that affect defect detection effec-
tiveness and efficiency is an important direction for
future research. At least most of the reported test case
design techniques are based on theories for effectively
revealing defects in software, but these have been stud-
ied only using predesigned and documented test cases.
More research is required to study the effect of predes-
igned test cases in comparison to other approaches to
manual testing.
Planning and designing test cases can provide many
other benefits besides defect detection efficiency, e.g.
benefits in test planning, traceability, test coverage,
repeatability and regression testing, tracking and con-
trolling the progress of testing efforts, and test report-
ing. Using an exploratory approach to testing instead
of predocumented test cases requires some other ap-
proach for planning, structuring, guiding and tracking
the testing efforts, e.g., session-based test management
[6, 22]. Approaches for managing exploratory testing
are a natural candidate for further research on this area.
In the inspection and review literature, a lot of re-
search focuses on review execution. Ways of perform-
ing inspection meetings and approaches to document
reading have been widely studied [4]. Similar ap-
proaches for manual testing have not been presented.
However, both reviewing and manual testing are hu-
man activities with the intent of revealing defects and
quality issues in the target artifact or software system.
These issues should be studied in the area of manual
[1] Abrahamsson, P., J. Warsta, M. T. Siponen, and J. Ron-
kainen, "New Directions on Agile Methods: A Comparative
Analysis", in Proceedings of ICSE, 2003, pp. 244-254.
[2] Ahonen, J. J., T. Junttila, and M. Sakkinen, "Impacts of
the Organizational Model on Testing: Three Industrial
Cases", Empirical Software Engineering, vol. 9(4), 2004, pp.
[3] Andersson, C. and P. Runeson, "Verification and Valida-
tion in Industry - A Qualitative Survey on the State of Prac-
tice", in Proceedings of ISESE, 2002, pp. 37-47.
[4] Aurum, A., H. Petersson, and C. Wohlin, "State-of-the-
art: software inspections after 25 years", STVR, vol. 12(3),
2002, pp. 133-154.
[5] Bach,J., "Exploratory Testing", in The Testing Practitio-
ner, Second ed., E. van Veenendaal Ed., Den Bosch: UTN
Publishers, 2004, pp. 253-265.
[6] Bach, J., "Session-Based Test Management", STQE, vol.
2, no. 6, 2000,
[7] Basili, V. R. and R. W. Selby, "Comparing the Effec-
tiveness of Software Testing Strategies", IEEE TSE, vol.
13(12), 1987, pp. 1278-1296.
[8] Beck, K., Test Driven Development by Example, Addi-
son-Wesley, Boston, 2003.
[9] Beck, K., "Embracing Change With Extreme Program-
ming", Computer, vol. 32(10), 1999, pp. 70-77.
[10] Beizer, B., Software Testing Techniques, Van Nostrand
Reinhold, New York, 1990.
[11] Berner, S., R. Weber, and R. K. Keller, "Observations
and Lessons Learned form Automated Testing", in Proceed-
ings of ICSE, 2005, pp. 571-579.
[12] Burnstein, I., Practical Software Testing, Springer-
Verlag, New York, 2003.
[13] Copeland, L., A Practitioner's Guide to Software Test
Design, Artech House Publishers, Boston, 2004.
[14] Fewster, M. and D. Graham, Software Test Automa-
tion, Addison-Wesley, Harlow, England, 1999.
[15] Houdek, F., T. Schwinn, and D. Ernst, "Defect Detec-
tion for Executable Specifications - An Experiment",
IJSEKE, vol. 12(6), 2002, pp. 637-655.
[16] Itkonen, J. and K. Rautiainen, "Exploratory Testing: A
Multiple Case Study", in Proceedings of ISESE, 2005, pp.
[17] Juristo, N. and A. M. Moreno, Basics of Software En-
gineering Experimentation, Kluwer Academic Publishers,
Boston, 2001.
[18] Juristo, N., A. M. Moreno, and S. Vegas, "Reviewing
25 years of Testing Technique Experiments", Empirical
Software Engineering, vol. 9(1-2), 2004, pp. 7-44.
[19] Kamsties, E. and C. Lott, "An Empirical Evaluation of
Three Defect-Detection Techniques", in Proceedings of the
5th ESEC, 1995,
[20] Kaner, C., J. Bach and B. Pettichord, Lessons Learned
in Software Testing, John Wiley & Sons, Inc., New York,
[21] Kitchenham, B. A., et al., "Preliminary guidelines for
empirical research in software engineering", IEEE TSE, vol.
28(8), 2002, pp. 721-734.
[22] Lyndsay, J. and N. van Eeden, "Adventures in Session-
Based Testing", 2003, Accessed 2007 06/26,
[23] Myers, G. J., The Art of Software Testing, John Wiley
& Sons, New York, 1979.
[24] Phalgune, A., C. Kissinger, M. M. Burnett, C. R. Cook,
L. Beckwith, and J. R. Ruthruff, "Garbage In, Garbage Out?
An Empirical Look at Oracle Mistakes by End-User Pro-
grammers", in IEEE Symposium on Visual Languages and
Human-Centric Computing, 2005, pp. 45-52.
[25] Rothermel, K., C. R. Cook, M. M. Burnett, J. Schon-
feld, Green, Thomas R. G., and G. Rothermel, "WYSIWYT
Testing in the Spreadsheet Paradigm: An Empirical Evalua-
tion", in Proceedings of ICSE, 2000, pp. 230-239.
[26] Våga,J. and S. Amland, "Managing High-Speed Web
Testing", in Software Quality and Software Testing in Inter-
net Times, D. Meyerhoff, B. Laibarra, van der Pouw
Kraan,Rob and A. Wallet Eds., Berlin: Springer-Verlag,
2002, pp. 23-30.
[27] Wood, M., et al., "Comparing and Combining Software
Defect Detection Techniques: A Replicated Empirical
Study", ACM SIGSOFT Software Engineering Notes, vol.
22(6), 1997, pp. 262-277.
... This is because many development projects hesitate to introduce test automation due to the high implementation cost, and automatic verifications are difficult in some tests such as user-interface testing, usability testing, and so on. Manual testing approaches can be broadly categorized as scripted testing or exploratory testing [10], [11], [12]. Scripted testing is an approach with which developers design tests in advance by considering what kind of test should be conducted and execute them in accordance with the test design. ...
... For example, we have a test case that adds a pet to an owner found in the owner search after moving from the top page to the owner search page. In this case, we determine that the test case confirms features (1,6,12) in Table 5. Note that Table 5 shows the result when the data-dependent test cases and to modify ones were correctly modified and became complete. ...
... However, since this page transition had already been checked, our approach did not generate the test case to check the feature to edit pet. As a result, for c ⃝ 2022 Information Processing Society of Japan , 7), (10,11) and (12,13) in Table 5, only one feature of each pair was checked. However, we believe that we can easily create test cases to check another feature of each pair by slightly modifying the generated test cases. ...
End-to-end test automation for web applications is important in order to release software quickly in accordance with market changes. However, the cost of implementing and maintaining test scripts is a major obstacle to the introduction of test automation. In addition, many testing activities, such as exploratory testing, user-interface testing, and usability testing, rely on human resources. We propose an approach to generate effective test scripts from manual testing, which is indispensable in software development. Manual testing activities are recorded by our tool. The generated test scripts leverage the page-object pattern, which improves the maintainability of test scripts. To generate page objects, our approach extracts operations as methods useful for test automation from the test logs. Our approach also generates test cases that cover features of an application by analyzing its page transitions. We evaluated whether our approach could generate complete test scripts from test logs obtained from four testers. Our experimental results indicate that our approach can generate a greater number of complete methods in page objects than a current page-object generation approach. We also conducted an empirical evaluation of whether our approach can reduce the cost of implementing test scripts for real systems. The result showed that our approach reduces about 48% of the time required to implement test scripts compared with manual implementation.
... It does not refer to a specific testing technology, but a free testing style or testing mind, which emphasizes the simultaneous test design and test execution. Its essence is learning, designing, testing and thinking at the same time [16] [17] . Many advantages of exploratory testing have been summarized through the research of many scholars and professionals, especially compared to script test. ...
... For example, exploratory testing cannot completely replace script test due to its disadvantages in the evaluation of test results and test report, which means it needs to be combined with scripted testing [23] [37] ; exploratory testing is highly dependent on the tester's skills, experience, and domain knowledge. When the product under test is complex, the tester's skills and domain knowledge seem to be deficient [23] [24] ; in addition, exploratory testing gives testers enough freedom to choose any test method , and allows them to combine multiple test methods (such as risk testing, pairwise testing, requirements testing, cross-functional testing, and so on) to complete their testing work [16][38] . However, as a discipline, there are few researches on the proprietary testing methods for exploratory testing. ...
Compared with scripted test, exploratory testing has the advantages of finding more defects and higher quality defects. However, in decades there are few researches on the proprietary testing methods for exploratory testing. In this paper, a new express delivery testing method is proposed based on the inspiration of the FedEx tour method and the express industry mode. The test types of the express delivery testing method are expanded from data to data, interaction objects and activities, states of internal and external and sequence of activities. Through practice verification, compared with the FedEx tour method, this method can excavate hidden test points that are easy to be missed, design more test cases, find more faults, and has higher test effectiveness.
... En ce qui concerne la capacité d'exploration, plusieurs travaux ont montré que les testeurs participant à une session ET doivent consacrer du temps à diverses explorations [16,38]. Lorsque les testeurs experts explorent le système testé de manière extensive, alors l'ET est efficace pour trouver de nouveaux défauts [4,30,10]. book. [22,25,20,36]. ...
Les systèmes logiciels, notamment les applications web, jouent un rôle majeur dans notre vie personnelle et professionnelle.Il est essentiel de minimiser les défaillances, mais en même temps, ils deviennent de plus en plus complexes et donc difficiles à tester.Les tests exploratoires se sont révélés être une méthode efficace pour trouver les bugs qui nécessitent des interactions complexes avec le système.Cette pratique de test s'appuie sur les connaissances métier et l'expérience des testeurs pour valider le système plutôt que sur des suites de tests scriptés.Cependant, l'outillage des testeurs est sous-développé dans ce domaine.Nous soutenons que le soutien des tests exploratoires pour informer le testeur peut améliorer la qualité des tests et diminuer le niveau d'expertise requis pour les testeurs.La connaissance de l'entreprise aide les testeurs à identifier les domaines d'intérêt dans le système et l'expérience des testeurs les aide à maintenir la diversité dans leurs tests.Cependant, les testeurs peuvent être biaisés pendant leurs tests, par exemple parce qu'ils ne sont pas familiers avec certaines parties du système, ou avec le comportement des utilisateurs sur l'application. De plus, il est courant que les sessions de tests exploratoires impliquent plusieurs testeurs. Il est alors difficile d'éviter une certaine redondance dans les tests.De plus, une certaine connaissance du système est nécessaire pour identifier les tests pertinents, il est alors compliqué de faire appel à des testeurs externes, comme avec une plateforme de crowtesting.Dans cette thèse, nous proposons et évaluons des approches pour aider les testeurs à réaliser des tests exploratoires efficaces.Nous proposons de modéliser les interactions réalisées par les testeurs exploratoires, de recommander des actions d'intérêt, et de fournir un feedback direct pendant la session.Nous basons nos recommandations sur des modèles de Markov construits à partir de séquences d'interaction avec l'application web, et sur des machines à états. Pour évaluer nos approches, nous avons développé une base logicielle, qui servira de base à de futurs travaux basés sur le test d'applications web.Nos résultats montrent que les approches proposées aident les testeurs à explorer sur des applications web réelles.Notre objectif pour les travaux futurs est d'améliorer les conseils proposés, en apportant plus d'automatisation dans les tests exploratoires.
... To understand the bugs during manual exploratory testing [46,48], we carry out an empirical study on observing testers' behavior. We randomly select 85 apps from F-Droid (the largest repository of open-source Android apps) according to the number of downloads, including 17 categories (e.g., connectivity, games, internet, money, reading, education, health) with 5 latest released apps in each category. ...
Full-text available
Mobile apps are indispensable for people's daily life. Complementing with automated GUI testing, manual testing is the last line of defence for app quality. However, the repeated actions and easily missing of functionalities make manual testing time-consuming and inefficient. Inspired by the game candy crush with flashy candies as hint moves for players, we propose an approach named NaviDroid for navigating testers via highlighted next operations for more effective and efficient testing. Within NaviDroid, we construct an enriched state transition graph with the triggering actions as the edges for two involved states. Based on it, we utilize the dynamic programming algorithm to plan the exploration path, and augment the GUI with visualized hints for testers to quickly explore untested activities and avoid duplicate explorations. The automated experiments demonstrate the high coverage and efficient path planning of NaviDroid and a user study further confirms its usefulness. The NaviDroid can help us develop more robust software that works in more mission-critical settings, not only by performing more thorough testing with the same effort that has been put in before, but also by integrating these techniques into different parts of development pipeline.
There are fully automated approaches proposed for Web application testing. These approaches mainly rely on tools that explore an application by crawling it. The crawling process results in a state transition model, which is used for generating test cases. Although these approaches are fully automated, they consume too much time and they usually require manual configuration. This is due to the lack of insight and domain knowledge of crawling tools regarding the application under test. We propose a semi-automated approach instead. We introduce a tool that takes a set of recorded event sequences as input. These sequences can be captured during exploratory tests. They are replayed as pre-recorded test cases. They are also exploited for steering the crawling and test case generation process. We performed a case study with 5 Web applications. These applications were randomly tested with state-of-the-art tools. Our approach can reduce the crawling time by hours, while compromising the coverage achieved by 0.2% to 7.43%. In addition, our tool does not require manual configuration before crawling. The input for the tool was created within 15 min of exploratory testing.
Testing has been widely recognised as difficult for AI applications. This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology. In these strategies, testing aims at exploring the data space of a classification or clustering application to discover the boundaries between classes that the machine learning application defines. This enables the tester to understand precisely the behaviour and function of the software under test. In the paper, three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy. The correctness of these algorithms are formally proved. Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
Exploratory testing (ET) is a software testing approach that complements automated testing by leveraging business expertise. It has gained momentum over the last decades as it appeals testers to exploit their business knowledge to stress the system under test (SUT). Exploratory tests, unlike automated tests, are defined and executed on-the-fly by testers. However, testers who perform exploratory tests may be biased by their experience and, incidentally, miss anomalies or unusual interactions proposed by the SUT. This is even more complex in the context of web applications, which typically expose a huge number of interaction paths to their users. As testers of these applications cannot remember all the sequences of interactions they performed, they may fail to deeply explore the application scope. This article, therefore, introduces a new approach to assist testers in widely exploring any web application. In particular, our approach monitors the online interactions performed by the testers to suggest in real-time the probabilities of performing next interactions. Looking at these probabilities, we claim that the testers who favour interactions that have a low probability (because they were rarely performed), will increase the diversity of their explorations. Our approach defines a prediction model, based on n-grams, that encodes the history of past interactions and that supports the estimation of the probabilities. Integrated within a web browser extension, it automatically and transparently injects feedback within the application itself. We conduct a controlled experiment and a qualitative study to assess our approach. Results show that it prevents testers to be trapped in already tested loops, and succeeds to assist them in performing deeper explorations of the SUT.
Testing has been widely recognised as difficult for AI applications. This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology. In these strategies, testing aims at exploring the data space of a classification or clustering application to discover the boundaries between classes that the machine learning application defines. This enables the tester to understand precisely the behaviour and function of the software under test. In the paper, three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy. The correctness of these algorithms are formally proved. Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
Biographies James Lyndsay is an independent test consultant with ten years experience. Specialising in test strategy, he has worked in a range of businesses from banking and telecoms to the web, and pays keen attention to the way that his clients' focus is shifting away from functional testing. Niel vanEeden has a background in mechanical engineering, IT hardware, software and retail. Since relocating to the UK from his native South Africa, he has worked in customer facing roles and in software quality, and has been involved in testing at JobPartners since May 2000. In January 2001 he became the test manager, and is currently responsible for product quality. Abstract This paper describes the way that a UK company controlled and improved ad-hoc testing, and was able to use the knowledge gained as a basis for ongoing, product sustained improvement. It details the session-based methods initially proposed, and notes problems, solutions and improvements found in their implementation. It also covers the ways that the improved test results helped put the case for change throughout development, and ways in which the team has since built on the initial processes to arrive at a better testing overall. Session-based testing can be used to introduce measurement and control to an immature test process, and can form a foundation for significant improvements in productivity and error detection.
As with any engineering discipline, software development requires a measurement mechanism for feedback and evaluation. Measurement is a mechanism for creating a corporate memory and an aid in answering a variety of questions associated with the enactment of any software process. It helps support project planning (e. g., How much will a new project cost?); it allows us to determine the strengths and weaknesses of the current processes and products (e. g., What is the frequency of certain types of errors?); it provides a rationale for adopting/refining techniques (e. g., What is the impact of the technique XX on the productivity of the projects?); it allows us to evaluate the quality of specific processes and products (e. g., What is the defect density in a specific system after deployment?). Measurement also helps, during the course of a project, to assess its progress, to take corrective action based on this assessment, and to evaluate the impact of such action.
Ever felt that you were lacking both the time and information you needed to do a proper job? Or rather; have you ever had complete information about the system you were going to test, and there was never a question about time? I didn’t think so. Neither have we. The following describes a web portal with a publishing tool where little testing had been done in advance, and we were given two days of testing before the system was to go live. It was a small project, but still very interesting in nature. This is a case study. We tell the story of how pair-based exploratory testing was tried by two rather inexperienced test managers in order to cover as much as possible of a web application where time was a commodity in short supply, and specifications where non-existing. And it worked. It really worked. So read on, and we’ll tell you all about it.