Conference PaperPDF Available

The Use of Automatic Test Data Generation for Genetic Improvement in a Live System

Authors:

Abstract and Figures

In this paper we present a bespoke live system in commercial use that has been implemented with self-improving properties. During business hours it provides overview and control for many specialists to simultaneously schedule and observe the rehabilitation process for multiple clients. However in the evening, after the last user logs out, it starts a self-analysis based on the day’s recorded interactions and the self-improving process. It uses Search Based Software Testing (SBST) techniques to generate test data for Genetic Improvement (GI) to fix any bugs if exceptions have been recorded. The system has already been under testing for 4 months and demonstrates the effectiveness of simple test data generation and the power of GI for improving live code.
Content may be subject to copyright.
The Use of Automatic Test Data Generation for
Genetic Improvement in a Live System
Saemundur O. Haraldsson,John R. Woodwardand Alexander I.E. Brownlee
Department of Computing Science and Mathematics
University of Stirling
Stirling, Scotland
Email: soh@cs.stir.ac.uk
Abstract—In this paper we present a bespoke live system in
commercial use that has been implemented with self-improving
properties. During business hours it provides overview and
control for many specialists to simultaneously schedule and
observe the rehabilitation process for multiple clients. However
in the evening, after the last user logs out, it starts a self-analysis
based on the day’s recorded interactions and the self-improving
process. It uses Search Based Software Testing (SBST) techniques
to generate test data for Genetic Improvement (GI) to fix any bugs
if exceptions have been recorded. The system has already been
under testing for 4 months and demonstrates the effectiveness of
simple test data generation and the power of GI for improving
live code.
Keywords-Search Based Software Engineering; Test data gen-
eration; Bug fixing; Real world application
I. INTRODUCTION
Genetic Improvement (GI) is a growing area within Search
Based Software Engineering (SBSE) [1] which uses compu-
tational search methods to improve existing software. When
improving programs, whether functional or non-functional
properties, it is necessary to ensure that the enhanced version
of the program behaves correctly. Traditionally GI has used
testing rather than other formal verification methods for that
purpose [2]. Moreover test cases have been used to evaluate the
improvements as well [3]–[5]. GI and SBST should therefore
be used in conjunction with each other, specifically when the
existing software has limited test data. It is then necessary to
generate more test cases. Earliest papers of the SBST literature
were mostly seeking to generate such test data automatically
with search methods [6] which fits with GI’s philosophy.
It is not uncommon to launch programs before they can be
completely tested. Often it is because the number of conceiv-
able use case scenarios is huge and therefore impossible to
test in practice. Instead the application is put in use after a
reasonable amount of testing and the developer collects data
from the users, both recording performance and reliability.
Then putting effort and resources into maintenance of the
software, regularly providing updates and patches throughout
the lifetime of the software.
In this paper we present a live system, Janus Manager (JM),
that collects data only when user input produces errors and
uses it to generate test data for fixing itself. This decreases
significantly the cost of maintenance after the initial delivery of
the product. It is a bespoke program for a vocational rehabilita-
tion centre, developed and maintained by Janus Rehabilitation
Centre in Reykjavik, Iceland.
The remainder of the paper is structured as follows. Sec-
tion II lists some related work and inspirations. Section III
details what the system does during business hours and how it
keeps records for later generating test data. Section IV explains
how the daily data is used to generate and utilise test data
and Section V summarises the current data gathered since the
launch of JM. Section VI gives an overview of what future
directions we are currently contemplating.
II. RE LATE D WO RK
The SBSE literature has expanded considerably [7] since
Harman and Jones coined the term [1] and with it the SBST
literature [6]. Moreover the research challenges for software
engineering for self-adaptive systems are regularly being rec-
ognized [8], [9].
Much of the SBST research has been about the generation of
test data such as Beyene et al. [10] where they generated string
test data with the objective of maximising code coverage.
The essential objective of the test data generation process
is maximum code coverage, where every new instance of
test data has never been seen before and therefore might be
covering code that previous test cases have not. It is still a
simple random sampling of test data and not as sophisticated
as uniform sampling with Bolzman samplers [11] or like
Feldt and Poulding do when searching for data with specific
properties [12], [13]. The random search can also be replaced
with an alternatives such as hill-climbing [14], an Evolu-
tionary Algorithm [15] or less commonly used optimisation
algorithms [16].
The majority of the generated test data in JM comes from
emulating actual inputs from a graphical user interface (GUI).
There are however examples of work that generate test data
for GUI testing by representing input fields with symbolic
alternatives [17]. That however demands that the developer
knows in better detail about how the software will be used,
which in our case is near impossible since every client’s route
through the rehabilitation is unique.
Our set up is in practice a slower working example of test
data and program co-evolution for bug fixing [18], [19] with
the addition that the usage evolves as well.
Fig. 1. JM functionality divided into daytime processes and night-time
processes.
III. JM DAILY AC TI VI TY
JM is a software system that is developed by JR as a tool
in their vocational rehabilitation service. The motivation for
its development is to provide the best possible service to their
clients by giving the specialists a user friendly management
tool. Moreover it is a tool for the directors to be able to
continuously improve the rehabilitation process with statistical
analysis of client data and performance of methods and
approaches. It has to manage multiple connections between
users, specialists and clients.
A. Usage
The left side of Figure 1 displays the daily routine of JM
and Figure 2 is a simplified map of currently possible usage
and features. The users are all employees of JR, over 40
in total, including both, specialists and administrators. They
interact with JM by either requesting or providing data which
is then processed and saved. The requests are for an example
internal communications between the interdisciplinary team of
specialists about clients, a journal record from a meeting or an
update to some information regarding the client. The system
can also produce reports and bills in pdf format or rich text
files.
The clients have access to specialised and standardised ques-
tionnaires that measure various aspects of the clients welfare
and progress. The specialists then use those questionnaires to
plan a treatment or therapy.
While all of this is happening, every time an input data
causes an exception to be thrown JM logs the trace, input
data and the type of exception in a daily log file shown in the
middle of Figure 1.
B. Structure
JR provides individualised vocational rehabilitation and
as such users of JM regularly encounter unique use cases.
Fig. 2. A simplified map of JM current features.
Therefore JM is in active development while being in use.
Features are continuously added based on user experience,
feedback and convenience. Currently the system is over 25K
lines of Python (300 classes and more than 600 functions).
JM runs as a web service on an Apache server running on
a 64 bit Ubuntu server with 48GiB RAM and two 6 core Intel
processors. The GUI is a web page that JM serves up from
pre-defined templates.
IV. JM N IGHTLY ACTIVITY
After the last user logs off in the evening the nightly routine
in Figure 1 initiates. The process runs until the next morning
or until all bugs are fixed. During the night JM analyses the
logs, generates new test data and uses GI to fix bugs that have
been encountered during the day.
A. Log analysis
Going through the daily logs involves filtering the excep-
tions to obtain a set of unique errors in terms of input, type
and location. The input is defined as the argument list at every
function call on the trace route from the users’ request to the
location of the exception. The type of the exception can be
Procedure 1 Test data search
1: Θ[θ]{Start with the original input}
2: n0
3: Θnew [θ]
4: while (n < 1000) AND (|Θnew|! = 0) do
5: extend Θwith Θlatest
6: Θlatest Θnew
7: Θnew [ ]
8: for i= 1 until i== 100 do
9: θrrandom choice Θlatest
10: θmutated mutate θr
11: if θmutated causes exception then
12: append θmutated to Θnew
13: end if
14: n+=1
15: end for
16: end while
any subclass of Exception in Python, both built in and locally
defined.
The errors are sorted in decreasing order of importance,
giving higher significance to errors that occurred more often,
arbitrarily choosing between draws. This measure of impor-
tance assumes that these are use case scenarios that happen
often and are experienced by multiple users and not a single
user who repeatedly submits the same request.
B. Generate test data
The test data generation is done with a simple random
search of the neighbourhood of the users’ input data. The input
is represented by a Python dictionary object, where elements
are key, value pairs and the values can be of any type or
class. However, most values are strings, dates, times, integers
or floating point numbers. The objective of the search is to
find as many versions of the input data as possible that trigger
the same exception. Procedure 1 details the search for new
test data,
Starting with the original input θwe make 100 instances
of θmutated where a single value has been randomly changed.
For each instance the value to be mutated is randomly se-
lected while all other values are kept fixed. Every θmutated
that causes the same exception as the original is kept in
Θ, essentially given fitness 1, others are discarded. This is
then repeated by randomly sampling from the latest batch of
θmutated,Θlatest (see line 9) until either no new instances are
kept or the maximum of 1000 instances have been evaluated
(line 4).
The mutation mechanism in line 10 first chooses randomly
between key, value pairs in θronly considering pairs where
values are of type string, date, time, integer or float. Then
depending on the type, the possible mutations are the following
String mutations randomly add strings from a predefined
dictionary with white space and special characters,
keeping the original as a sub-string.
Date mutations can change the format (e.g. 2017-01-27
becomes 27-01-17), the separator or randomly pick
a date within a year from the original
Time mutations can change the format (e.g. 7:00 PM
becomes 19:00), the separator or randomly pick a
time within 24 hours from the original
Int. mutations add or subtract 1, 2 or 3 from the original.
Float mutations change the original with a random sample
from the standardised normal distribution N(0,1)
All of the instances in Θalong with the original θare then
the inputs of the new unit tests. The assertion for each of
them will check that the response is of the specific exception
type and the tests will fail if the input triggers that exception.
The new unit tests are then added to the existing test suite,
automatically expanding the library of test cases.
The problem with this approach is that it does not check
whether new test cases are complementary or not, i.e. if the
two or more test cases are validating the same part of the code.
C. Genetic Improvement
The GI part of the nightly process relies on the new test
cases in conjunction with a previously available test suite.
The assumption is that given the test suites the program is
functioning correctly if it passes all test cases and so is
awarded highest fitness. Otherwise fitness is proportional to
the number of test cases the program passes of the whole
suite.
The process is inspired by Langdon’s et al. work [3] by
evolving edit lists that operate on the source code. The edit
lists define the operations replace, delete and copy for code
snippets, lines and statements.
The evolution is population based with 50 edit lists in
each generation. Each generation is evaluated in parallel to
minimise GI’s execution time and to utilise the full power of
the server. Edit lists are selected in proportion to their fitness.
Only half of the population gets selected and they undergo
mutation to start the next generation, crossover is not used in
the current implementation. The other half of the subsequent
generation are randomly generated new edit lists.
The GI only stops if it has found a program variant that
passes all tests or just before the users are expected to arrive
to work. It then produces an html report detailing the night’s
process for the developers. The report lists all exceptions
encountered, new test cases and possible fixes, recommending
the fittest. If more than a single fix is found, then the report
recommends the shortest in terms of number of edits. However
it is always the developers choice to implement the changes
as they are suggested, build on them or discard them.
V. SUMMARY
Development on JM started in March 2016 and quite
early on it was launched for general use in JR. Since late
September, early October 2016 the self-healing processes have
been running as a permanent service in JM. During that time
22 unique exceptions have been reported and always a single
error at a time. Table I lists exception types that have been
TABLE I
SUMMARY OF ENCOUNTERED EXCEPTION TYPES,THE NUMBER OF
OC CUR RE NCE S AND H OW MA NY T EST C AS ES TH EY P ROD UCE D
Exception type Number
of
occurrences
Mean total
number of
input test
data produced
Mean number of
complementary
test cases
IndexError 4 15.25 1.25
TypeError 6 11.17 1.33
UnicodeDecodeError 3 44.33 1.67
ValueError 9 16.33 1.11
encountered, the number of times for each type, how many test
cases it produced and how many of them were complementary.
Every single one of the exceptions revealed a bug in the
program that was subsequently fixed by the GI process. In total
408 test cases have been produced, however a manual post-
process revealed only 6% of those were testing unique parts of
the code. The most obvious example is a UnicodeDecodeError
caused by incorrect handling of a special character in a
string input. The generative method for strings, described in
Section IV, made multiple versions of a string containing the
same special character as a sub-string and therefore all of them
were invoking the same error.
VI. FU TU RE WORK
The system introduced in this paper is fully implemented
and live, however only 22 exceptions have been recorded
during the first few months.The total number of new test cases
is 408 of which 28 are unique in terms of code coverage,
which is not enough to make statistical inference. Our next
steps are to monitor the system while it is being developed
further and gather data on the bugs that are caught and fixed.
Ideally we want to be able to use the data to make predictions
for expected inputs to the system and thus make it possible to
generate test data that imitates unseen future inputs.
While a random search has been effective up until now, we
would like to improve the process to find more unique test
cases per exception encountered. That involves implementing
a fitness function that is not binary and a better sampling
method and adding constraints to the search to maximize code
coverage while minimizing the number of test cases.
ACKNOWLEDGMENT
The work presented in this paper is part of the DAASE
project which is funded by the EPSRC. The authors would like
to thank JR for the collaboration and providing the platform
for which made the development possible.
REFERENCES
[1] M. Harman and B. F. Jones, “Search-based software engineering,
Information and Software Technology, vol. 43, no. 14, pp. 833–839,
dec 2001.
[2] C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “{GenProg}: {A}
Generic Method for Automatic Software Repair,” IEEE Transactions
on Software Engineering, vol. 38, no. 1, pp. 54–72, 2012.
[3] W. B. Langdon and M. Harman, “Optimising Existing Software
with Genetic Programming,” IEEE Transactions on Evolutionary
Computation, vol. 19, no. 1, pp. 118–135, feb 2015.
[4] J. Petke, W. B. Langdon, and M. Harman, “Applying Genetic
Improvement to MiniSAT,” in 5th International Symposium, SSBSE
2013, ser. Lecture Notes in Computer Science, St. Petersburg, Russia:
Springer Berlin Heidelberg, aug 2013, pp. 257–262.
[5] J. Petke, M. Harman, W. B. Langdon, and W. Weimer, “Using Genetic
Improvement \& Code Transplants to Specialise a {C++} Program to a
Problem Class,” in 17th European Conference on Genetic Programming,
EuroGP 2014, ser. Lecture Notes in Computer Science, Granada,
Spain: Springer Berlin Heidelberg, 2014, pp. 137–149.
[6] P. McMinn, “Search-based software testing: Past, present and future,
in 2011 IEEE Fourth International Conference on Software Testing,
Verification and Validation Workshops. IEEE, 2011, pp. 153–163.
[7] M. Harman, P. McMinn, J. T. de Souza, and S. Yoo, “Search Based
Software Engineering: Techniques, Taxonomy, Tutorial,” in Empirical
Software Engineering and Verification, ser. Lecture Notes in Computer
Science, Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, vol.
7007, pp. 1–59.
[8] B. H. C. Cheng et al., “Software Engineering for Self-Adaptive
Systems: A Research Roadmap,” in Software Engineering for
SelfAdaptive Systems, ser. Lecture Notes in Computer Science, Berlin,
Heidelberg: Springer Berlin Heidelberg, 2009, vol. 5525, no. January,
pp. 1–26.
[9] R. de Lemos et al., “Software Engineering for Self-Adaptive Systems: A
Second Research Roadmap,” in Software Engineering for Self-Adaptive
Systems II: International Seminar, Dagstuhl Castle, Germany, October
24-29, 2010 Revised Selected and Invited Papers, ser. Lecture Notes
in Computer Science, Berlin, Heidelberg: Springer Berlin Heidelberg,
2013, vol. 7475, pp. 1–32.
[10] M. Beyene and J. H. Andrews, “Generating String Test Data for Code
Coverage,” in 2012 IEEE Fifth International Conference on Software
Testing, Verification and Validation. IEEE, 2012, pp. 270–279.
[11] P. Duchon and G. Louchard, “Boltzmann Samplers For The Random
Generation Of Combinatorial Structures,” Combinatorics Probability
and Computing, vol. 13, no. 4-5, pp. 577–625, 2004.
[12] R. Feldt and S. Poulding, “Finding Test Data with Specific Properties
via Metaheuristic Search,” in 2013 IEEE 24th International Symposium
on Software Reliability Engineering (ISSRE). IEEE, 2013, pp. 350–359.
[13] S. Poulding and R. Feldt, “Generating structured test data with specific
properties using Nested Monte-Carlo Search,” in Proceedings of the
2014 Annual Conference on Genetic and Evolutionary Computation.
Vancouver: ACM, 2014, pp. 1279—-1286.
[14] F. C. M. Souza, M. Papadakis, Y. Le Traon, and M. E. Delamaro,
“Strong mutation-based test data generation using hill climbing,”
in Proceedings of the 9th International Workshop on Search-Based
Software Testing - SBST ’16. Austin, Texas: ACM Press, 2016, pp.
45–54.
[15] K. Lakhotia, M. Harman, and P. Mcminn, “A Multi-objective Approach
to Search-based Test Data Generation,” in Proceedings of the 9th Annual
Conference on Genetic and Evolutionary Computation, ser. GECCO ’07.
London, England: ACM, jul 2007, pp. 1098–1105.
[16] R. Feldt and S. Poulding, “Broadening the Search in Search-Based
Software Testing: It Need Not Be Evolutionary,Proceedings - 8th
International Workshop on Search-Based Software Testing, SBST 2015,
pp. 1–7, 2015.
[17] K. Salvesen, J. P. Galeotti, F. Gross, G. Fraser, and A. Zeller, “Using
Dynamic Symbolic Execution to Generate Inputs in Search-Based GUI
Testing,Proceedings - 8th International Workshop on Search-Based
Software Testing, SBST 2015, pp. 32–35, 2015.
[18] A. Arcuri, “On the Automation of Fixing Software Bugs,” in ICSE
Companion ’08 Companion of the 30th international conference on
Software engineering. Leipzig, Germany: ACM, 2008, pp. 1003–1006.
[19] A. Arcuri, D. R. White, J. Clark, and X. Yao, “Multi-Objective Im-
provement of Software using Co-evolution and Smart Seeding,” in
Proceedings of the 7th International Conference on Simulated Evolution
and Learning (SEAL’08), 2008, pp. 1–10.
... In our previous work we presented a GI framework that targeted Python programs. We successfully integrated it into a live system [7,9] and gave examples of the tness landscape for three Python programs [8]. The landscape with "number of test cases passed" as tness was shown to be largely at and often dropping from passing all tests to zero with a single edit. ...
... Although GI has been used to improve execution time there is limited number of publication on the analysis of the search landscape. Our previous work examined the landscape for three small Python programs [9], while other recent papers look at the robustness of software [15] and bug xing landscape of the triangle program [16]. ...
... The setup for the second part of the experiment is similar to that in our previous work [9]. However, as the evaluation of ProbAbel's execution time is computationally more expensive than any tness evaluation of a simple calculator or K-means initiation it is not feasible to do as thorough an analysis here. ...
Conference Paper
Full-text available
We present a Genetic Improvement (GI) experiment on ProbAbel, a piece of bioinformatics software for Genome Wide Association (GWA) studies. The GI framework used here has previously been successfully used on Python programs and can, with minimal adaptation , be used on source code written in other languages. We achieve improvements in execution time without the loss of accuracy in output while also exploring the vast tness landscape that the GI framework has to search. The runtime improvements achieved on smaller data set scale up for larger data sets. Our nd-ings are that for ProbAbel, the GI's execution time landscape is noisy but at. We also connrm that human written code is robust with respect to small edits to the source code. CCS CONCEPTS • Software and its engineering → Genetic programming; Software performance; Search-based software engineering;
... Our approach signi cantly decreases the cost of maintenance after the initial release but allows developers to ultimately have the control and provide a sanity check before patches are issued to the live software. JM is a bespoke program for a vocational rehabilitation centre, developed and maintained by Janus Rehabilitation (JR) in Reykjavik, Iceland [43,44] and has been brie y described in previously published work [18,19]. The remainder of the paper is structured as follows. ...
... Although non-functional properties have been a prominent target for GI to tackle, bug xing is by far the largest single problem in the GI literature [47] to be addressed. The work on bug xing has led to the development of the well known tool, GenProg [35] and the discrete tness function of passed test cases has motivated more fundamental work on GI search landscapes [19,32]. Functionality improvements also include growing and grafting [21,29,36], repairing and optimising the distribution of hashcode implementations [25] and prediction model improvements [14]. ...
... Generated test data in JM is an emulation of actual inputs from a graphical user interface (GUI) [19]. There are however examples of work that generate test data for GUI testing by representing input elds with symbolic alternatives [42]. ...
Conference Paper
Full-text available
We present a bespoke live system in commercial use with self-improving capability. During daytime business hours it provides an overview and control for many specialists to simultaneously schedule and observe the rehabilitation process for multiple clients. However in the evening, after the last user logs out, it starts a self-analysis based on the day's recorded interactions. It generates test data from the recorded interactions for Genetic Improvement to x any recorded bugs that have raised exceptions. The system has already been under test for over 6 months and has in that time identiied, located, and xed 22 bugs. No other bugs have been identiied by other methods during that time. It demonstrates the eeectiveness of simple test data generation and the ability of GI for improving live code. CCS CONCEPTS • Software and its engineering → Error handling and recovery ; Automatic programming; Maintaining software; Search-based software engineering; Empirical software validation;
... GI is an emerging field from Search Based Software Engineering [30] which uses computational search to improve existing software. Such as fixing bugs [16], [17], [31], [32], and reducing execution time [33]. Typically, GI uses Genetic Programming [34] as the search method but other search methods can be used. ...
... The bottom layer of Figure 1 is the GI procedure which updates the predictor with the new data. Its implementation has been described previously [16], [17], [32], [33]. In short, it evolves a population of edits that represent small changes to a program. ...
Conference Paper
Full-text available
With the expanding load on healthcare and consequent strain on budget, the demand for tools to increase efficiency in treatments is rising. The use of prediction models throughout the treatment to identify risk factors might be a solution. In this paper we present a novel implementation of a prediction tool and the first use of a dynamic predictor in vocational rehabilitation practice. The tool is periodically updated and improved with Genetic Improvement of software. The predictor has been in use for 10 months and is evaluated on predictions made during that time by comparing them with actual treatment outcome. The results show that the predictions have been consistently accurate throughout the patients' treatment. After approximately 3 week learning phase, the predictor classified patients with 100% accuracy and precision on previously unseen data. The predictor is currently being successfully used in a complex live system where specialists have used it to make informed decisions.
... The results of Langdon's work on improving software's runtime have been adopted by open source projects [19]. More recently, GI has seen commercial deployment in companies, including live bug fixing [8,10,45]. ...
Conference Paper
Genetic Improvement (GI) uses automated search to improve existing software. Most GI work has focused on empirical studies that successfully apply GI to improve software's running time, fix bugs, add new features, etc. There has been little research into why GI has been so successful. For example, genetic programming has been the most commonly applied search algorithm in GI. Is genetic programming the best choice for GI? Initial attempts to answer this question have explored GI's mutation search space. This paper summarises the work published on this question to date.
... The distinction from other related fields [7] is that GI starts with an existing and functional software [22] and improves some property of the program. GI is traditionally implemented with a Genetic Programming [23] search technique and has been used to fix bugs [24], [25], shorten execution time [26], [27], and other non-functional properties [28]. ...
Conference Paper
Full-text available
Adaptive systems will become increasingly important for health care in coming years as costs and workload grow. The need for efficient rehabilitation will expand which will be fulfilled by information technologies. This paper presents a novel implementation and application of a dynamic prediction software in vocational rehabilitation. The software is made adaptable with a Genetic Improvement of software methodology and utilised to predict fluctuations in patient's perceived quality of life. Results of accuracy, recall and precision were better than 90% for the classification of the shifts and the mean absolute error in predictions of the quantity of the shifts was low. The findings of the present study support that it is possible to predict fluctuations in quality of life on average based on the status six months prior. Professionals could therefore intervene accordingly and increase the possibility of successful rehabilitation. The significant long term effect on health care from applying the prediction tool might be reduced cost and overall improved quality of life.
Conference Paper
Search-based testing has been successfully applied to generate complex sequences of events for graphical user interfaces (GUIs), but typically relies on simple heuristics or random values for data widgets like text boxes. This may greatly reduce the effectiveness of test generation for applications which expect specific input values to be entered in their GUI by users. Generating such specific input values is one of the virtues of dynamic symbolic execution (DSE), but DSE is less suitable to generate sequences of events. Therefore, this paper describes a hybrid approach that uses search-based testing to generate sequences of events, and DSE to build input data for text boxes. This is achieved by replacing standard widgets in a system under test with symbolic ones, allowing us to execute GUIs symbolically. In this paper, we demonstrate an extension of the search-based GUI testing tool EXSYST, which uses DSE to successfully increase the obtained code coverage on two case study applications.
Conference Paper
Search-based software testing (SBST) can potentially help software practitioners create better test suites using less time and resources by employing powerful methods for search and optimization. However, research on SBST has typically focused on only a few search approaches and basic techniques. A majority of publications in recent years use some form of evolutionary search, typically a genetic algorithm, or, alternatively, some other optimization algorithm inspired from nature. This paper argues that SBST researchers and practitioners should not restrict themselves to a limited choice of search algorithms or approaches to optimization. To support our argument we empirically investigate three alternatives and compare them to the de facto SBST standards in regards to performance, resource efficiency and robustness on different test data generation problems: classic algorithms from the optimization literature, bayesian optimization with gaussian processes from machine learning, and nested monte carlo search from game playing / reinforcement learning. In all cases we show comparable and sometimes better performance than the current state-of-the-SBST-art. We conclude that SBST researchers should consider a more general set of solution approaches, more consider combinations and hybrid solutions and look to other areas for how to develop the field.
Conference Paper
Mutation Testing is an effective test criterion for finding faults and assessing the quality of a test suite. Every test criterion requires the generation of test cases, which turns to be a manual and difficult task. In literature, search-based techniques are effective in generating structural-based test data. This fact motivates their use for mutation testing. Thus, if automatic test data generation can achieve an acceptable level of mutation score, it has the potential to greatly reduce the involved manual effort. This paper proposes an automated test generation approach, using hill climbing, for strong mutation. It incremental aims at strongly killing mutants, by focusing on mutants' propagation, i.e., how to kill mutants that are weakly killed but not strongly. Furthermore, the paper reports empirical results regarding the cost and effectiveness of the proposed approach on a set of 18 C programs. Overall, for the majority of the studied programs, the proposed approach achieved a higher strong mutation score than random-testing, by 19,02% on average, and the previously proposed test generation techniques that ignore mutants' propagation, by 7,2% on average. Our results also demonstrate the improved efficiency of the proposed scheme over the previous methods.
Conference Paper
Genetic Programming (GP) has long been applied to several SBSE problems. Recently there has been much interest in using GP and its variants to solve demanding problems in which the code evolved by GP is intended for deployment. This paper investigates the application of genetic improvement to a challenging problem of improving a well-studied system: a Boolean satisfiability (SAT) solver called MiniSAT. Many programmers have tried to make this very popular solver even faster and a separate SAT competition track has been created to facilitate this goal. Thus genetically improving MiniSAT poses a great challenge. Moreover, due to a wide range of applications of SAT solving technologies any improvement could have a great impact. Our initial results show that there is some room for improvement. However, a significantly more efficient version of MiniSAT is yet to be discovered.
Conference Paper
Genetic Improvement (GI) is a form of Genetic Programming that improves an existing program. We use GI to evolve a faster version of a C++ program, a Boolean satisfiability (SAT) solver called MiniSAT, specialising it for a particular problem class, namely Combinatorial Interaction Testing (CIT), using automated code transplantation. Our GI-evolved solver achieves overall 17% improvement, making it comparable with average expert human performance. Additionally, this automatically evolved solver is faster than any of the human-improved solvers for the CIT problem.
Article
We show that the genetic improvement of programs (GIP) can scale by evolving increased performance in a widely-used and highly complex 50000 line system. Genetic improvement of software for multiple objective exploration (GISMOE) found code that is 70 times faster (on average) and yet is at least as good functionally. Indeed, it even gives a small semantic gain.
Article
Software acting on complex data structures can be challenging to test: it is difficult to generate diverse test data that satisfies structural constraints while simultaneously exhibiting properties, such as a particular size, that the test engineer believes will be effective in detecting faults. In our previous work we introduced GödelTest, a framework for generating such data structures using non-deterministic programs, and combined it with Differential Evolution to optimize the generation process. Monte-Carlo Tree Search (MCTS) is a search technique that has shown great success in playing games that can be represented as a sequence of decisions. In this paper we apply Nested Monte-Carlo Search, a single-player variant of MCTS, to the sequence of decisions made by the generating programs used by GödelTest, and show that this combination can efficiently generate random data structures which exhibit the specific properties that the test engineer requires. We compare the results to Boltzmann sampling, an analytical approach to generating random combinatorial data structures.
Article
This paper claims that a new field of software engineering research and practice is emerging: search-based software engineering. The paper argues that software engineering is ideal for the application of metaheuristic search techniques, such as genetic algorithms, simulated annealing and tabu search. Such search-based techniques could provide solutions to the difficult problems of balancing competing (and some times inconsistent) constraints and may suggest ways of finding acceptable solutions in situations where perfect solutions are either theoretically impossible or practically infeasible. In order to develop the field of search-based software engineering, a reformulation of classic software engineering problems as search problems is required, The paper briefly sets out key ingredients for successful reformulation and evaluation criteria for search-based software engineering.
Conference Paper
For software testing to be effective the test data should cover a large and diverse range of the possible input domain. Boltzmann samplers were recently introduced as a systematic method to randomly generate data with a range of sizes from combinatorial classes, and there are a number of automated testing frameworks that serve a similar purpose. However, size is only one of many possible properties that data generated for software testing should exhibit. For the testing of realistic software systems we also need to trade off between multiple different properties or search for specific instances of data that combine several properties. In this paper we propose a general search-based framework for finding test data with specific properties. In particular, we use a metaheuristic, differential evolution, to search for stochastic models for the data generator. Evaluation of the framework demonstrates that it is more general and flexible than existing solutions based on random sampling.