Conference Paper

EvoSpex: An Evolutionary Algorithm for Learning Postconditions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To achieve this goal, we develop an approach for inferring inductive SL predicates by synthesising them from memory graphs, i.e., concrete examples of data structure memory layouts, as produced by programs that generate them. Our work is closely related to two research themes: (1) synthesising formal representations of data structures [8,28,44,73], and (2) using machine learning to infer data structure invariants [10,43,67]. Existing approachers either impose specific restrictions on the inputs of the synthesiser by, e.g., requiring functions constructing the data structure [44,73] or a large number of both positive and negative examples [43,67]; or produce weaker specification, e.g., only inferring the structure shape, but not its properties [8,28]. ...
... Our work is closely related to two research themes: (1) synthesising formal representations of data structures [8,28,44,73], and (2) using machine learning to infer data structure invariants [10,43,67]. Existing approachers either impose specific restrictions on the inputs of the synthesiser by, e.g., requiring functions constructing the data structure [44,73] or a large number of both positive and negative examples [43,67]; or produce weaker specification, e.g., only inferring the structure shape, but not its properties [8,28]. ...
... Similar to ShaPE, that work only considers the shape relation without the data properties. DOrder [73] and Evospex [44] are two later works on learning the data invariants from the constructors of the data structures (in OCaml or Java). Locust [10] infers shape predicates from pre-defined definitions with statistical machine learning, with no completeness guarantees. ...
Preprint
We present an approach to automatically synthesise recursive predicates in Separation Logic (SL) from concrete data structure instances using Inductive Logic Programming (ILP) techniques. The main challenges to make such synthesis effective are (1) making it work without negative examples that are required in ILP but are difficult to construct for heap-based structures in an automated fashion, and (2) to be capable of summarising not just the shape of a heap (e.g., it is a linked list), but also the properties of the data it stores (e.g., it is a sorted linked list). We tackle these challenges with a new predicate learning algorithm. The key contributions of our work are (a) the formulation of ILP-based learning only using positive examples and (b) an algorithm that synthesises property-rich SL predicates from concrete memory graphs based on the positive-only learning. We show that our framework can efficiently and correctly synthesise SL predicates for structures that were beyond the reach of the state-of-the-art tools, including those featuring non-trivial payload constraints (e.g., binary search trees) and nested recursion (e.g., n-ary trees). We further extend the usability of our approach by a memory graph generator that produces positive heap examples from programs. Finally, we show how our approach facilitates deductive verification and synthesis of correct-by-construction code.
... Automated generation of test oracles is an active research area aimed at improving software quality assurance. Various methods, including machine learning techniques [17,4,14], evolutionary algorithms [13], or statistical data analysis [5,1], have been applied to this domain. A common approach for test oracle generation involves the detection of likely invariants-properties of a program that should always hold true during execution [5,1]. ...
... Program invariants play a crucial role in formal verification, debugging, and software testing by providing additional specifications that can be automatically checked. However, many software systems lack these specifications, and those supplied by programmers are often outdated, ineffective, or incorrect [5,13]. To address this gap, various techniques have been developed to automatically detect invariants. ...
... Recent work by Alonso et al. extends Daikon to detect invariants in REST APIs [1]. Another approach by Molina et al. generates program executions for Java methods and applies a genetic algorithm to identify post-conditions that hold at the end of method executions [13]. While dynamic techniques can discover more nuanced, data-driven invariants, they are prone to generating falsepositives, especially when the execution data lacks sufficient variety [5,1]. ...
Preprint
Full-text available
Automated generation of test oracles is a critical area of research in software quality assurance. One effective technique is the detection of invariants by analyzing dynamic execution data. Though a common challenge of these approaches is the detection of false-positive invariants. This paper investigates the potential of Large Language Models (LLMs) to assist in filtering these dynamically detected invariants, aiming to reduce the manual effort involved in discarding incorrect in-variants. We conducted experiments using various GPT models from OpenAI, leveraging a dataset of invariants detected from the dynamic execution of REST APIs. By employing a Zero-shot Chain-of-Thought Prompting methodology, we guided the LLMs to articulate their reasoning behind their decisions. Our findings indicate that classification performance improves with detailed instructions and strategic prompt design (the best model achieving on average 80.7% accuracy), with some performance differences between different invariant types.
... There are two important considerations to make: First, differently from program assertions [25] and pre/post conditions [26], a single MR usually does not predicate on all possible program inputs and thus can hardly achieve zero FNs. An MR only predicates on those inputs that satisfy the input relation. ...
... For this reason, outputting multiple MRs with different input relations is useful as they might complement each other on the types of faults they can detect. Conversely, Second, same as Terragni et al. [25] and Molina et al. [26], we are considering the implemented program behavior for false positives, which might differ from the intended one. As such, GENMORPH might need a manual validation of the generated MRs to ensure that they capture the intended program behavior. ...
... GASSERT [25] and EVOSPEX [26] generate oracles with evolutionary algorithms driven by both FPs and FNs. However, they target program assertions and invariants, respectively, and they cannot be easily adapted to target MRs. ...
Article
Full-text available
Metamorphic testing is a popular approach that aims to alleviate the oracle problem in software testing. At the core of this approach are Metamorphic Relations (MRs), specifying properties that hold among multiple test inputs and corresponding outputs. Deriving MRs is mostly a manual activity, since their automated generation is a challenging and largely unexplored problem. This paper presents GenMorph , a technique to automatically generate MRs for Java methods that involve inputs and outputs that are boolean, numerical, or ordered sequences. GenMorph uses an evolutionary algorithm to search for effective test oracles, i.e., oracles that trigger no false alarms and expose software faults in the method under test. The proposed search algorithm is guided by two fitness functions that measure the number of false alarms and the number of missed faults for the generated MRs. Our results show that GenMorph generates effective MRs for 18 out of 23 methods (mutation score >20%). Furthermore, it can increase Randoop ’s fault detection capability in 7 out of 23 methods, and Evosuite ’s in 14 out of 23 methods. When compared with AUTOMR, a state-of-the-art MR generator, GenMorph also outperformed its fault detection capability in 9 out of 10 methods.
... Various approaches have been proposed to address the oracle problem by automatically deriving different kinds of oracles [1,3,5,9,10,22,23,35,40], including test assertions, contracts (such as pre/postconditions and invariants) or metamorphic relations. Generally, these approaches observe some artifact related to the SUT (documentation, comments, source code, executions) and then derive oracles that are consistent with the observations. ...
... Generally, these approaches observe some artifact related to the SUT (documentation, comments, source code, executions) and then derive oracles that are consistent with the observations. For example, TOGA [9] observes the source code of a target test and a focal method (method under test) and infers a test assertion for the given test; MeMo [5] extracts metamorphic relations by observing natural language comments in the source code; Daikon [10] and related tools [22,23,35] observe the behavior of the SUT (from a set of tests) in order to infer class invariants and pre/postconditions. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
... Daikon [10], a well-known tool for dynamic invariant detection, produces pre/postconditions and class invariants using its own language (a mix of Java, and mathematical logic). Other contract inference tools also use their own languages [22,23,35]. ...
Preprint
Full-text available
The effectiveness of a test suite in detecting faults highly depends on the correctness and completeness of its test oracles. Large Language Models (LLMs) have already demonstrated remarkable proficiency in tackling diverse software testing tasks, such as automated test generation and program repair. This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles. Additionally, our aim is to initiate discussions on the primary threats that SE researchers must consider when employing LLMs for oracle automation, encompassing concerns regarding oracle deficiencies and data leakages.
... Approaches to dynamic assertion learning generalize from observations, e.g., object states, to synthesize assertions such as preconditions and class invariants. Related tools include Daikon [8], Proviso [2], Hanoi [22], and EvoSpex [25]. Daikon observes program states during execution and uses templates to obtain a set of candidate assertions, including class invariants, that hold at certain program locations. ...
... We derive assertions to distinguish valid from invalid objects using a grammar. In contrast to related approaches [25,30], our objects are guaranteed to be (in)valid. Table 1 shows the execution state of our approach in each iteration when learning class invariant w = h ∧ w > 0 for our running example SimpleSquare. ...
... To enforce termination, we bound the random walk with respect to the number of walks and the number of method calls per walk. To ensure deterministic behavior, one may either randomly select a builder/action using a fixed seed (like Randoop [26]) or exhaustively explore all builder/action combinations up to a given depth (like EvoSpex [25]). Thus, not finding a valid object that is misclassified as invalid by the candidate class invariant does not guarantee the absence of one. ...
Chapter
Full-text available
Maintaining software is cumbersome when method argument constraints are undocumented. To reveal them, previous work learned preconditions from exemplary valid and invalid method arguments. In practice, it would be highly beneficial to know class invariants, too, because functionality added during software maintenance must not break them. Even more so than method preconditions, class invariants are rarely documented and often cannot completely be inferred automatically, especially for objects exhibiting complex state such as dynamic data structures. This paper presents a novel dynamic approach to learning class invariants, thereby complementing related work on learning method preconditions. We automatically synthesize assertions from an adjustable assertion grammar to distinguish valid and invalid objects. While random walks generate valid objects, a combination of bounded-exhaustive testing techniques and behavioral oracles yield invalid objects. The utility of our approach for code comprehension and software maintenance is demonstrated by comparing our learned invariants to documented invariant validation methods found in real-world Java classes and to the invariants detected by the Daikon tool.
... The inferred specifications can be used as an oracle [4] that distinguishes correct/intended program behaviour from incorrect/unintended program behaviour. In recent years, the specification inference problem has gained an increasing attention from the software engineering community, leading to the proposal of various tools and techniques for automated specification inference (some examples are Daikon [12], Jdoctor [6], GAssert [38], EvoSpex [26], and SpecFuzzer [25]). ...
... This language is similar in expressive power to the Java Modeling Language JML [8], and includes the usual relational, arithmetic and logical operators, as well as first-order quantification though the universal and existential quantifiers. The language's expressiveness is influenced by the expressive powers of the languages in other specification inference techniques, and in contract languages [8], [12], [26], [38]. ...
... Specification inference is an active area of research. Besides the techniques that infer contract assertions, such as SpecFuzzer [25], Daikon [12], Jdoctor [6], GAssert [38] and EvoSpex [26]), there are also other approaches focusing on generating other kinds of specifications, or from other sources. For instance, some approaches monitor software behaviour and attempt to infer test oracles [13], [40] (that is, assertions that are only valid in specific unit tests), while others rely on modern machine learning models to statically generate contextdependent test oracles [11]. ...
Preprint
Full-text available
In this paper, we propose an assertion-based approach to capture software evolution, through the notion of commit-relevant specification. A commit-relevant specification summarises the program properties that have changed as a consequence of a commit (understood as a specific software modification), via two sets of assertions, the delta-added assertions, properties that did not hold in the pre-commit version but hold on the post-commit, and the delta-removed assertions, those that were valid in the pre-commit, but no longer hold after the code change. We also present DeltaSpec, an approach that combines test generation and dynamic specification inference to automatically compute commit-relevant specifications from given commits. We evaluate DeltaSpec on two datasets that include a total of 57 commits (63 classes and 797 methods). We show that commit-relevant assertions can precisely describe the semantic deltas of code changes, providing a useful mechanism for validating the behavioural evolution of software. We also show that DeltaSpec can infer 88% of the manually written commit-relevant assertions expressible in the language supported by the tool. Moreover, our experiments demonstrate that DeltaSpec's inferred assertions are effective to detect regression faults. More precisely, we show that commit-relevant assertions can detect, on average, 78.3% of the artificially seeded faults that interact with the code changes. We also show that assertions in the delta are 58.3% more effective in detecting commit-relevant mutants than assertions outside the delta, and that it takes on average 169% fewer assertions when these are commit-relevant, compared to using general valid assertions, to achieve a same commit-relevant mutation score.
... To address this issue, specification inference techniques aim at automatically inferring assertions for specific program points that capture the exhibited software behaviour [34], [35], [44]. These techniques evolve candidate assertions and use dynamic test executions to determine which of those assertions are consistent with the behaviours exhibited by a provided test suite, and mutation testing to discard ineffective/weak assertions that are unable to detect any artificially seeded fault (mutant), i.e., assertions never falsified during mutant's execution. ...
... We implement AIMS and evaluate its ability to predict Assertion Inferring Mutants on a large set of 46 programs, composed of 40 taken from previous studies [34], [35], [44] and 6 large Maven projects taken from GitHub to evaluate scalability. Our results demonstrate that AIMS can statically select Assertion Inferring Mutants with 0.79 Precision and 0.49 Recall, overall yielding 0.58 MCC 2 . ...
... In this paper, we focus on postcondition based assertions that define the expected properties that must hold at the end of a given function's execution. Figure 1 depicts the process that existing assertion inference techniques ( [34], [35], [44]) follow to infer assertions. First, based on an assertion generation approach utilized (e.g. ...
Preprint
Full-text available
Specification inference techniques aim at (automatically) inferring a set of assertions that capture the exhibited software behaviour by generating and filtering assertions through dynamic test executions and mutation testing. Although powerful, such techniques are computationally expensive due to a large number of assertions, test cases and mutated versions that need to be executed. To overcome this issue, we demonstrate that a small subset, i.e., 12.95% of the mutants used by mutation testing tools is sufficient for assertion inference, this subset is significantly different, i.e., 71.59% different from the subsuming mutant set that is frequently cited by mutation testing literature, and can be statically approximated through a learning based method. In particular, we propose AIMS, an approach that selects Assertion Inferring Mutants, i.e., a set of mutants that are well-suited for assertion inference, with 0.58 MCC, 0.79 Precision, and 0.49 Recall. We evaluate AIMS on 46 programs and demonstrate that it has comparable inference capabilities with full mutation analysis (misses 12.49% of assertions) while significantly limiting execution cost (runs 46.29 times faster). A comparison with randomly selected sets of mutants, shows the superiority of AIMS by inferring 36% more assertions while requiring approximately equal amount of execution time. We also show that AIMS 's inferring capabilities are almost complete as it infers 96.15% of ground truth assertions, (i.e., a complete set of assertions that were manually constructed) while Random Mutant Selection infers 19.23% of them. More importantly, AIMS enables assertion inference techniques to scale on subjects where full mutation testing is prohibitively expensive and Random Mutant Selection does not lead to any assertion.
... Techniques for inferring class specifications exist [7,14,31,38], but their expressiveness is limited. Daikon [14], the baseline that other techniques use, supports a restricted set of templates, from which assertions are generated. ...
... It is then limited to simple assertions (e.g., no direct support for quantification), or requires the developer to manually extend the assertion language. GAssert [38] and EvoSpex [31], two recently proposed techniques for contract inference, try to address this limitation of Daikon by supporting more expressive assertion languages, but their extensions focus on specific kinds of constraints: GAssert focuses on logical/arithmetic constraints (no quantified expressions) and EvoSpex focuses on object navigation constraints (only very simple logical and arithmetic operators are supported). Moreover, as both techniques are based on evolutionary search, they are difficult to extend or adapt to support further expressions, as the evolutionary algorithms are targeted for the specific languages supported by the corresponding tools. ...
... We compared SpecFuzzer with GAssert [38] and EvoSpex [31], which are the state of the art tool-supported techniques in specification inference today. To evaluate SpecFuzzer, we used the same benchmarks from the evaluation of GAssert and EvoSpex, carefully studied the subjects, and manually produced corresponding "ground truth" assertions capturing the intended behavior of the subjects. ...
Preprint
Full-text available
Expressing class specifications via executable constraints is important for various software engineering tasks such as test generation, bug finding and automated debugging, but developers rarely write them. Techniques that infer specifications from code exist to fill this gap, but they are designed to support specific kinds of assertions and are difficult to adapt to support different assertion languages, e.g., to add support for quantification, or additional comparison operators, such as membership or containment. To address the above issue, we present SpecFuzzer, a novel technique that combines grammar-based fuzzing, dynamic invariant detection, and mutation analysis, to automatically produce class specifications. SpecFuzzer uses: (i) a fuzzer as a generator of candidate assertions derived from a grammar that is automatically obtained from the class definition; (ii) a dynamic invariant detector -- Daikon -- to filter out assertions invalidated by a test suite; and (iii) a mutation-based mechanism to cluster and rank assertions, so that similar constraints are grouped and then the stronger prioritized. Grammar-based fuzzing enables SpecFuzzer to be straightforwardly adapted to support different specification languages, by manipulating the fuzzing grammar, e.g., to include additional operators. We evaluate our technique on a benchmark of 43 Java methods employed in the evaluation of the state-of-the-art techniques GAssert and EvoSpex. Our results show that SpecFuzzer can easily support a more expressive assertion language, over which is more effective than GAssert and EvoSpex in inferring specifications, according to standard performance metrics.
... Evolutionary algorithms are one of the categories that mimic the notion of natural evolution. It includes evolution strategies (ES) [8], genetic algorithm (GA) [9], [10], programming related to genetic methods (GP) [11], DE [12], neuroevolution [13], and BBO-optimization based on biography [14], learner performance-based (LPB) optimizer [15], Evospex [16], artificial spider algorithm [17], group search optimizer [18]. ...
... Local optima escape Ranking and selection of worst agents as Equation (16). ...
Article
Many challenges are involved in solving mechanical design optimization problems related to the real-world, such as conflicting objectives, assorted design variables, discrete search space, intuitive flaws, and many locally optimal solutions. A comparison of algorithms on a given set of problems can provide us with insights into their performance, finding the best one to use, and potential improvements needed in their mechanisms to ensure maximum performance. This motivated our attempts to comprehensively compare eight recent meta-heuristics on 15 mechanical engineering design problems. Algorithms considered are water wave optimizer (WWO), butterfly optimization algorithm (BOA), Henry gas solubility optimizer (HGSO), Harris Hawks optimizer (HHO), ant lion optimizer (ALO), whale optimization algorithm (WOA), sine–cosine algorithm (SCA) and dragonfly algorithm (DA). Comparative performance analysis is based on the solution trait obtained from statistical tests and convergence plots. The results demonstrate the wide range of adaptability of considered algorithms for future applications.
... The automated generation of test oracles is an active research topic. Existing approaches mostly di er on the inputs from which test oracles are generated, including source code [28,77], program speci cations [34,47], documentation [20,40], and previous executions [61,63], among others. A common approach for test oracle generation is the detection of likely invariants, properties of the program that should always hold, e.g., " ...
... Automated test case generation techniques can be classi ed based on their inputs, and their application domains. Regarding their inputs, test oracles have been derived from source code [28,60,77,80], formal speci cations [34,47], semi-structured documentation [20,21,40,81], previous program executions [23,24,46,49,[61][62][63]73], or a combination of them. Application domains include Java projects [20,28,61], machine learning programs [22] and cyberphysical systems [17], among others. ...
... The accuracy of the invariants inferred with these methods depends on both the quality and completeness of the test cases and the collection of potential invariants provided. Evospex [37] overcomes the limitations of the previous techniques by exercising the unit under test through its APIs to generate valid pre and post states, without requiring any specification or test. The valid pre and post states undergo mutations that lead to corresponding invalid pre and post states and a genetic algorithm infers valid postconditions guided by the valid/invalid states. ...
Preprint
This paper presents Tratto, a neuro-symbolic approach that generates assertions (boolean expressions) that can serve as axiomatic oracles, from source code and documentation. The symbolic module of Tratto takes advantage of the grammar of the programming language, the unit under test, and the context of the unit (its class and available APIs) to restrict the search space of the tokens that can be successfully used to generate valid oracles. The neural module of Tratto uses transformers fine-tuned for both deciding whether to output an oracle or not and selecting the next lexical token to incrementally build the oracle from the set of tokens returned by the symbolic module. Our experiments show that Tratto outperforms the state-of-the-art axiomatic oracle generation approaches, with 73% accuracy, 72% precision, and 61% F1-score, largely higher than the best results of the symbolic and neural approaches considered in our study (61%, 62%, and 37%, respectively). Tratto can generate three times more axiomatic oracles than current symbolic approaches, while generating 10 times less false positives than GPT4 complemented with few-shot learning and Chain-of-Thought prompting.
... Only recently has the research community begun addressing metamorphic relation generation from different angles [5,6,12,104,105,114,115]. More research is needed on MR generation [5,6,105] and oracle/generation improvement [39,40,67,90,91] to facilitate effective testing of AI-generated code. ...
Article
Full-text available
A paradigm shift is underway in Software Engineering, with AI systems such as LLMs playing an increasingly important role in boosting software development productivity. This trend is anticipated to persist. In the next years, we expect a growing symbiotic partnership between human software developers and AI. The Software Engineering research community cannot afford to overlook this trend; we must address the key research challenges posed by the integration of AI into the software development process. In this paper, we present our vision of the future of software development in an AI-driven world and explore the key challenges that our research community should address to realize this vision.
... The oracle problem [5] has received signi cant attention by the software engineering research community. A great progress on automated inference of software speci cations has been done recently, including techniques based on dynamic analysis [17,45], evolutionary computation [37,47], fuzzing [35], natural language processing [7], and machine learning [36]. These approaches typically execute a test suite of the SUT, observe executions, and infer speci cations that are consistent with the observations. ...
Article
Full-text available
Metamorphic testing is a valuable technique that helps in dealing with the oracle problem. It involves testing software against specifications of its intended behavior given in terms of so called metamorphic relations , statements that express properties relating different software elements (e.g., different inputs, methods, etc). The effective application of metamorphic testing strongly depends on identifying suitable domain-specific metamorphic relations, a challenging task, that is typically manually performed. This paper introduces MemoRIA, a novel approach that aims at automatically identifying metamorphic relations. The technique focuses on a particular kind of metamorphic relation, which asserts equivalences between methods and method sequences. MemoRIA works by first generating an object-protocol abstraction of the software being tested, then using fuzzing to produce candidate relations from the abstraction, and finally validating the candidate relations through run-time analysis. A SAT-based analysis is used to eliminate redundant relations, resulting in a concise set of metamorphic relations for the software under test. We evaluate our technique on a benchmark consisting of 22 Java subjects taken from the literature, and compare MemoRIA with the metamorphic relation inference technique SBES. Our results show that by incorporating the object protocol abstraction information, MemoRIA is able to more effectively infer meaningful metamorphic relations, that are also more precise, compared to SBES, measured in terms of mutation analysis. Also, the SAT-based reduction allows us to significantly reduce the number of reported metamorphic relations, while in general having a small impact in the bug finding ability of the corresponding obtained relations.
... Only recently has the research community begun addressing metamorphic relation generation from different angles [2,3,5,56,60,61]. More research is needed on MR generation [2,3,56] and oracle/generation improvement [21,22,36,49,50] to facilitate effective testing of AI-generated code. ...
Preprint
Full-text available
A paradigm shift is underway in Software Engineering, with AI systems such as LLMs gaining increasing importance for improving software development productivity. This trend is anticipated to persist. In the next five years, we will likely see an increasing symbiotic partnership between human developers and AI. The Software Engineering research community cannot afford to overlook this trend; we must address the key research challenges posed by the integration of AI into the software development process. In this paper, we present our vision of the future of software development in an AI-Driven world and explore the key challenges that our research community should address to realize this vision. In Proceedings of the Workshop 2030 Software Engineering co-located with FSE 2024
... Using this approach, we found a number of subtle errors in repOK specifications taken from the literature. Thus, techniques that require repOK specifications (e.g, [30]), as well as techniques that require bounded-exhaustive suites (e.g., [21]) can benefit from our presented generation technique. ...
Chapter
Full-text available
Bounded exhaustive input generation (BEG) is an effective approach to reveal software faults. However, existing BEG approaches require a precise specification of the valid inputs, i.e., a , that must be provided by the user. Writing s for BEG is challenging and time consuming, and they are seldom available in software. In this paper, we introduce , an efficient approach that employs routines from the API of the software under test to perform BEG. Like API-based test generation approaches, creates sequences of calls to methods from the API, and executes them to generate inputs. As opposed to existing BEG approaches, does not require a to be provided by the user. To make BEG from the API feasible, implements three key pruning techniques: (i) discarding test sequences whose execution produces exceptions violating API usage rules, (ii) state matching to discard test sequences that produce inputs already created by previously explored test sequences, and (iii) the automated identification and use of a subset of methods from the API, called builders , that is sufficient to perform BEG. Our experimental assessment shows that ’s efficiency and scalability is competitive with existing BEG approaches, without the need for s. We also show that can assist the user in finding flaws in s, by (automatically) comparing inputs generated by with those generated from a . Using this approach, we revealed several errors in s taken from the assessment of related tools, demonstrating the difficulties of writing precise s for BEG.
... Research and practice with mutation testing have shown that it is one of the most powerful testing techniques [3], [22], [27], [55]. Apart from testing the software in general, mutation testing has been proven to be useful in supporting many software engineering activities which include improving test suite strength [2], [14], selecting quality software specifications [41], [42], [61], among others. Though, its use in tackling software security issues has received little attention. ...
Preprint
Full-text available
With the increasing release of powerful language models trained on large code corpus (e.g. CodeBERT was trained on 6.4 million programs), a new family of mutation testing tools has arisen with the promise to generate more "natural" mutants in the sense that the mutated code aims at following the implicit rules and coding conventions typically produced by programmers. In this paper, we study to what extent the mutants produced by language models can semantically mimic the observable behavior of security-related vulnerabilities (a.k.a. Vulnerability-mimicking Mutants), so that designing test cases that are failed by these mutants will help in tackling mimicked vulnerabilities. Since analyzing and running mutants is computationally expensive, it is important to prioritize those mutants that are more likely to be vulnerability mimicking prior to any analysis or test execution. Taking this into account, we introduce VMMS, a machine learning based approach that automatically extracts the features from mutants and predicts the ones that mimic vulnerabilities. We conducted our experiments on a dataset of 45 vulnerabilities and found that 16.6% of the mutants fail one or more tests that are failed by 88.9% of the respective vulnerabilities. More precisely, 3.9% of the mutants from the entire mutant set are vulnerability-mimicking mutants that mimic 55.6% of the vulnerabilities. Despite the scarcity, VMMS predicts vulnerability-mimicking mutants with 0.63 MCC, 0.80 Precision, and 0.51 Recall, demonstrating that the features of vulnerability-mimicking mutants can be automatically learned by machine learning models to statically predict these without the need of investing effort in defining such features.
... An invariant is a property that is always satisfied at one or more points in the execution of a program, such as its inputs and outputs in the context of black-box testing, whereas a likely invariant is a property that is satisfied by a set of program executions but that could not be satisfied by a different execution with, for example, different input values. Automated detection of likely invariants has shown promising results in different contexts such as learning postconditions in Java programs [14], relational databases [6], web applications [13], automated program repair [21], cyber-physical systems [3,20] or robotics [11], among others. Nonetheless, this technique has not been applied yet in the context of RESTful APIs, despite its potential. ...
Article
This paper presents Tratto, a neuro-symbolic approach that generates assertions (boolean expressions) that can serve as axiomatic oracles, from source code and documentation. The symbolic module of Tratto takes advantage of the grammar of the programming language, the unit under test, and the context of the unit (its class and available APIs) to restrict the search space of the tokens that can be successfully used to generate valid oracles. The neural module of Tratto uses transformers fine-tuned for both deciding whether to output an oracle or not and selecting the next lexical token to incrementally build the oracle from the set of tokens returned by the symbolic module. Our experiments show that Tratto outperforms the state-of-the-art axiomatic oracle generation approaches, with 73% accuracy, 72% precision, and 61% F1-score, largely higher than the best results of the symbolic and neural approaches considered in our study (61%, 62%, and 37%, respectively). Tratto can generate three times more axiomatic oracles than current symbolic approaches, while generating 10 times less false positives than GPT4 complemented with few-shot learning and Chain-of-Thought prompting.
Article
Static program analysis of real-world software that integrates numerous library Application Programming Interfaces (APIs) faces significant challenges due to inaccessible or highly complex source code. A common workaround is to use specifications that summarize the key behaviors of these APIs for analysis. However, manually writing specifications is labor-intensive and requires a deep understanding of API semantics while existing automated specification generation techniques struggle when source code is inaccessible or partially available. This paper introduces Spectre , an automated framework that leverages fuzzing techniques to generate aliasing specifications for library APIs. Spectre operates efficiently and precisely both with and without source code access. When source code is unavailable, Spectre integrates alias-check observers into the driver program after the API call site and performs black-box fuzzing to explore different API behaviors. If a check is satisfied, the corresponding aliasing specification is generated. When source code is available, Spectre incorporates new grey-box fuzzing features specifically tailored for aliasing specification inference, further enhancing its ability to generate aliasing specifications. We conducted extensive experiments to evaluate the performance of Spectre . Without source code access, Spectre demonstrated its specification generation capability across both Musl, a lightweight C standard library, and eight C third-party libraries. For Musl, Spectre recovered 96.7% of correct manually written specifications and identified 40.0% more aliasing specifications than those written by external experts. For C third-party libraries, all Spectre -generated aliasing specifications were validated as correct through static analysis of the API source code. Spectre is also more complete than other specification inference tool generating 16.7% more correct specifications for third-party libraries. The practicality of the generated specifications was confirmed, as they improved aliasing analysis in static pointer analysis of client code while maintaining a balance between accuracy and efficiency. The effectiveness of the tailored grey-box fuzzing features was demonstrated by Spectre identifying 20% more specifications compared to when these features were disabled. These results show that Spectre is an effective tool for inferring aliasing specifications and facilitating static analysis.
Article
We present an approach to automatically synthesise recursive predicates in Separation Logic (SL) from concrete data structure instances using Inductive Logic Programming (ILP) techniques. The main challenges to make such synthesis effective are (1) making it work without negative examples that are required in ILP but are difficult to construct for heap-based structures in an automated fashion, and (2) to be capable of summarising not just the shape of a heap (e.g., it is a linked list), but also the properties of the data it stores (e.g., it is a sorted linked list). We tackle these challenges with a new predicate learning algorithm. The key contributions of our work are (a) the formulation of ILP-based learning only using positive examples and (b) an algorithm that synthesises property-rich SL predicates from concrete memory graphs based on the positive-only learning. We show that our framework can efficiently and correctly synthesise SL predicates for structures that were beyond the reach of the state-of-the-art tools, including those featuring non-trivial payload constraints (e.g., binary search trees) and nested recursion (e.g., n -ary trees). We further extend the usability of our approach by a memory graph generator that produces positive heap examples from programs. Finally, we show how our approach facilitates deductive verification and synthesis of correct-by-construction code.
Article
The number and complexity of test case generation tools for REST APIs have significantly increased in recent years. These tools excel in automating input generation but are limited by their test oracles, which can only detect crashes, regressions, and violations of API specifications or design best practices. This article introduces AGORA+, an approach for generating test oracles for REST APIs through the detection of invariants—output properties that should always hold. AGORA+ learns the expected behavior of an API by analyzing API requests and their corresponding responses. We enhanced the Daikon tool for dynamic detection of likely invariants, adding new invariant types and creating a front-end called Beet. Beet translates any OpenAPI specification and a set of API requests and responses into Daikon inputs. AGORA+ can detect 106 different types of invariants in REST APIs. We also developed PostmanAssertify, which converts the invariants identified by AGORA+ into executable JavaScript assertions. AGORA+ achieved a precision of 80% on 25 operations from 20 industrial APIs. It also identified 48% of errors systematically seeded in the outputs of the APIs under test. AGORA+ uncovered 32 bugs in popular APIs, including Amadeus, Deutschebahn, GitHub, Marvel, NYTimesBooks, and YouTube, leading to fixes and documentation updates.
Article
The effectiveness of a test suite in detecting faults highly depends on the quality of its test oracles. Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks. This paper aims to present a roadmap for future research on the use of LLMs for test oracle automation. We discuss the progress made in the field of test oracle automation before the introduction of LLMs, identifying the main limitations and weaknesses of existing techniques. Additionally, we discuss recent studies on the use of LLMs for this task, highlighting the main challenges that arise from their use, e.g., how to assess quality and usefulness of the generated oracles. We conclude with a discussion about the directions and opportunities for future research on LLM-based oracle automation.
Conference Paper
In the realm of software engineering, the automation of test case generation represents a significant advancement towards improving efficiency and reliability. This paper introduces a novel approach to automate the generation of test cases from class diagrams using generative artificial intelligence (AI). By extracting class and attribute information from the XML representation of class diagrams, we can formulate structured prompts that are then fed into a generative AI model. The model is designed to interpret these prompts and produce comprehensive test cases corresponding to each class. Our methodology not only streamlines the test case creation process but also leverages the advanced capabilities of AI to ensure thorough coverage and accuracy. The implications of this approach extend to enhancing the quality assurance phase of software development, thereby contributing to the development of robust and error-resistant software systems.
Article
Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a program’s intent. However, there is typically no guarantee that a program’s implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language, which makes natural language intent challenging to check programmatically. The “emergent abilities” of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcondition, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcondition approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcondition postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that via LLMs has the potential to be helpful in practice; generated postconditions were able to catch 64 real-world historical bugs from Defects4J.
Chapter
Contracts capture assumptions (preconditions) and guarantees (postconditions) of functions in a software program, and are an important paradigm for documenting program code, for program understanding, and to enable modular program verification. In this paper, we focus on contracts for stateful software modules, for instance modules implementing data-structures like queues. Such modules offer different kinds of functions to their environment: observers, which are pure functions used to query the state of the module; and mutators, which can change the module state. We present a novel technique to synthesize contracts for the mutators of a module, in which pre- and postconditions are expressed as Boolean combinations of the observers. Our method builds on existing algorithms for active learning of register automata to model the possible behaviours of the stateful module. We then present techniques for synthesizing contracts from a learned register automaton. The entire method is fully black-box and automated. Based on our proposed approach, we develop a tool called CoGent that generates a set of contracts for a mutator from a given register automaton of a module. Finally, we evaluate our tool using the APIs for various data structures.
Article
We present an approach to learn contracts for object-oriented programs where guarantees of correctness of the contracts are made with respect to a test generator. Our contract synthesis approach is based on a novel notion of tight contracts and an online learning algorithm that works in tandem with a test generator to synthesize tight contracts. We implement our approach in a tool called Precis and evaluate it on a suite of programs written in C#, studying the safety and strength of the synthesized contracts, and compare them to those synthesized by Daikon.
Article
Full-text available
Auto-active verifiers provide a level of automation intermediate between fully automatic and interactive: users supply code with annotations as input while benefiting from a high level of automation in the back-end. This paper presents AutoProof, a state-of-the-art auto-active verifier for object-oriented sequential programs with complex functional specifications. AutoProof fully supports advanced object-oriented features and a powerful methodology for framing and class invariants, which make it applicable in practice to idiomatic object-oriented patterns. The paper focuses on describing AutoProof ’s interface, design, and implementation features, and demonstrates AutoProof ’s performance on a rich collection of benchmark problems. The results attest AutoProof ’s competitiveness among tools in its league on cutting-edge functional verification of object-oriented programs.
Article
Full-text available
We describe a general framework c2i for generating an invariant inference procedure from an invariant checking procedure. Given a checker and a language of possible invariants, c2i generates an inference procedure that iteratively invokes two phases. The search phase uses randomized search to discover candidate invariants and the validate phase uses the checker to either prove or refute that the candidate is an actual invariant. To demonstrate the applicability of c2i , we use it to generate inference procedures that prove safety properties of numerical programs, prove non-termination of numerical programs, prove functional specifications of array manipulating programs, prove safety properties of string manipulating programs, and prove functional specifications of heap manipulating programs that use linked list data structures.
Article
Full-text available
Testing involves examining the behaviour of a system in order to discover potential faults. Given an input for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentially incorrect behavior is called the “test oracle problem”. Test oracle automation is important to remove a current bottleneck that inhibits greater overall test automation. Without test oracle automation, the human has to determine whether observed behaviour is correct. The literature on test oracles has introduced techniques for oracle automation, including modelling, specifications, contract-driven development and metamorphic testing. When none of these is completely adequate, the final source of test oracle information remains the human, who may be aware of informal specifications, expectations, norms and domain specific information that provide informal oracle guidance. All forms of test oracles, even the humble human, involve challenges of reducing cost and increasing benefit. This paper provides a comprehensive survey of current approaches to the test oracle problem and an analysis of trends in this important area of software testing research and practice.
Article
Full-text available
Experience with lightweight formal methods suggests that programmers are willing to write specification if it brings tangible benefits to their usual development activities. This paper considers stronger specifications and studies whether they can be deployed as an incremental practice that brings additional benefits without being unacceptably expensive. We introduce a methodology that extends Design by Contract to write strong specifications of functional properties in the form of preconditions, postconditions, and invariants. The methodology aims at being palatable to developers who are not fluent in formal techniques but are comfortable with writing simple specifications. We evaluate the cost and the benefits of using strong specifications by applying the methodology to testing data structure implementations written in Eiffel and C#. In our extensive experiments, testing against strong specifications detects twice as many bugs as standard contracts, with a reasonable overhead in terms of annotation burden and run-time performance while testing. In the wide spectrum of formal techniques for software quality, testing against strong specifications lies in a "sweet spot" with a favorable benefit to effort ratio.
Conference Paper
Full-text available
To find defects in software, one needs test cases that execute the software systematically, and oracles that assess the correctness of the observed behavior when running these test cases. This paper presents EvoSuite, a tool that automatically generates test cases with assertions for classes written in Java code. To achieve this, EvoSuite applies a novel hybrid approach that generates and optimizes whole test suites towards satisfying a coverage criterion. For the produced test suites, EvoSuite suggests possible oracles by adding small and effective sets of assertions that concisely summarize the current behavior; these assertions allow the developer to detect deviations from expected behavior, and to capture the current behavior in order to protect against future defects breaking this behavior.
Conference Paper
Full-text available
Although unit tests are recognized as an important tool in soft- ware development, programmers prefer to write code, rather than unit tests. Despite the emergence of tools like JUnit which auto- mate part of the process, unit testing remains a time-consuming, resource-intensive, and not particularly appealing activity. This paper introduces a new development method, called Con- tract Driven Development. This development method is based on a novel mechanism that extracts test cases from failure-producing runs that the programmers trigger. It exploits actions that devel- opers perform anyway as part of their normal process of writing code. Thus, it takes the task of writing unit tests off the develop- ers' shoulders, while still taking advantage of their knowledge of the intended semantics and structure of the code. The approach is based on the presence of contracts in code, which act as the oracle of the test cases. The test cases are extracted completely automati- cally, are run in the background, and can easily be maintained over versions. The tool implementing this methodology is called Cdd and is available both in binary and in source form.
Conference Paper
Full-text available
The Code Contracts project [3] at Microsoft Research enables programmers on the .NET platform to author specifications in existing languages such as C# and VisualBasic. To take advantage of these specifications, we provide tools for documentation generation, runtime contract checking, and static contract verification. This talk details the overall approach of the static contract checker and examines where and how we trade-off soundness in order to obtain a practical tool that works on a full-fledged object-oriented intermediate language such as the .NET Common Intermediate Language.
Conference Paper
Full-text available
Assertions had their origin in program verification. For the systems developed in industry, construction of assertions and their use in showing program correctness is a near-impossible task. However, they can be used to show that some key properties are satisfied during program execution. We first present a survey of the special roles that assertions can play in object oriented software construction. We then analyse such assertions by relating them to the case study of an automatic surveillance system. In particular, we address the following two issues: What types of assertions can be used most effectively in the context of object oriented software? How can you discover them and where should they be placed? During maintenance, both the design and the software are continuously changed. These changes can mean that the original assertions, if present, are no longer valid for the new software. Can we automatically derive assertions for the changed software?.
Conference Paper
Full-text available
Many state-based specification languages, including the Java Modeling Language (JML), contain at their core specification constructs familiar to most computer science and software engineering undergrad- uates: e.g., assertions, pre- and postconditions, and invariants. Unfor- tunately, these constructs are not suciently expressive to permit for- mal modular verification of programs written in modern object-oriented languages like Java. The necessary extra constructs for specifying an object-oriented module include the less familiar frame properties, data- groups, and ghost and model fields. These constructs help specifiers deal with potential problems related to, e.g., unexpected side eects, aliasing, class invariants, inheritance, and lack of information hiding. This tutorial focuses on these constructs, explaining their meaning while illustrating how they can be used to address the stated problems.
Conference Paper
Full-text available
This tool paper presents an embodiment of TestEra - a framework developed in previous work for specification-based testing of Java programs. To test a Java method, TestEra uses the method's pre-condition specification to generate test inputs and the post-condition to check correctness of outputs. TestEra supports specifications written in Alloy - a first-order, declarative language based on relations - and uses the SAT-based back-end of the Alloy tool-set for systematic generation of test suites. Each test case is a JUnit test method, which performs three key steps: (1) initialization of pre-state, i.e., creation of inputs to the method under test; (2) invocation of the method; and (3) checking the correctness of post-state, i.e., checking the method output. The tool supports visualization of inputs and outputs as object graphs for graphical illustration of method behavior. TestEra is available for download to be used as a library or as an Eclipse plug-in.
Conference Paper
Full-text available
Reusable software components need expressive specifications. This paper outlines a rigorous foundation to model-based contracts, a method to equip classes with strong contracts that support accurate design, implementation, and formal verification of reusable components. Model-based contracts conservatively extend the classic Design by Contract with a notion of model, which underpins the precise definitions of such concepts as abstract equivalence and specification completeness. Experiments applying model-based contracts to libraries of data structures suggest that the method enables accurate specification of practical software.
Article
Full-text available
Developers often change software in ways that cause tests to fail. When this occurs, developers must determine whether failures are caused by errors in the code under test or in the test code itself. In the latter case, developers must repair failing tests or remove them from the test suite. Reparing tests is time consuming but beneficial, since removing tests reduces a test suite's ability to detect regressions. Fortunately, simple program transformations can repair many failing tests automatically. We present ReAssert, a novel technique and tool that suggests repairs to failing tests' code which cause the tests to pass. Examples include replacing literal values in tests, changing assertion methods, or replacing one assertion with several. If the developer chooses to apply the repairs, ReAssert modifies the code automatically. Our experiments show that ReAssert can repair many common test failures and that its suggested repairs correspond to developers' expectations. National Science Foundation grant no. CCF-0746856 published or submitted for publication is peer reviewed
Article
Various tools for program analysis, including run-time assertion checkers and static analyzers such as verification and test generation tools, require formal specifications of the programs being analyzed. Moreover, many of these tools and techniques require such specifications to be written in a particular style, or follow certain patterns, in order to obtain an acceptable performance from the corresponding analyses. Thus, having a formal specification sometimes is not enough for using a particular technique, since such specification may not be provided in the right formalism. In this paper, we deal with this problem in the increasingly common case of having an operational specification, while for analysis reasons requiring a declarative specification. We propose an evolutionary approach to translate an operational specification written in a sequential programming language, into a declarative specification, in relational logic. We perform experiments on a benchmark of data structure implementations, for which operational invariants are available, and show that our evolutionary computation based approach to translating specifications achieves very good precision in this context, and produces declarative specifications that are more amenable to analyses that demand specifications in this style. This is assessed in two contexts: bounded verification of data structure invariant preservation, and instance enumeration using symbolic execution aided by tight bounds.
Conference Paper
In this paper, we analyze the effect of reducing object redundancy in random testing, by comparing the Randoop random testing tool with a version of the tool that disregards tests that only produce objects that have been previously generated by other tests. As a side effect, this variant also identifies methods in the software under test that never participate in state changes, and uses these more heavily when building assertions. Our evaluation of this strategy concentrates on collection classes, since in this context of object-oriented implementations that describe stateful objects obbeying complex invariants, object variability is highly relevant. Our experimental comparison takes the main data structures in java.util, and shows that our object redundancy reduction strategy has an important impact in testing collections, measured in terms of code coverage and mutation killing.
Conference Paper
Data structure synthesis is the task of generating data structure implementations from high-level specifications. Recent work in this area has shown potential to save programmer time and reduce the risk of defects. Existing techniques focus on data structures for manipulating subsets of a single collection, but real-world programs often track multiple related collections and aggregate properties such as sums, counts, minimums, and maximums. This paper shows how to synthesize data structures that track subsets and aggregations of multiple related collections. Our technique decomposes the synthesis task into alternating steps of query synthesis and incrementalization. The query synthesis step implements pure operations over the data structure state by leveraging existing enumerative synthesis techniques, specialized to the data structures domain. The incrementalization step implements imperative state modifications by re-framing them as fresh queries that determine what to change, coupled with a small amount of code to apply the change. As an added benefit of this approach over previous work, the synthesized data structure is optimized for not only the queries in the specification but also the required update operations. We have evaluated our approach in four large case studies, demonstrating that these extensions are broadly applicable.
Book
An Integrated Approach to Software Engineering introduces software engineering to advanced-level undergraduate and graduate students of computer science. It emphasizes a case-study approach whereby a project is developed through the course of the book, illustrating the different activities of software development. The sequence of chapters is essentially the same as the sequence of activities performed during a typical software project. All activities, including quality assurance and control activities, are described in each chapter as integral activities for that phase of development. Similarly, the author carefully introduces appropriate metrics for controlling and assessing the software process. Chapters in this revised edition, updated for today’s standards, include these new features: Software Process: a discussion on the timeboxing model for iterative development and on inspection process Requirements Analysis and Specification: a description of Use Cases Software Architecture: an additional chapter for this edition Project Planning: some practical techniques for estimation, scheduling, tracking, and risk management Object Oriented Design: a discussion on UML and on concepts such as cohesion, coupling and open-closed principle Coding: sections on refactoring, test driven development, pair programming, common coding defects, coding standards, and some useful coding practices Testing: a presentation on pair-wise testing as an approach for functional testing, defect tracking, and defect analysis and prevention The text, bolstered by numerous examples and chapter summaries, imparts to the reader the knowledge, skills, practices and techniques needed to successfully execute a software project.
Article
In this paper an attempt is made to explore the logical foundations of computer programming by use of techniques which were first applied in the study of geometry and have later been extended to other branches of mathematics. This involves the elucidation of sets of axioms and rules of inference which can be used in proofs of the properties of computer programs. Examples are given of such axioms and rules, and a formal proof of a simple theorem is displayed. Finally, it is argued that important advantages, both theoretical and practical, may follow from a pursuance of these topics. © 1983, ACM. All rights reserved.
Conference Paper
Various tools for program analysis, including run-time assertion checkers and static analyzers such as verification and test generation tools, require formal specifications of the programs being analyzed. Moreover, many of these tools and techniques require such specifications to be written in a particular style, or follow certain patterns, in order to obtain an acceptable performance from the corresponding analyses. Thus, having a formal specification sometimes is not enough for using a particular technique, since such specification may not be provided in the right formalism. In this paper, we deal with this problem in the increasingly common case of having an operational specification, while for analysis reasons requiring a declarative specification. We propose an evolutionary approach to translate an operational specification written in a sequential programming language, into a declarative specification, in relational logic. We perform experiments on a benchmark of data structure implementations, that show that translating representation invariants using our approach and verifying invariant preservation using the resulting specifications outperforms verification with specifications obtained using an existing semantics-preserving translation. Also, our evolutionary computation translation achieves very good precision in this context.
Conference Paper
We introduce a technique for assessing and improving test oracles by reducing the incidence of both false positives and false negatives. We prove that our approach can always result in an increase in the mutual information between the actual and perfect oracles. Our technique combines test case generation to reveal false positives and mutation testing to reveal false negatives. We applied the decision support tool that implements our oracle improvement technique to five real-world subjects. The experimental results show that the fault detection rate of the oracles after improvement increases, on average, by 48.6% (86% over the implicit oracle). Three actual, exposed faults in the studied systems were subsequently confirmed and fixed by the developers.
Conference Paper
Feedback-directed random test generation is a widely used technique to generate random method sequences. It leverages feedback to guide generation. However, the validity of feedback guidance has not been challenged yet. In this paper, we investigate the characteristics of feedback-directed random test generation and propose a method that exploits the obtained knowledge that excessive feedback limits the diversity of tests. First, we show that the feedback loop of feedback-directed generation algorithm is a positive feedback loop and amplifies the bias that emerges in the candidate value pool. This over-directs the generation and limits the diversity of generated tests. Thus, limiting the amount of feedback can improve diversity and effectiveness of generated tests. Second, we propose a method named feedback-controlled random test generation, which aggressively controls the feedback in order to promote diversity of generated tests. Experiments on eight different, real-world application libraries indicate that our method increases branch coverage by 78% to 204% over the original feedback-directed algorithm on large-scale utility libraries.
Article
Research on software testing produces many innovative automated techniques, but because software testing is by necessity incomplete and approximate, any new technique faces the challenge of an empirical assessment. In the past, we have demonstrated scientific advance in automated unit test generation with the EVOSUITE tool by evaluating it on manually selected open-source projects or examples that represent a particular problem addressed by the underlying technique. However, demonstrating scientific advance is not necessarily the same as demonstrating practical value; even if VOSUITE worked well on the software projects we selected for evaluation, it might not scale up to the complexity of real systems. Ideally, one would use large “real-world” software systems to minimize the threats to external validity when evaluating research tools. However, neither choosing such software systems nor applying research prototypes to them are trivial tasks. In this article we present the results of a large experiment in unit test generation using the VOSUITE tool on 100 randomly chosen open-source projects, the 10 most popular open-source projects according to the SourceForge Web site, seven industrial projects, and 11 automatically generated software projects. The study confirms that VOSUITE can achieve good levels of branch coverage (on average, 71% per class) in practice. However, the study also exemplifies how the choice of software systems for an empirical study can influence the results of the experiments, which can serve to inform researchers to make more conscious choices in the selection of software system subjects. Furthermore, our experiments demonstrate how practical limitations interfere with scientific advances, branch coverage on an unbiased sample is affected by predominant environmental dependencies. The surprisingly large effect of such practical engineering problems in unit testing will hopefully lead to a larger appreciation of work in this area, thus supporting transfer of knowledge from software testing research to practice.
Conference Paper
Auto-active verifiers provide a level of automation intermediate between fully automatic and interactive: users supply code with annotations as input while benefiting from a high level of automation in the back-end. This paper presents AutoProof, a state-of-the-art auto-active verifier for object-oriented sequential programs with complex functional specifications. AutoProof fully supports advanced object-oriented features and a powerful methodology for framing and class invariants, which make it applicable in practice to idiomatic object-oriented patterns. The paper focuses on describing AutoProof's interface, design, and implementation features, and demonstrates AutoProof's performance on a rich collection of benchmark problems. The results attest AutoProof's competitiveness among tools in its league on cutting-edge functional verification of object-oriented programs.
Conference Paper
Formally verifying a program requires significant skill not only because of complex interactions between program subcomponents, but also because of deficiencies in current verification interfaces. These skill barriers make verification economically unattractive by preventing the use of less-skilled (less-expensive) workers and distributed workflows (i.e., crowdsourcing). This paper presents VeriWeb, a web-based IDE for verification that decomposes the task of writing verifiable specifications into manageable subproblems. To overcome the information loss caused by task decomposition, and to reduce the skill required to verify a program, VeriWeb incorporates several innovative user interface features: drag and drop condition construction, concrete counterexamples, and specification inlining. To evaluate VeriWeb, we performed three experiments. First, we show that VeriWeb lowers the time and monetary cost of verification by performing a comparative study of VeriWeb and a traditional tool using 14 paid subjects contracted hourly from Exhedra Solution's vWorker online marketplace. Second, we demonstrate the dearth and insufficiency of current ad-hoc labor marketplaces for verification by recruiting workers from Amazon's Mechanical Turk to perform verification with VeriWeb. Finally, we characterize the minimal communication overhead incurred when VeriWeb is used collaboratively by observing two pairs of developers each use the tool simultaneously to verify a single program.
Article
This paper attempts to provide an adequate basis for formal definitions of the meanings of programs in appropriately defined programming languages, in such a way that a rigorous standard is established for proofs about computer programs, including proofs of correctness, equivalence, and termination. The basis of our approach is the notion of an interpretation of a program: that is, an association of a proposition with each connection in the flow of control through a program, where the proposition is asserted to hold whenever that connection is taken. To prevent an interpretation from being chosen arbitrarily, a condition is imposed on each command of the program. This condition guarantees that whenever a command is reached by way of a connection whose associated proposition is then true, it will be left (if at all) by a connection whose associated proposition will be true at that time. Then by induction on the number of commands executed, one sees that if a program is entered by a connection whose associated proposition is then true, it will be left (if at all) by a connection whose associated proposition will be true at that time. By this means, we may prove certain properties of programs, particularly properties of the form: ‘If the initial values of the program variables satisfy the relation R l, the final values on completion will satisfy the relation R 2’.
Article
Popular software testing tools, such as JUnit, allow frequent retesting of modified code; yet the manually created test scripts are often seriously incomplete. A unit-testing tool called JWalk has therefore been developed to address the need for systematic unit testing within the context of agile methods. The tool operates directly on the compiled code for Java classes and uses a new lazy method for inducing the changing design of a class on the fly. This is achieved partly through introspection, using Java’s reflection capability, and partly through interaction with the user, constructing and saving test oracles on the fly. Predictive rules reduce the number of oracle values that must be confirmed by the tester. Without human intervention, JWalk performs bounded exhaustive exploration of the class’s method protocols and may be directed to explore the space of algebraic constructions, or the intended design state-space of the tested class. With some human interaction, JWalk performs up to the equivalent of fully automated state-based testing, from a specification that was acquired incrementally.
Conference Paper
Developers have used data structure repair over the last few decades as an effective means to recover on-the-fly from errors in program state. Traditional repair techniques were based on dedicated repair routines, whereas more recent techniques have used invariants that describe desired structural properties as the basis for repair. All repair techniques are designed with one primary goal: run-time error recovery. However, the actions that any such technique performs to repair an erroneous program state are meant to produce the effect of the actions of a (hypothetical) correct program. The key insight in this paper is that repair actions on the program state can guide debugging of code (when the erroneous program execution is due to a fault in the program and not an external event).This paper presents an approach that abstracts concrete repair actions that a routine performs to repair an erroneous state into a sequence of program statements that perform the same actions using variables visible in the scope of the faulty code. Thus, appending the generated statements to the original code is akin to performing the repair from within the program. Our implementation uses the Juzi data structure repair tool as an enabling technology. Experimental results using a library data structure as well as two applications demonstrate the effectiveness of our approach in enabling repair of faulty code.
Article
Daikon is an implementation of dynamic detection of likely invariants; that is, the Daikon invariant detector reports likely program invariants. An invariant is a property that holds at a certain point or points in a program; these are often used in assert statements, documentation, and formal specifications. Examples include being constant (x=a), non-zero (x≠0), being in a range (a≤x≤b), linear relationships (y=ax+b), ordering (x≤y), functions from a library (), containment (x∈y), sortedness (), and many more. Users can extend Daikon to check for additional invariants.Dynamic invariant detection runs a program, observes the values that the program computes, and then reports properties that were true over the observed executions. Dynamic invariant detection is a machine learning technique that can be applied to arbitrary data. Daikon can detect invariants in C, C++, Java, and Perl programs, and in record-structured data sources; it is easy to extend Daikon to other applications.Invariants can be useful in program understanding and a host of other applications. Daikon’s output has been used for generating test cases, predicting incompatibilities in component integration, automating theorem proving, repairing inconsistent data structures, and checking the validity of data streams, among other tasks.Daikon is freely available in source and binary form, along with extensive documentation, at http://pag.csail.mit.edu/daikon/.
Article
ContextOne of the important issues of software testing is to provide an automated test oracle. Test oracles are reliable sources of how the software under test must operate. In particular, they are used to evaluate the actual results that produced by the software. However, in order to generate an automated test oracle, oracle challenges need to be addressed. These challenges are output-domain generation, input domain to output domain mapping, and a comparator to decide on the accuracy of the actual outputs.ObjectiveThis paper proposes an automated test oracle framework to address all of these challenges.MethodI/O Relationship Analysis is used to generate the output domain automatically and Multi-Networks Oracles based on artificial neural networks are introduced to handle the second challenge. The last challenge is addressed using an automated comparator that adjusts the oracle precision by defining the comparison tolerance. The proposed approach was evaluated using an industry strength case study, which was injected with some faults. The quality of the proposed oracle was measured by assessing its accuracy, precision, misclassification error and practicality. Mutation testing was considered to provide the evaluation framework by implementing two different versions of the case study: a Golden Version and a Mutated Version. Furthermore, a comparative study between the existing automated oracles and the proposed one is provided based on which challenges they can automate.ResultsResults indicate that the proposed approach automated the oracle generation process 97% in this experiment. Accuracy of the proposed oracle was up to 98.26%, and the oracle detected up to 97.7% of the injected faults.ConclusionConsequently, the results of the study highlight the practicality of the proposed oracle in addition to the automation it offers.
Conference Paper
Pex automatically produces a small test suite with high code coverage for a .NET program. To this end, Pex performs a systematic program analysis (using dynamic symbolic execution, similar to path- bounded model-checking) to determine test inputs for Parameterized Unit Tests. Pex learns the program behavior by monitoring execution traces. Pex uses a constraint solver to produce new test inputs which exercise different program behavior. The result is an automatically gen- erated small test suite which often achieves high code coverage. In one case study, we applied Pex to a core component of the .NET runtime which had already been extensively tested over several years. Pex found errors, including a serious issue.
Conference Paper
Reusability is an important software engineering concept actively advocated for the last forty years. While reusability has been addressed for systems implemented using the same programming language, it does not usually handle interoperability with different programming languages. This paper presents a solution for the reuse of Java code within Eiffel programs based on a source-to-source translation from Java to Eiffel. The paper focuses on the critical aspects of the translation and illustrates them by formal means. The translation is implemented in the freely available tool J2Eif; it provides Eiffel replacements for the components of the Java runtime environment, including Java Native Interface services and reflection mechanisms. Our experiments demonstrate the practical usability of the translation scheme and its implementation, and record the performance slow-down compared to custom-made Eiffel applications: automatic translations of java.util data structures, java.io services, and SWT applications can be re-used as Eiffel programs, with the same functionalities as their original Java implementations.
Conference Paper
Organizations building highly complex business and technical systems need to architect families of systems and implement these with large-scale component reuse. Without carefully architecting the systems, components, organizations and processes for reuse, ...
Conference Paper
This paper presents Korat, a novel framework for automated testing of Java programs. Given a formal specification for a method, Korat uses the method precondition to automatically generate all (nonisomorphic) test cases up to a given small size. Korat then executes the method on each test case, and uses the method postcondition as a test oracle to check the correctness of each output.To generate test cases for a method, Korat constructs a Java predicate (i.e., a method that returns a boolean) from the method's pre-condition. The heart of Korat is a technique for automatic test case generation: given a predicate and a bound on the size of its inputs, Korat generates all (nonisomorphic) inputs for which the predicate returns true. Korat exhaustively explores the bounded input space of the predicate but does so efficiently by monitoring the predicate's executions and pruning large portions of the search space.This paper illustrates the use of Korat for testing several data structures, including some from the Java Collections Framework. The experimental results show that it is feasible to generate test cases from Java predicates, even when the search space for inputs is very large. This paper also compares Korat with a testing framework based on declarative specifications. Contrary to our initial expectation, the experiments show that Korat generates test cases much faster than the declarative framework.
Conference Paper
The project Code Contracts for .NET [1] comes from the Research in Software Engineering (RiSE) group [5] at Microsoft Research. We took the lessons we learned from the Spec# project [3,4] and have applied them in a setting available to all .NET programmers without the need for them to adopt an experimental programming language or the Spec# programming methodology. It has been available since early 2009 with a commercial use license on the DevLabs [7] web site. Since then there have been about 20,000 downloads, with an active forum of users.
Article
TestEra is a framework for automated specification-based testing of Java programs. TestEra requires as input a Java method (in sourcecode or byte- code), a formal specification of the pre- and post-conditions of that method, and a bound that limits the size of the test cases to be generated. Using the method's pre-condition, TestEra automatically generates all nonisomorphic test inputs up to the given bound. It executes the method on each test input, and uses the method postcondition as an oracle to check the correctness of each output. Specifications are first-order logic formulae. As an enabling technology, TestEra uses the Alloy toolset, which provides an automatic SAT-based tool for analyzing first-order logic formulae. We have used TestEra to check several Java programs including an architecture for dynamic networks, the Alloy-alpha analyzer, a fault-tree analyzer, and methods from the Java Collection Framework.
Article
In this paper an attempt is made to explore the logical founda- tions of computer programming by use of techniques which were first applied in the study of geometry and have later been extended to other branches of mathematics. This in- volves the elucidation of sets of axioms and rules of inference which can be used in proofs of the properties of computer programs. Examples are given of such axioms and rules, and a formal proof of a simple theorem is displayed. Finally, it is argued that important advantages, both theoretical and prac- tical, may follow from a pursuance of these topics.
Article
Object-Oriented Software Construction, second edition is the comprehensive reference on all aspects of object technology, from design principles to O-O techniques, Design by Contract, O-O analysis, concurrency, persistence, abstract data types and many more. Written by a pioneer in the field, contains an in-depth analysis of both methodological and technical issues.
Conference Paper
We present a technique that improves random test generation by incorporating feedback obtained from executing test inputs as they are created. Our technique builds inputs incrementally by randomly selecting a method call to apply and finding arguments from among previously-constructed inputs. As soon as an input is built, it is executed and checked against a set of contracts and filters. The result of the execution determines whether the input is redundant, illegal, contract-violating, or useful for generating more inputs. The technique outputs a test suite consisting of unit tests for the classes under test. Passing tests can be used to ensure that code contracts are preserved across program changes; failing tests (that violate one or more contract) point to potential errors that should be corrected. Our experimental results indicate that feedback-directed random test generation can outperform systematic and undirected random test generation, in terms of coverage and error detection. On four small but nontrivial data structures (used previously in the literature), our technique achieves higher or equal block and predicate coverage than model checking (with and without abstraction) and undirected random generation. On 14 large, widely-used libraries (comprising 780KLOC), feedback-directed random test generation finds many previously-unknown errors, not found by either model checking or undirected random generation.
Article
Methodological guidelines for object-oriented software construction that improve the reliability of the resulting software systems are presented. It is shown that the object-oriented techniques rely on the theory of design by contract, which underlies the design of the Eiffel analysis, design, and programming language and of the supporting libraries, from which a number of examples are drawn. The theory of contract design and the role of assertions in that theory are discussed.< ></ETX
Article
This paper presents Korat, a novel framework for automated testing of Java programs. Given a formal specification for a method, Korat uses the method precondition to automatically generate all (nonisomorphic) test cases up to a given small size. Korat then executes the method on each test case, and uses the method postcondition as a test oracle to check the correctness of each output. To generate test cases...
Constraint-based program debugging using data structure repair
  • Muhammad Zubair Malik
  • Haroon Siddiqui
  • Sarfraz Khurshid
Wan Mohd Nasir Wan-Kadir, Suhaimi Ibrahim, and Siti Zaiton Mohd Hashim. An automated framework for software test oracle
  • Reza Seyed
  • Shahamiri
Seyed Reza Shahamiri, Wan Mohd Nasir Wan-Kadir, Suhaimi Ibrahim, and Siti Zaiton Mohd Hashim. An automated framework for software test oracle. Information & Software Technology, 53(7):774-788, 2011.
Reducing the barriers to writing verified specifications
  • W Todd
  • Michael D Schiller
  • Ernst