Conference Paper

START: A Framework for Trusted and Resilient Autonomous Vehicles (Practical Experience Report)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In comparison, our work focuses on the online testing of UAS using OCL-based test oracles. Recently, Leach et al. (2022) presented a framework for assessing and handling the security situation during UAV operation. The evaluation using ArduPilot and Red Team attacks shows that the proposed framework can increase the dependability of autonomous systems. ...
Article
Full-text available
Unmanned aerial systems (UAS) rely on various avionics systems that are safety-critical and mission-critical. A major requirement of international safety standards is to perform rigorous system-level testing of avionics software systems. The current industrial practice is to manually create test scenarios, manually/automatically execute these scenarios using simulators, and manually evaluate outcomes. The test scenarios typically consist of setting certain flight or environment conditions and testing the system under test in these settings. The state-of-the-art approaches for this purpose also require manual test scenario development and evaluation. In this paper, we propose a novel approach to automate the system-level testing of the UAS. The proposed approach (namely AITester) utilizes model-based testing and artificial intelligence (AI) techniques to automatically generate, execute, and evaluate various test scenarios. The test scenarios are generated on the fly, i.e., during test execution based on the environmental context at runtime. The approach is supported by a toolset. We empirically evaluated the proposed approach on two core components of UAS, an autopilot system of an unmanned aerial vehicle (UAV) and cockpit display systems (CDS) of the ground control station (GCS). The results show that the AITester effectively generates test scenarios causing deviations from the expected behavior of the UAV autopilot and reveals potential flaws in the GCS-CDS.
... In comparison, our work focuses on the online testing of UAS using OCL-based test oracles. Recently, Leach et al. [71] presented a framework for assessing and handling the security situation during UAV operation. The evaluation using ArduPilot and Red Team attacks shows that the proposed framework can increase the dependability of autonomous systems. ...
Preprint
Full-text available
Unmanned aerial systems (UAS) rely on various avionics systems that are safety-critical and mission-critical. A major requirement of international safety standards is to perform rigorous system-level testing of avionics software systems. The current industrial practice is to manually create test scenarios, manually/automatically execute these scenarios using simulators, and manually evaluate outcomes. The test scenarios typically consist of setting certain flight or environment conditions and testing the system under test in these settings. The state-of-the-art approaches for this purpose also require manual test scenario development and evaluation. In this paper, we propose a novel approach to automate the system-level testing of the UAS. The proposed approach (AITester) utilizes model-based testing and artificial intelligence (AI) techniques to automatically generate, execute, and evaluate various test scenarios. The test scenarios are generated on the fly, i.e., during test execution based on the environmental context at runtime. The approach is supported by a toolset. We empirically evaluate the proposed approach on two core components of UAS, an autopilot system of an unmanned aerial vehicle (UAV) and cockpit display systems (CDS) of the ground control station (GCS). The results show that the AITester effectively generates test scenarios causing deviations from the expected behavior of the UAV autopilot and reveals potential flaws in the GCS-CDS.
... Because Zipr's technology is agnostic to the source language, it was the only solution in the program that was able to handle the ADA Web Server application (obviously written in ADA.) in addition these projects, Zipr has been used to do antifragility work, and is the basis for effective binary-only fuzzing with tools like Zafl and the binary-only based version of Untracer, called HeXcite. (26)(27)(28)(29)(30) Hardened Registries in UVA ACCORD. As a practical example of how to use hardened registries in a real-world environment, we have partnered with the ACCORD team at the University of Virgnia (UVA). ...
Preprint
Full-text available
The open-source Helix++ project improves the security posture of computing platforms by applying cutting-edge cybersecurity techniques to diversify and harden software automatically. A distinguishing feature of Helix++ is that it does not require source code or build artifacts; it operates directly on software in binary form--even stripped executables and libraries. This feature is key as rebuilding applications from source is a time-consuming and often frustrating process. Diversification breaks the software monoculture and makes attacks harder to execute as information needed for a successful attack will have changed unpredictably. Diversification also forces attackers to customize an attack for each target instead of attackers crafting an exploit that works reliably on all similarly configured targets. Hardening directly targets key attack classes. The combination of diversity and hardening provides defense-in-depth, as well as a moving target defense, to secure the Nation's cyber infrastructure.
Article
Full-text available
This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a technique, called SEQUENCER, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 samples, carefully curated from commits to open-source repositories. We evaluate SEQUENCER on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SEQUENCER is able to perfectly predict the fixed line for 950/4,711 testing samples, and find correct patches for 14 bugs in Defects4J benchmark. SEQUENCER captures a wide range of repair operators without any domain-specific top-down design.
Conference Paper
Full-text available
Security vulnerabilities are among the most critical software defects in existence. When identified, programmers aim to produce patches that prevent the vulnerability as quickly as possible, motivating the need for automatic program repair (APR) methods to generate patches automatically. Unfortunately , most current APR methods fall short because they approximate the properties necessary to prevent the vulnerability using examples. Approximations result in patches that either do not fix the vulnerability comprehensively, or may even introduce new bugs. Instead, we propose property-based APR, which uses human-specified, program-independent and vulnerability-specific safety properties to derive source code patches for security vulnerabilities. Unlike properties that are approximated by observing the execution of test cases, such safety properties are precise and complete. The primary challenge lies in mapping such safety properties into source code patches that can be instantiated into an existing program. To address these challenges, we propose Senx, which, given a set of safety properties and a single input that triggers the vulnerability, detects the safety property violated by the vulnerability input and generates a corresponding patch that enforces the safety property and thus, removes the vulnerability. Senx solves several challenges with property-based APR: it identifies the program expressions and variables that must be evaluated to check safety properties and identifies the program scopes where they can be evaluated, it generates new code to selectively compute the values it needs if calling existing program code would cause unwanted side effects, and it uses a novel access range analysis technique to avoid placing patches inside loops where it could incur performance overhead. Our evaluation shows that the patches generated by Senx successfully fix 32 of 42 real-world vulnerabilities from 11 applications including various tools or libraries for manipulating graphics/media files, a programming language interpreter, a relational database engine, a collection of programming tools for creating and managing binary programs, and a collection of basic file, shell, and text manipulation tools.
Conference Paper
Full-text available
Robotic vehicles (RVs), such as drones and ground rovers, are a type of cyber-physical systems that operate in the physical world under the control of computing components in the cyber world. Despite RVs' robustness against natural disturbances, cyber or physical attacks against RVs may lead to physical malfunction and subsequently disruption or failure of the vehicles' missions. To avoid or mitigate such consequences, it is essential to develop attack detection techniques for RVs. In this paper, we present a novel attack detection framework to identify external, physical attacks against RVs on the fly by deriving and monitoring Control Invariants (CI). More specifically, we propose a method to extract such invariants by jointly modeling a vehicle's physical properties, its control algorithm and the laws of physics. These invariants are represented in a state-space form, which can then be implemented and inserted into the vehicle's control program binary for runtime invariant check. We apply our CI framework to eleven RVs, including quadrotor, hexarotor, and ground rover, and show that the invariant check can detect three common types of physical attacks -- including sensor attack, actuation signal attack, and parameter attack -- with very low runtime overhead.
Conference Paper
Full-text available
Automated program repair (APR) has great potential to reduce bug fixing effort and many approaches have been proposed in recent years. APRs are often treated as a search problem where the search space consists of all the possible patches and the goal is to identify the correct patch in the space. Many techniques take a data-driven approach and analyze data sources such as existing patches and similar source code to help identify the correct patch. However, while existing patches and similar code provide complementary information, existing techniques analyze only a single source and cannot be easily extended to analyze both. In this paper, we propose a novel automatic program repair approach that utilizes both existing patches and similar code. Our approach mines an abstract search space from existing patches and obtains a concrete search space by differencing with similar code snippets. Then we search within the intersection of the two search spaces. We have implemented our approach as a tool called SimFix, and evaluated it on the Defects4J benchmark. Our tool successfully fixed 34 bugs. To our best knowledge, this is the largest number of bugs fixed by a single technology on the Defects4J benchmark. Furthermore, as far as we know, 13 bugs fixed by our approach have never been fixed by the current approaches.
Article
Full-text available
Unmanned Aircraft Systems (UAS) with autonomous decision-making capabilities are of increasing interest for a wide area of applications such as logistics and disaster recovery. In order to ensure the correct behavior of the system and to recognize hazardous situations or system faults, we applied stream runtime monitoring techniques within the DLR ARTIS (Autonomous Research Testbed for Intelligent System) family of unmanned aircraft. We present our experience from specification elicitation, instrumentation, offline log-file analysis, and online monitoring on the flight computer on a test rig. The debugging and health management support through stream runtime monitoring techniques have proven highly beneficial for system design and development. At the same time, the project has identified usability improvements to the specification language, and has influenced the design of the language.
Article
Full-text available
Detection of cyber attacks against vehicles is of growing interest. As vehicles typically afford limited processing resources, proposed solutions are rule-based or lightweight machine learning techniques. We argue that this limitation can be lifted with computational offloading commonly used for resource-constrained mobile devices. The increased processing resources available in this manner allow access to more advanced techniques. Using as case study a small four-wheel robotic land vehicle, we demonstrate the practicality and benefits of offloading the continuous task of intrusion detection that is based on deep learning. This approach achieves high accuracy much more consistently than with standard machine learning techniques and is not limited to a single type of attack or the in-vehicle CAN bus as previous work. As input, it uses data captured in real-time that relate to both cyber and physical processes, which it feeds as time series data to a neural network architecture. We use both a deep multilayer perceptron and a recurrent neural network architecture, with the latter benefitting from a long-short term memory hidden layer, which proves very useful for learning the temporal context of different attacks. We employ denial of service, command injection and malware as examples of cyber attacks that are meaningful for a robotic vehicle. The practicality of the latter depends on the resources afforded onboard and remotely, as well as the reliability of the communication means between them. Using detection latency as the criterion, we have developed a mathematical model to determine when computation offloading is beneficial given parameters related to the operation of the network and the processing demands of the deep learning model. The more reliable the network and the greater the processing demands, the greater the reduction in detection latency achieved through offloading.
Conference Paper
Full-text available
Current software development methodologies and practices, while enabling the production of large complex software systems, can have a serious negative impact on software quality. These negative impacts include excessive and unnecessary software complexity, higher probability of software vulnerabilities, diminished execution performance in both time and space, and the inability to easily and rapidly deploy even minor updates to deployed software, to name a few. Consequently, there has been growing interest in the capability to do late-stage software (i.e., at the binary level) manipulation to address these negative impacts. Unfortunately, highly robust, late-stage manipulation of arbitrary binaries is difficult due to complex implementation techniques and the corresponding software structures. Indeed, many binary rewriters have limitations that constrain their use. For example, to the best of our knowledge, no binary rewriters handle applications that include and use exception handlers-a feature used in programming languages such as C++, Ada, Common Lisp, ML, to name a few. This paper describes how Zipr, an efficient binary rewriter, manipulates applications with exception handlers and tables which are required for unwinding the stack. While the technique should be applicable to other binary rewriters, it is particularly useful for Zipr because the recovery of the IR exposed in exception handling tables significantly improves the runtime performance of Zipr'ed binaries-average performance overhead on the full SPEC CPU2006 benchmark is reduced from 15% to 3%.
Conference Paper
Full-text available
Unmanned Aircraft Systems (UAS) with autonomous decision-making capabilities are of increasing interest for a wide area of applications such as logistics and disaster recovery. In order to ensure the correct behavior of the system and to recognize hazardous situations or system faults, we applied stream runtime monitoring techniques within the DLR ARTIS (Autonomous Research Testbed for Intelligent System) family of unmanned aircraft. We present our experience from specification elicitation, instrumentation, offline log-file analysis, and online monitoring on the flight computer on a test rig. The debugging and health management support through stream runtime monitoring techniques have proven highly beneficial for system design and development. At the same time, the project has identified usability improvements to the specification language, and has influenced the design of the language.
Conference Paper
Full-text available
Cyber-Physical Systems (CPSes) are being widely deployed in security-critical scenarios such as smart homes and medical devices. Unfortunately, the connectedness of these systems and their relative lack of security measures makes them ripe targets for attacks. Specification-based Intrusion Detection Systems (IDS) have been shown to be effective for securing CPSs. Unfortunately, deriving invariants for capturing the specifications of CPS systems is a tedious and error-prone process. Therefore, it is important to dynamically monitor the CPS system to learn its common behaviors and formulate invariants for detecting security attacks. Existing techniques for invariant mining only incorporate data and events, but not time. However, time is central to most CPS systems, and hence incorporating time in addition to data and events, is essential for achieving low false positives and false negatives. This paper proposes ARTINALI, which mines dynamic system properties by incorporating time as a first-class property of the system. We build ARTINALI-based Intrusion Detection Systems (IDSes) for two CPSes, namely smart meters and smart medical devices, and measure their efficacy. We find that the ARTINALI-based IDSes significantly reduce the ratio of false positives and false negatives by 16 to 48% and 89 to 95% respectively, over other dynamic invariant detection tools.
Conference Paper
Full-text available
Search-based program repair automatically searches for a program fix within a given repair space. This may be accomplished by retrofitting a generic search algorithm for program repair as evidenced by the GenProg tool, or by building a customized search algorithm for program repair as in SPR. Unfortunately, automated program repair approaches may produce patches that may be rejected by programmers, because of which past works have suggested using human-written patches to produce templates to guide program repair. In this work, we take the position that we will not provide templates to guide the repair search because that may unduly restrict the repair space and attempt to overfit the repairs into one of the provided templates. Instead, we suggest the use of a set of anti-patterns — a set of generic forbidden transformations that can be enforced on top of any search-based repair tool. We show that by enforcing our anti-patterns, we obtain repairs that localize the correct lines or functions, involve less deletion of program functionality, and are mostly obtained more efficiently. Since our set of anti-patterns are generic, we have integrated them into existing search based repair tools, including GenProg and SPR, thereby allowing us to obtain higher quality program patches with minimal effort.
Article
Full-text available
System monitoring can help to detect anomalies, but crafting monitors for robot systems is difficult due to the inherent complexity, changing, and uncertain operating environment. We address this challenge by automatically inferring system invariants and synthesizing those invariants into monitors to detect faults with an approach inspired by state of the art software engineering methods. Our approach is novel in that: (1) It automatically derives invariants from messages; (2) The invariants types are tailored to match the spatial, temporal, and architectural attributes of robotic systems; and (3) It automatically classifies and synthesizes invariants into an online invariants monitor node. We have assessed the approach in the context of two unmanned aerial vehicle systems running robot operating system. We found that monitoring the inferred invariants can reduce system failure rates when facing unexpected contexts from 76 to 11 %, and can detect differences between the lab environment and the field deployments.
Article
Full-text available
We propose NOPOL, an approach to automatic repair of buggy conditional statements (i.e., if-then-else statements). This approach takes a buggy program as well as a test suite as input and generates a patch with a conditional expression as output. The test suite is required to contain passing test cases to model the expected behavior of the program and at least one failing test case that reveals the bug to be repaired. The process of NOPOL consists of three major phases. First, NOPOL employs angelic fix localization to identify expected values of a condition during the test execution. Second, runtime trace collection is used to collect variables and their actual values, including primitive data types and objected-oriented features (e.g., nullness checks), to serve as building blocks for patch generation. Third, NOPOL encodes these collected data into an instance of a Satisfiability Modulo Theory (SMT) problem; then a feasible solution to the SMT instance is translated back into a code patch. We evaluate NOPOL on 22 real-world bugs (16 bugs with buggy IF conditions and 6 bugs with missing preconditions) on two large open-source projects, namely Apache Commons Math and Apache Commons Lang. Empirical analysis on these bugs shows that our approach can effectively fix bugs with buggy IF conditions and missing preconditions. We illustrate the capabilities and limitations of NOPOL using case studies of real bug fixes.
Article
Testing plays an essential role in ensuring the safety and quality of cyberphysical systems (CPSs). One of the main challenges in automated and software-in-the-loop simulation testing of CPSs is defining effective oracles that can check that a given system conforms to expectations of desired behavior. Manually specifying such oracles can be tedious, complex, and error-prone, and so techniques for automatically learning oracles are attractive. Characteristics of CPSs, such as limited or no access to source code, behavior that is non-deterministic and sensitive to noise, and that the system may respond differently to input based on its context introduce considerable challenges for automated oracle learning. We present Mithra, a novel, unsupervised oracle learning technique for CPSs that operates on existing telemetry data. It uses a three-step multivariate time series clustering to discover the set of unique, correct behaviors for a CPS, which it uses to construct robust oracles. We instantiate our proposed technique for ArduPilot, a popular, open-source autopilot software. On a set of 24 bugs, we show that Mithra effectively identifies buggy executions with few false positives and outperforms AR-SI, a state-of-the-art CPS oracle learning technique. We demonstrate Mithra's wider applicability by applying it to an autonomous racer built for the Robot Operating System.
Conference Paper
Debugging is a time-consuming task for software engineers. Automated Program Repair (APR) has proved successful in automatically fixing bugs for many real-world applications. Search-based APR generates program variants that are then evaluated on the test suite of the original program, using a fitness function. In the vast majority of search-based APR work only the Boolean test case result is taken into account when evaluating the fitness of a program variant. We pose that more fine-grained fitness functions could lead to a more diverse fitness landscape, and thus provide better guidance for the APR search algorithms. We thus present 2Phase, a fitness function that also incorporates the output of test case failures, and compare it with ARJAe, that shares the same principles, and the standard fitness, that only takes the Boolean test case result into consideration. We conduct the comparison on 16 buggy programs from the QuixBugs benchmark using the Gin genetic improvement framework. The results show no significant difference in the performance of all three fitness functions considered. However, Gin was able to find 8 correct fixes, more than any of the APR tools in the recent QuixBugs study.
Article
Automated program repair is an emerging technology that seeks to automatically rectify program errors and vulnerabilities. Repair techniques are driven by a correctness criterion that is often in the form of a test suite. Such test-based repair may produce overfitting patches, where the patches produced fail on tests outside the test suite driving the repair. In this work, we present a repair method that fixes program vulnerabilities without the need for a voluminous test suite. Given a vulnerability as evidenced by an exploit, the technique extracts a constraint representing the vulnerability with the help of sanitizers. The extracted constraint serves as a proof obligation that our synthesized patch should satisfy. The proof obligation is met by propagating the extracted constraint to locations that are deemed to be “suitable” fix locations. An implementation of our approach (E xtract F ix ) on top of the KLEE symbolic execution engine shows its efficacy in fixing a wide range of vulnerabilities taken from the ManyBugs benchmark, real-world CVEs and Google’s OSS-Fuzz framework. We believe that our work presents a way forward for the overfitting problem in program repair by generalizing observable hazards/vulnerabilities (as constraint) from a single failing test or exploit.
Conference Paper
We revisit the performance of template-based APR to build comprehensive knowledge about the effectiveness of fix patterns, and to highlight the importance of complementary steps such as fault localization or donor code retrieval. To that end, we first investigate the literature to collect, summarize and label recurrently-used fix patterns. Based on the investigation, we build TBar, a straightforward APR tool that systematically attempts to apply these fix patterns to program bugs. We thoroughly evaluate TBar on the Defects4J benchmark. In particular, we assess the actual qualitative and quantitative diversity of fix patterns, as well as their effectiveness in yielding plausible or correct patches. Eventually, we find that, assuming a perfect fault localization, TBar correctly/plausibly fixes 74/101 bugs. Replicating a standard and practical pipeline of APR assessment, we demonstrate that TBar correctly fixes 43 bugs from Defects4J, an unprecedented performance in the literature (including all approaches, i.e., template-based, stochastic mutation-based or synthesis-based APR).
Conference Paper
Automated Program Repair (APR) is one of the most recent advances in automated debugging, and can directly fix buggy programs with minimal human intervention. Although various advanced APR techniques (including search-based or semantic-based ones) have been proposed, they mainly work at the source-code level and it is not clear how bytecode-level APR performs in practice. Also, empirical studies of the existing techniques on bugs beyond what has been reported in the original papers are rather limited. In this paper, we implement the first practical bytecode-level APR technique, PraPR, and present the first extensive study on fixing real-world bugs (e.g., Defects4J bugs) using JVM bytecode mutation. The experimental results show that surprisingly even PraPR with only the basic traditional mutators can produce genuine fixes for 17 bugs; with simple additional commonly used APR mutators, PraPR is able to produce genuine fixes for 43 bugs, significantly outperforming state-of-the-art APR, while being over 10X faster. Furthermore, we performed an extensive study of PraPR and other recent APR tools on a large number of additional real-world bugs, and demonstrated the overfitting problem of recent advanced APR tools for the first time. Lastly, PraPR has also successfully fixed bugs for other JVM languages (e.g., for the popular Kotlin language), indicating PraPR can greatly complement existing source-code-level APR.
Conference Paper
Existing program repair systems modify a buggy program so that the modified program passes given tests. The repaired program may not satisfy even the most basic notion of correctness, namely crash-freedom. In other words, repair tools might generate patches which over-fit the test data driving the repair, and the automatically repaired programs may even introduce crashes or vulnerabilities. We propose an integrated approach for detecting and discarding crashing patches. Our approach fuses test and patch generation into a single process, in which patches are generated with the objective of passing existing tests, and new tests are generated with the objective of filtering out over-fitted patches by distinguishing candidate patches in terms of behavior. We use crash-freedom as the oracle to discard patch candidates which crash on the new tests. In its core, our approach defines a grey-box fuzzing strategy that gives higher priority to new tests that separate patches behaving equivalently on existing tests. This test generation strategy identifies semantic differences between patch candidates, and reduces over-fitting in program repair. We evaluated our approach on real-world vulnerabilities and open-source subjects from the Google OSS-Fuzz infrastructure. We found that our tool Fix2Fit (implementing patch space directed test generation), produces crash-avoiding patches. While we do not give formal guarantees about crash-freedom, cross-validation with fuzzing tools and their sanitizers provides greater confidence about the crash-freedom of our suggested patches.
Conference Paper
Effective program repair techniques, which modify faulty programs to fix them with respect to given test suites, can substantially reduce the cost of manual debugging. A common repair approach is to iteratively first generate candidate programs with possible bug fixes and then validate them against the given tests until a candidate that passes all the tests is found. While this approach is conceptually simple, due to the potentially high number of candidates that need to first be generated and then be compiled and tested, existing repair techniques that embody this approach have relatively low effectiveness, especially for faults at a fine granularity. To tackle this limitation, we introduce a novel repair technique, SketchFix, which generates candidate fixes on demand (as needed) during the test execution. Instead of iteratively re-compiling and re-executing each actual candidate program, SketchFix translates faulty programs to sketches, i.e., partial programs with "holes", and compiles each sketch once which may represent thousands of concrete candidates. With the insight that the space of candidates can be reduced substantially by utilizing the runtime behaviors of the tests, SketchFix lazily initializes the candidates of the sketches while validating them against the test execution. We experimentally evaluate SketchFix on the Defects4J benchmark and the experimental results show that SketchFix works particularly well in repairing bugs with expression manipulation at the AST node-level granularity compared to other program repair techniques. Specifically, SketchFix correctly fixes 19 out of 357 defects in 23 minutes on average using the default setting. In addition, SketchFix finds the first repair with 1.6% of re-compilations (#compiled sketches/#candidates) and 3.0% of re-executions out of all repair candidates.
Conference Paper
Software maintenance, especially bug fixing, is one of the most expensive problems in software practice. Bugs have global impact in terms of cost and time, and they also reflect negatively on a company's brand. GenProg is a method for Automated Program Repair based on an evolutionary approach. It aims to generate bug repairs without human intervention or a need for special instrumentation or source code annotations. Its canonical fitness function evaluates each variant as the weighted sum of the test cases that a modified program passes. However, it evaluates distinct individuals with the same fitness score (plateaus). We propose a fitness function that minimizes these plateaus using dynamic analysis to increase the granularity of the fitness information that can be gleaned from test case execution, increasing the diversity of the population, the number of repairs found (expressiveness), and the efficiency of the search. We evaluate the proposed fitness functions on two standard benchmarks for Automated Program Repair: IntroClass and ManyBugs. We find that our proposed fitness function minimizes plateaus, increases expressiveness, and the efficiency of the search.
Conference Paper
Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches. To reduce the number of incorrect patches generated, we propose a novel approach that heuristically determines the correctness of the generated patches. The core idea is to exploit the behavior similarity of test case executions. The passing tests on original and patched programs are likely to behave similarly while the failing tests on original and patched programs are likely to behave differently. Also, if two tests exhibit similar runtime behavior, the two tests are likely to have the same test results. Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness. Our approach is evaluated on a dataset consisting of 139 patches generated from existing program repair systems including jGen-Prog, Nopol, jKali, ACS and HDRepair. Our approach successfully prevented 56.3% of the incorrect patches to be generated, without blocking any correct patches.
Article
On 4 August 2016, DARPA conducted the final event of the Cyber Grand Challenge (CGC). The challenge in CGC was to build an autonomous system capable of playing in a capture-the-flag hacking competition. The final event pitted the systems from seven finalists against each other, with each system attempting to defend its own network services while proving vulnerabilities in other systems’ defended services. Xandra, our automated cyber reasoning system, took second place overall in the final event. Xandra placed first in security (preventing exploits), second in availability (keeping services operational and efficient), and fourth in evaluation (proving vulnerabilities in competitor services). Xandra also drew the least power of any of the competitor systems. In this article, we describe the high-level strategies applied by Xandra, their realization in Xandra’s architecture, the synergistic interplay between offense and defense, and finally, lessons learned via post-mortem analysis of the final event.
Article
As self-driving vehicles become increasingly popular, new generations of attackers will seek to exploit vulnerabilities introduced by the technologies that underpin such vehicles for a range of motivations (e.g. curiosity, criminally-motivated, financially-motivated and state-sponsored). For example, vulnerabilities in self-driving vehicles may be exploited to be used in terrorist attacks such as driving into places of mass gatherings (i.e. using driverless vehicles as weapons to cause death or serious bodily injury). This survey presents a categorized summary of security methodologies developed to secure sensing, positioning, vision, and network technologies that can be equipped in driverless-vehicles. These technologies have the potential to benefit their security from tailored machine learning models. Future research opportunities are also identified.
Article
The Zipr tool reverse-engineers binary code and applies specific and generic code transformations, with the goal of securing binary code.
Conference Paper
Research in Search-Based Automated Program Repair has demonstrated promising results, but has nevertheless been largely confined to small, single-edit patches using a limited set of mutation operators. Tackling a broader spectrum of bugs will require multiple edits and a larger set of operators, leading to a combinatorial explosion of the search space. This motivates the need for more efficient search techniques. We propose to use the test case results of candidate patches to localise suitable fix locations. We analysed the test suite results of single-edit patches, generated from a random walk across 28 bugs in 6 programs. Based on the findings of this analysis, we propose a number of mutation-based fault localisation techniques, which we subsequently evaluate by measuring how accurately they locate the statements at which the search was able to generate a solution. After demonstrating that these techniques fail to result in a significant improvement, we discuss why this may be the case, despite the successes of mutation-based fault localisation in previous studies.
Conference Paper
A major challenge towards large scale deployment of autonomous mobile robots is to program them with formal guarantees and high assurance of correct operation. To this end, we present a framework for building safe robots. Our approach for validating the end-to-end correctness of robotics system consists of two parts: (1) a high-level programming language for implementing and systematically testing the reactive robotics software via model checking; (2) a signal temporal logic (STL) based online monitoring system to ensure that the assumptions about the low-level controllers (discrete models) used during model checking hold at runtime. Combining model checking with runtime verification helps us bridge the gap between software verification (discrete) that makes assumptions about the low-level controllers and the physical world, and the actual execution of the software on a real robotic platform in the physical world. To demonstrate the efficacy of our approach, we build a safe adaptive surveillance system and present software-in-the-loop simulations of the application.
Conference Paper
Small unmanned vehicles support many mission critical tasks. However, the provenance of these systems is usually not known, devices may be deployed in contested environments, and operators are often not computer system experts. Yet, the benefits of these systems outweigh the risks, and critical tasks and data are delegated to these systems without a sound basis for assessing trust. This paper describes an approach that can determine an operator’s trust in a mission system and applies continuous monitoring to indicate if the performance is within a trusted operating region. In an early prototype we (a) define a multi-dimensional trusted operating region for a given mission, (b) monitor the system in-mission, and (c) detect when anomalous effects put the mission at risk.
Conference Paper
A notable class of techniques for automatic program repair is known as semantics-based. Such techniques, e.g., Angelix, infer semantic specifications via symbolic execution, and then use program synthesis to construct new code that satisfies those inferred specifications. However, the obtained specifications are naturally incomplete, leaving the synthesis engine with a difficult task of synthesizing a general solution from a sparse space of many possible solutions that are consistent with the provided specifications but that do not necessarily generalize. We present S3, a new repair synthesis engine that leverages programming-by-examples methodology to synthesize high-quality bug repairs. The novelty in S3 that allows it to tackle the sparse search space to create more general repairs is threefold: (1) A systematic way to customize and constrain the syntactic search space via a domain-specific language, (2) An efficient enumeration-based search strategy over the constrained search space, and (3) A number of ranking features based on measures of the syntactic and semantic distances between candidate solutions and the original buggy program. We compare S3's repair effectiveness with state-of-the-art synthesis engines Angelix, Enumerative, and CVC4. S3 can successfully and correctly fix at least three times more bugs than the best baseline on datasets of 52 bugs in small programs, and 100 bugs in real-world large programs. CCS CONCEPTS • Software and its engineering → Programming by example; Dynamic analysis; Software testing and debugging;
Conference Paper
Since debugging is a time-consuming activity, automated program repair tools such as GenProg have garnered interest. A recent study revealed that the majority of GenProg repairs avoid bugs simply by deleting functionality. We found that SPR, a state-of-the-art repair tool proposed in 2015, still deletes functionality in their many "plausible" repairs. Unlike generate-and-validate systems such as GenProg and SPR, semantic analysis based repair techniques synthesize a repair based on semantic information of the program. While such semantics-based repair methods show promise in terms of quality of generated repairs, their scalability has been a concern so far. In this paper, we present Angelix, a novel semantics-based repair method that scales up to programs of similar size as are handled by search-based repair tools such as GenProg and SPR. This shows that Angelix is more scalable than previously proposed semantics based repair methods such as SemFix and DirectFix. Furthermore, our repair method can repair multiple buggy locations that are dependent on each other. Such repairs are hard to achieve using SPR and GenProg. In our experiments, Angelix generated repairs from large-scale real-world software such as wireshark and php, and these generated repairs include multi-location repairs. We also report our experience in automatically repairing the well-known Heartbleed vulnerability.
Conference Paper
The speed with which newly discovered software vulnerabilities are patched is a critical factor in mitigating the harm caused by subsequent exploits. Unfortunately, software vendors are often slow or unwilling to patch vulnerabilities, especially in embedded systems which frequently have no mechanism for updating factory-installed firmware. The situation is particularly dire for commercial off the shelf (COTS) software users, who lack source code and are wholly dependent on patches released by the vendor. We propose a solution in which the vulnerabilities drive an automated evolutionary computation repair process capable of directly patching embedded systems firmware. Our approach does not require access to source code, regression tests, or any participation from the software vendor. Instead, we present an interactive evolutionary algorithm that searches for patches that resolve target vulnerabilities while relying heavily on post-evolution difference minimization to remove most regressions. Extensions to prior work in evolutionary program repair include: repairing vulnerabilities in COTS router firmware; handling stripped MIPS executables; operating without fault localization information; operating without a regression test suite; and incorporating user interaction into the evolutionary repair process. We demonstrate this method by repairing two well-known vulnerabilities in version 4 of NETGEAR's WNDR3700 wireless router before NETGEAR released patches publicly for the vulnerabilities. Without fault localization we are able to find repair edits that are not located on execution traces. Without the advantage of regression tests to guide the search, we find that 80% of repairs of the example vulnerabilities retain program functionality after minimization. With minimal user interaction to demonstrate required functionality, 100% of the proposed repairs were able to address the vulnerabilities while retaining required functionality.
Conference Paper
We present SPR, a new program repair system that combines staged program repair and condition synthesis. These techniques enable SPR to work productively with a set of parameterized transformation schemas to generate and efficiently search a rich space of program repairs. Together these techniques enable SPR to generate correct repairs for over five times as many defects as previous systems evaluated on the same benchmark set.
Conference Paper
Automated program repair has shown promise for reducing the significant manual effort debugging requires. This paper addresses a deficit of earlier evaluations of automated repair techniques caused by repairing programs and evaluating generated patches' correctness using the same set of tests. Since tests are an imperfect metric of program correctness, evaluations of this type do not discriminate between correct patches and patches that overfit the available tests and break untested but desired functionality. This paper evaluates two well-studied repair tools, GenProg and TrpAutoRepair, on a publicly available benchmark of bugs, each with a human-written patch. By evaluating patches using tests independent from those used during repair, we find that the tools are unlikely to improve the proportion of independent tests passed, and that the quality of the patches is proportional to the coverage of the test suite used during repair. For programs that pass most tests, the tools are as likely to break tests as to fix them. However, novice developers also overfit, and automated repair performs no worse than these developers. In addition to overfitting, we measure the effects of test suite coverage, test suite provenance, and starting program quality, as well as the difference in quality between novice-developer-written and tool-generated patches when quality is assessed with a test suite independent from the one used for patch generation.
Article
This article describes and evaluates DIG, a dynamic invariant generator that infers invariants from observed program traces, focusing on numerical and array variables. For numerical invariants, DIG supports both nonlinear equalities and inequalities of arbitrary degree defined over numerical program variables. For array invariants, DIG generates nested relations among multidimensional array variables. These properties are nontrivial and challenging for current static and dynamic invariant analysis methods. The key difference between DIG and existing dynamic methods is its generative technique, which infers invariants directly from traces, instead of using traces to filter out predefined templates. To generate accurate invariants, DIG employs ideas and tools from the mathematical and formal methods domains, including equation solving, polyhedra construction, and theorem proving; for example, DIG represents and reasons about polynomial invariants using geometric shapes. Experimental results on 27 mathematical algorithms and an implementation of AES encryption provide evidence that DIG is effective at generating invariants for these programs.
Article
Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin . Our goals are to provide easy-to-use, portable, transparent , and efficient instrumentation. Instrumentation tools (called Pintools ) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.
Article
Program invariants are important for defect detection, program verification, and program repair. However, existing techniques have limited support for important classes of invariants such as disjunctions, which express the semantics of conditional statements. We propose a method for generating disjunctive invariants over numerical domains, which are inexpressible using classical convex polyhedra. Using dynamic analysis and reformulating the problem in non-standard ``max-plus'' and ``min-plus'' algebras, our method constructs hulls over program trace points. Critically, we introduce and infer a weak class of such invariants that balances expressive power against the computational cost of generating nonconvex shapes in high dimensions. Existing dynamic inference techniques often generate spurious invariants that fit some program traces but do not generalize. With the insight that generating dynamic invariants is easy, we propose to verify these invariants statically using k-inductive SMT theorem proving which allows us to validate invariants that are not classically inductive. Results on difficult kernels involving nonlinear arithmetic and abstract arrays suggest that this hybrid approach efficiently generates and proves correct program invariants.