
Paolo TonellaFondazione Bruno Kessler | FBK · Software Engineering (SE)
Paolo Tonella
PhD
About
356
Publications
88,427
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,730
Citations
Citations since 2017
Introduction
Paolo Tonella is Full Professor at the Faculty of Informatics and at the Software Institute of Università della Svizzera Italiana (USI) in Lugano, Switzerland. He is also Honorary Professor at University College London, UK. Until mid 2018 he has been Head of Software Engineering at Fondazione Bruno Kessler, Trento, Italy. Paolo Tonella holds an ERC Advanced grant as Principal Investigator of the project PRECRIME.
Publications
Publications (356)
Recent decades have seen the rise of large-scale Deep Neural Networks (DNNs) to achieve human-competitive performance in a variety of artificial intelligence tasks. Often consisting of hundreds of millions, if not hundreds of billion parameters, these DNNs are too large to be deployed to, or efficiently run on resource-constrained devices such as m...
This is the Replicated Computational Results (RCR) Report for our TOSEM paper ”Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks”, where we propose a novel client-server architecture allowing to leverage the high accuracy of huge neural networks running on remote servers while reducing the economical and latency...
Recent advances in Deep Neural Networks (DNNs) and sensor technologies are enabling autonomous driving systems (ADSs) with an ever-increasing level of autonomy. However, assessing their dependability remains a critical concern. State-of-the-art ADS testing approaches modify the controllable attributes of a simulated driving environment until the AD...
When deployed in the operation environment, Deep Learning (DL)
systems often experience the so-called development to operation
(dev2op) data shift, which causes a lower prediction accuracy on
field data as compared to the one measured on the test set during
development. To address the dev2op shift, developers must obtain
new data with the newly obs...
Modern software systems accommodate complex configurations and execution conditions that depend on the environment where the software is run. While in house testing can exercise only a fraction of such execution contexts, in vivo testing can take advantage of the execution state observed in the field to conduct further testing activities. In this p...
Web test automation techniques employ web crawlers to automatically produce a web app model that is used for test generation. Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence. Such algorithms are hard to tune in the general case and cannot accurately identify and remove near-duplicate web pages from cr...
Deep Reinforcement Learning (DRL) has received a lot of attention from the research community in recent years. As the technology moves away from game playing to practical contexts, such as autonomous vehicles and robotics, it is crucial to evaluate the quality of DRL agents. In this paper, we propose a search-based approach to test such agents. Our...
Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we strengthen simulatio...
Offline model-level testing of autonomous driving software is much cheaper, faster, and diversified than in-field, online system-level testing. Hence, researchers have compared empirically model-level vs system-level testing using driving simulators. They reported the general usefulness of simulators at reproducing the same conditions experienced i...
Recent decades have seen the rise of large-scale Deep Neural Networks (DNNs) to achieve human-competitive performance in a variety of artificial intelligence tasks. Often consisting of hundreds of millions, if not hundreds of billion parameters, these DNNs are too large to be deployed to, or efficiently run on resource-constrained devices such as m...
As Deep Neural Networks (DNNs) are rapidly being adopted within large software systems, software developers are increasingly required to design, train, and deploy such models into the systems they develop. Consequently, testing and improving the robustness of these models have received a lot of attention lately. However, relatively little effort ha...
Deep neural networks (DNN) are increasingly used as components of larger software systems that need to process complex data, such as images, written texts, audio/video signals. DNN predictions cannot be assumed to be always correct for several reasons, amongst which the huge input space that is dealt with, the ambiguity of some inputs data, as well...
Deep Neural Networks (DNN) are increasingly used as components of larger software systems that need to process complex data, such as images, written texts, audio/video signals. DNN predictions cannot be assumed to be always correct for several reasons, among which the huge input space that is dealt with, the ambiguity of some inputs data, as well a...
A central aspect of the Android platform is Inter-Component Communication (ICC), which enables the reuse of functionality across apps and components via message passing. While a powerful feature, ICC still constitutes a serious attack surface. This paper addresses the issue of generating exploits for a subset of Android ICC vulnerabilities (i.e., I...
p>A central aspect of the Android platform is Inter-Component Communication (ICC), which enables the reuse of functionality across apps and components via message passing. While a powerful feature, ICC still constitutes a serious attack surface. This paper addresses the issue of generating exploits for a subset of Android ICC vulnerabilities (i.e.,...
Automated online recognition of unexpected conditions is an indispensable component of autonomous vehicles to ensure safety even in unknown and uncertain situations. In this paper we propose a runtime monitoring technique rooted in the attention maps computed by explainable artificial intelligence techniques. Our approach , implemented in a tool ca...
Safe deployment of self-driving cars (SDC) necessitates thorough simulated and in-field testing. Most testing techniques consider virtualized SDCs within a simulation environment, whereas less effort has been directed towards assessing whether such techniques transfer to and are effective with a physical real-world vehicle.
In this paper, we shed l...
Deep Neural Networks (DNNs) are becoming a crucial component of modern software systems, but they are prone to fail under conditions that are different from the ones observed during training (out-of-distribution inputs) or on inputs that are truly ambiguous, i.e., inputs that admit multiple classes with nonzero probability in their ground truth lab...
Assessing the quality of Deep Learning (DL) systems is crucial, as they are increasingly adopted in safety-critical domains. Researchers have proposed several input generation techniques for DL systems. While such techniques can expose failures, they do not explain which features of the test inputs influenced the system’s (mis-) behaviour. DeepHype...
Test Input Prioritizers (TIP) for Deep Neural Networks (DNN) are an important technique to handle the typically very large test datasets efficiently, saving computation and labeling costs. This is particularly true for large-scale, deployed systems, where inputs observed in production are recorded to serve as potential test or training data for the...
Software is constantly changing as developers add new features or make changes. This directly impacts the effectiveness of the test suite associated with that software, especially when the new modifications are in an area where no test case exists. This article addresses the issue of developing a high-quality test suite to repeatedly cover a given...
The data set available for pre-release training of a machine learning based system is often not representative of all possible execution contexts that the system will encounter in the field. Reinforcement Learning (RL) is a prominent approach among those that support continual learning, i.e., learning continually in the field, in the post-release p...
Business needs often demand business analysts to inspect large and intricate business process models for analysis and maintenance purposes. Nevertheless, manually retrieving information in these complex models is not a rewarding error-free activity. The automatic querying of process models and visualization of the retrieved results represents a use...
Deep Learning (DL) components are routinely integrated into software systems that need to perform complex tasks such as image or natural language processing. The adequacy of the test data used to test such systems can be assessed by their ability to expose artificially injected faults (mutations) that simulate real DL faults. In this paper, we desc...
Safe handling of hazardous driving situations is a task of high practical relevance for building reliable and trustworthy cyber-physical systems such as autonomous driving systems. This task necessitates an accurate prediction system of the vehicle's confidence to prevent potentially harmful system failures on the occurrence of unpredictable condit...
One of the major challenges in the verification of complex industrial Cyber-Physical Systems is the difficulty of determining whether a particular system output or behaviour is correct or not, the so- called test oracle problem. Metamorphic testing alleviates the oracle problem by reasoning on the relations that are expected to hold among multiple...
Deep Learning (DL) has been successfully applied to a wide range of application domains, including safety-critical ones. Several DL testing approaches have been recently proposed in the literature but none of them aims to assess how different interpretable features of the generated inputs affect the system's behaviour.
In this paper, we resort to...
Assertion oracles are executable boolean expressions placed inside a software program that verify the correctness of test executions. A perfect assertion oracle passes (returns true) for all correct executions and fails (returns false) for all incorrect executions. Because designing perfect assertion oracles is difficult, assertions often fail to d...
Deep Learning (DL) has been successfully applied to a wide range of application domains, including safety-critical ones. Several DL testing approaches have been recently proposed in the literature but none of them aims to assess how different interpretable features of the generated inputs affect the system's behaviour. In this paper, we resort to I...
Field testing refers to testing techniques that operate in the field to reveal those faults that escape in-house testing. Field testing techniques are becoming increasingly popular with the growing complexity of contemporary software systems. In this article, we present the first systematic survey of field testing approaches over a body of 80 colle...
By ensuring adequate functional coverage, End‐to‐End (E2E) testing is a key enabling factor of continuous integration. This is even more true for web applications, where automated E2E testing is the only way to exercise the full stack used to create a modern application. The test code used for web testing usually relies on DOM locators, often expre...
This demo presents the implementation and usage details of GASSERT, the first tool to automatically improve assertion oracles. Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assert...
The race for deploying AI-enabled autonomous vehicles (AVs) on public roads is based on the promise that such self-driving cars will be as safe as or safer than human drivers. Numerous techniques have been proposed to test AVs, which however lack oracle definitions that account for the quality of driving, due to the lack of a commonly used set of m...
The degree of code coverage reached by a test suite is an important indicator of the thoroughness of testing. Most coverage tools for Android apps work at the bytecode level and provide no information to developers about which source code lines have not yet been exercised by any test case. In this paper, we present COSMO, the first fully automated...
Software changes constantly, because developers add new features or modifications. This directly affects the effectiveness of the test suite associated with that software, especially when these new modifications are in a specific area that no test case covers. This article tackles the problem of generating a high-quality test suite to cover repeate...
Surprise Adequacy (SA) is one of the emerging and most promising adequacy criteria for Deep Learning (DL) testing. As an adequacy criterion, it has been used to assess the strength of DL test suites. In addition, it has also been used to find inputs to a Deep Neural Network (DNN) which were not sufficiently represented in the training data, or to s...
This demo presents the implementation and usage details of GASSERT, the first tool to automatically improve assertion oracles. Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assert...
Modern software systems rely on Deep Neural Networks (DNN) when processing complex, unstructured inputs, such as images, videos, natural language texts or audio signals. Provided the intractably large size of such input spaces, the intrinsic limitations of learning algorithms, and the ambiguity about the expected predictions for some of the inputs,...
The state space of Android apps is huge and its thorough exploration during testing remains a major challenge. In fact, the best exploration strategy is highly dependent on the features of the app under test. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task by trial and error, guided by po...
Uncertainty and confidence have been shown to be useful metrics in a wide variety of techniques proposed for deep learning testing, including test data selection and system supervision. We present uncertainty-wizard, a tool that allows to quantify such uncertainty and confidence in artificial neural networks. It is built on top of the industry-lead...
During testing, developers can place oracles externally or internally with respect to a method. Given a faulty execution state, i.e., one that differs from the expected one, an oracle might be unable to expose the fault if it is placed at a program point with no access to the incorrect program state or where the program state is no longer corrupted...
Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assertion oracles is difficult, assertions often fail to distinguish between correct and incorrect executions. In other words, they a...
Context: A Machine Learning based System (MLS) is a software system including one or more components that learn how to perform a task from a given data set. The increasing adoption of MLSs in safety critical domains such as autonomous driving, healthcare, and finance has fostered much attention towards the quality assurance of such systems. Despite...
In this paper, we first discuss the challenges of adapting an already trained DNN-based anomaly detector with knowledge mined during the execution of the main system. Then, we present a framework for the continual learning of anomaly detectors, which records in-field behavioural data to determine what data are appropriate for adaptation. We evaluat...
With the increasing adoption of Deep Learning (DL) for critical tasks, such as autonomous driving, the evaluation of the quality of systems that rely on DL has become crucial. Once trained, DL systems produce an output for any arbitrary numeric vector provided as input, regardless of whether it is within or outside the validity domain of the system...
With the increasing adoption of Deep Learning (DL) for critical tasks, such as autonomous driving, the evaluation of the quality of systems that rely on DL has become crucial. Once trained, DL systems produce an output for any arbitrary numeric vector provided as input, regardless of whether it is within or outside the validity domain of the system...
Smart contracts are a new type of software that allows its users to perform irreversible transactions on a distributed persistent data storage called the blockchain. The nature of such contracts and the technical details of the blockchain architecture give raise to new kinds of faults, which require specific test behaviours to be exposed. In this p...
The ecosystem in which mobile applications run is highly heterogeneous and configurable. All layers upon which mobile apps are built offer wide possibilities of variations, from the device and the hardware, to the operating system and middleware, up to the user preferences and settings. Testing all possible configurations exhaustively, before relea...
Context
Code hardening is meant to fight malicious tampering with sensitive code executed on client hosts. Code splitting is a hardening technique that moves selected chunks of code from client to server. Although widely adopted, the effective benefits of code splitting are not fully understood and thoroughly assessed.
Objective
The objective of t...
Mobile apps can be executed with a huge set of partially unpredictable configurations. Indeed, they can be executed on an unbounded combination of devices, operating systems, settings, and user preferences. Apps may also interact with other apps that were not even available when they were released. This results in a virtually infinite set of config...
The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most...
Deep Neural Networks (DNNs) are the core component of modern autonomous driving systems. To date, it is still unrealistic that a DNN will generalize correctly in all driving conditions. Current testing techniques consist of offline solutions that identify adversarial or corner cases for improving the training phase, and little has been done for ena...
We propose a human-in-the-loop approach for oracle improvement and analyse whether the proposed oracle improvement process is helping developers to create better oracles. For this, we conducted two human studies with 68 participants overall: an oracle assessment study and an oracle improvement study. Our results show that developers exhibit poor pe...
E2E web test suites are prone to test dependencies due to the heterogeneous multi-tiered nature of modern web apps, which makes it difficult for developers to create isolated program states for each test case. In this paper, we present the first approach for detecting and validating test dependencies present in E2E web test suites. Our approach emp...
Existing web test generators derive test paths from a navigational model of the web application, completed with either manually or randomly generated input values. However, manual test data selection is costly, while random generation often results in infeasible input sequences, which are rejected by the application under test. Random and search-ba...
E2E web test suites are prone to test dependencies due to the heterogeneous multi-tiered nature of modern web apps, which makes it difficult for developers to create isolated program states for each test case. In this paper, we present the first approach for detecting and validating test dependencies present in E2E web test suites. Our approach emp...
When critical assets or functionalities are included in a piece of software accessible to the end users, code protections are used to hinder or delay the extraction or manipulation of such critical assets. The process and strategy followed by hackers to understand and tamper with protected software might differ from program understanding for benign...
Several criteria have been proposed over the years for measuring test suite adequacy. Each criterion can be converted into a specific objective function to optimize with search-based techniques in an attempt to generate test suites achieving the highest possible coverage for that criterion. Recent work has tried to optimize for multiple-criteria at...
Context: Replication studies and experiments form an important foundation in advancing scientific research. While their prevalence in Software Engineering is increasing, there is still more to be done.
Objective: This article aims to extend our previous replication study on search-based test generation techniques by performing a large-scale empiri...
The oracle problem remains one of the key challenges in software testing, for which little automated support has been developed so far. We introduce OASIs, a search-based tool for Java that assists testers in oracle assessment and improvement. It does so by combining test case generation to reveal false positives and mutation testing to reveal fals...
Several criteria have been proposed over the years for measuring test suite adequacy. Each criterion can be converted into a specific objective function to optimize with search-based techniques in an attempt to generate test suites achieving the highest possible coverage for that criterion. Recent work has tried to optimize for multiple-criteria at...