Matteo Sonza Reorda

Matteo Sonza Reorda
  • PhD in Computer Engineering
  • Professor (Full) at Polytechnic University of Turin

About

750
Publications
58,512
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,147
Citations
Current institution
Polytechnic University of Turin
Current position
  • Professor (Full)
Additional affiliations
November 1990 - present
Polytechnic University of Turin
Position
  • Professor (Full)
Education
January 1987 - December 1989
Politecnico di Torino
Field of study
  • Computer Engineering

Publications

Publications (750)
Article
Full-text available
Neural networks (NNs) are essential in advancing modern safety-critical systems. Lightweight NN architectures are deployed on resource-constrained devices using hardware accelerators like Graphics Processing Units (GPUs) for fast responses. However, the latest semiconductor technologies may be affected by physical faults that can jeopardize the NN...
Article
Deep Neural Networks (DNNs) have permeated multiple applications, including cutting-edge safety-critical domains, which require relevant computational power, often provided by Graphic Processing Units (GPUs). GPUs are manufactured with advanced semiconductor technologies that can be affected by faults during the operational phase (e.g., due to wear...
Article
Full-text available
Arithmetic circuits are fundamental building blocks in modern digital computers, allowing for precise mathematical operations and driving the digital age. They are essential components in almost every digital device, from basic CPUs to advanced accelerators in AI applications. In particular, in safety-critical fields like automotive, avionics, and...
Preprint
The reliability of Neural Networks has gained significant attention, prompting efforts to develop SW-based hardening techniques for safety-critical scenarios. However, evaluating hardening techniques using application-level fault injection (FI) strategies, which are commonly hardware-agnostic, may yield misleading results. This study for the first...
Article
Full-text available
Graphics Processing Units (GPUs) are becoming widespread, even in safety-critical applications. In that case, it is imperative to guarantee that the probability of producing critical failures due to hardware faults is lower than a given threshold. To detect possible permanent hardware faults as soon as they appear during the operational phase (e.g....
Conference Paper
Full-text available
A widely adopted practice for in-field testing of electronic devices uses Software-Based Self-Test (SBST) in the form of Software Test Libraries (STLs). Typically, STLs target the stuck-at and Transition Delay Fault (TDF) models. However, to face the new defects introduced by the most recent semiconductor technologies, new fault models must be adop...
Preprint
Full-text available
Reliability assessment is mandatory to guarantee the correct behavior of Deep Neural Network (DNN) hardware accelerators in safety-critical applications. While fault injection stands out as a well-established, practical and robust method for reliability assessment, it is still a very time-consuming process. This paper contributes with three recipes...
Presentation
Slides of the second talk of the special session: "Reliability Assessment Recipes for DNN Accelerators" at the VTS 2024 conference.
Presentation
This presentation summarizes the analyses of the reliability impact of scheduling policies on GPUs when permanent faults affect TCUs, during the execution of CNN operations. We developed a configurable architectural GPU model (in terms of clusters and parallel cores) that implements five selectable scheduling policies and supports the instruction-...
Article
Full-text available
Ensuring the reliability of GPUs and their internal components is paramount, especially in safety-critical domains like autonomous machines and self-driving cars. These cutting-edge applications heavily rely on GPUs to implement complex algorithms due to their implicit programming flexibility and parallelism, which is crucial for efficient operatio...
Conference Paper
Full-text available
Arithmetic circuits form the foundation of modern digital computation, enabling us to conduct precise mathematical operations and drive the digital age. They are integral components in nearly every digital circuit, such as processors' arithmetic and logic units. Especially in safety-critical domains like automotive and aviation, the flawless operat...
Article
Full-text available
The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologie...
Presentation
Slides of the paper contribution: Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units
Chapter
High-Performance Computing (HPC) have evolved to be used to perform simulations of systems where physical experimentation is prohibitively impractical, expensive, or dangerous. This paper provides a general overview and showcases the analysis of non-functional properties in RISC-V-based platforms for HPCs. In particular, our analyses target the eva...
Article
Full-text available
Throughout device testing, one key parameter to be considered is the switching activity (SWA) of the circuit under test (CUT). To avoid unwanted scenarios due to excessive power consumption during test, in most cases the SWA of the CUTs must be retained to a minimal value when the test stimulus is applied. However, there are specific cases where th...
Article
New semiconductor technologies for advanced applications are more prone to defects and imperfections related, among several different causes, to the manufacturing process, aging and cross-talks. These phenomena negatively affect the circuit’s timing and can be effectively modeled by means of the path delay fault (PDF) model. While path delay testin...
Conference Paper
Full-text available
In-field test of microprocessors is a major topic for the industry, especially in the safety-critical domain, where the respective standards mandate high test coverage thresholds. The dominant fault models used are the transition delay and the stuck-at fault model. However, the adoption of very advanced semiconductor technologies to manufacture dev...
Chapter
Full-text available
The reliability of High-Performance Computing (HPC) systems is an essential concern due to their massive size and the complexity of their operation. Thus, functional tests have been extensively used to monitor HPC systems and use software routines to verify the software stack’s operation, mainly focusing on high-level abstraction features. However,...
Article
Self-Test Libraries (STLs) are widely used for in-field fault detection in processor-based systems. Currently, their adoption is being extended to Graphics Processing Units (GPUs), due to their increasing usage in the safety-critical domain, and the demand for effective in-field functional safety mechanisms mandated by the functional safety standar...
Conference Paper
Full-text available
High-PerformanceComputing(HPC)haveevolvedtobeused to perform simulations of systems where physical experimentation is pro- hibitively impractical, expensive, or dangerous. This paper provides a general overview and showcases the analysis of non-functional properties in RISC-V-based platforms for HPCs. In particular, our analyses target the evaluati...
Preprint
Full-text available
Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufact...
Conference Paper
Full-text available
With the continued success of the open RISC-V architecture, practical deployment of RISC-V processors necessitates an in-depth consideration of their testability, safety and security aspects. This survey provides an overview of recent developments in this quickly-evolving field. We start with discussing the application of state-of-the-art functiona...
Article
Full-text available
During device testing, an important parameter to be considered by the test engineers is the switching activity (SWA) of the circuit under test (CUT). It is well known that the SWA must be kept to a minimum in order to avoid catastrophic scenarios on the CUTs, e.g., unacceptable peak power consumption or over-stressing that can lead to an artificial...
Article
Full-text available
Complexity and performance of Automotive System-on-Chips have exponentially grown in the last decade, also according to technology advancements. Unfortunately, this trend directly and profoundly impacts modern Electronic Design Automation tools, which must handle very large amounts of logic gates. The consequence is an exponential increase in compu...
Preprint
Full-text available
The reliability evaluation of Deep Neural Networks (DNNs) executed on Graphic Processing Units (GPUs) is a challenging problem since the hardware architecture is highly complex and the software frameworks are composed of many layers of abstraction. While software-level fault injection is a common and fast way to evaluate the reliability of complex...
Article
The reliability evaluation of Deep Neural Networks (DNNs) executed on Graphic Processing Units (GPUs) is a challenging problem since the hardware architecture is highly complex and the software frameworks are composed of many layers of abstraction. While software-level fault injection is a common and fast way to evaluate the reliability of complex...
Conference Paper
Full-text available
Graphics Processing Units (GPUs) boost the development of high-performance safety-critical applications. The reliability of such systems is of utmost importance since faults affecting the hardware may occur at any time during the systems' operational life. Thus, methods to effectively test these devices during their in-field operation are necessary...
Article
Full-text available
In order to match the strict reliability requirements mandated by regulations and standards adopted in the automotive sector, as well as other domains where safety is a major concern, the in-field testing of the most critical devices, including microcontrollers and systems on chip, is a crucial task. Since the controller area network (CAN) bus is w...
Conference Paper
Full-text available
Numerous electronic systems store valuable intellectual property (IP) information inside non-volatile memories. In order to protect the integrity of such sensitive information from an unauthorized access or modification, encryption mechanisms are employed. From a reliability standpoint, such information can be vital to the system's functionality an...
Preprint
Full-text available
Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundamental computational approach applied in a wide range of domains, including some safety-critical applications (e.g., automotive, robotics, and healthcare equipment). Therefore, the reliability evaluation of those computational systems is mandatory. The r...
Article
Full-text available
The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliab...
Article
Full-text available
Graphics Processing Units (GPUs) are increasingly adopted in several domains where reliability is fundamental, such as self-driving cars and autonomous systems. Unfortunately, GPU devices have been shown to have a high error rate, while the constraints imposed by real-time safety-critical applications make traditional (and costly) replication-based...
Article
Full-text available
ISO 26262 requires classifying random hardware faults based on their effects (safe, detected, or undetected) within integrated circuits used in automobiles. In general, this classification is addressed using expert judgment and a combination of tools. However, the growth of integrated circuit complexity creates a huge fault space; hence, this form...
Article
Full-text available
This paper compares different types of resistive defects that may occur inside low-power SRAM cells, focusing on their impact on device operation. Notwithstanding the continuous evolution of SRAM device integration, manufacturing processes continue to be very sensitive to production faults, giving rise to defects that can be modeled as resistances,...
Article
Temperature management is a non-secondary aspect in the design of power circuits and systems. As a matter of facts, changes in the junction temperature have significant effects on the semiconductor device behavior; furthermore, a high junction temperature accelerates the failure mechanisms of power devices used in the power module and reduces their...
Article
Full-text available
Burn-In test equipment usually owns extensive memory capabilities to store pre-computed patterns to be applied to the circuit inputs as well as ad-hoc circuitries to drive and read the DUT pins during the BI phase. The solution proposed in this paper dramatically reduces the memory size requirement and just demands a generic microcontroller unit (M...
Preprint
Full-text available
Low-power SRAM architectures are especially sensitive to many types of defects that may occur during manufacturing. Among these, resistive defects can appear. This paper analyzes some types of such defects that may impair the device functionalities in subtle ways, depending on the defect characteristics, and that may not be directly or easily detec...
Article
Full-text available
Nowadays, many electronic systems store valuable Intellectual Property (IP) information inside Non-Volatile Memories (NVMs). Designers widely use encryption mechanisms to enhance the integrity of such IPs and protect them from any unauthorized access or modification. At the same time, often such IPs are critical from a reliability standpoint. Thus,...
Conference Paper
Full-text available
During device testing, one of the aspects to be considered is the minimization of the switching activity of the circuit under test in order to steer clear of introducing problems due to device overheating. Nevertheless, there are also certain scenarios during which the maximization of switching activity of the circuit under test (CUT) or of certain...
Article
Full-text available
General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these paralle...
Preprint
Full-text available
In-field test of processor-based devices is a must when considering safety-critical systems (e.g., in robotics, aerospace, and automotive applications). During in-field testing, different solutions can be adopted, depending on the specific constraints of each scenario. In the last years, Self-Test Libraries (STLs) developed by IP or semiconductor c...
Conference Paper
Full-text available
One key aspect to be considered during device testing is the minimization of the switching activity of the circuit under test (CUT), thus avoiding possible problems stemming from overheating it. But there are also scenarios, where the maximization of certain circuits' modules switching activity could be proven useful (e.g., during Burn-In) in order...

Network

Cited By