Fabrizio Ferrandi

Fabrizio Ferrandi
Politecnico di Milano | Polimi · Department of Electronics, Information, and Bioengineering

PhD

About

181
Publications
20,915
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,661
Citations

Publications

Publications (181)
Chapter
Hardware designers working on FPGA accelerators are free to explore ad-hoc value representations that differ from the IEEE 754 floating-point standard, significantly reducing resource utilization and latency. In fact, while some applications are amenable to fixed-point quantization, others may require a wider dynamic range of values, better represe...
Preprint
Full-text available
Recent trends in deep learning (DL) imposed hardware accelerators as the most viable solution for several classes of high-performance computing (HPC) applications such as image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent advances in designing DL accelerators suitable to reach the pe...
Preprint
European efforts to boost competitiveness in the sector of space services promote the research and development of advanced software and hardware solutions. The EU-funded HERMES project contributes to the effort by qualifying radiation-hardened, high-performance programmable microprocessors, and by developing a software ecosystem that facilitates th...
Article
Full-text available
Edge systems are required to autonomously make real-time decisions based on large quantities of input data under strict power, performance, area, and other constraints. Meeting these constraints is only possible by specializing systems through hardware accelerators purposefully built for machine learning and data analysis algorithms. However, data...
Preprint
Custom hardware accelerators for Deep Neural Networks are increasingly popular: in fact, the flexibility and performance offered by FPGAs are well-suited to the computational effort and low latency constraints required by many image recognition and natural language processing tasks. The gap between high-level Machine Learning frameworks (e.g., Tens...
Preprint
Full-text available
High-Performance Big Data Analytics (HPDA) applications are characterized by huge volumes of distributed and heterogeneous data that require efficient computation for knowledge extraction and decision making. Designers are moving towards a tight integration of computing systems combining HPC, Cloud, and IoT solutions with artificial intelligence (A...
Article
Full-text available
Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task-level parallelism, that make existing general-purpose processors or accelerators (e.g., GPUs) suboptimal or difficult to program. To address...
Article
Full-text available
Improving data locality of tensor data structures is a crucial optimization for maximizing the performance of machine learning and intensive linear algebra applications. While CPUs and GPUs improve data locality by means of automated caching mechanisms, FPGAs let the developer specify data structure allocation. Although this feature enables a high...
Article
Full-text available
Field Programmable Gate Arrays (FPGAs) are becoming an appealing technology in datacenters and High Performance Computing. High-Level Synthesis (HLS) of multi-threaded parallel programs is increasingly used to extract parallelism. Despite great leaps forward in HLS and related debugging methodologies, there is a lack of contributions in automated b...
Article
Heterogeneous Multiprocessor System-on-Chip (MPSoC) architectures allow designers to achieve energy efficiency and high performance, especially when tailored for the target applications. This specialization requires to select the MPSoC components and define their interconnections (on the architecture side), and also to determine the mapping and sch...
Article
Full-text available
High Level Synthesis is a set of methodologies aimed at generating hardware descriptions starting from specifications written in high-level languages. While these methodologies share different elements with traditional compilation flows, there are characteristics of the addressed problem which require ad hoc management. In particular, differently f...
Conference Paper
Full-text available
Data analytics applications increasingly are complex workflows composed of phases with very different program behaviors (e.g., graph algorithms and machine learning, algorithms operating on sparse and dense data structures, etc). To reach the levels of efficiency required to process these workflows in real time, upcoming architectures will need to...
Article
This article presents an automated approach for detecting system- level bugs in SoC designs that are composed of many IP blocks, without exposing sensitive information. The approach leverages high-level synthesis techniques. —Binoy Ravindran, Virginia Tech</italic
Chapter
This chapter introduces the characterizing aspects of embedded systems and discusses the specific features that a designer should address in an embedded system “rugged,” i.e., able to operate reliably in harsh environments. The chapter addresses both the hardware and the less obvious software aspects. After presenting a current list of certificatio...
Article
Full-text available
High-Level Synthesis (HLS) for FPGAs is attracting popularity and is increasingly used to handle complex systems with multiple integrated components. To increase performance and efficiency, HLS flows now adopt several advanced optimization techniques. Aggressive optimizations and system level integration can cause the introduction of bugs that are...
Article
Full-text available
The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential impleme...
Article
Full-text available
Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization. In this paper a metho...
Chapter
Accelerators, including graphic processing units (GPUs) for general-purpose computation, manycore designs with wide vector units (e.g., Intel Phi), have become a common component of many high-performance clusters. The appearance of more stable and reliable tools that can automatically convert code written in high-level specifications with annotatio...
Conference Paper
Full-text available
RDF databases naturally map to a graph representation and employ languages, such as SPARQL, that implements queries as graph pattern matching routines. Graph methods exhibit an irregular behavior: they present unpredictable, fine-grained data accesses, and are synchronization intensive. Graph data structures expose large amounts of dynamic parallel...
Conference Paper
Full-text available
Conventional High Level Synthesis (HLS) tools mainly target compute intensive kernels typical of digital signal processing applications. We are developing techniques and architectural templates to enable HLS of data analytics applications. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dyn...
Conference Paper
Full-text available
The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system allows to improve its efficiency and its flexibility thanks to their programmability. Generating the required hardware descriptions for a software developer could be a very difficult task because of the different programming paradigms of software programs and hardware...
Article
Full-text available
High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today’s system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is pa...
Conference Paper
Full-text available
Code motion and speculations are usually exploited in the High Level Synthesis of control dominated applications to improve the performances of the synthesized designs. Selecting the transformations to be applied is not a trivial task: their effects can indeed indirectly spread across the whole design, potentially worsening the quality of the resul...
Conference Paper
Full-text available
In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable fine-grained data accesses, and high synchronization intensity), and exploit their opportunities (thread leve...
Article
Full-text available
Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and...
Article
The current generation of High Level Synthesis (HLS) tools usually generates hierarchical and modular designs, mimicking the structure of the call graph of the original high-level input specification. The standard approach is to progressively synthesize functions into modules by navigating the application call graph from the leaves up to the top fu...
Conference Paper
Full-text available
Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization. In this paper a meth...
Article
Full-text available
Synchronous Data Flow graphs are widely adopted in the designing of streaming applications, but were originally formulated to describe only how an application is partitioned and which data are exchanged among different tasks. Since Synchronous Data Flow graphs are often used to describe and evaluate complete design solutions, missing information (e...
Conference Paper
Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth bound, but al...
Conference Paper
High Level Synthesis (HLS) provides a way to significantly enhance the productivity of embedded system designers, by enabling the automatic or semiautomatic generation of hardware accelerators starting from high level descriptions with (usually software) programming languages. Typical HLS approaches build a centralized Finite State Machine (FSM) to...
Conference Paper
Modern hardware cores necessarily have to deal with many sources of unknown or uncertain information. Components with variable latency and unpredictable behavior are becoming predominant in hardware designs. Conventional hardware cores underperform when dealing with unknown or uncertain information. Common High-Level Synthesis (HLS) approaches, whi...
Conference Paper
This paper presents bambu, a modular framework for research on high-level synthesis currently under development at Politecnico di Milano. It can accept most of C constructs without requiring any three-state for their implementations by exploiting a novel and efficient memory architecture. It also allows the integration of floating-point units and t...
Conference Paper
Full-text available
Streaming applications can efficiently exploit multiprocessors architectures by means of pipelined parallelism, but designing this type of applications can be an hard task. Different subproblems have indeed to be solved: partitioning, mapping, scheduling and pipeline stage assignment. For this reason, high level abstraction models are adopted durin...
Conference Paper
Full-text available
Modern heterogeneous embedded platforms, composed of several digital signal, application specific and general purpose processors, also include reconfigurable devices supporting partial dynamic reconfiguration. These devices can change the behavior of some of their parts during execution, allowing hardware acceleration of more sections of the applic...
Conference Paper
High Level Synthesis (HLS) provides automatic flows for the generation of hardware accelerators starting from their behavioral description. HLS guarantees results comparable to hand-written design for some applications domains such as Digital Signal Processing. However, it is not yet able to cope with performance requirements when scaling the appli...
Conference Paper
In the past decades, design methodologies of Embedded Systems (ES) and High Performance Computing (HPC) systems have evolved following different trends. However, they are lately experiencing issues that affect both the domains, whose solutions converge to similar approaches. Examples of issues affecting both the domains are: large parallelism degre...
Conference Paper
Classical techniques for register allocation and binding require the definition of the program execution order, since a partial ordering relation between operations must be induced to perform liveness analysis through data-flow equations. In High Level Synthesis (HLS) flows this is commonly obtained through the scheduling task. However for some HLS...
Conference Paper
Heterogeneous architectures are becoming an increasingly relevant component for High-Performance Computing: they combine the computational power of multi-core processors with the flexibility of reconfigurable co-processor boards. Such boards are often composed of a set of standard Field Programmable Gate Arrays (FPGAs), coupled with a distributed m...
Article
Nowadays, implementing hardware accelerators by hand-writing the RTL still leads to better quality of the results with respect to those obtained by automating the design process. Manually developing and maintaining hardware designs, however, is a complex, time-consuming and error prone task, making improvements in the automatic design flow definiti...
Conference Paper
Full-text available
In this paper we present a novel implementation of a Wave Field Synthesis application based on emerging NVIDIA Compute Unified Device Architecture (CUDA) technology using NU-Tech Framework. CUDA technology unlocks the processing power of the Graphics Processing Units (GPUs) which are characterized by a highly parallel architecture. A wide range of...
Conference Paper
Full-text available
Since time constraints are a very critical aspect of an embedded system, performance evaluation can not be postponed to the end of the design flow, but it has to be introduced since its early stages. Estimation techniques based on mathematical models are usually preferred during this phase since they provide quite accurate estimation of the applica...
Chapter
Full-text available
This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform.
Chapter
In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab)...
Chapter
In this chapter, we describe functionality which has also been developed in the context of the hArtes project but that were not included in the final release or that are separately released. The development of the tools described here was often initiated after certain limitations of the current toolset were identified. This was the case of the memo...
Conference Paper
Nowadays, the memory synthesis is becoming the main bottleneck for the generation of efficient hardware accelerators. This paper presents a design methodology to efficiently and automatically implement memory accesses in High-Level Synthesis. In particular, the approach starts from a behavioral specification (in pure C language) and a set of design...
Article
In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab)...
Article
Full-text available
The four letters in this special section cover a spectrum of topics ranging from the optimization of adaptive multiprocessor embedded systems, methodologies for improving the P&R run-time, approaches to improving the hardware allocation, and scheduling on a reconfigurable system and techniques for optimizing system dependability.
Article
Nowadays, the design of hardware cores has to necessarily deal with unpredictable components, due to process variation or to the interaction with external modules (e.g., memories, sensors, IP cores). Adaptive systems are, thus, one of the most important solutions to substitute traditional approaches, based on analysis at design time, especially in...
Conference Paper
Full-text available
Current EDA tools are often based on standard-cell libraries for the design of modern complex systems-on-chip. In general, there are opposite trends to compact and extend the standard cell libraries, and to move towards custom libraries, highly optimized for specific goals (e.g., area, timing or power consumption) or designs. We thus propose a desi...
Chapter
When targeting heterogeneous, multi-core platforms, system and application developers are not only confronted with the challenge of choosing the best hardware configuration for the application they need to map, but also the application has to be modified such that certain parts are executed on the most appropriate hardware component. The hArtes too...
Article
Full-text available
Developing heterogeneous multicore platforms requires choosing the best hardware configuration for mapping the application, and modifying that application so that different parts execute on the most appropriate hardware component. The hArtes toolchain provides the option of automatic or semi-automatic support for this mapping. During test and valid...
Conference Paper
Full-text available
Performance estimation is a key step in the development of an embedded system. Normally, the performance evaluation is performed using a simulator or a performance mathematical model of the target architecture. However, both these approaches are usually based on the knowledge of the architectural details of the target. In this paper we present a me...
Conference Paper
In this paper, we apply multi-objective evolutionary computation to the synthesis of real-time, embedded, heterogeneous, multiprocessor systems (briefly, Multiprocessor Systems-on-Chip or MP-SoCs). Our approach simultaneously explores the architecture, the mapping and the scheduling of the system, by using multi-objective evolution. In particular,...
Conference Paper
Nowadays, design issues related to physical design and scalability are becoming the main bottlenecks of modern tools for technology mapping, limiting the usage of large cells. On the other hand, the generation of regular macro cells, such as compound gates, are becoming interesting from the manufacturing point of view, but they need to be properly...
Conference Paper
Full-text available
In embedded system design, the tuning and validation of a cycle accurate simulator is a difficult task. The designer has to assure that the estimation error of the simulator meets the design constraints on every application. If an application is not correctly estimated, the designer has to identify on which parts of the application the simulator in...
Conference Paper
Full-text available
Fast and accurate performance estimation is a key aspect of heterogeneous embedded systems design flow, since cycle-accurate simulators, when exist, are usually too slow to be used during design space exploration. Performance estimation techniques are usually based on combination of estimation of the single processing elements which compose the sys...
Conference Paper
Full-text available
Face Recognition techniques are solutions used to quickly screen a huge number of persons without being intrusive in open environments or to substitute id cards in companies or research institutes. There are several reasons that require to systems implementing these techniques to be reliable. This paper presents the design of a reliable face recogn...
Conference Paper
Full-text available
Efficient mapping and scheduling of partitioned applications are crucial to improve the performance on today's reconfigurable multiprocessor systems-on-chip (MPSoCs) platforms. Most of existing heuristics adopt the directed acyclic (task) Graph as representation, that unfortunately, is not able to represent typical embedded applications (e.g., real...
Chapter
Full-text available
High-Level Synthesis (HLS) is the process of developing digital circuits from behavioral specifications. It involves three interdependent and NP-complete optimization problems: (i) the operation scheduling, (ii) the resource allocation, and (iii) the controller synthesis. Evolutionary Algorithms have been already effectively applied to HLS to find...
Conference Paper
Full-text available
In this paper we present a new technique for automatically measuring the performance of tasks, functions or arbitrary parts of a program on a multiprocessor embedded system. The technique instruments the tasks described by OpenMP, used to represent the task parallelism, while ad hoc pragmas in the source indicate other pieces of code to profile. Th...
Conference Paper
In this paper we propose a flow based on the Bayesian Optimization Algorithm (BOA) for mapping pipelined applications on a heterogeneous multiprocessor platform on Field Programmable Gate Array (FPGA) with customizable processors. BOA is a Probabilistic Model Building Genetic Algorithm (PMBGA) that, substituting the classical mutation and crossover...
Conference Paper
In this paper, we compare four algorithms for the mapping of pipelined applications on a heterogeneous multiprocessor platform implemented using Field Programmable Gate Arrays (FPGAs) with customizable processors. Initially, we describe the framework and the model of pipelined application we adopted. Then, we focus on the problem of mapping a set o...
Conference Paper
Full-text available
The speed-up estimation of parallelized code is crucial to efficiently compare different parallelization techniques or task graph transformations. Unfortunately, most of the time, during the parallelization of a specification, the information that can be extracted by profiling the corresponding sequential code (e.g. the most executed paths) are not...
Conference Paper
Full-text available
Abstract—This paper,presents a multiprocessor,archi- tecture prototype,on a Field Programmable,Gate Arrays (FPGA) with,support,for hardware,and,software,mul- tithreading. Thanks to partial dynamic reconfiguration, this system can, at run time, spawn both software and hardware threads, sharing not only the general purpose soft-cores present in the a...