About
181
Publications
20,915
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,661
Citations
Publications
Publications (181)
Hardware designers working on FPGA accelerators are free to explore ad-hoc value representations that differ from the IEEE 754 floating-point standard, significantly reducing resource utilization and latency. In fact, while some applications are amenable to fixed-point quantization, others may require a wider dynamic range of values, better represe...
Recent trends in deep learning (DL) imposed hardware accelerators as the most viable solution for several classes of high-performance computing (HPC) applications such as image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent advances in designing DL accelerators suitable to reach the pe...
European efforts to boost competitiveness in the sector of space services promote the research and development of advanced software and hardware solutions. The EU-funded HERMES project contributes to the effort by qualifying radiation-hardened, high-performance programmable microprocessors, and by developing a software ecosystem that facilitates th...
Edge systems are required to autonomously make real-time decisions based on large quantities of input data under strict power, performance, area, and other constraints. Meeting these constraints is only possible by specializing systems through hardware accelerators purposefully built for machine learning and data analysis algorithms. However, data...
Custom hardware accelerators for Deep Neural Networks are increasingly popular: in fact, the flexibility and performance offered by FPGAs are well-suited to the computational effort and low latency constraints required by many image recognition and natural language processing tasks. The gap between high-level Machine Learning frameworks (e.g., Tens...
High-Performance Big Data Analytics (HPDA) applications are characterized by huge volumes of distributed and heterogeneous data that require efficient computation for knowledge extraction and decision making. Designers are moving towards a tight integration of computing systems combining HPC, Cloud, and IoT solutions with artificial intelligence (A...
Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task-level parallelism, that make existing general-purpose processors or accelerators (e.g., GPUs) suboptimal or difficult to program. To address...
Improving data locality of tensor data structures is a crucial optimization for maximizing the performance of machine learning and intensive linear algebra applications. While CPUs and GPUs improve data locality by means of automated caching mechanisms, FPGAs let the developer specify data structure allocation. Although this feature enables a high...
Field Programmable Gate Arrays (FPGAs) are becoming an appealing technology in datacenters and High Performance Computing. High-Level Synthesis (HLS) of multi-threaded parallel programs is increasingly used to extract parallelism. Despite great leaps forward in HLS and related debugging methodologies, there is a lack of contributions in automated b...
Heterogeneous Multiprocessor System-on-Chip (MPSoC) architectures allow designers to achieve energy efficiency and high performance, especially when tailored for the target applications. This specialization requires to select the MPSoC components and define their interconnections (on the architecture side), and also to determine the mapping and sch...
High Level Synthesis is a set of methodologies aimed at generating hardware descriptions starting from specifications written in high-level languages. While these methodologies share different elements with traditional compilation flows, there are characteristics of the addressed problem which require ad hoc management. In particular, differently f...
Data analytics applications increasingly are complex workflows composed of phases with very different program behaviors (e.g., graph algorithms and machine learning, algorithms operating on sparse and dense data structures, etc). To reach the levels of efficiency required to process these workflows in real time, upcoming architectures will need to...
This article presents an automated approach for detecting system- level bugs in SoC designs that are composed of many IP blocks, without exposing sensitive information. The approach leverages high-level synthesis techniques.
—Binoy Ravindran, Virginia Tech</italic
This chapter introduces the characterizing aspects of embedded systems and discusses the specific features that a designer should address in an embedded system “rugged,” i.e., able to operate reliably in harsh environments. The chapter addresses both the hardware and the less obvious software aspects. After presenting a current list of certificatio...
High-Level Synthesis (HLS) for FPGAs is attracting popularity and is increasingly used to handle complex systems with multiple integrated components. To increase performance and efficiency, HLS flows now adopt several advanced optimization techniques. Aggressive optimizations and system level integration can cause the introduction of bugs that are...
The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential impleme...
Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow
to easily exploit the potential parallelism provided by programmable devices.
This type of parallelism can be implemented in several ways: by duplicating
the implementation of body loop, by exploiting loop pipelining or by applying
vectorization.
In this paper a metho...
Accelerators, including graphic processing units (GPUs) for general-purpose computation, manycore designs with wide vector units (e.g., Intel Phi), have become a common component of many high-performance clusters. The appearance of more stable and reliable tools that can automatically convert code written in high-level specifications with annotatio...
RDF databases naturally map to a graph representation and employ languages, such as SPARQL, that implements queries as graph pattern matching routines. Graph methods exhibit an irregular behavior: they present unpredictable, fine-grained data accesses, and are synchronization intensive. Graph data structures expose large amounts of dynamic parallel...
Conventional High Level Synthesis (HLS) tools mainly target compute intensive kernels typical of digital signal processing applications. We are developing techniques and architectural templates to enable HLS of data analytics applications. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dyn...
The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system allows to improve its efficiency and its flexibility thanks to their programmability. Generating the required hardware descriptions for a software developer could be a very difficult task because of the different programming paradigms of software programs and hardware...
High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today’s system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is pa...
Code motion and speculations are usually exploited in the High Level Synthesis of control dominated applications to improve the performances of the synthesized designs. Selecting the transformations to be applied is not a trivial task: their effects can indeed indirectly spread across the whole design, potentially worsening the quality of the resul...
In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable fine-grained data accesses, and high synchronization intensity), and exploit their opportunities (thread leve...
Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and...
The current generation of High Level Synthesis (HLS) tools usually generates hierarchical and modular designs, mimicking the structure of the call graph of the original high-level input specification. The standard approach is to progressively synthesize functions into modules by navigating the application call graph from the leaves up to the top fu...
Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization.
In this paper a meth...
Synchronous Data Flow graphs are widely adopted in the designing of streaming applications, but were originally formulated to describe only how an application is partitioned and which data are exchanged among different tasks. Since Synchronous Data Flow graphs are often used to describe and evaluate complete design solutions, missing information (e...
Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth bound, but al...
High Level Synthesis (HLS) provides a way to significantly enhance the productivity of embedded system designers, by enabling the automatic or semiautomatic generation of hardware accelerators starting from high level descriptions with (usually software) programming languages. Typical HLS approaches build a centralized Finite State Machine (FSM) to...
Modern hardware cores necessarily have to deal with many sources of unknown or uncertain information. Components with variable latency and unpredictable behavior are becoming predominant in hardware designs. Conventional hardware cores underperform when dealing with unknown or uncertain information. Common High-Level Synthesis (HLS) approaches, whi...
This paper presents bambu, a modular framework for research on high-level synthesis currently under development at Politecnico di Milano. It can accept most of C constructs without requiring any three-state for their implementations by exploiting a novel and efficient memory architecture. It also allows the integration of floating-point units and t...
Streaming applications can efficiently exploit multiprocessors architectures by means of pipelined parallelism, but designing this type of applications can be an hard task. Different subproblems have indeed to be solved: partitioning, mapping, scheduling and pipeline stage assignment. For this reason, high level abstraction models are adopted durin...
Modern heterogeneous embedded platforms, composed of several digital signal, application specific and general purpose processors, also include reconfigurable devices supporting partial dynamic reconfiguration. These devices can change the behavior of some of their parts during execution, allowing hardware acceleration of more sections of the applic...
High Level Synthesis (HLS) provides automatic flows for the generation of hardware accelerators starting from their behavioral description. HLS guarantees results comparable to hand-written design for some applications domains such as Digital Signal Processing. However, it is not yet able to cope with performance requirements when scaling the appli...
In the past decades, design methodologies of Embedded Systems (ES) and High Performance Computing (HPC) systems have evolved following different trends. However, they are lately experiencing issues that affect both the domains, whose solutions converge to similar approaches. Examples of issues affecting both the domains are: large parallelism degre...
Classical techniques for register allocation and binding require the definition of the program execution order, since a partial ordering relation between operations must be induced to perform liveness analysis through data-flow equations. In High Level Synthesis (HLS) flows this is commonly obtained through the scheduling task. However for some HLS...
Heterogeneous architectures are becoming an increasingly relevant component for High-Performance Computing: they combine the computational power of multi-core processors with the flexibility of reconfigurable co-processor boards. Such boards are often composed of a set of standard Field Programmable Gate Arrays (FPGAs), coupled with a distributed m...
Nowadays, implementing hardware accelerators by hand-writing the RTL still leads to better quality of the results with respect to those obtained by automating the design process. Manually developing and maintaining hardware designs, however, is a complex, time-consuming and error prone task, making improvements in the automatic design flow definiti...
In this paper we present a novel implementation of a Wave Field Synthesis application based on emerging NVIDIA Compute Unified Device Architecture (CUDA) technology using NU-Tech Framework. CUDA technology unlocks the processing power of the Graphics Processing Units (GPUs) which are characterized by a highly parallel architecture. A wide range of...
Since time constraints are a very critical aspect of an embedded system, performance evaluation can not be postponed to the end of the design flow, but it has to be introduced since its early stages. Estimation techniques based on mathematical models are usually preferred during this phase since they provide quite accurate estimation of the applica...
This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform.
In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab)...
In this chapter, we describe functionality which has also been developed in the context of the hArtes project but that were not included in the final release or that are separately released. The development of the tools described here was often initiated after certain limitations of the current toolset were identified. This was the case of the memo...
Nowadays, the memory synthesis is becoming the main bottleneck for the generation of efficient hardware accelerators. This paper presents a design methodology to efficiently and automatically implement memory accesses in High-Level Synthesis. In particular, the approach starts from a behavioral specification (in pure C language) and a set of design...
In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab)...
The four letters in this special section cover a spectrum of topics ranging from the optimization of adaptive multiprocessor embedded systems, methodologies for improving the P&R run-time, approaches to improving the hardware allocation, and scheduling on a reconfigurable system and techniques for optimizing system dependability.
Nowadays, the design of hardware cores has to necessarily deal with unpredictable components, due to process variation or to the interaction with external modules (e.g., memories, sensors, IP cores). Adaptive systems are, thus, one of the most important solutions to substitute traditional approaches, based on analysis at design time, especially in...
Current EDA tools are often based on standard-cell libraries for the design of modern complex systems-on-chip. In general, there are opposite trends to compact and extend the standard cell libraries, and to move towards custom libraries, highly optimized for specific goals (e.g., area, timing or power consumption) or designs. We thus propose a desi...
When targeting heterogeneous, multi-core platforms, system and application developers are not only confronted with the challenge of choosing the best hardware configuration for the application they need to map, but also the application has to be modified such that certain parts are executed on the most appropriate hardware component. The hArtes too...
Developing heterogeneous multicore platforms requires choosing the best hardware configuration for mapping the application, and modifying that application so that different parts execute on the most appropriate hardware component. The hArtes toolchain provides the option of automatic or semi-automatic support for this mapping. During test and valid...
Performance estimation is a key step in the development of an embedded system. Normally, the performance evaluation is performed using a simulator or a performance mathematical model of the target architecture. However, both these approaches are usually based on the knowledge of the architectural details of the target. In this paper we present a me...
In this paper, we apply multi-objective evolutionary computation to the synthesis of real-time, embedded, heterogeneous, multiprocessor systems (briefly, Multiprocessor Systems-on-Chip or MP-SoCs). Our approach simultaneously explores the architecture, the mapping and the scheduling of the system, by using multi-objective evolution. In particular,...
Nowadays, design issues related to physical design and scalability are becoming the main bottlenecks of modern tools for technology mapping, limiting the usage of large cells. On the other hand, the generation of regular macro cells, such as compound gates, are becoming interesting from the manufacturing point of view, but they need to be properly...
In embedded system design, the tuning and validation of a cycle accurate simulator is a difficult task. The designer has to assure that the estimation error of the simulator meets the design constraints on every application. If an application is not correctly estimated, the designer has to identify on which parts of the application the simulator in...
Fast and accurate performance estimation is a key aspect of heterogeneous embedded systems design flow, since cycle-accurate simulators, when exist, are usually too slow to be used during design space exploration. Performance estimation techniques are usually based on combination of estimation of the single processing elements which compose the sys...
Face Recognition techniques are solutions used to quickly screen a huge number of persons without being intrusive in open environments or to substitute id cards in companies or research institutes. There are several reasons that require to systems implementing these techniques to be reliable. This paper presents the design of a reliable face recogn...
Efficient mapping and scheduling of partitioned applications are crucial to improve the performance on today's reconfigurable multiprocessor systems-on-chip (MPSoCs) platforms. Most of existing heuristics adopt the directed acyclic (task) Graph as representation, that unfortunately, is not able to represent typical embedded applications (e.g., real...
High-Level Synthesis (HLS) is the process of developing digital circuits from behavioral specifications. It involves three
interdependent and NP-complete optimization problems: (i) the operation scheduling, (ii) the resource allocation, and (iii) the controller synthesis. Evolutionary Algorithms have been already effectively applied to HLS to find...
In this paper we present a new technique for automatically measuring the performance of tasks, functions or arbitrary parts of a program on a multiprocessor embedded system. The technique instruments the tasks described by OpenMP, used to represent the task parallelism, while ad hoc pragmas in the source indicate other pieces of code to profile. Th...
In this paper we propose a flow based on the Bayesian Optimization Algorithm (BOA) for mapping pipelined applications on a heterogeneous multiprocessor platform on Field Programmable Gate Array (FPGA) with customizable processors. BOA is a Probabilistic Model Building Genetic Algorithm (PMBGA) that, substituting the classical mutation and crossover...
In this paper, we compare four algorithms for the mapping of pipelined applications on a heterogeneous multiprocessor platform implemented using Field Programmable Gate Arrays (FPGAs) with customizable processors. Initially, we describe the framework and the model of pipelined application we adopted. Then, we focus on the problem of mapping a set o...
The speed-up estimation of parallelized code is crucial to efficiently compare different parallelization techniques or task graph transformations. Unfortunately, most of the time, during the parallelization of a specification, the information that can be extracted by profiling the corresponding sequential code (e.g. the most executed paths) are not...
Abstract—This paper,presents a multiprocessor,archi- tecture prototype,on a Field Programmable,Gate Arrays (FPGA) with,support,for hardware,and,software,mul- tithreading. Thanks to partial dynamic reconfiguration, this system can, at run time, spawn both software and hardware threads, sharing not only the general purpose soft-cores present in the a...