Nuno Paulino

Nuno Paulino
Institute for Systems and Computer Engineering, Technology and Science (INESC TEC) | INESC TEC · CTM – Centre for Telecommunications and Multimedia

Doctor of Engineering

About

48
Publications
4,434
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
113
Citations
Introduction
Nuno M. C. Paulino received a M.Sc. degree in Electrical and Computer Engineering from the Faculty of Engineering of the University of Porto in 2011. He received his Ph.D. in the same field from the same institution in 2015. He is a researcher at INESC TEC, where his research interests include run-time reconfigurable systems, embedded systems in FPGAs, co-processor hardware acceleration, and tools for hardware/software co-design automation

Publications

Publications (48)
Article
Full-text available
Deep learning methods have been shown to be competitive solutions for modulation classification tasks, but suffer from being computationally expensive, limiting their use on embedded devices. We propose a new deep neural network architecture which employs known structures, depth-wise separable convolution and residual connections, as well as a comp...
Article
In order to achieve the full potential of the Internet-of-Things, connectivity between devices should be ubiquitous and efficient. Wireless mesh networks are a critical component to achieve this ubiquitous connectivity for a wide range of services, and are composed of terminal devices (i.e., nodes), such as sensors of various types, and wall powere...
Preprint
Full-text available
On the of the applications in the realm of the Internet-of-Things (IoT) is real-time localization of assets in specific application environments where satellite based global positioning is unviable. Numerous approaches for localization relying on wireless sensor mesh systems have been evaluated, but the recent Bluetooth Low Energy (BLE) 5.1 directi...
Presentation
Full-text available
Presentation
Full-text available
Article
Full-text available
Future telecommunications aim to be ubiquitous and efficient, as widely deployed connectivity will allow for a variety of edge/fog based services. Challenges are numerous, e.g., spectrum overuse, energy efficiency, latency and bandwidth, battery life and computing power of edge devices. Addressing these challenges is key to compose the backbone for...
Preprint
Full-text available
Decision trees are often preferred when implementing Machine Learning in embedded systems for their simplicity and scalability. Hoeffding Trees are a type of Decision Trees that take advantage of the Hoeffding Bound to allow them to learn patterns in data without having to continuously store the data samples for future reprocessing. This makes them...
Preprint
Full-text available
In order to achieve the full potential of the Internet-of-Things, connectivity between devices should be ubiquitous and efficient. Wireless mesh networks are a critical component to achieve this ubiquitous connectivity for a wide range of services, and are composed of terminal devices (i.e., nodes), such as sensors of various types, and wall powere...
Preprint
Full-text available
This paper presents and discusses an implementation of a multiple target tracking method, which is able to deal with target interactions and prevent tracker failures due to hijacking. The referenced approach uses a Markov Chain Monte Carlo (MCMC) sampling step to evaluate the filter and constructs an efficient proposal density to generate new sampl...
Preprint
Full-text available
This paper presents results of a study of the performance of several base classifiers for recognition of handwritten characters of the modern Latin alphabet. Base classification performance is further enhanced by utilizing Viterbi error correction by determining the Viterbi sequence. Hidden Markov Models (HMMs) models exploit relationships between...
Preprint
Full-text available
This paper presents the design and post-layout characteristics of a differential capacitance based inertial accelerometer This includes a MEMS based mechanical sensing element and a CMOS charge amplifier, which is the first stage of a readout circuit. The mechanical sensor is designed according to the SOIMUMPs fabrication process technology, and th...
Presentation
Full-text available
A brief history and explanation of FPGAs for software engineers, covering the basics of FPGA history, application, design flow, challenges, and future perspectives.
Article
As applications move to the edge, efficiency in computing power and power/energy consumption is required. Heterogeneous computing promises to meet these requirements through application-specific hardware accelerators. Runtime adaptivity might be of paramount importance to realize the potential of hardware specialization, but further study is requir...
Presentation
Full-text available
The recent Bluetooth 5.1 specification introduced the use of Angle-of-Arrival (AoA) information which enables the design of novel low-cost indoor positioning systems. Existing approaches rely on multiple fixed gateways equipped with antenna arrays, in order to determine the location of an arbitrary number of simple mobile omni-directional emitters....
Conference Paper
Full-text available
With the ever more pressing issue arising from the phenomenon known as the death or slowdown of Moore’s Law and the Dennard Scaling, compute performance has not been increasing at the rate the industry had been accustomed to over the decades [7]. This has prompted a shift from mostly homogeneous compute architectures to increasingly heterogeneous o...
Preprint
Full-text available
High-Level Synthesis has introduced reconfigurable logic to a new world -- that of software development. The newest wave of HLS tools has been successful, and the future looks bright. But is HLS the end-all-be-all to FPGA acceleration? Is it enough to allow non-experts to program FPGAs successfully, even when dealing with troublesome data structure...
Conference Paper
Executing ARMv8 Loop Traces on Reconfigurable Accelerator via Binary Translation Framework
Article
Full-text available
High Level Synthesis (HLS) tools targeting Field Programmable Gate Arrays (FPGAs) aim to provide a method for programming these devices via high-level abstractions. Initially, HLS support for FPGAs focused on compiling C/C++ to hardware circuits. This raised the issue of determining the programming practices which resulted in the best performing ci...
Code
A Generator of Randomly Correlated N-Dimentional Clusters: This Matlab/Octave function is capable of outputing "txt" files containing randomly generated data points, clustered around a specified number of centroids. The user may specify the total number of points "N", the number of clusters "K", and the number of attributes "D" per data point. The...
Data
This is a simple batch of data sets of points containing only integer attributes. The data sets were generated with a randomly correlated data set generator (DOI:10.13140/RG.2.2.34866.43200). This batch includes a total of 12 data sets which can be used to validate implementations of clustering algorithms such as k-nearest neighbours, or k-means.
Conference Paper
Full-text available
Hardware specialization is an efficient solution for maximization of performance and minimization of energy consumption. This work is based on automated detection of workload by analysis of a compiled application, and on the automated generation of specialized hardware modules. We will present the current version of the binary analysis and translat...
Data
This dataset gathers product information for desktop processor devices (and a small subset of mobile devices), ranging from the years of 1970 to 2019. The data were gathered from several sources, which can be found under folder "sources". Sources include: CPUDB from Stanford University (http://cpudb.stanford.edu/), data gathered from AMD and Inte...
Article
Full-text available
The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all w...
Article
The use of specialized accelerator circuits is a feasible solution to address performance and energy issues in embedded systems. This paper extends a previous field-programmable gate array-based approach that automatically generates pipelined customized loop accelerators (CLAs) from runtime instruction traces. Despite efficient acceleration, the ap...
Conference Paper
Software developers have always found it difficult to adopt Field-Programmable Gate Arrays (FPGAs) as computing platforms. Recent advances in HLS tools aim to ease the mapping of computations to FPGAs by abstracting the hardware design effort via a standard OpenCL interface and execution model. However, OpenCL is a low-level programming language an...
Article
Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloschedu...
Thesis
Full-text available
With the increase of application complexity and amount of data, the required computational power increases in tandem. Technology improvements have allowed for the increase in clock frequencies of all kinds of processing architectures. But exploration of new architecture and computing paradigms over the simple single-issue in-order processor are equ...
Article
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences...
Article
This paper presents a binary acceleration approach based on extending a General Purpose Processor (GPP) with a Reconfigurable Processing Unit (RPU), both sharing an external data memory. In this approach repeating sequences of GPP instructions are migrated to the RPU. The RPU resources are selected and organized off-line using execution trace infor...
Article
This article presents a reconfigurable hardware/software architecture for binary acceleration of embedded applications. A Reconfigurable Processing Unit (RPU) is used as a coprocessor of the General Purpose Processor (GPP) to accelerate the execution of repetitive instruction sequences called Megablocks. A toolchain detects Megablocks from instruct...
Article
This paper presents a novel approach to accelerate program execution by mapping repetitive traces of executed instructions, called Megablocks, to a runtime reconfigurable array of functional units. An offline tool suite extracts Megablocks from microprocessor instruction traces and generates a Reconfigurable Processing Unit (RPU) tailored for the e...
Conference Paper
This paper presents an extension to a hardware/software system architecture in which repetitive instruction traces, called Megablocks, Reconfigurable Processing Unit (RPU). This scheme is supported by a custom toolchain able to automatically generate a RPU tailored for the execution of one or more Megablocks detected offline. Switching between hard...
Article
Full-text available
The ability to map instructions running in a microprocessor to a reconfigurable processing unit (RPU), acting as a coprocessor, enables the runtime acceleration of applications and ensures code and possibly performance portability. In this work, we focus on the mapping of loop-based instruction traces (called Megablocks) to RPUs. The proposed appro...
Conference Paper
This paper presents an offline tool-chain which automatically extracts loops (Mega blocks) from Micro Blaze instruction traces and creates a tailored Reconfigurable Processing Unit (RPU) for those loops. The system moves loops from the CPU to the RPU transparently, at runtime, and without changing the executable binaries. The system was implemented...

Network

Cited By