About
135
Publications
49,528
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,398
Citations
Introduction
In the multicore era the potential to increase the processing speed of compute-intensive applications is high. This is precisely where my area of research lies. More specifically, I'm working on the development of portable parallel kernels for running on different multicore platforms. At the same time I'm struggling to decrease the programming-effort and improve energy-efficiency of parallel computing architectures, namely through the exploitation of GPU- and FPGA-based computational resources.
Additional affiliations
January 2011 - present
January 2007 - present
INESC-ID/Instituto Superior Técnico, UTL
Position
- Parallel computing architectures
January 2004 - present
Instituto de Telecomunicações, Portugal, Coimbra
Position
- Efficient parallel architectures for LDPC decoding
Publications
Publications (135)
More than 900000 deaths were caused by Colorectal Cancer (CRC) in 2020. Colonoscopy is the gold standard for colorectal cancer screening, with studies concluding that colonoscopies significantly reduce mortality from CRC. It has been shown in the literature that computer-aided detection (CAD) systems can improve adenoma detection. In particular, de...
The case for in-memory inferencing of quantized CNNs at the edge.
Recent advances in artificial intelligence algorithms are leveraging massive amounts of data to optimize, refine, and improve existing solutions in critical areas such as healthcare, autonomous vehicles, robotics, social media, or human resources. The significant increase in the quantity of data generated each year makes it urgent to ensure the pro...
As artificial intelligence becomes a pervasive tool for the billions of IoT (Internet of things) devices at the edge, the data movement bottleneck imposes severe limitations on the performance and autonomy of these systems. PiM (processing-in-memory) is emerging as a way of mitigating the data movement bottleneck while satisfying the stringent perf...
In recent years, Convolutional Neural Networks (CNNs) have become the standard class of deep neural network for image processing, classification and segmentation tasks. However, the large strides in accuracy obtained by CNNs have been derived from increasing the complexity of network topologies, which incurs sizeable performance and energy penaltie...
Attaining the performance and efficiency levels required by modern applications often requires the use of application-specific accelerators. However, writing synthesizable Register-Transfer Level (RTL) code for such accelerators is a complex, expensive, and time-consuming process, which is cumbersome for early architecture development phases. To ta...
The papers in this special section explore cutting edge research on topics that combine artificial intelligence with edge computing, relating to the design, performance, or application of microprocessors and microcomputers.
To overcome the current performance wall, data streaming and data-flow computing paradigms have been gradually making their way into the general-purpose domain. However, the proliferation of such paradigms is often hindered by the lack of compilation support, as their execution model is usually incompatible with the internal static single-assignmen...
Quantum Computing (QC) is regarded with a mix of amazement, excitement, and skepticism. While quantum computers have been shown to outperform classical ones in particular computational tasks, their effective applicability to general-purpose problems remains under-studied. We shed light on the practical use of QC to tackle a combinatorial optimizati...
As artificial intelligence becomes a pervasive tool for the billions of IoT devices at the edge, the data movement bottleneck imposes severe limitations on these systems’ performance and autonomy. Processing-in-Memory emerges as a way to mitigate the data movement bottleneck while satisfying the stringent performance, energy efficiency, and accuracy...
As artificial intelligence becomes a pervasive tool for the billions of IoT devices at the edge, the data movement bottleneck imposes severe limitations on these systems’ performance and autonomy. Processing-in-Memory emerges as a way to mitigate the data movement bottleneck while satisfying the stringent performance, energy efficiency, and accuracy...
Hardware designers of LDPC decoders used in modern low-power communications are confronted with the need to perform design space exploration for targeting high throughput and low-power systems. These constraints pose tremendous pressure on the on-chip design of irregular data structures and micro-circuit implementation for supporting the complex Ga...
The current design paradigm of car cabin components assumes seats aligned with the driving direction. All passengers are aligned with the driver that, until recently, was the only element in charge of controlling the vehicle. The new paradigm of self-driving cars eliminates several of those requirements, releasing the driver from control duties and...
Data movement between main memory and the processor is a significant contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM), which enables computation inside the memory chip. However, existing PiM architectures often lack support for...
Recently, there has been much interest in the use of convolutional neural networks (CNN) for mobile user localization in massive multiple-input multiple-output (MIMO) systems operating at millimeter wave (mmWave) frequencies. However, current CNN-based approaches cannot predict the confidence interval bounds for the localization accuracy. While the...
Non-binary low-density parity-check ( nbldpc) codes show higher error-correcting performance than binary codes when the codeword length is moderate and/or the channel has bursts of errors. The need for high-speed decoders for future digital communications led to the investigation of optimized nbldpc decoding algorithms and efficient implementations...
Nowadays, processing systems are constrained by the low efficiency of their memory subsystems. Although memories evolved into faster and more efficient devices through the years, they were still unable to keep up with the computational power offered by processors, i.e., feed the processors with the data they require at the rhythm it is consumed. Co...
Automatic classification of musical instruments from audio relies heavily on datasets of acoustic recordings of the instruments to train models of those instruments. To do this, precise labels of the instrument’s events are mandatory. Also, it is very difficult to obtain such labels, especially in polyphonic performances. OpenMic-2018 is a polyphon...
Planar 3D reconstruction presents advantages over point cloud representations. This work focuses on the acceleration of piecewise-planar-based 3D reconstruction, a StereoScan method. We identify the SymStereo (logN) and uncapacitated facility location (UFL) algorithms as the most computationally expensive tasks, consuming nearly 80 × of total runti...
Environmental concern regularly leads to the study and improvement of manufacturing processes and the development of new industrial products. The purpose of this work is to optimize the amount of injected plastic and reduce the number of parts used in the production of entrance panels to control features inside the car cabin. It focuses on a partic...
Data movement between main memory and the processor is a significant contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM), which enables computation inside the memory chip. However, existing PiM architectures often lack support for...
Edge applications evolved into a variety of scenarios that include the acquisition and compression of immense amounts of images acquired in space remote environments such as satellites and drones, where characteristics such as power have to be properly balanced with constrained memory and parallel computational resources. The CCSDS-123 is a standar...
The polyphonic OpenMIC-2018 dataset is based on weak and incomplete labels. The automatic classification of sound events, based on the VGGish bottleneck layer as proposed before by the AudioSet, implies the classification of only one second at a time, making it hard to find the label of that exact moment. To answer this question, this paper pro- po...
Over the last years positioning systems have become increasingly pervasive, covering most of the planet’s surface. Although they are accurate enough for a large number of uses, their precision, power consumption, and hardware requirements establish the limits for their adoption in mobile devices. In this paper, the energy consumption of a proposed...
The introduction of 5G’s millimeter wave transmissions brings a new paradigm to wireless communications. Whereas physical obstacles were mostly associated with signal attenuation, their presence now adds complex, non-linear phenomena, including reflections and scattering. The result is a multipath propagation environment, shaped by the obstacles en...
The CCSDS-123 is a standard for lossless compression of multispectral and hyperspectral images with applications in on-board power constrained systems such as satellites and military drones. This work explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded s...
This paper presents a new, heterogeneous CPU+GPU attacks against lattice-based (post-quantum) cryptosystems based on the Shortest Vector Problem (SVP), a central problem in lattice-based cryptanalysis. To the best of our knowledge, this is the first SVP-attack against lattice-based cryptosystems using CPUs and GPUs simultaneously. We show that Voro...
Traditional dense stereo estimation algorithms measure photo-similarity to calculate the disparity between image pairs. SymStereo is a new framework of matching cost functions that measure symmetry to evaluate the possibility of two pixels being a match. This article proposes a fully functional real-time parallel 3D reconstruction pipeline that use...
With the recent surge in popularity of Convolutional Neural Networks (CNNs), motivated by their significant performance in many classification and related tasks, a new challenge now needs to be addressed: how to accommodate CNNs in mobile devices, such as drones, smartphones, and similar low-power devices? In order to tackle this challenge we explo...
With millimeter wave wireless communications, the resulting radiation reflects on most visible objects, creating rich multipath environments, namely in urban scenarios. The radiation captured by a listening device is thus shaped by the obstacles encountered, which carry latent information regarding their relative positions. In this paper, a system...
Convolutional Neural Networks (CNNs) have shown to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times t...
With the recent surge in popularity of Convolu-tional Neural Networks (CNNs), motivated by their significant performance in many classification and related tasks, a new challenge now needs to be addressed: how to accommodate CNNs in mobile devices, such as drones, smartphones, and similar low-power devices? In order to tackle this challenge we expl...
Applying advanced signal processing and artificial intelligence algorithms is often constrained by power and energy consumption limitations, in high performance and embedded, cyber-physical and super-computing devices and systems. Although Graphics Processing Units (GPUs) helped to mitigate the throughput-per-Watt performance problem in many comput...
This survey describes the lattice problems that are key in the study of lattice-based cryptography, identifies and categorizes methods for solving these problems, analyzes existing implementations of these algorithms, and extrapolates on the future of lattice-based cryptanalysis, based on the foreseeable advances in computer architecture. Some futu...
This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both “microscopic” and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-...
Today, high-level synthesis (HLS) tools are being touted as a means to perform rapid prototyping and shortening the long development cycles needed to produce hardware designs in register transfer level (RTL). In this paper, we attempt to verify this claim by testing the productivity benefits offered by current HLS tools by using them to develop one...
Although OpenCL aims to achieve portability at the code level, different hardware platforms requires different approaches in order to extract the best performance for OpenCL-based code. In this work, we use an image encoder originally tuned for OpenCL on GPU (OpenCL-GPU), and optimize it for multi-CPU based platforms. We produce two OpenCL-based ve...
With the growth of packet-switch networks in the mobile consumer electronics market, fountain codes are coming to play an increasingly important role as they allow high packet loss rates while still supporting high QoS. This is particularly critical for video streaming in multimedia delivery on moving handheld mobile consumer devices. A powerful fo...
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale Parser (MMP), a computationally demanding image encoder application, which was ported to several hardware and software parallel environments as a...
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and many-core hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale Parser (MMP), a computationally demanding image encoder application, which was ported to several hardware and software parallel environments as a...
Low-density parity-check (LDPC) block codes are popular forward error correction schemes due to their capacity-approaching characteristics. However, the realization of LDPC decoders that meet both low latency and high throughput is not a trivial challenge. Usually, this has been solved with ASIC and FPGA technology that enables meeting the decoder...
In this work we develop a Finite Difference in the Time Domain (FDTD)
algorithm to model the time evolution of electron waves in graphene
superlattices, using both microscopic and effective medium formalisms. It is
proven that the dynamics of an electronic state may be accurately predicted
with the effective medium approach, provided the initial st...
Low-Density Parity-Check (LDPC) decoders typically implement a single decoding algorithm or update rule, which narrows down the design space of the decoder and maintains its overall simplicity. However, gear-shift techniques combine multiple decoding algorithms, update rules and quantization of the log-likelihood ratios (LLRs), allowing wider desig...
Computing on field-programmable gate arrays (FPGAs) has been receiving continued interest as it provides high performance at relatively low power budgets, while avoiding the high non-recurring engineering (NRE) costs associated with ASIC designs. However, FPGA development is typically performed using register transfer level (RTL) languages which ma...