Eric Flamand’s research while affiliated with Grenoble Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (23)


Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors
  • Preprint

August 2024

·

9 Reads

Manuele Rusci

·

·

Marco Fariselli

·

[...]

·

Tinne Tuytelaars

This paper proposes a self-learning framework to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 5 s or 16.4 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.


Fig. 1: Baseline SoC architecture, featuring an 8-core cluster with private instruction cache, the cache is shown in green.
Fig. 2: Cache bank subsystem
Fig. 3: Shared instruction cache architectures and their critical paths shown in the red arrow.
Fig. 5: Two-level cache with L1 next-line prefetching the complexity of the interconnect removing the critical path present in the SP. Moreover, it does not suffer the congestion issues related to the multiple ports of the MP. The width of the fetch interface and the cache line are 128 bits. Between the L1 and L1.5, the proposed hierarchical cache features an optional request buffer and a response buffer (i.e., it can be enabled with a System Verilog parameter) to cut potentially critical paths from L1 to L1.5. Since the response buffer is sufficient in the presented cluster implementation to avoid critical paths from L1 to L1.5, the request buffer has been disabled in the experiments performed in this work. On the other hand, this buffer is a powerful knob to improve the system scalability towards high-end clusters optimized for frequency or featuring a larger number of cores, taking advantage of the hierarchical structure of the cache. In the proposed architecture, the access time of L1 and L1.5 is one cycle and two cycles, respectively. In the case of banking conflicts in the L1.5 cache, the access time can be larger than two cycles depending on the number of parallel requests. However, the contention on the L1.5 banks decreases significantly with respect to SP thanks to the presence of the L1 banks filtering many requests to L1.5. Compared to a private cache, the two-level cache performance can improve largely for high-footprint SPMD applications as it avoids replication in the L1.5 cache, thereby increasing actual capacity. Still, the presence of relatively small single-cycle L1 caches may lead to an increase in cycle count to execute a program with respect to the shared cache architectures [32]. To overcome this issue, in this work, we propose a cache prefetching between the L1 and L1.5.
Fig. 6: Timing diagram of L1 to L1.5 prefetch. The upper timing diagram describes the core fetching from L1 with one cycle latency when hit (first fetch) and three cycles latency when miss and hit in the L1.5 (second fetch). The bottom timing diagram describes L1 refill and prefetch to the L1.5 with two cycles latency when hit. Once there is a core fetch, prefetch starts in the next cycle.

+9

Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters
  • Preprint
  • File available

September 2023

·

166 Reads

High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.

Download

Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters

April 2023

·

12 Reads

·

4 Citations

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures’ performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.


Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization

January 2023

·

15 Reads

·

6 Citations

Communications in Computer and Information Science

This paper presents an optimized methodology to design and deploy Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V cores. To achieve low-latency execution, we propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks, featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16) compute units, with manually-managed memory transfers of model parameters. To ensure minimal accuracy degradation with respect to the full-precision models, we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ) scheme that compresses the recurrent layers to 8-bit while the bit precision of remaining layers is kept to FP16. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters. Thanks to the proposed approaches, we speed-up the computation by up to 4×\times with respect to the lossless FP16 baselines. Differently from a uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while achieving a 1.4–1.7×\times memory saving. Thanks to this compression, we cut the power cost of the external memory by fitting the large models on the limited on-chip non-volatile memory and we gain a MCU power saving of up to 2.5×\times by reducing the supply voltage from 0.8 V to 0.65 V while still matching the real-time constraints. Our design results >10×\times more energy efficient than state-of-the-art SE solutions deployed on single-core MCUs that make use of smaller models and quantization-aware training.KeywordsMCUSpeech EnhancementRNNsMixed-Precision


Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

October 2022

·

254 Reads

This paper presents an optimized methodology to design and deploy Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V cores. To achieve low-latency execution, we propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks, featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16) compute units, with manually-managed memory transfers of model parameters. To ensure minimal accuracy degradation with respect to the full-precision models, we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ) scheme that compresses the recurrent layers to 8-bit while the bit precision of remaining layers is kept to FP16. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters. Thanks to the proposed approaches, we speed-up the computation by up to 4x with respect to the lossless FP16 baselines. Differently from a uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while achieving a 1.4-1.7x memory saving. Thanks to this compression, we cut the power cost of the external memory by fitting the large models on the limited on-chip non-volatile memory and we gain a MCU power saving of up to 2.5x by reducing the supply voltage from 0.8V to 0.65V while still matching the real-time constraints. Our design results 10x more energy efficient than state-of-the-art SE solutions deployed on single-core MCUs that make use of smaller models and quantization-aware training.


Vega: A Ten-Core SoC for IoT Endnodes with DNN Acceleration and Cognitive Wake-Up from MRAM-Based State-Retentive Sleep Mode

January 2022

·

55 Reads

·

94 Citations

IEEE Journal of Solid-State Circuits

The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7- μW\mu \text{W} fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states.


Vega: A 10-Core SoC for IoT End-Nodes with DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode

October 2021

·

49 Reads

The Internet-of-Things requires end-nodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT end-node SoC capable of scaling from a 1.7 μ\mathrm{\mu}W fully retentive cognitive sleep mode up to 32.2 GOPS (@ 49.4 mW) peak performance on NSAAs, including mobile DNN inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile MRAM. To meet the performance and flexibility requirements of NSAAs, the SoC features 10 RISC-V cores: one core for SoC and IO management and a 9-cores cluster supporting multi-precision SIMD integer and floating-point computation. Vega achieves SoA-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3TOPS/W for 8-bit DNN inference with hardware acceleration). On floating-point (FP) compuation, it achieves SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine-learning (ML) accelerators boost energy efficiency in cognitive sleep and active states, respectively.




A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones

May 2019

·

280 Reads

·

182 Citations

IEEE Internet of Things Journal

Fully miniaturized robots (e.g., drones), with artificial intelligence (AI)-based visual navigation capabilities, are extremely challenging drivers of Internet-of-Things edge intelligence capabilities. Visual navigation based on AI approaches, such as deep neural networks (DNNs) are becoming pervasive for standard-size drones, but are considered out of reach for nano-drones with a size of a few cm 2 . In this paper, we present the first (to the best of our knowledge) demonstration of a navigation engine for autonomous nano-drones capable of closed-loop end-to-end DNN-based visual navigation. To achieve this goal we developed a complete methodology for parallel execution of complex DNNs directly on board resource-constrained milliwatt-scale nodes. Our system is based on GAP8, a novel parallel ultralow-power computing platform, and a 27-g commercial, open-source Crazyflie 2.0 nano-quadrotor. As part of our general methodology, we discuss the software mapping techniques that enable the DroNet state-of-the-art deep convolutional neural network to be fully executed aboard within a strict 6 frame-per-second real-time constraint with no compromise in terms of flight results, while all processing is done with only 64 mW on average. Our navigation engine is flexible and can be used to span a wide performance range: at its peak performance corner, it achieves 18 frames/s while still consuming on average just 3.5% of the power envelope of the deployed nano-aircraft. To share our key findings with the embedded and robotics communities and foster further developments in autonomous nano-unmanned aerial vehicles (UAVs), we publicly release all our code, datasets, and trained networks.


Citations (16)


... The CV32E40P architecture is a simple in-order four-stage pipeline based on the Harvard template, extending the baseline RV32IMC ISA with Xpulp extensions for efficient digital signal processing. In the PULP cluster, the cores' instruction interface connects them to a hierarchical instruction cache [20] made of private (512 B per core) and shared (4 kiB) banks for improved application performance on parallel workloads. On the other hand, the cores connect to the rest of the system through a data interface. ...

Reference:

PULP Fiction No More—Dependable PULP Systems for Space
Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters
  • Citing Article
  • April 2023

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

... As an example, the work by Chen et al. 8 trained a 2-layer LSTM and proposed to use a similarity score to compare the last embedding state with a reference vector of the target keyword. RNN methods are however expensive at inference time because of the memory-bound workload, which is up to 28x less efficient than running convolutional models on our target HW 10 . A design of this class has been prototyped instead on a Hi3516EV200 development board with an ARM Cortex-A7 processor running at 900MHz, 32MB of memory under a power envelope of 100mW, which is much higher than the capacity of MCU devices 11 . ...

Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization
  • Citing Chapter
  • January 2023

Communications in Computer and Information Science

... Kraken programmable RISC-V cluster achieves comparable efficiency with respect to SoA PULP systems [17] while boosting performance by a factor of 2.7× on 8-bit workloads. Moreover, Kraken improves the PULP architectures reported in [17] by supporting low-precision SIMD operations. ...

Vega: A Ten-Core SoC for IoT Endnodes with DNN Acceleration and Cognitive Wake-Up from MRAM-Based State-Retentive Sleep Mode
  • Citing Article
  • January 2022

IEEE Journal of Solid-State Circuits

... Como proyección futura se propone diseñar e implementar un algoritmo de aprendizaje profundo en el microcontrolador para el reconocimiento de patrones en tiempo real en los espectrogramas, que permita la detección de palabras clave en señales continuas de audio. Se considerará evaluar el impacto que pudieran tener en la precisión del algoritmo el empleo de técnicas de cuantización como las propuestas por Fariselli et al. en [24] para el cálculo de los MFCC o el empleo de Transformadas de Wavelet para la eliminación de ruido en las señales de voz. ...

Integer-Only Approximated MFCC for Ultra-Low Power Audio NN Processing on Multi-Core MCUs
  • Citing Conference Paper
  • June 2021

... Video compression for IoT: We can categorize IoT-based video encoders into two parts: hardwarebased and software-based. Hardware approaches primarily focus on designing more power-efficient camera sensors [39][40][41] and more efficient MCU circuits and processors [42][43][44]. Due to its simplicity, scalability, low latency, and very low energy consumption, the most common software-based video encoder on IoT devices is M-JPEG [24]. Nevertheless, there have been few works exploring alternative software-based models: Veluri et al. [16] employ M-JPEG on the encoder to capture black-and-white and colorized frames at two different resolutions and uses super-resolution methods to interpolate and colorize frames on the decoder. ...

4.4 A 1.3TOPS/W @ 32GOPS Fully Integrated 10-Core SoC for IoT End-Nodes with 1.7μW Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode
  • Citing Conference Paper
  • February 2021

... By using autonomous systems, like small drones that have advanced visual navigation, project managers can make workflows easier and reduce risks. Recent examples show that these tiny robots use AI methods to carry out complex tasks on their own, which lowers the need for human involvement and helps keep people safe in dangerous situations (Benini et al., 2018). Moreover, using a Service Oriented Architecture (SOA) in robotic systems makes it easier to reuse software parts, allowing better teamwork between different systems and supporting flexible project execution (Kloss et al., 2007). ...

A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones
  • Citing Article
  • May 2019

IEEE Internet of Things Journal

... This is especially true for the new generation of MCU-class devices focusing on AI applications. In contrast to their predecessors, these MCUs feature multi-core compute clusters, DNN accelerators, and on-chip memory of up to 10 MiB, split into multiple software-managed memory hierarchy levels [13], [14], [27]. ...

GAP-8: A RISC-V SoC for AI at the Edge of the IoT
  • Citing Conference Paper
  • July 2018

... On the other side, with the advent of deep learning techniques, machine learning algorithms' size grows exponentially, thanks to the improvements in processor speeds and the availability of large training data. However, embedded systems cannot sustain the resource requirements of standard deep learning techniques, adequate for GP-GPUs [4,11,25]. How then to compose the need for exploiting the opportunity to bring intelligence at the edge with the complexity of deep learning? In this context, the junction point is TinyML [30,32] cutting-edge field that brings the machine learning (ML) transformative power to the performanceand power-constrained domain of tiny devices and embedded systems. ...

Always-ON visual node with a hardware-software event-based binarized neural network inference engine
  • Citing Conference Paper
  • May 2018

... Similarly, redundancy and spare mechanisms for parallel functional units were proposed for GPUs [41]. In addition, several works proposed and analyzed the impact of trans-/mixed-precision solutions to increase the reliability of ML applications with promising results, during the training and inference stages [42]. Other mitigation analyses and strategies resort to approximate computing [43], quantization strategies on ML workloads [44], and the use of emerging number formats and hardware [33], [45], [46]. ...

The transprecision computing paradigm: Concept, design, and applications
  • Citing Conference Paper
  • March 2018

... By adding programmability to devices, their flexibility can be improved with the drawback of increased area and energy consumption caused by the instruction stream from memories to compute elements. As this does not directly contribute to the processing of data and can consume up to 70% of total system energy budget [2][3][4], minimizing it is appealing. As an example: Even a highly optimized binary precision deep neural network accelerator [5] spends up to 11% of its energy budget on the instruction stream. ...

Slow and steady wins the race? A comparison of ultra-low-power RISC-V cores for Internet-of-Things applications