Project

WiPLASH: Architecting More Than Moore – Wireless Plasticity for Heterogeneous Massive Computer Architectures

Goal: The main design principles in computer architecture have shifted from a monolithic scaling-driven approach towards an emergence of heterogeneous architectures that tightly co-integrate multiple specialized computing and memory units. This is motivated by the urgent need of very high parallelism and by energy constraints. This heterogeneous hardware specialization requires interconnection mechanisms that integrate the architecture. State-of-the-art approaches are 3D stacking and 2D architectures complemented with a Network-on-Chip (NoC) to interconnect the components. However, such interconnects are fundamentally monolithic and rigid, and are unable to provide the efficiency and architectural flexibility required by current and future key ICT applications. The main challenge is to introduce diversification and specialization in heterogeneous processor architectures while ensuring their generality and scalability.

In order to achieve this, the WiPLASH project aims to pioneer an on-chip wireless communication plane able to provide architectural plasticity, reconfigurability and adaptation to the application requirements with near-ASIC efficiency but without any loss of generality. For this, the WiPLASH consortium will provide solid experimental foundations of the key enablers of on-chip wireless communication at the functional unit level as well as their technological and architectural integration. The main goals are: (i) prototype a miniaturized and tunable graphene antenna in the terahertz band, (ii) co-integrate graphene RF components with submillimeter-wave transceivers and (iii) demonstrate low-power reconfigurable wireless chip-scale networks. The culminating goal is to demonstrate that the wireless plane offers the plasticity required by future computing platforms by improving at least one key application (mainly biologically-plausible deep learning architectures) by 10X in terms of execution speed and energy-delay product over a state-of-the-art baseline.

Updates
0 new
0
Recommendations
0 new
0
Followers
0 new
14
Reads
0 new
129

Project log

Sergi Abadal
added 3 research items
Hyperdimensional computing (HDC) is an emerging computing paradigm that represents, manipulates, and communicates data using very long random vectors (aka hypervectors). Among different hardware platforms capable of executing HDC algorithms, in-memory computing (IMC) systems have been recently proved to be one of the most energy-efficient options, due to hypervector manipulations in the memory itself that reduces data movement. Although implementations of HDC on single IMC cores have been made, their parallelization is still unresolved due to the communication challenges that these novel architectures impose and that traditional Networks-on-Chip and Networks-in-Package were not designed for. To cope with this difficulty, we propose the use of wireless on-chip communication technology in unique ways. We are particularly interested in physically distributing a large number of IMC cores performing similarity search across a chip, and maintaining the classification accuracy when each of which is queried with a slightly different version of a bundled hypervector. To achieve it, we introduce a novel over-the-air computing that consists of defining different binary decision regions in the receivers so as to compute the logical majority operation (i.e., bundling, or superposition) required in HDC. It introduces moderate overheads of a single antenna and receiver per IMC core. By doing so, we achieve a joint broadcast distribution and computation with a performance and efficiency unattainable with wired interconnects, which in turn enables massive parallelization of the architecture. It is demonstrated that the proposed approach allows to both bundle at least three hypervectors and scale similarity search to 64 IMC cores seamlessly, while incurring an average bit error ratio of 0.01 without any impact in the accuracy of a generic HDC-based classifier working with 512-bit vectors.
Analog In-Memory Computing (AIMC) is emerging as a disruptive paradigm for heterogeneous computing, potentially delivering orders of magnitude better peak performance and efficiency over traditional digital signal processing architectures on Matrix-Vector multiplication. However, to sustain this throughput in real-world applications, AIMC tiles must be supplied with data at very high bandwidth and low latency; this poses an unprecedented pressure on the on-chip communication infrastructure, which becomes the system's performance and efficiency bottleneck. In this context, the performance and plasticity of emerging on-chip wireless communication paradigms provide the required breakthrough to up-scale on-chip communication in large AIMC devices. This work presents a many-tile AIMC architecture with inter-tile wireless communication that integrates multiple heterogeneous computing clusters, embedding a mix of parallel RISC-V cores and AIMC tiles. We perform an extensive design space exploration of the proposed architecture and discuss the benefits of exploiting emerging on-chip communication technologies such as wireless transceivers in the millimeter-wave and terahertz bands.
This paper introduces the concept of smart radio environments, currently intensely studied for wireless communication in metasurface‐programmable meter‐scaled environments (e.g., inside rooms), on the chip scale. Wireless networks‐on‐chips (WNoCs) are a candidate technology to improve inter‐core communication on chips but current proposals are plagued by a dilemma: either the received signal is weak, or it is significantly reverberated such that the on–off‐keying modulation speed must be throttled. Here, this vexing problem is overcome by endowing the wireless on‐chip environment with in situ programmability which enables the shaping of the channel impulse response (CIR); thereby, a pulse‐like CIR shape can be imposed despite strong multipath propagation and without entailing a reduced received signal strength. First, a programmable metasurface suitable for integration in the on‐chip environment (“on‐chip reconfigurable intelligent surface”) is designed and characterized. Second, its configuration is optimized to equalize selected wireless on‐chip channels “over the air.” Third, by conducting a rigorous communication analysis, the feasibility of significantly higher modulation speeds with shaped CIRs is evidenced. The results introduce a programmability paradigm to WNoCs which boosts their competitiveness as complementary on‐chip interconnect solution. A programmable metasurface is included inside a chip package, and suitable metasurface configurations are identified that equalize wireless channels on the chip over‐the‐air to mitigate inter‐symbol interference. The largely improved data transfer rates boost the competitiveness of wireless networks‐on‐chips (WNoCs) as complementary interconnect technology. WNoCs aim to avert the risk of communication‐limited performance of multicore chips.
Mohamed Saeed Elsayed
added a research item
Diodes made of heterostructures of the 2D material graphene and conventional 3D materials are reviewed in this manuscript. Several applications in high frequency electronics and optoelectronics are highlighted. In particular, advantages of metal–insulator–graphene (MIG) diodes over conventional metal–insulator–metal diodes are discussed with respect to relevant figures‐of‐merit. The MIG concept is extended to 1D diodes. Several experimentally implemented radio frequency circuit applications with MIG diodes as active elements are presented. Furthermore, graphene‐silicon Schottky diodes as well as MIG diodes are reviewed in terms of their potential for photodetection. Here, graphene‐based diodes have the potential to outperform conventional photodetectors in several key figures‐of‐merit, such as overall responsivity or dark current levels. Obviously, advantages in some areas may come at the cost of disadvantages in others, so that 2D/3D diodes need to be tailored in application‐specific ways. Diodes made of heterostructures of the 2D material graphene and conventional 3D materials are reviewed in this article. In particular, metal–insulator–graphene diodes and graphene‐silicon Schottky diodes are discussed with relevant figures‐of‐merit. Several applications in high frequency electronics and optoelectronics are highlighted, such as power detectors, mixers, frequency doublers, receivers, and photodetectors.
Mohamed Saeed Elsayed
added a research item
This work presents the design, implementation, and characterization of the first thin-film integrated tunable microwave harmonic generator. The design is realized by exploiting the nonlinearity of four chemical vapor deposition (CVD) graphene-based diodes arranged in a nonlinear transmission-line (NLTL) approach. The used thin-film monolithic microwave integrated circuit (MMIC) technology is substrate independent. The fabricated prototype is realized on a 500- $\mu \text{m}$ transparent quartz substrate and occupies less than 1.2 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> of chip area including pads. Measurement results show a wide input frequency range from 0 to 2.8 GHz with measured $S_{11}$ better than −10 dB. The measured second-harmonic conversion gain (CG) for an output frequency of 3.4 GHz is −21.6 dB. The measured third-harmonic CG for an output frequency of 3.15 GHz is −31 dB. To the best knowledge of the authors, the proposed circuit is the first tunable microwave harmonic generator combining graphene diodes in a NLTL topology on thin-film technology.
Sergi Abadal
added a research item
Deep neural network (DNN) models continue to grow in size and complexity, demanding higher computational power to enable real-time inference. To efficiently deliver such computational demands, hardware accelerators are being developed and deployed across scales. This naturally requires an efficient scale-out mechanism for increasing compute density as required by the application. 2.5D integration over interposer has emerged as a promising solution, but as we show in this work, the limited interposer bandwidth and multiple hops in the Network-on-Package (NoP) can diminish the benefits of the approach. To cope with this challenge, we propose WIENNA, a wireless NoP-based 2.5D DNN accelerator. In WIENNA, the wireless NoP connects an array of DNN accelerator chiplets to the global buffer chiplet, providing high-bandwidth multicasting capabilities. Here, we also identify the dataflow style that most efficienty exploits the wireless NoP's high-bandwidth multicasting capability on each layer. With modest area and power overheads, WIENNA achieves 2.2X--5.1X higher throughput and 38.2% lower energy than an interposer-based NoP design.
Sergi Abadal
added a research item
The main design principles in computer architecture have recently shifted from a monolithic scaling-driven approach to the development of heterogeneous architectures that tightly co-integrate multiple specialized processor and memory chiplets. In such data-hungry multi-chip architectures, current Networks-in-Package (NiPs) may not be enough to cater to their heterogeneous and fast-changing communication demands. This position paper makes the case for wireless in-package nanonetworking as the enabler of efficient and versatile wired-wireless interconnect fabrics for massive heterogeneous processors. To that end, the use of graphene-based antennas and transceivers with unique frequency-beam reconfigurability in the terahertz band is proposed. The feasibility of such a nanonetworking vision and the main research challenges towards its realization are analyzed from the technological, communications, and computer architecture perspectives.
Alexandre Levisse
added 2 research items
Deeply-scaled three-dimensional (3D) Multi-Processor Systems-on-Chip (MPSoCs) enable high performance and massive communication bandwidth for next-generation computing. However, as process nodes shrink, temperature-dependent leakage dramatically increases, and thermal and power management becomes problematic. In this context, Integrated Flow Cell Array (FCA) technology, which consists of inter-tier microfluidic channels, combines on-chip electrochemical power generation and liquid cooling of 3D MPSoCs. When connected to power delivery networks (PDN) of dies, FCAs provide an additional current compensating voltage drop (IR-drop) across PDNs. In this paper, we evaluate for the first time how IR-drop reduction and liquid cooling capabilities of the FCAs scale with advanced CMOS processes. We develop a framework to quantify the system-level impact of FCAs at different technology nodes, from 22nm to 3nm. Our results show that, across all considered nodes, FCAs reduce the peak temperature of a multi-core processor (MCP) and a Machine Learning (ML) accelerator by over 22°C and 35°C, respectively, compared to off-chip direct liquid cooling. Moreover, the low operation voltages and high temperatures at advanced nodes improve up to 2× FCA power generation. Hence, FCAs allow us to keep the IR-drop below 5% for both the MCP and ML accelerator, saving over 10% TSV-reserved chip area, as opposed to using a High-Performance Computing (HPC) MPSoC liquid cooling solution.
Hybrid caches consisting of both SRAM and emerging Non-Volatile Random Access Memory (eNVRAM) bitcells increase cache capacity and reduce power consumption by taking advantage of eNVRAM's small area footprint and low leakage energy. However, they also inherit eNVRAM's drawbacks, including long write latency and limited endurance. To mitigate these drawbacks, many works propose heuristic strategies to allocate memory blocks into SRAM or eNVRAM arrays at runtime based on block content or access pattern. In contrast, this work presents a HW/SW Stack for Hybrid Caches (SHyCache), consisting of a hybrid cache architecture and supporting programming model, reminiscent of those that enable GP-GPU acceleration, in which application variables can be allocated explicitly to the eNVRAM cache, eliminating the need for heuristics and reducing cache access time, power consumption, and area overhead while maintaining maximal cache utilization efficiency and ease of programming. SHyCache improves performance for applications such as neural networks, which contain large numbers of invariant weight values with high read/write access ratios that can be explicitly allocated to the eNVRAM array. We simulate SHyCache on the gem5-X architectural simulator and demonstrate its utility by benchmarking a range of cache hierarchy variations using three neural networks, namely, Inception v4, ResNet-50, and SqueezeNet 1.0. We demonstrate a design space that can be exploited to optimize performance, power consumption, or endurance, depending on the expected use case of the architecture, while demonstrating maximum performance gains of 1.7/1.4/1.3x and power consumption reductions of 5.1/5.2/5.4x, for Inception/ResNet/SqueezeNet, respectively.
Sergi Abadal
added 5 project references
Alexandre Levisse
added a research item
Area and power constrained edge devices are increasingly utilized to perform compute intensive workloads, necessitating increasingly area and power efficient accelerators. In this context, in-SRAM computing performs hundreds of parallel operations on spatially local data common in many emerging workloads, while reducing power consumption due to data movement. However, in-SRAM computing faces many challenges, including integration into the existing architecture, arithmetic operation support, data corruption at high operating frequencies, inability to run at low voltages, and low area density. To meet these challenges, this work introduces BLADE, a BitLine Accelerator for Devices on the Edge. BLADE is an in-SRAM computing architecture that utilizes local wordline groups to perform computations at a frequency 2.8× higher than state-of-the-art in-SRAM computing architectures. BLADE is integrated into the cache hierarchy of low-voltage edge devices, and simulated and benchmarked at the transistor, architecture, and software abstraction levels. Experimental results demonstrate performance/energy gains over an equivalent NEON accelerated processor for a variety of edge device workloads, namely, cryptography (4× performance gain/6× energy reduction), video encoding (6×/2×), and convolutional neural networks (3×/1.5×), while maintaining the highest frequency/energy ratio (up to 2.2Ghz@1V) of any conventional in-SRAM computing architecture, and a low area overhead of less than 8%.
Sergi Abadal
added a research item
The miniaturization of transceivers and antennas is enabling the development of Wireless Networks-on-Chip (WNoC), in which chip-scale communication is utilized to increase the computing performance of multi-core/multi-chip architectures. Although the potential benefits of the WNoC paradigm have been studied in depth, its practicality remains unclear due to the lack of a proper characterization of the wireless channel at the chip scale and across the spectrum, among others. In this paper, the state of the art in wave propagation and channel modeling for chip-scale communication is surveyed. First, the peculiarities of WNoC, including the design drivers, architecture, environment, and on-chip electromagnetics are reviewed. After a brief description of the different methods to characterize wave propagation at chip-scales, a comprehensive discussion covering the different works at millimeter-wave (mmWave), Terahertz (THz) and optical frequencies is provided. Finally, the major challenges in the characterization of the WNoC channel and potential solutions to address them are discussed, providing a roadmap for the foundations of practical WNoCs.
Sergi Abadal
added a research item
Deep Neural Networks have flourished at an unprecedented pace in recent years. They have achieved outstanding accuracy in fields such as computer vision, natural language processing, medicine or economics. Specifically, Convolutional Neural Networks (CNN) are particularly suited to object recognition or identification tasks. This, however, comes at a high computational cost, prompting the use of specialized GPU architectures or even ASICs to achieve high speeds and energy efficiency. ASIC accelerators streamline the execution of certain dataflows amenable to CNN computation that imply the constant movement of large amounts of data, thereby turning on-chip communication into a critical function within the accelerator. This paper studies the communication flows within CNN inference accelerators of edge devices, with the aim to justify current and future decisions in the design of the on-chip networks that interconnect their processing elements. Leveraging this analysis, we then qualitatively discuss the potential impact of introducing the novel paradigm of wireless on-chip network in this context.
Sergi Abadal
added a project goal
The main design principles in computer architecture have shifted from a monolithic scaling-driven approach towards an emergence of heterogeneous architectures that tightly co-integrate multiple specialized computing and memory units. This is motivated by the urgent need of very high parallelism and by energy constraints. This heterogeneous hardware specialization requires interconnection mechanisms that integrate the architecture. State-of-the-art approaches are 3D stacking and 2D architectures complemented with a Network-on-Chip (NoC) to interconnect the components. However, such interconnects are fundamentally monolithic and rigid, and are unable to provide the efficiency and architectural flexibility required by current and future key ICT applications. The main challenge is to introduce diversification and specialization in heterogeneous processor architectures while ensuring their generality and scalability.
In order to achieve this, the WiPLASH project aims to pioneer an on-chip wireless communication plane able to provide architectural plasticity, reconfigurability and adaptation to the application requirements with near-ASIC efficiency but without any loss of generality. For this, the WiPLASH consortium will provide solid experimental foundations of the key enablers of on-chip wireless communication at the functional unit level as well as their technological and architectural integration. The main goals are: (i) prototype a miniaturized and tunable graphene antenna in the terahertz band, (ii) co-integrate graphene RF components with submillimeter-wave transceivers and (iii) demonstrate low-power reconfigurable wireless chip-scale networks. The culminating goal is to demonstrate that the wireless plane offers the plasticity required by future computing platforms by improving at least one key application (mainly biologically-plausible deep learning architectures) by 10X in terms of execution speed and energy-delay product over a state-of-the-art baseline.