Conference Paper

MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

MATRIX is a novel, coarse-grain, reconfigurable computing architecture which supports configurable instruction distribution. Device resources are allocated to controlling and describing the computation on a per task basis. Application-specific regularity allows us to compress the resources allocated to instruction control and distribution, in many situations yielding more resources for datapaths and computations. The adaptability is made possible by a multi-level configuration scheme, a unified configurable network supporting both datapaths and instruction distribution, and a coarse-grained building block which can serve as an instruction store, a memory element, or a computational element. In a 0.5 μ CMOS process, the 8-bit functional unit at the heart of the MATRIX architecture has a footprint of roughly 1.5 mm×1.2 mm, making single dies with over a hundred function units practical today. At this process point, 100 MHz operation is easily achievable, allowing MATRIX components to deliver on the order of 10 Gop/s (8-bit ops)

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... After a two-decade gap, from around 1980s, research on modern reconfigurable computing systems started to revive [77]. Several research groups (both from industry and academia) proposed several reconfigurable architectures [165], such as: MATRIX [116], Garp [27], MorphoSys [144], RaPiD [44], PipeRench [70], PACT XAPP [18], REMARC [118], ADRES [114], etc. These designs were feasible only because of the advancement of the silicon technology, which lead to implementation of complex designs on a single chip [165]. ...
... Hardware Some reconfigurable processing units are composed of standard FPGAs [4], [41], [78], [131], [169], while others are composed of custom-designed configurable hardware [18] [27], [44], [70], [94], [95], [101], [114], [116], [118], [144]. ...
... Some reconfigurable systems are based on custom-reconfigurable silicon devices [18] [27], [44], [70], [94], [95], [101], [114], [116], [118], [144]. The reconfigurable processing unit of these systems comprise of an array of processing elements (PEs) called reconfigurable cell (RC) array and an interconnection network. ...
Thesis
From mid-2000s, the realm of portable and embedded computing has expanded to include a wide variety of applications. Data mining is one of the many applications that are becoming common on these devices. Many of today’s data mining applications are compute and/or data intensive, requiring more processing power than ever before, thus speed performance is a major issue. In addition, embedded devices have stringent area and power requirements. At the same time manufacturing cost and time-to-market are decreasing rapidly. To satisfy the constraints associated with these devices, and also to improve the speed performance, it is imperative to incorporate some special-purpose hardware into embedded system design. In some cases, reconfigurable hardware support is desirable to provide the flexibility required in the ever-changing application environment. Our main objective is to provide chip-level and reconfigurable hardware support for data mining applications in portable, handheld, and embedded devices. We focus on the most widely used data mining tasks, clustering and classification. Our investigation on the hardware design and implementation of similarity computation (an important step in clustering/classification) illustrates that the chip-level hardware support for data mining operations is indeed a feasible and a worthwhile endeavour. Further performance gain is achieved with hardware optimizations such as parallel processing. To address the issue of limited hardware foot-print on portable and embedded devices, we investigate reconfigurable computing systems. We propose dynamic reconfigurable hardware solutions for similarity computation using a multiplexer-based approach, and for principal component analysis (another important step in clustering/classification) using partial reconfiguration method. Experimental results are encouraging and show great potential in implementing data mining applications using reconfigurable platform. Finally, we propose a design methodology for FPGA-based dynamic reconfigurable hardware, in order to select the most efficient FPGA-based reconfiguration method(s) for specific applications on portable and embedded devices. This design methodology can be generalized to other embedded applications and gives guidelines to the designer based on the computation model and characteristics of the application.
... The MATRIX (multiple ALU architecture with reconfigurable interconnect experiment) architecture is built according to an application-specific methodology, aiming to be suitable for general purpose applications [14]. The architecture is composed of an array of identical 8-bit basic functional units and a configuration network. ...
... The architecture's authors claimed that no specific applications are targeted and that the architecture can be a general purpose one [14]. ...
... Ring style for Dnode interconnections[13] Fig. 8MATRIX Processing element (functional unit)[14] ...
Article
Full-text available
The development of mobile devices has challenged hardware designers to come up with suitable architectures. Challenges such as power consumption, flexibility, processing power and area are likely to lead to the need for a reconfigurable architecture to cater for the growing demands made of mobile devices, and to suit the needs of the next generation of devices. Parallelism and multifunction in real-time will be the minimum required characteristics of the architectures of such devices. This chapter reviews the currently available reconfigurable architectures. The focus here is on coarse-grain reconfigurable architectures, with particular attention to those which support dynamic reconfiguration with low-power consumption. The capacity for dynamic reconfiguration will be a key factor in defining the most suitable architecture for future generations of mobile devices. This paper describes existing reconfigurable platforms. Their principles of operation, architectures and structures are discussed highlighting their advantages and disadvantages. Various coarse-grain reconfigurable architectures are discussed along with their improvement with time. Finally, the key characteristics which are required for a reconfigurable architecture to be suitable for telecommunication systems are identified. A comparison is given for the various architectures discussed in terms of suitability for telecommunications applications.
... A common approach for developing reconfigurable platforms is to have a special configuration layer, similar to that in MATRIX as proposed by Mirsky et al. [108], which is a coarse-grained platform that enabled applications to control resources using a multi-level configuration scheme by configurable instruction distribution. MATRIX building blocks ...
... Moreover, the evolved circuit was temperature dependent and did not perform well using a temperature different than what was used in the evolution process.Searching the literature from the past 30 years, many successful EHW implementations can be found due to the advances made in computing systems and FPGAs. Some of these implementations were novel reconfigurable architectures fabricated on custom-hardware with traditional EAs[11,107,110,12,109,108]. Other implementations were EHWs with novel FPGA-based architectures with traditional EAs[160,161,162,152,163,164]. ...
Thesis
Full-text available
Evolvable hardware (EHW) is a powerful autonomous system for adapting and finding solutions within a changing environment. EHW consists of two main components: a reconfigurable hardware core and an evolutionary algorithm. The majority of prior research focuses on improving either the reconfigurable hardware or the evolutionary algorithm in place, but not both. Thus, current implementations suffer from being application oriented and having slow reconfiguration times, low efficiencies, and less routing flexibility. In this work, a novel evolvable hardware platform is proposed that combines a novel reconfigurable hardware core and a novel evolutionary algorithm. The proposed reconfigurable hardware core is a systolic array, which is called HexArray. HexArray was constructed using processing elements with a redesigned architecture, called HexCells, which provide routing flexibility and support for hybrid reconfiguration schemes. The improved evolutionary algorithm is a genome-aware genetic algorithm (GAGA) that accelerates evolution. Guided by a fitness function the GAGA utilizes context-aware genetic operators to evolve solutions. The operators are genome-aware constrained (GAC) selection, genome-aware mutation (GAM), and genome-aware crossover (GAX). The GAC selection operator improves parallelism and reduces the redundant evaluations. The GAM operator restricts the mutation to the part of the genome that affects the selected output. The GAX operator cascades, interleaves, or parallel-recombines genomes at the cell level to generate better genomes. These operators improve evolution while not limiting the algorithm from exploring all areas of a solution space. The system was implemented on a SoC that includes a programmable logic (i.e., field-programmable gate array) to realize the HexArray and a processing system to execute the GAGA. A computationally intensive application that evolves adaptive filters for image processing was chosen as a case study and used to conduct a set of experiments to prove the developed system robustness. Through an iterative process using the genetic operators and a fitness function, the EHW system configures and adapts itself to evolve fitter solutions. In a relatively short time (e.g., seconds), HexArray is able to evolve autonomously to the desired filter. By exploiting the routing flexibility in the HexArray architecture, the EHW has a simple yet effective mechanism to detect and tolerate faulty cells, which improves system reliability. Finally, a mechanism that accelerates the evolution process by hiding the reconfiguration time in an “evolve-while-reconfigure” process is presented. In this process, the GAGA utilizes the array routing flexibility to bypass cells that are being configured and evaluates several genomes in parallel.
... Early CGRAs were classified based on their integration into the processor. Tightly coupled CGRAs integrate into a processor data path and are executed as a custom instruction, e.g., Chess [17], MATRIX [18], and DySer [19]. Loosely coupled CGRAs act more as an accelerator that executes alongside the processor, executing in tandem with the processor and communicating via on-chip interconnect. ...
Preprint
Full-text available
Scientific edge computing increasingly relies on hardware-accelerated neural networks to implement complex, near-sensor processing at extremely high throughputs and low latencies. Existing frameworks like HLS4ML are effective for smaller models, but struggle with larger, modern neural networks due to their requirement of spatially implementing the neural network layers and storing all weights in on-chip memory. CGRA4ML is an open-source, modular framework designed to bridge the gap between neural network model complexity and extreme performance requirements. CGRA4ML extends the capabilities of HLS4ML by allowing off-chip data storage and supporting a broader range of neural network architectures, including models like ResNet, PointNet, and transformers. Unlike HLS4ML, CGRA4ML generates SystemVerilog RTL, making it more suitable for targeting ASIC and FPGA design flows. We demonstrate the effectiveness of our framework by implementing and scaling larger models that were previously unattainable with HLS4ML, showcasing its adaptability and efficiency in handling complex computations. CGRA4ML also introduces an extensive verification framework, with a generated runtime firmware that enables its integration into different SoC platforms. CGRA4ML's minimal and modular infrastructure of Python API, SystemVerilog hardware, Tcl toolflows, and C runtime, facilitates easy integration and experimentation, allowing scientists to focus on innovation rather than the intricacies of hardware design and optimization.
... Compilation times for coarsegrained architectures, such as coarse-grained reconfigurable arrays (CGRAs), are significantly lower on account of the higher granularity used. However, CGRAs implemented as ASIC devices [29], [30], [31], [32], [33] have not achieved widespread adoption because functional units (FUs) are often too application specific to be efficient and useful for a wide enough range of applications, while a very general CGRA tends to entail significant area and performance overheads. ...
Article
Full-text available
Coarse-grained FPGA overlays built around the runtime programmable DSP blocks in modern FPGAs can achieve high throughput and improved scalability compared to traditional overlays built without detailed consideration of FPGA architecture. These overlays can be mapped to using higher level compilers, achieving fast compilation, software-like programmability and run-time management, and high-level design abstraction. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, prohibitive hardware compilation times in traditional design flows mean that the tools cannot effectively use just-in-time (JIT) compilation or runtime performance scaling on FPGAs. We present an architecture-optimised FPGA overlay that exploits the capabilities of DSP blocks to maximise throughput and an associated design methodology for runtime compilation of dataflow graphs expressed as OpenCL kernels onto the overlays. The methodology benefits from the high level of abstraction afforded by using the OpenCL programming model, while the mapping to the overlay significantly reduces compilation and load times. Key characteristics of this work include highly performant DSP-optimized functional units that scale to large overlays on modern devices and the ability to perform automatic resource-aware kernel replication up to the size of the overlay for performance scaling. We demonstrate place and route times orders of magnitude better than traditional HLS flows, even when running on an embedded processor in the Xilinx Zynq.
... The PipeRench [7] systems are designed as an accelerator for data streaming applications which contain several reconfigurable parallel structure. The Matrix [8] is architecture which contains number of 8-bit processing element which is called as Basic Functional Units (BFUs) in a 2D mesh structure. It has a routing fabrics which provide 8-bit bus connections in 3 levels. ...
... Soon, CGRA with larger data widths and closer to current CGRA configurations started to appear. MATRIX [17] considered an 8-bit datapath and a mesh of ALUs. ALUs are still fine-grained programmable, similar to an FPGA, but the interconnection topology has been improved, allowing connections to non-adjacent neighbour FU and networking operations, like data merge. ...
Article
Full-text available
Reconfigurable computing architectures allow the adaptation of the underlying datapath to the algorithm. The granularity of the datapath elements and data width determines the granularity of the architecture and its programming flexibility. Coarse-grained architectures have shown the right balance between programmability and performance. This paper provides an overview of coarse-grained reconfigurable architectures and describes Versat, a Coarse-Grained Reconfigurable Array (CGRA) with self-generated partial reconfiguration, presented as a case study for better understanding these architectures. Unlike most of the existing approaches, which mainly use pre-compiled configurations, a Versat program can generate and apply myriads of on-the-fly configurations. Partial reconfiguration plays a central role in this approach, as it speeds up the generation of incrementally different configurations. The reconfigurable array has a complete graph topology, which yields unprecedented programmability, including assembly programming. Besides being useful for optimising programs, assembly programming is invaluable for working around post-silicon hardware, software, or compiler issues. Results on core area, frequency, power, and performance running different codes are presented and compared to other implementations.
... Another early but influential CGRA was the MATRIX [37] architecture, which (similar to REMARC) revolved around ALUs as the main reconfigurable compute resource, but was slightly more fine-grained than REMARC due to choosing an 8-bit (contra REMARCs 16-bit) data-path. Despite their name, the functionality of the ALU was actually more similar to that of an FPGA, where a NOR-plane could be programmed to desired functionality (similar to a Programmable Logic Array, PLAs), but did also include native support for pattern matching. ...
Article
Full-text available
With the end of both Dennard’s scaling and Moore’s law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
... Published work in coarse-grained reconfigurable architectures and FPGA overlays such as [14], [15], [8] are essentially dataflow machines, usually consisting of small arithmetic and logic units, registers, all of which are immersed in an switchbased interconnect structure. The processors are homogeneous and programmable, and not tailored for specific applications. ...
... Published work in coarse-grained reconfigurable architectures and FPGA overlays such as [14], [15], [8] are essentially dataflow machines, usually consisting of small arithmetic and logic units, registers, all of which are immersed in an switchbased interconnect structure. The processors are homogeneous and programmable, and not tailored for specific applications. ...
Article
Full-text available
Overlay architectures implemented on FPGA devices have been proposed as a means to increase FPGA adoption in general-purpose computing. They provide the benefits of software such as flexibility and programmability, thus making it easier to build dedicated compilers. However, existing overlays are generic, resource and power hungry with performance usually an order of magnitude lower than bare metal implementations. As a result, FPGA overlays have been confined to research and some niche applications. In this paper, we introduce Application-Specific FPGA Overlays (AS-Overlays), which can provide bare-metal performance to FPGA overlays, thus opening doors for broader adoption. Our approach is based on the automatic extraction of hardware kernels from data flow applications. Extracted kernels are then leveraged for application-specific generation of hardware accelerators. Reconfiguration of the overlay is done with RapidWright which allows to bypass the HDL design flow. Through prototyping, we demonstrated the viability and relevance of our approach. Experiments show a productivity improvement up to 20× compared to the state of the art FPGA overlays, while achieving over 1.33× higher Fmax than direct FPGA implementation and the possibility of lower resource and power consumption compared to bare metal.
... Published work in coarse-grained reconfigurable architectures and FPGA overlays such as [14], [15], [8] are essentially dataflow machines, usually consisting of small arithmetic and logic units, registers, all of which are immersed in an switchbased interconnect structure. The processors are homogeneous and programmable, and not tailored for specific applications. ...
Preprint
Overlay architectures implemented on FPGA devices have been proposed as a means to increase FPGA adoption in general-purpose computing. They provide the benefits of software such as flexibility and programmability, thus making it easier to build dedicated compilers. However, existing overlays are generic, resource and power hungry with performance usually an order of magnitude lower than bare metal implementations. As a result, FPGA overlays have been confined to research and some niche applications. In this paper, we introduce Application-Specific FPGA Overlays (AS-Overlays), which can provide bare-metal performance to FPGA overlays, thus opening doors for broader adoption. Our approach is based on the automatic extraction of hardware kernels from data flow applications. Extracted kernels are then leveraged for application-specific generation of hardware accelerators. Reconfiguration of the overlay is done with RapidWright which allows to bypass the HDL design flow. Through prototyping, we demonstrated the viability and relevance of our approach. Experiments show a productivity improvement up to 20x compared to the state of the art FPGA overlays, while achieving over 1.33x higher Fmax than direct FPGA implementation and the possibility of lower resource and power consumption compared to bare metal.
... gendat is a variable that can be created by the user and we created above Data Matrix Command: apply(gendat,1,mean) [6] Gene1 ...
Article
Full-text available
Background: R is one of the renowned programming language which is an open source software developed by the scientific community to compute, analyze and visualize big data of any field including biomedical research for bioinformatics applications. Methods: Here, we outlined R allied packages and affiliated bioinformatics infrastructures e.g. Bioconductor and CRAN. Moreover, basic concepts of factor, vector, data matrix and whole transcriptome RNA-Seq data was analyzed and discussed. Particularly, differential expression workflow on simulated prostate cancer RNA-Seq data was performed through experimental design, data normalization, hypothesis testing and downstream investigations using EdgeR package. A few genes with ectopic expression were retrieved and knowhow to gene enrichment pathway analysis is highlighted using available online tools. Results: Data matrix of (4×3) was constructed, and a complex data matrix of Golub et al., was analyzed through χ2 statistics by generating a frequency table of 15 true positive, 4 false positive, 15 true negative and 4 false negative on gene expression cut-off values, and a test statistics value of 10.52 with 1 df and p= 0.001 was obtained, which reject the null hypothesis and supported the alternative hypothesis of “predicted state of a person by gene expression cut-off values is dependent on the disease state of patient” in our data. Similarly, sequence data of human Zyxin gene was selected and a null hypothesis of equal frequencies was rejected. Conclusion: Machine-learning approaches using R statistical package is a supportive tool which can provide systematic prediction of putative causes, present state, future consequences and possible remedies of any problem of modern biology. Keywords: NGS data; R language; Zyxin gene
... A processing element for a data path-oriented architecture executes only one type of operation once it is configured, and required data flow is constructed by routing mesh structured processing elements. To implement the LLP on a data path-oriented architecture, the body of the loop is replicated on mesh and multiple iterations are executed concurrently in a pipeline manner [4]. And also it does not lead to high resource utilization when I/Os from/to processing elements are limited. ...
... Research prototypes with fine-grain granularity include Splash [3], DECPeRLe-1 [4], DPGA [5] and Garp [6]. Array processors with coarse-grain granularity, such as rDPA [10], MATRIX [11], and REMARC [12] form another class of reconfigurable systems. Other systems with coarse-grain granularity include MorphoSys [25][26][27], RaPiD [7], and RAW [9]. ...
Preprint
Full-text available
The rapid progress and advancement in electronic chips technology provide a variety of new implementation options for system engineers. The choice varies between the flexible programs running on a general-purpose processor (GPP) and the fixed hardware implementation using an application specific integrated circuit (ASIC). Many other implementation options present, for instance, a system with a RISC processor and a DSP core. Other options include graphics processors and microcontrollers. Specialist processors certainly improve performance over general-purpose ones, but this comes as a quid pro quo for flexibility. Combining the flexibility of GPPs and the high performance of ASICs leads to the introduction of reconfigurable computing (RC) as a new implementation option with a balance between versatility and speed. The focus of this chapter is on introducing reconfigurable computers as modern super computing architectures. The chapter also investigates the main reasons behind the current advancement in the development of RC-systems. Furthermore, a technical survey of various RC-systems is included laying common grounds for comparisons. In addition, this chapter mainly presents case studies implemented under the MorphoSys RC-system. The selected case studies belong to different areas of application, such as, computer graphics and information coding. Parallel versions of the studied algorithms are developed to match the topologies supported by the MorphoSys. Performance evaluation and results analyses are included for implementations with different characteristics.
... Early papers make the case for combining processors and FPGAs [19], [20] and show how to integrate reconfigurable functional units [21]. Architectures explored the use of coarser-grained building blocks that natively support wordwide computations and sequencing of operations [22]- [24]. Other works explore the benefits of packet-switched networks in the FPGA [25], [26]. ...
Article
The TCFPGA Hall of Fame for FPGAs (field-programmable gate arrays) and Reconfigurable Computing recognizes the most significant peer-reviewed publications in the field, highlights key contributions, and represents the body of knowledge that has accumulated over the past 30 years. The ACM SIGDA Technical Committee on FPGAs and Reconfigurable Computing is a technical committee of the Design Automation Special Interest Group, which was formed to promote the FPGA and reconfigurable computing community.
... Research prototypes with fine-grain granularity include Splash [3], DECPeRLe-1 [4], DPGA [5] and Garp [6]. Array processors with coarse-grain granularity, such as rDPA [10], MATRIX [11], and REMARC [12] form another class of reconfigurable systems. Other systems with coarse-grain granularity include MorphoSys [25][26][27], RaPiD [7], and RAW [9]. ...
... The multiple ALU architecture with reconfigurable interconnect experiment system (MATRIX) is another modern architecture that benefits from the PLA architecture (5). The MATRIX architecture is unique because it aims to unify resources for instruction storage and computation. ...
Chapter
Full-text available
Programmable logic arrays (PLAs) are traditional digital electronic devices. A PLA is a simple programmable logic device (SPLD) used to implement combinational logic circuits. A PLA has a set of programmable AND gates, which link to a set of programmable OR gates to produce an output. The AND–OR layout of a PLA allows for implementing logic functions that are in a sum-of-products form. PLAs are available in the market in different types. PLAs could be stand alone chips, or parts of bigger processing systems. Stand alone PLAs are available as mask programmable (MPLAs) and field programmable (FPLAs) devices. The attractions of PLAs that brought them to mainstream engineers include their simplicity, relatively small circuit area, predictable propagation delay, and ease of development. The powerful-but-simple property brought PLAs to rapid prototyping, synthesis, design optimization techniques, embedded systems, traditional computer systems, hybrid high-performance computing systems, etc. Indeed, there has been renewable interests in working with the simple AND-to-OR PLAs.
Preprint
Programmable logic arrays (PLAs) are traditional digital electronic devices. A PLA is a simple programmable logic device (SPLD) used to implement combinational logic circuits. A PLA has a set of programmable AND gates, which link to a set of programmable OR gates to produce an output. The AND-OR layout of a PLA allows for implementing logic functions that are in a sum-of-products form. PLAs are available in the market in different types. PLAs could be stand alone chips, or parts of bigger processing systems. Stand alone PLAs are available as mask programmable (MPLAs) and field programmable (FPLAs) devices. The attractions of PLAs that brought them to mainstream engineers include their simplicity, relatively small circuit area, predictable propagation delay, and ease of development. The powerful-but-simple property brought PLAs to rapid prototyping, synthesis, design optimization techniques, embedded systems, traditional computer systems, hybrid high-performance computing systems, etc. Indeed, there has been renewable interests in working with the simple AND-to-OR PLAs.
Article
Stream processing, which involves real-time computation of data as it is created or received, is vital for various applications, specifically wireless communication. The evolving protocols, the requirement for high-throughput, and the challenges of handling diverse processing patterns make it demanding. Traditional platforms grapple with meeting real-time throughput and latency requirements due to large data volume, sequential and indeterministic data arrival, and variable data rates, leading to inefficiencies in memory access and parallel processing. We present Canalis, a throughput-optimized framework designed to address these challenges, ensuring high-performance while achieving low energy consumption. Canalis is a hardware-software co-designed system. It includes a programmable spatial architecture, FluxSPU (Flux Stream Processing Unit), proposed by this work to enhance data throughput and energy efficiency. FluxSPU is accompanied by a software stack that eases the programming process. We evaluated Canalis with eight distinct benchmarks. When compared to CPU and GPU in mobile SoC to demonstrate the effectiveness of domain specialization, Canalis achieves an average speedup of 13.4× and 6.6×, and energy savings of 189.8× and 283.9×, respectively. In contrast to equivalent ASICs of the benchmarks, the average energy overhead of Canalis is within 2.4×, successfully maintaining generalizations without incurring significant overhead.
Article
We present domain adaptive processor (), a programmable systolic-array processor designed for wireless communication and linear algebra workloads. uses a globally homogeneous but locally heterogeneous architecture, uses decode-less reconfiguration instructions for data streaming, enables single-cycle data communication between functional units (FUs), and features lightweight nested-loop control for periodic execution. Our design demonstrates how configuration flexibility and rapid program loading enable a wide range of communication workloads to be mapped and swapped in less than a microsecond, supporting continually evolving communication standards such as 5G. A prototype chip of with 256 cores is fabricated in a 12-nm FINFET process and has been verified. The measurement results show that achieves 507 GMACs/J and a peak performance of 264 GMACs.
Article
Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power and performance for programmability, or ASICs, which become obsolete as applications change. Programmable domain-specific accelerators, such as coarse-grained reconfigurable arrays (CGRAs), have emerged as a promising middle-ground, but they have traditionally been difficult compiler targets since they use a different memory abstraction. In contrast to CPUs and GPUs, the memory hierarchies of domain-specific accelerators use push memories : memories that send input data streams to computation kernels or to higher or lower levels in the memory hierarchy, and store the resulting output data streams. To address the compilation challenge caused by push memories, we propose that the representation of these memories in the compiler be altered to directly represent them by combining storage with address generation and control logic in a single structure—a unified buffer. The unified buffer abstraction enables the compiler to separate generic push memory optimizations from the mapping to specific memory implementations in the backend. This separation allows our compiler to map high-level Halide applications to different CGRA memory designs, including some with a ready-valid interface. The separation also opens the opportunity for optimizing push memory elements on reconfigurable arrays. Our optimized memory implementation, the Physical Unified Buffer (PUB), uses a wide-fetch, single-port SRAM macro with built-in address generation logic to implement a buffer with two read and two write ports. It is 18% smaller and consumes 31% less energy than a physical buffer implementation using a dual-port memory that only supports two ports. Finally, our system evaluation shows that enabling a compiler to support CGRAs leads to performance and energy benefits. Over a wide range of image processing and machine learning applications, our CGRA achieves 4.7 × better runtime and 3.5 × better energy-efficiency compared to an FPGA.
Article
Modern applications require hardware accelerators to maintain energy efficiency while satisfying the increasing computation requirements. However, with evolving standards and rapidly changing algorithmic complexity as well as rising design costs at advanced technology nodes, the iterative development of inflexible accelerators for such applications becomes ineffective. The reconfigurable architectures can provide a higher throughput and the required flexibility, but with substantial energy and area efficiency overhead relative to the accelerators. We develop a domain-specific, energy-and area-efficient (within 2 ×\times- 10 ×\times of accelerators) multiprogram runtime reconfigurable accelerator called the universal digital signal processor (UDSP). The design maximizes generality and resource utilization for signal processing and linear algebra with minimal area and energy penalty. The statistics-driven multilayer network minimizes the network delay and consists of an optimized switchbox design that maximizes the connectivity per hardware cost. The multilayered interconnect network is linearly scalable with the number of processing elements and allows for intradielet and multidielet scaling. The network features’ deterministic routing and timing for fast program compile and its translation and rotation symmetries allow for hardware resource reallocation. Multidielet scaling is enabled by energy-efficient, high-bandwidth, and high-density interdielet communication channels that seamlessly extend the intradielet routing network across the dielet boundaries using 10- μ\mu m fine-pitch silicon interconnect fabric (Si-IF) interposer. A 2 ×\times 2 multidielet UDSP on Si-IF can achieve a peak energy efficiency of 785 GMACs/J at 0.42 V and 315 MHz. The interdielet communication channel, SNR-10, provides a shoreline bandwidth density of 297 Gb/s/mm at 1.1 Gb/s/pin at 0.8 V and nominal energy efficiency of 0.38 pJ/bit.
Chapter
The hardware design fundamentally determines the essential properties of the chip, such as performance, energy efficiency, parallelism, and flexibility. Both compilation and programming methods are essentially designed to make it more efficient and convenient for the user to exploit the potential of the hardware.
Chapter
Modern embedded applications require a high computational performance under severe energy constraints. For applications that in the digital signal processing domain field programmable gate arrays (FPGAs) provide a very flexible processing platform. However, due to bit-level reconfigurability the overhead of such architectures is high. Both in energy, area, and performance. Coarse-grained reconfigurable architectures remove much of this overhead, at the expense of some flexibility. Although many publications of CGRAs have been presented in the past, what exactly constitutes a CGRA is not really clear. For this reason, this chapter defines what a CGRA is and evaluates this definition of a large set of previously presented architectures. This definition depends on the reconfiguration granularity of an architecture in both the temporal and spatial domains. Furthermore, the chapter provides the reader with an overview of the investigated CGRAs and suggests some research topics that could improve CGRAs in the future.
Article
With demand for high performance and huge logic dense portable devices, the need for silicon area is increasing. A potential solution for the electronics industry to develop such huge logic demanding applications is the ability to reconfigure the system partially without altering the overall system operation. For more than two decades, reconfigurable computing has aided various applications and has seen tremendous technology transformation. The paper presents a survey of reconfigurable computing, its present state of existence, and a detailed report on state of art Partial Dynamic Reconfiguration Framework (PDRF) for reconfiguring FPGA designs partially and dynamically. A detailed analysis of the features, limitations, and performance of a wide range of PDRFs available in the literature are reported.
Chapter
High-performance reconfigurable computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to general-purpose processors. Better performance and lower power consumption could be achieved using application-specific integrated circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter, the authors provide a description of reconfigurable hardware for high-performance computing.
Article
Full-text available
CGRAs are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler, onto the CGRA architecture. The CGRA mapping problem, being NP-complete, is performed in a two-step process namely, scheduling and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a mapping failure, the II is increased and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable. Therefore, IMS does not explore the schedule space effectively. To address these issues, this paper proposes CRIMSON, Compute-intensive loop acceleration by Randomized Iterative Modulo Scheduling and Optimized Mapping technique that generates random modulo schedules by exploring the schedule space, thereby creating different modulo schedules at a given and increased II. CRIMSON also employs a novel conservative test after scheduling to prune valid schedules that are not mappable. From our study conducted on the top 24 performance-critical loops (run for more than 7% of application time) from MiBench, Rodinia, and Parboil, we found that previous state-of-the-art approaches that use IMS such as RAMP and GraphMinor could not map five and seven loops respectively, on a 4×4 CGRA, whereas CRIMSON was able to map them all. For loops mapped by the previous approaches, CRIMSON achieved a comparable II.
Preprint
Full-text available
With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of compute in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with particular focus on premises behind the different CGRA architectures and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover existing knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
Conference Paper
Hardware accelerators are one promising solution to contend with the end of Dennard scaling and the slowdown of Moore's law. For mature workloads that are regular and have high compute per byte, hardening an application into one or more hardware modules is a standard approach. However, for some applications, we find that a programmable homogeneous architecture is preferable. This paper compares a previously proposed heterogeneous hardware accelerator for analytical query processing to a homogeneous systolic array alternative. We find that the heterogeneous and homogeneous accelerators are equivalent for large designs, while for small designs the homogeneous is better. Our analysis explains this counter-intuitive result, finding that the homogeneous architecture has higher average resource utilization and lower relative costs for the communication infrastructure.
Article
Energy efficiency has become one the most important factors in today's processing platforms. Coarse-grained reconfigurable architectures (CGRAs) have turned out to perform well in this area. In this paper, we have proposed two methods to enhance their energy efficiency. The first method decreases the power consumption of a CGRA by reducing the number of occurring transitions of the context switching process. The second one decreases the energy consumption of a CGRA through volume shrinkage of its context memory. Our results show up to 83% of the context memory energy can be reduced by using our proposed method. In addition, a high level instruction-set simulator for CGRA has been presented in this paper.
Article
Full-text available
In this paper we give a fresh look to Coarse Grained Reconfigurable Arrays (CGRAs) as ultra-low power accelerators for near-sensor processing. We present a general-purpose Integrated Programmable-Array accelerator (IPA) exploiting a novel architecture, execution model, and compilation flow for application mapping that can handle kernels containing complex control flow, without the significant energy overhead incurred by state of the art predication approaches. To optimize the performance and energy efficiency, we explore the IPA architecture with special focus on shared memory access, with the help of the flexible compilation flow presented in this paper. We achieve a maximum energy gain of 2×, and performance gain of 1.33× and 1.8× compared with state of the art partial and full predication techniques, respectively. The proposed accelerator achieves an average energy efficiency of 1617 MOPS/mW operating at 100MHz, 0.6V in 28nm UTBB FD-SOI technology, over a wide range of near-sensor processing kernels, leading to an improvement up to 18×, with an average of 9.23× (as well as a speed-up up to 20.3×, with an average of 9.7×) compared to a core specialized for ultra-low power near-sensor processing.
Thesis
Full-text available
An innovative methodology to model and simulate partial and dynamic reconfiguration is presented in this work. As dynamic reconfiguration can be seen as the remove and reinsertion of modules into the system, the presented methodology is based on the execution blocking of not configured modules during the simulation, without interfere on the normal system activity. Once the simulator provides the possibility to remove, insert and exchange modules during simulation, all systems modeled on this simulator can have the benefit of the dynamic reconfigurations. In order to prove the concept, modifications on the SystemC kernel were developed, adding new instructions to remove and reconfigure modules at simulation time, enabling the simulator to be used either at transaction level (TLM) or at register transfer level (RTL). At TLM it allows the modeling and simulation of higher-level hardware and embedded software, while at RTL the dynamic system behavior can be observed at signals level. At the same time all the abstraction levels can be modeled and simulated, all system granularity can also be considered. At the end, every system able to be simulated using SystemC can also has your behavior changed on run-time. The provided set of instructions decreases the design cycle time. Compared with traditional strategies, information about dynamic and adaptive behavior will be available at earlier stages. Three different applications were developed using the methodology at different abstract levels and granularities. Considerations about the decision on how to apply dynamic reconfiguration in the better way are also made. The acquired results assist the designers on choosing the best cost/benefit tradeoff in terms of chip area and reconfiguration delay.
Chapter
Full-text available
Electronic circuits can be separated into two groups, digital and analog circuits. Analog circuits operate on analog quantities that are continuous in value, whereas digital circuits operate on digital quantities that are discrete in value and limited in precision. In practice, most digital systems contain combinational circuits along with memory; these systems are known as sequential circuits. Sequential circuits are of two types: synchronous and asynchronous. In a synchronous sequential circuit, a clock signal is used at discrete instants of time to synchronize desired operations. Asynchronous sequential circuits do not require synchronizing clock pulses; however, the completion of an operation signals the start of the next operation in sequence. The basic logic design steps are generally identical for sequential and combinational circuits; these are specification, formulation, optimization, and the implementation of the optimized equations using a suitable hardware technology. The differences between sequential and combinational design steps appear in the details of each step. The minimization (optimization) techniques used in logic design range from simple (manual) to complex (automated). An example of manual optimization methods is the Karnough map (K-map). Indeed, hardware implementation technology has been growing faster than the ability of designers to produce hardware designs. Hence, there has been a growing interest in developing techniques and tools that facilitate the process of logic design.
Conference Paper
Full-text available
Dynamically Programmable Gate Arrays (DPGAs) are programmable arrays which allow the strategic reuse of limited resources. In so doing, DPGAs promise greater capacity, and in some cases higher performance, than conventional programmable device architectures where all array resources are dedicated to a single function for an entire operational epoch. This paper examines several usage patterns for DPGAs including temporal pipelining, utility functions, multiple function accommodation, and state-dependent logic. In the process, it offers insight into the application and technology space where DPGA-style reuse techniques are most beneficial.
Conference Paper
This paper presents an architecture for a FPGA oriented towards logic emulation, to achieve maximum usable logic density per unit silicon area, and fast mapping. Logic circuits are translated into a program that is executed sequentially by a network of processor elements. Overall, a sevenfold increase in raw logic blocks, and a 25-fold increase in usable logic blocks compared to a FPGA-based logic emulator is expected for a given silicon area
Conference Paper
Existing hardware prototyping platforms, often based on commercial processors or FPGAs, cannot cope with the high computation requirements of complex DSP algorithms, especially those with high sampling rate and heterogeneous data-now patterns. The multiprocessor IC presented here is designed to handle these types of algorithms. The chip presented here contains 48 16 b PEs interconnected by a 2-level high bandwidth communication network
Conference Paper
The GPA machine, a massively parallel, multiple single-instruction-stream-multiple-data-stream (MSIMD) system is described. Its distinguishing characteristics is the generality of its partitioning capabilities. Like the PASM system it can be dynamically reconfigured to operate as one or more independent SIMD machines. However, unlike PASM, the only constraint placed on partitioning is that an individual processing element is a member of at most one partition. This capability allows for reconfiguration based on the run-time status of dynamic data structures and for partitioning of disconnected and overlapping data structures. Significant speedups are expected from operating on data structures in place; copying of data to a newly configured partition is unnecessary. The GPA system consists of N processing-element/RAM pairs and an interconnection network providing access to and from P control processors or microcontrollers. With current technologies, values for N and P of 64K and 16, respectively, are feasible
Conference Paper
CREATE-LIFE (Compiler for REgular ArchiTEcture, Long Instruction Format Engine) is an architecture-specific design approach for CMOS VLSI based on block modularity and architecture regularity. It takes as input a description of the function to be performed by a VLSI circuit using general-purpose. Pascal-like programming language and a description of the particular LIFE architecture instance. The output is the CMOS layout of a very high-performance (micro) programmed architecture tailored for that specific function. The LIFE architectures generated in the design environment are the single-chip implementations of an improved form of a very long instruction work (VLIW) architecture that uses both parallelism and pipelining. With this approach, an experimental general-purpose processor which performs in the 50 to 100 VAX equivalent range has been designed.< >
Article
Three different operating system strategies for a parallel processor computer system are compared, and the most effective strategy for given job loads is determined. The three strategies compare uniprogramming versus multiprogramming and distributed operating systems versus dedicated processor operating systems. The level of evaluation includes I/O operations, resource allocation, and interprocess communication. The results apply to architectures where jobs may be scheduled to processors on the basis of processor availability, memory availability, and the availability of one other resource used by all jobs.
Article
A field-programmable multiprocessor integrated circuit, PADDI (programmable arithmetic devices for high-speed digital signal processing), has been designed for the rapid prototyping of high-speed data paths typical to real-time digital signal processing applications. The processor architecture addresses the key requirements of these data paths: (a) fast, concurrently operating, multiple arithmetic units, (b) conflict-free data routing, (c) moderate hardware multiplexing (of the arithmetic units), (d) minimal branch penalty between loop iterations, (e) wide instruction bandwidth, and (f) wide I/O bandwidth. The initial version contains eight processors connected via a dynamically controlled crossbar switch, and has a die size of 8.9×9.5 mm in a 1.2-μm CMOS technology. With a maximum clock rate of 25 MHz, it can support a computation rate of 200 MIPS and can sustain a data I/O bandwidth of 400 Mbytes/s with a typical power consumption of 0.45 W. An assembler and simulator have been developed to facilitate programming and testing of the chip
Article
This paper described the Abacus machine at a number of levels. We presented the microarchitecture of the PE comprising the reconfigurable bit-parallel array, a set of arithmetic and communication primitives, details of the VLSI implementation, and system-level design issues of a high-speed SIMD array. The most concrete goal of the Abacus project was to design and build a machine that could be used by members of the MIT Artificial Intelligence Laboratory for real-time early vision processing. Along the way, we explored several architectural ideas. First, we tested the limits of the premise that simple one-bit PEs allow a faster overall clock rate. Although this is a common argument in research papers, existing MPP SIMD systems use a clock rate substantially slower than that of commercial bit-parallel microprocessors. The 125 MHz clock of the Abacus chip is the highest of any massively parallel SIMD systems that we're aware of. Further increases in clock speed are limited by instruction bandwidth rather than PE complexity, and therefore transfer the difficulty to off-chip interfaces and printed circuit board design. Retaining the clock speed while increasing the work done per cycle holds more promise as the approach for incorporating smaller and faster VLSI technology. Second, we determined how effectively a collection of simple one-bit PEs could emulate bit-parallel hardware. As shown in Section 4, a variety of arithmetic circuits can be emulated at a cost of two to five cycles. However, while reconfiguration reduces the silicon cost, it increases the time requirements in two ways. Bit-level reconfiguration precludes hardware pipelining by introducing dependencies on network switch settings. Conventional techniques for dealing with data dependencies (such as forwarding...
MicroUnity Lifts Veil on MediaProcessor
  • M Slater
Pilkington Preps Reconfigurable Video DSP
  • P Clarke
Chromatic Raises the Multimedia Bar
  • D Epstein