Conference Paper

A MIPS Processor with a Reconfigurable Coprocessor

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Typical reconfigurable machines exhibit shortcomings that make them less than ideal for general-purpose computing. The Garp Architecture combines reconfigurable hardware with a standard MIPS processor on the same die to retain the better features of both. Novel aspects of the architecture are presented, as well as a prototype software environment and preliminary performance results. Compared to an UltraSPARC, a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factor of 24 for some useful applications

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The Garp architecture is based on a combination of reconfigurable hardware with a standard MIPS processor [17]. This means that the Garp is reconfigurable architecture as a co-processor for executing certain parts of the code which are slower when running on the MIPS. ...
... The DART architecture is intended to be a reconfigurable architecture for telecommunications applications. The authors claimed that the architecture can handle complex processing tasks of third generation telecommunications systems in an efficient and low-power manner [17]. The architecture can be broken down into independent processing units named clusters. ...
... Garp program flowchart[17] Fig. 23SRGA architecture interconnect mesh[11] ...
Article
Full-text available
The development of mobile devices has challenged hardware designers to come up with suitable architectures. Challenges such as power consumption, flexibility, processing power and area are likely to lead to the need for a reconfigurable architecture to cater for the growing demands made of mobile devices, and to suit the needs of the next generation of devices. Parallelism and multifunction in real-time will be the minimum required characteristics of the architectures of such devices. This chapter reviews the currently available reconfigurable architectures. The focus here is on coarse-grain reconfigurable architectures, with particular attention to those which support dynamic reconfiguration with low-power consumption. The capacity for dynamic reconfiguration will be a key factor in defining the most suitable architecture for future generations of mobile devices. This paper describes existing reconfigurable platforms. Their principles of operation, architectures and structures are discussed highlighting their advantages and disadvantages. Various coarse-grain reconfigurable architectures are discussed along with their improvement with time. Finally, the key characteristics which are required for a reconfigurable architecture to be suitable for telecommunication systems are identified. A comparison is given for the various architectures discussed in terms of suitability for telecommunications applications.
... Garp is another reconfigurable platform proposed by Hauser et al. [12], which is a fine-grained architecture capable of complex bit-oriented computations for image processing applications. The programmable grid works as a co-processor for an in-chip MIPS II processor. ...
... Result: GA returns the fittest genome 1 g r = g m = g c = null 2 generation = 0 3 while not termination condition do 4 if generation=0 then 5 for i ← 1 to population do 6 g ri = generate random() 7 end 8 else 9 for i ← 1 to (population×mutation rate) do 10 g p = random select(G parents ) 11 g mi = generate mutation(g p ) 12 end 13 for i ← 1 to (population×crossover rate) do 14 g p1 = random select(G parents ) 15 g p2 = random select(G parents ) 16 g ci = generate crossover(g p1 , g p2 ) 21 for i ← 1 to parents do 22 g s = get f ittest(G children ) 23 G parents .add(g s ) 24 G children .remove(g ...
... thus, we constructed a simple selection function that scales with the available number of parents and is biased toward the better ones. The function is defined as follows: # list according to probabilities 10 11 # filter can be min_size 12 In the following subsections, the GAGA operators will be presented, which are genomeaware constrained selection, genome-aware mutation, and genome-aware crossover, in addition to the new GA operator, genome-aware pruner. The target of these improvements 85 is to accelerate evolution and increase parallelism without limiting the GA from exploring the entirety of the search space. ...
Thesis
Full-text available
Evolvable hardware (EHW) is a powerful autonomous system for adapting and finding solutions within a changing environment. EHW consists of two main components: a reconfigurable hardware core and an evolutionary algorithm. The majority of prior research focuses on improving either the reconfigurable hardware or the evolutionary algorithm in place, but not both. Thus, current implementations suffer from being application oriented and having slow reconfiguration times, low efficiencies, and less routing flexibility. In this work, a novel evolvable hardware platform is proposed that combines a novel reconfigurable hardware core and a novel evolutionary algorithm. The proposed reconfigurable hardware core is a systolic array, which is called HexArray. HexArray was constructed using processing elements with a redesigned architecture, called HexCells, which provide routing flexibility and support for hybrid reconfiguration schemes. The improved evolutionary algorithm is a genome-aware genetic algorithm (GAGA) that accelerates evolution. Guided by a fitness function the GAGA utilizes context-aware genetic operators to evolve solutions. The operators are genome-aware constrained (GAC) selection, genome-aware mutation (GAM), and genome-aware crossover (GAX). The GAC selection operator improves parallelism and reduces the redundant evaluations. The GAM operator restricts the mutation to the part of the genome that affects the selected output. The GAX operator cascades, interleaves, or parallel-recombines genomes at the cell level to generate better genomes. These operators improve evolution while not limiting the algorithm from exploring all areas of a solution space. The system was implemented on a SoC that includes a programmable logic (i.e., field-programmable gate array) to realize the HexArray and a processing system to execute the GAGA. A computationally intensive application that evolves adaptive filters for image processing was chosen as a case study and used to conduct a set of experiments to prove the developed system robustness. Through an iterative process using the genetic operators and a fitness function, the EHW system configures and adapts itself to evolve fitter solutions. In a relatively short time (e.g., seconds), HexArray is able to evolve autonomously to the desired filter. By exploiting the routing flexibility in the HexArray architecture, the EHW has a simple yet effective mechanism to detect and tolerate faulty cells, which improves system reliability. Finally, a mechanism that accelerates the evolution process by hiding the reconfiguration time in an “evolve-while-reconfigure” process is presented. In this process, the GAGA utilizes the array routing flexibility to bypass cells that are being configured and evaluates several genomes in parallel.
... The pioneering FPGAs were somehow limited in the data width, and the architecture of the functional units [12,13]. While the GARP architecture in [12] still designs the functional units as look-up tables, like in an FPGA, the reconfigurable units in CHESS [13] use a structure similar to an ALU. ...
... The pioneering FPGAs were somehow limited in the data width, and the architecture of the functional units [12,13]. While the GARP architecture in [12] still designs the functional units as look-up tables, like in an FPGA, the reconfigurable units in CHESS [13] use a structure similar to an ALU. ...
Article
Full-text available
Reconfigurable computing architectures allow the adaptation of the underlying datapath to the algorithm. The granularity of the datapath elements and data width determines the granularity of the architecture and its programming flexibility. Coarse-grained architectures have shown the right balance between programmability and performance. This paper provides an overview of coarse-grained reconfigurable architectures and describes Versat, a Coarse-Grained Reconfigurable Array (CGRA) with self-generated partial reconfiguration, presented as a case study for better understanding these architectures. Unlike most of the existing approaches, which mainly use pre-compiled configurations, a Versat program can generate and apply myriads of on-the-fly configurations. Partial reconfiguration plays a central role in this approach, as it speeds up the generation of incrementally different configurations. The reconfigurable array has a complete graph topology, which yields unprecedented programmability, including assembly programming. Besides being useful for optimising programs, assembly programming is invaluable for working around post-silicon hardware, software, or compiler issues. Results on core area, frequency, power, and performance running different codes are presented and compared to other implementations.
... At the same time, many of the limitations that FPGAs have, such as slow configuration times, long compilations times, and (comparably) low clock frequencies, remain unsolved. These limitations have been recognized for decades (e.g., [15]- [17]), and have driven forth a different branch of reconfigurable architecture: the Coarse-Grained Reconfigurable Architecture (CGRAs). ...
... Some early CGRAs were not much coarser than their respective FPGAs. For example, the Garp [17] (shown in Figure 2:a) infrastructure was reconfigurable at a 2-bit (rather than FPGA 1-bit) granularity. Here, each reconfigurable unit could connect to neighbors in both the horizontal (used for carry-outs) and vertical direction, as well as to dedicated bus lines for interfacing memory. ...
Article
Full-text available
With the end of both Dennard’s scaling and Moore’s law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
... Moreover, considering coarse grained reconfigurable architectures, besides the context memory, they employ a cache memory hierarchy to store regular instructions which, somehow, causes a needless redundancy [8][9][10][14][15][16]. Prior work proposes a demand-based allocation cache memory, which joins regular instructions and contexts in a single memory structure sharing dynamically the number of blocks available for each type of information [11]. ...
... However, since not every configuration uses all available rows, a per-row-based partial reconfiguration mechanism is proposed. Thus, supposing a 128-bit bus, it is necessary 384 external memory accesses to fetch configuration bits of 32 rows [9]. Using the same approach of Garp, Chiamera couples an FPGA to a MIPS Processor and employs a partial reconfiguration to decrease reconfiguration time. ...
... Te concept of RC is initiated in the early1960s when Gerald Estrin proposed the concept of a computer made of a standard processor and an array of reconfgurable hardware [4,5]. Programmable active memories (PAM) [6], Garp [7] (integrates RC with a standard MIPS processor), NGEN [8], POLYP [9], and MereGen [10] (which are massively parallel reconfgurable computers based on hundreds of FPGAs coupled with SRAMs) are few examples of early RC projects. ...
Article
Full-text available
Reconfigurable computing (RC) theory aims to take advantage of the flexibility of general-purpose processors (GPPs) alongside the performance of application specific integrated circuits (ASICs). Numerous RC architectures have been proposed since the 1960s, but all are struggling to become mainstream. The main factor that prevents RC to be used in general-purpose CPUs, GPUs, and mobile devices is that it requires extensive knowledge of digital circuit design which is lacked in most software programmers. In an RC development, a processor cooperates with a reconfigurable hardware accelerator (HA) which is usually implemented on a field-programmable gate arrays (FPGAs) chip and can be reconfigured dynamically. It implements crucial portions of software (kernels) in hardware to increase overall performance, and its design requires substantial knowledge of digital circuit design. In this paper, a novel RC architecture is proposed that provides the exact same instruction set that a standard general-purpose RISC microprocessor (e.g., ARM Cortex-M0) has while automating the generation of a tightly coupled RC component to improve system performance. This approach keeps the decades-old assemblers, compilers, debuggers and library components, and programming practices intact while utilizing the advantages of RC. The proposed architecture employs the LLVM compiler infrastructure to translate an algorithm written in a high-level language (e.g., C/C++) to machine code. It then finds the most frequent instruction pairs and generates an equivalent RC circuit that is called miniature accelerator (MA). Execution of the instruction pairs is performed by the MA in parallel with consecutive instructions. Several kernel algorithms alongside EEMBC CoreMark are used to assess the performance of the proposed architecture. Performance improvement from 4.09% to 14.17% is recorded when HA is turned on. There is a trade-off between core performance and combination of compilation time, die area, and program startup load time which includes the time required to partially reconfigure an FPGA chip.
... The Garp compiler and architecture [30]- [32] are specifically designed to facilitate the pipelined execution of loops on a co-processor. This system comprises a single-issue microprocessor equipped with a rapidly reconfigurable array acting as the co-processor for loop acceleration. ...
Preprint
Coarse-grain reconfigurable architectures (CGRAs) are gaining traction thanks to their performance and power efficiency. Utilizing CGRAs to accelerate the execution of tight loops holds great potential for achieving significant overall performance gains, as a substantial portion of program execution time is dedicated to tight loops. But loop parallelization using CGRAs is challenging because of loop-carried data dependencies. Traditionally, loop-carried dependencies are handled by spilling dependent values out of the reconfigurable array to a memory medium and then feeding them back to the grid. Spilling the values and feeding them back into the grid imposes additional latencies and logic that impede performance and limit parallelism. In this paper, we present the Dependency Resolved CGRA (DR-CGRA) architecture that is designed to accelerate the execution of tight loops. DR-CGRA, which is based on a massively-multithreaded CGRA, runs each iteration as a separate CGRA thread and maps loop-carried data dependencies to inter-thread communication inside the grid. This design ensures the passage of data-dependent values across loop iterations without spilling them out of the grid. The proposed DR-CGRA architecture was evaluated on various SPEC CPU 2017 benchmarks. The results demonstrated significant performance improvements, with an average speedup ranging from 2.1 to 4.5 and an overall average of 3.1 when compared to state-of-the-art CGRA architecture.
... Earlier research also focused on using FPGAs as a functional unit. Garp [14] targets embedded processors without multi-processing support, but it introduces the idea of combining a bitstream alongside the process binary. It does not make FPGAs transparent, as it requires configuration instructions. ...
Preprint
This paper explores a computer architecture, where part of the instruction set architecture (ISA) is implemented on small highly-integrated field-programmable gate arrays (FPGAs). It has already been demonstrated that small FPGAs inside a general-purpose processor (CPU) can be used effectively to implement custom instructions and, in some cases, approach accelerator-level of performance. Our proposed architecture goes one step further to directly address some related challenges for high-end CPUs, where such highly-integrated FPGAs would have the highest impact, including access to the memory hierarchy with the highest bandwidth available. The main contribution is the introduction of the "FPGA-extended modified Harvard architecture" model to enable context-switching between processes with a different distribution of instructions without modifying the applications. The cycle-approximate evaluation of a dynamically reconfigurable core shows promising results for multi-processing, approaching the performance to an equivalent core with all enabled instructions, and better performance than when featuring a fixed subset of the supported instructions.
... Reconfigurable architectures can be constructed with custom-designed chips different from FPGAs. The Garp reconfigurable processor [67,68] is a fine-grained architecture allowing bit manipulation resembling FPGAs. PipeRench [69,70] is an arithmetic logic unit (ALU) based coarse-grained architecture working as a coprocessor. ...
Thesis
Full-text available
Parallelism, non-determinism and large scale are the three characters of biological and ecological systems which are transmitted to bio-inspired computing models enlightened by these bio- or/and eco- systems. Imitating bio-inspired computing models on general purpose computers by designing high level programming language codes is the common approach to simulate these unconventional computing models for its accessibility. However, this approach is inappropriate to cope with the three characters mentioned above, especially for the parallelism on a large scale. From the perspective of hardware, CPU of computers executes software codes which simulate bio-inspired computing models. More precisely, integrated circuits inside CPUs or other processing devices perform operations defined by software codes which mimic these models. Software codes emulating bio-inspired computing models can get expected results, but it is not guaranteed that the CPU executes operations in line with what models do. If the parallel performance provided by target CPU is not enough to support parallel processing, some parallel procedures will be serialized. And because that the process of CPU is not transparent, we cannot know whether the CPU carry out operations in accordance with that of models or not. To state the main works more clearly, the notion of implementation and simulation should be distinguished. If a hardware emulates bio-inspired computing models in strict accordance with procedures defined by the models, such type of emulation is termed as “implementation”. While if processing procedures of target hardware are not consistent with models, although expected outcomes obtained, such kind of emulation is entitled as “simulation”.
... The eFPGA was integrated directly into the RISC-V core such that reconfigurable custom instructions are mapped into the CPU instruction space. In this system, developers can create hardware designs of custom instructions (in Verilog) targeting the eFPGA that will be used as an assembly instruction (e.g., as used in [6,24,35,60]). ...
Conference Paper
At the end of CMOS-scaling, the role of architecture design is increasingly gaining importance. Supporting this trend, customizable embedded FPGAs are an ingredient in ASIC architectures to provide the advantages of reconfigurable hardware exactly where and how it is most beneficial. To enable this, we are introducing the FABulous embedded open-source FPGA framework. FABulous is designed to fulfill the objectives of ease of use, maximum portability to different process nodes, good control for customization, and delivering good area, power, and performance characteristics of the generated FPGA fabrics. The framework provides templates for logic, arithmetic, memory, and I/O blocks that can be easily stitched together, whilst enabling users to add their own fully customized blocks and primitives. The FABulous ecosystem generates the embedded FPGA fabric for chip fabrication, integrates Yosys, ABC, VPR and nextpnr as FPGA CAD tools, deals with the bitstream generation and after fabrication tests. Additionally, we provide an emulation path for system development. FABulous was demonstrated for an ASIC integrating a RISC-V core with an embedded FPGA fabric for custom instruction set extensions using a TSMC 180nm process and an open-source 45nm process node.
... The Dynamic Instruction Set Computer (DISC) [4] used a CLAy31 FPGA due to its ability of partial reconfiguration in order to swap instructions in a computing system rapidly. GARP [5] integrated an FPGA tightly into a MIPS CPU with the FPGA given direct access to a cache which allows larger acceleration jobs and therefore overall better performance. CHIMAERA [6] is another architecture that couples a reconfigurable unit with a CPU core. ...
Conference Paper
This paper presents an all open-source framework for adding embedded FPGAs into RISC-V CPUs. In our approach , an eFPGA is directly coupled with the CPU, and through supporting partial reconfiguration, instructions can be swapped at runtime. The eFPGA fabric is tiled into multiple slots in order to host different instructions in parallel, and multiple slots can be combined for hosting more complex instructions. Instructions can be swapped without interrupting the CPU, and instructions can have a different number of execution cycles to provide more flexibility for instruction implementations. Our case study integrates an Ibex RISC-V core from lowRISC together with our custom embedded FPGA supporting multiple regions, with logic, DSP, and Register File slices. This system had been taped out in a 180um TSMC process.
... The eFPGA was integrated directly into the RISC-V core such that reconfigurable custom instructions are mapped into the CPU instruction space. In this system, developers can create hardware designs of custom instructions (in Verilog) targeting the eFPGA that will be used as an assembly instruction (e.g., as used in [58]- [61]). ...
Preprint
Full-text available
At the end of CMOS-scaling, the role of architecture design is increasingly gaining importance. Supporting this trend, customizable embedded FPGAs are an ingredient in ASIC archi-tectures to provide the advantages of reconfigurable hardware exactly where and how it is most beneficial. To enable this, we are introducing the FABulous embedded open-source FPGA framework. FABulous is designed to fulfill the objectives of ease of use, maximum portability to different process nodes, good control for customization, and delivering good area, power, and performance characteristics of the generated FPGA fabrics. The framework provides templates for logic, arithmetic, memory, and I/O blocks that can be easily stitched together, whilst enabling users to add their own fully customized blocks and primitives. The FABulous ecosystem generates the embedded FPGA fabric for chip fabrication, integrates Yosys, ABC, VPR and nextpnr as FPGA CAD tools, deals with the bitstream generation and after fabrication tests. Additionally, we provide an emulation path for system development. FABulous was demonstrated for an ASIC integrating a RISC-V core with an embedded FPGA fabric for custom instruction set extensions using a TSMC 180 nm process and an open-source 45nm process node.
... Based on our knowledge, at least 40 CGRAs have been developed to adapt diverse applications in the past decades of years. These CGRAs are either positioned as accelerators or standalone processing units, and target on improving the efficiency of running applications that cover mobile computing [1], [2], [3], media processing [4], [5], [6], [7], [8], image processing [9], [10], digital signal processing (DSP) [11], [12], [13], [14], [15], [16], [17], [18], ultra-low power processing [19], [20], [21], machine learning [22], [23], [24], data or computational intensive domains [25], [26], [27], [28], [29], [30], [31], [32], and even general purpose computing [33], [34], [35], [36], [37], [38]. However, building the software ecosystem around CGRAs is challenging due to the diverse CGRA hardware design flavors and application purposes. ...
Article
Coarse-Grained Reconfigurable Architectures (CGRA) is a promising solution for accelerating computation intensive tasks due to its good trade-off in energy efficiency and flexibility. One of the challenging research topic is how to effectively deploy loops onto CGRAs within acceptable compilation time. Modulo scheduling (MS) has shown to be efficient on deploying loops onto CGRAs. Existing CGRA MS algorithms still suffer from mapping loop with higher performance under acceptable compilation time, especially mapping large and irregular loops onto CGRAs with limited computational and routing resources. This is mainly due to the under utilization of the available buffer resources on CGRA, unawareness of critical mapping constraints and time consuming method of solving temporal and spatial mapping. This paper focus on improving the performance and compilation robustness of the modulo scheduling mapping algorithm for CGRAs. We decomposes the CGRA MS problem into the temporal and spatial mapping problem and reorganize the processes inside these two problems. For the temporal mapping problem, we provide a comprehensive and systematic mapping flow that includes a powerful buffer allocation algorithm, and efficient interconnection & computational constraints solving algorithms. For the spatial mapping problem, we develop a fast and stable spatial mapping algorithm with backtracking and reordering mechanism. Our MS mapping algorithm is able to map loops onto CGRA with higher performance and faster compilation time. Experiment results show that given the same compilation time budget, our mapping algorithm generates higher compilation success rate. Among the successfully compiled loops, our approach can improve 5.4% to 14.2% performance and takes x24 to x1099 less compilation time in average comparing with state-of-the-art CGRA mapping algorithms.
... These limitations have been recognized for decades (e.g. [15]- [17]), and have been used to drive forth a different branch of reconfigurable architecture: the Coarse-Grained Reconfigurable Architecture (CGRAs). ...
Preprint
Full-text available
With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of compute in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with particular focus on premises behind the different CGRA architectures and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover existing knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
... The integration of reconfigurable fabric with the main processor has been the subject of researchers' attention to accelerate the main computation [8,17]. There are several works that propose using the reconfigurable fabric for implementing specific monitoring tasks, and book-keeping functions [5]. ...
Conference Paper
Runtime verification employs dedicated hardware or software monitors to check whether program properties hold at runtime. However, these monitors often incur high area and performance overheads depending on whether they are implemented in hardware or software. In this work, we propose DHOOM, an architectural framework for runtime monitoring of program assertions, which exploits the combination of a reconfigurable fabric present alongside a processor core with the vestigial on-chip Design-for-Debug hardware. This combination of hardware features allows DHOOM to minimize the overall performance overhead of runtime verification, even when subject to a given area constraint. We present an algorithm for dynamically selecting an effective subset of assertion monitors that can be accommodated in the available programmable fabric, while instrumenting the remaining assertions in software. We show that our proposed strategy, while respecting area constraints, reduces the performance overhead of runtime verification by up to 32% when compared with a baseline of software-only monitors.
... GARP was a dynamically reconfigurable architecture, that combined reconfigurable hardware with a standard MIPS processor [Hauser and Wawrzynek 1997]. The reconfigurable fabric was a slave compute unit located on the same die as the processor as shown in Fig. 2(a). ...
Article
Full-text available
Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption.
... To d a y 's F P G A s h av e c o a r s egrained logic blocks and embedded processors, and the hall of fame papers track this evolution. Early papers make the case for combining processors and FPGAs [19], [20] and show how to integrate reconfigurable functional units [21]. Architectures explored the use of coarser-grained building blocks that natively support wordwide computations and sequencing of operations [22]- [24]. ...
Article
The TCFPGA Hall of Fame for FPGAs (field-programmable gate arrays) and Reconfigurable Computing recognizes the most significant peer-reviewed publications in the field, highlights key contributions, and represents the body of knowledge that has accumulated over the past 30 years. The ACM SIGDA Technical Committee on FPGAs and Reconfigurable Computing is a technical committee of the Design Automation Special Interest Group, which was formed to promote the FPGA and reconfigurable computing community.
... They can be found in the RaPiD architecture, for instance, [10] as a smaller programmed control with a short instruction set. Furthermore, the GARP architecture uses a processor in order to only load and execute array configurations [19]. A microsequencer is an optimal solution in terms of area and speed. ...
... Reconfigurable architecture has been studied in various domains like multimedia, signal processing, and pattern matching to exploit the spatial parallelism from a large loop work. This research has focused on two categories-CGRA that computes with word granularity (Goldstein et al. 1999;Govindaraju et al. 2012;Huang et al. 2013), and FPGA that computes with bit granularity (Hauser et al. 1997;Hartenstein et al. 2001;Mishra et al. 2006). The reconfigurable architecture showed outstanding benefits in terms of performance and energy efficiency compared to a fully-programmable architecture for many data-parallel applications. ...
Article
The advent of 3D memory stacking technology, which integrates a logic layer and stacked memories, is expected to be one of the most promising memory technologies to mitigate the memory wall problem by leveraging the concept of near-memory processing (NMP). With the ability to process data locally within the logic layer of stacked memory, a variety of emerging big data applications can achieve significant performance and energy-efficiency benefits. Various approaches to the NMP logic layer architecture have been studied to utilize the advantage of stacked memory. While significant acceleration of specific kernel operations has been derived from previous NMP studies, an NMP-based system using an NMP logic architecture capable of handling some specific kernel operations can suffer from performance and energy efficiency degradation caused by a significant communication overhead between the host processor and NMP stack. In this article, we first analyze the kernel operations that can greatly improve the performance of NMP-based systems in diverse emerging applications, and then we analyze the architecture to efficiently process the extracted kernel operations. This analysis confirms that three categories of processing engines for NMP logic are required for efficient processing of a variety of emerging applications, and thus we propose a Triple Engine Processor (TEP), a heterogeneous near-memory processor with three types of computing engines. These three types of engines are an in-order core, a coerce-grain reconfigurable processor (CGRA), and dedicated hardware. The proposed TEP provides about 3.4 times higher performance and 33% greater energy savings than the baseline 3D memory system.
... On-chip Hardware Accelerators (HWAccs) can be coarsely classified as tightly coupled or loosely coupled accelerators. In the first case, the accelerator is directly attached to a specific core [7] [8] [9], while in the latter case, the accelerator is attached to a global bus and is shared among multiple cores [10] [11] [12]. This work targets these second type of loosely coupled accelerators given as BIPs mapped onto a CSoC. ...
... There are many approaches for accelerating CPUs with very tightly coupled FPGAs. Famous examples are the Dynamic Instruction Set Computer (DISC) [6] and Garp [7]. And most of these approaches could benefit from the cryptographic library presented in this paper. ...
Conference Paper
Full-text available
Many CPU design houses have added dedicated support for cryptography in recent processor generations, including Intel, IBM, and ARM. While adding accelerators and/or dedicated instructions boosts performance on cryptography, we are investigating a different approach that is not adding extra silicon area: We study to replace the hardened NEON SIMD unit of an ARM Cortex-A9 with an identically sized FPGA fabric, called an interlay. This will be used for implementing cryptographic instructions in soft-logic. We show that this approach can outperform the hardened NEON by up to 7.7×on AES and provide functionality that is not available in the hardened ARM
... Due to the wide-range of applications of MPSoC systems,parallelism transfers from the instruction stage to thetask level. One way is to applya reconfigurableFPGA system and assimilate acceleration engines likeChimaera [5], Garp [6], and OneChip [7]. In recent times, MPSoC scheduling has extendedmoreconsideration.In paper [8], Carvalho and Moraes studied the function ofmapping processes in Network-on-Chip-based Ht-MPSoCs. ...
Article
Full-text available
In this paper, the influence ofthe dynamic task scheduling process is examined. Outof-Order (OoO) implementationprocessesexhibit remarkableguaranteefor task-level parallelism in multiprocessor system-on-chip (MPSoC) designs. The superior performance can be attained with the help of a precisemapping of tasks onto the right processors. Hence, to obtain thisperformance, a Particle Swarm Optimization(PSO) based task parallelism is presented in this paper. The software related dynamic operations are illustratedon a heterogeneous MPSoC (Ht-MPSoC). PSO abides by a cooperative population-based search, which demonstrates upon the social activities of bird grouping. PSO system merges local search approaches with global search approaches. This implementationdecreases the issue of task management. The performance of the proposed design is compared with the existing work for area, power and speed analysis. © 2017, Institute of Advanced Scientific Research, Inc. All rights reserved.
... Reconfigurable architectures have evolved greatly in recent years and much research has been carried out to address various design parameters of generic reconfigurable architectures. Some approaches use the standard fine-grained reconfigurable architectures like commercial FPGAs, while others contain hard-core processors coupled with soft-core reconfigurable coprocessors (e.g., GARP [8]). Similarly, coarse-grained reconfigurable architectures (CGRAs) have attracted a lot of attention from the research community as well and there has been impressive work in the domain of optimization of application to CGRA mapping(e.g. ...
Conference Paper
Full-text available
High-level simulation tools are used for optimization and design space exploration of digital circuits for a target Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC) implementation. Compared to ASICs, FPGAs are slower and less power-efficient, but they are programmable, flexible and offer faster prototyp-ing. One reason for the slow performance in FPGA is their finer granu-larity as they operate at bit-level. To overcome this problem, the concept of Coarse Grained Reconfigurable Architectures (CGRAs) is introduced whose granularity is at word-level. There already exists a whole variety of CGRAs based on their architectural parameters. However, the CGRA research lacks in the design automation arena since high-level simulation and optimization tools targeted at CGRAs are nearly non-existent. In this paper, we propose to address this shortcoming and present a high-level simulation and optimization framework for mesh-based homogeneous CGRAs. As expected, the results show that auto-generated homogeneous CGRAs consume 54% more resources when compared with custom academic FPGAs while providing around 63.3% faster application to architecture mapping time.
... Scalera and Vazquez [29] developed the first practical multi-context FPGA on a 0.35µm technology, called Context Switching Reconfigurable Computer (CSRC) that can store up to four configuration concurrently. GARP was another dynamically reconfigurable architecture that combined reconfigurable hardware with a standard MIPS processor [14]. Tabula also supported dynamic reconfiguration for their FPGA platform based on the concept of Spacetime technology, which enables rapid reconfiguration of hardware by fast context switching of configuration bits that are locally stored [31]. ...
Article
Full-text available
Dynamically reconfigurable architectures, such as NATURE, achieve high logic density and low reconfiguration latency compared to traditional field-programmable gate arrays. Unlike fine-grained NATURE, reconfigurable DSP block incorporated NATURE architecture achieves significant improvement in performance for mapping compute-intensive arithmetic operations. However, the DSP block fails to fully exploit the potential provided by the run-time reconfiguration. This paper presents a pipeline reconfigurable DSP architecture to target the NATURE platform that supports temporal logic folding. The proposed approach allows the DSP pipeline stages to be reconfigured independently such that different functions can be performed distinctively and individually at every clock interval during runtime. In addition, a multistage clock gating technique is also used in the design to minimize the power consumption. We also extend NanoMap tool for mapping circuits on NATURE platform to exploit the pipeline-level reconfigurability of our proposed DSP block to enable efficient resource sharing and area/power reduction. Simulation results on 13 benchmarks show that the proposed approach enables area-delay improvement of up to 3.6×\times compared to the fine-grained NATURE architecture. The proposed architecture also delivers 31.42% reduction in area and a maximum of 4.18×\times improvement in power-delay compared to existing NATURE architecture. We also observe an average improvement of 29 and 54.13% in performance and area when compared to commercial Xilinx Spartan-3A DSP platform, thereby allowing the designers to tune the circuit implementations for the area, power, or performance benefits.
Article
Full-text available
Reconfigurable Systems represent a middle trade-off between speed and flexibility in the processor design world. It provides performance close to the custom-hardware and yet preserves some of the general-purpose processor flexibility. Recently, the area of reconfigurable computing has received considerable interest in both its forms: the FPGA and coarse-grain hardware. Since the field is still in its developing stage, it is important to perform hardware analysis and evaluation of certain key applications on target reconfigurable architectures to identify potential limitations and improvements. This paper presents the mapping and performance analysis of two encryption algorithms, namely Rijndael and Twofish, on a coarse grain reconfigurable platform, namely MorphoSys. MorphoSys is a reconfigurable architecture targeted for multimedia applications. Since many cryptographic algorithms involve bitwise operations, bitwise instruction set extension was proposed to enhance the performance. We present the details of the mapping of the bitwise operations involved in the algorithms with thorough analysis. The methodology we used can be utilized in other systems.
Article
Modern applications require hardware accelerators to maintain energy efficiency while satisfying the increasing computation requirements. However, with evolving standards and rapidly changing algorithmic complexity as well as rising design costs at advanced technology nodes, the iterative development of inflexible accelerators for such applications becomes ineffective. The reconfigurable architectures can provide a higher throughput and the required flexibility, but with substantial energy and area efficiency overhead relative to the accelerators. We develop a domain-specific, energy-and area-efficient (within 2 ×\times- 10 ×\times of accelerators) multiprogram runtime reconfigurable accelerator called the universal digital signal processor (UDSP). The design maximizes generality and resource utilization for signal processing and linear algebra with minimal area and energy penalty. The statistics-driven multilayer network minimizes the network delay and consists of an optimized switchbox design that maximizes the connectivity per hardware cost. The multilayered interconnect network is linearly scalable with the number of processing elements and allows for intradielet and multidielet scaling. The network features’ deterministic routing and timing for fast program compile and its translation and rotation symmetries allow for hardware resource reallocation. Multidielet scaling is enabled by energy-efficient, high-bandwidth, and high-density interdielet communication channels that seamlessly extend the intradielet routing network across the dielet boundaries using 10- μ\mu m fine-pitch silicon interconnect fabric (Si-IF) interposer. A 2 ×\times 2 multidielet UDSP on Si-IF can achieve a peak energy efficiency of 785 GMACs/J at 0.42 V and 315 MHz. The interdielet communication channel, SNR-10, provides a shoreline bandwidth density of 297 Gb/s/mm at 1.1 Gb/s/pin at 0.8 V and nominal energy efficiency of 0.38 pJ/bit.
Chapter
This paper introduces a computer architecture, where part of the instruction set architecture (ISA) is implemented on small highly-integrated field-programmable gate arrays (FPGAs). Small FPGAs inside a general-purpose processor (CPU) can be used effectively to implement custom or standardised instructions. Our proposed architecture directly address related challenges for high-end CPUs, where such highly-integrated FPGAs would have the highest impact, such as on main memory bandwidth. This also enables software-transparent context-switching. The simulation-based evaluation of a dynamically reconfigurable core shows promising results approaching the performance of an equivalent core with all enabled instructions. Finally, the feasibility of adopting the proposed architecture in today’s CPUs is studied through the prototyping of fast-reconfigurable FPGAs and profiling the miss behaviour of opcodes.KeywordsComputer architectureMemory hierarchyReconfigurable extensions
Article
Full-text available
The usage of RISC-based embedded processors, aimed at low cost and low power, is becoming an increasingly popular ecosystem for both hardware and software development. High performance yet low power embedded processors may be attained via the use of hardware acceleration and Instruction Set Architecture (ISA) extension. Efficient mapping of the computational load onto hardware and software resources is a key challenge for performance improvement while still keeping low power and area. Furthermore, exploring performance at an early stage of the design makes this challenge more difficult. Potential hardware accelerators can be identified and extracted from the high-level source code by graph analysis to enumerate common patterns. A scheduling algorithm is used to select an optimized sub-set of accelerators to meet real-time constraints. This paper proposes an efficient hardware/software codesign partitioning methodology applied to high-level programming language at an early stage of the design. The proposed methodology is based on graph analysis. The applied algorithms are presented by a synchronous directed acyclic graph. A constraint-driven method and unique scheduling algorithm are used for graph partitioning to obtain overall speedup and area requirements. The proposed hardware/software partitioning methodology has been evaluated for MLPerf Tiny benchmark. Experimental results demonstrate a speedup of up to 3 orders of magnitude compared to software-only implementation. For example, the resulting runtime for the KWS (Keyword Spotting) software implementation is reduced from 206 sec to only 181ms using the proposed hardware-acceleration approach.
Chapter
Modern embedded applications require a high computational performance under severe energy constraints. For applications that in the digital signal processing domain field programmable gate arrays (FPGAs) provide a very flexible processing platform. However, due to bit-level reconfigurability the overhead of such architectures is high. Both in energy, area, and performance. Coarse-grained reconfigurable architectures remove much of this overhead, at the expense of some flexibility. Although many publications of CGRAs have been presented in the past, what exactly constitutes a CGRA is not really clear. For this reason, this chapter defines what a CGRA is and evaluates this definition of a large set of previously presented architectures. This definition depends on the reconfiguration granularity of an architecture in both the temporal and spatial domains. Furthermore, the chapter provides the reader with an overview of the investigated CGRAs and suggests some research topics that could improve CGRAs in the future.
Conference Paper
Full-text available
In this study the analysis of the influence and development of Supply Chain Management (SCM) in Misurata Textile Factory (MTF) firm in Libya is presented. Some parameters such as Leadership, Supplier supply management, Vision and plan statement, Evaluation, Process control and Improvement, and Customer Focus, are analysed to establish if the current implementation of SCM systems on MTF is well organized. Based on the interview discussion with top management and middle managers of the firm, the findings indicate that the communication and knowledge transfer between the workers are limited, top management empowerment has not yet been implemented, the computerized information system does not exist due to its pursuit of immediate profits and short-term benefits. As a conclusion, current SCM implementation practices are not applied to the factory as the full package of the SCM implementation model. Therefore the firm could identify which areas urgently need improvement by using this model.
Article
This article deals with reconfigurable uniprocessor systems powered by a renewable energy source under real-time and resource sharing constraints. A reconfigurable system is defined as a set of implementations, each of which is encoded by real-time periodic software tasks. Reconfiguration is a flexible runtime scenario that adapts the current system's implementation to any related environment evolution under well-defined conditions. A task is characterized by an effective calculated deadline that should be less than a maximum deadline defined in user requirements. The main problem is how to calculate the effective deadlines of the different periodic tasks in the different implementations under possibly the predicted renewable energy source and the sharing of resource constraints. We propose an offline method based on three solutions to calculate the deadlines of tasks. The first serves to compute the deadlines ensuring the real-time system feasibility and also minimizes the number of context switches by assigning the highest priority to the task with the smallest maximum deadline. The second computes the deadlines ensuring the respect of energy constraints, and the third computes the deadlines ensuring the respect of resource sharing constraints. These three solutions calculate the possible deadlines of each task in the hyperperiod of the corresponding implementations. We develop a new simulator called DEAD-CALC, that integrates a new tool called RANDOM-TASK for applying and evaluating the proposed solutions. The conducted experimentation proves that this methodology provides deadlines with affecting neither the load nor the processor speed while reducing the calculation time.
Conference Paper
During the last decade, high-performance grid networks have been successfully used to aggregate heterogeneous resources to run large scale complex scientific and engineering applications from different domains. Simultaneously, reconfigurable computing is becoming more flexible in terms of programmability, to increase the performance of many applications. We propose a Collaborative Reconfigurable Computing (CRC) framework which enables the use of reconfigurable processors in the current and future grid networks, opening a new dimension for high-performance computing. The platform can be used to target a wide range of compute-intensive applications, generally targeted by costly supercomputers. In this paper, we investigate the current grid middleware solutions to incorporate a CRC node into a grid network. In addition, we introduce the advantages of CRC platform in high-performance computing. We also survey the current and future grid computing applications that can take benefit from such a platform. Finally, we propose some future research trends in this field.
Article
Full-text available
In recent years, with the development of microelectronics technology, microsystems based on SIP/SoC have been applied to drone and other avionics systems. It is practical to study the reconfigurable calculation of avionics microsystems facing the flexible use of the scene. Encryption/decryption data based on FPGA can adapt to different application environments and functional requirements, which is especially important for product protection. However, implementing multiple algorithms on the same chip leads to increased logic resource consumption, low resource utilization, and poor system flexibility. In view of the above problems, it is necessary to design a dynamic reconfigurable computing platform based on an aviation SIP micro-system chip with dynamic reconfigurable technology as the core (using Zynq SoC). The platform use on-chip ARM processor to control reconfiguration. The different encryption and decryption algorithm logics are configured into different logical partitions on the chip according to the requirement. The logic circuit is updated and the algorithm is reconstructed. Different encryption and decryption algorithms are implemented by using the HLS method. The verification results show that the design can complete the algorithm switching at a higher configuration speed while the other functions on the chip work normally. Under the premise of ensuring the stability of the system, the on-chip logic resource consumption is reduced, and the resource utilization and system flexibility are improved.
Article
Microprocessors are designed to provide good general performance across a range of benchmarks. As such, microarchitectural techniques which provide good speedup for only a small subset of applications are not attractive when designing a general-purpose core. We propose coupling a reconfigurable fabric with the CPU, on the same chip, via a simple and flexible interface to allow post-silicon development of application-specific microarchitectures. The interface supports observation and intervention at key pipeline stages of the CPU, so that exotic microarchitecture designs (with potentially narrow applicability) can be synthesized in the reconfigurable fabric and seem like components that were hardened into the core.
Article
This work proposes three different methods to automatically characterize heterogeneous MPSoCs composed of a variable number of masters (in the form of processors) and hardware accelerators (HWaccs). These hardware accelerators are given as Behavioral IPs (BIPs) mapped as loosely coupled accelerators on a shared bus system (i.e. AHB, AXI). BIPs have a distinct advantage over traditional RT-level based IPs given VHDL or Verilog: The ability to generate micro-architectures with different area vs. performance trade-offs from the same description. This is usually done by specifying different synthesis directives in the form of pragmas. This in turn implies that using different mixes of the accelerators’ micro-architectures lead to SoCs with unique area vs. performance trade-offs. Two of the three methods proposed are based on cycle-accurate simulations of the complete MPSoC, while the third method accelerates this exploration by performing it on a Configurable SoC FPGA. Extensive experimental results compare these three methods and highlight their strengths and weaknesses.
Conference Paper
This paper12 discusses the key technologies in the design of heterogeneous multi-core architecture for dynamic reconfiguration from the perspective of architecture. The design ideas are introduced from the interconnection structure of the processor, the coordination and dynamic scheduling of the soft core and hard core. Further a design scheme of heterogeneous mobile terminal architecture is presented. As a result, the flexibility and usability of mobile terminals are enhanced by the design of heterogeneous multi-core and dynamic reconfiguration.
Thesis
Full-text available
With the increase of application complexity and amount of data, the required computational power increases in tandem. Technology improvements have allowed for the increase in clock frequencies of all kinds of processing architectures. But exploration of new architecture and computing paradigms over the simple single-issue in-order processor are equally important towards increasing performance, by properly exploiting the data parallelism of demanding tasks. For instance: superscalar processors, which discover instruction parallelism at runtime; Very Long Instruction Word processors, which rely on compile-time parallelism discovery and multiple issue-units; and multi-core approaches, which are based on thread-level parallelism exploited at the software level. For embedded applications, depending on the performance requirements or resource constraints, and if the application is composed of well-defined tasks, then a more application-specific system may be appropriate, i.e., developing an Application Specific Integrated Circuit (ASIC). Custom logic delivers the best power/performance ratio, but this solution requires advanced hardware expertise, implies long development time, and especially very high manufacture costs. This work designed and evaluated a transparent binary acceleration approach, targeting Field Programmable Gate Array (FPGA) devices, which relies on instruction traces to automatically generate specialized accelerator instances. A custom accelerator, capable of executing a set of previously detected loop traces, is coupled to a host MicroBlaze processor. The traces are detected via simulation of the target binary. The approach does not require the application source code to be modified, which ensures the transparency for the application developer. No custom compilers are necessary, and the binary code does not need to be altered either offline or during runtime. The accelerators contain per-instance reconfiguration capabilities, which allow for the reuse of computing units between accelerated loops, without sacrificing the benefits of circuit specialization. To increase the achievable performance, the accelerator is capable of performing two concurrent memory accesses to the MicroBlaze’s data memory. The repetitive nature of the loop traces is exploited via loop-pipelining, which maximizes the achievable acceleration. By supporting single-precision floating-point operations via fully-pipelined units, the accelerator is capable of executing realistic data-oriented loops. Finally, the use of Dynamic Partial Reconfiguration (DPR) allows for significant area savings when instantiating accelerators with numerous configurations, and also ensures circuit specialization per-configuration. Several fully functional systems were implemented, using commercial FPGAs, to validate the design iterations of the accelerator. An initial design relied on translating Control and Dataflow Graph representations of the traces into a multi-row array of Functional Units. For 15 benchmarks, the geometric mean speedup was 2.08×. A second implementation augmented the accelerator with shared memory access to the the entire local data memory of the MicroBlaze. Arbitrary addresses can be accessed without need for address generation hardware. Exploiting data-parallelism allows for targeting of larger, more realistic traces. The mean geometric speedup for 37 benchmarks was 2.35×. The most efficient implementation supports floating-point operations and relies on loop pipelining. The developed tools generate an accelerator instance by modulo-scheduling each trace at the minimum possible Initiation Interval. The geometric mean speedup for a set of 24 benchmarks is 5.61×, and the accelerator requires only 1.12× the FPGA slices required by the MicroBlaze. Finally, resorting to DPR, an accelerator with 10 configurations requires only a third of the Lookup Tables relative to an equivalent accelerator without this capability. To summarize, the approach is capable of expediently generating accelerator-augmented embedded systems which achieve considerable performance increases whilst incurring a low resource cost, and without requiring manual hardware design.
Thesis
Full-text available
An innovative methodology to model and simulate partial and dynamic reconfiguration is presented in this work. As dynamic reconfiguration can be seen as the remove and reinsertion of modules into the system, the presented methodology is based on the execution blocking of not configured modules during the simulation, without interfere on the normal system activity. Once the simulator provides the possibility to remove, insert and exchange modules during simulation, all systems modeled on this simulator can have the benefit of the dynamic reconfigurations. In order to prove the concept, modifications on the SystemC kernel were developed, adding new instructions to remove and reconfigure modules at simulation time, enabling the simulator to be used either at transaction level (TLM) or at register transfer level (RTL). At TLM it allows the modeling and simulation of higher-level hardware and embedded software, while at RTL the dynamic system behavior can be observed at signals level. At the same time all the abstraction levels can be modeled and simulated, all system granularity can also be considered. At the end, every system able to be simulated using SystemC can also has your behavior changed on run-time. The provided set of instructions decreases the design cycle time. Compared with traditional strategies, information about dynamic and adaptive behavior will be available at earlier stages. Three different applications were developed using the methodology at different abstract levels and granularities. Considerations about the decision on how to apply dynamic reconfiguration in the better way are also made. The acquired results assist the designers on choosing the best cost/benefit tradeoff in terms of chip area and reconfiguration delay.
Chapter
Reconfigurable architecture is a computer architecture combining some of the flexibility of software with the high performance of hardware. It has configurable fabric that performs a specific data-dominated task, such as image processing or pattern matching, quickly as a dedicated piece of hardware. Once the task has been executed, the hardware can be adjusted to do some other task. This allows the reconfigurable architecture to provide the flexibility of software with the speed of hardware. This chapter discusses two major streams of reconfigurable architecture : Field-Programmable Gate Array (FPGA) and Coarse Grained Reconfigurable Architecture (CGRA) . It gives a brief explanation of the merits and usage of reconfigurable architecture and explains basic FPGA and CGRA architectures. It also explains techniques for mapping applications onto FPGAs and CGRAs .
Conference Paper
Full-text available
This paper explores a novel way to incorporate hardware-programmable resources into a processor microarchitecture to improve the performance of general-purpose applications. Through a coupling of compile-time analysis routines and hardware synthesis tools, we automatically configure a given set of the hardware-programmable functional units (PFUs) and thus augment the base instruction set architecture so that it better meets the instruction set needs of each application. We refer to this new class of general-purpose computers as programmable instruction set computers (PRISC). Although similar in concept, the PRISC approach differs from dynamically programmable microcode because in PRISC we define entirely-new primitive datapath operations. We concentrate on the microarchitectural design of the simplest form of PRISC-a RISC microprocessor with a single PFU that only evaluates combinational functions. We briefly discuss the operating system and the programming language compilation techniques that are needed to successfully build PRISC and, we present performance results from a proof-of-concept study. With the inclusion of a single 32-bit-wide PFU whose hardware cost is less than that of a 1 kilobyte SRAM, our study shows a 22% improvement in processor performance on the SPECint92 benchmarks.
Conference Paper
Full-text available
During the past decade, the microprocessor has become a key commodity component for building all kinds of computational systems. During this time frame, large reconfigurable logic arrays have exploited the same advances in IC fabrication technology to emerge as viable system building blocks. Looking at both the technology prospects and application requirements, there is compelling evidence that microprocessors with integrated reconfigurable logic arrays will be a primary building block for future computing systems. In this paper, we look at the role such components can play in building high-performance and economical systems, as well as the ripe technological outlook. We note how the tight integration of reconfigurable logic into the processor can overcome some of the major limitations of contemporary attached reconfigurable computer engines. We specifically consider the use of integrated dynamically programmable gate array (DPGA) structures for the configurable logic, and examine the advantages that rapid reconfiguration provides in this application
Article
Full-text available
A two-slot addition called Splash, which enables a Sun workstation to outperform a Cray-2 on certain applications, is discussed. Following an overview of the Splash design and programming, hardware development is described. The development of the logic description generator is examined in detail. Splash's runtime environment is described, and an example application, that of sequence comparison, is given.< >
Article
Full-text available
This paper explores a novel way to incorporate hardware-programmable resources into a processor microarchitecture to improve the performance of general-purpose applications. Through a coupling of compile-time analysis routines and hardware synthesis tools, we automatically configure a given set of the hardware-programmable functional units (PFUs) and thus augment the base instruction set architecture so that it better meets the instruction set needs of each application. We refer to this new class of general-purpose computers as PRogrammable Instruction Set Computers (PRISC). Although similar in concept, the PRISC approach differs from dynamically programmable microcode because in PRISC we define entirely-new primitive datapath operations. In this paper, we concentrate on the microarchitectural design of the simplest form of PRISC---a RISC microprocessor with a single PFU that only evaluates combinational functions. We briefly discuss the operating system and the programm...
Article
Many computationally-intensive tasks spend nearly all of their execution time withina small fraction of the executable code. A new hardware/software system, called PRISM,is presented which improves the performance of many of these computationally intensivetasks by utilizing information extracted at compile-time to synthesize new operationswhich augment the functionality of a core processor. By integrating adaptation into ageneral-purpose computer, one can not only reap the performance ...
Conference Paper
This paper describes a processor architecture called OneChip, which combines a fixed-logic processor core with reconfigurable logic resources. Using the programmable components of this the performance of speed-critical can be improved by customizing OneChip's execution units, or flexibility can be added to the glue logic interfaces of embedded controller applications. OneChip eliminates the shortcomings of other custom compute machines by tightly integrating its reconfigurable resources into a MIPS-like processor. Speedups of close to 50 over strict software implementations on a MIPS R4400 are achievable for computing the DCT
Conference Paper
Free text database searching is a natural candidate for acceleration by run time reconfigurable custom computing machines. We describe a fully pipelined search machine architecture for scoring the relevance of textual documents against approximately 100 relevant target words, with provision for limited regular expression matching and error tolerance. An implementation on the SPACE custom computing platform indicates that throughput in the order of 20 megabytes per second is achievable on ALgotronix FPGAs if a locally synchronous design style is adopted and global communications minimized. Partial reconfiguration of the datapath at run time, in around 3 seconds, serves to maximize the density of data storage on the machine and correspondingly avoid costly input from the environment
Conference Paper
A dynamic instruction set computer (DISC) has been developed that supports demand-driven modification of its instruction set. Implemented with partially reconfigurable FPGAs, DISC treats instructions as removable modules paged in and out through partial reconfiguration as demanded by the executing program. Instructions occupy FPGA resources only when needed and FPGA resources can be reused to implement an arbitrary number of performance-enhancing application-specific instructions. DISC further enhances the functional density of FPGAs by physically relocating instruction modules to available FPGA space
Conference Paper
This paper evaluates the feasibility of reconfiguring an FPGA at run time, and tests its performance using a “Grand Challenge Problem”, the high speed scanning of genomic sequence databases. Algorithm implementation into a XC3090 FPGA is described, and methods proposed for generating a placed Xilinx Netlist File that can be efficiently routed at run time by the Automated Placing and Routing Xilinx tools, in order to increase the speed and the density of the design. The same algorithm carefully optimised on a RISC processor has been compared with the run time reconfigurated FPGA, and shows the latter to have an improvement in speed of two to three orders of magnitude
Article
Programmable active memories (PAM) are a novel form of universal reconfigurable hardware coprocessor. Based on field-programmable gate array (FPGA) technology, a PAM is a virtual machine, controlled by a standard microprocessor, which can be dynamically and indefinitely reconfigured into a large number of application-specific circuits. PAM's offer a new mixture of hardware performance and software versatility. We review the important architectural features of PAM's, through the example of DECPeRLe-1, an experimental device built in 1992. PAM programming is presented, in contrast to classical gate-array and full custom circuit design. Our emphasis is on large, code-generated synchronous systems descriptions; no compromise is made with regard to the performance of the target circuits. We exhibit a dozen applications where PAM technology proves superior, both in performance and cost, to every other existing technology, including supercomputers, massively parallel machines, and conventional custom hardware. The fields covered include computer arithmetic, cryptography, error correction, image analysis, stereo vision, video compression, sound synthesis, neural networks, high-energy physics, thermodynamics, biology and astronomy. At comparable cost, the computing power virtually available in a PAM exceeds that of conventional processors by a factor 10 to 1000, depending on the specific application, in 1992. A technology shrink increases the performance gap between conventional processors and PAM's. By Noyce's law, we predict by how much the performance gap will widen with time.
Article
The processor reconfiguration through instruction-set metamorphosis (PRISM) general-purpose architecture, which speeds up computationally intensive tasks by augmenting the core processor's functionality with new operations, is described. The PRISM approach adapts the configuration and fundamental operations of a core processing system to the computationally intensive portions of a targeted application. PRISM-1, an initial prototype system, is described, and experimental results that demonstrate the benefits of the PRISM concept are presented
Processor reconfiguration through instruction-set metamorphosis A dynamic instruction set computer Applied Cryptography: Protocols, Al-gorithms, and Source Code in C
  • Peter M Athanas
  • Harvey J Silverman
  • Brad L Wirthlin
  • Hutchings
Peter M. Athanas and Harvey E Silverman, " Processor reconfiguration through instruction-set metamorphosis, " [ 11 J Michael J. Wirthlin and Brad L. Hutchings, " A dynamic instruction set computer, " in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, Apr. 1995, pp. 99-107. [ 121 Bruce Schneier, Applied Cryptography: Protocols, Al-gorithms, and Source Code in C, 2nd edition, John Wiley and Sons, 1996.