Conference Paper

ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The coarse-grained reconfigurable architectures have advantages over the traditional FPGAs in terms of delay, area and configuration time. To execute entire applications, most of them combine an instruction set processor(ISP) and a reconfigurable matrix. However, not much attention is paid to the integration of these two parts, which results in high communication overhead and programming difficulty. To address this problem, we propose a novel architecture with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix. The advantages include simplified programming model, shared resource costs, and reduced communication overhead. To exploit this architecture, our previously developed compiler framework is adapted to the new architecture. The results show that the new architecture has good performance and is very compiler-friendly.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Other examples are Tartan [9], BilRC [5], HyCUBE [6,16], Plasticine [17], RipTide [10] and Amber [7]. In contrast to this, shared-PE architectures [8,[18][19][20] facilitate greater utilization of logic resources. Statically scheduled shared-PE architectures share many properties of VLIW processors [21]. ...
... To reduce the delay of a linear network, Ultrascalar proposes a logarithmic topology [22]. Using ADRES [18] as template, Lambrechts et al. [28] evaluated the energy and performance impact of various mesh and bus interconnects. In contrast to the interconnect exploration in this work, ADRES machine code must be recompiled for each interconnect network modification. ...
... Most CGRAs use custom instruction sets (bitstream formats) and require machine code to be generated ahead of time. While some support the C programming model [10,18], others require the use of custom domain-specific programming languages [5,7]. Spatial and temporal aspects impose unique challenges and opportunities on CGRA compiler design, including placement/binding, routing and scheduling problems. ...
Article
Full-text available
This work presents, a coarse-grained reconfigurable array (CGRA) that combines energy efficiency with CPU-like programmability. Its extensible instruction set uses sequential control flow in code fragments of up to 64 RISC-like instructions, which encode control and dataflow graphs in adjacency lists. Combined with dedicated, uniform processing elements, this enables fast compilation from C source code (1.4mean compile time). Demonstrator measurements reveal energy efficiency of up to 601 int32 MIPS/mW at 0.59and performance of up to 148 MIPS at 0.90. Compared to a RISC reference system, mean energy efficiency is improved by 2.24× with 1.71× higher execution times across 12 of 14 benchmarks. Program-dependent factors underlying variations in energy efficiency are identified using dynamic program analysis. To reduce operand transfer energy, seven interconnect topologies are evaluated: a flat bus, five crossbar variants and a logarithmic network. Best results are obtained for a crossbar topology, reducing mean dynamic tile energy by 19%. Furthermore, floating-point (FP) support is added to the instruction set and evaluated using three binary-compatible microarchitectures, presenting distinct area-performance-energy tradeoffs. The interconnect and FP microarchitecture explorations demonstrate that, unlike CGRAs utilizing low-level bitstreams, Pasithea’s instruction set hides microarchitectural details, which makes it possible to optimize hardware without severing binary compatibility.
... Reducing these costs is highly desirable. For this reason, some existing work proposes templated RTL that permits customization before synthesis [10]- [12], while others rely on in-house, hand-written (System)Verilog generators [13]- [15]. However, such implementations are error-prone and may severely limit the possible extent of one's exploration. ...
... For the same reason, and as many compute-heavy kernels can be mapped to few PEs with relatively few contexts, the array of PEs' dimensions are often kept small [7], [24]. 1) Architectures: There exist many CGRA architectures that differ mainly by their PE designs. However, as we focus on ones used as co-processors to regular CPUs, commonalities between these designs include a context register in the PE [9], [10], [14], [25] or in the centralized control logic [8], an Arithmetic Logic Unit (ALU) with or without support for predication [10], and either multiplexer [10], [14] or crossbarbased routing hardware [8], [9], [25]. Most architectures also - [9]. ...
... For the same reason, and as many compute-heavy kernels can be mapped to few PEs with relatively few contexts, the array of PEs' dimensions are often kept small [7], [24]. 1) Architectures: There exist many CGRA architectures that differ mainly by their PE designs. However, as we focus on ones used as co-processors to regular CPUs, commonalities between these designs include a context register in the PE [9], [10], [14], [25] or in the centralized control logic [8], an Arithmetic Logic Unit (ALU) with or without support for predication [10], and either multiplexer [10], [14] or crossbarbased routing hardware [8], [9], [25]. Most architectures also - [9]. ...
Conference Paper
Full-text available
The popularity of the Internet of Things and next-generation wireless networks calls for a greater distribution of small but high-performance and energy-efficient compute devices at the networks’ Edge. These devices must integrate hardware acceleration to meet the latency requirements of relevant use cases. Existing work has highlighted Coarse-Grained Reconfigurable Arrays (CGRAs) as suitable compute architectures for this purpose. However, like other modern hardware design, research and design space exploration into CGRAs is hindered by long development times needed for Register Transfer Level implementation. In this paper, we propose mitigating these by extending the open-source CGRAgen tool with a Chisel-based hardware backend capable of transforming abstract Processing Element (PE) descriptions into synthesizable Verilog code. We present how CGRAgen’s internal module representation is transformed to Chisel modules and demonstrate this on a selection of PE architectures from the literature. Finally, we outline future work on extending this flow to generate entire CGRAs.
... In Figure 1. Spatial vs Spatio-temporal provide word-level reconfigurability, balancing adaptability and efficiency with significant research targeting both edge [10,11,20,26] and server domains [4,30,36,38]. A Coarse-Grained Reconfigurable Architecture (CGRA) uses simple PEs connected by a reconfigurable on-chip network to form high-throughput datapaths. ...
... Spatial architectures [10][11][12]31] maintain a fixed mapping of compute and communication, ensuring low configuration energy but potentially lower performance. Spatio-temporal [20,26,39,42], on the other hand, allow each PE to reconfigure to a new instruction every cycle, enhancing performance but potentially increasing energy consumption [1]. These architectures excel in accelerating regular workloads with predictable access patterns and minimal control flow, such as DSP workloads including dense linear algebra. ...
Preprint
Full-text available
Modern reconfigurable architectures are increasingly favored for resource-constrained edge devices as they balance high performance, energy efficiency, and programmability well. However, their proficiency in handling regular compute patterns constrains their effectiveness in executing irregular workloads, such as sparse linear algebra and graph analytics with unpredictable access patterns and control flow. To address this limitation, we introduce the Nexus Machine, a novel reconfigurable architecture consisting of a PE array designed to efficiently handle irregularity by distributing sparse tensors across the fabric and employing active messages that morph instructions based on dynamic control flow. As the inherent irregularity in workloads can lead to high load imbalance among different Processing Elements (PEs), Nexus Machine deploys and executes instructions en-route on idle PEs at run-time. Thus, unlike traditional reconfigurable architectures with only static instructions within each PE, Nexus Machine brings dynamic control to the idle compute units, mitigating load imbalance and enhancing overall performance. Our experiments demonstrate that Nexus Machine achieves 1.5x performance gain compared to state-of-the-art (SOTA) reconfigurable architectures, within the same power budget and area. Nexus Machine also achieves 1.6x higher fabric utilization, in contrast to SOTA architectures.
... However, application demands are rapidly evolving, and a more flexible approach is needed. Configurable hardware, using either coarse grain reconfigurable arrays (CGRA) [15,9] or FPGAs [12] are one approach to providing a flexible machine that can be configured to match the data flow of different algorithms. Although these architectures promise much better energy efficiency compared to CPUs or GPUs, programming and integrating them into complete real world systems remains a formidable task for application developers. ...
... y, xi, yi, 256, 256).unroll(c) 15 .accelerate({in}, xi, x) 16 .parallel(y).parallel(x); ...
Preprint
Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality. But creating, "programming,"and integrating this hardware into a hardware/software system is difficult. We address this problem by extending the image processing language, Halide, so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler that uses this code to automatically create the accelerator along with the "glue" code needed for the user's application to access this hardware. Starting with Halide not only provides a very high-level functional description of the hardware, but also allows our compiler to generate the complete software program including the sequential part of the workload, which accesses the hardware for acceleration. Our system also provides high-level semantics to explore different mappings of applications to a heterogeneous system, with the added flexibility of being able to map at various throughput rates. We demonstrate our approach by mapping applications to a Xilinx Zynq system. Using its FPGA with two low-power ARM cores, our design achieves up to 6x higher performance and 8x lower energy compared to the quad-core ARM CPU on an NVIDIA Tegra K1, and 3.5x higher performance with 12x lower energy compared to the K1's 192-core GPU.
... 20 The parallelism feature of most of the coarse-grain platforms adds a distinctive yet essential advantage to such hardware. 18 Recent work in mesh-based coarse grain reconfigurable architectures includes GARP (UC Berkeley), 21 MATRIX (CalTech), 22 REMARC (Stanford), 23 ADRES (IMEC), 24 and MorphoSys (UC Irvine). 3 In view of all that, performance and hardware analysis should be investigated to identify all the bottlenecks and provide a realistic feedback in order to propose future improvements. Targeted applications, such as image manipulations, cryptography, and communication algorithms, should be mapped to determine the hardware behavior. ...
Article
Full-text available
Reconfigurable Systems represent a middle trade-off between speed and flexibility in the processor design world. It provides performance close to the custom-hardware and yet preserves some of the general-purpose processor flexibility. Recently, the area of reconfigurable computing has received considerable interest in both its forms: the FPGA and coarse-grain hardware. Since the field is still in its developing stage, it is important to perform hardware analysis and evaluation of certain key applications on target reconfigurable architectures to identify potential limitations and improvements. This paper presents the mapping and performance analysis of two encryption algorithms, namely Rijndael and Twofish, on a coarse grain reconfigurable platform, namely MorphoSys. MorphoSys is a reconfigurable architecture targeted for multimedia applications. Since many cryptographic algorithms involve bitwise operations, bitwise instruction set extension was proposed to enhance the performance. We present the details of the mapping of the bitwise operations involved in the algorithms with thorough analysis. The methodology we used can be utilized in other systems.
... We first consider CGRAs with PEs extracted from ADRES [11], CMA [12], and RF-CGRA [8]. The PEs differ in interconnect and pipelining and are adapted with additional registers to avoid combinational loops. ...
Article
Full-text available
Modern Edge applications are diverse and compute-intensive but can often afford constrained computational inaccuracy. Commonly denoted approximate computing, this paper explores the integration of inexact arithmetic units into Coarse-Grained Reconfigurable Array (CGRA) architectures with an associated, comprehensive framework for architecture modeling, hardware generation, and automated approximation, CGRAgen. A use case kernel from signal processing illustrates the functionality and potential of CGRAgen in design space explorations into CGRAs with integrated approximate computing techniques, highlighting the utility of making approximations reconfigurable.
... Given the significant redesign of the CGRA architecture, we benchmark our results against several baselines: Spatio-temporal CGRA, like typical CGRAs [14,39,58,67] as shown in Figure 3, has a 4×4 PE array with a mesh network and the same SPM configuration with 2×2 Plaid. Each PE has 16-entry configuration memory. ...
Preprint
Full-text available
Coarse-grained Reconfigurable Arrays (CGRAs) are domain-agnostic accelerators that enhance the energy efficiency of resource-constrained edge devices. The CGRA landscape is diverse, exhibiting trade-offs between performance, efficiency, and architectural specialization. However, CGRAs often overprovision communication resources relative to their modest computing capabilities. This occurs because the theoretically provisioned programmability for CGRAs often proves superfluous in practical implementations. In this paper, we propose Plaid, a novel CGRA architecture and compiler that aligns compute and communication capabilities, thereby significantly improving energy and area efficiency while preserving its generality and performance. We demonstrate that the dataflow graph, representing the target application, can be decomposed into smaller, recurring communication patterns called motifs. The primary contribution is the identification of these structural motifs within the dataflow graphs and the development of an efficient collective execution and routing strategy tailored to these motifs. The Plaid architecture employs a novel collective processing unit that can execute multiple operations of a motif and route related data dependencies together. The Plaid compiler can hierarchically map the dataflow graph and judiciously schedule the motifs. Our design achieves a 43% reduction in power consumption and 46% area savings compared to the baseline high-performance spatio-temporal CGRA, all while preserving its generality and performance levels. In comparison to the baseline energy-efficient spatial CGRA, Plaid offers a 1.4x performance improvement and a 48% area savings, with almost the same power.
... CGRA, with a coarser word-level reconfigurability, offers an efficient alternative to FPGAs, bringing its area and power consumption closer to that of ASICs. Numerous CGRAs are proposed in industry [10,18,20,32] and academia [22,31,36,39]. SoftBrain [39] CGRA shows that its area and energy are within 8× and 2× of the ASIC values, respectively. ...
Preprint
Full-text available
Hardware specialization is commonly viewed as a way to scale performance in the dark silicon era with modern-day SoCs featuring multiple tens of dedicated accelerators. By only powering on hardware circuitry when needed, accelerators fundamentally trade off chip area for power efficiency. Dark silicon however comes with a severe downside, namely its environmental footprint. While hardware specialization typically reduces the operational footprint through high energy efficiency, the embodied footprint incurred by integrating additional accelerators on chip leads to a net overall increase in environmental footprint, which has led prior work to conclude that dark silicon is not a sustainable design paradigm. We explore sustainable hardware specialization through reconfigurable logic that has the potential to drastically reduce the environmental footprint compared to a sea of accelerators by amortizing its embodied footprint across multiple applications. We present an abstract analytical model that evaluates the sustainability implications of replacing dedicated accelerators with a reconfigurable accelerator. We derive hardware synthesis results on ASIC and CGRA (a representative reconfigurable fabric) for chip area and energy numbers for a wide variety of kernels. We input these results to the analytical model and conclude that reconfigurable fabric is more sustainable. We find that as few as a handful to a dozen accelerators can be replaced by a CGRA. Moreover, replacing a sea of accelerators with a CGRA leads to a drastically reduced environmental footprint (by a factor of 2.5×2.5 \times to 7.6×7.6 \times).
... Static Placement Static Issue. Statically building the placement and the issuing of instructions sacrifices flexibility with power efficiency [19], [31], [32]. To enable the static issue, the routing must be done to guarantee that the operands are ready by the issue time. ...
Article
Full-text available
Specialized accelerators are becoming a standard way to achieve both high-performance and efficient computation. We see this trend extending to all areas of computing, from low-power edge-computing systems to high-performance processors in datacenters. Reconfigurable architectures, such as Coarse-Grained Reconfigurable Arrays (CGRAs), attempt to find a balance between performance and energy efficiency by trading off dynamism, flexibility, and programmability. Our goal in this work is to find a new solution that provides the flexibility of traditional CPUs, with the parallelism of a CGRA, to improve overall performance and energy efficiency. Our design, the Dynamic Data-Driven Reconfigurable Architecture (3DRA), is unique, in that it targets both low-latency and high-throughput workloads. This architecture implements a dynamic dataflow execution model that resolves data dependencies at run-time and utilizes non-blocking broadcast communication that reduces transmission latency to a single cycle to achieve high performance and energy efficiency. By employing a dynamic model, 3DRA eliminates costly mapping algorithms during compilation and improves the flexibility and compilation time of traditional CGRAs. The 3DRA architecture achieves up to 731 MIPS/mW, and it improves performance by up to 4.43x compared to the current state-of-the-art CGRA-based accelerators.
Article
CGRA, as a coprocessor in SoCs, has been widely studied. However, there is limited research on how to efficiently debug and verify SoCs composed of CGRAs and processors during the design process. To address this gap, we introduce DVHetero. DVHetero incorporates a simulation and validation framework, SoCDiff, which enables comprehensive SoC simulation, debugging, and rapid error localization. Using this verification framework, we successfully implemented and validated the entire SoC. The SoC includes a Chisel-based CGRA generator and provides a pipelined CGRA architecture template. The CGRA is tightly integrated with the RISC-V processor, allowing for efficient DMA-based data transfer and MMIO support within the SoC.The pipelined CGRA architecture generated by DVHetero shows a 1.27x improvement in area efficiency and a 10.54x increase in mapping speed compared to the state-of-the-art CGRA framework, HierCGRA. Additionally, compared to state-of-the-art CGRA-SoC systems FDRA, DVHetero demonstrates a 1.67x increase in execution speed and a 4.34x improvement in area efficiency.
Article
Domain-specific languages for hardware can significantly enhance designer productivity, but sometimes at the cost of ease of verification. On the other hand, ISA specification languages are too static to be used during early stage design space exploration. We present PEak, an open-source hardware design and specification language, which aims to improve both design productivity and verification capability. PEak does this by providing a single source of truth for functional models, formal specifications, and RTL. PEak has been used in several academic projects, and PEak-generated RTL has been included in three fabricated hardware accelerators. In these projects, the formal capabilities of PEak were crucial for enabling both novel design space exploration techniques and automated compiler synthesis.
Article
Stream processing, which involves real-time computation of data as it is created or received, is vital for various applications, specifically wireless communication. The evolving protocols, the requirement for high-throughput, and the challenges of handling diverse processing patterns make it demanding. Traditional platforms grapple with meeting real-time throughput and latency requirements due to large data volume, sequential and indeterministic data arrival, and variable data rates, leading to inefficiencies in memory access and parallel processing. We present Canalis, a throughput-optimized framework designed to address these challenges, ensuring high-performance while achieving low energy consumption. Canalis is a hardware-software co-designed system. It includes a programmable spatial architecture, FluxSPU (Flux Stream Processing Unit), proposed by this work to enhance data throughput and energy efficiency. FluxSPU is accompanied by a software stack that eases the programming process. We evaluated Canalis with eight distinct benchmarks. When compared to CPU and GPU in mobile SoC to demonstrate the effectiveness of domain specialization, Canalis achieves an average speedup of 13.4× and 6.6×, and energy savings of 189.8× and 283.9×, respectively. In contrast to equivalent ASICs of the benchmarks, the average energy overhead of Canalis is within 2.4×, successfully maintaining generalizations without incurring significant overhead.
Article
We present domain adaptive processor (), a programmable systolic-array processor designed for wireless communication and linear algebra workloads. uses a globally homogeneous but locally heterogeneous architecture, uses decode-less reconfiguration instructions for data streaming, enables single-cycle data communication between functional units (FUs), and features lightweight nested-loop control for periodic execution. Our design demonstrates how configuration flexibility and rapid program loading enable a wide range of communication workloads to be mapped and swapped in less than a microsecond, supporting continually evolving communication standards such as 5G. A prototype chip of with 256 cores is fabricated in a 12-nm FINFET process and has been verified. The measurement results show that achieves 507 GMACs/J and a peak performance of 264 GMACs.
Article
Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient cores act as the systolic array’s processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster’s shared memory. We introduce two low-overhead instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture’s trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double ’s compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80/25), in a 22-nm FDX technology, our hybrid architecture runs at 600with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.
Article
Although spatial programmable architectures have demonstrated high-performance and programmability for a variety of applications, they suffer from the pipeline unbalancing issue which restricts resource utilization and degrades the performance. In this article, we identify that spatial initiation interval (SpII) can quantitatively describe the impact of pipeline unbalancing on performance, so we formulate SpII for the first time in spatial architectures. To achieve an optimal SpII, we propose dataflow decomposing and integrated mapping to enable high-performance dataflow-mapping on spatial architectures. Dataflow decomposing decomposes the application graph into subgraphs and runs them serially, so that it adapts the regular spatial architecture to various application dataflows, particularly for extremely unbalanced datapaths without incurring large buffering overhead. Based on the quantitative SpII, we propose integrated mapping to consider operator placing, operand routing and pipeline balancing at the same time that can find a better SpII for fully pipelined execution on spatial architectures. The experiment results show that our proposal can gain an average of 2.1×2.1\times performance speedup on a variety of application kernels over the state-of-the-art approaches.
Article
While coarse-grained reconfigurable arrays (CGRAs) have emerged as promising programmable accelerator architectures, they require automatic pipelining of applications during their compilation flow to achieve high performance. Current CGRA compilers either lack pipelining altogether resulting in low application performance, or perform exhaustive pipelining resulting in high power and resource consumption. We address these challenges by proposing Cascade, an end-to-end open-source application compiler for CGRAs that achieves both state-of-the-art performance and fast compilation times. The contributions of this work are: 1) a novel post place-and-route (PnR) application pipelining technique for CGRAs that accounts for interconnect hop delays during pipelining but in a unique way that avoids cyclic scheduling and PnR, 2) a register resource usage optimization technique that leverages the scheduling logic in CGRA memory tiles to minimize the number of register resources used during pipelining, and 3) an automated CGRA timing model generator, an application timing analysis tool, and a large set of existing and novel application pipelining techniques integrated into an end-to-end compilation flow. Cascade achieves 8 - 34×34\times lower critical path delay and 7 - 190×190\times lower energy-delay product (EDP) across a variety of dense image processing and machine learning workloads, and 3 - 5.2×5.2\times lower critical path delay and 2.5 - 5.2×5.2\times lower EDP on sparse workloads, compared to a compiler without pipelining. Cascade mitigates the performance and energy-efficiency drawbacks of existing CGRA compilers, and enables further research into CGRAs as flexible, yet competitive accelerator architectures.
Article
Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.
Preprint
Full-text available
Domain-specific languages for hardware can significantly enhance designer productivity, but sometimes at the cost of ease of verification. On the other hand, ISA specification languages are too static to be used during early stage design space exploration. We present PEak, an open-source hardware design and specification language, which aims to improve both design productivity and verification capability. PEak does this by providing a single source of truth for functional models, formal specifications, and RTL. PEak has been used in several academic projects, and PEak-generated RTL has been included in three fabricated hardware accelerators. In these projects, the formal capabilities of PEak were crucial for enabling both novel design space exploration techniques and automated compiler synthesis.
Chapter
Coarse Grained Reconfigurable Arrays (CGRA) have become a popular technology to realize compute accelerators. CGRAs can be found in High-Performance systems and also in embedded systems. In order to provide the highest speedup, they need to support conditional statements and nested loops. This requires a management of conditions within the CGRA. This management can be done in different ways. In this contribution, we compare two such concepts and evaluate the impact that these concepts have on the achievable clock frequency, the required resources and the change of schedules. It turns out, that with our new condition management and the accompanying advanced schedule, we can save more than 20% of runtime.KeywordsCGRASchedulingCompute AcceleratorNested Loops
Article
Full-text available
Modulo scheduling is a framework within which algorithms for software pipelining innermost loops may be defined. The framework specifies a set of constraints that must be met in order to achieve a legal modulo schedule. A wide variety of algorithms and heuristics can be defined within this framework. Little work has been done to evaluate and compare alternative algorithms and heuristics for modulo scheduling from the viewpoints of schedule quality as well as computational complexity. This, along with a vague and unfounded perception that modulo scheduling is computationally expensive as well as difficult to implement, have inhibited its incorporation into product compilers. This paper presents iterative modulo scheduling, a practical algorithm that is capable of dealing with realistic machine models. The paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.
Conference Paper
Full-text available
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, a number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
Conference Paper
Full-text available
Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compilation tools are essential to their success. In this paper, we present a modulo scheduling algorithm to exploit loop-level parallelism for coarse-grained reconfigurable architectures. This algorithm is a key part of our dynamically reconfigurable embedded systems compiler (DRESC). It is capable of solving placement, scheduling and routing of operations simultaneously in a modulo-constrained 3D space and uses an abstract architecture representation to model a wide class of coarse-grained architectures. The experimental results show high performance and efficient resource utilization on tested kernels.
Conference Paper
Full-text available
Programmable reduced instruction set computers (PRISC) are a new class of computers which can offer a programmable functional unit (PFU) in the context of a RISC datapath. PRISC create application-specific instructions to accelerate the performance for a particular application. Our previous work has demonstrated that peephole optimizations in a compiler can utilize PFU resources to accelerate the performance of general purpose programs. However these compiler optimizations are limited by the structure of the input source code. This work generalizes on our previous work, and demonstrates that the performance of general abstract data types such as short-set vectors, hash tables, and finite state machines is significantly accelerated (250%-500%) by using PFU resources. Thus, a wide variety of end-user applications can be specifically designed to use PFU resources to accelerate performance. Results from applications in the domain of computer-aided design (CAD) are presented to demonstrate the usefulness of our techniques
Article
Full-text available
Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compilation tools are essential to their success. A modulo scheduling algorithm to exploit loop-level parallelism for coarse-grained reconfigurable architectures is presented. This algorithm is a key part of a dynamically reconfigurable embedded systems compiler (DRESC). It is capable of solving placement, scheduling and routing of operations simultaneously in a modulo-constrained 3D space and uses an abstract architecture representation to model a wide class of coarse-grained architectures. The experimental results show high performance and efficient resource utilisation on tested kernels
Article
Full-text available
This paper introduces MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for word-level, computation-intensive applications. MorphoSys is a coarse-grain, integrated, and reconfigurable system-on-chip, targeted at high-throughput and data-parallel applications. It is comprised of a reconfigurable array of processing cells, a modified RISC processor core, and an efficient memory interface unit. This paper describes the MorphoSys architecture, including the reconfigurable processor array, the control processor, and data and configuration memories. The suitability of MorphoSys for the target application domain is then illustrated with examples such as video compression, data encryption and target recognition. Performance evaluation of these applications indicates improvements of up to an order of magnitude (or more) on MorphoSys, in comparison with other systems
Article
Full-text available
The performance of multiple-instruction-issue processors can be severely limited by the compiler's ability to generate efficient code for concurrent hardware. In the IMPACT project, we have developed IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization capabilities of the IMPACT-I C compiler are summarized in this paper. Using the IMPACT-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors achieve solid speedup over high-performance single-instruction-issue processors. We ran experiments to characterize the following architectural design issues: code scheduling model, instruction issue rate, memory load latency, and function unit resource limitations. Based on the experimental results, we propose the IMPACT Architectural Framework, a set of architectural features that best support the IMPACT-I C compile...
Conference Paper
This paper describes a new reconfigurable processor architecture called REMARC (Reconfigurable Multimedia Array Coprocessor). REMARC is a small array processor that is tightly coupled to a main RISC processor. It consists of a global control unit and 64 16-bit processors called nano processors. REMARC is designed to accelerate multimedia applications, such as video compression, decompression, and image processing. These applications typically use 8-bit or 16-bit data therefore, each nano processor has a 16-bit datapath that is much wider than those of other reconfigurable coprocessors. We have developed a programming environment for REMARC and several realistic application programs, DES encryption, MPEG-2 decoding, and MPEG-2 encoding. REMARC can implement various parallel algorithms which appear in these multimedia applications. For instance, REMARC can implement SIMD type instructions similar to multimedia instruction extensions for motion compensation of the MPEG-2 decoding. Furthermore, the highly pipelined algorithms, like systolic algorithms, which appear in motion estimation of the MPEG-2 encoding can also be implemented efficiently. REMARC achieves speedups ranging from a factor of 2.3 to 21.2 over the base processor which is a single issue processor or 2-issue superscalar processor. We also compare its performance with multimedia instruction extensions. Using more processing resources, REMARC can achieve higher performance than multimedia instruction extensions.
Conference Paper
Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compiling tools are essential to their success. In this paper, we present a retargetable compiler for a family of coarse-grained reconfigurable architectures. Several key issues are addressed. Program analysis and transformation prepare dataflow for scheduling. Architecture abstraction generates an internal graph representation from a concrete architecture description. A modulo scheduling algorithm is key to exploit parallelism and achieve high performance. The experimental results show up to 28.7 instructions per cycle (IPC) over tested kernels.
Article
By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself. With direct access to the host processor's register file, the system enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfigurable computing. Chimaera also supports multi-output functions and utilizes partial run-time reconfiguration to reduce reconfiguration time. Combined, the system can provide speedups of a factor of two or more for general-purpose computing, and speedups of 160 or more are possible for hand-mapped applications.
Article
Modulo scheduling is a framework within which algorithms for software pipelining innermost loops may be defined. The framework specifies a set of constraints that must be met in order to achieve a legal modulo schedule. A wide variety of algorithms and heuristics can be defined within this framework. Little work has been done to evaluate and compare alternative algorithms and heuristics for modulo scheduling from the viewpoints of schedule quality as well as computational complexity. This, along with a vague and unfounded perception that modulo scheduling is computationally expensive as well as difïicult to implement, have inhibited its incorporation into product compilers. This paper presents iterative modulo scheduling, a practical algorithm that is capable of dealing with realistic machine models. The paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.
Article
Configurable computing has captured the imagination of many architects who want the performance of application-specific hardware combined with the reprogrammability of general-purpose computers. Unfortunately, configurable computing has had rather limited success largely because the FPGAs on which they are built are more suited to implementing random logic than computing tasks. This paper presents RaPiD, a new coarse-grained FPGA architecture that is optimized for highly repetitive, computation-intensive tasks. Very deep application-specific computation pipelines can be configured in RaPiD. These pipelines make much more efficient use of silicon than traditional FPGAs and also yield much higher performance for a wide range of applications.
Automatic Synthesis of Reconfigurable Instruction Set Accelerations
  • B Kastrup