Conference Paper

Dynamically Specialized Datapaths for Energy Efficient Computing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Offline detection benefits from few constraints regarding memory, processing power, or overhead and can extract lengthier binary segments. To implement offline detection, some approaches integrate the detection into the compiler [17,32,40,44] (1), while others analyze the application post-compilation either statically [70] (2) or by simulation [67,73] (3). Additional outputs, such as profiling information may also be produced [73]. ...
... This method often occurs in approaches that rely on small accelerators integrated into the processor. The binary is modified by replacing the segments with one or more custom instructions of lower latency that activate the accelerator [17,40,68] and control any of its internal logic [39]. ...
... In summary, we identify three major methods for accelerator generation or configuration: (1) targeting a static pre-designed accelerator architecture [40,70], (2) specializing an architecture template followed by generation of configuration words, instructions, or schedules [17,32,73], and (3) generation of a full custom hardware description [56,58]. Taxonomy Table 1 summarizes taxonomy on the binary translation aspect of each approach, presenting the possible values we have outlined for each feature (e.g. ...
Article
Full-text available
The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6× to 5.6× are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3× and 3.9×. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.
... However, application demands are rapidly evolving, and a more flexible approach is needed. Configurable hardware, using either coarse grain reconfigurable arrays (CGRA) [15,9] or FPGAs [12] are one approach to providing a flexible machine that can be configured to match the data flow of different algorithms. Although these architectures promise much better energy efficiency compared to CPUs or GPUs, programming and integrating them into complete real world systems remains a formidable task for application developers. ...
... x, y) + .1*in(2, x, y); 7 blury(x, y) = (gray(x, y-1) + gray(x, y) + gray(x, y+1)) / 3; 8 blurx(x, y) = (blury(x-1, y) + blury(x, y) + blury(x+1, y)) / 3; 9 sharpen(x, y) = 2 * gray(x, y) -blurx(x, y); 10 ratio(x, y) = sharpen(x, y) / gray(x, y); 11 unsharp(c, x, y) = ratio(x, y) * input(c, x, y); 12 13 // The schedule 14 unsharp.tile(x, y, xi, yi, 256, 256).unroll(c) 15 .accelerate({in}, ...
Preprint
Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality. But creating, "programming,"and integrating this hardware into a hardware/software system is difficult. We address this problem by extending the image processing language, Halide, so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler that uses this code to automatically create the accelerator along with the "glue" code needed for the user's application to access this hardware. Starting with Halide not only provides a very high-level functional description of the hardware, but also allows our compiler to generate the complete software program including the sequential part of the workload, which accesses the hardware for acceleration. Our system also provides high-level semantics to explore different mappings of applications to a heterogeneous system, with the added flexibility of being able to map at various throughput rates. We demonstrate our approach by mapping applications to a Xilinx Zynq system. Using its FPGA with two low-power ARM cores, our design achieves up to 6x higher performance and 8x lower energy compared to the quad-core ARM CPU on an NVIDIA Tegra K1, and 3.5x higher performance with 12x lower energy compared to the K1's 192-core GPU.
... Sequential programs spend a considerable portion of their run time executing tight loops with loop-carried data dependencies, and accelerating these loops becomes a critical factor in enhancing overall performance [13]- [18]. The challenge in accelerating the execution of loops with loop-carried dependencies in CGRAs arises from the need to spill data-dependent values out of the array and then feed them back in [19]- [22]. Consequently, the execution of the next iteration is stalled until all its input operands become ready. ...
... In its operation, BERET utilizes a single thread and incorporates an internal register file shared among the subgraphs. DYSER (Dynamically specializing execution resources) [19], [20] dynamically generates specialized data paths solely for frequently executed regions. This technique incorporates a heterogeneous array of functional units interconnected with simple switches, enabling efficient functionality and parallelism mechanisms. ...
Preprint
Coarse-grain reconfigurable architectures (CGRAs) are gaining traction thanks to their performance and power efficiency. Utilizing CGRAs to accelerate the execution of tight loops holds great potential for achieving significant overall performance gains, as a substantial portion of program execution time is dedicated to tight loops. But loop parallelization using CGRAs is challenging because of loop-carried data dependencies. Traditionally, loop-carried dependencies are handled by spilling dependent values out of the reconfigurable array to a memory medium and then feeding them back to the grid. Spilling the values and feeding them back into the grid imposes additional latencies and logic that impede performance and limit parallelism. In this paper, we present the Dependency Resolved CGRA (DR-CGRA) architecture that is designed to accelerate the execution of tight loops. DR-CGRA, which is based on a massively-multithreaded CGRA, runs each iteration as a separate CGRA thread and maps loop-carried data dependencies to inter-thread communication inside the grid. This design ensures the passage of data-dependent values across loop iterations without spilling them out of the grid. The proposed DR-CGRA architecture was evaluated on various SPEC CPU 2017 benchmarks. The results demonstrated significant performance improvements, with an average speedup ranging from 2.1 to 4.5 and an overall average of 3.1 when compared to state-of-the-art CGRA architecture.
... The most significant factor which affects the area overhead of an overlay is the virtual routing resources [18,19], or interconnects between the different FUs. A number of interconnect strategies exist, with the most common being: island style [7,15,18,20,21], nearest neighbor (NN) [6,22], network-on-chip (NoC) [23,24] and to a lesser extent linear interconnect [16,25]. Most (island style, NN, and NoC) are 2-D mesh structures which are quite similar to the architecture of FPGAs. ...
... Irrespective of the computational element (be it a processor or a dedicated processing element), CGRA-like overlays are characterized by an array structure of computational elements connected using programmable interconnect. A number of interconnect strategies exist, with the most common being: island style [7,15,18,20,21], nearest neighbor (NN) [6,22], network-onchip (NoC) [23,24] and to a lesser extent linear interconnect [16,25], as shown in Figure 2.3. ...
Thesis
Full-text available
The benefits of FPGAs over processor-based systems have been well established, however apart from specialist application domains, such as digital signal processing and communications, these platforms have not seen wide usage. Poor design productivity has been a key limiting factor, preventing the mainstream adoption of FPGAs and restricting their effective use to experts in hardware design. Coarse-grained overlay architectures have been proposed as a possible solution for improving design productivity by offering fast compilation and software-like programmability. These overlays can either be spatially configured (SC), with one complete functional unit (FU) allocated to each compute kernel operation and a routing network which is essentially static during computation, or, multiplexed, with the FUs and interconnect being shared between kernel operations. This thesis examines an overlay architecture based on a simple linear interconnected array of time-multiplexed (TM) functional units. Sharing the FUs among kernel operations should significantly reduce the FPGA resource overhead compared to an SC overlay which requires one FU for each operation along with a fully functional routing network to support connections to neighboring FUs. The linear interconnected array of TM FUs should also result in reduced instruction storage and interconnect resource requirements compared to other TM overlays, again resulting in a more area efficient overlay. In order to minimize the use of the fine-grained FPGA resource, we make use of the DSP block to design a fast, fully-pipelined, architecture-aware FU implementation, better targeting the capabilities of the FPGA. The results presented show a significant reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device. A number of architectural enhancements are then proposed to improve the performance of the DSP block based FU. The overlay subsystem is then integrated into complete hardware accelerator systems, along with memory interfaces, to an ARM processor or a host CPU. To achieve this, we investigate two different memory solutions based on AXI and PCIe interfaces, namely Xillybus and RIFFA. The performance of these hardware accelerators for a range of benchmarks is investigated and performance results are presented. The proposed AXI-Xillybus-V3 overlay system is also compared to a state-of-art TM overlay, namely VectorBlox MXP. The comparison results show the AXI-Xillybus-V3 achieves a very area efficient implementation at the expense of around half of the throughput (limited by AXI-Xillybus using a 32-bit bus compared to the 64-bit bus used by VectorBlox MXP). The proposed RIFFA-V3 overlay system shows a 3.6× better performance compared to the PCIe-Xillybus-V3, and a 5.7× better performance than AXI-Xillybus-V3, but at the cost of a larger BRAM consumption.
... From Table 1, it can be seen that overlays with SC FUs and SC interconnect networks [6,10,11,15,25,30,33,36,75] comprise a significant group. In an SC overlay, a single operation node is mapped to an individual FU and data is shifted between FUs over a programmable, but temporally dedicated, point-to-point link. ...
... Irrespective of the computational element (be it a processor or a dedicated processing element), CGRA-like overlays are characterized by an array structure of computational elements connected using programmable interconnect. A number of interconnect strategies exist, with the most common being: island style [6,15,25,33,36], nearest neighbor (NN) [11,59], network-on-chip (NoC) [26,41,42] and to a lesser extent linear interconnect [10,16], as shown in Figure 2. Other interconnect strategies are possible, including circuit switched [31] networks, but these typically consume significant hardware resource and are less suited for FPGA-based overlays. There are also variations in the more common interconnect strategies. ...
Article
Full-text available
This article presents a comprehensive survey of time-multiplexed (TM) FPGA overlays from the research literature. These overlays are categorized based on their implementation into two groups: processor-based overlays, as their implementation follows that of conventional silicon-based microprocessors, and; CGRA-like overlays, with either an array of interconnected processor-based functional units or medium-grained arithmetic functional units. Time-multiplexing the overlay allows it to change its behavior with a cycle-by-cycle execution of the application kernel, thus allowing better sharing of the limited FPGA hardware resource. However, most TM overlays suffer from large resource overheads, due to either the underlying processor-like architecture (for processor-based overlays) or due to the routing array and instruction storage requirements (for CGRA-like overlays). Reducing the area overhead for CGRA-like overlays, specifically that required for the routing network, and better utilizing the hard macros in the target FPGA are active areas of research.
... Therefore, recent spatial architectures use increasingly coarse-grained building blocks, such as ALUs, register files, and memory controllers, distributed in a programmable, word-level static interconnect. Several of these Coarse-Grained Reconfigurable Arrays (CGRAs) have recently been proposed [14,17,18,27,32,34,35,43,46]. ...
... Many previously proposed CGRAs use a word-level static interconnect, which has better compute density than bit-based routing [60]. CGRAs such as HRL [14], DySER [18], and Elastic CGRAs [23] commonly employ two static interconnects: a word-level interconnect to route data and a bit-level interconnect to route control signals. Several works have also proposed a statically scheduled interconnect [12,38,56] using a modulo schedule. ...
Conference Paper
Recent years have seen the increased adoption of Coarse-Grained Reconfigurable Architectures (CGRAs) as flexible, energy-efficient compute accelerators. Obtaining performance using spatial architectures while supporting diverse applications requires a flexible, high-bandwidth interconnect. Because modern CGRAs support vector units with wide datapaths, designing an interconnect that balances dynamism, communication granularity, and programmability is a challenging task. In this work, we explore the space of spatial architecture interconnect dynamism, granularity, and programmability. We start by characterizing several benchmarks' communication patterns and showing links' imbalanced bandwidth requirements, fanout, and data width. We then describe a compiler stack that maps applications to both static and dynamic networks and performs virtual channel allocation to guarantee deadlock freedom. Finally, using a cycle-accurate simulator and 28 nm ASIC synthesis, we perform a detailed performance, area, and power evaluation across the identified design space for a variety of benchmarks. We show that the best network design depends on both applications and the underlying accelerator architecture. Network performance correlates strongly with bandwidth for streaming accelerators, and scaling raw bandwidth is more area- and energy-efficient with a static network. We show that the application mapping can be optimized to move less data by using a dynamic network as a fallback from a high-bandwidth static network. This static-dynamic hybrid network provides a 1.8x energy-efficiency and 2.8x performance advantage over the purely static and purely dynamic networks, respectively.
... Tightly coupled accelerators: ADRES [33] and DySER [55] are CGRA-like accelerators tightly integrated into the processor pipeline and meant for accelerating the inner most loop bodies. Similarly, GARP [34] and Tartan [35] are tightly coupled within the processor and utilize FPGAlike and coarse-grained reconfigurable fabric, respectively, to provide the reconfigurability. ...
... Table V summarizes and classifies the related works in terms of their level of integration with the processor, targeted code granularity, heterogeneity of accelerator units, and whether they share accelerators across cores. Loose vs. tight integration with the processor pipeline naturally leads to different acceleration granularity; Loosely coupled architectures, through taking advantage of the local data memory, enable acceleration of whole kernels [23,24,25,26], while the tightly coupled architectures support finegrain acceleration (e.g., inner most loop [55], operationchains [9]). Stitch tightly integrates the patches with the processor pipeline to exploit the acceleration of operationchains. ...
... The ESP architecture is a heterogeneous tile grid, connected by a 2D mesh NoC. Figure 2 shows one of the key types of tiles, the accelerator tile, with an instance of an example programmable accelerator. Programmable accelerators need to feature dedicated structures for instruction dispatch, scheduling, and retirement [11]. To avoid defining an ISA and developing control structures from scratch, open-source ISAs like RISC-V and accompanying core implementations, such as the Rocket Core [12], can be attractive for those developing programmable accelerators [13]. ...
Preprint
Full-text available
We present several enhancements to the open-source ESP platform to support flexible and efficient on-chip communication for programmable accelerators in heterogeneous SoCs. These enhancements include 1) a flexible point-to-point communication mechanism between accelerators, 2) a multicast NoC that supports data forwarding to multiple accelerators simultaneously, 3) accelerator synchronization leveraging the SoC's coherence protocol, 4) an accelerator interface that offers fine-grained control over the communication mode used, and 5) an example ISA extension to support our enhancements. Our solution adds negligible area to the SoC architecture and requires minimal changes to the accelerators themselves. We have validated most of these features in complex FPGA prototypes and plan to include them in the open-source release of ESP in the coming months.
... Von Neumann PE only passively executes the configuration according to the instruction sequence. Some SAs [2,12,16,18,19,21,22,24,26,28,29,30,31,33,36,39,40,41,46,48,50,62] by a unified controller (main processor or co-processor), and some use counters [52,69] or finite state machines [14,59] to control the order of instructions. They both have controllers that construct sequential instruction flows. ...
Preprint
Spatial architecture is a high-performance architecture that uses control flow graphs and data flow graphs as the computational model and producer/consumer models as the execution models. However, existing spatial architectures suffer from control flow handling challenges. Upon categorizing their PE execution models, we find that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This leads to limited performance in intensive control programs. A spatial architecture, Marionette, is proposed, with an explicit-designed control flow plane. The Control Flow Plane enables autonomous, peer-to-peer and temporally loosely-coupled control flow handling. The Proactive PE Configuration ensures timely and computation-overlapped configuration to improve handling Branch Divergence. The Agile PE Assignment enhance the pipeline performance of Imperfect Loops. We develop full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging intensive control programs, compared to state-of-the-art spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88x, 3.38x, 1.55x, and 2.66x.
... Core Fusion allows cores in a CMP machine to be "fused" into a large powerful CPU or "split" into many small independent cores [29]. Govindaraju et al. [24] integrate a reconfigurable fabric into a programmable core as a set of functional units, mapping performance-critical operations on to the fabric. While this architecture allows the frequent computations to be accelerated, it does not focus on accelerating memory-bound workloads. ...
Thesis
The rise of cloud computing and deep machine learning in recent years have led to a tremendous growth of workloads that are not only large, but also have highly sparse representations. A large fraction of machine learning problems are formulated as sparse linear algebra problems in which the entries in the matrices are mostly zeros. Not surprisingly, optimizing linear algebra algorithms to take advantage of this sparseness is critical for efficient computation on these large datasets. This thesis presents a detailed analysis of several approaches to sparse matrix-matrix multiplication, a core computation of linear algebra kernels. While the arithmetic count of operations for the nonzero elements of the matrices are the same regardless of the algorithm used to perform matrix-matrix multiplication, there is significant variation in the overhead of navigating the sparse data structures to match the nonzero elements with the correct indices. This work explores approaches to minimize these overheads as well as the number of memory accesses for sparse matrices. To provide concrete numbers, the thesis examines inner product, outer product and row-wise algorithms on Transmuter, a flexible accelerator that can reconfigure its cache and crossbars at runtime to meet the demands of the program. This thesis shows how the reconfigurability of Transmuter can improve complex workloads that contain multiple kernels with varying compute and memory requirements, such as the Graphsage deep neural network and the Sinkhorn algorithm for optimal transport distance. Finally, we examine a novel Transmuter feature: register-to-register queues for efficient communication between its processing elements, enabling systolic array style computation for signal processing algorithms.
... A few prior work reconfigure at the sub-core level [134,87,105,43,171] and the network-level [75,106,198,146]. In contrast, Transmuter uses native in-order cores and the reconfigurability lies in the memory and interconnect. ...
Thesis
The past decade has seen the breakdown of two important trends in the computing industry: Moore’s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy efficiency than general-purpose processors, such as CPUs. While the performance and efficiency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-efficiency gap across contemporary platforms. A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system. The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU. The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.
... Such models can exploit more ILP than the strict von-Neumann architectures, but the block of executable instructions is limited by the technology, and the amount of parallelism they can expose is typically limited. TRIPS [93] TARTAN [94] and DySER [95] execute the instructions within a block in a Data-Flow manner, whereas the blocks are [96], TAM [97], ADARC [98], EARTH [99], MT. Monsoon [100], Plebbes [101], SDF [102], DDM [103], Carbon [104], and Task Superscalar [105]. ...
Thesis
Full-text available
The end of Dennard scaling [1], Moore’s law [2], and the resulting difficulty of increasing clock frequency forced the engineering community to shift to the multi/many-core processors and multi-node systems as an alternative way to improve performance. An increased core number benefits many workloads, but programming limitations still reduce the performance due to not fully exploited parallelism. From this perspective, new execution models are arising to overcome such limitations to scale up performance. Execution models like Data-Flow can take advantage of the full parallelism, thanks to the possibility of creating many asynchronous threads that can run in parallel. These threads may encapsulate the data to be processed, their dependencies, and, once completed, write their output for other threads. Data-Flow Threads (DF-Threads) is a novel Data-Flow execution model for mapping threads on local or distributed cores transparently to the programmer [3]. This model is capable of being parallelized massively among different cores, and it handles even hundreds of thousands or more Data-Flow threads with their associated data regions. Further implementation and evaluation of the DF-Threads model (previously proposed by R.Giorgi [3]) are presented in this thesis. The proposed model can be able to exploit the full parallelism of modern heterogeneous embedded architectures (e.g., the AXIOM-Board [4]). The work relies on introducing the ”Data-Flow engine” (DF-Engine), which is able to accelerate the function execution and spawn many asynchronous, data-driven threads across several general-purpose cores of a multi-core/node system. The DF-Engine can be placed either in software or directly implemented at the hardware level by using a heterogeneous architecture (e.g., the AXIOM-Board). The DF-Engine can handle the creation, the thread-dependency, and the locality of many fine-grain threads, leaving the general-purpose core focusing only on the execution of the threads. This implementation is a hybrid Data-Flow - von-Neumann model, which harnesses the parallelism and data synchronization inherently to the Data-Flow, and yet maintains the programmability of the von-Neumann model. Starting from the DF-Threads execution model, we analyzed the tradeoffs of a minimalistic API to enable an efficient implementation, which can distribute the DF-Threads either locally across the core of a single multi-core system or/and across the remote cores of a cluster. Implement and evaluate the proposed model directly on a real architecture requires time, resources, and effort. Therefore, the design has been preliminarily evaluated in a simulation framework, and then the validated model has been gradually migrated into a real board in collaboration with my research group. The simulation framework presented in this thesis is based on the COTSon simulator [5] and on a set of tools named ”MY Design Space Exploration” (MYDSE) [6], which has been implemented and adopted by our research group. Then, we explain how the validation phase of the simulation framework has been performed against real architecture like x86 64 and AArch64. Moreover, we analyzed the impact of different Linux distributions on the execution. Afterward, we explain the workflow adopted to migrate the design of the DF-Threads execution model from the COTSon simulator to a High-Level Synthesis framework (e.g., Xilinx HLS), targeting a heterogeneous architecture such as the AXIOM-Board [4]. We used a driving example that models a two-way associative cache to demonstrate the simplicity and rapidness of our framework in migrating the design from the COTSon simulator to the HLS framework. The methodology has been adopted in the context of the AXIOM project [7], which helped our research team in reducing the development time from days/weeks to minutes/hours. In the end, we present the evaluation of the proposed DF-Threads execution model. We are interested in stressing and analyzing the efficiency of the DF-Engine with thousands or more Data-Flow threads. For this goal, we decided to use the Recursive Fibonacci algorithm, which gives us the possibility to generate such a high number of threads easily. Moreover, we want to study the behavior of the execution model with data-intensive applications for evaluating the performance with memory operations and data movements. For this reason, we adopted the Matrix Multiplication benchmark, which is the main computational kernel of widely used applications (e.g., Smart Home Living, Smart Video Surveillance, Artificial Intelligence). The proposed design has been evaluated against OpenMPI, which is typically adopted for cluster programming, and against CUDA, a parallel programming language for GPUs. DF-Threads achieve better performance-per-core compared to both OpenMPI and CUDA. In particular, OpenMPI exhibits much more Operating System (OS) kernel activity with respect to DF-Threads. This OS activity slows down the OpenMPI performance. If we consider the delivered GFLOPS per core, DF-Threads is also competitive with respect to CUDA.
... While DNNs are highly advantageous, they require a notable amount of power for computation. Therefore, researchers have focused on specialized accelerators for DNNs [7], [8], [9], [10], and other workloads [11], [12], [13], [14], [15], [16], [17]. Application-specific integrated circuits (ASICs) provide significant performance and efficiency gains for DNNs [18], [19], [20], [21], but they may not cope with the ever-evolving DNN models. ...
Preprint
Nowadays, shallow and deep Neural Networks (NNs) have vast applications including biomedical engineering, image processing, computer vision, and speech recognition. Many researchers have developed hardware accelerators including field-programmable gate arrays (FPGAs) for implementing high-performance and energy efficient NNs. Apparently, the hardware architecture design process is specific and time-consuming for each NN. Therefore, a systematic way to design, implement and optimize NNs is highly demanded. The paper presents a systematic approach to implement state-space models in register transfer level (RTL), with special interest for NN implementation. The proposed design flow is based on the iterative nature of state-space models and the analogy between state-space formulations and finite-state machines. The method can be used in linear/nonlinear and time-varying/time-invariant systems. It can also be used to implement either intrinsically iterative systems (widely used in various domains such as signal processing, numerical analysis, computer arithmetic, and control engineering), or systems that could be rewritten in equivalent iterative forms. The implementation of recurrent NNs such as long short-term memory (LSTM) NNs, which have intrinsic state-space forms, are another major applications for this framework. As a case study, it is shown that state-space systems can be used for the systematic implementation and optimization of NNs (as nonlinear and time-varying dynamic systems). An RTL code generating software is also provided online, which simplifies the automatic generation of NNs of arbitrary size.
... While simplified and partitioned hardware improves efficiency for these cases, further reduction of energy per computation is possible by diversifying the computation paths and tailoring each path to support specific functionality. Common forms of heterogeneity are fusing of functional units (e.g., clustering of instructions [65] or dataflow graph phases [66]), algorithmspecific functional units (e.g., tangent activation units for neural networks), and asymmetric memory banks and network topologies to accommodate irregular data patterns. ...
Conference Paper
Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip's transistor budget. Unfortunately, the stagnation of the number of transistors available on a chip will limit the accelerator design optimization space, leading to diminishing specialization returns, ultimately hitting an accelerator wall. In this work, we tackle the question of what are the limits of future accelerators and chip specialization? We do this by characterizing how current accelerators depend on CMOS scaling, based on a physical modeling tool that we constructed using datasheets of thousands of chips. We identify key concepts used in chip specialization, and explore case studies to understand how specialization has progressed over time in different applications and chip platforms (e.g., GPUs, FPGAs, ASICs)1. Utilizing these insights, we build a model which projects forward to see what future gains can and cannot be enabled from chip specialization. A quantitative analysis of specialization returns and technological boundaries is critical to help researchers understand the limits of accelerators and develop methods to surmount them.
... The DySER system integrates dataflow graph processing into the pipeline of a processor essentially transforming the dataflow graph into a processor instruction [13,14]. The CPU instruction fetch and single memory access per instruction greatly limits the performance of DySER. ...
Conference Paper
Full-text available
In order to meet the ever-increasing speed differences between processor clocks and memory access times, there has been an interest in moving computation closer to memory. The near data processing or processing-in-memory is particularly suited for very high bandwidth memories such as the 3D-DRAMs. There are different ideas proposed for PIMs, including simple in-order processors, GPUs, specialized ASICs and reconfigurable designs. In our case, we use Coarse-Grained Reconfigurable Logic to build dataflow graphs for computational kernels as the PIM. We show that our approach can achieve significant speedups and save energy consumed by computations. We evaluated our designs using several processing technologies for building the coarse-gained logic units. The DFPIM concept showed good performance improvement and excellent energy efficiency for the streaming benchmarks that were analyzed. The DFPIM in a 28 nm process with an implementation in each of 16 vaults of a 3D-DRAM logic layer showed an average speed-up of 7.2 over that using 32 cores of an Intel Xeon server system. The server processor required 368 times more energy to execute the benchmarks than the DFPIM implementation.
... If we directly used it as a condition of terminal mapping state, it may never be satisfied. Therefore, we add a relaxation factor (β) to (6) and it becomes ...
Article
Coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their flexibility and energy efficiency. Data flow graphs (DFGs) are often mapped onto CGRAs for acceleration. The problem of DFG mapping is challenging due to the diverse structures from DFGs and constrained hardware from CGRAs. Consequently, it is difficult to find a valid and high quality solution simultaneously. Inspired from the great progress in deep reinforcement learning for AI problems, we consider building methods that learn to map DFGs onto spatially-programmed CGRAs directly from experiences. We propose RLMap, a solution that formulates DFG mapping on CGRA as an agent in reinforcement learning (RL), which unifies placement, routing and PE insertion by interchange actions of the agent. Experimental results show that RLMap performs comparably to state-of-the-art heuristics in mapping quality, adapts to different architecture and converges quickly.
... To address this, considerable research exists on automated generation of specialized hardware targeting computationally intensive portions of applications, avoiding manual design and/or system integration [1], [2], [3], [4], [5], [6]. However, given that embedded systems are required to perform not one but a multitude of tasks, this hardware needs to either have some degree of programmability, or several different tailored circuits are required. ...
Article
The use of specialized accelerator circuits is a feasible solution to address performance and energy issues in embedded systems. This paper extends a previous field-programmable gate array-based approach that automatically generates pipelined customized loop accelerators (CLAs) from runtime instruction traces. Despite efficient acceleration, the approach suffered from high area and resource requirements when offloading a large number of kernels from the target application. This paper addresses this by enhancing the CLA with dynamic partial reconfiguration (DPR) support. Each kernel to accelerate is implemented as a variant of a reconfigurable area of the CLA which hosts all functional units and configuration memory. Evaluation of the proposed system is performed on a Virtex-7 device. We show, for a set of 21 kernels, that when comparing two CLAs capable of accelerating the same subset of kernels, the one which benefits from DPR can be up to 4.3×4.3\times smaller. Resorting to DPR allows for the implementation of CLAs which support numerous kernels without a significant decrease in operating frequency and does not affect the initiation intervals at which kernels are scheduled. Finally, the area required by a CLA instance can be further reduced by increasing the IIs of the scheduled kernels.
... The shrinking horseman: Shrink down chips in order to reduce total static power and maintain performance efficiency. 2. The dim horseman: Use dim and sprinting operational modes such as temporal and spatial (but brief ) boost of chip resources, Coarse-Grained Reconfigurable Arrays (CGRAs) to reduce the multiplexing of processor datapaths, near-threshold voltage processing to diminish dynamic power, and exploit dark fractions of larger caches [20][21][22]. 3. The specialized horseman: Exploit architectural heterogeneity such as application-specific accelerators and specialized coprocessors with different power and performance characteristics [18,[23][24][25][26]. 4. The deus ex machina horseman: Create novel physical materials such as low leakage post-MOSFET-based transistors, e.g., Tunnel Field Effect Transistor (TFET) which is based on tunneling effects and Nano-Electro-Mechanical (NEM) switches, high-k metal gate, nonvolatile memory cells, and strained silicon. Unfortunately, they fail in inherent privilege of easy scaling of MOSFET [27][28][29]. ...
Article
Stream processing, which involves real-time computation of data as it is created or received, is vital for various applications, specifically wireless communication. The evolving protocols, the requirement for high-throughput, and the challenges of handling diverse processing patterns make it demanding. Traditional platforms grapple with meeting real-time throughput and latency requirements due to large data volume, sequential and indeterministic data arrival, and variable data rates, leading to inefficiencies in memory access and parallel processing. We present Canalis, a throughput-optimized framework designed to address these challenges, ensuring high-performance while achieving low energy consumption. Canalis is a hardware-software co-designed system. It includes a programmable spatial architecture, FluxSPU (Flux Stream Processing Unit), proposed by this work to enhance data throughput and energy efficiency. FluxSPU is accompanied by a software stack that eases the programming process. We evaluated Canalis with eight distinct benchmarks. When compared to CPU and GPU in mobile SoC to demonstrate the effectiveness of domain specialization, Canalis achieves an average speedup of 13.4× and 6.6×, and energy savings of 189.8× and 283.9×, respectively. In contrast to equivalent ASICs of the benchmarks, the average energy overhead of Canalis is within 2.4×, successfully maintaining generalizations without incurring significant overhead.
Article
We present domain adaptive processor (), a programmable systolic-array processor designed for wireless communication and linear algebra workloads. uses a globally homogeneous but locally heterogeneous architecture, uses decode-less reconfiguration instructions for data streaming, enables single-cycle data communication between functional units (FUs), and features lightweight nested-loop control for periodic execution. Our design demonstrates how configuration flexibility and rapid program loading enable a wide range of communication workloads to be mapped and swapped in less than a microsecond, supporting continually evolving communication standards such as 5G. A prototype chip of with 256 cores is fabricated in a 12-nm FINFET process and has been verified. The measurement results show that achieves 507 GMACs/J and a peak performance of 264 GMACs.
Chapter
AI accelerator is a specialized hardware processing unit that provides high throughput, lower latency, and higher energy efficiency compared to existing server-based processors available in the market. Some AI accelerators are NPU, GPU, FPGA, and ASIC. As compared to other accelerators, ASICs are much more efficient technology as they consume very low power and can be readily customized for specific activities. The AI accelerators can be used in cloud servers as well as at the edge devices. Nowadays, the cloud provides an ideal environment for Machine Learning as it gathers a massive amount of data from various sources. At the same time, edge computing or in-device computing is the ideal option for inference that requires quick output. AI accelerator architecture is necessary for advanced data centers to address the ever-increasing demands of processing and handling massive datasets workloads such as machine vision, deep learning, AI, etc. Moreover, it is necessary to consider the servers’ power consumed and the data center’s power budget while designing the AI accelerators. This chapter discusses various AI accelerators in the cloud, data centers, servers, and edge computing.
Chapter
Compute-intensive applications such as AI, bioinformatics, data centers and IoT have become the hot topic of our time, and these emerging applications are becoming increasingly demanding in terms of the computing power of SDCs. To meet the demanding computing power requirements of these applications, the scale of programmable computing resources in SDCs has increased rapidly. Therefore, how to use these computing resources conveniently and efficiently has gradually become a key issue affecting the application of SDCs. With the advent of large-scale programmable computing resources and emerging applications, the cost of manual mapping has seriously affected the development efficiency of users. Therefore, it is urgent to develop an automated compilation system for SDCs.
Chapter
Executing complex scientific applications on Coarse Grain Reconfigurable Arrays (CGRAs) promises execution time and/or energy consumption reduction compared to software execution or even customized hardware solutions. The compute core of CGRA architectures is a cell that typically consists of simple and generic hardware units, such as ALUs, simple processors, or even custom logic tailored to an application’s specific characteristics. However generality in the cell contents, while convenient for serving multiple applications, comes at the cost of execution acceleration and energy consumption.
Article
Dataflow architecture has been proved to be promising in high-performance computing. Traditional dataflow architectures are not efficient enough in typical scientific applications such as stencil and FFT due to low utilization of function units. Based on the blocking and parallelism features of scientific applications, we design SPU, an efficient dataflow architecture for scientific applications. In SPU, dataflow graphs translated from the loop body in scientific applications are mapped to the Processing Element(PE) Array. Iterations enter the dataflow graph in pipeline during execution meanwhile three levels of parallelism are exploited to improve the utilization of function units in dataflow architectures: inner-graph parallelism, pipelining parallelism and inter graph parallelism. The experimental results show that the average energy efficiency of SPU achieves 25.97GFlops/W in 40 nm technology and the utilization of floating point function units in SPU is 2.82x that of typical dataflow architecture on average for typical scientific applications.
Article
Microprocessors are designed to provide good general performance across a range of benchmarks. As such, microarchitectural techniques which provide good speedup for only a small subset of applications are not attractive when designing a general-purpose core. We propose coupling a reconfigurable fabric with the CPU, on the same chip, via a simple and flexible interface to allow post-silicon development of application-specific microarchitectures. The interface supports observation and intervention at key pipeline stages of the CPU, so that exotic microarchitecture designs (with potentially narrow applicability) can be synthesized in the reconfigurable fabric and seem like components that were hardened into the core.
Chapter
In order to enable control units to run future algorithms, such as advanced control theory, advanced signal processing, data-based modeling, and physical modeling, the control units require a substantial step-up in computational power. In case of an automotive Engine Control Unit (ECU), safety requirements and cost constraints are just as important. Existing solutions to increase the performance of a microcontroller are either not suitable for a subset of the expected algorithms, or too expensive in terms of area. Hence, we introduce the novel Data Flow Architecture (DFA) for embedded hardware accelerators. The DFA is flexible from the concept level to the individual functional units to achieve a high performance per size ratio for a wide variety of data intensive algorithms. Compared to hardwired implementations, the area can be as little as 1.4 times higher at the same performance.
Conference Paper
Traditional process scheduling in the operating system focuses on high CPU utilization while achieving fairness among the processes. However, this can lead to an inefficient usage of other hardware resources, e.g., the caches, which have limited capacity and is a scarce resource on most systems. This paper extends a traditional operating system scheduler to schedule processes more efficiently against hardware resources. Through the introduction of a new concept, a progress period, which models the variation of resource access characteristics during application execution, our scheduling extension dynamically monitors the changes in resource access behavior of each process being scheduled, tracks their collective usage of hardware resources, and schedules the processes to decrease overall system power consumption without compromising performance. Testing this scheduling system on programs on an Intel(R) Xeon(R) E5-2420 CPU with twelve kernels from the BLAS suite and five applications from the SPLASH-2 benchmark suite yielded a 48% maximum decrease in system energy consumption (average 12%), and a 1.88x maximum increase in application performance (average 1.16x).
Article
Non-traditional processing schemes continue to grow in popularity as a means to achieve high performance with greater energy-efficiency. Data-centric processing is one such scheme that targets functional-specialization and memory bandwidth limitations, opening up small processors to wide memory IO. These functional-specific accelerators prove to be an essential component to achieve energy-efficiency and performance, but purely application-specific integrated circuit accelerators have expensive design overheads with limited reusability. We propose an architecture that combines existing processing schemes utilizing CGRAs for dynamic data path configuration as a means to add flexibility and reusability to data-centric acceleration. While flexibility adds a large energy overhead, performance can be regained through intelligent mappings to the CGRA for the functions of interest, while reusability can be gained through incrementally adding general purpose functionality to the processing elements. Building upon previous work accelerating sparse encoded neural networks, we present a CGRA architecture for mapping functional accelerators operating at 500 MHz in 32 nm. This architecture achieves a latency-per-function within 2×2{\times} of its function-specific counterparts with energy-per-operation increases between 21–188 ×\times , and energy-per-area increases between 1.8–3.6 ×\times .
Article
Distributed controlled coarse-grained reconfigurable arrays (CGRAs) enable efficient execution of irregular control flows by reconciling divergence in the processing elements (PEs). To further improve performance by better exploiting spatial parallelism, the triggered instruction architecture (TIA) eliminates the program counter and branch instructions by converting control flows into predicate dependencies as triggers. However, pipeline stalls, which occur in pipelines composed of both intra and inter-PEs, remain a major obstacle to the overall performance. In fact, the stalls in distributed controlled CGRAs pose a unique problem that is difficult to resolve by previous techniques. This work presents a triggered-issuance and triggered-execution (TITE) paradigm in which the issuance and execution of instructions are separately triggered to further relax the predicate dependencies in TIA. In this paradigm, instructions are paired as dual instructions to eliminate stalls caused by control divergence. Tags that identify the data transmitted between PEs are forwarded for acceleration. As a result, pipeline stalls of both intra- and inter-PEs can be significantly minimized. Experiments show that TITE improves performance by 21 %, energy efficiency by 17 %, and area efficiency by 12 % compared with a baseline TIA. IEEE
Article
Full-text available
As the number of transistors double, it becomes diffi- cult to power all of them within a strict power budget and still achieve the performance gains of that the industry has achieved historically. This work presents, Navigo, a modeling framework for architecture exploration across future process technology generations. The model includes support for volt- age and frequency scaling based on ITRS and PTM models. This work is designed to aid architects in the planning stages of next generation microprocessors, by addressing the space between early-stage back-of-the-envelope calculations and later stage cycle accurate simulators. Using parameters from existing commercial processor cores, we show how power consumption limits the theoretical throughput of future pro- cessors. Navigo shows that specialization is the answer to cir- cumvent the power density limit that curbs performance gains and resume traditional 1.58x performance growth trends. We present analysis, using next generation of process technolo- gies, that shows the fraction of area that must be allocated for specialization to maintain performance growth must in- crease with each new generation of process technology.
Conference Paper
Full-text available
PARSEC is a reference application suite used in industry and academia to assess new Chip Multiprocessor (CMP) designs. No investigation to date has profiled PARSEC on real hardware to better understand scaling properties and bottlenecks. This understanding is crucial in guiding futu re CMP designs for these kinds of emerging workloads. We use hardware performance counters, taking a systems-level approach and varying common architectural parameters: number of out-of-order cores, memory hierarchy configu- rations, number of multiple simultaneous threads, number of memory channels, and processor frequencies. We find these programs to be largely compute-bound, and thus lim- ited by number of cores, micro-architectural resources, an d cache-to-cache transfers, rather than by off-chip memory or system bus bandwidth. Half the suite fails to scale lin- early with increasing number of threads, and some applica- tions saturate performance at few threads on all platforms tested. Exploiting thread level parallelism delivers grea ter payoffs than exploiting instruction level parallelism. Tore- duce power and improve performance, we recommend in- creasing the number of arithmetic units per core, increasin g support for TLP, and reducing support for ILP.
Conference Paper
Full-text available
Significant improvement to visual quality for real-time D graphics requires modeling of complex illumination effects like soft-shadows, reflections, and diffuse lighting interac- tions. The conventional Z-buffer algorithm driven GPU model does not provide sufficient support for this improvement. This paper targets the entire graphics system stack and demon- strates algorithms, a software architecture, and a hardware architecture for real-time rendering with a paradigm shift to ray-tracing. The three unique features of our system called Copernicus are support for dynamic scenes, high image quality, and execution on programmable multicore architec- tures. The focus of this paper is the synergy and interaction between applications, architecture, and evaluation. First, we describe the ray-tracing algorithms which are designed to use redundancy and partitioning to achieve locality. Second, we describe the architecture which uses ISA specialization, multi- threading to hide memory delays and supports only local coherence. Finally, we develop an analytical performance model for our 128-core system, using measurements from sim- ulation and a scaled-down prototype system. More generally, this paper addresses an important issue of mechanisms and evaluation for challenging workloads for future processors. Our results show that a single 8-core tile (each core 4-way multithreaded) can be almost 100% utilized and sustain 10 million rays/second. Sixteen such tiles, which can fit on a 240mm2 chip in 22nm technology, make up the system and with our anticipated improvements in algorithms, can sustain real-time rendering. The mechanisms and the architecture can potentially support other domains like irregular scientific computations and physics computations.
Conference Paper
Full-text available
This paper presents and characterizes the Princeton Ap- plication Repository for Shared-Memory Computers (PAR- SEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiproces- sors have focused on high-performance computing applica- tions and used a limited number of synchronization meth- ods. PARSEC includes emerging applications in recogni- tion, mining and synthesis (RMS) as well as systems appli- cations which mimic large-scale multi-threaded commercial programs. Our characterization shows that the benchmark suite is diverse in working set, locality, data sharing, syn- chronization, and o-chip trac. The benchmark suite has been made available to the public.
Conference Paper
Full-text available
There are two basic models for the on-chip memory in CMP sys- tems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consump- tion, bandwidth requirements, and latency tolerance for general- purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. Forcertainapplicationswithlittledatareuse, streamingscales better due to better bandwidth use and macroscopic software pre- fetching. However, the introduction of techniques such as hard- ware prefetching and non-allocating stores to the cache-based mod- el eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly man- aged. On the other hand, we show that streaming at the program- ming model level is particularly beneficial, even with the cache- based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream pro- gramming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.
Conference Paper
Full-text available
Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.
Article
Full-text available
The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simula- tion infrastructure (Simics (14)) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infra- structure enables us to run architectural experi- ments using a suite of scaled-down commercial workloads (3). To enable other researchers to more easily perform such research, we have released these timing simulator modules as theMultifacet General Execution-driven Multiprocessor Simula- tor (GEMS) Toolset, release 1.0, under GNU GPL (9).
Article
Full-text available
In optimizing compilers, data structure choices directly influence the power and efficiency of practical program optimization. A poor choice of data structure can inhibit optimization or slow compilation to the point that advanced optimization features become undesirable. Recently, static single assignment form and the control dependence graph have been proposed to represent data flow and control flow propertiee of programs. Each of these previously unrelated techniques lends efficiency and power to a useful class of program optimization. Although both of these structures are attractive, the difficulty of their construction and their potential size have discouraged their use. We present new algorithms that efficiently compute these data structures for arbitrary control flow graphs. The algorithms use dominance frontiers, a new concept that may have other applications. We also give analytical and experimental evidence that all of these data structures are usually linear in the size of the original program. This paper thus presents strong evidence that these structures can be of practical use in optimization.
Conference Paper
Full-text available
We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Article
Full-text available
Wire delay is emerging as the natural limiter to microprocessor scalability. A new architectural approach could solve this problem, as well as deliver unprecedented performance, energy efficiency and cost effectiveness. The Raw microprocessor research prototype uses a scalable instruction set architecture to attack the emerging wire-delay problem by providing a parallel, software interface to the gate, wire and pin resources of the chip. An architecture that has direct, first-class analogs to all of these physical resources will ultimately let programmers achieve the maximum amount of performance and energy efficiency in the face of wire delay.
Article
Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of entire applications on Tartan, a general-purpose architecture which integrates a reconfigurable fabric (RF) with a superscalar core. Our compiler automatically partitions and compiles an application into an instruction stream for the core and a configuration for the RF. We use a detailed simulator to capture both timing and energy numbers for all parts of the system.Our results indicate that a hierarchical RF architecture, designed around a scalable interconnect, is instrumental in harnessing the benefits of spatial computation. The interconnect uses static configuration and routing at the lower levels and a packet-switched, dynamically-routed network at the top level. Tartan is most energyefficient when almost all of the application is mapped to the RF, indicating the need for the RF to support most general-purpose programming constructs. Our initial investigation reveals that such a system can provide, on average, an order of magnitude improvement in energy-delay compared to an aggressive superscalar core on single-threaded workloads.
Article
Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.
Conference Paper
SPEC CPU benchmarks are commonly used by compiler writers and architects of general purpose processors for performance evaluation. Since the release of the CPU89 suite, the SPEC CPU benchmark suites have evolved, with applications either removed or added or upgraded. This influences the design decisions for the next generation compilers and microarchitectures. In view of the above, it is critical to characterize the applications in the new suite - SPEC CPU2006 - to guide the decision making process. Although similar studies using the retired SPEC CPU benchmark suites have been done in the past, to the best of our knowledge, a thorough performance characterization of CPU2006 and its comparison with CPU2000 has not been done so far. In this paper, we present the above. For this, we compiled the applications in CPU2000 and CPU2006 using the Intelreg2 Fortran/C++ optimizing compiler and executed them, using the reference data sets, on the state-of-the-art Intel Coretrade2 Duo processor. The performance information was collected by using the Intel VTunetrade performance analyzer that takes advantage of the built-in hardware performance counters to obtain accurate information on program behavior and its use of processor resources. The focus of this paper is on branch and memory access behavior, the well-known reasons for program performance problems. By analyzing and comparing the L1 data and L2 cache miss rates, branch prediction accuracy, and resource stalls the performance impact in each suite is indirectly determined and described. Not surprisingly, the CPU2006 codes are larger, more complex, and have larger data sets. This leads to higher average L2 cache miss rates and a slight reduction in average IPC compared to the CPU2000 suite. Similarly, the average branch behavior is slightly worse in CPU2006 suite. However, based on processor stall counts branches are much less of a problem. The results presented here are a step towards understanding the SPEC CPU2- - 006 benchmarks and will aid compiler writers in understanding the impact of currently implemented optimizations and in the design of new ones to address the new challenges presented by SPEC CPU2006. Similar opportunities exist for architecture optimization.
Conference Paper
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory schedul- ing and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instruc- tions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAM- based fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the un- derlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "Fire-and- Forget" (FnF) scheme for load/store scheduling and for- warding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that for- wards data to a load will use a predicted LQ index to di- rectly write the value to the LQ entry without any associa- tive logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competi- tion. Specifically, our simulation results show that our SQ- less Fire-and-Forget provides a 3.3% speedup over a pro- cessor using a conventional fully-associative SQ.
Conference Paper
We present Adaptive Stream Detection, a simple tech- nique for modulating the aggressiveness of a stream prefetcher to match a workload's observed spatial locality. We use this concept to design a prefetcher that resides on an on-chip memory controller. The result is a prefetcher with small hardware costs that can exploit workloads with low amounts of spatial locality. Using highly accurate sim- ulators for the IBM Power5+, we show that this prefetcher improves performance of the SPEC2006fp benchmarks by an average of 32.7% when compared against a Power5+ that performs no prefetching. On a set of 5 commercial benchmarks that have low spatial locality, this prefetcher improves performance by an average of 15.1%. When compared against a typical Power5+ that does perform processor-side prefetching, the average performance im- provement of these benchmark suites is 10.2% and 8.4%. We also evaluate the power and energy impact of our tech- nique. For the same benchmark suites, DRAM power con- sumption increases by less than 3%, while energy usage decreases by 9.8% and 8.2%, respectively. Moreover, the power consumption of the prefetcher itself is low; it is es- timated to increase the power consumption of the Power5+ chip by 0.06%.
Conference Paper
Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.
Conference Paper
An architecture for high-performance scalar computation is proposed and discussed. The main feature of the architecture is a high degree of decoupling between operand access and execution. This results in an implementation that has two separate instruction streams that communicate via architectural queues. Performance comparisons with a conventional scalar architecture are given, and these show that significant performance gains can be realized. Single-instruction-stream versions, both physical and conceptual, are discussed, with the primary goal of minimizing the differences with conventional architectures. This allows known compilation and programming techniques to be used. Finally, the problem of deadlock in a decoupled system is discussed, and a deadlock prevention method is given.
Article
An architecture for improving computer performance is presented and discussed. The main feature of the architecture is a high degree of decoupling between operand access and execution. This results in an implementation which has two separate instruction streams that communicate via queues. A similar architecture has been previously proposed for array processors, but in that context the software is called on to do most of the coordination and synchronization between the instruction streams. This paper emphasizes implementation features that remove this burden from the programmer. Performance comparisons with a conventional scalar architecture are given, and these show that considerable performance gains are possible. Single instruction stream versions, both physical and conceptual, are discussed with the primary goal of minimizing the differences with conventional architectures. This would allow known compilation and programming techniques to be used. Finally, the problem of deadlock in such a system is discussed, and one possible solution is given.
Article
Scaling the performance of a power limited processor requires decreasing the energy expended per instruction executed, since energy/op * op/second is power. To better understand what improvement in processor efficiency is possible, and what must be done to capture it, we quantify the sources of the performance and energy overheads of a 720p HD H.264 encoder running on a general-purpose four-processor CMP system. The initial overheads are large: the CMP was 500 x less energy efficient than an Application Specific Integrated Circuit (ASIC) doing the same job. We explore methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. Broadly applicable optimizations like single instruction, multiple data (SIMD) units improve CMP performance by 14 x and energy by 10x, which is still 50x worse than an ASIC. The problem is that the basic operation costs in H.264 are so small that even with a SIMD unit doing over 10 ops per cycle, 90% of the energy is still overhead. Achieving ASIC-like performance and effciency requires algorithm-specifc optimizations. For each subalgorithm of H.264, we create a large, specialized functional/storage unit capable of executing hundreds of operations per instruction. This improves energy effciency by 160x (instead of 10x), and the final customized CMP reaches the same performance and within 3x of an ASIC solution's energy in comparable area.
Article
For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependance assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction.
Conference Paper
Performance improvement solely through transistor scaling is becoming more and more difficult, thus it is increasingly common to see domain specific accelerators used in conjunction with general purpose processors to achieve future performance goals. There is a serious drawback to accelerators, though: binary compatibility. An application compiled to utilize an accelerator cannot run on a processor without that accelerator, and applications that do not utilize an accelerator will never use it. To overcome this problem, we propose decoupling the instruction set architecture from the underlying accelerators. Computation to be accelerated is expressed using a processorpsilas baseline instruction set, and light-weight dynamic translation maps the representation to whatever accelerators are available in the system. In this paper, we describe the changes to a compilation framework and processor system needed to support this abstraction for an important set of accelerator designs that support innermost loops. In this analysis, we investigate the dynamic overheads associated with abstraction as well as the static/dynamic tradeoffs to improve the dynamic mapping of loop-nests. As part of the exploration, we also provide a quantitative analysis of the hardware characteristics of effective loop accelerators. We conclude that using a hybrid static-dynamic compilation approach to map computation on to loop-level accelerators is an practical way to increase computation efficiency, without the overheads associated with instruction set modification.
Conference Paper
Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces the length of computation as well as the number of intermediate results stored in the register file. The main problem with this approach is that a new processor must be generated for each application domain. While new instructions can be designed automatically, there is a substantial amount of engineering cost incurred to verify and to implement the final custom processor. In this work, we propose a strategy to transparent customization of the core computation capabilities of the processor without changing its instruction set. A congurable array of function units is added to the baseline processor that enables the acceleration of a wide range of data flow subgraphs. To exploit the array, the microarchitecture performs subgraph identification at run-time, replacing them with new microcode instructions to configure and utilize the array. We compare the effectiveness of replacing subgraphs in the fill unit of a trace cache versus using a translation table during decode, and evaluate the tradeoffs between static and dynamic identification of subgraphs for instruction set customization.
Conference Paper
Instruction set customization is an effective way to improve processor performance. Critical portions of application data-flow graphs are collapsed for accelerated execution on specialized hardware. Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of intermediate results stored in the register file. While custom instructions can be effective, the time and cost of designing a new processor for each application is immense. To overcome this roadblock, this paper proposes a flexible architectural framework to transparently integrate custom instructions into a general-purpose processor. Hardware accelerators are added to the processor to execute the collapsed subgraphs. A simple microarchitectural interface is provided to support a plug-and-play model for integrating a wide range of accelerators into a pre-designed and verified processor core. The accelerators are exploited using an approach of static identification and dynamic realization. The compiler is responsible for identifying profitable subgraphs, while the hardware handles discovery, mapping, and execution of compatible subgraphs. This paper presents the design of a plug-and-play transparent accelerator system arid evaluates the cost/performance implications of the design.
Conference Paper
The high transistor density afforded by modern VLSI processes has enabled the design of embedded processors that use clustered execution units to deliver high levels of performance. However, delivering data to the execution resources in a timely manner remains a major problem that limits ILP. It is particularly significant for embedded systems where memory and power budgets are limited. A distributed address generation and loop acceleration architecture for VLIW processors is presented. This decentralized on-chip memory architecture uses multiple SRAMs to provide high intra-processor bandwidth. Each SRAM has an associated stream address generator capable of implementing a variety of addressing modes in conjunction with a shared loop accelerator. The architecture is extremely useful for generating application specific embedded processors, particularly for processing input data which is organized as a stream. The idea is evaluated in the context of a fine grain VLIW architecture executing complex perception algorithms such as speech and visual feature recognition. Transistor level Spice simulations are used to demonstrate a 159x improvement in the energy delay product when compared to conventional architectures executing the same applications.
Conference Paper
The need to process multimedia data places large computational demands on portable/embedded devices. These multimedia functions share common characteristics: they are computationally intensive and data-streaming, performing the same operation(s) on many data elements. The reconfigurable streaming vector processor (RSVP) is a vector coprocessor architecture that accelerated streaming data operations. Programming the RSVP architecture involves describing the shape and location of vector steams in memory and describing computations as data-flow graphs. These descriptions are intuitive and independent of each other, making the RSVP architecture easy to program. They are also machine independent, allowing binary-compatible implementations with varying cost-performance tradeoffs. This paper presents the RSVP architecture and programming model, a programming case study, and our first implementation. Our results show significant speedups on streaming data functions. Speedups for kernels and applications range from 2 to over 20 times that of an ARM9 host processor alone.
Conference Paper
For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction.
Conference Paper
Reconfigurable hardware has the potential for significant performance improvements by providing support for application-specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically-scheduled superscalar processor. Chimaera is capable of performing 9-input/1-output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for execution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control-flow into RFU operations, and (3) supporting a more powerful fine-grain data-parallel model than that supported by current multimedia extension instruction sets (for integer operations), Using a set of multimedia and communication applications bye show that even with simple optimizations. The Chimaera C compiler is able to map 22% of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub-shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4-way out-of-order superscalar processor Chimaera results in average performance improvements of 21%, assuming a very aggressive core processor design (most pessimistic RFU latency model) and communication overheads from and to the RFU.
Conference Paper
As various portable systems get popular, the reduction of the power dissipation in LSIs is becoming more essential. The scaling down of both the supply voltage and threshold voltage is effective in reducing the power without a serious degradation of operating speed. The static leakage current, however, enlarges the power in the sleep period when the LSI is not operating. To avoid such undesirable leakage, two methods have been reported recently. One is the multi-threshold (MT) CMOS which utilizes dual threshold voltages for both p- and n-channel transistors. This method, however, requires some means of holding the latched data in the sleep period, which increases the design complexity and the chip area. The other is the variable-threshold (VT) CMOS which controls the backgate bias to increase the threshold voltage of transistors during the sleep period. Although it holds the latched data, it requires a triple-well structure and an additional circuit to control the substrate bias. We propose an auto-backgate-controlled MTCMOS (ABC-MT-CMOS) circuit that holds the latched data in the sleep period with a simple circuit. In this paper, we present the circuit, the layout method and the application of the circuit to a 32-bit RISC microprocessor
Conference Paper
Typical reconfigurable machines exhibit shortcomings that make them less than ideal for general-purpose computing. The Garp Architecture combines reconfigurable hardware with a standard MIPS processor on the same die to retain the better features of both. Novel aspects of the architecture are presented, as well as a prototype software environment and preliminary performance results. Compared to an UltraSPARC, a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factor of 24 for some useful applications
Article
The Burroughs Scientific Processor (BSP), a high-performance computer system, performed the Department of Energy LLL loops at roughly the speed of the CRAY-1. The BSP combined parallelism and pipelining, performing memory-to-memory operations. Seventeen memory units and two crossbar switch data alignment networks provided conflict-free access to most indexed arrays. Fast linear recurrence algorithms provided good performance on constructs that some machines execute serially. A system manager computer ran the operating system and a vectorizing Fortran compiler. An MOS file memory system served as a high bandwidth secondary memory.
Article
Microprocessor designs are on the verge of a post-RISC era in which companies must introduce new ISAs to address the challenges that modern CMOS technologies pose while also exploiting the massive levels of integration now possible. To meet these challenges, we have developed a new class of ISAs, called explicit data graph execution (EDGE), that will match the characteristics of semiconductor technology over the next decade. The TRIPS architecture is the first instantiation of an EDGE instruction set, a new, post-RISC class of instruction set architectures intended to match semiconductor technology evolution over the next decade, scaling to new levels of power efficiency and high performance.
Ambric'S New Parallel Processor -Globally Asyn-chronous
  • T R Halfhill
T. R. Halfhill. Ambric'S New Parallel Processor -Globally Asyn-chronous Architecture Eases Parallel Programming. Microprocessor Report, October 2006.
The single-chip cloud computer. Microprocessor Report
  • M Baron
M. Baron. The single-chip cloud computer. Microprocessor Report, April 2010.
Design and evaluation of dynamically specialized datapaths with the dyser architecture
  • V Govindaraju
  • C Ho
  • K Sankaralingam
V. Govindaraju, C. Ho, and K. Sankaralingam. Design and evalua-tion of dynamically specialized datapaths with the dyser architecture. Technical report, The University of Wisconsin-Madison, Department of Computer Sciences.
Ambric'S New Parallel Processor - Globally Asynchronous Architecture Eases Parallel Programming
  • T R Halfhill
MathStar Challenges FPGAs
  • T R Harfill