[Show abstract][Hide abstract] ABSTRACT: Hybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it di to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (HPG) has developed a set of performance metrics to evaluate the performance and power consumption of accelerators for various science applications. The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform. The first set of published results demonstrate the viability and relevance of the new metrics in comparing accelerator performance. This paper discusses the benchmark suites and selected published results in great detail.
[Show abstract][Hide abstract] ABSTRACT: Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in the construction of reliable supercomputer systems. Because hardware protection is expensive to develop, requires dedicated on-chip resources, and is not portable across different architectures, the efficiency of software solutions such as redundant multithreading (RMT) must be explored. This paper presents a real-world design and evaluation of automatic software RMT on GPU hardware. We first describe a compiler pass that automatically converts GPGPU kernels into redundantly threaded versions. We then perform detailed power and performance evaluations of three RMT algorithms, each of which provides fault coverage to a set of structures in the GPU. Using real hardware, we show that compiler-managed software RMT has highly variable costs. We further analyze the individual costs of redundant work scheduling, redundant computation, and inter-thread communication, showing that no single component in general is responsible for high overheads across all applications; instead, certain workload properties tend to cause RMT to perform well or poorly. Finally, we demonstrate the benefit of architectural support for RMT with a specific example of fast, register-level thread communication.
2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA); 06/2014
[Show abstract][Hide abstract] ABSTRACT: Due to non-ideal technology scaling, delivering a stable supply voltage is increasingly challenging. Furthermore, competition for limited chip interface resources (i.e., C4 pads) between power supply and I/O, and the loss of such resources to electromigration, means that constructing a power delivery network (PDN) that satisfies noise margins without compromising performance is and will remain a critical problem for architects and circuit designers alike. Simple guardbanding will no longer work, as the consequent performance penalty will grow with technology scaling. In this paper, we develop a pre-RTL PDN model, VoltSpot, for the purpose of studying the performance and noise tradeoffs among power supply and I/O pad allocation, the effectiveness of noise mitigation techniques, and the consequent implications of electromigration-induced PDN pad failure. Our simulations demonstrate that, despite their integral role in the PDN, power/ground pads can be aggressively reduced (by conversion into I/O pads) to their electromigration limit with minimal performance impact from extra voltage noise - provided the system implements a suitable noise-mitigation strategy. The key observation is that even though reducing power/ground pads significantly increases the number of voltage emergencies, the average noise amplitude increase is small. Overall, we can triple I/O bandwidth while maintaining target lifetimes and incurring only 1.5% slowdown.
2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA); 06/2014
[Show abstract][Hide abstract] ABSTRACT: Transient voltage noise, including resistive and reactive noise, causes timing errors at runtime. We introduce a heuristic framework---Walking Pads---to minimize transient voltage violations by optimizing power supply pad placement. We show that the steady-state optimal design point differs from the transient optimum, and further noise reduction can be achieved with transient optimization. Our methodology significantly reduces voltage violations by balancing the average transient voltage noise of the four branches at each pad site. When we optimize pad placement using a representative stressmark, voltage violations are reduced 46-80% across 11 Parsec benchmarks with respect to the results from IR-drop-optimized pad placement. We also show that the allocation of on-chip decoupling capacitance significantly influences the optimal locations of pads.
[Show abstract][Hide abstract] ABSTRACT: To meet performance goals at the lowest possible cost, reconfigurable SIMD/MIMD architectures have emerged to exploit application parallelism. In this paper, we investigate the energy and flexibility tradeoffs of such architectures by designing our own reconfigurable SIMD/MIMD system, ParaFlex, using simple in-order processor components and evaluating the associated design decisions. We observe that, unlike traditional SIMD designs, ParaFlex is most energy efficient when only the instruction cache is shared by units performing data-parallel execution.
2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM); 05/2014
[Show abstract][Hide abstract] ABSTRACT: Resilience to hardware failures is a key challenge for a large class of future computing systems that are constrained by the so-called power wall: from embedded systems to supercomputers. Today's mainstream computing systems typically assume that transistors and interconnects operate correctly during useful system lifetime. With enormous complexity and significantly increased vulnerability to failures compared to the past, future system designs cannot rely on such assumptions. At the same time, there is explosive growth in our dependency on such systems. To overcome this outstanding challenge, this paper advocates and examines a cross-layer resilience approach. Two major components of this approach are: 1. System and software-level effects of circuit-level faults are considered from early stages of system design; and, 2. resilience techniques are implemented across multiple layers of the system stack - from circuit and architecture levels to runtime and applications - such that they work together to achieve required degrees of resilience in a highly energy-efficient manner. Illustrative examples to demonstrate key aspects of cross-layer resilience are discussed.
2014 International Symposium on VLSI Technology, Systems and Application (VLSI-TSA); 04/2014
[Show abstract][Hide abstract] ABSTRACT: We propose a novel C4 pad placement optimization framework for 2D power delivery grids: Walking Pads (WP). WP optimizes pad locations by moving pads according to the “virtual forces” exerted on them by other pads and current sources in the system. WP algorithms achieve the same IR drop as state-of-the-art techniques, but are up to 634X faster. We further propose an analytical model relating pad count and IR drop for determining the optimal pad count for a given IR drop budget.
2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC); 01/2014
[Show abstract][Hide abstract] ABSTRACT: The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, which in turn cause the code base to diverge and makes porting difficult. Our experiences with parallel applications and frameworks lead us to the conclusion that achieving performance portability requires a common set of high-level directives and efficient mapping onto each architecture.
Journal of Parallel and Distributed Computing 10/2013; 73(10):1400-1413. DOI:10.1016/j.jpdc.2013.07.001 · 1.18 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Near-threshold operation can increase the number of simultaneously active cores at the expense of much lower operating frequency ("dim silicon"), but dim cores suffer from diminishing returns as the number of cores increases. At this point, hardware accelerators become more efficient alternatives. To explore such a broad design space, the authors present an analytical model to quantify the performance limits of many-core, heterogeneous systems operating at near-threshold voltage. The model augments Amdahl's law with detailed scaling of frequency and power, calibrated by circuit-level simulations using a modified Predictive Technology Model (PTM), and factors in the effects of process variations. Results show that dim cores do indeed boost throughput, even in the presence of process variations, but significant benefits are achieved only in applications with high parallelism or novel architectures to mitigate variation. Reconfigurable logic that supports a variety of accelerators is more beneficial than "dim cores" or dedicated, fixed-logic accelerators, unless the kernel targeted by fixed logic has overwhelming coverage across applications, or the speedup of the dedicated accelerator over the reconfigurable equivalent is significant.
[Show abstract][Hide abstract] ABSTRACT: As circuit feature sizes shrink, multibit errors become more significant, while previously unprotected combinational logic becomes more vulnerable, requiring a reevaluation of the resiliency design space within a processor core. The authors present Svalinn, a framework that provides comprehensive analysis of multibit error protection overheads to facilitate better architecture-level design choices. Supported protection techniques include hardening, parity, error-correcting code, parity prediction, residue codes, and spatial and temporal redundancy. The overheads of these are characterized via synthesis and, as a case study, presented here in the context of a simple OpenRISC core. The analysis provided by Svalinn shows the difference in protection overheads per component and circuit category in terms of area, delay, and energy. The authors show that the contribution of logic components to the area of a simple core increases from 35 percent to as much as 54 percent with comprehensive multibit error protection. They also observe that the overhead of protection could increase from 29 percent to as much as 97 percent when transitioning from single-bit to multibit protection. Analysis of Svalinn also suggests that storage components will continue to benefit from the use of error-correcting code, whereas products requiring comprehensive coverage of logic components might use redundancy and residue codes. Optimal core-level protection will require novel combinations of these.
[Show abstract][Hide abstract] ABSTRACT: Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to respond to fluctuations in device performance that occur at run time. We present a novel dynamic approach to work partitioning that requires no offline training and responds automatically to performance variability to provide consistently good performance. Using six diverse OpenCL™ applications, we demonstrate the effectiveness of our approach in scenarios both with and without run-time performance variability, as well as in more extreme scenarios in which one device is non-functional.
Proceedings of the ACM International Conference on Computing Frontiers; 05/2013
[Show abstract][Hide abstract] ABSTRACT: Graphics processing units (GPUs) have become an important platform for general-purpose computing, thanks to their high parallel throughput and high memory bandwidth. GPUs present significantly different architectures from CPUs and require specific mappings and optimizations to achieve high performance. This makes GPU workloads demonstrate application characteristics different from those of CPU workloads. It is critical for researchers to understand the first-order metrics that most influence GPU performance and scalability. Furthermore, methodologies and associated tools are needed to analyze and predict the performance of GPU applications and help guide users' purchasing decisions. In this work, we study the approach of predicting the performance of GPU applications by correlating them to existing workloads. One tenet of benchmark design, also a motivation of this paper, is that users should be given the capability to leverage standard workloads to infer the performance of applications of their interest. We first identify a set of important GPU application characteristics and then use them to predict performance of an arbitrary application by determining its most similar proxy benchmarks. We demonstrate the prediction methodology and conduct predictions with benchmarks from different suites to achieve better workload coverage. The experimental results show that we are able to achieve satisfactory performance predictions, although errors are higher for outlier applications. Finally, we discuss several considerations for systematically constructing future benchmark suites.
International Journal of High Performance Computing Applications 05/2013; 28(2):238-250. DOI:10.1177/1094342013507960 · 1.48 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Process technology scaling, lagging supply voltage scaling, and the resulting exponential increase in power density, have made temperature a first-class design constraint in today's microprocessors. Prior work has shown that the silicon substrate acts as a spatial low-pass filter for temperature. This phenomenon, spatial thermal filtering, has clear implications for thermal management: depending on the size of dissipators, either design-time strategies, such as dividing and distributing functionality spatially, or runtime strategies, such as isolating functionality temporally (duty cycling), may be the most effective way to control peak temperature. To assist designers with such trade-offs, we have performed extensive analysis and simulation to evaluate the extent and effect of spatial filtering on thermal management in a number of microarchitecture design scenarios.We begin our exploration of spatial filtering with an analytical study of the heat conduction problem, followed by a series of studies to validate the effect and extent of spatial filtering under realistic system assumptions. In particular, we investigate the effect of power dissipator size, location, and aspect ratio in the context of high-performance computing. We then extend these experiments with two microarchitectural studies. First, we perform a study of spatial filtering in many-core architectures. Our results show that as cores shrink, the granularity of effective thermal management increases to the point that even turning cores on and off has a limited effect on peak temperature. Second, we investigate spatial filtering in caches. We discover that despite the size and aspect ratio of cache lines, pathological code behavior can heat caches to undesirable levels, accelerating wear-out.
[Show abstract][Hide abstract] ABSTRACT: GPUs have become popular recently to accelerate general-purpose data-parallel applications. However, most existing work has focused on GPU-friendly applications with regular data structures and access patterns. While a few prior studies have shown that some irregular workloads can also achieve speedups on GPUs, this domain has not been investigated thoroughly.
Workload Characterization (IISWC), 2013 IEEE International Symposium on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Motivation:
The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery.
We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals.
[Show abstract][Hide abstract] ABSTRACT: General purpose GPU (GPGPU) programming frameworks such as OpenCL and CUDA allow running individual computation kernels sequentially on a device. However, in some cases it is possible to utilize device resources more efficiently by running kernels concurrently. This raises questions about load balancing and resource allocation that have not previously warranted investigation. For example, what kernel characteristics impact the optimal partitioning of resources among concurrently executing kernels? Current frameworks do not provide the ability to easily run kernels concurrently withne-grained and dynamic control over resource partitioning. We present KernelMerge, a kernel scheduler that runs two OpenCL kernels concurrently on one device. KernelMerge furnishes a number of settings that can be used to survey concurrent or single kernel configurations, and to investigate how kernels interact and influence each other, or themselves. KernelMerge provides a concurrent kernel scheduler compatible with the OpenCL API. We present an argument on the benefits of running kernels concurrently. We demonstrate how to use KernelMerge to increase throughput for two kernels that efficiently use device resources when run concurrently, and we establish that some kernels show worse performance when running concurrently. We also outline a method for using KernelMerge to investigate how concurrent kernels influence each other, with the goal of predicting runtimes for concurrent execution from individual kernel runtimes. Finally, we suggest GPU architectural changes that would improve such concurrent schedulers in the future.
Proceedings of the 4th USENIX conference on Hot Topics in Parallelism; 06/2012
[Show abstract][Hide abstract] ABSTRACT: Microprocessor design has recently encountered many constraints such as power, energy, reliability, and temperature. Among these challenging issues, temperature-related issues have become especially important within the past several years. We summarize recent thermal management techniques for microprocessors, focusing on those that affect or rely on the microarchitecture. We categorize thermal management techniques into six main categories: temperature monitoring, microarchitectural techniques, floorplanning, OS/compiler techniques, liquid cooling techniques, and thermal reliability/security. Temperature monitoring, a requirement for Dynamic Thermal Management (DTM), includes temperature estimation and sensor placement techniques for accurate temperature measurement or estimation. Microarchitectural techniques include both static and dynamic thermal management techniques that control hardware structures. Floorplanning covers a range of thermal-aware floorplanning techniques for 2D and 3D microprocessors. OS/compiler techniques include thermal-aware task scheduling and instruction scheduling techniques. Liquid cooling techniques are higher-capacity alternatives to conventional air cooling techniques. Thermal reliability/security issues cover temperature-dependent reliability modeling, Dynamic Reliability Management (DRM), and malicious codes that specifically cause overheating. Temperature-related issues will only become more challenging as process technology continues to evolve and transistor densities scale up faster than power per transistor scales down. The overall objective of this survey is to give microprocessor designers a broad perspective on various aspects of designing thermal-aware microprocessors and to guide future thermal management studies.
[Show abstract][Hide abstract] ABSTRACT: Architectures that aggressively exploit SIMD often have many data paths execute in lockstep and use multi-threading to hide latency. They can yield high through-put in terms of area- and energy-efficiency for many data-parallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Examples of such architectures include Intel's MIC, AMD's Fusion, and NVIDIA's Fermi. However, unlike software-managed streaming memories used in conventional graphics processors (GPUs), hardware-managed caches are more disruptive to SIMD execution, therefore the interaction between implicit caching and aggressive SIMD execution may no longer follow the conventional wisdom gained from streaming memories. We show that due to more frequent memory latency divergence, lower latency in non-L1 data accesses, and relatively unpredictable L1 contention, cache hierarchies favor different SIMD widths and multi-threading depths than streaming memories. In fact, because the above effects are subject to runtime dynamics, a fixed combination of SIMD width and multi-threading depth no longer works ubiquitously across diverse applications or when cache capacities are reduced due to pollution or power saving. To address the above issues and reduce design risks, this paper proposes Robust SIMD, which provides wide SIMD and then dynamically adjusts SIMD width and multi-threading depth according to performance feedback. Robust SIMD can trade wider SIMD for deeper multi-threading by splitting a wider SIMD group into multiple narrower SIMD groups. Compared to the performance generated by running every benchmark on its individually preferred SIMD organization, the same Robust SIMD organization performs similarly -- sometimes even better due to phase adaptation -- and out per-forms the best fixed SIMD organization by 17%. When D-cache capacity is reduced due to runtime disruptiveness, Robust SIMD offers graceful performance degradation, w- th 25% polluted cache lines in a 32 KB D-cache, Robust SIMD performs 1.4× better compared to a conventional SIMD architecture.
[Show abstract][Hide abstract] ABSTRACT: The physical channel is the element that consumes the largest amount of power in a traditional memory controller (MC). Wired-RF can potentially decrease the amount of power dissipated by replacing the physical memory channel by an RF-channel, just as optical memory systems do by replacing the physical memory channel by an optical-channel. Considering that RF transmission can potentially consume less power than a traditional bus for on-chip distances, we propose to replace the traditional digital MC physical channel by coupling RF transmitters (TX), receivers (RX), an RF quilt-packaging coplanar waveguide (CPW), and a quilt-to-to interconnect MCs and memory ranks on the same package in a multicore. We evaluate the proposed solution in terms of power and area employing ITRS  and RF predictions. Preliminary estimation shows that the proposed RF interface is able to save up to 57.3% in terms of area and up to 78.2% in terms of power consumption for next processor generations. Furthermore, considering a fixed area budget of one MC as a reference, the proposed interface can improve bandwidth up to 2.2x for an 8-core multiprocessor with 3 MCs and, assuming a fixed power budget of one MC, the proposed interface can improve bandwidth of up to 2.4x.
[Show abstract][Hide abstract] ABSTRACT: Previous work has shown that using the GPU as a brute force method for SELECT statements on a SQLite database table yields significant speedups. However, this requires that the entire table be selected and transformed from the B-Tree to row-column format. This paper investigates possible speedups by traversing B+ Trees in parallel on the GPU, avoiding the overhead of selecting the entire table to transform it into row-column format and leveraging the logarithmic nature of tree searches. We experiment with different input sizes, different orders of the B+ Tree, and batch multiple queries together to find optimal speedups for SELECT statements with single search parameters as well as range searches. We additionally make a comparison to a simple GPU brute force algorithm on a row-column version of the B+ Tree.