[Show abstract][Hide abstract] ABSTRACT: Hybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it di to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (HPG) has developed a set of performance metrics to evaluate the performance and power consumption of accelerators for various science applications. The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform. The first set of published results demonstrate the viability and relevance of the new metrics in comparing accelerator performance. This paper discusses the benchmark suites and selected published results in great detail.
[Show abstract][Hide abstract] ABSTRACT: Transient voltage noise, including resistive and reactive noise, causes timing errors at runtime. We introduce a heuristic framework---Walking Pads---to minimize transient voltage violations by optimizing power supply pad placement. We show that the steady-state optimal design point differs from the transient optimum, and further noise reduction can be achieved with transient optimization. Our methodology significantly reduces voltage violations by balancing the average transient voltage noise of the four branches at each pad site. When we optimize pad placement using a representative stressmark, voltage violations are reduced 46-80% across 11 Parsec benchmarks with respect to the results from IR-drop-optimized pad placement. We also show that the allocation of on-chip decoupling capacitance significantly influences the optimal locations of pads.
[Show abstract][Hide abstract] ABSTRACT: Due to non-ideal technology scaling, delivering a stable supply voltage is increasingly challenging. Furthermore, competition for limited chip interface resources (i.e., C4 pads) between power supply and I/O, and the loss of such resources to electromigration, means that constructing a power delivery network (PDN) that satisfies noise margins without compromising performance is and will remain a critical problem for architects and circuit designers alike. Simple guardbanding will no longer work, as the consequent performance penalty will grow with technology scaling. In this paper, we develop a pre-RTL PDN model, VoltSpot, for the purpose of studying the performance and noise tradeoffs among power supply and I/O pad allocation, the effectiveness of noise mitigation techniques, and the consequent implications of electromigration-induced PDN pad failure. Our simulations demonstrate that, despite their integral role in the PDN, power/ground pads can be aggressively reduced (by conversion into I/O pads) to their electromigration limit with minimal performance impact from extra voltage noise - provided the system implements a suitable noise-mitigation strategy. The key observation is that even though reducing power/ground pads significantly increases the number of voltage emergencies, the average noise amplitude increase is small. Overall, we can triple I/O bandwidth while maintaining target lifetimes and incurring only 1.5% slowdown.
2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA); 06/2014
[Show abstract][Hide abstract] ABSTRACT: To meet performance goals at the lowest possible cost, reconfigurable SIMD/MIMD architectures have emerged to exploit application parallelism. In this paper, we investigate the energy and flexibility tradeoffs of such architectures by designing our own reconfigurable SIMD/MIMD system, ParaFlex, using simple in-order processor components and evaluating the associated design decisions. We observe that, unlike traditional SIMD designs, ParaFlex is most energy efficient when only the instruction cache is shared by units performing data-parallel execution.
2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM); 05/2014
[Show abstract][Hide abstract] ABSTRACT: Resilience to hardware failures is a key challenge for a large class of future computing systems that are constrained by the so-called power wall: from embedded systems to supercomputers. Today's mainstream computing systems typically assume that transistors and interconnects operate correctly during useful system lifetime. With enormous complexity and significantly increased vulnerability to failures compared to the past, future system designs cannot rely on such assumptions. At the same time, there is explosive growth in our dependency on such systems. To overcome this outstanding challenge, this paper advocates and examines a cross-layer resilience approach. Two major components of this approach are: 1. System and software-level effects of circuit-level faults are considered from early stages of system design; and, 2. resilience techniques are implemented across multiple layers of the system stack - from circuit and architecture levels to runtime and applications - such that they work together to achieve required degrees of resilience in a highly energy-efficient manner. Illustrative examples to demonstrate key aspects of cross-layer resilience are discussed.
2014 International Symposium on VLSI Technology, Systems and Application (VLSI-TSA); 04/2014
[Show abstract][Hide abstract] ABSTRACT: Near-threshold operation can increase the number of simultaneously active cores at the expense of much lower operating frequency ("dim silicon"), but dim cores suffer from diminishing returns as the number of cores increases. At this point, hardware accelerators become more efficient alternatives. To explore such a broad design space, the authors present an analytical model to quantify the performance limits of many-core, heterogeneous systems operating at near-threshold voltage. The model augments Amdahl's law with detailed scaling of frequency and power, calibrated by circuit-level simulations using a modified Predictive Technology Model (PTM), and factors in the effects of process variations. Results show that dim cores do indeed boost throughput, even in the presence of process variations, but significant benefits are achieved only in applications with high parallelism or novel architectures to mitigate variation. Reconfigurable logic that supports a variety of accelerators is more beneficial than "dim cores" or dedicated, fixed-logic accelerators, unless the kernel targeted by fixed logic has overwhelming coverage across applications, or the speedup of the dedicated accelerator over the reconfigurable equivalent is significant.
[Show abstract][Hide abstract] ABSTRACT: The Svalinn framework provides comprehensive analysis of multibit error protection overheads to facilitate better architecture-level design choices. supported protection techniques include hardening, parity, error-correcting code, parity prediction, residue codes, and spatial and temporal redundancy. The overheads of these are characterized via synthesis and, as a case study, presented here in the context of a simple openrisc core.
[Show abstract][Hide abstract] ABSTRACT: Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to respond to fluctuations in device performance that occur at run time. We present a novel dynamic approach to work partitioning that requires no offline training and responds automatically to performance variability to provide consistently good performance. Using six diverse OpenCL™ applications, we demonstrate the effectiveness of our approach in scenarios both with and without run-time performance variability, as well as in more extreme scenarios in which one device is non-functional.
Proceedings of the ACM International Conference on Computing Frontiers; 05/2013
[Show abstract][Hide abstract] ABSTRACT: GPUs have become popular recently to accelerate general-purpose data-parallel applications. However, most existing work has focused on GPU-friendly applications with regular data structures and access patterns. While a few prior studies have shown that some irregular workloads can also achieve speedups on GPUs, this domain has not been investigated thoroughly.
Workload Characterization (IISWC), 2013 IEEE International Symposium on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Process technology scaling, lagging supply voltage scaling, and the resulting exponential increase in power density, have made temperature a first-class design constraint in today's microprocessors. Prior work has shown that the silicon substrate acts as a spatial low-pass filter for temperature. This phenomenon, spatial thermal filtering, has clear implications for thermal management: depending on the size of dissipators, either design-time strategies, such as dividing and distributing functionality spatially, or runtime strategies, such as isolating functionality temporally (duty cycling), may be the most effective way to control peak temperature. To assist designers with such trade-offs, we have performed extensive analysis and simulation to evaluate the extent and effect of spatial filtering on thermal management in a number of microarchitecture design scenarios.We begin our exploration of spatial filtering with an analytical study of the heat conduction problem, followed by a series of studies to validate the effect and extent of spatial filtering under realistic system assumptions. In particular, we investigate the effect of power dissipator size, location, and aspect ratio in the context of high-performance computing. We then extend these experiments with two microarchitectural studies. First, we perform a study of spatial filtering in many-core architectures. Our results show that as cores shrink, the granularity of effective thermal management increases to the point that even turning cores on and off has a limited effect on peak temperature. Second, we investigate spatial filtering in caches. We discover that despite the size and aspect ratio of cache lines, pathological code behavior can heat caches to undesirable levels, accelerating wear-out.
Integration the VLSI Journal 01/2013; 46(1):44–56. · 0.41 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: The comparison of diverse genomic datasets is fundamental to understanding genome biology. Researchers must explore many large datasets of genome intervals (e.g., genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect: that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features is crucial for future discovery. RESULTS: We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures such as Graphics Processing Units (GPUs) by illustrating its utility for efficient Monte Carlo (MC) simulations measuring the significance of relationships between sets of genomic intervals. AVAILABILITY: https://github.com/arq5x/bits CONTACT: email@example.com.
[Show abstract][Hide abstract] ABSTRACT: General purpose GPU (GPGPU) programming frameworks such as OpenCL and CUDA allow running individual computation kernels sequentially on a device. However, in some cases it is possible to utilize device resources more efficiently by running kernels concurrently. This raises questions about load balancing and resource allocation that have not previously warranted investigation. For example, what kernel characteristics impact the optimal partitioning of resources among concurrently executing kernels? Current frameworks do not provide the ability to easily run kernels concurrently withne-grained and dynamic control over resource partitioning. We present KernelMerge, a kernel scheduler that runs two OpenCL kernels concurrently on one device. KernelMerge furnishes a number of settings that can be used to survey concurrent or single kernel configurations, and to investigate how kernels interact and influence each other, or themselves. KernelMerge provides a concurrent kernel scheduler compatible with the OpenCL API. We present an argument on the benefits of running kernels concurrently. We demonstrate how to use KernelMerge to increase throughput for two kernels that efficiently use device resources when run concurrently, and we establish that some kernels show worse performance when running concurrently. We also outline a method for using KernelMerge to investigate how concurrent kernels influence each other, with the goal of predicting runtimes for concurrent execution from individual kernel runtimes. Finally, we suggest GPU architectural changes that would improve such concurrent schedulers in the future.
Proceedings of the 4th USENIX conference on Hot Topics in Parallelism; 06/2012
[Show abstract][Hide abstract] ABSTRACT: Microprocessor design has recently encountered many constraints such as power, energy, reliability, and temperature. Among these challenging issues, temperature-related issues have become especially important within the past several years. We summarize recent thermal management techniques for microprocessors, focusing on those that affect or rely on the microarchitecture. We categorize thermal management techniques into six main categories: temperature monitoring, microarchitectural techniques, floorplanning, OS/compiler techniques, liquid cooling techniques, and thermal reliability/security. Temperature monitoring, a requirement for Dynamic Thermal Management (DTM), includes temperature estimation and sensor placement techniques for accurate temperature measurement or estimation. Microarchitectural techniques include both static and dynamic thermal management techniques that control hardware structures. Floorplanning covers a range of thermal-aware floorplanning techniques for 2D and 3D microprocessors. OS/compiler techniques include thermal-aware task scheduling and instruction scheduling techniques. Liquid cooling techniques are higher-capacity alternatives to conventional air cooling techniques. Thermal reliability/security issues cover temperature-dependent reliability modeling, Dynamic Reliability Management (DRM), and malicious codes that specifically cause overheating. Temperature-related issues will only become more challenging as process technology continues to evolve and transistor densities scale up faster than power per transistor scales down. The overall objective of this survey is to give microprocessor designers a broad perspective on various aspects of designing thermal-aware microprocessors and to guide future thermal management studies.
[Show abstract][Hide abstract] ABSTRACT: Precisely predicting performance degradation due to colocating multiple executing applications on a single machine is critical for improving utilization in modern warehouse-scale computers (WSCs). Bubble-Up is the first mechanism for such precise prediction. As opposed to over-provisioning machines, Bubble-Up enables the safe colocation of multiple workloads on a single machine for Web service applications that have quality of service constraints, thus greatly improving machine utilization in modern WSCs.
[Show abstract][Hide abstract] ABSTRACT: Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.
ACM Transactions on Computer Systems - TOCS. 01/2012;
[Show abstract][Hide abstract] ABSTRACT: There has been a fundamental shift from ever more complex single cores to single chip multi-core (CMP) designs. Along with this opportunity come major challenges; notably the sheer size of the CMP design space. An integrated suite of tools is needed that provides life-cycle support from early prototyping to final design. Here we present ArchFP, a floorplanning tool targeted towards prototyping of pre-RTL CMP design concepts. As such, it is complementary to traditional floorplanners that are more appropriate later in the design cycle. An ArchFP ftoorplan is specified using a model similar to that supported by GUI toolkits such as Java Swing and Windows Presentation Foundation. The ftoorplan design is comprised of a hierarchy of components placed within containers, called Layout Managers (LMs), that provide a variety of layout algorithms. Current LMs include a generalized grid LM, one that supports geographic hints for component placement, and a fixed layout of imported subcomponents. ArchFP is easy to extend with additional LMs that leverage these initial layout algorithms to support more specific design elements such as NoC configurations, cache partitioning strategies, SIMD design, etc. To the best of our knowledge, no one has previous used this approach for ftoorplanning. ArchFP is written in C++ for UNIX systems, and is free for download from http://lava.cs.virginia.edularchfp. We demonstrate the utility of ArchFP in the study of power delivery and temperature constraints in likely CMP configurations of Penryn-like cores over four technology scales.
VLSI and System-on-Chip (VLSI-SoC), 2012 IEEE/IFIP 20th International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Architectures that aggressively exploit SIMD often have many data paths execute in lockstep and use multi-threading to hide latency. They can yield high through-put in terms of area- and energy-efficiency for many data-parallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Examples of such architectures include Intel's MIC, AMD's Fusion, and NVIDIA's Fermi. However, unlike software-managed streaming memories used in conventional graphics processors (GPUs), hardware-managed caches are more disruptive to SIMD execution, therefore the interaction between implicit caching and aggressive SIMD execution may no longer follow the conventional wisdom gained from streaming memories. We show that due to more frequent memory latency divergence, lower latency in non-L1 data accesses, and relatively unpredictable L1 contention, cache hierarchies favor different SIMD widths and multi-threading depths than streaming memories. In fact, because the above effects are subject to runtime dynamics, a fixed combination of SIMD width and multi-threading depth no longer works ubiquitously across diverse applications or when cache capacities are reduced due to pollution or power saving. To address the above issues and reduce design risks, this paper proposes Robust SIMD, which provides wide SIMD and then dynamically adjusts SIMD width and multi-threading depth according to performance feedback. Robust SIMD can trade wider SIMD for deeper multi-threading by splitting a wider SIMD group into multiple narrower SIMD groups. Compared to the performance generated by running every benchmark on its individually preferred SIMD organization, the same Robust SIMD organization performs similarly -- sometimes even better due to phase adaptation -- and out per-forms the best fixed SIMD organization by 17%. When D-cache capacity is reduced due to runtime disruptiveness, Robust SIMD offers graceful performance degradation, w- th 25% polluted cache lines in a 32 KB D-cache, Robust SIMD performs 1.4× better compared to a conventional SIMD architecture.
[Show abstract][Hide abstract] ABSTRACT: As much of the world's computing continues to move into the cloud, the overprovisioning of computing resources to ensure the performance isolation of latency-sensitive tasks, such as web search, in modern datacenters is a major contributor to low machine utilization. Being unable to accurately predict performance degradation due to contention for shared resources on multicore systems has led to the heavy handed approach of simply disallowing the co-location of high-priority, latency-sensitive tasks with other tasks. Performing this precise prediction has been a challenging and unsolved problem. In this paper, we present Bubble-Up, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem. By using a bubble to apply a tunable amount of "pressure" to the memory subsystem on processors in production datacenters, our methodology can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation. Using this methodology to arrive at "sensible" co-locations in Google's production datacenters with real-world large-scale applications, we can improve the utilization of a 500-machine cluster by 50% to 90% while guaranteeing a high quality of service of latency-sensitive applications.
[Show abstract][Hide abstract] ABSTRACT: The past few years have witnessed high-end processors with increasing numbers of cores and larger dies. With limited instruction-level parallelism, chip power constraints, and technology-scaling limitations, designers have embraced multiple cores rather than single-core performance scaling to improve chip throughput. This article examines whether this approach is sustainable by scaling from a state-of-the-art big-chip design point using analytical models.
[Show abstract][Hide abstract] ABSTRACT: Future general purpose architectures will scale to hundreds of cores. In order to accommodate both latency-oriented and throughput-oriented workloads, the system is likely to present a heterogeneous mix of cores. In particular, sequential code can achieve peak performance with an out-of-order core while parallel code achieves peak throughput over a set of simple, in-order (10) or single-instruction, multiple-data (SIMD) cores. These large-scale, heterogeneous architectures form a prohibitively large design space, including not just the mix of cores, but also the memory hierarchy, coherence protocol, and on-chip network (OCN). Because of the abundance of potential architectures, an easily reconfigurable multicore simulator is needed to explore the large design space. We build a reconfigurable multicore simulator based on M5, an event-driven simulator originally targeting a network of processors.
Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on; 05/2011