Allan Snavely

University of California, San Diego, San Diego, California, United States

Are you Allan Snavely?

Claim your profile

Publications (101)11.21 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The current state of practice in supercomputer resource allocation places jobs from different users on disjoint nodes both in terms of time and space. While this approach largely guarantees that jobs from different users do not degrade one another's performance, it does so at high cost to system throughput and energy efficiency. This focused study presents job striping, a technique that significantly increases performance over the current allocation mechanism by colocating pairs of jobs from different users on a shared set of nodes. To evaluate the potential of job striping in large-scale environments, the experiments are run at the scale of 128 nodes on the state-of-the-art Gordon supercomputer. Across all pairings of 1024 process network-attached storage parallel benchmarks, job striping increases mean throughput by 26% and mean energy efficiency by 22%. On pairings of the real applications Gyrokinetic Toroidal Code (GTC), Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS), and MIMD Lattice Computation (MILC) at equal scale, job striping improves average throughput by 12% and mean energy efficiency by 11%. In addition, the study provides a simple set of heuristics for avoiding low performing application pairs. Copyright © 2013 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 12/2013; DOI:10.1002/cpe.3187 · 0.78 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This article presents Green Queue, a production quality tracing and analysis framework for implementing application aware dynamic voltage and frequency scaling (DVFS) for message passing interface applications in high performance computing. Green Queue makes use of both intertask and intratask DVFS techniques. The intertask technique targets applications where the workload is imbalanced by reducing CPU clock frequency and therefore power draw for ranks with lighter workloads. The intratask technique targets balanced workloads where all tasks are synchronously running the same code. The strategy identifies program phases and selects the energy-optimal frequency for each by predicting power and measuring the performance responses of each phase to frequency changes. The success of these techniques is evaluated on 1024 cores on Gordon, a supercomputer at the San Diego Supercomputer Center built using Intel Xeon E5-2670 (Sandybridge) processors. Green Queue achieves up to 21% and 32% energy savings for the intratask and intertask DVFS strategies, respectively. Copyright © 2013 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 12/2013; DOI:10.1002/cpe.3184 · 0.78 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Gordon data intensive supercomputer entered service in 2012 as an allocable computing system in the NSF Extreme Science and Engineering Discovery Environment (XSEDE) program. Gordon has several innovative features that make it ideal for data intensive computing including: 1,024, compute nodes based on Intel's Sandy Bridge (Xeon E5) processor; 64 I/O nodes with an aggregate of 300 TB of high performance flash (SSD); large, virtual SMP "supernodes" of up to 2 TB DRAM; a dual-rail, QDR InfiniBand, 3D torus network based on commodity hardware and open source software; and a 100 GB/s Lustre based parallel file system, with over 4 PB of disk space. In this paper we present the motivation, design, and performance of Gordon. We provide: low level micro-benchmark results to demonstrate processor, memory, I/O, and network performance; standard HPC benchmarks; and performance on data intensive applications to demonstrate Gordon's performance on typical workloads. We highlight the inherent risks in, and describe mitigation strategies for, deploying a data intensive supercomputer like Gordon which embodies significant innovative technologies. Finally we present our experiences thus far in supporting users and managing Gordon.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Compute intensive kernels make up the majority of execution time in HPC applications. Therefore, many of the power draw and energy consumption traits of HPC applications can be characterized in terms of the power draw and energy consumption of these constituent kernels. Given that power and energy-related constraints have emerged as major design impediments for exascale systems, it is crucial to develop a greater understanding of how kernels behave in terms of power/energy when subjected to different compiler-based optimizations and different hardware settings. In this work, we develop CPU and DIMM power and energy models for three extensively utilized HPC kernels by training artificial neural networks. These networks are trained using empirical data gathered on the target architecture. The models utilize kernel-specific compiler-based optimization parameters and hard-ware tunables as inputs and make predictions for the power draw rate and energy consumption of system components. The resulting power draw and energy usage predictions have an absolute error rate that averages less than 5.5% for three important kernels - matrix multiplication (MM), stencil computation and LU factorization.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: We examine the scalability of a set of techniques related to Dynamic Voltage-Frequency Scaling (DVFS) on HPC systems to reduce the energy consumption of scientific applications through an application-aware analysis and runtime framework, Green Queue. Green Queue supports making CPU clock frequency changes in response to intra-node and inter-node observations about application behavior. Our intra-node approach reduces CPU clock frequencies and therefore power consumption while CPUs lacks computational work due to inefficient data movement. Our inter-node approach reduces clock frequencies for MPI ranks that lack computational work. We investigate these techniques on a set of large scientific applications on 1024 cores of Gordon, an Intel Sandy bridge-based supercomputer at the San Diego Supercomputer Center. Our optimal intra-node technique showed an average measured energy savings of 10.6% and a maximum of 21.0% over regular application runs. Our optimal inter-node technique showed an average 17.4% and a maximum of 31.7% energy savings.
    Cloud and Green Computing (CGC), 2012 Second International Conference on; 01/2012
  • Source
    Andrew A. Chien, Allan Snavely, Mark Gahagan
    [Show abstract] [Hide abstract]
    ABSTRACT: Two decades of microprocessor architecture driven by quantitative 90/10 optimization has delivered an extraordinary 1000-fold improvement in microprocessor performance, enabled by transistor scaling which improved density, speed, and energy. Recent generations of technology have produced limited benefits in transistor speed and power, so as a result the industry has turned to multicore parallelism for performance scaling. Long-range studies [1,2] indicate that radical approaches are needed in the coming decade – extreme parallelism, near-threshold voltage scaling (and resulting poor single-thread performance), and tolerance of extreme variability – are required to maximize energy efficiency and compute density. These changes create major new challenges in architecture and software. As a result, the performance and energy-efficiency advantages of heterogeneous architectures are increasingly attractive. However, computing has lacked an optimization paradigm in which to systematically analyze, assess, and implement heterogeneous computing. We propose a new paradigm, “10x10”, which clusters applications into a set of less frequent cases (ie. 10% cases), and creates customized architecture, implementation, and software solutions for each of these clusters, achieving significantly better energy efficiency and performance. We call this approach “10x10” because the approach is exemplified by optimizing ten different 10% cases, reflecting a shift away from the 90/10 optimization paradigm framed by Amdahl's law [3]. We describe the 10x10 approach, explain how it solves the major obstacles to widespread adoption of heterogeneous architectures, and present a preliminary 10x10 clustering, strawman architecture, and software tool chain approach.
    Procedia Computer Science 12/2011; 4:1987-1996. DOI:10.1016/j.procs.2011.04.217
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Application address streams contain a wealth of information that can be used to characterize the behavior of applications. However, the collection and handling of address streams is complicated by their size and the cost of collecting them. We present PSnAP, a compression scheme specifically designed for capturing the fine-grained patterns that occur in well structured, memory intensive, high performance computing applications. PSnAP profiles are human readable and reveal a great deal of information about the application memory behavior. In addition to providing insight to application behavior the profiles can be used to replay a proxy synthetic address stream for analysis. We demonstrate that the synthetic address streams mimic very closely the behavior of the originals.
    Workload Characterization (IISWC), 2011 IEEE International Symposium on; 11/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: To meet the growing demand for high performance computing systems that are capable of processing large datasets, the San Diego Supercomputer Center is deploying Gordon. This system was specifically designed for data intensive workloads and uses flash memory to fill the large latency gap in the memory hierarchy between DRAM and hard disk. In preparation for the deployment of Gordon, we evaluated the performance of multiple remote storage technologies and file systems for use with the flash memory. We find that OCFS and XFS are both superior to PVFS at delivering fast random access to flash. In addition, our tests indicate that the Linux SCSI target framework (TGT) can export flash storage devices with minimal overhead and achieve a large fraction of the theoretical peak I/O performance. Despite the difficulties in fairly comparing I/O solutions due to the many differences in file systems and service implementations, we conclude that OCFS on TGT is a viable option for our system as it provides both excellent performance and a user-friendly shared file system interface. In those instances where a parallel file system is not required, XFS on TGT is a better alternative.
    Proceedings of the 1st Workshop on Architectures and Systems for Big Data; 10/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The power wall has become a dominant impeding factor in the realm of exascale system design. It is therefore important to understand how to most effectively create software to minimize its power usage while maintaining satisfactory levels of performance. This work uses existing software and hardware facilities to tune applications to minimize for several combinations of power and performance. The tuning is done with respect to software level performance-related tunables and for processor clock frequency. These tunable parameters are explored via an offline search to find the parameter combinations that are optimal with respect to performance (or delay, D), energy (E), energy×delay (E×D) and energy×delay×delay (E×D2). These searches are employed on a parallel application that solves Poisson's equation using stencils. We show that the parameter configuration that minimizes energy consumption can save, on average, 5.4% energy with a performance loss of 4% when compared to the configuration that minimizes runtime.
    Proceedings of the 2011 international conference on Parallel Processing - Volume 2; 08/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: As the gap between the speed of computing elements and the disk subsystem widens it becomes increasingly important to understand and model disk I/O. While the speed of computational resources continues to grow, potentially scaling to multiple peta flops and millions of cores, the growth in the performance of I/O systems lags well behind. In this context, data-intensive applications that run on current and future systems depend on the ability of the I/O system to move data to the distributed memories. As a result, the I/O system becomes a bottleneck for application performance. Additionally, due to the higher risk of component failure that results from larger scales, the frequency of application checkpointing is expected to grow and put an additional burden on the disk I/O system [1]. Emergence of new technologies such as flash-based Solid State Drives (SSDs) presents an opportunity to narrow the gap between speed of computing and I/O systems. With this in mind, SDSC's PMAC lab is investigating the use of flash drives in a new prototype system called DASH [8, 9, 13]. In this paper we apply and extend a modeling methodology developed for spinning disk and use it to model disk I/O time on DASH. We studied two data-intensive applications, MADbench2 [6] and an application for geological imaging [5]. Our results show that the prediction errors for total I/O time are 14.79% for MADbench2 and our efforts for geological imaging yield error of 9% for one category of read calls; this application made a total of 3 categories of read/write. We are still investigating the geological application, and in this paper we present our results thus far for both applications.
    GLOBECOM Workshops (GC Wkshps), 2010 IEEE; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Suppose one is considering purchase of a computer equipped with accelerators. Or suppose one has access to such a computer and is considering porting code to take advantage of the accelerators. Is there a reason to suppose the purchase cost or programmer effort will be worth it? It would be nice to able to estimate the expected improvements in advance of paying money or time. We exhibit an analytical framework and tool-set for providing such estimates: the tools first look for user-defined idioms that are patterns of computation and data access identified in advance as possibly being able to benefit from accelerator hardware. A performance model is then applied to estimate how much faster these idioms would be if they were ported and run on the accelerators, and a recommendation is made as to whether or not each idiom is worth the porting effort to put them on the accelerator and an estimate is provided of what the overall application speedup would be if this were done. As a proof-of-concept we focus our investigations on Gather/Scatter (G/S) operations and means to accelerate these available on the Convey HC-1 which has a special-purpose "personality" for accelerating G/S. We test the methodology on two large-scale HPC applications. The idiom recognizer tool saves weeks of programmer effort compared to having the programmer examine the code visually looking for idioms; performance models save yet more time by rank-ordering the best candidates for porting; and the performance models are accurate, predicting G/S runtime speedup resulting from porting to within 10% of speedup actually achieved. The G/S hardware on the Convey sped up these operations 20x, and the overall impact on total application runtime was to improve it by as much as 21%.
    Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31 - June 04, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Systems with hardware accelerators speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use accelerators. However, in addition to procurement cost, significant programming and porting effort is required to realize the potential benefit of such accelerators. Hence, before building such a system it is prudent to answer the question ‘what is the projected performance benefit from accelerators for workloads of interest?’ We address this question by way of a performance-modeling framework, which predicts realizable application performance on accelerators speedily and accurately without going to the considerable effort of porting and tuning.
    International Journal of High Performance Computing Applications 01/2011; 27(2). DOI:10.1109/IISWC.2011.6114198 · 1.63 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the life of a modern supercomputer, the energy cost of running the system can exceed the cost of the original hardware purchase. This has driven the community to attempt to understand and minimize energy costs wherever possible. Towards these ends, we present an automated, fine-grained approach to selecting per-loop processor clock frequencies. The clock frequency selection criteria is established through a combination of lightweight static analysis and runtime tracing that automatically acquires application signatures - characterizations of the patterns of execution of each loop in an application. This application characterization is matched with one of a series of benchmark loops, which have been run on the target system and probe it in various ways. These benchmarks form a covering set, a machine characterization of the expected power consumption and performance traits of the machine over the space of execution patterns and clock frequencies. The frequency that confers the optimal behavior in terms of power-delay product for the benchmark that most closely resembles each application loop is the one chosen for that loop. The set of tools that implement this scheme is fully automated, built on top of freely available open source software, and uses an inexpensive power measurement apparatus. We use these tools to show a measured, system-wide energy savings of up to 7.6% on an 8-core Intel Xeon E5530 and 10.6% on a 32-core AMD Opteron 8380 (a Sun X4600 Node) across a range of workloads.
    Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part I; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data intensive computing can be defined as computation involving large datasets and complicated I/O patterns. Data intensive computing is challenging because there is a five-orders-of-magnitude latency gap between main memory DRAM and spinning hard disks; the result is that an inordinate amount of time in data intensive computing is spent accessing data on disk. To address this problem we designed and built a prototype data intensive supercomputer named DASH that exploits flash-based Solid State Drive (SSD) technology and also virtually aggregated DRAM to fill the latency gap . DASH uses commodity parts including Intel® X25-E flash drives and distributed shared memory (DSM) software from ScaleMP®. The system is highly competitive with several commercial offerings by several metrics including achieved IOPS (input output operations per second), IOPS per dollar of system acquisition cost, IOPS per watt during operation, and IOPS per gigabyte (GB) of available storage. We present here an overview of the design of DASH, an analysis of its cost efficiency, then a detailed recipe for how we designed and tuned it for high data-performance, lastly show that running data-intensive scientific applications from graph theory, biology, and astronomy, we achieved as much as two orders-of- magnitude speedup compared to the same applications run on traditional architectures.
    Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010; 11/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: SPECFEM3D_GLOBE is a spectral-element application enabling the simulation of global seismic wave propagation in 3D anelastic, anisotropic, rotating and self-gravitating Earth models at unprecedented resolution. A fundamental challenge in global seismology is to model the propagation of waves with periods between 1 and 2 seconds, the highest frequency signals that can propagate clear across the Earth. These waves help reveal the D structure of the Earth's deep interior and can be compared to seismographic recordings. We broke the 2 second barrier using the 62K processor Ranger system at TACC. Indeed we broke the barrier using just half of Ranger, by reaching a period of 1.84 seconds with sustained 28.7 Tflops on 32K processors. We obtained similar results on the XT4 Franklin system at NERSC and the XT4 Kraken system at University of Tennessee Knox- ville, while a similar run on the 28K processor Jaguar sys- tem at ORNL, which has more memory per processor, sus- tained 35.7 Tflops (a higher flops rate) with a 1.94 shortest period. For the final run we obtained access to the ORNL Petaflop System, a new very large XT5 just coming online, and achieved 1.72 shortest period and 161 Tflops using 149,784 cores. With this landmark calculation we have enabled a powerful new tool for seismic wave simulation, one that operates in the same frequency regimes as nature; in seismology there is no need to pursue periods much smaller because higher frequency signals do not propagate across the entire globe. We employed performance modeling methods to identify performance bottlenecks and worked through issues of par- allel I/O and scalability. Improved mesh design and num- bering results in excellent load balancing and few cache misses. The primary achievements are not just the scalabil- ity and high teraflops number, but a historic step towards understanding the physics and chemistry of the Earth's inte- rior at unprecedented resolution.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Machine affinity is the observed phenomena that some applications benefit more than others from features of high performance computing (HPC) architectures. When considering a diverse portfolio of HPC machines manufactured by different vendors and of different ages, such as the set of all supercomputers currently operated by the Department of Defense High Performance Computing Modernization Program, it should be obvious that some run a given application faster than others do. Therefore, almost every user would request to run on the fastest machines. But an important insight is that some applications benefit more from the features of the faster machines than others do. If allocations are done in such a way that applications that benefit the most from the features of the fastest machines are assigned to those machines then overall throughput across all machines is boosted by more than 10%. We exhibit exemplary empirical analysis and provide a simple algorithm for doing allocations based on machine affinity. The net effect is like adding a new $10M supercomputer to the portfolio without paying for it.
    High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2010 DoD; 07/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Understanding input/output (I/O) performance in high performance computing (HPC) is becoming increasingly important as the gap between the performance of computation and I/O widens. In this paper we propose a methodology to predict an application's disk I/O time while running on High Performance Computing Modernization Program (HPCMP) systems. Our methodology consists of the following steps: 1) Characterize the I/O operations of an application running on a reference system. 2) Using a configurable I/O benchmark, collect statistics on the reference and target systems about the I/O operations that are relevant to the application on the reference and target systems. 3) Calculate a ratio between the measured I/O performance of the application on the reference system, with respect to target systems to predict the application's I/O time on the target systems. Our results show that this methodology can accurately predict the I/O time of relevant HPC applications on HPCMP systems that have reasonably stable I/O performance run to run while systems that have wide variability in I/O performance are more difficult to predict accurately.
    High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2010 DoD; 07/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy costs comprise a significant fraction of the total cost of ownership of a large supercomputer. As with performance, energy-efficiency is not an attribute of a compute resource alone; it is a function of a resource-workload combination. The operation mix and locality characteristics of the applications in the workload affect the energy consumption of the resource. Our experiments confirm that data locality is the primary source of variation in energy requirements. The major contributions of this work include a method for performing fine-grained power measurements on high performance computing (HPC) resources, a benchmark infrastructure that exercises specific portions of the node in order to characterize operation energy costs, and a method of combining application information with independent energy measurements in order to estimate the energy requirements for specific application-resource pairings. A verification study using the NAS parallel benchmarks and S3D shows that our model has an average prediction error of 7.4%.
    High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2010 DoD; 07/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: exchange and its publication does not constitute the Government’s approval or disapproval of its ideas or findings NOTICE Using Government drawings, specifications, or other data included in this document for any purpose other than Government procurement does not in any way obligate the U.S. Government. The fact that the Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation; or convey any rights or permission to manufacture, use, or sell any patented invention that may relate to them. APPROVED FOR PUBLIC RELEASE, DISTRIBUTION UNLIMITED. This page intentionally left blank. DISCLAIMER The following disclaimer was signed by all members of the Exascale Study Group (listed below): I agree that the material in this document reflects the collective views, ideas, opinions and findings of the study participants only, and not those of any of the universities, corporations, or other institutions with which they are affiliated. Furthermore, the material in this document does not reflect the official views, ideas, opinions and/or findings of DARPA, the Department of Defense, or of the United States government.
  • The DOD High Performance Computing Modernization Office User's Conference; 06/2010