[Show abstract][Hide abstract] ABSTRACT: The Gordon data intensive supercomputer entered service in 2012 as an allocable computing system in the NSF Extreme Science and Engineering Discovery Environment (XSEDE) program. Gordon has several innovative features that make it ideal for data intensive computing including: 1,024, compute nodes based on Intel's Sandy Bridge (Xeon E5) processor; 64 I/O nodes with an aggregate of 300 TB of high performance flash (SSD); large, virtual SMP "supernodes" of up to 2 TB DRAM; a dual-rail, QDR InfiniBand, 3D torus network based on commodity hardware and open source software; and a 100 GB/s Lustre based parallel file system, with over 4 PB of disk space. In this paper we present the motivation, design, and performance of Gordon. We provide: low level micro-benchmark results to demonstrate processor, memory, I/O, and network performance; standard HPC benchmarks; and performance on data intensive applications to demonstrate Gordon's performance on typical workloads. We highlight the inherent risks in, and describe mitigation strategies for, deploying a data intensive supercomputer like Gordon which embodies significant innovative technologies. Finally we present our experiences thus far in supporting users and managing Gordon.
[Show abstract][Hide abstract] ABSTRACT: We examine the scalability of a set of techniques related to Dynamic Voltage-Frequency Scaling (DVFS) on HPC systems to reduce the energy consumption of scientific applications through an application-aware analysis and runtime framework, Green Queue. Green Queue supports making CPU clock frequency changes in response to intra-node and inter-node observations about application behavior. Our intra-node approach reduces CPU clock frequencies and therefore power consumption while CPUs lacks computational work due to inefficient data movement. Our inter-node approach reduces clock frequencies for MPI ranks that lack computational work. We investigate these techniques on a set of large scientific applications on 1024 cores of Gordon, an Intel Sandy bridge-based supercomputer at the San Diego Supercomputer Center. Our optimal intra-node technique showed an average measured energy savings of 10.6% and a maximum of 21.0% over regular application runs. Our optimal inter-node technique showed an average 17.4% and a maximum of 31.7% energy savings.
Cloud and Green Computing (CGC), 2012 Second International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Compute intensive kernels make up the majority of execution time in HPC applications. Therefore, many of the power draw and energy consumption traits of HPC applications can be characterized in terms of the power draw and energy consumption of these constituent kernels. Given that power and energy-related constraints have emerged as major design impediments for exascale systems, it is crucial to develop a greater understanding of how kernels behave in terms of power/energy when subjected to different compiler-based optimizations and different hardware settings. In this work, we develop CPU and DIMM power and energy models for three extensively utilized HPC kernels by training artificial neural networks. These networks are trained using empirical data gathered on the target architecture. The models utilize kernel-specific compiler-based optimization parameters and hard-ware tunables as inputs and make predictions for the power draw rate and energy consumption of system components. The resulting power draw and energy usage predictions have an absolute error rate that averages less than 5.5% for three important kernels - matrix multiplication (MM), stencil computation and LU factorization.
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Application address streams contain a wealth of information that can be used to characterize the behavior of applications. However, the collection and handling of address streams is complicated by their size and the cost of collecting them. We present PSnAP, a compression scheme specifically designed for capturing the fine-grained patterns that occur in well structured, memory intensive, high performance computing applications. PSnAP profiles are human readable and reveal a great deal of information about the application memory behavior. In addition to providing insight to application behavior the profiles can be used to replay a proxy synthetic address stream for analysis. We demonstrate that the synthetic address streams mimic very closely the behavior of the originals.
Workload Characterization (IISWC), 2011 IEEE International Symposium on; 11/2011
[Show abstract][Hide abstract] ABSTRACT: To meet the growing demand for high performance computing systems that are capable of processing large datasets, the San Diego Supercomputer Center is deploying Gordon. This system was specifically designed for data intensive workloads and uses flash memory to fill the large latency gap in the memory hierarchy between DRAM and hard disk. In preparation for the deployment of Gordon, we evaluated the performance of multiple remote storage technologies and file systems for use with the flash memory. We find that OCFS and XFS are both superior to PVFS at delivering fast random access to flash. In addition, our tests indicate that the Linux SCSI target framework (TGT) can export flash storage devices with minimal overhead and achieve a large fraction of the theoretical peak I/O performance. Despite the difficulties in fairly comparing I/O solutions due to the many differences in file systems and service implementations, we conclude that OCFS on TGT is a viable option for our system as it provides both excellent performance and a user-friendly shared file system interface. In those instances where a parallel file system is not required, XFS on TGT is a better alternative.
Proceedings of the 1st Workshop on Architectures and Systems for Big Data; 10/2011
[Show abstract][Hide abstract] ABSTRACT: The power wall has become a dominant impeding factor in the realm of exascale system design. It is therefore important to understand how to most effectively create software to minimize its power usage while maintaining satisfactory levels of performance. This work uses existing software and hardware facilities to tune applications to minimize for several combinations of power and performance. The tuning is done with respect to software level performance-related tunables and for processor clock frequency. These tunable parameters are explored via an offline search to find the parameter combinations that are optimal with respect to performance (or delay, D), energy (E), energy×delay (E×D) and energy×delay×delay (E×D2). These searches are employed on a parallel application that solves Poisson's equation using stencils. We show that the parameter configuration that minimizes energy consumption can save, on average, 5.4% energy with a performance loss of 4% when compared to the configuration that minimizes runtime.
Proceedings of the 2011 international conference on Parallel Processing - Volume 2; 08/2011
[Show abstract][Hide abstract] ABSTRACT: As the gap between the speed of computing elements and the disk subsystem widens it becomes increasingly important to understand and model disk I/O. While the speed of computational resources continues to grow, potentially scaling to multiple peta flops and millions of cores, the growth in the performance of I/O systems lags well behind. In this context, data-intensive applications that run on current and future systems depend on the ability of the I/O system to move data to the distributed memories. As a result, the I/O system becomes a bottleneck for application performance. Additionally, due to the higher risk of component failure that results from larger scales, the frequency of application checkpointing is expected to grow and put an additional burden on the disk I/O system . Emergence of new technologies such as flash-based Solid State Drives (SSDs) presents an opportunity to narrow the gap between speed of computing and I/O systems. With this in mind, SDSC's PMAC lab is investigating the use of flash drives in a new prototype system called DASH [8, 9, 13]. In this paper we apply and extend a modeling methodology developed for spinning disk and use it to model disk I/O time on DASH. We studied two data-intensive applications, MADbench2  and an application for geological imaging . Our results show that the prediction errors for total I/O time are 14.79% for MADbench2 and our efforts for geological imaging yield error of 9% for one category of read calls; this application made a total of 3 categories of read/write. We are still investigating the geological application, and in this paper we present our results thus far for both applications.
[Show abstract][Hide abstract] ABSTRACT: Systems with hardware accelerators speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use accelerators. However, in addition to procurement cost, significant programming and porting effort is required to realize the potential benefit of such accelerators. Hence, before building such a system it is prudent to answer the question ‘what is the projected performance benefit from accelerators for workloads of interest?’ We address this question by way of a performance-modeling framework, which predicts realizable application performance on accelerators speedily and accurately without going to the considerable effort of porting and tuning.
International Journal of High Performance Computing Applications 01/2011; 27(2). · 1.30 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Two decades of microprocessor architecture driven by quantitative 90/10 optimization has delivered an extraordinary 1000-fold improvement in microprocessor performance, enabled by transistor scaling which improved density, speed, and energy. Recent generations of technology have produced limited benefits in transistor speed and power, so as a result the industry has turned to multicore parallelism for performance scaling. Long-range studies [1,2] indicate that radical approaches are needed in the coming decade – extreme parallelism, near-threshold voltage scaling (and resulting poor single-thread performance), and tolerance of extreme variability – are required to maximize energy efficiency and compute density. These changes create major new challenges in architecture and software. As a result, the performance and energy-efficiency advantages of heterogeneous architectures are increasingly attractive. However, computing has lacked an optimization paradigm in which to systematically analyze, assess, and implement heterogeneous computing. We propose a new paradigm, “10x10”, which clusters applications into a set of less frequent cases (ie. 10% cases), and creates customized architecture, implementation, and software solutions for each of these clusters, achieving significantly better energy efficiency and performance. We call this approach “10x10” because the approach is exemplified by optimizing ten different 10% cases, reflecting a shift away from the 90/10 optimization paradigm framed by Amdahl's law . We describe the 10x10 approach, explain how it solves the major obstacles to widespread adoption of heterogeneous architectures, and present a preliminary 10x10 clustering, strawman architecture, and software tool chain approach.
[Show abstract][Hide abstract] ABSTRACT: Over the life of a modern supercomputer, the energy cost of running the system can exceed the cost of the original hardware
purchase. This has driven the community to attempt to understand and minimize energy costs wherever possible. Towards these
ends, we present an automated, fine-grained approach to selecting per-loop processor clock frequencies. The clock frequency
selection criteria is established through a combination of lightweight static analysis and runtime tracing that automatically
acquires application signatures - characterizations of the patterns of execution of each loop in an application. This application characterization is matched
with one of a series of benchmark loops, which have been run on the target system and probe it in various ways. These benchmarks
form a covering set, a machine characterization of the expected power consumption and performance traits of the machine over the space of execution patterns and clock frequencies.
The frequency that confers the optimal behavior in terms of power-delay product for the benchmark that most closely resembles
each application loop is the one chosen for that loop. The set of tools that implement this scheme is fully automated, built
on top of freely available open source software, and uses an inexpensive power measurement apparatus. We use these tools to
show a measured, system-wide energy savings of up to 7.6% on an 8-core Intel Xeon E5530 and 10.6% on a 32-core AMD Opteron
8380 (a Sun X4600 Node) across a range of workloads.
Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part I; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Suppose one is considering purchase of a computer equipped with accelerators. Or suppose one has access to such a computer and is considering porting code to take advantage of the accelerators. Is there a reason to suppose the purchase cost or programmer effort will be worth it? It would be nice to able to estimate the expected improvements in advance of paying money or time. We exhibit an analytical framework and tool-set for providing such estimates: the tools first look for user-defined idioms that are patterns of computation and data access identified in advance as possibly being able to benefit from accelerator hardware. A performance model is then applied to estimate how much faster these idioms would be if they were ported and run on the accelerators, and a recommendation is made as to whether or not each idiom is worth the porting effort to put them on the accelerator and an estimate is provided of what the overall application speedup would be if this were done. As a proof-of-concept we focus our investigations on Gather/Scatter (G/S) operations and means to accelerate these available on the Convey HC-1 which has a special-purpose "personality" for accelerating G/S. We test the methodology on two large-scale HPC applications. The idiom recognizer tool saves weeks of programmer effort compared to having the programmer examine the code visually looking for idioms; performance models save yet more time by rank-ordering the best candidates for porting; and the performance models are accurate, predicting G/S runtime speedup resulting from porting to within 10% of speedup actually achieved. The G/S hardware on the Convey sped up these operations 20x, and the overall impact on total application runtime was to improve it by as much as 21%.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31 - June 04, 2011; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Machine affinity is the observed phenomena that some applications benefit more than others from features of high performance computing (HPC) architectures. When considering a diverse portfolio of HPC machines manufactured by different vendors and of different ages, such as the set of all supercomputers currently operated by the Department of Defense High Performance Computing Modernization Program, it should be obvious that some run a given application faster than others do. Therefore, almost every user would request to run on the fastest machines. But an important insight is that some applications benefit more from the features of the faster machines than others do. If allocations are done in such a way that applications that benefit the most from the features of the fastest machines are assigned to those machines then overall throughput across all machines is boosted by more than 10%. We exhibit exemplary empirical analysis and provide a simple algorithm for doing allocations based on machine affinity. The net effect is like adding a new $10M supercomputer to the portfolio without paying for it.
High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2010 DoD; 07/2010
[Show abstract][Hide abstract] ABSTRACT: Understanding input/output (I/O) performance in high performance computing (HPC) is becoming increasingly important as the gap between the performance of computation and I/O widens. In this paper we propose a methodology to predict an application's disk I/O time while running on High Performance Computing Modernization Program (HPCMP) systems. Our methodology consists of the following steps: 1) Characterize the I/O operations of an application running on a reference system. 2) Using a configurable I/O benchmark, collect statistics on the reference and target systems about the I/O operations that are relevant to the application on the reference and target systems. 3) Calculate a ratio between the measured I/O performance of the application on the reference system, with respect to target systems to predict the application's I/O time on the target systems. Our results show that this methodology can accurately predict the I/O time of relevant HPC applications on HPCMP systems that have reasonably stable I/O performance run to run while systems that have wide variability in I/O performance are more difficult to predict accurately.
High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2010 DoD; 07/2010
[Show abstract][Hide abstract] ABSTRACT: Energy costs comprise a significant fraction of the total cost of ownership of a large supercomputer. As with performance, energy-efficiency is not an attribute of a compute resource alone; it is a function of a resource-workload combination. The operation mix and locality characteristics of the applications in the workload affect the energy consumption of the resource. Our experiments confirm that data locality is the primary source of variation in energy requirements. The major contributions of this work include a method for performing fine-grained power measurements on high performance computing (HPC) resources, a benchmark infrastructure that exercises specific portions of the node in order to characterize operation energy costs, and a method of combining application information with independent energy measurements in order to estimate the energy requirements for specific application-resource pairings. A verification study using the NAS parallel benchmarks and S3D shows that our model has an average prediction error of 7.4%.
High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2010 DoD; 07/2010
[Show abstract][Hide abstract] ABSTRACT: Binary instrumentation facilitates the insertion of additional code into an executable in order to observe or modify the executable's behavior. There are two main approaches to binary instrumentation: static and dynamic binary instrumentation. In this paper we present a static binary instrumentation toolkit for Linux on the x86/x86_64 platforms, PEBIL (PMaC's Efficient Binary Instrumentation Toolkit for Linux). PEBIL is similar to other toolkits in terms of how additional code is inserted into the executable. However, it is designed with the primary goal of producing efficient-running instrumented code. To this end, PEBIL uses function level code relocation in order to insert large but fast control structures. Furthermore, the PEBIL API provides tool developers with the means to insert lightweight hand-coded assembly rather than relying solely on the insertion of instrumentation functions. These features enable the implementation of efficient instrumentation tools with PEBIL. The overhead introduced for basic block counting by PEBIL is an average of 65% of the overhead of Dyninst, 41% of the overhead of Pin, 15% of the overhead of DynamoRIO, and 8% of the overhead of Valgrind.
Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on; 04/2010
[Show abstract][Hide abstract] ABSTRACT: In 2011 SDSC will deploy Gordon, an HPC architecture specifically designed for data-intensive applications. We describe the Gordon architecture and the thinking behind the design choices by considering the needs of two targeted application classes: massive database/data mining and data-intensive predictive science simulations. Gordon employs two technologies that have not been incorporated into HPC systems heretofore: flash SSD memory, and virtual shared memory software. We report on application speedups obtained with a working prototype of Gordon in production at SDSC called Dash, currently available as a TeraGrid resource.
[Show abstract][Hide abstract] ABSTRACT: The speed of the memory subsystem often constrains the performance of large-scale parallel applications. Experts tune such applications to use hierarchical memory subsystems efficiently. Hardware accelerators, such as GPUs, can potentially improve memory performance beyond the capabilities of traditional hierarchical systems. However, the addition of such specialized hardware complicates code porting and tuning. During porting and tuning expert application engineers manually browse source code and identify memory access patterns that are candidates for optimization and tuning. HPC applications typically contain thousands to hundreds of thousands of lines of code, creating a labor-intensive challenge for the expert. PIR, PMaC's Static Idiom Recognizer, automates the pattern recognition process. PIR recognizes specified patterns and tags the source code where they appear using static analysis. This paper describes the PIR implementation and defines a subset of idioms commonly found in HPC applications. We examine the effectiveness of the tool, demonstrating 95% identification accuracy and present the results of using PIR on two HPC applications.
39th International Conference on Parallel Processing, ICPP Workshops 2010, San Diego, California, USA, 13-16 September 2010; 01/2010