[show abstract][hide abstract] ABSTRACT: Suppose one is considering purchase of a computer equipped with accelerators. Or suppose one has access to such a computer and is considering porting code to take advantage of the accelerators. Is there a reason to suppose the purchase cost or programmer effort will be worth it? It would be nice to able to estimate the expected improvements in advance of paying money or time. We exhibit an analytical framework and tool-set for providing such estimates: the tools first look for user-defined idioms that are patterns of computation and data access identified in advance as possibly being able to benefit from accelerator hardware. A performance model is then applied to estimate how much faster these idioms would be if they were ported and run on the accelerators, and a recommendation is made as to whether or not each idiom is worth the porting effort to put them on the accelerator and an estimate is provided of what the overall application speedup would be if this were done. As a proof-of-concept we focus our investigations on Gather/Scatter (G/S) operations and means to accelerate these available on the Convey HC-1 which has a special-purpose "personality" for accelerating G/S. We test the methodology on two large-scale HPC applications. The idiom recognizer tool saves weeks of programmer effort compared to having the programmer examine the code visually looking for idioms; performance models save yet more time by rank-ordering the best candidates for porting; and the performance models are accurate, predicting G/S runtime speedup resulting from porting to within 10% of speedup actually achieved. The G/S hardware on the Convey sped up these operations 20x, and the overall impact on total application runtime was to improve it by as much as 21%.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31 - June 04, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: Over the life of a modern supercomputer, the energy cost of running the system can exceed the cost of the original hardware
purchase. This has driven the community to attempt to understand and minimize energy costs wherever possible. Towards these
ends, we present an automated, fine-grained approach to selecting per-loop processor clock frequencies. The clock frequency
selection criteria is established through a combination of lightweight static analysis and runtime tracing that automatically
acquires application signatures - characterizations of the patterns of execution of each loop in an application. This application characterization is matched
with one of a series of benchmark loops, which have been run on the target system and probe it in various ways. These benchmarks
form a covering set, a machine characterization of the expected power consumption and performance traits of the machine over the space of execution patterns and clock frequencies.
The frequency that confers the optimal behavior in terms of power-delay product for the benchmark that most closely resembles
each application loop is the one chosen for that loop. The set of tools that implement this scheme is fully automated, built
on top of freely available open source software, and uses an inexpensive power measurement apparatus. We use these tools to
show a measured, system-wide energy savings of up to 7.6% on an 8-core Intel Xeon E5530 and 10.6% on a 32-core AMD Opteron
8380 (a Sun X4600 Node) across a range of workloads.
Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part I; 01/2011
[show abstract][hide abstract] ABSTRACT: SPECFEM3D_GLOBE is a spectral element application enabling the simulation of global seismic wave propagation in 3D anelastic, anisotropic, rotating and self-gravitating Earth models at unprecedented resolution. A fundamental challenge in global seismology is to model the propagation of waves with periods between 1 and 2 seconds, the highest frequency signals that can propagate clear across the Earth. These waves help reveal the D structure of the Earth's deep interior and can be compared to seismographic recordings. We broke the 2 second barrier using the 62K processor Ranger system at TACC. Indeed we broke the barrier using just half of Ranger, by reaching a period of 1.84 seconds with sustained 28.7 Tflops on 32K processors. We obtained similar results on the XT4 Franklin system at NERSC and the XT4 Kraken system at University of Tennessee Knoxville, while a similar run on the 28K processor Jaguar system at ORNL, which has better memory bandwidth per processor, sustained 35.7 Tflops (a higher flops rate) with a 1.94 shortest period.Thus we have enabled a powerful new tool for seismic wave simulation, one that operates in the same frequency regimes as nature; in seismology there is no need to pursue periods much smaller because higher frequency signals do not propagate across the entire globe.We employed performance modeling methods to identify performance bottlenecks and worked through issues of parallel I/O and scalability. Improved mesh design and numbering results in excellent load balancing and few cache misses. The primary achievements are not just the scalability and high teraflops number, but a historic step towards understanding the physics and chemistry of the Earth's interior at unprecedented resolution.
Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15-21, 2008, Austin, Texas, USA; 01/2008
[show abstract][hide abstract] ABSTRACT: Trace-driven memory simulation tools such as MetaSim Tracer (1) capture the address stream of an application dur- ing an instrumented program run. Various statistics can be measured using the in-flight address stream including anticipated cache hit rates. This paper reports on perfor- mance improvements of MetaSim Tracer gained by using techniques developed in SimPoint (2). Concurrent research (3) addresses techniques that can be used to reduce the in- strumentation overhead involved in memory tracing, and this work addresses a technique that can be used to decrease the amount of cache simulation that is required on top of this. The result is a tool for trace driven cache simulation that is practical to use for memory performance studies of full sized scientific applications.
[show abstract][hide abstract] ABSTRACT: Binaryinstrumentationtoolsareveryusefulforcollectingtraces ofprogramevents. Common usesforsuchtraces include trace- driven simulation and performance modeling. However com- monly available general-purpose instrumentation tools are in- efficient for capturing fine-grained events as for example a se- quenceofdynamicmemoryaddresses. WeintroduceALITER, anasynchronouslightweightinstrumentationtoolforeventrecord- ing which is extremely light in terms of tracing overhead as compared to commonly available binary instrumentation tools. The tool creates a buffer in the instrumented code space and in- lines buffer maintenance functions into the instrumented code. Usersuppliedanalysisroutinesareonlyinvokedwhenthebuffer is fairly full. This approach, i.e. having a user code space buffer managed under ALITER's control, ensures that most control transfers between user code and instrumentation code are eliminated. In addition, storing events to the buffer and checking buffer status are implemented very cheaply. Thus traditional sources of tracing overheads are greatly reduced. Overall we report less than a 2-fold slowdown to collect mem- ory traces of the selected benchmarks; this contrasts with tens and even hundreds of fold slowdown using generally available instrumentation tools.
[show abstract][hide abstract] ABSTRACT: Memory traces record the addresses touched by a program during its execution, enabling many useful investigations for understanding and predicting program performance. But complete address traces are time- consuming to acquire and too large to practically store except in the case of short-running programs. Also, memory traces have to be re-acquired each time the input data (and thus the dynamic behavior of the program) changes. We observe that individual load and store instructions typically have stable memory access patterns. Changes in dynamic control-flow of programs, rather than variation in memory access patterns of individual instructions, appear to be the primary cause of overall memory behavior varying both during one execution of a program and during re-execution of the same program on different input data. We are leveraging this observation to enable approximate memory traces that are smaller than full traces, faster to acquire via sampling, much faster to re-acquire for new input data, and have a high degree of verisimilitude relative to full traces. This paper presents an update on our progress.
Computational Science - ICCS 2003, International Conference, Melbourne, Australia and St. Petersburg, Russia, June 2-4, 2003. Proceedings, Part III; 01/2003
[show abstract][hide abstract] ABSTRACT: We introduce a metric for evaluating the quality of any predictive ranking and use this metric to investigate methods for answering the question: How can we best rank a set of supercomputers based on their expected performance on a set of applications? On modern supercomputers, with their deep memory hierarchies, we find that rankings based on benchmarks measuring the latency of accesses to L1 cache and the bandwidth of accesses to main memory are significantly better than rankings based on peak flops. We show how to use a combination of application characteristics and machine attributes to compute improved workload-independent rankings.