Conference Paper

Performance Profiling and Analysis of DoD Applications Using PAPI and TAU

University of Tennessee-Knoxville, Knoxville, TN;
DOI: 10.1109/DODUGC.2005.50 Conference: Users Group Conference, 2005
Source: IEEE Xplore

ABSTRACT Large scientific applications developed as recently as five to ten years ago are often at a disadvantage in current computing environments. Due to frequent acquisition decisions made for reasons such as priceperformance, in order to continue production runs it is often necessary to port large scientific applications to completely different architectures than the ones on which they were developed. Since the porting step does not include optimizations necessary for the new architecture, performance often suffers due to various architectural features. The Programming Environment and Training (PET) Computational Environments (CE) team has developed and deployed different procedures and mechanisms for collection of performance data and for profiling and optimizations of these applications based on that data. The paper illustrates some of these procedures and mechanisms.

  • Source
    Conference Paper: Program Interferometry.
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern microprocessors have many micro architectural features. Quantifying the performance impact of one feature such as dynamic branch prediction can be difficult. On one hand, a timing simulator can predict the difference in performance given two different implementations of the technique, but simulators can be quite inaccurate. On the other hand, real systems are very accurate representations of themselves, but often cannot be modified to study the impact of a new technique. We demonstrate how to develop a performance model for branch prediction using real systems. The technique perturbs benchmark executables to yield a wide variety of performance points without changing program semantics or other important execution characteristics such as the number of retired instructions. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a new branch predictor by simulating only the predictor and not the rest of the micro architecture. We call this technique Program Interferometry.
    2011 International Conference on Parallel Architectures and Compilation Techniques, PACT 2011, Galveston, TX, USA, October 10-14, 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a new cache-efficient parallel multilayer Gauss-Seidel algorithm to solve 2D diffusion equations on distributed memory machines, by focusing on improving its cache behaviour and parallelism simultaneously. The novelty of our parallel multi-layer algorithm lies in performing Gauss-Seidel in two alternating sweeping directions (with multiple layers, i.e., iterations per direction) and applying alternating tiling strategies in two opposite sweeping directions to the subdomain allocated to every processor. As a result, its efficiency comes from a significant reduction in two sources of overhead: data cache misses and communication costs. In comparison with two commonly used parallel Gauss-Seidel algorithms, our algorithm has good performance and scalability in a cluster computing environment.
    IEEE 15th International Conference on Parallel and Distributed Systems, ICPADS 2009, 8-11 December 2009, Shenzhen, China; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Characterizing a memory reference stream using reuse distance distribution can enable predicting the performance on a given architecture. Benchmarks can subject an architecture to a limited set of reuse distance distributions, but it cannot exhaustively test it. In contrast, Apex-Map, a synthetic memory probe with parameterized locality, can provide a better coverage of the machine use scenarios. Unfortunately, it requires a lot of expertise to relate an application memory behavior to an Apex-Map parameter set. In this work we present a mathematical formulation that describes the relation between Apex-Map and reuse distance distributions. We also introduce a process through which we can automate the estimation of Apex-Map locality parameters for a given application. This process finds the best parameters for Apex-Map probes that generate a reuse distance distribution similar to that of the original application. We tested this scheme on benchmarks from Scalable Synthetic Compact Applications and Unbalanced Tree Search, and we show that this scheme provides an accurate Apex-Map parameterization with a small percentage of mismatch in reuse distance distributions, about 3% in average and less than 8% in the worst case, on the tested applications.
    39th International Conference on Parallel Processing, ICPP 2010, San Diego, California, USA, 13-16 September 2010; 01/2010

Full-text (3 Sources)

Available from
May 31, 2014