Conference Paper

Performance Profiling and Analysis of DoD Applications Using PAPI and TAU

University of Tennessee-Knoxville, Knoxville, TN;
DOI: 10.1109/DODUGC.2005.50 Conference: Users Group Conference, 2005
Source: IEEE Xplore

ABSTRACT Large scientific applications developed as recently as five to ten years ago are often at a disadvantage in current computing environments. Due to frequent acquisition decisions made for reasons such as priceperformance, in order to continue production runs it is often necessary to port large scientific applications to completely different architectures than the ones on which they were developed. Since the porting step does not include optimizations necessary for the new architecture, performance often suffers due to various architectural features. The Programming Environment and Training (PET) Computational Environments (CE) team has developed and deployed different procedures and mechanisms for collection of performance data and for profiling and optimizations of these applications based on that data. The paper illustrates some of these procedures and mechanisms.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern microprocessors have many microarchitectural features. Quantifying the performance impact of one feature such as dynamic branch prediction can be difficult. On one hand, a timing simulator can predict the difference in performance given two different implementations of the technique, but simulators can be quite inaccurate. On the other hand, real systems are very accurate representations of themselves, but often cannot be modified to study the impact of a new technique. We demonstrate how to develop a performance model for branch prediction using real systems based on object code reordering. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a new branch predictor by simulating only the predictor and not the rest of the microarchitecture. We also use the reordered object code to validate a reverse-engineered model for the Intel Core 2 branch predictor. We simulate several branch predictors using Pin and measure which hypothetical branch predictor has the highest correlation with the real one. This study in object code reorder points to way to future work on estimating the impact of other structures such as the instruction cache, the second-level cache, instruction decoders, indirect branch prediction, etc.
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: Characterizing a memory reference stream using reuse distance distribution can enable predicting the performance on a given architecture. Benchmarks can subject an architecture to a limited set of reuse distance distributions, but it cannot exhaustively test it. In contrast, Apex-Map, a synthetic memory probe with parameterized locality, can provide a better coverage of the machine use scenarios. Unfortunately, it requires a lot of expertise to relate an application memory behavior to an Apex-Map parameter set. In this work we present a mathematical formulation that describes the relation between Apex-Map and reuse distance distributions. We also introduce a process through which we can automate the estimation of Apex-Map locality parameters for a given application. This process finds the best parameters for Apex-Map probes that generate a reuse distance distribution similar to that of the original application. We tested this scheme on benchmarks from Scalable Synthetic Compact Applications and Unbalanced Tree Search, and we show that this scheme provides an accurate Apex-Map parameterization with a small percentage of mismatch in reuse distance distributions, about 3% in average and less than 8% in the worst case, on the tested applications.
    39th International Conference on Parallel Processing, ICPP 2010, San Diego, California, USA, 13-16 September 2010; 01/2010

Full-text (3 Sources)

Available from
May 31, 2014