Conference Paper

Performance Profiling and Analysis of DoD Applications Using PAPI and TAU

University of Tennessee-Knoxville, Knoxville, TN
DOI: 10.1109/DODUGC.2005.50 Conference: Users Group Conference, 2005
Source: IEEE Xplore

ABSTRACT Large scientific applications developed as recently as five to ten years ago are often at a disadvantage in current computing environments. Due to frequent acquisition decisions made for reasons such as priceperformance, in order to continue production runs it is often necessary to port large scientific applications to completely different architectures than the ones on which they were developed. Since the porting step does not include optimizations necessary for the new architecture, performance often suffers due to various architectural features. The Programming Environment and Training (PET) Computational Environments (CE) team has developed and deployed different procedures and mechanisms for collection of performance data and for profiling and optimizations of these applications based on that data. The paper illustrates some of these procedures and mechanisms.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multicore microarchitecture designs have become ubiquitous in today's computing environment enabling multiple processes to execute simultaneously on a single chip. With these new parallel processing capabilities comes a need to better understand how co-running applications impact and interfere with each other. The ability to characterize and better understand cross-core performance interference can prove critical for a number of application domains, such as performance debugging, compiler optimization, and application co-scheduling to name a few. We proposed a novel methodology for the characterization and profiling of cross-core interference on current multicore systems, which we call contention synthesis. Our profiling approach characterizes an applications cross-core interference sensitivity by manufacturing contention with the application and observing the impact of this synthesized contention on the application. Understanding how to synthesize contention on current chip microarchitectures is unclear as there are a number of potentially contentious data access behaviors. This is further complicated by the fact that current chip microprocessors are engineered and tuned to circumvent the contentious nature of certain data access behaviors. In this work we explore and evaluate five designs for a contention synthesis mechanism. We also investigate how these five contention synthesis engines impact the performance of 19 of the SPEC2006 benchmarks on two state of the art chip multiprocessors, namely Intel's Core i7 and AMD's Phenom X4 architectures. Finally we demonstrate how contention synthesis can be used to accurately characterize an application's cross-core interference sensitivity.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern microprocessors have many microarchitectural features. Quantifying the performance impact of one feature such as dynamic branch prediction can be difficult. On one hand, a timing simulator can predict the difference in performance given two different implementations of the technique, but simulators can be quite inaccurate. On the other hand, real systems are very accurate representations of themselves, but often cannot be modified to study the impact of a new technique. We demonstrate how to develop a performance model for branch prediction using real systems based on object code reordering. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a new branch predictor by simulating only the predictor and not the rest of the microarchitecture. We also use the reordered object code to validate a reverse-engineered model for the Intel Core 2 branch predictor. We simulate several branch predictors using Pin and measure which hypothetical branch predictor has the highest correlation with the real one. This study in object code reorder points to way to future work on estimating the impact of other structures such as the instruction cache, the second-level cache, instruction decoders, indirect branch prediction, etc.
  • Source

Full-text (3 Sources)

Available from
May 31, 2014