Article

Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry

The Ohio State University, Columbus, Ohio, USA.
The Journal of Physical Chemistry A (Impact Factor: 2.78). 11/2009; 113(45):12715-23. DOI: 10.1021/jp9051215
Source: PubMed

ABSTRACT Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization with common subexpression identification and demonstrates its effectiveness on tensor contraction expressions for coupled cluster equations. The second part of the paper highlights the importance of data layout transformation in the optimization of tensor contraction computations on modern processors. A number of considerations, such as minimization of cache misses and utilization of multimedia vector instructions, are discussed. A library for efficient index permutation of multidimensional tensors is described, and experimental performance data is provided that demonstrates its effectiveness.

Download full-text

Full-text

Available from: Gerald Baumgartner, Jul 08, 2015
1 Follower
 · 
207 Views
  • Source
    • "We also notice that the use of different compilers leads to differences in performance. On architectures such as recent Intel x86 processors, where SIMD instructions are available, we generate index permutation routines following an automatic approach described in [51] [26] [50]. The basic idea is to apply loop tiling at different cache/TLB levels and then to search automatically for the optimal loop order, tile sizes, and SIMD code sequence for index permutation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to perform data layout optimization together with the selection of library calls and layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry.
    Journal of Parallel and Distributed Computing 03/2012; 72:338-352. DOI:10.1016/j.jpdc.2011.09.006 · 1.01 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this article we report on the coupled-cluster factorization problem. We describe the first implementation that optimizes (i) the contraction order for each term, (ii) the identification of reusable intermediates, (iii) the selection and factoring out of common factors simultaneously, considering all projection levels in a single step. The optimization is achieved by means of a genetic algorithm. Taking a one-term-at-a-time strategy as reference our factorization yields speedups of up to 4 (for intermediate excitation levels, smaller basis sets). We derive a theoretical lower bound for the highest order scaling cost and show that it is met by our implementation. Additionally, we report on the performance of the resulting highly excited coupled-cluster algorithms and find significant improvements with respect to the implementation of Kállay and Surján [J. Chem. Phys. 115, 2945 (2001)] and comparable performance with respect to MOLPRO's handwritten and dedicated open shell coupled cluster with singles and doubles substitutions implementation [P. J. Knowles, C. Hampel, and H.-J. Werner, J. Chem. Phys. 99, 5219 (1993)].
    The Journal of Chemical Physics 03/2011; 134(12):124106. DOI:10.1063/1.3561739 · 3.12 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe an extension of our graphics processing unit (GPU) electronic structure program TeraChem to include atom-centered Gaussian basis sets with d angular momentum functions. This was made possible by a “meta-programming” strategy that leverages computer algebra systems for the derivation of equations and their transformation to correct code. We generate a multitude of code fragments that are formally mathematically equivalent, but differ in their memory and floating-point operation footprints. We then select between different code fragments using empirical testing to find the highest performing code variant. This leads to an optimal balance of floating-point operations and memory bandwidth for a given target architecture without laborious manual tuning. We show that this approach is capable of similar performance compared to our hand-tuned GPU kernels for basis sets with s and p angular momenta. We also demonstrate that mixed precision schemes (using both single and double precision) remain stable and accurate for molecules with d functions. We provide benchmarks of the execution time of entire self-consistent field (SCF) calculations using our GPU code and compare to mature CPU based codes, showing the benefits of the GPU architecture for electronic structure theory with appropriately redesigned algorithms. We suggest that the meta-programming and empirical performance optimization approach may be important in future computational chemistry applications, especially in the face of quickly evolving computer architectures.
    Journal of Chemical Theory and Computation 11/2012; 9(1):213–221. DOI:10.1021/ct300321a · 5.31 Impact Factor