Page 1

Performance Optimization of Tensor Contraction

Expressions for Many Body Methods in Quantum

Chemistry

Albert Hartono,†Qingda Lu,†Thomas Henretty,†Sriram Krishnamoorthy,†

Huaijian Zhang,†Gerald Baumgartner,‡David E. Bernholdt,¶Marcel Nooijen,§

Russell Pitzer,†J. Ramanujam,‡and P. Sadayappan∗,†

The Ohio State University, Louisiana State University, Oak Ridge National Laboratory, and

University of Waterloo

E-mail: saday@cse.ohio-state.edu

Abstract

Complex tensor contraction expressions arise in accurate electronic structure models in

quantum chemistry, such as the coupled cluster method. This paper addresses two complemen-

tary aspects of performance optimization of such tensor contraction expressions. Transforma-

tions using algebraic properties of commutativity and associativity can be used to significantly

decrease the number of arithmetic operations required for evaluation of these expressions. The

identification of common subexpressions among a set of tensor contraction expressions can

result in a reduction of the total number of operations required to evaluate the tensor contrac-

tions. The first part of the paper describes an effective algorithm for operation minimization

†The Ohio State University

‡Louisiana State University

¶Oak Ridge National Laboratory

§University of Waterloo

1

Page 2

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

with common subexpression identification and demonstrates its effectiveness on tensor con-

traction expressions for coupled cluster equations. The second part of the paper highlights

the importance of data layout transformation in the optimization of tensor contraction com-

putations on modern processors. A number of considerations such as minimization of cache

misses and utilization of multimedia vector instructions are discussed. A library for efficient

index permutation of multi-dimensional tensors is described and experimental performance

data is provided that demonstrates its effectiveness.

Introduction

Users of current and emerging high-performance parallel computers face major challenges to both

performance and productivity in the development of their scientific applications. The manual de-

velopment of accurate quantum chemistry models typically can take an expert months to years of

tedious effort to develop and debug a high-performance implementation. One approach to alle-

viate the burden on application developers is the use of automatic code generation techniques to

synthesize efficient parallel programs from high-level specification of computations expressed in a

domain-specific language. The Tensor Contraction Engine (TCE)1–3effort resulted from a collab-

oration between computer scientists and quantum chemists to develop a framework for automated

optimization of tensor contraction expressions, which form the basis of many-body and coupled

cluster methods.4–7In this paper, we describe two complementary optimization approaches that

were developed for the TCE, but are available as independent software components for use by

developers of other computational chemistry suites.

The first step in the TCE’s code synthesis process is the transformation of input tensor con-

traction expressions into an equivalent form with minimal operation count. Input equations rep-

resenting a collection of tensor contraction expressions typically involve the summation of tens to

hundreds of terms, each involving the contraction of two or more tensors. Given a single-term ex-

pression with several tensors to be contracted, instead of a single nested loop structure to compute

the result, it is often much more efficient to use a sequence of pairwise contractions of tensors,

2

Page 3

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

with explicit creation of temporary intermediate tensors. This optimization problem can be viewed

as a generalization of the matrix chain multiplication problem. However, while the matrix-chain

optimization problem has a polynomial time solution, the multi-tensor contraction problem has

been shown to be NP-hard8— a combinatorial number of possibilities for pairwise two-tensor

contractions must be considered. With tensor contraction expressions involving the summation

of tens to hundreds of terms, there are opportunities for further reduction in computational cost

by recognizing common subexpressions in the sequence of pairwise two-tensor contractions for

computing the multi-tensor contraction terms. Quantum chemists have addressed the operation

optimization problem for specific models,7,9but to the best of our knowledge a general approach

to optimization of arbitrary tensor contraction expressions was not addressed prior to the TCE ef-

fort. In the first part of the paper, we discuss a generalized treatment of the operation minimization

problem for tensor contraction expressions.

The second part of the paper addresses an important issue pertaining to achieving a high frac-

tion of processor peak performance when computing operation-minimized tensor contraction ex-

pressions. Achieving high performance on current and emerging processors requires the genera-

tion of highly optimized code that exploits the vector instruction set of the machine (e.g., SSE,

AVX, etc.), minimizes data movement costs between memory and cache, and minimizes the num-

ber of register loads/stores in loops. The current state-of-the-art in compiler technology is unable

to achieve anywhere close to machine peak in compiling loop-level code representing a multi-

dimensional tensor contraction. Hence the approach taken in quantum chemistry codes is to morph

a tensor contraction problem into a matrix multiplicationproblem and then use highly tuned matrix

multiplication libraries available for nearly all systems. In general, this requires a layout transfor-

mation of the tensors into a form where all contracted indices of the tensors are grouped togetherin

the transformed view. Theoretically, the computational complexity of the data layout transforma-

tion step is linear in the number of elements in the tensor, while the computational complexity of

the subsequent matrix multiplication has a higher computational complexity. However, in practice

the use of a straightforward loop code to perform the layout transformation results in significant

3

Page 4

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

overhead. In the second part of the paper we discuss the development of an efficient tensor layout

transformation library.

The rest of the paper is organized as follows. The next section elaborates on the operation

minimization problem, followed by a section that describes the algorithmic approach to operation

minimization. Experimental results that demonstrate its effectiveness are presented in the section

after that. The following section describes the layout transformation problem, summarizing an

approach (described in detail elsewhere10) to efficient transposition of 2D arrays, and the gener-

alization of the 2D transposition routines to multi-dimensional tensor layout transformation along

with experimental results from incorporation of the layout transformation routines into NWChem.

We then discuss related work in the section following that, leading to the conclusion section.

Operation Minimization of Tensor Contraction Expressions

A tensor contraction expression comprises a sum of a number of terms, where each term represents

the contraction of two or more tensors. We first illustrate the issue of operation minimization

for a single term, before addressing the issue of optimizing across multiple terms. Consider the

following tensor contraction expression involving three tensors t, f and s, with indices x and z

that have range V, and indices i and k that have range O. Distinct ranges for different indices is

a characteristic of the quantum chemical methods of interest, where O and V correspond to the

number of occupied and virtual orbitals in the representation of the molecule (typically V ≫ O).

Computed as a single nested loop computation, the number of arithmetic operations needed would

be 2O2V2.

rx

i= ∑z,ktz

ifk

zsx

k

(cost=2O2V2)

However, by performing a two-step computation with an intermediate I, it is possible to com-

pute the result using 4OV2operations:

Ix

z= ∑kfk

zsx

k

(cost=2OV2);

rx

i= ∑ztz

iIx

z

(cost=2OV2)

Another possibility using 4O2V computations, which is more efficient when V > O (as is usu-

4

Page 5

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

ally the case in quantum chemistry calculations), is shown below:

Ik

i= ∑ztz

ifk

z

(cost=2O2V);

rx

i= ∑kIk

isx

k

(cost=2O2V)

The above example illustrates the problem of single-term optimization, also called strength

reduction: find an operation-minimalsequenceof two-tensorcontractions to achievea multi-tensor

contraction. Different orders of contraction can result in very different operation costs; for the

above example, if the ratio of V/O were 10, there is an order of magnitude difference in the

number of arithmetic operations for the two choices.

With complex tensor contraction expressions involving a large number of terms, if multiple

occurrences of the same subexpression can be identified, it need only be computed once, stored

in an intermediate tensor and used multiple times. Thus, common subexpressions can be stored

as intermediate results that are used more than once in the overall computation. Manual formu-

lations of computational chemistry models often involve the use of such intermediates. The class

of quantum chemical methods of interest, which include the coupled cluster singles and doubles

(CCSD) method,7,9are most commonly formulated using the molecular orbital basis (MO) inte-

gral tensors. However the MO integrals are intermediates, derived from the more fundamental

atomic orbital basis (AO) integral tensors. Alternate “AO-based” formulations of CCSD have been

developed in which the more fundamental AO integrals are used directly, without fully forming the

MO integrals.11However it is very difficult to manually explore all possible formulations of this

type to find the one with minimal operation count, especially since it can depend strongly on the

characteristics of the particular molecule being studied.

The challenge in identifying cost-effective common subexpressions (also referred to as com-

mon subexpression elimination, or CSE) is the combinatorial explosion of the search space, since

single-term optimization of different product terms must be treated in a coupled manner. The

following simple example illustrates the problem.

Suppose we have two MO-basis tensors, v and w, which can be expressed as a transformation

of the AO-basis tensor, a, in two steps. Using single-term optimization to form tensor v, we

consider two possible sequences of binary contractions as shown below, which both have the same

5

Page 6

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

(minimal) operation cost. Extending the notation above, indices p and q represent AO indices,

which have range M = O+V.

Seq. 1:

fi

q= ∑pap

gp

qcip

(cost=2OM2);

vij= ∑pfipdp

vij= ∑pgp

j

(cost=2O2M)

Seq. 2:

i= ∑qap

qdq

i

(cost=2OM2);

jcip

(cost=2O2M)

To generate tensor w, suppose that there is only one cost-optimal sequence:

fi

q= ∑pap

qcip

(cost=2OM2);

wix= ∑pfipep

x

(cost=2OVM)

Note that the first step in the formation of w uses the same intermediate tensor f that appears in

sequence 1 for v. Considering just the formation of v, either of the two sequences is equivalent in

cost. But one form uses a common subexpression that is useful in computing the second MO-basis

tensor, while the other form does not. If sequence 1 is chosen for v, the total cost of computing

both v and w is 2OM2+2O2M+2OVM. On the other hand, the total cost is higher if sequence 2

is chosen (4OM2+2O2M+2OVM). The 2OM2cost difference is significant when M is large.

When a large number of terms exist in a tensor contraction expression, there is a combinatorial

explosion in the search space if all possible equivalent-cost forms for each product term must be

compared with each other.

In the first part of the paper, we address the followingquestion: by developingan automaticop-

eration minimization procedure that is effective in identifying suitable common subexpressions in

tensor contraction expressions, can we automatically find more efficient computational forms? For

example, with the coupled cluster equations, can we automatically find AO-based forms by sim-

ply executing the operation minimization procedure on the standard MO-based CCSD equations,

where occurrences of the MO integral terms are explicitly expanded out in terms of AO integrals

and integral transformations?

6

Page 7

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

Algorithms for Operation Minimization with Common Subex-

pression Elimination

In this section, we describe the algorithm used to perform operation minimization, by employing

single-term optimization together with CSE. The exponentially large space of possible single-term

optimizations, together with CSE, makes an exhaustive search approach prohibitively expensive.

So we use a two-step approach to apply single-term optimization and CSE in tandem.

The algorithm is shown in Figure 2. It uses the single-term optimization algorithm, which is

broadly illustrated in Figure 1 and described in greater detail in our earlier work.12It takes as

input a sequence of tensor contraction statements. Each statement defines a tensor in terms of a

sum of tensor contraction expressions. The output is an optimized sequence of tensor contrac-

tion statements involving only binary tensor contractions. All intermediate tensors are explicitly

defined.

The key idea is to determine the “binarization” (determination of optimal sequence of two-

tensor contractions) of more expensive terms before the less expensive terms. The most expensive

terms contribute heavily to the overall operation cost, and potentially contain expensive subex-

pressions. Early identification of these expensive subexpressions can facilitate their reuse in the

computation of other expressions, reducing the overall operation count.

The algorithm begins with the term set to be optimized as the set of all the terms of the tensor

contraction expressions on the right hand side of each statement. The set of intermediates is ini-

tially empty. In each step of the iterative procedure, the binarization for one term is determined.

Single-term optimization is applied to each term in the term set using the current set of interme-

diates and the most expensive term is chosen to be “binarized” first. Among the set of optimal

binarizations for the chosen term, the one that maximally reduces the cost of the remaining terms

is chosen. Once the term and its binarizations are decided upon, the set of intermediates is up-

dated and the corresponding statements for the new intermediates are generated. The procedure

continues until the term set is empty.

7

Page 8

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

SINGLE-TERM-OPT-CSE(E,is)

1

if E is a single-tensor expression

2

then return {?E, / 0?}

3

else \* E is a multiple-tensor contraction expression (i.e., E1∗...∗En) * \

4

{?p1,is1?,?p2,is2?,...} ←

5 set of pairs of optimal binarizations of E and its corresponding intermediate set

6 (the given intermediate set is is used to find effective common subexpressions)

7

return {?p1,is1?,?p2,is2?,...}

Figure 1: Single-term optimization algorithm with common subexpression elimination

Evaluation of Operation Minimization

In order to illustrate the use of the automatic operation minimization algorithm, we consider the

tensor expressions for a closed shell CCSD T2 computation. Figure 3 shows the CCSD T2 equa-

tion, including the computation of the MO integrals (denoted v) and the expression for the double-

excitation residual. We compare the optimized forms generated in two different ways: 1) with the

conventional “separated” approach of first explicitly forming the MO integrals from AO integrals

and then using the MO integrals for the CCSD T2 term, and 2) using an “integrated” form where

significant MO integrals in the CCSD T2 equation are replaced by the expressions that produce

them. Although some MO integrals may appear more than once in the T2 expression, the multiple

expansion of such terms does not result in any unnecessary duplication of computation because of

common subexpression elimination with the operation minimization algorithm.

We study two scenarios for evaluation of the CCSD T2 expression: 1) the typical mode, where

iterations of the residual calculation are performed with the t-amplitudes changing every itera-

tion, but without change to the MO integrals (because the transformation matrices to convert AO

integrals to MO integrals do not change), and 2) an orbital optimization (Brueckner basis) sce-

nario where the AO-to-MO transformation matrices change from iteration to iteration, i.e., the MO

integrals (if explicitly formed) must be recalculated for every iteration.

Since the operation minimization algorithm uses specific values for the number of occupied

orbitals O and the number of virtual orbitalsV, the optimized expressions that are generated could

be different for different O and V values. The values for O and V depend on the molecule and

8

Page 9

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

OPTIMIZE(stmts)

1

MSET ← set of all terms obtained from RHS expressions of stmts

2

is ← / 0 \* the set of intermediates * \

3

while MSET ?= / 0

4

do Mheaviest← the heaviest term in MSET

5 (searched by applying SINGLE-TERM-OPT-CSE(Mi,is) on each term Mi∈ MSET)

6

PSET ← SINGLE-TERM-OPT-CSE(Mheaviest,is)

7

?pbest,isbest? ← NIL

8

profit ← 0

9

for each ?pi,isi? ∈ PSET

10

do cur_profit ← 0

11

for each Mi∈ (MSET −{Mheaviest})

12

do base_cost ← op-cost of optimal binarization in SINGLE-TERM-OPT-CSE(Mi,is)

13

opt_cost ← op-cost of optimal binarization in SINGLE-TERM-OPT-CSE(Mi,is∪isi)

14

cur_profit ← cur_profit +(base_cost −opt_cost)

15

if (?pbest,isbest? = NIL)∨(cur_profit > profit)

16

then ?pbest,isbest? ← ?pi,isi?

17

profit ← cur_profit

18

stmts ← replace the term Mheaviestin stmts with pbest

19

MSET ← MSET −{Mheaviest}

20

is ← is∪isbest

21

return stmts

Figure 2: Global operation minimization algorithm

quality of the simulation, but a typical range is 1 ≤V/O ≤ 100. To provide concrete comparisons,

O was set to 10 and V values of 100, 500 and 1000 were used. Additional runs for O set to 100

and V values of 1000, 5000 and 10000 were also evaluated but the overall trends were similar and

so that data is not presented here.

The standard CCSD computation proceeds through a number of iterations in which the MO

integrals remain unchanged. At convergence, the amplitudes attain values such that the residual

is equal to zero and this typically takes 10–50 iterations. In some variants of CCSD, such as

Brueckner basis orbital optimization, the MO integrals also change at each iteration, requiring

the AO-to-MO transformation to be repeated. The optimized tensor expressions for these two

scenarios can be very different. With the operation minimization system, all input terms can be

tagged as either stable (i.e., unchanging from iteration to iteration) or volatile (i.e., changing every

iteration). In addition, an expected number of iterations can be provided to the optimizer. The

9

Page 10

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

vij

via

vab

rab

kl= apq

jb= apq

ic= apq

ij=1

+1

+2vcd

+2vdc

+1

−vdc

+vcd

residualab

rscipcj

rscipca

rsca

2vji

2vji

katbc

katbc

2vcd

kbtc

kltc

ij= rab

qcr

qcr

qcr

kcs

jcs

ics

atbc

l

vij

via

vab

ka= apq

bj= apq

cd= apq

ktab

itab

i−vcd

jitd

kltdc

jtd

itd

ij+rba

rscipcj

rscipca

rsca

kj+vjc

ktbc

i+1

ltab

kltbc

2vcd

l j−2vcd

qcr

qcr

qcr

i−vij

k+1

2vji

kj−vcd

jktad

jtd

kltc

kcs

bcs

ccs

a

via

vab

jk= apq

ij= apq

rscipca

rsca

qcr

qcr

jcs

ics

k

vij

via

ab= apq

bc= apq

rscipcj

rscipca

qcr

qcr

acs

bcs

b

b

j

pcb

j

c

pcb

cpcb

batc

d

ba+ fc

kltba

jktd

ji− fooi

ktc

katbc

k−2vic

jitab

ita

ktab

kbta

batc

kbtdc

k+vic

kltbc

kltc

kj+1

k+1

jtd

jita

kltc

il+1

kta

kltdc

2vcd

bktc

k+vic

ktab

2vcd

l−2vcd

jita

abtdc

ita

kltbc

kltbc

kltbc

kltbc

l+vcd

ji+2vic

k−vjc

jkta

jitad

kjtad

jktd

kltbc

aktbc

katc

l+vjc

lk−2vcd

li−vcd

ita

jitd

jk−vic

itb

kltbc

kltbc

kltbc

kltbc

l−2vcd

aktbc

kltc

l−2vic

jitad

ik+vcd

ita

kltbc

kj−vic

itba

kltbc

kl+vcd

kltbc

l+vcd

jitd

katbc

katbc

jlta

kltdc

l jtad

kltbc

k+1

jk−vjc

jktd

k+vic

iltab

ik+1

jltd

2vcd

katbc

i−vdc

kltbc

kj−2vcd

2vcd

k+vcd

kltc

ki

kbtac

l jta

k

kltdc

kltac

kltac

itb

l

kl− fc

kj− fc

kjtd

kltc

jita

2vcd

l−vcd

jitd

kjtad

itb

2vcd

i−vjc

k+vjc

kita

kl−vdc

kjtd

i

kltb

katbc

kta

l j+vcd

iktab

l j

kl+2vcd

k+1

il−2vcd

itba

itd

jltad

kjtd

l jtbd

l jtd

ki

kltc

kl+vjc

ltab

l+vcd

kta

ita

itb

k

ktb

lta

jtd

kta

ji

Figure 3: Unoptimized input expressions for CCSD T2 and AO-to-MO transform

operation minimization algorithm seeks to find a transformed form of the input tensor expression

that minimizes the total arithmetic cost for the expected number of iterations.

Apq

Epi

vij

Fai

Jij

Nij

rab

ri= apq

qr= asp

ab= cr

jk= vad

kl= vcd

kl= vic

ij=1

+tbc

+cra(1

+tb

+ta

srcs

qrcis

a(cs

ijtd

ijtdc

jktc

2vji

kj(vcd

i

Bpq

vij

via

Gp

Kij

Oab

ki−Pcb

ki+vcd

b(Gq

katc

ki−cr

2tb

ij= rab

ij= cs

kl= cip(cj

jb= ca

i= ccptc

kl= tc

ij= ca

kltbd

il−Oca

ji+1

lNjk

k(tbc

+1

residualab

iApq

sj

qBpq

Cip

vij

via

Hab

Lpq

Pab

a+tac

ak)+tbc

2tab

kl+tc

li−2Fcl

k− fc

aj= cs

ka= cj

bj= ca

ij= vab

ri= asp

ij= ca

ji(fd

il)−vic

srGq

li)+tab

jPdb

i+tbc

a(ciqAqp

qCiq

ak

q(cr

ij−2vab

qrGs

i

q(cs

lkHdc

jk(2Oca

klJkl

iMc

ki)+1

kb+1

sj)

Dap

via

vab

Iap

Ma

Qia

sk−Aqp

il(2vcd

li+1

i+tc

lktbc

ij= ca

jk= cipDap

ij= ca

ij= csa(arq

i= vad

jk= via

a((2Apq

ki−Pca

ji+tba

k− fooi

2tb

2vji

qBpq

ij

lk)

kj

b(cj

qEqi

rs))

qCiq

bj

b(cs

jEqi

rs))

pDbp

ji

ps(cc

k−2vad

jk−2via

sk)Gp

kl−vcd

2vji

lQic

li+vic

k

i ji

r(cd

iktd

qtdc

ij)))

kl

iFcj

p(cs

li)−vjc

kl(1

iLqs

i+tac

kIbr

ki−vjc

ij+rba

ji

kl

kitd

k

l

b(Apq

katbc

li−2tad

b(Eqj

kj(tdc

ki+tbc

ji(Mc

siGq

ki+tbd

j))

b(cr

kl+cd

iLqr

q(cs

sj))

kj

ba+tac

lktad

2Ibr

k(ta

l jFcl

l j(1

2vcd

k)))

lk)+2vic

2tad

ki+tad

li+Njk

k− fc

ji+vjc

ak−vic

ka)

2cs

rj)+cs

l jFck

i))+1

ilHcd

kl(1

2Kjk

kl)

li−vjc

ktc

lk)

l j+tbc

ji−td

bktc

jl(Fck

lJkl

l)

kltbc

jlQic

lk

lKjl

k)−vij

lktb

Figure 4: Integrated optimization of CCSD T2 with AO-to-MO transforms

Figure 4 shows the output generated by the integrated optimizationof the AO-to-MO transform

and the the CCSD T2 expression (for an expected number of iterations T of 10). Seventeen new

intermediates are generated - labeled using capital letters A through Q. Only seven of the original

twelve v integrals are explicitly computed in the optimized form, while the expressions using the

10

Page 11

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

other v integrals has been transformed to use other intermediates to reduce total operation cost.

Table 1 provides detailed information about the computational complexity of the optimized

expressions for the different cases considered, showing the coefficients for the various higher order

polynomial terms for the arithmetic cost (counting each floating point addition or multiplication as

one operation; we note that this is different from the convention used in previous publications such

as9where a multiply-add pair is counted as one operation rather than two).

The first six columns in Table 1 correspond to the standard CCSD model while the last six

columns correspond to optimization for the Brueckner CCSD model. Alternate columns, labeled

“sep” and “int”, provide the coefficients of cost terms for the resulting expressions using separated

and integrated optimization, respectively. Considering the first two columns (for V = 500), it

is clear that the optimized expressions are very different. Some table entries have constant values

whileothersare scaled byT -aconstantvalueimpliesthatthecorrespondingtermisonlyevaluated

once (for example, the MO integrals in the expressions derived by separated optimization), while

the entries scaled by T are executed repeatedly during every CCSD iteration. Since a single table is

used for displaying the polynomial complexity terms for different expressions, we also have some

zero entries when terms do not apply to a particular optimized expression.

With separated optimization, the optimized form has several contractions with computational

complexity in the fifth power ofV/M (forV >> O, M is very close toV), arising from the explicit

computation of the MO integrals. In contrast, integrated optimization produces optimized expres-

sions without any terms involving the fifth power of V/M, instead trading them for an O(OM4)

term that is computed T times (once every CCSD iteration). When O×T is less than V, such a

term has lower cost despite being recomputed every iteration than the one-time explicit computa-

tion of the MO integrals. The last row of the table shows the ratio of total arithmetic operation

count using the separated versus integrated optimization. ForV = 100, both optimized expressions

essentially have the same cost. But for the higher values of V, it can be seen that the integrated

optimization produces a much more efficient form than integrated optimization, with the benefit

increasing as V increases.

11

Page 12

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

Table 1: Coefficients of leading terms of symbolic cost function; O = 10, M = V + O; ("sep"

denotes separated optimization of CCSD T2 expression and AO-to-MO transform; "int" denotes

integrated optimization of CCSD T2 and AO-to-MO transform; T denotes the number of CCSD

iterations).

Leading terms Standard iteration

of symbolic

V = 100

V = 500

V = 1000

cost function sepintsepintsep

VM4

20202

V2M3

20202

V3M2

20202

V4M

20202

O2M4

02T

02T

0

O2V4

2T

02T

02T

OM4

22T+422T+42

OVM3

22202

OV2M2

40202

OV3M

40404

O3V3

20T

16T

20T

16T

20T

O3V2M

00000

OV4

2T

02T

02T

O2M3

46T+648T+84

O2VM2

6 12T+86 12T+86

O2V2M

88T+888T+88

O2V3

10T

4T

10T

4T

10T

Reduction

1 2.46

factor

Brueckner basis

V = 500

sep

2T

2T

2T

2T

0

2T

2T

2T

2T

4T

22T

0

2T

4T

6T

8T

14T

V = 100

sep

2T

2T

2T

2T

0

2T

2T

2T

2T

4T

22T

0

2T

4T

6T

8T

14T

V = 1000

sep

2T

2T

2T

2T

0

2T

2T

2T

2T

4T

22T

0

2T

4T

6T

8T

14T

int

0

0

0

0

2T

0

2T+4

0

0

0

16T

0

0

8T+8

12T+8

8T+8

4T

int

0

0

0

0

2T

0

6T

0

0

0

18T

0

0

14T

18T

16T

4T

int

0

0

0

0

2T

0

6T

0

0

0

18T

2T

0

14T

18T

16T

4T

int

0

0

0

0

2T

0

6T

0

0

0

18T

2T

0

14T

18T

16T

4T

4.24 2.5113.7528.86

The right half of Table 1 shows the computational complexity terms for the optimized expres-

sions for the Brueckner CCSD model, where the AO integral transformation must be performed

for every CCSD iteration. For both the separated approach and the integrated approach, each term

is therefore scaled by T. Again, the optimized forms are clearly very different for separated versus

integrated optimization. Relative to the standard CCSD scenario, for the Brueckner CCSD mode

the benefit of integrated optimization over separated optimization is significantly higher.

So far the comparisons of different optimized forms have all been generated by the automated

operation minimizationalgorithm. But howeffectiveis theautomaticoptimizationwhen compared

with manually optimized formulations? In order to answer this question, we generated an opti-

mized version of just the CCSD T2 equations and compared the complexity of the generated terms

with a highly optimized closed-shell CCSD T2 form developed by Scuseria et al.9The optimized

12

Page 13

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

form produced by the automatic minimizer is shown in Figure 5. The computational complexity of

the most significant terms is 2O2V4+20O3V3+10O4V2operations (counting each floating-point

addition or multiplication as a separate operation). The manually optimized implementation from

Scuseria et al.9is1

2O2V4+8O3V3+2O4V2A close examination of the optimized forms shows

that the difference is mainly due to two reasons. First, our compiler generated expressions exploit

antisymmetry but not another form of symmetry (“vertex” symmetry) that is used in the optimized

form from Scuseria et al. – Aab

ij= Aba

ji. The most significant contraction causing the O(O2V4)

complexity is essentially the same contraction in both optimized forms, but is implemented by

Scuseria et al. with one fourth the operation count due to maximal exploitation of such symmetry.

Secondly, a close examination of the form of optimized equations in Scuseria et al.9demonstrate

the need for more sophisticated intermediate steps (e.g., one that involves adding and subtracting

a term to an intermediate term that significantly enhances overall possibility of reduction in oper-

ation count). We are in the process of incorporating vertex symmetry and enhancing the operation

count minimization capability of our compiler using more sophisticated steps.

Aa

Fij

rab

i= vad

kl= tc

ij= 0.5vji

+tbd

+tad

+tab

+tbc

+ta

residualab

kitd

iCcj

k

Bai

Ga

bj= vad

i= vad

abtdc

klta

kj(tb

kj(vic

ji(fc

k(0.5tb

ij= rab

ibtd

iltd

ji+2tad

ka+0.5vcd

ki−Hdk

lktc

a+td

lDkl

ij+rba

ji

j

Cai

Hai

il(vcd

kltac

k(vjc

iGc

ka)+ta

bi+tbc

jk= vad

bj= vca

kl(tbc

l j)+tbc

katc

k−2vic

k(Ac

l jCcl

ijtd

ibtc

jk−tbc

jl(2vic

i)+tba

kltc

k− fc

ki−vcd

k

Dij

Iij

kj))+tbc

al−Eca

kl(Iji

l+tc

k−2Gc

kbtdc

kl= vcd

kl= 0.5vij

ijtdc

kl+vic

kj(0.5vcd

li−Hcl

li)+td

k− fc

k)+vcd

ji+vic

kl

Eab

ij= vad

kitbd

jk

kl

l

j

kltc

kltad

ai+2Bcl

i(vjd

il(vcd

kl(tad

l j+0.5tb

j

ba+0.5vcd

ki(vjd

lCdl

l− fooi

li+Eca

ai−vic

ba+0.5vcd

kl−2vcd

lk−2tad

lFjl

ki−vic

la+ta

batc

lk))

kl))

ki−vjc

ak−Bck

k(vic

j)

ai)+0.5tab

lk−2vic

klDkl

ki+Cck

ji

l−vjd

kl−2Ccl

li))

bi)−tb

k−2tc

ka−vcd

ji−td

kl+0.5Fjk

i(Ac

k)+tdc

k(2vdc

jHdk

kltbc

bktc

i−vij

kb+tb

lIji

lk)

Figure 5: CCSD T2 expression optimized separately from AO-to-MO transform

13

Page 14

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

Implementing Tensor Contractions using Tuned Matrix Multi-

plication

Consider the following tensor contraction expression.

E[i, j,k] =∑

a,b,c

A[a,b,c]B[a,i]C[b, j]D[c,k]

where all indices range over N and a, b, and c are contraction indices. The direct way to compute

this wouldrequire O(N6) arithmeticoperations. However, as discussed in thefirst part of thepaper,

algebraic transformations can be used to reduce the number of operations to O(N4).

T1[a,b,k] = ∑

T2[a, j,k] = ∑

E[i, j,k] = ∑

c

A[a,b,c]D[c,k]

b

T1[a,b,k]C[b, j]

a

T2[a, j,k]B[a,i]

Each of the three contractions for the operation-optimized form is essentially a generalized matrix

multiplication. Since highly tuned library Generalized Matrix Multiplication (GEMM) routines

exist, it is attractive to translate the computation for each 2-tensor contraction node into a call to

GEMM if possible. For the above 3-contraction example, the first contraction can be implemented

directly as a call to GEMM with A viewed as an N2×N rectangular matrix and D as an N ×

N matrix. The second contraction, however, cannot be directly implemented as a GEMM call

because the contraction index b is the middle index of T1. GEMM can only be directly used when

summationindicesand non-summationindicesinthecontractioncan becollectedintotwoseparate

contiguous groups. However, T1 can first be “reshaped" via explicit layout transformation, e.g.,

T1[a,b,k] → T1r[a,k,b]. GEMM can then be invoked with the first operand T1r viewed as an

N2×N array and the second input operand C as an N ×N array. The result, which has the index

order [a,k, j], would also have to be reshaped to form T2[a, j,k]. Considering the last contraction,

14

Page 15

Albert Hartono et al.Performance Opt. of Tensor Contraction Exps.

it might seem that some reshaping would be necessary in order to use GEMM. However, GEMM

allows one or both of its input operands to be transposed. Thus, the contraction can be achieved

by invoking GEMM with B as the first operand in transposed form, and T2[a, j,k] as the second

operand, with shape N×N2.

In general, a sequence of multi-dimensional tensor contractions can be implemented using a

sequence of GEMM calls, possibly with some additional array reordering operations interspersed.

Since the multiplication of two N ×N matrices requires O(N3) operations and reordering of a

P×Q matrix only requires O(PQ) data moves, it might seem that the overhead of the layout

transformation steps would be negligible relative to the time for matrix multiplication. However,

as shown in the next section, a simple nested loop structure to perform the layout transposition

can result in significant overhead. The remaining sections of this paper address the development

of an efficient index permutation library for tensors. The problem of efficient transposition of 2D

matrices is first addressed, and is then used as thecore function in implementinggeneralized tensor

layout transformation.

Index Permutation Library for Tensors

In thissection, wefirst present an overviewoftheproblem ofefficient 2D matrixtransposition(dis-

cussed in detail elsewhere10) and then discuss its use in optimizing arbitrary index permutations

of multi-dimensional arrays. Consider the simple double-nested loop in Figure 6. While trans-

position might seem such a straightforward operation, existing compilers are unable to generate

efficient code. For example, the program in Figure 6 was compiled using the Intel C compiler with

“-O3” option. On an Intel Pentium 4 with a 533MHz front side bus, it achieved an average data

transfer bandwidth of 90.3MB/s, for single-precision arrays, with each dimension ranging from

3800 to 4200. This is only 4.4% of the sustained copy bandwidth achieved on the machine by the

STREAM memory benchmark.13

On modern architectures, the cache hierarchy, the memory subsystem, and SIMD vector in-

15