A Performance Optimization Framework for Compilation of Tensor Contraction
Expressions into Parallel Programs


Gerald Baumgartner
Dept. of Computer and Information Science
Ohio State University
gb@cis.ohio-state.edu
David E. Bernholdt
Oak Ridge National Laboratory
bernholdte@ornl.gov
Daniel Cociorva
Dept. of Computer and Information Science
Ohio State University
cociorva@cis.ohio-state.edu
Robert Harrison
Pacific Northwest National Laboratory
Robert.Harrison@pnl.gov
Chi-Chung Lam
Dept. of Computer and Information Science
Ohio State University
clam@cis.ohio-state.edu
Marcel Nooijen
Department of Chemistry
Princeton University
Nooijen@Princeton.edu
J. Ramanujam
Dept. of Electrical and Computer Engineering
Louisiana State University
jxr@ee.lsu.edu
P. Sadayappan
Dept. of Computer and Information Science
Ohio State University
saday@cis.ohio-state.edu
Abstract

This paper discusses a program synthesis system to facilitate
the generation of high-performance parallel programs
for a class of computations encountered in quantum chemistry
and physics. These computations are expressible as a
set of tensor contractions and arise in electronic structure
modeling. An overview is provided of the synthesis system
under development, that will take as input a high-level specification
of the computation and generate high-performance
parallel code for a number of target architectures. Several
components of the synthesis system are described, focusing
on compile-time optimization issues that they address.
1. Introduction

The development of high-performance parallel programs
for scientific applications is often very complicated. The effect
of the algorithm choice on memory access costs and


Supported in part by the National Science Foundation through awards
CHE-0121676 and EIA-9986052, and the U. S. Department of Energy
through award DE-AC05-00OR22725.

communication overhead are often very complex. Currently,
available tools for software development and performance
optimization do not provide adequate support to the
developers of high-performance scientific applications. Often,
the time to develop an efficient parallel program for
a computational model is the limiting factor in the rate of
progress of the science. Therefore, approaches to automated
synthesis of high-performance parallel programs from highlevel
specifications are very attractive. In general, automatic
program synthesis is an intractable problem. However, for
specific domains, this is feasible, as is being demonstrated
by the SPIRAL project [20] for the domain of signal processing.

In this paper, we provide an overview of a project that
is developing program synthesis tools to facilitate the development
of high-performance parallel programs for a
class of scientific computations encountered in chemistry
and physics. We focus on electronic structure calculations
where many computationally intensive components are expressible
as a set of tensor contractions.
These computational structures arise in some computational
physics codes modeling electronic properties of
semiconductors and metals, and in computational chemistry
codes such as ACES II, GAMESS, Gaussian, NWChem[7],

PSI, and MOLPRO. In particular, they comprise the bulk
of the computation with the coupled cluster approach to the
accurate description of the electronic structure of atoms and
molecules [15, 16]. Computational approaches to modeling
the structure and interactions of molecules, the electronic
and optical properties of molecules, the heats and rates of
chemical reactions, etc., are crucial to the understanding
of chemical processes in real-world systems. Examples
of applications include combustion and atmospheric chemistry,
chemical vapor deposition, protein structure and enzymatic
chemistry, and industrial chemical processing. Computational
chemistry and materials science account for significant
fractions of supercomputer usage at national centers
(for example, approximately 85% of total usage at Pacific
Northwest National Laboratories, 30% at NERSC, and
about 50% of SDSC's IBM SP/128 system).

2. The computational context

In the class of computations considered, the final result
to be computed can be expressed in terms of tensor contractions,
essentially a collection of multi-dimensional summations
of the product of several input arrays. Due to commutativity,
associativity, and distributivity, there are many
different ways to compute the final result, and they could
differ widely in the number of floating point operations required.
Consider the following expression:

S abij =

X

cdefkl

A acik B befl  Cd fjk D cdel
If this expression is directly translated to code (with ten
nested loops, for indices a l), the total number of arithmetic
operations required will be 4  N

10

if the range of
each index a l is N . Instead, the same expression can be
rewritten by use of associative and distributive laws:
S abij =

X

ck

 
X

df

 
X

el

B befl D cdel

!

Cd fjk

!

A acik
This corresponds to the formula sequence shown in
Fig. 1(a) and can be directly translated into code as shown
in Fig. 1(b). This form only requires 6  N

6

operations.
However, additional space is required to store temporary
arrays T1 and T2. Often, the space requirements for the
temporary arrays poses a serious problem. For this example,
abstracted from a quantum chemistry model, the array
extents along indices a d are the largest, while the extents
along indices i l are the smallest. Therefore, the size of
temporary array T1 would dominate the total memory requirement.

The operation minimization problem encountered here
is a generalization of the well known matrix-chain multiplication
problem, where a linear chain of matrices to be
multiplied is given, e.g., ABCD, and the optimal order of
pair-wise multiplications is sought, i.e., ((AB)C)D versus
(AB)(CD), etc. In contrast, for computations expressed as
sets of matrix contractions, there is additional freedom in
choosing the pair-wise products. For the above example,
instead of forcing a single chain order, e.g., ABCD, other
orders are possible, such as the BDCA order shown for the
operation-reduced form above.
We have shown that the problem of determining the operator
tree with minimal operation count is NP-complete,
and have developed a pruning search procedure [13, 14]
that is very efficient in practice. For the above example, although
the latter form is far more economical in terms of the
number of operations, its implementation will require the
use of temporary intermediate arrays to hold the partial results
of the parenthesized array subexpressions. Sometimes,
the sizes of intermediate arrays needed for the "operationminimal
" form are too large to even fit on disk.
A systematic way to explore ways of reducing the memory
requirement for the computation is to view it in terms of
potential loop fusions. Loop fusion merges loop nests with
common outer loops into larger imperfectly nested loops.
When one loop nest produces an intermediate array which
is consumed by another loop nest, fusing the two loop nests
allows the dimension corresponding to the fused loop to be
eliminated in the array. This results in a smaller intermediate
array and thus reduces the memory requirements. For
the example considered, the application of fusion is illustrated
in Fig. 1(c). This way, T1 can be reduced to a scalar
and T2 to a 2-dimensional array, without changing the number
of operations.
For a computation comprising a number of nested loops,
there are a number of fusion choices, that are not all mutually
compatible. This is because different fusion choices
could require different loops to be made the outermost. In
prior work, we addressed the problem of finding the choice
of fusions for a given operator tree that minimized the total
space required for all arrays [10, 11, 12].

3. An example

One of the most computationally intensive components
of many quantum chemistry packages is the CCSD(T)
scheme. It is a coupled cluster approximation that includes
single and double excitations from the Hartree-Fock wavefunction
plus a perturbative estimate for the connected triple
excitations. For molecules well described by a HartreeFock
wave function, this method predicts bond energies,
ionization potentials, and electron affinities to an accuracy
of 0:5 kcal/mol, bond lengths accurate to 0:0005  A, and
vibrational frequencies accurate to 5cm

1

: This level of
accuracy is adequate to answer many of the questions that
arise in studies of chemical systems.
As a motivating example for the problem addressed, we
discuss a component of the CCSD(T) calculation. The following
representative equation arises in the Laplace factor-

T1 bcdf =

X

el

B befl D cdel
T2 bcjk =

X

df

T1 bcdf C dfjk
S abij =

X

ck

T2 bcjk  A acik
(a) Formula sequence

T1=0; T2=0; S=0;
for b, c, d, e, f, l



T1 bcdf += B befl D cdel

for b, c, d, f, j, k



T2 bcjk += T1 bcdf C dfjk

for a, b, c, i, j, k



S abij += T2 bcjk A acik

(b) Direct implementation (unfused code)

S = 0;
for b, c

2
6
6
6
6
6
6
6
6
4

T1f = 0; T2f = 0;
for d, f

2
6
4

for e, l



T1f += B befl D cdel

for j, k



T2f jk += T1f C dfjk

for a, i, j, k



S abij += T2f jk A acik

(c) Memory-reduced implementation (fused)

Figure 1. Example illustrating use of loop fusion for memory reduction.
ized expression for linear triples perturbation correction:

A3A = X ce;af Y ae;cf +X ae;c  f Y ce;a  f +X ae;cf Y  ce;af

+ X  ae;c



f Y ce;a



f +X  ae;cf Y  ce;af +X  ae;c



f Y  ce;a



f ;
where X and Y are of the form X ae;cf = t

ae
ij t

cf
ij and

Y ce;af = hcb k ekihab k fki, respectively.
Integrals with two vertical bars have been antisymmetrized
and may be expressed as: (hpq k rsi = hpq j

rsi hpq j sri), where integrals with one vertical bar
are of the form h j !i =

R R

dr

3

ds

3

  (r)  (s)jr

sj

1

! (r)  (s) and are quite expensive to compute (requiring
on the order of 1000 arithmetic operations). Electrons
may have either up or down (or alpha/beta) spin. Down spin
is denoted here with an over-bar. The indices i; j; k; l; m;n

refer to occupied orbitals, of number O between 30 and 100.
The indices a; b; c; d; e; f refer to unoccupied orbitals of
number V between 1000 and 3000. The integrals are written
in the molecular orbital basis, but must be computed in the
underlying atom-centered Gaussian basis, and transformed
to the molecular orbital basis. We omit these details in our
discussion here.
A3A is one of many contributions to the energy, and
among the most expensive, scaling as O(OV

5

): Here, we
assume that we have already computed the amplitudes t

ae
ij ,
and they must be read as necessary, and contracted to form
a block of X: The integrals hcb k eki must be recomputed
as necessary, contracted to form a block of Y corresponding
to X; and the two contracted to form the scalar contribution
to the energy.
Fig. 2 shows pseudo-code for the computation of one of
the energy components E for A3A. Temporary arrays T1

and T2 are used to store the integrals of form hab k eki,

where the functions f1 and f2 represent the integral calculations.
The intermediate quantities X aecf are computed by
contracting over (i.e., summing over products of) input array
 T , while the intermediate quantities Y ceaf are obtained
by contracting over T1 and T2. The final result is a single
scalar quantity E, that is obtained by adding together the

O(OV

3

) pair-wise products X aecf Y ceaf .
The cost of computing each integral f1, f2 is represented
by C i , and in practice is of the order of hundreds
or a few thousand arithmetic operations. The pseudocode
form shown in Fig. 2 is computationally very efficient
in minimizing the number of expensive integral function
evaluations f1 and f2, and maximizing the reuse of
the stored integrals in T1 and T2 (each element of T1 and

T2 is used O(V

2

) times). However, it is impractical due
to the huge memory requirement. With O = 100 and

V = 5000, the size of T1, T2 is O(10

14

) bytes and the
size of X , Y is O(10

15

) bytes. By fusing together pairs
of producer-consumer loops in the computation, reductions
in the needed array sizes may be sought, since the fusion
of a loop with common index in the pair of loops allows
the elimination of that dimension of the intermediate array.
It can be seen that the loop that produces X (with indices

a; e; c; f ), the loop that produces Y (with indices c; e; a; f )
and the loop that consumes X and Y to produce E (with
indices c; e; a; f ) can all be fully fused together, permitting
the elimination of all explicit indices in X and Y to reduce
them to scalars. However, the loops producing T1 (with indices
 c; e; b; k) and T2 (with indices a; f; b; k) cannot also
be directly fused with the other three loops because their
indices do not match.
Fig. 3 shows how a reduction of space for T1 and T2 can
be achieved by introducing redundant loops around their
producer loops --- add loops with the missing indices a; f

for T1 and c; e for T2. Now all five loops have common
indices a; e; c; f that can be fused, permitting elimination
of those indices from all temporaries. Further, by fusing the
producer loops for T1 and T2 with their consumer loop,
which produces Y , the b; k indices can also be eliminated
from T1 and T2. A dramatic reduction of memory space
is achieved, reducing all temporaries T1; T2; X and Y to
scalars, but the space savings come at the price of a significant
increase in computation. No reuse is achieved of the
quantities derived from the expensive integral calculations

f1 and f2. Since C i is of the order of 1000 in practice, the
integral calculations now dominate the total compute time,
increasing the operation count by three orders of magnitude
over the unfused form in Fig. 2.
A desirable solution would be somewhere in between

for a, e, c, f



for i, j



X aecf += T ijae T ijcf

for c, e, b, k



T1 cebk = f 1 (c, e, b, k)
for a, f, b, k



T2 afbk = f 2 (a, f, b, k)
for c, e, a, f



for b, k



Y ceaf += T1 cebk T2 afbk

for c, e, a, f



E += X aecf Y ceaf

array space time

X V

4

V

4

O

2

T1 V

3

O C i V

3

O

T2 V

3

O C i V

3

O

Y V

4

V

5

O

E 1 V

4
Figure 2. Unfused operation-minimal form.
for a, e, c, f



for i, j



X aecf += T ijae T ijcf

for a, f



for c, e, b, k



T1 cebk = f1(c, e, b, k)
for c, e



for a, f, b, k



T2 afbk = f2(a, f, b, k)
for c, e, a, f



for b, k



Y ceaf += T1 cebk T2 afbk

for c, e, a, f



E += X aecf Y ceaf

)

for a, e, c, f

2
6
6
6
6
6
4

for i, j



X += T ijae T ijcf

for b, k

"

T1 = f1(c, e, b, k)
T2 = f2(a, f, b, k)
Y += T1 T2
E += X Y

array space time

X 1 V

4

O

2

T1 1 C i V

5

O

T2 1 C i V

5

O

Y 1 V

5

O

E 1 V

4
Figure 3. Use of redundant computation to allow full fusion.
for a

t

, e

t

, c

t

, f

t

2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4

for a, e, c, f



for i, j



X aecf += T ijae T ijcf

for b, k

2
6
6
6
6
4

for c, e



T1ce = f 1 (c, e, b, k)
for a, f



T2 af = f 2 (a, f, b, k)
for c, e, a, f



Y ceaf += T1ce T2 af

for c, e, a, f



E += X aecf Y ceaf

array space time

X B

4

V

4

O

2

T1 B

2

C i (V=B)

2

V

3

O

T2 B

2

C i (V=B)

2

V

3

O

Y B

4

V

5

O

E 1 V

4
Figure 4. Use of tiling and partial fusion to reduce recomputation cost.

the unfused structure of Fig. 2 (with maximal memory requirement
and maximal reuse) and the fully fused structure
of Fig. 3 (with minimal memory requirement and minimal
reuse). This is shown in Fig. 4, where tiling and partial
fusion of the loops is employed. The loops with indices

a; e; c; f are tiled by splitting each of those indices into a
pair of indices. The indices with a superscript t represent
the tiling loops and the unsuperscripted indices now stand
for intra-tile loops with a range of B, the block size used
for tiling. For each tile (a

t

; e

t

; c

t

; f

t

), blocks of T1 and T2

of size B

2

are computed and used to form B

4

product contributions
to the components of Y , which are stored in an
array of size B

4

.
As the tile size B is increased, the cost of function computation
for f1; f2 decreases by a factor of B

2

, due to the
reuse enabled. However, the size of the needed temporary
array for Y increases as B

4

(the space needed for X can
be reduced back to a scalar by fusing its producer loop with
the loop producing E, but Y 's space requirement cannot be
decreased). When B

4

becomes larger than the size of physical
memory, expensive paging in and out of disk will be required
for Y . Further, there is diminishing returns on reuse
of T1 and T2 after B

2

becomes comparable to C i , since the
loop producing Y now becomes the dominant one. So we
can expect that as B is increased, performance will improve
and then level off and then deteriorate. The optimum value
of B will clearly depend on the cost of access at the various
levels of the memory hierarchy.
The computation considered here is just one component
of the A3A term, which in turn is only one of very
many terms that must be computed. Although developers
of quantum chemistry codes naturally recognize and perform
some of these optimizations, a collective analysis of
all these computations to determine their optimal implementation
is beyond the scope of manual effort. Further,
the time required to develop codes to implement such computational
models is quite large. In the next section, we
provide an overview of a transformation system that we are
developing to aid quantum chemists in rapidly developing
high-performance parallel codes for computations that they
specify as a set of high-level tensor contractions.

4. Overview of the synthesis system

Fig. 5 shows the components of the system being developed.
A brief description of the components follows:

High-level language: The input to the synthesis system
is a sequence of tensor contraction expressions (essentially
sum-of-products array expressions) together with declarations
of index ranges and symmetry and sparsity of matrices.
This high-level notation provides essential information
to the optimization components that would be difficult or
impossible to extract out of low-level code.

Algebraic transformations: It takes input from the user
as tensor expressions and synthesizes an output computa-

Performance
Model
Performance
Model
Parallel Fortran/C
(OpenMP/MPI/Global Arrays)

Performance
Model
Trade-Offs
Space-Time
Transformations
Algebraic
No solution that
fits in available
disk space
No solution that
fits in available
disk space System
Memory
Specification
- Operation-Minimal Sequence of Loops
Solution fits in
available disk space
Data Distribution
and Partitioning

Performance
Model

Solution fits in
avail. disk space.

Tensor Expressions
Memory minimization
Data Locality Optimization
Figure 5. The Synthesis System
tion sequence. The Algebraic Transformations module uses
the properties of commutativity and associativity of addition
and multiplication and the distributivity of multiplication
over addition. It searches for all possible ways of applying
these properties to an input sum-of-products expression,
and determines a combination that results in an equivalent
form of the computation with minimal operation cost.

Memory minimization: The operation-minimal computation
sequence synthesized by the Algebraic Transformation
module might require an excessive amount of memory due
to the large temporary intermediate arrays involved. The
Memory Minimization module attempts to perform loop fusion
transformations to reduce the memory requirements.
This is done without any change to the number of arithmetic
operations.

Space-time transformation: If the Memory Minimization
module is unable to reduce memory requirements of the
computation sequence below the available disk capacity on
the system, the computation is infeasible unless a spacetime
trade-off is performed. If no satisfactory transformation
is found, feedback is provided to the Memory Minimization
module, causing it to seek a different solution.
If the Space-Time Transformation module is successful in
bringing down the memory requirement below the disk capacity,
the Data Locality Optimization module is invoked.

Data locality optimization: If the space requirement ex-

ceeds physical memory capacity, portions of the arrays must
be moved between disk and main memory as needed, in a
way that maximizes reuse of elements in memory. The same
considerations are involved in minimizing cache misses ---
blocks of data are moved between physical memory and the
space available in the cache.

Data distribution and partitioning: The final step is to
determine how best to partition the arrays among the processors
of a parallel system. We assume a data-parallel
model, where each operation in the operation sequence is
distributed across the entire parallel machine. The arrays
are to be disjointly partitioned between the physical memories
of the processors. This model allows us to decouple (or
loosely couple) the parallelization considerations from the
operation minimization and memory considerations.
In the next sections, we provide some details about the
optimizations implemented in some of these modules. For
details of the operation minimization algorithm see [13, 14].

5. Memory minimization and space-time
trade-offs

As discussed in Section 2, the operation minimization
procedure results in the creation of intermediate temporary
arrays. For typical computations in computational chemistry,
the space required for storing these temporary arrays
can be several tera bytes, which makes the computation impractical.
As shown in in Fig. 1(c), the problem with memory
requirements of large intermediate arrays can be mitigated
through loop fusion. Loop fusion merges loop nests
with common outer loops into larger imperfectly nested
loops. When one loop nest produces an intermediate array
that is consumed by another loop nest, fusing the two loop
nests allows the dimension corresponding to the fused loop
to be eliminated in the array. This results in a smaller intermediate
array and thus lowers the memory requirement.
The use of loop fusion can be seen to result in significant
potential reduction to the total memory requirement. For a
computation comprising of a number of nested loops, there
will generally be a number of fusion choices, that are not
all mutually compatible. This is because different fusion
choices could require different loops to be made the outermost.
A data structure that we call a fusion graph can be
used to facilitate enumeration of all possible compatible fusion
configurations for a given computation tree.
Fig. 6 shows the fusion graph for the unfused form of
the computation from Fig. 2. Corresponding to each node
in a computation tree, the fusion graph has a set of vertices
corresponding to the loop indices of the node of the computation
tree. In Fig. 6, we do not show the operator tree
corresponding to the computation, but directly illustrate the
fusion graph. The potential for fusion of a common loop
among a producer-consumer pair of loop nests is indicated
in the fusion graph through a dashed potential fusion edge
connecting the corresponding vertices. Leaf nodes in the

T2 T1 T
c f i j a e i j
T
k
f1
c e b
f2
a f b k
X +ij
b k
Y +bk
f a e c
E +ceaf
Figure 6. Fusion graph for unfused operationminimal
form of loop in Figure 2
fusion graph correspond to input arrays or primitive function
evaluations and do not represent a loop nest. The edges
from the leaves to their parents are shown as dotted edges
and do not affect the fusion possibilities. If a pair of loop
nests is fused using one or more common loops, it is captured
in the fusion graph by changing the dashed potentialfusion
edges to continuous fusion edges. If more than two
loop nests are fused together, a chain of fusion edges results,
called a fusion chain. The scope of a fusion chain is the set
of nodes it spans. The fusion graph allows us to characterize
the condition for feasibility of a particular combination
of fusions: the scope of any two fusion chains in a fusion
graph must either be disjoint or a subset/superset of each
other. Scopes of fusion chains do not partially overlap because
loops do not (i.e., loops must be either separate or
nested).
The fusion graph in Fig. 6 can be used to determine
the fusion possibilities. On the left side of the graph, the
edges corresponding to (a; e; c; f) can all be made fusion
edges, suggesting that complete fusion is possible for the
loop nests producing and consuming X , reducing it to a
scalar. Similarly, on the right side of the graph, the edges
corresponding to (c; e; a; f) can also be made fusion edges,
reducing Y to a scalar. Further, by creating fusion edges for
indices (c; e), the producer loop for T1 can be fully fused
with the Y loop that consumes it. However, now the producer
loop for T2 cannot be fused since the addition of any
fusion edge (say for index a) will result in partially overlapping
fusion chains for a and (c; e). The fully fused version
from Fig. 3 can be represented graphically as shown in
Fig. 7(a). Additional vertices have been added for indices

(c; e) and (a; f), respectively, at the nodes corresponding
to the producer loops for T1 and T2. Now, complete fusion
chains can be created without any partial overlap in the
scopes of the fusion chains. From the figure, it can be seen
that in fact the redundant computation need only be added
to one of T1 or T2 to achieve complete fusion. For example,
removing the additional vertices for (a; f) at T2 does
not violate the non-partial-overlap condition for fusion.

T
c i j a e i j
T T1 T2
c e b k
f1
a f b k
f2
X +ij
b k
Y +bk
f a e c
E +ceaf
T1 T2 T
a e

t

j i e a j i f c f
T
c

t
e c
f1
k b c e
f2
k b f a f a
X +ij
k b
Y +bk
f a e c f a e c
E +ceaf
(a) Fully fused computation from Fig. 3. (b) Partially fused computation from Fig. 4.

Figure 7. Fusion graphs showing redundant compution and tiling.
The fusion graph was used to develop an algorithm
[12, 10] to determine the combination of fusions that minimizes
the total storage required for all the temporary intermediate
arrays. A bottom-up dynamic programming approach
was used that maintains a set of pareto-optimal fusion
configurations at each node, merging solutions from
children nodes to generate the optimal configurations at a
parent. The two metrics used are the total memory required
under the subtree rooted at the node, and the constraints imposed
by a configuration on fusion further up the tree. A
configuration is inferior to another if it is more or equally
constraining with respect to further fusions than the other,
and uses no less memory. At the root of the tree, the configuration
with the lowest memory requirement is chosen.
Although the complexity of the algorithm is exponential
in the number of index variables and the number of solutions
could in theory grow exponentially with the size of
the expression tree, the number of index variables in practical
applications is small enough and there is indication that
the pruning is effective in keeping the size of the solution
set at each node small.
If after memory minimization the storage requirements
still exceed the disk capacity, we need to recompute some
(parts of) temporary arrays in order to further reduce the
space requirements. The space-time trade-off algorithm we
have developed [4] employs a combination of fusion and
tiling to achieve a good balance between recomputation and
memory usage. The first step of the space-time trade-off
algorithm uses a dynamic programming approach similar
to the memory minimization algorithm that maintains a set
of pareto-optimal fusion/recomputation configurations, in
which the recomputation cost is used as a third metric. Solutions
exceeding the memory limit are pruned out. The
result of the search is a set of loop structures with different
combinations of space requirements and recomputation
cost.
In the second step of the algorithm, recomputation indices
are split into tiling and intra-tile loop pairs. By making
intra-tile loops the inner-most loops, any recomputation
only needs to be performed once per iteration of the tiling
loop in exchange for increasing the storage requirements for
temporaries in which the dimension corresponding to the
tiled loop had been eliminated. For each solution from the
first step of the algorithm, we then search for tile sizes that
minimize the recomputation cost, and take the solution that
results in the lowest recomputation cost.

6. Data locality optimization

Once a solution is found that fits onto disk, we optimize
the data locality to reduce memory and disk access
times. We developed algorithms [2, 3] that, given a
memory-reduced (fused) version of the code, find the appropriate
blocking of the loops in order to maximize data
reuse. These algorithms can be applied at different levels
of the memory hierarchy, for example, to minimize data
transfer between main memory and disk (disk access minimization)
, or to minimize data transfer between main memory
and the cache (cache optimization). In this section, we
briefly describe the main points of our algorithm [3], focusing
mostly on the cache management problem. For the disk
access minimization problem, the same approach is used,
replacing the cache size by the physical memory size.
We introduce a memory access cost model (Cost), an estimate
on the number of cache misses, as a function of tile
sizes and loop bounds. In a bottom-up traversal of the abstract
syntax tree, we count for each loop the number (Ac-

cesses) of distinct array elements accessed in its scope. If
this number is smaller than the number of elements that fit
into the cache, then Cost = Accesses. Otherwise, it means
that the elements in the cache are not reused from one loop
iteration to the next, and the cost is obtained by multiplying
the loop range by the cost of its inner loop(s).
Using this cost model, we can compute the total memory
access cost for given tile sizes. The procedure is repeated
for different sets of tile sizes, and new costs are computed.
In the end the lowest possible cost is chosen, thus determining
the optimal tile sizes. We define our tile size search
space in the following way: if N i is a loop range, we use
a tile size starting from T i = 1 (no tiling), and succesively
increasing T i by doubling it until it reaches N i . This ap-

proach ensures a slow (logarithmic) growth of the search
space with increasing array dimension for large N i . If N i is
small enough, an exhaustive search is performed instead.

7. Data partitioning and communication minimization


Given an operation-optimal operator tree, there are many
choices for the parallel implementation of the operations.
Since each node in the tree typically involves operations on
large arrays, there is no need to operate on several tree nodes
at the same time, i.e. intra-node parallelism is sufficient.
We use a logical view of the processors as a multidimensional
grid, where each array can be distributed or
replicated along one or more of the processor dimensions.
Let p d be the number of processors on the d-th dimension of
an n-dimensional processor array, so that the total number
of processors is p 1 p 2  : : : pn . We use an n-tuple to denote
the partitioning or distribution of the elements of a data
array on an n-dimensional processor array. The d-th position
in an n-tuple , denoted [d], corresponds to the d-th

processor dimension. Each position may be one of the following:
an index variable distributed along that processor
dimension, a `*' denoting replication of data along that processor
dimension, or a `1' denoting that only the first processor
along that processor dimension is assigned any data.
If an index variable appears as an array subscript but not in
the n-tuple, then the corresponding dimension of the array
is not distributed. Conversely, if an index variable appears
in the n-tuple but not in the array, then the data is replicated
along the corresponding processor dimension, which is the
same as replacing that index variable with a `*'.
As an example, consider the 3-dimensional array

B[j; k; t], and suppose 64 processors form a 2  4  8 array.
The 3-tuple hk; ; 1i specifies that the second dimension
of B is distributed along the first processor dimensions,
the first and third dimensions of B are not distributed, and
that data are replicated along the second processor dimension
and are assigned only to processors whose third processor
dimension equals 1. Thus, a processor whose id
is P z1 ;z 2 ;z 3 will be assigned a portion of B specified by

B[1 : N j ; myrange(z 1 ; N k ; p 1 ); 1 : N t )] if z 3 = 1 and no
part of B otherwise, where myrange(z; N; p) is the range

(z 1) N=p+ 1 to z N=p.

A child array is redistributed before the evaluation of
its parent if their distributions do not match. For instance,
suppose the arrays T 1 [j; t] and T 2 [j; t] in the computation

T 3 [j; t] = T 1 [j; t]  T 2 [j; t] have distributions h1; t; ji and

hj; ; 1i, respectively. If we want T 3 to have distribution

hj; t; 1i, T 1 would have to be redistributed from h1; t; ji to

hj; t; 1i because the two distributions do not match. But for

T 2 to go from hj; ; 1i to hj; t; 1i, each processor just needs
to give up part of the t-dimension of the array and no interprocessor
data movement is required.
The goal is to determine the combination of the distribution
 n-tuples for the data arrays and the loop nests that
minimizes the computational and communication costs. A
dynamic programming algorithm that determines the distribution
of dense arrays to minimize the computational and
communication costs is given below.
1. Let T be an expression tree, and Cost(v; ) the minimal
cost for the subtree rooted at v with distribution
 . Initialize Cost(v; ) for each leaf node v in

T and each distribution  as follows: Cost(v; ) =
0 if NoReplicate(), where NoReplicate() is a
predicate meaning  involves no replication, and

Cost(v; ) = min NoReplicate() fMoveCost(v; ; )g

otherwise.
2. Perform a bottom-up traversal of T . For each internal
node u and each distribution , calculate Cost(u; ) as
follows:

Case (a): u is a multiplication node with two children

v and v

0

. We need both v and v

0

to have the same distribution,
say 
, before u can be formed. After the multiplication,
the product could be redistributed if necessary.
Thus,

Cost(u; ) = min



fCost(v; 
) + Cost(v

0

; 
)

+ CalcCost(u; 
) + MoveCost(u; 
; )g

Case (b): u is a summation node over index i and
with a child v. v may have any distribution 
. If i 2 
,

each processor first forms partial sums of u and then
we either combine the partial sums on one processor
along the i dimension or replicate them on all processors
along that processor dimension. Afterwards, the
sum could be redistributed if necessary. Thus,

Cost(u; ) = min



fCost(v; 
) + min

i=1;2

(CalcCost i (u; 
)

+ MoveCost i (u; 
; ))g

In either case, save into Dist(u; ) the distribution 

that minimizes Cost(u; ).
3. When step 2 finishes for all nodes and all indices,
the minimal total cost for the entire tree is

min fCost(T :root; )g. The distribution  that minimizes
the total cost is the optimal distribution for

T :root. The optimal distributions for other nodes can
be obtained by tracing back Dist(u; ) in a top-down
manner, starting from Dist(T :root; ).

The time complexity of this algorithm is O(q

2

jT j), where

jT j is the number of internal nodes in the expression tree
and q = O(m

n

) is the number of different possible distribution
 n-tuples, and m is the number of index variables.
The storage requirement for Cost(u; ) and Dist(u; ) is

O(qjT j).

8. Related work and discussion

The approach undertaken in this project bears similarities
to some projects in other domains, such as the SPIRAL
project which is aimed at the design of a system to generate
efficient libraries for digital signal processing algorithms
[20, 8, 23]. SPIRAL generates efficient implementations of
algorithms expressed in a domain-specific language called
SPL by a systematic search through the space of possible
implementations.
Other efforts in automatically generating efficient implementations
of programs include FFTW [6], the telescoping
languages project [9], ATLAS [22] for deriving efficient
implementation of BLAS routines, the PHIPAC [1]
project, and the TUNE project [21]. In addition, motivated
by the difficulty of detecting and optimizing matrix operations
hidden in array subscript expressions within loop
nests, several projects have worked on efficient code generation
from high-level languages such as MATLAB and
Maple [5, 17, 18, 19].
While our effort shares some common goals with several
of the projects mentioned above, there are also significant
differences. Some of the optimizations we consider, such
as the algebraic optimizations, memory minimization, and
space-time trade-offs, do not appear to have been previously
explored, to the best of our knowledge. While optimization
of performance is a significant goal, more important in our
context is the potential for dramatically reducing the developmental
effort required of a quantum chemist to develop
a new ab initio computational model. Currently, the manual
development and testing of a reasonably efficient parallel
code for a computational model such as the coupled
cluster model typically takes several weeks to many months
for a computational chemist. We aim to reduce the time to
prototype a new model to under a day, through use of the
synthesis system being developed.
References

[1] J. Bilmes, K. Asanovic, C. Chin, and J. Demmel. Optimizing matrix
multiply using PHiPAC. In Proc. ACM Intl. Conf. on Supercomputing,
 pages 340--347, 1997.
[2] D. Cociorva, J. Wilkins, C. Lam, G. Baumgartner, P. Sadayappan,
J. Ramanujam. Loop Optimizations for a Class of MemoryConstrained
Computations. In Proc. 15th ACM Intl. Conf. on Supercomputing,
2001.
[3] D. Cociorva, J. Wilkins, G. Baumgartner, P. Sadayappan, J. Ramanujam,
M. Nooijen, D. Bernholdt, and R. Harrison. Towards Automatic
Synthesis of High-Performance Codes for Electronic Structure
Calculations: Data Locality Optimization. Proc. of the Intl.
Conf. on High Performance Computing, 2001.
[4] D. Cociorva, J. Wilkins, G. Baumgartner, P. Sadayappan, J. Ramanujam,
M. Nooijen, D. Bernholdt, and R. Harrison. Space-Time
Trade-Off Optimization for a Class of Electronic Structure Calculations.
Technical Report, Dept. of Comp. and Info. Sci., The Ohio
State Univ, 2001.
[5] L. De Rose and D. Padua. A MATLAB to Fortran 90 translator and
its effectiveness. In Proc. 10th ACM Intl. Conf. on Supercomputing,

1996.
[6] M. Frigo and S. Johnson. FFTW: An adaptive software architecture
for the FFT. In Proc. ICASSP 98, Volume 3, pages 1381--1384,
1998, http://www.fftw.org.

[7] High Performance Computational Chemistry Group. NWChem, A
computational chemistry package for parallel computers, Version
3.3, 1999. Pacific Northwest National Laboratory, Richland, WA
99352.
[8] J. Johnson, R. Johnson, D. Padua, and J. Xiong. Searching for the
best FFT formulas with the SPL compiler. In Languages and Compilers
for High-Performance Computing, Springer-Verlag, 2001.
[9] K. Kennedy, B. Broo, K. Cooper, J. Dongarra, R. Fowler, D. Gannon,
L. Johnsson, J. Mellor-Crummey, and L. Torczon. Telescoping
Languages: A Strategy for Automatic Generation of Scientific
Problem-Solving Systems from Annotated Libraries. J. Parallel and
Distributed Computing, 2001.
[10] C.-C Lam. Performance Optimization of a Class of Loops Implementing
Multi-Dimensional Integrals. PhD thesis, The Ohio State
University, 1999.
[11] C.-C Lam, D. Cociorva, G. Baumgartner, and P. Sadayappan.
Memory-optimal evaluation of expression trees involving large objects.
In Intl. Conf. on High Performance Computing, 1999.
[12] C.-C Lam, D. Cociorva, G. Baumgartner, and P. Sadayappan. Optimization
of memory usage for a class of loops implementing multidimensional
integrals. In Languages and Compilers for Parallel
Computing, 1999.
[13] C.-C Lam, P. Sadayappan, and R. Wenger. On optimizing a class
of multi-dimensional loops with reductions for parallel execution.

Parallel Processing Letters, 7(2):157--168, 1997.
[14] C.-C Lam, P. Sadayappan, and R. Wenger. Optimization of a class
of multi-dimensional integrals on parallel machines. In 8th SIAM
Conf. on Parallel Processing for Scientific Computing,, 1997.
[15] T. J. Lee and G. E. Scuseria. Achieving chemical accuracy with coupled
cluster theory. In S. R. Langhoff (Ed.), Quantum Mechanical
Electronic Structure Calculations with Chemical Accuracy, pages
47--109, Kluwer Academic, 1997.
[16] J. M. L. Martin. In P. v. R. Schleyer, P. R. Schreiner, N. L. Allinger,
T. Clark, J. Gasteiger, P. Kollman, H. F. Schaefer III (Eds.), Encyclopedia
of Computational Chemistry. Wiley & Sons, Berne
(Switzerland). Vol. 1, pp. 115--128, 1998.
[17] The Match Project. A MATLAB compilation environment
for distributed heterogeneous adaptive computing systems.

www.ece.nwu.edu/cpdc/Match/Match.html.

[18] Vijay Menon, Keshav Pingali. High-Level Semantic Optimization
of Numerical Codes. In Proc. ACM Intl. Conf. on Supercomputing.

1999.
[19] Vijay Menon, Keshav Pingali. A Case for Source-Level Transformations
in MATLAB. In Proc. 2nd Conf. on Domain-Specific Languages.
 1999.
[20] J. Moura, J. Johnson, R. Johnson, D. Padua, V. Prasanna,
M. Puschel, and M. Veloso. SPIRAL: Portable library
of optimized signal processing algorithms, 1998.

http://www.ece.cmu.edu/spiral.

[21] M. Thottethodi, S. Chatterjee, and A. Lebeck. Tuning Strassen's
matrix multiplication for memory hierarchies. In Proc. SuperComputing
 '98.

[22] R. Whaley and J. Dongarra. Automatically Tuned Linear Algebra
Software (ATLAS). In Proc. SC '98 (Electronic Publication), IEEE
Publication, 1998.
[23] J. Xiong, D. Padua, and J. Johnson. SPL: A language and compiler
for DSP algorithms. In Proc. ACM SIGPLAN Conf. on Programming
Language Design and Implementation, 2001.

