Our third case study, like the first, is from computational science. It is an example of an application that accesses a distributed data structure in an asynchronous fashion and that is amenable to a functional decomposition.
Computational techniques are being used increasingly as an alternative to experiment in chemistry. In what is called ab initio quantum chemistry , computer programs are used to compute fundamental properties of atoms and molecules, such as bond strengths and reaction energies, from first principles, by solving various approximations to the Schrödinger equation that describes their basic structures. This approach allows the chemist to explore reaction pathways that would be hazardous or expensive to explore experimentally. One application for these techniques is in the investigation of biological processes. For example, Plate 6
shows a molecular model for the active site region in the enzyme malate dehydrogenase, a key enzyme in the conversion of glucose to the high-energy molecule ATP. This image is taken from a simulation of the transfer of a hydride anion from the substrate, malate, to a cofactor, nicotinamide adenine diphosphate. The two isosurfaces colored blue and brown represent lower and higher electron densities, respectively, calculated by using a combined quantum and classical mechanics methodology. The green, red, blue, and white balls are carbon, oxygen, nitrogen, and hydrogen atoms, respectively.
Fundamental to several methods used in quantum chemistry is the need
to compute what is called the Fock matrix, a two-dimensional
array representing the electronic structure of an atom or molecule.
This matrix, which is represented here as F, has size
N
N
and is formed by evaluating the following
summation for each element:
where D is a two-dimensional array of size
N
N
that is only read, not written, by this
computation and the I represent integrals that are
computed using elements i
, j
, k
, and l
of a
read-only, one-dimensional array A with
elements. An
integral can be thought of as an approximation to the repulsive force
between two electrons.
Because Equation 2.3 includes a double summation, apparently
2 integrals must be computed for each element of F,
for a total of 2
integrals. However, in practice it is possible
to exploit redundancy in the integrals and symmetry in F and
reduce this number to a total of
. When this is done, the
algorithm can be reduced to the rather strange logic given as
Algorithm 2.3. In principle, the calculation of each
element of F requires access to all elements of D and
A; furthermore, access patterns appear highly irregular. In this
respect, the Fock matrix construction problem is representative of
many numeric problems with irregular and nonlocal communication
patterns.
For the molecular systems of interest to chemists, the problem size
N
may be in the range . Because the evaluation
of an integral is a fairly expensive operation, involving
operations, the construction of the Fock matrix may require
operations. In addition, most methods require that
a series of Fock matrices be constructed, each representing a more
accurate approximation to a molecule's electronic structure. These
considerations have motivated a considerable amount of work on both
efficient parallel algorithms for Fock matrix construction and
improved methods that require the computation of less than
integrals.
Because the Fock matrix problem is concerned primarily with the symmetric
two-dimensional matrices F and D, an obvious partitioning
strategy is to apply domain decomposition techniques to these matrices
to create N(N+1)/2
tasks, each containing a single element from
each matrix (,
) and responsible for the
operations required to compute its
. This yields
N(N+1)/2
tasks, each with
data and each responsible for
computing 2
integrals, as specified in Equation 2.3.
This domain decomposition strategy is simple but suffers from a significant disadvantage: it cannot easily exploit redundancy and symmetry and, hence, performs eight times too many integral computations. Because an alternative algorithm based on functional decomposition techniques is significantly more efficient (it does not perform redundant computation and does not incur high communication costs), the domain decomposition algorithm is not considered further.
Figure 2.31: Functional decomposition of Fock matrix problem. This
yields about data tasks, shown in the upper part of the
figure, and
computation tasks, shown in the lower part of the
figure. Computation tasks send read and write requests to data
tasks.
Quite a different parallel algorithm can be developed by focusing on
the computation to be performed rather than on the data structures
manipulated, in other words, by using a functional decomposition.
When redundancy is considered, one naturally thinks of a computation
as comprising a set of integrals (the integral procedure of
Algorithm 2.3), each requiring six D elements and
contributing to six F elements. Focusing on these computations,
we define ``computation'' tasks, each responsible for one
integral.
Having defined a functional decomposition, we next need to distribute
data structures over tasks. However, we see no obvious criteria by
which data elements might be associated with one computation task
rather than another: each data element is accessed by many tasks. In
effect, the F, D, and A arrays constitute large data
structures that the computation tasks need to access in a distributed
and asynchronous fashion. This situation suggests that the techniques
described in Section 2.3.4 for asynchronous communication
may be useful. Hence, for now we simply define two sets of ``data''
tasks that are responsible only for responding to requests to read and write
data values. These tasks encapsulate elements of the two-dimensional
arrays D and F (,
) and of the
one-dimensional array A (
), respectively. In all,
our partition yields a total of approximately
computation
tasks and
data tasks (Figure 2.31).
We have now defined computation tasks and
data
tasks. Each computation task must perform sixteen communications: six
to obtain D matrix elements, four to obtain A matrix
elements, and six to store F matrix elements. As the
computational costs of different integrals can vary significantly,
there does not appear to be any opportunity for organizing these
communication operations into a regular structure, as is advocated in
Section 2.3.2.
On many parallel computers, the cost of an integral will be comparable to the cost of a communication. Hence, communication requirements must be reduced by agglomeration. We describe two alternative strategies that can be used to achieve this goal. Their data requirements are illustrated in Figure 2.32.
Figure 2.32: Agglomeration strategies for Fock matrix construction with
N=P=5
, for (a) the total replication algorithm and (b) the partial
replication algorithm. In each case, the five tasks are shown with
shading used to represent the portion of the symmetric D and
F matrices allocated to each task. In (a), each matrix is
replicated in each task. In (b), each task is given a single row and
column; this corresponds to a factor of two
replication.
1. Total replication. Communication costs can be cut dramatically by replicating the F and D matrices in each of P tasks, one per processor of a parallel computer. Each task is given responsibility for 1/P of the integrals. Computation can then proceed in each task without any communication. The only coordination required is a final summation to accumulate partial F matrices. This can be achieved using a parallel vector reduction algorithm described in Section 11.2.
The technique of replicating data structures on each processor of a parallel computer is commonly used in parallel computing to reduce software engineering costs. It allows an existing sequential code to be adapted quickly for parallel execution, since there is no need to modify data structures. The principal disadvantage of the technique is that it is nonscalable. Because total memory requirements scale with the number of tasks created, the largest problem that can be solved is determined not by the total amount of memory in a parallel computer, but by the amount available in a single processor. For example, on a 512-processor computer with 16 MB of memory per processor, an implementation of the quantum chemistry code DISCO that uses this strategy cannot solve problems with N>400 . In principle, it would be interesting to solve problems where N is 10 times larger.
Figure 2.33: Data requirements for integral clusters. Each task accesses
three rows (and sometimes columns) of the D and F
matrices.
2. Partial replication. An alternative approach is as follows.
First, we agglomerate computation in what seems an obvious way, namely,
by making the inner loop of the procedure fock_build in
Algorithm 2.3 into a task. This yields computation
tasks, each responsible for
integrals. Next, we examine the
communication requirements of each such task. We find that there is
considerable locality in the data required by these clusters of
integrals: each cluster accesses the i
th, j
th, and
k
th row (and sometimes column) of D and F
(Figure 2.33). To exploit this locality, we
agglomerate data to create N
data tasks, each containing a
row/column pair of the two-dimensional arrays D and F and
all of the one-dimensional array A. In this scheme, each
element of D and F is replicated once, and A is
replicated N
times, so total storage requirements are increased
from an average of N
to 3N
per task. Because of this
replication, each computation task now requires data from just three
data tasks. Hence, the number of messages is reduced from
to
. The total volume communicated remains
. Because
the cost of communicating a word is typically much less than the cost
of computing an integral, this is an efficient parallel algorithm.
The ``partial replication'' Fock matrix construction algorithm creates
N
data tasks and computation tasks. We use the notation
(i j k)
to identify the computation task responsible for
computing the
integrals I
; this task requires
data from data tasks i
, j
, and k
. To complete the
parallel algorithm, we must define a mapping of data and computation
tasks to processors.
We assume processors. Since each data task will receive
roughly the same number of requests, we allocate one data task to each
processor. This leaves the problem of mapping computation tasks. We
can imagine a variety of approaches:
We have developed two alternative parallel algorithms for the Fock matrix construction problem.
This case study illustrates some of the tradeoffs that can arise in the design process. The first algorithm slashes communication and software engineering costs; however, it is not scalable. In contrast, the second algorithm has higher communication costs but is highly scalable: its memory requirements increase only with problem size, not the number of processors. To choose between the two algorithms, we need to quantify their parallel performance and then to determine the importance of scalability, by assessing application requirements and the characteristics of the target parallel computer.
© Copyright 1995 by Ian Foster