The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design.
Recall from Chapter 1 that in our programming model, we conceptualize a need for communication between two tasks as a channel linking the tasks, on which one task can send messages and from which the other can receive. Hence, the communication associated with an algorithm can be specified in two phases. First, we define a channel structure that links, either directly or indirectly, tasks that require data (consumers) with tasks that possess those data (producers). Second, we specify the messages that are to be sent and received on these channels. Depending on our eventual implementation technology, we may not actually create these channels when coding the algorithm. For example, in a data-parallel language, we simply specify data-parallel operations and data distributions. Nevertheless, thinking in terms of tasks and channels helps us to think quantitatively about locality issues and communication costs.
The definition of a channel involves an intellectual cost and the sending of a message involves a physical cost. Hence, we avoid introducing unnecessary channels and communication operations. In addition, we seek to optimize performance by distributing communication operations over many tasks and by organizing communication operations in a way that permits concurrent execution.
In domain decomposition problems, communication requirements can be difficult to determine. Recall that this strategy produces tasks by first partitioning data structures into disjoint subsets and then associating with each datum those operations that operate solely on that datum. This part of the design is usually simple. However, some operations that require data from several tasks usually remain. Communication is then required to manage the data transfer necessary for these tasks to proceed. Organizing this communication in an efficient manner can be challenging. Even simple decompositions can have complex communication structures.
In contrast, communication requirements in parallel algorithms obtained by functional decomposition are often straightforward: they correspond to the data flow between tasks. For example, in a climate model broken down by functional decomposition into atmosphere model, ocean model, and so on, the communication requirements will correspond to the interfaces between the component submodels: the atmosphere model will produce values that are used by the ocean model, and so on (Figure 2.3).
In the following discussion, we use a variety of examples to show how communication requirements are identified and how channel structures and communication operations are introduced to satisfy these requirements. For clarity in exposition, we categorize communication patterns along four loosely orthogonal axes: local/global, structured/unstructured, static/dynamic, and synchronous/asynchronous.
A local communication structure is obtained when an operation requires data from a small number of other tasks. It is then straightforward to define channels that link the task responsible for performing the operation (the consumer) with the tasks holding the required data (the producers) and to introduce appropriate send and receive operations in the producer and consumer tasks, respectively.
For illustrative purposes, we consider the communication requirements
associated with a simple numerical computation, namely a Jacobi finite
difference method. In this class of numerical method, a
multidimensional grid is repeatedly updated by replacing the value at
each point with some function of the values at a small, fixed number
of neighboring points. The set of values required to update a single
grid point is called that grid point's stencil. For example,
the following expression uses a five-point stencil to update each
element of a two-dimensional grid X
:
This update is applied repeatedly to compute a sequence of values
,
, and so on. The notation
denotes the value of grid point
at step t
.
Figure 2.4: Task and channel structure for a two-dimensional finite
difference computation with five-point update stencil. In this simple
fine-grained formulation, each task encapsulates a single element of a
two-dimensional grid and must both send its value to four neighbors
and receive values from four neighbors. Only the channels used by the
shaded task are shown.
Let us assume that a partition has used domain decomposition
techniques to create a distinct task for each point in the
two-dimensional grid. Hence, a task allocated the grid point
must compute the sequence
This computation requires in turn the four corresponding sequences![]()
which are produced by the four tasks handling grid points![]()
for t=0 to T-1send
to each neighbor
receive
,
,
,
from neighbors
compute
using Equation 2.1
endfor
We observed earlier that the best sequential and parallel solutions to
a problem may use different techniques. This situation arises in
finite difference problems. In sequential computing,
Gauss-Seidel update strategies are often preferred over Jacobi
strategies because they allow solutions of comparable accuracy to be
obtained using fewer iterations. In a Gauss-Seidel scheme, elements
are updated in a particular order so that the computation of each
element can use the most up-to-date value of other elements. For
example, the Jacobi update of Equation 2.1 may be reformulated
as follows (notice the use of values and
):
While Jacobi schemes are trivial to parallelize (all grid points can
be updated concurrently), this is not the case for all Gauss-Seidel
schemes. For example, the update scheme of Equation 2.2 allows
only an average of around N/2
points within an
N
N
grid to be updated concurrently. Fortunately,
many different Gauss-Seidel orderings are possible for most problems,
and we are usually free to choose the ordering that maximizes
available parallelism. In particular, we can choose to update first
the odd-numbered elements and then the even-numbered
elements of an array. Each update uses the most recent information,
yet the updates to the odd-numbered points are independent and can
proceed concurrently, as can the updates to the even-numbered points.
This update strategy yields what is referred to as a red-black
algorithm, since the points can be thought of as being colored as on a
chess board: either red (odd) or black (even); points of the same
color can be updated concurrently. Figure 2.5 illustrates
both the Gauss-Seidel scheme of Equation 2.2 and a red-black
scheme, and shows how the latter scheme increases opportunities for
parallel execution.
Figure 2.5: Two finite difference update strategies, here applied on a
two-dimensional grid with a five-point stencil. In both figures,
shaded grid points have already been updated to step t+1
;
unshaded grid points are still at step t
. The arrows show data
dependencies for one of the latter points. The figure on the left
illustrates a simple Gauss-Seidel scheme and highlights the five grid
points that can be updated at a particular point in time. In this
scheme, the update proceeds in a wavefront from the top left corner to
the bottom right. On the right, we show a red-black update scheme.
Here, all the grid points at step t
can be updated
concurrently.
This example indicates the important role that choice of solution strategy can play in determining the performance of a parallel program. While the simple Gauss-Seidel update strategy of Equation 2.2 may be appropriate in a sequential program, it is not ideal on a parallel computer. The Jacobi update strategy is efficient on a parallel computer but is inferior numerically. The red-black scheme combines the advantages of both approaches.
Figure 2.6: A centralized summation algorithm that uses a central
manager task (S) to sum N
numbers distributed among
N
tasks. Here, N=8
, and each of the 8 channels is labeled with the
number of the step in which they are used.
A global communication operation is one in which many tasks must participate. When such operations are implemented, it may not be sufficient simply to identify individual producer/consumer pairs. Such an approach may result in too many communications or may restrict opportunities for concurrent execution. For example, consider the problem of performing a parallel reduction operation, that is, an operation that reduces N values distributed over N tasks using a commutative associative operator such as addition:
Let us assume that a single ``manager'' task requires the result
S
of this operation. Taking a purely local view of
communication, we recognize that the manager requires values ,
, etc., from tasks 0, 1, etc. Hence, we could define a
communication structure that allows each task to communicate its value
to the manager independently. The manager would then receive the
values and add them into an accumulator (Figure 2.6).
However, because the manager can receive and sum only one number at a
time, this approach takes
time to sum N
numbers---not a
very good parallel algorithm!
This example illustrates two general problems that can hinder efficient parallel execution in algorithms based on a purely local view of communication:
We first consider the problem of distributing the computation and communication associated with the summation. We can distribute the summation of the N numbers by making each task i , 0<i<N-1 , compute the sum:
Figure 2.7: A summation algorithm that connects N
tasks in an
array in order to sum N
numbers distributed among these tasks.
Each channel is labeled with the number of the step in which it is
used and the value that is communicated on it.
The communication requirements associated with this algorithm can be satisfied by connecting the N tasks in a one-dimensional array (Figure 2.7). Task N-1 sends its value to its neighbor in this array. Tasks 1 through N-2 each wait to receive a partial sum from their right-hand neighbor, add this to their local value, and send the result to their left-hand neighbor. Task 0 receives a partial sum and adds this to its local value to obtain the complete sum. This algorithm distributes the N-1 communications and additions, but permits concurrent execution only if multiple summation operations are to be performed. (The array of tasks can then be used as a pipeline, through which flow partial sums.) A single summation still takes N-1 steps.
Opportunities for concurrent computation and communication can often
be uncovered by applying a problem-solving strategy called divide
and conquer. To solve a complex problem (such as summing
N
numbers), we seek to partition it into two or more simpler problems of
roughly equivalent size (e.g., summing N/2
numbers). This
process is applied recursively to produce a set of subproblems that
cannot be subdivided further (e.g., summing two numbers). The
strategy is summarized in Algorithm 2.1. The
divide-and-conquer technique is effective in parallel computing when
the subproblems generated by problem partitioning can be solved
concurrently. For example, in the summation problem, we can take
advantage of the following identity (, n
an
integer):
The two summations on the right hand side can be performed
concurrently. They can also be further decomposed if n>1
, to
give the tree structure illustrated in Figure 2.8.
Summations at the same level in this tree of height can be
performed concurrently, so the complete summation can be achieved in
rather than N
steps.
Figure 2.8: Tree structure for divide-and-conquer summation algorithm
with N=8
. The N
numbers located in the tasks at the
bottom of the diagram are communicated to the tasks in the row
immediately above; these each perform an addition and then forward the
result to the next level. The complete sum is available at the root
of the tree after steps.
In summary, we observe that in developing an efficient parallel summation algorithm, we have distributed the N-1 communication and computation operations required to perform the summation and have modified the order in which these operations are performed so that they can proceed concurrently. The result is a regular communication structure in which each task communicates with a small set of neighbors.
Figure 2.9: Example of a problem requiring unstructured communication.
In this finite element mesh generated for an assembly part, each
vertex is a grid point. An edge connecting two vertices represents a
data dependency that will require communication if the vertices are
located in different tasks. Notice that different vertices have
varying numbers of neighbors. (Image courtesy of M.
S. Shephard.)
The examples considered previously are all of static, structured communication, in which a task's communication partners form a regular pattern such as a tree or a grid and do not change over time. In other cases, communication patterns may be considerably more complex. For example, in finite element methods used in engineering calculations, the computational grid may be shaped to follow an irregular object or to provide high resolution in critical regions (Figure 2.9). Here, the channel structure representing the communication partners of each grid point is quite irregular and data-dependent and, furthermore, may change over time if the grid is refined as a simulation evolves.
Unstructured communication patterns do not generally cause conceptual difficulties in the early stages of a design. For example, it is straightforward to define a single task for each vertex in a finite element graph and to require communication for each edge. However, unstructured communication complicates the tasks of agglomeration and mapping. In particular, sophisticated algorithms can be required to determine an agglomeration strategy that both creates tasks of approximately equal size and minimizes communication requirements by creating the least number of intertask edges. Algorithms of this sort are discussed in Section 2.5.1. If communication requirements are dynamic, these algorithms must be applied frequently during program execution, and the cost of these algorithms must be weighed against their benefits.
The examples considered in the preceding section have all featured synchronous communication, in which both producers and consumers are aware when communication operations are required, and producers explicitly send data to consumers. In asynchronous communication, tasks that possess data (producers) are not able to determine when other tasks (consumers) may require data; hence, consumers must explicitly request data from producers.
Figure 2.10: Using separate ``data tasks'' to service read and write
requests on a distributed data structure. In this figure, four
computation tasks (C) generate read and write requests to eight data
items distributed among four data tasks (D). Solid lines represent
requests; dashed lines represent replies. One compute task and one
data task could be placed on each of four processors so as to
distribute computation and data equitably.
This situation commonly occurs when a computation is structured as a set of tasks that must periodically read and/or write elements of a shared data structure. Let us assume that this data structure is too large or too frequently accessed to be encapsulated in a single task. Hence, a mechanism is needed that allows this data structure to be distributed while supporting asynchronous read and write operations on its components. Possible mechanisms include the following.
Each strategy has advantages and disadvantages; in addition, the performance characteristics of each approach vary from machine to machine. The first strategy can result in convoluted, nonmodular programs because of the need to intersperse polling operations throughout application code. In addition, polling can be an expensive operation on some computers, in which case we must trade off the cost of frequent polling against the benefit of rapid response to remote requests. The second strategy is more modular: responsibility for the shared data structure is encapsulated in a separate set of tasks. However, this strategy makes it hard to exploit locality because, strictly speaking, there are no local data: all read and write operations require communication. Also, switching between the computation and data tasks can be expensive on some machines.
Having devised a partition and a communication structure for our parallel algorithm, we now evaluate our design using the following design checklist. As in Section 2.2.3, these are guidelines intended to identify nonscalable features, rather than hard and fast rules. However, we should be aware of when a design violates them and why.
© Copyright 1995 by Ian Foster