Next: Organization of this Book Up: Overview of Beowulf Design Previous: Building Price and Performance Contents

Protocol Summary

Profile and Analyze Task(s) for Parallelizability:
1. Profile your task(s). Determine how much time is being spent doing work that could be done in parallel and how much time is being spend doing work that must be done serially. Task profiling is covered in a chapter below.
2. Use Amdahl's Law (covered in a chapter of its own) to determine whether or not there is any point in proceeding further. If not, quit. Your task(s) runs optimally on a single processor, and all you get to choose is which of the various single processors to buy. This book can still help (although it isn't its primary purpose) - check out the chapter on ``node hardware'' to learn about benchmarking numerical performance.
Parallelize your code:
1. Just splitting up the serially written routines in your original task (as you probably did to get your first estimate of potential parallel speedup) isn't particularly optimal - one has to adopt ``real'' parallel algorithms to gain the greatest benefit. Consult one of the resources in the bibliography to learn about parallel algorithms.
2. Figure out how much information will have to be communicated between the various subtasks being done in parallel, and estimate how long it take for those communications to be completed with various kinds of networks.
3. Parallelizing added some time to the serial and parallel time estimates again, so recalculate them.
4. Some of the tasks finish at the same time, some don't. Sometimes we have to wait for everything to be done and resynchronize (at ``barriers''), sometimes we don't. Add in and estimate for all the serial time wasted waiting for things to finish up and so forth.
5. Use the more detailed forms of Amdahl's Law given in the chapter to determine whether or not there is any point in proceeding further. If the answer is unequivocally no, then quit. Give it up; use a single processor that is as fast as you can afford or need.
Note that these first two steps can be greatly shortened (to approximately zero) if there already exists a decently written parallel version of your code. Quite a lot of parallel code already has been written by various folks around the world. Look around in the chapter on parallel applications and ask on the beowulf list before tackling a new parallelization project - you might find that your particular wheel has already been invented or that some other wheel can, with a little regrinding, be made to serve as a template for your own.
Build Tables:
1. Build a single node performance table for as many candidate nodes as you wish or can afford, using measurements where you can and calculated estimates where you must.
2. Build an estimated multinode performance table for as many networks as you wish or can afford, using measurements where you can and calculated estimates where you must.
3. All your answers in the previous step depended on a certain program size. Sometimes making a program ``bigger'' makes it parallelize more efficiently (explained below in the Amdahl's Law chapter). Determine the approximate effect of scaling up the task size and varying the number of nodes in your multinode performance table.
4. Attach costs and determine benefits as weights and transform the multinode performance table into a cost-benefit table (covered in a chapter of its own).
Design your Beowulf:
1. Determine your performance threshold.
2. Determine your price threshold.
3. Select the beowulf (or cluster, or even dedicated parallel supercomputer - if necessary) design that optimizes price performance.

Simple, really. You figure out if your computational task can be sped up ``at all'' by parallelizing it under ideal circumstances. You figure out pretty much how to parallelize it, up to and including writing or finding or adapting a parallel version of the computational code. You then do a cost-benefit analysis of the alternative ways to get the work done^2.9.

This is a bit oversimplified, of course. For example, the best parallel algorithm to choose for certain numerical tasks depends on things like the size of the task and the relative speeds of numerical operations and IPC's, so using one particular one for all the different combinations of node and network hardware isn't likely to give you a global optimum in price performance. The price performance is a function of many variables - the problem itself, the node design (and all the variables that describe a node), network design (and all the variables that describe the network, including e.g. the topology), and the algorithms and methodology used to parallelize the problem. Furthermore, it is a nonlinear function of all of those variables - sometimes changing something a tiny bit at very low cost or no cost at all can double overall performance. Or halve it.

Still, the general approach above should suffice to get you into the vicinity of an optimum solution, which is generally good enough. In particular, it will give you a very reliable answer to the question of whether or not you can afford to build a beowulf that will run your problem satisfactorily faster. If the answer is yes (and often it will be) then some strategic prototyping and parametric tweaking will often let you do even better.

Next: Organization of this Book Up: Overview of Beowulf Design Previous: Building Price and Performance Contents

Robert G. Brown 2004-05-24