 
 
 
 
 
 
 
  
 Next: Organization of this Book
 Up: Overview of Beowulf Design
 Previous: Building Price and Performance
     Contents 
- Profile and Analyze Task(s) for Parallelizability:
 
- Profile your task(s).  Determine how much time is being spent
   doing work that could be done in parallel and how much time is being
   spend doing work that must be done serially.  Task profiling is
   covered in a chapter below.
 
- Use Amdahl's Law (covered in a chapter of its own) to determine
   whether or not there is any point in proceeding further.  If not,
   quit.  Your task(s) runs optimally on a single processor, and all you
   get to choose is which of the various single processors to buy.  This
   book can still help (although it isn't its primary purpose) - check
   out the chapter on ``node hardware'' to learn about benchmarking
   numerical performance.
 
 
 
- Parallelize your code:
 
- Just splitting up the serially written routines in your
   original task (as you probably did to get your first estimate of
   potential parallel speedup) isn't particularly optimal - one has to
   adopt ``real'' parallel algorithms to gain the greatest benefit.
   Consult one of the resources in the bibliography to learn about
   parallel algorithms.
 
- Figure out how much information will have to be communicated
   between the various subtasks being done in parallel, and estimate how
   long it take for those communications to be completed with various
   kinds of networks.
 
- Parallelizing added some time to the serial and parallel
   time estimates again, so recalculate them.
 
- Some of the tasks finish at the same time, some don't.
   Sometimes we have to wait for everything to be done and resynchronize
   (at ``barriers''), sometimes we don't.  Add in and estimate for all
   the serial time wasted waiting for things to finish up and so forth.
 
- Use the more detailed forms of Amdahl's Law given in the
   chapter to determine whether or not there is any point in proceeding
   further.  If the answer is unequivocally no, then quit.  Give it up;
   use a single processor that is as fast as you can afford or need.
 
 
Note that these first two steps can be greatly shortened (to
approximately zero) if there already exists a decently written parallel
version of your code.  Quite a lot of parallel code already has been
written by various folks around the world.  Look around in the chapter
on parallel applications and ask on the beowulf list before tackling a
new parallelization project - you might find that your particular wheel
has already been invented or that some other wheel can, with a little
regrinding, be made to serve as a template for your own.
 
 
- Build Tables:
 
- Build a single node performance table for as many candidate
   nodes as you wish or can afford, using measurements where you can and
   calculated estimates where you must.
 
- Build an estimated multinode performance table for as many
   networks as you wish or can afford, using measurements where you can
   and calculated estimates where you must.
 
- All your answers in the previous step depended on a certain
   program size.  Sometimes making a program ``bigger'' makes it
   parallelize more efficiently (explained below in the Amdahl's Law
   chapter).  Determine the approximate effect of scaling up the task
   size and varying the number of nodes in your multinode
   performance table.
 
- Attach costs and determine benefits as weights and transform
   the multinode performance table into a cost-benefit table (covered
   in a chapter of its own).
 
 
 
- Design your Beowulf:
 
- Determine your performance threshold.
 
- Determine your price threshold.
 
- Select the beowulf (or cluster, or even dedicated parallel
   supercomputer - if necessary) design that optimizes price
   performance.
 
 
 
Simple, really.  You figure out if your computational task can be sped
up ``at all'' by parallelizing it under ideal circumstances.  You figure
out pretty much how to parallelize it, up to and including writing or
finding or adapting a parallel version of the computational code.  You
then do a cost-benefit analysis of the alternative ways to get the
work done2.9.
This is a bit oversimplified, of course.  For example, the best
parallel algorithm to choose for certain numerical tasks depends on
things like the size of the task and the relative speeds of numerical
operations and IPC's, so using one particular one for all the different
combinations of node and network hardware isn't likely to give you a
global optimum in price performance.  The price performance is a
function of many variables - the problem itself, the node design (and
all the variables that describe a node), network design (and all the
variables that describe the network, including e.g. the topology), and
the algorithms and methodology used to parallelize the problem.
Furthermore, it is a nonlinear function of all of those variables
- sometimes changing something a tiny bit at very low cost or no cost
at all can double overall performance.  Or halve it.
Still, the general approach above should suffice to get you into the
vicinity of an optimum solution, which is generally good enough.
In particular, it will give you a very reliable answer to the question
of whether or not you can afford to build a beowulf that will run your
problem satisfactorily faster.  If the answer is yes (and often it will
be) then some strategic prototyping and parametric tweaking will often
let you do even better.
 
 
 
 
 
 
 
  
 Next: Organization of this Book
 Up: Overview of Beowulf Design
 Previous: Building Price and Performance
     Contents 
Robert G. Brown
2004-05-24