In order to achieve the best scaling behavior, we can see from the previous chapter that we want to maximize the parallel fraction of a program (the part that can be split up) and minimize the serial fraction (which cannot). We also want to maximize the time spend doing work in parallel on each node and minimize the time required to communicate between nodes. We want to avoid wasting time by having some nodes sit idle waiting for other nodes to finish.
However, we must be cautious and clearly define our real goals as in general they aren't to ``achieve the best scaling behavior'' (unless one is a computer scientist studying the abstract problem, of course). More commonly in application, they are to ``get the most work done in the least amount of time given a fixed budget''. When economic constraints appear in the picture one has to carefully consider trade-offs between the computational speed of the nodes, the speed and latency of the network, the size of memory and its speed and latency, and the size, speed and latency of any hard storage subsystem that might be required. Virtually any conceivable combination of system and network speed can turn out to be cost-benefit optimal and get the most work done for a given budget and parallel task.
As the discussion proceeds, it will become clear why successful beowulf design focusses on the problem as much as it does on the hardware. One perfectly reasonable conclusion a reader can draw from this chapter is that understanding the nuances of computer hardware and their effect on program speed is ludicrously difficult and that only individuals with some sort of obsessive-compulsive personality disorder would ever try it. It is so much simpler to just measure the performance of any given hardware platform on your program.
When you achieve this Satori, Bravo! However, be warned that the wisest course is to both measure performance and understand at least a bit about how that measured performance is likely to vary when you vary things like the clock speed of your CPU, the CPU's manufacturer, the kind and speed of the memory subsystem, and the network.
The variations can be profound. As we'll see, when we double the size of (say, vectors being multiplied within) a program it can take twelve or more times as long to complete for certain ranges of sizes. You could naively make your measurement where performance is great, expecting it to remain great in production for much larger vectors or matrices. You could then expend large sums of money buying nodes, fail miserably to get the work accomplished that you expected to accomplish for that sum, and (fill in your own favorite personal disaster) get fired, not get tenure, lose your grant, go broke, lose the respect of your children and pets. Alternatively, you could make your measurement for large matrices, assume that fast memory systems are essential and spend a great deal for a relative few of them, only to find that by the time the problem is split up it would run just as fast on nodes with slower memory that cost 1/10 as much (so you could have bought 10x as many).
This isn't as hard to understand is it may now seem. I'll try to explain how and why this can occur and illustrate it with enough pictures and graphs and examples that it becomes clear. Once you understand what this chapter has to offer, you'll understand how to study your problem in a way that is unlikely to produce embarrassing and expensive mistakes in your final beowulf design.