Statistics

Next: History of PVM Up: Debugging the System Previous: Sane Heap

Statistics

The pvmd includes several registers and counters to sample certain events, such as the number of calls made to select() or the number of packets refragmented by the network code. These values can be computed from a debug log , but the counters have less adverse impact on the performance of the pvmd than would generating a huge log file. The counters can be dumped or reset using the pvm_tickle() function or the console tickle command. The code to gather statistics is normally switched out at compile time. To enable it, one edits the makefile and adds -DSTATISTICS to the compile options.

Glossary

asynchronous: Not guaranteed to enforce coincidence in clock time. In an asynchronous communication operation, the sender and receiver may or may not both be engaged in the operation at the same instant in clock time.
atomic: Not interruptible. An atomic operation is one that always appears to have been executed as a unit.
bandwidth: A measure of the speed of information transfer typically used to quantify the communication capability of multicomputer and multiprocessor systems. Bandwidth can express point-to-point or collective (bus) communications rates. Bandwidths are usually expressed in megabytes per second.
barrier synchronization: An event in which two or more processes belonging to some implicit or explicit group block until all members of the group have blocked. They may then all proceed. No member of the group may pass a barrier until all processes in the group have reached it.
big-endian: A binary data format in which the most significant byte or bit comes first. See also little-endian.
bisection bandwidth: The rate at which communication can take place between one half of a computer and the other. A low bisection bandwidth or a large disparity between the maximum and minimum bisection bandwidths achieved by cutting the computers elements in different ways is a warning that communications bottlenecks may arise in some calculations.
broadcast: To send a message to all possible recipients. Broadcast can be implemented as a repeated send or in a more efficient method, for example, over a spanning tree where each node propagates the message to its descendents.
buffer: A temporary storage area in memory. Many methods for routing messages between processors use buffers at the source and destination or at intermediate processors.
bus: A single physical communications medium shared by two or more devices. The network shared by processors in many distributed computers is a bus, as is the shared data path in many multiprocessors.
cache consistency: The problem of ensuring that the values associated with a particular variable in the caches of several processors are never visibly different.
channel: A point-to-point connection through which messages can be sent. Programming systems that rely on channels are sometimes called connection oriented, to distinguish them from connectionless systems in which messages are sent to named destinations rather than through named channels.
circuit: A network where connections are established between senders and receivers, reserving network resources. Compare with packet switching.
combining: Joining messages together as they traverse a network. Combining may be done to reduce the total traffic in the network, to reduce the number of times the start-up penalty of messaging is incurred, or to reduce the number of messages reaching a particular destination.
communication overhead: A measure of the additional workload incurred in a parallel algorithm as a result of communication between the nodes of the parallel system.
computation-to-communication ratio: The ratio of the number of calculations a process does to the total size of the messages it sends; alternatively, the ratio of time spent calculating to time spent communicating, which depends on the relative speeds of the processor and communications medium, and on the startup cost and latency of communication.
contention: Conflict that arises when two or more requests are made concurrently for a resource that cannot be shared. Processes running on a single processor may contend for CPU time, or a network may suffer from contention if several messages attempt to traverse the same link at the same time.
context switching: Saving the state of one process and replacing it with that of another. If little time is required to switch contexts, processor overloading can be an effective way to hide latency in a message-passing system.
daemon: A special-purpose process that runs on behalf of the system, for example, the pvmd process or group server task.
data encoding: A binary representation for data objects (e.g., integers, floating-point numbers) such as XDR or the native format of a microprocessor. PVM messages can contain data in XDR, native, or foo format.
data parallelism: A model of parallel computing in which a single operation can be applied to all elements of a data structure simultaneously. Typically, these data structures are arrays, and the operations act independently on every array element or reduction operations.
deadlock: A situation in which each possible activity is blocked, waiting on some other activity that is also blocked.
distributed computer: A computer made up of smaller and potentially independent computers, such as a network of workstations. This architecture is increasingly studied because of its cost effectiveness and flexibility. Distributed computers are often heterogeneous.
distributed memory: Memory that is physically distributed among several modules. A distributed-memory architecture may appear to users to have a single address space and a single shared memory or may appear as disjoint memory made up of many separate address spaces.
DMA: Direct memory access, allowing devices on a bus to access memory without interfering with the CPU.
efficiency: A measure of hardware utilization, equal to the ratio of speedup achieved on P processors to P itself.
Ethernet: A popular LAN technology invented by Xerox. Ethernet is a 10-Mbit/S CSMA/CD (Carrier Sense Multiple Access with Collision Detection) bus. Computers on an Ethernet send data packets directly to one another. They listen for the network to become idle before transmitting, and retransmit in the event that multiple stations simultaneously attempt to send.
FDDI: Fiber Distributed Data Interface, a standard for local area networks using optical fiber and a 100-Mbit/s data rate. A token is passed among the stations to control access to send on the network. Networks can be arranged in topologies such as stars, trees, and rings. Independent counter-rotating rings allow the network to continue to function in the event that a station or link fails.
FLOPS: Floating-Point Operations per Second, a measure of memory access performance, equal to the rate at which a machine can perform single-precision floating-point calculations.
fork: To create another copy of a running process; fork returns twice. Compare with spawn.
fragment: A contiguous part of a message. Messages are fragmented so they can be sent over a network having finite maximum packet length.
group: A set of tasks assigned a common symbolic name, for addressing purposes.
granularity: The size of operations done by a process between communications events. A fine-grained process may perform only a few arithmetic operations between processing one message and the next, whereas a coarse-grained process may perform millions.
heterogeneous: Containing components of more than one kind. A heterogeneous architecture may be one in which some components are processors, and others memories, or it may be one that uses different types of processor together.
hierarchical routing: Messages are routed in PVM based on a hierarchical address (a TID). TIDs are divided into host and local parts to allow efficient local and global routing.
HiPPI: High Performance Parallel Interface, a point-to-point 100-MByte/sec interface standard used for networking components of high-performance multicomputers together.
host: A computer, especially a self-complete one on a network with others. Also, the front-end support machine for, for example, a multiprocessor.
hoster: A special PVM task that performs slave pvmd startup for the master pvmd.
interval routing: A routing algorithm that assigns an integer identifier to each possible destination and then labels the outgoing links of each node with a single contiguous interval or window so that a message can be routed simply by sending it out the link in whose interval its destination identifier falls.
interrupt-driven system: A type of message-passing system. When a message is delivered to its destination process, it interrupts execution of that process and initiates execution of an interrupt handler, which may either process the message or store it for subsequent retrieval. On completion of the interrupt handler (which may set some flag or sends some signal to denote an available message), the original process resumes execution.
IP: Internet Protocol, the Internet standard protocol that enables sending datagrams (blocks of data) between hosts on interconnected networks. It provides a connectionless, best-effort delivery service. IP and the ICMP control protocol are the building blocks for other protocols such as TCP and UDP.
kernel: A program providing basic services on a computer, such as managing memory, devices, and file systems. A kernel may provide minimal service (as on a multiprocessor node) or many features (as on a Unix machine). Alternatively, a kernel may be a basic computational building-block (such as a fast Fourier transform) used iteratively or in parallel to perform a larger computation.
latency: The time taken to service a request or deliver a message that is independent of the size or nature of the operation. The latency of a message-passing system is the minimum time to deliver any message.
Libpvm: The core PVM programming library, allowing a task to interface with the pvmd and other tasks.
linear speedup: The case when a program runs faster in direct proportion to the number of processors used.
little-endian: A binary data format is which the least significant byte or bit comes first. See also big-endian.
load balance: The degree to which work is evenly distributed among available processors. A program executes most quickly when it is perfectly load balanced, that is, when every processor has a share of the total amount of work to perform so that all processors complete their assigned tasks at the same time. One measure of load imbalance is the ratio of the difference between the finishing times of the first and last processors to complete their portion of the calculation to the time taken by the last processor.
locality: The degree to which computations done by a processor depend only on data held in memory that is close to that processor. Also, the degree to which computations done on part of a data structure depend only on neighboring values. Locality can be measured by the ratio of local to nonlocal data accesses, or by the distribution of distances of, or times taken by, nonlocal accesses.
lock: A device or algorithm the use of which guarantees some type of exclusive access to a shared resource.
loose synchronization: The situation when the nodes on a computer are constrained to intermittently synchronize with each other via some communication. Frequently, some global computational parameter such as a time or iteration count provides a natural synchronization reference. This parameter divides the running program into compute and communication cycles.
mapping: An allocation of processes to processors; allocating work to processes is usually called scheduling.
memory protection: Any system that prevents one process from accessing a region of memory being used by another. Memory protection is supported in most serial computers by the hardware and the operating system and in most parallel computers by the hardware kernel and service kernel of the processors.
mesh: A topology in which nodes form a regular acyclic d-dimensional grid, and each edge is parallel to a grid axis and joins two nodes that are adjacent along that axis. The architecture of many multicomputers is a two- or three-dimensional mesh; meshes are also the basis of many scientific calculations, in which each node represents a point in space, and the edges define the neighbors of a node.
message ID: An integer handle used to reference a message buffer in libpvm.
message passing: A style of interprocess communication in which processes send discrete messages to one another. Some computer architectures are called message-passing architectures because they support this model in hardware, although message passing has often been used to construct operating systems and network software for uniprocessors and distributed computers.
message tag: An integer code (chosen by the programmer) bound to a message as it is sent. Messages can be accepted by tag value and/or source address at the destination.
message typing: The association of information with a message that identifies the nature of its contents. Most message-passing systems automatically transfer information about a message's sender to its receiver. Many also require the sender to specify a type for the message, and let the receiver select which types of messages it is willing to receive. See message tag.
MIMD: Multiple-Instruction Multiple-Data, a category of Flynn's taxonomy in which many instruction streams are concurrently applied to multiple data sets. A MIMD architecture is one in which heterogeneous processes may execute at different rates.
multicast: To send a message to many, but not necessarily all possible recipient processes.
multicomputer: A computer in which processors can execute separate instruction streams, can have their own private memories, and cannot directly access one another's memories. Most multicomputers are disjoint memory machines, constructed by joining nodes (each containing a microprocessor and some memory) via links.
multiprocessor: A computer in which processors can execute separate instruction streams, but have access to a single address space. Most multiprocessors are shared-memory machines, constructed by connecting several processors to one or more memory banks through a bus or switch.
multiprocessor host: The front-end support machine of, for example, a multicomputer. It may serve to boot the multicomputer, provide network access, file service, etc. Utilities such as compilers may run only on the front-end machine.
multitasking: Executing many processes on a single processor. This is usually done by time-slicing the execution of individual processes and performing a context switch each time a process is swapped in or out, but is supported by special-purpose hardware in some computers. Most operating systems support multitasking, but it can be costly if the need to switch large caches or execution pipelines makes context switching expensive in time.
mutual exclusion: A situation in which at most one process can be engaged in a specified activity at any time. Semaphores are often used to implement this.
network: A physical communication medium. A network may consist of one or more buses, a switch, or the links joining processors in a multicomputer.
network byte order: The Internet standard byte order (big-endian).
node: Basic compute building block of a multicomputer. Typically a node refers to a processor with a memory system and a mechanism for communicating with other processors in the system.
non-blocking: An operation that does not block the execution of the process using it. The term is usually applied to communications operations, where it implies that the communicating process may perform other operations before the communication has completed.
notify: A message generated by PVM on a specified event. A task may request to be notified when another task exits or the virtual machine configuration changes.
NUMA: Non-Uniform Memory Access, an architecture that does not support constant-time read and write operations. In most NUMA systems, memory is organized hierarchically, so that some portions can be read and written more quickly than others by a given processor.
packet: A quantity of data sent over the network.
packet switching: A network in which limited-length packets are routed independently from source to destination. Network resources are not reserved. Compare with circuit.
parallel computer: A computer system made up of many identifiable processing units working together in parallel. The term is often used synonymously with concurrent computer to include both multiprocessor and multicomputer. The term concurrent is more commonly used in the United States, whereas the term parallel is more common in Europe.
parallel slackness: Hiding the latency of communication by giving each processor many different tasks, and having the processors work on the tasks that are ready while other tasks are blocked (waiting on communication or other operations).
PID: Process Identifier (in UNIX) that is native to a machine or operating system.
polling: An alternative to interrupting in a communication system. A node inspects its communication hardware (typically a flag bit) to see whether information has arrived or departed.
private memory: Memory that appears to the user to be divided between many address spaces, each of which can be accessed by only one process. Most operating systems rely on some memory protection mechanism to prevent one process from accessing the private memory of another; in disjoint-memory machines, the problem is usually finding a way to emulate shared memory using a set of private memories.
process: An address space, I/O state, and one or more threads of program control.
process creation: The act of forking or spawning a new process. If a system permits only static process creation, then all processes are created at the same logical time, and no process may interact with any other until all have been created. If a system permits dynamic process creation, then one process can create another at any time. Most first and second generation multicomputers only supported static process creation, while most multiprocessors, and most operating systems on uniprocessors, support dynamic process creation.
process group: A set of processes that can be treated as a single entity for some purposes, such as synchronization and broadcast or multicast operations. In some parallel programming systems there is only one process group, which implicitly contains all processes; in others, programmers can assign processes to groups statically when configuring their program, or dynamically by having processes create, join and leave groups during execution.
process migration: Changing the processor responsible for executing a process during the lifetime of that process. Process migration is sometimes used to dynamically load balance a program or system.
pvmd: PVM daemon, a process that serves as a message router and virtual machine coordinator. One PVD daemon runs on each host of a virtual machine.
race condition: A situation in which the result of operations being executed by two or more processes depends on the order in which those processes execute, for example, if two processes and are to write different values and to the same variable.
randomized routing: A routing technique in which each message is sent to a randomly chosen node, which then forwards it to its final destination. Theory and practice show that this can greatly reduce the amount of contention for access to links in a multicomputer.
resource manager: A special task that manages other tasks and the virtual machine configuration. It intercepts requests to create/destroy tasks and add/delete hosts.
route: The act of moving a message from its source to its destination. A routing algorithm is a rule for deciding, at any intermediate node, where to send a message next; a routing technique is a way of handling the message as it passes through individual nodes.
RTFM: Read The Fine Manual
scalable: Capable of being increased in size; More important, capable of delivering an increase in performance proportional to an increase in size.
scheduling: Deciding the order in which the calculations in a program are to be executed and by which processes. Allocating processes to processors is usually called mapping.
self-scheduling: Automatically allocating work to processes. If T tasks are to be done by P processors, and P < T , then they may be self-scheduled by keeping them in a central pool from which each processor claims a new job when it finishes executing its old one.
semaphore: A data type for controlling concurrency. A semaphore is initialized to an integer value. Two operations may be applied to it: signal increments the semaphore's value by one, and wait blocks its caller until the semaphore's value is greater than zero, then decrements the semaphore. A binary semaphore is one that can only take on the values 0 and 1. Any other synchronization primitive can be built in terms of semaphores.
sequential bottleneck: A part of a computation for which there is little or no parallelism.
sequential computer: Synonymous with a Von Neumann computer, that is, a ``conventional'' computer in which only one processing element works on a problem at a given time.
shared memory: Real or virtual memory that appears to users to constitute a single address space, but which is actually physically disjoint. Virtual shared memory is often implemented using some combination of hashing and local caching. Memory that appears to the user to be contained in a single address space and that can be accessed by any process. In a uniprocessor or multiprocessor there is typically a single memory unit, or several memory units interleaved to give the appearance of a single memory unit.
shared variables: Variables to which two or more processes have access, or a model of parallel computing in which interprocess communication and synchronization are managed through such variables.
signal
SIMD: Single-Instruction Multiple-Data, a category of Flynn's taxonomy in which a single instruction stream is concurrently applied to multiple data sets. A SIMD architecture is one in which homogeneous processes synchronously execute the same instructions on their own data, or one in which an operation can be executed on vectors of fixed or varying size.
socket: An endpoint for network communication. For example, on a Unix machine, a TCP/IP connection may terminate in a socket, which can be read or written through a file descriptor.
space sharing: Dividing the resources of a parallel computer among many programs so they can run simultaneously without affecting one another's performance.
spanning tree: A tree containing a subset of the edges in a graph and including every node in that graph. A spanning tree can always be constructed so that its depth (the greatest distance between its root and any leaf) is no greater than the diameter of the graph. Spanning trees are frequently used to implement broadcast operations.
spawn: To create a new process or PVM task, possibly different from the parent. Compare with fork.
speedup: The ratio of two program execution times, particularly when times are from execution on 1 and P nodes of the same computer. Speedup is usually discussed as a function of the number of processors, but is also a function (implicitly) of the problem size.
SPMD: Single-Program Multiple-Data, a category sometimes added to Flynn's taxonomy to describe programs made up of many instances of a single type of process, each executing the same code independently. SPMD can be viewed either as an extension of SIMD or as a restriction of MIMD.
startup cost: The time taken to initiate any transaction with some entity. The startup cost of a message-passing system, for example, is the time needed to send a message of zero length to nowhere.
supercomputer: A time-dependent term that refers to the class of most powerful computer systems worldwide at the time of reference.
switch: A physical communication medium containing nodes that perform only communications functions. Examples include crossbar switches, in which N + M buses cross orthogonally at N M switching points to connect objects of one type to M objects of another, and multistage switches in which several layers of switching nodes connect objects of one type to objects of another type.
synchronization: The act of bringing two or more processes to known points in their execution at the same clock time. Explicit synchronization is not needed in SIMD programs (in which every processor either executes the same operation as every other or does nothing) but is often necessary in SPMD and MIMD programs. The time wasted by processes waiting for other processes to synchronize with them can be a major source of inefficiency in parallel programs.
synchronous: Occurring at the same clock time. For example, if a communication event is synchronous, then there is some moment at which both the sender and the receiver are engaged in the operation.
task: The smallest component of a program addressable in PVM. A task is generally a native ``process'' to the machine on which it runs.
tasker: A special task that manages other tasks on the same host. It is the parent of the target tasks, allowing it to manipulate them (e.g., for debugging or other instrumentation).
TCP: Transmission Control Protocol, a reliable host-host stream protocol for packet-switched interconnected networks such as IP.
thread: A thread of program control sharing resources (memory, I/O state) with other threads. A lightweight process.
TID: Task Identifier, an address used in PVM for tasks, pvmds, and multicast groups.
time sharing: Sharing a processor among multiple programs. Time sharing attempts to better utilize a CPU by overlapping I/O in one program with computation in another.
trace scheduling: A compiler optimization technique that vectorizes the most likely path through a program as if it were a single basic block, includes extra instructions at each branch to undo any ill effects of having made a wrong guess, vectorizes the next most likely branches, and so on.
topology: the configuration of the processors in a multicomputer and the circuits in a switch. Among the most common topologies are the mesh, the hypercube, the butterfly, the torus, and the shuffle exchange network.
tuple: An ordered sequence of fixed length of values of arbitrary types. Tuples are used for both data storage and interprocess communication in the generative communication paradigm.
tuple space: A repository for tuples in a generative communication system. Tuple space is an associative memory.
UDP: User Datagram Protocol, a simple protocol allowing datagrams (blocks of data) to be sent between hosts interconnected by networks such as IP. UDP can duplicate or lose messages, and imposes a length limit of 64 kbytes.
uniprocessor: A computer containing a single processor. The term is generally synonymous with scalar processor.
virtual channel: A logical point-to-point connection between two processes. Many virtual channels may time share a single link to hide latency and to avoid deadlock.
virtual concurrent computer: A computer system that is programmed as a concurrent computer of some number of nodes P but that is implemented either on a real concurrent computer of some number of nodes less than P or on a uniprocessor running software to emulate the environment of a concurrent machine. Such an emulation system is said to provide virtual nodes to the user.
virtual cut-through: A technique for routing messages in which the head and tail of the message both proceed as rapidly as they can. If the head is blocked because a link it wants to cross is being used by some other message, the tail continues to advance, and the message's contents are put into buffers on intermediate nodes.
virtual machine: A multicomputer composed of separate (possibly self-complete) machines and a software backplane to coordinate operation.
virtual memory: Configuration in which portions of the address space are kept on a secondary medium, such as a disk or auxiliary memory. When a reference is made to a location not resident in main memory, the virtual memory manager loads the location from secondary storage before the access completes. If no space is available in main memory, data is written to secondary storage to make some available. Virtual memory is used by almost all uniprocessors and multiprocessors to increase apparent memory size, but is not available on some array processors and multicomputers.
virtual shared memory: Memory that appears to users to constitute a single address space, but that is actually physically disjoint. Virtual shared memory is often implemented using some combination of hashing and local caching.
Von Neumann architecture: Any computer that does not employ concurrency or parallelism. Named after John Von Neumann (1903-1957), who is credited with the invention of the basic architecture of current sequential computers.
wait context: A data structure used in the pvmd to hold state when a thread of operation must be suspended, for example, when calling a pvmd on another host.
working set: Those values from shared memory that a process has copied into its private memory, or those pages of virtual memory being used by a process. Changes a process makes to the values in its working set are not automatically seen by other processes.
XDR: eXternal Data Representation An Internet standard data encoding (essentially just big-endian integers and IEEE format floating point numbers). PVM converts data to XDR format to allow communication between hosts with different native data formats.

Next: History of PVM Up: Debugging the System Previous: Sane Heap