PVM daemons communicate with one another through UDP sockets. UDP is an unreliable delivery service which can lose, duplicate or reorder packets, so an acknowledgment and retry mechanism is used. UDP also limits packet length, so PVM fragments long messages.
We considered TCP, but three factors make it inappropriate. First is scalability . In a virtual machine of N hosts, each pvmd must have connections to the other N - 1. Each open TCP connection consumes a file descriptor in the pvmd, and some operating systems limit the number of open files to as few as 32, whereas a single UDP socket can communicate with any number of remote UDP sockets. Second is overhead . N pvmds need N(N - 1)/2 TCP connections, which would be expensive to set up. The PVM/UDP protocol is initialized with no communication. Third is fault tolerance . The communication system detects when foreign pvmds have crashed or the network has gone down, so we need to set timeouts in the protocol layer. The TCP keepalive option might work, but it's not always possible to get adequate control over the parameters.
The packet header is shown in Figure . Multibyte values are sent in (Internet) network byte order (most significant byte first).
Figure: Pvmd-pvmd packet header
The source and destination fields hold the TIDs of the true source and final destination of the packet, regardless of the route it takes. Sequence and acknowledgment numbers start at 1 and increment to 65535, then wrap to zero.
SOM (EOM) - Set for the first (last) fragment of a message. Intervening fragments have both bits cleared. They are used by tasks and pvmds to delimit message boundaries.
DAT - If set, data is contained in the packet, and the sequence number is valid. The packet, even if zero length, must be delivered.
ACK - If set, the acknowledgment number field is valid. This bit may be combined with the DAT bit to piggyback an acknowledgment on a data packet.
FIN - The pvmd is closing down the connection. A packet with FIN bit set (and DAT cleared) begins an orderly shutdown. When an acknowledgement arrives (ACK bit set and ack number matching the sequence number from the FIN packet), a final packet is sent with both FIN and ACK set. If the pvmd panics, (for example on a trapped segment violation) it tries to send a packet with FIN and ACK set to every peer before it exits.
The state of a connection to another pvmd is kept in its host table entry. The protocol driver uses the following fields of struct hostd:
Field Meaning ----------------------------------------------------- hd_hostpart TID of pvmd hd_mtu Max UDP packet length to host hd_sad IP address and UDP port number hd_rxseq Expected next packet number from host hd_txseq Next packet number to send to host hd_txq Queue of packets to send hd_opq Queue of packets sent, awaiting ack hd_nop Number of packets in hd_opq hd_rxq List of out-of-order received packets hd_rxm Buffer for message reassembly hd_rtt Estimated smoothed round-trip time -----------------------------------------------------
Figure shows the host send and outstanding-packet queues. Packets waiting to be sent to a host are queued in FIFO hd_txq. Packets are appended to this queue by the routing code, described in Section . No receive queues are used; incoming packets are passed immediately through to other send queues or reassembled into messages (or discarded). Incoming messages are delivered to a pvmd entry point as described in Section .
Figure: Host descriptors with send queues
The protocol allows multiple outstanding packets to improve performance over high-latency networks, so two more queues are required. hd_opq holds a per-host list of unacknowledged packets, and global opq lists all unacknowledged packets, ordered by time to retransmit. hd_rxq holds packets received out of sequence until they can be accepted.
The difference in time between sending a packet and getting the acknowledgement is used to estimate the round-trip time to the foreign host. Each update is filtered into the estimate according to the formula .
When the acknowledgment for a packet arrives, the packet is removed from hd_opq and opq and discarded. Each packet has a retry timer and count, and each is resent until acknowledged by the foreign pvmd. The timer starts at 3 * hd_rtt, and doubles for each retry up to 18 seconds. hd_rtt is limited to nine seconds, and backoff is bounded in order to allow at least 10 packets to be sent to a host before giving up. After three minutes of resending with no acknowledgment, a packet expires.
If a packet expires as a result of timeout, the foreign pvmd is assumed to be down or unreachable, and the local pvmd gives up on it, calling hostfailentry()