Internet Draft
Internet Engineering Task Force Indra Widjaja
Fujitsu Network Communications
INTERNET DRAFT Anwar Elwalid
Expired in six months Bell Labs, Lucent Technologies
August 1998
MATE: MPLS Adaptive Traffic Engineering
<draft-widjaja-mpls-mate-00.txt>
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ``work in progress.''
To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
ftp.isi.edu (US West Coast).
Abstract
This document describes an MPLS Adaptive Traffic Engineering scheme,
called MATE. The main goal of MATE is to avoid network congestion by
balancing the loads among the LSPs. MATE makes minimal assumptions
in that the intermediate LSRs are not required to perform traffic
engineering activities or measurements beside forwarding packets.
Moreover, MATE does not impose any particular scheduling, buffer
management, architecture, or a priori traffic characterization, on
the LSRs.
1.0 Introduction
One of the main advantages of MPLS is its efficient support of expli-
cit routing through the use of Label Switched Paths (LSPs) [1][2].
With destination-based forwarding as in the conventional datagram
networks, explicit routing is usually provided by attaching to each
Widjaja & Elwalid Expired in six months [Page 1]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
packet the network-layer address of each node along the explicit
path. This approach makes the overhead in the packet prohibitively
expensive. Switched forwarding using label swapping in MPLS enables
a label to identify an LSP regardless of whether the LSP is esta-
blished through hop-by-hop or explicit routing. In other words, MPLS
imposes the same amount of packet header overhead irrespective of the
use of explicit routing.
Explicit routing has useful applications such as Virtual Private Net-
works (VPNs) and Traffic Engineering. This memo focuses on engineer-
ing the traffic across multiple explicit routes. The purpose of
traffic engineering is to manage network resources as efficiently as
possible so that congestion is minimized. This may be done by avoid-
ing links that are already heavily stressed. Traffic engineering typ-
ically becomes more effective as the network provides more alternate
paths. Traffic engineering also becomes more critical in large Auto-
nomous Systems where maximal operational efficiency should be
emphasized [3].
It is envisioned that traffic engineering is performed only for
traffic that does not require resource reservation, but may need pro-
visioning on an aggregated basis. Examples include best-effort and
differentiated services. This memo proposes that traffic engineering
be done by establishing multiple LSPs between a given ingress LSR and
a given egress LSR in an MPLS domain. The main objective is to have
the ingress LSR distribute traffic across the multiple LSPs effec-
tively so that network resources among the LSPs are equitably util-
ized. This memo describes a scheme called MPLS Adaptive Traffic
Engineering (MATE). It is to be noted that MATE is intended for
quasi-stationary situations where traffic statistics changes rela-
tively slowly (much longer than the round-trip delay between the
ingress and egress LSRs). Recent measurements in the Internet indi-
cate traffic stationarity in at least 5-min intervals [4].
Some of the features of MATE include:
o Traffic engineering on a per-class basis.
o Distributed load-balancing algorithm.
o End-to-end protocol between ingress and egress LSRs.
o No new hardware or protocol requirement on intermediate LSRs,
nor a priori traffic distributions.
o No assumption on the scheduling or buffer management
schemes at the LSR.
o Optimization decision based on LSP congestion measure.
o Minimal packet reordering due to traffic engineering.
o No clock synchronization between two LSRs.
Widjaja & Elwalid Expired in six months [Page 2]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
2.0 Traffic Engineering Using MATE
This section provides the description of MATE. While the specifics
of the scheme are tailored for best-effort service, the basic princi-
ples are intended to be extensible to other services (e.g., differen-
tiated service) as well.
2.1 Overview
It is assumed that M explicit LSPs have been established between an
ingress LSR and an egress LSR when traffic engineering is to be
facilitated between these two LSRs. The value of M is typically
between 2 and 5 (the case with M=1 is not interesting). The M LSPs
may be chosen to be the "best" M paths calculated by a link-state
protocol or configured manually. Explicit LSPs may be established by
RSVP, LDP, or other means. The mechanism to select and establish the
LSPs is beyond the scope of this document.
Once the LSPs are setup, the ingress LSR tries to distribute the
traffic across the LSPs so that the traffic loads are balanced and
congestion is thus minimized. The traffic to be engineered at the
ingress LSR is the aggregated flow (called traffic trunk in [5]) that
shares the same Forwarding Equivalence Class. This document assumes
that the traffic to be engineered consists of the best-effort ser-
vice. Future work will consider differentiated service.
In order to perform effective traffic distribution, the characteris-
tics of the LSPs and the QoS requirements of the traffic must be
known. In general, the LSP characteristics may include the average
packet delay, packet delay variance, loading factor/utilization,
packet loss rate, bottleneck bandwidth, available bandwidth, etc.
For best-effort traffic, there is no explicit QoS requirement, except
that it is desirable to have minimum packet loss rate. Since MATE is
intended to be as flexible as possible, the pertinent LSP charac-
teristics are not assumed to be given quantities, but must be gath-
ered through some measurement. In MATE, the ingress LSR transmits
probe packets periodically to the egress LSR which will return the
probe packets back to the ingress LSR. Based on the information in
the returning probe packets, the ingress LSR is able to compute the
LSP characteristics. Intermediate LSRs are not required to modify
the contents of the probe packets, but such optional capabilities may
be used to refine the measurement process.
MATE employs a four-phase algorithm for load balancing. The first
phase initializes the congestion measure for each LSP. The conges-
tion measure may be a function of delay derivative and packet loss.
Widjaja & Elwalid Expired in six months [Page 3]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
In the second phase, the algorithm tries to equalize the congestion
measure for each LSP. Once the measures are equalized, the algorithm
moves to the third phase. The algorithm monitors each LSP in the
third phase. If an appreciable change in the network state is
detected, the algorithm moves to the fourth phase where the conges-
tion measures are appropriate adjusted. Then, the algorithm goes to
the second phase and the whole process repeats.
2.2 Traffic Filtering and Distribution
MATE performs a two-stage traffic distribution. First, MATE distri-
butes the engineered traffic for a given ingress-egress pair equally
among N bins at the ingress LSR. If the total incoming traffic to be
engineered is of rate R bps, each bin receives an amount of r = R/N
bps. Then, the traffic from the N bins is mapped into M LSPs accord-
ing to the rule defined below.
The engineered traffic can be filtered and distributed into the N
bins in a number of ways. A simple method is to distribute the
traffic on a per-packet basis without filtering. For example, one
may distribute incoming packets at the ingress LSR to the bins in a
round-robin fashion. Although it does not have to maintain any per-
flow state, the method suffers from potentially having to reorder an
excessive amount of packets for a given flow which is undesirable for
TCP applications.
On the other extreme, one may filter the traffic on a per-flow basis
(e.g., based on tuple), and distribute the
flows to the bins such that the loads are similar. Although per-flow
traffic filtering and distribution preserves packet sequencing, it
has to maintain a large number of states to keep track of each active
flow.
Another method, which is described in [6], is to filter the incoming
packets by using a hash function on the IP field(s). The fields can
be based on the source and destination address pair, or other combi-
nations. A typical hash function is based on CRC. The purpose of
the hash function is to randomize the address space to prevent hot
spots. Traffic can be distributed into the N bins by applying a
modulo-N operation on the hash space. Note that packet sequence for
each flow is maintained with this method.
After the engineered traffic is distributed into the N bins, a second
function maps each bin to the corresponding LSP. The rule for the
second function is very simple. If LSP(i) is to receive twice as
much traffic as LSP(j), then LSP(i) should receive traffic from twice
Widjaja & Elwalid Expired in six months [Page 4]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
as many bins as LSP(j). The value N should be chosen so that the
smallest traffic that can be shifted, which is equal to 1/N of the
total incoming traffic, has reasonable granularity.
2.3 Traffic Measurement
The efficacy of any traffic engineering scheme depends crucially on
the traffic measurement process. MATE does not require each LSR to
perform traffic measurement. Only the ingress and egress LSRs are
required to participate in the measurement process.
For the purpose of balancing the loads on each LSP, the available
bandwidth appears to be a desirable metric to measure. The methods
for measuring the available bandwidth of a given path have been
described in the past (e.g., see [7] and [8]). Based on our experi-
ence, this metric turns out to be difficult to measure accurately
using minimal requirements assumed in MATE.
To this end, we found that packet delays across the LSPs are the most
reliable known metric. The delay of a packet on an LSP can be
obtained by transmitting a probe packet from the ingress LSR to the
egress LSR. The probe packet is time-stamped at the ingress LSR at
time T1 and recorded at the egress LSR at time T2. If the ingress's
clock is faster than the egress's clock by Td, then the total packet
delay (i.e, queueing time, propagation time, and processing time) is
T2-T1+Td. A bunch of probe packets can easily yield an estimate on
the mean packet delay Tm+Td, where Tm is the long-term average of
T2-T1. The reliability of the estimator can be evaluated by
bootstrapping or jacknife to give the confidence interval for the
mean delay. One important point to note is that the value of Td is
not required when only the marginal delay is needed. MATE exploits
this property by relying only on marginal delays rather than absolute
delays. Therefore, clock synchronization is not necessary.
Packet loss probability is another metric that can be estimated by a
bunch of probe packets. In general, only reasonably high packet loss
rates can be reliably observed. Packet loss probability can be
estimated by encoding a sequence number in the probe packet to notify
the egress LSR how many probe packets have been transmitted by the
ingress LSR, and another field in the probe packet to indicate how
many probe packets have been received by the egress LSR. When a
probe packet returns, the ingress LSR is able to estimate the one-way
packet loss probability based on the number of probe packets that has
been transmitted and the number that has been received. The advan-
tage of this approach is that it is resilient to losses in the
reverse direction.
Widjaja & Elwalid Expired in six months [Page 5]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
2.4 Objective Function
For best-effort traffic, our objective is to avoid congested LSPs.
The congestion measure can be characterized by the sensitivity of the
LSP. An LSP is qualitatively said to be sensitive if a perturbation
in the load changes the mean delay significantly. MATE computes the
derivative of the mean packet delay with respect to the offered load
to quantify the sensitivity of an LSP. For a given LSP, the deriva-
tive increases as the load increases. This is evident since the mean
delay is an increasing and convex function with respect to the load.
The derivative can be derived by observing the mean delays at two
different loads. One advantage for using the sensitivity measure is
that the fixed propagation delay has no effect on the derivative.
An LSP is said to be lossless if each LSR along the LSP never dis-
cards packets. Otherwise, the LSP is said to be lossy. Although the
lossless case may not be realistic, it is much simpler to understand.
Here, MATE performs load balancing by simply equalizing the deriva-
tives across the M LSPs. The implementation is much simplified since
only the relative derivatives should be known. MATE essentially
applies a Min-Max operation; that is minimizing the load on an LSP
that has maximum congestion. The objective is also equivalent to
minimizing the sum of all LSP delays. MATE can be easily made to
respond to changes in network state adaptively as long as the traffic
is quasi-stationary. Network changes can be tracked by periodically
measuring packet delays. If the changes in delays exceed a certain
threshold, the algorithm minimized the most congested LSP by re-
equalizing the derivatives.
For the lossy case, the packet loss information must be incorporated
into the load balancing process. The lossy case will be treated in
the later section.
2.5 State Table
MATE keeps a state table for each traffic class that will be
engineered between an ingress LSR and an egress LSR. The state table
needs to maintain the mean delay and derivative of each LSP. One con-
venient realization is depicted below:
Widjaja & Elwalid Expired in six months [Page 6]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
+-----------------+--------------+-------------+---------+
| LSP | D_old | D_new | Inc |
+-----------------+--------------+-------------+---------+
| 1 | | | |
| 2 | | | |
| 3 | | | |
| ... | | | |
| M | | | |
+-----------------+--------------+-------------+---------+
D_old (D_new) denotes the mean packet delay before (after) a traffic
shift is made to the LSP. Inc (for Increments) denotes the number of
bins for which traffic was recently shifted to/from the LSP.
The derivative for LSP(i) can be simply computed by |D_new(i) -
D_old(i)|/Inc(i).
2.6 Algorithm: Lossless Case
We now present the algorithm in an ideal case with quasi-stationary
traffic, zero packet loss, and perfect measurements.
Initially, the traffic to be engineered at the ingress LSR is sent
through the shortest path, say through LSP(1). The algorithm can be
broken into four phases:
Initially the state table is empty, and phase 1 is run.
Phase 1: Initialize delays and derivatives
Inc(1) <- N
for i from 2 to M
Send probe packets on LSP(1) and LSP(i)
Compute mean packet delays D_old(1) and D_old(i)
Shift [N/M] bins of traffic from LSP(1) to LSP(i)
Inc(1) <- N - [N/M]
Inc(i) <- [N/M]
Send probe packets on LSP(1) and LSP(i)
Compute mean packet delays D_new(1) and D_new(i)
Go to Phase 2
Here, [N/M] indicates the largest integer less than or equal to N/M.
At the end of Phase 1, all columns of the state table are filled up.
The traffic at the ingress LSR is distributed almost equally across all M LSPs.
Phase 2: Equalize derivatives
Widjaja & Elwalid Expired in six months [Page 7]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
do
Let LSP(i) be the LSP that has the highest derivative
Let LSP(j) be the LSP that has the lowest derivative
D_old(i) <- D_new(i)
D_old(j) <- D_new(j)
Shift n bins of traffic from LSP(i) to LSP(j)
Send probe packets on LSP(i) and LSP(j)
Compute mean packet delays D_new(i) and D_new(j)
Inc(i) <- n
Inc(j) <- n
while (|D_old(i) + D_old(j)| > |D_new(i) + D_new(j)|)
Go to Phase 3
The main stopping rule for the optimization process is
when the total LSP delay can no longer be decreased.
This is achieved when all derivatives are approximately equal.
Note that phase 2 actually stops when the total LSP delay overshoots.
A compensation could be made by shifting back
the traffic from LSP(j) to LSP(i) after the overshoot is detected.
The parameter n (n = 1, 2, ..., N)
determines the amount of traffic to be shifted.
A small value of n may cause the "noise" to mask the traffic
that is shifted. On the other hand, a large value of n may
overshoot and overload the LSP. Thus, this value should
be properly engineered, and can be made adaptive.
It is to be understood that MATE cannot shift more bins
than it currently has for a given LSP.
The algorithm terminates if no bin is available from
the LSP with the lowest derivative.
By the mean value theorem, the computed derivative is
equal to the actual derivative between the points before and after
the traffic is shifted.
Phase 3: Track network dynamics
Send probe packets intermittently on each LSP(i)
Compute running mean packet delay D(i)
If (|D(i) - D_new(i)| > alpha * |D_new(i) - D_old(i)|)
Go to Phase 4
The condition in Phase 3 basically detects a network change by moni-
toring an appreciable change in an LSP load.
The amount of probe packets sent at phase 3 should be negligible com-
pared to the engineered traffic that is sent to the same LSP (say,
less than 1 percent). The parameter alpha determines the threshold on
Widjaja & Elwalid Expired in six months [Page 8]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
the change in load.
The mean packet delay on LSP(i) is updated by a low-pass filter:
D(i) <- beta * D(i) + (1 - beta) * Measured delay
where beta is weighting constant between 0 and 1, and D(i) is ini-
tialized to D_new(i).
In order to minimize synchronization due to multiple ingress LSRs
detecting the network change simultaneously, it may be useful to ran-
domize the detection time with a random backoff time. There are other
schemes to eliminate synchronization.
Phase 4: Refresh state table
LSP(i) detects a significant change in its marginal delay
set Flag(i)
if (D(i) > D_new(i))
D_old(i) <- D(i) - 2 * (D_new(i) - D_old(i))
D_new(i) <- D(i)
Inc(i) <- 1
else
D_old(i) <- D(i) - (D_new(i) - D_old(i)) / 2
D_new(i) <- D(i)
Inc(i) <- 1
For each LSP k such that (k != i)
D_old(k) <- D(k) - (D_new(k) - D_old(k))
D_new(k) <- D(k)
Go to Phase 2
When LSP(i) detects a significant change in its marginal delay, it
changes doubles or halfs its derivative depending on whether the
delay has increased or decreased, respectively. The associated flag
is set to ensure that LSP(i) will be considered in Phase 2. D_new's
and D_old's of the rest of the LSPs are updated with the latest
information. All computations should check for illegal values.
2.7 Time Scales
Since MATE relies on measurements to estimate LSP characteristics, it
is important to understand the time scale each phase typically
operates. After the initialization in Phase 1, MATE alternates in
Phase 2, Phase 3 and Phase 4. MATE is designed to execute in Phase 2
on the order of 1 minute. The effectiveness in this phase depends
critically on the stationarity of the traffic. Recent measurements
indicate that traffic in the wide area network is stationary in at
least a 5-minute interval [4]. Phase 3 is intended to detect a
Widjaja & Elwalid Expired in six months [Page 9]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
persistent and appreciable change in the network after the algorithm
has stabilized in Phase 2. In general, Phase 3 operates at a much
longer time scale (on the order of 30 minutes or longer). Phase 4 is
executed to update the State Table immediately after Phase 3. Phase 4
typically operates on the order of a second.
2.8 Monotonous Property of Delay versus Load
To cope with "noise" in the cross/background traffic and measurement
errors, we describe some improvements to the above algorithm by using
the property of delay-load curve.
It is possible for the computed mean delay to violate the monotonous
property of delay versus load. A violation may occur in two cases.
In the first case, the mean delay decreases when traffic is shifted
into an LSP. In another case, the mean delay increases when traffic
is shifted out of an LSP. The violation can be noted by associating
an attribute flag to each delay variable in the state table.
In order to correct an anomaly due to violation, MATE always repeats
the measurement on the pair that is involved in violation until there
is no more violation detected. In the repetition, traffic is shifted
back and forth between the pair until no more violation is detected
in each LSP.
2.9 Inflection
Another observation is that the mean delay is typically a convex
function of the offered load, except beyond a certain point (called
the inflection point) where the LSP becomes heavily congested.
When MATE encounters an inflection point during the iteration pro-
cess, it should adjust D_old so that the higher derivative between
the previous and the current iterations is used. This decision makes
the derivative estimate more conservative when the inflection point
is due to heavy congestion.
2.10 Dynamic Shift
The number of bins to be shifted (Inc) can be dynamically adjusted as
follows. If one of the LSPs to be shifted consists of an LSP whose
new mean delay was invalidated previously due to a violation, then
the number of bins to be shifted will be doubled from the previous
value, subject to the maximum number of bins that can be shifted.
Otherwise, the number of bins to be shifted will be halfed from the
Widjaja & Elwalid Expired in six months [Page 10]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
previous value, subject to the minimum number of bins that is
allowed.
2.11 Lossy Case
The base algorithm described in the previous section assumes an ideal
case where there is no packet loss. This section incorporates the
effect of packet loss on the algorithm by adjusting the derivative
with a loss factor.
MPLS traffic is likely to be dominated by TCP applications, in which
case the packet loss inside an MPLS domain should be tightly con-
trolled since TCP reacts to packet losses. It is well-known that TCP
achieves bandwidth that is inversely proportional to the square root
of the packet loss probability (p), under idealized condition. Note
that this relationship is typically reasonable only in a certain
regime of p (roughly, p is not too close to zero or one).
We use a typical delay-load function which is convex and increasing
in load. The delay-load function blows up when the arrival rate
approaches the service rate, which can be approximated by the simple
formula:
D = A / (mu - lambda), (1)
where mu is the offered traffic, lambda is the average service rate,
and A is some constant. Most queueing systems would have the same
factor in the denominator as the dominant one. The derivative of the
delay with respect to the offered traffic can then be computed by
D' = A / (mu - lambda)^2. (2)
Since the available bandwidth for the lossless case is
BW_lossless = mu - lambda, (3)
we can relate the lossless available bandwidth to the derivative by
BW_lossless^2 = A / D'. (4)
Because the bandwidth achieved by TCP is reduced by a factor of
square root of p when there are packet losses, the available
bandwidth for the lossy case can be correspondingly reduced by the
same factor. In other words, we can write
BW_lossy^2 = BW_lossless^2 * C / p, (5)
Widjaja & Elwalid Expired in six months [Page 11]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
where C is set to the lowest observable loss probability through the
probe packets. If MATE cannot determine the packet loss probability
because of its rare occurrences, it sets p to C. Comparing (4) and
(5), the lossy available bandwidth can be related to the derivative
by
BW_lossy^2 = C / (p * D'). (6)
Therefore, equalizing (p * D') is equivalent to equalizing the avail-
able bandwidth for the lossy case.
It is interesting to see that one may define another metric called
Power which is defined as the ratio of throughput to delay. If the
objective is to equalize power, one will find that this is equivalent
to equalizing (p * D').
For the lossy case, Phase 2 of the algorithm stops when the weighted
delay increases. In Phase 3, it is not possible to measure the
packet loss rate since the probe traffic is low. Instead, a lost
probe packet is accounted for by assigning D(i) to the maximum
measurable delay.
3.0 Protocol Specification
To be added.
4.0 Security Considerations
Security considerations are not addressed in the present document.
5.0 Acknowledgement
The authors thank Cheng Jin for implementing MATE and suggesting
improvements, and Dimitri Stiliadis and Eric Gray for providing
insightful comments.
6.0 References
[1] R. Callon, P. Doolan, N. Feldman, A. Fredette,
G. Swallow and A. Viswanathan,
Widjaja & Elwalid Expired in six months [Page 12]
Internet Draft MPLS Adaptive Traffic Engineering August 1998
"A Framework for Multiprotocol Label Switching,"
Internet Draft <draft-ietf-mpls-framework-02.txt>,
Nov 1997.
[2] E.C. Rosen, A. Viswanathan, and R. Callon,
"Multiprotocol Label Switching Architecture,"
Internet Draft <draft-ietf-mpls-arch-01.txt>, Mar 1998.
[3] D.O. Awduche et. al., "Requirements for Traffic
Engineering over MPLS,"
Internet Draft <draft-awduche-mpls-traffic-eng-00.txt>,
April 1998.
[4] K. Thompson, G.J. Miller, and R. Wilder, "Wide-Area
Internet Traffic Patterns and Characteristics, "
IEEE Networks, vol. 6, no. 6, Dec. 1997.
[5] T. Li and Y. Rekhter, "Provider Architecture for
Differentiated Services and Traffic Engineering (PASTE),"
Internet Draft <draft-li-paste-00.txt>, Jan 1998.
[6] C. Villamizar, "OSPF Optimized Multipath (OSPF-OMP),"
Internet Draft , Mar 1998.
[7] S. Keshav, "A Control-Theoretic Approach to Flow
Control,"
Proc. SIGCOMM'91, pp.3-15, Sep. 1991.
[8] R.L. Carter and M.E. Crovella, "Measuring Bottleneck
Link Speed in Packet-Switched Networks," Technical Report,
BU-CS-96-006, Boston University, Mar. 1996.
Author Information:
Indra Widjaja
Fujitsu Network Communications
4403 Bland Road
Raleigh, NC 27609, USA
Phone: 919 790 2037
Email: indra.widjaja@fnc.fujitsu.com
Anwar Elwalid
Bell Labs Lucent Technologies
Murray Hill, NJ 07974, USA
Phone: 908 582-7589
Email: anwar@lucent.com
Widjaja & Elwalid Expired in six months [Page 13]