Internet Draft
Internet Draft S. Berson
Expiration: May 1998 ISI
File: draft-berson-classy-approach-01.txt S. Vincent
ISI
Aggregation of Internet Integrated Services State
November 21, 1997
Status of Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
To learn the current status of any Internet-Draft, please check the
linebreak "1id-abstracts.txt" listing contained in the Internet-
Drafts Shadow Directories on ds.internic.net (US East Coast),
nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au
(Pacific Rim).
Abstract
The Internet Integrated Services (IIS) architecture[2] has a
fundamental scaling problem in that per flow state is maintained at
all routers and end-systems supporting a flow. This paper examines
the use of aggregation as a technique to reduce the amount of state
needed to provide IIS. In our approach, routers at the edge of a
region doing aggregation keep detailed IIS state, while in the
interior of this region, routers keep a greatly reduced amount of
state. Packets will be tagged at the edge with scheduling
information that will be used in place of the detailed IIS state.
The aggregation scheme described will allow large scale deployment of
IIS without overloading routers with state and associated processing.
Berson, Vincent Expiration: May 1998 [Page 1]
Internet Draft State Aggregation November 1997
1. Introduction
In order to deploy Internet Integrated Services (IIS), additional
resources are needed in the routers to keep state and to process
packets. As currently defined [4,2], this state is on a per session
basis, where a session is a unicast or multicast destination and
optionally a port. IIS state is stored at all routers between the
source(s) and destination(s) of a session. Thus, widespread use of
IIS would mean a large amount of state at routers in the core of the
Internet. This large amount of state and related processing could
overwhelm routers, so IIS state models that do not require per-
session state at all routers need to be explored.
Several types of per-session state are used with integrated services.
First, there is scheduling state that consists of the different
traffic service queues for packet forwarding. Second, there is
packet classifier state. Classifier state is used to assign each
arriving packet to a packet scheduling queue for forwarding. Third,
there is the state for the setup protocol, e.g. state carried in RSVP
PATH and RESV messages. Finally, current multicast routing
algorithms also store state, either per source and session (e.g.
DVMRP, PIM-DM) or per session shared-tree (e.g. CBT, PIM-SM). While
approaches to aggregation of routing state are similar to aggregation
of IIS state, they are beyond the scope of this paper.
Per session state in a router has two harmful effects. First,
additional state consumes memory, and second, additional state
requires additional CPU cycles to process the state. Depending on
the speed with which the memory needs to be accessed, different types
of memory can be used. State that is required in order to process
each packet (e.g. classifier state and scheduling state) needs to be
stored in fast (i.e. cache) memory, while other state (e.g. setup
protocol state) can be stored in slower memory. Since fast memory is
substantially more expensive than slow memory, and since accesses to
fast memory are on the router fast path, our primary goal is to
reduce the classifier and scheduler state. Reducing the amount of
setup protocol state is an important, but secondary goal.
Any aggregation scheme entails some tradeoffs. Keeping full
integrated services state at all routers allows the resources
dedicated to the flow of packets from each admitted session to be
isolated from the resources for packets from all other sessions.
This isolation means that resources reserved and needed for one
session will be used only on the reserving session. The disadvantage
of doing aggregation is the loss of some portion of this flow
isolation, since with aggregation many flows would share the same
traffic service class. Our goal is to provide Internet Integrated
Services to a network while using a greatly reduced amount of state.
Berson, Vincent Expiration: May 1998 [Page 2]
Internet Draft State Aggregation November 1997
We make several assumptions:
1. We assume that an explicitly definable region (e.g. one or more
contiguous autonomous systems), called an "aggregating region",
will aggregate and will implement supporting mechanisms at its
boundary points. We define an "ingress" of an aggregating
region as a router where a packet for a specific unicast or
multicast session enters the region, and an "egress" as a router
where a packet leaves the region. Also, we define an "interior
router" as any router in the aggregating region that is on the
data path for the session and is not the ingress or egress.
2. We assume that the choice to aggregate and the mechanism used to
do so is an intra-domain issue, not an inter-domain issue and
therefore there can be variability across regions so long as at
the boundaries routers continue to pass along adequate messages
to support the setup protocol upstream and downstream.
3. We assume current multicast and unicast routing within the
aggregating region; but with no other enhancements.
2. Framework
Figure 1 shows the architecture of an Internet integrated services
device. There are three parts of the device, the scheduler, the
setup protocol engine, and the classifier. The scheduler is
responsible for enforcing a quality of service for each of the flows
(i.e. one flow per scheduling queue). The scheduler achieves this
with a scheduling algorithm and a policing algorithm. The scheduling
algorithm decides which queue the next packet to be forwarded will
come from, while the policing algorithm decides which packets are
discarded or put into a different scheduler queue in case of
congestion. Typical types of scheduling algorithms are based on fair
queueing and/or priority. The classifier is responsible for
assigning packets to queues. The classifier accomplishes this by
looking at fields (e.g. destination address) in the packet headers.
A table in the classifier tells which packet header values assign the
packet to each scheduler queue. The setup protocol engine is
responsible for setting up the classifier state and scheduler queues.
The setup engine receives requests that include a quality of service
and a packet classifier entry. Upon receiving a message, the setup
engine does admission control to determine if sufficient resources
are available to make the reservation. If resources are available, a
new scheduling queue is set up in the scheduler for the requested
resources, and the classifier entry is installed in the classifier.
Berson, Vincent Expiration: May 1998 [Page 3]
Internet Draft State Aggregation November 1997
[FIGURE]
Figure 1: Internet Integrated Services device
2.1 Reducing the amount of state
Reducing the amount of state means that some of the basic
functions of integrated services must be limited. Since each
network service provider will want to make decisions on how to
provide integrated services based on their provisioning and
traffic patterns, a detailed review of how IIS state is used, is
offered.
Scheduler classes are the unit of assignment of network resources.
Scheduler state typically takes the form of different traffic
classes. These traffic classes are assigned different priorities,
link shares, and policing policies to provide a certain level of
service. In the basic IIS model, each traffic flow gets its own
traffic class, meaning that each flow has certain resources
dedicated to it. By reducing the number of traffic classes, a
coarser granularity of resource assignment is done in that several
flows will be assigned to the same scheduler queue. A consequence
of the coarser granularity is that the aggregated traffic gets a
certain level of resource usage, but it is impossible inside an
aggregating region to isolate flows in the same traffic class from
each other. The lack of isolation can be mitigated by adequate
provisioning of network resources and by policing at aggregating
region borders.
Classifier state is used to assign packets to traffic classes.
Each arriving packet is classified into a traffic class at each
node according to state established by the setup protocol. In the
complete absense of classifier state, no service differentiation
is possible. However a packet can be classified at the ingress,
and the results of that classification can be encoded in each
packet. This reduced classifier state might be as simple as one
or more bits segregating different classes of traffic, or
alternatively, an encapsulation header. By encoding a small
amount of classifier state in the packet, only a very simple
classification is needed at interior routers.
Finally setup protocol (e.g. RSVP) state is used to do admission
control, provide policing, allow reclassification, and allow
merging of reservation messages. The first two of these issues,
admission control and policing, are general issues involving both
unicast and multicast traffic, while the last two,
reclassification and reservation merging, are only issues with
multicast traffic.
Berson, Vincent Expiration: May 1998 [Page 4]
Internet Draft State Aggregation November 1997
Setup protocol state is used in admission control. Admission
control with full IIS state involves computing how much network
resources are already committed and whether there are sufficient
resources to admit a new flow with certain resource requirements.
Lacking setup protocol state, admission control in the aggregating
region cannot be based on reservation state per flow. Instead,
admission control must be based on aggregated state information at
each node. More specifically, state will typically consist only
of measurements, and so admission control must be measurement
based.
Setup protocol state is used for traffic policing. Policing is
done at two points in an IIS network, at a traffic merge point,
and at a traffic split point. The traffic merge point is where
the traffic from two or more sources for the same session merge.
For a shared style reservation, the traffic from all the sources
needs to be policed to the reservation parameters at traffic merge
points. The second place that policing is needed is at traffic
split points. A traffic split point is where multicast traffic
has multiple outgoing interfaces. Because of reservation merging
and because some outgoing interfaces may have no reservation,
traffic at a split point needs to be policed on each outgoing
interface to the proper reserved values for that interface.
Lacking setup protocol state, no per flow policing can take place
inside an aggregating region. This implies that all per flow
policing must take place at the edges of the aggregating region.
Setup protocol state is used to allow reclassification of packets
from reserved to best effort. This reclassification is necessary
at traffic flow split points where there are one (or more)
outgoing interfaces with reservations and one (or more) outgoing
interfaces without reservations. At these points packets
forwarded to the outgoing interfaces with no reservations need to
be reclassified as best effort.
Finally, setup protocol state is used to merge reservation
messages coming from the egresses to the ingress. With setup
protocol state for each session, and hop by hop processing of
reservation messages, each router can limit the number of
reservation refreshes it sends. Without setup protocol state,
each reservation message would be sent to the ingress, requiring a
large amount of processing for any large multicast groups.
2.2 Basic aggregation architecture
We propose a scheme with a constant amount of scheduler state at
all routers, a constant amount of classifier state at interior
routers, and no unicast setup protocol state at interior routers.
Berson, Vincent Expiration: May 1998 [Page 5]
Internet Draft State Aggregation November 1997
In our approach, a fixed number of traffic service classes are
defined and configured at all routers in an aggregating region.
Each traffic service class is subject to admission control at the
ingress to the region. Traffic policing is done at the ingress
for unicast traffic, and at the ingress plus at traffic split
points for multicast traffic. Since only a fixed number of
traffic classes are available, the scheduler state at all routers
is constant.
Data packets are classified and "tagged" on entry to the
aggregating region with some identifier indicating to which
aggregated traffic class the packet should belong. This
identifier can consist of some bits in the packet header (e.g.
type of service bits) or the packet could be encapsulated. The
tag (or encapsulation) encodes the traffic service class that this
packet will receive across the aggregating region. Tagging or
encapsulating each packet bounds the amount of classifier state in
the interior of the aggregating region.
Full setup protocol state is stored at the borders of the
aggregating region, but only multicast setup protocol state is
needed in the interior of the region. When a reservation message
arrives at an ingress to the region (from an egress), and that
reservation passes admission control, then packets from the flow
associated with the new reservation are assigned to one of the
configured traffic service classes.
Admission control will be measurement-based, and, since there is
no setup protocol state in the interior, admission control will be
done at the ingress for a session. But the decision will be based
on congestion measurements within the aggregating region. Thus
each node in the aggregating region, upon receiving an admission
control setup protocol message, will make a local congestion
decision. This local congestion decision will be made on a
threshold basis where new reservations are not admitted if the
existing load plus the expected additional load from the new flow
is greater than a (possibly dynamic) threshold.
To implement the local congestion decision, each node keeps an
estimate of its current load per traffic class. As an admission
control message traverses the path between ingress and egress,
each interior router must admit that reservation. Each router
that provisionally admits the reservation should factor the new
reservation into its currect traffic estimate. To admit the flow,
an estimate of the amount of the reserved traffic due to this
reservation must be made, e.g. based on peak rate specified in the
admission control message. If some router does not admit the
reservation, the admission control message is marked appropriately
Berson, Vincent Expiration: May 1998 [Page 6]
Internet Draft State Aggregation November 1997
by the rejecting router and the message is forwarded. When the
admission control message arrives at the egress, the egress node
checks the message to see if any router rejected the message. If
some router rejected the reservation, then the egress will forward
the message to the ingress which will reject the reservation and
send a reservation error message.
In summary, our approach provides for reduction of packet
scheduler state, packet classifier state, and setup protocol (e.g.
RSVP) state in the following ways. A constant amount of packet
scheduler state at all routers in an aggregating region is
achieved by using preconfigured traffic service classes. A
constant amount of packet classifier state for unicast traffic at
all interior routers in the aggregating region is accomplished by
tagging or encapsulating each packet with its traffic service
class. No setup protocol state at interior routers for unicast
traffic is achieved by doing admission control and policing only
at the edges of the aggregating region. For multicast traffic,
full RSVP state is stored at all routers, but classifier state is
stored only at traffic split points and only on those outgoing
interfaces with no reservation.
2.3 Multicast
We would like to apply the above approach to multicast traffic.
However, there are some fundamental differences between unicast
and multicast traffic that results in additional setup protocol
state being desirable. Since traffic with multiple destinations
is being classified and tagged at the ingress router, a tagged
multicast packet will receive reserved service everywhere in the
cloud, including branches for which there are no reservations.
This all-or-nothing reservation property, caused by tagging at the
ingress, complicates aggregating state for multicast sessions.
The complications derive from two features of multicast sessions,
"heterogeneity" and "dynamic group membership". Heterogeneity
refers to the situation where different branches of a multicast
distribution tree have different reservations, including the case
where some branches have no reservation. Unreserved packets
receiving reserved service can adversely affect packet flows that
properly have a reservation. Without additional state, there is
no way to determine which packets have a legitimate reservation.
The other multicast feature causing complications is dynamic group
membership. Dynamic group membership means that the members of a
multicast session can be changing over time by having new
receivers join, and old receivers leave. Due to the all-or-
nothing property, a new best effort receiver joining a multicast
session with a reservation in place will be receiving reserved
Berson, Vincent Expiration: May 1998 [Page 7]
Internet Draft State Aggregation November 1997
service. Admission control will not have been done for that new
receiver, and so those reserved packets to the new receiver can
cause congestion for other properly reserved packets. Thus a best
effort receiver can disrupt reserved traffic.
Our approach to aggregation and multicast is to keep setup
protocol state for multicast sessions. Since setup protocol state
is not on the router fast path, this option is reasonable since
scheduler state and classifier state are still minimized. Some
additional classifier state is needed, but only for sessions with
heterogeneous branches, and only at those nodes with heterogeneous
outgoing interfaces. This classifier state can be established
dynamically from setup protocol state and routing state. When a
node discovers that it has heterogeneous outgoing interfaces,
classifier entries are established on the best effort branches.
The packets that the classifier matches have the reserved markings
removed and new markings inserted indicating that the packets are
to be forwarded as best effort.
Having setup protocol state for multicast can also solve the
problems mentioned in section 2.1 with implosion of reservation
and ADREP messages, and with policing. Also note that an
additional optimization would be to store setup protocol state for
a session only on routers with multiple outgoing interfaces for
that session.
2.4 Policing in interior
There is still an issue with state aggregation and shared (e.g.
RSVP wildcard or shared explicit) style reservations at traffic
merge points. At a merge point, the traffic entering the node may
conform to the reservation, but the traffic exiting the node may
be non-conforming due to multiple senders. With setup protocol
state, the traffic would be policed at the merge node. Since
there is no state in the interior of the network, it is impossible
to police the traffic of an individual flow in the interior. The
traffic will eventually be policed at the next ingress, but may
interfere with actual reserved traffic between the merge point and
the egress. Since the traffic will eventually be policed, there
is no "free bandwidth" for a user trying to exploit this feature.
However there is a security problem where a malicious user could
tie up reserved bandwidth traffic in a transit network. To
protect a network from this sort of abuse, the reservation of an
abuser could be cancelled by the egress when excess traffic at an
egress is detected.
Berson, Vincent Expiration: May 1998 [Page 8]
Internet Draft State Aggregation November 1997
3. RSVP operations/extensions
In this section we describe how this aggregation scheme would be
applied to RSVP. We assume that a measurement based admission
control algorithm is defined, and we describe how to utilize this
system to implement aggregation. The basic issues are gathering the
measurement data, keeping track of which reservations are new, and
describing the appropriate data structures.
Our approach to collecting measurement information for admission
control is to originate the state collection from the ingress upon
receiving an RSVP reservation message (RESV). The basic exchange of
messages is shown in figure 2. Note that no processing is needed in
the interior of the aggregating region on the RESV or the preceding
PATH messages. But when the ingress receives a new RESV message
(i.e. a message for which it has no traffic control state), the
ingress initiates admission control. For admission control, the
ingress sends an ADmission REQuest (ADREQ) message hop-by-hop through
the interior of the aggregating region to the egress (known from the
previous RSVP hop field in the RESV message). At each hop through
the region, the node attempts to admit the reservation. If admission
control succeeds at a node, the ADREQ message is forwarded unchanged
to the next hop. If admission control fails at a node, a failure
object is appended to the ADREQ message and then the message is
forwarded to the next hop. When the egress receives the ADREQ
message, it checks if any interior node appended a failure object to
the message. If there is no rejection information in the ADREQ
message, an ADMISSION REPLY (ADREP) message with an ADMITTED status
flag is sent back to the ingress. If there is a failure object in
the message, an ADREP message with a REJECTED status is sent back to
the ingress. The ingress will treat the ADREP message with a
REJECTED status as an admission control failure and a RSVP
reservation error will be sent out.
[FIGURE]
Figure 2: Admission control exchange of messages
There are three other possible approaches to aggregated admission
control that may be useful in certain types of network environments,
but which have serious limitations in general. One approach is to
put the admission control request and responses in RSVP reservation
messages. This approach is conceptually simpler than the above
scheme, but assumes that the reverse route (i.e. egress to ingress)
through the aggregating region is known. There are cases where this
holds, most notably in link state protocols, but in general this is
not the case.
Berson, Vincent Expiration: May 1998 [Page 9]
Internet Draft State Aggregation November 1997
The second approach to collecting admisson control information is an
RSVP path message which travels on the forward route from data
ingress to data egress. When this message is processed at each
interior node, admission control information is collected. The
accumulated admission control information is then included in a
corresponding RSVP reservation message from egress to ingress. The
main disadvantage with this scheme is that the service class is
generally not known only from path information. If a region offers
only a small number of services, it would be possible to include
admission control information for all of the services in the path
message, but this could be a large amount of information in general.
The third approach to collecting admission control information is by
out-of-band communication, e.g. as part of routing protocols. In
this case, the ingress site can do the admission control based on the
out-of-band communication. Further work is needed in this area.
The remainder of this section shows the additions to RSVP to support
our approach; using separate ADREQ and ADREP messages in the
aggregating region.
3.1 RSVP messages
Two new RSVP message types would be needed for RSVP messages,
admission request (ADREQ) and admission reply (ADREP). The ADREQ
is used by an ingress to set up the reservation across the
aggregating region, while the ADREP is used by the egress to
report that the reservation has succeeded or not.
In addition to new message types, an admission control object
(ADMISSION), and an admission control status (STATUS) object types
are needed. The ADMISSION object and zero or more STATUS objects
are included as part of a ADREQ or ADREP message. The ADMISSION
object contains a common object header, the IP address of the
ingress, a handle from the ingress node, and some flags. The
handle is included by the ingress node on initiating aggregated
admission control and is used to uniquely identify reservation
state on the ingress. The STATUS object has a common header, an
IP address, and a load status. The IP address is the address of
the node supplying the load report, while the load status is the
measured load from a node.
The format of the ADREQ and ADREP messages are very simple and are
as follows:
::=
Berson, Vincent Expiration: May 1998 [Page 10]
Internet Draft State Aggregation November 1997
status_report ::= |
::=
When an ingress receives a new RSVP RESV message (and local
admission control is successful), the ingress initiates aggregated
admission control by sending an ADREQ message. The FLOWSPEC from
the RESV message is used in the ADREQ message, as well as an
ADMISSION object containing a handle chosen by the ingress.
On receiving an ADREQ message, an interior router will perform
admission control based on the measured load and the FLOWSPEC
object. If the router is too congested to accept a flow, the
rejecting router adds its address and measured load information to
a STATUS object which is appended to the ADREQ message. The ADREQ
message eventually arrives at the egress with a (possibly empty)
list of STATUS objects, containing the address and measured load
of nodes that are congested. The egress will set the flags in the
ADMISSION object and then return the ADMISSION object and the list
of STATUS objects in an ADREP message unicasted to the ingress.
The ingress will locate the reservation state block by using the
handle in the ADMISSION object. Then the ingress will check the
ADREP message for any status objects. If there are any STATUS
objects the ingress will reject the reservation and send a
reservation error message downstream. If there are no STATUS
objects in the ADREP message then the ingress will accept the
reservation and the RESV message will be forwarded upstream.
Note that if either an ADREQ or an ADREP message is lost, then the
ingress behaves as a router that lost an RSVP reservation message.
The ingress will wait until the reservation message is refreshed
and then send a new ADREQ. It would also be possible to resend an
ADREQ after a configurable amount of time.
struct {
Object_header header;
struct in_addr ingress;
Handle handle;
int flags;
} ADMISSION;
#define ADMITTED 1
#define REJECTED 0
struct {
Object_header header;
struct in_addr sess;
Load load;
Berson, Vincent Expiration: May 1998 [Page 11]
Internet Draft State Aggregation November 1997
} STATUS;
#define RSVP_ADREQ 11
#define RSVP_ADREP 12
4. Reserved traffic congestion
One last issue is how to deal with traffic congestion in reserved
traffic classes. Congestion is typically detected in an ADREQ/ADREP
message. Similar to the load threshold for admission, there can be a
load threshold for congestion which is higher.
There are two ways that admitted load can cause high levels of
congestion on a router, measurement based admission control failure
and link failure. Measurement based admission control failure occurs
due to past load not being representative of the future. For
example, if many idle reserved sources become very active at the same
time, some links may become overloaded. Link failure is caused by a
link being declared down and a new set of routes are generated. The
traffic that was traversing the failed link may now be added to an
already heavily loaded link. In both of these cases, links may
become overcommitted.
For small amounts of congestion, reserved traffic can preempt best
effort traffic, i.e. best effort traffic packets are dropped, but new
admission control requests are rejected. For larger amounts of
congestion, reserved packets need to be dropped of reservations need
to be preepted. For link failure in a per-flow IS state network, the
setup protocol would typically use reservation preempting (while
admission control mistakes are expected not to happen). With
aggregated IS state, the point at which the routing changes may not
have setup protocol state. Thus a node typically cannot distinguish
an overload caused by a link failure from one caused by transient
heavy congestion. The response is the same for either cause of
congestion; Ingress nodes in this case will need to preempt
reservations.
5. Acknowledgements
This Internet Draft is the result of discussions with many people
particularly Bob Braden, Bob Lindell, Deborah Estrin, and Daniel
Zappala.
REFERENCES
[1] Braden, R., Zhang, L., Berson, S., Herzog, S., and Jamin, S.,
"Resource ReSerVation Protocol (RSVP) -- Version 1 Functional
Specification," RFC 2205, September 1997.
Berson, Vincent Expiration: May 1998 [Page 12]
Internet Draft State Aggregation November 1997
[2] Braden, R., Clark, D., and Shenker, S., "Integrated Services in the
Internet Architecture: an Overview," RFC 1633, June 1994.
[3] Floyd, S., and Jacobson, V., "Link-sharing and Resource Management
Models for Packet Networks," IEEE/ACM Transactions on Networking,
Vol. 3 No. 4, pp. 365-386, August 1995.
[4] Jamin, S., Shenker, S., and Danzig, P., "Comparison of Measurement-
based Admission Control Algorithms for Controlled-Load Service,"
Infocomm '97.
[5] Rampal, S., "Flow Grouping for Reducing Reservation Requirements for
Guaranteed Delay Service," Internet Draft, December 1996.
Security Considerations
Security considerations have not been addressed in this draft.
Author's Address
Steven Berson
USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Phone: +1 310 822 1511
EMail: berson@isi.edu
Subramaniam Vincent
USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Phone: +1 310 822 1511
EMail: svincent@isi.edu
Berson, Vincent Expiration: May 1998 [Page 13]