Internet Draft
Roughly here is a possible break-up of the MPLS architecture:

1. Labels, label control and distribution, LDP transport...
2. Label stack, and label encodings, MPLS and various links....
3. Loop control/prevention...
4. Merging and non-merging LSRs....
5. Granularity, Tunnels and hierarchy...
6. Some applications of MPLS....
7. Label based forwarding; MPLS specific hardware/and software.

Another possible break-up is based on I-Ds, such as:

1. LDP - labels, control and distribution.
2. architecture.
3. framework
4. label encodings
5. MPLS and links.

But I like more the first one, and have a preference for 2. and/or 3.
and would welcome one person to work with me.

MPLS requirements - low cost high performance forwarding, loop prevention
or containment.  Also, MPLS needs to provide for aggregation and should
support the same level of other functionality as is provided by routing.
Both multicast and unicast must be supported.  Protocol must support
stream merging.  Also:

   - The core set of MPLS standards, along with existing Internet
     standards, MUST be a self-contained solution. For example, the
     proposed solution MUST NOT require specific hardware features that
     do not commonly exist on network equipment at the time that the
     standard is complete. However, the solution MAY make use of
     additional optional hardware features (e.g., to optimize
     performance).

   - The MPLS protocol standards MUST support multipath routing and
     forwarding.

   - MPLS MUST be compatible with the IETF Integrated Services Model,
     including RSVP.

   - It MUST be possible for MPLS switches to coexist with non MPLS
     switches in the same switched network. MPLS switches SHOULD NOT
     impose additional configuration on non-MPLS switches.

   - MPLS MUST allow "ships in the night" operation with existing layer
     2 switching protocols (e.g., ATM Forum Signaling) (i.e., MPLS must
     be capable of being used in the same network which is also
     simultaneously operating standard layer 2 protocols).

   - The MPLS protocol MUST support both topology-driven and
     traffic/request-driven label assignments.


MPLS issues - in a completely flat MPLS domain, MPLS can offer no more
scalability than IP - because the number of labels required is ultimately
determined by the egress (and in a flat MPLS domain, that will be the
last hop router); in fact, the opposite is likely to occur - because an
IP router can aggregate NLRI information and in a ubiquitous MPLS domain
an LSR cannot.

=========================================================================
Internet Draft                                    Eric Gray
                                                  ...
             Lucent Technologies, Inc.
                                                  Expires May 1998

               Generic Multi-Protocol Specification

Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   To learn the current status of any Internet-Draft, please check the
   "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
   ftp.isi.edu (US West Coast).

Abstract

   This document describes the specification of generic Multi-Protocol
   Label Switching (MPLS).  Its purpose is to define those parts of
   MPLS that are media/technology independent.  Other documents, both
   existing works in progress and yet to come, will describe specifics
   of MPLS protocol behavior that apply to a particular media or tech-
   nology.

Contents

         Status of this Memo  ......................................   1
         Abstract  .................................................   1
         Table of Contents  ........................................  ##
    1    Protocol Overview  ........................................  ##
    2    LDP Messages  .............................................  ##
    2.1  Common Message Header  ....................................  ##
    2.2  LDP Message TLV-objects  ..................................  ##
    2.2.1  Common Object Header  ...................................  ##
    2.2.2  Label Object  ...........................................  ##
    2.2.3  Explicit Path Object  ...................................  ##
    2.3  LDP Neighbor Notification  ................................  ##
    2.4  LDP Bind Request  .........................................  ##
    2.4.1  Destination-Based  ......................................  ##
    2.4.2  Explicit-Route  .........................................  ##
    2.5  LDP Label Bind  ...........................................  ##
    2.5.1  Destination-Based  ......................................  ##
    2.5.2  Explicit-Route  .........................................  ##
    2.6  LDP Teardown  .............................................  ##
    3    LDP State Transitions  ....................................  ##
    4    LDP Interaction With Routing  .............................  ##
    5    LDP Multicast  ............................................  ##
    6    Acknowledgments  ..........................................  ##
    7    References  ...............................................  ##
    8    Author Information  .......................................  ##

 7.  References

   [1] "Tag distribution Protocol", Doolan, Davie, Katz, Rekhter, Rosen,
       work in progress, internet draft <draft-doolan-tdp-spec-01.txt>

   [2] "Label Switching: Label Stack Encodings", Rosen, Rekhter, Tappan,
       Farinacci, Fedorkow, Li, work in progress, internet draft
       <draft-rosen-tag-stack-02.txt>

   [3] "A Framework for Multiprotocol Label Switching", 5/12/97, draft-
       ietf-mpls-framework-01.txt, Callon, Doolan, Feldman, Fredette,
       Swallow, Visanathawan

   [4] "A Proposed Architecture for MPLS", E. Rosen, A. Viswanathan, R.
       Callon, work in progress, draft-rosen-architecture-00.txt, July
       1997.

   [5] "Partitioning Tag Space among Multicast Routers on a Common
       Subnet", Farinacci, work in progress, internet draft 

   [6] "Multicast Tag Binding and Distribution using PIM", Farinacci,
       Rekhter, work in progress, internet draft 

   [7] "ARIS: Aggregate Route-Based IP Switching", A. Viswanathan, N.
       Feldman, R. Boivie, R. Woundy, work in progress, Internet Draft
       <draft-viswanathan-aris-overview-00.txt>, March 1997.

   [8] "ARIS Specification", N. Feldman, A. Viswanathan, work in
       progress, Internet Draft <draft-feldman-aris-spec-00.txt>, March
       1997.

   [9] "ARIS Support for LAN Media Switching", S. Blake, A. Ghanwani, W.
       Pace, V. Srinivasan, work in progress, Internet Draft , March 1997.

   [10] "OSPF version 2", J. Moy, RFC 1583, March 1994.

   [11] "A Border Gateway Protocol 4 (BGP-4)", Y. Rekhter and T. Li,
        RFC 1771, March 1995.

   [12] "ATM Forum Private Network-Network Interface Specification,
        Version 1.0", ATM Forum af-pnni-0055.000, March 1996.

   [13] "Internet Control Message Protocol", RFC 792, 9/81, Postel

   [14] "Path MTU Discovery", RFC 1191, 11/90, Mogul & Deering

   [15] Heinanen, J. "Multiprotocol Encapsulation over ATM Adaptation
        Layer 5" RFC 1483, July 1993

   [16] "IP Router Alert Option", RFC 2113, 2/97, Katz

   [17] "The Point-to-Point Protocol (PPP)", RFC 1661, 7/94, Simpson

   [A] "Tag Switching Architecture - Overview", Rekhter, Davie, Katz,
       Rosen, Swallow, Farinacci, work in progress, Internet Draft
       <draft-rekhter-tagswitch-arch-01.txt>

   [B] "Use of Tag Switching with ATM", Davie, Doolan, Lawrence,
       McGloghrie, Rekhter, Rosen, Swallow, work in progress, Internet
       Draft <draft-davie-tag-switching-atm-01.txt>

   [C] "Soft State Switching: A Proposal to Extend RSVP for Switching
       RSVP Flows", A. Viswanathan, V. Srinivasan, work in progress,
       Internet Draft <draft-viswanathan-aris-rsvp-00.txt>, March 1997.

   [D] "Loop-Free Routing Using Diffusing Computations", J.J. Garcia-
       Luna-Aceves, IEEE/ACM Transactions on Networking, Vol. 1, No. 1,
       February 1993.

   [E] "NBMA Next Hop Resolution Protocol (NHRP)", J. Luciani et al.,
       work in progress, draft-ietf-rolc-nhrp-11.txt, March 1997.

   [F] "Cisco System's Tag Switching Overview", IETF RFC 2105,
       Y.Rekhter, B.Davie, D.Katz, E.Rosen, G.Swallow,
       February, 1997.

   [G] ATM Forum, "LAN Emulation over ATM Specification Ver.1.0",
       April, 1995.

   [H] ATM Forum, "Multi Protocol Over ATM" (Work in Progress)

   [I] M.Laubach, "Classical IP and ARP over ATM",
       IETF RFC 1577, October, 1993.

   [J] "Toshiba's Router Architecture Extensions for ATM: Overview",
        Katsube, Nagami, Esaki, RFC 2098.

   [K] "Ipsilon Flow Management Protocol Specification for IPv4 Version
        1.0", P. Newman et al., RFC 1953, May 1996.

   [L] Rekhter, Y. et al "draft-rfced-tag-switching-overview-00.txt".

   [M] Deering, S. et al "An Architecture for Wide Area Multicast Routing",
       Pro Sigcomm 94 in Computer Communications Review Vol 24 No 4.

   [N] Reynolds J, Postel J. "Assigned numbers"  RFC 1700, October 1994

8.  Author Information

        Eric Gray
        Lucent Technologies, Inc.
        1600 Osgood Street
        North Andover, MA 01845
        ewgray@lucent.com

        Zheng Wang
        Lucent Technologies, Inc.
        101 Crawfords Corner Road
        Holmdel, NJ 07733
        zhwang@lucent.com

        Grenville Armitage
        Lucent Technologies, Inc.
        101 Crawfords Corner Road
        Holmdel, NJ 07733
        gja@lucent.com

        Ross Callon
        Ascend Communications, Inc.
        1 Robbins Road
        Westford, MA  01886
        rcallon@casc.com

        Bruce Davie
        Cisco Systems, Inc.
        250 Apollo Drive
        Chelmsford, MA, 01824
        bsd@cisco.com

        Jeremy Lawrence
        Cisco Systems, Inc.
        1400 Parkmoor Ave.
        San Jose, CA, 95126
        jlawrenc@cisco.com

        Keith McCloghrie
        Cisco Systems, Inc.
        170 Tasman Drive
        San Jose, CA, 95134
        kzm@cisco.com

        Eric Rosen
        Cisco Systems, Inc.
        250 Apollo Drive
        Chelmsford, MA, 01824
        erosen@cisco.com

        Paul Doolan
        Cisco Systems, Inc
        250 Apollo Drive
        Chelmsford, MA 01824
        pdoolan@cisco.com

        Nancy Feldman
        IBM Corp.
        17 Skyline Drive
        Hawthorne NY 10532
        nkf@vnet.ibm.com

        Andre Fredette
        Bay Networks Inc
        3 Federal Street
        Billerica, MA  01821
        fredette@baynetworks.com

        George Swallow
        Cisco Systems, Inc
        250 Apollo Drive
        Chelmsford, MA 01824
        swallow@cisco.com

        Arun Viswanathan
        IBM Corp.
        17 Skyline Drive
        Hawthorne NY 10532
        arunv@vnet.ibm.com

        Yakov Rekhter
        Cisco Systems, Inc.
        170 Tasman Drive
        San Jose, CA, 95134
        yakov@cisco.com

        Dan Tappan
        Cisco Systems, Inc.
        250 Apollo Drive
        Chelmsford, MA, 01824
        tappan@cisco.com

        Dino Farinacci
        Cisco Systems, Inc.
        170 Tasman Drive
        San Jose, CA, 95134
        dino@cisco.com

        Dave Katz
        Juniper Networks
        3260 Jay Street
        Santa Clara, CA 95051
        dkatz@jnx.com

        Guy Fedorkow
        Cisco Systems, Inc.
        250 Apollo Drive
        Chelmsford, MA, 01824
        fedorkow@cisco.com

        Tony Li
        Juniper Networks
        3260 Jay Street
        Santa Clara, CA 95051
        tli@jnx.com

        Hiroshi Esaki
        Computer and Network Division,
        Toshiba Corporation,
        1-1-1 Shibaura,
        Minato-ku, 105-01, Japan.
        hiroshi@isl.rdc.toshiba.co.jp

        Yasuhiro Katsube
        R&D Center, Toshiba Corporation,
        1 Komukai-Toshiba-cho, Saiwai-ku,
        Kawasaki, 210, Japan
        katsube@isl.rdc.toshiba.co.jp

        Ken-ichi Nagami
        R&D Center, Toshiba Corporation,
        1 Komukai-Toshiba-cho, Saiwai-ku,
        Kawasaki, 210, Japan
        nagami@isl.rdc.toshiba.co.jp

        James V. Luciani
        Bay Networks, Inc.
        3 Federal Street, BL3-04
        Billerica, MA  01821
        luciani@baynetworks.com

        Joel M. Halpern
        Newbridge Networks Corp.
        593 Herndon Parkway
        Herndon, VA 22070-5241
        jhalpern@Newbridge.COM

=========================================================================
	Possible LDP document lay-outs
=========================================================================
TDP Draft Structure - From draft-doolan-tdp-spec-01.txt
=========================================================================

1  Abstract
2  Protocol Overview
   2.1  LDP and Label switching over ATM
3  State machines
   3.1  LDP state transition table
   3.2  LDP state transition diagram
   3.3  Transport connections
   3.4  Timeout
4  Protocol Data Units (PDUs)
   4.1  LDP Fixed Header
   4.2  LDP TLVs
   4.3  Example LDP PDU
   4.4  PIEs defined in V1 of LDP
   4.5  LDP_PIE_OPEN
   4.6  LDP_PIE_BIND
   4.7  LDP_PIE_REQUEST_BIND
   4.8  LDP_PIE_WITHDRAW_BIND
   4.9  LDP_PIE_RELEASE_BIND
   4.10 LDP_PIE_KEEP_ALIVE
   4.11 LDP_PIE_NOTIFICATION
5  Intellectual Property Considerations
6  Acknowledgments
7  References
8  Author Information

=========================================================================
ARIS Specification Structure - From draft-feldman-aris-spec-00.txt
=========================================================================

1.  Introduction
2.  ARIS Messaging
    2.1. ARIS Objects
    2.2. Init
    2.3. Establish
         2.3.1. Destination-Based Routing
         2.3.2. Explicit Routes
    2.4. Trigger
    2.5. Teardown
    2.6. Acknowledge
    2.7. KeepAlive
3.  Neighbor Adjacency
    3.1. State Transition
4.  Egress Identifiers
    4.1. Egress ISR
    4.2. Selecting Egress Identifiers
5.  Destination-Based Routing
    5.1. Forwarding Information Bases
    5.2. TTL Decrement
    5.3. Loop Prevention
    5.4. BGP Interaction with ARIS
    5.5. OSPF Interaction with ARIS
6.  L2-Tunnels
7.  Label Management
    7.1. VCIB
    7.2. Label Swapping
8.  Multicast
    8.1. DVMRP and PIM-DM
    8.2. PIM-SM
9.  Multipath
10. Timers
11. Configuration
12. ARIS Signaling Pseudo Code
13. Object Definitions
    13.1. Common Header
    13.2. Common Object Header
    13.3. Label Object
    13.4. Egress Identifier Object
    13.5. Multipath Identifier Object
    13.6. Router Path Object
    13.7. Explicit Path Object
    13.8. Tunnel Object
    13.9. Timer Object
    13.10. Acknowledge Message Object
    13.11. Init Message Object

=========================================================================
MPLS Framework Document Structure - From draft-ietf-mpls-framework-01.txt
=========================================================================

1. Introduction and Requirements
   1.1  Overview of MPLS
   1.2  Requirements
   1.3  Terminology
   1.4  Acronyms and Abbreviations
2. Discussion of Core MPLS Components
   2.1  The Basic Routing Approach
   2.2  Labels
        2.2.1  Label Semantics
        2.2.2  Label Granularity
        2.2.3  Label Assignment
        2.2.4  Label Stack and Forwarding Operations
   2.3  Encapsulation
3. Observations, Issues and Assumptions
   3.1  Layer 2 versus Layer 3 Forwarding
   3.2  Scaling Issues
   3.3  Types of Streams
   3.4  Data Driven versus Control Traffic Driven Label Assignment
        3.4.1  Topology Driven Label Assignment
        3.4.2  Request Driven Label Assignment
        3.4.3  Traffic Driven Label Assignment
   3.5  The Need for Dealing with Looping
   3.6  Operations and Management
4. Technical Approaches
   4.1  Label Distribution
        4.1.1  Explicit Label Distribution
               4.1.1.1  Downstream Label Allocation
               4.1.1.2  Upstream Label Allocation
               4.1.1.3  Other Label Allocation Methods
        4.1.2  Piggybacking on Other Control Messages
        4.1.3  Acceptable Label Values
        4.1.4  LDP Reliability
        4.1.5  Label Purge Mechanisms
   4.2  Stream Merging
   4.3  Loop Handling
   4.4  Interoperation with NHRP
   4.5  Operation in a Hierarchy
   4.6  Stacked Labels in a Flat Routing Environment
   4.7  Multicast
   4.8  Multipath
   4.9  Host Interactions
   4.10 Explicit Routing
        4.10.1 Establishment of Point to Point Explicitly Routed LSPs
        4.10.2 Explicit and Hop by Hop routing: Avoiding Loops
        4.10.3 Merge and Explicit Routing
        4.10.4 Using Explicit Routing for Traffic Engineering
        4.10.5 Using Explicit Routing for Policy Routing
   4.11 Traceroute
   4.12 LSP Control: Egress versus Local
   4.13 Security

=========================================================================
MPLS Architecture Document Structure - From draft-rosen-mpls-arch-00.txt
=========================================================================

1 Introduction to MPLS
  1.1 Overview
  1.2 Terminology
  1.3 Acronyms and Abbreviations
  1.4 Acknowledgments
2 Outline of Approach
  2.1 Labels
  2.2 Upstream and Downstream LSRs
  2.3 Labeled Packet
  2.4 Label Assignment and Distribution; Attributes
  2.5 Label Distribution Protocol (LDP)
  2.6 The Label Stack
  2.7 The Next Hop Label Forwarding Entry (NHLFE)
  2.8 Incoming Label Map (ILM)
  2.9 Stream-to-NHLFE Map (STN)
  2.10 Label Swapping
  2.11 Label Switched Path (LSP), LSP Ingress, LSP Egress
  2.12 LSP Next Hop
  2.13 Route Selection
  2.14 Time-to-Live (TTL)
  2.15 Loop Control
    2.15.1 Loop Prevention
    2.15.2 Interworking of Loop Control Options
  2.16 Merging and Non-Merging LSRs
    2.16.1 Stream Merge
    2.16.2 Non-merging LSRs
    2.16.3 Labels for Merging and Non-Merging LSRs
    2.16.4 Merge over ATM
      2.16.4.1 Methods of Eliminating Cell Interleave
      2.16.4.2 Interoperation: VC Merge, VP Merge, and Non-Merge
    2.17 LSP Control: Egress versus Local
    2.18 Granularity
    2.19 Tunnels and Hierarchy
      2.19.1 Hop-by-Hop Routed Tunnel
      2.19.2 Explicitly Routed Tunnel
      2.19.3 LSP Tunnels
      2.19.4 Hierarchy: LSP Tunnels within LSPs
      2.19.5 LDP Peering and Hierarchy
    2.20 LDP Transport
    2.21 Label Encodings
      2.21.1 MPLS-specific Hardware and/or Software
      2.21.2 ATM Switches as LSRs
      2.21.3 Interoperability among Encoding Techniques
    2.22 Multicast
3 Some Applications of MPLS
   3.1 MPLS and Hop by Hop Routed Traffic
     3.1.1 Labels for Address Prefixes
     3.1.2 Distributing Labels for Address Prefixes
       3.1.2.1 LDP Peers for a Particular Address Prefix
       3.1.2.2 Distributing Labels
     3.1.3 Using the Hop by Hop path as the LSP
     3.1.4 LSP Egress and LSP Proxy Egress
     3.1.5 The POP Label
     3.1.6 Option: Egress-Targeted Label Assignment
   3.2 MPLS and Explicitly Routed LSPs
     3.2.1 Explicitly Routed LSP Tunnels: Traffic Engineering
   3.3 Label Stacks and Implicit Peering
   3.4 MPLS and Multi-Path Routing
   3.5 LSPs may be Multipoint-to-Point Entities
   3.6 LSP Tunneling between BGP Border Routers
   3.7 Other Uses of Hop-by-Hop Routed LSP Tunnels
   3.8 MPLS and Multicast
4 LDP Procedures
5 Security Considerations
6 Authors' Addresses
7 References
Appendix A Why Egress Control is Better
Appendix B Why Local Control is Better

=========================================================================
SCSP Document Structure - From draft-ietf-ion-scsp-01.txt
=========================================================================

1. Introduction
2. Overview
   2.1  Hello Protocol
   2.2 Cache Alignment Protocol
2.2.1 Master Slave Negotiation State
2.2.2 The Cache Summarize State
2.2.2.1 "CA message processing":
2.2.3 The Update Cache State
2.2.4 The Aligned State
   2.3 Cache State Update Protocol
   2.4 The meaning of "More Up To Date"/"Newness"
Discussion and conclusions
Appendix A: Terminology and Definitions
Appendix B:  SCSP Message Formats
   B.1 Fixed Part
      B.2.0 Mandatory Part
         B.2.0.1 Mandatory Common Part
         B.2.0.2 Cache State Advertisement Summary Record (CSAS record)
      B.2.1 Cache Alignment (CA)
      B.2.2 Cache State Update Request (CSU Request)
         B.2.2.1 Cache State Advertisement Record (CSA record)
      B.2.3 Cache State Update Reply (CSU Reply)
      B.2.4 Cache State Update Solicit Message (CSUS message)
      B.2.5 Hello:
   B.3  Extensions Part
      B.3.0  The End Of Extensions
      B.3.1  SCSP Authentication Extension
      B.3.2  SCSP Vendor-Private Extension
References

=========================================================================
NHRP Document Structure - From draft-ietf-rolc-nhrp-11.txt
=========================================================================

1. Introduction
2. Overview
   2.1 Terminology
   2.2 Protocol Overview
3. Deployment
4. Configuration
5. NHRP Packet Formats
   5.1  NHRP Fixed Header
   5.2  (Other Message Contents)
        5.2.0  Mandatory Part
               5.2.0.1  Mandatory Part Format
        5.2.1  NHRP Resolution Request
        5.2.2  NHRP Resolution Reply
        5.2.3  NHRP Registration Request
        5.2.4  NHRP Registration Reply
        5.2.5  NHRP Purge Request
        5.2.6  NHRP Purge Reply
        5.2.7  NHRP Error Indication
   5.3  Extensions Part
        5.3.0  The End Of Extensions
        5.3.1  Responder Address Extension
        5.3.2  NHRP Forward Transit NHS Record Extension
        5.3.3  NHRP Reverse Transit NHS Record Extension
        5.3.4  NHRP Authentication Extension
        5.3.5  NHRP Vendor-Private Extension
6. Protocol Operation
   6.1  Router-to-Router Operation
   6.2  Cache Management Issues
        6.2.1  Caching Requirements
               Source Stations
               Serving NHSs
               Transit NHSs
        6.2.2  Dynamics of Cached Information
               NBMA-Connected Destinations
               Destinations Off of the NBMA Subnetwork
   6.3  Use of the Prefix Length field of a CIE
   6.4  Domino Effect
7. NHRP over Legacy BMA Networks
8. Security Considerations
9. Discussion
References

=========================================================================
MPLS over ATM Document Structure - From draft-davie-tag-switching-atm-01.txt
=========================================================================

1. Introduction
2. Definitions
3. Special Characteristics of ATM Switches
4. Label Switching Control Component for ATM
5. Hybrid Switches (Ships in the Night)
6. Use of  VPI/VCIs
7. Label Allocation and Maintenance Procedures
   7.1. Edge LSR Behavior
   7.2. Conventional ATM Switches (non-Stream-merge)
   7.3. Stream Merge
   7.4. Efficient use of label space
8. Generic Encapsulation

=========================================================================
	========== Some Message Formats ===========
=========================================================================

   CIEs have the following format:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    Code       | Prefix Length |         unused                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Maximum Transmission Unit    |        Holding Time           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Cli Addr T/L | Cli SAddr T/L | Cli Proto Len |  Preference   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            Client NBMA Address (variable length)              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Client NBMA Subaddress (variable length)            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Client Protocol Address (variable length)            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                        .....................
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    Code       | Prefix Length |         unused                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Maximum Transmission Unit    |        Holding Time           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Cli Addr T/L | Cli SAddr T/L | Cli Proto Len |  Preference   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            Client NBMA Address (variable length)              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Client NBMA Subaddress (variable length)            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Client Protocol Address (variable length)            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Error Indication:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Src Proto Len | Dst Proto Len |            unused             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Error Code          |        Error Offset           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            Source NBMA Address (variable length)              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source NBMA Subaddress (variable length)             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source Protocol Address (variable length)            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |       Destination  Protocol Address (variable length)         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |       Contents of NHRP Packet in error (variable length)      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


   Extensions format:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |C|u|        Type               |        Length                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Value...                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


1.3 Terminology 

   flow

     A single instance of an application to application flow of data
     (as in the RSVP and IFMP use of the term "flow")

   Forwarding Equivalence Class (FEC)

     A class of packets which are forwarded in the same manner.  A
     forwarding equivalence class is therefore the set of L3 packets
     which could safely be mapped to the same label. Note that there
     may be reasons that packets from a single forwarding equivalence
     class may be mapped to multiple labels (e.g., when stream merge
     is not used).

   frame merge

     Stream merge, when it is applied to operation over
     frame based media, so that the potential problem of cell
     interleave is not an issue.

   label

     A short fixed length physically contiguous locally
     significant identifier which is used to identify a stream

   label information base

     The database of information containing label bindings

   label swap

     The basic forwarding operation consisting of looking
     up an incoming label to determine the outgoing label,
     encapsulation, port, and other data handling information.

   label swapping

     A forwarding paradigm allowing streamlined forwarding of
     data by using labels to identify streams of data to be
     forwarded.

   label switched hop

     The hop between two MPLS nodes, on which forwarding is
     done using labels.

   label switched path

     The path created by the concatenation of one or more label
     switched hops, allowing a packet to be forwarded by swapping
     labels from an MPLS node to another MPLS node.

   layer 2

     The protocol layer under layer 3 (which therefore offers
     the services used by layer 3). Forwarding, when done by the
     swapping of short fixed length labels, occurs at layer 2
     regardless of whether the label being examined is an ATM
     VPI/VCI, a frame relay DLCI, or an MPLS label.

   layer 3

     The protocol layer at which IP and its associated routing
     protocols operate

   link layer

     Synonymous with layer 2

   loop detection

     A method of dealing with loops in which loops are allowed
     to be set up, and data may be transmitted over the loop,
     but the loop is later detected and closed

   loop prevention

     A method of dealing with loops in which data is never
     transmitted over a loop

   label stack

     An ordered set of labels

   loop survival

     A method of dealing with loops in which data may be
     transmitted over a loop, but means are employed to limit the
     amount of network resources which may be consumed by the
     looping data

   label switching router

     An MPLS node which is capable of forwarding native L3 packets

   merge point

     The node at which multiple streams and switched paths are
     combined into a single stream sent over a single path. In the
     case that the multiple paths are not combined prior to the
     egress node, then the egress node becomes the merge point.

   Mlabel

     Abbreviation for MPLS label

   MPLS core standards

     The standards which describe the core MPLS technology

   MPLS domain

     A contiguous set of nodes which operate MPLS routing and
     forwarding and which are also in one Routing or Administrative
     Domain

   MPLS edge node

     An MPLS node that connects an MPLS domain with a node which
     is outside of the domain, either because it does not run
     MPLS, and/or because it is in a different domain. Note that
     if an LSR has a neighboring host which is not running MPLS,
     that that LSR is an MPLS edge node.

   MPLS egress node

     An MPLS edge node in its role in handling traffic as it
     leaves an MPLS domain

   MPLS ingress node

     An MPLS edge node in its role in handling traffic as it
     enters an MPLS domain

   MPLS label

     A label placed in a short MPLS shim header used to identify
     streams

   MPLS node

     A node which is running MPLS. An MPLS node will be aware of
     MPLS control protocols, will operate one or more L3 routing
     protocols, and will be capable of forwarding packets based on
     labels. An MPLS node may optionally be also capable of
     forwarding native L3 packets.

   MultiProtocol Label Switching

     An IETF working group and the effort associated with the
     working group

   network layer

     Synonymous with layer 3

   shortcut VC

     A VC set up as a result of an NHRP query and response

   stack

     Synonymous with label stack

   stream

     An aggregate of one or more flows, treated as one aggregate
     for the purpose of forwarding in L2 and/or L3 nodes (e.g.,
     may be described using a single label). In many cases a stream
     may be the aggregate of a very large number of flows.
     Synonymous with "aggregate stream".

   stream merge

     The merging of several smaller streams into a larger stream,
     such that for some or all of the path the larger stream can
     be referred to using a single label.

   switched path

     Synonymous with label switched path

   virtual circuit

     A circuit used by a connection-oriented layer 2 technology
     such as ATM or Frame Relay, requiring the maintenance of
     state information in layer 2 switches.

   VC merge

     Stream merge when it is specifically applied to VCs,
     specifically so as to allow multiple VCs to merge into one
     single VC

   VP merge

     Stream merge when it is applied to VPs, specifically so as
     to allow multiple VPs to merge into one single VP. In this
     case the VCIs need to be unique. This allows cells from
     different sources to be distinguished via the VCI.

   VPI/VCI

     A label used in ATM networks to identify circuits

1.4 Acronyms and Abbreviations

   DLCI            Data Link Circuit Identifier

   FEC             Forwarding Equivalence Class

   ISP             Internet Service Provider

   LIB             Label Information Base

   LDP             Label Distribution Protocol

   L2              Layer 2

   L3              Layer 3

   LSP             Label Switched Path

   LSR             Label Switching Router

   MPLS            MultiProtocol Label Switching

   MPT             Multipoint to Point Tree

   NHC             Next Hop (NHRP) Client

   NHS             Next Hop (NHRP) Server

   VC              Virtual Circuit

   VCI             Virtual Circuit Identifier

   VPI             Virtual Path Identifier

2. Discussion of Core MPLS Components

2.1 The Basic Routing Approach

   Routing is accomplished through the use of standard L3 routing
   protocols, such as OSPF and BGP.  The information maintained by the
   L3 routing protocols is then used to distribute labels to neighboring
   nodes that are used in the forwarding of packets as described below.
   In the case of ATM networks, the labels that are distributed are
   VPI/VCIs and a separate protocol (i.e., PNNI) is not necessary for
   the establishment of VCs for IP forwarding.

   The topological scope of a routing protocol (i.e. routing domain) and
   the scope of label switching MPLS-capable nodes may be different.
   For example, MPLS-knowledgeable and MPLS-ignorant nodes, all of which
   are OSPF routers, may be co-resident in an area. In the case that
   neighboring routers know MPLS, labels can be exchanged and used.

   Neighboring MPLS routers may use configured PVCs or PVPs to tunnel
   through non-participating ATM or FR switches.

2.2 Labels

   In addition to the single routing protocol approach discussed above,
   the other key concept in the basic MPLS approach is the use of short
   fixed length labels to simply user data forwarding.

2.2.1 Label Semantics

   It is important that the MPLS solutions are clear about what
   semantics (i.e., what knowledge of the state of the network) is
   implicit in the use of labels for forwarding user data packets or
   cells.

   At the simplest level, a label may be thought of as nothing more than
   a shorthand for the packet header, in order to index the forwarding
   decision that a router would make for the packet. In this context,
   the label is nothing more than a shorthand for an aggregate stream of
   user data.

   This observation leads to one possible very simple interpretation
   that the "meaning" of the label is a strictly local issue between two
   neighboring nodes. With this interpretation: (i) MPLS could be
   employed between any two neighboring nodes for forwarding of data
   between those nodes, even if no other nodes in the network
   participate in MPLS; (ii) When MPLS is used between more than two
   nodes, then the operation between any two neighboring nodes could be
   interpreted as independent of the operation between any other pair of
   nodes. This approach has the advantage of semantic simplicity, and of
   being the closest to pure datagram forwarding. However this approach
   (like pure datagram forwarding) has the disadvantage that when a
   packet is forwarded it is not known whether the packet is being
   forwarded into a loop, into a black hole, or towards links which have
   inadequate resources to handle the traffic flow. These disadvantages
   are necessary with pure datagram forwarding, but are optional design
   choices to be made when label switching is being used.

   There are cases where it would be desirable to have additional
   knowledge implicit in the existence of the label. For example, one
   approach to avoiding loops (see section x.x below) involves signaling
   the label distribution along a path before packets are forwarded on
   that path. With this approach the fact that a node has a label to use
   for a particular IP packet would imply the knowledge that following
   the label (including label swapping at subsequent nodes) leads to a
   non-looping path which makes progress towards the destination
   (something which is usually, but not necessarily always true when
   using pure datagram routing). This would of course require some sort
   of label distribution/setup protocol which signals along the path
   being setup before the labels are available for packet forwarding.

   However, there are also other consequences to having additional
   semantics associated with the label: specifically, procedures are
   needed to ensure that the semantics are correct. For example, if the
   fact that you have a label for a particular destination implies that
   there is a loop-free path, then when the path changes some procedures
   are required to ensure that it is still loop free. Another example of
   semantics which could be implicit in a label is the identity of the
   higher level protocol type which is encoded using that label value.

   In either case, the specific value of a label to use for a stream is
   strictly a local issue; however the decision about whether to use the
   label may be based on some global (or at least wider scope) knowledge
   that, for example, the label-switched path is loop-free and/or has
   the appropriate resources.

   A similar example occurs in ATM networks: With standard ATM a
   signaling protocol is used which both reserves resources in switches
   along the path, and which ensures that the path is loop-free and
   terminates at the correct node. Thus implicit in the fact that an ATM
   node has a VPI/VCI for forwarding a particular piece of data is the
   knowledge that the path has been set up successfully.

   Another similar examples occurs with multipoint to point trees over
   ATM (see section xx below), where the multipoint to point tree uses a
   VP, and cell interleave at merge points in the tree is handled by
   giving each source on the tree a distinct VCI within the VP. In this
   case, the fact that each source has a known VPI/VCI to use needs to
   (implicitly or explicitly) imply the knowledge that the VCI assigned
   to that source is unique within the context of the VP.

   In general labels are used to optimize how the system works, not to
   control how the system works. For example, the routing protocol
   determines the path that a packet follows. The presence or absence of
   a label assignment should not effect the path of a L3 packet. Note
   however that the use of labels may make capabilities such as explicit
   routes, loadsharing, and multipath more efficient.

2.2.2 Label Granularity

   Labels are used to create a simple forwarding paradigm.  The
   essential element in assigning a label is that the device which will
   be using the label to forward packets will be forwarding all packets
   with the same label in the same way.  If the packet is to be
   forwarded solely by looking at the label, then at a minimum, all
   packets with the same incoming label must be forwarded out the same
   port(s) with the same encapsulation(s), and with the same next hop
   label (if any).

   The term "forwarding equivalence class" is used to refer to a set of
   L3 packets which are all forwarded in the same manner by a particular
   LSR (for example, the IP packets in a forwarding equivalence class
   may be destined for the same egress from an MPLS network, and may be
   associated with the same QoS class). A forwarding equivalence class
   is therefore the set of L3 packets which could safely be mapped to
   the same label. Note that there may be reasons that packets from a
   single forwarding equivalence class may be mapped to multiple labels
   (e.g., when stream merge is not used).

   Note that the label could also mean "ignore this label and forward
   based on what is contained within," where within one might find a
   label (if a stack of labels is used) or a layer 3 packet.

   For IP unicast traffic, the granularity of a label allows various
   levels of aggregation in a Label Information Base (LIB).  At one end
   of the spectrum, a label could represent a host route (i.e. the full
   32 bits of IP address).  If a router forwards an entire CIDR prefix
   in the same way, it may choose to use one label to represent that
   prefix.  Similarly if the router is forwarding several (otherwise
   unrelated) CIDR prefixes in the same way it may choose to use the
   same label for this set of prefixes.  For instance all CIDR prefixes
   which share the same BGP Next Hop could be assigned the same label.
   Taking this to the limit, an egress router may choose to advertise
   all of its prefixes with the same label.

   By introducing the concept of an egress identifier, the distribution
   of labels associated with groups of CIDR prefixes can be simplified.
   For instance, an egress identifier might specify the BGP Next Hop,
   with all prefixes routed to that next hop receiving the label
   associated with that egress identifier.  Another natural place to
   aggregate would be the MPLS egress router.  This would work
   particularly well in conjunction with a link-state routing protocol,
   where the association between egress router and CIDR prefix is
   already distributed throughout an area.

   For IP multicast, the natural binding of a label would be to a
   multicast tree, or rather to the branch of a tree which extends from
   a particular port.  Thus for a shared tree, the label corresponds to
   the multicast group, (*,G).  For (S,G) state, the label would
   correspond to the source address and the multicast group.

   A label can also have a granularity finer than a host route.  That
   is, it could be associated with some combination of source and
   destination address or other information within the packet.  This
   might for example be done on an administrative basis to aid in
   effecting policy.  A label could also correspond to all packets which
   match a particular Integrated Services filter specification.

   Labels can also represent explicit routes.  This use is semantically
   equivalent to using an IP tunnel with a complete explicit route. This
   is discussed in more detail in section 4.10.

2.2.3 Label Assignment

   Essential to label switching is the notion of binding between a label
   and Network Layer routing (routes).  A control component is
   responsible for creating label bindings, and then distributing the
   label binding information among label switches. Label assignment
   involves allocating a label, and then binding a label to a route.

   Label assignment can be driven by control traffic or by data traffic.
   This is discussed in more detail in section 3.4.

   Control traffic driven label assignment has several advantages, as
   compared to data traffic driven label Assignment. For one thing, it
   minimizes the amount of additional control traffic needed to
   distribute label binding information, as label binding information is
   distributed only in response to control traffic, independent of data
   traffic. It also makes the overall scheme independent of and
   insensitive to the data traffic profile/pattern. Control traffic
   driven creation of label binding improves forwarding latency, as
   labels are assigned before data traffic arrives, rather than being
   assigned as data traffic arrives. It also simplifies the overall
   system behavior, as the control plane is controlled solely by control
   traffic, rather than by a mix of control and data traffic.

   There are however situations where data traffic driven label
   assignment is necessary.  A particular case may occur with ATM
   without VP or VC merge. In this case in order to set up a full mesh
   of VCs would require n-squared VCs. However, in very large networks
   this may be infeasible. Instead VCs may be setup where required for
   forwarding data traffic. In this case it is generally not possible to
   know a priori how many such streams may occur.

   Label withdrawal is required with both control-driven and data-driven
   label assignment. Label withdrawal is primarily a matter of garbage
   collection, that is collecting up unused labels so that they may be
   reassigned.  Generally speaking, a label should be withdrawn when the
   conditions that allowed it to be assigned are no longer true. For
   example, if a label is imbued with extra semantics such as loop-free-
   ness, then the label must be withdrawn when those extra semantics
   cease to hold.

   In certain cases, notably multicast, it may be necessary to share a
   label space between multiple entities.  If these sharing arrangements
   are altered by the coming and going of neighbors, then labels which
   are no longer controlled by an entity must be withdrawn and a new
   label assigned.

2.2.4 Label Stack and Forwarding Operations

   The basic forwarding operation consists of looking up the incoming
   label to determine the outgoing label, encapsulation, port, and any
   additional information which may pertain to the stream such as a
   particular queue or other QoS related treatment.  We refer to this
   operation as a label swap.

   When a packet first enters an MPLS domain, the packet is forwarded by
   normal layer 3 forwarding operations with the exception that the
   outgoing encapsulation will now include a label.  We refer to this
   operation as a label push.  When a packet leaves an MPLS domain, the
   label is removed.  We refer to this as a label pop.

   In some situations, carrying a stack of labels is useful.  For
   instance both IGP and BGP label could be used to allow routers in the
   interior of an AS to be free of BGP information.  In this scenario,
   the "IGP" label is used to steer the packet through the AS and the
   "BGP" label is used to switch between ASes.

   With a label stack, the set of label operations remains the same,
   except that at some points one might push or pop multiple labels, or
   pop & swap, or swap & push.

2.3 Encapsulation

   Label-based forwarding makes use of various pieces of information,
   including a label or stack of labels, and possibly additional
   information such as a TTL field. In some cases this information may
   be encoded using an MPLS header, in other cases this information may
   be encoded in L2 headers. Note that there may be multiple types of
   MPLS headers. For example, the header used over one media type may be
   different than is used over a different media type. Similarly, in
   some cases the information that MPLS makes use of may be encoded in
   an ATM header. We will use the term "MPLS encapsulation" to refer to
   whatever form is used to encapsulate the label information and other
   information used for label based forwarding. The term "MPLS header"
   will be used where this information is carried in some sort of MPLS-
   specific header (i.e., when the MPLS information cannot all be
   carried in a L2 header). Whether there is one or multiple forms of
   possible MPLS headers is also outside of the scope of this document.

   The exact contents of the MPLS encapsulation is outside of the scope
   of this document. Some fields, such as the label, are obviously
   needed. Some others might or might not be standardized, based on
   further study. An encapsulation scheme may make use of the following
   fields:

     -  label
     -  TTL
     -  class of service
     -  stack indicator
     -  next header type indicator
     -  checksum

   It is desirable to have a very short encapsulation header.  For
   example, a four byte encapsulation header adds to the convenience of
   building a hardware implementation that forwards based on the
   encapsulation header. But at the same time it is tricky assigning
   such a limited number of bits to carry the above listed information
   in an MPLS header. Hence careful consideration must be given to the
   information chosen for an MPLS header.

   A TTL value in the MPLS header may be useful in the same manner as it
   is in IP. Specifically, TTL may be used to terminate packets caught
   in a routing loop, and for other related uses such as traceroute. The
   TTL mechanism is a simple and proven method of handling such events.
   Another use of TTL is to expire packets in a network by limiting
   their "time to live" and eliminating stale packets that may cause
   problems for some of the higher layer protocols. When used over link
   layers which do not provide a TTL field, alternate mechanisms will be
   needed to replace the uses of the TTL field.

   A provision for a class of service (COS) field in the MPLS header
   allows multiple service classes within the same label.  However, when
   more sophisticated QoS is associated with a label, the COS may not
   have any significance.  Alternatively, the COS (like QoS) can be left
   out of the header, and instead propagated with the label assignment,
   but this entails that a separate label be assigned to each required
   class of service.  Nevertheless, the COS mechanism provides a simple
   method of segregating flows within a label.

   As previously mentioned, the encapsulation header can be used to
   derive benefits of tunneling (or stacking).

   The MPLS header must provide a way to indicate that multiple MPLS
   headers are stacked (i.e., the "stack indicator").  For this purpose
   a single bit in the MPLS header will suffice. In addition, there are
   also some benefits to indicating the type of the protocol header
   following the MPLS header (i.e., the "next header type indicator").
   One option would be to combine the stack indicator and next header
   type indicator into a single value (i.e., the next header type
   indicator could be allowed to take the value "MPLS header"). Another
   option is to have the next header type indicator be implicit in the
   label value (such that this information would be propagated along
   with the label).

   There is no compelling reason to support a checksum field in the MPLS
   header. A CRC mechanism at the L2 layer should be sufficient to
   ensure the integrity of the MPLS header.

3. Observations, Issues and Assumptions

3.1 Layer 2 versus Layer 3 Forwarding

   MPLS uses L2 forwarding as a way to provide simple and fast packet
   forwarding capability.  One primary reason for the simplicity of L2
   layer forwarding comes from its short, fixed length labels.  A node
   forwarding at L3 must parse a (relatively) large header, and perform
   a longest-prefix match to determine a forwarding path.  However, when
   a node performs L2 label swapping, and labels are assigned properly,
   it can do a direct index lookup into its forwarding (or in this case,
   label-swapping) table with the short header. It is arguably simpler
   to build label swapping hardware than it is to build L3 forwarding
   hardware because the label swapping function is less complex.

   The relative performance of L2 and L3 forwarding may differ
   considerably between nodes. Some nodes may illustrate an order of
   magnitude difference. Other nodes (for example, nodes with more
   extensive L3 forwarding hardware) may have identical performance at
   L2 and L3. However, some nodes may not be capable of doing a L3
   forwarding at all (e.g. ATM), or have such limited capacity as to be
   unusable at L3.  In this situation, traffic must be blackholed if no
   switched path exists.

   On nodes in which L3 forwarding is slower than L2 forwarding, pushing
   traffic to L3 when no L2 path is available may cause congestion. In
   some cases this could cause data loss (since L3 may be unable to keep
   up with the increased traffic). However, if data is discarded, then
   in general this will cause TCP to backoff, which would allow control
   traffic, traceroute and other network management tools to continue to
   work.

   The MPLS protocol MUST not make assumptions about the forwarding
   capabilities of an MPLS node.  Thus, MPLS must propose solutions that
   can leverage the benefits of a node that is capable of L3 forwarding,
   but must not mandate the node be capable of such.

   Why We Will Still Need L3 Forwarding:

   MPLS will not, and is not intended to, replace L3 forwarding. There
   is absolutely a need for some systems to continue to forward IP
   packets using normal Layer 3 IP forwarding. L3 forwarding will be
   needed for a variety of reasons, including:

     -  For scaling; to forward on a finer granularity than the labels
        can provide
     -  For security; to allow packet filtering at firewalls.
     -  For forwarding at the initial router (when hosts don't do MPLS)

   Consider a campus network which is serving a small company. Suppose
   that this companies makes use of the Internet, for example as a
   method of communicating with customers. A customer on the other side
   of the world has an IP packet to be forwarded to a particular system
   within the company. It is not reasonable to expect that the customer
   will have a label to use to forward the packet to that specific
   system. Rather, the label used for the "first hop" forwarding might
   be sufficient to get the packet considerably closer to the
   destination. However, the granularity of the labels cannot be to
   every host worldwide. Similarly, routing used within one routing
   domain cannot know about every host worldwide. This implies that in
   may cases the labels assigned to a particular packet will be
   sufficient to get the packet close to the destination, but that at
   some points along the path of the packet the IP header will need to
   be examined to determine a finer granularity for forwarding that
   packet. This is particularly likely to occur at domain boundaries.

   A similar point occurs at the last router prior to the destination
   host. In general, the number of hosts attached to a network is likely
   to be great enough that it is not feasible to assign a separate label
   to every host. Rather, as least for routing within the destination
   routing domain (or the destination area if there is a hierarchical
   routing protocol in use) a label may be assigned which is sufficient
   to get the packet to the last hop router. However, the last hop
   router will need to examine the IP header (and particularly the
   destination IP address) in order to forward the packet to the correct
   destination host.

   Packet filtering at firewalls is an important part of the operation
   of the Internet. While the current state of Internet security may be
   considerably less advanced than may be desired, nonetheless some
   security (as is provided by firewalls) is much better than no
   security. We expect that packet filtering will continue to be
   important for the foreseeable future. Packet filtering requires
   examination of the contents of the packet, including the IP header.
   This implies that at firewalls the packet cannot be forwarded simply
   by considering the label associated with the packet. Note that this
   is also likely to occur at domain boundaries.

   Finally, it is very likely that many hosts will not implement MPLS.
   Rather, the host will simply forward an IP packet to its first hop
   router. This first hop router will need to examine the IP header
   prior to forwarding the packet (with or without a label).

3.2 Scaling Issues

   MPLS scalability is provided by two of the principles of routing.
   The first is that forwarding follows an inverted tree rooted at a
   destination.  The second is that the number of destinations is
   reduced by routing aggregation.

   The very nature of IP forwarding is a merged multipoint-to-point
   tree. Thus, since MPLS mirrors the IP network layer, an MPLS node
   that is capable of merging is capable of creating O(n) switched paths
   which provide network reachability to all "n" destinations.  The
   meaning of "n" depends on the granularity of the switched paths.  One
   obvious choice of "n" is the number of CIDR prefixes existing in the
   forwarding table (this scales the same as today's routing). However,
   the value of "n" may be reduced considerably by choosing switched
   paths of further aggregation. For example, by creating switched paths
   to each possible egress node, "n" may represent the number of egress
   nodes in a network. This choice creates "n" switched paths, such that
   each path is shared by all CIDR prefixes that are routed through the
   same egress node.  This selection greatly improves scalability, since
   it minimizes "n", but at the same time maintains the same switching
   performance of CIDR aggregation. (See section 2.2.2 for a description
   of all of the levels of granularity provided by MPLS).

   The MPLS technology must scale at least as well as existing
   technology. For example, if the MPLS technology were to support ONLY
   host-to-host switched path connectivity, then the number of
   switched-paths would be much higher than the number of routing table
   entries.

   There are several ways in which merging can be done in order to allow
   O(n) switches paths to connect n nodes. The merging approach used has
   an impact on the amount of state information, buffering, delay
   characteristics, and the means of control required to coordinate the
   trees. These issues are discussed in more detail in section 4.2.

   There are some cases in which O(n-squared) switched paths may be used
   (for example, by setting up a full mesh of point to point streams).
   As label space and the amount of state information that can be
   supported may be limited, it will not be possible to support O(n-
   squared) switched paths in very large networks. However, in some
   cases the use of n- squared paths may even be a advantage (for
   example, to allow load- splitting of individual streams).

   MPLS must be designed to scale for O(n). O(n) scaling allows MPLS
   domains to scale to a very large scale. In addition, if best effort
   service can be supported with O(n) scaling, this conserves resources
   (such as label space and state information) which can be used for
   supporting advanced services such as QoS. However, since some
   switches may not support merging, and some small networks may not
   require the scaling benefits of O(n), provisions must also be
   provided for a non- merging, O(n-squared) solution.

   Note: A precise and complete description of scaling would consider
   that there are multiple dimensions of scaling, and multiple resources
   whose usage may be considered. Possible dimensions of scaling
   include: (i) the total number of streams which exist in an MPLS
   domain (with associated labels assigned to them); (ii) the total
   number of "label swapping pairs" which may be stored in the nodes of
   the network (ie, entries of the form "for incoming label 'x', use
   outgoing label 'y'"); (iii) the number of labels which need to be
   assigned for use over a particular link; (iv) The amount of state
   information which needs to be maintained by any one node. We do not
   intend to perform a complete analysis of all possible scaling issues,
   and understand that our use of the terms "O(n)" and "O(n-squared)" is
   approximate only.

3.3 Types of Streams

   Switched paths in the MPLS network can be of different types:

     -  point-to-point
     -  multipoint-to-point
     -  point-to-multipoint
     -  multipoint-to-multipoint

   Two of the factors that determine which type of switched path is used
   are (i) The capability of the switches employed in a network; (ii)
   The purpose of the creation of a switched path; that is, the types of
   flows to be carried in the switched path.  These two factor also
   determine the scalability of a network in terms of the number of
   switched paths in use for transporting data through a network.

   The point-to-point switched path can be used to connect all ingress
   nodes to all the egress nodes to carry unicast traffic.  In this
   case, since an ingress node has point-to-point connections to all the
   egress nodes, the number of connections in use for transporting
   traffic is of O(n-squared), where n is the number of edges MPLS
   devices.  For small networks the full mesh connection approach may
   suffice and not pose any scalability problems.  However, in large
   enterprise backbone or ISP networks, this will not scale well.

   Point-to-point switched paths may be used on a host-to-host or
   application to application basis (e.g., a switched path per RSVP
   flow). The dedicated point-to-point switched path transports the
   unicast data from the ingress to the egress node of the MPLS network.
   This approach may be used for providing QoS services or for best-
   effort traffic.

   A multipoint-to-point switched path connects all ingress nodes to an
   single egress node. At a given intermediate node in the multipoint-
   to- point switched path, L2 data units from several upstream links
   are "merged" into a single label on a downstream link.  Since each
   egress node is reachable via a single multipoint-to-point switched
   path, the number of switched paths required to transport best-effort
   traffic through a MPLS network is O(n), where n is the number of
   egress nodes.

   The point-to-multipoint switched path is used for distributing
   multicast traffic. This switched path tree mirrors the multicast
   distribution tree as determined by the multicast routing protocols.
   Typically a switch capable of point-to-multipoint connection
   replicates an L2 data unit from the incoming (parent) interface to
   all the outgoing (child) interfaces. Standard ATM switches support
   such functionality in the form of point-to-multipoint VCs or VPs.

   A multipoint-to-multipoint switched path may be used to combine
   multicast traffic from multiple sources into a single multicast
   distribution tree.  The advantage of this is that the multipoint-to-
   multipoint switched path is shared by multiple sources. Conceptually,
   a form of multipoint-to-multipoint can be thought of as follows:
   Suppose that you have a point to multipoint VC from each node to all
   other nodes. Suppose that any point where two or more VCs happen to
   merge, you merge them into a single VC or VP. This would require
   either coordination of VCI spaces (so that each source has a unique
   VCI within a VP) or VC merge capabilities. The applicability of
   similar concepts to MPLS is FFS.

3.4 Data Driven versus Control Traffic Driven Label Assignment

   A fundamental concept in MPLS is the association of labels and
   network layer routing. Each LSR must assign labels, and distribute
   them to its forwarding peers, for traffic which it intends to forward
   by label swapping.  In the various contributions that have been made
   so far to the MPLS WG we identify three broad strategies for label
   assignment; (i) those driven by topology based control traffic
   [TAG][8][IP navigator]; (ii) Those driven by request based control
   traffic [RSVP]; and (iii) those driven by data traffic
   [CSR][Ipsilon].

   We also note that in actual practice combinations of these methods
   may be employed. One example is that topology based methods for best
   effort traffic plus request based methods for support of RSVP.

3.4.1 Topology Driven Label Assignment

   In this scheme labels are assigned in response to normal processing
   of routing protocol control traffic. Examples of such control
   protocols are OSPF and  BGP. As an LSR processes OSPF or BGP updates
   it can, as it makes or changes entries in its forwarding tables,
   assign labels to those entries.

   Among the properties of this scheme are:

   - The computational load of assignment and distribution and the
     bandwidth consumed by label distribution are bounded by the size
     of the network.

   - Labels are in the general case preassigned. If a route exists then
     a label has been assigned to it (and distributed). Traffic may be
     label swapped immediately it arrives, there is no label setup
     latency at forwarding time.

   - Requires LSRs to be able to process control traffic load only.

   - Labels assigned in response to the operation of routing protocols
     can have a granularity equivalent to that of the routes advertised
     by the protocol. Labels can, by this means, cover (highly)
     aggregated routes.

3.4.2 Request Driven Label Assignment

   In this scheme labels are assigned in response to normal processing
   of request based control traffic. Examples of such control protocols
   are RSVP. As an LSR processes RSVP messages it can, as it makes or
   changes entries in its forwarding tables, assign labels to those
   entries.

   Among the properties of this scheme are:

   - The computational load of assignment and distribution and the
     bandwidth consumed by label distribution are bounded by the
     amount of control traffic in the system.

   - Labels are in the general case preassigned. If a route exists
     then a label has been assigned to it (and distributed). Traffic
     may be label swapped immediately it arrives, there is no label
     setup latency at forwarding time.

   - Requires LSRs to be able to process control traffic load only.

   - Depending upon the number of flows supported, this approach may
     require a larger number of labels to be assigned compared with
     topology driven assignment.

   - This approach requires applications to make use of request
     paradigm in order to get a label assigned to their flow.

3.4.3 Traffic Driven Label Assignment

   In this scheme the arrival of data at an LSR "triggers" label
   assignment and distribution. Traffic driven approach has the
   following characteristics.

   - Label assignment and distribution costs are a function of
     traffic patterns. In an LSR with limited label space that is
     using a traffic driven approach to amortize its labels over a
     larger number of flows the overhead due to label assignment
     and distribution grows as a function of the number of flows
     and as a function of their "persistence". Short lived but
     recurring flows may impose a heavy control burden.

   - There is a latency associated with the appearance of a "flow"
     and the assignment of a label to it. The documented approaches
     to this problem suggest L3 forwarding during this setup phase,
     this has the potential for packet reordering (note that packet
     reordering may occur with any scheme when the network topology
     changes, but traffic driven label assignment introduces another
     cause for reordering).

   - Flow driven label assignment requires high performance packet
     classification capabilities.

   - Traffic driven label assignment may be useful to reduce label
     consumption (assuming that flows are not close to full mesh).

   - If you want flows to hosts, due to limits on label space, then
     traffic based label consumption is probably necessary due to
     the large number of hosts which may occur in a network.

   - If you want to assign specific network resources to specific
     labels, to be used for support of application flows, then
     again the fine grain associated with labels may require data
     based label assignment.

3.5 The Need for Dealing with Looping

   Routing protocols which are used in conjunction with MPLS will in
   many cases be based on distributed computation. As such, during
   routing transients, these protocols may compute forwarding paths
   which contain loops. For this reason MPLS will be designed with
   mechanisms to either prevent the formation of loops and /or contain
   the amount of  resources that can be consumed due to the presence of
   loops.

   Note that there are a number of different alternative mechanisms
   which have been proposed (see section 4.3). Some of these prevent the
   formation of layer 2 forwarding loops, others allow loops to form but
   minimize their impact in one way or another (e.g., by discarding
   packets which loop, or by detecting and closing the loop after a
   period of time). Generally speaking, there are tradeoffs to be made
   between the amount of looping which might occur, and other
   considerations such as the time to convergence after a change in the
   paths computed by the routing algorithm.

   We are not proposing any changes to normal layer 3 operation, and
   specifically are not trying to eliminate the possibility of looping
   at layer 3. Transient loops will continue to be possible in IP
   networks. Note that IP has a means to limit the damage done by
   looping packets, based on decrementing the IP TTL field as the packet
   is forwarded, and discarding packets whose TTL has expired. Dynamic
   routing protocols used with IP are also designed to minimize the
   amount of time during which loops exist.

   The question that MPLS has to deal with is what to do at L2. In some
   cases L2 may make use of the same method that is used as L3. However,
   other options are available at L2, and in some cases (specifically
   when operating over ATM or Frame Relay hardware) the method of
   decrementing a TTL field (or any similar field) is not available.

   There are basically two problems caused by packet looping: The most
   obvious problem is that packets are not delivered to the correct
   destination. The other result of looping is congestion. Even with TTL
   decrementing and packet discard, there may still be a significant
   amount of time that packets travel through a loop. This can adversely
   affect other packets which are not looping: Congestion due to the
   looping packets can cause non-looping packets to be delayed and/or
   discarded.

   Looping is particularly serious in (at least) three cases: One is
   when forwarding over ATM. Since ATM does not have a TTL field to
   decrement, there is no way to discard ATM cells which are looping
   over ATM subnetworks.  Standard ATM PNNI routing and signaling solves
   this problem by making use of call setup procedures which ensure that
   ATM VCs will never be setup in a loop [PNNI]. However, when MPLS is
   used over ATM subnets, the native ATM routing and signaling
   procedures may not be used for the full L2 path. This leads to the
   possibility that MPLS over ATM might in principle allow packets to
   loop indefinitely, or until L3 routing stabilizes. Methods are needed
   to prevent this problem.

   Another case in which looping can be particularly unpleasant is for
   multicast traffic. With multicast, it is possible that the packet may
   be delivered successfully to some destinations even though copies
   intended for other destinations are looping. This leads to the
   possibility that huge numbers of identical packets could be delivered
   to some destinations. Also, since multicast implies that packets are
   duplicated at some points in their path, the congestion resulting
   from looping packets may be particularly severe.

   Another unpleasant complication of looping occurs if the congestion
   caused by the loop interferes with the routing protocol. It is
   possible for the congestion caused by looping to cause routing
   protocol control packets to be discarded, with the result that the
   routing protocol becomes unstable. For example this could lengthen
   the duration of the loop.

   In normal operation of IP networks the impact of congestion is
   limited by the fact that TCP backs off (i.e., transmits substantially
   less traffic) in response to lost packets. Where the congestion is
   caused by looping, the combination of TTL and the resulting discard
   of looping packets, plus the reduction in offered traffic, can limit
   the resulting impact on the network. TCP backoff however does not
   solve the problem if the looping packets are not discarded (for
   example, if the loop is over an ATM subnetwork where TTL is not
   used).

   The severity of the problem caused by looping may depend upon
   implementation details. Suppose, for instance, that ATM switching
   hardware is being used to provide MPLS switching functions. If the
   ATM hardware has per-VC queuing, and if it is capable of providing
   fair access to the buffer pool for incoming cells based on the
   incoming VC (so that no one incoming VC is allowed to grab a
   disproportionate number of buffers), this looping might not have a
   significant effect on other traffic. If the ATM hardware cannot
   provide fair buffer access of this sort, however, then even transient
   loops may cause severe degradation of the node's total performance.

   Given that MPLS is a relatively new approach, it is possible that
   looping may have consequences which are not fully understood (such as
   looping of LDP control information in cases where stream merge is not
   used).

   Even if fair buffer access can be provided, it is still worthwhile to
   have some means of detecting loops that last "longer than possible".
   In addition, even where TTL and/or per-VC fair queuing provides a
   means for surviving loops, it still may be desirable where practical
   to avoid setting up LSPs which loop.

   Methods for dealing with loops are discussed in section 4.3.

3.6 Operations and Management

   Operations and management of networks is critically important. This
   implies that MPLS must support operations, administration,  and
   maintenance facilities at least as extensive as those supported in
   current IP networks.

   In most ways this is a relatively simple requirement to meet. Given
   that all MPLS nodes run normal IP routing protocols, it is
   straightforward to expect them to participate in normal IP network
   management protocols.

   There is one issue which has been identified and which needs to be
   addressed by the MPLS effort: There is an issue with regard to
   operation of Traceroute over MPLS networks. Note that other O&M
   issues may be identified in the future.

   Traceroute is a very commonly used network management tool.
   Traceroute is based on use of the TTL field: A station trying to
   determine the route from itself to a specified address transmits
   multiple IP packets, with the TTL field set to 1 in the first packet,
   2 in the second packet, etc.. This causes each router along the path
   to send back an ICMP error report for TTL exceeded. This in turn
   allows the station to determine the set of routers along the route.
   For example, this can be used to determine where a problem exists (if
   no router responds past some point, the last router which responds
   can become the starting point for a search to determine the cause of
   the problem).

   When MPLS is operating over ATM or Frame Relay networks there is no
   TTL field to decrement (and ATM and Frame Relay forwarding hardware
   does not decrement TTL). This implies that it is not straightforward
   to have Traceroute operate in this environment.

   There is the question of whether we *want* all routers along a path
   to be visible via traceroute. For example, an ISP probably doesn't
   want to expose the interior of their network to a customer. However,
   the issue of whether a network's policy will allow the interior of
   the network to be visible should be independent of whether is it
   possible for some users to see the interior of the network. Thus
   while there clearly should be the possibility of using policy
   mechanisms to block traceroute from being used to see the interior of
   the network, this does not imply that it is okay to develop protocol
   mechanisms which break traceroute from working.

   There is also the question of whether the interior of a MPLS network
   is analogous to a normal IP network, or whether it is closer to the
   interior of a layer 2 network (for example, an ATM subnet). Clearly
   IP traceroute cannot be used to expose the interior of an ATM subnet.
   When a packet is crossing an ATM subnetwork (for example, between an
   ingress and an egress router which are attached to the ATM subnet)
   traceroute can be used to determine the router to router path, but
   not the path through the ATM switches which comprise the ATM subnet.
   Note here that MPLS forms a sort of "in between" special case:
   Routing is based on normal IP routing protocols, the equivalent of
   call setup (label binding/exchange) is based on MPLS-specific
   protocols, but forwarding is based on normal L2 ATM forwarding. MPLS
   therefore supersedes the normal ATM-based methods that would be used
   to eliminate loops and/or trace paths through the ATM subnet.

   It is generally agreed that Traceroute is a relatively "ugly" tool,
   and that a better tool for tracing the route of a packet would be
   preferable. However, no better tool has yet been designed or even
   proposed. Also, however ugly Traceroute may be, it is nonetheless
   very useful, widely deployed, and widely used. In general, it is
   highly preferable to define, implement, and deploy a new tool, and to
   determine through experience that the new tool is sufficient, before
   breaking a tool which is as widely used as traceroute.

   Methods that may be used to either allow traceroute to be used in an
   MPLS network, or to replace traceroute, are discussed in section
   4.14.

4. Technical Approaches

   We believe that section 4 is probably less complete than other
   sections. Additional subsections are likely to be needed as a result
   of additional discussions in the MPLS working group.

4.1 Label Distribution

   A fundamental requirement in MPLS is that an LSR forwarding label
   switched traffic to another LSR apply a label to that traffic which
   is meaningful to the other (receiving the traffic) LSR. LSR's could
   learn about each other's labels in a variety of ways. We call the
   general topic "label distribution".

4.1.1 Explicit Label Distribution

   Explicit label distribution anticipates the specification by MPLS of
   a standard protocol for label distribution. Two of the possible
   approaches [1] [8] that are oriented toward topology driven
   label distribution. One other approach [FANP], in contrast, makes use
   of traffic driven label distribution.

   We expect that the label distribution protocol (LDP) which emerges
   from the MPLS WG is likely to inherit elements from one or more of
   the possible approaches.

   Consider LSR A forwarding traffic to LSR B. We call A the upstream
   (wrt to dataflow) LSR and B the downstream LSR. A must apply a label
   to the traffic that B "understands". Label distribution must ensure
   that the "meaning" of the label will be communicated between A and B.
   An important question is whether A or B (or some other entity)
   allocates the label.

   In this discussion we are talking about the allocation and
   distribution of labels between two peer LSRs  that are on a single
   segment of what may be a longer path. A related but in fact entirely
   separate issue is the question of where control of the whole path
   resides. In essence there are two models; by analogy to upstream and
   downstream for a single segment we can talk about ingress and egress
   for an LSP (or to and from a label swapping "domain"). In one model a
   path is setup from ingress to egress in the other from egress to
   ingress.

4.1.1.1 Downstream Label Allocation

   "Downstream Label Allocation" refers to a method where the label
   allocation is done by the downstream LSR, i.e. the LSR that uses the
   label as an index into its switching tables.

   This is, arguably, the most natural label allocation/distribution
   mode for unicast traffic. As an LSR build its routing tables (we
   consider here control driven allocation of labels) it is free, within
   some limits we will discuss, to allocate labels to in any manner that
   may be convenient to the particular implementation. Since the labels
   that it allocates will be those upon which it subsequently makes
   forwarding decisions we assume implementations will perform the
   allocation in an optimal manner. Having allocated labels the default
   behavior is to distribute the labels (and bindings) to all peers.

   In some cases (particularly with ATM) there may be a limited number
   of labels which may be used across an interface, and/or a limited
   number of label assignments which may be supported by a single
   device. Operation in this case may make use of "on demand" label
   assignment. With this approach, an LSR may for example request a
   label for a route from a particular peer only when its routing
   calculations indicate that peer to be the new next hop for the route.

4.1.1.2 Upstream Label Allocation

   "Upstream Label Allocation" refers to a method where the label
   allocation is done by the upstream LSR. In this case the LSR choosing
   the label (the upstream LSR) and the LSR which needs to interpret
   packets using the label (the downstream LSR) are not the same node.
   We note here that in the upstream LSR the label at issue is not used
   as an index into the switching tables but rather is found as the
   result of a lookup on those tables.

   The motivation for upstream label allocation comes from the
   recognition that it might be possible to optimize multicast machinery
   in an LSR if it were possible to use the same label on all output
   ports for which a particular multicast packet/cell were destined.
   Upstream assignment makes this possible.

4.1.1.3 Other Label Allocation Methods

   Another option would be to make use of label values which are unique
   within the MPLS domain (implying that a domain-wide allocation would
   be needed). In this case, any stream to a particular MPLS egress node
   could make use of the label of that node (implying that label values
   do not need to be swapped at intermediate nodes).

   With this method of label allocation, there is a choice to be made
   regarding the scope over which a label is unique. One approach is to
   configure each node in an MPLS domain with a label which is unique in
   that domain. Another approach is to use a truly global identifier
   (for example the IEEE 48 bit identifier), where each MPLS-capable
   node would be stamped at birth with a truly globally unique
   identifier. The point of this global approach is to simplify
   configuration in each MPLS domain by eliminating the need to
   configure label IDs.

4.1.2 Piggybacking on Other Control Messages

   While we have discussed use of an explicit MPLS LDP we note that
   there are several existing protocols that can be easily modified to
   distribute both routing/control and label information. This could be
   done with any of OSPF, BGP, RSVP and/or PIM. A particular
   architectural elegance of these schemes is that label distribution
   uses the same mechanisms as are used in distribution of the
   underlying routing or control information.

   When explicit label distribution is used, the routing computation and
   label distribution are decoupled. This implies a possibility that at
   some point you may either have a route to a specific destination
   without an associated label, and/or a label for a specific
   destination which makes use of a path which you are no longer using.
   Piggybacking label distribution on the operation of the routing
   protocol is one way to eliminate this decoupling.

   Piggybacking label distribution on the routing protocol introduces an
   issue regarding how to negotiate acceptable label values and what to
   do if an invalid label is received. This is discussed in section
   4.1.3.

4.1.3 Acceptable Label Values

   There are some constraints on which label values may be used in
   either allocation mode. Clearly the label values must lie within the
   allowable range described in the encapsulation standards that the
   MPLS WG will produce. The label value used must also, however, lie
   within a range that the peer  LSR is capable of supporting. We
   imagine that certain machines, for example ATM switches operating as
   LSRs may, due to operational or implementation restrictions, support
   a label space more limited than that bounded by the valid range found
   in the encapsulation standard. This implies that an advertisement or
   negotiation mechanism for useable label range may be a part of the
   MPLS LDP. When operating over ATM using ATM forwarding hardware, due
   to the need for compatibility with the existing use of the ATM
   VPI/VCI space, it is quite likely that an explicit mechanism will be
   needed for label range negotiation.

   In addition we note that LDP may be one of a number of mechanism used
   to distribute labels between any given pair of LSRs. Clearly where
   such multiple mechanisms exist care must be taken to coordinate the
   allocation of label values. A single label value must  have a unique
   meaning to the LSR that distributes it.

   There is an issue regarding how to allow negotiation of acceptable
   label values if label distribution is piggybacked with the routing
   protocol. In this case it may be necessary either to require
   equipment to accept any possible label value, or to configure devices
   to know which range of label values may be selected. It is not clear
   in this case what to do if an invalid label value is received as
   there may be no means of sending a NAK.

   A similar issue occurs with multicast traffic over broadcast media,
   where there may be multiple nodes which receive the same transmission
   (using a single label value). Here again it may be "non-trivial" how
   to allow n-party negotiation of acceptable label values.

4.1.4 LDP Reliability

   The need for reliable label distribution depends upon the relative
   performance of L2 and L3 forwarding, as well as the relationship
   between label distribution and the routing protocol operation.

   If label distribution is tied to the operation of the routing
   protocol, then a reasonable protocol design would ensure that labels
   are distributed successfully as long as the associated route and/or
   reachability advertisement is distributed successfully. This implies
   that the reliability of label distribution will be the same as the
   reliability of route distribution.

   If there is a very large difference between L2 and L3 forwarding
   performance, then the cost of failing to deliver a label is
   significant. In this case it is important to ensure that labels are
   distributed reliably. Given that LDP needs to operate in a wide
   variety of environments with a wide variety of equipment, this
   implies that it is important for any LDP developed by the MPLS WG to
   ensure reliable delivery of label information.

   Reliable delivery of LDP packets may potentially be accomplished
   either by using an existing reliable transport protocol such as TCP,
   or by specifying reliability mechanisms as part of LDP (for example,
   the reliability mechanisms which are defined in IDRP could
   potentially be "borrowed" for use with LSP).

4.1.5 Label Purge Mechanisms

   Another issue to be considered is the "lifetime" of label data once
   it arrives at an LSR, and the method of purging label data. There are
   several methods that could be used either separately, or (more
   likely) in combination.

   One approach is for label information to be timed out. With this
   approach a lifetime is distributed along with the label value. The
   label value may be refreshed prior to timing out. If the label is not
   refreshed prior to timing out it is discarded. In this case each
   lifetime and timer may apply to a single label, or to a group of
   labels (e.g., all labels selected by the same node).

   Similarly, two peer nodes may make use of an MPLS peer keep-alive
   mechanism. This implies exchange of MPLS control packets between
   neighbors on a periodic basis. This in general is likely to use a
   smaller timeout value than label value timers (analogous to the fact
   that the OSPF HELLO interval is much shorter than the OSPF LSA
   lifetime). If the peer session between two MPLS nodes fails (due to
   expiration of the associated timer prior to reception of the refresh)
   then associated label information is discarded.

   If label information is piggybacked on the routing protocol then the
   timeout mechanisms would also be taken from the associated routing
   protocol (note that routing protocols in general have mechanisms to
   invalidate stale routing information).

   An alternative method for invalidating labels is to make use of an
   explicit label removal message.

4.2 Stream Merging

   In order to scale O(n) (rather than O(n-squared), MPLS makes use of
   the concept of stream merge. This makes use of multipoint to point
   streams in order to allow multiple streams to be merged into one
   stream.

   Types of Stream Merge:

   There are several types of stream merge that can be used, depending
   upon the underlying media.

   When MPLS is used over frame based media merging is straightforward.
   All that is required for stream merge to take place is for a node to
   allow multiple upstream labels to be forwarded the same way and
   mapped into a single downstream label. This is referred to as frame
   merge.

   Operation over ATM media is less straightforward. In ATM, the data
   packets are encapsulated into an ATM Adaptation Layer, say AAL5, and
   the AAL5 PDU is segmented into ATM cells with a VPI/VCI value and the
   cells are transmitted in sequence.  It is contingent on ATM switches
   to keep the cells of a PDU (or with the same VPI/VCI value)
   contiguous and in sequence.  This is because the device that
   reassembles the cells to re-form the transmitted PDU expects the
   cells to be contiguous and in sequence, as there isn't sufficient
   information in the ATM cell header (unlike IP fragmentation) to
   reassemble the PDU with any cell order. Hence, if cells from several
   upstream link are transmitted onto the same downstream VPI/VCI, then
   cells from one PDU can get interleaved with cells from another PDU on
   the outgoing VPI/VCI, and result in corruption of the original PDUs
   by mis-sequencing the cells of each PDU.

   The most straightforward (but erroneous) method of merging in an ATM
   environment would be to take the cells from two incoming VCs and
   merge them into a single outgoing VCI. If this was done without any
   buffering of cells then cells from two or more packets could end up
   being interleaved into a single AAL5 frame. Therefore the problem
   when operating over ATM is how to avoid interleaving of cells from
   multiple sources.

   There are two ways to solve this interleaving problem, which are
   referred to as VC merge and VP merge.

   VC merge allows multiple VCs to be merged into a single outgoing VC.
   In order for this to work the node performing the merge needs to keep
   the cells from one AAL5 frame (e.g., corresponding to an IP packet)
   separate from the cells of other AAL5 frames. This may be done by
   performing the SAR function in order to reassemble each IP packet
   before forwarding that packet. In this case VC merge is essentially
   equivalent to frame merge. An alternative is to buffer the cells of
   one AAL5 frame together, without actually reassembling them. When the
   end of frame indicator is reached that frame can be forwarded. Note
   however that both forms of VC merge requires that the entire AAL5
   frame be received before any cells corresponding to that frame be
   forwarded. VC merge therefore requires capabilities which are
   generally not available in most existing ATM forwarding hardware.

   The alternative for use over ATM media is VP merge. Here multiple VPs
   can be merged into a single VP. Separate VCIs within the merged VP
   are used to distinguish frames (e.g., IP packets) from different
   sources. In some cases, one VP may be used for the tree from each
   ingress node to a single egress node.

   Interoperation of Merge Options:

   If some nodes support stream merge, and some nodes do not, then it is
   necessary to ensure that the two types of nodes can interoperate
   within a single network. This affects the number of labels that a
   node needs to send to a neighbor. An upstream LSR which supports
   Stream Merge needs to be sent only one label per forwarding
   equivalence class (FEC). An upstream neighbor which does not support
   Stream Merge needs to be sent multiple labels per FEC. However, there
   is no way of knowing a priori how many labels it needs. This will
   depend on how many LSRs are upstream of it with respect to the FEC in
   question.

   If a particular upstream neighbor does not support stream merge, it
   is not known a priori how many labels it will need. The upstream
   neighbor may need to explicitly ask for labels for each FEC. The
   upstream neighbor may make multiple such requests (for one or more
   labels per request). When a downstream neighbor receives such a
   request from upstream, and the downstream neighbor does not itself
   support stream merge, then it must in turn ask its downstream
   neighbor for more labels for the FEC in question.

   It is possible that there may be some nodes which support merge, but
   have a limited number of upstream streams which may be merged into a
   single downstream streams. Suppose for example that due to some
   hardware limitation a node is capable of merging four upstream LSPs
   into a single downstream LSP. Suppose however, that this particular
   node has six upstream LSPs arriving at it for a particular Stream. In
   this case, this node may merge these into two downstream LSPs
   (corresponding to two labels that need to be obtained from the
   downstream neighbor). In this case, the node will need to obtain the
   required two labels.

   The interoperation of the various forms of merging over ATM is most
   easily described by first describing the interoperation of VC merge
   with non-merge.

   In the case where VC merge and non-merge nodes are interconnected the
   forwarding of cells is based in all cases on a VC (i.e., the
   concatenation of the VPI and VCI). For each node, if an upstream
   neighbor is doing VC merge then that upstream neighbor requires only
   a single outgoing VPI/VCI for a particular FEC (this is analogous to
   the requirement for a single label in the case of operation over
   frame media). If the upstream neighbor is not doing merge, then it
   will require a single outgoing VPI/VCI per FEC for itself (assuming
   that it can be an ingress node), plus enough outgoing VPI/VCIs to map
   to incoming VPI/VCIs to pass to its upstream neighbors. The number
   required will be determined by allowing the upstream nodes to request
   additional VPI/VCIs from their downstream neighbors.

   A similar method is possible to support nodes which perform VP merge.
   In this case the VP merge node, rather than requesting a single
   VPI/VCI or a number of VPI/VCIs from its downstream neighbor, instead
   may request a single VP (identified by a VPI). Furthermore, suppose
   that a non-merge node is downstream from two different VP merge
   nodes. This node may need to request one VPI/VCI (for traffic
   originating from itself) plus two VPs (one for each upstream node).

   Note that there are multiple options for coordinating VCIs within a
   VP. Description of the range of options is FFS.

   In order to support all of VP merge, VC merge, and non-merge, it is
   therefore necessary to allow upstream nodes to request a combination
   of zero or more VC identifiers (consisting of a VPI/VCI), plus zero
   or more VPs (identified by VPIs). VP merge nodes would therefore
   request one VP. VC merge node would request only a single VPI/VCI
   (since they can merge all upstream traffic into a single VC). Non-
   merge nodes would pass on any requests that they get from above, plus
   request a VPI/VCI for traffic that they originate (if they can be
   ingress nodes). However, non-merge nodes which can only do VC
   forwarding (and not VP forwarding) will need to know which VCIs are
   used within each VP in order to install the correct VCs in its
   forwarding table. A detailed description of how this could work is
   FFS.

   Coordination of the VCI space with VP Merge:

   VP merge requires that the VCIs be coordinated to ensure uniqueness.
   There are a number of ways in which this may be accomplished:

   1. Each node may be pre-configured with a unique VCI value (or
      values).

   2. Some one node (most likely they root of the multipoint to point
      tree) may coordinate the VCI values used within the VP.  A
      protocol mechanism will be needed to allow this to occur. How
      hard this is to do depends somewhat upon whether the root is
      otherwise involved in coordinating the multipoint to point
      tree. For example, allowing one node (such as the root) to
      coordinate the tree may be useful for purposes of coordinating
      load sharing (see section 4.10). Thus whether or not the issue
      of coordinating the VCI space is significant or trivial may
      depend upon other design choices which at first glance may
      have appeared to be independent protocol design choices.

   3. Other unique information such as portions of a class B or class
      C address may be used to provide a unique VCI value.

   4. Another alternative is to implement a simple hardware extension
      in the ATM switches to keep the VCI values unique by dynamically
      altering them to avoid collision.

   VP merge makes less efficient use of the VPI/VCI space (relative to
   VC merge).  When VP merge is used, the LSPs may not be able to
   transit public ATM networks that dont support SVP.

   Buffering Issues Related To Stream Merge:

   There is an issue regarding the amount of buffering required for
   frame merge, VC merge, and VP merge. Frame merge and VC merge
   requires that intermediate points buffer incoming packets until the
   entire packet arrives. This is essentially the same as is required in
   traditional IP routers.

   VP merge allows cells to be transmitted by intermediate nodes as soon
   as they arrive, reducing the buffering and latency at intermediate
   nodes. However, the use of VP merge implies that cells from multiple
   packets will arrive at the egress node interleaved on separate VCIs.
   This in turn implies that the egress node may have somewhat increased
   buffering requirements. To a large extent egress nodes for some
   destinations will be intermediate nodes for other destinations,
   implying that increase in buffers required for some purpose (egress
   traffic) will be offset by a reduction in buffers required for other
   purposes (transit traffic). Also, routers today typically deal with
   high-fanout channelized interfaces and with multi-VC ATM interfaces,
   implying that the requirement of buffering simultaneously arriving
   cells from multiple packets and sources is something that routers
   typically do today. This is not meant to imply that the required
   buffer size and performance is inexpensive, but rather is meant to
   observe that it is a solvable issue.

4.3 Loop Handling

   Generally, methods for dealing with loops can be split into three
   categories: Loop Survival makes use of methods which minimize the
   impact of loops, for example by limiting the amount of network
   resources which can be consumed by a loop; Loop Detection allows
   loops to be set up, but later detects these loops and eliminates
   them; Loop Prevention provides methods for avoiding setting up L2
   forwarding in a way which results in a L2 loop.

   Note that we are concerned here only with loops that occur in L2
   forwarding. Transient loops at L3 will continue to be part of the
   normal IP operation, and will be handled the way that IP has been
   handling loops for years (see section 3.5).

   Loop Survival:

   Loop Survival refers to methods that are used to allow the network to
   operate well even though short term transient loops may be formed by
   the routing protocol. The basic approach to loop survival is to limit
   the amount of network resources which are consumed by looping
   packets, and to minimize the effect on other (non-looping) traffic.
   Note that loop survival is the method used by conventional IP
   forwarding, and is therefore based on long and relatively successful
   experience in the Internet.

   The most basic method for loop survival is based on the use to a TTL
   (Time To Live) field. The TTL field is decremented at each hop. If
   the TTL field reaches zero, then the packet is discarded. This method
   works well over those media which has a TTL field. This explicitly
   includes L3 IP forwarding. Also, assuming that the core MPLS
   specifications will include definition of a "shim" MPLS header for
   use over those media which do not have their own labels, in order to
   carry labels for use in forwarding of user data, it is likely that
   the shim header will also include a TTL field.

   However, there is considerable interest in using MPLS over L2
   protocols which provide their own labels, with the L2 label used for
   MPLS forwarding. Specific L2 protocols which offer a label for this
   purpose include ATM and Frame Relay. However, neither ATM nor Frame
   Relay have a TTL field. This implies that this method cannot be used
   when basic ATM or Frame Relay forwarding is being used.

   Another basic method for loop survival is the use of dynamic routing
   protocols which converge rapidly to non-looping paths. In some
   instances it is possible that congestion caused by looping data could
   effect the convergence of the routing protocol (see section 3.5).
   MPLS should be designed to prevent this problem from occurring. Given
   that MPLS uses the same routing protocols as are used for IP, this
   method does not need to be discussed further in this framework
   document.

   Another possible tool for loop survival is the use of fair queuing.
   This allows unrelated flows of user data to be placed in different
   queues. This helps to ensure that a node which is overloaded with
   looping user data can nonetheless forward unrelated non-looping data,
   thereby minimizing the effect that looping data has on other data. We
   cannot assume that fair queuing will always be available. In
   practice, many fair queuing implementations merge multiple streams
   into one queue (implying that the number of queues used is less than
   the number of user data flows which are present in the network).
   This implies that any data which happens to be in the same queue with
   looping data may be adversely effected.

   Loop Detection:

   Loop Detection refers to methods whereby a loop may be set up at L2,
   but the loop is subsequently detected. When the loop is detected, it
   may be broken at L2 by dropping the label relationship, implying that
   packets for a set of destinations must be forwarded at L3.

   A possible method for loop detection is based on transmitting a "loop
   detection" control packet (LDCP) along the path towards a specified
   destination whenever the route to the destination changes. This LDCP
   is forwarded in the direction that the label specifies, with the
   labels swapped to the correct next hop value. However, normal L2
   forwarding cannot be used because each hop needs to examine the
   packet to check for loops.  The LDCP is forwarded towards that
   destination until one of the following happens: (i) The LDCP reaches
   the last MPLS node along the path (i.e. the next hop is either a
   router which is not participating in MPLS, or is the final
   destination host); (ii) The TTL of the LDCP expires (assuming that
   the control packet uses a TTL, which is optional but not absolutely
   necessary), or (iii) The LDCP returns to the node which originally
   transmitted it. If the latter occurs, then the packet has looped and
   the node which originally transmitted the LDCP stops using the
   associated label, and instead uses L3 forwarding  for the associated
   destination addresses. One problem with this method is that once a
   loop is detected it is not known when the loop clears. One option
   would be to set a timer, and to transmit a new LDCP when the timer
   expires.

   An alternate method counts the hops to each egress node, based on the
   routes currently available. Each node advertises its distance (in hop
   counts) to each destination. An egress node advertises the
   destinations that it can reach directly with an associated hop count
   of zero. For each destination, a node computes the hop count to that
   destination based on adding one to the hop count advertised by its
   actual next hop used for that destination. When the hop count for a
   particular destination changes, the hop counts needs to be
   readvertised.

   In addition, the first of the loop prevention schemes discussed below
   may be modified to provide loop detection (the details are
   straightforward, but have not been written down in time to include in
   this rough draft).

   Loop Prevention:

   Loop prevention makes use of methods to ensure that loops are never
   set up at L2. This implies that the labels are not used until some
   method is used to ensure that following the label towards the
   destination, with associated label swaps at each switch, will not
   result in a loop. Until the L2 path (making use of assigned labels)
   is available, packets are forwarded at L3.

   Loop prevention requires explicit signaling of some sort to be used
   when setting up an L2 stream.

   One method of loop prevention requires that labels be propagated
   starting at the egress switch. The egress switch signals to
   neighboring switches the label to use for a particular destination.
   That switch then signals an associated label to its neighbors, etc.
   The control packets which propagate the labels also include the path
   to the egress (as a list of routerIDs). Any looping control packet
   can therefore be detected and the path not set up to or past the
   looping point. .

   Another option is to use explicit routing to set up label bindings
   from the egress switch to each ingress switch. This precludes the
   possibility of looping, since the entire path is computed by one
   node. This also allows non-looping paths to be set up provided that
   the egress switch has a view of the topology which is reasonably
   close to reality (if there are operational links which the egress
   switch doesn't know about, it will simply pick a path which doesn't
   use those links; if there are links which have failed but which the
   the egress switch thinks are operational, then there is some chance
   that the setup attempt will fail but in this case the attempt can be
   retried on a separate path). Note therefore that non-looping paths
   can be set up with this method in many cases where distributed
   routing plus hop by hop forwarding would not actually result in non-
   looping paths. This method is similar to the method used by standard
   ATM routing to ensure that SVCs are non-looping [PNNI].

   Explicit routing is only applicable if the routing protocol gives the
   egress switch sufficient information to set up the explicit route,
   implying that the protocol must be either a link state protocol (such
   as OSPF) or a path vector protocol (such as BGP). Source routing
   therefore is not appropriate as a general approach for use in any
   network regardless of the routing protocol. This method also requires
   some overhead for the call setup before label-based forwarding can be
   used. If the network topology changes in a manner which breaks the
   existing path, then a new path will need to be explicit routed from
   the egress switch.  Due to this overhead this method is probably only
   appropriate if other significant advantages are also going to be
   obtained from having a single node (the egress switch) coordinate the
   paths to be used. Examples of other reasons to have one node
   coordinate the paths to a single egress switch include: (i)
   Coordinating the VCI space where VP merge is used (see section 4.2);
   and (ii) Coordinating the routing of streams from multiple ingress
   switches to one egress switch so as to balance the load on multiple
   alternate paths through the network.

   In principle the explicit routing could also be done in the alternate
   direction (from ingress to egress). However, this would make it more
   difficult to merge streams if stream merge is to be used. This would
   also make it more difficult to coordinate (i) changes to the paths
   used, (ii) the VCI space assignments, and (iii) load sharing. This
   therefore makes explicit routing more difficult, and also reduces the
   other advantages that could be obtained from the approach.

   If label distribution is piggybacked on the routing protocol (see
   section 4.1.2), then loop prevention is only possible if the routing
   protocol itself does loop prevention.

   What To Do If A Loop Is Detected:

   With all of these schemes, if a loop is known to exist then the L2
   label-swapped path is not set up. This leads to the obvious question
   of what does an MPLS node do when it doesn't have a label for a
   particular destination, and a packet for that destination arrives to
   be forwarded? If possible, the packet is forwarded using normal L3
   (IP) forwarding. There are two issues that this raises: (i) What
   about nodes which are not capable of L3 forwarding; (ii) Given the
   relative speeds of L2 and L3 forwarding, does this work?

   Nodes which are not capable of L3 forwarding obviously can't forward
   a packet unless it arrives with a label, and the associated next hop
   label has been assigned. Such nodes, when they receive a packet for
   which the next hop label has not been assigned, must discard the
   packet. It is probably safe to assume that if a node cannot forward
   an L3 packet, then it is probably also incapable of forwarding an
   ICMP error report that it originates. This implies that the packet
   will need to be discarded in this case.

   In many cases L2 forwarding will be significantly faster than L3
   forwarding (allowing faster forwarding is a significant motivation
   behind the work on MPLS). This implies that if a node is forwarding a
   large volume of traffic at L2, and a change in the routing protocol
   causes the associated labels to be lost (necessitating L3
   forwarding), in some cases the node will not be capable of forwarding
   the same volume of traffic at L3. This will of course require that
   packets be discarded. However, in some cases only a relatively small
   volume of traffic will need to be forwarded at L3. Thus forwarding at
   L3 when L2 is not available is not necessarily always a problem.
   There may be some nodes which are capable of forwarding equally fast
   at L2 and L3 (for example, such nodes may contain IP forwarding
   hardware which is not available in all nodes). Finally, when packets
   are lost this will cause TCP to backoff, which will in turn reduce
   the load on the network and allow the network to stabilize even at
   reduced forwarding rates until such time as the label bindings can be
   reestablished.

   Note that in most cases loops will be caused either by configuration
   errors, or due to short term transient problems caused by the failure
   of a link. If only one link goes down, and if routing creates a
   normal "tree-shaped" set of paths to any one destination, then the
   failure of one link somewhere in the network will effect only one
   link's worth of data passing through any one node in the network.
   This implies that if a node is capable of forwarding one link's worth
   of data at L3, then in many or most cases it will have sufficient L3
   bandwidth to handle looping data.

4.4 Interoperation with NHRP

   

   When label switching is used over ATM, and there exists an LSR which
   is also operating as a Next Hop Client (NHC), the possibility of
   direct interaction arises.  That is, could one switch cells between
   the two technologies without reassembly.  To enable this several
   important issues must be addressed.

   The encapsulation must be acceptable to both MPLS and NHRP.  If only
   a single label is used, then the null encapsulation could be used.
   Other solutions could be developed to handle label stacks.

   NHRP must understand and respect the granularity of a stream.

   Currently NHRP resolves an IP address to an ATM address. The response
   may include a mask indicating a range of addresses. However, any VC
   to the ATM address is considered to be a viable means of packet
   delivery. Suppose that an NHC NHRPs for IP address A and gets back
   ATM address 1 and sets up a VC to address 1. Later the same NHC NHRPs
   for a totally unrelated IP address B and gets back the same ATM
   address 1. In this case normal NHRP behavior allows the NHC to use
   the VC (that was set up for destination A) for traffic to B.

   Note: In this section we will refer to a VC set up as a result of an
   NHRP query/response as a shortcut VC.

   If one expects to be able to label switch the packets being received
   from a shortcut VC, then the label switch needs to be informed as to
   exactly what traffic will arrive on that VC and that mapping cannot
   change without notice. Currently there exists no mechanism in the
   defined signaling of an shortcut VC.  Several means are possible.  A
   binding, equivalent to the binding in LDP, could be sent in the setup
   message.  Alternatively, the binding of prefix to label could remain
   in an LDP session (or whatever means of label distribution as
   appropriate) and the setup could carry a binding of the label to the
   VC. This would leave the binding mechanism for shortcut VCs
   independent of the label distribution mechanism.

   A further architectural challenge exists in that label switching is
   inherently unidirectional whereas ATM is bi-directional.  The above
   binding semantics are fairly straight-forward.  However, effectively
   using the reverse direction of a VC presents further challenges.

   Label switching must also respect the granularity of the shortcut VC.
   Without VC merge, this means a single label switched flow must map to
   a VC.  In the case of VC merge, multiple label switched streams could
   be merged onto a single shortcut VC.  But given the asymmetry
   involved, there is perhaps little practical use

   Another issue is one of practicality and usefulness.  What is sent
   over the VC must be at a fine enough granularity to be label switched
   through receiving domain.  One potential place where the two
   technologies might come into play is in moving data from one campus
   via the wide-area to another campus.  In such a scenario, the two
   technologies would border precisely at the point where summarization
   is likely to occur.  Each campus would have a detailed understanding
   of itself, but not of the other campus.  The wide-area is likely to
   have summarized knowledge only. But at such a point level 3
   processing becomes the likely solution.

4.5 Operation in a Hierarchy

   This section is FFS.

4.6 Stacked Labels in a Flat Routing Environment

   This section is FFS.

4.7 Multicast

   This section is FFS.

4.8 Multipath

   Many IP routing protocols support the notion of equal-cost multipath
   routes, in which a router maintains multiple next hops for one
   destination prefix when two or more equal-cost paths to the prefix
   exist. There are a few possible approaches for handling multipath
   with MPLS.

   In this discussion we will use the term "multipath node" to mean a
   node which is keeping track of multiple switched paths from itself
   for a single destination.

   The first approach maintains a separate switched path from each
   ingress node via one or more multipath nodes to a merge point. This
   requires MPLS to distinguish the separate switched paths, so that
   learning of a new switched path is not misinterpreted as a
   replacement of the same switched path. This also requires an ingress
   MPLS node be capable of distributing the traffic among the multiple
   switched paths. This approach preserves switching performance, but at
   a cost of proliferating the number of switched paths. For example,
   each switched path consumes a distinct label.

   The second approach establishes only one switched path from any one
   ingress node to a destination. However, when the paths from two
   different ingress nodes happen to arrive at the same node, that node
   may use different paths for each (implying that the node becomes a
   multipath node). Thus the switched path chosen by the multipath node
   may assign a different downstream path to each incoming stream. This
   conserves switched paths and maintains switching performance, but
   cannot balance loads across downstream links as well as the other
   approaches, even if switched paths are selectively assigned. With
   this approach is that the L2 path may be different from the normal L3
   path, as traffic that otherwise would have taken multiple distinct
   paths is forced onto a single path.

   The third approach allows a single stream arriving at a multipath
   node to be split into multiple streams, by using L3 forwarding at the
   multipath node. For example, the multipath node might choose to use a
   hash function on the source and destination IP addresses, in order to
   avoid misordering packets between any one IP source and destination.
   This approach conserves switched paths at the cost of switching
   performance.

4.9 Host Interactions

   There are a range of options for host interaction with MPLS:

   The most straightforward approach is no host involvement. Thus host
   operation may be completely independent of MPLS, rather hosts operate
   according to other IP standards. If there is no host involvement then
   this implies that the first hop requires an L3 lookup.

   If the host is ATM attached and doing NHRP, then this would allow the
   host to set up a Virtual Circuit to a router. However this brings up
   a range of issues as was discussed in section 4.4 ("interoperation
   with NHRP").

   On the ingress side, it is reasonable to consider having the first
   hop LSR provide labels to the hosts, and thus have hosts attach
   labels for packets that they transmit. This could allow the first hop
   LSR to avoid an L3 lookup. It is reasonable here to have the host
   request labels only when needed, rather than require the host to
   remember all labels assigned for use in the network.

   On the egress side, it is questionable whether hosts should be
   involved. For scaling reasons, it would be undesirable to use a
   different label for reaching each host.

4.10 Explicit Routing

   There are two options for Route Selection: (1) Hop by hop routing,
   and (2) Explicit routing.

   An explicitly routed LSP is an LSP where, at a given LSR, the LSP
   next hop is not chosen by each local node, but rather is chosen by a
   single node (usually the ingress or egress node of the LSP). The
   sequence of LSRs followed by an explicit routing LSP may be chosen by
   configuration, or by an algorithm performed by a single node (for
   example, the egress node may make use of the topological information
   learned from a link state database in order to compute the entire
   path for the tree ending at that egress node).

   With MPLS the explicit route needs to be specified at the time that
   Labels are assigned, but the explicit route does not have to be
   specified with each L3 packet. This implies that explicit routing
   with MPLS is relatively efficient (when compared with the efficiency
   of explicit routing for pure datagrams).

   Explicit routing may be useful for a number of purposes such as
   allowing policy routing and/or facilitating traffic engineering.

4.10.1 Establishment of Point to Point Explicitly Routed LSPs

   In order to establish a point to point explicitly routed LSP, the LDP
   packets used to set up the LSP must contain the explicit route. This
   implies that the LSP is set up in order either from the ingress to
   the egress, or from the egress to the ingress.

   One node needs to pick the explicit route: This may be done in at
   least two possible ways: (i) by configuration (eg, the explicit route
   may be chosen by an operator, or by a centralized server of some
   kind); (ii) By use of a routing protocol which allows the ingress
   and/or egress node to know the entire route to be followed. This
   would imply the use of a link state routing protocol (in which all
   nodes know the full topology) or of a path vector routing protocol
   (in which the ingress node is told the path as part of the normal
   operation of the routing protocol).

   Note: The normal operation of path vector routing protocols (such as
   BGP) does not provide the full set of routers along the path. This
   implies that either a partial source route only would be provided
   (implying that LSP setup would use a combination of hop by hop and
   explicit routing), or it would be necessary to augment the protocol
   in order to provide the complete explicit route. Detailed operation
   in this case is FFS.

   In the point to point case, it is relatively straightforward to
   specify the route to use: This is indicated by providing the
   addresses of each LSR on the LSP.

4.10.2 Explicit and Hop by Hop routing: Avoiding Loops

   In general, an LSP will be explicit routed specifically because there
   is a good reason to use an alternative to the hop by hop routed path.
   This implies that the explicit route is likely to follow a path which
   is inconsistent with the path followed by hop by hop routing. If some
   of the nodes along the path follow an explicit route but some of the
   nodes make use of hop by hop routing (and ignore the explicit route),
   then inconsistent routing may result and in some cases loops (or
   severely inefficient paths) may form. This implies that for any one
   LSP, there are two possible options: (i) The entire LSP may be hop by
   hop routed; or (ii) The entire LSP may be explicit routed.

   For this reason, it is important that if an explicit route is
   specified for setting up an LSP, then that route must be followed in
   setting up the LSP.

   There is a related issue when a link or node in the middle of an
   explicitly routed LSP breaks: In this case, the last operating node
   on the upstream part of the LSP will continue receiving packets, but
   will not be able to forward them along the explicitly routed LSP
   (since its next hop is no longer functioning). In this case, it is
   not in general safe for this node to forward the packets using L3
   forwarding with hop by hop routing. Instead, the packets must be
   discarded, and the upstream partition of the explicitly routed LSP
   must be torn down.

   Where part of an Explicitly Routed LSP breaks, the node which
   originated the LSP needs to be told about this. For robustness
   reasons the MPLS protocol design should not assume that the routing
   protocol will tell the node which originated the LSP. For example, it
   is possible that a link may go down and come back up quickly enough
   that the routing protocol never declares the link down. Rather, an
   explicit MPLS mechanism is needed.

4.10.3 Merge and Explicit Routing

   Explicit Routing is slightly more complex with a multipoint to point
   LSP (i.e., in the case that stream merge is used).

   In this case, it is not possible to specify the route for the LSP as
   a simple list of LSRs (since the LSP does not consist of a simple
   sequence of LSRs). Rather the explicit route must specify a tree.
   There are several ways that this may be accomplished. Details are
   FFS.

4.10.4 Using Explicit Routing for Traffic Engineering

   In the Internet today it is relatively common for ISPs to make use of
   a Frame Relay or ATM core, which interconnects a number of IP
   routers. The primary reason for use of a switching (L2) core is to
   make use of low cost equipment which provides very high speed
   forwarding. However, there is another very important reason for the
   use of a L2 core: In order to allow for Traffic Engineering.

   Traffic Engineering (also known as bandwidth management) refers to
   the process of managing the routes followed by user data traffic in a
   network in order to provide relatively equal and efficient loading of
   the resources in the network (i.e., to ensure that the bandwidth on
   links and nodes are within the capabilities of the links and nodes).

   Some rudimentary level of traffic engineering can be accomplished
   with pure datagram routing and forwarding by adjusting the metrics
   assigned to links. For example, suppose that there is a given link in
   a network which tends to be overloaded on a long term basis. One
   option would be to manually configure an increased metric value for
   this link, in the hopes of moving some traffic onto alternate routes.
   This provides a rather crude method of traffic engineering and
   provides only limited results.

   Another method of traffic engineering is to manually configure
   multiple PVCs across a L2 core, and to adjust the route followed by
   each PVC in an attempt to equalize the load on different parts of the
   network. Where necessary, multiple PVCs may be configured between the
   same two nodes, in order to allow traffic to be split between
   different paths. In some topologies it is much easier to achieve
   efficient non-overlapping or minimally-overlapping paths via this
   method (with manually configured paths) than it would be with pure
   datagram forwarding. A similar ability can be achieved with MPLS via
   the use of manual configuration of the paths taken by LSPs.

   A related issue is the decision on where merge is to occur. Note that
   once two streams merge into one stream (forwarded by a single label)
   then they cannot diverge again at that level of the MPLS hierarchy
   (i.e., they cannot be bifurcated without looking at a higher level
   label or the IP header). Thus there may be times when it is desirable
   to explicitly NOT merge two streams even though they are to the same
   egress node and FEC. Non-merge may be appropriate either because the
   streams will want to diverge later in the path (for example, to avoid
   overloading a particular downstream link), or because the streams may
   want to use different physical links in the case where multiple
   slower physical links are being aggregated into a single logical link
   for the purpose of IP routing.

   As a network grows to a very large size (on the order of hundreds of
   LSRs), it becomes increasingly difficult to handle the assignment of
   all routes via manual configuration. However, explicit routing allows
   several alternatives:

   1. Partial Configuration: One option is to use automatic/dynamic
   routing for most of the paths through the network, but then manually
   configure some routes. For example, suppose that full dynamic routing
   would result in a particular link being overloaded. One of the LSPs
   which uses that link could be selected and manually routed to use a
   different path.

   2. Central Computation: One option would be to provide long term
   network usage information to a single central management facility.
   That facility could then run a global optimization to compute a set
   of paths to use. Network management commands can be used to configure
   LSRs with the correct routes to use.

   3. Egress Computation: An egress node can run a computation which
   optimizes the path followed for traffic to itself. This cannot of
   course optimize the entire traffic load through the network, but can
   include optimization of traffic from multiple ingress's to one
   egress. The reason for optimizing traffic to a single egress, rather
   than from a single ingress, relates to the issue of when to merge: An
   ingress can never merge the traffic from itself to different
   egresses, but an egress can if desired chose to merge the traffic
   from multiple ingress's to itself.

4.10.5 Using Explicit Routing for Policy Routing

   This section is FFS.

4.11 Traceroute

   This section is FFS.

4.12 LSP Control: Egress versus Local

   There is a choice to be made regarding whether the initial setup of
   LSPs will be initiated by the egress node, or locally by each
   individual node.

   When LSP control is done locally, then each node may at any time pass
   label bindings to its neighbors for each FEC recognized by that node.
   In the normal case that the neighboring nodes recognize the same
   FECs, then nodes may map incoming labels to outgoing labels as part
   of the normal label swapping forwarding method.

   When LSP control is done by the egress, then initially (on startup)
   only the egress node passes label bindings to its neighbors
   corresponding to any FECs which leave the MPLS network at that egress
   node. When initializing, other nodes wait until they get a label from
   downstream for a particular FEC before passing a corresponding label
   for the same FEC to upstream nodes.

   With local control, since each LSR is (at least initially)
   independently assigning labels to FECs, it is possible that different
   LSRs may make inconsistent decisions. For example, an upstream LSR
   may make a coarse decision (map multiple IP address prefixes to a
   single label) while its downstream neighbor makes a finer grain
   decision (map each individual IP address prefix to a separate label).
   With downstream label assignment this can be corrected by having LSRs
   withdraw labels that it has assigned which are inconsistent with
   downstream labels, and replace them with new consistent label
   assignments.

   This may appear to be an advantage of egress LSP control (since with
   egress control the initial label assignments "bubble up" from the
   egress to upstream nodes, and consistency is therefore easy to
   ensure). However, even with egress control it is possible that the
   choice of egress node may change, or the egress may (based on a
   change in configuration) change its mind in terms of the granularity
   which is to be used. This implies the same mechanism will be
   necessary to allow changes in granularity to bubble up to upstream
   nodes. The choice of egress or local control may therefore effect the
   frequency with which this mechanism is used, but will not effect the
   need for a mechanism to achieve consistency of label granularity.

   Egress control and local control can interwork in a very
   straightforward manner: With either approach, (assuming downstream
   label assignment) the egress node will initially assign labels for
   particular FECs and will pass these labels to its neighbors. With
   either approach these label assignments will bubble upstream, with
   the upstream nodes choosing labels that are consistent with the
   labels that they receive from downstream.

   The difference between the two techniques therefore becomes a
   tradeoff between avoiding a short period of initial thrashing on
   startup (in the sense of avoiding the need to withdraw inconsistent
   labels which may have been assigned using local control) versus the
   imposition of a short delay on initial startup (while waiting for the
   initial label assignments to bubble up from downstream). The protocol
   mechanisms which need to be defined are the same in either case, and
   the steady state operation is the same in either case.

4.13 Security

   Security in a network using MPLS should be relatively similar to
   security in a normal IP network.

   Routing in an MPLS network uses precisely the same IP routing
   protocols as are currently used with IP. This implies that route
   filtering is unchanged from current operation. Similarly, the
   security of the routing protocols is not effected by the use of MPLS.

   Packet filtering also may be done as in normal IP. This will require
   either (i) that label swapping be terminated prior to any firewalls
   performing packet filtering (in which case a separate instance of
   label swapping may optionally be started after the firewall); or (ii)
   that firewalls "look past the labels", in order to inspect the entire
   IP packet contents. In this latter case note that the label may imply
   semantics greater than that contained in the packet header: In
   particular, a particular label value may imply that the packet is to
   take a particular path after the firewall. In environments in which
   this is considered to be a security issue it may be desirable to
   terminate the label prior to the firewall.

   Note that in principle labels could be used to speed up the operation
   of firewalls: In particular, the label could be used as an index into
   a table which indicates the characteristics that the packet needs to
   have in order to pass through the firewall. Depending upon
   implementation considerations matching the contents of the packet to
   the contents of the table may be quicker than parsing the packet in
   the absence of the label.

                                                           Eric C. Rosen
                                                     Cisco Systems, Inc.
Expiration Date: January 1998
                                                        Arun Viswanathan
                                                               IBM Corp.

                                                             Ross Callon
                                             Ascend Communications, Inc.

                                                               July 1997

                    A Proposed Architecture for MPLS

                      draft-rosen-mpls-arch-00.txt

Abstract

   This internet draft contains a draft protocol architecture for
   multiprotocol label switching (MPLS). The proposed architecture is
   based on other label switching approaches [2-11] as well as on the
   MPLS Framework document [1].

Table of Contents

1. Introduction to MPLS

1.1. Overview

   In connectionless network layer protocols, as a packet travels from
   one router hop to the next, an independent forwarding decision is
   made at each hop.  Each router analyzes the packet header, and runs a
   network layer routing algorithm. The next hop for a packet is chosen
   based on the header analysis and the result of running the routing
   algorithm.

   Packet headers contain considerably more information than is needed
   simply to choose the next hop. Choosing the next hop can therefore be
   thought of as the composition of two functions. The first function
   partitions the entire packet forwarding space into "forwarding
   equivalence classes (FECs)".  The second maps these FECs to a next
   hop.  Multiple network layer headers which get mapped into the same
   FEC are indistinguishable, as far as the forwarding decision is
   concerned. The set of packets belonging to the same FEC, traveling
   from a common node, will follow the same path and be forwarded in the

   same manner (for example, by being placed in a common queue) towards
   the destination.  This set of packets following the same path,
   belonging to the same FEC (and therefore being forwarded in a common
   manner) may be referred to as a "stream".

   In IP forwarding, multiple packets are typically assigned to the same
   Stream by a particular router if there is some address prefix X in
   that router's routing tables such that X is the "longest match" for
   each packet's destination address.

   In MPLS, the mapping from packet headers to stream is performed just
   once, as the packet enters the network.  The stream to which the
   packet is assigned is encoded with a short fixed length value known
   as a "label". When a packet is forwarded to its next hop, the label
   is sent along with it; that is, the packets are "labeled".

   At subsequent hops, there is no further analysis of the network layer
   header. Rather, the label is used as an index into a table which
   specifies the next hop, and a new label.  The old label is replaced
   with the new label, and the packet is forwarded to its next hop. This
   eliminates the need to perform a longest match computation for each
   packet at each hop; the computation can be performed just once.

   Some routers analyze a packet's network layer header not merely to
   choose the packet's next hop, but also to determine a packet's
   "precedence" or "class of service", in order to apply different
   discard thresholds or scheduling disciplines to different packets. In
   MPLS, this can also be inferred from the label, so that no further
   header analysis is needed.

   The fact that a packet is assigned to a Stream just once, rather than
   at every hop, allows the use of sophisticated forwarding paradigms.
   A packet that enters the network at a particular router can be
   labeled differently than the same packet entering the network at a
   different router, and as a result forwarding decisions that depend on
   the ingress point ("policy routing") can be easily made.  In fact,
   the policy used to assign a packet to a Stream need not have only the
   network layer header as input; it may use arbitrary information about
   the packet, and/or arbitrary policy information as input.  Since this
   decouples forwarding from routing, it allows one to use MPLS to
   support a large variety of routing policies that are difficult or
   impossible to support with just conventional network layer
   forwarding.

   Similarly, MPLS facilitates the use of explicit routing, without
   requiring that each IP packet carry the explicit route. Explicit
   routes may be useful to support policy routing and traffic
   engineering.

   MPLS makes use of a routing approach whereby the normal mode of
   operation is that L3 routing (e.g., existing IP routing protocols
   and/or new IP routing protocols) is used by all nodes to determine
   the routed path.

   MPLS stands for "Multiprotocol" Label Switching, multiprotocol
   because its techniques are applicable to ANY network layer protocol.
   In this document, however, we focus on the use of IP as the network
   layer protocol.

   A router which supports MPLS is known as a "Label Switching Router",
   or LSR.

   A general discussion of issues related to MPLS is presented in "A
   Framework for Multiprotocol Label Switching" [1].

1.2. Terminology

   This section gives a general conceptual overview of the terms used in
   this document. Some of these terms are more precisely defined in
   later sections of the document.

     aggregate stream          synonym of "stream"

     DLCI                      a label used in Frame Relay networks to
                               identify frame relay circuits

     flow                      a single instance of an application to
                               application flow of data (as in the RSVP
                               and IFMP use of the term "flow")

     forwarding equivalence class   a group of IP packets which are
                                    forwarded in the same manner (e.g.,
                                    over the same path, with the same
                                    forwarding treatment)

     frame merge               stream merge, when it is applied to
                               operation over frame based media, so that
                               the potential problem of cell interleave
                               is not an issue.

     label                     a short fixed length physically
                               contiguous identifier which is used to
                               identify a stream, usually of local
                               significance.

     label information base    the database of information containing
                               label bindings

     label swap                the basic forwarding operation consisting
                               of looking up an incoming label to
                               determine the outgoing label,
                               encapsulation, port, and other data
                               handling information.

     label swapping            a forwarding paradigm allowing
                               streamlined forwarding of data by using
                               labels to identify streams of data to be
                               forwarded.

     label switched hop        the hop between two MPLS nodes, on which
                               forwarding is done using labels.

     label switched path       the path created by the concatenation of
                               one or more label switched hops, allowing
                               a packet to be forwarded by swapping
                               labels from an MPLS node to another MPLS
                               node.

     layer 2                   the protocol layer under layer 3 (which
                               therefore offers the services used by
                               layer 3).  Forwarding, when done by the
                               swapping of short fixed length labels,
                               occurs at layer 2 regardless of whether
                               the label being examined is an ATM
                               VPI/VCI, a frame relay DLCI, or an MPLS
                               label.

     layer 3                   the protocol layer at which IP and its
                               associated routing protocols operate link
                               layer synonymous with layer 2

     loop detection            a method of dealing with loops in which
                               loops are allowed to be set up, and data
                               may be transmitted over the loop, but the
                               loop is later detected and closed

     loop prevention           a method of dealing with loops in which
                               data is never transmitted over a loop

     label stack               an ordered set of labels

     loop survival             a method of dealing with loops in which
                               data may be transmitted over a loop, but
                               means are employed to limit the amount of
                               network resources which may be consumed
                               by the looping data

     label switched path       The path through one or more LSRs at one
                               level of the hierarchy followed by a
                               stream.

     label switching router    an MPLS node which is capable of
                               forwarding native L3 packets

     merge point               the node at which multiple streams and
                               switched paths are combined into a single
                               stream sent over a single path.

     Mlabel                    abbreviation for MPLS label

     MPLS core standards       the standards which describe the core
                               MPLS technology

     MPLS domain               a contiguous set of nodes which operate
                               MPLS routing and forwarding and which are
                               also in one Routing or Administrative
                               Domain

     MPLS edge node            an MPLS node that connects an MPLS domain
                               with a node which is outside of the
                               domain, either because it does not run
                               MPLS, and/or because it is in a different
                               domain. Note that if an LSR has a
                               neighboring host which is not running
                               MPLS, that that LSR is an MPLS edge node.

     MPLS egress node          an MPLS edge node in its role in handling
                               traffic as it leaves an MPLS domain

     MPLS ingress node         an MPLS edge node in its role in handling
                               traffic as it enters an MPLS domain

     MPLS label                a label placed in a short MPLS shim
                               header used to identify streams

     MPLS node                 a node which is running MPLS. An MPLS
                               node will be aware of MPLS control
                               protocols, will operate one or more L3
                               routing protocols, and will be capable of
                               forwarding packets based on labels.  An
                               MPLS node may optionally be also capable
                               of forwarding native L3 packets.

     MultiProtocol Label Switching  an IETF working group and the effort
                                    associated with the working group

     network layer             synonymous with layer 3

     stack                     synonymous with label stack

     stream                    an aggregate of one or more flows,
                               treated as one aggregate for the purpose
                               of forwarding in L2 and/or L3 nodes
                               (e.g., may be described using a single
                               label). In many cases a stream may be the
                               aggregate of a very large number of
                               flows.  Synonymous with "aggregate
                               stream".

     stream merge              the merging of several smaller streams
                               into a larger stream, such that for some
                               or all of the path the larger stream can
                               be referred to using a single label.

     switched path             synonymous with label switched path

     virtual circuit           a circuit used by a connection-oriented
                               layer 2 technology such as ATM or Frame
                               Relay, requiring the maintenance of state
                               information in layer 2 switches.

     VC merge                  stream merge when it is specifically
                               applied to VCs, specifically so as to
                               allow multiple VCs to merge into one
                               single VC

     VP merge                  stream merge when it is applied to VPs,
                               specifically so as to allow multiple VPs
                               to merge into one single VP. In this case
                               the VCIs need to be unique. This allows
                               cells from different sources to be
                               distinguished via the VCI.

     VPI/VCI                   a label used in ATM networks to identify
                               circuits

1.3. Acronyms and Abbreviations

   ATM                       Asynchronous Transfer Mode

   BGP                       Border Gateway Protocol

   DLCI                      Data Link Circuit Identifier

   FEC                       Forwarding Equivalence Class

   STN                       Stream to NHLFE Map

   IGP                       Interior Gateway Protocol

   ILM                       Incoming Label Map

   IP                        Internet Protocol

   LIB                       Label Information Base

   LDP                       Label Distribution Protocol

   L2                        Layer 2

   L3                        Layer 3

   LSP                       Label Switched Path

   LSR                       Label Switching Router

   MPLS                      MultiProtocol Label Switching

   MPT                       Multipoint to Point Tree

   NHLFE                     Next Hop Label Forwarding Entry

   SVC                       Switched Virtual Circuit

   SVP                       Switched Virtual Path

   TTL                       Time-To-Live

   VC                        Virtual Circuit

   VCI                       Virtual Circuit Identifier

   VP                        Virtual Path

   VPI                       Virtual Path Identifier

1.4. Acknowledgments

   The ideas and text in this document have been collected from a number
   of sources and comments received. We would like to thank Rick Boivie,
   Paul Doolan, Nancy Feldman, Yakov Rekhter, Vijay Srinivasan, and
   George Swallow for their inputs and ideas.

2. Outline of Approach

   In this section, we introduce some of the basic concepts of MPLS and
   describe the general approach to be used.

2.1. Labels

   A label is a short fixed length locally significant identifier which
   is used to identify a stream. The label is based on the stream or
   forwarding equivalence class that a packet is assigned to. The label
   does not directly encode the network layer address, and is based on
   the network layer address only to the extent that the forwarding
   equivalence class is based on the address.

   If Ru and Rd are neighboring LSRs, they may agree to use label L to
   represent Stream S for packets which are sent from Ru to Rd.  That
   is, they can agree to a "mapping" between label L and Stream S for
   packets moving from Ru to Rd.  As a result of such an agreement, L
   becomes Ru's "outgoing label" corresponding to Stream S for such
   packets; L becomes Rd's "incoming label" corresponding to Stream S
   for such packets.

   Note that L does not necessarily correspond to Stream S for any
   packets other than those which are being sent from Ru to Rd.  Also, L
   is not an inherently meaningful value and does not have any network-
   wide value; the particular value assigned to L gets its meaning
   solely from the agreement between Ru and Rd.

   Sometimes it may be difficult or even impossible for Rd to tell that
   an arriving packet carrying label L comes from Ru, rather than from
   some other LSR.  In such cases, Rd must make sure that the mapping
   from label to FEC is one-to-one.  That is, in such cases, Rd must not
   agree with Ru1 to use L for one purpose, while also agreeing with
   some other LSR Ru2 to use L for a different purpose.

   The scope of labels could be unique per interface, or unique per MPLS
   node, or unique in a network. If labels are unique within a network,
   no label swapping needs to be performed in the MPLS nodes in that
   domain.  The packets are just label forwarded and not label swapped.
   The possible use of labels with network-wide scope is FFS.

2.2. Upstream and Downstream LSRs

   Suppose Ru and Rd have agreed to map label L to Stream S, for packets
   sent from Ru to Rd.  Then with respect to this mapping, Ru is the
   "upstream LSR", and Rd is the "downstream LSR".

   The notion of upstream and downstream relate to agreements between
   nodes of the label values to be assigned for packets belonging to a
   particular Stream that might be traveling from an upstream node to a
   downstream node. This is independent of whether the routing protocol
   actually will cause any packets to be transmitted in that particular
   direction. Thus, Rd is the downstream LSR for a particular mapping
   for label L if it recognizes L-labeled packets from Ru as being in
   Stream S.  This may be true even if routing does not actually forward
   packets for Stream S between nodes Rd and Ru, or if routing has made
   Ru downstream of Rd along the path which is actually used for packets
   in Stream S.

2.3. Labeled Packet

   A "labeled packet" is a packet into which a label has been encoded.
   The encoding can be done by means of an encapsulation which exists
   specifically for this purpose, or by placing the label in an
   available location in either of the data link or network layer
   headers. Of course, the encoding technique must be agreed to by the
   entity which encodes the label and the entity which decodes the
   label.

2.4. Label Assignment and Distribution; Attributes

   For unicast traffic in the MPLS architecture, the decision to bind a
   particular label L to a particular Stream S is made by the LSR which
   is downstream with respect to that mapping.  The downstream LSR then
   informs the upstream LSR of the mapping.  Thus labels are
   "downstream-assigned", and are "distributed upstream".

   A particular mapping of label L to Stream S, distributed by Rd to Ru,
   may have associated "attributes".  If Ru, acting as a downstream LSR,
   also distributes a mapping of a label to Stream S, then under certain
   conditions, it may be required to also distribute the corresponding
   attribute that it received from Rd.

2.5. Label Distribution Protocol (LDP)

   A Label Distribution Protocol (LDP) is a set of procedures by which
   one LSR informs another of the label/Stream mappings it has made.
   Two LSRs which use an LDP to exchange label/Stream mapping
   information are known as "LDP Peers" with respect to the mapping
   information they exchange; we will speak of there being an "LDP
   Adjacency" between them.

   (N.B.: two LSRs may be LDP Peers with respect to some set of
   mappings, but not with respect to some other set of mappings.)

   The LDP also encompasses any negotiations in which two LDP Peers need
   to engage in order to learn of each other's MPLS capabilities.

2.6. The Label Stack

   So far, we have spoken as if a labeled packet carries only a single
   label. As we shall see, it is useful to have a more general model in
   which a labeled packet carries a number of labels, organized as a
   last-in, first-out stack.  We refer to this as a "label stack".

   At a particular LSR, the decision as to how to forward a labeled
   packet is always based exclusively on the label at the top of the
   stack.

   An unlabeled packet can be thought of as a packet whose label stack
   is empty (i.e., whose label stack has depth 0).

   If a packet's label stack is of depth m, we refer to the label at the
   bottom of the stack as the level 1 label, to the label above it (if
   such exists) as the level 2 label, and to the label at the top of the
   stack as the level m label.

   The utility of the label stack will become clear when we introduce
   the notion of LSP Tunnel and the MPLS Hierarchy (sections 2.19.3 and
   2.19.4).

2.7. The Next Hop Label Forwarding Entry (NHLFE)

   The "Next Hop Label Forwarding Entry" (NHLFE) is used when forwarding
   a labeled packet. It contains the following information:

      1. the packet's next hop

      2. the data link encapsulation to use when transmitting the packet

      3. the way to encode the label stack when transmitting the packet

      4. the operation to perform on the packet's label stack; this is
         one of the following operations:

            a) replace the label at the top of the label stack with a
               specified new label

            b) pop the label stack

            c) replace the label at the top of the label stack with a
               specified new label, and then push one or more specified
               new labels onto the label stack.

   Note that at a given LSR, the packet's "next hop" might be that LSR
   itself.  In this case, the LSR would need to pop the top level label
   and examine and operate on the encapsulated packet. This may be a
   lower level label, or may be the native IP packet. This implies that
   in some cases the LSR may need to operate on the IP header in order
   to forward the packet. If the packet's "next hop" is the current LSR,
   then the label stack operation MUST be to "pop the stack".

2.8. Incoming Label Map (ILM)

   The "Incoming Label Map" (ILM) is a mapping from incoming labels to
   NHLFEs. It is used when forwarding packets that arrive as labeled
   packets.

2.9. Stream-to-NHLFE Map (STN)

   The "Stream-to-NHLFE" (STN) is a mapping from stream to NHLFEs. It is
   used when forwarding packets that arrive unlabeled, but which are to
   be labeled before being forwarded.

2.10. Label Swapping

   Label swapping is the use of the following procedures to forward a
   packet.

   In order to forward a labeled packet, a LSR examines the label at the
   top of the label stack. It uses the ILM to map this label to an
   NHLFE.  Using the information in the NHLFE, it determines where to
   forward the packet, and performs an operation on the packet's label
   stack. It then encodes the new label stack into the packet, and
   forwards the result.

   In order to forward an unlabeled packet, a LSR analyzes the network
   layer header, to determine the packet's Stream. It then uses the FTN
   to map this to an NHLFE. Using the information in the NHLFE, it
   determines where to forward the packet, and performs an operation on
   the packet's label stack.  (Popping the label stack would, of course,
   be illegal in this case.)  It then encodes the new label stack into
   the packet, and forwards the result.

   It is important to note that when label swapping is in use, the next
   hop is always taken from the NHLFE; this may in some cases be
   different from what the next hop would be if MPLS were not in use.

2.11. Label Switched Path (LSP), LSP Ingress, LSP Egress

   A "Label Switched Path (LSP) of level m" for a particular packet P is
   a sequence of LSRs,

                               

   with the following properties:

      1. R1, the "LSP Ingress", pushes a label onto P's label stack,
         resulting in a label stack of depth m;

      2. For all i, 10).

   In other words, we can speak of the level m LSP for Packet P as the
   sequence of LSRs:

      1. which begins with an LSR (an "LSP Ingress") that pushes on a
         level m label,

      2. all of whose intermediate LSRs make their forwarding decision
         by label Switching on a level m label,

      3. which ends (at an "LSP Egress") when a forwarding decision is
         made by label Switching on a level m-k label, where k>0, or
         when a forwarding decision is made by "ordinary", non-MPLS
         forwarding procedures.

   A consequence (or perhaps a presupposition) of this is that whenever
   an LSR pushes a label onto an already labeled packet, it needs to
   make sure that the new label corresponds to a FEC whose LSP Egress is
   the LSR that assigned the label which is now second in the stack.

   Note that according to these definitions, if  is a level
   m LSP for packet P, P may be transmitted from R[n-1] to Rn with a
   label stack of depth m-1. That is, the label stack may be popped at
   the penultimate LSR of the LSP, rather than at the LSP Egress. This
   is appropriate, since the level m label has served its function of
   getting the packet to Rn, and Rn's forwarding decision cannot be made
   until the level m label is popped.  If the label stack is not popped
   by R[n-1], then Rn must do two label lookups; this is an overhead
   which is best avoided.  However, some hardware switching engines may
   not be able to pop the label stack.

   The penultimate node pops the label stack only if this is
   specifically requested by the egress node. Having the penultimate
   node pop the label stack has an implication on the assignment of
   labels: For any one node Rn, operating at level m in the MPLS
   hierarchy, there may be some LSPs which terminate at that node (i.e.,
   for which Rn is the egress node) and some other LSPs which continue
   beyond that node (i.e., for which Rn is an intermediate node). If the
   penultimate node R[n-1] pops the stack for those LSPs which terminate
   at Rn, then node R[n] will receive some packets for which the top of
   the stack is a level m label (i.e., packets destined for other egress
   nodes), and some packets for which the top of the stack is a level
   m-1 label (i.e., packets for which Rn is the egress). This implies
   that in order for node R[n-1] to pop the stack, node Rn must assign
   labels such that level m and level m-1 labels are distinguishable
   (i.e., use unique values across multiple levels of the MPLS
   hierarchy).

   Note that if m = 1, the LSP Egress may receive an unlabeled packet,
   and in fact need not even be capable of supporting MPLS. In this
   case, assuming that we are using globally meaningful IP addresses,
   the confusion of labels at multiple levels is not possible. However,
   it is possible that the label may still be of value for the egress
   node. One example is that the label may be used to assign the packet
   to a particular Forwarding Equivalence Class (for example, to
   identify the packet as a high priority packet). Another example is
   that the label may assign the packet to a particular virtual private
   network (for example, the virtual private network may make use of
   local IP addresses, and the label may be necessary to disambiguate
   the addresses). Therefore even when there is only a single label
   value the stack is nonetheless popped only when requested by the
   egress node.

   We will call a sequence of LSRs the "LSP for a particular Stream S"
   if it is an LSP of level m for a particular packet P when P's level m
   label is a label corresponding to Stream S.

2.12. LSP Next Hop

   The LSP Next Hop for a particular labeled packet in a particular LSR
   is the LSR which is the next hop, as selected by the NHLFE entry used
   for forwarding that packet.

   The LSP Next Hop for a particular Stream is the next hop as selected
   by the NHLFE entry indexed by a label which corresponds to that
   Stream.

2.13. Route Selection

   Route selection refers to the method used for selecting the LSP for a
   particular stream. The proposed MPLS protocol architecture supports
   two options for Route Selection: (1) Hop by hop routing, and (2)
   Explicit routing.

   Hop by hop routing allows each node to independently choose the next
   hop for the path for a stream. This is the normal mode today with
   existing datagram IP networks. A hop by hop routed LSP refers to an
   LSP whose route is selected using hop by hop routing.

   An explicitly routed LSP is an LSP where, at a given LSR, the LSP
   next hop is not chosen by each local node, but rather is chosen by a
   single node (usually the ingress or egress node of the LSP). The
   sequence of LSRs followed by an explicit routing LSP may be chosen by
   configuration, or by a protocol selected by a single node (for
   example, the egress node may make use of the topological information
   learned from a link state database in order to compute the entire
   path for the tree ending at that egress node). Explicit routing may
   be useful for a number of purposes such as allowing policy routing
   and/or facilitating traffic engineering.  With MPLS the explicit
   route needs to be specified at the time that Labels are assigned, but
   the explicit route does not have to be specified with each IP packet.
   This implies that explicit routing with MPLS is relatively efficient
   (when compared with the efficiency of explicit routing for pure
   datagrams).

   For any one LSP (at any one level of hierarchy), there are two
   possible options: (i) The entire LSP may be hop by hop routed from
   ingress to egress; (ii) The entire LSP may be explicit routed from
   ingress to egress. Intermediate cases do not make sense: In general,
   an LSP will be explicit routed specifically because there is a good
   reason to use an alternative to the hop by hop routed path. This
   implies that if some of the nodes along the path follow an explicit
   route but some of the nodes make use of hop by hop routing, then
   inconsistent routing will result and loops (or severely inefficient
   paths) may form.

   For this reason, it is important that if an explicit route is
   specified for an LSP, then that route must be followed. Note that it
   is relatively simple to *follow* an explicit route which is specified
   in a LDP setup.  We therefore propose that the LDP specification
   require that all MPLS nodes implement the ability to follow an
   explicit route if this is specified.

   It is not necessary for a node to be able to create an explicit
   route.  However, in order to ensure interoperability it is necessary
   to ensure that either (i) Every node knows how to use hop by hop
   routing; or (ii) Every node knows how to create and follow an
   explicit route. We propose that due to the common use of hop by hop
   routing in networks today, it is reasonable to make hop by hop
   routing the default that all nodes need to be able to use.

2.14. Time-to-Live (TTL)

   In conventional IP forwarding, each packet carries a "Time To Live"
   (TTL) value in its header.  Whenever a packet passes through a
   router, its TTL gets decremented by 1; if the TTL reaches 0 before
   the packet has reached its destination, the packet gets discarded.

   This provides some level of protection against forwarding loops that
   may exist due to misconfigurations, or due to failure or slow
   convergence of the routing algorithm. TTL is sometimes used for other
   functions as well, such as multicast scoping, and supporting the
   "traceroute" command. This implies that there are two TTL-related
   issues that MPLS needs to deal with: (i) TTL as a way to suppress
   loops; (ii) TTL as a way to accomplish other functions, such as
   limiting the scope of a packet.

   When a packet travels along an LSP, it should emerge with the same
   TTL value that it would have had if it had traversed the same
   sequence of routers without having been label switched.  If the
   packet travels along a hierarchy of LSPs, the total number of LSR-
   hops traversed should be reflected in its TTL value when it emerges
   from the hierarchy of LSPs.

   The way that TTL is handled may vary depending upon whether the MPLS
   label values are carried in an MPLS-specific "shim" header, or if the
   MPLS labels are carried in an L2 header such as an ATM header or a
   frame relay header.

   If the label values are encoded in a "shim" that sits between the
   data link and network layer headers, then this shim should have a TTL
   field that is initially loaded from the network layer header TTL
   field, is decremented at each LSR-hop, and is copied into the network
   layer header TTL field when the packet emerges from its LSP.

   If the label values are encoded in an L2 header (e.g., the VPI/VCI
   field in ATM's AAL5 header), and the labeled packets are forwarded by
   an L2 switch (e.g., an ATM switch). This implies that unless the data
   link layer itself has a TTL field (unlike ATM), it will not be
   possible to decrement a packet's TTL at each LSR-hop. An LSP segment
   which consists of a sequence of LSRs that cannot decrement a packet's
   TTL will be called a "non-TTL LSP segment".

   When a packet emerges from a non-TTL LSP segment, it should however
   be given a TTL that reflects the number of LSR-hops it traversed. In
   the unicast case, this can be achieved by propagating a meaningful
   LSP length to ingress nodes, enabling the ingress to decrement the
   TTL value before forwarding packets into a non-TTL LSP segment.

   Sometimes it can be determined, upon ingress to a non-TTL LSP
   segment, that a particular packet's TTL will expire before the packet
   reaches the egress of that non-TTL LSP segment. In this case, the LSR
   at the ingress to the non-TTL LSP segment must not label switch the
   packet. This means that special procedures must be developed to
   support traceroute functionality, for example, traceroute packets may
   be forwarded using conventional hop by hop forwarding.

2.15. Loop Control

   On a non-TTL LSP segment, by definition, TTL cannot be used to
   protect against forwarding loops.  The importance of loop control may
   depend on the particular hardware being used to provide the LSR
   functions along the non-TTL LSP segment.

   Suppose, for instance, that ATM switching hardware is being used to
   provide MPLS switching functions, with the label being carried in the
   VPI/VCI field. Since ATM switching hardware cannot decrement TTL,
   there is no protection against loops. If the ATM hardware is capable
   of providing fair access to the buffer pool for incoming cells
   carrying different VPI/VCI values, this looping may not have any
   deleterious effect on other traffic. If the ATM hardware cannot
   provide fair buffer access of this sort, however, then even transient
   loops may cause severe degradation of the LSR's total performance.

   Even if fair buffer access can be provided, it is still worthwhile to
   have some means of detecting loops that last "longer than possible".
   In addition, even where TTL and/or per-VC fair queuing provides a
   means for surviving loops, it still may be desirable where practical
   to avoid setting up LSPs which loop.

   The MPLS architecture will therefore provide a technique for ensuring
   that looping LSP segments can be detected, and a technique for
   ensuring that looping LSP segments are never created.

2.15.1. Loop Prevention

   LSR's maintain for each of their LSP's an LSR id list. This list is a
   list of all the LSR's downstream from this LSR on a given LSP. The
   LSR id list is used to prevent the formation of switched path loops.
   The LSR ID list is propagated upstream from a node to its neighbor
   nodes.  The LSR ID list is used to prevent loops as follows:

   When a node, R, detects a change in the next hop for a given stream,
   it asks its new next hop for a label and the associated LSR ID list
   for that stream.

   The new next hop responds with a label for the stream and an
   associated LSR id list.

   R looks in the LSR id list. If R determines that it, R, is in the
   list then we have a route loop. In this case, we do nothing and the
   old LSP will continue to be used until the route protocols break the
   loop. The means by which the old LSP is replaced by a new LSP after
   the route protocols breathe loop is described below.

   If R is not in the LSR id list, R will start a "diffusion"
   computation [12].  The purpose of the diffusion computation is to
   prune the tree upstream of R so that we remove all LSR's from the
   tree that would be on a looping path if R were to switch over to the
   new LSP.  After those LSR's are removed from the tree, it is safe for
   R to replace the old LSP with the new LSP (and the old LSP can be
   released).

   The diffusion computation works as follows:

   R adds its LSR id to the list and sends a query message to each of
   its "upstream" neighbors (i.e. to each of its neighbors that is not
   the new "downstream" next hop).

   A node S that receives such a query will process the query as
   follows:

     - If node R is not node S's next hop for the given stream, node S
       will respond to node R will an "OK" message meaning that as far
       as node S is concerned it is safe for node R to switch over to
       the new LSP.

     - If node R is node S's next hop for the stream, node S will check
       to see if it, node S, is in the LSR id list that it received from
       node R.  If it is, we have a route loop and S will respond with a
       "LOOP" message.  R will unsplice the connection to S pruning S
       from the tree.  The mechanism by which S will get a new LSP for
       the stream after the route protocols break the loop is described
       below.

     - If node S is not in the LSR id list, S will add its LSR id to the
       LSR id list and send a new query message further upstream.  The
       diffusion computation will continue to propagate upstream along
       each of the paths in the tree upstream of S until either a loop
       is detected, in which case the node is pruned as described above
       or we get to a point where a node gets a response ("OK" or
       "LOOP") from each of its neighbors perhaps because none of those
       neighbors considers the node in question to be its downstream
       next hop.  Once a node has received a response from each of its
       upstream neighbors, it returns an "OK" message to its downstream
       neighbor.  When the original node, node R, gets a response from
       each of its neighbors, it is safe to replace the old LSP with the
       new one because all the paths that would loop have been pruned
       from the tree.

   There are a couple of details to discuss:

     - First, we need to do something about nodes that for one reason or
       another do not produce a timely response in response to a query
       message.  If a node Y does not respond to a query from node X
       because of a failure of some kind, X will not be able to respond
       to its downstream neighbors (if any) or switch over to a new LSP
       if X is, like R above, the node that has detected the route
       change.  This problem is handled by timing out the query message.
       If a node doesn't receive a response within a "reasonable" period
       of time, it "unsplices" its VC to the upstream neighbor that is
       not responding and proceeds as it would if it had received the
       "LOOP" message.

     - We also need to be concerned about multiple concurrent routing
       updates.  What happens, for example, when a node M receives a
       request for an LSP from an upstream neighbor, N, when M is in the
       middle of a diffusion computation i.e., it has sent a query
       upstream but hasn't received all the responses.  Since a
       downstream node, node R is about to change from one LSP to
       another, M needs to pass to N an LSR id list corresponding to the
       union of the old and new LSP's if it is to avoid loops both
       before and after the transition.  This is easily accomplished
       since M already has the LSR id list for the old LSP and it gets
       the LSR id list for the new LSP in the query message.  After R
       makes the switch from the old LSP to the new one, R sends a new
       establish message upstream with the LSR id list of (just) the new
       LSP.  At this point, the nodes upstream of R know that R has
       switched over to the new LSP and that they can return the id list
       for (just) the new LSP in response to any new requests for LSP's.

       They can also grow the tree to include additional nodes that
       would not have been valid for the combined LSR id list.

     - We also need to discuss how a node that doesn't have an LSP for a
       given stream at the end of a diffusion computation (because it
       would have been on a looping LSP) gets one after the routing
       protocols break the loop.  If node L has been pruned from the
       tree and its local route protocol processing entity breaks the
       loop by changing L's next hop, L will request a new LSP from its
       new downstream neighbor which it will use once it executes the
       diffusion computation as described above.  If the loop is broken
       by a route change at another point in the loop, i.e. at a point
       "downstream" of L, L will get a new LSP as the new LSP tree grows
       upstream from the point of the route change as discussed in the
       previous paragraph.

     - Note that when a node is pruned from the tree, the switched path
       upstream of that node remains "connected".  This is important
       since it allows the switched path to get "reconnected" to a
       downstream switched path after a route change with a minimal
       amount of unsplicing and resplicing once the appropriate
       diffusion computation(s) have taken place.

   The LSR Id list can also be used to provide a "loop detection"
   capability.  To use it in this manner, an LSR which sees that it is
   already in the LSR Id list for a particular stream will immediately
   unsplice itself from the switched path for that stream, and will NOT
   pass the LSR Id list further upstream.  The LSR can rejoin a switched
   path for the stream when it changes its next hop for that stream, or
   when it receives a new LSR Id list from its current next hop, in
   which it is not contained.  The diffusion computation would be
   omitted.

2.15.2. Interworking of Loop Control Options

   The MPLS protocol architecture allows some nodes to be using loop
   prevention, while some other nodes are not (i.e., the choice of
   whether or not to use loop prevention may be a local decision). When
   this mix is used, it is not possible for a loop to form which
   includes only nodes which do loop prevention. However, it is possible
   for loops to form which contain a combination of some nodes which do
   loop prevention, and some nodes which do not.

   There are at least four identified cases in which it makes sense to
   combine nodes which do loop prevention with nodes which do not: (i)
   For transition, in intermediate states while transitioning from all
   non-loop-prevention to all loop prevention, or vice versa; (ii) For
   interoperability, where one vendor implements loop prevention but
   another vendor does not; (iii) Where there is a mixed ATM and
   datagram media network, and where loop prevention is desired over the
   ATM portions of the network but not over the datagram portions; (iv)
   where some of the ATM switches can do fair access to the buffer pool
   on a per-VC basis, and some cannot, and loop prevention is desired
   over the ATM portions of the network which cannot.

   Note that interworking is straightforward.  If an LSR is not doing
   loop prevention, and it receives from a downstream LSR a label
   mapping which contains loop prevention information, it (a) accepts
   the label mapping, (b) does NOT pass the loop prevention information
   upstream, and (c) informs the downstream neighbor that the path is
   loop-free.

   Similarly, if an LSR R which is doing loop prevention receives from a
   downstream LSR a label mapping which does not contain any loop
   prevention information, then R passes the label mapping upstream with
   loop prevention information included as if R were the egress for the
   specified stream.

   Optionally, a node is permitted to implement the ability of either
   doing or not doing loop prevention as options, and is permitted to
   choose which to use for any one particular LSP based on the
   information obtained from downstream nodes. When the label mapping
   arrives from downstream, then the node may choose whether to use loop
   prevention so as to continue to use the same approach as was used in
   the information passed to it. Note that regardless of whether loop
   prevention is used the egress nodes (for any particular LSP) always
   initiates exchange of label mapping information without waiting for
   other nodes to act.

2.16. Merging and Non-Merging LSRs

   Merge allows multiple upstream LSPs to be merged into a single
   downstream LSP. When implemented by multiple nodes, this results in
   the traffic going to a particular egress nodes, based on one
   particular Stream, to follow a multipoint to point tree (MPT), with
   the MPT rooted at the egress node and associated with the Stream.
   This can have a significant effect on reducing the number of labels
   that need to be maintained by any one particular node.

   If merge was not used at all it would be necessary for each node to
   provide the upstream neighbors with a label for each Stream for each
   upstream node which may be forwarding traffic over the link. This
   implies that the number of labels needed might not in general be
   known a priori. However, the use of merge allows a single label to be
   used per Stream, therefore allowing label assignment to be done in a
   common way without regard for the number of upstream nodes which will
   be using the downstream LSP.

   The proposed MPLS protocol architecture supports LSP merge, while
   allowing nodes which do not support LSP merge. This leads to the
   issue of ensuring correct interoperation between nodes which
   implement merge and those which do not. The issue is somewhat
   different in the case of datagram media versus the case of ATM. The
   different media types will therefore be discussed separately.

2.16.1. Stream Merge

   Let us say that an LSR is capable of Stream Merge if it can receive
   two packets from different incoming interfaces, and/or with different
   labels, and send both packets out the same outgoing interface with
   the same label. This in effect takes two incoming streams and merges
   them into one. Once the packets are transmitted, the information that
   they arrived from different interfaces and/or with different incoming
   labels is lost.

   Let us say that an LSR is not capable of Stream Merge if, for any two
   packets which arrive from different interfaces, or with different
   labels, the packets must either be transmitted out different
   interfaces, or must have different labels.

   An LSR which is capable of Stream Merge (a "Merging LSR") needs to
   maintain only one outgoing label for each FEC. AN LSR which is not
   capable of Stream Merge (a "Non-merging LSR") may need to maintain as
   many as N outgoing labels per FEC, where N is the number of LSRs in
   the network. Hence by supporting Stream Merge, an LSR can reduce its
   number of outgoing labels by a factor of O(N). Since each label in
   use requires the dedication of some amount of resources, this can be
   a significant savings.

2.16.2. Non-merging LSRs

   The MPLS forwarding procedures is very similar to the forwarding
   procedures used by such technologies as ATM and Frame Relay. That is,
   a unit of data arrives, a label (VPI/VCI or DLCI) is looked up in a
   "cross-connect table", on the basis of that lookup an output port is
   chosen, and the label value is rewritten. In fact, it is possible to
   use such technologies for MPLS forwarding; LDP can be used as the
   "signalling protocol" for setting up the cross-connect tables.

   Unfortunately, these technologies do not necessarily support the
   Stream Merge capability. In ATM, if one attempts to perform Stream
   Merge, the result may be the interleaving of cells from various
   packets. If cells from different packets get interleaved, it is
   impossible to reassemble the packets. Some Frame Relay switches use
   cell switching on their backplanes. These switches may also be
   incapable of supporting Stream Merge, for the same reason -- cells of
   different packets may get interleaved, and there is then no way to
   reassemble the packets.

   We propose to support two solutions to this problem. First, MPLS will
   contain procedures which allow the use of non-merging LSRs. Second,
   MPLS will support procedures which allow certain ATM switches to
   function as merging LSRs.

   Since MPLS supports both merging and non-merging LSRs, MPLS also
   contains procedures to ensure correct interoperation between them.

2.16.3. Labels for Merging and Non-Merging LSRs

   An upstream LSR which supports Stream Merge needs to be sent only one
   label per FEC. An upstream neighbor which does not support Stream
   Merge needs to be sent multiple labels per FEC. However, there is no
   way of knowing a priori how many labels it needs. This will depend on
   how many LSRs are upstream of it with respect to the FEC in question.

   In the MPLS architecture, if a particular upstream neighbor does not
   support Stream Merge, it is not sent any labels for a particular FEC
   unless it explicitly asks for a label for that FEC. The upstream
   neighbor may make multiple such requests, and is given a new label
   each time. When a downstream neighbor receives such a request from
   upstream, and the downstream neighbor does not itself support Stream
   Merge, then it must in turn ask its downstream neighbor for another
   label for the FEC in question.

   It is possible that there may be some nodes which support merge, but
   have a limited number of upstream streams which may be merged into a
   single downstream streams. Suppose for example that due to some
   hardware limitation a node is capable of merging four upstream LSPs
   into a single downstream LSP. Suppose however, that this particular
   node has six upstream LSPs arriving at it for a particular Stream. In
   this case, this node may merge these into two downstream LSPs
   (corresponding to two labels that need to be obtained from the
   downstream neighbor). In this case, the normal operation of the LDP
   implies that the downstream neighbor will supply this node with a
   single label for the Stream. This node can then ask its downstream
   neighbor for one additional label for the Stream, implying that the
   node will thereby obtain the required two labels.

   The interaction between explicit routing and merge is FFS.

2.16.4. Merge over ATM

2.16.4.1. Methods of Eliminating Cell Interleave

   There are several methods that can be used to eliminate the cell
   interleaving problem in ATM, thereby allowing ATM switches to support
   stream merge: :

      1. VP merge

         When VP merge is used, multiple virtual paths are merged into a
         virtual path, but packets from different sources are
         distinguished by using different VCs within the VP.

      2. VC merge

         When VC merge is used, switches are required to buffer cells
         from one packet until the entire packet is received (this may
         be determined by looking for the AAL5 end of frame indicator).

   VP merge has the advantage that it is compatible with a higher
   percentage of existing ATM switch implementations. This makes it more
   likely that VP merge can be used in existing networks. Unlike VC
   merge, VP merge does not incur any delays at the merge points and
   also does not impose any buffer requirements.  However, it has the
   disadvantage that it requires coordination of the VCI space within
   each VP. There are a number of ways that this can be accomplished.
   Selection of one or more methods is FFS.

   This tradeoff between compatibility with existing equipment versus
   protocol complexity and scalability implies that it is desirable for
   the MPLS protocol to support both VP merge and VC merge. In order to
   do so each ATM switch participating in MPLS needs to know whether its
   immediate ATM neighbors perform VP merge, VC merge, or no merge.

2.16.4.2. Interoperation: VC Merge, VP Merge, and Non-Merge

   The interoperation of the various forms of merging over ATM is most
   easily described by first describing the interoperation of VC merge
   with non-merge.

   In the case where VC merge and non-merge nodes are interconnected the
   forwarding of cells is based in all cases on a VC (i.e., the
   concatenation of the VPI and VCI). For each node, if an upstream
   neighbor is doing VC merge then that upstream neighbor requires only
   a single VPI/VCI for a particular Stream (this is analogous to the
   requirement for a single label in the case of operation over frame
   media). If the upstream neighbor is not doing merge, then the
   neighbor will require a single VPI/VCI per Stream for itself, plus
   enough VPI/VCIs to pass to its upstream neighbors. The number
   required will be determined by allowing the upstream nodes to request
   additional VPI/VCIs from their downstream neighbors (this is again
   analogous to the method used with frame merge).

   A similar method is possible to support nodes which perform VP merge.
   In this case the VP merge node, rather than requesting a single
   VPI/VCI or a number of VPI/VCIs from its downstream neighbor, instead
   may request a single VP (identified by a VPI) but several VCIs within
   the VP.  Furthermore, suppose that a non-merge node is downstream
   from two different VP merge nodes. This node may need to request one
   VPI/VCI (for traffic originating from itself) plus two VPs (one for
   each upstream node), each associated with a specified set of VCIs (as
   requested from the upstream node).

   In order to support all of VP merge, VC merge, and non-merge, it is
   therefore necessary to allow upstream nodes to request a combination
   of zero or more VC identifiers (consisting of a VPI/VCI), plus zero
   or more VPs (identified by VPIs) each containing a specified number
   of VCs (identified by a set of VCIs which are significant within a
   VP). VP merge nodes would therefore request one VP, with a contained
   VCI for traffic that it originates (if appropriate) plus a VCI for
   each VC requested from above (regardless of whether or not the VC is
   part of a containing VP). VC merge node would request only a single
   VPI/VCI (since they can merge all upstream traffic into a single VC).
   Non-merge nodes would pass on any requests that they get from above,
   plus request a VPI/VCI for traffic that they originate (if
   appropriate).

2.17. LSP Control: Egress versus Local

   There is a choice to be made regarding whether the initial setup of
   LSPs will be initiated by the egress node, or locally by each
   individual node.

   When LSP control is done locally, then each node may at any time pass
   label bindings to its neighbors for each FEC recognized by that node.
   In the normal case that the neighboring nodes recognize the same
   FECs, then nodes may map incoming labels to outgoing labels as part
   of the normal label swapping forwarding method.

   When LSP control is done by the egress, then initially only the
   egress node passes label bindings to its neighbors corresponding to
   any FECs which leave the MPLS network at that egress node. Other
   nodes wait until they get a label from downstream for a particular
   FEC before passing a corresponding label for the same FEC to upstream
   nodes.

   With local control, since each LSR is (at least initially)
   independently assigning labels to FECs, it is possible that different
   LSRs may make inconsistent decisions. For example, an upstream LSR
   may make a coarse decision (map multiple IP address prefixes to a
   single label) while its downstream neighbor makes a finer grain
   decision (map each individual IP address prefix to a separate label).
   With downstream label assignment this can be corrected by having LSRs
   withdraw labels that it has assigned which are inconsistent with
   downstream labels, and replace them with new consistent label
   assignments.

   Even with egress control it is possible that the choice of egress
   node may change, or the egress may (based on a change in
   configuration) change its mind in terms of the granularity which is
   to be used. This implies the same mechanism will be necessary to
   allow changes in granularity to bubble up to upstream nodes. The
   choice of egress or local control may therefore effect the frequency
   with which this mechanism is used, but will not effect the need for a
   mechanism to achieve consistency of label granularity. Generally
   speaking, the choice of local versus egress control does not appear
   to have any effect on the LDP mechanisms which need to be defined.

   Egress control and local control can interwork in a very
   straightforward manner (although some of the advantages ascribed to
   egress control may be lost, see appendices A and B).  With either
   approach, (assuming downstream label assignment) the egress node will
   initially assign labels for particular FECs and will pass these
   labels to its neighbors. With either approach these label assignments
   will bubble upstream, with the upstream nodes choosing labels that
   are consistent with the labels that they receive from downstream. The
   difference between the two approaches is therefore primarily an issue
   of what each node does prior to obtaining a label assignment for a
   particular FEC from downstream nodes: Does it wait, or does it assign
   a preliminary label under the expectation that it will (probably) be
   correct?

   Regardless of which method is used (local control or egress control)
   each node needs to know (possibly by configuration) what granularity
   to use for labels that it assigns. Where egress control is used, this
   requires each node to know the granularity only for streams which
   leave the MPLS network at that node. For local control, in order to
   avoid the need to withdraw inconsistent labels, each node in the
   network would need to be configured consistently to know the
   granularity for each stream. However, in many cases this may be done
   by using a single level of granularity which applies to all streams
   (such as "one label per IP prefix in the forwarding table").  The
   choice between local control versus egress control could similarly be
   left as a configuration option.

   Future versions of the MPLS architecture will need to choose between
   three options: (i) Requiring local control; (ii) Requiring egress
   control; or (iii) Allowing a choice of local control or egress
   control. Arguments for local versus egress control are contained in
   appendices A and B.

2.18. Granularity

   When forwarding by label swapping, a stream of packets following a
   stream arriving from upstream may be mapped into an equal or coarser
   grain stream. However, a coarse grain stream (for example, containing
   packets destined for a short IP address prefix covering many subnets)
   cannot be mapped directly into a finer grain stream (for example,
   containing packets destined for a longer IP address prefix covering a
   single subnet). This implies that there needs to be some mechanism
   for ensuring consistency between the granularity of LSPs in an MPLS
   network.

   The method used for ensuring compatibility of granularity may depend
   upon the method used for LSP control.

   When LSP control is local, it is possible that a node may pass a
   coarse grain label to its upstream neighbor(s), and subsequently
   receive a finer grain label from its downstream neighbor. In this
   case the node has two options: (i) It may forward the corresponding
   packets using normal IP datagram forwarding (i.e., by examination of
   the IP header); (ii) It may withdraw the label mappings that it has
   passed to its upstream neighbors, and replace these with finer grain
   label mappings.

   When LSP control is egress based, the label setup originates from the
   egress node and passes upstream. It is therefore straightforward with
   this approach to maintain equally-grained mappings along the route.

2.19. Tunnels and Hierarchy

   Sometimes a router Ru takes explicit action to cause a particular
   packet to be delivered to another router Rd, even though Ru and Rd
   are not consecutive routers on the Hop-by-hop path for that packet,
   and Rd is not the packet's ultimate destination. For example, this
   may be done by encapsulating the packet inside a network layer packet
   whose destination address is the address of Rd itself. This creates a
   "tunnel" from Ru to Rd. We refer to any packet so handled as a
   "Tunneled Packet".

2.19.1. Hop-by-Hop Routed Tunnel

   If a Tunneled Packet follows the Hop-by-hop path from Ru to Rd, we
   say that it is in an "Hop-by-Hop Routed Tunnel" whose "transmit
   endpoint" is Ru and whose "receive endpoint" is Rd.

2.19.2. Explicitly Routed Tunnel

   If a Tunneled Packet travels from Ru to Rd over a path other than the
   Hop-by-hop path, we say that it is in an "Explicitly Routed Tunnel"
   whose "transmit endpoint" is Ru and whose "receive endpoint" is Rd.
   For example, we might send a packet through an Explicitly Routed
   Tunnel by encapsulating it in a packet which is source routed.

2.19.3. LSP Tunnels

   It is possible to implement a tunnel as a LSP, and use label
   switching rather than network layer encapsulation to cause the packet
   to travel through the tunnel. The tunnel would be a LSP , where R1 is the transmit endpoint of the tunnel, and Rn is the
   receive endpoint of the tunnel. This is called a "LSP Tunnel".

   The set of packets which are to be sent though the LSP tunnel becomes
   a Stream, and each LSR in the tunnel must assign a label to that
   Stream (i.e., must assign a label to the tunnel).  The criteria for
   assigning a particular packet to an LSP tunnel is a local matter at
   the tunnel's transmit endpoint.  To put a packet into an LSP tunnel,
   the transmit endpoint pushes a label for the tunnel onto the label
   stack and sends the labeled packet to the next hop in the tunnel.

   If it is not necessary for the tunnel's receive endpoint to be able
   to determine which packets it receives through the tunnel, as
   discussed earlier, the label stack may be popped at the penultimate
   LSR in the tunnel.

   A "Hop-by-Hop Routed LSP Tunnel" is a Tunnel that is implemented as
   an hop-by-hop routed LSP between the transmit endpoint and the
   receive endpoint.

   An "Explicitly Routed LSP Tunnel" is a LSP Tunnel that is also an
   Explicitly Routed LSP.

2.19.4. Hierarchy: LSP Tunnels within LSPs

   Consider a LSP . Let us suppose that R1 receives
   unlabeled packet P, and pushes on its label stack the label to cause
   it to follow this path, and that this is in fact the Hop-by-hop path.
   However, let us further suppose that R2 and R3 are not directly
   connected, but are "neighbors" by virtue of being the endpoints of an
   LSP tunnel. So the actual sequence of LSRs traversed by P is .

   When P travels from R1 to R2, it will have a label stack of depth 1.
   R2, switching on the label, determines that P must enter the tunnel.
   R2 first replaces the Incoming label with a label that is meaningful
   to R3.  Then it pushes on a new label. This level 2 label has a value
   which is meaningful to R21. Switching is done on the level 2 label by
   R21, R22, R23. R23, which is the penultimate hop in the R2-R3 tunnel,
   pops the label stack before forwarding the packet to R3. When R3 sees
   packet P, P has only a level 1 label, having now exited the tunnel.
   Since R3 is the penultimate hop in P's level 1 LSP, it pops the label
   stack, and R4 receives P unlabeled.

   The label stack mechanism allows LSP tunneling to nest to any depth.

2.19.5. LDP Peering and Hierarchy

   Suppose that packet P travels along a Level 1 LSP ,
   and when going from R2 to R3 travels along a Level 2 LSP .  From the perspective of the Level 2 LSP, R2's LDP peer is
   R21.  From the perspective of the Level 1 LSP, R2's LDP peers are R1
   and R3.  One can have LDP peers at each layer of hierarchy.  We will
   see in sections 3.6 and 3.7 some ways to make use of this hierarchy.
   Note that in this example, R2 and R21 must be IGP neighbors, but R2
   and R3 need not be.

   When two LSRs are IGP neighbors, we will refer to them as "Local LDP
   Peers".  When two LSRs may be LDP peers, but are not IGP neighbors,
   we will refer to them as "Remote LDP Peers".  In the above example,
   R2 and R21 are local LDP peers, but R2 and R3 are remote LDP peers.

   The MPLS architecture supports two ways to distribute labels at
   different layers of the hierarchy: Explicit Peering and Implicit
   Peering.

   One performs label Distribution with one's Local LDP Peers by opening
   LDP connections to them.  One can perform label Distribution with
   one's Remote LDP Peers in one of two ways:

      1. Explicit Peering

         In explicit peering, one sets up LDP connections between Remote
         LDP Peers, exactly as one would do for Local LDP Peers.  This
         technique is most useful when the number of Remote LDP Peers is
         small, or the number of higher level label mappings is large,
         or the Remote LDP Peers are in distinct routing areas or
         domains.  Of course, one needs to know which labels to
         distribute to which peers; this is addressed in section 3.1.2.

         Examples of the use of explicit peering is found in sections
         3.2.1 and 3.6.

      2. Implicit Peering

         In Implicit Peering, one does not have LDP connections to one's
         remote LDP peers, but only to one's local LDP peers.  To
         distribute higher level labels to ones remote LDP peers, one
         encodes the higher level labels as an attribute of the lower
         level labels, and distributes the lower level label, along with
         this attribute, to the local LDP peers. The local LDP peers
         then propagate the information to their peers. This process
         continues till the information reaches remote LDP peers. Note
         that the intermediary nodes may also be remote LDP peers.

         This technique is most useful when the number of Remote LDP
         Peers is large. Implicit peering does not require a n-square
         peering mesh to distribute labels to the remote LDP peers
         because the information is piggybacked through the local LDP
         peering.  However, implicit peering requires the intermediate
         nodes to store information that they might not be directly
         interested in.

         An example of the use of implicit peering is found in section
         3.3.

2.20. LDP Transport

   LDP is used between nodes in an MPLS network to establish and
   maintain the label mappings. In order for LDP to operate correctly,
   LDP information needs to be transmitted reliably, and the LDP
   messages pertaining to a particular FEC need to be transmitted in
   sequence. This may potentially be accomplished either by using an
   existing reliable transport protocol such as TCP, or by specifying
   reliability mechanisms as part of LDP (for example, the reliability
   mechanisms which are defined in IDRP could potentially be "borrowed"
   for use with LSP). The precise means for accomplishing transport
   reliability with LSP are for further study, but will be specified by
   the MPLS Protocol Architecture before the architecture may be
   considered complete.

2.21. Label Encodings

   In order to transmit a label stack along with the packet whose label
   stack it is, it is necessary to define a concrete encoding of the
   label stack.  The architecture supports several different encoding
   techniques; the choice of encoding technique depends on the
   particular kind of device being used to forward labeled packets.

2.21.1. MPLS-specific Hardware and/or Software

   If one is using MPLS-specific hardware and/or software to forward
   labeled packets, the most obvious way to encode the label stack is to
   define a new protocol to be used as a "shim" between the data link
   layer and network layer headers.  This shim would really be just an
   encapsulation of the network layer packet; it would be "protocol-
   independent" such that it could be used to encapsulate any network
   layer.  Hence we will refer to it as the "generic MPLS
   encapsulation".

   The generic MPLS encapsulation would in turn be encapsulated in a
   data link layer protocol.

   The generic MPLS encapsulation should contain the following fields:

      1. the label stack,

      2. a Time-to-Live (TTL) field

      3. a Class of Service (CoS) field

   The TTL field permits MPLS to provide a TTL function similar to what
   is provided by IP.

   The CoS field permits LSRs to apply various scheduling packet
   disciplines to labeled packets, without requiring separate labels for
   separate disciplines.

   This section is not intended to rule out the use of alternative
   mechanisms in network environments where such alternatives may be
   appropriate.

2.21.2. ATM Switches as LSRs

   It will be noted that MPLS forwarding procedures are similar to those
   of legacy "label swapping" switches such as ATM switches. ATM
   switches use the input port and the incoming VPI/VCI value as the
   index into a "cross-connect" table, from which they obtain an output
   port and an outgoing VPI/VCI value.  Therefore if one or more labels
   can be encoded directly into the fields which are accessed by these
   legacy switches, then the legacy switches can, with suitable software
   upgrades, be used as LSRs.  We will refer to such devices as "ATM-
   LSRs".

   There are three obvious ways to encode labels in the ATM cell header
   (presuming the use of AAL5):

      1. SVC Encoding

         Use the VPI/VCI field to encode the label which is at the top
         of the label stack.  This technique can be used in any network.
         With this encoding technique, each LSP is realized as an ATM
         SVC, and the LDP becomes the ATM "signaling" protocol.  With
         this encoding technique, the ATM-LSRs cannot perform "push" or
         "pop" operations on the label stack.

      2. SVP Encoding

         Use the VPI field to encode the label which is at the top of
         the label stack, and the VCI field to encode the second label
         on the stack, if one is present. This technique some advantages
         over the previous one, in that it permits the use of ATM "VP-
         switching".  That is, the LSPs are realized as ATM SVPs, with
         LDP serving as the ATM signaling protocol.

         However, this technique cannot always be used.  If the network
         includes an ATM Virtual Path through a non-MPLS ATM network,
         then the VPI field is not necessarily available for use by
         MPLS.

         When this encoding technique is used, the ATM-LSR at the egress
         of the VP effectively does a "pop" operation.

      3. SVP Multipoint Encoding

         Use the VPI field to encode the label which is at the top of
         the label stack, use part of the VCI field to encode the second
         label on the stack, if one is present, and use the remainder of
         the VCI field to identify the LSP ingress.  If this technique
         is used, conventional ATM VP-switching capabilities can be used
         to provide multipoint-to-point VPs.  Cells from different
         packets will then carry different VCI values, so multipoint-
         to-point VPs can be provided without any cell interleaving
         problems.

         This technique depends on the existence of a capability for
         assigning small unique values to each ATM switch.

   If there are more labels on the stack than can be encoded in the ATM
   header, the ATM encodings must be combined with the generic
   encapsulation.  This does presuppose that it be possible to tell,
   when reassembling the ATM cells into packets, whether the generic
   encapsulation is also present.

2.21.3. Interoperability among Encoding Techniques

   If  is a segment of a LSP, it is possible that R1 will
   use one encoding of the label stack when transmitting packet P to R2,
   but R2 will use a different encoding when transmitting a packet P to
   R3.  In general, the MPLS architecture supports LSPs with different
   label stack encodings used on different hops.  Therefore, when we
   discuss the procedures for processing a labeled packet, we speak in
   abstract terms of operating on the packet's label stack. When a
   labeled packet is received, the LSR must decode it to determine the
   current value of the label stack, then must operate on the label
   stack to determine the new value of the stack, and then encode the
   new value appropriately before transmitting the labeled packet to its
   next hop.

   Unfortunately, ATM switches have no capability for translating from
   one encoding technique to another.  The MPLS architecture therefore
   requires that whenever it is possible for two ATM switches to be
   successive LSRs along a level m LSP for some packet, that those two
   ATM switches use the same encoding technique.

   Naturally there will be MPLS networks which contain a combination of
   ATM switches operating as LSRs, and other LSRs which operate using an
   MPLS shim header. In such networks there may be some LSRs which have
   ATM interfaces as well as "MPLS Shim" interfaces. This is one example
   of an LSR with different label stack encodings on different hops.
   Such an LSR may swap off an ATM encoded label stack on an incoming
   interface and replace it with an MPLS shim header encoded label stack
   on the outgoing interface.

2.22. Multicast

   This section is for further study

3. Some Applications of MPLS

3.1. MPLS and Hop by Hop Routed Traffic

   One use of MPLS is to simplify the process of forwarding packets
   using hop by hop routing.

3.1.1. Labels for Address Prefixes

   In general, router R determines the next hop for packet P by finding
   the address prefix X in its routing table which is the longest match
   for P's destination address.  That is, the packets in a given Stream
   are just those packets which match a given address prefix in R's
   routing table. In this case, a Stream can be identified with an
   address prefix.

   If packet P must traverse a sequence of routers, and at each router
   in the sequence P matches the same address prefix, MPLS simplifies
   the forwarding process by enabling all routers but the first to avoid
   executing the best match algorithm; they need only look up the label.

3.1.2. Distributing Labels for Address Prefixes

3.1.2.1. LDP Peers for a Particular Address Prefix

   LSRs R1 and R2 are considered to be LDP Peers for address prefix X if
   and only if one of the following conditions holds:

      1. R1's route to X is a route which it learned about via a
         particular instance of a particular IGP, and R2 is a neighbor
         of R1 in that instance of that IGP

      2. R1's route to X is a route which it learned about by some
         instance of routing algorithm A1, and that route is
         redistributed into an instance of routing algorithm A2, and R2
         is a neighbor of R1 in that instance of A2

      3. R1 is the receive endpoint of an LSP Tunnel that is within
         another LSP, and R2 is a transmit endpoint of that tunnel, and
         R1 and R2 are participants in a common instance of an IGP, and
         are in the same IGP area (if the IGP in question has areas),
         and R1's route to X was learned via that IGP instance, or is
         redistributed by R1 into that IGP instance

      4. R1's route to X is a route which it learned about via BGP, and
         R2 is a BGP peer of R1

   In general, these rules ensure that if the route to a particular
   address prefix is distributed via an IGP, the LDP peers for that
   address prefix are the IGP neighbors.  If the route to a particular
   address prefix is distributed via BGP, the LDP peers for that address
   prefix are the BGP peers.  In other cases of LSP tunneling, the
   tunnel endpoints are LDP peers.

3.1.2.2. Distributing Labels

   In order to use MPLS for the forwarding of normally routed traffic,
   each LSR MUST:

      1. bind one or more labels to each address prefix that appears in
         its routing table;

      2. for each such address prefix X, use an LDP to distribute the
         mapping of a label to X to each of its LDP Peers for X.

   There is also one circumstance in which an LSR must distribute a
   label mapping for an address prefix, even if it is not the LSR which
   bound that label to that address prefix:

      3. If R1 uses BGP to distribute a route to X, naming some other
         LSR R2 as the BGP Next Hop to X, and if R1 knows that R2 has
         assigned label L to X, then R1 must distribute the mapping
         between T and X to any BGP peer to which it distributes that
         route.

   These rules ensure that labels corresponding to address prefixes
   which correspond to BGP routes are distributed to IGP neighbors if
   and only if the BGP routes are distributed into the IGP.  Otherwise,
   the labels bound to BGP routes are distributed only to the other BGP
   speakers.

   These rules are intended to indicate which label mappings must be
   distributed by a given LSR to which other LSRs, NOT to indicate the
   conditions under which the distribution is to be made.  That is
   discussed in section 2.17.

3.1.3. Using the Hop by Hop path as the LSP

   If the hop-by-hop path that packet P needs to follow is , then  can be an LSP as long as:

      1. there is a single address prefix X, such that, for all i,
         1<=i, and the Hop-by-hop path for P2 is .  Let's suppose that R3 binds label L3 to X, and distributes
   this mapping to R2.  R2 binds label L2 to X, and distributes this
   mapping to both R1 and R4.  When R2 receives packet P1, its incoming
   label will be L2. R2 will overwrite L2 with L3, and send P1 to R3.
   When R2 receives packet P2, its incoming label will also be L2.  R2
   again overwrites L2 with L3, and send P2 on to R3.

   Note then that when P1 and P2 are traveling from R2 to R3, they carry
   the same label, and as far as MPLS is concerned, they cannot be
   distinguished.  Thus instead of talking about two distinct LSPs,  and , we might talk of a single "Multipoint-to-
   Point LSP", which we might denote as <{R1, R4}, R2, R3>.

   This creates a difficulty when we attempt to use conventional ATM
   switches as LSRs.  Since conventional ATM switches do not support
   multipoint-to-point connections, there must be procedures to ensure
   that each LSP is realized as a point-to-point VC.  However, if ATM
   switches which do support multipoint-to-point VCs are in use, then
   the LSPs can be most efficiently realized as multipoint-to-point VCs.
   Alternatively, if the SVP Multipoint Encoding (section 2.21) can be
   used, the LSPs can be realized as multipoint-to-point SVPs.

3.6. LSP Tunneling between BGP Border Routers

   Consider the case of an Autonomous System, A, which carries transit
   traffic between other Autonomous Systems. Autonomous System A will
   have a number of BGP Border Routers, and a mesh of BGP connections
   among them, over which BGP routes are distributed. In many such
   cases, it is desirable to avoid distributing the BGP routes to
   routers which are not BGP Border Routers.  If this can be avoided,
   the "route distribution load" on those routers is significantly
   reduced. However, there must be some means of ensuring that the
   transit traffic will be delivered from Border Router to Border Router
   by the interior routers.

   This can easily be done by means of LSP Tunnels. Suppose that BGP
   routes are distributed only to BGP Border Routers, and not to the
   interior routers that lie along the Hop-by-hop path from Border
   Router to Border Router. LSP Tunnels can then be used as follows:

      1. Each BGP Border Router distributes, to every other BGP Border
         Router in the same Autonomous System, a label for each address
         prefix that it distributes to that router via BGP.

      2. The IGP for the Autonomous System maintains a host route for
         each BGP Border Router. Each interior router distributes its
         labels for these host routes to each of its IGP neighbors.

      3. Suppose that:

            a) BGP Border Router B1 receives an unlabeled packet P,

            b) address prefix X in B1's routing table is the longest
               match for the destination address of P,

            c) the route to X is a BGP route,

            d) the BGP Next Hop for X is B2,

            e) B2 has bound label L1 to X, and has distributed this
               mapping to B1,

            f) the IGP next hop for the address of B2 is I1,

            g) the address of B2 is in B1's and I1's IGP routing tables
               as a host route, and

            h) I1 has bound label L2 to the address of B2, and
               distributed this mapping to B1.

         Then before sending packet P to I1, B1 must create a label
         stack for P, then push on label L1, and then push on label L2.

      4. Suppose that BGP Border Router B1 receives a labeled Packet P,
         where the label on the top of the label stack corresponds to an
         address prefix, X, to which the route is a BGP route, and that
         conditions 3b, 3c, 3d, and 3e all hold. Then before sending
         packet P to I1, B1 must replace the label at the top of the
         label stack with L1, and then push on label L2.

   With these procedures, a given packet P follows a level 1 LSP all of
   whose members are BGP Border Routers, and between each pair of BGP
   Border Routers in the level 1 LSP, it follows a level 2 LSP.

   These procedures effectively create a Hop-by-Hop Routed LSP Tunnel
   between the BGP Border Routers.

   Since the BGP border routers are exchanging label mappings for
   address prefixes that are not even known to the IGP routing, the BGP
   routers should become explicit LDP peers with each other.

3.7. Other Uses of Hop-by-Hop Routed LSP Tunnels

   The use of Hop-by-Hop Routed LSP Tunnels is not restricted to tunnels
   between BGP Next Hops. Any situation in which one might otherwise
   have used an encapsulation tunnel is one in which it is appropriate
   to use a Hop-by-Hop Routed LSP Tunnel. Instead of encapsulating the
   packet with a new header whose destination address is the address of
   the tunnel's receive endpoint, the label corresponding to the address
   prefix which is the longest match for the address of the tunnel's
   receive endpoint is pushed on the packet's label stack. The packet
   which is sent into the tunnel may or may not already be labeled.

   If the transmit endpoint of the tunnel wishes to put a labeled packet
   into the tunnel, it must first replace the label value at the top of
   the stack with a label value that was distributed to it by the
   tunnel's receive endpoint.  Then it must push on the label which
   corresponds to the tunnel itself, as distributed to it by the next
   hop along the tunnel.  To allow this, the tunnel endpoints should be
   explicit LDP peers. The label mappings they need to exchange are of
   no interest to the LSRs along the tunnel.

3.8. MPLS and Multicast

   Multicast routing proceeds by constructing multicast trees. The tree
   along which a particular multicast packet must get forwarded depends
   in general on the packet's source address and its destination
   address.  Whenever a particular LSR is a node in a particular
   multicast tree, it binds a label to that tree.  It then distributes
   that mapping to its parent on the multicast tree.  (If the node in
   question is on a LAN, and has siblings on that LAN, it must also
   distribute the mapping to its siblings.  This allows the parent to
   use a single label value when multicasting to all children on the
   LAN.)

   When a multicast labeled packet arrives, the NHLFE corresponding to
   the label indicates the set of output interfaces for that packet, as
   well as the outgoing label. If the same label encoding technique is
   used on all the outgoing interfaces, the very same packet can be sent
   to all the children.

4. LDP Procedures

   This section is FFS.

5. Security Considerations

   Security considerations are not discussed in this version of this
   draft.

                                                           Eric C. Rosen
                                                           Yakov Rekhter
Expiration Date: December 1997                             Daniel Tappan
                                                          Dino Farinacci
                                                            Guy Fedorkow
                                                     Cisco Systems, Inc.

                                                                 Tony Li
                                                  Juniper Networks, Inc.

                                                               June 1997

                 Label Switching: Label Stack Encodings

                      draft-rosen-tag-stack-02.txt

Abstract

   "Multi-Protocol Label Switching (MPLS)" [1,2] requires a set of
   procedures for augmenting network layer packets with "label stacks"
   (sometimes called "label stacks"), thereby turning them into "labeled
   packets".  Routers which support MPLS are known as "Label Switching
   Routers", or "LSRs".  In order to transmit a labeled packet on a
   particular data link, an LSR must support an encoding technique
   which, given a label stack and a network layer packet, produces a
   labeled packet.  This document specifies the encoding to be used by
   an LSR in order to transmit labeled packets on PPP data links and on
   LAN data links.  This document also specifies rules and procedures
   for processing the various fields of the label stack encoding.

Table of Contents

    1      Introduction  ...........................................   2
    1.1    Specification of Requirements  ..........................   3
    2      The Label Stack  ........................................   4
    2.1    Encoding the Label Stack  ...............................   4
    2.2    Determining the Network Layer Protocol  .................   6
    2.3    Processing the Time to Live Field  ......................   7
    2.3.1  Definitions  ............................................   7
    2.3.2  Protocol-independent rules  .............................   7
    2.3.3  IP-dependent rules  .....................................   8
    3      Fragmentation and Path MTU Discovery  ...................   8
    3.1    Terminology  ............................................   9
    3.2    Maximum Initially Labeled IP Datagram Size  .............  10
    3.3    When are Labeled IP Datagrams Too Big?  .................  11
    3.4    Processing Labeled IP Datagrams which are Too Big  ......  11
    3.5    Implications with respect to Path MTU Discovery  ........  12
    3.5.1  Tunneling through a Transit Routing Domain  .............  13
    3.5.2  Tunneling Private Addresses through a Public Backbone  ..  13
    4      Transporting Labeled Packets over PPP  ..................  14
    4.1    Introduction  ...........................................  14
    4.2    A PPP Network Control Protocol for MPLS  ................  14
    4.3    Sending Labeled Packets  ................................  15
    4.4    Label Switching Control Protocol Configuration Options  .  16
    5      Transporting Labeled Packets over LAN Media  ............  16
    6      Security Considerations  ................................  16
    7      Authors' Addresses  .....................................  16
    8      References  .............................................  17

1. Introduction

   "Multi-Protocol Label Switching (MPLS)" [1,2] requires a set of
   procedures for augmenting network layer packets with "label stacks"
   (sometimes called "label stacks"), thereby turning them into "labeled
   packets".  Routers which support MPLS are known as "Label Switching
   Routers", or "LSRs".  In order to transmit a labeled packet on a
   particular data link, an LSR must support an encoding technique
   which, given a label stack and a network layer packet, produces a
   labeled packet.

   This document specifies the encoding to be used by an LSR in order to
   transmit labeled packets on PPP data links and on LAN data links.

   This document also specifies rules and procedures for processing the
   various fields of the label stack encoding.  Since MPLS is
   independent of any particular network layer protocol, the majority of
   such procedures are also protocol-independent.  A few, however, do
   differ for different protocols.  In this document, we specify the
   protocol-independent procedures, and we specify the protocol-
   dependent procedures for IPv4.

   LSRs that are implemented on certain switching devices (such as ATM
   switches) may use different encoding techniques for encoding the top
   one or two entries of the label stack.  When the label stack has
   additional entries, however, the encoding technique described in this
   document may be used for the additional label stack entries.

1.1. Specification of Requirements

   In this document, several words are used to signify the requirements
   of the specification.  These words are often capitalized.

        MUST

        This word, or the adjective "required", means that the
        definition is an absolute requirement of the specification.

        MUST NOT

        This phrase means that the definition is an absolute prohibition
        of the specification.

        SHOULD

        This word, or the adjective "recommended", means that there may
        exist valid reasons in particular circumstances to ignore this
        item, but the full implications must be understood and carefully
        weighed before choosing a different course.

        MAY

        This word, or the adjective "optional", means that this item is
        one of an allowed set of alternatives.  An implementation which
        does not include this option MUST be prepared to interoperate
        with another implementation which does include the option.

2. The Label Stack

2.1. Encoding the Label Stack

   On both PPP and LAN data links, the label stack is represented as a
   sequence of "label stack entries".  Each label stack entry is
   represented by 4 octets.  This is shown in Figure 1.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Label
   |                Label                  | CoS |S|       TTL     | Stack
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Entry

                       Label:  Label Value, 20 bits
                       CoS:    Class of Service, 3 bits
                       S:      Bottom of Stack, 1 bit
                       TTL:    Time to Live, 8 bits

                                 Figure 1

   The label stack entries appear AFTER the data link layer headers, but
   BEFORE any network layer headers.  The top of the label stack appears
   earliest in the packet, and the bottom appears latest.  The network
   layer packet immediately follows the label stack entry which has the
   S bit set.

   Each label stack entry is broken down into the following fields:

      1. Bottom of Stack (S)

         This bit is set to one for the last entry in the label stack
         (i.e., for the bottom of the stack), and zero for all other
         label stack entries.

      2. Time to Live (TTL)

         This eight-bit field is used to encode a time-to-live value.
         The processing of this field is described in section 2.3.

      3. Class of Service (CoS)

         This three-bit field is used to identify a "Class of Service".
         The setting of this field is intended to affect the scheduling
         and/or discard algorithms which are applied to the packet as it
         is transmitted through the network.

         When an unlabeled packet is initially labeled, the value
         assigned to the CoS field in the label stack entry is
         determined by policy.  Some possible policies are:

           - the CoS value is a function of the IP ToS value

           - the CoS value is a function of the packet's input interface

           - the CoS value is a function of the "flow type"

         Of course, many other policies are also possible.

         When an additional label is pushed onto the stack of a packet
         that is already labeled:

           - in general, the value of the CoS field in the new top stack
             entry should be equal to the value of the CoS field of the
             old top stack entry;

           - however, in some cases, most likely at boundaries between
             network service providers, the value of the CoS field in
             the new top stack entry may be determined by policy.

      4. Label Value

         This 20-bit field carries the actual value of the Label.

         When a labeled packet is received, the label value at the top
         of the stack is looked up.  As a result of this lookup one
         learns:

            (a) information needed to forward the packet, such as the
                next hop and the outgoing data link encapsulation;
                however, the precise queue to put the packet on, or
                information as to how to schedule the packet, may be a
                function of both the label value AND the CoS field
                value;

            (b) the operation to be performed on the label stack before
                forwarding; this operation may be to replace the top
                label stack entry with another, or to pop an entry off
                the label stack, or to replace the top label stack entry
                and then to push one or more additional entries on the
                label stack.

         There are several reserved label values:

              i. A value of 0 represents the "IPv4 Explicit NULL Label".
                 This label value is only legal when it is the sole
                 label stack entry.  It indicates that the label stack
                 must be popped, and the forwarding of the packet must
                 then be based on the IPv4 header.

             ii. A value of 1 represents the "Router Alert Label".  This
                 label value is legal anywhere in the label stack except
                 at the bottom.  When a received packet contains this
                 label value at the top of the label stack, it is
                 delivered to a local software module for processing.
                 The actual forwarding of the packet is determined by
                 the label beneath it in the stack.  However, if the
                 packet is forwarded further, the Router Alert Label
                 should be pushed back onto the label stack before
                 forwarding.  The use of this label is analogous to the
                 use of the "Router Alert Option" in IP packets [16].
                 Since this label cannot occur at the bottom of the
                 stack, it is not associated with a particular network
                 layer protocol.

            iii. A value of 2 represents the "IPv6 Explicit NULL Label".
                 This label value is only legal when it is the sole
                 label stack entry.  It indicates that the label stack
                 must be popped, and the forwarding of the packet must
                 then be based on the IPv6 header.

             iv. Values 3-16 are reserved.

         We must also discuss the "Implicit NULL Label".  This is a
         label that an LSR may assign and distribute, but which never
         actually appears in the encapsulation.  When an LSR would
         otherwise replace the label at the top of the stack with a new
         label, but the new label is "Implicit NULL", the LSR will pop
         the stack instead of doing the replacement.

2.2. Determining the Network Layer Protocol

   When the last label is popped from the label stack, it is necessary
   to determine the particular network layer protocol which is being
   carried.  Note that the label stack entries carry no explicit field
   to identify the network layer header.  Rather, this must be inferable
   from the value of the label which is popped from the bottom of the
   stack.  This means that when the first label is pushed onto a network
   layer packet, the label must be one which is used ONLY for packets of
   a particular network layer.  Furthermore, whenever that label is
   replaced by another label value during a packet's transit, the new
   value must also be one which is used only for packets of that network
   layer.

2.3. Processing the Time to Live Field

2.3.1. Definitions

   The "incoming TTL" of a labeled packet is defined to be the value of
   the TTL field of the top label stack entry when the packet is
   received.

   The "outgoing TTL" of a labeled packet is defined to be the larger
   of:

      (a) one less than the incoming TTL,
      (b) zero.

2.3.2. Protocol-independent rules

   If the outgoing TTL of a labeled packet is 0, then the labeled packet
   MUST NOT be further forwarded; the packet's lifetime in the network
   is considered to have expired.

   Depending on the label value in the label stack entry, the packet MAY
   be silently discarded, or the packet MAY have its label stack
   stripped off, and passed as an unlabeled packet to the ordinary
   processing for network layer packets which have exceeded their
   maximum lifetime in the network.  However, even if the label stack is
   stripped, the packet MUST NOT be further forwarded.

   When a labeled packet is forwarded, the TTL field of the label stack
   entry at the top of the label stack must be set to the outgoing TTL
   value.

   Note that the outgoing TTL value is a function solely of the incoming
   TTL value, and is independent of whether any labels are pushed or
   popped before forwarding.  There is no significance to the value of
   the TTL field in any label stack entry which is not at the top of the
   stack.

2.3.3. IP-dependent rules

   When an IP packet is first labeled, the TTL field of the label stack
   entry MUST BE set to the value of the IP TTL field.  (If the IP TTL
   field needs to be decremented, as part of the IP processing, it is
   assumed that this has already been done.)

   When a label is popped, and the resulting label stack is empty, then
   the value of the IP TTL field MUST BE replaced with the outgoing TTL
   value, as defined above.  Note that, in IPv4, this will also require
   modification of the IP header checksum.

3. Fragmentation and Path MTU Discovery

   Just as it is possible to receive an unlabeled IP datagram which is
   too large to be transmitted on its output link, it is possible to
   receive a labeled packet which is too large to be transmitted on its
   output link.

   It is also possible that a received packet (labeled or unlabeled)
   which was originally small enough to be transmitted on that link
   becomes too large by virtue of having one or more additional labels
   pushed onto its label stack.  In label switching, a packet may grow
   in size if additional labels get pushed on.  Thus if one receives a
   labeled packet with a 1500-byte frame payload, and pushes on an
   additional label, one needs to forward it as frame with a 1504-byte
   payload.

   This section specifies the rules for processing labeled packets which
   are "too large".  In particular, it provides rules which ensure that
   hosts implementing RFC 1191 Path MTU Discovery will be able to
   generate IP datagrams that do not need fragmentation, even if they
   get labeled as the traverse the network.

   In general, hosts which do not implement RFC 1191 Path MTU Discovery
   send IP datagrams which contain no more than 576 bytes.  Since the
   MTUs in use on most data links today are 1500 bytes or more, the
   probability that such datagrams will need to get fragmented, even if
   they get labeled, is very small.

   Some hosts that do not implement RFC 1191 Path MTU Discovery will
   generate IP datagrams containing 1500 bytes, as long as the IP Source
   and Destination addresses are on the same subnet.  These datagrams
   will not pass through routers, and hence will not get fragmented.

   Unfortunately, some hosts will generate IP datagrams containing 1500
   bytes, as long the IP Source and Destination addresses do not have
   the same classful network number.  This is the one case in which
   there is significant risk of fragmentation when such datagrams get
   labeled.

   This document specifies procedures which allow one to configure the
   network so that large datagrams from hosts which do not implement
   Path MTU Discovery get fragmented just once, when they are first
   labeled.  These procedures make it possible (assuming suitable
   configuration) to avoid any need to fragment packets which have
   already been labeled.

3.1. Terminology

   With respect to a particular data link, we can use the following
   terms:

     - Frame Payload:

       The contents of a data link frame, excluding any data link layer
       headers or trailers (e.g., MAC headers, LLC headers, 802.1Q or
       802.1p headers, PPP header, frame check sequences, etc.).

       When a frame is carrying an an unlabeled IP datagram, the Frame
       Payload is just the IP datagram itself.  When a frame is carrying
       a labeled IP datagram, the Frame Payload consists of the label
       stack entries and the IP datagram.

     - Conventional Maximum Frame Payload Size:

       The maximum Frame Payload size allowed by data link standards.
       For example, the Conventional Maximum Frame Payload Size for
       ethernet is 1500 bytes.

     - True Maximum Frame Payload Size:

       The maximum size frame payload which can be sent and received
       properly by the interface hardware attached to the data link.

       On ethernet and 802.3 networks, it is believed that the True
       Maximum Frame Payload Size is 4-8 bytes larger than the
       Conventional Maximum Frame Payload Size (as long neither an
       802.1Q header nor an 802.1p header is present, and as long as
       neither can be added by a switch or bridge while a packet is in
       transit to its next hop).  For example, it is believed that most
       ethernet equipment could correctly send and receive packets
       carrying a payload of 1504 or perhaps even 1508 bytes, at least,
       as long as the ethernet header does not have an 802.1Q or 802.1p
       field.

       On PPP links, the True Maximum Frame Payload Size may be
       virtually unbounded.

     - Effective Maximum Frame Payload Size for Labeled Packets:

       This is either be the Conventional Maximum Frame Payload Size or
       the True Maximum Frame Payload Size, depending on the
       capabilities of the equipment on the data link and the size of
       the ethernet header being used.

     - Initially Labeled IP Datagram

       Suppose that an unlabeled IP datagram is received at a particular
       LSR, and that the the LSR pushes on a label before forwarding the
       datagram.  Such a datagram will be called an Initially Labeled IP
       Datagram at that LSR.

     - Previously Labeled IP Datagram

       An IP datagram which had already been labeled before it was
       received by a particular LSR.

3.2. Maximum Initially Labeled IP Datagram Size

   Every LSR which is capable of

      (a) receiving an unlabeled IP datagram,
      (b) adding a label stack to the datagram, and
      (c) forwarding the resulting labeled packet,

   MUST support a configuration parameter known as the "Maximum IP
   Datagram Size for Labeling", which can be set to a non-negative
   value.

   If this configuration parameter is set to zero, it has no effect.

   If it is set to a positive value, it is used in the following way.
   If:
      (a) an unlabeled IP datagram is received, and
      (b) that datagram does not have the DF bit set in its IP header,
          and
      (c) that datagram needs to be labeled before being forwarded, and
      (d) the size of the datagram (before labeling) exceeds the value
          of the parameter,
   then
      (a) the datagram must be broken into fragments, each of whose size
          is no greater than the value of the parameter, and
      (b) each fragment must be labeled and then forwarded.

   If this configuration parameter is set to a value of 1488, for
   example, then any unlabeled IP datagram containing more than 1488
   bytes will be fragmented before being labeled.  Each fragment will be
   capable of being carried on a 1500-byte data link, without further
   fragmentation, even if as many as three labels are pushed onto its
   label stack.

   In other words, setting this parameter to a non-zero value allows one
   to eliminate all fragmentation of Previously Labeled IP Datagrams,
   but it may cause some unnecessary fragmentation of Initially Labeled
   IP Datagrams.

   Note that the parameter has no effect on IP Datagrams that have the
   DF bit set, which means that it has no effect on Path MTU Discovery.

3.3. When are Labeled IP Datagrams Too Big?

   A labeled IP datagram whose size exceeds the Conventional Maximum
   Frame Payload Size of the data link over which it is to be forwarded
   MAY be considered to be "too big".

   A labeled IP datagram whose size exceeds the True Maximum Frame
   Payload Size of the data link over which it is to be forwarded MUST
   be considered to be "too big".

   A labeled IP datagram which is not "too big" MUST be transmitted
   without fragmentation.

3.4. Processing Labeled IP Datagrams which are Too Big

   If a labeled IP datagram is "too big", and the DF bit is not set in
   its IP header, then the LSR MAY discard the datagram.

   Note that discarding such datagrams is a sensible procedure only if
   the "Maximum Initially Labeled IP Datagram Size" is set to a non-zero
   value in every LSR in the network which is capable of adding a label
   stack to an unlabeled IP datagram.

   If the LSR chooses not to discard a labeled IP datagram which is too
   big, or if the DF bit is set in that datagram, then it MUST execute
   the following algorithm:

      1. Strip off the label stack entries to obtain the IP datagram.

      2. Let N be the number of bytes in the label stack (i.e, 4 times
         the number of label stack entries).

      3. If the IP datagram does NOT have the "Don't Fragment" bit set
         in its IP header:

            a. convert it into fragments, each of which MUST be at least
               N bytes less than the Effective Maximum Frame Payload
               Size.

            b. Prepend each fragment with the same label header that
               would have been on the original datagram had
               fragmentation not been necessary.

            c. Forward the fragments

      4. If the IP datagram has the "Don't Fragment" bit set in its IP
         header:

            a. the datagram MUST NOT be forwarded

            b. Create an ICMP Destination Unreachable Message:

                    i. set its Code field (RFC 792) to "Fragmentation
                       Required and DF Set",

                   ii. set its Next-Hop MTU field (RFC 1191) to the
                       difference between the Effective Maximum Frame
                       Payload Size and the value of N

            c. If possible, transmit the ICMP Destination Unreachable
               Message to the source of the of the discarded datagram.

3.5. Implications with respect to Path MTU Discovery

   The procedures described above for handling datagrams which have the
   DF bit set, but which are "too large", have an impact on the Path MTU
   Discovery procedures of RFC 1191.  Hosts which implement these
   procedures will discover an MTU which is small enough to allow n
   labels to be pushed on the datagrams, without need for fragmentation,
   where n is the number of labels that actually get pushed on along the
   path currently in use.
   In other words, datagrams from hosts that use Path MTU Discovery will
   never need to be fragmented due to the need to put on a label header,
   or to add new labels to an existing label header.  (Also, datagrams
   from hosts that use Path MTU Discovery generally have the DF bit set,
   and so will never get fragmented anyway.)

   However, note that Path MTU Discovery will only work properly if, at
   the point where a labeled IP Datagram's fragmentation needs to occur,
   it is possible to route to the packet's source address.  If this is
   not possible, then the ICMP Destination Unreachable message cannot be
   sent to the source.

3.5.1. Tunneling through a Transit Routing Domain

   Suppose one is using MPLS to "tunnel" through a transit routing
   domain, where the external routes are not leaked into the domain's
   interior routers.  If a packet needs fragmentation at some router
   within the domain, and the packet's DF bit is set, it is necessary to
   be able to originate an ICMP message at that router and have it
   routed correctly to the source of the fragmented packet.  If the
   packet's source address is an external address, this poses a problem.

   Therefore, in order for Path MTU Discovery to work, any routing
   domain in which external routes are not leaked into the interior
   routers MUST have a default route which causes all packets carrying
   external destination addresses to be sent to a border router.  For
   example, one of the border routers may inject "default" into the IGP.

3.5.2. Tunneling Private Addresses through a Public Backbone

   In other cases where MPLS is used to tunnel through a routing domain,
   it may not be possible to route to the source address of a fragmented
   packet at all.  This would be the case, for example, if the IP
   addresses carried in the packet were private addresses, and MPLS were
   being used to tunnel those packets through a public backbone.

   In such cases, the LSR at the transmitting end of the tunnel MUST be
   able to determine the MTU of the tunnel as a whole.  It SHOULD do
   this by sending packets through the tunnel to the tunnel's receiving
   endpoint, and performing Path MTU Discovery with those packets.  Then
   any time the transmitting endpoint of the tunnel needs to send a
   packet into the tunnel, and that packet has the DF bit set, and it
   exceeds the tunnel MTU, the transmitting endpoint of the tunnel MUST
   send the ICMP Destination Unreachable message to the source, with
   code "Fragmentation Required and DF Set", and the Next-Hop MTU Field
   set as described above.
4. Transporting Labeled Packets over PPP

   The Point-to-Point Protocol (PPP) [PPP] provides a standard method
   for transporting multi-protocol datagrams over point-to-point links.
   PPP defines an extensible Link Control Protocol, and proposes a
   family of Network Control Protocols for establishing and configuring
   different network-layer protocols.

   This section defines the Network Control Protocol for establishing
   and configuring label Switching over PPP.

4.1. Introduction

   PPP has three main components:

      1. A method for encapsulating multi-protocol datagrams.

      2. A Link Control Protocol (LCP) for establishing, configuring,
         and testing the data-link connection.

      3. A family of Network Control Protocols for establishing and
         configuring different network-layer protocols.

   In order to establish communications over a point-to-point link, each
   end of the PPP link must first send LCP packets to configure and test
   the data link.  After the link has been established and optional
   facilities have been negotiated as needed by the LCP, PPP must send
   "MPLS Control Protocol" packets to enable the transmission of labeled
   packets.  Once the "MPLS Control Protocol" has reached the Opened
   state, labeled packets can be sent over the link.

   The link will remain configured for communications until explicit LCP
   or MPLS Control Protocol packets close the link down, or until some
   external event occurs (an inactivity timer expires or network
   administrator intervention).

4.2. A PPP Network Control Protocol for MPLS

   The MPLS Control Protocol (MPLSCP) is responsible for enabling and
   disabling the use of label switching on a PPP link.  It uses the same
   packet exchange mechanism as the Link Control Protocol (LCP).  MPLSCP
   packets may not be exchanged until PPP has reached the Network-Layer
   Protocol phase.  MPLSCP packets received before this phase is reached
   should be silently discarded.

   The MPLS Control Protocol is exactly the same as the Link Control
   Protocol [17] with the following exceptions:

      1. Frame Modifications

         The packet may utilize any modifications to the basic frame
         format which have been negotiated during the Link Establishment
         phase.

      2. Data Link Layer Protocol Field

         Exactly one MPLSCP packet is encapsulated in the PPP
         Information field, where the PPP Protocol field indicates type
         hex 8081 (MPLS).

      3. Code field

         Only Codes 1 through 7 (Configure-Request, Configure-Ack,
         Configure-Nak, Configure-Reject, Terminate-Request, Terminate-
         Ack and Code-Reject) are used.  Other Codes should be treated
         as unrecognized and should result in Code-Rejects.

      4. Timeouts

         MPLSCP packets may not be exchanged until PPP has reached the
         Network-Layer Protocol phase.  An implementation should be
         prepared to wait for Authentication and Link Quality
         Determination to finish before timing out waiting for a
         Configure-Ack or other response.  It is suggested that an
         implementation give up only after user intervention or a
         configurable amount of time.

      5. Configuration Option Types

         None.

4.3. Sending Labeled Packets

   Before any labeled packets may be communicated, PPP must reach the
   Network-Layer Protocol phase, and the MPLS Control Protocol must
   reach the Opened state.

   Exactly one labeled packet is encapsulated in the PPP Information
   field, where the PPP Protocol field indicates either type hex 0081
   (MPLS Unicast) or type hex 0083 (MPLS Multicast).  The maximum length
   of a labeled packet transmitted over a PPP link is the same as the
   maximum length of the Information field of a PPP encapsulated packet.

   The format of the Information field itself is as defined in section
   2.

   Note that two codepoints are defined for labeled packets; one for
   multicast and one for unicast.  Once the MPLSCP has reached the
   Opened state, both label Switched multicasts and label Switched
   unicasts can be sent over the PPP link.

4.4. Label Switching Control Protocol Configuration Options

   There are no configuration options.

5. Transporting Labeled Packets over LAN Media

   Exactly one labeled packet is carried in each frame.

   The label stack entries immediately precede the network layer header,
   and follow any data link layer headers, including any VLAN headers,
   802.1p headers, and/or 802.1Q headers that may exist.

   The ethertype value 8847 hex is used to indicate that a frame is
   carrying an MPLS unicast packet.

   The ethertype value 8848 hex is used to indicate that a frame is
   carrying an MPLS multicast packet.

   These ethertype values can be used with either the ethernet
   encapsulation or the 802.3 SNAP/SAP encapsulation to carry labeled
   packets.

                                                      Yakov Rekhter
Expiration date:January 1998                          cisco Systems
                                                        Bruce Davie
                                                      cisco Systems
                                                          Dave Katz
                                              Juniper Networks Inc.
                                                         Eric Rosen
                                                      cisco Systems
                                                     George Swallow
                                                      cisco Systems
                                                     Dino Farinacci
                                                      cisco Systems
                                                          July 1997

                 Tag Switching Architecture - Overview

                  draft-rekhter-tagswitch-arch-01.txt

2. Abstract

   This document provides an overview of label switching. Label switching is
   a way to combine the label-swapping forwarding paradigm with network
   layer routing. This has several advantages. Labels can have a wide
   spectrum of forwarding granularities, so at one end of the spectrum a
   label could be associated with a group of destinations, while at the
   other a label could be associated with a single application flow. At
   the same time forwarding based on label switching, due to its
   simplicity, is well suited to high performance forwarding. These
   factors facilitate the development of a routing system which is both
   functionally rich and scalable. Finally, label switching simplifies
   integration of routers and ATM switches by employing common
   addressing, routing, and management procedures.

3. Introduction

   Continuous growth of the Internet demands higher bandwidth within the
   Internet Service Providers (ISPs). However, growth of the Internet is
   not the only driving factor for higher bandwidth - demand for higher
   bandwidth also comes from emerging multimedia applications. Demand
   for higher bandwidth, in turn, requires higher forwarding performance
   for both multicast and unicast traffic.

   The growth of the Internet also demands improved scaling properties
   of the Internet routing system. The ability to contain the volume of
   routing information maintained by individual routers and the ability
   to build a hierarchy of routing knowledge are essential to support a
   high quality, scalable routing system.

   While the destination-based forwarding paradigm is adequate in many
   situations, we already see examples where it is no longer adequate.
   The ability to overcome the rigidity of destination-based forwarding
   and to have more flexible control over how traffic is routed is
   likely to become more and more important.

   We see the need to improve forwarding performance while at the same
   time adding routing functionality to support multicast, allowing more
   flexible control over how traffic is routed, and providing the
   ability to build a hierarchy of routing knowledge. Moreover, it
   becomes more and more crucial to have a routing system that can
   support graceful evolution to accommodate new and emerging
   requirements.

   Label switching is a technology that provides an efficient solution to
   these challenges. Label switching blends the flexibility and rich
   functionality provided by Network Layer routing with the simplicity
   provided by the label swapping forwarding paradigm. The simplicity of
   the label switching forwarding paradigm (label swapping) enables
   improved forwarding performance, while maintaining competitive
   price/performance. By associating a wide range of forwarding
   granularities with a label, the same forwarding paradigm can be used to
   support a wide variety of routing functions, such as destination-
   based routing, multicast, hierarchy of routing knowledge, and
   flexible routing control. Finally, a combination of simple
   forwarding, a wide range of forwarding granularities, and the ability
   to evolve routing functionality while preserving the same forwarding
   paradigm enables a routing system that can gracefully evolve to
   accommodate new and emerging requirements.

4. Label Switching components

   Label switching consists of two components: forwarding and control. The
   forwarding component uses the label information (labels) carried by
   packets and the label forwarding information maintained by a label switch
   to perform packet forwarding. The control component is responsible
   for maintaining correct label forwarding information among a group of
   inter- connected label switches.

   Segregating control and forwarding into separate components promotes
   modularity, which in turn enables to build a system that can
   gracefully evolve to accommodate new and emerging requirements.

5. Forwarding component

   The fundamental forwarding paradigm employed by label switching is
   based on the notion of label swapping. When a packet with a label is
   received by a label switch, the switch uses the label as an index in its
   Label Information Base (TIB). Each entry in the TIB consists of an
   incoming label, and one or more sub-entries of the form . If the switch
   finds an entry with the incoming label equal to the label carried in the
   packet, then for each  in the entry the switch replaces the label in
   the packet with the outgoing label, replaces the link level information
   (e.g MAC address) in the packet with the outgoing link level
   information, and forwards the packet over the outgoing interface.

   From the above description of the forwarding component we can make
   several observations. First, the forwarding decision is based on the
   exact match algorithm using a fixed length, fairly short label as an
   index. This enables a simplified forwarding procedure, relative to
   longest match forwarding traditionally used at the network layer.
   This in turn enables higher forwarding performance (higher packets
   per second). The forwarding procedure is simple enough to allow a
   straightforward hardware implementation.

   A second observation is that the forwarding decision is independent
   of the label's forwarding granularity. For example, the same forwarding
   algorithm applies to both unicast and multicast - a unicast entry
   would just have a single (outgoing label, outgoing interface, outgoing
   link level information) sub-entry, while a multicast entry may have
   one or more (outgoing label, outgoing interface, outgoing link level
   information) sub-entries. (For multi-access links, the outgoing link
   level information in this case would include a multicast MAC
   address.) This illustrates how with label switching the same forwarding
   paradigm can be used to support different routing functions (e.g.,
   unicast, multicast, etc...)

   The simple forwarding procedure is thus essentially decoupled from
   the control component of label switching. New routing (control)
   functions can readily be deployed without disturbing the forwarding
   paradigm. This means that it is not necessary to re-optimize
   forwarding performance (by modifying either hardware or software) as
   new routing functionality is added.

   In the label switching architecture, various implementation options are
   acceptable. For example, support for network layer forwarding by a
   label switch (i.e., forwarding based on the network layer header as
   opposed to a label) is optional. Moreover, use of network layer
   forwarding may be constrained to handling network layer control
   traffic only. (Note, however, that a label switch must be able to
   source and sink network layer packets, e.g. to participate in network
   layer routing protocols)

   For the purpose of handling network layer hop count (time-to-live)
   the architecture allows two alternatives: network layer hops may
   correspond directly to hops formed by label switches, or one network
   layer hop may correspond to several label switched hops.

   When a switch receives a packet with a label, and the TIB maintained by
   the switch has no entry with the incoming label equal to the label
   carried by the packet, or the entry exists, the outgoing label entry is
   entry, and the entry does not indicate local delivery to the switch,
   the switch may either (a) discard the packet, or (b) strip the label
   information, and submit the packet for network layer processing.
   Support for the latter is optional (as support for network layer
   forwarding is optional). Note that it may not always be possible to
   successfully forward a packet after stripping a label even if a label
   switch supports network layer forwarding.

   The architecture allows a label switch to maintain either a single TIB
   per label switch, or a TIB per interface. Moreover, a label switch could
   mix both of these options - some labels could be maintained in a single
   TIB, while other labels could be maintained in a TIB associated with
   individual interfaces.

5.1. Label encapsulation

   Label switching clearly requires a label to be carried in each packet.
   The label information can be carried in a variety of ways:

      - as a small "shim" label header inserted between the layer 2 and
      the Network Layer headers;

      - as part of the layer 2 header, if the layer 2 header provides
      adequate semantics (e.g., Frame Relay, or ATM);

      - as part of the Network Layer header (e.g., using the Flow Label
      field in IPv6 with appropriately modified semantics).

   It is therefore possible to implement label switching over virtually
   any media type including point-to-point links, multi-access links,
   and ATM. At the same time the forwarding component allows specific
   optimizations for particular media (e.g., ATM).

   Observe also that the label forwarding component is Network Layer
   independent. Use of control component(s) specific to a particular
   Network Layer protocol enables the use of label switching with
   different Network Layer protocols.

6. Control component

   Essential to label switching is the notion of binding between a label and
   Network Layer routing (routes). The control component is responsible
   for creating label bindings, and then distributing the label binding
   information among label switches. Creating a label binding involves
   allocating a label, and then binding a label to a route. The distribution
   of label binding information among label switches could be accomplished
   via several options:

      - piggybacking on existing routing protocols

      - using a separate Label Distribution Protocol (LDP)

   While the architecture supports distribution of label binding
   information that is independent of the underlying routing protocols,
   the architecture acknowledges that considerable optimizations can be
   achieved in some cases by small enhancements of existing protocols to
   enable piggybacking label binding information on these protocols.

   One important characteristic of the label switching architecture is
   that creation of label bindings is driven primarily by control traffic
   rather than by data traffic. Control traffic driven creation of label
   bindings has several advantages, as compared to data traffic driven
   creation of label bindings. For one thing, it minimizes the amount of
   additional control traffic needed to distribute label binding
   information, as label binding information is distributed only in
   response to control traffic, independent of data traffic. It also
   makes the overall scheme independent of and insensitive to the data
   traffic profile/pattern. Control traffic driven creation of label
   binding improves forwarding performance, as labels are precomputed
   (prebound) before data traffic arrives, rather than being created as
   data traffic arrives. It also simplifies the overall system behavior,
   as the control plane is controlled solely by control traffic, rather
   than by a mix of control and data traffic.

   Another important characteristic of the label switching architecture is
   that distribution and maintenance of label binding information is
   consistent with distribution and maintenance of the associated
   routing information. For example, distribution of label binding
   information for labels associated with unicast routing is based on the
   technique of incremental updates with explicit acknowledgment. This
   is very similar to the way unicast routing information gets
   distributed by such protocols as OSPF and BGP. In contrast,
   distribution of label binding information for labels associated with
   multicast routing is based on period updates/ refreshes, without any
   explicit acknowledgments. This is consistent with the way multicast
   routing information is distributed by such protocols as PIM.

   To provide good scaling characteristics, while also accommodating
   diverse routing functionality, label switching supports a wide range of
   forwarding granularities. At one extreme a label could be associated
   (bound) to a group of routes (more specifically to the Network Layer
   Reachability Information of the routes in the group). At the other
   extreme a label could be bound to an individual application flow (e.g.,
   an RSVP flow). A label could also be bound to a multicast tree. In
   addition, a label may be bound to a path that has been selected for a
   certain set of packets based on some policy (e.g. an explicit route).

   The control component is organized as a collection of modules, each
   designed to support a particular routing function. To support new
   routing functions, new modules can be added. The architecture does
   not mandate a prescribed set of modules that have to be supported by
   every label switch.

   The following describes some of the modules.

6.1. Destination-based routing

   In this section we describe how label switching can support
   destination-based routing. Recall that with destination-based routing
   a router makes a forwarding decision based on the destination address
   carried in a packet and the information stored in the Forwarding
   Information Base (FIB) maintained by the router. A router constructs
   its FIB by using the information it receives from routing protocols
   (e.g., OSPF, BGP).

   To support destination-based routing with label switching, a label
   switch, just like a router, participates in routing protocols (e.g.,
   OSPF, BGP), and constructs its FIB using the information it receives
   from these protocols.

   There are three permitted methods for label allocation and Label
   Information Base (TIB) management: (a) downstream label allocation, (b)
   downstream label allocation on demand, and (c) upstream label allocation.
   In all cases, a switch allocates labels and binds them to address
   prefixes in its FIB. In downstream allocation, the label that is
   carried in a packet is generated and bound to a prefix by the switch
   at the downstream end of the link (with respect to the direction of
   data flow). On demand allocation means that labels will only be
   allocated and distributed by the downstream switch when it is
   requested to do so by the upstream switch. Method (b) is most useful
   in ATM networks (see Section 8). In upstream allocation, labels are
   allocated and bound at the upstream end of the link. Note that in
   downstream allocation, a switch is responsible for creating label
   bindings that apply to incoming data packets, and receives label
   bindings for outgoing packets from its neighbors. In upstream
   allocation, a switch is responsible for creating label bindings for
   outgoing labels, i.e. labels that are applied to data packets leaving the
   switch, and receives bindings for incoming labels from its neighbors.

   The downstream label allocation scheme operates as follows: for each
   route in its FIB the switch allocates a label, creates an entry in its
   Label Information Base (TIB) with the incoming label set to the allocated
   label, and then advertises the binding between the (incoming) label and
   the route to other adjacent label switches. The advertisement could be
   accomplished by either piggybacking the binding on top of the
   existing routing protocols, or by using a separate Label Distribution
   Protocol (LDP). When a label switch receives label binding information
   for a route, and that information was originated by the next hop for
   that route, the switch places the label (carried as part of the binding
   information) into the outgoing label of the TIB entry associated with
   the route. This creates the binding between the outgoing label and the
   route.

   With the downstream on demand label allocation scheme, operation is as
   follows. For each route in its FIB, the switch identifies the next
   hop for that route. It then issues a request (via LDP) to the next
   hop for a label binding for that route. When the next hop receives the
   request, it allocates a label, creates an entry in its TIB with the
   incoming label set to the allocated label, and then returns the binding
   between the (incoming) label and the route to the switch that sent the
   original request. When the switch receives the binding information,
   the switch creates an entry in its TIB, and sets the outgoing label in
   the entry to the value received from the next hop. Handling of data
   packets is as for downstream allocation. The main application for
   this mode of operation is with ATM switches, as described in Section
   8.

   The upstream label allocation scheme is used as follows. If a label
   switch has one or more point-to-point interfaces, then for each route
   in its FIB whose next hop is reachable via one of these interfaces,
   the switch allocates a label, creates an entry in its TIB with the
   outgoing label set to the allocated label, and then advertises to the
   next hop (via LDP) the binding between the (outgoing) label and the
   route. When a label switch that is the next hop receives the label
   binding information, the switch places the label (carried as part of
   the binding information) into the incoming label of the TIB entry
   associated with the route.

   Note that, while we have described upstream allocation for the sake
   of completeness, we have found the two downstream allocation methods
   adequate for all practical purposes so far.

   Independent of which label allocation method is used, once a TIB entry
   is populated with both incoming and outgoing labels, the label switch can
   forward packets for routes bound to the labels by using the label
   switching forwarding algorithm (as described in Section 5).

   When a label switch creates a binding between an outgoing label and a
   route, the switch, in addition to populating its TIB, also updates
   its FIB with the binding information. This enables the switch to add
   labels to previously unlabeled packets.

   So far we have described how a label could be bound to a single route,
   creating a one-to-one mapping between routes and labels. However, under
   certain conditions it is possible to bind a label not just to a single
   route, but to a group of routes, creating a many-to-one mapping
   between routes and labels. Consider a label switch that is connected to a
   router.  It is quite possible that the switch uses the router as the
   next hop not just for one route, but for a group of routes. Under
   these conditions the switch does not have to allocate distinct labels
   to each of these routes - one label would suffice. The distribution of
   label binding information is unaffected by whether there is a one-to-
   one or one-to-many mapping between labels and routes. Now consider a
   label switch that receives from one of its neighbors (label switching
   peers) label binding information for a set of routes, such that the set
   is bound to a single label. If the switch decides to use some or all of
   the routes in the set, then for these routes the switch does not need
   to allocate individual labels - one label would suffice. Such an approach
   may be valuable when labels are a precious resource. Note that the
   ability to support many-to-one mapping makes no assumptions about the
   routing protocols being used.

   When a label switch adds a label to a previously unlabeled packet the label
   could be either associated with the route to the destination address
   carried in the packet, or with the route to some other label switch
   along the path to the destination (in some cases the address of that
   other label switch could be gleaned from network layer routing
   protocols). The latter option provides yet another way of mapping
   multiple routes into a single label. However, this option is either
   dependent on particular routing protocols, or would require a
   separate mechanism for discovering label switches along a path.

   To understand the scaling properties of label switching in conjunction
   with destination-based routing, observe that the total number of labels
   that a label switch has to maintain can not be greater than the number
   of routes in the switch's FIB. Moreover, as we have just seen, the
   number of labels can be much less than the number of routes. Thus, much
   less state is required than would be the case if labels were allocated
   to individual flows.

   In general, a label switch will try to populate its TIB with incoming
   and outgoing labels for all routes to which it has reachability, so
   that all packets can be forwarded by simple label swapping. Label
   allocation is thus driven by topology (routing), not data traffic -
   it is the existence of a FIB entry that causes label allocations, not
   the arrival of data packets.

   Use of labels associated with routes, rather than flows, also means
   that there is no need to perform flow classification procedures for
   all the flows to determine whether to assign a label to a flow. That,
   in turn, simplifies the overall scheme, and makes it more robust and
   stable in the presence of changing traffic patterns.

   Note that when label switching is used to support destination-based
   routing, label switching does not completely eliminate the need to
   perform normal Network Layer forwarding at some network elements.
   First of all, to add a label to a previously unlabeled packet requires
   normal Network Layer forwarding. This function could be performed by
   the first hop router, or by the first router on the path that is able
   to participate in label switching. In addition, whenever a label switch
   aggregates a set of routes (e.g., by using the technique of
   hierarchical routing), into a single route, and the routes do not
   share a common next hop, the switch needs to perform Network Layer
   forwarding for packets carrying the label associated with the
   aggregated route. However, one could observe that the number of
   places where routes get aggregated is smaller than the total number
   of places where forwarding decisions have to be made. Moreover, quite
   often aggregation is applied to only a subset of the routes
   maintained by a label switch. As a result, on average a packet can be
   forwarded most of the time using the label switching algorithm. Note
   that many label switches may not need to perform any network layer
   forwarding.

6.2. Hierarchy of routing knowledge

   The IP routing architecture models a network as a collection of
   routing domains. Within a domain, routing is provided via interior
   routing (e.g., OSPF), while routing across domains is provided via
   exterior routing (e.g., BGP). However, all routers within domains
   that carry transit traffic (e.g., domains formed by Internet Service
   Providers) have to maintain information provided by not just interior
   routing, but exterior routing as well, even if only some of these
   routers participate in exterior routing. That creates certain
   problems. First of all, the amount of this information is not
   insignificant. Thus it places additional demand on the resources
   required by the routers.  Moreover, increase in the volume of routing
   information quite often increases routing convergence time. This, in
   turn, degrades the overall performance of the system.

   Label switching allows complete decoupling of interior and exterior
   routing. With label switching only label switches at the border of a
   domain would be required to maintain routing information provided by
   exterior routing - all other switches within the domain would just
   maintain routing information provided by the domains interior routing
   (which is usually significantly smaller than the exterior routing
   information), with no "leaking" of exterior routing information into
   interior routing. This, in turn, reduces the routing load on non-
   border switches, and shortens routing convergence time.

   To support this functionality, label switching allows a packet to carry
   not one but a set of labels, organized as a stack. A label switch could
   either swap the label at the top of the stack, or pop the stack, or
   swap the label and push one or more labels into the stack.

   Consider a label switch that is at the border of a routing domain. This
   switch maintains both exterior and interior routes. The interior
   routes provide routing information and labels to all the other label
   switches within the domain. For each exterior route that the switch
   receives from some other border label switch that is in the same domain
   as the local switch, the switch maintains not just a label associated
   with the route, but also a label associated with the route to that
   other border label switch. Moreover, for inter-domain routing protocols
   that are capable of passing the "third-party" next hop information
   the switch would maintain a label associated with the route to the next
   hop, rather than with the route to the border label switch from whom
   the local switch received the exterior route.

   When a packet is forwarded between two (border) label switches in
   different domains, the label stack in the packet contains just one label
   (associated with an exterior route). However, when a packet is
   forwarded within a domain, the label stack in the packet contains not
   one, but two labels (the second label is pushed by the domain's ingress
   border label switch). The label at the top of the stack provides packet
   forwarding to an appropriate egress border label switch (or the
   "third-party" next hop), while the next label in the stack provides
   correct packet forwarding at the egress switch (or at the "third-
   party" next hop). The stack is popped by either the egress switch (or
   the "third-party" next hop) or by the penultimate (with respect to
   the egress switch/"third-party" next hop) switch.

   One could observe that when label switching is confined to a single
   routing domain, the above still could be used to decouple interior
   from exterior routing, similar to what was described above. However,
   in this case a border label switch wouldn't maintain labels associated
   with each exterior route, and forwarding between domains would be
   performed at the network layer.

   The control component used in this scenario is fairly similar to the
   one used with destination-based routing. In fact, the only essential
   difference is that in this scenario the label binding information is
   distributed both among physically adjacent label switches, and among
   border label switches within a single domain. One could also observe
   that the latter (distribution among border switches) could be
   trivially accommodated by very minor extensions to BGP.

   The notion of supporting hierarchy of routing knowledge with label
   switching is not limited to the case of exterior/interior routing,
   but could be applicable to other cases where the hierarchy of routing
   knowledge is possible. Moreover, while the above describes only a
   two-level hierarchy of routing knowledge, the label switching
   architecture does not impose limits on the depth of the hierarchy.

   In the presence of hierarchy of routing knowledge a label switched path
   at the level N in the hierarchy has to have its endpoints at label
   switches that are at border between the level N and (N-1) in the
   hierarchy (level 0 in the hierarchy corresponds to an unlabeled path).

6.3. Multicast

   Essential to multicast routing is the notion of spanning trees.
   Multicast routing procedures (e.g., PIM) are responsible for
   constructing such trees (with receivers as leafs), while multicast
   forwarding is responsible for forwarding multicast packets along such
   trees. Thus, to support a multicast forwarding function with label
   switching we need to be able to associate a label with a multicast
   tree.  The following describes the procedures for allocation and
   distribution of labels for multicast.

   When label switching is used for multicast, it is important that label
   switching be able to utilize multicast capabilities provided by the
   Data Link layer (e.g., multicast capabilities provided by Ethernet).
   To be able to do this, an (upstream) label switch connected to a given
   Data Link subnetwork should use the same label when forwarding a
   multicast packet to all of the (downstream) switches on that
   subnetwork. This way the packet will be multicasted at the Data Link
   layer over the subnetwork. To support this, all label switches that are
   part of a given multicast tree and are on a common subnetwork must
   agree on a common label that would be used for forwarding multicast
   packets along the tree over the subnetwork. Moreover, since multicast
   forwarding is based on Reverse Path Forwarding (RPF), it is crucial
   that, when a label switch receives a multicast packet, a label carried in
   a packet must enable the switch to identify both (a) a particular
   multicast group, as well as (b) the previous hop (upstream) label
   switch that sent the packet.

   To support the requirements outlined in the previous paragraph, the
   label switching architecture assumes that (a) multicast labels are
   associated with interfaces on a label switch (rather than with a label
   switch as a whole), (b) the label space that a label switch could use for
   allocating labels for multicast is partitioned into non-overlapping
   regions among all the label switches connected to a common Data Link
   subnetwork, and (c) there are procedures by which label switches that
   belong to a common multicast tree and are on a common Data Link
   subnetwork agree on the label switch that is responsible for allocating
   a label for the tree.

   One possible way of partitioning label space into non-overlapping
   regions among label switches connected to a common subnetwork is for
   each label switch to claim a region of the space and announce this
   region to its neighbors. Conflicts are resolved based on the IP
   address of the contending switches (the higher address wins, the
   lower retries). Once the label space is partitioned among label switches,
   the switches may create bindings between labels and multicast trees
   (routes).

   At least in principle there are two possible ways to create bindings
   between labels and multicast trees (routes). With the first alternative
   for a set of label switches that share a common Data Link subnetwork,
   the label switch that is upstream with respect to a particular
   multicast tree allocates a label (out of its own region that does not
   overlap with the regions of other switches on the subnetwork), binds
   the label to a multicast route, and then advertises the binding to all
   the (downstream) switches on the subnetwork. With the second
   alternative, one of the label switches that is downstream with respect
   to a particular multicast tree allocates a label (out of its own region
   that does not overlap with the regions of other switches on the
   subnetwork), binds the label to a multicast route, and then advertises
   the binding to all the switches (both downstream and upstream) on the
   subnetwork. Usually the first label switch to join the group is the one
   that performs the allocation.

   Each of the above alternatives has its own trade-offs. The first
   alternative is fairly simple - one upstream router does the label
   binding and multicasts the binding downstream. However, the first
   alternative may create uneven distribution of allocated labels, as some
   label switches on a common subnetwork may have more upstream multicast
   sources than the others. Also, changes in topology could result in
   upstream neighbor changes, which in turn would require label re-
   binding. Finally, one could observe that distributing label binding
   from upstream towards downstream is inconsistent with the direction
   of multicast routing information distribution (from downstream
   towards upstream).

   The second alternative, even if more complex that the first one, has
   its own advantages. For one thing, it makes distribution of multicast
   label binding consistent with the distribution of unicast label binding.
   It also makes distribution of multicast label binding consistent with
   the distribution of multicast routing information. This, in turn,
   allows the piggybacking of label binding information on existing
   multicast routing protocols (PIM). This alternative also avoids the
   need for label re-binding when there are changes in upstream neighbor.
   Finally it is more likely to provide more even distribution of
   allocated labels, as compared to the first alternative. Note that this
   approach does require a mechanism to choose the label allocator from
   among the downstream label switches on the subnetwork.

6.4. Quality of service

   Two mechanisms are needed for providing a range of qualities of
   service to packets passing through a router or a label switch. First,
   we need to classify packets into different classes. Second, we need
   to ensure that the handling of packets is such that the appropriate
   QOS characteristics (bandwidth, loss, etc.) are provided to each
   class.

   Label switching provides an easy way to mark packets as belonging to a
   particular class after they have been classified the first time.
   Initial classification could be done using configuration information
   (e.g., all traffic from a certain interface) or using information
   carried in the network layer or higher layer headers (e.g., all
   packets between a certain pair of hosts). A label corresponding to the
   resultant class would then be applied to the packet. Labeled packets
   can then be efficiently handled by the label switching routers in their
   path without needing to be reclassified. The actual scheduling and
   queueing of packets is largely orthogonal - the key point here is
   that label switching enables simple logic to be used to find the state
   that identifies how the packet should be scheduled.

   Label switching can, for example, be used to support a small number of
   classes of service in a service provider network (e.g. premium and
   standard). On frame-based media, the class can be encoded by a field
   in the label header. On ATM label switches, additional labels can be
   allocated to differentiate the different classes. For example, rather
   than having one label for each destination prefix in the FIB, an ATM
   label switch could have two labels per prefix, one to be used by premium
   traffic and one by standard. Thus a label binding in this case is a
   triple consisting of . Such a label would be
   used both to make a forwarding decision and to make a scheduling
   decision, e.g., by selecting the appropriate queue in a weighted fair
   queueing (WFQ) scheduler.

   To provide a finer granularity of QOS, label switching can be used with
   RSVP. We propose a simple extension to RSVP in which a label object is
   defined. Such an object can be carried in an RSVP reservation message
   and thus associated with a session. Each label capable router assigns a
   label to the session and passes it upstream with the reservation
   message. Thus the association of labels with RSVP sessions works very
   much like the binding of labels to routes with downstream allocation.
   Note, however, that binding is accomplished using RSVP rather than
   LDP. (It would be possible to use LDP, but it is simpler to extend
   RSVP to carry labels and this ensures that labels and reservation
   information are communicated in a similar manner.)

   When data packets are transmitted, the first router in the path that
   is label-capable applies the label that it received from its downstream
   neighbor. This label can be used at the next hop to find the
   corresponding reservation state, to forward and schedule the packet
   appropriately, and to find the suitable outgoing label value provided
   by the next hop.  Note that label imposition could also be performed at
   the sending host.

6.5. Flexible routing (explicit routes)

   One of the fundamental properties of destination-based routing is
   that the only information from a packet that is used to forward the
   packet is the destination address. While this property enables highly
   scalable routing, it also limits the ability to influence the actual
   paths taken by packets. This, in turn, limits the ability to evenly
   distribute traffic among multiple links, taking the load off highly
   utilized links, and shifting it towards less utilized links. For
   Internet Service Providers (ISPs) who support different classes of
   service, destination-based routing also limits their ability to
   segregate different classes with respect to the links used by these
   classes. Some of the ISPs today use Frame Relay or ATM to overcome
   the limitations imposed by destination-based routing. Label switching,
   because of the flexible granularity of labels, is able to overcome
   these limitations without using either Frame Relay or ATM.

   Another application where destination-based routing is no longer
   adequate is routing with resource reservations (QOS routing).
   Increasing the number of ways by which a particular reservation could
   traverse a network may improve the success of the reservation.
   Increasing the number of ways, in turn, requires the ability to
   explore paths that are not constrained to the ones constructed solely
   based on destination.

   To provide forwarding along paths that are different from the paths
   determined by destination-based routing, the control component of label
   switching allows installation of label bindings in label switches that do
   not correspond to the destination-based routing paths.

   One possible alternative for supporting explicit routes is to allow
   LDP to carry information about an explicit route, where such a route
   could be expressed as a sequence of label switches. Another alternative
   is to use label-capable RSVP (see Section 6.4) as a mechanism to
   distribute label bindings, and to augment RSVP with the ability to
   steer the PATH message along a particular (explicit) route. Finally,
   it is also possible in principle to use some form of source route
   (e.g., SDRP, GRE) to steer RSVP PATH messages carrying label bindings
   along a particular path. Note, however, that this would require a
   change to the way in which RSVP handles PATH messages, as it would be
   necessary to store the source route as part of the PATH state.

7. Label Forwarding Granularities and Forwarding Equivalence Classes

   A conventional router has some sort of structure or set of structures
   which may be called a "forwarding table", which has a finite number
   of entries. Whenever a packet is received, the router applies a
   classification algorithm which maps the packet to one of the
   forwarding table entries. This entry specifies how to forward the
   packet.

   We can think of this classification algorithm as a means of
   partitioning the universe of possible packets into a finite set of
   "Forwarding Equivalence Classes" (FECs).

   Each router along a path must have some way of determining the next
   hop for that FEC. For a given FEC, the corresponding entry in the
   forwarding table may be created dynamically, by operation of the
   routing protocols (unicast or multicast), or it might be created by
   configuration, or it might be created by some combination of
   configuration and protocol.

   In label switching, if a pair of label switches are adjacent along a label
   switched path, they must agree on an assignment of labels to FECs. Once
   this agreement is made, all label switches on the label switched path
   other than the first are spared the work of actually executing the
   classification algorithm. In fact, subsequent label switches need not
   even have the code which would be necessary to do this.

   There are a large number of different ways in which one may choose to
   partition a set of packets into FECs. Some examples:

      1. Consider two packets to be in the same FEC if there is a single
      address prefix in the routing table which is the longest match for
      the destination address of each packet;

      2. Consider two packets to be in the same FEC if these packets
      have to traverse through a common router/label switch;

      3. Consider two packets to be in the same FEC if they have the
      same source address and the same destination address;

      4. Consider two packets to be in the same FEC if they have the
      same source address, the same destination address, the same
      transport protocol, the same source port, and the same destination
      port.

      5. Consider two packets to be in the same FEC if they are alike in
      some arbitrary manner determined by policy. Note that the
      assignment of a packet to a FEC by policy need not be done solely
      by examining the network layer header. One might want, for
      example, all packets arriving over a certain interface to be
      classified into a single FEC, so that those packets all get
      tunnelled through the network to a particular exit point.

   Other examples can easily be thought of.

   In case 1, the FEC can be identified by an address prefix (as
   described in Section 6.1). In case 2, the FEC can be identified by
   the address of a label switch (as described in Section 6.1). Both 1 and
   2 are useful for binding labels to unicast routes - labels are bound to
   FECs, and an address prefix, or an address identifies a particular
   FEC. Case 3 is useful for binding labels to multicast trees that are
   constructed by protocols such as PIM (as described in Section 6.3).
   Case 4 is useful for binding labels to individual flows, using, say,
   RSVP (as described in Section 6.4). Case 5 is useful as a way of
   connecting two pieces of a private network across a public backbone
   (without even assuming that the private network is an IP network) (as
   described in Section 6.5).

   Any number of different kinds of FEC can co-exist in a single label
   switch, as long as the result is to partition the universe of packets
   seen by that label switch. Likewise, the procedures which different label
   switches use to classify (hitherto unlabeled) packets into FECs need
   not be identical.

   Networks could be organized around a hierarchy of FECs. For example,
   (non-adjacent) label switches TSa and TSb may classify packets into
   some set of FECs FEC1,...,FECn.  However from the point of view of
   the intermediate label switches between TSa and TSb, all of these FECs
   may be treated indistinguishably. That is, as far as the intermediate
   label switches are concerned, the union of the FEC1,...,FECn is a
   single FEC.  Each intermediate label switch may then prefer to use a
   single label for this union (rather than maintaining individual labels
   for each member of this union). Label switching accommodates this by
   providing a hierarchy of labels, organized in a stack.

   Much of the power of label switching arises from the facts that:

      - there are so many different ways to partition the packets into
      FECs,

      - different label switches can partition the hitherto unlabeled
      packets in different ways,

      - the route to be used for a particular FEC can be chosen in
      different ways,

      - a hierarchy of labels, organized as a stack, can be used to
      represent the network's hierarchy of FECs.

   Note that label switching does not specify, as an element of any
   particular protocol, a general notion of "FEC identifier". Even if it
   were possible to have such a thing, there is no need for it, since
   there is no "one size fits all" setup protocol which works for any
   arbitrary combination of packet classifier and routing protocol.
   That's why label distribution is sometimes done with LDP, sometimes
   with BGP, sometimes with PIM, sometimes with RSVP.

8. Label switching with ATM

   Since the label switching forwarding paradigm is based on label
   swapping, and since ATM forwarding is also based on label swapping,
   label switching technology can readily be applied to ATM switches by
   implementing the control component of label switching.

   The label information needed for label switching can be carried in the
   VCI field. If two levels of labeling are needed, then the VPI field
   could be used as well, although the size of the VPI field limits the
   size of networks in which this would be practical. However, for most
   applications of one level of labeling the VCI field is adequate.

   To obtain the necessary control information, the switch should be
   able to support the label switching control component. Moreover, if the
   switch has to perform routing information aggregation, then to
   support destination-based unicast routing the switch should be able
   to perform Network Layer forwarding for some fraction of the traffic
   as well.

   Supporting the destination-based routing function with label switching
   on an ATM switch may require the switch to maintain not one, but
   several labels associated with a route (or a group of routes with the
   same next hop). This is necessary to avoid the interleaving of
   packets which arrive from different upstream label switches, but are
   sent concurrently to the same next hop.

   If an ATM switch has built-in mechanism(s) to suppress cell
   interleave, then the switch could implement the destination-based
   routing function precisely the way it was described in Section 6.1.
   This would eliminate the need to maintain several labels per route.
   Note, however, that suppressing cell interleave is not part of the
   ATM User Plane, as defined by the ATM Forum.

   Yet another alternative that eliminates the need to maintain several
   labels per route is to carry the label information in the VPI field, and
   use the VCI field for identifying cells that were sent by different
   label switches. Note, however, that the scalability of this alternative
   is constrained by the size of the VPI space (4096 labels total).
   Moreover, this alternative assumes that for a set of ATM label switches
   that form a contiguous segment of a network topology there exists a
   mechanism to assign to each ATM label switch around the edge of the
   segment a set of unique VCIs that would be used by this switch alone.

   The downstream label allocation on demand scheme is likely to be a
   preferred scheme for the label allocation and TIB maintenance
   procedures with ATM switches, as this scheme allows efficient use of
   entries in the cross-connect tables maintained by ATM switches.

   Implementing label switching on an ATM switch simplifies integration of
   ATM switches and routers. From a routing peering point of view an ATM
   switch capable of label switching would appear as a router to an
   adjacent router; this reduces the number of routing peers a router
   would have to maintain (relative to the common arrangement where a
   large number of routers are fully meshed over an ATM cloud). Label
   switching enables better routing, as it exposes the underlying
   physical topology to the Network Layer routing. Finally label switching
   simplifies overall operations by employing common addressing,
   routing, and management procedures among both routers and ATM
   switches. That could provide a viable, more scalable alternative to
   the overlay model. Because creation of label binding is driven by
   control traffic, rather than data traffic, application of this
   approach to ATM switches does not produce high call setup rates, nor
   does it depend on the longevity of flows.

   Implementing label switching on an ATM switch does not preclude the
   ability to support a traditional ATM control plane (e.g., PNNI) on
   the same switch. The two components, label switching and the ATM
   control plane, would operate in a Ships In the Night mode (with
   VPI/VCI space and other resources partitioned so that the components
   do not interact).

9. Label switching migration strategies

   Since label switching is performed between a pair of adjacent label
   switches, and since the label binding information can be distributed on
   a pairwise basis, label switching could be introduced in a fairly
   simple, incremental fashion. For example, once a pair of adjacent
   routers are converted into label switches, each of the switches would
   label packets destined to the other, thus enabling the other switch to
   use label switching. Since label switches use the same routing protocols
   as routers, the introduction of label switches has no impact on
   routers. In fact, a label switch connected to a router acts just as a
   router from the router's perspective.

   As more and more routers are upgraded to enable label switching, the
   scope of functionality provided by label switching widens. For example,
   once all the routers within a domain are upgraded to support label
   switching, in becomes possible to start using the hierarchy of
   routing knowledge function.

10. Summary

   In this paper we described the label switching technology. Label
   switching is not constrained to a particular Network Layer protocol -
   it is a multiprotocol solution. The forwarding component of label
   switching is simple enough to facilitate high performance forwarding,
   and may be implemented on high performance forwarding hardware such
   as ATM switches. The control component is flexible enough to support
   a wide variety of routing functions, such as destination-based
   routing, multicast routing, hierarchy of routing knowledge, and
   explicitly defined routes. By allowing a wide range of forwarding
   granularities that could be associated with a label, we provide both
   scalable and functionally rich routing. A combination of a wide range
   of forwarding granularities and the ability to evolve the control
   component fairly independently from the forwarding component results
   in a solution that enables graceful introduction of new routing
   functionality to meet the demands of a rapidly evolving computer
   networking environment.

11. Security Considerations

   Security considerations are not addressed in this document.

12. Intellectual Property Considerations

   Cisco Systems may seek patent or other intellectual property
   protection for some or all of the technologies disclosed in this
   document. If any standards arising from this document are or become
   protected by one or more patents assigned to Cisco Systems, Cisco
   intends to disclose those patents and license them under openly
   specified and non-discriminatory terms, for no fee.

                                                                P Doolan
                                                           cisco Systems
Expiration Date: November 1997
                                                                 B Davie
                                                           cisco Systems

                                                                  D Katz
                                                        Juniper Networks

                                                               Y Rekhter
                                                           cisco Systems

                                                                 E Rosen
                                                           cisco Systems

                                                                May 1997

                       Label Distribution Protocol

                      draft-doolan-tdp-spec-01.txt

1. Abstract

   An overview of a  label switching architecture is provided in
   [L].  This document defines the Label Distribution Protocol (LDP)
   referred to in [L].

   LDP is a two party protocol that runs over a connection oriented
   transport layer with guaranteed sequential delivery.  Label Switching
   Routers use LDP to communicate label binding information to their
   peers. LDP supports multiple network layer protocols including but
   not limited to IPv4, IPv6, IPX and AppleTalk.

   We define here the PDUs and operational procedures for this LDP and
   specify its transport requirements. We also define aspects of the
   protocol that are specific to the case where it is run over an ATM
   datalink.

2. Protocol Overview

   A label switching architecture is described in [L]. As explained
   in that document Label Switching Routers (LSRs) create label bindings,
   and then distribute the label binding information among other LSRs.

   LDP provides the means for LSRs to distribute, request, and release
   label binding information for multiple network layer protocols. LDP
   also provides means to open, monitor and close LDP sessions and to
   indicate errors that occur during those sessions.

   LDP is a two party protocol that requires a connection oriented
   transport layer with guaranteed sequential delivery. We use TCP as
   the transport for LDP.

   A LSR that wishes to exchange label bindings with another opens a TCP
   connection to the LDP port (TBD) on that other LSR. Once the TCP
   connection has been established then the LSRs exchange LDP PDUs that
   encode label binding information. LDP is symmetrical in that once the
   TCP connection has been opened the peer LSRs may each send and
   receive LDP PDUs at will.

   A single LSR may have LDP sessions with multiple other LSRs. Each of
   these sessions is completely independent of the others. Multiple LDP
   sessions may exist between any given pair of LSRs. Each of these
   sessions is completely independent of the others. LDP sessions are
   identified by the 'LDP Identifier' field in the LDP header (see
   below).

   LDP does not require any  keepalive notification from the transport,
   but implements its own keepalive timer. The usage is straightforward:
   peers must communicate within the period specified by the timer. Each
   time a LDP peer receives a LDP PDU it resets the timer. If the timer
   expires some number of times without reception of a LDP PDU from the
   remote system the LDP closes the session with its peer.

   When a LSR determines that it lost a LDP session with another LSR, if
   the LSR has any label bindings that were created as a result of
   receiving label binding requests from the peer, the LSR may destroy
   these bindings (and deallocate labels associated with these binding).

   When a LSR determines that it lost a LDP session with another LSR,
   the LSR shall no longer use the binding information it received from
   the other LSR.

   The procedures that govern when other components in a LSR invoke
   services from LDP and how a LSR maintains its TIBs are beyond the
   scope of this document.

   The use of LDP does not preclude the use of other mechanisms to
   distribute label binding information.

2.1. LDP and Label switching over ATM

   The label switching architecture [L] describes application of
   label switching to ATM, [B] provides more details and describes
   a number of features of LDP required specifically to support this ATM
   case. We describe control circuit useage and encapsulation here.  The
   sections on LDP_PIE_BIND and LDP_PIE_REQUEST_BIND describe how 'Hop
   Count' referred to in [B] is carried.

2.1.1. Default VPI/VCI

   By default the LDP connection between two ATM-LSRs uses VPI/VCI 0/32.
   The default LDP connection uses the LLC/SNAP encapsulation defined in
   RFC 1483 [15]. This LDP VC may be used to exchange other
   LLC/SNAP encapsulated traffic. In particular the LDP VC might be used
   to carry  Network Layer routing information. There are circumstances
   (see ATM_TAG_RANGE) when this VC is also used to carry data traffic.

   LDP provides means to advertise the range of, and negotiate the
   encapsulation used on, the data VCs. See the section on LDP_PIE_OPEN
   for further details.

   Cooperating LSRs may agree to use VPI/VCI other than 0/32 as the LDP
   VC, how they do this (management) is outside the scope of this
   document.

3. State machines

   We describe the LDP's  behavior in terms of a state machine. We
   define the LDP state machine to have four possible states and present
   the behavior as a state transition table and diagram.

3.1. LDP state transition table

           STATE           EVENT                           NEW STATE

                           Initialization                  INITIALIZED

           INITIALIZED     Sent     LDP_PIE_OPEN           OPENSENT
                           Received LDP_PIE_OPEN           OPENREC

           OPENREC         Received LDP_PIE_KEEP_ALIVE     OPERATIONAL
                           Received Any other LDP PDU      INITIALIZED

           OPENSENT        Received LDP_PIE_OPEN   &
                           Transmit LDP_PIE_KEEP_ALIVE     OPENREC
                           Received Any other LDP PDU      INITIALIZED
                           Sent     LDP_PIE_NOTIFICATION   INITIALIZED

           OPERATIONAL     Rx/Tx    LDP_PIE_NOTIFICATION
                                    with CLOSING parameter INITIALIZED
                           Other    LDP PDUs               OPERATIONAL
                           Timeout                         INITIALIZED

3.2. LDP state transition diagram

                                         ---
                                        |   | All LDP PIEs except PIE_OPEN
                                        V   |
                                 ---------------
                 Rx Ao PDU       |              |<--------------------------
                 Tx NOTIFICATION |              |                          |
                  -------------->| INITIALIZED  |                          |
                 |             --|              |---                       |
                 |            |   -------------    |                       |
                 |            |                    |                       |
                 |            |Rx PIE_OPEN &       |Tx PIE_OPEN            |
                 |            |(Tx OPEN            |                       |
                 |            |Tx KEEP_ALIVE)      |                       |
                 |            V                    V                       |
                 |      ---------             ----------                   |
                 |     |         |           |          |                  |
                  -----| OPENREC |           | OPENSENT | ---------------- |
                  -----|         |           |          | Rx Ao PDU        ^
                 |      ---------             ----------  Tx NOTIFICATION  |
                 |            ^                 |                          |
                 |            |Rx PIE_OPEN & Tx |                          |
                 |            |KEEP_ALIVE       |                          |
                 |              ----------------                           |
                 |Rx                                                       |
                 |PIE_KEEP_ALIVE                                           |
                 |         ------------                                    |
                  ------->|            |                                   |
                          | OPERATIONAL|                                   |
                          |            |-----------------------------------
                           ------------      R/Tx NOTIFICATION  with CLOSE
                       All other | ^         or TIMEOUT
                        LDP PDUs | |
                                 |_|

3.3. Transport connections

   A LSR that implements LDP opens a TCP connection to a peer LSR. Once
   open, and regardless of which LSR opened it, the TCP connection is
   used bidirectionally. That is there is only one TCP 'connection' used
   for a LDP session between two LSRs. LDP uses TCP port (TBD).

3.4. Timeout

   Timeout in the state transition table and diagram indicates that the
   keep alive timer set to HOLD_TIME has expired. See LDP_PIE_OPEN for a
   discussion of this mechanism.

4. Protocol Data Units (PDUs)

   LDP PDUs are variable length and consist of a fixed header and one or
   more Protocol Information Elements (PIE) each with a Type Length
   Value (TLV) structure. Within a single PIE TLVs may be nested to an
   arbitrary depth.

   A single LDP PDU may contain multiple PIEs.  The maximum LDP PDU size
   is 4096 octets.

4.1. LDP Fixed Header

   The fixed header of the LDP PDU is:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |  Version                      |         LENGTH                |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                         LDP Identifier                        |
           +                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                               |          Res                  |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Version:

        This two octet unsigned integer contains the version number of
        the protocol.  A LDP version number must lie in the range  0x01
        <= Version <= 0xFF. This version of the LDP specification speci-
        fies protocol Version = 1.

   LENGTH:

     This two octet integer specifies the length in octets of the data
     portion of the PDU. LENGTH is set to the length of the PDU in
     octets minus four.

   LDP Identifier:

     Six octet unsigned integer containing a unique identifier for the
     LSR that generated the PDU. The value of this Identifier is deter-
     mined  on startup. The first four octets encode an IP address
     assigned to the LSR. The last two octets represent the 'instance'
     of LDP on the LSR. A LSR with only one active LDP session would
     supply the value zero in this field.

   Res:

     This field is reserved. It must be set to zero on transmission and
     must be ignored on receipt.

4.2. LDP TLVs

   The LDP fixed header frames Protocol  Information Elements (PIEs)
   that have a Type Length Value (TLV) structure.

   In this protocol TYPE  is a 16  bit integer value that encodes how
   the VALUE field is to be interpreted. Within a single PIE TLVs may be
   nested to an arbitrary depth. A LDP must silently discard TLVs that
   it does not recognize.

   LENGTH is an unsigned  16 bit integer value that encodes the length
   of the VALUE field in octets. LENGTH is set to the length of the
   whole TLV in octets minus four. A LENGTH of zero indicates that there
   is no value field present.

   VALUE is an octet string of length LENGTH octets that encodes infor-
   mation the semantics of which are indicated by the TYPE field.

   A single TLV has the following format:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE                      |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |          Value   -- length as given by 'LENGTH' field
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |..............
           +-+-+-+-+-+-+-+-+

4.3. Example LDP PDU

   A complete LDP PDU containing two PIEs having 4 and 5 octets of Value
   field respectively would  have the following structure:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |  Version                      |         LENGTH = 25           |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                         LDP Identifier                        |
           +                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                               |          Res                  |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE                      |      LENGTH = 4               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                         Value                                 |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE                      |      LENGTH = 5               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                         Value
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                           |
           +-+-+-+-+-+-+-+-+

4.4. PIEs defined in V1 of LDP

   The following PIEs are defined for this version of the protocol. They
   are described in the sections that follow

                   Type 0x100 LDP_PIE_OPEN
                   Type 0x200 LDP_PIE_BIND
                   Type 0x300 LDP_PIE_REQUEST_BIND
                   Type 0x400 LDP_PIE_WITHDRAW_BIND
                   Type 0x500 LDP_PIE_KEEP_ALIVE
                   Type 0x600 LDP_PIE_NOTIFICATION
                   Type 0x700 LDP_PIE_RELEASE_BIND
                   Type 0x800 Unassigned
                   ..........
                   Type 0xFF00

   Each of these PIEs may have optional TLV encoded parameters.

4.5. LDP_PIE_OPEN

   LDP_PIE_OPEN is the first PIE sent by a LSR initiating a LDP session
   to its peer. It is sent immediately after the TCP connection has been
   opened. The LSR receiving a LDP_PIE_OPEN responds either with a
   LDP_PIE_KEEPALIVE or with a LDP_PIE_NOTIFICATION.

4.5.1. Initiating a LDP session

   A LSR initiating a LDP session sets the LDP_OPEN_PIE's fields as
   described below, issues a PDU containing it to the target peer, the
   LDP state machine transitions to the OPENSENT state.

   While in the OPENSENT state a LSR takes the following actions:

     If it receives an 'acceptable' LDP_PIE_OPEN then LSR sends a
     LDP_PIE_KEEPALIVE and the LDP state machine transitions to the
     OPEN_REC state.

     Receipt of any other PDU is an error and results in sending a
     LDP_PIE_NOTIFICATION indicating a bad open and transition to the
     INITIALIZED state.

4.5.2. Passive OPEN

   A LSR in the INITIALIZED state that receives a LDP_PIE_OPEN behaves
   as follows:

     If it can  support the version of the protocol proposed by the LSR
     that issued the LDP_PIE_OPEN then it sets Version in all its subse-
     quent communication with that LSR to the value proposed in Prop Ver
     and obeys the rules specified for that version of the protocol.

     LSR sends a PDU containing a LDP_PIE_OPEN PIE to the LSR that ini-
     tiated the LDP session.

     LSR  sends a PDU containing a LDP_PIE_KEEPALIVE PIE to the LSR that
     initiated the LDP session.

     The LDP state machine transitions to the OPEN_REC state

     If the LSR  cannot support the version of the protocol proposed in
     the LDP_PIE_OPEN then it sends a LDP_PIE_NOTIFICATION PDU that
     informs the LSR which generated the PIE_OPEN of the version(s) it
     can support. The LDP state machine transitions to the INITIALIZED
     state. See below under errors for more details.

4.5.3. OPENREC state

   When in the OPENREC state a LSR takes the following actions:

     If a LDP_PIE_KEEPALIVE is received then it transitions to the
     OPERATIONAL state.

     Receipt of any other PDU causes the generation of a
     LDP_PIE_NOTIFICATION and transition to the INITIALIZED state.

   The LDP_PIE_OPEN has the following format

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE (0x100)              |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |        Prop Ver               |      Hold Time                |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                       Optional Parameters                     |
           |                       (Variable  Length)                      |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   TYPE:

     Type field as described above. Set to 0x100 for LDP_PIE_OPEN.

   LENGTH:

     Length in octets of the  value field of this PIE. LENGTH  is set to
     the length of the whole PIE in octets minus four.

   Prop Ver:

     The Version of the LDP that the LSR that generated this PDU pro-
     poses be used for this LDP session once it is established.  Note
     that the session is not established until the LSR that issues a
     LDP_PIE_OPEN receives a LDP_PIE_OPEN in response.

   Hold Time:

     Two octet unsigned non zero integer that indicates the number of
     seconds that the peer initiating the connection proposes for the
     value of the Hold Timer.  Upon receipt of a PDU with PIE
     LDP_PIE_OPEN , a LDP peer  MUST calculate the value of the Hold
     Timer by using the smaller of its configured HOLD_TIME and the
     HOLD_TIME received in the PDU.  The value chosen for HOLD_TIME
     indicates the maximum number of seconds that may elapse between the
     receipt of successive PDUs from the LDP peer. The Hold Timer is
     reset each time a LDP_PDU arrives.  If the timer expires without
     the arrival of a LDP_PDU then a LDP_NOTIFICATION with the optional
     parameter CLOSING is sent.

   Optional Parameters:

     This variable length field contains zero or more optional PIEs sup-
     plied in TLV structures.

             +-----------------------+----------+--------+-----------+
             | OPTIONAL PARAMETER    | Type     | Length | Value     |
             +-----------------------+----------+--------+-----------+
             | DOWNSTREAM_ON_DEMAND  | 0x101    |   0    |   0       |
             +-----------------------+----------+--------+-----------+
             | ATM_TAG_RANGE         | 0x102    |Variable|See below  |
             +-----------------------+----------+--------+-----------+
             | ATM_ENCAPSULATION     | 0x103    |   0    |   0       |
             +-----------------------+----------+--------+-----------+

          DOWNSTREAM_ON_DEMAND:

          A LSR may supply this optional parameter to indicate that it
          wishes to use downstream label allocation on demand. When either
          of the peers in a LDP session indicates that it requires down-
          stream allocation on demand then both shall use that mechan-
          ism. LSRs operating in downstream on demand provide bindings
          only in response to LDP_PIE_REQUEST_BINDs.

          ATM_TAG_RANGE:

          An ATM-LSR supplies this parameter to indicate to its ATM peer
          the range of VCIs that it can use as labels (on this VP). An ATM
          LSR, when satisfying a LDP_PIE_BIND_REQUEST, may only generate
          VCI/prefix bindings, ie bindings of BLIST_TYPE 6, containing
          VCI values from the range communicated to it using this
          optional parameter.

          If an ATM-LSR is unable to generate a BLIST_TYPE 6 binding
          within the constraints imposed by ATM_TAG_RANGE it may gen-
          erate a binding of BLIST_TYPE 2.[In that case the LSR receiv-
          ing the binding sends data traffic on the default LDP VCI but
          labeled with the BLIST_TYPE 2 label]

          The value for this optional parameter is a list of entries of
          the following form:

              0                   1                   2                   3
              0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |               VPI                                             |
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |               VCI Upper range bound                           |
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |               VCI Lower range bound                           |
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

          VPI:

          32 bit unsigned integer encoding the VPI to the which the fol-
          lowing VCI range bounds apply.

          VCI Upper range bound:

          32 bit unsigned integer encoding the upper bound of a block of
          VCIs that the ATM_LSR originating the LDP_PIE_OPEN is making
          available as labels. VCI values between and including Upper and
          Lower range bound may be used as labels.

          VCI Lower range bound:

          32 bit unsigned integer encoding the lower bound of a block of
          VCIs that the ATM_LSR originating the LDP_PIE_OPEN is making
          available as labels. VCI values between and including Upper and
          Lower range bound may be used as labels.

          The number of entries may be deduced from the value in the
          Length field. VCI labels may be allocated from the range indi-
          cated by the upper/lower values inclusive of those values.
          There must be at least one entry. There may be more than
          one.There may be more than one entry with the same VPI value.

          ATM_NULL_ENCAPSULATION:

          An ATM-LSR supplies this parameter to indicate that it sup-
          ports the null encapsulation of RFC 1483 [15] for its
          data VCs. In this case IP packets are carried directly inside
          AAL5 frames. This option is only used by an ATM-LSR that it is
          configured to support a single level of labeling. See [B]
          for more details.

          An ATM-LSR that cannot support this option will generate the
          error LDP_WRONG_ENCAPS.

4.5.4. Errors

   All Errors generated by the receipt of a LDP_PIE_OPEN are reported by
   issuing a LDP_PIE_NOTIFICATION.  The value field of the PIE contains
   one or more TLVs describing individual errors with more precision.

           +--------------------------+----------+--------+------------+
           | Error                    | Type     | Length | Value      |
           +--------------------------+----------+--------+------------+
           | LDP_OPEN_UNSUPPORTED_VER | 0x1F0    |   Var  | See below  |
           +--------------------------+----------+--------+------------+
           | LDP_BAD_OPEN             | 0x1F1    |   0    |   0        |
           +--------------------------+----------+--------+------------+
           | LDP_WRONG_ENCAPS         | 0x1F2    |   0    |   0        |
           +--------------------------+----------+--------+------------+

4.5.4.1. LDP_OPEN_UNSUPPORTED_VER:

   This error is issued to indicate to the LSR that generated the
   LDP_PIE_OPEN that this LSR does not support the version of LDP pro-
   posed in 'Prop Ver' in the PIE_OPEN. LDP_OPEN_UNSUPPORTED_VER reports
   the version(s) of the protocol that this LSR does support.

   A LSR that receives this error may choose to reissue the LDP_PIE_OPEN
   specifying a version of the protocol that the target systems has
   indicated it can support. If a LSR is to take this action it should
   not close (and reopen) the TCP connection before so doing but should
   leave the  connection 'up' during the negotitation process.

   A LSR that generates this error should anticipate that the other sys-
   tem may reissue the LDP_PIE_OPEN and should wait at least
   TRANSPORT_HOLDDOWN seconds (default 30 ) before it closes the TCP
   connection. The TRANSPORT_HOLDDOWN  timer is started when a
   LDP_PIE_NOTIFICATION containing LDP_OPEN_UNSUPPORTED_VER is sent and
   is reset on reception of a LDP_PIE_OPEN. These measure are designed
   to stop the version negotiation mechanism 'thrashing' the transport
   setup mechanism.

   TYPE:

     LDP_OPEN_UNSUPPORTED_VER = 0x1F0

   LENGTH:

     Length in octets of the  value field of this PIE. LENGTH  is set to
     the length of the whole PIE in octets minus four.

   VALUE:

     One or more 2 octet integers that encode the Version(s) of the pro-
     tocol that this LSR supports.

   The format of an NOTIFICATION PIE containing LDP_OPEN_UNSUPPORTED_VER
   is:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |    LDP_PIE_NOTIFICATION       |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |    LDP_OPEN_UNSUPPORTED_VER   |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           | Supported version(s) ............
           +-+-+-+-+-+-+-+-+

4.5.4.2. LDP_BAD_OPEN

   This error is issued to indicate failure during the open phase.

4.5.4.3. LDP_WRONG_ENCAPS

   This error is used to indicate that an ATM-LSR will not support the
   null encapsulation proposed in the LDP_PIE_OPEN (by the inclusion of
   the option ATM_NULL_ENCAPSULATION).

4.6. LDP_PIE_BIND

   LDP_PIE_BIND is sent from one LSR to another to distribute label bind-
   ings. Transmission of a LDP_PIE_BIND may occur as a result of some
   local decision or it may be in response to the reception of a
   LDP_REQUEST_BIND.

   This PIE has the following format

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE (0x200)              |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                Request ID                                     |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           AFAM                |      BLIST_TYPE               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           BLIST_LENGTH        |                               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+     BINDING_LIST              |
           |           Variable length list consisting of one or more      |
           |           BLIST entries ....                                  |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                                                               |
           |                       Optional Parameters                     |
           |                       (Variable  Length)                      |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   TYPE:

     Type field as described above. Set to 0x200 for LDP_PIE_BIND.

   LENGTH:

     Length in octets of the  value field of this PIE. LENGTH  is set to
     the length of the whole PIE in octets minus four.

   Request ID:

     If this LDP_PIE_BIND is generated in response to a
     LDP_PIE_REQUEST_BIND then LSR places the value of the Request ID
     from that request PIE in this field. For all other LDP_PIE_BINDS
     this field must be set to zero.

   AFAM:

     This 16 bit integer contains a value from ADDRESS FAMILY NUMBERS
     in Assigned Numbers [N] that encodes the address family that
     the network layer address in the label bindings in the BINDING_LIST
     is from. This protocol provides support for multiple network
     address families.

   BLIST_TYPE:

     This 16 bit integer contains a value from the table below that
     encodes the format and semantics of the BLIST entries in the
     BINDING_LIST field.

             BLIST_TYPE      BLIST entry format
               0             Null list (see LDP_PIE_WITHDRAW_BIND)
               1             32 bit Upstream assigned
               2             32 bit Downstream assigned
               3             32 bit Multicast Upstream assigned (*,G)
               4             32 bit Multicast Upstream assigned (S,G)
               5             32 bit Upstream assigned VCI label
               6             32 bit Downstream assigned VCI label

     The formats are defined below.

   BLIST_LENGTH:

     Two octet unsigned integer that encodes the length of the
     BINDING_LIST

   BINDING_LIST:

     Variable length field consisting of one or more BLIST entries of
     the type indicated by BLIST_TYPE.

   Optional Parameters:

     This variable length field contains zero or more optional PIEs sup-
     plied in TLV structures.

4.6.1. BLIST_TYPE 0

   BLIST_TYPE = 0 indicates that there are no BLIST entries. See
   LDP_PIE_WITHDRAW_BIND for further details.

4.6.2. BLIST_TYPE 1 and 2

   A BLIST_TYPE 1 contains Upstream assigned labels.  A LDP must only
   include label values in a BLIST_TYPE 1 label entry that lie between the
   values, inclusive of those values, that the LSR to whom the
   LDP_PIE_BIND is being sent indicated it could support during the OPEN
   phase.

   BLIST entries of type 1 and 2  have the following format.

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+
           |  Precedence   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |       Label                                                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |  Pre Len      |     Prefix  (length variable )
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+............................

   Precedence

     8 bit unsigned integer containing the precedence with which traffic
     bearing this label will be serviced by the LSR that issued the
     LDP_PIE_BIND. [Note that the precedence is likely to be restricted
     to perhaps three bits of the space reserved here.]

   Label:

     Label is a 32 bit unsigned integer encoding the value of the label.

   Pre Len:

     This one octet unsigned integer contains the length in bits of the
     address prefix that follows.

   Prefix:

     A variable length field containing an address prefix whose length,
     in bits, was specified in the previous (Pre Len) field. A Prefix is
     padded with sufficient trailing zero bits  to cause the end of the
     field to fall on an octet boundary.

4.6.3. BLIST_TYPE 3

   This binding  allows the association of a label with  the (*,G) shared
   tree.  See [M] for a discussion of (*,G) shared trees.

   The (*,G) binding has the following format:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+
           |  Precedence   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |       Label                                                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |       Multicast Group Address G                               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Precedence

     8 bit unsigned integer containing the precedence with which traffic
     bearing this label will be serviced by the LSR that issued the
     LDP_PIE_BIND. [Note that the precedence is likely to be restricted
     to perhaps three bits of the space reserved here.]

   Label:

     Label is a 32 bit unsigned integer encoding the value of the label.

   Multicast Group Address G:

     Multicast Group Address. The length of this address is network
     layer specific and can be deduced from the value of AFAM. The
     diagram above illustrates a four octet IPv4 address format.

4.6.4. BLIST_TYPE 4

   This binding type  allows association of a label with a (S,G) source
   rooted tree. See [M] for a discussion of (S,G) trees.

   The (S,G) binding has the following format:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+
           |  Precedence   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                       Label                                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                       Source Address S                        |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                       Multicast Group Address G               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Precedence

     8 bit unsigned integer containing the precedence with which traffic
     bearing this label will be serviced by the LSR that issued the
     LDP_PIE_BIND. [Note that the precedence is likely to be restricted
     to perhaps three bits of the space reserved here.]

   Label:

     Label is a 32 bit unsigned integer encoding the value of the label.

   Source Address S:

     Network Layer address of the  source sending to the G tree. The
     length of this address is network layer specific and can be deduced
     from the value of AFAM. The diagram above illustrates a four octet
     IPv4 address format.

   Multicast Group Address G:

     Network Layer Multicast group address.  The length of this address
     is network layer specific and can be deduced from the value of
     AFAM. The diagram above illustrates a four octet IPv4 address for-
     mat.

4.6.5. BLIST_TYPE 5 and 6

   BLIST entries of type 5 and 6  have the following format.

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |  Precedence   |      HC       |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |       Label                                                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |  Pre Len      |     Prefix  (length variable )
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+............................

   Precedence:

     8 bit unsigned integer containing the precedence with which traffic
     bearing this label will be serviced by the LSR that issued the
     LDP_PIE_BIND. [Note that the precedence is likely to be restricted
     to perhaps three bits of the space reserved here.]

   HC:

     Hop count. See [B] for a detailed description.

   Label:

     Label is a 32 bit signed integer encoding the value of the label. (See
     section 2.1).

   Pre Len:

     This one octet unsigned integer contains the length in bits of the
     address prefix that follows.

   Prefix:

     A variable length field containing an address prefix whose length,
     in bits, was specified in the previous (Pre Len) field. A Prefix is
     padded with sufficient trailing zero bits  to cause the end of the
     field to fall on an octet boundary.

4.7. LDP_PIE_REQUEST_BIND

   LDP_PIE_REQUEST_BIND is sent from a LSR to a peer to request a bind-
   ing for one or more specific NLRIs, or to request all the bindings
   that its peer has.

   A LSR receiving a LDP_PIE_REQUEST_BIND must respond with a
   LDP_PIE_BIND or with a LDP_PIE_NOTIFICATION. A LSR that issues a
   LDP_PIE_BIND in response to a LDP_PIE_REQUEST_BIND places the Request
   ID from LDP_PIE_REQUEST_BIND in the Request ID field in the
   LDP_PIE_BIND that it issues.

   When a LSR receiving a LDP_PIE_REQUEST_BIND is unable to satisfy it
   because of resource limitations it issues a LDP_PIE_NOTIFICATION for
   RESOURCE_LIMIT containing the Request ID from the
   LDP_PIE_REQUEST_BIND.

   A LSR that issues  LDP_PIE_NOTIFICATION with RESOURCE_LIMIT set must
   send a subsequent LDP_PIE_NOTIFICATION, containing the status notifi-
   cation RESOURCES, to the peer to whom it previously sent that
   LDP_PIE_NOTIFICATION when it has resources available to satisfy
   further LDP_PIE_BIND_REQUESTs from that peer.

   If a LDP_PIE_NOTIFICATION is received containing RESOURCE_LIMIT the
   LSR may not issue further LDP_PIE_REQUEST_BINDs until it receives a
   LDP_PIE_NOTIFICATION with the Optional parameter RESOURCES.

   A LSR may receive a LDP_PIE_REQUEST_BIND for a prefix for which there
   is no entry in its router information base (RIB). If this occurs the
   LSR issues a LDP_PIE_NOTIFICATION containing the Optional parameter
   NO_ROUTE. The value field of the NO_ROUTE parameter contains the
   prefix(es) for which no entry was found in the RIB.

   The procedures to be employed by a LSR that receives a
   LDP_PIE_NOTIFICATION with the optional parameter NO_ROUTE are outside
   the scope of this specification.

   A LSR may issue LDP_PIE_BIND and LDP_PIE_NOTIFICATION containing
   RESOURCE_LIMIT or NO_ROUTE in response to a single
   LDP_PIE_REQUEST_BIND.  A LSR must satisfy as much of a
   LDP_PIE_REQUEST_BIND as it can. A LSR may not ignore other prefixes
   in a LDP_PIE_REQUEST_BIND on encountering an error with one prefix.

   This PIE has the following format:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE (0x300)              |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           Request ID                                          |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           AFAM                |      ALIST_TYPE               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           ALIST_LENGTH        |                               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
           |           ADDR_LIST                                           |
           |           Variable length list consisting of one or           |
           |           more ALIST entries                                  |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                                                               |
           |                       Optional Parameters                     |
           |                       (Variable  Length)                      |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   TYPE:

     Type field as described above. Set to 0x300 for
     LDP_PIE_REQUEST_BIND.

   LENGTH:

     Length in octets of the  value field of this PIE. LENGTH  is set to
     the length of the whole PIE in octets minus four.

   Request ID:

     This four octet unsigned integer contains a locally significant non
     zero value that a LSR uses to identify LDP_PIE_BINDs or
     LDP_PIE_NOTIFICATIONs that are generated in response to this
     request.

   AFAM:

     This 16 bit integer contains a value from ADDRESS FAMILY NUMBERS
     in Assigned Numbers [N] that encodes the address family that
     the network layer address in the label bindings in the BINDING_LIST
     is from. This version of LDP supports IPv4 and IPv6.

   ALIST_TYPE:

     This 16 bit integer contains a value from the table below that
     encodes the format of the ALIST entries in the ADDR_LIST field.
     Currently there are  3 values defined by this specification.

             ALIST_TYPE      ALIST entry format
               0             Null list
               1             Precedence followed by variable length NLRI
               2             Precedence, Hop Count followed by variable length NLRI

     The format for these entries is defined below.

   ALIST_LENGTH:

     Two octet unsigned integer that encodes the length in octets of the
     ADDR_LIST field.

   ADDR_LIST:

     A variable length list consisting of one or more entries of type
     ALIST_TYPE.

   Optional Parameters:

     This variable length field contains zero or more optional PIEs sup-
     plied in TLV structures.

4.7.1. ALIST formats

   ALIST_TYPE = 0 indicates a  null list ie there are no ALIST entries.
   A LDP receiving a LDP_PIE_REQUEST_BIND with ALIST_TYPE set to 0
   interprets this as an implicit request for all the bindings that it
   currently has.

   For ALIST_TYPE = 1 ALIST entries have  the following form:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+
           |   Precedence  |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |    Pre Len    |   Prefix  (length variable ....................
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            .....................
           +-+-+-+-+-+-+-+-+-+-+-

   For ALIST_TYPE = 2 ALIST entries have  the following form:

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |   Precedence  |      HC       |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |    Pre Len    |   Prefix  (length variable ....................
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            .....................
           +-+-+-+-+-+-+-+-+-+-+-

   HC:

     Hop count.

   Precedence:

     This one octet unsigned integer encodes the precedence with which
     the requestor wants traffic to this prefix handled.

   Pre Len:

     This one octet unsigned integer contains the length in bits of the
     address prefix that follows.

   Prefix:

     A variable length field containing an address prefix whose length,
     in bits, was specified in the previous (Pre Len) field. A Prefix is
     padded with sufficient trailing zero bits  to cause the end of the
     field to fall on an octet boundary.

4.7.2. Errors

   Errors are reported using LDP_PIE_NOTIFICATION.

           +-----------------------+----------+--------+--------------+
           | STATUS NOTIFICATION   | Type     | Length | Value        |
           +-----------------------+----------+--------+--------------+
           | RESOURCE_LIMIT        | 0x3F0    |   4    | Request ID   |
           +-----------------------+----------+--------+--------------+
           | RESOURCES             | 0x3F1    |   0    | 0            |
           +-----------------------+----------+--------+--------------+
           | HOP_COUNT_EQUALLED    | 0x3F2    |   Var  | See below    |
           +-----------------------+----------+--------+--------------+
           | NO_ROUTE              | 0x3F3    |   Var  | See below    |
           +-----------------------+----------+--------+--------------+

   RESOURCE_LIMIT:

   If the LSR is unable to provide a LDP_PIE_BIND in response to a
   request the LSR indicates this by supplying the RESOURCE_LIMIT status
   notification as a parameter in the LDP_PIE_NOTIFICATION. The Request
   ID from the the TPD_PIE_REQUEST bind is supplied in the Value field
   of this status notification

   RESOURCES:

   A LSR that has sent  RESOURCE_LIMIT to a peer sends RESOURCES when
   that resource limit clears.

   HOP_COUNT_EQUALLED:

   An ATM_LSR that receives a LDP_PIE_BIND_REQUEST containing a
   HOP_COUNT that equals MAX_HOP_COUNT does not generate a binding but
   instead sends this error notification. The length is variable and the
   value returns the Request ID and the ALIST entry(ies) that caused the
   error in the following format.

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |       Request ID                                              |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |       HC      | Precedence    | Pre Len       | Prefix ......
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            (length variable) ....................
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
           |       HC      | Precedence    | Pre Len       | Prefix ......
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            (length variable) ....................
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-

   NO_ROUTE:

   A LSR that has no RIB entry for a prefix that it receives in a
   LDP_PIE_REQUEST_BIND issues a notification containing this parameter
   for that prefix(es). The value field of this parameter contains the
   Request_ID, AFAM, ALIST_TYPE from the LDP_PIE_REQUEST_BIND and a
   suitably modified ALIST_LENGTH and ADDR_LIST in the following format.

   See section 4.7 for descriptions of the Request_ID,AFAM, ALIST_TYPE,
   ALIST_LENGTH and ADDR_LIST elements.

           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           Request ID                                          |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           AFAM                |      ALIST_TYPE               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           ALIST_LENGTH        | Variable length list of       |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ prefixes for which no RIB     |
           | entry exists in ADDR_LIST format..............................
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           +-+-+-+-+

4.8. LDP_PIE_WITHDRAW_BIND

   LDP_PIE_WITHDRAW_BIND is issued by a LSR that originally provided a
   binding containing the label in question and is an absolute instruction
   to the LSR that receives it that it may not continue to use that label
   to forward traffic to the LSR issuing the LDP_PIE_WITHDRAW_BIND.

   This PIE has the following format.

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE (0x400)              |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           BLIST_TYPE          |      BLIST_LENGTH             |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           BINDING_LIST                                        |
           |           Variable length list consisting of one or           |
           |           more BLIST entries ....                             |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                                                               |
           |                       Optional Parameters                     |
           |                       (Variable  Length)                      |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   TYPE:

     Type field as described above. Set to 0x400 for
     LDP_PIE_WITHDRAW_BIND.

   LENGTH:

     Length in octets of the  value field of this PIE. LENGTH  is set to
     the length of the whole PIE in octets minus four.

   BLIST_TYPE

     This 16 bit integer encodes the format of the BLIST entries in the
     BINDING_LIST field. Possible values are defined in Section 4.6.  A
     LDP receiving this PIE with the BLIST_TYPE set to Null interprets

     it (based on the semantics) as either (a) an implicit instruction
     to WITHDRAW all bindings belonging to the peer that issued the PIE,
     or (b) as an indication that all the bindings requested by the peer
     are no longer needed by the peer that issued the PIE.

   BLIST_LENGTH:

     This 16 bit unsigned integer encodes the length in octets of the
     BINDING_LIST.

   BINDING_LIST:

     Variable length field consisting of one or more BLIST entries of
     the type indicated by BLIST_TYPE. The format of these entries is
     defined in Section 4.6.

   Optional Parameters:

     This variable length field contains zero or more optional PIEs sup-
     plied in TLV structures.

4.9. LDP_PIE_RELEASE_BIND

   LDP_PIE_RELEASE_BIND is issued by a LSR that received a label as a
   consequence of an Upstream Request/downstream assignment sequence.
   It is an indication to the LSR that receives it that the LSR that
   requested the binding no longer needs that binding.

   This PIE has, with the exception of a different type value exactly
   the same syntax as LDP_PIE_WITHDRAW_BIND.

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE (0x700)              |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           BLIST_TYPE          |      BLIST_LENGTH             |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |           BINDING_LIST                                        |
           |           Variable length list consisting of one or           |
           |           more BLIST entries ....                             |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                                                               |
           |                       Optional Parameters                     |
           |                       (Variable  Length)                      |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   See the discussion of LDP_PIE_WITHDRAW_BIND for details of the syn-
   tax.

   Optional Parameters:

     This variable length field contains zero or more optional PIEs sup-
     plied in TLV structures.

4.10. LDP_PIE_KEEP_ALIVE

   The Hold Timer mechanism described earlier in Sections 3 and 4 is
   reset every time a LDP_PDU is received. LDP_PIE_KEEP_ALIVE is pro-
   vided to allow reset of the Hold Timer in circumstances where a LDP
   has no other information to communicate to its peer.

   A LDP must arrange that its peer sees a LDP_PDU from it at least
   every HOLD_TIME period. That PDU may be any other from the protocol
   or, in circumstances where there is no need to send one of them, it
   must be LDP_PIE_KEEP_ALIVE.

   This PIE has the following format

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE (0x500)              |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                                                               |
           |                       Optional Parameters                     |
           |                       (Variable  Length)                      |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   TYPE:

     Type field as described above. Set to 0x500 for LDP_PIE_KEEP_ALIVE.

   LENGTH:

     Length in octets of the  value field of this PIE. LENGTH  is set to
     the length of the whole PIE in octets minus four.

   Optional Parameters:

     This variable length field contains zero or more optional PIEs sup-
     plied in TLV structures.

4.11. LDP_PIE_NOTIFICATION

   LDP_PIE_NOTIFICATION is issued by LDP to inform its peer of a signi-
   ficant event. 'Significant events' include errors  and changes in LSR
   capabilities or operational state.

   All notification information is encoded as TLVs in the optional
   parameters field.

   This PIE has the following format

            0                   1                   2                   3
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |     TYPE (0x600)              |      LENGTH                   |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |                                                               |
           |                       Optional Parameters                     |
           |                       (Variable  Length)                      |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   TYPE:

     Type field as described above. Set to 0x600 for
     LDP_PIE_NOTIFICATION

   LENGTH:

     Length in octets of the  value field of this PIE. LENGTH  is set to
     the length of the whole PIE in octets minus four.

   Optional Parameters:

     This variable length field contains zero or more optional parame-
     ters supplied in TLV structures.

     The optional parameter types and their uses are:

          RETURNED_PDU:

          A LSR uses this parameter to return a PDU to the LSR that
          issued it.

                  +--------------------------+-------+--------+--------------+
                  | Optional Parameter       | Type  | Length | Value        |
                  +--------------------------+-------+--------+--------------+
                  | RETURNED_PDU             | 0x601 |  Var   | Peer's PDU   |
                  +--------------------------+-------+--------+--------------+

          As much as possible of the complete PDU, including the header,
          that is to be returned is inserted into the value field. The
          Length is set to the the number of octets of the PDU that is
          being returned that have been inserted into the Value field of
          this optional parameter. Implementations parsing RETURNED_PDU
          must be careful to recognize that the returned PDU may have
          been truncated.

          CLOSING: A LSR uses this parameter to indicate that it is ter-
          minating the LDP session.

                  +--------------------------+-------+--------+--------------+
                  | Optional Parameter       | Type  | Length | Value        |
                  +--------------------------+-------+--------+--------------+
                  | CLOSING                  | 0x602 |   0    |     0        |
                  +--------------------------+-------+--------+--------------+

          LDP may send a LDP_PIE_NOTIFICATION with CLOSING set in
          response to a protocol error or to administrative interven-
          tion.

          A LDP receiving or issuing this notification transitions to
          the INITIALIZED state.

          The following optional parameters are defined for returning
          errors from individual PIEs. See the description of the
          relevant PIEs for a complete description of the errors.

          LDP_PIE_OPEN:

                  +--------------------------+----------+
                  | Optional Parameter       | Type     |
                  +--------------------------+----------+
                  | LDP_OPEN_UNSUPPORTED_VER | 0x1F0    |
                  +--------------------------+----------+
                  | LDP_BAD_OPEN             | 0x1F1    |
                  +--------------------------+----------+
                  | LDP_WRONG_ENCAPS         | 0x1F2    |
                  +--------------------------+----------+

          LDP_PIE_REQUEST_BIND:

                  +--------------------------+----------+
                  | Optional Parameter       | Type     |
                  +--------------------------+----------+
                  | RESOURCE_LIMIT           | 0x3F0    |
                  +--------------------------+----------+
                  | RESOURCES                | 0x3F1    |
                  +--------------------------+----------+
                  | HOP_COUNT_EQUALLED       | 0x3F2    |
                  +--------------------------+----------+
                  | NO_ROUTE                 | 0x3F3    |
                  +--------------------------+----------+

                                        H. Esaki   (Toshiba Corp.)
                                        Y. Katsube (Toshiba Corp.)
                                        K. Nagami  (Toshiba Corp.)
                                        P. Doolan  (Cisco Systems Inc.)
                                        Y. Rekhter (Cisco Systems Inc.)

          IP Address Resolution and ATM Signaling for MPLS
                      over ATM SVC services

                  <draft-katsube-mpls-over-svc-00.txt>

Abstract

   Enabling interconnection of ATM Label Switching Routers (ATM-LSRs)
   over standard ATM switch networks is desirable in order to
   introduce MPLS technology into emerging ATM-LAN/WAN platform.
   This draft describes how ATM-LSRs may interoperate with other
   ATM-LSRs over ATM UNI SVC services.  The two aspects of the problem
   that we address are ATM address (of target ATM-LSR) resolution and
   SVC to label binding.

1. Introduction

   Several architectural models for label switching are proposed to
   the MPLS working group, e.g., [7][RFC2098][RFC2105]. One of the
   major application of MPLS technology described in these proposals
   is ATM.

   Interconnecting ATM-LSRs over point-to-point links is straight-
   forward since it is analogues to conventional ATM switch networks
   except for the difference in the control protocol (ATM-UNI/NNI
   control plane vs. MPLS specific control plane). It is possible to
   build "Hybrid" ATM-LSRs that operate in a "Ships in the night" mode
   running both MPLS and ATM Forum control planes on the same link
   [ATM_TAG]. It is also possible to interconnect ATM-LSRs over a cloud
   of standard ATM switches, which are non-MPLS ATM switches.

   Interconnecting ATM-LSRs over a cloud of standard ATM switches using
   VP services is described in [ATM_TAG]. In this case, the VCIs within
   the VP are the labels, and again the MPLS control plane can manage
   them similar to the point-to-point link case.

   This document describes operations of MPLS over ATM SVC networks.
   One possible circumstances where this might be necessary is ATM-LSRs
   interconnected through ATM switches that support ATM SVC services
   (e.g., ATM WAN cloud).

   Imagine two LSRs connected via such an ATM cloud. If one LSR decides,
   by normal L3 routing procedures, that it must forward traffic to the
   other, it opens an ATM SVC to that LSR. The question is "how does it
   obtain the ATM address of the target LSR to put in the SVC
   Setup?". When we answer this question, another immediately occurs:
   "when the SVC Setup arrives at the target LSR, how does that system
   know to which route or Forwarding Equivalence Class(FEC)[TAG_ARCH]
   this new SVC should be bound?".
   The diagram below shows a more general view of the application
   space.

   We provide answers to these questions below in sections 2 and 3
   respectively.

    +---+                     +---+                     +---+
    |LSR| +-----+     +-----+ |LSR| +-----+     +-----+ |LSR|
   -|   |-|ATMSW|-...-|ATMSW|-|   |-|ATMSW|-...-|ATMSW|-|   |-
    +---+ +-----+     +-----+ +---+ +-----+     +-----+ +---+
        <--------------------->     <------------------->
          standard ATM link           standard ATM link
   <=========================================================>
              ATM cloud (or collections of ATM clouds)

    Fig.1  Label Switching Routers (LSRs) with standard ATM links

2  LSR ATM address selection

   An ATM-LSR, having made a determination that it must route traffic
   to another ATM-LSR, and finding, through some local procedure, that
   it must open an ATM SVC to that ATM-LSR must select an appropriate
   ATM address to place in the SVC Setup message.

   The selection of the address may be based on configuration or on
   dynamic resolution. In either case the ATM-LSR has, from its routing
   table, an IP address for which it requires a corresponding ATM
   address.

2.1 Configuration

   It is possible to provide tools and procedures to configure an ATM-
   LSR with, for example, IP to ATM address translation tables. The
   control mechanisms in the LSR that are responsible for deciding to
   setup SVCs could use these tables to obtain requisite ATM address
   for peer ATM-LSRs.

   This operation could be seen as the existing point-to-point link
   operation over SONET links, and it may become common for LSRs over
   WAN ATM links.

2.2 Dynamic resolution

2.2.1 Classical IP over ATM LSRs

   If the ATM-LSR is configured to be able to reach an RFC1577 ARP
   server and if the IP address of the target LSR is on the same subnet
   then the LSR may employ ATM-ARP [RFC1577] to attempt to resolve the
   ATM address of the target ATM-LSR.

2.2.2 NHRP capable LSRs

   If the ATM-LSR is able to reach an NHRP server, by configuration,
   anycast or MPOA LE_ARP discovery [MPOA], then it may use NHRP to
   attempt to resolve the ATM address of the target ATM-LSR.

3  Binding of ATM SVCs and MPLS labels

   When an ATM-LSR has decided to open an SVC to its neighbor ATM-LSR
   and has determined the appropriate ATM address using the procedures
   in section 2, it uses the mechanisms described here to communicate
   the 'binding' between the SVC and specific forwarding equivalence
   class(FEC)[TAG_ARCH]. The MPLS label value, which is VPI/VCI or VCI
   in ATM, is not the same at both ends of an SVC.  Therefore,
   neighboring ATM-LSRs, when they are communicating with each other,
   cannot use the label as an identifier of the SVC but should use
   another identifier that can be uniquely identified by both ends of
   the SVC. The binding between the unique identifier and the label
   is performed locally at individual ATM-LSRs.

   Basic mechanism of binding for SVC is to convey a unique identifier
   of an SVC in the BLLI IE of SETUP message, and to convey the
   identifier of the SVC in the MPLS binding request and/or
   acknowledge messages together with the specific FEC. Some example
   procedures are described below.

3.1 Support for Destination-based unicast routing over ATM SVCs

   Following the procedures outlined in [TAG_ATM], an ATM-LSR sends
   an MPLS message that request binding to its next-hop (downstream)
   ATM-LSR for the destination prefix contained in the request.
   The downstream ATM-LSR that receives the request provides
   a binding that contains an identifier that is unique between the
   upstream ATM-LSR. The upstream ATM-LSR, after it receives the
   binding sets up an SVC to the downstream ATM-LSR using the ATM
   Forum signalling procedures, which includes the received identifier
   in the BLLI field of the Setup message. The downstream LSR
   accepting the SVC setup is able to determine, from the identifier
   value in BLLI, the binding of this new SVC to the destination
   prefix.

   Another examples would be possible, e.g., SVC setup including
   unique identifier in the BLLI field may proceed the MPLS messages
   that exchange binding between the SVC (identified by the value at
   BLLI field) and the destination prefix.

   Whether the 7-bit BLLI space for the identifier is enough or not
   depends on up to when the identifier value should be held for a
   certain SVC, which further depends on the protocol to maintain or
   remove binding.  For instance, if the removal of binding is
   performed by releasing the related SVC with a RELEASE message of
   ATM signaling, the unique identifier should be maintained only for
   the period between the time when the binding procedure begins and
   the binding has been established.

3.2 Support for multicast over ATM SVCs

   In the case that PIM-SM is used as a multicast routing protocol,
   following the procedures outlined in [TAG_PIM], an ATM-LSR sends
   PIM_Join messages to upstream neighboring ATM-LSRs toward the RP
   for the shared-tree (*,G) or source for the source-tree (S,G).
   If the two ATM-LSRs are using SVC, the downstream LSR will include
   a unique identifier for an SVC, instead of label value, in the PIM
   Join message. When the upstream router receives the PIM Join, it
   will set up, using ATM signalling procedures, an SVC to the
   downstream ATM-LSR. The Upstream router sends the identifier it
   received in the PIM Join message in the BLLI IE of the SETUP
   message that it uses to establish the SVC. In this way the
   downstream ATM-LSR is able to associate the new SVC's label with
   the appropriate multicast tree. Once an SVC is setup for a group,
   subsequent PIM Join (refreshes) include the value zero in the
   identifier field. This special value indicates to the next hop
   router toward the RP that a SVC already exists and in this case
   that router simply refreshes its PIM state without initiating SVC
   setup.

   The identifier employed for this purpose is constrained by the
   available space in the BLLI to be 7 bits also. Whether this small
   space is enough or not depends on up to when the identifier value
   should be held for a certain SVC, which further depends on the
   protocol to maintain or remove binding. For instance, if the
   identifier is only of significance to the downstream LSR and only
   significant to it during the period between the dispatch of the
   PIM Join to the upstream router and the completion of the SVC
   setup, this is not a major restriction.

   Another examples, e.g., a multicast routing protocol other than
   PIM-SM is used, would be possible and added in later version of
   the draft.

                                                           Nancy Feldman
Expiration Date: September 1997                         Arun Viswanathan
                                                               IBM Corp.

                                                              March 1997

                           ARIS Specification

                    <draft-feldman-aris-spec-00.txt>

Abstract

   ARIS (Aggregate Route-Based IP Switching) adds the advantages of
   high-speed switching to a network based on routing protocols.  It
   provides a means of mapping network-layer routing information to
   link-layer switched paths, enabling datagrams to traverse a network
   at media speeds.  This memo defines the ARIS protocol and its
   mechanisms.

1. Introduction

   An Integrated Switch Router (ISR) is a switch that has been augmented
   with standard IP routing support.  The ARIS protocol establishes
   switched paths through a network of ISRs by mapping network-layer
   routing information directly to data-link layer switched paths.
   These switched paths may have an endpoint at a directly attached
   neighbor (comparable to IP hop-by-hop forwarding), or may have an
   endpoint at a network egress node, enabling switching via all
   intermediary nodes.

   A switched path is created for an "egress identifier", which
   identifies a routed path through a network. The egress identifiers
   may be extracted from information existing in the routing protocols,
   or may be configured. Routes that are populated in a router's
   forwarding table are extended to include a reference to an egress
   identifier with a corresponding switched path.  ARIS supports
   switched path granularity ranging from end-to-end flows to the
   aggregation of all traffic through a common egress node.  The choice
   of granularity is determined by the choice of the egress identifier.
   Since multiple routes may map to the same egress identifier, the
   number of switched paths needed in a network is minimized.  Switched
   paths for different levels of aggregation may exist simultaneously.

   ARIS can support IP or other network protocols. This version of the
   draft is defined with respect to IPv4, and will be extended to
   support other protocols in future revisions.

   It is assumed the reader is familiar with the ARIS architecture, as
   defined in the ARIS Overview Internet Draft [7].

2. ARIS Messaging

   ARIS messages are used to communicate with directly attached ARIS
   neighbors.  Their purpose is to manage the switched paths in an ARIS
   network.

   ARIS messages are transmitted single hop to adjacent neighbors
   directly over IP, with IP protocol number 104.  The following ARIS
   messages are defined:

   INIT        - This message establishes neighbor adjacencies
   KEEPALIVE   - This message maintains neighbor adjacencies
   ESTABLISH   - This message creates switched path(s)
   TRIGGER     - This message requests an Establish message
   TEARDOWN    - This message deletes established switched path(s)
   ACKNOWLEDGE - This message positively or negatively acknowledges a
                 received ARIS message

2.1. ARIS Objects

   All ARIS messages begin with a common header, and may be followed by
   one or more ARIS objects.  See section 13 for a definition of the
   common header and objects.

   ARIS Common Header:
        The common header authenticates an ARIS message.

   Egress Identifier Object (EGRID_OBJ):
        A representation of one or more routes that traverse a common
        switched path through a network.  ARIS creates a switched path
        for each egress identifier.

   Label Object (LABEL_OBJ):
        The unique label assigned to a switched path on a link.

   Multipath Object (MPATH_OBJ):
        A numeric identifier used to distinguish multiple switched paths
        to the same egress identifier.  It is of local significance per
        link.

   Router Path Object (RPATH_OBJ):
        The list of ISRs through which the message has traversed, where
        each ISR is identified via a unique router-id.  This object's
        purpose is to detect routing loops, where a loop is detected
        when an ISR finds its own router-id in the received message
        router-id list.

   Explicit Path Object (EPATH_OBJ):
        The source routed path the Establish message must follow.

   Tunnel Object (TUNNEL_OBJ):
        A label used for an encapsulated tunnel.  This enables a de-
        aggregating ISR to avoid L3 forwarding, by removing an incoming
        L2 header and switching on an encapsulated label.

   Timer Object (TIMER_OBJ):
        A value used for timeouts.  In the Init message, this is the
        neighbor adjacency timeout.  In the Establish message, this is
        the Establish refresh timeout.

   Acknowledge Object (ACK_OBJ):
        An indication of the success or failure of an ARIS message.

   Init Object (INIT_OBJ):
        Information that must be communicated to adjacent ARIS neighbors
        on initialization.

2.2. Init

   This is the first message exchanged by ARIS neighbors to establish
   adjacency.

   The format of an Init Message is:
         ::= 
                           
                           

   The INIT message MUST be periodically transmitted over each ARIS link
   until a successful INIT message exchange, at which time the neighbor
   state is transitioned to ACTIVE.  All other ARIS messages may only be
   transmitted after the ACTIVE state is achieved.  The INIT message
   contains such information as the neighbor timeout period, and the
   ISR's supported label ranges.

   The neighbor adjacency sequence is addressed in section 3.

2.3. Establish

   Neighbors that are in the ACTIVE state may exchange Establish mes-
   sages.

2.3.1. Destination-Based Routing

   The format of the Establish Message is:
         ::=  
                               { 
                                []
                                []
                                 
                                 
                                [] } ...

   The "destination-based" Establish message builds switched paths for
   egress identifiers which follow the forwarding paths created by the
   network layer routing protocols.  The switched paths form a
   multipoint-to-point tree, where the egress ISR is the root and the
   ingress ISRs are the leaves.

   The Establish message is initiated by the egress ISR (see section 4.1
   for egress definition), and is periodically sent to each upstream
   neighbor to setup or refresh a switched path.  These upstream neigh-
   bors forward the messages to their own upstream neighbors in Reverse
   Path Multicast (RPM) style, continuing this pattern until each ISR
   with a routed path to the given egress identifier establishes a
   switched path.

   An Establish message is also sent by any ISR in response to a Trigger
   message.  In this case, the Establish message is only forwarded to
   the triggering ISR.

   An egress ISR initiating an Establish message for an egress identif-
   ier allocates a label for each upstream ACTIVE ARIS neighbor, where
   the downstream neighbor is determined by the forwarding table (FIB).
   The egress ISR creates and forwards an Establish message to each
   upstream neighbor.

   The Establish message contains:

        o    a hop-count set to 0 if the Egress is the switched path
             endpoint, else the hop-count set to 1

        o    the router-id list set to the Egress ISR's router-id if
             loop prevention is configured

        o    the allocated upstream label

   Each ISR that receives an Establish message for an egress identifier
   verifies that the message was received from the correct next-hop for
   the given egress identifier, as indicated by the forwarding table
   (FIB), and if loop prevention is configured, verifies that the path
   taken is loop free by examining the ISR list.  A message that con-
   tains a loop or is received from a non-downstream neighbor is dropped
   and a negative acknowledge MUST be transmitted to the sender of the
   invalid message.

   On receipt of a valid Establish message, the ISR uses the multipath
   identifier to determine if the switched path for each given egress
   identifier is new or is an update to a previously established
   switched path.

   An ISR receiving a new valid Establish message populates the FIB with
   the given downstream label and replies to the sender with a positive
   Acknowledge message.  The ISR then allocates a label for each of its
   ARIS upstream neighbors (where the downstream neighbor is the ISR
   from which the Establish was received).  The ISR splices the given
   downstream with the newly allocated upstream label(s), unless loop
   prevention is configured; when loop prevention is configured, the
   paths MUST NOT be spliced until a positive Acknowledge is received.
   The ISR then forwards the Establish message RPM style.

   An ISR forwards the Establish message to the upstream neighbors with:

        o    incremented hop-count

        o    the ISR's router id appended to the ISR list if loop
             prevention is configured

        o    the allocated upstream label

   An ISR receiving a valid Establish message for a previously esta-
   blished switched path determines if the path has been modified by
   examining the label and the ISR list (if loop prevention configured).
   If these are unchanged the message is considered a refresh, else the
   message is an update.

   An ISR receiving a refresh Establish message MUST acknowledge the
   message and forward the Establish RPM style, with the local router id
   appended to the ISR list (if loop prevention configured) and the pre-
   viously allocated upstream label.

   An ISR receiving an update Establish with a new downstream label MUST
   unsplice the obsolete downstream switched path, and populate the FIB
   with the newly acquired label.  An ISR receiving a new ISR list MUST
   unsplice the downstream switched path if loop prevention is config-
   ured. The ISR then follows the procedures described above for for-
   warding the Establish message upstream.

   An Establish message MAY also use the form of upstream allocation, as
   decided on initialization.  This option, in conjunction with an end-
   to-end acknowledge, may be useful on switches that do not support
   merging.  In this case, the egress ISR initiates an Establish message
   to its upstream neighbors with an end-to-end bit, and a zero label
   value.  Each intermediary ISR that receives such an Establish for-
   wards the message RPM style as describe above, except that no label
   is allocated, and no Acknowledge is immediately returned.  The label
   selection is initiated at each ingress ISR, and transmitted to the
   sender of the Establish via the Acknowledge message.  The intermedi-
   ary ISRs which receive a label via the Acknowledge message splice the
   given label to an allocated downstream label, and forward an Ack-
   nowledge message with the selected downstream label to the ISR from
   which the Establish was received.

   The Establish message must be retransmitted at the Retransmit Timer
   interval (see section 10), until an Acknowledge message is received.
   In the case of an end-to-end acknowledgment, the Establish retransmit
   interval SHOULD be a configured multiple of the Retransmit Timer
   value.

   The egress ISR MUST send a refresh Establish message within the con-
   figured RefreshEstablish interval (see section 10).  The RefreshEs-
   tablish interval is transmitted to the upstream ISRs in the Establish
   Timer object.  An ISR SHOULD timeout an established switched path if
   no refresh is received within the given interval.  If a Timer object
   is not present in the received Establish message, the switched path
   SHOULD NOT timeout.

2.3.2. Explicit Routes

   The format of the Establish Message is:
         ::=  
                               { 
                                []
                                 
                                 
                                 
                                [] } ...

   The explicit-route Establish message builds switched paths for an
   egress identifier that follow an explicitly chosen loop-free path.
   An Establish message is identified as an explicit-route by the pres-
   ence of an Explicit Path object.  Such an Establish is initiated by
   the first ISR in the explicit path, and is periodically sent to the
   path neighbors to setup or refresh a switched path.  Since the expli-
   cit Establish may be initiated by an ingress or egress ISR, the
   Establish indicates the direction of the dataflow.

   The initiator of the explicit-route Establish message for a particu-
   lar egress identifier allocates a label for each given adjacent ARIS
   neighbor in the explicit path.  It then creates an explicit path
   Establish message with an indication of upstream or downstream path
   direction, and forwards the Establish to the explicitly given neigh-
   bors.

   Each ISR that receives an explicit route Establish message for an
   egress replies with an Acknowledge message.  It then allocates labels
   for the listed ARIS neighbors, and splices the given label to the
   newly allocated label.  The message is forwarded to the listed ARIS
   neighbors with an update to the Exlicit Path object indicating the
   current path location.

   Note that the Establish message may contain a zero label rather than
   an allocated label.  This causes the receiving node to do the label
   allocation and respond with the given label in the Acknowledge mes-
   sage.

   The Establish message MUST be retransmitted at the Retransmit Timer
   rate (see section 10), until an Acknowledge message is received.

   The initiating ISR MUST send a refresh Establish message within the
   configured RefreshEstablish interval (see section 10).  The
   RefreshEstablish interval is transmitted to the neighbor ISRs in the
   Establish Timer Object.  An ISR SHOULD timeout an established
   switched path if no refresh is received within the given interval.
   If a Timer Object is not present in the received Establish message,
   the switched path SHOULD NOT timeout.

2.4. Trigger

   Neighbors that are in the ACTIVE state may exchange Trigger messages.

   The format of a Trigger Message is:
         ::=   
                              {  } ...

   A Trigger message is used in destination-based routing, when a local
   routing change has modified the network layer path to an egress iden-
   tifier.  When this occurs, the ISR MUST unsplice the obsolete
   switched path, and transmit a Trigger message to the new downstream
   neighbor, requesting an Establish message.

   An ISR that receives a Trigger message for a particular egress iden-
   tifier identifies the downstream switched path(s) and allocates the
   label(s) to the upstream triggering ISR.  The upstream label is
   spliced to the downstream label, unless loop prevention is config-
   ured; in this case, the paths are NOT spliced until a positive Ack-
   nowledge is received.

   The ISR transmits an Establish to the triggering neighbor with:

        o    incremented hop-count

        o    ISR's router id appended to the ISR list (if loop preven-
             tion configured)

        o    the allocated upstream label

   An ISR that receives a trigger but has no downstream path to the
   egress identifier replies to the triggering ISR with a Nak.

   The Trigger message must be retransmitted at the Retransmit Timer
   rate, until an Establish or a Nak is received.

   Note that the Trigger message is NOT a requirement for switched path
   corrections, as switched paths are consistently refreshed and re-
   established via the refresh Establish message.

2.5. Teardown

   Neighbors that are in the ACTIVE state may exchange Teardown mes-
   sages.

   The format of a Teardown Message is:
         ::=   
                           { 
                             
                              } ...

   A teardown message is used to delete a switched path to an egress
   identifier.  When the routing protocols indicate an egress identifier
   is no longer in use, the egress ISR SHOULD initiate a teardown.  When
   an ISR changes from an egress ISR to a non-egress ISR due to an ARIS
   link turning ACTIVE, the ISR SHOULD initiate a teardown for its
   obsolete established switched paths.

   The Teardown follows the same path as the Establish message.  On
   receipt of a Teardown message, the allocated labels for the given
   egress identifier MUST be released.

   The Teardown message must be retransmitted at the Retransmit Timer
   interval, until an Acknowledge message is received.

2.6. Acknowledge

   This message is sent as a response to ARIS messages.  It may be used
   as a positive acknowledge (Ack) or negative acknowledge (Nak).

   The format of an Acknowledge Message is:
         ::= 
                          
                         []
                         []

   ARIS messages requiring a response SHOULD be placed on a retransmit
   queue.  If an expected ARIS response is not received within the
   Retransmit Timer period, the ARIS message is retransmitted.

   The Init, Trigger, Teardown, and Establish messages may receive a
   negative acknowledge (Nak) on an error condition.  The only messages
   that require a positive acknowledge (Ack) are the Teardown and Estab-
   lish messages.  On receipt of an Acknowledge message, the original
   message MUST be removed from the retransmit queue.

   On receipt of a positive Acknowledge message in response to a
   destination- based Establish message, the ISR splices the upstream
   label to the downstream label when loop prevention is configured.
   When a received Establish message contained a zero label, the Ack-
   nowledge returns the allocated label.

   See DVMRP section 8.1 for a discussion on the use of the Acknowledge
   Tunnel object.

2.7. KeepAlive

   Neighbors that are in the ACTIVE state may exchange KeepAlive mes-
   sages.

   The format of a KeepAlive Message is:
         ::= 

   Note that the KeepAlive message is the only ARIS message to have no
   objects.

   This message is sent by an ISR to inform its neighbors of its contin-
   ued existence.  It is the first message that is transmitted after an
   adjacency has been established.  In order to prevent the neighbor
   timeout period from expiring, ARIS messages must be periodically sent
   to neighbors.  The KeepAlive will only be sent when no other ARIS
   messages have been transmitted within the periodic interval time.  A
   recommended transmission interval is 1/3 of the exchanged neighbor
   timeout period.

3. Neighbor Adjacency

   ARIS must form an adjacency with each of its directly attached ARIS
   neighbors before it may begin establishing switched paths.  Initiali-
   zation messages are transmitted over ARIS interfaces to discover the
   ARIS neighbors and exchange adjacency information.

3.1. State Transition

   Local Session Number (LSN):
      Sending ISR's own session number.  This is always sent in the common
      header "Sender Session Number" field, and verified in the "Receiver
      Session Number" field upon receipt of a message.

   Neighbor Session Number (NSN):
      The session number of an adjacent ISR as learned via the common header
      "Sender Session Number" field received via the INIT message.  It is sent
      in the "Receive Session Number" field in all subsequent messages.  If
      the Neighbor Session Number is unknown, this field must be set to zero.

   Match condition (S1):   Receiver Session Number == 0
   Match condition (S2):   Local Session Number == Receiver Session Number
   Match condition (S3):   Local Session Number == Receiver Session Number &&
                           Neighbor Session Number == Sender Session Number

   Note:
   INIT w/0:   INIT transmitted with the Receiver Session Number set to 0
   INIT w/NSN: INIT transmitted with the Receiver Session Number set to the
               learned Neighbor Session Number

   State: INITSENT

   +====================================================================+
   | Receive        | Action                                | New State |
   +================+=======================================+===========+
   | INIT && (S1)   | Update NSN; Send INIT w/NSN           | INITRCVD  |
   +----------------+---------------------------------------+-----------+
   | INIT && (S2)   | Update NSN; Send KEEP                 | ACTIVE    |
   +----------------+---------------------------------------+-----------+
   | INIT && !(S2)  | Send INIT w/0                         | INITSENT  |
   +----------------+---------------------------------------+-----------+
   | other          | Drop Packet                           | INITSENT  |
   +----------------+---------------------------------------+-----------+
   | timeout        | Send INIT w/0                         | INITSENT  |
   +====================================================================+

   State: INITRCVD

   +====================================================================+
   | Receive        | Action                                | New State |
   +================+=======================================+===========+
   | INIT && (S1)   | Update NSN; Send INIT w/NSN           | INITRCVD  |
   +----------------+---------------------------------------+-----------+
   | INIT && (S2)   | Update NSN; Send KEEP                 | ACTIVE    |
   +----------------+---------------------------------------+-----------+
   | INIT && !(S2)  | Send INIT w/0                         | INITSENT  |
   +----------------+---------------------------------------+-----------+
   | KEEP && (S3)   | Send KEEP                             | ACTIVE    |
   +----------------+---------------------------------------+-----------+
   | KEEP && !(S3)  | Send INIT w/0                         | INITSENT  |
   +----------------+---------------------------------------+-----------+
   | other          | Drop Packet                           | INITRCVD  |
   +----------------+---------------------------------------+-----------+
   | timeout        | Send INIT w/0                         | INITSENT  |
   +====================================================================+

   State: ACTIVE

   +====================================================================+
   | Receive        | Action                                | New State |
   +================+=======================================+===========+
   | INIT && (S1)   | New LSN; Update NSN; Send INIT w/NSN  | INITRCVD  |
   +----------------+---------------------------------------+-----------+
   | INIT && (S3)   | Send KEEP                             | ACTIVE    |
   +----------------+---------------------------------------+-----------+
   | KEEP && (S3)   | Send KEEP                             | ACTIVE    |
   +----------------+---------------------------------------+-----------+
   | other          | Drop Packet                           | ACTIVE    |
   +----------------+---------------------------------------+-----------+
   | timeout        | New LSN; Send INIT                    | INITSENT  |
   +====================================================================+

   NOTES:
   ======
   * No more than one KEEPALIVE may be sent within the KeepAlive transmission
     interval
   * No more than one INIT may be sent within the Retransmit Timer interval
   * Once ARIS neighbors are in the ACTIVE state, all ARIS messages MUST verify
     the session numbers and the neighbor router-id.

4. Egress Identifiers

   A unique switched path is established for each egress identifier,
   where an egress identifier represents one or more routes that share a
   common switched path through a network.  Egress identifiers may be
   extracted from information in the routing protocols, or may be expli-
   citly configured.  The egress identifiers are populated in an ISR
   forwarding table.  Once ARIS neighbors are ACTIVE, they may begin
   exchanging switched path labels for each egress identifier via the
   Establish message.

4.1. Egress ISR

   An ISR is defined to be an "egress ISR" on a per egress identifier
   basis. Thus, an ISR may considered an egress for a particular set of
   egress identifiers, and a non-egress for others.

   An ISR is an egress ISR, with respect to a particular egress identif-
   ier, under any of the following conditions:

        1.   The egress identifier refers to the ISR itself (including
             one of its directly attached interfaces).

        2.   The egress identifier is reachable via a next hop router
             that is outside the ISR switching infrastructure.

        3.   The egress identifier is reachable by crossing a routing
             domain boundary, such as another area for OSPF summary net-
             works, or another autonomous system for OSPF AS externals
             and BGP routes [rfc1583] [rfc1771].

4.2. Selecting Egress Identifiers

   Following are the currently defined egress identifiers. New egress
   identifiers may be added as needed:

        a)   IPv4 address
             This egress identifier contains an IP address.  This iden-
             tifier is used for host or CIDR prefixes [rfc1519].  This
             type results in each IP destination prefix sustaining its
             own switched path tree.  It is recommended in environments
             where no aggregation information is provided by the routing
             protocols (such as RIP), or in networks where the number of
             destination prefixes is limited.

        b)   BGP Next Hop
             This egress identifier contains the value in the BGP
             NEXT_HOP attribute.  It may be the IP address of a BGP
             border router (enabling one switched path tree for all des-
             tinations reachable through the same egress point), or the
             address of an external BGP peer (enabling one switched path
             tree for all routes destinated to the same external peer).
             This identifier provides the maximum obtainable aggrega-
             tion.

        c)   OSPF Router ID
             This egress identifier contains the OSPF Router ID of the
             router that initiated the link state advertisement.  This
             type allows aggregation of traffic on behalf of multiple
             datagram protocols routed by OSPF.

        d)   OSPF Area Border Router
             This egress identifier contains the OSPF Router ID of the
             border router.  This identifier is used in OSPF external
             link advertisement with a non-zero forwarding address.

        e)   Explicit Path
             This egress identifier contains an explicitly defined
             source-routed path.  This information may be provided via
             configuration, or may be computed via a Dijkstra calcula-
             tion for a certain metric (e.g. QoS, Tos), and may be used
             for point-to-point, point-to-multipoint, or multipoint-to-
             point paths.  This type of egress identifier may be egress
             or ingress based.

        f)   CIDR group
             This egress identifier contains a list of CIDR prefixes
             that are to share a common egress point.  This type is con-
             figured, and may be used when additional aggregation not
             provided by the routing protocols is required.

        g)   Flow
             This egress identifiers contains information pertaining to
             a constant set of datagram information, such as port,
             dest-addr, src-addr, etc.  This feature provides the user
             with the ability to use ARIS with no aggregation.  This
             type of egress identifier may be egress or ingress based.

        h)   Multicast (S,G)
             This egress identifier contains the unique (Source, Group)
             multicast pair. It creates one switched path tree per (S,G)
             pair.  It is used by DVMRP and PIM-DM [rfc1075] [pim-dm].

        i)   Multicast (G)
             This egress identifier contains the unique multicast group
             on a multicast tree.  It creates one switched path tree per
             group.  It is used by PIM-SM [pim-sm].

5. Destination-Based Routing

5.1. Forwarding Information Bases

   An ISR extends the FIB to associate route entries with egress iden-
   tifiers via the routing protocols or configuration.  An egress iden-
   tifier may be defined for a single route entry, or may be "aggre-
   gated", where it is shared by multiple route entries.  These egress
   identifiers are assigned switched paths by the ARIS protocol.

   Route lookups are performed as they are on conventional routers
   (longest prefix match).  However, if an ARIS switched path is associ-
   ated with the route, traffic is forwarded on that path.  If a
   switched path is not available, traffic may be forwarded as on a con-
   ventional router.

   Note that any route may be associated with both an aggregated
   switched path (sharing a common switched path with many routes), as
   well as an individual switched path (de-aggregated from the shared
   path).  The switched path selected will always be the most specific
   available as decided by a longest prefix match or policy.

                              +---+
                              | A |
                              +---+
                                |
                                V
                              +---+
                              | B |
                            / +---+ \
                           V         V
                         +---+     +---+
                         | C |     | D |
                         +---+     +---+
                           |         |
                           V         V
                         Net 1     Net 2
                         Net 3     Net 4

     Figure 1:  Sample Topology

   In Figure 1, The Forwarding Information Base (FIB) on ISR A knows of
   two aggregated egress identifiers, ISR C and ISR D.  Net 1 and Net 3
   in the FIB are associated with the switched path assigned to
   (egress-id:C), while Net 2 and Net 4 are associated with switched
   path assigned to (egress-id:D).

   It is also possible to de-aggregate a prefix from the select aggre-
   gate egress-id, and setup a unique switched path.  For example, the
   FIB entry on ISR A for a de-aggregated Net 2 would be associated with
   (egress-id:Net2).

5.2. TTL Decrement

   In order to comply with the requirements for IPv4 routers [rfc1812],
   the IP datagram Time-To-Live (TTL) field must be decremented on each
   hop it traverses.  The forwarding ISR SHOULD decrement a packet TTL
   by the number of switched hops plus one when the the link-layer
   packet header does not have a TTL field (as in ATM).  The switched
   path hop-count is computed via the Establish message.  If the decre-
   ment value is greater than or equal to the TTL of the packet, the
   packet MAY be forwarded hop-by-hop or discarded.

5.3. Loop Prevention

   An ISR may be configured with loop prevention.  In this, an egress
   ISR initiating an Establish message includes the Router Path object.
   The ISR list in the object MUST be concatenated with the unique
   router-id of each ISR through which the Establish traverses.  If an
   ISR receives an Establish message with itself in the list, a loop is
   detected.  When this occurs, the Establish message MUST be ter-
   minated.

   Further, if an ISR modifies the network layer path to an egress iden-
   tifier due to a routing change, the ISR MUST NOT splice the upstream
   switched path(s) to the new downstream switched path until it for-
   wards the new ISR list to the upstream ISR(s) via the Establish mes-
   sage, and receives the Acknowledge message(s) in return.  An ISR that
   receives an Establish message with a modified ISR list in the Router
   Path object MUST unsplice any established upstream switched path(s)
   from the downstream switched path, and re-establish the path through
   the Establish/Acknowledge mechanism.

   If an ISR is not configured with loop prevention, no Router Path
   object is included in the Establish message, and modified paths to
   egress identifiers are immediately spliced.  The default configura-
   tion is loop prevention.

5.4. BGP Interaction with ARIS

   The BGP implementation of the ISR uses the NEXT_HOP attribute as the
   egress identifier.  When the BGP border ISR injects routes into the
   BGP mesh, it may use its own IP address or the address of its exter-
   nal BGP peer as the value of the NEXT_HOP attribute.  This choice of
   NEXT_HOP attribute value creates different establishment behaviors
   with ARIS.

   If the BGP border ISR uses its own IP address as the NEXT_HOP attri-
   bute in its injected routes, then all of these BGP routes share the
   same egress identifier.  This approach establishes only one tree to
   the BGP border ISR, and the border ISR may forward traffic at the IP
   layer towards its external BGP neighbors.

   If the BGP border ISR uses the external BGP peer as the NEXT_HOP
   attribute in its injected routes, then the BGP routes from each
   unique external BGP neighbor share the same egress identifier.  This
   approach establishes one switched path tree per external BGP neighbor
   of the BGP border ISR.  The BGP border ISR can switch traffic
   directly to its external BGP neighbors.

5.5. OSPF Interaction with ARIS

   The OSPF protocol exchanges five types of "link state advertisements"
   to create OSPF routing tables.  All types of advertisements contain
   an "Advertising Router" field, which identifies the OSPF Router ID of
   the router that originates the advertisement.  The ISR uses this OSPF
   Router ID as the egress identifier.

   The use of the OSPF Router ID as an egress identifier allows a new
   level of destination prefix abstraction.  In a typical network, a
   router may be connected to several LANs (Ethernets, Token Rings,
   etc.), and may communicate to remote networks outside of its routing
   domain via adjacent routers.  The remote destination networks may be
   injected into the link state routing domain via static configuration,
   or via other routing protocols (such as RIP or BGP).  These local and
   remote networks may be represented in the router forwarding tables as
   many destination prefixes, which cannot be aggregated into shorter
   prefixes (even when using CIDR]).  Router labels (OSPF Router ID)
   provide a compact means of representing a number of destination pre-
   fixes that exit the link state routing domain at the same egress
   router.  The association between destination prefixes and router
   labels is an easy by-product of the normal SPF computation.

   The one exception to using the OSPF Router ID is when ISRs receive an
   AS external link advertisement with a non-zero forwarding address.
   The OSPF protocol uses the forwarding address to allow traffic to
   bypass the router that originates the advertisement.  Since the OSPF
   Router ID refers to the bypassed router, it is inadequate as an
   egress identifier in this case.  Instead, the ARIS protocol must use
   the forwarding address as the egress identifier.

   Using the forwarding address as the egress identifier provides signi-
   ficant benefits.  Since the AS external forwarding address and the
   BGP NEXT_HOP attribute are both external IP addresses, they are com-
   patible types of egress identifiers, which may allow BGP and OSPF
   routes to share the same switched path.  Further, the OSPF AS boun-
   dary ISR can switch traffic directly to its external neighbors, just
   like BGP.

   The ISR identifies itself as an OSPF egress when the ISR is an area
   border router or an AS boundary router, or when it is directly
   attached to a LAN.

6. L2-Tunnels

   L2-tunnels provide a mechnism by which L2 data units can be switched
   at a de-aggregating ISR without performing network layer forwarding.
   The ARIS protocol enables a de-aggregating ISR to advertise labels to
   the upstream ingress ISRs.  Ingress ISRs use this information to
   build a packet with the advertised label in the L2 header.  This
   packet is then encapsulated into another L2 header with the label
   representing the switched path to the de-aggregating ISR.

   The de-aggregating ISR advertises the labels via the Establish mes-
   sage using the tunnel object, where this object contains a list of
   egress identifiers that are to use the tunnel label.  An ISR receiv-
   ing a tunnel object for an egress identifier that is not in the for-
   warding table ignores the tunnel information.

   For example, the egress identifiers in the Tunnel object may be a
   list of CIDRs.  This enables those CIDRs to share a switched path to
   a de-aggregation point, and then be de-encapsulated and switched
   towards their final destinations on different paths.

   Although the current specification provides only two levels of tun-
   neling, multiple level support may be provided in future revisions.

7. Label Management

7.1. VCIB

   The VC information base (VCIB), which does not exist on a standard
   router, maintains for each egress identifier the upstream to down-
   stream switched path label mappings and related states. This mapping
   is controlled by the ARIS protocol.  The labels are populated in the
   label swapping cross-connect table.

7.2. Label Swapping

   An incoming L2 data unit is forwarded using the information provided
   by an L2 forwarding and label swapping table.  This table is indexed
   directly by the incoming port number and label, and provides the
   mapped outgoing port(s) and outgoing label(s). In the case of point-
   to-multipoint, outgoing information for each branch is obtained.
   This cross-connect table and the L2 forwarding/swapping mechanism
   currently exists in standard ATM and Frame Relay switches.

   The label swapping table should be extended to include L2-tunnel
   information, so when an ISR is a switched path termination point,
   de-encapsulation and appropriate re-encapsulation can take place.
   All related information for this purpose should be maintained in this
   table.

8. Multicast

   The establishment of the IP Multicast point-to-multipoint switched
   path tree is initiated at the root (ingress) node.  The switched path
   tree carries traffic from the ingress ISR to all egress ISRs, using
   multicast switching at intermediate ISRs.

   The mechanism for establishing the switched path is virtually the
   same as described in the unicast destination-based routing case.  The
   root of the tree (ingress in this case) transmits the Establish RPM
   style to all child links as determined by the multicast fib.  Each
   ISR that receives the Establish MUST verify the message was received
   from the correct parent, and if loop prevention is configured, uses
   the Router Path object to guarantee a loop-free path.  The switched
   paths MAY be created such that the Establish carries NO Label object,
   and the Acknowledge message returns the downstream link label that is
   spliced to the upstream label.  A new receiver joining an established
   tree may either send a trigger message to the parent ISR, or wait for
   the next refresh cycle to be spliced into the switched path.

8.1. DVMRP and PIM-DM

   The choice of egress identifier for the multicast routing protocols
   DVMRP and PIM-DM is the (S,G) pair.  This egress identifier creates
   one ingress routed point-to-multipoint switched path tree per source
   address and group pair.  The creation of the switched path is ini-
   tiated by the ingress node on receipt of traffic from the sender S
   for a particular multicast group G.

   The branches of the multipoint-to-point switched path tree that do
   not lead to receivers are pruned when the multicast routing protocol
   prunes up by deleting forwarding entries in the multicast FIB.

   ARIS can also support the notion of DVMRP tunnel switched paths,
   through the Establish and L2-tunneling mechanism.  In this case, the
   egress identifier is the DVMRP tunnel endpoint.  The source of the
   tunnel initiates the Establish message to its next-hop, but with the
   end-to-end option.  Each intermediary node that receives such an
   Establish may create the switched path, but does not immediately send
   an Acknowledge message.  It forwards the message to its next-hop and
   waits for the DVMRP tunnel endpoint to initiate the Acknowledge.
   This Acknowledge message includes a Tunnel object, where the Tunnel
   object contains the list of labels for all reachable (S, G) pairs.
   These labels are used by the DVMRP tunnel source to populate its
   label-swapping table for the purpose of encapsulation.  At the source
   of the DVMRP tunnel, an incoming header is replaced by a header with
   the DVMRP tunnel label, followed by the label used by the DVMRP tun-
   nel endpoint to the given (S,G).  This enables the DVMRP tunnel end-
   point to de-encapsulate the packet, and forward the message on its
   switched path to (S, G).

8.2. PIM-SM

   The choice of egress identifier for those groups on a shared tree is
   (RP,G), where RP is the PIM-SM rendezvous point.  For the groups on a
   source-specific tree, the egress identifier is (S, G).

   The PIM-SM switched path Establish is initiated by an ingress when it
   receives a PIM Join/Prune message.  In the shared-tree case, the RP
   behaves as the ingress, and initiates the switched path for all down-
   stream receivers.  For those groups that are on a source-specific
   tree, the ingress of the source initiates the switched path.  A
   source-specific switched path for a group that was created by the
   rendezvous point SHOULD be spliced to the downstream shared tree.

9. Multipath

   Many IP routing protocols such as OSPF support the notion of equal-
   cost multipath routes, in which a router maintains multiple next hops
   for one destination prefix when two or more equal-cost paths to the
   prefix exist.  Each ISR that receives multiple Establishment messages
   from downstream ISRs with different paths to the same egress identif-
   ier can choose via configuration one of four different approaches for
   sending Establish messages upstream.

   One approach is to send multiple Establish messages upstream,
   preserving multiple switched paths to the egress ISR, where each
   switched path represents a different equal-cost path.  In this case,
   the ingress ISR will make multipath decisions for traffic on behalf
   of all downstream ISRs.  Each Establish message requires an addi-
   tional numeric identifier to be able to distinguish multiple distinct
   switched paths to the destination, so that successive Establish mes-
   sages for distinct switched paths are not misinterpreted as consecu-
   tive replacements of the same switched path.  When multiple Establish
   switched paths are preserved upstream, they require distinct label
   assignments, which works against conservation of switched paths.

   Another approach, that conserves switched paths at the cost of
   switching performance, is to originate one Establish message
   upstream, and to forward datagrams at the IP network layer on the
   multipath point ISR.

   A third approach is to propagate only one Establish message from the
   downstream ISRs to the upstream ISRs, and ignore the content of other
   Establish messages.  This conserves switched paths and maintains
   switching performance, but may not balance loads across downstream
   links as well as the first two approaches, even if switched paths are
   selectively dropped.

   The final approach is to propagate one Establish message that carries
   the content of all downstream Establish messages, so that only one
   upstream switched path is created to the multipath point.  This
   requires that the switching hardware on the multipath ISR be capable
   of correctly distributing the traffic of an upstream switched path
   onto multiple downstream switched paths.  Furthermore, the Establish
   message to send upstream must concatenate the ISR ID lists from down-
   stream messages, in order to preserve the loop-free property.  The
   ISR ID list concatenation is similar to using AS_SETs for aggregation
   in the BGP protocol.  This final approach has the benefit of both
   conservation and performance, although it requires a slightly more
   complex implementation.

   The default behavior is to ignore the multipath route(s), and
   establish only one switched path to the egress identifier.

                              +---+
                              | A |
                              +---+
                                |
                                V
                              +---+
                              | B |
                            / +---+ \
                           V         V
                         +---+     +---+
                         | C |     | D |
                         +---+ \ / +---+
                                V
                              +---+
                              | E |
                              +---+
                                |
                              Net 1

   Figure 2:  Multipath Sample Topology

   Figure 2 shows a topology for a network with two equal cost paths to the
   egress identifier, ISR E.  On ISR A, the Forwarding Information Base (FIB)
   for Net 1 is associated with both (egress-id:E, mpath-id:1) and
   (egress-id:E, mpath-id:2).

10. Timers

        a)   KeepAlive Timer
             This configured value is exchanged via the INIT message
             Timer object, and is the interval by which an ARIS message
             must be received to prevent neighbor adjacency time-out.  A
             recommended KeepAlive transmission interval is 1/3 of the
             exchanged neighbor timeout period.

        b)   EstablishRefresh Timer
             This configured value is received in the Establish message
             Timer object, and is the interval by which an Establish
             refresh message MUST be received to prevent an egress
             identifier's switched path time-out.  If this value is 0,
             no refresh Establish message will be transmitted, else the
             refresh will be transmitted at 1/3 of the EstablishRefresh
             Timer value.  Note that this value MUST be greater than the
             Retransmit Timer value.

        c)   Retransmit Timer
             This is the interval by which an ARIS message must receive
             a response.  If a response is not received within the
             interval, the ARIS message will be retransmitted.  Note
             that this value MUST be less than the EstablishRefresh
             Timer value.

11. Configuration

   ARIS MUST allow configuration of the various timers as described in
   section 10.

   The configuration MUST support egress identifier selection.  By
   default, ARIS egress identifiers are selected via the routing proto-
   cols.  For example, BGP may select the egress identifier from its
   NEXT_HOP field, OSPF may select the area border router, and other
   protocols may select the CIDR prefix. However, configuration can
   override these defaults.  In addition, the user may configure addi-
   tional egress identifiers for specifically requested switched paths
   at edge routers.  The configuration SHOULD also provide the ability
   to stop an egress ISR from originating Establishes for a specified
   set of egress identifiers.

   The configuration MUST allow the selection of the owner/allocator of
   the incoming and the outgoing label space for each link.  Note that
   this MUST have a matching configuration on the link neighbor.

   The configuration SHOULD support the choice of ATM encapsulation
   [15].  The default is NULL encapsulation.

   The configuration SHOULD support the choice of action for multipath.
   The default action MUST be to propagate only one path towards the
   ingress.

12. ARIS Signaling Pseudo Code

   receive_message(arispkt, nbr)
   {

      if (verify_msg(arispkt) fails)
          return;

      switch(arispkt->type) {
      case INIT:
           See state table defined in section 3.1
      case KEEPALIVE:
           See state table defined in section 3.1
           break;
      case ESTABLISH:
           process_establish_msg(arispkt->establish_contents, nbr);
           update nbr->rcv_time to current-time;
           break;
      case TRIGGER:
           process_trigger_msg(arispkt->trigger_contents, nbr);
           update nbr->rcv_time to current-time;
           break;
      case ACK:
           process_ack_msg(arispkt->ack_contents, nbr);
           update nbr->rcv_time to current-time;
           break;
      case TEAR:
           process_teardown_msg(arispkt->tear_contents, nbr);
           update nbr->rcv_time to current-time;
           break;
      }
   }

   /*
    * Verify contents of ARIS common header
    */
   int
   verify_msg(arispkt, nbr)
   {
      if (IP-style header checksum fails ||
          common-header sequence number check fails)
          return(error);
      if (nbr->state == ACTIVE) {
          if (common-header neighbor router id not matched ||
              common-header sender session number not matched ||
              common-header receiver session number not matched)
              return(error);
      }
      return(0);
   }

   /*
    * Process received establish message (destination-based)
    */
   process_establish_msg(establish_msg, sender-nbr)
   {
       for (each egress-id in establish_msg) {
           fib_verify(egress-id, sender-nbr);
           transmit ack to sender-nbr;
           if ((new egress-id) || (different multipath-identifier)) {
               create VCIB entry;
               populate fib with given label;
               for (each ARIS-nbr) {
                    if (ARIS-nbr == sender-nbr)
                        continue;    /* ignore downstream nbr */
                    allocate upstream label/populate VCIB;
                    tx_establish_message(establish_msg, ARIS-nbr);
               }
           } else {
               if ((ISR list same as previously received) &&
                   (switched-path label same as previously received)) {
                    /* This is a refresh message */
                    for (each ARIS-nbr) {
                        if (ARIS-nbr == sender-nbr)
                            continue;    /* ignore downstream nbr */
                        tx_establish_msg(establish_msg, ARIS-nbr)
               } else {
                   unsplice current downstream label;
                   if (new switched-path label)
                       repopulate fib with given label;
                   for (each ARIS-nbr) {
                       if (ARIS-nbr == sender-nbr)
                           continue;    /* ignore downstream nbr */
                       tx_establish_msg(establish_msg, ARIS-nbr);
                   }
               }
           }
           if (egress-id is on retransmit queue)   /* May be triggering */
               remove egress-id from retransmit queue
       }
   }

   tx_establish_msg(establish_msg, upstream-nbr)
   {
       append router-id to ISR-list
       increment hopcount
       set upstream-nbr label
       tx_msg(establish_msg, upstream-nbr, ESTABLISH)
   }

   /*
    * Process received trigger message
    */
   process_trigger_msg(trigger_msg, upstream-nbr)
   {
       for (each egress-id in trigger_msg) {
           for (each ARIS-nbr) {
               if (ARIS-nbr is next-hop for egress-id) {
                  build establish_msg;
                  if (no upstream-nbr label)
                      allocate upstream label/populate VCIB;
                  tx_establish_message(establish-message, upstream-nbr)
               }
           }
       }
   }

   /*
    * Process received acknowledge message
    */
   process_ack_msg(ack-message, sender-nbr)
   {
       if (match original sequence number and msg type to retransmit queue entry)
           remove item from retransmit queue;
       if (original message was establish-message)
          splice upstream label to downstream label;
   }

   tx_msg(arispkt, nbr, msg-type)
   {
       prepend common-header to arispkt;
       if (arispkt->type != ACK)
           put message on retransmit queue
       prepend protocol header to arispkt;
       update nbr->send_time to current-time;
       transmit message to nbr
   }

   /*
    * Forwarding Table signals ARIS with an egress identifier
    */
   learn_egress_identifier(egress-id, next-hop)
   {
      if ((next-hop is not aris_nbr) ||
          (next-hop is different OSPF-area) ||
          (next-hop is different domain)) {
           /* I am the egress */
           initiate_establish(egress-id)
       } else {
           /* I am intermediary ISR or ingress */
           send_trigger(egress-id, next-hop)
       }
   }

   /*
    * Egress node initiates Establish message
    */
   initiate_establish(egress-id)
   {
       for (each ARIS-nbr) {
           if (ARIS-nbr is next-hop for egress-id)
               continue;           /* ignore downstream nbr(s) */
           allocate upstream label/populate VCIB;
           build establish_msg;
           tx_establish_message(establish_msg, ARIS-nbr)
       }
       put egress-id on EstablishRefresh queue
   }

13. Object Definitions

13.1. Common Header

      ARIS messages begin with the following header.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |    Version    |    Msg Type   |            Length             |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |         Header Checksum       |           Reserved            |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                        Sender Router ID                       |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                      Sender Sequence Number                   |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                      Sender Session Number                    |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                     Receiver Session Number                   |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Version
            Version number of the ARIS protocol, currently 1.

        Msg Type
            Defines the type of the ARIS message, as follows:

                  INIT        = 1
                  KEEPALIVE   = 2
                  TRIGGER     = 3
                  ESTABLISH   = 4
                  TEARDOWN    = 5
                  ACKNOWLEDGE = 6

        Length
            Total length of the ARIS message, including this header.

        Header Checksum
            IP style checksum of the complete ARIS message, that includes
            the ARIS Common Header and all the objects therein.

        Sender Router ID
            Sender router identifier.

        Sender Sequence Number
            Sender message sequence number. The upper 16 bits may be used
            as local flags, while the lower 16 bits represent sequence numbers
            from 1 through 2^16-1.

        Sender Session Number
            Unique session number of the sender.

        Receiver Session Number
            Session number of the receiver as known by the sender through
            a previous INIT message.  The sender MUST set this to 0 in
            an INIT message if there is no learned receiver session number.

13.2. Common Object Header

      All objects in the ARIS message start with the following object
      header. The objects are placed back-to-back within the ARIS message.
      Each object MUST be padded to a word boundary.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |    Obj Type   |    Sub Type   |            Length             |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Object Type
            Object type of this object. Currently the following objects
            are defined:
                 LABEL_OBJ       = 1
                 EGRID_OBJ       = 2
                 MPATH_OBJ       = 3
                 RPATH_OBJ       = 4
                 EPATH_OBJ       = 5
                 TUNNEL_OBJ      = 6
                 TIMER_OBJ       = 7
                 ACK_OBJ         = 8
                 INIT_OBJ        = 9

        Sub Type
            Sub type of the object. See object definitions for sub types
            of an object.

        Length
            Length of the object, including this header.

13.3. Label Object

      The selected link-layer label.

      Obj Type = 1, Sub Type = 1  (ATM)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |E|Res|V|          VPI          |              VCI              |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        E-bit
            End-to-End Acknowledge bit.

        V-bit
            Virtual Path switching indicator bit.  If V-bit is 1, only the
            VPI field is significant.  If V-bit is 0, both VPI and VCI are
            significant.

        VPI (12 bits)
            Virtual Path Identifier. If VPI is less than 12-bits it should
            be right justified in this field and preceding bits should be
            set to 0.  If both the VPI and the VCI are 0, the receiver
            allocates the label.

        VCI (16 bits)
            Virtual Connection Identifier. If the VCI is less than 16-bits,
            it should be right justified in the field and the preceding
            bit must be set to 0. If Virtual Path switching is indicated
            in the VP field, then this field must be ignored by the receiver
            and set to 0 by the sender.  If both the VPI and the VCI are 0,
            the receiver allocates the label.

13.4. Egress Identifier Object

      The egress identifier, in any one of the following formats:

      Obj Type = 2, Sub Type = 1  (IPv4 Address)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    Reserved                   |  Prefix Len   |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                            IPv4 Address                       |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Prefix Len
            Number of significant bits of the IPv4 Network Address field.

        IPv4 Address
            Egress identifier represented by an IPv4 Network address.

      Obj Type = 2, Sub Type = 2  (BGP NEXT_HOP)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                     BGP Next Hop IPv4 Address                 |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        BGP Next Hop
           The IPv4 address of the BGP Next Hop router.

      Obj Type = 2, Sub Type = 3  (OSPF Router ID)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                        OSPF Router Id                         |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        OSPF Router ID
            Router identifier of the OSPF node.

      Obj Type = 2, Sub Type = 4  (OSPF Area Border Router)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    Reserved                   |  Prefix Len   |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                     IPv4 Network Address                      |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                  OSPF Area Border Router ID                   |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Prefix Len
            Number of significant bits of the IPv4 Network Address field.

        IPv4 Network Address
            Network Address.

        OSPF Area Border Router ID
            Router identifier of the OSPF ABR node.

      Obj Type = 2, Sub Type = 5  (Multicast Source,Group)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    IPv4 Source Address                        |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |               IPv4 Multicast Group Address                    |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        IPv4 Source Address
            Source IPv4 address of the multicast stream.

        IPv4 Multicast Group Address
            IPv4 Multicast Group Address.

      Obj Type = 2, Sub Type = 6  (Multicast Group)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    IPv4 Rendezvous Point                      |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |               IPv4 Multicast Group Address                    |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        IPv4 Rendezvous Point
            IPv4 address the rendezvous point of the multicast stream.

        IPv4 Multicast Group Address
            IPv4 Multicast Group Address.

      Obj Type = 2, Sub Type = 7  (Flow)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    IPv4 Source Address                        |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    IPv4 Dest Address                          |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |         Source Port           |        Dest Port              |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |    Protocol   |   Direction   |        Reserved               |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        IPv4 Source Address
            Source IPv4 address.

        IPv4 Destination Address
            Destination IPv4 address.

         Source Port
            Source port.

         Destination Port
            Destination port.

         Protocol
             Protocol type.

         Direction
            Field indicating the direction of the switched path.  Field is
            set to 1 on Downstream; field is set to 2 on Upstream.

      Obj Type = 2, Sub Type = 8  (CIDR list)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                     Aggregate Router-Id                       |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |             Count             |    Reserved   |  Prefix Len   |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                          IPv4 Address                         |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        ~                                                               ~
        |                                                               |
        ~                                                               ~
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  Prefix Len   |              IPv4 Address...
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            ...         |                 Pad                           |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Aggregate Router-Id
            The IPv4 Address of the ISR which is the aggregation point
            for the listed set of IPv4 Addresses.

        Count
            The number of IPv4 addresses in this object, which comprise
            the set of addresses that share the aggregation ISR.

        Prefix Len
            Number of significant bits of the IPv4 Address field.

         IPv4 Address
            An IPv4 Address associated with the aggregation ISR.
            Note this value may not be word aligned.

13.5. Multipath Identifier Object

      Uniquely identifies a switched path to an egress identifier.

      Obj Type = 3, Sub Type = 1

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    Multipath Identifier                       |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Multipath Identifier
            A unique value that identifies a switched path.

13.6. Router Path Object

      Information related to the path the Establish message traverses.

      Obj Type = 4, Sub Type = 1

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |   Hop Count   |   Reserved    |        Router Id Count        |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                          Router Id 1                          |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                                                               |
        ~                                                               ~
        |                                                               |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                          Router Id n                          |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Hop Count
            The number of hops to the egress identifier.  It is incremented
            at each forwarding ISR.

        Router Id Count
             The number of Router Identifiers in this object.

        Router Id 1 to n-1
            A series of Router Identifiers indicating the path that the message
            has traversed.

        Router Id n
            Router Identifier of the router that sent the current message.
            This must be an adjacent router.

13.7. Explicit Path Object

      Explicitly defined source-routed path.

      Obj Type = 5, Sub Type = 1

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |   Direction   |  Reserved     |          Current              |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                          IPv4 Address                         |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |   NH-Cnt      |   NH-Offset   |   NH-Offset   |   ...         |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                                                               |
        ~                                                               ~
        |                                                               |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                           IPv4 Address                        |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |   NH-Cnt      |   NH-Offset   |   NH-Offset   |   ...         |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Direction
            Field indicating the direction of the switched path.  Field is
            set to 1 on Downstream; field is set to 2 on Upstream.

         Current
            Pointer to the current IP address location in the explicit path.

         IPv4 Address

            IP address of a node in the switched path.  Note this value
            may not be word aligned.

         NH-Cnt
            The number of next-hops for the corresponding IP Address.
            A value of 0 indicates the end of the list.

         NH-Ptr
            A relative offset from the corresponding IP address to the
            location of a next-hop IP address.

13.8. Tunnel Object

      Encapsulation label.

      Obj Type = 6, Sub Type = 1  (IPv4 addresses)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                      Link-layer Label                         |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |     Count     |                Reserved                       |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  Prefix Len   |       IPv4 Address...
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            ...         |    Prefix Len   |          IPv4 Addr...
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        ~                                                               ~
        |                                                               |
        ~                                                               ~
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  Prefix Len   |                 IPv4 Address...
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            ...         |                 Pad                           |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

         Link-layer Label
            The label to use in tunnel encapsulation.

         Count
            The number of IPv4 addresses associated with the given label.

        Prefix Len
            Number of significant bits of the IPv4 Address field.

         IPv4 Address

            IP address that is to use the associated label.  Note this value
            may not be word aligned.

      Obj Type = 6, Sub Type = 2  (S, G)

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                      Link-layer Label                         |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |       Count   |                Reserved                       |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    IPv4 Source Address                        |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |               IPv4 Multicast Group Address                    |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        ~                                                               ~
        |                                                               |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                    IPv4 Source Address                        |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |               IPv4 Multicast Group Address                    |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

         Link-layer Label
            The label to use in tunnel encapsulation.

         Count
            The number of (S,G) pairs associated with the given label.

         IPv4 Source Address
            Source IPv4 address of the multicast stream.

         IPv4 Multicast Group Address
            IPv4 Multicast Group Address.

13.9. Timer Object

      Timeout value.

      Obj Type = 7, Sub Type = 1

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                      Timer Interval                           |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Timer Interval
            A timeout interval (in seconds).  When present in an Init message,
            it is the Neighbor Dead Interval.  This interval is the maximum
            number of seconds that may elapse between received ARIS messages.
            When present in an Establish message, it is the RefreshEstablish
            Interval.  This interval is maximum number of seconds that may
            elapse between egress identifier refresh Establish messages.
            This value MUST be greater than 0.

13.10. Acknowledge Message Object

      Status of an ARIS message.

      Obj Type = 8, Sub Type = 1

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                 Acknowledge Sequence Number                   |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |   Obj Type    |     Reserved  |             Error             |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Acknowledge Sequence Number
             The sequence number of the originating message that is being
             acknowledged.

        Obj Type
              Type of message being acknowledged

        Error
              An error code.  A value of 0 indicates no error.

13.11. Init Message Object

      Information pertaining to neighbor initialization.

      Obj Type = 9, Sub Type = 1

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  Res  |    Minimum VPI        |        Minimum VCI            |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  Res  |    Maximum VPI        |        Maximum VCI            |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Res
            Reserved.

        Minimum VPI (12 bits)
            Minimum Virtual Path Identifier that is supported on the switch.
            If VPI is less than 12-bits it should be right justified in this
            field and preceding bits should be set to 0.

        Minimum VCI (16 bits)
            Minimum Virtual Connection Identifier that is supported on the
            switch.  If VCI is less than 16-bits it should be right justified
            in this field and preceding bits should be set to 0.

        Maximum VPI (12 bits)
            Maximum Virtual Path Identifier that is supported on the switch.
            If VPI is less than 12-bits it should be right justified in this
            field and preceding bits should be set to 0.

        Maximum VCI (16 bits)
            Maximum Virtual Connection Identifier that is supported on the
            switch.  If VCI is less than 16-bits it should be right justified
            in this field and preceding bits should be set to 0.

Internetworking Over NBMA                               James V. Luciani
INTERNET-DRAFT                                            (Bay Networks)
<draft-ietf-ion-scsp-01.txt>                          Grenville Armitage
                                                              (Bellcore)
                                                            Joel Halpern
                                                             (Newbridge)
                                                  Expires September 1997





              Server Cache Synchronization Protocol (SCSP)

1. Introduction

2. Overview

   In order to give a frame of reference for the following discussion,
   the terms Local Server (LS), Directly Connected Server (DCS), and
   Remote Server (RS) are introduced.  The LS is the server under
   scrutiny; i.e., all statements are made from the perspective of the
   LS when discussing the SCSP protocol. The DCS is a server which is
   directly connected to the LS;  e.g., there exists a VC between the LS
   and DCS.  Thus, every server is a DCS from the point of view of every
   other server which connects to it directly, and every server is an LS
   which has zero or more DCSs directly connected to it. From the
   perspective of an LS, an RS is a server, separate from the LS, which
   is not directly connected to the LS (i.e., an RS is always two or
   more hops away from an LS whereas a DCS is always one hop away from
   an LS).

   SCSP contains three sub protocols: the "Hello" protocol, the "Cache
   Alignment" protocol, and the "Cache State Update" protocol.  The
   "Hello" protocol is used to ascertain whether a DCS is operational
   and whether the connection between the LS and DCS is bidirectional,
   unidirectional, or non-functional.  The "Cache Alignment" (CA)
   protocol allows an LS to synchronize its entire cache with that of
   the cache of its DCSs. The "Cache State Update" (CSU) protocol is
   used to update the state of cache entries in servers for a given SG.
   Sections 2.1, 2.2, and 2.3 contain a more in-depth explanation of the
   Hello, CA, and CSU protocols and the messages they use.

   SCSP based synchronization is performed on a per protocol instance
   basis.  That is, a separate instance of SCSP is run for each instance
   of the given protocol running in a given box.  The protocol is
   identified in SCSP via a Protocol ID and the instance of the protocol
   is identified by a Server Group ID (SGID).  Thus the PID/SGID pair
   uniquely identify an instance of SCSP.  In general, this is not an
   issue since it is seldom the case that many instances of a given
   protocol (which is distributed and needs cache synchronization) are
   running within the same physical box.  However, when this is the
   case, there is a mechanism called the Family ID (described briefly in
   the Hello Protocol) which enables a substantial reduction in
   maintenance traffic at little real cost in terms of control.  The use
   of the Family ID mechanism, when appropriate for a given protocol
   which is using SCSP, will be fully defined in the given SCSP protocol
   specific specification.


                       +---------------+
                       |               |
              +-------@|     DOWN      |@-------+
              |        |               |        |
              |        +---------------+        |
              |            |       @            |
              |            |       |            |
              |            |       |            |
              |            |       |            |
              |            @       |            |
              |        +---------------+        |
              |        |               |        |
              |        |    WAITING    |        |
              |     +--|               |--+     |
              |     |  +---------------+  |     |
              |     |    @           @    |     |
              |     |    |           |    |     |
              |     @    |           |    @     |
            +---------------+     +---------------+
            | BIDIRECTIONAL |----@| UNIDIRECTIONAL|
            |               |     |               |
            |  CONNECTION   |@----|  CONNECTION   |
            +---------------+     +---------------+


          Figure 1: Hello Finite State Machine (HFSM)


2.1  Hello Protocol

   "Hello" messages are used to ascertain whether a DCS is operational
   and whether the connections between the LS and DCS are bidirectional,
   unidirectional, or non-functional. In order to do this, every LS MUST
   periodically send a Hello message to its DCSs.

   An LS must be configured with a list of NBMA addresses which
   represent the addresses of peer servers in a SG to which the LS
   wishes to have a direct connection for the purpose of running SCSP;
   that is, these addresses are the addresses of would-be DCSs.  The
   mechanism for the configuration of an LS with these NBMA address is
   beyond the scope of this document; although one possible mechanism
   would be an autoconfiguration server.

   An LS has a Hello Finite State Machine (HFSM) associated with each of
   its DCSs (see Figure 1) for a given SG, and the HFSM monitors the
   state of the connectivity between the servers.

   The HFSM starts in the "Down" State and transitions to the "Waiting"
   State after NBMA level connectivity has been established.  Once in
   the Waiting State, the LS starts sending Hello messages to the DCS.
   The Hello message includes: a Sender ID which is set to the LS's ID
   (LSID), zero or more Receiver IDs which identify the DCSs from which
   the LS has heard a Hello message, and a HelloInterval and DeadFactor
   which will be described below.   At this point, the DCS may or may
   not already be sending its own Hello messages to the LS.

   When the LS receives a Hello message from one of its DCSs, the LS
   checks to see if its LSID is in one of the Receiver ID fields of that
   message which it just received, and the LS saves the Sender ID from
   that Hello message. If the LSID is in one of the Receiver ID fields
   then the LS transitions the HFSM to the Bidirectional Connection
   state otherwise it transitions the HFSM into the Unidirectional
   Connection state. The Sender ID which was saved is the DCS's ID
   (DCSID).  The next time that the LS sends its own Hello message to
   the DCS, the LS will check the saved DCSID against a list of Receiver
   IDs which the LS uses when sending the LS's own Hello messages.  If
   the DCSID is not found in the list of Receiver IDs then it is added
   to that list before the LS sends its Hello message.

   Hello messages also contain a HelloInterval and a DeadFactor.  The
   Hello interval advertises the time (in seconds) between sending of
   consecutive Hello messages by the server which is sending the
   "current" Hello message.  That is, if the time between reception of
   Hello messages from a DCS exceeds the HelloInterval advertised by
   that DCS then the next Hello message is to be considered late by the
   LS.  If the LS does not receive a Hello message, which contains the
   LS's LSID in one of the Receiver ID fields, within the interval
   HelloInterval*DeadFactor seconds (where DeadFactor was advertised by
   the DCS in a previous Hello message) then the LS MUST consider the
   DCS to be stalled.  At which point one of two things will happen: 1)
   if any Hello messages have been received during the last
   HelloInterval*DeadFactor seconds then the LS should transition the
   HFSM for that DCS to the Unidirectional Connection State; otherwise,
   the LS should transition the HFSM for that DCS to the Waiting State
   and remove the DCSID from the Receiver ID list.

   Note that the Hello Protocol is on a per PID/SGID basis. Thus, for
   example, if there are two servers (one in SG A and the other in SG B)
   associated with an NBMA address X and another two servers (also one
   in SG A and the other in SG B) associated with NBMA address Y and
   there is a suitable point-to-point VC between the NBMA addresses then
   there are two HFSMs running on each side of the VC (one per
   PID/SGID).

   Hello messages contain a list of Receiver IDs instead of a single
   Receiver ID in order to make use of point to multipoint connections.
   While there is an HFSM per DCS, an LS MUST send only a single Hello
   message to its DCSs attached as leaves of a point to multipoint
   connection.  The LS does this by including DCSIDs in the list of
   Receiver IDs when the LS's sends its next Hello message.  Only the
   DCSIDs from non-stalled DCSs from which the LS has heard a Hello
   message are included.

   Any abnormal event, such as receiving a malformed SCSP message,
   causes the HFSM to transition to the Waiting State; however, a loss
   of NBMA connectivity causes the HFSM to transition to the Down State.
   Until, the HFSM is in the Bidirectional Connection State any properly
   formed SCSP messages other than Hello messages must be ignored (this
   is for the case where, for example, there is a point to multipoint
   connection involved).

Appendix B:  SCSP Message Formats

   This section of the appendix includes the message formats for SCSP.
   SCSP protocols are LLC/SNAP encapsulated with an LLC=0xAA-AA-03 and
   OUI=0x00-00-5e and PID=0x00-05.

   SCSP has 3 parts to every packet: the fixed part, the mandatory part,
   and the extensions part.  The fixed part of the message exists in
   every packet and is shown below.  The mandatory part is specific to
   the particular message type (i.e., CA, CSU Request/Reply, Hello,
   CSUS) and, it includes (among other packet elements) a Mandatory
   Common Part and zero or more records each of which contains
   information pertinent to the state of a particular cache entry
   (except in the case of a Hello message) whose information is being
   synchronized within a SG. The extensions part contains the set of
   extensions for the SCSP message.

   In the following message formats, "unused" fields are set to zero on
   when transmitting a message and these fields are ignored on receipt
   of a message.

B.1 Fixed Part

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    Version    |  Type Code    |        Packet Size            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Checksum             |      Start Of Extensions      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Version
     This is the version of the SCSP protocol being used.  The current
     version is 1.

   Type Code
     This is the code for the message type (e.g., Hello (5), CSU
     Request(2), CSU Reply(3), CSUS (4), CA (1)).

   Packet Size
     The total length of the SCSP packet, in octets (excluding link
     layer and/or other protocol encapsulation).

   Checksum
     The standard IP checksum over the entire SCSP packet (starting with
     the fixed header).

   Start Of Extensions
     This field is coded as zero when no extensions are present in the
     message.  If extensions are present then this field will be coded
     with the offset from the top of the fixed header to the beginning
     of the first extension.


B.2.0 Mandatory Part

   The mandatory part of the SCSP packet contains the operation specific
   information for a given message type (e.g., SCSP Cache State Update
   Request/Reply, etc.), and it includes (among other packet elements) a
   Mandatory Common Part (described in Section B.2.0.1) and zero or more
   records each of which contains information pertinent to the state of
   a particular cache entry (except in the case of a Hello message)
   whose information is being synchronized within a SG.  These records
   may, depending on the message type, be either Cache State
   Advertisement Summary (CSAS) Records (described in Section B.2.0.2)
   or Cache State Advertisement (CSA) Records (described in Section
   B.2.2.1).  CSA Records contain a summary of a cache entry's
   information (i.e., a CSAS Record) plus some additional client/server
   protocol specific information.  The mandatory common part format and
   CSAS Record format is shown immediately below, prior to showing their
   use in SCSP messages, in order to prevent replication within the
   message descriptions.

B.2.0.1 Mandatory Common Part

   Sections B.2.1 through B.2.5 have a substantial overlap in format.
   This overlapping format is called the mandatory common part and its
   format is shown below:

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Protocol ID           |        Server Group ID        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            unused             |             Flags             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Sender ID Len | Recvr ID Len  |       Number of Records       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                  Sender ID (variable length)                  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                Receiver ID (variable length)                  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Protocol ID
     This field contains an identifier which identifies the
     client/server protocol which is making use of SCSP for the given
     message.  The assignment of Protocol IDs for this field is given
     over to IANA.  IANA will accept any and all requests for value
     assignment as long as the client/server protocol specific document
     exists.  Protocols with current documents have the the following
     defined values:
       1 - ATMARP
       2 - NHRP
       3 - MARS
       4 - DHCP
       5 - LNNI

   Server Group ID
     This ID is uniquely identifies the instance of a given
     client/server protocol for which servers are being synchronized.

   Flags
     The Flags field is message specific, and its use will be described
     in the specific message format sections below.

   Sender ID Len
     This field holds the length in octets of the Sender ID.

   Recvr ID Len
     This field holds the length in octets of the Receiver ID.

   Number of Records
     This field contains the number of additional records associated
     with the given message.  The exact format of these records is
     specific to the message and will be described for each message type
     in the sections below.

   Sender ID
     This is an identifier assigned to the server which is sending the
     given message.  One possible assignment might be the protocol
     address of the sending server.

   Receiver ID
     This is an identifier assigned to the server which is to receive
     the given message.  One possible assignment might be the protocol
     address of the server which is to receive the given message.

B.2.5 Hello:

   The Hello message is used to check connectivity between the sending
   server (the LS) and one of its directly connected neighbor servers
   (the DCSs).  The Hello message type code is 5.  The Hello message
   mandatory part format is as follows:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         HelloInterval         |          DeadFactor           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            unused             |          Family ID            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Mandatory Common Part                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                 Additional Receiver ID Record                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                               .........
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                 Additional Receiver ID Record                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   HelloInterval
     The hello interval advertises the time between sending of
     consecutive Hello Messages.  If the LS does not receive a Hello
     message from the DCS (which contains the LSID as a Receiver ID)
     within the HelloInterval advertised by the DCS then the DCS's Hello
     is considered to be late.  Also, the LS MUST send its own Hello
     message to a DCS within the HelloInterval which it advertised to
     the DCS in the LS's previous Hello message to that DCS (otherwise
     the DCS would consider the LS's Hello to be late).

   DeadFactor
     This is a multiplier to the HelloInterval. If an LS does not
     receive a Hello message which contains the LS's LSID as a Receiver
     ID within the interval HelloInterval*DeadFactor from a given DCS,
     which advertised the HelloInterval and DeadFactor in a previous
     Hello message, then the LS MUST consider the DCS to be stalled; at
     this point, one of two things MUST happen: 1) if the LS has
     received any Hello messages from the DCS during this time then the
     LS transitions the corresponding HFSM to the Unidirectional State;
     otherwise, 2) the LS transitions the corresponding HFSM to the
     Waiting State.

   Family ID
     This is an opaque bit string which is used to refer to an aggregate
     of Protocol ID/SGID pairs.  Only a single HFSM is run for all
     Protocol ID/SGID pairs assigned to a Family ID.  Thus, there is a
     one to many mapping between the single HFSM and the CAFSMs
     corresponding to each of the Protocol ID/SGID pairs.  This might
     have the net effect of substantially reducing HFSM maintenance
     traffic.  See the protocol specific SCSP documents for further
     details.

   Mandatory Common Part
     The mandatory common part is described in detail in Section
     B.2.0.1.  There are two fields in the mandatory common part whose
     codings are specific to a given message type.  These fields are the
     "Number of Records" field and the "Flags" field.

     Number of Records
       The Number of Records field of the mandatory common part for the
       Hello message contains the number of "Additional Receiver ID"
       records which are included in the Hello.  Additional Receiver ID
       records contain a length field and a Receiver ID field.  Note
       that the count in "Number of Records" does NOT include the
       Receiver ID which is included in the Mandatory Common Part.

     Flags
       Currently, there are no flags defined for the Flags field of the
       mandatory common part for the Hello message.

     All other fields of the mandatory common part are coded as
     described in Section B.2.0.1.

   Additional Receiver ID Record
     This record contains a length field followed by a Receiver ID.
     Since it is conceivable that the length of a given Receiver ID may
     vary even within an SG, each additional Receiver ID heard (beyond
     the first one) will have both its length in bytes and value encoded
     in an "Additional Receiver ID Record".  Receiver IDs are IDs of a
     DCS from which the LS has heard a recent Hello (i.e., within
     DeadFactor*HelloInterval as advertised by the DCS in a previous
     Hello message).

     The format for this record is as follows:

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Rec ID Len   |                 Receiver ID                   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


   If the LS has not heard from any DCS then the LS sets the Hello
   message fields as follows: Recvr ID Len is set to zero and no storage
   is allocated for the Receiver ID in the Common Mandatory Part,
   "Number of Records" is set to zero, and no storage is allocated for
   "Additional Receiver ID Records".

   If the LS has heard from exactly one DCS then the LS sets the Hello
   message fields as follows: the Receiver ID of the DCS which was heard
   and the length of that Receiver ID are encoded in the Common
   Mandatory Part, "Number of Records" is set to zero, and no storage is
   allocated for "Additional Receiver ID Records".

   If the LS has heard from two or more DCSs then the LS sets the Hello
   message fields as follows: the Receiver ID of the first DCS which was
   heard and the length of that Receiver ID are encoded in the Common
   Mandatory Part, "Number of Records" is set to the number of
   "Additional" DCSs heard, and for each additional DCS an "Additional
   Receiver ID Record" is formed and appended to the end of the Hello
   message.


B.3  Extensions Part

   The Extensions Part, if present, carries one or more extensions in
   {Type, Length, Value} triplets.

   Extensions have the following format:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            Type               |           Length              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Value...                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Type
     The extension type code (see below).

   Length
     The length in octets of the value (not including the Type and
     Length fields;  a null extension will have only an extension header
     and a length of zero).

   When extensions exist, the extensions part is terminated by the End
   of Extensions extension, having Type = 0 and Length = 0.

   Extensions may occur in any order but any particular extension type
   may occur only once in an SCSP packet.  An LS MUST NOT change the
   order of extensions.


B.3.0  The End Of Extensions

    Type = 0
    Length = 0

   When extensions exist, the extensions part is terminated by the End
   Of Extensions extension.


B.3.1  SCSP Authentication Extension

    Type = 1
    Length = variable

   The SCSP Authentication Extension is carried in SCSP packets to
   convey authentication information between an LS and a DCS in the same
   SG.

   Authentication is done pairwise on an LS to DCS basis;  i.e., the
   authentication extension is generated at each LS. If a received
   packet fails the authentication test then an "abnormal event" has
   occurred.  Any "abnormal event" causes the HFSM associated with the
   server from which the packet was received to transition to the
   Waiting State.

   The presence or absence of the Authentication Extension is a local
   matter.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                     Authentication Type                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+ Authentication Data... -+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   The Authentication Type field identifies the authentication method in
   use.  Currently assigned values are:

      1 - Cleartext Password
      2 - Keyed MD5

   All other values are reserved.

   The Authentication Data field contains the type-specific
   authentication information.

   In the case of Cleartext Password Authentication, the Authentication
   Data consists of a variable length password.

   In the case of Keyed MD5 Authentication, the Authentication Data
   contains the 16 byte MD5 digest of the entire SCSP packet, including
   the encapsulated protocol's header, with the authentication key
   appended to the end of the packet.  The authentication key is not
   transmitted with the packet.  The MD5 digest covers only the fixed
   part and mandatory part.


B.3.2  SCSP Vendor-Private Extension

    Type = 2
    Length = variable

   The SCSP Vendor-Private Extension is carried in SCSP packets to
   convey vendor-private information between an LS and a DCS in the same
   SG and is thus of limited use.  If a finer granularity (e.g., CSA
   record level) is desired then then given client/server protocol
   specific SCSP document MUST define such a mechanism.  Obviously,
   however, such a protocol specific mechanism might look exactly like
   this extension.  The Vendor Private Extension MAY NOT appear more
   than once in an SCSP packet for a given Vendor ID value.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                  Vendor ID                    |  Data....     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Vendor ID
     802 Vendor ID as assigned by the IEEE [10].

   Data
     The remaining octets after the Vendor ID in the payload are
     vendor-dependent data.

   If the receiver does not handle this extension, or does not match the
   Vendor ID in the extension then the extension may be completely
   ignored by the receiver.

Network Working Group                                        Bruce Davie
Internet Draft                                               Paul Doolan
Expiration Date: July 1997                               Jeremy Lawrence
                                                        Keith McCloghrie
                                                           Yakov Rekhter
                                                              Eric Rosen
                                                          George Swallow

                                                     Cisco Systems, Inc.

                                                            January 1997

                     Use of Tag Switching With ATM

                  draft-davie-tag-switching-atm-01.txt

Abstract

   A label switching architecture is described in [4].  Label Switching
   enables the use of ATM Switches as Label Switching Routers. The ATM
   Switches run network layer routing algorithms (such as OSPF, IS-IS,
   etc.), and their data forwarding is based on the results of these
   routing algorithms. No ATM-specific routing or addressing is needed.

   This document describes how the label switching architecture is applied
   to ATM switches.

1. Introduction

   A label switching architecture is described in [4]. It is possible to
   use ATM switches as label switching routers. Such ATM switches run
   network layer routing algorithms (such as OSPF, IS-IS, etc.), and
   their forwarding is based on the results of these routing algorithms.
   No ATM-specific routing or addressing is needed.

   When an ATM switch is used for label switching, the label on which
   forwarding decisions are based is carried in the VCI and/or VPI
   fields. (It is possible to carry multiple labels in the VCI and/or VPI
   fields, but the scope of this document is restricted to the case of a
   single label.)

   The characteristics of ATM switches require some specialized
   procedures and conventions to support label switching. This document
   describes those aspects of label switching which are specific to ATM.

2. Definitions

   A Label Switching Router (LSR) is a device which implements the tag
   switching control and forwarding components described in [1].

   A label switching controlled ATM (TC-ATM) interface is an ATM interface
   controlled by the label switching control component. Packets traversing
   such an interface carry labels in the VCI and/or VPI field.

   An ATM-LSR is a LSR with a number of TC-ATM interfaces which forwards
   cells between these interfaces using labels carried in the VCI and/or
   VPI field.

   A frame-based LSR is a LSR which forwards complete frames between its
   interfaces. Note that such a LSR may have zero, one or more TC-ATM
   interfaces.

   An ATM-LSR cloud is a set of ATM-LSRs which are mutually
   interconnected by TC-ATM interfaces.

   The Edge Set of an ATM-LSR cloud is the set of frame-based LSRs which
   are connected to the cloud by TC-ATM interfaces.

   VC-merge is the process by which a switch receives cells on several
   incoming VCIs and transmits them on a single outgoing VCI without
   causing the cells of different AAL5 PDUs to become interleaved.

3. Special Characteristics of ATM Switches

   While the label switching architecture permits considerable flexibility
   in LSR implementation, an ATM-LSR is constrained by the capabilities
   of the (possibly pre-existing) hardware and the restrictions on such
   matters as cell format imposed by ATM standards. Because of these
   constraints, some special procedures are required for ATM-LSRs.

   Some of the key features of ATM switches that affects their behavior
   as LSRs are:

      - the label swapping function is performed on fields (the VCI
      and/or VPI) in the cell header; this dictates the size and
      placement of the label(s) in a packet.

      - multipoint-to-point and multipoint-to-multipoint VCs are
      generally not supported. This means that most switches cannot
      support `VC-merge' as defined above.

      - there is generally no capability to perform a `TTL-decrement'
      function as is performed on IP headers in routers.

   This document describes ways of applying label switching to ATM
   switches which work within these constraints.

4. Label Switching Control Component for ATM

   To support label switching an ATM switch must implement the control
   component of label switching. This consists primarily of label allocation
   and maintenance procedures. Label binding information is communicated
   by several mechanisms, notably the Label Distribution Protocol (LDP)
   [2].

   Since the label switching control component uses information learned
   directly from network layer routing protocols, this implies that the
   switch must participate as a peer in these protocols (e.g., OSPF,
   IS-IS).

   In some cases, LSRs make use of other protocols (e.g. RSVP, PIM, BGP)
   to distribute label bindings. In these cases, an ATM LSR would need to
   participate in these protocols.

   Support of label switching on an ATM switch does not require the switch
   to support the ATM control component defined by the ITU and ATM Forum
   (e.g., UNI, PNNI). An ATM-LSR may optionally respond to OAM cells.

5. Hybrid Switches (Ships in the Night)

   The existence of the label switching control component on an ATM switch
   does not preclude the ability to support the ATM control component
   defined by the ITU and ATM Forum on the same switch and the same
   interfaces.  The two control components, label switching and the
   ITU/ATM Forum defined, would operate independently.

   Definition of how such a device operates is beyond the scope of this
   document.  However, only a small amount of information needs to be
   consistent between the two control components, such as the portions
   of the VPI/VCI space which are available to each component.

6. Use of  VPI/VCIs

   Label switching is accomplished by associating labels with routes and
   using the label value to forward packets, including determining the
   value of any replacement label.  See [1] for further details. In an
   ATM-LSR, the label is carried in the VPI and/or VCI field. Just as in
   conventional ATM, for a cell arriving at an interface, the VPI/VCI is
   looked up, replaced, and the cell is switched.

   ATM-LSRs may be connected by ATM virtual paths to enable
   interconnection of ATM-LSRs over a cloud of conventional ATM
   switches. In this case, the label is carried in the VCI field.

   For two connected ATM-LSRs, a connection must be available for LDP.
   The default is for this connection to be on VPI 0, VCI 32. For ATM-
   LSRs connected by a VPI of value x, the default for the LDP
   connection is VPI x, VCI 32. Additionally, for all VPI values, VCIs 0
   - 32 are not used as labels.

   With the exception of these reserved values, the VPI/VCI values used
   in the two directions of the link may be treated as independent
   spaces.

   The allowable ranges of VPI/VCIs are always communicated through LDP.
   If more than one VPI is used for label switching, the allowable range
   of VCIs may be different for each VPI, and each range is communicated
   through LDP.

7. Label Allocation and Maintenance Procedures

   ATM-LSRs use the downstream-on-demand allocation mechanism described
   in [1]. The procedures for label allocation depend on whether the
   switches support VC-merge or not. We therefore describe the two
   scenarios in turn. We begin by describing the behavior of members of
   the Edge Set of an ATM-LSR cloud; these edge LSRs are not themselves
   ATM-LSRs, and their behavior is the same whether the cloud contains
   VC-merge capables LSRs or not.

7.1. Edge LSR Behavior

   Consider a member of the Edge Set of an ATM-LSR cloud. Assume that,
   as a result of its routing calculations, it selects an ATM-LSR as the
   next hop of a certain route, and that the next hop is reachable via a
   TC-ATM interface. The Edge LSR uses LDP's BIND_REQUEST to request a
   label binding from the next hop.  The hop count field in the request is
   set to 1.  Once the Edge LSR receives the label binding information,
   the label is used as an outgoing label. The binding received by the edge
   LSR may contain a hop count, which represents the number of hops a
   packet will take to cross the ATM-LSR cloud when using this label. The
   edge LSR may either

       - use this hop count to decrement the TTL of packets before
      transmitting them over the cloud

       - decrement the TTL of packets by one before transmitting them
      over the cloud.

   The choice between these two options should be made based on local
   configuration.

   When a member of the Edge Set of the ATM-LSR cloud receives a tag
   binding request from an ATM-LSR, it allocates a label, creates a new
   entry in its Label Information Base (TIB), places that label in the
   incoming label component of the entry, and returns (via LDP) a binding
   containing the allocated label back to the peer that originated the
   request.  It sets the hop count in the binding to 1.

   When a routing calculation causes an Edge LSR to change the next hop
   for a route, and the former next hop was in the ATM-LSR cloud, the
   Edge LSR should notify the former next hop (via LDP) that the tag
   binding associated with the route is no longer needed.

7.2. Conventional ATM Switches (non-VC-merge)

   When an ATM-LSR receives (via LDP) a label binding request for a
   certain route from a peer connected to the ATM-LSR over a TC-ATM
   interface, the ATM-LSR takes the following actions:

       - it allocates a label, creates a new entry in its Label Information
      Base (TIB), and places that label in the incoming label component of
      the entry;

       - it requests (via LDP) a label binding from the next hop for that
      route;

       - it returns (via LDP) a binding containing the allocated
      incoming label back to the peer that originated the request.

   The hop count field in the request that the ATM-LSR sends (to the
   next hop LSR) is set to the hop count field in the request that it
   received from the upstream LSR plus one.  Once the ATM-LSR receives
   the binding from the next hop, it places the label from the binding
   into the outgoing label component of the TIB entry.

   The ATM-LSR may choose to wait for the request to be satisfied from
   downstream before returning the binding upstream (a "conservative"
   approach).  In this case, the ATM-LSR increments the hop count it
   received from downstream and uses this value in the binding it
   returns upstream. If the value of the hop count equals MAX_HOP_COUNT
   the ATM-LSR should notify the upstream neighbor that it could not
   satisfy the binding request.

   Alternatively, the ATM-LSR may return the binding upstream without
   waiting for a binding from downstream (an "optimistic" approach). In
   this case, it uses a reserved value for hop count in the binding,
   indicating that it is unknown. The correct value for hop count will
   be returned later, as described below.

   Since both the conservative and the optimistic approach has
   advantages and disadvantages, this is left as an implementation
   choice.

   Note that an ATM-LSR, or a member of the edge set of an ATM-LSR
   cloud, may receive multiple binding requests for the same route from
   the same ATM-LSR. It must generate a new binding for each request
   (assuming adequate resources to do so), and retain any existing
   binding(s). For each request received, an ATM-LSR should also
   generate a new binding request toward the next hop for the route.

   When a routing calculation causes an ATM-LSR to change the next hop
   for a route, the ATM-LSR should notify the former next hop (via LDP)
   that the label binding associated with the route is no longer needed.

   When a LSR receives a notification that a particular label binding is
   no longer needed, the LSR may deallocate the label associated with the
   binding, and destroy the binding. In the case where an ATM-LSR
   receives such notification and destroys the binding, it should notify
   the next hop for the route that the label binding is no longer needed.
   If a LSR does not destroy the binding, it may re-use the binding only
   if it receives a request for the same route with the same hop count
   as the request that originally caused the binding to be created.

   When a route changes, the label bindings are re-established from the
   point where the route diverges from the previous route.  LSRs
   upstream of that point are (with one exception, noted below)
   oblivious to the change.  Whenever a LSR changes its next hop for a
   particular route, if the new next hop is an ATM-LSR or a member of
   the edge set reachable via a TC-ATM interface, then for each entry in
   its TIB associated with the route the LSR should request (via LDP) a
   binding from the new next hop.

   When an ATM-LSR receives a label binding from a downstream neighbor, it
   may already have provided a corresponding label binding for this route
   to an upstream neighbor, either because it is operating
   optimistically or because the new binding from downstream is the
   result of a routing change. In this case, it should extract the hop
   count from the new binding and increment it by one. If the new hop
   count is different from that which was previously conveyed to the
   upstream neighbor (including the case where the upstream neighbor was
   given the value `unknown') the ATM-LSR must notify the upstream
   neighbor of the change. Each ATM-LSR in turn increments the hop count
   and passes it upstream until it reaches the ingress Edge LSR. If at
   any point the value of the hop count equals MAX_HOP_COUNT, the ATM-
   LSR should withdraw the binding from the upstream neighbor.

   Whenever an ATM-LSR originates a label binding request to its next hop
   LSR as a result of receiving a label binding request from another
   (upstream) LSR, and the request to the next hop LSR is not satisfied,
   the ATM-LSR should destroy the binding created in response to the
   received request, and notify the requester (via LDP).

   If an ATM-LSR receives a binding request containing a hop count that
   equals MAX_HOP_COUNT, no binding should be established and an error
   message should be returned to the requester.

   When a LSR determines that it has lost its LDP session with another
   LSR, the following actions are taken.  Any binding information
   learned via this connection must be discarded.  For any label bindings
   that were created as a result of receiving label binding requests from
   the peer, the LSR may destroy these bindings (and deallocate tags
   associated with these binding).

==============================================================================

7.3. Stream Merge

   Label merge occurs when multiple received labels are mapped to a
   single forwarding label.  This behavior is not required in an LSR
   but the behavior in a neighboring LSR must be supported by all
   LSRs.

   Merge-capable LSRs need only one outgoing label per route, even if
   multiple requests for label bindings to that route are received from
   upstream neighbors.

   For merging, when  a LSR receives a binding request from an
   upstream LSR for a certain route, and it does not already have an
   outgoing label binding for that route, it issues a bind request to its
   next hop just as before. If, however, it already has an outgoing tag
   binding for that route, it does not need to issue a downstream
   binding request. Instead, it creates a new TIB entry, allocates an
   incoming label for that entry and returns that label in a binding to the
   upstream requester, and uses the existing outgoing label for the
   outgoing label entry in the TIB. It also takes the hop count that was
   provided with the label binding it received from downstream, increments
   it by one, and uses this value in the binding that it sends to the
   upstream requester.

   A merging LSR must issue new bindings every time it receives a request
   from upstream.  However, it only needs to issue a corresponding binding
   request downstream if it does not already have a label binding for the
   appropriate route.

   Changes in the routing table of a merging LSR which cause it to select
   a new next hop for one of the routes associated with a merged LSL-set,
   cause it to release bindings for that route from the former next hop and
   requests a new binding from the new next hop. If the new binding contains
   a hop count that differs from that which was received in the old binding,
   then the LSR must take the new hop count, increment it by one, and notify
   any upstream neighbors who have label bindings for this route of the new
   value. This enables the new hop count to propagate back to the ingress of
   the LSR cloud. If at any point the hop count reaches MAX_HOP_COUNT, the
   label bindings for this route must be withdrawn from upstream neighbor
   LSRs for which a binding was previously provided. This ensures that any
   loops caused by routing transients will be detected and broken.

7.4. Efficient use of label space

[ So, how does one go about asking for a single label for multiple
  routes - in order to conserve labels

   The above discussion assumes that an edge LSR will request one tag
   for each prefix in its routing table that has a next hop in the ATM-
   LSR cloud. In fact, it is possible to significantly reduce the number
   of labels needed by having the edge LSR request instead one label for
   several routes. Use of many-to-one mappings between routes (address
   prefixes) and labels using the notion of Forwarding Equivalence Classes
   (as described in [1]) provides a mechanism to conserve the number of
   labels.
]

8. Generic Encapsulation

   The generic encapsulation is as described in sections below.  In the
   general case, the current label is included - with a corresponding
   TTL at the front (or top) of the label stack.  The exception to this
   applies to technologies do not support use of TTL - in which case -
   the current label is ommited from the label stack.

[ ATM or other non-TTL technology -
   For systems which are using only one level of labeling, LDP may be
   used to negotiate null encapsulation.  This negotiation is done once
   at LDP open and applies to all VPI/VCI values used as labels. In this
   case, IP packets are carried directly inside AAL5 frames, as in the
   null encapsulation of RFC 1483.
]

   LDP may be used to advertise additional VPI/VCIs to carry control
   information or non-labeled packets. These may use either the null
   encapsulation, as defined in Section 5.1 of RFC 1483, or the LLC/SNAP
   encapsulation, as defined in Section 4.1 of RFC 1483.