Internet Draft TE Working Group Heinrich Hummel Internet Draft Siemens AG Expiration Date: Sept. 2000 March 2000 Orchestrally conducted Traffic ( OCT ) draft-hummel-te-oct-01.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 except that the right to produce derivative works is not granted. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This draft proposes to engineer traffic such that we can speak of an Orchestrally Conducted Traffic (OCT): Any traffic stream (given by its source and destination nodes and by a priority class) may use several differently routed LSPs. Each, traffic stream ingress reports, periodically, the currently measured traffic load (in bitrate) to a common TE Conductor (TEC) who evaluates them as to guess what traffic load change might occur in the immediate time ahead. Accordingly the TEC computes well synchronized likelihood values by which to take which route/LSP. These values will be such that any traffic stream will be served as good as possible, as often as possible, while equally ranked traffic streams are treated fair, higher ranked streams are prioritized and the network thruput is maximized. The TEC sends the likelihood values to the respective ingress node, who will distribute the received packets of the traffic stream to the pertaining LSPs accordingly. Hummel March 2000 [Page 1] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 1. Introduction and summary This draft proposes to engineer traffic such that we can speak of an Orchestrally Conducted Traffic (OCT): Any traffic stream (given by its source and destination nodes and by a priority class) may use several differently routed Label Switched Pathes (LSPs). Each, traffic stream ingress reports, periodically, the currently measured traffic load (=bitrate value) to a common Traffic Engineering Conductor (TEC) who evaluates them as to guess what traffic load change might occur in the immediate time ahead. Accordingly the TEC computes well synchronized likelihood values by which to take which route/LSP. These values will be such that any traffic stream will be served as good as possible, as often as possible, while equally ranked traffic streams are treated fair, higher ranked streams are prioritized and the network thruput is maximized. The TEC sends the likelihood values to the respective ingress node, who will distribute the received packets of the traffic stream to the pertaining LSPs accordingly. 2. OCT concept in detail The OCT concept is based on a periodic information exchange between the ingress nodes of all traffic streams and a central Traffic Engineering Conductor (TEC). For now, it is assumed that communication channels are established (i.e. special LSPs) to cater this information exchange. It is also assumed that there are one or several differently routed LSPs per traffic stream established to carry the user data. The OCT concept solves the question "by which likelihood should which LSP been taken so that the entire traffic is balanced best onto the given network ?" Best balancing includes many aspects, some of which have already been mentioned: Higher ranked traffic streams shall be prioritized versus lower ranked traffic streams. Equally ranked traffic streams shall be serviced equally fair. The thruput should be maximized. But there are even more properties of "best balancing", mentioned in the previous draft version, like: avoiding congestions as best as possible, NOT resolving a congestion by creating a new congestion at just another location. Maximizing the thruput: It means: Among equally ranked traffic streams, give preference nevertheless to that portion of a traffic stream that can be serviced by a shorter LSP (using less hops). Eventually such a preferenced handling is beneficial to all traffic streams (e.g. so that all traffic streams can be completely serviced rather than some are more, some are less deteriorated). Eventually however the traffic streams that use shorter LSPs are indeed better Hummel March 2000 [Page 2] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 serviced than the others. Conclusion: Maximizing the thruput is more important than equal treatment. Let us partition the whole traffic into traffic streams T(i,e,p), each of which is identified by an information-triplet {i-ngress node, e-gress node, p-riority class}. By other words, a traffic stream is the sum of all IP-Micro flows which enter the MPLS network domain at the same node (= ingress node), which will leave the MPLS network domain at the same node (=egress node) and which share the same priority class, based on their DiffServ parameters. For each traffic stream T(i,e,p) there may be n(i,e,p) many differently routed LSPs available for transmitting all packets of the traffic stream. LSP-1 shall be the most preferrable LSP, LSP- n(i,e,p) shall be the least preferrable LSP. The ingress node i of traffic stream T(i,e,p) knows all respective n(i,e,p) LSPs just like the TEC. Indeed, by means of mutual communication, both know of each LSP its entire sequence of links. Periodically each ingress node reports to the TEC the traffic load (in bitrate) of all traffic streams it originates: For traffic stream T(i,e,p) the ingress node i may report both the total actual traffic load as well as all individual traffic load shares of all respective LSPs. The TEC knows of course the network domain's topology and the maximum bandwidth on each link j , for all j=1,...jMax. The TEC also needs to know what is the actually available bandwidth BWj on each link j in the network domain. Solution 1: The TEC may learn this information by means of a routing protocol. Solution 2: THE TEC takes, from all reports, all traffic load shares with respect to all LSPs and computes for each link j the currently remaining free bandwidth BWj. The sending of the reports to the TEC by all ingress nodes may be slightly jittered, but by neglecting the jitter, the TEC shall receive all reports with respect to the same time of day. It is advisable that any report also contain either the exact time of day it has been sent or that time of day the report is meant for. With respect to a given time of day the TEC knows the currently available bandwidth BWj per network link j (see above). As the TEC receives the reports about the current Traffic Load TL (i,e,p) for each traffic stream T(i,e,p) periodically, it will be able to Hummel March 2000 [Page 3] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 extrapolate a Traffic Load Change TLC (i,e,p) for the immediately upcoming future. TLC (i,e,p) will be a positive bitrate if the traffic load increases and will be a negative bitrate if the traffic load decreases. Different mathematical/heuristical methods, algorithms, concepts, theories may be applied as to compute i.e. to prognosticate TLC (i,e,p). For example, two methods are shown by the following: Method 1: Make a linear extrapolation based on the two youngest traffic load values at time t-1 ("t minus one") and at time t0 ("t zero"), as to determine the traffic load at time t1 ("t one"). If traffic load TL(i,e,p)(t1) exceeds the highest traffic load ever reported for traffic stream T(i,e,p) then replace the extrapolated value by the highest traffic load ever reported for T(i,e,p). If traffic load TL (i,e,p) (t1) turns out negative then replace this value by the value 0. Build the traffic load change TLC (i,e,p) = TL (i,e,p) (t1) - TL(i,e,p) (t0). Method 2 (very promising, if upstream from ingress node i* there is a webpage of general interest): Compare the traffic load TL (i*,e*,p*) with all the other traffic loads TL (i*,e,p) that share the same ingress node i* and determine how much is TL (i*,e*,p*) ahead or behind all the others: Build the average avTL among all current traffic loads TL (i*,e,p), for all e except e* and all p except p*. Set TLC (i*,e*,p*) = avTL - TL (i*,e*,p*). Certainly there will be many more methods as to compute TLC (i,e,p) for traffic stream T(i,e,p). The method may also vary, depending on whether TL (i,e,p) is currently high or low. Learning from the success: Note that one report-timer-interval later the prognosticated traffic load change can be verified, how well it turned out. We can make statistics which method is more often more successful than the other methods. We may apply the more successful method more often ! So we may learn from the success. Let us recapture: The TEC knows the currently available bandwidth BWj for all links j in the network. This currently remaining network may expect an oncoming traffic given by all positive TLC (i,e,p). Whereas all Hummel March 2000 [Page 4] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 negative TLC (i,e,p) will increase the available bandwidth BWj of those links j which are traversed by the respective LSPs of the respective traffic streams T(i,e,p). Prosaic description how the likelihood values may be computed: Imagine the TEC as a person sitting in front of a casino table. On the table however, the topology graph of his network is drawn. On each ingress node i the TEC has placed, with respect to traffic stream T(i,e,p), a pile of uniformly sized chips: The height of the pile amounts to the absolut value of traffic load change TLC (i,e,p). (Eventually, at the bottom of any pile, there may be a chip whose size is smaller than the uniformly sized chips.) If traffic load change TLC (i,e,p) is positive (load increase) all chips of the respective pile have a unique color equal to "i,e,p". If traffic load chance TLC (i,e,p) is negative (load decrease) then all chips of that pile are colorless. On any link j of the network there is also a pile of colorless chips whose height amounts to the currently available bandwidth BWj. Let's call it link-j-pile. The goal is to distribute the chips of each positive TLC (i,e,p)-pile by piling up respective LSP-specific piles. Eventually, a remainder of a positive TLC (i,e,p)-pile cannot be distributed in this way. It will reflect that proportion of the traffic stream which cannot be transmitted. Furthermore all the chips of any negative TLC (i,e,p) -pile shall "disappear": Taking away one such (colorless) chip will be accompanied by adding a colorless chip to all those link-j-piles of all those links of any particular LSP for servicing traffic stream T(i,e,p). The TEC processes one chip after the other, in generally each time one chip from a different TLC(i,e,p)-pile. Processing a chip from a positive TLC (i,e,p)-pile: Try to move the chip to the k-th LSP (k=1,2,...,n(i,e,p) ) beginning with the best possible LSP, i.e. with k-th LSP of lowest possible k. Whenever a positiv chip is moved to the k-th LSP-specific pile, a colorless chip is removed from each link-j-pile of those links j this k-th LSP is traversing. If this is not feasable because at least one of these links lacks a colorless chip, the same is tried with the (k+1)st LSP. If it is not feasable with any of the n(i,e,p) LSPs then there will be a remainder pile whose (normed) height will indicate Hummel March 2000 [Page 5] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 the likelihood by which a packet from traffic stream T(i,e,p) shall not be transmitted. Processing a chip from a negative TLC (i,e,p) -pile: Remove one (colorless) chip from the TLC (i,e,p) -pile and concurrently add a colorless chip to each link-j-pile of all links j that are traversed by a particular LSP for traffic stream T(i,e,p). The question is: which particular LSP shall it be ? Indeed, it should be done "in proportion to the current traffic load share per LSP". (Note that the current traffic load share per LSP has been reported to the TEC. The TEC may form a statistical stage function, generate a random number which hits any column of that histogram as to select the respective LSP. Make sure that you do not release more bandwidth on the links of a particular LSP than has been used so far, i.e. not more than the current respective LSP-specific traffic load share). The sequence by which the chips of all TLC (i,e,p)-piles are handled as described above is very important as to: achieve fairness among equally ranked traffic streams, priortize/discriminate higher/lower ranked traffic streams, and get best thruput: Higher ranked traffic streams will be preferred versus the lower ranked traffic streams by processing ALL the chips of the positive TLC (i,e,p) -pile of the higher ranked traffic streams first. Alternate the processing of a positive chip and of a negative chip . Hereby do not differentiate between a higher ranked negative chip and a lower ranked negative chip. E.g. process (as long as there are both positive and negative chips ) according to one of the following alterations: Each alteration may represent a particular behaviour: from very optimistic to very cautious. Distribute positive chips of a particular traffic stream first to its higher ranked LSP prior to its lower ranked LSP. This is to take the best possible LSP, as long as no other traffic stream "objects". Create fairness among equally ranked traffic streams by processing each time a chip from a different (positive) TLC (i,e,p)-pile of an equally ranked traffic stream, however with the following exception: Creating fairness may collide with another goal which is achieving best thruput. Achieving best thruput may be accomplished by preferred processing of those chips of positive TLC(i,e,p)-piles (among those of equally ranked traffic streams) which can be moved to an LSP-specific pile whereby that LSP consists of a relatively Hummel March 2000 [Page 6] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 smaller number of links. As a final result each positive TLC(i,e,p)-pile will be distributed to its LSP-specific piles. Eventually there will be a remainder pile which could not be distributed to any of its LSP-specific piles. We norm each pile by dividing its heigth by the sum of all heights (incl.the remainder pile's height) and get the likelihood values by which to take which LSP, as well as the likelihood value by which IP packets should not be transmitted. With respect to a traffic stream with a negative TLC (i,e,p) pile we cannot compute likelihood values, by which to select any LSP (at least not with the algorithm shown above). Proposal: The ingress node may continue to use the old likelihood values which applied through-out the previous report-timer-interval. The TEC may send back to the ingress node i with respect to traffic stream T(i,e,p) either the new computed likelihood values or an indication that it should continue to use the old likelihood values. Balancing the received user data by the ingress node according to the likelihood values: An ingress node i whichs receives a new IP packet may identify the traffic stream to which this packet belongs. I.e. it may identify its egress node based on its IP destination address and may identify its priority class based on its Diffserv parameters. In accordance with the actual likelihood values it will select an appropriate LSP for transmitting this packet, respective will dishonor its forwarding request: The ingress node may build a hash with respect to the IP source addresses of the IP packets of T(i,e,p). The simplest hash is such that it intersects the entire IP UNICAST address space into n(i,e,p) +1 intervalls. The sizes of these intervalls are proportional to the likelihood values. The IP source address of a received IP packet of a particular traffic stream will be examined as to determine from which interval it is part of. If it is from the k-th interval, then it is forwarded along the k-th LSP; k=1,...,n(i,e,p). If it is from the n(i,e,p)+1 st intervall, then it is not forwarded at all. This very simple hash mechanism suits best if the ingress node is fairly in the "center of the internet", i.e. receives IP packets containing any existing IP UNICAST source address. Or : An ingress node i (e.g. of an access network) may know, by means of network management configuration or by learning while operating, the appropriate, more limited IP source address range which may be intersected as to form the right hash-intervals. Hummel March 2000 [Page 7] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 If an ingress node receives IP packets from a fairly limited number of sources, e.g. given by less than 100 different IP source addresses, it may be feasable to queue up information pertaining to all currently active IP micro flows: A queue element may consist of an IP source address and the LSP to be selected. If the first packet of a new IP micro flow is received, i.e. if no queued information can be found that matches the received packet's IP source address then a new queue element must be allocated whose LSP information may be determined by this: Generate a random number which hits/selects any column of a statistical stage function which is built from the likelihood values. The k-th column will select the k-th LSP. It shall be used and also entered in the new queue element. Any succeeding packet of a particular IP micro flow may be sent along the same LSP. Advantage: all packets of the same IP micro flow will arrive in proper sequence at the egress node. A mechanism needs to be added to determine the end of the lifetime of an IP micro flow, i.e. the point in time when a queue element is deleted: Maintain an information in the queue element which is refreshed/incremented by the transmission request of each packet of the IP micro flow. Employ a periodic timer. Whenever it expires check all queue elements for all traffic streams originated in this ingress node whether they are aged out. If so, delete them. 3. Further Aspects 3.1 Other traffic load input: The TEC may also receive information about SCHEDULED traffic from other senders than the ingress nodes. Such senders may be a policy agent or a Network Management station/agent. 3.2 Adding/Removing another LSP per traffic stream 3.2.1 Adding an LSP: In the preceding it was assumed that there are some LSPs per traffic stream, without saying when and why an(other) LSP may be added or removed. An obvious reason for establishing another LSP is given, when the TEC determines a considerable likelihood for NOT forwarding some packets of the traffic stream along any of the existing LSPs. Indeed, the TEC may compute the route for another LSP while taking into account the currently available bandwidth in the network and the anticipated traffic load change. The TEC may suggest to the ingress node to setup another LSP,while sending both its route information and the likelihood value to use it. (This may reduce the likelihood for not forwarding packets of this traffic stream). Hummel March 2000 [Page 8] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 3.2.1 Removing a LSP: It may turn out that over a long period of time the likelihood values for some LSPs constantly turn out to be zero. In this case the TEC may suggest to the ingress node to remove the idle LSPs. 3.3 Delaying instead of dishonoring packet transmission Despite of best traffic balancing and even despite of 3.2 there may be IP packets whose transmission request might be dishonored.Alternatively, their transmission may only be delayed. Further study is required how to do it and what are the consequences with respect to the entire "traffic balancing result". 3.4 On-demand LSPs: Another variation would be to establish LSPs just on demand. The ingress node may have for a traffic stream, instead of a set of existing LSPs, a set of routes to choose from. The described procedure as to determine the right LSP for transmitting an IP packet may be applied however as to determine the route to use when the on- demand LSP is to be established. 3.5 Long term results Over a longer period of time, while and though the total traffic is orchestrally conducted, the TEC may find out which links are permanently under-utilized resp. over-utilized and may conclude which links should be stocked up by which amount of bandwidth respectively may conclude which links may be removed or used differently by becoming re-configured. Determining to which extent a link is permanently under-utilized: Simplest method: determine the smallest available bandwidth ever, of that link. Determining to which extent the bandwidth of a link should be stocked up: After the TEC has computed and sent out the likelihood values for all traffic streams, it may use the time and compute likelihood values for a fictitious "should be"-network by the following: In one additional cycle the TEC may process each remainder TLC (i,e,p)- pile like being one big chip and by distributing this "big chip" onto its most preferred LSP-1. Hereby substract from each affected link- j-pile the height of the remainder TLC (i,e,p)-pile and do not mind if the height of any link-j-pile becomes lower than zero. Indeed,after this cycle any resulting negative link-j-pile, indicates the absolute amount of bandwidth by which link j should be stocked up. Hummel March 2000 [Page 9] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 Of course, average values should be built, over a longer period of time, as to determine the permanent over/under-utilization of any link in the network. 3.6. Standardization - required / not required It is possible to install the OCT concept without standardizing any new message types, TLV types, TLV codepoints. In a simplest implementation special p2p LSPs are established from each ingress node to the TEC and vice versa for being used as communication channels. The format of the communicated data packets may be proprietary. However, it is up to the TE-WG (and probably the MPLS-WG) whether or not to standardize their formats, e.g. as to become new LDP- messages. In case there were a readiness/willingness to standardize these formats, there would be an option, where no p2p communication channels between the ingress nodes and the TEC were required: Each ingress node sends its report messages by a new LDP message to the TEC. The TEC sends out a new LDP message which were guided by a TREE ROUTE TLV, see [3], in a spanning tree fashion to all ingress nodes of the network domain, while carrying the likelihood values in form of individually targetted information to each individual ingress node. 4. Intellectual Property Considerations Siemens AG may seek patent or other intellectual property protection for some or all of the technologies disclosed in this document. If any standards arising from this document are or become protected by one or more patents assigned to Siemens AG. Siemens AG intend to disclose those patents and license them under openly specified and non-discriminatory terms. 5. References [1] "Traffic Engineering Concept", draft-awduche-mpls-traffic-eng-00.txt, Awduche [2] "Provider Architecture for Differentiated Services and Traffic Engineering (PASTE)", 1/1998, draft-li-paste-00.txt, Li, Rekhter [3] "Explicit Tree Routing",draft-hummel-mpls-tree-route-01.txt, Hummel, Loke Hummel March 2000 [Page 10] Orchestrally conducted Traffic (OCT) Exp. Sept 2000 6. Author's Address Heinrich Hummel Siemens AG Hofmannstrasse 51 81379 Munich, Germany Tel: +49 89 722 32057 Email: heinrich.hummel@icn.siemens.de Hummel March 2000 [Page 11]