TSVWG                                                         B. Briscoe
Internet Draft                                               G. Corliano
draft-briscoe-tsvwg-cl-architecture-00.txt                    P. Eardley
Expires: January 2006                                          P. Hovell
                                                              A. Jacquet
                                                            D. Songhurst
                                                                      BT
                                                           July 11, 2005



       An architecture for edge-to-edge controlled load service using
              distributed measurement-based admission control
                draft-briscoe-tsvwg-cl-architecture-00.txt


Status of this Memo

   By submitting this Internet-Draft, each author represents that
   any applicable patent or other IPR claims of which he or she is
   aware have been or will be disclosed, and any of which he or she
   becomes aware will be disclosed, in accordance with Section 6 of
   BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
        http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
        http://www.ietf.org/shadow.html

   This Internet-Draft will expire on January 11, 2006.

Copyright Notice

   Copyright (C) The Internet Society (2005).  All Rights Reserved.






Briscoe               Expires January 11, 2006                [Page 1]


Internet-Draft      Controlled Load architecture             July 2005


Abstract

This document describes an architecture to achieve a Controlled Load
(CL) service edge-to-edge, i.e. within a particular region of the
Internet, by using distributed measurement-based admission control. The
measurement made is of CL packets that have their Congestion
Experienced (CE) codepoint set as they travel across the edge-to-edge
region. Setting the CE codepoint, which is under the control of a new
Per Hop Behaviour (CL-ramp-PHB, defined in draft-briscoe-tsvwg-cl-phb-
00.txt), provides an "early warning" of potential congestion. This
information is used by the ingress node of the edge-to-edge region to
decide whether to admit a new CL microflow.

A use case is described which shows how the PHB is a fundamental
building block in the edge-to-edge architecture, and in turn how this
is a building block within a broader QoS architecture achieving an end-
to-end CL service.



Table of Contents


   1. Introduction................................................3
      1.1. Summary................................................3
      1.2. Key features...........................................4
      1.3. Benefits...............................................6
      1.4. Standardisation requirements............................6
      1.5. Terminology............................................7
      1.6. Structure of rest of document...........................8
   2. Use case....................................................8
      2.1. Configured bandwidth allocation to the CL behaviour aggregate
       ...........................................................10
      2.2. Flexible bandwidth allocation to CL behaviour aggregate.11
   3. Details....................................................12
      3.1. Packet processing......................................12
         3.1.1. Ingress nodes.....................................12
         3.1.2. Interior nodes....................................13
         3.1.3. Egress nodes......................................15
      3.2. Signalling............................................16
   4. Extensions.................................................17
      4.1. Multi-domain and multi-operator usage..................17
      4.2. Variable bit rate sources..............................18
      4.3. Starvation prevention..................................18
   5. Relationship to other QoS mechanisms........................18
      5.1. Standardisation requirements...........................18
      5.2. Controlled Load........................................18


Briscoe               Expires January 11, 2006                [Page 2]


Internet-Draft      Controlled Load architecture             July 2005


      5.3. Integrated services operation over Diffserv............19
      5.4. Differentiated Services................................19
      5.5. ECN...................................................19
      5.6. RTECN.................................................20
      5.7. RMD...................................................20
      5.8. MPLS-TE...............................................20
   6. Security Considerations.....................................21
   7. Acknowledgements...........................................21
   8. Comments solicited.........................................21
   9. References.................................................21
   Authors' Addresses............................................24
   Intellectual Property Statement................................26
   Disclaimer of Validity........................................26
   Copyright Statement...........................................26

1. Introduction

1.1. Summary

   This document describes an architecture to achieve a controlled load
   service edge-to-edge, i.e. within a particular region of the
   Internet, using distributed measurement-based admission control.
   Controlled load service is a quality of service (QoS) closely
   approximating the QoS that the same flow would receive from a lightly
   loaded network element [RFC2211]. Controlled Load (CL) is useful for
   inelastic flows such as those for streaming real-time media.

   The architecture described in this document achieves edge-to-edge
   controlled load service using a new Per Hop Behaviour (PHB) as a
   fundamental building block. In turn, an end-to-end CL service would
   use this architecture as a building block within a broader QoS
   architecture. The PHB, edge-to-edge and end-to-end aspects are now
   briefly introduced in turn.

   The new PHB, called CL-ramp-PHB, is defined in [CL-PHB]. Network
   nodes that implement the differentiated services (DS) enhancements to
   IP use a codepoint in the IP header to select a PHB as the specific
   forwarding treatment for that packet [RFC2474, RFC2475]. The CL-ramp-
   PHB is different from PHBs defined so far, in that it defines
   Explicit Congestion Notification (ECN) marking semantics as part of
   the PHB. A node in the CL-region sets the Congestion Experienced (CE)
   codepoint in the IP header as an "early warning" of potential
   congestion, and aims to do so before there is any significant build-
   up of CL packets in the queue.





Briscoe               Expires January 11, 2006                [Page 3]


Internet-Draft      Controlled Load architecture             July 2005


   To achieve the CL service edge-to-edge, ie within a region of the
   Internet - which we call CL-region (defined below) - distributed
   measurement-based admission control is used. All nodes within the CL-
   region run the CL-ramp-PHB. The measurement is of the CL packets that
   have had their CE codepoint set as they travel across the CL-region.
   Since any node in the CL-region may set the CE codepoint, the
   measurement is distributed. The measurement is recorded by the egress
   node of the CL-region. The egress node calculates the bits in these
   CE packets as a fraction of the bits in all the CL packets, as an
   exponentially weighted moving average (which we term Congestion-
   Level-Estimate). Depending on the value of Congestion-Level-Estimate,
   the ingress node of the CL-region decides whether to admit a new CL
   microflow. Since setting the CE codepoint is an "early warning" of
   potential congestion (ie before there is any significant build-up of
   CL packets in the queue), the admission control procedure means that
   previously accepted CL microflows will suffer minimal queuing delay,
   jitter and loss - exactly the requirements of real time traffic.

   In turn, the edge-to-edge architecture is a building block in
   delivering an end-to-end CL service. The approach is similar to that
   described in [RFC2998] for Integrated services operation over
   Diffserv networks. Like [RFC2998], an IntServ class (CL in our case)
   is achieved end-to-end, with a CL-region viewed as a single
   reservation hop in the total end-to-end path. Interior routers of the
   CL-region do not process flow signalling nor do they hold state.
   Unlike [RFC2998] we do not require the end-to-end signalling
   mechanism to be RSVP, although it can be - as indeed we assume in
   Sections 2 and 3. [RFC2998] and our approach are compared further in
   Section 5.



1.2. Key features

   In this section we discuss some of the key aspects of the edge-to-
   edge architecture.

   One key feature of our approach revolves around the use of Explicit
   Congestion Notification (ECN) [RFC3168] to indicate that the amount
   of packets flowing is getting close to the engineered capacity. Note
   that ECN operates across the CL-region, ie edge-to-edge, and not
   host-to-host as in [RFC3168].

   The new PHB, CL-ramp-PHB, is designed to provide an "early warning"
   of potential congestion. It assumes that a new microflow won't move
   the CL-region directly from no congestion to overload; there will
   always be an intermediate stage where a new CL microflow causes CL


Briscoe               Expires January 11, 2006                [Page 4]


Internet-Draft      Controlled Load architecture             July 2005


   packets to have their CE codepoint set but still be delivered without
   significant delay. This assumption is valid for core and backbone
   networks but is unlikely to be valid in access networks where the
   granularity of an individual call becomes significant.

   Note that the CL-region can potentially span multiple domains.
   Indeed, over time CL-regions may incrementally grow and merge, and
   could eventually become a single CL-region encompassing all core and
   backbone networks, providing Internet-wide controlled load service in
   concert with stateful admission control mechanisms at the very edges
   of the Internet.

   It is also possible for a CL-region to include domains run by
   different operators. The border routers between operators within the
   CL-region only have to do bulk accounting - per microflow metering
   and policing is not needed. Section 4.1 discusses further.

   CL-packets are marked with a Differentiated Services Codepoint
   (DSCP), so that nodes in the CL-region can distinguish the CL packets
   from non-CL ones [RFC2474] and know that the CL-ramp-PHB is required.

   However, note that we do not use the traffic conditioning agreements
   (TCAs) of the (informational) Diffserv architecture [RFC2475], in
   which operators in practice rely on subscription-time Service Level
   Agreements (SLAs) that statically define the parameters of the
   traffic that will be accepted from a customer. Operators deploying
   our mechanism do not need to make a fixed assignment of capacity
   because the division of bandwidth between CL and non-CL traffic can
   be flexible.

   Our edge-to-edge architecture uses dynamic admission control: the
   closed feedback loop between the ingress and egress nodes of the CL-
   region. The key advantage of controlling the load dynamically rather
   than with TCAs is that the latter can fail catastrophically. The
   problem arises because the TCA at the ingress must allow any
   destination address, if it is to remain scalable. But for longer
   topologies, the chances increase that traffic will focus on a
   resource near the egress, even though it is within contract at the
   ingress [Reid]. Even though networks can be engineered to make such
   failures rare, when they occur all inelastic flows through the
   congested resource fail catastrophically. This is also why in our
   approach the egress node of the CL-region calculates the Congestion-
   Level-Estimate separately for CL packets from each ingress node.

   Finally, it is assumed that the end systems react properly to non-CL
   packets that are dropped or have their CE codepoint set, otherwise



Briscoe               Expires January 11, 2006                [Page 5]


Internet-Draft      Controlled Load architecture             July 2005


   new CL microflows calls may get unfairly blocked. How to police this
   is out of scope of this document.



1.3. Benefits

   We believe that the mechanism described in this document has several
   advantages, which we briefly explain with reference to the key
   features described above:

   o It achieves statistical guarantees of quality of service for
      microflows, delivering a very low delay, jitter and packet loss
      service suitable for applications like voice and video calls that
      generate real time inelastic traffic. This is because of its per
      microflow admission control scheme, combined with its "early
      warning" of potential congestion. The guarantee is at least as
      strong as with Intserv Controlled Load (Section 5 mentions why the
      guarantee may be somewhat better), but without its scalability
      problems [RFC2208].

   o It scales well, because there is no signal processing or path
      state held by the interior nodes of the CL-region.

   o It is resilient, again because no state is held by the interior
      nodes of the CL-region.

   o It requires minimal new standardisation, because it reuses
      existing QoS protocols.

   o It can be deployed incrementally, network by network. Not all the
      networks on the end-to-end path need to have it deployed. Two CL-
      regions can be separated by a network that uses another QoS
      mechanism (eg MPLS), or where they are adjacent can merge to
      become a single CL-region.

   o It can work between operators, ie the CL-region can include
      domains run by different operators. This is scalable because there
      is only bulk metering at the inter-operator interface; there is no
      need for per microflow accounting or policing.



1.4. Standardisation requirements

   The architecture described in this document has two new
   standardisation requirements: for a new PHB, as described in [CL-


Briscoe               Expires January 11, 2006                [Page 6]


Internet-Draft      Controlled Load architecture             July 2005


   PHB], and for the end-to-end signalling protocol to carry the
   Congestion-Level-Estimate report (eg with RSVP, the RESV message must
   carry a new opaque object across the CL-region). Other than these two
   things, the arrangement uses existing standards throughout although,
   as mentioned above, not in their usual architecture. Section 5
   discusses standardisation issues further.

   This document is INFORMATIONAL.



1.5. Terminology

   o Ingress node: a node which is an ingress gateway to the CL-region.
      A CL-region may have several ingress nodes.

   o Egress node: a node which is an egress gateway from the CL-region.
      A CL-region may have several egress nodes.

   o Interior node: a node which is part of the CL-region, but isn't an
      ingress or egress node.

   o CL-region: A region of the Internet in which all nodes run the CL-
      ramp-PHB and all traffic enters/leaves through an ingress/egress
      node. A CL-region is a DS region (a DS region is either a single
      DS domain or set of contiguous DS domains), but note that the CL-
      region does not use the traffic conditioning agreements (TCAs) of
      the (informational) Diffserv architecture.

   o CL-ramp-PHB: A new Per Hop Behaviour, described in [CL-PHB].

   o Congestion-Level-Estimate: the bits in CL packets that have the CE
      codepoint set, divided by the bits in all CL packets. It is
      calculated as an exponentially weighted moving average. It is
      calculated by an egress node for CL packets from a particular
      ingress node.













Briscoe               Expires January 11, 2006                [Page 7]


Internet-Draft      Controlled Load architecture             July 2005


   ______________________________
  /                              \
 /                                \
|-------|    |--------|    |-------|
|Ingress|----|Interior|----|Egress |
| node  |    | node   |    | node  |
|-------|    |--------|    |-------|
 \                                /
  \______________________________/

< ---------- CL-region ----------- >

Figure 1: Sample edge-to-edge configuration and terminology



1.6. Structure of rest of document

   Section 2 describes a use case, with further details in Section 3 and
   extensions in Section 4. Section 5 discusses standardisation aspects.



2. Use case

   In this section we outline a usage scenario to illustrate how our
   mechanism works. It is intended to show how the main features fit
   together to deliver QoS, with further details in Section 3.

   Our QoS mechanism operates over a CL-region. For now we assume that
   it consists of one domain whilst in Section 4.1 we extend it to the
   multi-domain case, including where different operators run the
   domains. So our scenario consists of two end hosts, each connected to
   their own access networks, which are linked by the CL-region. We
   require some other method, for instance IntServ, to be used outside
   the CL-region to provide QoS. For now we assume that the end-to-end
   signalling protocol is RSVP; other protocols are considered in
   Section 3.2. From the perspective of RSVP the CL-region is a single
   hop, so the RSVP PATH and RESV messages are processed by the ingress
   and egress nodes but are carried transparently across all the
   interior nodes. Hence, the ingress and egress nodes hold per
   microflow state, whilst no state is kept by the interior nodes.

   Section 2.1 describes a restricted scenario where the CL behaviour
   aggregate is assigned a fixed amount of bandwidth. This is equivalent


Briscoe               Expires January 11, 2006                [Page 8]


Internet-Draft      Controlled Load architecture             July 2005


   to the case today with the DS architecture: a subscription-time
   Service Level Agreement (SLA) statically defines the amount of
   bandwidth reserved for a particular behaviour aggregate. Section 2.2
   describes the more general case where there is no fixed allocation to
   CL traffic.

   Each node in the CL-region runs an algorithm to determine whether to
   set the CE codepoint of a particular CL packet. In our description we
   assume that a bulk token bucket is used (other implementations are
   possible), and that tokens are added when packets are queued and are
   consumed at a fixed rate. The idea is that an excess of tokens is
   seen before the queue of CL packets has got long enough to cause the
   CL packets to suffer a significant delay - the algorithms are
   explained more fully below and are slightly different in Sections 2.1
   and 2.2. Note that the same token bucket is used for all the CL
   packets, ie it operates in bulk on the CL behaviour aggregate and not
   per microflow.

 ___    ____    _______________________________________    ____    ___
|   |  |    |  |                                       |  |    |  |   |
|   |  |    |  |Ingress   Interior   Interior    Egress|  |    |  |   |
|   |  |    |  | node      node       node       node  |  |    |  |   |
|   |  |    |  |------|   |------|   |------|   |------|  |    |  |   |
|   |  |    |  | CL-  |   | CL-  |   | CL-  |   |      |  |    |  |   |
|   |..|    |..| PHB  |...| PHB  |...| PHB  |...| Meter|..|    |..|   |
|   |  |    |  |------|   |------|   |------|   |------|  |    |  |   |
|   |  |    |  |  \                                 /  |  |    |  |   |
|   |  |    |  |   \                               /   |  |    |  |   |
|   |  |    |  |    --<------------<-----------<--     |  |    |  |   |
|   |  |    |  |                                       |  |    |  |   |
|___|  |____|  |_______________________________________|  |____|  |___|

Sx     Access               CL-region                    Access    Rx
End    Network                                           Network   End
Host                                                               Host

                <------ edge-to-edge signalling ------>
                          (admission control)

<-------------------end-to-end QoS signalling protocol---------------->

Figure 2: Overall QoS architecture



Briscoe               Expires January 11, 2006                [Page 9]


Internet-Draft      Controlled Load architecture             July 2005



2.1. Configured bandwidth allocation to the CL behaviour aggregate

   Each node in the CL-region has a fixed rate (bandwidth) allocated to
   CL traffic, under the control of management configuration. Tokens are
   consumed at a fixed rate that is slightly slower than the configured
   rate, and added when packets are queued. This means that the amount
   of tokens starts to increase before the actual queue builds up but
   when it is in danger of doing so soon; hence it can be used as an
   "early warning" of potential congestion. The probability that a node
   sets the CE codepoint of a CL packet depends on the number of tokens
   in the bucket. Below one threshold value of the number of tokens no
   packets have their CE codepoint set and above the second they all do;
   in between, the probability increases linearly.

   We now describe how setting the CE codepoint influences admission
   control by the ingress node. For ease of description we imagine that
   packets are already flowing. Each egress meters whether a CL packet
   has its CE codepoint set. We assume that initially the traffic load
   is such that there are no CE packets.

   Next a source tries to set up a new CL microflow. The RSVP PATH
   message is processed by the ingress and egress nodes and PATH state
   is installed in these two routers. When the RSVP RESV message travels
   back from the receiving end host, the egress node adds on an RSVP
   object which states that currently no CL packets have their CE
   codepoint set. Hence the ingress node admits the new CL microflow,
   and the RESV message continues on to the source.

   We imagine that this new microflow results in one (or more) of the
   interior nodes starting to set the CE codepoint of CL packets because
   their arrival rate is nearing the configured rate. The egress
   calculates - as an exponentially weighted moving average - the
   fraction of CL packets from a particular ingress node that have their
   CE codepoint set (or rather the calculation is done according to the
   bits in those packets). This Congestion-Level-Estimate provides an
   estimate of how near the CL-region is getting to a load where the CL
   traffic will start suffering significant delays. Note that the
   metering is done separately per ingress node, because (as discussed
   in Section 1.2) there may be sufficient capacity on all the nodes on
   the path between one ingress node and a particular egress, but not
   from a second ingress.

   The next time a source tries to set up a CL microflow, the egress
   informs the ingress node about the relevant Congestion-Level-
   Estimate; this is included as an opaque object within the RSVP RESV


Briscoe               Expires January 11, 2006               [Page 10]


Internet-Draft      Controlled Load architecture             July 2005


   message. If it is greater than some threshold value then the ingress
   refuses the request, otherwise it is accepted and the RSVP RESV
   continues to the source end host.

   It is also possible for an egress node to get a RSVP RESV message and
   not know what Congestion-Level-Estimate is. For example, if there are
   no CL microflows at present between the relevant ingress and egress
   nodes. In this case the egress requests the ingress to send probe
   packets, from which it can initialise its meter.



   Having explained how the admission control decision is reached we now
   look at an on-going data microflow. The source sends CL packets,
   which arrive at the ingress node. The ingress uses a normal five-
   tuple filter to identify that the packets are part of a previously
   admitted CL microflow, and it also polices the microflow to ensure it
   remains within its traffic profile. (The ingress has learnt the
   required information from the RSVP PATH message.) The ingress sets
   the DSCP appropriately and the ECN field to ECT (ECN-Capable
   Transport). The CL packets now travel across the CL-region, with the
   CE codepoint getting set if necessary. Also, appropriate queue
   scheduling is needed in each node to ensure that CL traffic gets its
   configured bandwidth. For instance, a Weighted Round Robin scheduler
   could be used.



2.2. Flexible bandwidth allocation to CL behaviour aggregate

   The set-up is similar to the previous sub-section, except that nodes
   in the CL-region do not allocate a fixed bandwidth to CL flows. As a
   consequence, the algorithm for setting the CE codepoint is slightly
   altered.

   Tokens are consumed at a fixed rate that is slightly slower than the
   (total) outgoing service rate, and added when packets are queued. The
   probability that a node sets the CE codepoint of a CL packet depends
   on the number of tokens in the bucket *plus* the number of queued
   non-CL packets. Below one threshold value of this sum no packets have
   their CE codepoint set and above the second they all do; in between,
   the probability increases linearly.

   Note that the probability reflects the load of both CL and non-CL
   traffic. The reason is to ensure a 'fair balance' between the two
   classes, by rejecting CL session requests if non-CL demand is very
   high. Alternatively, if the number of queued non-CL packets is not


Briscoe               Expires January 11, 2006               [Page 11]


Internet-Draft      Controlled Load architecture             July 2005


   included, then the admission of a CL microflow is independent of the
   amount of non-CL traffic.

   The admission control procedure is as in the previous sub-section. As
   regards queue scheduling, CL packets are always scheduled ahead of
   non-CL ones, in order to minimise their delay and jitter, and FIFO
   (First In First Out) queuing is used to prevent reordering within a
   CL microflow. This is more restrictive than in the previous sub-
   section, which we believe is necessary now the arrival rate of CL
   packets is unknown.



3. Details

   In this section we first concentrate on the details about packet
   processing in nodes in the CL-region, before looking more briefly at
   issues associated with the signalling for admission control.

3.1. Packet processing

   A network operator upgrades normal IP routers by:

   o Adding functionality related to admission control to all its
      ingress and egress nodes

   o Adding appropriate queuing and scheduling behaviour to its nodes,
      including the ability to set the CE codepoint "early".

   We consider the detailed actions required for each of the types of
   node in turn.

3.1.1. Ingress nodes

   Ingress nodes perform the following tasks:

   o Classify incoming packets - decide whether they are CL or non-CL
      packets. This is done using a normal filter spec (source and
      destination addresses and port numbers), whose details have been
      gathered from the RSVP PATH message

   o Police - check that the microflow is conformant with what has been
      agreed (ie the flow keeps to its agreed data rate). If necessary,
      the suggested action is that packets are marked to Best Effort.

   o Packet colouring - for CL microflows, set the DSCP appropriately
      and set the ECN field to ECT(0) or ECT(1)


Briscoe               Expires January 11, 2006               [Page 12]


Internet-Draft      Controlled Load architecture             July 2005


   o Perform standard 'interior node' functions (see next sub-section)

3.1.2. Interior nodes

   Interior nodes do the following tasks:

   o Examine the DSCP - to see if it's a CL packet

   o Enqueue - CL and non-CL packets are put into logically separate
      queues; if required, a CL packet can pre-empt non-CL packet(s) in
      the total buffer (see below).

   o Non-CL packets are handled as usual. A RED algorithm [RFC2309] is
      used to decide whether to drop packets or, if they are ECN-
      capable, set their CE codepoint.

   o CL packets have their CE codepoint set according to what is
      essentially a token bucket algorithm (see below).

   o Dequeue - any CL packet is always dequeued before a non-CL packet.
      Within the CL class scheduling is FIFO. There may be a hierarchy
      of non-CL classes, this is out of scope.



   Queuing:

   Although CL and non-CL packets are put into logically separate
   queues, implementations in practice share the same buffer space. If
   the buffer is full then an incoming non-CL packet is dropped, whilst
   an incoming CL packets is queued and sufficient of the newest non-CL
   packet(s) are dropped. In the unlikely event that the buffer is full
   of CL packets, then the newest CL packet is discarded (ie tail drop).
   Because of the admission procedure this should be rare, but it is
   needed to protect the network in case of misconfiguration for
   instance.



   Setting the CE codepoint:

   Tokens are added when CL packets are queued and are consumed at a
   fixed rate related to the outgoing service rate.

   When a CL packet arrives the token bucket is updated as follows:




Briscoe               Expires January 11, 2006               [Page 13]


Internet-Draft      Controlled Load architecture             July 2005


   [CL-bucket-level]n+1 = [CL-bucket-level]n + CL-packet-size -
   (service-bit-rate * time * safety-factor)

   Where

   CL-bucket-level is the amount of tokens in the token bucket. It is
   constrained to lie between 0 and a fixed upper limit

   time is the time elapsed since CL-bucket-level was last updated

   safety-factor is > 1 and gives the "early warning" of potential
   congestion

   service-bit-rate is

     either the configured bit rate for CL traffic - for the fixed
     bandwidth case (ie Section 2.1),

     or the outgoing service rate for all traffic - for the flexible
     bandwidth case (ie Section 2.2).



   CL packets have their CE codepoint set with a probability that
   depends on the number of non-CL packets in the queue, as well as the
   number of tokens in a token bucket.

   When a CL packet arrives, the probability that the node sets its CE
   codepoint is determined as follows:

   if  [CL-bucket-level]n+1 + (A * smoothed-non-CL-queue-length) < min-
   threshold

     Probability-CE-codepoint-set = 0

   if  [CL-bucket-level]n+1   + (A * smoothed-non-CL-queue-length) >
   max-threshold

     Probability-CE-codepoint-set = 1

   otherwise

     Probability-CE-codepoint-set = (CL-bucket-level - min-threshold) /
   (max-threshold - min-threshold)





Briscoe               Expires January 11, 2006               [Page 14]


Internet-Draft      Controlled Load architecture             July 2005


   Where

   max-threshold > min-threshold

   max-threshold <= the fixed upper limit of CL-bucket-level

   smoothed-non-CL-queue-length is the number of bits in packets in the
   non-CL queue, smoothed as an exponentially weighted moving average
   (EWMA)

   A is either 0 or 1:

      A = 0 for the fixed bandwidth case (ie Section 2.1),

      A = 1 for the flexible bandwidth case (ie Section 2.2).



3.1.3. Egress nodes

   Egress nodes do the following tasks:

   o Metering - for CL packets, calculating the fraction of the total
      bits which are in CE packets. The calculation is done as an
      exponentially weighted moving average. A separate calculation is
      made for CL packets from each ingress router.

   o Packet colouring - for CL packets, set the DSCP and the ECN field
      to whatever has been agreed as appropriate for the next domain.

   An egress node getting a CL packet first determines which ingress
   node that packet has come from. The necessary details are gathered
   from the RSVP PATH message (previous RSVP hop, ie ingress node, vs.
   filter spec). It then updates the two meters associated with that
   ingress node. The meters work on an aggregate basis, and not per
   microflow.



   For every CL packet arrival:

   [EWMA-total-bits]n+1  =  (w * bits-in-packet)  +  ((1-w) * [EWMA-
   total-bits]n )

   [EWMA-CE-bits]n+1  =  (B * w * bits-in-packet)  +  ((1-w) * [EWMA-CE-
   bits]n )



Briscoe               Expires January 11, 2006               [Page 15]


Internet-Draft      Controlled Load architecture             July 2005


   [Congestion-Level-Estimate]n+1  =  [EWMA-CE-bits]n+1  /  [EWMA-total-
   bits]n+1



   where

   EWMA-total-bits is the total number of bits in CL packets, calculated
   as an exponentially weighted moving average (EWMA)

   EWMA-CE-bits is the total number of bits in CL packets where the
   packet has its CE codepoint set, again calculated as an EWMA.

   B is either 0 or 1:

     B = 0 if the CL packet does not have its CE codepoint set

     B = 1 if the CL packet has its CE codepoint set

   w is the exponential weighting factor.


   Varying the value of the weight trades off between the smoothness and
   responsiveness of the estimate of the percentage of CE packets. There
   will be a threshold inter-arrival time between packets of the same
   aggregate below which the egress will consider the estimate of the
   Congestion-Level-Estimate as too stale, and it will then trigger
   probing by the ingress.
   For packet colouring, by default the ECN field is set to the Not-ECT
   codepoint. Note that this results in the loss of the end-to-end
   meaning of the ECN field. It can usually be assumed that end-to-end
   congestion control is unnecessary within an end-to-end reservation.
   But if a genuine need is identified for end-to-end ECN semantics
   within a reservation, then an alternative is to tunnel CL packets
   across the CL-region, or to agree an extension to end-to-end
   signalling to indicate that the microflow uses an ECN-capable
   transport. We do not recommend such apparently unnecessary
   complexity.



3.2. Signalling

   The admission control procedure involves signalling between the
   ingress and egress nodes. The following new messages are needed:-



Briscoe               Expires January 11, 2006               [Page 16]


Internet-Draft      Controlled Load architecture             July 2005


   o Egress to ingress: piggy-backed on reservation reply: this is the
      current value of Congestion-Level-Estimate. An egress node is
      configured to know it is an egress node, so it always appends this
      to the reservation response. A flag in this message can indicate
      the value is unknown, in order to trigger probing by the ingress.

   o Ingress to egress: probe: this is a probe packet

   The description in the earlier sections has assumed that RSVP
   signalling is used. In this case, the first bullet requires
   standardisation so that the RSVP RESV message can carry a new opaque
   object with the load report.



   However, there are several other possible signalling protocols, for
   instance using NSIS. It would therefore be sensible to ensure that
   the new signalling messages do not constrain the choice of end-to-end
   QoS mechanism nor how the end-to-end and edge-to-edge (ie ingress-to-
   egress) mechanisms interact. As an example on the latter point, with
   RSVP the PATH message is forwarded immediately to the next domain,
   with the Congestion-Level-Estimate report only being calculated when
   the RESV returns, at which point it can be piggy-backed on to the
   RESV and sent to the ingress. In other cases, it may be that
   admission control is performed before the signalling message is
   forwarded to the next domain.



4. Extensions

4.1. Multi-domain and multi-operator usage

   The CL-region can consist of multiple domains. Then only the ingress
   and egress nodes of the CL-region take part in the admission control
   procedure, ie at the ingress to the first domain and the egress from
   the final domain. Note that domain border nodes within the CL-region
   do not take part in signal processing or hold path state.

   The multiple domains can even be run by different operators. The
   border routers between operators within the CL-region only have to do
   bulk accounting - per microflow metering and policing is not needed
   [Briscoe]. This is possible even when the operators do not trust each
   other. In a later version of the draft we will explain how a
   downstream domain can police that its upstream domain does not
   'cheat' by admitting traffic when the downstream path is over-
   congested [Re-feedback].


Briscoe               Expires January 11, 2006               [Page 17]


Internet-Draft      Controlled Load architecture             July 2005


4.2. Variable bit rate sources

   So far we have assumed that the real time inelastic sources operate
   at a constant bit rate. We have determined under what conditions it
   is possible to handle variable bit rate (VBR) sources. The simplest
   approach is an algorithm that decides whether to set the CE codepoint
   using a service rate much less than the real service rate (ie
   allowing an extra safety margin); the network can still operate
   efficiently when resources are shared between CL and non-CL flows.
   This approach assumes that the sources are statistically independent.

4.3. Starvation prevention

   According to the particular traffic levels it may sometimes be
   possible for either the non-CL or CL traffic to be starved. An
   algorithm to prevent starvation will be documented in a future draft.



5. Relationship to other QoS mechanisms

5.1. Standardisation requirements

   Standardisation of two functions is needed:

   o First, a new per hop behaviour is required (CL-ramp-PHB), which is
      described in [CL-PHB]. The corresponding DSCP needs to be
      RECOMMENDED rather than EXP/LU (experimental / local use), to
      enable multi-domain operation and vendor interoperability. This
      document is a use case of CL-ramp-PHB.

   o Signalling between the ingress and egress nodes and its
      interaction with the end-to-end QoS mechanism, for instance RSVP
      or NSIS. For instance, given RSVP's capabilities to carry opaque
      objects, define an object to carry the Congestion-Level-Estimate
      report. Probe packets are simply data addressed to the egress
      gateway and require no protocol standardisation, although best
      practice is required for their number, size and rate.

5.2. Controlled Load

   The CL mechanism delivers QoS similar to Integrated Services
   controlled load, but rather better as queues are kept empty by
   driving admission control from bulk token buckets on each interface
   that can detect a rise in load before queues build, sometimes termed
   a virtual queue [AVQ, vq]. It is also more robust to route changes.



Briscoe               Expires January 11, 2006               [Page 18]


Internet-Draft      Controlled Load architecture             July 2005


5.3. Integrated services operation over Diffserv

   Our approach to end-to-end QoS is similar to that described in
   [RFC2998] for Integrated services operation over Diffserv networks.
   Like [RFC2998], an IntServ class (CL in our case) is achieved end-to-
   end, with a CL-region viewed as a single reservation hop in the total
   end-to-end path. Interior routers of the CL-region do not process
   flow signalling nor do they hold state. Unlike [RFC2998] we do not
   require the end-to-end signalling mechanism to be RSVP, although it
   can be. Also, we do not use the DS architecture (see Section 5.4).

   Bearing in mind these differences, we can describe our architecture
   in the terms of the options in [RFC2998]. The Diffserv network region
   is RSVP-aware, but awareness is confined to (what [RFC2998] calls)
   the "border routers" of the Diffserv region. We use explicit
   admission control into this region, with either static provisioning
   or explicit signalling (corresponding to the configured and flexible
   bandwidth cases of Sections 2.1 and 2.2 respectively). The ingress
   "border router" does per microflow policing and sets the correct DSCP
   (ie we use router marking rather than host marking).

5.4. Differentiated Services

   The DS architecture does not specify any way for devices outside the
   domain to dynamically reserve resources or receive indications of
   network resource availability.  In practice, service providers rely
   on subscription-time Service Level Agreements (SLAs) that statically
   define the parameters of the traffic that will be accepted from a
   customer. The CL mechanism allows dynamic reservation of resources
   and unlike Diffserv it can span multiple domains without active
   mechanisms at the borders. Therefore we do not use the traffic
   conditioning agreements (TCAs) of the (informational) Diffserv
   architecture [RFC2475].

   [Johnson] compares admission control with a 'generously dimensioned'
   Diffserv network as ways to achieve QoS. The former is recommended.

5.5. ECN

   CL complies with the ECN aspects of the IP wire protocol [RFC3168],
   but provides its own edge-to-edge feedback instead of the TCP aspects
   of ECN. All nodes within a particular CL-region are upgraded with the
   CL mechanism, so the requirements of [Floyd] are met. The operator
   prevents traffic arriving at a node that doesn't understand CL by
   administrative configuration of the ring of gateways around the
   region. Where a region of nodes that understand CL spans multiple
   domains, the operators contract with each other to surround the


Briscoe               Expires January 11, 2006               [Page 19]


Internet-Draft      Controlled Load architecture             July 2005


   region by gateways to prevent CL traffic being handled by nodes that
   do not understand it.

5.6. RTECN

   Real-time ECN (RTECN) [RTECN, RTECN-usage] has a similar aim to this
   document (to achieve a low delay, jitter and loss service suitable
   for RT traffic) and a similar approach (per microflow admission
   control combined with an "early warning" of potential congestion
   through setting the CE codepoint). But it has a different
   architecture: host-to-host (rather than edge-to-edge). [CL-PHB]
   defines a new PHB, CL-step-PHB, that should be suitable; its
   algorithm is similar to CL-ramp-PHB, but setting the CE codepoint is
   either 'on' or 'off'. Only probe packets use the CL-step-PHB, whilst
   data uses the Expedited Forwarding PHB [RFC3246].

5.7. RMD

   Resource Management in Diffserv (RMD) [RMD] is similar to this work,
   in that it pushes complex classification, traffic conditioning and
   admission control functions to the edge of a DS domain and simplifies
   the operation of the interior nodes. One of the RMD modes uses
   measurement-based admission control, however it works differently:
   each interior node measures the user traffic load in the PHB traffic
   aggregate, and each interior node processes a local RESERVE message
   and compares the requested resources with the available resources
   (maximum allowed load minus current load).

   Hence a difference is that the CL architecture described in this
   document has been designed not to require interaction between
   interior nodes and signalling, whereas in RMD all interior nodes are
   QoS-NSLP aware. So our architecture is more agnostic to signalling,
   requires fewer changes to existing standards and therefore works with
   existing RSVP as well as having the potential to work with future
   signalling protocols like NSIS.

5.8. MPLS-TE

   Multi-protocol label switching traffic engineering (MPLS-TE) allows
   reservation of resources for an aggregate of many flows. However, it
   still requires admission control and policing (using a bandwidth
   manager) of microflows into the aggregate. This must be repeated at
   each trust boundary. The present technique could be used for
   admission control of microflows into a set of MPLS-TE aggregates.
   They may span multiple domains without requiring per-microflow
   processing at the trust boundaries. However it would require that the
   MPLS header could include the ECN field.


Briscoe               Expires January 11, 2006               [Page 20]


Internet-Draft      Controlled Load architecture             July 2005




6. Security Considerations

   To protect against denial of service attacks, the ingress node of the
   CL-region needs to police all CL packets and drop packets in excess
   of the reservation.

   Further security aspects to be considered later.



7. Acknowledgements

   We thank Joe Babiarz for very helpful discussion about this document
   and [RTECN].

   This work evolved from the Guaranteed Stream Provider developed in
   the M3I project [GSPa, GSP-TR], which in turn was based on the
   theoretical work of Gibbens and Kelly [DCAC].

8. Comments solicited

   Comments and questions are encouraged and very welcome. They can be
   sent to the Transport Area Working Group's mailing list,
   tsvwg@ietf.org, and/or to the authors (either individually or
   collectively at gqs@jungle.bt.co.uk).



9. References

   A later version will distinguish normative and informative
   references.

   [AVQ]         S. Kunniyur and R. Srikant "Analysis and Design of an
                 Adaptive Virtual Queue (AVQ) Algorithm for Active
                 Queue Management", In: Proc. ACM SIGCOMM'01, Computer
                 Communication Review 31 (4) (October, 2001).

   [Briscoe]     Bob Briscoe and Steve Rudkin, "Commercial Models for
                 IP Quality of Service Interconnect", BT Technology
                 Journal, Vol 23 No 2, April 2005.






Briscoe               Expires January 11, 2006               [Page 21]


Internet-Draft      Controlled Load architecture             July 2005


   [CL-PHB]      B. Briscoe, G. Corliano, P. Eardley, P. Hovell, A.
                 Jacquet, D. Songhurst, "The Controlled Load per hop
                 behaviour", draft-briscoe-tsvwg-cl-phb-00.txt (work in
                 progress), July 2005

   [DCAC]        Richard J. Gibbens and Frank P. Kelly "Distributed
                 connection acceptance control for a connectionless
                 network", In: Proc. International Teletraffic Congress
                 (ITC16), Edinburgh, pp. 941ù952 (1999).

   [Floyd]       S. Floyd, 'Specifying Alternate Semantics for the
                 Explicit Congestion Notification (ECN) Field', draft-
                 floyd-ecn-alternates-00.txt (work in progress), April
                 2005

   [GSPa]        Karsten (Ed.), Martin "GSP/ECN Technology \&
                 Experiments", Deliverable: 15.3 PtIII, M3I Eu Vth
                 Framework Project IST-1999-11429, URL:
                 http://www.m3i.org/ (February, 2002) (superseded by
                 [GSP- TR])

   [GSP-TR]      Martin Karsten and Jens Schmitt, "Admission Control
                 Based on Packet Marking and Feedback Signalling ¡--
                 Mechanisms, Implementation and Experiments", TU-
                 Darmstadt Technical Report TR-KOM-2002-03, URL:
                 http://www.kom.e-technik.tu-
                 darmstadt.de/publications/abstracts/KS02-5.html (May,
                 2002)

   [Johnson]     DM Johnson, 'QoS control versus generous
                 dimensioning', BT Technology Journal, Vol 23 No 2,
                 April 2005

   [Re-feedback] Bob Briscoe, Arnaud Jacquet, Carla Di Cairano-
                 Gilfedder, Andrea Soppera, Re-feedback for Policing
                 Congestion Response in an Inter-network, ACM SIGCOMM
                 2005, August 2005.

   [Reid]        ABD Reid, 'Economics and scalability of QoS
                 solutions', BT Technology Journal, Vol 23 No 2, April
                 2005

   [RFC2208]     F. Baker et al, "Resource ReSerVation Protocol (RSVP)
                 --- Version 1 Applicability Statement; Some Guidelines
                 on Deployment" RFC2208 (January, 1997)




Briscoe               Expires January 11, 2006               [Page 22]


Internet-Draft      Controlled Load architecture             July 2005


   [RFC2211]     J. Wroclawski, Specification of the Controlled-Load
                 Network Element Service, September 1997

   [RFC2309]     Braden, B., et al., "Recommendations on Queue
                 Management and Congestion Avoidance in the Internet",
                 RFC 2309, April 1998.

   [RFC2474]     Nichols, K., Blake, S., Baker, F. and D. Black,
                 "Definition of the Differentiated Services Field (DS
                 Field) in the IPv4 and IPv6 Headers", RFC 2474,
                 December 1998

   [RFC2475]     Blake, S., Black, D., Carlson, M., Davies, E., Wang,
                 Z. and W. Weiss, "An Architecture for Differentiated
                 Services", RFC 2475, December 1998.

   [RFC2597]     Heinanen, J., Baker, F., Weiss, W. and J. Wrocklawski,
                 "Assured Forwarding PHB Group", RFC 2597, June 1999.

   [RFC2998]     Bernet, Y., Yavatkar, R., Ford, P., Baker, F., Zhang,
                 L., Speer, M., Braden, R., Davie, B., Wroclawski, J.
                 and E. Felstaine, "A Framework for Integrated Services
                 Operation Over DiffServ Networks", RFC 2998, November
                 2000.

   [RFC3168]     Ramakrishnan, K., Floyd, S. and D. Black "The Addition
                 of Explicit Congestion Notification (ECN) to IP", RFC
                 3168, September 2001.

   [RFC3246]     B. Davie, A. Charny, J.C.R. Bennet, K. Benson, J.Y. Le
                 Boudec, W. Courtney, S. Davari, V. Firoiu, D.
                 Stiliadis, 'An Expedited Forwarding PHB (Per-Hop
                 Behavior)', RFC 3246, March 2002.

   [RMD]         Attila Bader, Lars Westberg, Georgios Karagiannis,
                 Cornelia Kappler, Tom Phelan, 'RMD-QOSM - The Resource
                 Management in Diffserv QoS model', draft-ietf-nsis-
                 rmd-03 Work in Progress, June 2005.

   [RTECN]       Babiarz, J., Chan, K. and V. Firoiu, 'Congestion
                 Notification Process for Real-Time Traffic', draft-
                 babiarz-tsvwg-rtecn-03" Work in Progress, February
                 2005.






Briscoe               Expires January 11, 2006               [Page 23]


Internet-Draft      Controlled Load architecture             July 2005


   [RTECN-usage] Alexander, C., Ed., Babiarz, J. and J. Matthews,
                 'Admission Control Use Case for Real-time ECN, draft-
                 alexander-rtecn-admission-control-use-case-00', Work
                 in Progress, February 2005.

   [vq]          Costas Courcoubetis and Richard Weber "Buffer Overflow
                 Asymptotics for a Switch Handling Many Traffic
                 Sources" In: Journal Applied Probability 33 pp. 886--
                 903 (1996).





Authors' Addresses

   Bob Briscoe
   BT Research
   B54/77, Sirius House
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk
   IP5 3RE
   United Kingdom
   Email: bob.briscoe@bt.com


   Dave Songhurst
   BT Research
   B54/69, Sirius House
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk
   IP5 3RE
   United Kingdom
   Email: dsonghurst@jungle.bt.co.uk












Briscoe               Expires January 11, 2006               [Page 24]


Internet-Draft      Controlled Load architecture             July 2005


   Philip Eardley
   BT Research
   B54/77, Sirius House
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk
   IP5 3RE
   United Kingdom
   Email: philip.eardley@bt.com


   Peter Hovell
   BT Research
   B54/69, Sirius House
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk
   IP5 3RE
   United Kingdom
   Email: peter.hovell@bt.com


   Gabriele Corliano
   BT Research
   B54/70, Sirius House
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk
   IP5 3RE
   United Kingdom
   Email: gabriele.2.corliano@bt.com


   Arnaud Jacquet
   BT Research
   B54/70, Sirius House
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk
   IP5 3RE
   United Kingdom
   Email: arnaud.jacquet@bt.com







Briscoe               Expires January 11, 2006               [Page 25]


Internet-Draft      Controlled Load architecture             July 2005




Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org

Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Copyright Statement

   Copyright (C) The Internet Society (2005).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.






Briscoe               Expires January 11, 2006               [Page 26]