Internet Engineering Task Force                                M. Scharf
Internet-Draft                                  Alcatel-Lucent Bell Labs
Intended status: Experimental                              July 12, 2010
Expires: January 13, 2011


                 Multi-Connection TCP (MCTCP) Transport
                      draft-scharf-mptcp-mctcp-01

Abstract

   Multipath transport over potentially different paths can be realized
   by several coupled Transmission Control Protocol (TCP) connections.
   Multi-Connection TCP (MCTCP) transport aggregates multiple TCP
   connections between potentially different addresses into a single
   session that can be accessed by an application like a single TCP
   connection.  MCTCP encodes control information, as far as possible,
   in the payload of the TCP connections and therefore requires only
   minor changes in the TCP implementations, and it is transparent in
   the single-path case.  MCTCP is therefore proposed as a simple,
   modular, and extensible mechanism for multipath transport.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 13, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents



Scharf                  Expires January 13, 2011                [Page 1]


Internet-Draft            Multi-Connection TCP                 July 2010


   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Design Considerations  . . . . . . . . . . . . . . . . . . . .  4
     3.1.  Objectives . . . . . . . . . . . . . . . . . . . . . . . .  4
     3.2.  Operation Summary  . . . . . . . . . . . . . . . . . . . .  5
     3.3.  Differences to Other Multipath Transport Solutions . . . .  9
   4.  TCP Extensions by MCTCP  . . . . . . . . . . . . . . . . . . . 14
     4.1.  Setup of the Initial Connection  . . . . . . . . . . . . . 14
     4.2.  Setup of Coupled Connection  . . . . . . . . . . . . . . . 15
     4.3.  Usage of Coupled Connections . . . . . . . . . . . . . . . 17
     4.4.  Operation Mode Switch  . . . . . . . . . . . . . . . . . . 18
   5.  MCTCP Session Protocol Messages  . . . . . . . . . . . . . . . 19
     5.1.  Data Segmentation and Encoding . . . . . . . . . . . . . . 19
     5.2.  Retransmission Requests  . . . . . . . . . . . . . . . . . 21
     5.3.  Address Advertisement  . . . . . . . . . . . . . . . . . . 22
     5.4.  Connection Management and Fallback . . . . . . . . . . . . 24
   6.  MCTCP Session Policies and Algorithms  . . . . . . . . . . . . 25
     6.1.  Message Scheduling . . . . . . . . . . . . . . . . . . . . 25
     6.2.  Congestion and Flow Control  . . . . . . . . . . . . . . . 25
   7.  Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 26
     7.1.  Interface between MCTCP and TCP  . . . . . . . . . . . . . 26
     7.2.  Interface to Applications  . . . . . . . . . . . . . . . . 27
   8.  Interaction with Middleboxes . . . . . . . . . . . . . . . . . 27
     8.1.  Middleboxes that Manipulate TCP Options  . . . . . . . . . 27
     8.2.  Middleboxes that Change Content  . . . . . . . . . . . . . 28
     8.3.  Middleboxes that Translate Addresses/Ports . . . . . . . . 29
     8.4.  Middleboxes that Want to Control MCTCP Traffic . . . . . . 30
     8.5.  Middleboxes that Proactively Acknowledge Data  . . . . . . 30
   9.  Open Issues  . . . . . . . . . . . . . . . . . . . . . . . . . 31
   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 31
   11. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 32
   12. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 32
   13. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 32
   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 33
     14.1. Normative References . . . . . . . . . . . . . . . . . . . 33
     14.2. Informative References . . . . . . . . . . . . . . . . . . 33
   Appendix A.  Possible Future MCTCP Extension . . . . . . . . . . . 33
   Appendix B.  Change History of the Document  . . . . . . . . . . . 35





Scharf                  Expires January 13, 2011                [Page 2]


Internet-Draft            Multi-Connection TCP                 July 2010


1.  Introduction

   The objective of Multipath TCP is to enable multipath transport over
   multiple paths like a regular TCP connection [1].  The motivation for
   using multiple paths, as well as design considerations are discussed
   in [7].

   One key question concerning the Multipath TCP protocol design is how
   to transport the control information, which is required for the setup
   and the teardown of different sub-flows, as well as for the
   segmentation and reassembly of the byte stream in the sender and
   receiver, respectively.  One possibility is to encode this signaling
   information in several new TCP options [8].

   This document describes Multi-Connection TCP (MCTCP) transport.
   MCTCP is an alternative solution that transports both application and
   control data with an own framing mechanism in the payload of parallel
   TCP connections, but only if multipath transport is really needed.
   MCTCP is simpler and more modular while providing almost the same
   service like a Multipath TCP protocol with option signaling.

   To applications, MCTCP offers the same reliable, in-order, byte-
   stream transport as TCP.  It is designed to be backward-compatible
   with both applications and the network layer.  Applications can use
   MCTCP exactly like a single TCP connection, as described in [11].  As
   long as multiple paths are not used, an MCTCP transfer is identical
   to a standard TCP transfer, except for a new TCP option in SYN
   segments that detects MCTCP support in the remote end.  Once multi-
   connection transfer is enabled, data chunks are sent over several TCP
   connections with a new type-length-value (TLV) framing format.  This
   framing also permits the exchange of arbitrary amounts of control
   information between the endpoints of the MCTCP session.  The multiple
   TCP connections operate independently, but the MCTCP session
   coordinates the congestion control states.  MCTCP can therefore use a
   coupled congestion control (e. g., [10]) that does not harm other
   network users.

2.  Terminology

   This document uses a terminology that slighly differs to [8]:

      Path: A sequence of links between a sender and a receiver, defined
      in this context by a source and destination address pair.

      Initial connection: The first TCP connection between the two
      endpoints of the MCTCP session.





Scharf                  Expires January 13, 2011                [Page 3]


Internet-Draft            Multi-Connection TCP                 July 2010


      Coupled connection: A coupled connection is a follow-up TCP
      connection that is part of the session.  It roughly corresponds to
      a "subflow" in [8].

      Session: A collection of the initial connection and, if in use,
      one or more coupled TCP connections.  The applications at the two
      endpoints of the session can communicate as if there was a single
      TCP connection only.  For an application, there is a one-to-one
      mapping between a session and the socket.  If a session includes
      only the initial connection, it is almost identical to a standard
      TCP connection, except for a new TCP option in the SYN segments.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [3].

3.  Design Considerations

   This section gives a high-level, non-compulsory overview of MCTCP's
   design and its usage.

3.1.  Objectives

   With multipath transport, applications should be able to use the
   aggregated bandwith of several paths without coping about details of
   data transport, path management, scheduling, and congestion control.
   This can improve both performance and resilience compared to the
   current data transport that is mostly limited to a single path.

   Yet, a multipath transport solution that requires multiple addresses
   at least on one side will only be useful under certain constraints:
   First, it requires endsystems with more than one address.  One
   example are mobile devices with several radio interfaces, which are
   increasingly common.  But even in that case it can make sense to use
   one interface only, for instance in order to save battery energy.
   Second, due to the signaling overhead and the effort of negotiation,
   a multipath transport mechanism is mainly useful for long bulk data
   file transfers.  In the Internet, this use case only represents a
   small subset of TCP's usage scenarios.

   Given this rather specific use case, this document argues that a
   multipath transport mechanism should neither require complex
   modifications of the TCP stack, nor fundamentally change the TCP data
   transmission as seen by middleboxes on the path, at least as long
   only a single path is in use.  Obviously, once multipath transport is
   enabled, any middlebox performing deep packet inspection may get
   confused as it will only see that part of the byte stream that is
   transported over the corresponding path.  As a consequence, on can



Scharf                  Expires January 13, 2011                [Page 4]


Internet-Draft            Multi-Connection TCP                 July 2010


   use a different framing format in that case.  Furthermore, rapid
   deployment of a multipath solution would also significantly benefit
   from the possibility to implement it in the user space, as far as
   possible.

   Multi-Connection TCP (MCTCP) transport is designed to be a simple,
   modular, extensible, and non-disruptive multipath transport
   mechanism.  Key design objectives are:

   o  Backward-compatibility: MCTCP is designed to be entirely backward-
      compatible with a single TCP connection and falls back to standard
      TCP if it is not supported by both endsystems, or if the setup of
      additional coupled connections fails.

   o  Few TCP options only: MCTCP only requires new, short TCP options
      in SYN segments, at least for the basic operation.  As a result,
      middleboxes that strip, duplicate, or modify TCP options, drop
      such packets, or reassemble the byte stream cannot affect the
      integrity of the data transport.

   o  Identical byte stream: MCTCP's byte stream is identical to a TCP
      connection until multipath usage gets negotiated, except for the
      new TCP option in the SYN.  As a fallback, it is in principle even
      possible to seamlessly continue the transport of the whole
      application data over the initial TCP connection, if multipath
      transport fails (e. g., due to middleboxes).

   o  Simplicity: MCTCP tries to minimize the changes required inside
      existing network stacks.  Except for few pretty straightforward
      addons, a coupled TCP connection is setup, maintained, and closed
      like a standard TCP connection.  The major functions of MCTCP can
      be implemented in the user-space.

   o  Same API: MCTCP can provide the same API to applications like the
      existing TCP.

   o  Multi-address assumption: MCTCP assumes that one or both endpoints
      of an MCTCP session are multihomed and multiaddressed.

   These objectives are achieved by defining two different operation
   modes of MCTCP, the single-connection and the multi-connection mode.

3.2.  Operation Summary

   In single-connection mode, an MCTCP session is equivalent to a single
   TCP connection.  The required minimum of control information is
   exchanged by TCP options.  When multipath transfer shall be enabled,
   MCTCP switches to the multi-connection mode, in which it opens



Scharf                  Expires January 13, 2011                [Page 5]


Internet-Draft            Multi-Connection TCP                 July 2010


   additional, coupled TCP connections from or to possibly different
   addresses of the same endsystems.  Initial and coupled connection are
   linked by two tokens in each session endpoint, which are exchanged
   during the setup of the initial connection.

   Each coupled TCP connection can transport control information and
   data chunks in messages that are encoded in a type-length-value (TLV)
   framing format.  In multi-connection mode, the MCTCP transport on one
   of the coupled TCP connections is similar to the Transport Layer
   Security (TLS) protocol [5], except that data is not encrypted but
   partitioned over different connections.  TLS can be used on top of
   MCTCP without requiring any adaptation.

   In summary, in single-connection mode MCTCP is transparent, while in
   multi-connection mode it acts as a shim layer between several coupled
   TCP connections and the upper protocol layers, with a payload
   encoding similar like TLS.  An MCTCP session can also fall back to
   single-connection mode a mean to further increase MCTCP's robustness
   when facing problems with certain types of middleboxes.

            +-------------------------------+
            |           Application         |
            +-------------------------------+
                          ^^^^
                          |||| Byte stream (e. g., socket interface)
                          VVVV
            +-------------------------------+
            |       MCTCP session layer     |
            +-------------------------------+
               ^^   ^               ^   ^^
       Chunked ||   : Connection &  :   || Chunked
       data    ||   : cong. control :   || data
               VV   V               V   VV
            +---------------+---------------+
            | TCP connection| TCP connection|
            +-------------------------------+
            |       IP      |      IP       |
            +-------------------------------+

                   Figure 1: MCTCP in the protocol stack

   Figure 1 shows the position of MCTCP in the protocol stack, as a shim
   layer between (coupled) TCP connections and upper-layer protocols or
   applications.  For MCTCP's connection management and the coupled
   congestion control, the MCTCP session layer requires an additional
   interface to each TCP connection, as well as some simple changes in
   the TCP stack, e. g., to set the new TCP option in SYN segments.
   Both modifications are straightforward and only affect a small subset



Scharf                  Expires January 13, 2011                [Page 6]


Internet-Draft            Multi-Connection TCP                 July 2010


   of TCP's function.

   The MCTCP session layer can be implemented in the kernel space as an
   extension of the socket interface processing.  Alternatively, the
   connection management, data segmentation/reassembly, and congestion
   control coupling can be realized in the user space, in combination
   with some small modifications of TCP.  As an example, MCTCP could be
   implemented as an extension of the library that offers the socket
   interface to applications.  In both cases the MCTCP session layer can
   be completely transparent to applications, i. e., they can continue
   to use the existing socket interface to TCP [11].

   In the following, a high-level summary of normal operation of MCTCP
   is provided, for the scenario shown in Figure 2:

   o  To a non-MCTCP-aware application, MCTCP will be transparent and
      indistinguishable from normal TCP.  All MCTCP operation is handled
      by the MCTCP implementation, although extended APIs could provide
      additional control and influence [11].  An application begins by
      opening a TCP socket in the normal way.

   o  An MCTCP session begins in single-connection mode with a single
      TCP connection ("initial connection").  This is illustrated in
      Figure 2 between Addresses A1 and B1 on Hosts A and B,
      respectively.

   o  MCTCP uses an "Multipath Capable" TCP option in the SYN segments
      to determine whether both endsystems support MCTCP.  If the option
      is not echoed in the SYN/ACK, the connection initiator knows that
      the destination is not MCTCP-capable.  If the SYN segment has to
      be retransmitted, the connection initiator will not set the
      "Multipath Capable" TCP option again, in order to circumvent
      problems with middleboxes that cannot deal with unknown TCP
      options.  In that case, multipath transport cannot be used to that
      destination.

   o  MCTCP does not exchange much signaling information in single-
      connection mode, as this would require further TCP options outside
      SYN segments.  The only exception is the non-mandatory "Mode" TCP
      option, which can be set by one endpoint in order to signal to the
      other endpoint that it shall switch to multi-connection mode by
      establishing a coupled connection to the same destination IP
      address, over which additional information can then be exchanged.
      If this TCP option is removed on the path, MCTCP may not be able
      to enable multipath transport in some usage scenarios (e. g.,
      behind NAPTs), but the single-connection transport will continue
      without being impacted.




Scharf                  Expires January 13, 2011                [Page 7]


Internet-Draft            Multi-Connection TCP                 July 2010


   o  If additional addresses are available, and if they shall be used,
      MCTCP switches to the multi-connection mode.

   o  When entering multi-connection mode, the MCTCP session endpoints
      establish one or more coupled TCP connections.  The first coupled
      connection should use the same IP source and destination address
      like the initial connection, in order to establish a control
      channel over which more information can be exchanged.  Each
      coupled connection is added to the MCTCP session.

   o  MCTCP identifies multiple paths by the presence of multiple
      addresses at endpoints, and it can establish coupled connections
      between combinations of these multiple addresses.  In the example
      shown in Figure 2, coupled connections are set up between A1 and
      B1, and between A2 and B1.

   o  The discovery and setup of additional coupled TCP connections will
      be achieved through a path management method described later in
      this document.

   o  The coupled connection use TLV-encoded messages and can thus
      transport both control messages and data chunks.  The data chunks
      include a session-level sequence number to allow the in-order
      reassembly of the data chunks from multiple coupled connections at
      the receiver.


























Scharf                  Expires January 13, 2011                [Page 8]


Internet-Draft            Multi-Connection TCP                 July 2010


             Host A                               Host B
    ------------------------             ------------------------
    Address A1    Address A2             Address B1    Address B2
    ----------    ----------             ----------    ----------
        |             |                      |             |
        |    "Initial connection" setup      |             |   ^
        |--------------SYN+MPCAP------------>|             |   |
        |(incl. Multipath Capable TCP option)|             |   | Single-
        |             |                      |             |   | conn.
        |<----------SYN/ACK+MPCAP------------|             |   | mode
        |             |                      |             |   |
        |#####Byte stream data transfer######|             |   V
        |             |                      |             |
        ~             ~                      ~             ~
        |             |                      |             |
        |    "Coupled connections" setup     |             |
        |--------------SYN+JOIN------------->|             |
        |<-----------SYN/ACK+JOIN------------|             |   ^
        |             |                      |             |   |
        |             |------SYN+JOIN------->|             |   | Multi-
        |             |<----SYN/ACK+JOIN-----|             |   | conn.
        |             |                      |             |   | mode
        |##########TLV data transfer#########|             |   |
        |             |                      |             |   |
        |             |##TLV data transfer###|             |   V
        |             |                      |             |

                      Figure 2: MCTCP usage scenario

   For simplicity reasons, MCTCP does not send further data over the
   initial connection after it has triggered the transition to multi-
   connection mode.  As a consequence, the initial connection will be
   unused in multi-connection mode.  This document mandates to keep the
   connection open as long as other coupled connections exist.  This
   design choice is motivated later in this document.

3.3.  Differences to Other Multipath Transport Solutions

   MCTCP follows the design principles outlined in [7], but it differs
   to the protocol design described in [8], which uses TCP options to
   transport all control information.  In the following, the key
   advantages of MCTCP are summarized:

   o  MCTCP does not rely on frequently sent TCP options, in particular
      not on options that may have to be present in many packets.  In
      the simplest case, it only requires two new types of TCP options
      which are set in SYN segments only.  The required options are
      short and do not consume much of the TCP option space, which is



Scharf                  Expires January 13, 2011                [Page 9]


Internet-Draft            Multi-Connection TCP                 July 2010


      already scarce in SYNs.  It should also be noted that the
      selective acknowledgment (SACK) option [2] is currently the only
      major TCP option that is sporadically set after connection setup.
      Yet, SACK options are only present after packet losses or
      reordering events, which are seldom, and they are often set in
      segments without payload.  Adding sporadically other new TCP
      options to all kinds of segments may increase the complexity of
      the TCP sender, since the MSS must be adapted correspondingly.  As
      a consequence, MCTCP may also be simpler to realize in combination
      with TCP segmentation offload on network cards.

   o  MCTCP's operation is much more robust in combination with
      middleboxes that strip, duplicate, or modify TCP options and/or
      drop packets with unknown TCP options.  The worst case is that
      multipath transport will not be enabled on a path with such
      middleboxes, but the data stream's integrity will not be affected.
      In general, the transport of information in TCP options outside
      SYNs is not necessarily reliable, unless an acknowledgement and
      retransmission mechanism for that information exists.  As a
      consequence, TCP options are not well suited for transport of
      information that is absolutely essential for the data integrity.
      It is also impossible to savely detect whether novel TCP options
      can indeed be exchanged between two hosts in the Internet, as the
      routing may change and additional middleboxes may appear on the
      paths, e. g., in mobile networks.  Therefore, a signaling method
      that transports essential control information such as sequence
      numbers in TCP options is not robust in such environments.
      Obviously, it cannot efficiently use multiple paths if a middlebox
      blocks TCP options, as there is no way to reliably exchange
      control information in options.  There are also situations where
      multipath transport with option encoding cannot even fall back to
      single-path transport, e. g., if routing changes and afterwards
      TCP options cannot be exchanged on all used paths.  Unlike MCTCP,
      multipath transport with option encoding would break and not be
      able to complete ongoing data transfers in such cases, except if
      it used an MCTCP-like approach as well.

   o  MCTCP is also rather robust when middleboxes rewrite content, as
      it can use a checksum to savely detect content modifications in
      one or several connections.  It could even define schemes that
      transfer such content in a different content encoding format.

   o  MCTCP offers a simple mechanism by which a middlebox can prevent
      to transport any multi-connection traffic: It can simply drop SYN
      segments with the "JOIN" TCP option.  In that case, unless routing
      changes, paths through that middlebox will not be used in multi-
      connection mode.  If that middlebox is on the path of the initial
      connection, it will always see the whole, unmodified byte stream.



Scharf                  Expires January 13, 2011               [Page 10]


Internet-Draft            Multi-Connection TCP                 July 2010


      This middlebox-friendly design is an advantage of the distinction
      between initial and coupled connections.  It could also help to
      comply with certain network policies such as lawful interception.

   o  The TCP option space is limited to 40 byte.  In multi-connection
      mode, MCTCP can exchange any amount of information between the
      endsystems.  As such, it is more extensible and flexible.  For
      instance, without length limitation, one can easily exchange a
      list of several IPv6 candidate addresses in the payload of a
      single TCP sgement.  It would also be possible to announce lists
      of candidate port numbers or even to exchange address information
      in form of a Uniform Resource Identifer (URI) or any other
      referral object structure.  Finally, MCTCP could use strong
      protection mechanisms between coupled connections to ensure that
      they have indeed the same endpoint, such as longer tokens.

   o  The design is modular, as the operation of a single TCP connection
      is almost independent from the multipath transport, except for the
      necessary coupling of congestion control.  For instance, there is
      no need to modify the SACK scoreboards implementation in existing
      TCP implementations, and synchronization issues between different
      TCP connections are avoided.

   o  MCTCP has a reasonable deployment roadmap.  Most functions of
      MCTCP can be realized in the user space with a small patch of the
      TCP implementation only.  The required extensions inside the
      network stack are simple, straightforward, and non-disruptive.
      This means that MCTCP can initially be deployed mostly as a user
      space solution, without lacking any features.  As a second step,
      once the protocol is widely supported in the Internet, it could
      become an integral part of the network stack.

   o  The transport of control information in the payload is reliable
      and congestion-controlled.  TLV-encoded messaging is
      straightforward and well-known, e. g., from TLS.  MCTCP does not
      use a mandatory positive acknowledgement mechanism and therefore
      does not require frequent additional data transport in the reverse
      direction.

   o  MCTCP can be extended in future, for instance to use a stronger
      protection for the coupling of connections, possibly even by
      exchange of cryptographic keys, if needed.  A list of possible
      future extensions is provided in the appendix.

   MCTCP shares a number of properties of [8].  It can use a coupled
   congestion control in a similar way, and it is able to enable
   multipath transport under the same constraints.




Scharf                  Expires January 13, 2011               [Page 11]


Internet-Draft            Multi-Connection TCP                 July 2010


   Still, it must be noted that there are a number of potential
   drawbacks of MCTCP's design as well:

   o  MCTCP is designed for the use case of a bulk data transfer that
      starts as a single path transfer that is later "upgraded" in order
      to use multiple interfaces.  This is the most obvious use case of
      multipath transfer, as transporting smaller amounts of data over
      multiple paths would result in a significant overhead.  In
      contrast, MCTCP is less efficient if the multipath transfer shall
      be used right from the beginning of a transfer, due the backward-
      compatible design of MCTCP's single-connection mode that results
      in a very limited control.  If this use case was important, an
      MCTCP variant with payload encoding in the initial connection
      could be developed, too.  Its design is straightforward, but left
      for further study, as it would only be of use in certain
      scenarios.

   o  MCTCP opens an additional TCP connection when switching to multi-
      connection mode, and it does not continue using the initial
      connection.  The connection setup of the coupled connections
      results in a small delay, i. e., the path may not be completely
      utilized during a short time.  An obvious optimization would be to
      transfer the congestion control state from the initial connection
      to the first coupled connection, in order to avoid the TCP Slow-
      Start there.  Both connections should use the same path.  It must
      be noted that not using the initial connection after the switch-
      over to the multi-connection mode is the simplest solution;
      alternative solutions are possible.  Furthermore, the "handover"
      process and the resulting delay could be minimized by further
      optimization, but this is left for further study.

   o  MCTCP session endpoints do not exchange address information before
      entering the multi-connection mode, even if this would be possible
      by additional TCP options [8].  Both endsystems can initiate a
      change of operation mode, and address information can be exchanged
      by the MCTCP session protocol once this is successful.  If the
      "Mode" TCP option is supported, an endpoint can even trigger the
      setup of a coupled connection by the other endpoint, e. g., if
      that host is located behind a NAPT.  Yet, while being in single-
      connection mode, MCTCP provides no means to learn other addresses.
      As a consequence, endsystems may try to enter the multi-connection
      mode in vain, if they assume that their peer is multi-homed.  If
      that peer is not multi-homed, it can either agree to switch to
      multi-connection mode, or deny that (by not responding with a
      "Join" option).  In the former case, an additional TCP connection
      is needlessly established between both peers, and in the latter
      case data transfer could briefly slow down until MCTCP falls back
      to single-connection mode.  For long-lived connections that



Scharf                  Expires January 13, 2011               [Page 12]


Internet-Draft            Multi-Connection TCP                 July 2010


      benefit most from multi-connection mode both cases hardly cause
      much harm.

   o  Given that MCTCP transports control information in the payload, it
      is more complex for middleboxes to parse and potentially modify
      MCTCP's control information.  In order to do so, a middlebox has
      to perform deep packet inspection and reassemble the messages of
      the coupled TCP connection(s).  This may prevent certain
      operations and optimizations by middleboxes.  However, it should
      be noted that middleboxes cannot affect the payload in other
      related protocols such as TLS neither, i. e., MCTCP is somehow
      similar to TLS in that sense.  Of course, middleboxes can still
      perform certain forms of traffic engineering for an individual
      coupled connection, such as randomizing initial sequence numbers
      or modifying the advertized receive window (which may, of course,
      do harm to any end-to-end connection).  A middlebox that wants to
      prevent MCTCP usage can simply and savely drop packets with the
      TCP "Join" option and will then not be passed by any multi-
      connection traffic, except if routing changes.

   o  If MCTCP detects that one coupled connection stalls, it can
      retransmit data over another connection, which can reduce the
      delivery time and prevent head-of-line blocking.  However, if
      MCTCP is partly realized in the user space, it might not be able
      to retransmit a lost segment immediately over another coupled
      connection, given that this would require complex changes of the
      segmentation and SACK scoreboard implementation in each coupled
      connection.  As a result, if congestion occurs on a subset of the
      coupled connections, the end-to-end delivery delay of a user-space
      solution may be larger than the delay of a protocol that is
      tightly integrated into the protocol stack.  In general, an
      implementation inside the protocol stack can assign data more
      flexibly and more dynamically to the different interfaces.  This
      would be an advantage of a kernel-space implementation.  Yet, a
      reasonable MCTCP session layer scheduling can reduce the risk of
      head-of-line blocking by simply avoiding long send buffer queues,
      even if it is realized in the user space.

   o  MCTCP as defined in this document does not provide some signaling
      mechanisms of [8], such as the "DATA FIN".  While it is obviously
      possible to add these mechanisms as well, it will result in a more
      complex protocol design and is therefore not addressed in this
      version of the protocol specification.








Scharf                  Expires January 13, 2011               [Page 13]


Internet-Draft            Multi-Connection TCP                 July 2010


4.  TCP Extensions by MCTCP

   This section describes the modifications in the TCP protocol that are
   required by MCTCP.  MCTCP only defines additional TCP options.
   Several TCP options and mechanisms are similar to [8], but differ in
   details.  Later, Section 7.1 describes to what information inside the
   TCP stack an MCTCP session must have access to.

4.1.  Setup of the Initial Connection

   The initial connection of an MCTCP session is setup like a TCP
   connection with a three-way handshake.  A connection initiator that
   wants to announce its MCTCP capability sets the "Multipath Capable"
   TCP option in the SYN, as shown in Figure 3.  This option only
   declares that its sender is capable of using MCTCP, even if will not
   be enabled for that session.  It includes a field that presents a
   locally-unique token identifying this connection.  The two tokens
   will be used when adding additional coupled connections to verify
   that the endpoint is identical.

                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +---------------+---------------+-------------------------------+
     | Kind=OPT_MPCAP|   Length=6    |       Sender Token            :
     +---------------+---------------+-------------------------------+
     : Sender Token (contd.)         |
     +-------------------------------+

                    Figure 3: Multipath Capable option

   This option MUST only be present in packets with the SYN flag set.
   It is only used in the initial TCP connection, in order to identify
   the MCTCP session; all following (coupled) connections will use
   another, similar option to join the MCTCP session.

   If a SYN contains an "Multipath Capable" option but the SYN/ACK does
   not, it is assumed that the responder is not multipath capable and
   thus the MCTCP session MUST fall back to standard TCP.  If a SYN does
   not contain a "Multipath Capable" option, the SYN/ACK MUST NOT
   contain one in response.

   There are two tokens in a MCTCP session, one per endsystem.  The
   token is generated by the sender and has local meaning only.  It MUST
   be unique for the sender.  The token MUST be difficult for an
   attacker to guess, and thus it is recommended that it SHOULD be
   generated randomly.

   If the SYN packets are unacknowledged, it is up to a local policy to



Scharf                  Expires January 13, 2011               [Page 14]


Internet-Draft            Multi-Connection TCP                 July 2010


   decide how to respond.  A sender SHOULD fall back to standard TCP (i.
   e., without the "Multipath Capable" option) after a maximum number of
   attempts, in order to work around middleboxes that may drop packets
   with unknown options.  The number of attempts that are made will be
   up to local policy.  Once the connection initiator has sent a SYN
   without the "Multipath Capable" option, it MUST fall back to regular
   TCP behavior, even if it subsequently receives a SYN/ACK that
   contains an "Multipath Capable" option.  This might happen if the
   "Multipath Capable" SYN and subsequent non-MP-capable SYN are
   reordered.  This is to ensure that the two endpoints end up in an
   interoperable state, no matter what order the SYNs arrive at the
   passive opener.

4.2.  Setup of Coupled Connection

   An MCTCP session can open additional, coupled TCP connections.  These
   coupled TCP connections all run the MCTCP session protocol with TLV
   encoding, as specified below.  The endsystems can also use the
   coupled connection to exchange knowledge about their own address(es)
   - in particular the first one.  Using this knowledge, an endpoint can
   initiate further coupled connections over currently unused pairs of
   addresses.  Either endpoint that is part of an MCTCP session SHOULD
   be able to initiate the creation of a new coupled connection.

   A new coupled connection is started as a normal TCP three-way-
   handshake.  The "Join" TCP option (Figure 4) is used to identify of
   which session the new connection should become a part.  The token
   used is the locally unique token of the destination for the
   connection, as received by the "Multipath Capable" option in the SYN/
   ACK exchange of the initial connection.

                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +---------------+---------------+-------------------------------+
     | Kind=OPT_JOIN |   Length=6    |       Receiver Token          :
     +---------------+---------------+-------------------------------+
     : Receiver Token (contd.)       |
     +-------------------------------+

                      Figure 4: Multipath Join option

   This option MUST only be present when the SYN flag is set.  The
   recipient of the "Join" option with a token that is valid for an
   existing MCTCP session must decide whether to allow an additional
   coupled connection, or whether to deny it.  If the coupled connection
   shall be established, the recipient of the SYN responds with a SYN/
   ACK also containing a "Join" option, with the initiator's token.




Scharf                  Expires January 13, 2011               [Page 15]


Internet-Draft            Multi-Connection TCP                 July 2010


   Otherwise, if the recipient decides to deny the setup of a coupled
   connection, it MUST reply with a TCP RST.  If the token is unknown at
   the recipient, the recipient MUST also respond with a TCP RST in the
   same way as when an unknown TCP port is used.  Similarly, if the
   initiator of a coupled connection receives a SYN/ACK with an invalid
   token or a SYN/ACK without the "Join" option, it must send a TCP RST.
   In all these cases, the setup procedure of that coupled connection
   MUST be abandoned.  As a result, the endpoints MUST return to single-
   connection mode if it is the first coupled connection.  If there are
   already other coupled connections, it SHOULD NOT use that address
   pair for multipath transport.  The verification of the tokens in both
   endpoints of the MCTCP session ensures that the endpoints of a
   coupled connection are identical to the endpoints of the initial
   connection.  Also, middleboxes that drop packets with SYN options, or
   strip the option, can be detected in that way.

   A local policy SHOULD ensure that an endpoint stops re-sending SYNs
   with the "Join" option if it receives TCP RST or if it does not
   receive corresponding SYN/ACKs.  In general, an endpoint SHOULD NOT
   try to open further coupled connections if previous attempts to the
   same destination address failed.  An endpoint SHOULD also refrain
   from attempts to switch to multi-connection mode if this repeatedly
   failed before; this SHOULD be governed by a local policy.




























Scharf                  Expires January 13, 2011               [Page 16]


Internet-Draft            Multi-Connection TCP                 July 2010


             Host A                               Host B
    ------------------------             ------------------------
    Address A1    Address A2             Address B1    Address B2
    ----------    ----------             ----------    ----------
        |             |                      |             |
        |---------SYN+MPCAP (Token A)------->|             |   ^
        |<-----SYN/ACK+MPCAP (Token B)-------|             |   | Single-
        |             |                      |             |   | conn.
        |########Initial connection##########|             |   | mode
        |             |                      |             |   V
        ~             ~                      ~             ~
        |             |                      |             |
        |---------SYN+JOIN (Token B)-------->|             |
        |<------SYN/ACK+JOIN (Token A)-------|             |   ^
        |             |                      |             |   |
        |<=====E. g., MCTCP Add. Address=====|             |   | Multi-
        |             |                      |             |   | conn.
        |             |----------SYN+JOIN (Token B)------->|   | mode
        |             |<-------SYN/ACK+JOIN (Token A)------|   |
        |             |                      |             |   |
        |######First coupled connection######|             |   |
        |             |                      |             |   |
        |             |#####Second coupled connection######|   V
        |             |                      |             |

                   Figure 5: Example use of MCTCP tokens

   Figure 5 illustrates the usage of the two MCTCP tokens.  An endpoint
   can decide to switch to multi-connection mode any time, as long as
   the initial connection is established.  In multi-connection mode, an
   endpoint can add further coupled connections at any time.

4.3.  Usage of Coupled Connections

   The setup of the first coupled connection MUST use the same source
   and destination IP addresses and SHOULD use same destination port
   like the initial connection.  This implies that the first coupled
   connection SHOULD be actively opened by the initiator of the initial
   connection.  This constraint ensures that the first coupled
   connection indeed uses valid addresses and that it uses the same path
   like the initial connection.  It also facilites user-space
   implementation and network address port translation (NAPT) traversal.
   The first coupled connection has a special role because it enables
   the exchange of addresses or other information, which can be useful
   to setup additional coupled connections.

   The token supplied in the initial connection's SYN exchange is used
   for the demultiplexing of coupled connections, i. e., to associate a



Scharf                  Expires January 13, 2011               [Page 17]


Internet-Draft            Multi-Connection TCP                 July 2010


   new coupled connection to an existing MCTCP session.  This means that
   the port numbers in a SYN of a coupled connection MAY NOT be used for
   demultiplexing.  Still, an active opener of a new coupled connection
   SHOULD use a destination port numbers that is already in use by the
   passive opener, as long as the 5-tuple is unique for each host.  Once
   a coupled connection is established, demultiplexing packets is done
   using the five-tuple, as in traditional TCP.  This strategy is
   intended to maximize the probability of the SYN being permitted by a
   firewall or network address port translation (NAPT) at the recipient
   and to avoid confusing any network monitoring software.

   Control information can be sent over any established coupled
   connection, and it always affects the MCTCP session as a whole.  As
   control information and data chunks are transported over the same
   pipe and may experience queueing in the send buffer, it is reasonable
   to send important control information immediately after the
   establishment of a new coupled connections (as shown in Figure 4 for
   an "MCTCP Additional Address" message).  A scheduler in the MCTCP
   session layer decides which MCTCP messages are sent over which
   coupled connection.

4.4.  Operation Mode Switch

   An MCTCP session endpoint MUST change its operation mode from single-
   connection to multi-connection mode once the first coupled connection
   is sucessfully setup.

   Either endpoint of an MCTCP session can request the other endpoint to
   switch to multi-connection mode by a "Mode" TCP option that is
   depicted in Figure 6.  This may be useful if only the other endpoint
   can establish coupled TCP connections, e. g., if it is located behind
   a middlebox performing network address port translation (NAPT).

                                          1
                      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
                     +---------------+---------------+
                     | Kind=OPT_MODE |   Length=2    |
                     +---------------+---------------+

                           Figure 6: Mode option

   This TCP option MAY be set in segments of the initial connection.
   Its implementation is RECOMMENDED.  It MAY be set in segments without
   or with payload once the initial connection is established, as long
   as the MCTCP session is not in multi-connection mode.  The option is
   also allowed in SYN/ACK segments, but not in pure SYN segments.  If
   it is set in the SYN/ACK, it asks the connection initiation to enter
   multi-connection mode immediately.  When receiving a "Mode" TCP



Scharf                  Expires January 13, 2011               [Page 18]


Internet-Draft            Multi-Connection TCP                 July 2010


   option, an MCTCP endpoint MAY send a SYN with the "Join" TCP option
   to the destination address and port of the initial connection, and
   switch to multi-connection mode.  It is also allowed to silently
   ignore that notification and to continue in single-connection mode.
   An endsystem MUST refrain from resending "Mode" TCP options
   frequently if the MCTCP session cannot successfully negotiate the
   multi-connection mode, in order to avoid needless effort.

5.  MCTCP Session Protocol Messages

   All coupled TCP connections run the MCTCP session protocol, which
   transports both data chunks and control messages in the format that
   is defined in this section.

5.1.  Data Segmentation and Encoding

   In multi-connection mode, MCTCP segments data in chunks and
   transports them as TLV-encoded messages over one or more coupled TCP
   connections.  The framing format of these chunks is shown in
   Figure 7.

                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +---------------+-------------------------------+---------------+
     | Type=MSG_CHUNK|      Total message length     |C|   Reserved  |
     +---------------+-------------------------------+---------------+
     |                 Session sequence number (64 bit)              :
     +---------------------------------------------------------------+
     :                 Session sequence number (contd.)              |
     +---------------------------------------------------------------+
     |                                                               |
     ~                      Data chunk (variable)                    ~
     |                                                               |
     +---------------------------------------------------------------+
     |                    Optional checksum (32 bit)                 |
     +---------------------------------------------------------------+

                    Figure 7: MCTCP Data Chunk message

   If a receiver observes a corrupted MCTCP message, e. g., by invalid
   TLV format or an invalid checksum, it SHOULD close the corresponding
   coupled connection by sending a TCP FIN.

   MCTCP uses global sequence number during a session.  The value 0
   refers to the first byte that is sent over the initial connection.
   An MCTCP receiver reassembles the byte stream according to that
   sequence number and delivers the data in-order to the upper protocol
   layer or application.



Scharf                  Expires January 13, 2011               [Page 19]


Internet-Draft            Multi-Connection TCP                 July 2010


   If the the C-flag is set, the MCTCP Data Chunk message includes a 32
   bit checksum that covers the whole MCTCP message.  The checksum is
   OPTIONAL, but it helps to detect middleboxes that modify the TCP byte
   stream.  If it is present, a receiver MUST verify the checksum.  If
   there is a checksum mismatch, the receiver MUST discard the MCTCP
   message and its data, and it SHOULD close the corresponding coupled
   connection, as the integrity of the TLV framing on that connection is
   not guaranteed any more.  The receiver MAY ask for a retransmission
   of the corresponding data chunk over an alternative coupled
   connection, as defined in the next section.  If there is only one
   coupled connection, there is a possibility to fall-back to transport
   over the initial connection, as discussed below.

   If present, the checksum is calculated by the Castagnoli CRC 32C
   algorithm that is also used in the Stream Control Transmission
   Protocol (SCTP) [4].

   The sequence number in the first Data Chunk message sent over coupled
   TCP connections SHOULD be the first byte that the MCTCP
   implementation has not already enqueued on the initial connection.
   In that case, there is no overlap between data transported over the
   initial connection and data transport over the coupled connections,
   which simplifies the reassembly.  An MCTCP sender MAY also resend
   data that has already been written to the initial connection if a
   coupled connection can use a faster path, but it MUST NOT resend data
   that has already been acknowledged on the initial connection by the
   receiver.

   A sender SHOULD NOT write further data to the initial connection
   after it has sent its first Data Chunk message to a coupled
   connection, in order to simplify the reconstruction of the byte
   stream in the receiver.  The only exception is a fallback to single
   connection mode, which is needed if all coupled connections are
   closed.  The initial connection transports the upper layer protocol's
   byte stream without any gaps, i. e., the global session sequence
   number implicitly increases continuously even after multi-connection
   mode is entered.  As a consequence, apart from redundancy and
   fallback, it does not make much sense to continue sending the
   application byte stream over the initial connection.  A receiver
   SHOULD close the MCTCP session if it detects an inconsistency between
   the byte stream received over the initial connection and the data
   chunks on the coupled connections.

   The maximum allowed size of an MCTCP message is 65535 octets.
   Therefore, the maximum data chunk size is 2^16-13 = 65523 octets.
   The minimum allowed data chunk size is 1 octet.

   The segmentation of the application byte stream into data chunks and



Scharf                  Expires January 13, 2011               [Page 20]


Internet-Draft            Multi-Connection TCP                 July 2010


   their assignment to coupled TCP connections is decided by a local
   algorithm in the MCTCP sender, which may take into account the path
   characteristics such as MSS, congestion control state, and other
   relevant information (e. g., the page size in case of a kernel
   implementation).  An efficient segmentation algorithm should avoid
   sending small data chunks to reduce the header overhead both in the
   MCTCP and TCP layer.

   MCTCP does not provide positive acknowledgements at session layer,
   since TCP transport is reliable as long as paths do not fail.  It is
   an allowed behavior for an MCTCP instance to free the memory after
   handing data over to a connection.  In that case, if a coupled TCP
   connection fails or if it is closed, it may be impossible to complete
   the transfer on other coupled connections.  Therefore, it is
   RECOMMENDED that an MCTCP instance caches sent data for a certain
   time.  An MCTCP sender can duplicate or retransmit data chunks over
   other coupled connections, even with overlapping sequence numbers.
   The receiver can explicitly request such retransmissions as described
   in the next section.  A retransmission strategy is more efficient if
   the retransmission is sent over a coupled connection that does not
   have a long-standing sending queue.  The MCTCP sender can infer the
   connection state from the sequence numbers and congestion control
   state of the individual connections.

5.2.  Retransmission Requests

   As the individual coupled TCP connections provide already reliable
   transport, the session error recovery must only deal with connection
   failure or middlebox problems.  If a path fails, it will be necessary
   to retransmit the data that has not been sucessfully transported.  In
   this case the MCTCP sender SHOULD retransmit the data on a coupled
   connection over another path by assembling new MCTCP Data Chunk
   messages.  It MAY also close the MCTCP session instead.

   There are two different solutions how the MCTCP sender can determine
   what data has to be retransmitted: It can either try to implicitly
   determine the missing data from the amount of unacknowledged data in
   the connections that fails, if it has access to this information.

   Alternatively, the MCTCP receiver can explicitly request for the
   retransmission of data that has not successfully been received.
   Since MCTCP session messages are transported reliably, MCTCP uses a
   negative acknowledgment (NACK) mechanism: The receiver MAY send MCTCP
   Retransmission Request messages in order to indicate gaps in the
   received global sequence number space.  However, a receiver SHOULD
   wait until there is reasonable evidence that the data has been lost
   due to path failure, or that a retransmission over another coupled
   connection would be of significant benefit, in order to avoid



Scharf                  Expires January 13, 2011               [Page 21]


Internet-Draft            Multi-Connection TCP                 July 2010


   spurious retransmissions.  The MCTCP Retransmission Request message
   MAY also be sent after a checksum mismatch in a Data Chunk message.
   It is allowed to send these messages over several coupled connections
   in parallel.  Such messages should only seldomly be required, since
   TCP transport is in general reliable unless paths completely fail.
   If there are several gaps in the sequence number space, the receiver
   SHOULD coalesce the sequence numbers in a reasonable way to reduce
   the overhead.  The message format of the MCTCP Retransmission Request
   message is defined in Figure 8:

                           1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +---------------+-------------------------------+---------------+
      | Type=MSG_RTXRQ|      Total message length     |C| Reserved    |
      +---------------+-------------------------------+---------------+
      |              Start session sequence number (64 bit)           :
      +---------------------------------------------------------------+
      :              Start session sequence number (contd.)           |
      +---------------------------------------------------------------+
      |               End session sequence number (64 bit)            :
      +---------------------------------------------------------------+
      :               End session sequence number (contd.)            |
      +---------------------------------------------------------------+

              Figure 8: MCTCP Retransmission Request message

   The two sequence numbers refer to the first and last missing byte in
   the session sequence number space.  Upon reception of this message, a
   MCTCP sender SHOULD retransmit the data over one or more subflows,
   other than the one that has originally been used.  The MCTCP sender
   must still have the data buffered in order to be able to retransmit
   the data.  MCTCP also allows that the MCTCP sender closes the MCTCP
   session instead of retransmitting data, as single-path data transport
   over that path would have failed, too.

5.3.  Address Advertisement

   As motivated in [7], path management refers to the exchange of
   information about additional paths between endpoints.  MCTCP requires
   multiple addresses at endpoints to be able to use multiple, possibly
   at least partly disjoint paths.

   In multi-connection mode, MCTCP can explicitly signal additional
   addresses of one endpoint to the other endpoint, which allows it to
   initiate new connections.  The MCTCP session can therefore also deal
   with addresses that change.

   The "Add Address" MCTCP message announces additional addresses on



Scharf                  Expires January 13, 2011               [Page 22]


Internet-Draft            Multi-Connection TCP                 July 2010


   which an endpoint can be reached (Figure 9 and Figure 10).  Multiple
   messages can be sent subsequently in order to advertise several
   addresses.  This message can be sent at any time over any coupled
   connection.

                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +---------------+-------------------------------+---------------+
     | Type=MSG_AADD4|    Total message length = 8   |   Reserved    |
     +---------------+-------------------------------+---------------+
     |                       IPv4 address (32 bit)                   |
     +---------------------------------------------------------------+

              Figure 9: MCTCP Additional IPv4 Address message

   In Figure 9, the "Additional Address" message is shown for IPv4.  The
   reserved bits could be used to express priorities or policies (e. g.,
   "use now").

                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +---------------+-------------------------------+---------------+
     | Type=MSG_AADD6|   Total message length = 20   |   Reserved    |
     +---------------+-------------------------------+---------------+
     |                                                               |
     ~                     IPv6 address (128 bit)                    ~
     |                                                               |
     +---------------------------------------------------------------+

             Figure 10: MCTCP Additional IPv6 Address message

   Furthermore, there are MCTCP message to remove candidate addresses,
   which are shown in Figure 11 and Figure 12.  If an address is
   removed, an endpoint SHOULD NOT try to open further coupled
   connections to that address.  Already established coupled connections
   are not affected by these messages and must be explicitly closed
   separately.

                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +---------------+-------------------------------+---------------+
     | Type=MSG_RADD4|    Total message length = 8   |   Reserved    |
     +---------------+-------------------------------+---------------+
     |                       IPv4 address (32 bit)                   |
     +---------------------------------------------------------------+

               Figure 11: MCTCP Remove IPv4 Address message




Scharf                  Expires January 13, 2011               [Page 23]


Internet-Draft            Multi-Connection TCP                 July 2010


                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +---------------+-------------------------------+---------------+
     | Type=MSG_RADD6|   Total message length = 20   |   Reserved    |
     +---------------+-------------------------------+---------------+
     |                                                               |
     ~                     IPv6 address (128 bit)                    ~
     |                                                               |
     +---------------------------------------------------------------+

               Figure 12: MCTCP Remove IPv6 Address message

5.4.  Connection Management and Fallback

   Each coupled TCP connection is maintained individually.  A FIN only
   closes that individual connection.  If an application closes the
   socket, the MCTCP shim layer MUST close the initial connection and
   all existing coupled connection.  Apart from that, the MCTCP layer
   may always close (or even re-open) coupled connections, governed by
   the local path management policies.  In multi-connection mode, the
   MCTCP session is only closed once all coupled connections are closed.
   Coupled connections can be kept in the half-open state, but the MCTCP
   connection management SHOULD avoid this.  It would be possible to
   specify an MCTCP message for explicitly closing the MCTCP session, or
   several coupled connections, but this is left for further study.

   MCTCP SHOULD keep the initial connection established when being in
   multi-connection mode, even if it is not used for data transport any
   more.  This allows to expose valid addresses and port numbers to the
   application [11].  Keep-alives MAY be sent.  The initial connection
   is closed by the MCTCP layer when all coupled connections are closed.
   If the initial connection is closed, the whole MCTCP session SHOULD
   be closed, too.  Further studies are needed to understand whether the
   initial connection could savely be closed earlier, and whether an
   MCTCP session can be kept established even if the addresses of the
   initial connections cannot be used any more.

   If an MCTCP receiver detects that the byte stream on a coupled
   connection has been modified by a middlebox, it SHOULD close the
   corresponding coupled connection.  By error recovery and
   retransmission schemes the corresponding data can then be transfered
   over other coupled connections.  If all coupled connections are
   closed, the session SHOULD fall back to single-connection mode.
   Then, data transfer SHOULD continue over the initial connection.  The
   MCTCP session MUST NOT try to enter multi-connection mode again.  As
   an alternative, either of the two session endpoints MAY decide to
   close the MCTCP session in case of such an violation of TCP's end-to-
   end semantics.



Scharf                  Expires January 13, 2011               [Page 24]


Internet-Draft            Multi-Connection TCP                 July 2010


   In certain cases, byte counters of the initial connection in the
   sender and receiver could get desynchronized if a middlebox
   transparently changes the length of the content sent over the initial
   connection.  As also discussed in Section 8, this violation of TCP's
   end-to-end semantics can be detected in the receiver, e. g., if there
   is a gap between the first byte received from the coupled connections
   and the last byte received from the initial connection.
   Alternatively, there could be an overlap or potentially even
   mismatching content.  If the receiver detects this, it SHOULD
   immediately close all coupled connections.  This means that the MCTCP
   session falls back to single-connection mode and continues the byte
   stream data transport over the initial connection, including all
   middlebox modifications.  As an other remedy, or if a fallback is not
   possible, either sender or receiver MAY also decide to close the
   MCTCP session in case of such an event.  Further work is needed to
   define whether MCTCP should also have a method to resynchronize the
   sequence numbers at sender and receiver in such cases.

6.  MCTCP Session Policies and Algorithms

   This document does not mandate specific policies how to use and share
   resources on the coupled connections.  Still, this section addresses
   some important issues that an MCTCP implementation must take into
   account.

6.1.  Message Scheduling

   Data and control messages can be assigned to any coupled TCP
   connection and are sent then over that connection.  Messages may be
   duplicated or retransmitted for redundancy reasons.  The receiver
   MUST process the messages in one coupled TCP connection in the order
   of arrival.  In-order message processing among several coupled
   connection of one MCTCP session is not ensured.

6.2.  Congestion and Flow Control

   The MCTCP protocol does not have an own congestion control, nor an
   own flow control.  Instead, it relies on the algorithms in the
   individual TCP connections.  In the following, the operation is
   explained more in detail for the multi-connection mode.  In single-
   connection mode, there is no difference compared to a normal TCP
   connection.

   Concerning flow control, the operation is straightforward: If the
   MCTCP receiver runs out of buffer space, it stops reading data from
   one or more coupled TCP connections.  Depending on TCP's flow control
   and the available receive buffer, the flow control on one or more
   connections may throttle data transport until the MCTCP layer can



Scharf                  Expires January 13, 2011               [Page 25]


Internet-Draft            Multi-Connection TCP                 July 2010


   process data again.

   The MCTCP layer SHOULD at least be able to queue one full-sized MCTCP
   message (i. e., 65535 byte) for each established coupled TCP
   connection.  In order to avoid stalls of the data transfer, an
   endsystem SHOULD NOT actively or passively open coupled TCP
   connection when it is short on memory.  Similarly, coupled
   connections SHOULD NOT be established if an application explicitly
   sets small send or receive buffer sizes [11].

   The coupled connections have different congestion windows.  To
   achieve resource pooling, it is necessary to couple the congestion
   windows in use on each connection, in order to push most traffic to
   uncongested links and avoid unfairness.  One algorithm that aims at
   achieving this objective is presented in [10].  MCTCP is able to use
   this or other coupled congestion control algorithms.

   In addition, an MCTCP sender may have local policies to decide how
   much traffic to sent over the available connections.  It could also
   obtain path cost metrics from the receivers.  The latter could be
   realized by a new MCTCP messages defining connection priorities,
   which is left for further study.

7.  Interfaces

   This section describes MCTCP's interfaces from a functional point of
   view.  Their realization is implementation-specific.

7.1.  Interface between MCTCP and TCP

   MCTCP must be able to control a small set of features inside a TCP
   stack and therefore requires a corresponding interface:

   o  The MCTCP layer must be able to set a "Multipath Capable" or
      "Join" TCP option in SYN segments.  It must also be notified if
      those options are set in an incoming SYN segment, it must be able
      to access the tokens, and it must be able to influence how to
      respond depending on the token value (i. e., either by a SYN/ACK
      or RST).

   o  The MCTCP layer may set the "Mode" TCP option on the established
      initial connection, in any segment other than pure SYNs, and it
      should be notified if that option is received.

   o  The MCTCP layer must be able to affect the congestion window on
      each coupled connection.  Depending on the algorithm, it may be
      sufficient just to set periodically certain parameters of the
      congestion control, such as the additive increase factor.



Scharf                  Expires January 13, 2011               [Page 26]


Internet-Draft            Multi-Connection TCP                 July 2010


   For efficient operation, MCTCP may also have to read certain
   information from each coupled TCP connection, such as:

   o  The current amount of acknowledged and unacknowledged data on that
      connection, or the corresponding pointers to the byte stream.

   o  The receive window advertised by the other endpoint on that
      connection.

   o  The estimated round-trip time.

   o  The maximum transmission unit (MTU) of the path, or TCP's maximum
      segment size (MSS).  Note that the MSS is not a constant value if
      TCP options are added to data segments.

   Many operating systems provide already information about a subset of
   these parameters by a kernel/user-space interface.

7.2.  Interface to Applications

   MCTCP provides reliable, in-order, byte-stream transport to
   applications and thus can be used by legacy applications like a
   standard TCP connection [11].  When MCTCP is realized inside the
   network stack, it is a new function block between the TCP instance
   and the socket interface, which is transparent to applications.

   Alternatively, MCTCP can be implemented in large parts by a user-
   space library that accesses an extended network stack by the socket
   interface, which may have to be enhanced to provide some additional
   control functions as explained in the previous section.  Applications
   could then still use the standard APIs to that library and would not
   be affected at all.  Such a user-space implementation in combination
   with a simple patch of the network stack could facilitate the initial
   deployment of MCTCP.

8.  Interaction with Middleboxes

   There are various types of middleboxes in the Internet.  Some of them
   only parse a TCP stream (e. g., deep packet inspection), while others
   change TCP header fields on the fly, and some may even rewrite the
   TCP payload.  MCTCP is designed to be compatible with most types of
   middleboxes, but as middlebox behavior is not well specified, some
   open issues may remain.

8.1.  Middleboxes that Manipulate TCP Options

   One class of middleboxes may strip, duplicate, or modify TCP options
   and/or drop packets with unknown TCP options, and this may even



Scharf                  Expires January 13, 2011               [Page 27]


Internet-Draft            Multi-Connection TCP                 July 2010


   depend on whether the SYN flag is set or not.  If a middlebox removes
   MCTCP's TCP options in SYN segments, multipath transport will not be
   enabled at all (if that middlebox is on the path of the initial
   connection), or not over that path (if the middlebox is on the path
   of a potential coupled connection towards another address).  Still,
   data transfer over the initial connection or other coupled
   connection(s) can continue without being significantly affected.

   Other TCP options that could be used by MCTCP are non-mandatory, i.
   e., the data integrity is not affected when these options are
   stripped or duplicated.  In summary, unlike protocols that transport
   essential information in TCP options outside SYNs, MCTCP operates
   savely in an environment with middleboxes that strip, duplicate, or
   modify TCP options and/or drop packets with unknown TCP options.

8.2.  Middleboxes that Change Content

   Other middleboxes may rewrite the content of the TCP payload and
   possibly also its length (e. g., by rewriting URIs).  MCTCP, as well
   as other multipath transport solutions, requires a session level
   sequence number space for the in-order reassembly of the application
   data.  If a middlebox changes the content and/or length on the
   initial connection or on coupled connections, it may be impossible to
   correctly reassemble the byte stream at the receiver.

   MCTCP will in many cases be able to detect changes of content over
   coupled connections, as it looses track of the TLV framing on that
   connection.  Content modifications can even better be detected if the
   sender adds checksums to the data chunks.  If MCTCP detects a
   middlebox that changes the byte stream on a coupled connection, it
   will close the corresponding coupled connection.  By error recovery
   and retransmission schemes the corresponding content can then be
   transfered over other coupled connections, or over the initial
   connection as a fallback method.

   If a middlebox changes the length of the byte stream on the initial
   connection, the sequence numbers at sender and receiver will not be
   synchronized when entering multi-connection mode, and there could be
   a gap or an overlap even with mismatching content.  MCTCP can detect
   both cases.  MCTCP keeps the initial connection open even in multi-
   connection mode.  Therefore, if a content length modification on the
   initial connection is detected, it can fall back to the initial
   connection by closing all coupled connections and continue to use
   single-path transport.







Scharf                  Expires January 13, 2011               [Page 28]


Internet-Draft            Multi-Connection TCP                 July 2010


8.3.  Middleboxes that Translate Addresses/Ports

   NAPT middleboxes that are unaware of MCTCP create two problems:
   First, as hosts have local addresses only, and the global addresses
   are not necessarily known to host behind the NAPT, it may not be
   possible to advertise addresses to the other endpoint.  Second, it
   may be impossible for one endpoint to open a coupled TCP connection
   to an endpoint sitting behind a NAPT middlebox.

   In order to address the latter issue, MCTCP defines the Mode option.
   With that option, one endpoint can ask the other endpoint to enter
   multi-connection mode.  As shown in Figure 13, sending this TCP
   option is useful if one endpoint has multiple public IP addresses,
   but cannot anounce them over the initial connection.  If the host
   behind the NAPT middlebox receives the option and establishes a
   coupled connection, this can be used to convey the information about
   the other public address, and a coupled connection to that address
   can then be established, too.

             Host A             NAPT              Host B
    ------------------------     //      ------------------------
    Address A1    Address A2     //      Address B1    Address B2
    (private)     (private)      //       (public)      (public)
    ----------    ----------     //      ----------    ----------
        |             |          //          |             |
        |---------SYN+MPCAP------//--------->|             |   ^
        |<-----SYN/ACK+MPCAP-----//----------|             |   | Single-
        |             |          //          |             |   | conn.
        |###Initial connection###//##########|             |   | mode
        |             |          //          |             |   V
        ~             ~          ~~          ~             ~
        |             |          //          |             |
        |<--------Mode option----//----------|             |
        |             |          //          |             |
        |---------SYN+JOIN-------//--------->|             |
        |<------SYN/ACK+JOIN-----//----------|             |   ^
        |             |          //          |             |   |
        |#1st coupled connection#//##########|             |   |
        |             |          //          |             |   |
        |<=MCTCP Add. Address B2=//==========|             |   | Multi-
        |             |          //          |             |   | conn.
        |---------SYN+JOIN-------//----------------------->|   | mode
        |<------SYN/ACK+JOIN-----//------------------------|   |
        |             |          //          |             |   |
        |#2nt coupled connection#//########################|   V
        |             |          //          |             |

                 Figure 13: Example use of the Mode option



Scharf                  Expires January 13, 2011               [Page 29]


Internet-Draft            Multi-Connection TCP                 July 2010


8.4.  Middleboxes that Want to Control MCTCP Traffic

   Given that MCTCP transports control information in the payload, it is
   more complex for middleboxes to parse and potentially modify MCTCP's
   control information.  In order to do so, a middlebox must perform
   deep packet inspection and it has to parse the MCTCP session messages
   in the TCP connection.  This may prevent certain operations and
   optimizations by middleboxes.  However, it should be noted that
   middleboxes cannot affect the payload in TLS neither, i. e., MCTCP is
   somehow similar to TLS in that sense.  As a remedy, it could be
   possible to define a TCP option that contains an offset field with a
   pointer to the first byte of an MCTCP control message, so that a
   middlebox can find control messages without parsing the whole byte
   stream of a coupled TCP connection.  Yet, such an option would be
   subject to all limitations of sporadically added TCP options.

   A middlebox that wants to prevent MCTCP usage can drop SYN segments
   containing the "Join" TCP option without causing any significant
   harm.  If that middlebox is on the path of the initial connection,
   MCTCP will continue using the backward-compatible initial TCP
   connection only.  If the middlebox is on the path towards another
   address, i. e., if the multi-connection mode is already entered,
   MCTCP will not establish an additional coupled connection.  Under the
   assumption of stable routing, no TLV-encoded content will pass that
   middlebox in both cases.  Instead of dropping SYN segments with the
   "Join" TCP option, a middlebox could also strip the "Join" option, as
   the setup of a coupled connection will then fail.  This method would
   avoid timeouts and further retransmission attempts by the sender.

   Alternatively, a middlebox could remove the "Multipath Capable" TCP
   option from SYN segments.  Then, MCTCP will be identical to a
   standard TCP connection and never try to switch to multi-connection
   mode.  However, it is not recommended to drop SYN segments containing
   the "Multipath Capable" TCP option as a means to prevent MCTCP, since
   this needlessly results in a longer connection setup time, and since
   just dropping segments with the "Join" option would be sufficient.

8.5.  Middleboxes that Proactively Acknowledge Data

   Finally, there might be middleboxes that proactively acknowledge
   data, or middleboxes that transparently split the TCP connection.
   Such middleboxes break the end-to-end semantics of TCP connections
   [6], i. e., TCP cannot ensure a reliable end-to-end transport of data
   over such middleboxes.  Mitigating the drawbacks of proactively
   acknowledging middleboxes is mostly orthogonal to multipath
   transport.

   Yet, if such a middleboxe is on a path used by MCTCP, and if this



Scharf                  Expires January 13, 2011               [Page 30]


Internet-Draft            Multi-Connection TCP                 July 2010


   path fails, a specific problem arises: The MCTCP sender may
   erroneously assume that the data over the corresponding coupled
   connections has already been received by the receiver, and therefore
   it will not retransmit it.  In that case, after some time, the MCTCP
   receiver will observe a gap in the session sequence number space and
   can issue a request for retransmission.  The sender can then decide
   whether to retransmit the data over another coupled connection to
   solve this problem, or it can just close the session.  MCTCP
   explicitly allows the latter behavior as a single-path transport over
   the path with that middlebox would have failed, too.

   If MCTCP used positive session layer acknowledgements, future
   middleboxes could parse MCTCP's session messages and proactively
   acknowledge data on the session level, too.  MCTCP does not
   incorporate a positive session layer acknowledgement mechanism in
   order to prevent such a further violation of the end-to-end
   principle.  Of course, future middleboxes could still try to modify
   the retransmission requests inside the coupled connections, but this
   would not have any significant benefit.

9.  Open Issues

   o  Avoiding inconsistencies when switching in parallel to multi-
      connection mode.

   o  MCTCP does not support out-of-band TCP signaling transport (urgent
      flag).

10.  Security Considerations

   A generic threat analysis for the addition of multipath capabilities
   to TCP is presented in [9].  MCTCP is designed along the assumptions
   of that document, with some enhancements.  In general, MCTCP is
   subject to similar security threads like [8], but due to its
   extensibility, additional protection mechanisms could be incorporated
   in a future version.  For instance, MCTCP can employ more secure
   mechanisms to protect the coupling of TCP connections, even by
   cryptographic keys like in TLS.

   MCTCP uses a 32bit token only, in order to save TCP option space in
   SYN segments.  This is reasonable, as this token is only required to
   authenticate the initiator of the first coupled connection, which
   must use the same IP source and destination address like the initial
   connection, i. e., off-path attacks are not possible.  Coupled
   connections that are added subsequently could use a more secure
   protection scheme at the MCTCP session layer, either by longer 64bit
   tokens, or even by cryptographic methods, which could be exchanged by
   corresponding MCTCP control messages (not specified in this version



Scharf                  Expires January 13, 2011               [Page 31]


Internet-Draft            Multi-Connection TCP                 July 2010


   of the document).

   This section will be extended in a later version of this document.

11.  IANA Considerations

   This document will make a request to IANA to allocate new values for
   TCP option identifiers:

   o  OPT_MPCAP ("Multipath Capable" option)

   o  OPT_JOIN ("Join" option in order to add a coupled connection to
      the MCTCP session)

   o  OPT_MODE ("Mode" option that requests change from single-
      connection to multi-connection operation mode)

   This document also defines several types of MCTCP messages:

   o  MSG_CHUNK ("MCTCP Data Chunk")

   o  MSG_RTXRQ ("MCTCP Retransmission Request")

   o  MSG_AADD4 ("MCTCP Additional IPv4 Address")

   o  MSG_AADD6 ("MCTCP Additional IPv6 Address")

   o  MSG_RADD4 ("MCTCP Remove IPv4 Address")

   o  MSG_RADD6 ("MCTCP Remove IPv6 Address")

12.  Conclusion

   Multi-connection TCP transport is a simple, modular, and extensible
   solution to enable reliable transfer over multiple paths.  This
   specification defines the protocol on top of the TCP byte stream, the
   few required extensions of TCP, and the light-weight interface
   between MCTCP and each TCP connection.  In summary, MCTCP is a
   reasonable and incrementally deployable alternative to a signaling
   mechanism that uses TCP options only.

13.  Acknowledgments

   Michael Scharf is supported by the German-Lab project
   (http://www.german-lab.de/) funded by the German Federal Ministry of
   Education and Research (BMBF).

14.  References



Scharf                  Expires January 13, 2011               [Page 32]


Internet-Draft            Multi-Connection TCP                 July 2010


14.1.  Normative References

   [1]   Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
         September 1981.

   [2]   Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
         Selective Acknowledgment Options", RFC 2018, October 1996.

   [3]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", BCP 14, RFC 2119, March 1997.

   [4]   Stewart, R., "Stream Control Transmission Protocol", RFC 4960,
         September 2007.

   [5]   Dierks, T. and E. Rescorla, "The Transport Layer Security (TLS)
         Protocol Version 1.2", RFC 5246, August 2008.

14.2.  Informative References

   [6]   Border, J., Kojo, M., Griner, J., Montenegro, G., and Z.
         Shelby, "Performance Enhancing Proxies Intended to Mitigate
         Link-Related Degradations", RFC 3135, June 2001.

   [7]   Ford, A., Raiciu, C., Barre, S., and J. Iyengar, "Architectural
         Guidelines for Multipath TCP Development",
         draft-ietf-mptcp-architecture-01 (work in progress), June 2010.

   [8]   Ford, A., Raiciu, C., and M. Handley, "TCP Extensions for
         Multipath Operation with Multiple Addresses",
         draft-ietf-mptcp-multiaddressed-00 (work in progress),
         June 2010.

   [9]   Bagnulo, M., "Threat Analysis for Multi-addressed/Multi-path
         TCP", draft-ietf-mptcp-threat-02 (work in progress),
         March 2010.

   [10]  Raiciu, C., Handley, M., and D. Wischik, "Coupled Multipath-
         Aware Congestion Control", draft-raiciu-mptcp-congestion-01
         (work in progress), March 2010.

   [11]  Scharf, M. and A. Ford, "MPTCP Application Interface
         Considerations", draft-scharf-mptcp-api-02 (work in progress),
         July 2010.

Appendix A.  Possible Future MCTCP Extension

   This memo describes the baseline specification of MCTCP and the
   required minimum set of functions.  A future version of this



Scharf                  Expires January 13, 2011               [Page 33]


Internet-Draft            Multi-Connection TCP                 July 2010


   specification may additionally add several other features to MCTCP,
   such as:

   o  Exchange of longer tokens (e. g., 64bit) for connection coupling,
      using MCTCP control messages.

   o  Signaling messages to exchange policy information concerning the
      usage of the coupled TCP connections.

   o  A signaling message that advertises combination of addresses and
      port numbers, e. g., to deal with corresponding policies on one
      endpoint.

   o  A signaling message that advertises additional addresses in
      another format, e. g., as URI.

   o  MCTCP session positive level acknowledgements ("data
      acknowledgement").

   o  A checksum in all MCTCP messages.

   o  Signaling messages to negotiate different payload encoding
      formats, e. g., MIME-like encoding.  A future version of the MCTCP
      session protocol could also define retransmission requests for a
      different encoding format to work around content modifying
      middleboxes.

   o  MCTCP control messages that manage coupled connections, such as a
      method to explicitly ask for closing several connections at MCTCP
      layer, similar to a "DATA FIN".

   o  A simple MCTCP session flow control mechanism, complementing TCP's
      flow control.

   o  A negotiation whether to indeed keep the initial connection
      established in multi-connection mode, assuming that it could
      either be closed or reused as a coupled connection.

   o  A variant of this protocol that uses TLV-encoded message transport
      right from the beginning.

   o  A method to discover and negotiate features between the two MCTCP
      session endpoints, e. g., by Hello messages similar to TLS.

   Further studies are needed to determine whether some of these
   functions should be added to MCTCP.  If so, their implementation may
   partly be optional and negotiated between the session endpoints.  The
   baseline MCTCP design should be kept as simple as possible.



Scharf                  Expires January 13, 2011               [Page 34]


Internet-Draft            Multi-Connection TCP                 July 2010


Appendix B.  Change History of the Document

   Changes compared to version 00:

   o  Addition of a checksum in data chunk messages

   o  Definition of a message to request retransmission

   o  Description of how to fall back to single-connection mode

   o  Discussion of proactively acking middleboxes

   o  Various clarifications of the design motivations

Author's Address

   Michael Scharf
   Alcatel-Lucent Bell Labs
   Lorenzstrasse 10
   70435 Stuttgart
   Germany

   EMail: michael.scharf@alcatel-lucent.com




























Scharf                  Expires January 13, 2011               [Page 35]