TCP Maintenance and Minor                               R. Scheffenegger
Extensions (tcpm)                                           NetApp, Inc.
Internet-Draft                                         November 15, 2010
Intended status: Standards Track
Expires: May 19, 2011


               Improving SACK-based loss recovery for TCP
             draft-scheffenegger-tcpm-sack-loss-recovery-00

Abstract

   This note clarifies the behavior of TCP SACK while doing loss
   recovery close to the end-of-stream.  This allows TCP SACK to never
   exhibit worse loss recovery characteristics than TCP NewReno under
   identical circumstances.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on May 19, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.



Scheffenegger             Expires May 19, 2011                  [Page 1]


Internet-Draft             SACK loss recovery              November 2010


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Requirements Language  . . . . . . . . . . . . . . . . . . . .  3
   3.  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .  4
   4.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   5.  Algorithm  . . . . . . . . . . . . . . . . . . . . . . . . . .  5
   6.  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .  5
     6.1.  Reordering . . . . . . . . . . . . . . . . . . . . . . . .  6
   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .  6
   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  7
   9.  Security Considerations  . . . . . . . . . . . . . . . . . . .  7
   10. References . . . . . . . . . . . . . . . . . . . . . . . . . .  7
     10.1. Normative References . . . . . . . . . . . . . . . . . . .  7
     10.2. Informative References . . . . . . . . . . . . . . . . . .  7
   Appendix A.  Scenarios . . . . . . . . . . . . . . . . . . . . . .  8
     A.1.  Basic Case . . . . . . . . . . . . . . . . . . . . . . . .  9
     A.2.  Data delay ~1 RTT  . . . . . . . . . . . . . . . . . . . . 10
     A.3.  Data reordering  . . . . . . . . . . . . . . . . . . . . . 11
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11































Scheffenegger             Expires May 19, 2011                  [Page 2]


Internet-Draft             SACK loss recovery              November 2010


1.  Introduction

   Selective Acknowledgement (SACK) is widely used to identify exactly
   which TCP segment was lost and only send these missing segments
   during a recovery episode.  This helps improve the effectiveness of
   loss recovery and aligns with the principle of packet conservation.
   When no SACK information is available, TCP senders typically revert
   to the [RFC3782] NewReno fast retransmission / fast recovery
   retransmission algorithm.  As ultima ratio, the method of last
   resort, retransmission timeouts (RTO) are used to perform loss
   recovery.

   When multiple segments of a window are lost, including one or more
   segments directly prior to the end-of-stream, TCP sessions making use
   of [RFC3517] SACK suffer worse loss recovery performance than TCP
   session utilizing [RFC3782] NewReno.  When this happens, TCP SACK has
   to revert to retransmission timeout (RTO) for loss recovery.

   An algorithm is described that allows the complete and timely
   recovery at the end-of-stream.  The aim of this algorithm is to
   address one corner case of TCP SACK.  The timeliness of recovery for
   TCP SACK is improved to that of TCP NewReno.  Overall, this minor
   change will minimize the prevalence of RTOs.


2.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].





















Scheffenegger             Expires May 19, 2011                  [Page 3]


Internet-Draft             SACK loss recovery              November 2010


3.  Overview

   TCP SACK Loss Recovery [RFC3517] was designed to reduce the number of
   unnecessary retransmissions to close to zero and also to recover from
   multiple segment loss within a single window without reverting to a
   retransmission timeout.

   In addition, [RFC2018] specifically stipulated up to which point a
   SACK enabled sender may promote segments to become elegible for
   retransmission under the SACK scheme.  This heuristic works very well
   during bulk transfers, where the sender always has additional data to
   transmit.  Close to the end of a stream, when there is no more data
   in the socket to send, still outstanding and never acknowledged
   segments can not become elegible for retransmission.

   When this happens, TCP SACK performance degrades and becomes worse
   than the performance of TCP NewReno [RFC3782].  TCP NewReno can
   recover such a set of loss events without reverting to RTO loss
   recovery.

   The relevance of this specific aspect may seem unimportant at first
   glance.  When TCP is used for boundary synchronized transactions,
   where applications regularly stall transmitting data, end-of-stream
   performance can dominate the transfer.  Such streams are very
   frequently application limited during their existance (see definition
   in [RFC5827]), and the performance penalty of TCP SACK often requires
   the use of TCP NewReno despite it having worse overall network
   efficiency.


4.  Definitions

   The reader is expected to be familiar with the definitions given in
   [RFC5827], [RFC5681], [RFC3517] and [RFC2018].

   SND.FACK (forward acknowledgment) is used to describe the highest
   sequence number that has been SACKed by the receiver and subsequently
   seen by the sender.  The full definition can be found in [MM96a] and
   [MM96b].

   End-of-stream is used similar to the definition of small congestion
   windows in [RFC5827], with the exception of small congestion windows
   due to TCP congestion control.  End-of-stream indicates that the TCP
   sender has no additional unsent data in the sender socket, or may
   wait for enough data to accumulate before sending (Nagle's Algorithm
   [RFC0896]).





Scheffenegger             Expires May 19, 2011                  [Page 4]


Internet-Draft             SACK loss recovery              November 2010


5.  Algorithm

   The key observation is that when the receiver sends out a cumulative
   ACK with no SACK entries, all data delivered to the receiver is fully
   continguous but some segments are potentially lost.  In NewReno loss
   recovery, any cumulative ACK below "recover" triggers a single
   retransmission regardless if NewReno is at end-of-stream or in
   continous transfer.

   TCP SACK already performs at least as good as TCP NewReno, as long as
   the sender can continue to inject new data into the network.  The
   modification outlined below ensures, that TCP SACK can perform as
   good as NewReno under a wider range of circumstances.

   This algorithm is only applicable when the TCP sender has SACK
   enabled for the TCP connection, and also maintains a variable
   SND.FACK.

   A.  A TCP sender SHOULD NOT exit loss recovery if it receives a
       cumulative ACK for a sequence number greater than RecoveryPoint
       while it is at end-of-stream.  Any necessary congestion window
       adjustments SHALL be performed as necessary.

   B.  A TCP sender using this algorithm MUST perform the following
       steps upon the receipt of a cumulative ACK containing no SACK
       information, while it is in loss recovery.

       1.  Process ACK information per the loss recovery algorithm
           outlined in [RFC3517].

       2.  If the ACK contains no SACK information, cumulatively
           acknowledges all data up to SND.FACK (SND.UNA == SND.FACK),
           some data is still outstanding (SND.UNA < SND.MAX), the TCP
           sender may send additional data (cwnd - Pipe >= 1 SMSS), and
           the TCP sender has no additional data to send beyond SND.MAX,
           the TCP sender SHOULD transmit one segment.

   In order to achive timely recovery the retransmission timer MUST NOT
   be reset when this algorithm performs a retransmission.  This is in
   strict compliance with [RFC0793].


6.  Discussion

   This algorithm does not deviate from current implementation of SACK
   loss recovery for bulk transfers.  However, at the end-of-stream,
   when there is no data to advance SND.MAX, this heuristic allows the
   recovery of segments similar to NewReno loss recovery.  If the loss



Scheffenegger             Expires May 19, 2011                  [Page 5]


Internet-Draft             SACK loss recovery              November 2010


   occurs during times where cwnd is very small, or when the ACK clock
   fails, this approach still falls back to RTO loss recovery.

   For the case of only a few (2-3) segments lost in the last window
   before the end-of-stream, which this algorithm addresses, no spurious
   retransmissions are performed unless reordering delay above 1 RTT
   occurs, any a cumulative ACK is received by the sender in the
   meantime.  This property of the outlined algorithm is identical to
   that of TCP NewReno.

   The aspects of packet conservation, timely loss recovery and
   avoidance of retransmission timeouts have lead to allowing only a
   single segment to be recovered per RTT.

6.1.  Reordering

   If the last segment(s) at the end-of-stream are not lost, but delays,
   three different cases may result:

   If RTT > RTO(min), and reordering delay >= RTT:  No change in the
       sender behavior, all segments may be retransmitted spuriously.
       Without this algorithm due to RTO, with this algorithm the
       retransmitted segments may be clocked out by ACKs.  Slow-start
       may be posponed somewhat reliving acute network congestion
       slightly.

   If RTT < RTO(min), and reorder delay between RTT and RTO(min):  Some
       spurious retransmits can happen, but retransmissions will again
       occur at most 1 segment per RTT.  A premature, spurious RTO may
       be avoided.

   If reordering delay < RTT:  The TCP sender will not see a cumulative
       ACK without SACK enties, thus SND.UNA will remain lower than
       SND.FACK.  The TCP sender behavior is therefore unchanged.


7.  Acknowledgements

   The author would like to thank Matt Mathis for the insightful
   discussions about SACK and it's intended behavior and the spirit
   driving the design of SACK.

   Furthermore, valuable feedback was received from Ethan Blanton,
   Yoshifumi Nishida and John Heffner.

   Dragana Damjanovic was very helpful in reviewing the text.





Scheffenegger             Expires May 19, 2011                  [Page 6]


Internet-Draft             SACK loss recovery              November 2010


8.  IANA Considerations

   This memo includes no request to IANA.


9.  Security Considerations

   The algorithm presented in this paper shares security considerations
   with [RFC2018] and [RFC3517].


10.  References

10.1.  Normative References

   [RFC1323]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
              for High Performance", RFC 1323, May 1992.

   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
              Selective Acknowledgment Options", RFC 2018, October 1996.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3517]  Blanton, E., Allman, M., Fall, K., and L. Wang, "A
              Conservative Selective Acknowledgment (SACK)-based Loss
              Recovery Algorithm for TCP", RFC 3517, April 2003.

   [RFC3782]  Floyd, S., Henderson, T., and A. Gurtov, "The NewReno
              Modification to TCP's Fast Recovery Algorithm", RFC 3782,
              April 2004.

   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
              Control", RFC 5681, September 2009.

   [RFC5827]  Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and
              P. Hurtig, "Early Retransmit for TCP and Stream Control
              Transmission Protocol (SCTP)", RFC 5827, May 2010.

10.2.  Informative References

   [I-D.blanton-tcpm-3517bis]
              Blanton, E., Allman, M., Jarvinen, I., and M. Kojo, "A
              Conservative Selective Acknowledgment (SACK)-based Loss
              Recovery Algorithm for TCP", draft-blanton-tcpm-3517bis-00
              (work in progress), October 2010.

   [I-D.henderson-tcpm-rfc3782-bis]



Scheffenegger             Expires May 19, 2011                  [Page 7]


Internet-Draft             SACK loss recovery              November 2010


              Floyd, S., Henderson, T., and A. Gurtov, "The NewReno
              Modification to TCP's Fast Recovery Algorithm",
              draft-henderson-tcpm-rfc3782-bis-01 (work in progress),
              October 2010.

   [I-D.ietf-tcpm-sack-recovery-entry]
              Jarvinen, I. and M. Kojo, "Using TCP Selective
              Acknowledgement (SACK) Information to Determine Duplicate
              Acknowledgements for Loss Recovery Initiation",
              draft-ietf-tcpm-sack-recovery-entry-01 (work in progress),
              March 2010.

   [LRSF]     Hurtig, P., Garcia, J., and A. Brunstrom, "Loss Recovery
              in Short TCP/SCTP Flows", Dec 2006, <Karlstad University
              Studies 2006:71>.

   [MM96a]    Mathis, M. and J. Mahdavi, "Forward Acknowledgment:
              Refining TCP Congestion Control", Aug 1996, <Proceedings
              of SIGCOMM'96>.

   [MM96b]    Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
              Parameters", Sep 2004,
              <http://www.psc.edu/networking/papers/FACKnotes/current>.

   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
              RFC 793, September 1981.

   [RFC0896]  Nagle, J., "Congestion control in IP/TCP internetworks",
              RFC 896, January 1984.

   [SimRTO]   Guan, Y., Van den Broeck, B., Potemans, J., Theunis, J.,
              Li, D., Van Lil, E., and A. Van de Capelle, "Simulation
              Study of TCP Eifel Algorithms", 2005, <http://
              www.esat.kuleuven.be/telemic/networking/
              opnetwork05_guan.pdf>.

   [TCPLat]   Cardwell, N., Savage, S., and T. Anderson, "Modeling TCP
              Latency", Mar 2000, <Proceedings IEEE INFOCOM>.


Appendix A.  Scenarios

   For clarity, each segment is denoted only via a single number.  Note
   that the ACKs are also given with the segement they ack, not the next
   sequence number.  For example S1 may span sequence numbers 1000-1999,
   while the acknowledgement A1 carries the sequence number 2000.  If an
   acknowledgement also carries SACK information, the SACK entries are
   listed after a colon.  A hyphen denotes which segments a single SACK



Scheffenegger             Expires May 19, 2011                  [Page 8]


Internet-Draft             SACK loss recovery              November 2010


   entry spans.  For simplicity, all segments are SMSS sized.

A.1.  Basic Case

   In this scenario, the sender has no more data to send past S7.
   Reordering of data segments or ACKs and ACK losses are are absent
   from this scenario.


                          ACK    TX  RX    ACK
                          Rcvd   Seg Seg   Sent

                          A00
                                 S1  S1
                                 S2  (dropped)
                                           A1
                          A0
                                 S3  S3
                                 S4  S4    A1,3
                                           A1,3-4
                          A1
                                 S5  S5
                                 S6  (dropped)
                                           A1,3-5
                          A1,3
                                 S7  (dropped)
                                 ---
                          A1,3-4
                                 ---
                          A1,3-5
                                 S2  S2
                                           A5
                          A5
                                 S6  S6
                                           A6
                          A6
                                 S7  S7
                                           A7
                          A7

                        end-of-stream loss recovery










Scheffenegger             Expires May 19, 2011                  [Page 9]


Internet-Draft             SACK loss recovery              November 2010


A.2.  Data delay ~1 RTT

   In this scenario, segments S6 and S7 are not dropped, but delayed by
   about 1 RTT - while RTT is smaller then the minimum allowed
   retransmission timeout threshold RTO(min).  Segments that are delayed
   by less than 1 RTT are not retransmitted.  Segments delayed more than
   1 RTT are either retransmitted by this algorithm, or by RTO loss
   recovery.

                          ACK    TX  RX    ACK
                          Rcvd   Seg Seg   Sent

                          A00
                                 S1  S1
                                 S2  (dropped)
                                           A1
                          A0
                                 S3  S3
                                 S4  S4    A1,3
                                           A1,3-4
                          A1
                                 S5  S5
                                 S6  (delayed)
                                           A1,3-5
                          A1,3
                                 S7  (delayed)
                                 ---
                          A1,3-4
                                 ---
                          A1,3-5     S6
                                 S2  S2    A1,3-6
                                           A6
                          A1,3-6     S7
                                 ---       A7
                          A6
                                 S7  S7
                                           A7
                          A7

                          A7

                     end-of-stream segment delay < RTT









Scheffenegger             Expires May 19, 2011                 [Page 10]


Internet-Draft             SACK loss recovery              November 2010


A.3.  Data reordering

   In this case, the segments S6 and S7 are delivered out of order.
   This is a normal SACK recovery event.

                         ACK    TX  RX    ACK
                         Rcvd   Seg Seg   Sent

                         A00
                                S1  S1
                                S2  (dropped)
                                          A1
                         A0
                                S3  S3
                                S4  S4    A1,3
                                          A1,3-4
                         A1
                                S5  S5
                                S6  (reordered)
                                          A1,3-5
                         A1,3
                                S7  S7
                                ---       A1,3-5,7
                         A1,3-4
                                --- S6
                                          A1,3-7
                         A1,3-5
                                S2  S2
                                          A7
                         A1,3-5,7
                                S6
                                          A7
                         A1,3-7
                                ---

                         A7

                         A7

                    end-of-stream segment reorder < RTT











Scheffenegger             Expires May 19, 2011                 [Page 11]


Internet-Draft             SACK loss recovery              November 2010


Author's Address

   Richard Scheffenegger
   NetApp, Inc.
   Am Euro Platz 2
   Vienna,   1120
   Austria

   Phone: +43 1 3676811 3146
   Email: rs@netapp.com









































Scheffenegger             Expires May 19, 2011                 [Page 12]