QUIC WG                                                    E. Stephan
Internet Draft                                               M. Cayla
Intended status: Informational                               A. Braud
Expires: July 2020                                           F. Fieau
                                                          A. Ferrieux
                                                               Orange
                                                             M. Ihlar
                                                             Ericsson
                                                       January 9, 2020



                     QUIC Interdomain Troubleshooting
           draft-stephan-quic-interdomain-troubleshooting-04.txt


Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on July 9, 2020.

Copyright Notice

   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of




Stephan                 Expires July 9, 2020                  [Page 1]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.

Abstract

   On-path network performance measurements methods currently deployed
   contribute to the ossification of the Internet because they are
   expensive to deploy and to maintain. This draft motivates the
   exposure of QUIC header fields for on-path network measurements and
   their specification in the QUIC core protocol as a solution to avoid
   on-path network performance measurements to ossify the IP stack in
   the future.

Table of Contents


   1. Introduction ................................................ 2
   2. Conventions used in this document ........................... 3
   3. Interdomain UX troubleshooting .............................. 3
   4. Reference of Network Performance ............................ 4
   5. Explicit measurement signals ................................ 5
   5.1. Spinbit, measuring the delay .............................. 5
   5.2. lossbits, measuring packet losses ......................... 6
   6. QUIC Fallback ............................................... 6
   6.1. Flapping .................................................. 6
   7. Versioning and Implementations .............................. 7
   8. Tools ....................................................... 7
   8.1. Spindump .................................................. 7
   9. Security Considerations ..................................... 8
   10. IANA Considerations......................................... 8
   11. Discussions ................................................ 8
   11.1. Fallback ................................................. 8
   11.2. On-path Measurement ...................................... 9
   12. References ................................................. 9
   12.1. Normative References ..................................... 9
   12.2. Informative References .................................. 10
   13. Acknowledgments ........................................... 10

1. Introduction

   The IP layer does not include the material for measuring the delay
   and packet losses of segments of a path. The network performance is
   currently measured by points of presence of the path [SPATIAL],
   [COMPO] using transport fields of the upper layers: TCP transport
   layer, RTP application layer...



Stephan, et al          Expires July 9, 2020                  [Page 2]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   The evolution of the Internet stack toward end-to-end integrity
   protection is unavoidable [IABSEC]. This document presents the
   benefits of preserving the same on-path network performance
   measurement capabilities in the evolution of TCP (TCP/TLS,
   TCPinc...) and UDP (QUIC/UDP...) currently specified at the IETF.

   On-path network performance measurement methods currently deployed
   contribute to the ossification of the Internet because they are
   expensive to deploy and complex to maintain. This is due to the use
   of protocol fields not primarily designed for this purpose. This
   draft motivates the exposure of the fields for on-path network
   performance measurements of the delay [SPINBIT] and of the packet
   losses [LOSSBITS] in the QUIC core protocol to avoid network
   performance measurements to ossify the IP stack in the future.

   The memo recalls the UX interdomain troubleshooting complexity
   [FALLBACK] introduced by QUIC deployment. Then it describes
   operational concerns QUIC fallbacks and discusses the potential
   impacts on the security. Finally it discusses the benefits of
   exposing durably the fields needed for measuring packet delay and
   losses.

2. Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

3. Interdomain UX troubleshooting

   Fast troubleshooting of network performance is mandatory to maintain
   end-users high Quality of Experience.

   Troubleshooting is the act of identifying the origin of a problem. A
   major case is the localization of troubles impacting a large number
   of customers. This case becomes critical when it appears suddenly
   and represents a noticeable part of the traffic between the entity
   that connects customers (ISP) and the entity that provides the data
   (APP).

   It becomes critical because network operation center (NOC) teams of
   the two entities are expected to immediately identify the causes in
   order to restore UX as quickly as possible. Each team checks that
   the point of failure is either in their domain or outside. When they
   located the point of failure in their entity they investigate their




Stephan, et al          Expires July 9, 2020                  [Page 3]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   own chains of components (network, routers, reverse proxies,...) and
   quickly fix the issue.

   It becomes extremely critical when an entity locates the point of
   failure outside of their entity. In this case the time needed to fix
   the problem is much longer and unpredictable because it expects
   other entities on the path to perform the same actions on their
   segments.

   There are many cases of troubleshooting. A typical example starts by
   signaling to the ISP that its end-users are experiencing a
   significant decrease of QoE when using an Internet application.
   Typical point of failures can be line card memory errors or
   overloaded routers located somewhere on the path, either in the ISP
   domain or outside.

   The ISP NOC has to localize asap the point of failure. Currently it
   proceeds by dichotomy (is the point of failure inside ISP domain or
   outside?) using passive monitoring of packet loss and congestion.

   Following is the description of the parameters in use:

   Packet lost downstream (vice versa for upstream):

   o Measure of packet loss before the point of measure needs TCP
      sequence number;

   o Measure of loss located after the point of measure needs TCP ACK
      + SACK.

   Congestion:

   o Congestion also manifests as increased delays in queues, located
      by measuring half-RTTs upstream and downstream from the point of
      measure;

   o The analysis is based on TCP SEQ/ACK/Timestamp correlation.

4. Reference of Network Performance

   A reference of the real performance of the network is not always
   provided by counters of network equipment. Counters may not be
   implemented, values are not always stable, their values could be
   compromised in case of software bugs or equipment congestion... In
   addition, to face the increasing network architecture complexity
   involved by the evolution of access networks, security and hosting



Stephan, et al          Expires July 9, 2020                  [Page 4]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   infrastructures, NOCs need reliable network performance measurement
   in near real time.

   In practice, these measurement tools shall be able to monitor
   numerous points and interfaces within the network to provide near
   real time network performances indicators taking into account the
   global network state.

   These measurements systems can also allow detecting issues or
   unexplained behaviors on equipment, links, peers: for instance, in
   mobile networks, operators shall be able to identify in real time
   bottleneck links responsible for customer experience degradation and
   take necessary actions to avoid further snowball effect.

   Charging of data usage is another key feature for Mobile Network
   Operators where per flow accurate information is collected. It is
   important to reconcile values between amount of charged data and
   amount of data seen by the network (cf discussion on goodput in
   section 6)). This could allow detecting fraud attempts or
   dysfunctions within the network. In case of significant gap,
   operators must be able to react quickly to isolate this traffic.
   Additionally, charging may require the differentiation of the
   goodput from the throughput.

   Continuous network performance monitoring requires packet losses and
   delay measurements to allow operators to manage properly their
   networks and to provides them with a reference of performance of
   their network for interdomain troubleshooting.

5. Explicit measurement signals

5.1. Spinbit, measuring the delay

   An alternative to exposing raw transport protocol data to the path
   is to have explicitly designed signals with the purpose of
   facilitating on-path measurements. To facilitate troubleshooting,
   such a signal should enable passive measurement of RTT, packet-loss
   and other congestion indicators such as ECN. An example of how a
   transport protocol could expose measurement information to the path
   would be three flag bits available in a public header. The first bit
   would be used for passive RTT measurement, bit 2 would indicate
   whether the packet contains retransmitted data and bit 3 would
   indicate whether the packet contains an ECN echo. An explicit signal
   is unambiguous and simpler for a middlebox to interpret, than parsed
   transport headers. Furthermore, it could be invariant between




Stephan, et al          Expires July 9, 2020                  [Page 5]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   revisions of the transport protocol that exposes it, which minimizes
   the risk of network ossification.

   At this step the WG specified a signal to measure the round trip
   delay delay [SPINBIT].

5.2. lossbits, measuring packet losses

   QUIC packets numbering is available in QUIC but encrypted. By
   consequence an additional signal like described in [LOSSBITS] is
   needed to measure packet losses on the path.

   [TODO: add other proposals]

6. QUIC Fallback

   Fallback is necessary to address cases where a QUIC connection
   establishment fails [QUICAPP] (A device of the path blocks UDP, the
   stack blocks 0-RTT...).

   Fallback may occur additionally when an active QUIC connection drops
   and tries to reconnect. As an example, the steps of the fallback
   could be:

   o The QUIC connection drops accidently;

   o The UA fallback and connects in TCP/TLS to the origin server;

   o The UA receives from the origin server an indication for an
      alternate service [ALTSVC] supporting QUIC;

   o The UA ends gracefully the TCP connection;

   o The UA tries to establish a QUIC connection to the server and
      port described in the alternate service indication;

   o The QUIC 1-RTT connection is established;



6.1. Flapping

   A fallback may suddenly occur when one or more elements (links,
   nodes, reverse proxy, switch, server ...) of the path fail or are
   reconfigured.




Stephan, et al          Expires July 9, 2020                  [Page 6]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   There are cases where the fallback loops and triggers flapping
   between the origin server and the alternate server. As an example,
   this might happen when an alternate service indication is outdated
   and points to a server which does not support QUIC anymore.

   This becomes critical for UX when numerous fallbacks occur suddenly
   on the same path between a set of customers of an ISP and another
   entity which provides the application data. The time to troubleshoot
   can be very long. The origin server and the alternate servers can be
   hosted by different entities.

   This should be specified either in QUIC core specifications.

7. Versioning and Implementations

   Versioning is an important part of the QUIC protocol framework
   [QUICCORE]. Multiple versions of the protocol are expected to be
   deployed and used concurrently. In order to encourage networks to
   rapidly support the QUIC protocol and to support any versions of
   QUIC in the future, the exposure of the fields for on-path network
   performance measurement must not depend on the version.

   There might be numerous implementations of the QUIC protocol in the
   future. An important part of them will implement the congestion
   control at application level. There will be unfair behaviors like
   abnormal retransmission rate which will impact the fairness of the
   repartition of the bandwidth amongst the customers of the network.
   By consequence the network needs to be able to detect connections
   which have abnormal throughput/goodput.

 8. Tools

8.1. Spindump

 The "Spindump" tool [SPINDUMP] is a Unix command-line utility that can
 be used for latency monitoring in traffic passing through an interface.
 The tool performs passive, in-network monitoring. It is not a tool to
 monitor traffic content or metadata of individual connections, and
 indeed that is not possible in the Internet as most connections are
 encrypted.

 The tool looks at the characteristics of transport protocols, such as
 the QUIC Spin Bit, and attempts to derive information about round-trip
 times for individual connections or for the aggregate or average
 values. The tool supports TCP, QUIC, COAP, DNS, and ICMP traffic.
 There's also an easy way to anonymize connection information so that



Stephan, et al          Expires July 9, 2020                  [Page 7]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


 the resulting statistics cannot be used to infer anything about
 specific connections or users.

 Spindump can both generate and read JSON formatted measurement events.
 Measurements from several collection points can be aggregated at a
 central point. Having multiple Spindump instances along a path allows
 for further segmentation of measurements than having a single
 measurement point, which is useful for both inter- and intra-domain
 troubleshooting.

 The Spindump command-line utility is based on the Spindump library
 which can be integrated with other measurement or troubleshooting
 software.

9. Security Considerations

   The integrity of the parameters exposed for measuring on-path delay
   and losses can be end-to-end protected to increase the security of
   the connection.

   Flapping from QUIC to a fallback protocol might overload on-path
   devices and end-points and by consequence affect the stability of
   the connections and introduces weaknesses.

   The fallback from encrypted headers to clear headers transport
   protocols might open the door to new types of active attacks.

   It is not clear yet whether a network can distinguish numerous QUIC
   fallback flappings from an active attack:

   o What is the expected behavior from the network?

   o Will networks detect QUIC flapping as an active attack?

10. IANA Considerations

   This draft does not request any IANA actions.

11. Discussions

11.1. Fallback

   Troubleshooting QUIC traffic and its fallbacks requires measuring
   similar metrics. One suggestion is to use the integrity mechanism of
   the TCPinc WG [TCPINC] to protect and keep visible the fields used
   for on-path measurement.



Stephan, et al          Expires July 9, 2020                  [Page 8]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   Fallback must be precisely specified in the core specification of
   QUIC [QUICCORE].

   To avoid unnecessary flapping [QUICCORE] might clarify the usage of
   the advertisement of QUIC support in HTTP protocols [ALTSVC].

   [QUICMAN] should propose guidance for the management of QUIC
   fallback in a way to avoid flapping situations.



11.2. On-path Measurement

   QUIC is designed to carry other traffic than HTTP such as DNS and
   Web. End-to-end encryption of the transport headers prevents the use
   of models [E-MODEL] and heuristics to estimate UX on a path segment.
   To maintain a high level of UX, QUIC capabilities should support the
   measurement of the delay and the losses of a segment of a source to
   destination path.

   On-path measurement techniques are currently ad hoc. Adding the
   exposure of the fields for on-path packet delay and losses in the
   core specification of the QUIC protocol creates a stable network
   performance measurement framework. It will be a real incentive for
   networks to support QUIC rapidly and to support the numerous QUIC
   versions in the future. This will reduce network impacts on the
   ossification of the IP stack in the future.





12. References

12.1. Normative References

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

   [QUICCORE]   https://tools.ietf.org/html/draft-ietf-quic-transport

   [FALLBACK]  https://github.com/quicwg/base-drafts/issues/166

   [IABSEC] https://www.iab.org/2014/11/14/iab-statement-on-internet-
             confidentiality/




Stephan, et al          Expires July 9, 2020                  [Page 9]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


   [SPINBIT] https://tools.ietf.org/html/draft-ietf-quic-transport-
             20#section-17.3.1

   [LOSSBITS] https://tools.ietf.org/html/draft-ferrieuxhamchaoui-quic-
             lossbits-02

12.2. Informative References

   [E-MODEL]    https://www.itu.int/rec/T-REC-G.107-199812-S/en

   [SPATIAL]    https://tools.ietf.org/html/rfc5644

   [COMPO]  https://tools.ietf.org/html/rfc6049

   [QUICAPP] https://tools.ietf.org/wg/quic/draft-ietf-quic-
             applicability/

   [QUICMAN] https://tools.ietf.org/wg/quic/draft-ietf-quic-
             manageability/

   [TCPINC]  https://tools.ietf.org/wg/tcpinc/

   [ALTSVC] https://tools.ietf.org/html/rfc7838

   [SPINDUMP] https://github.com/EricssonResearch/spindump

   [SBA5G] https://www.3gpp.org/ftp/Specs/archive/29_series/
             29.893/29893-120.zip


13. Acknowledgments

   This document was prepared using 2-Word-v2.0.template.dot.















Stephan, et al          Expires July 9, 2020                 [Page 10]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020


Authors' Addresses

   Emile Stephan
   Orange
   2, avenue Pierre Marzin
   Lannion 22300
   France

   Email: emile.stephan@orange.com

   Mathilde Cayla
   Orange
   6, avenue Albert Durand
   Blagnac 31700
   France

   Email: mathilde.cayla@orange.com

   Arnaud Braud
   Orange
   2, avenue Pierre Marzin
   Lannion 22300
   France

   Email: arnaud.braud@orange.com

   Fred Fieau
   Orange
   40-48, avenue de la Republique
   Chatillon 92320
   France

   Email: frederic.fieau@orange.com

   Alex Ferrieux
   Orange
   2, avenue Pierre Marzin
   Lannion 22300
   France

   Email: alexandre.ferrieux@orange.com

   Marcus Ihlar
   Ericsson

   Email: marcus.ihlar@ericsson.com



Stephan, et al          Expires July 9, 2020                 [Page 11]


Internet-Draft    QUIC Interdomain Troubleshooting        January 2020

















































Stephan, et al          Expires July 9, 2020                 [Page 12]