BESS Working Group                                       A. Sajassi, Ed.
Internet-Draft                                                 G. Badoni
Intended status: Standards Track                                  D. Rao
Expires: September 10, 2020                                 P. Brissette
                                                                   Cisco
                                                                J. Drake
                                                                 Juniper
                                                              J. Rabadan
                                                                   Nokia
                                                           March 9, 2020


                   Fast Recovery for EVPN DF Election
                draft-ietf-bess-evpn-fast-df-recovery-01

Abstract

   Ethernet Virtual Private Network (EVPN) solution [RFC7432] describes
   DF election procedures for multi-homing Ethernet Segments.  These
   procedures are enhanced further in [RFC8584] by applying Highest
   Random Weight Algorithm for DF election in order to avoid DF status
   unnecessarily upon a failure.  This draft makes further improvement
   to DF election procedures in [RFC8584] by providing two options for
   fast DF election upon recovery of the failed link or node associated
   with the multi-homing Ethernet Segment.  This fast DF election is
   achieved independent of number of EVIs associated with that Ethernet
   Segment and it is performed via a simple signaling between the
   recovered PE and each PE in the multi-homing group.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 10, 2020.






Sajassi, et al.        Expires September 10, 2020               [Page 1]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


Copyright Notice

   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Challenges with Existing Solution . . . . . . . . . . . . . .   3
     2.1.  Overview of Proposed Solutions  . . . . . . . . . . . . .   5
   3.  DF Election Handshake Solution  . . . . . . . . . . . . . . .   5
     3.1.  Discovery . . . . . . . . . . . . . . . . . . . . . . . .   5
     3.2.  DF Candidates Determination . . . . . . . . . . . . . . .   6
     3.3.  DF Election Handshake . . . . . . . . . . . . . . . . . .   6
     3.4.  Node Insertion  . . . . . . . . . . . . . . . . . . . . .   7
     3.5.  BGP Encoding  . . . . . . . . . . . . . . . . . . . . . .   8
       3.5.1.  DF Election Handshake Request Route . . . . . . . . .   8
       3.5.2.  DF Election Handshake Response Route  . . . . . . . .   8
     3.6.  DF Handshake Scenarios  . . . . . . . . . . . . . . . . .  10
     3.7.  Backwards Compatibility . . . . . . . . . . . . . . . . .  12
   4.  DF Election Synchronization Solution  . . . . . . . . . . . .  13
     4.1.  Advantages  . . . . . . . . . . . . . . . . . . . . . . .  14
     4.2.  BGP Encoding  . . . . . . . . . . . . . . . . . . . . . .  14
     4.3.  Note on NTP-based synchronization . . . . . . . . . . . .  15
     4.4.  Synchronization Scenarios . . . . . . . . . . . . . . . .  15
     4.5.  Backwards Compatibility . . . . . . . . . . . . . . . . .  16
   5.  Interoperability  . . . . . . . . . . . . . . . . . . . . . .  17
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  17
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  17
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  18
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  18
     8.2.  Informative References  . . . . . . . . . . . . . . . . .  18
   Appendix A.  Contributors . . . . . . . . . . . . . . . . . . . .  18
   Appendix B.  Acknowledgements . . . . . . . . . . . . . . . . . .  19
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  19





Sajassi, et al.        Expires September 10, 2020               [Page 2]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


1.  Introduction

   Ethernet Virtual Private Network (EVPN) solution [RFC7432] is
   becoming pervasive in data center (DC) applications for Network
   Virtualization Overlay (NVO) and DC interconnect (DCI) services, and
   in service provider (SP) applications for next generation virtual
   private LAN services.

   EVPN solution [RFC7432] describes DF election procedures for multi-
   homing Ethernet Segments.  These procedures are enhanced further in
   [RFC8584] by applying Highest Random Weight Algorithm for DF election
   in order to avoid DF status change unnecessarily upon a link or node
   failure associated with the multi-homing Ethernet Segment.  This
   draft makes further improvement to DF election procedures in
   [RFC8584] by providing two options for a fast DF election upon
   recovery of the failed link or node associated with the multi-homing
   Ethernet Segment.  This DF election is achieved independent of number
   of EVIs associated with that Ethernet Segment and it is performed via
   a simple signaling between the recovered PE and each PE in the multi-
   homing group.  The draft presents two signaling options.  The first
   option is based on a bidirectional handshake procedure whereas the
   second option is based on simple one-way signaling mechanism.

1.1.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

   Provider Edge (PE):  A device that sits in the boundary of Provider
      and Customer networks and performs encap/decap of data from L2 to
      L3 and vice-versa.

   Designated Forwarder (DF):  A PE that is currently forwarding
      (encapsulating/decapsulating) traffic for a given VLAN in and out
      of a site.

2.  Challenges with Existing Solution

   In EVPN technology, multiple PE devices have the ability to encap and
   decap data belonging to the same VLAN.  In certain situations, this
   may cause L2 duplicates and even loops if there is a momentary
   overlap of forwarding roles between two or more PE devices, leading
   to broadcast storms.

   EVPN [RFC7432] currently uses timer based synchronization among PE
   devices in redundancy group that can result in duplications (and even




Sajassi, et al.        Expires September 10, 2020               [Page 3]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   loops) because of multiple DFs if the timer is too short or
   blackholing if the timer is too long.

   Using site-of-origin Split Horizon filtering can prevent loops (but
   not duplicates), however if there are overlapping DFs in two
   different sites at the same time for the same VLAN, the site
   identifier will be different upon re-entry of the packet and hence
   the split horizon check will fail, leading to L2 loops.

   The current state of art [RFC8584] uses the well known HRW
   (Highest Random Weight) algorithm to avoid reshuffling of VLANs among
   PE devices in the redundancy group upon failure/recovery and thus
   reducing the impact of failure/recovery to VLANs not on the
   failed/recovered ports.  This eliminates loops/duplicates in failure
   scenarios.

   However, upon PE insertion or port bring-up, HRW cannot help as a
   transfer of DF role need to happen to the newly inserted device/port
   while the old DF is still active.

                                     +---------+
                  +-------------+    |         |
                  |             |    |         |
                / |    PE1      |----|         |   +-------------+
               /  |             |    |  MPLS/  |   |             |---H3
              /   +-------------+    |  VxLAN/ |   |     PE10    |
         CE1 -                       |  Cloud  |   |             |
              \   +-------------+    |         |---|             |
               \  |             |    |         |   +-------------+
                \ |     PE2     |----|         |
                  |             |    |         |
                  +-------------+    |         |
                                     +---------+


    Figure 1: CE1 multi-homed to PE1 and PE2.  Potential for duplicate
                                    DF.

   In the Figure 1, when PE2 is inserted or booted up, PE1 will transfer
   DF role of some VLANs to PE2 to achieve load balancing.  However,
   because there is no handshake mechanism between PE1 and PE2,
   duplication of DF roles for a give VLAN is possible.  Duplication of
   DF roles may eventually lead to L2 loops as well as duplication of
   traffic.

   Current state of EVPN art relies on a blackholing timer for
   transferring the DF role to the newly inserted device.  This can
   cause the following issues:



Sajassi, et al.        Expires September 10, 2020               [Page 4]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   * Loops/Duplicates if the timer value is too short

   * Prolonged Traffic Blackholing if the timer value is too long

2.1.  Overview of Proposed Solutions

   The first solution proposed deterministically eliminates loops/
   duplicates with a state machine approach.  The second proposal helps
   narrow the DF Election window defined in [RFC7432], intended to
   eliminate loops, based on common clock alignment.  Both proposals
   provide fast convergence upon PE/port insertion.

   Two signaling mechanisms between the newly inserted PE and remaining
   PEs are described.  The signaling is only possible once the newly
   inserted PE has reliably discovered the other PEs and vice versa.
   The first option is referred to as DF Election Handshake solution and
   is described in Section 3.  The second option is referred to as DF
   Election Synchronization Solution and is described in Section 4.

3.  DF Election Handshake Solution

   Due to HRW, the handshake will only be one per PE device and
   independent of EVI/VNI scale.  Therefore, this solution is divided
   into three steps:

   Phase 1:  Discovery

   Phase 2:  DF Candidates Determination

   Phase 3:  DF Election Handshake

   Following is the description each step in detail.

3.1.  Discovery

   Each PE needs to have a consistent view of the network including the
   newly inserted PE.

   Newly inserted device PE will advertise it's Ethernet Segment route
   and start a flood/wait timer.  This timer should be large enough to
   guarantee the dissemination and receipt of this advertisement by
   previously inserted PEs.

   As the old DF is continuously forwarding traffic while the new PE is
   running this timer, this timer can be made as long as required
   without impacting traffic convergence.  The timer value can be the
   BGP session hold time in the worst case to ensure proper discovery
   but in most cases will be equivalent to [RFC7432]'s PEERING timer.



Sajassi, et al.        Expires September 10, 2020               [Page 5]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


3.2.  DF Candidates Determination

   After the discovery timer has elapsed, each PE would have an imported
   list of the Ethernet Segment Routes from other PEs.  The resultant
   database will comprise of all the DF candidates on a per ES basis and
   will be used for DF election.  Each PE will independently run the
   selected DF algorithm - i.e., HRW algorithm (or Preference-based) for
   all VLANs in a given Ethernet Segment.  Since the discovery phase
   guarantees uniform network view between the participating devices,
   the VLAN distribution results based on HRW (or Preference-based) will
   be consistent.

3.3.  DF Election Handshake

   The DF Election handshake will be accomplished in the following
   steps:

   - The newly inserted PE will send the DF-Request to previously
     inserted PEs with a new sequence number.

   - The previously inserted PE(s) will receive the DF-Request, will
     validate this request as per own discovery state and local DF
     Candidates results (e.g.  Modulo, HRW or Preference-based).

   - The previously inserted PE(s) will program its hardware to block
     the VLANs that must be transferred to the newly inserted PE.

   - The previously inserted PE(s) will send DF-Response (with DF-ACK or
     DF-NACK flag) to the newly inserted PE with the same sequence
     number that was contained in the DF-Request.

   - Newly inserted PE will receive DF-Response and validate it using
     the sequence number.  It will take action per received DF-Response
     message, and for faster convergence, does not wait for all
     previously inserted devices.  The Handshake transaction are on a
     per-pair of peering PEs.

   - The DF-Response received at newly inserted PE is interpreted as an
     indication from the previously inserted PE that is has relinquished
     the DF role on those VLANs for which the newly inserted PE should
     be DF.  In other words, the newly inserted PE will only take over
     as DF for a given VLAN/ISID if

     A.  it is the DF Candidates election winner, AND

     B.  it gets the DF-ACK from the previous DF.





Sajassi, et al.        Expires September 10, 2020               [Page 6]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   - Upon receiving DF-Response with DF-ACK, newly inserted PE assumes
     the DF responsibility and will program its hardware to unblock the
     VLANs it is assuming.

   - In case of Preference-based DF Election, the above procedure should
     only be followed if there is at least one previously inserted PE
     that signals DP=0 in its ES route (there is no need for handshake
     in case of non-revertive mode).

   We don't need to have a handshake on a per VLAN/EVI basis but rather
   per pair of PEs in the redundancy group - i.e., if a new PE is added
   to an existing redundancy group of 3 PE devices, then we need only to
   have 3 handshakes.  This is because the devices already are in sync
   about which VLANs to give-up/takeover.

   At the end of these three phases, the VLAN DF role transfer would
   have happened in a deterministic way while ensuring minimum traffic
   loss.  Device recovery and device insertion scenarios are identical
   in terms of the handshaking procedure.  In next section, we describe
   the procedure details for device insertion.

3.4.  Node Insertion

   Consider the scenario where PE3 is inserted in the network, while PE1
   and PE2 are already in stable state.  PE3 will send/receive the
   following flags along with the EVPN Type 4 route:

   - DF-Request:  Upon completing the DF Election, PE3 will send DF
     Request with a new sequence number.  PE1 and PE2 will receive this
     message and respond with Response DF-ACK or DF-NACK with the same
     sequence number that was generated by PE3.

   - DF-Response DF-ACK:  When PE3 receives DF-Response DF-ACK from PE1
     with the same sequence number as DF-Request, it will take over the
     DF role for the appropriate VLANS that are being transferred from
     PE1.  When DF-Response DF-ACK from PE2 arrives, the rest of the
     VLANS to be transferred from PE2 to PE3 are then taken over by PE3.

   - DF-Response DF-NACK:  If PE3 receives DF-Response DF-NACK from at
     least one of PE1 or PE2, it will not take over DF role and will
     start over (new sequence number).

   Consider the scenario where two nodes PE3 and PE4 are being inserted
   at the same time.  Both of them will send a DF-Request to PE1 and PE2
   at around the same time with possibly the same sequence number.  When
   PE1 and PE2 respond with DF-Response DF-ACK, it is important to
   signify exactly whom the response is meant for as it could be for
   either requester (PE3 or PE4).  To remove any ambiguity and false



Sajassi, et al.        Expires September 10, 2020               [Page 7]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   positives, the IP address of the requester MUST be included in the
   response message to specify who the response is meant for.

3.5.  BGP Encoding

   The EVPN NLRI comprises of Route Type (1B), Length (1B) and Route
   Type specific variable encoding.  Here we propose the creation of two
   new EVPN route types:

   + TBD1 - DF Election Handshake Request Route

   + TBD2 - DF Election Handshake Response Route

3.5.1.  DF Election Handshake Request Route

   A DF Election Handshake Request Type NLRI consists of the following:

       +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++
       | RD (8 octets)                           |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Ethernet Segment Identifier (10 octets) |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | DF-Flags (1 octet)                      |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Sequence Number (1 octet)               |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Originating Router's IP Address         |
       |          (4 or 16 octets)               |
       +-----------------------------------------+


   The DF Flags can have the following values:

   DF-INIT : Sent initially upon boot-up; bootstraps the network
   DF-REQUEST : Sent to request DF takeover

   For the purpose of BGP route key processing, the Ethernet Segment
   Identifier and Originating Router's IP address fields are considered
   to be part of the prefix in the NLRI.  The DF-Flag and Sequence
   number is to be treated as a route attribute as opposed to being part
   of the route.  This route is sent along with ES-Import route target.

3.5.2.  DF Election Handshake Response Route

   A DF Election Handshake Response Type NLRI consists of the following:






Sajassi, et al.        Expires September 10, 2020               [Page 8]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


       +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++
       | RD (8 octets)                           |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Ethernet Segment Identifier (10 octets) |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | IP-Address Length (1 octet)             |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Destination Router's IP Address         |
       |      (4 or 16 octets)                   |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | DF-Flags (1 octet)                      |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Sequence Number (1 octet)               |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Originating Router's IP Address         |
       |          (4 or 16 octets)               |
       +-----------------------------------------+


   The DF-Flags can have the following values:

   DF-ACK : Sent to Acknowledge DF-REQUEST
   DF-NACK : Sent to Reject DF-REQUEST

   For the purpose of BGP route key processing, the Ethernet Segment
   Identifier, IP Address Length and Destination Router's IP Address
   fields, and Originating Router's IP address fields are considered to
   be part of the prefix in the NLRI.  The DF-Flag and Sequence number
   is to be treated as a route attribute as opposed to being part of the
   route.  This route is sent along with ESI-Import route target.

   This document introduces a new flag called "H" (for Handshake) to the
   bitmap field of the DF Election Extended Community defined in [DF-
   FRAMWORK].

                        1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Type=0x06     | Sub-Type(0x06)|   DF Type     |D|A|H|T| |P|   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Reserved = 0                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


   H: This flag is located in bit position 26 as shown above.  When set
   to 1, it indicates the desire to use Handshaking capability with the
   rest of the PEs in the ES.  This capability can only be used with a




Sajassi, et al.        Expires September 10, 2020               [Page 9]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   selected number of DF election algorithms such as HRW and Preference-
   based.

3.6.  DF Handshake Scenarios

   Consider the scenario where PE3 is freshly inserted into the network
   with PE1 and PE2 in steady state (as shown below).  As shown in the
   sequence diagram below, at time = T0, PE3 will send Type 4 ES route
   and that will cause PE1 and PE2 to discover PE3.

   Post the discovery timer, at time = T1, PE3 will send DF-Request
   containing [ESI, DF-REQ, SEQ1].

   PE2 responds via DF-Response ACK at time = T2, with the same sequence
   number SEQ1.  [ESI, DF-ACK, PE3, SEQ1].  Note that the sequence
   number is the same as is contained in the DF-Request from PE3.  PE3
   will receive the DF-Response ACK and take over the appropriate VLANs
   based on HRW only if the sequence number matches.

   PE1 responds via DF-Response DF-ACK at time = T3, with the same
   sequence number SEQ1; [ESI, DF-ACK, PE3, SEQ1].  PE3 will receive the
   DF Response ACK and take over the appropriate VLANs based on HRW only
   if the sequence number matches.

   By the end of the handshake, all appropriate VLANs for the ES are
   transferred from PE1 and PE2 to PE3 with a single per-ES handshake.

         PE1                     PE2                          PE3
          |                      |                           |
          |                      |    Type-4 (Discovery)     |
          |                      |<<-------------------------| T0
          |<<------------------------------------------------|
          |                      |                           |
          |                      |                           |
          |                      |    Type-TBD1 (DF-Request) |
          |<<--------------------|<<-------------------------| T1
          |                      |                           |
          |                      |    Type-TBD2 (DF-Response)|
          |                      |------------------------->>| T2
          | Type-TBD2 (DF-Response)|                         |
          |------------------------------------------------>>| T3
          |                      |                           |
          |<<##############################################>>|
          |               PE3 freshly inserted               |
          |<<##############################################>>|
          .                      .                           .





Sajassi, et al.        Expires September 10, 2020              [Page 10]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   Consider the scenario where PE2 and PE3 are inserted simultaneously
   in the network where PE1 is in steady state (as shown below).  PE2
   and PE3 will send the Type 4 ES routes and start the discovery timer.
   This will cause PE1, PE2 and PE3 to discover each other.

   PE2 and PE3 will then simultaneously and separately send DF Request.
   PE1 will receive these requests and respond to them.

   To avoid any ambiguity, PE1 will explicitly specify in the DF Request
   route the destination for which the DF-ACK is meant for.  That is why
   the responses from PE1 will contain [ES1, DF-ACK, PE2, SEQ] and [ESI,
   DF-ACK, PE3, SEQ] to specify that the response is meant for PE2 and
   PE3 respectively.

   Upon receiving the Type-TBD2 response message, PE2 and PE3 will take
   over the respective VLANs.

         PE1                    PE2                          PE3
          |                     |                           |
          |                     |    Type 4 (Discovery)     |
          |<<-------------------|                           | T0
          |<<-------------------|<<-------------------------|
          |                     |                           |
          |                     |                           |
          |                     |    Type-TBD1 (DF-Request) |
          |                     |<<-------------------------| T1
          |                     |                           |
          | Type-TBD1(DF-Request)|                          |
          |<<-------------------|                           | T2
          |                     |                           |
          |                     |    Type-TBD2 (DF-Response)|
          |----------------------------------------------->>| T3
          |                     |                           |
          |Type-TBD2(DF-Response)|                          |
          |------------------->>|                           | T4
          |                     |                           |
          |<<#############################################>>|
          |       PE2 and PE3 inserted simultaneously       |
          |<<#############################################>>|
          .                     .                           .



   When PE3 is booted down or removed from the network, the routes
   formerly advertised by PE3 will be withdrawn, including the Type-4
   route (as shown below).  When PE1 and PE2 process the deletion of
   PE3's Type-4 route, they will clean up any DF handshake state
   pertaining to PE3.  This means that PE1 and PE2 will withdraw the DF



Sajassi, et al.        Expires September 10, 2020              [Page 11]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   Response routes that they had earlier sent with PE3 as the
   destination.

         PE1                    PE2                          PE3
          |                     |                           |
          |                     |  Type-4 Route Withdrawal  |
          |                     |<<-------------------------| T0
          |<<-----------------------------------------------|
          |                     |                           |
          |   PE2 purges Type-TBD2 (DF-Response) sent to PE3| T2
          |                     |                           |
          |   PE1 purges Type-TBD2 (DF-Response) sent to PE3| T3
          |                     |                           |
          |<<#############################################>>|
          |   PE3 booted down/removed from the network      |
          |<<#############################################>>|
          .                     .                           .


3.7.  Backwards Compatibility

   Per redundancy group (per ES), for the DF election procedures to be
   globally convergent and unanimous, it is necessary that all the
   participating PEs agree on the DF Election algorithm to be used.  It
   is, however, possible that some PEs continue to use the existing
   modulus based DF election and do not rely on the new handshake/sync
   procedures.  PEs running an old versions of draft/RFC shall simply
   discard unrecognized new BGP extended communities.

   A PE can indicate its willingness to support new Handshake and/or
   Time Synchronization capabilities by signaling them in the DF
   Election Extended Community defined in [RFC8584] sent along with the
   Ethernet-Segment Route (Type-4).

   Considering that all the PE devices support the HRW election
   algorithm, but only a subset of them may have the capability of
   performing the handshake or synchronization mechanism.  In such a
   situation, the following procedure are exercised.

   In the illustration below, PE1, PE2 and PE3 send their respective
   Type-4 routes indicating their DF capabilities at time T1, T2 and T3
   respectively.  Only PE2 and PE3 are Handshake capable, hence only PE2
   and PE3 partake in DF Handshaking procedure described here at time T4
   and T5.  PE1 on the other hand, runs the DF election timer and takes
   over the DF role upon timer expiry at time T6.






Sajassi, et al.        Expires September 10, 2020              [Page 12]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


         PE1                    PE2                          PE3
          |                     |                           |
          |                     |                           |
          |         Type-4 (0x0 Default Capability)         |
          |------------------->>|------------------------->>| T1
          |                     |                           |
          |         Type-4 (H=1 Handshake Capable)          |
          |<<-------------------|------------------------->>| T2
          |                     |                           |
          |         Type-4 (H=1 Handshake Capable)          |
          |<<-------------------|<<-------------------------| T3
          |                     |                           |
          |                     |                           |
          |                     |    Type-TBD1 (DF-Request) |
          |                     |<<-------------------------| T4
          |                     |                           |
          |                     |    Type-TBD2 (DF-Response)|
          |                     |------------------------->>| T5
          |   PE1  Timer Expiry (DF Takeover)               | T6
          |<<#############################################>>|
          |       Only PE2 and PE3 Handshake Capable        |
          |<<#############################################>>|
          .                     .                           .


4.  DF Election Synchronization Solution

   If all PE devices attached to a given Ethernet Segment are clock-
   synchronized with each other, then the above handshaking procedures
   can be simplified and packet loss can be reduced from BGP-propagation
   time (between recovered PE and the DF PE) to very small time (e.g.,
   milliseconds or less).

   The simplified procedure is as follow:

   The DF Election procedure, as described in [RFC7432] and as
   optionally signalled in [RFC8584], is applied.

   All PEs attached to a given Ethernet-Segment are clock-synchronized;
   using a networking protocol for clock synchronization (e.g.  NTP,
   PTP, etc.).

   Newly inserted device PE or during failure recovery of a PE, that PE
   communicates the current time to peering partners plus the remaining
   peering timer time left.  This constitute an "end" or "absolute" time
   as seen from local PE.  That absolute time is called "Service Carving
   Time" (SCT).




Sajassi, et al.        Expires September 10, 2020              [Page 13]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   A new BGP Extended Community is advertised along with RT-4 to
   communicate to other partners the Service Carving Time.

   Upon reception of that new BGP Extended Community, partner PEs know
   exactly its carving time.  The notion of skew is introduced to
   eliminate any potential duplicate traffic or loops.  They add a skew
   (default = -10ms) to the Service Carving Time to enforce this.  The
   previously inserted PE(s) must carve first, followed shortly(skew) by
   the newly insterted PE.

   To summarize, all peering PEs carve almost simultaneously at the time
   announced by newly added/recovered PE.  The newly inserted PE
   initiates the SCT, and carves immediately on peering timer expiry.
   The previously inserted PE(s) receiving RT-4 with a SCT BGP extended
   community, carve shortly before Service Carving Time.

4.1.  Advantages

   There are multiples advantages of using the approach.  Here is a non-
   exhaustive list:

   - A simple uni-directional signaling is all needed

   - Backwards-compatible: PEs supporting only older [RFC7432] shall
     simply discard unrecognized new "Service Carving Timestamp" BGP
     Extended Community

   - Multiple DF Election algorithms can be supported:

     * [RFC7432] default ordered list ordinal algorithm (Modulo),

     * [RFC8584] highest-random weight, etc.

   - Independent of BGP transmission delay for RT-4

   - Agnostic of the time synchronization mechanism used (e.g .NTP, PTP,
     etc.)

4.2.  BGP Encoding

   A new BGP extended community needs to be defined to communicate the
   Service Carving Timestamp for each Ethernet Segment.

   A new transitive extended community where the Type field is 0x06, and
   the Sub-Type is [TBD3] is advertised along with Ethernet Segment
   route.  Timestamp for expected Service carving is encoded as a
   8-octet value as follows:




Sajassi, et al.        Expires September 10, 2020              [Page 14]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


                        1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Type=0x06   | Sub-Type(TBD3)|              Timestamp(upper 16)|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Timestamp (lower 32)                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


   This document introduces a new flag called "T" (for Time
   Synchronization) to the bitmap field of the DF Election Extended
   Community defined in [RFC8584].

                        1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Type=0x06     | Sub-Type(0x06)|   DF Type     |P|A|H|T| Bitmap|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Reserved = 0                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


   T: This flag is located in bit position 27 as shown above.  When set
   to 1, it indicates the desire to use Time Synchronization capability
   with the rest of the PEs in the ES.  This capability is used in
   conjunction with the agreed upon DF Type (DF Election Type).  For
   example if all the PEs in the ES indicated that they have Time
   Synchronization capability and they want the DF type be of HRW, then
   HRW algorithm is used in conjunction with this capability.

4.3.  Note on NTP-based synchronization

   The 64-bit timestamp used by NTP protocol consists of a 32-bit part
   for seconds and a 32-bit part for fractional second.  Giving a time
   scale that rolls over every 2^32 seconds (136 years) and a
   theoretical resolution of 2^-32 seconds (233 picoseconds).  The
   recommendation is to keep the top 32 bits and carry lower MSB 16 bits
   of fractional second.

4.4.  Synchronization Scenarios

   Let's take Figure 1 as an example where initially PE2 had failed and
   PE1 had taken over.

   Based on [RFC7432]:

   - Initial state: PE1 is in steady-state, PE2 is recovering




Sajassi, et al.        Expires September 10, 2020              [Page 15]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   - PE2 recovers at (absolute) time t=99

   - PE2 advertises RT-4 (sent at t=100) to partner PE1

   - PE2, it starts its 3sec peering timer as per RFC7432

   - PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal BGP
     propagation delay

   - PE2 carves at time t=103

   With above procedure, and based on the [RFC7432] aim of favouring
   traffic black hole over duplicate traffic, traffic black hole will
   occur as part of each PE recovery sequence.  The peering timer value
   has a direct effect on the duration of the prolonged blackholing.  A
   short (esp. zero) peering timer may, however, result in duplicate
   traffic or traffic loops.

   Based on the Service Carving Time (SCT) approach:

   - Initial state: PE1 is in steady-state, PE2 is recovering

   - PE2 recovers at (absolute) time t=99

   - PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 to
     partner PE1

   - PE2 starts its 3 second peering timer as per [RFC7432]

   - Both PE1 and PE2 carves at (absolute) time t=103; In fact, PE1
     should carve slightly before PE2 (skew).

   Using SCT approach, the negative effect of the peering timer is
   mitigated.  Also, the BGP RT-4 transmission delay (from PE2 to PE1)
   becomes a no-op.

4.5.  Backwards Compatibility

   Per redundancy group, for the DF election procedures to be globally
   convergent and unanimous, it is necessary that all the participating
   PEs agree on the DF Election algorithm to be used.  It is, however,
   possible that some PEs continue to use the existing modulus based DF
   election and do not rely on the new SCT BGP extended community.  PEs
   running an baseline DF election mechanism shall simply discard
   unrecognized new SCT BGP extended community.

   A PE can indicate its willingness to support clock-synched carving by
   signaling the new 'T' DF Election Capability as well as including the



Sajassi, et al.        Expires September 10, 2020              [Page 16]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   new Service Carving Time BGP extended community along with the
   Ethernet-Segment Route (Type-4).

5.  Interoperability

   If some PEs in the redundancy group signal both Handshake and Time
   Synchronization capabilities (both H & T set to 1), then Time
   Synchronization capability SHALL be chosen over Handshake capability
   with the HRW (or Preference-based) DF election algorithm.

   If some PEs in the redundancy group signal Time Synchronization (T=1)
   but not Handshaking (H=0); whereas, some other PEs in the same
   redundancy group signal Handshaking (H=1) but not Time
   Synchronization (T=0), then the PEs that have handshaking ability,
   SHALL perform DF Election using signaled or default DF-Type with
   handshaking among themselves and the PEs that Time Synchronization
   capability SHALL perform DF Election using signaled or default DF-
   Type with time synchronization among themselves.

   If some PEs in the redundancy group don't signal either Time
   Synchronization or Handshaking capabilities, then these PEs SHALL
   perform DF Election (Modulo, HRW or Preference-based) with default
   Peering timer based mechanism defined in [RFC7432].

6.  Security Considerations

   The mechanisms in this document use EVPN control plane as defined in
   [RFC7432].  Security considerations described in [RFC7432] are
   equally applicable.  This document uses MPLS and IP-based tunnel
   technologies to support data plane transport.  Security
   considerations described in [RFC7432] and in [RFC8365] are equally
   applicable.

7.  IANA Considerations

   This document solicits the allocation of the following sub-type in
   the "EVPN Route Types" registry setup by [RFC7432]:

         TBD1  DF Election Handshake Request      This document
         TBD2  DF Election Handshake Rsponse      This document

   This document solicits the allocation of the following sub-type in
   the "EVPN Extended Community Sub-Types" registry setup by [RFC7153]:

         TBD3     Service Carving Timestamp    This document

   This document solicits the allocation of the following values in the
   "DF Election Capabilities" registry setup by [RFC8584]:



Sajassi, et al.        Expires September 10, 2020              [Page 17]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


         Bit         Name                             Reference
         ----        ----------------                 -------------
         2           Handshake                        This document
         3           Time Synchronization             This document

8.  References

8.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC7153]  Rosen, E. and Y. Rekhter, "IANA Registries for BGP
              Extended Communities", RFC 7153, DOI 10.17487/RFC7153,
              March 2014, <https://www.rfc-editor.org/info/rfc7153>.

   [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
              Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
              Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
              2015, <https://www.rfc-editor.org/info/rfc7432>.

   [RFC8365]  Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,
              Uttaro, J., and W. Henderickx, "A Network Virtualization
              Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,
              DOI 10.17487/RFC8365, March 2018,
              <https://www.rfc-editor.org/info/rfc8365>.

   [RFC8584]  Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake,
              J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet
              VPN Designated Forwarder Election Extensibility",
              RFC 8584, DOI 10.17487/RFC8584, April 2019,
              <https://www.rfc-editor.org/info/rfc8584>.

8.2.  Informative References

   [RFC8126]  Cotton, M., Leiba, B., and T. Narten, "Guidelines for
              Writing an IANA Considerations Section in RFCs", BCP 26,
              RFC 8126, DOI 10.17487/RFC8126, June 2017,
              <https://www.rfc-editor.org/info/rfc8126>.

Appendix A.  Contributors

   In addition to the authors listed on the front page, the following
   co-authors have also contributed substantially to this document:

   Luc Andre Burdet



Sajassi, et al.        Expires September 10, 2020              [Page 18]


Internet-Draft     Fast Recovery for EVPN DF Election         March 2020


   Cisco

   Email: lburdet@cisco.com

Appendix B.  Acknowledgements

   Authors would like to acknowledge helpful comments and contributions
   of Satya Mohanty and Bharath Vasudevan.

Authors' Addresses

   Ali Sajassi (editor)
   Cisco

   Email: sajassi@cisco.com


   Gaurav Badoni
   Cisco

   Email: gbadoni@cisco.com


   Dhananjaya Rao
   Cisco

   Email: dhrao@cisco.com


   Patrice Brissette
   Cisco

   Email: pbrisset@cisco.com


   John Drake
   Juniper

   Email: jdrake@juniper.net


   Jorge Rabadan
   Nokia

   Email: jorge.rabadan@nokia.com






Sajassi, et al.        Expires September 10, 2020              [Page 19]