Network Working Group P. Lapukhov
Internet-Draft Facebook
Intended status: Informational J. Tantsura
Expires: January 2, 2020 Apstra, Inc.
July 1, 2019
Equal-Cost Multipath Considerations for BGP
draft-lapukhov-bgp-ecmp-considerations-02
Abstract
BGP (Border Gateway Protocol) [RFC4271] employs tie-breaking logic to
select a single best path among multiple paths available, known as
BGP best path selection. At the same time, it is a common practice
to allow for "equal-cost multipath" (ECMP) selection and programming
of multiple next-hops in routing tables. This document summarizes
some common considerations for the ECMP logic when BGP is used as the
routing protocol, with the intent of providing common reference for
otherwise unstandardized set of features.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 2, 2020.
Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Lapukhov & Tantsura Expires January 2, 2020 [Page 1]
Internet-Draft draft-lapukhov-bgp-ecmp-considerations July 2019
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. AS-PATH attribute comparison . . . . . . . . . . . . . . . . 2
3. Multipath among eBGP-learned paths . . . . . . . . . . . . . 3
4. Multipath among iBGP learned paths . . . . . . . . . . . . . 3
5. Multipath among eBGP and iBGP paths . . . . . . . . . . . . . 4
6. Multipath with AIGP . . . . . . . . . . . . . . . . . . . . . 5
7. Best path advertisement . . . . . . . . . . . . . . . . . . . 5
8. Multipath and non-deterministic tie-breaking . . . . . . . . 5
9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5
10. Informative References . . . . . . . . . . . . . . . . . . . 5
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 6
1. Introduction
Section 9.1.2.2 of [RFC4271] defines step-by-step tie-breaking
procedure for selecting a single "best-path" among multiple
alternatives available for the same route. In order to improve
efficiency in densely meshed symmetric network topologies it is
common to allow the selection of multiple "equal cost" paths for the
same route. Typical approach is to abort the tie-breaking process
after comparing IGP cost for the NEXT_HOP attribute and select either
all eBGP or all iBGP paths that remained "equal" under the tie-
breaking rules. See [BGPMP] for a vendor document explaining the
logic. In a nutshell, the steps that compare the BGP identifiers and
BGP peer IP addresses (steps (f) and (g) in [RFC4271]) are ignored
for the purpose of multipath routing. BGP implementations commonly
have a configuration knob that specifies the maximum number of equal
paths that are allowed be programmed in the routing table. Commonly,
there's also a knob to enable multipath separately for iBGP-learned
or eBGP-learned paths.
2. AS-PATH attribute comparison
The mandatory requirement for all paths that are considered as the
candidates for ECMP selection is to have the same AS_PATH length,
computed using the logic defined in [RFC4271] and [RFC5065], i.e.
ignoring the AS_SET, AS_CONFED_SEQUENCE, and AS_CONFED_SET segment
lengths. The content of the latter attributes is used purely for
routing loop prevention. Assuming that AS_PATHs length computed in
this fashion are the same, many implementations require that the
content of AS_SEQUENCE segment MUST be the same among all the paths
Lapukhov & Tantsura Expires January 2, 2020 [Page 2]
Internet-Draft draft-lapukhov-bgp-ecmp-considerations July 2019
considered. Two common configuration knobs to alter this behaviour
are usually provided: First one, to relax the otherwise mandatory
AS_SEQUENCE comparison rule, enforcing only the AS_PATH length rule,
while ignoring the content of AS_SEQUENCE. Another one requiring
that the first AS numbers in the first AS_SEQUENCE segment found in
AS_PATH (often referred to as "peer AS" number) be the same as the
one found in best path (as determined by running the full tie-
breaking procedure). This document refers to those two as "multipath
as-path relaxed" and "multipath same peer-as" correspondingly.
3. Multipath among eBGP-learned paths
Step (d) in Section 9.1.2.2 of [RFC4271] mandates, in presence of an
eBGP path, to remove all iBGP paths from the the ECMP candidates set.
This leaves the BGP tie-breaking procedure with just eBGP paths. At
this point, the mandatory BGP NEXT_HOP attribute value most commonly
belongs to the IP subnet that the BGP speaker shares with the
advertising neighbor. In this case, it is common for implementations
to treat all NEXT_HOP values as having the same "internal cost" to
reach them per the guidance of step (e) of Section 9.1.2.2. In some
cases, either static routing or an IGP routing protocol could be used
between the BGP speakers peering using an eBGP session. An
implementation may use the next-hop metric discovered from the above
sources to perform tie-breaking even for eBGP paths.
If the MED attribute is present in some paths, the set of multipath
routes allowed will most likely be reduced to the ones coming from
the same peer AS, per step (c) of Section 9.1.2.2. This is unless an
implementation provides a configuration knob to always compare MED
attributes across all paths, as recommended by [RFC4451]. In the
latter case, the presence of the MED attribute does not automatically
reduce the candidate path set to the same peer AS only.
4. Multipath among iBGP learned paths
In most cases iBGP is used along with an underlying IGP. Thus, when
all paths for a prefix are learned via iBGP, the tie-breaking
commonly occurs based on IGP metric of the NEXT_HOP attribute. In
some implementations, it is possible to ignore the IGP cost as well,
if all of the paths are reachable via some kind of tunneling
mechanism, such as MPLS [RFC3031]. This is enabled via a knob
referred in this document as "skip igp check" . Notice that there is
no standard way for a BGP speaker to detect presence of such
tunneling techniques other than relying on the configuration
settings.
When iBGP is deployed with BGP route-reflectors per [RFC4456], the
path attribute list may include the CLUSTER_LIST attribute. Many
Lapukhov & Tantsura Expires January 2, 2020 [Page 3]
Internet-Draft draft-lapukhov-bgp-ecmp-considerations July 2019
implementations ignore it for the purpose of ECMP route selection,
assuming that IGP cost along should be sufficient for loop
prevention. This assumption may not hold when IGP is not deployed,
and instead iBGP session are configured to reset the NEXT_HOP
attribute to "self" on every node. This also assumes the use of
directly connected link addresses for session formation. In this
case, ignoring CLUSTER_LIST length might lead to routing loops. It
is therefore recommended for implementations to have a knob that
enables accounting for CLUSTER_LIST length when performing multipath
route selection. Effectively, the CLUSTER_LIST attribute length
should be as an IGP metric.
Similarly to the route-reflector scenario, the use of BGP
confederations in multipath scenarios assumes presence of an IGP for
proper loop prevention and use the IGP metric as the final tie-
breaker for multipath routing. In addition to that, and similar to
eBGP case, implementations often require that in order to be
considered equal, the paths must belong to the same peer member AS as
the best-path. It is useful to have the following two configuration
knobs. First one enabling "multipath same confederation member peer-
as", and another enabling less restrictive "confed as-path multipath
relaxed" rule, that allow selecting multipath routes reachable via
any confederation member peer AS. As mentioned above, the
AS_CONFED_SEQUENCE value length is usually ignored for the purpose of
AS_PATH length comparison, relying instead on the IGP cost for loop
prevention.
In cases when IGP is not present with BGP confederation deployment,
and similar to route-reflection case, it may be nessesary to consider
AS_CONFED_SEQUENCE length when selecting the equivalent routes,
effectively using it as a substitution for an IGP metric. A separate
configuration knob is needed to allow this behavior.
Per [RFC5065] paths learned over BGP intra-confederation peering
sessions are treated as iBGP. There is no specification or
operational document that defines how a mixed iBGP route-reflector
and confederation based deplyments would work together. Therefore,
this document does not make recommendations for the above case.
5. Multipath among eBGP and iBGP paths
The best-path selection algorithm explicitly prefers eBGP paths over
iBGP or learned from a BGP confederation member AS, which is, as per
[RFC5065] treated the same as iBGP from perspective of best-path
selection. In some cases however, it might be beneficial to allow
multipath routing between eBGP and iBGP learned paths. This is only
possible if some sort of tunneling technique is used to reach both
the eBGP and iBGP paths. If this feature is enabled, the equal
Lapukhov & Tantsura Expires January 2, 2020 [Page 4]
Internet-Draft draft-lapukhov-bgp-ecmp-considerations July 2019
routes are selected prior to the MED comparison step (c) in
Section 9.1.2.2 [RFC4271].
6. Multipath with AIGP
AIGP attribute defined in [RFC7311] must be used for best-path
selection prior to running any logic of Section 9.1.2.2 [RFC4271].
Only the paths with minimal value of AIGP metric are eligible for
further consideration of tie-breaking rules. The rest of multipath
selection logic remains the same.
7. Best path advertisement
Unless BGP "Add-Path" feature described in [RFC7911] is enabled and
even though multiple equal paths may be selected for programming into
the routing table, a BGP speaker announces single best-path only to
its peers. The unique best-path is elected among the multi-path set
using the standard tie-breaking rules.
8. Multipath and non-deterministic tie-breaking
Some implementations may implement non-standard tie-breaking logic,
for example using the oldest path rule(reference). This is generally
not recommended, and may interact with multi-path route selection on
downstream BGP speakers. That is, after a route flap that affects
the best-path upstream, the original best path would not be
recovered, and the older path would still be advertised, possibly
affecting the tie-breaking rules on down-stream device if for
example, the AS_PATH contents are different from previous.
9. Weighted equal-cost multipath
The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions
where iBGP multipath feature might inform the routing table of
"weights" associated with the multiple external paths.
[I-D.ietf-idr-link-bandwidth] defines the weight extended community
attribute as non-transitive, considers the applicability in iBGP case
only, though there are implementations that apply it to eBGP as well.
The proposal does not change the equal-cost multipath selection
logic, but associates additional load-sharing attributes with
equivalent paths.
10. Informative References
[BGPMP] "BGP Best Path Selection Algorithm",
<http://www.cisco.com/c/en/us/support/docs/ip/
border-gateway-protocol-bgp/13753-25.html>.
Lapukhov & Tantsura Expires January 2, 2020 [Page 5]
Internet-Draft draft-lapukhov-bgp-ecmp-considerations July 2019
[I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", draft-ietf-idr-link-bandwidth-07
(work in progress), March 2018.
[RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol
Label Switching Architecture", RFC 3031,
DOI 10.17487/RFC3031, January 2001,
<https://www.rfc-editor.org/info/rfc3031>.
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006,
<https://www.rfc-editor.org/info/rfc4271>.
[RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED)
Considerations", RFC 4451, DOI 10.17487/RFC4451, March
2006, <https://www.rfc-editor.org/info/rfc4451>.
[RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route
Reflection: An Alternative to Full Mesh Internal BGP
(IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006,
<https://www.rfc-editor.org/info/rfc4456>.
[RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous
System Confederations for BGP", RFC 5065,
DOI 10.17487/RFC5065, August 2007,
<https://www.rfc-editor.org/info/rfc5065>.
[RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro,
"The Accumulated IGP Metric Attribute for BGP", RFC 7311,
DOI 10.17487/RFC7311, August 2014,
<https://www.rfc-editor.org/info/rfc7311>.
[RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder,
"Advertisement of Multiple Paths in BGP", RFC 7911,
DOI 10.17487/RFC7911, July 2016,
<https://www.rfc-editor.org/info/rfc7911>.
Authors' Addresses
Petr Lapukhov
Facebook
1 Hacker Way
Menlo Park, CA 94025
US
Email: petr@fb.com
Lapukhov & Tantsura Expires January 2, 2020 [Page 6]
Internet-Draft draft-lapukhov-bgp-ecmp-considerations July 2019
Jeff Tantsura
Apstra, Inc.
Menlo Park, CA 94025
US
Email: jefftant.ietf@gmail.com
Lapukhov & Tantsura Expires January 2, 2020 [Page 7]