Last Call Review of draft-ietf-opsawg-large-flow-load-balancing-11
review-ietf-opsawg-large-flow-load-balancing-11-opsdir-lc-pignataro-2014-05-15-00

Request Review of draft-ietf-opsawg-large-flow-load-balancing
Requested rev. no specific revision (document currently at 15)
Type Last Call Review
Team Ops Directorate (opsdir)
Deadline 2014-05-06
Requested 2014-04-24
Authors Ramki Krishnan, Lucy Yong, Anoop Ghanwani, Ning So, Bhumip Khasnabish
Draft last updated 2014-05-15
Completed reviews Genart Last Call review of -11 by Martin Thomson (diff)
Secdir Last Call review of -11 by Yoav Nir (diff)
Secdir Telechat review of -15 by Yoav Nir
Opsdir Last Call review of -11 by Carlos Pignataro (diff)
Opsdir Telechat review of -15 by Carlos Pignataro
Assignment Reviewer Carlos Pignataro 
State Completed
Review review-ietf-opsawg-large-flow-load-balancing-11-opsdir-lc-pignataro-2014-05-15
Reviewed rev. 11 (document currently at 15)
Review result Not Ready
Review completed: 2014-05-15

Review
review-ietf-opsawg-large-flow-load-balancing-11-opsdir-lc-pignataro-2014-05-15

Hi!

I have reviewed this document as part of the Operational directorate's ongoing effort to review all IETF documents being processed by the IESG. These comments were written primarily for the benefit of the operational area directors. Document editors and WG chairs should treat these comments just like any other last call comments.

This document is on the Informational track, providing mechanisms for optimization of LAG/ECMP load-balancing.

Summary: Not ready, there are issues to solve

The document offers a thorough analysis and presents a taxonomy and lexicon to talk about LAG/ECMP load-balancing. It also presents various options with pros and cons of the mechanics presented in this document. This is all very useful and helpful.

I do have a number of questions and comments included below, categorized as Major and Minor. I apologize in advance if I misunderstood anything in the document that lead to these questions and observations, and I hope these are useful.

Major:

Section 4 describes the trade-off and limitations of local-only optimizations. However, this document describes what's an active (stateful) mechanism as opposed to a hash-based passive (stateless) mechanism. There should be a section of Operational considerations of Stateful LAG/ECMP LB, given that monitoring flow degrades forwarding performance, requires state maintenance, etc.

4.2. Operational Overview
...
   Step 2) The egress component links are periodically scanned for link
   utilization and the imbalance for the LAG/ECMP group is monitored. If
   the imbalance exceeds a certain imbalance threshold, then re-
   balancing is triggered. Measurement of the imbalance is discussed
   further in 5.1. Additional criteria may also be used to determine
   whether or not to trigger rebalancing, such as the maximum
   utilization of any of the component links, in addition to the
   imbalance.

If the egress component link of an ECMP are measured, but those are in different routers, how is this a local-only method, and how is the loop closed and "rebalancing required" notified?

Take for example:

       +--B
A==+
       +--C

If B and C measure inbalance, how do they know they belong to the same ECMP?

The doc says:

   All of the steps identified above can be done locally within the
   router itself or could involve the use of a central management
   entity.

But I am not sure how some of these are done locally only, and also the "central management entity" seems underspecified.


5.1. Configuration Parameters for Flow Rebalancing
...


Also, this paragraph and document defines a number of variables like the "imbalance threshold", the "max utilization of any component links", etc.  From an operational perspective: how are those values set? What are their defaults? What are appropriate ranges and values? Section 5 describes nicely the parameters, but does not give guidance of default values and ranges.


4.3. Large Flow Recognition

4.3.1. Flow Identification

   A flow (large flow or small flow) can be defined as a sequence of
   packets for which ordered delivery should be maintained.  Flows are
   typically identified using one or more fields from the packet header,
   for example:

     .  Layer 2: source MAC address, destination MAC address, VLAN ID.

     .  IP header: IP Protocol, IP source address, IP destination
        address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP
        destination port.

Are these only applicable to TCP and UDP traffic? I think there needs to be a more exhaustive list of transports for this to be useful.

   For tunneling protocols like Generic Routing Encapsulation (GRE)
   [RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN],
   Network Virtualization using Generic Routing Encapsulation (NVGRE)
   [NVGRE], Stateless Transport Tunneling (STT) [STT], etc., flow
   identification is possible based on inner and/or outer headers.

Please add L2TPv3 as a key tunneling protocol.

Also, for tunneling protocols, there is a lot more than that. Yes, inner or outer. BUT there is also the tunnel header typically. For example, GRE Key, L2TPv3 Session ID, etc. Sometimes, these summarize a flow decision. You might also want to look at (and reference) RFC 5640, "Load-Balancing for Mesh Softwires".

4.3.2. Criteria and Techniques for Large Flow Recognition

   From a bandwidth and time duration perspective, in order to recognize
   large flows we define an observation interval and observe the
   bandwidth of the flow over that interval.  A flow that exceeds a
   certain minimum bandwidth threshold over that observation interval
   would be considered a large flow.

>From an operational standpoint, it appears these techniques are under-specified. As it pertains to these thresholds, time intervals, etc. How are those configured? What are defaults? What are appropriate ranges?

Sections 4.3 and 4.4 present respectively different techniques for sampling and re-balancing. THe analysis are very useful. It would be really helpful to have a table summarizing all the different options and associated pros and cons, and perhaps some applicability-based recommendations.

5.2. System Configuration and Identification Parameters
...
How are those parameters (besides an IP address) defined? What is a "LAG ID"? An UTF-8 string? A 64-bit unsigned integer? 

5.3. Information for Alternative Placement of Large Flows

See comment above regarding transport protocols and tunnels.

5.6. Monitoring information

5.6.1. Interface (link) utilization

   The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and
   interface speed (ifSpeed) can be measured from the Interface table
   (iftable) MIB [RFC 1213].

Why are these algorithms using MIBs only?


Minor:

I think it is confusing to talk about "short-lived large flows" referring to them as "small flows". In fact I think it is potentially very confusing. I'd recommend creating a new term.

The introduction describes a bunch of numbers (5% link bandwidth, 10s/100s flows, etc) but from an operational standpoint it is not clear how those potentially vary or are tied to a specific set of use cases. Further, not clear how those can potentially influence different algorithms. Maybe the answer is to put caps on them, or other answer, but it would help to be more prescriptive about applicability.

1.2. Terminology

   ECMP table: A table that is used as the nexthop of an ECMP route that
   comprises the set of component links and the weights associated with
   each of those component links.  The weights are used to determine
   which values of the hash function map to a given component link.

It is not clear what the "weights" are if this is ECMP and not UCMP (U for Unequal).

Also, "a table used as the next hop" is confusing.

   LAG table: A table that is used as the output port which is a LAG
   that comprises the set of component links and the weights associated
   with each of those component links.

What is the input? or what is the LAG Table associated to (i.e., not a route)

                Figure 2: Unevenly Utilized Component Links

I am not sure how realistic the example in Section 3, Figure 2 is, if only two flows congest a member link...

4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization

   The suggested mechanisms in this draft are about a local optimization
   solution; they are local in the sense that both the identification of
   large flows and re-balancing of the load can be accomplished
   completely within individual nodes in the network without the need
   for interaction with other nodes.

It is not clear to me how a local-only node can deal with node polarization in ECMP networks. A small explanation of this could help.

     .  Component Link Weight: The relative weight to be applied to
        traffic for a given component link when using hash-based
        techniques for load distribution.

Is this for ECMP or UCMP?

11. References

11.1. Normative References

11.2. Informative References

I would have expected that many of these references are Normative (i.e., needed to understand the document). Yes, the doc is Informational. The meaning of Normative vs. Informative still remains.

Hope this helps.

Thanks,

-- Carlos.