Last Call Review of draft-ietf-rtgwg-backoff-algo-07
review-ietf-rtgwg-backoff-algo-07-secdir-lc-kaduk-2018-02-14-00

Request Review of draft-ietf-rtgwg-backoff-algo
Requested rev. no specific revision (document currently at 10)
Type Last Call Review
Team Security Area Directorate (secdir)
Deadline 2018-02-14
Requested 2018-01-31
Draft last updated 2018-02-14
Completed reviews Rtgdir Early review of -04 by Sasha Vainshtein (diff)
Secdir Last Call review of -07 by Benjamin Kaduk (diff)
Genart Last Call review of -07 by Elwyn Davies (diff)
Opsdir Last Call review of -07 by Mehmet Ersue (diff)
Assignment Reviewer Benjamin Kaduk
State Completed
Review review-ietf-rtgwg-backoff-algo-07-secdir-lc-kaduk-2018-02-14
Reviewed rev. 07 (document currently at 10)
Review result Has Nits
Review completed: 2018-02-14

Review
review-ietf-rtgwg-backoff-algo-07-secdir-lc-kaduk-2018-02-14

Hi all,

I have reviewed this document as part of the security directorate's 
ongoing effort to review all IETF documents being processed by the 
IESG.  These comments were written primarily for the benefit of the 
security area directors.  Document editors and WG chairs should treat 
these comments just like any other last call comments.

>From a security perspective, this document is Ready.  It specifies a
standard scheme that can be used to back off SPF calculations during
periods of frequent IGP events, avoiding excessive resource
consumption performing calculations that would be rendered redundant
(or just be useless) soon.  The security considerations correctly
note that an attacker that can generate IGP events would be able to
delay the IGP convergence time, which is true both for this scheme
and all schemes previously in use.  (I might use more words to say
the same thing if I was writing it, but that probably reflects more
on me than the document.)


I do have some questions about the actual proposed FSM, though -- I
suspect that I am just making some implicit assumptions that may not
be grounded in reality.  In particular, I am basically assuming that
INITIAL_SPF_DELAY < SHORT_SPF_DELAY < LONG_SPF_DELAY <
HOLDDOWN_INTERVAL.  (The draft itself only has it as RECOMMENDED for
SPF_INITIAL_DELAY <= SPF_SHORT_DELAY <= SPF_LONG_DELAY in Section 6,
yes, with the different spellings, but has a MUST for
HOLDDOWN_INTERVAL > TIME_TO_LEARN_INTERVAL.)
This would potentially affect the state machine for events 1, 6, and
7.

In transition 1, we say to only start the SPF_TIMER if it is not already
running, but I do not see a way for it to already be running unless
the HOLDDOWN_TIMER value is less than one or more of the SPF_TIMER
values.

Similarly, I don't see how transition 7 could ever happen, since an IGP
event moves us out of the QUIET state, and I assume that the
SPF_TIMER would fire before the HOLDDOWN_TIMER, since the latter is
reset on every IGP event and I assume the latter has a larger value.

Transition 6 is a little less clear, but is also similar -- if
HOLDDOWN_TIMER is larger than LEARN_TIMER, then LEARN_TIMER must
fire before HOLDDOWN_TIMER, and we leave SHORT_WAIT to go to
LONG_WAIT before we could consider leaving SHORT_WAIT to go back to
QUIET.

So, am I making some flawed assumptions?  (Are there examples of
situations that clearly demonstrate the flaw?)


Also, to confirm my understanding, suppose a scenario happens where
an IGP participant sees an event, then a gap of 100ms, then three
IGP events at equal 10ms intervals, with the SPF delays at the
example values of 0/50/2000 ms and TIME_TO_LEARN of 1s.  The first
event triggers a SPF computation immediately, then we go to
SHORT_WAIT, the first of the three events kicks off a SPF_TIMER for
50ms, which is reset by the next two events, the SPF timer fires and
we recompute the SPF, then TIME_TO_LEARN fires and we go to
LONG_WAIT until the HOLDDOWN_TIMER fires.  Or maybe the SPF
calculation takes more than 200ms, so when the second IGP event
fires, we abort the currently in-progress calculation and don't
start another one until 50ms after the last event?  I bring this up
because of the text in the second paragraph of Section 4 that talks
of computing the post-failure routing table in "a single route
computation".  But if I understand correctly, the *single*
computation only happens in the second case here, when the
calculation takes some hundreds of milliseconds; otherwise we still
have *two* computations (one triggered while we're in QUIET and the
second triggered in SHORT_DELAY).  So I'm not sure I fully
understand the expected scenario.


I also am probably having some problems with terminology, presumably
just my misunderstanding, which hopefully can be set straight
easily.

In the Introduction, we have a "desire to compute a new Shortest
Path First (SPF) as soon as a failure is detected", which is using
SPF as it is a data structure (e.g., the result of an algorithm),
whereas my intuition has SPF referring to the algorithm [class] but
not its output.

In section 3, we talk of "computation of the routing table, by the
IGP", which gets me confused about whether "the IGP" represents a
network protocol for conveying (e.g.) link state information, an
algorithm for SPF computation, or a router that performs SPF
computations.

In section 6 we talk of "the number of protocols
reactions/computations triggered by IGP SPF".  Is this just in the sense
of "each SPF calculation triggers a bunch of other stuff"?  I think
this is another case about me being confused whether "SPF" means an
algorithm, a specific computation using that algorithm, etc.



Some other editorial notes:

It's probably better to cite RFC 8174 instead of/in addition to RFC
2119, especially since there is at least a lowercase "may" present.

It's unclear that "temporally close" in "multiple temporally close
failures over a short time" really adds any value, in the
Introduction.

In section 2, last bullet point on page 3, "SPF_DELAY timers values"
probably doesn't need the plural "timers" (so, either "timer" or
the possessive "timers'"), though I am mindful of the recent
discussion on ietf@ about (non-)American English.  The second
sentence of the bullet is also a sentence fragment and not a
complete sentence.

SRLG is used without expansion in multiple places, but does not
appear on https://www.rfc-editor.org/materials/abbrev.expansion.txt
as a "well-known" abbreviation.

In section 6, we find the awkward construction "play it safe and
start with safe, i.e., longer timers".  Probably we want to say
"safe values" as the noun, and maybe consider rewording to avoid the
duplicate "safe" and/or the colloquialism "play it safe".

Section 8 says:

   [...]. FIBs
   are installed after multiple steps such as flooding of the IGP event
   across the network, SPF wait time, SPF computation, FIB distribution
   across line cards, and FIB update.  This document only addresses the
   first contribution.

which makes me try to match up "the first contribution" with the
flooding, when I assume it's meant to match up with the SPF wait
time.

-Benjamin