Skip to main content

Serving Stale Data to Improve DNS Resiliency
draft-ietf-dnsop-serve-stale-10

Yes

(Alexey Melnikov)
(Barry Leiba)

No Objection

(Alissa Cooper)
(Alvaro Retana)
(Deborah Brungard)
(Magnus Westerlund)
(Martin Vigoureux)

Recuse


Note: This ballot was opened for revision 09 and is now closed.

Roman Danyliw
No Objection
Comment (2019-12-03 for -09) Sent
* I agree with Mirja, Section 8 is more informative than what is alluded to the paragraph starting with “Several recursive resolvers …” in Section 3, and IMO is worth keeping.  I struck me as odd to call out the operation practice of a particular vendor (Akamai).  We might want to check if this reference is ok – Ben?

* A few reference nits:
- Section 6.  Per the mention to DNS-OARC, please provide a citation.
- Section 6 and 9.  The text references “during discussions in the IETF”.  What is that specifically – WG deliberation? 

* Thanks for covering the attacker use cases of stale data in Section 10.
Éric Vyncke
No Objection
Comment (2019-12-01 for -09) Sent
Thank you for the work put into this document. The short document is easy to read. Feel free to ignore the sentences below.

I loved the sentence "stale bread is better than no bread.", who said that I-D are boring? :-)

Should the assertion about DNS stale data by products (end of section 3) be documented by external documents? Somehow addressed in section 8 (to be removed...)

Finally, I am unsure whether it is worth documenting the WG discussion about EDNS.

Regards,

-éric
Warren Kumari
Recuse
Comment (2019-11-18 for -09) Sent
Recusing because I'm an author.
Alexey Melnikov Former IESG member
Yes
Yes (for -09) Not sent

                            
Barry Leiba Former IESG member
Yes
Yes (for -09) Unknown

                            
Suresh Krishnan Former IESG member
Yes
Yes (2019-12-04 for -09) Sent
* Section 6

It might be useful to include a reference to DITL for some background on the dataset mentioned in this section

http://www.caida.org/projects/ditl/
Adam Roach Former IESG member
No Objection
No Objection (2019-12-02 for -09) Sent
Thanks to everyone who put work into documenting this useful and apparently
well-deployed mechanism. I have a handful of comments on the current document.

---------------------------------------------------------------------------

§3:

>  Several recursive resolver operators, including Akamai, currently use
>  stale data for answers in some way.

This won't age well; and it's not clear why calling out Akamai amongst
the various DNS service providers is warranted. Suggest:

  At the time of this document's publication, several recursive resolver
  operators use stale data for answers in some way

(If the notion of citing Akamai is to indicate the scale of such operators,
I suggest "...operators, including large-scale operators, use stale...")

---------------------------------------------------------------------------

§4:

>  The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is
>  amended to read:
>
>  TTL  a 32-bit unsigned integer number of seconds that specifies the
>     duration that the resource record MAY be cached before the source
>     of the information MUST again be consulted.  Zero values are
>     interpreted to mean that the RR can only be used for the
>     transaction in progress, and should not be cached.  Values SHOULD
>     be capped on the orders of days to weeks, with a recommended cap
>     of 604,800 seconds (seven days).  If the data is unable to be
>     authoritatively refreshed when the TTL expires, the record MAY be
>     used as though it is unexpired.  See the Section 5 and Section 6
>     sections for details.


The addition of what I must presume is intended to be RFC 2119 language to a
document that doesn't cite RFC 2119 seems questionable.  I would suggest
either explicitly adding RFC 2119 boilerplate to RFC 1035 as part of this
update, or using plain English language to convey the same concepts as are
intended.

Nit: "See the Section 5 and Section 6 sections for details" is a very awkward
way to phrase the closing sentence.

More substantively: Sections 5 and 6 of RFC 1035 are "MASTER FILES" and "NAME
SERVER IMPLEMENTATION" respectively. Is this final sentence intended to refer to
those two sections? Or is it pointing to "Example Method" and "Implementation
Considerations" of this document? If the latter, please specifically cite this
document (e.g., "See Section 5 and Section 6 of [RFCXXXX] for details.")

---------------------------------------------------------------------------

§4:

>  therefor leave any previous state intact.  See Section 6 for a

Nit: "therefore"

---------------------------------------------------------------------------

§5:

>  When a request is received by a recursive resolver, it should start
>  the client response timer.

The passive tense in this sentence makes "it" linguistically ambiguous.
Suggest: "When the recursive resolver receives a request, it should start..."

---------------------------------------------------------------------------

§10:

>  A proposed mitigation is that certificate authorities
>  should fully look up each name starting at the DNS root for every
>  name lookup.  Alternatively, CAs should use a resolver that is not
>  serving stale data.

This seems like a perfectly good solution, although I wonder how many CAs are
likely to read this document. If I were the type to engage in wagering, I'd
put all of my money on "zero." I'm not sure specific action is called for
prior to publication of this document as an RFC, but it seems that additional
publicity of this issue and the way that serve-stale interacts with it --
e.g., to CAB Forum and its members -- is warranted.
Alissa Cooper Former IESG member
No Objection
No Objection (for -09) Not sent

                            
Alvaro Retana Former IESG member
No Objection
No Objection (for -09) Not sent

                            
Benjamin Kaduk Former IESG member
No Objection
No Objection (2019-12-04 for -09) Sent
Thanks for this document; it's some good comprehensive discussion of the
issues related to this topic and will improve the stability of the
internet.  I have several minor coments and a few side notes that
are expected to lead to at most my own elucidiation (but no textual
changes).

Section 2

   For a comprehensive treatment of DNS terms, please see [RFC8499].

(side note: I myself would not use the word "comprehensive" when it
explicitly says that "some DNS-related terms are interpreted quite
differently by different DNS experts", but I understand why it is used
here.)

Section 3

   There are a number of reasons why an authoritative server may become
   unreachable, including Denial of Service (DoS) attacks, network
   issues, and so on.  If a recursive server is unable to contact the
   authoritative servers for a query but still has relevant data that

side note: the way this is worded might make a reader wonder if the
recursive is expected to attempt to contact all known authoritatives
before declaring failure.

   Several recursive resolver operators, including Akamai, currently use
   stale data for answers in some way.  A number of recursive resolver

I did not follow the discussions that led to this wording, but one of my
colleagues at Akamai suggested that "currently fall back to stale data
for answers under some circumstances" might be a nicer wording, though I
note that Adam has already proposed some text here as well, which is
probably fine.

Section 4

   The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is
   amended to read:

   TTL  a 32-bit unsigned integer number of seconds that specifies the
      duration that the resource record MAY be cached before the source
      of the information MUST again be consulted.  Zero values are
      interpreted to mean that the RR can only be used for the
      transaction in progress, and should not be cached.  Values SHOULD
      be capped on the orders of days to weeks, with a recommended cap
      of 604,800 seconds (seven days).  If the data is unable to be
      authoritatively refreshed when the TTL expires, the record MAY be
      used as though it is unexpired.  See the Section 5 and Section 6
      sections for details.

I recommend using "[this document]" in the section references, since a
reader reading the updated content in the context of RFC 1035 might look
there instead of here.

Section 5

   The resolver then checks its cache for any unexpired records that
   satisfy the request and returns them if available.  If it finds no
   relevant unexpired data and the Recursion Desired flag is not set in
   the request, it should immediately return the response without
   consulting the cache for expired records.  Typically this response
   would be a referral to authoritative nameservers covering the zone,
   but the specifics are implementation-dependent.

side note: I'm slightly surprised that the semantics of the absence of
Recusion Desired are not more tightly nailed down, but neither is it the
role of this document to specify them.

   When no authorities are able to be reached during a resolution
   attempt, the resolver should attempt to refresh the delegation and
   restart the iterative lookup process with the remaining time on the
   query resolution timer.  This resumption should be done only once
   during one resolution effort.

Is the "during one" more like a global cap or more like "during a
given"?

Section 6

   The client response timer is another variable which deserves
   consideration.  If this value is too short, there exists the risk
   that stale answers may be used even when the authoritative server is
   actually reachable but slow; this may result in sub-optimal answers
   being returned.  Conversely, waiting too long will negatively impact
   user experience.

Not just sub-optimal but potentially even wrong or actively harmful
answers, no?

   The balance for the failure recheck timer is responsiveness in
   detecting the renewed availability of authorities versus the extra
   resource use for resolution.  If this variable is set too large,
   stale answers may continue to be returned even after the
   authoritative server is reachable; per [RFC2308], Section 7, this
   should be no more than five minutes.  If this variable is too small,
   authoritative servers may be rapidly hit with a significant amount of
   traffic when they become reachable again.

I think part of the concern is also that setting the value too small
will cause additional traffic towards the authoritative even while it is
nonresponsive/nonreachable, which could aggravate any DoS attack ongoing
against the authoritative.  Which is to say, that perhaps "became
reachable again" does not quite reflect the full set of considerations.

   Regarding the TTL to set on stale records in the response,
   historically TTLs of zero seconds have been problematic for some
   implementations, and negative values can't effectively be
   communicated to existing software.  Other very short TTLs could lead
   to congestive collapse as TTL-respecting clients rapidly try to
   refresh.  The recommended value of 30 seconds not only sidesteps
   those potential problems with no practical negative consequences, it
   also rate limits further queries from any client that honors the TTL,
   such as a forwarding resolver.

I a little-bit wonder whether an RFC 8085 reference would make sense
here, but that's not exactly my area of expertise.

   There's also no record of TTLs in the wild having the most
   significant bit set in DNS-OARC's "Day in the Life" samples.  With no

Should we have a reference for DNS-OARC's samples?

   apparent reason for operators to use them intentionally, that leaves
   either errors or non-standard experiments as explanations as to why
   such TTLs might be encountered, with neither providing an obviously
   compelling reason as to why having the leading bit set should be
   treated differently from having any of the next eleven bits set and
   then capped per Section 4.

side note(?): This discussion, as roughly "we can't think of any reason
why the change would be problematic", calls to mind the ongoing
discussions of RFC (text) format changes, where arguments are being made
for more-strict backwards/historical compatibility.  That said, I have
no reason to doubt the WG consensus position here, hence "side note".

Section 7

   Be aware that Canonical Name (CNAME) and DNAME [RFC6672] records
   mingled in the expired cache with other records at the same owner
   name can cause surprising results.  This was observed with an initial
   implementation in BIND when a hostname changed from having an IPv4
   Address (A) record to a CNAME.  The version of BIND being used did
   not evict other types in the cache when a CNAME was received, which
   in normal operations is not a significant issue.  However, after both
   records expired and the authorities became unavailable, the fallback
   to stale answers returned the older A instead of the newer CNAME.

I'm not sure to what extent the lesson from this scenario is limited to
"CNAME/DNAME are special" versus "when serving stale, serve the
least-stale you have".

Section 8

   Details of Apple's implementation are not currently known.

I'm amenable to the other reviewer's comment that this section might be
interesting to keep, RFC 6982 notwithstanding, in which case this might
be more appropriately worded as "publicly disclosed" -- one assumes that
the Apple employees that wrote it know what it does!

Section 10

   The most obvious security issue is the increased likelihood of DNSSEC
   validation failures when using stale data because signatures could be
   returned outside their validity period.  Stale negative records can

We seem to be carefully not giving explicit guidance about using "stale"
DNSSEC keys in addition to stale resolution records.  If the
consequences of potentially using expired key material are more severe
than the consequences of potentially using expired DNS records (as it
seems to me), perhaps we should explicitly reiterate that serve-stale is
not an excuse to ignore key validity periods (as we are implicitly doing
here)?

   In [CloudStrife], it was demonstrated how stale DNS data, namely
   hostnames pointing to addresses that are no longer in use by the
   owner of the name, can be used to co-opt security such as to get
   domain-validated certificates fraudulently issued to an attacker.
   While this document does not create a new vulnerability in this area,
   it does potentially enlarge the window in which such an attack could
   be made.  A proposed mitigation is that certificate authorities
   should fully look up each name starting at the DNS root for every
   name lookup.  Alternatively, CAs should use a resolver that is not
   serving stale data.

[I think Adam has probably already covered this one, but keeping just in
case.]
I note that the target of this guidance (CAs) is not obviously in the
expected readership set for a document about DNS recursive resolver
operational considerations.  Can we do more to expand the visibility of
this guidance to the audience where it would be most useful?  (I don't
see an obvious candidate for, e.g., an additional Updates: relationship,
but perhaps someone has other ideas.)
Deborah Brungard Former IESG member
No Objection
No Objection (for -09) Not sent

                            
Magnus Westerlund Former IESG member
No Objection
No Objection (for -09) Not sent

                            
Martin Vigoureux Former IESG member
No Objection
No Objection (for -09) Not sent

                            
Mirja Kühlewind Former IESG member
No Objection
No Objection (2019-12-02 for -09) Sent
Two comments:

1) It seems to me that this sentence in section 7 should/could actually be phrased as a normative requirement in this document:
"it is not necessary that every client request needs to
   trigger a new lookup flow in the presence of stale data, but rather
   that a good-faith effort has been recently made to refresh the stale
   data before it is delivered to any client."
Maybe worth considering...

2) I find the Implementation Status section (8) actually quite interesting for this document and maybe it should be considered to keep it in the document for final publication.