Ballot for draft-ietf-dnsop-no-response-issue
Yes
No Objection
Note: This ballot was opened for revision 17 and is now closed.
{Yes} [nits] S3.2.2 * s/answers responses/responses/ (or answers) S5 * Is there a reference for a definition of "scrubbing service"? S10 * s/None the tests/None of the tests/
General: Nits: * I tripped almost every time on saying "set FOO bit to 1" and similar because I'm used to "set" implying one and "not set" or "clear" implying zero. In other places the prose does go with simply saying "FOO bit is set". Maybe that's just me though; we'll see how my colleagues feel. Section 1: * Suggest including a reference to RFC4732 in the discussion of amplification attacks. Section 2: * In the discussion of abandoned transition to the SPF type, suggest a reference to RFC6686. Nits: * "Widespread non-response to EDNS queries has lead to ..." -- s/lead/led/ * "Widespread non-response to EDNS options, requires ..." -- remove comma * "... requires recursive servers to have to decide ..." -- s/to have// * "... being present, leads to ..." -- remove comma Section 3.1.2: A nit: * "The exception to this are ..." -- either s/exception/exceptions/ or s/are/is/. Section 3.1.5: A nit: * "While firewalls should not block TCP connection attempts if they do they should ..." -- suggest: "While firewalls should not block TCP connection attempts, those that do should ..." Section 3.2.2: More nits: * "... version 0 queries but ... version numbers that are higher than zero." -- why the digit in one place but prose in the other? Section 4: * Paragraphs 3, 4, and 5 could be common factored very easily since most of the text is identical. Section 5: * I've never heard of a "scrubbing service". Is there a reference RFC, or could we include a short definition? * "One needs to take care when choosing a scrubbing service." -- This is vague. What, apart from the prior sentence (whose implications I don't understand), should an operator be looking for? Section 8: Nit: * "Testing is divided into two sections." -- a list follows, so s/./:/ Section 9: * The final paragraph suggests disconnection of broken nameservers. This can have serious non-technical implications as well. That might be worth mentioning. Nit: * "Name server operators ..." -- s/Name server/Nameserver/, to be consistent with the rest of the document
Thanks for this document – it is allows for a very approachable way to verify conformance. ** Section 2. Per “Working around issues due to non-compliance with RFCs is not sustainable”, this seems like a bold statement. What is the basis for it? ** Section 4. This section repeats several times that firewall should not drop DNS traffic with unknown parameters and such traffic should not be construed as an attack. In the general case with “normal clients”, this is good advice. However, for certain highly controlled enclaves where a white-list-style approach to traffic is taken, this is not realistic. The presence of unexpected classes of new DNS traffic would be a bad sign (e.g., of compromise, a new software load whose features were not understood, or a configuration which was not validated) ** Section 8. For completeness, per “The test below use dig from BIND 9.11.0”, please provide a reference. ** Section 8 dig examples. It would be worth explaining $zone and $server. ** Section 10. Per “Testing protocol compliance can potentially result in false reports of attempts to break services from Intrusion Detection Services and firewalls.”, thanks for calling this out. I would recommend tuning this language: -- s/break services/attack services/ -- to acknowledge that uncommon DNS protocol fields or traffic (from this test regime) might trigger anomaly-detection/profile-based IDS alerts too ** Editorial Nits: -- Section 8. s/is know/is known/
Thank you for the work put into this document. I also like the extensive test scenarios with 'dig' ;-) To be honest, I was about to ballot a DISCUSS as I have some doubts whether the objective of removing non-compliant servers (end of section 2) is achievable... Too many non-compliant servers, probably managed by clueless people. But, hey, we can always try! I also wonder why this document is a generic BCP while section 8 and other parts seem to indicate more like a testing of servers. Balloting NO OBJECTION but also long hesitation for a DISCUSS. Please address the nits found by Carlos during the INTDIR review: https://mailarchive.ietf.org/arch/msg/int-dir/wfKo4vDmFJwPa1HeDY9wxP2JdEA (at least one nit is already addressed, thank you) Please find below some non-blocking COMMENTs and NITs. An answer will be appreciated. I hope that this helps to improve the document, Regards, -éric == COMMENTS == Generic: the objective of this document is a little unclear to me, is it to do compliance testing/certification specific DNS server software ? or to actual DNS servers on the Internet. -- Section 1 -- Suggest to also add middle-box dropping EDNS in the sentence "Due to the inability to distinguish between packet loss and nameservers dropping EDNS" (see section 4). -- Section 4 -- Why limiting the middle boxes to only firewalls and load balancers? There are many different types of middle-box (NAT, ...) also doing "packet massaging" on the fly... :-( -- Section 10 -- The security considerations is rather weak... If the intent is to probe Internet servers, then why not adding some text around 'do it only with prior agreement of the DNS servers operator' ? Also, if the firewall is "protecting" the DNS server (cough cough), then as a security officer I would prefer to block all unknown opcodes/types at the firewall (possibly with a reply). == NITS == -- section 2 -- Please add an expansion or a reference to "AD flag bit". (and in my non-native English speaker, it is a pleonasm).
Thanks for a BCP on this. I agree with Ben about the commas. For what it’s worth, I disagree with Martin’s comment about “should” and such: the document does not cite BCP 14, and I think that’s fine. Some editorial stuff: — Section 1 — While there is still a pool of servers that don't respond to EDNS requests, clients have no way to know if the lack of response is due to packet loss, or EDNS packets not being supported, I tripped on the meaning of “while” here, and I suggest changing it to “As long as there are still servers...”, so as to avoid the ambiguity. — Section 2 — Some are caused directly from the non-compliant behaviour and others as a result of work-arounds Make it “directly by”, not “from”. And then “and others are as a result”. o Widespread non-response to EDNS queries has lead to recursive Make it “has led”. servers to have to decide whether to probe to see if it is the EDNS option or just EDNS that is causing the non response. I would say, “the specific EDNS option or the use of EDNS in general”.
Someone (maybe the RFC Editor) will end up tweaking a lot of commas. I didn't try to list them all. I didn't see a response to the secdir reviewer's question (though I'm also not sure that there's an easy answer to it). Section 1 The existence of servers which fail to respond to queries results in developers being hesitant to deploy new standards. Such servers need nit: it feels a little like a juxtaposition to have "developers" that "deploy" new standards (vs. "developers that implement" or "operators that deploy"). indication that the server is under attack. Parent zone operators are advised to regularly check that the delegating NS records are consistent with those of the delegated zone and to correct them when they are not [RFC1034]. Doing this regularly should reduce the instances of broken delegations. I can't tell if this 1034 reference is for the recommendation to regularly check or the definition of "consistent" or something else; if the recommendation is new, then would BCP 14 keywords be appropriate? Section 2 o The AD flag bit in a response cannot be trusted to mean anything as some servers incorrectly copy the flag bit from the request to the response [RFC1035], [RFC4035]. Would it be worth a 6840 ref here as well (to catch setting AD in a request, even though that's not exactly what's being mentioned)? Section 3.1.2 (Do we want to remind the reader on the NOERROR vs. NXDOMAIN rules? "No" is probably acceptable. I see we do so later, in Section 7, so even a forward reference might suffice.) Where's the first reference/mention of Meta-RRs? I see RFC 2929 (obsoleted, transitively, by 6895) that we cite for the "range reserved for private use" but not for terminology. Even RFC 8499 (which we don't cite) only has "meta-RR" in a parenthetical in the description of OPT. Section 3.1.5 micro-nit: I guess firewalls don't exactly count as "nameservers", which seems to be the claimed scope for this document. Section 3.2.1 This section threw me a bit, at first, as the 3.1.x had led me to expect "nameservers should behave in this way", but this section is "here is how to tell if a nameserver is misbehaving". That's not necessarily a problem, just a ... comment :) Section 3.2.6 Some nameservers fail to copy the DO bit to the response despite clearly supporting DNSSEC by returning an RRSIG records to EDNS queries with DO=1. I'm not sure if we also want an explicit "nameservers should copy to the DO bit to the response when they support DNSSEC". Section 3.2.7 [similarly an affirmative statement of what nameservers should do might be appropriate here.] Section 4 Firewalls and load balancers can affect the externally visible behaviour of a nameserver. Tests for conformance should to be done from outside of any firewall so that the system is tested as a whole. (These are conformance tests run by the nameserver's own operator, or externally-driven tests, too?) However, there may be times when a nameserver mishandles messages with a particular flag, EDNS option, EDNS version field, opcode, type or class field or combination thereof to the point where the integrity of the nameserver is compromised. Firewalls should offer the ability to selectively reject messages using an appropriately constructed response based on all these fields while awaiting a fix from the nameserver vendor. I would suggest reiterating that this is "with a response" vs. "drop the packet silently". Section 5 Ideally, Operators should run these tests against a packet scrubbing service to ensure that these tests are not seen as attack vectors. It feels like maybe the most we can say here is "not seen as attack vectors during normal operation". We can't exclude the possibility that some actor decides to generate a flood of messages that happens to match the test behavior (whether by accident or design), which seems fairly likely to lead to blocking of the test-behavior traffic as collateral damage. Section 7 If the server does not support EDNS at all, FORMERR is the expected error code. That said a minimal EDNS server implementation requires parsing the OPT records and responding with an empty OPT record in the additional section in most cases. There is no need to interpret any EDNS options present in the request as unsupported EDNS options are expected to be ignored [RFC6891]. Additionally EDNS flags can be ignored. The only part of the OPT record that needs to be examined is the version field to determine if BADVERS needs to be sent or not. It seems like there's an implied "so providing minimal EDNS support is pretty trivial and you ought to do so already" in here; do we want to make such sentiment explicit? Section 8 Testing is divided into two sections. "Basic DNS", which all servers should meet, and "Extended DNS", which should be met by all servers that support EDNS (a server is deemed to support EDNS if it gives a valid EDNS response to any EDNS query). If a server does not support EDNS it should still respond to all the tests. Is this "respond to all the tests, albeit with [error responses]"? The tests below use dig from BIND 9.11.0. I guess this version could become important if some future version starts setting a new flag by default (that would need to be suppressed if that version of dig was used for many of these tests). Section 8.1.2 Ask for the TYPE1000 RRset at the configured zone's name. This query is made with no DNS flag bits set and without EDNS. TYPE1000 has been chosen for this purpose as IANA is unlikely to allocate this type in the near future and it is not in a range reserved for private use [RFC6895]. Any unallocated type code could be chosen for this test. Is there a risk that since we document TYPE1000 like this some server will implement "respond to TYPE1000" without implementing the actual desired behavior? Section 8.1.3.2 AD use in queries is defined in [RFC6840]. (Knowing this would have been helpful up in the toplevel section 8 where we talk about one or both AD=1 and DO=1 being a signal to expect AD=1.) Section 8.2.3, 8.2.6 [Same comment about option code 100 as for TYPE1000 above; the same response is assumed.] Section 9 When notification is not effective at correcting problems with a misbehaving name server, parent operators can choose to remove NS record sets (and glue records below) that refer to the faulty server until the servers are fixed. This should only be done as a last resort and with due consideration, as removal of a delegation can have unanticipated side effects. [...] I have mixed feelings about recommending "cut you off until you fix your bugs" as an option, but not strongly enough to override WG consensus.
Thanks for the draft. It's always good for congestion controls if congestion-based packet losses are disambiguated from other types. A few nits: - Section 1 has a number of acronyms without clear references (DANE, SPF, TLSA). Please define them on first use. - Sec. 3.1.5. Please add a comma after "attempts" - Sec 3.2.4 uses lower case versions of the normative keywords. Selecting a synonym would improve it.