Skip to main content

DNS Stateful Operations
draft-ietf-dnsop-session-signal-20

Yes

Warren Kumari

No Objection

(Adam Roach)
(Alissa Cooper)
(Deborah Brungard)
(Suresh Krishnan)
(Terry Manderson)

Note: This ballot was opened for revision 11 and is now closed.

Warren Kumari
Yes
Spencer Dawkins Former IESG member
(was Discuss) Yes
Yes (2018-08-02 for -13) Unknown
Thanks for addressing my Discuss, and for considering my comments. I'll stick the Discuss text here for ease of archaeology. 

--- previous Discuss 

I really like this document, and think it's headed the right direction. Of course I have four pages of comments, because reasons, but the only part I'm really confused about is this one ...

I would have thought that if you end up with a different endpoint because your anycast address now resolves differently, the new endpoint would have to have shared a lot of state with the previous endpoint, for this to work:

  When an anycast service is configured on a particular IP address and
   port, it must be the case that although there is more than one
   physical server responding on that IP address, each such server can
   be treated as equivalent.  If a change in network topology causes
   packets in a particular TCP connection to be sent to an anycast
   server instance that does not know about the connection, the normal
   keepalive and TCP connection timeout process will allow for recovery.

What I would have expected to happen, is that the new endpoint sees a packet arrive that's not on a synchronized TCP connection, and immediately responds with a RST (reset), rather than the normal keepalive and TCP connection timeout process happening. That's also the way I'm reading https://tools.ietf.org/html/rfc7828#section-3.6. Is that not the way it's working for anycast these days?

--- end of previous Discuss

Everything else is a comment, so non-blocking, and please do the right thing. 

This is a nit, and your answer could be "no", and that's fine, but in some places this document uses "DSO keepalive", and in other places, "keepalive" with no qualifier. It's likely that less confusion would result if you could consistently call this "DSO keepalive", so that it is clearly NOT a TCP keepalive. Do the right thing, of course. 

Is the expectation that DSO would also be used in DNS over HTTP? I'm reading 

  At the time of publication, DSO is specified only for DNS over TCP
   [RFC1035] [RFC7766], and for DNS over TLS over TCP [RFC7858].  Any
   use of DSO over some other connection technology needs to be
   specified in an appropriate future document.

and noticing that https://tools.ietf.org/html/draft-ietf-doh-dns-over-https-12 is currently in IETF Last Call.

This next one is well within the "Spencer wouldn't have done it this way, but Spencer's not the working group, or the IETF" range, but 

  However, in the typical case a server will not know in advance
   whether a client supports DSO, so in general, unless it is known in
   advance by other means that a client does support DSO, a server MUST
   NOT initiate DSO request messages or DSO unacknowledged messages
   until a DSO Session has been mutually established by at least one
   successful DSO request/response exchange initiated by the client, as
   described below.  Similarly, unless it is known in advance by other
   means that a server does support DSO, a client MUST NOT initiate DSO
   unacknowledged messages until after a DSO Session has been mutually
   established.

seems fragile, especially in environments where clients can come and go, and servers may be addressed using anycast (so I knew in advance that the four servers at that anycast address supported DSO, but somebody installed a fifth server that does not). Is that unlikely to be a problem?

I'm sure 

  A single server may support multiple services, including DNS Updates
   [RFC2136], DNS Push Notifications [I-D.ietf-dnssd-push], and other
   services, for one or more DNS zones.  When a client discovers that
   the target server for several different operations is the same target
   hostname and port, the client SHOULD use a single shared DSO Session
   for all those operations.  A client SHOULD NOT open multiple
   connections to the same target host and port just because the names
   being operated on are different or happen to fall within different
   zones.  This requirement is to reduce unnecessary connection load on
   the DNS server.

is correct from the server side, but perhaps it's also worth noting that using multiple TCP connections unnecessarily increases the chances that data transfers happen during TCP slow start. If only one or two packets are being exchanged, that doesn't matter, but as more packets are exchanged, the difference increases, because congestion windows will grow more rapidly if fewer connections are used. 

I appreciate the inclusion of 5.4.  DSO Response Generation

But I've gotta ask. In the last paragraph of that section, I see 

   o  Use a networking API that lets the receiver signal to the TCP
      implementation that the receiver has received and processed a
      client request for which it will not be generating any immediate
      response.  This allows the TCP implementation to operate
      efficiently in both cases; for requests that generate a response,
      the TCP ACK, window update, and DSO response are transmitted
      together in a single TCP segment, and for requests that do not
      generate a response, the application-layer software informs the
      TCP implementation that it should go ahead and send the TCP ACK
      and window update immediately, without waiting for the Delayed ACK
      timer.  Unfortunately it is not known at this time which (if any)
      of the widely-available networking APIs currently include this
      capability.

I would love to know if there are any widely-available network APIs that include this capability, before including this text in a standards-track RFC. Do you need help chasing this down?

The text in 6.1.  DSO Session Initiation seems rough to me, for a couple of reasons. 

   The client may perform as many DNS operations as it wishes using the
   newly created DSO Session.  Operations SHOULD be pipelined (i.e., the

I don't understand why this would be a SHOULD. At least from the client's perspective, it's not needed for interoperation. 

   client doesn't need wait for a response before sending the next
   message).  The server MUST act on messages in the order they are
   transmitted, but responses to those messages SHOULD be sent out of
   order when appropriate.

Is it correct to say that "responses to those messages SHOULD be sent when they become available, even if the responses are sent out of order"? If not, I'm probably missing what "when appropriate" means.

I'm a bit mystified by this text in 6.2.  DSO Session Timeouts

  In the usual case where the inactivity timeout is shorter than the
   keepalive interval, it is only when a client has a very long-lived,
   low-traffic, operation that the keepalive interval comes into play,
   to ensure that a sufficient residual amount of traffic is generated
   to maintain NAT and firewall state and to assure client and server
   that they still have connectivity to each other.

I think the basics are correct - the inactivity timer and (DSO) keepalive interval are independent - but I'm struggling to think of a reason to send (DSO) keepalives that's NOT tied to maintaining NAT/firewall state, and there's a lot of text before the paragraph that mentions NAT/firewall, that talks about why either interval might be longer or shorter than the other, without considering NAT/firewall. Am I missing something here? 

... and, now that I keep reading, 6.5.2.  Values for the Keepalive Interval
does a much better job of explaining how a (DSO) keepalive interval should be selected - I think you could reasonably delete most of the text about (DSO) keepalive intervals in section 6.2, and at most provide a forward pointer to 6.5.2. 

(As an aside, I think you probably want to cite https://tools.ietf.org/html/bcp142 as the operative recommendation for NAT behaviour toward TCP, since https://tools.ietf.org/html/rfc5382 has been updated)

I found this text 

  For long-lived DNS Stateful operations (such as a Push Notification
   subscription [I-D.ietf-dnssd-push] or a Discovery Relay interface
   subscription [I-D.ietf-dnssd-mdns-relay]), an operation is considered
   in progress for as long as the operation is active, until it is
   cancelled.  This means that a DSO Session can exist, with active
   operations, with no messages flowing in either direction, for far
   longer than the inactivity timeout, and this is not an error.  This
   is why there are two separate timers: the inactivity timeout, and the
   keepalive interval.  Just because a DSO Session has no traffic for an
   extended period of time does not automatically make that DSO Session
   "inactive", if it has an active operation that is awaiting events.

to be extremely helpful, but it's 28 pages into the document. Is there a place earlier in the document that describes these timers, where you could place this text? Maybe section 3/Terminology isn't the right place, but maybe there is a right place toward the front of the document. 

I'm not understanding why the SHOULDs are not MUSTs in this text:

  If, at any time during the life of the DSO Session, twice the
   inactivity timeout value (i.e., 30 seconds by default), or five
   seconds, if twice the inactivity timeout value is less than five
   seconds, elapses without there being any operation active on the DSO
   Session, the server SHOULD consider the client delinquent, and SHOULD
   forcibly abort the DSO Session.

Perhaps part of my confusion is that I'm not sure what it means to "consider the client delinquent", but NOT to "forcibly abort the DSO session". But there are several "will forcibly abort"s in section 6.4.2, that sound more like MUST than SHOULD.

I don't think the MUST NOT in 

  Normally a server MUST NOT close a DSO Session with a client.  A
   server only causes a DSO Session to be ended in the exceptional
   circumstances outlined below. 

is quite right. Given that you have a bulleted list of reasons why a server would violate the MUST not immediately following this sentence, you might want to say "Normally a server does not close" here.
Adam Roach Former IESG member
No Objection
No Objection (for -12) Unknown

                            
Alexey Melnikov Former IESG member
No Objection
No Objection (2018-07-29 for -12) Unknown
Thank you for this document, I find it to be well written and useful.

One small nit:

In Section 9.3:

   Requests to register additional new DSO Type Codes in the
   "Unassigned" range 0040-F7FF are to be recorded by IANA after Expert
   Review [RFC8126].  At the time of publication of this document, the
   Designated Expert for the newly created DSO Type Code registry is
   [*TBD*].

The last sentence is not going to age well (even though it is "TBD").
I think it should be removed from the document and such instructions
should be sent separately to IANA in the IESG ballot.
Alissa Cooper Former IESG member
No Objection
No Objection (for -13) Unknown

                            
Ben Campbell Former IESG member
No Objection
No Objection (2018-08-01 for -12) Unknown
Substantive Comments:

§5.1,"
   If the RCODE is set to any value other than NOERROR (0) or DSOTYPENI
   ([TBA2] tentatively 11), then the client MUST assume that the server
   does not implement DSO at all.  In this case the client is permitted
   to continue sending DNS messages on that connection, but the client
   SHOULD NOT issue further DSO messages on that connection."

Why is the SHOULD NOT not MUST NOT? Do you envision situations where it might make sense to violate the SHOULD NOT?

§5.1.2: Are there security considerations for using zero round trip handshakes?

§5.1.3: "In cases where a DSO session is terminated on one side of a
   middlebox, and then some session is opened on the other side of the
   middlebox in order to satisfy requests sent over the first DSO
   session, any such session MUST be treated as a separate session."

By what? How would the ultimate endpoints know?

§5.2.2.4: "
   If DSO unacknowledged message is received containing an unrecognized
   Primary TLV, with a zero MESSAGE ID (indicating that no response is
   expected), then this is a fatal error and the recipient MUST forcibly
   abort the connection immediately."
 
Doesn't that make extensibility difficult? What if an extension adds a new unacknowledged message type that uses a new primary TLV?

§6.1: "The server MUST act on messages in the order they are
   transmitted, but responses to those messages SHOULD be sent out of
   order when appropriate."

The SHOULD seems more like a MAY, unless you mean for implementors to go looking for reasons to do things out of order.

§6.4.1: Does "consider delinquent" entail any concrete actions beyond resetting the connection?

§12.2: It seems like TLS1.3 should be a normative reference, given that it's used to describe the condition for a normative statement.


Editorial Comments:

- General: Please watch for comma splices.

§1:
-- " It is likely that future updates to these tools will add the ability to recognize, decode, and display the DSO data."
That sentence may not age well.
-- " A goal of this approach is to avoid the operational issues that have befallen EDNS(0), particularly relating to middlebox behaviour."
Is there something that can be cited to describe the operational issues?

§2: There's a fair amount of procedure, including normative statements, described in the terminology section. That would better reserved to the sections that are more about procedure. Some readers only use terminology sections to look up terms on demand; such users may miss the procedure bits.

§5.1, "DSO messages MUST be carried in only protocols "
"MUST ... only" constructions can be ambiguous. Consider reformulating into a "MUST NOT" construction.

§5.1.3:
-- 3rd paragraph: Should "stateless" be "stateful"?
-- "will have no way to
   know on which connection to forward a DSO message, and therefore will
   not be able to behave incorrectly."
That seems like famous last words :-)

§5.2.3, first paragraph: The MUST NOT seems more like a statement of fact.

§5.3: The section has a number of redundant normative keywords.  Please consider stating them authoritatively in one place, and making the others descriptive
Benjamin Kaduk Former IESG member
(was Discuss) No Objection
No Objection (2018-10-01 for -16) Unknown
Thank you for resolving my Discuss points!

[COMMENT section unchanged from original ballot; contents may no longer apply]

Six authors exceeds five, so "there is likely to be discussion" about this
being too large a number of authors.  What is the justification for the
author count?

Do we need to specify some GREASE (per draft-ietf-tls-grease) for new TLV
types in order to ensure the proper handling of unknown types?

Section-by-section comments follow.

Section 1

A DoH reference seems timely/apt.  (But maybe then it is only "Some such
transports" that can offer persistent sessions.)

Maybe give some examples of advantages of server-initiated messages?  Are
we talking about letting the server push records with larger TTLs or
notifying when the response to a query is changing, or just more mundane
keepalive-type things?

Section 3

   The terms "initiator" and "responder" correspond respectively to the
   initial sender and subsequent receiver of a DSO request message or
   unacknowledged message, regardless of which was the "client" and
   "server" in the usual DNS sense.

We just defined "client" and "server" explictly (without reference to the
"usual DNS sense"), so probably it's best to have this definition refer to
the previous client/server definitions or clarify that the above
definitions match the "usual DNS sense".

   When an anycast service is configured on a particular IP address and
   port, it must be the case that although there is more than one
   physical server responding on that IP address, each such server can
   be treated as equivalent.  If a change in network topology causes
   packets in a particular TCP connection to be sent to an anycast
   server instance that does not know about the connection, the normal
   keepalive and TCP connection timeout process will allow for recovery.
   If after the connection is re-established, [...]

Perhaps clarifying that "recovery" means "detecting the broken session and
starting a new one" would be useful?  (I guess Spencer's DISCUSS takes this
a different direction.)

   DSO unacknowledged messages are unidirectional messages and do not
   generate any response.

"Do not generate any response" at the DNS layer; any reason to mention that
TCP will still ACK the bytes (or rather, that the "reliable" part of the
data stream will need to do so)?

Section 5.1

   DSO messages MUST be carried in only protocols and in environments
   where a session may be established according to the definition given
   above in the Terminology section (Section 3).

nit: is this "in only" or "only in"

   If the RCODE is set to any value other than NOERROR (0) or DSOTYPENI
   ([TBA2] tentatively 11), then the client MUST assume that the server
   does not implement DSO at all.  In this case the client is permitted
   to continue sending DNS messages on that connection, but the client
   SHOULD NOT issue further DSO messages on that connection.

I'm confused how the server would still have proper framing for subsequent
DNS messages, since the DSO TLVs would be "spurious extra data" after a
request header and potentially subject to misinterpretation as the start of
another DNS message header.

Section 5.1.3

It is probably worth explicitly noting that a middlebox MUST NOT
make/forward a DSO request with TLVs that it does not implement.

Section 5.2

   If a DSO message is received where any of the count fields are not
   zero, then a FORMERR MUST be returned, unless a future IETF Standard
   specifies otherwise.

This seems like ... not the conventional wording for this behavior, and
subject to large debates about the meaning of "IETF Standard".
(Similar language is used elsewhere, too.)

Section 5.2.2

The start of this section seems to duplicate a lot of Section 3 -- e.g.,
the specification of the "Primary TLV"; request/response/unacknowledged as
the types of messages; etc.  It's unclear to me that this duplication of
content is helpful to the reader.

   Unacknowledged messages MUST NOT be used "speculatively" in cases
   where the sender doesn't know if the receiver supports the Primary
   TLV in the message, because there is no way to receive any response
   to indicate success or failure of the request message (the request
   message does not contain a unique MESSAGE ID with which to associate
   a response with its corresponding request).  Unacknowledged messages
   are only appropriate in cases where the sender already knows that the
   receiver supports, and wishes to receive, these messages.

Having gone to the trouble to explicitly define (twice!) "request",
"response" and "unacknowledged", it's pretty confusing to then use the
English word "request" to refer to an "unacknowledged" message.

Section 5.2.2.1

   The specification for a TLV states whether that DSO-TYPE may be used
   in "Primary", "Additional", "Response Primary", or "Response
   Additional" TLVs.

Perhaps this could be wordsmithed to avoid accidental misreading as
exclusive-or?

Section 5.3.1

   When a DSO unacknowledged message is unsuccessful for some reason,
   the responder immediately aborts the connection.

Doesn't this kill the client/server pairing for an hour?  "For some reason"
is very vague to induce such behavior, and could include transient internal
errors.

Section 6.1

When would it be appropriate for the server to send responses out of order?

Section 6.6.1.1

nit: "RECONNECT DELAY" is used with inconsistent capitalization.

Section 7.1

The description of the two timeout fields is predicated on understanding
that it is only the response's incarnation of them that is authoritative;
as an editorial matter, it might be nice to introduce this fact earlier.

Section 7.1.1

It seems like we could consolidate the "equal to" and "greater than" cases
into "greater than or equal to".

Section 7.2.1

   A client MUST NOT send a Retry Delay DSO request message or DSO
   unacknowledged message to a server. [...]

nit: it must not send it as a response, either, so perhaps "MUST NOT send a
Retry Delay DSO message to a server" is shorter and better.

Section 9.3

I thought IANA liked to see a "registration template" for what subsequent
registrations in a registry being created will need to specify.  (That
said, the IANA state is "IANA OK - Actions Needed", and one might expect
that they know better than I do...)

Section 10

I'm a little surprised to not see some discussion that this mechanism
encourages the maintenance of persistent connections on DNS servers, which
encourages the maintenance of persistent connections on DNS servers, which
has impact on resource consumption/load, but is not expected to be
problematic because the server can tell the clients to go away if needed.
Deborah Brungard Former IESG member
No Objection
No Objection (for -12) Unknown

                            
Eric Rescorla Former IESG member
No Objection
No Objection (2018-07-31 for -12) Unknown
Rich version of this review at:
https://mozphab-ietf.devsvcdev.mozaws.net/D4358



IMPORTANT
S 5.3.
>      field set to zero, and MUST NOT elicit a response.
>   
>      Every DSO request message (QR=0) with a nonzero MESSAGE ID field is
>      an acknowledged DSO request, and MUST elicit a corresponding response
>      (QR=1), which MUST have the same MESSAGE ID in the DNS message header
>      as in the corresponding request.

How do I handle duplicate message IDs on the responder? Did I miss
where you said this? Is this just an error?


S 9.3.
>   
>      Requests to register additional new DSO Type Codes in the
>      "Unassigned" range 0040-F7FF are to be recorded by IANA after Expert
>      Review [RFC8126].  At the time of publication of this document, the
>      Designated Expert for the newly created DSO Type Code registry is
>      [*TBD*].

What is the standard for the expert to follow

COMMENTS
S 1.
>      is appended to the end of the DNS message header.  When displayed
>      using packet analyzer tools that have not been updated to recognize
>      the DSO format, this will result in the DSO data being displayed as
>      unknown additional data after the end of the DNS message.  It is
>      likely that future updates to these tools will add the ability to
>      recognize, decode, and display the DSO data.

I'm sure you will get to this soon, but what are the backward
compatibility implications for the two endpoints.


S 3.
>      The unqualified term "session" in the context of this document means
>      the exchange of DNS messages over a connection where:
>   
>      o  The connection between client and server is persistent and
>         relatively long-lived (i.e., minutes or hours, rather than
>         seconds).

This is a surprising taxonomy. I would assume that some of the options
you are proposing would be relevant with a 30s connection (very long
by HTTP standards!)


S 3.
>      Where this specification says, "close gracefully," that means sending
>      a TLS close_notify (if TLS is in use) followed by a TCP FIN, or the
>      equivalents for other protocols.  Where this specification requires a
>      connection to be closed gracefully, the requirement to initiate that
>      graceful close is placed on the client, to place the burden of TCP's
>      TIME-WAIT state on the client rather than the server.

Does this mean that the server will ask the client to close?


S 3.
>      connection to be closed gracefully, the requirement to initiate that
>      graceful close is placed on the client, to place the burden of TCP's
>      TIME-WAIT state on the client rather than the server.
>   
>      Where this specification says, "forcibly abort," that means sending a
>      TCP RST, or the equivalent for other protocols.  In the BSD Sockets

Because you bother to mention TLS above, what about non-close_notify
TLS alerts?


S 3.
>      the server's listening socket.
>   
>      The terms "initiator" and "responder" correspond respectively to the
>      initial sender and subsequent receiver of a DSO request message or
>      unacknowledged message, regardless of which was the "client" and
>      "server" in the usual DNS sense.

Might be helpful to say earlier that this is a request/response
protocol


S 3.
>   
>      DNS Stateful Operations uses three kinds of message: "DSO request
>      messages", "DSO response messages", and "DSO unacknowledged
>      messages".  A DSO request message elicits a DSO response message.
>      DSO unacknowledged messages are unidirectional messages and do not
>      generate any response.

This would be useful further up.


S 5.1.
>   
>      DNS over plain UDP [RFC0768] is not appropriate since it fails on the
>      requirement for in-order message delivery, and, in the presence of
>      NAT gateways and firewalls with short UDP timeouts, it fails to
>      provide a persistent bi-directional communication channel unless an
>      excessive amount of keepalive traffic is used.

Note that this is going to make things not work super-well with DNS-
over-QUIC unless you use one stream only.


S 5.1.
>   
>      If the RCODE is set to any value other than NOERROR (0) or DSOTYPENI
>      ([TBA2] tentatively 11), then the client MUST assume that the server
>      does not implement DSO at all.  In this case the client is permitted
>      to continue sending DNS messages on that connection, but the client
>      SHOULD NOT issue further DSO messages on that connection.

Why is this a SHOULD and not a MUST?


S 5.1.3.
>      to any problems that could be result from the inadvertent replay that
>      can occur with zero round-trip operation.
>   
>   5.1.3.  Middlebox Considerations
>   
>      Where an application-layer middlebox (e.g., a DNS proxy, forwarder,

I'm having trouble with this section. Is it a set of requirements on
middleboxes or statements of fact? If the latter, it seems like there
are a bunch of ways for middleboxes to mess things up,


S 5.4.
>         generate a response, the application-layer software informs the
>         TCP implementation that it should go ahead and send the TCP ACK
>         and window update immediately, without waiting for the Delayed ACK
>         timer.  Unfortunately it is not known at this time which (if any)
>         of the widely-available networking APIs currently include this
>         capability.

Are you going to make a recommendation here?


S 6.6.1.2.
>      client a Retry Delay message, or by forcibly aborting the underlying
>      transport connection) the client SHOULD try to reconnect, to that
>      service instance, or to another suitable service instance, if more
>      than one is available.  If reconnecting to the same service instance,
>      the client MUST respect the indicated delay, if available, before
>      attempting to reconnect.

Do you want to recommend some sort of randomness around this value to
avoid avalanche?


S 8.2.
>      The table below indicates, for each of the three TLVs defined in this
>      document, whether they are valid in each of ten different contexts.
>   
>      The first five contexts are requests or unacknowledged messages from
>      client to server, and the corresponding responses from server back to
>      client:

Nit. This text is a tiny bit hard to read, because you don't list S-P,
etc.


S 10.
>      messages are subject to the same constraints as any other DNS-over-
>      TLS messages and MUST NOT be sent in the clear before the TLS session
>      is established.
>   
>      The data field of the "Encryption Padding" TLV could be used as a
>      covert channel.

Why not require this to be 0, then?
Ignas Bagdonas Former IESG member
No Objection
No Objection (2018-08-02 for -13) Unknown
What is the status of running code for this? Are there any known and interoperable implementations? This is in the context of IETF 102 plenary discussion on implementations and interoperability.
Mirja Kühlewind Former IESG member
(was Discuss) No Objection
No Objection (2018-12-06 for -19) Sent
Thanks for address my discuss!

------------------------
These are old comments for the record:

1) sec 3: I really find it a bit strange that there is normative language about error handling (as well as in the "same service instance" definition part) in the terminology section. Maybe move those paragraphs somewhere else...? Also the part about "long-lived operations" and messages types provides far more information than just terminology and I would recommend to also move it into an own section or maybe just have it as part of the intro.

2) Maybe call section 5 "Protocol specification" instead of "Protocol details"...?

3) Sec 5.1: "DSO messages MUST be carried in only protocols and in environments
   where a session may be established according to the definition given
   above in the Terminology section (Section 3)."
I don't get this. Which part of section 3? Given section 3 is on terminology and this is a normative statement, I would recommend to spell out here explicitly what is meant. Do you mean the protocol must be connection-oriented, reliable, and providing in-order delivery? Any thing else? However, given that you say two paragraphs onwards that this spec is only applicable for the use with TCP and TLS/TCP, do you even need to specify these requirements normatively? 

4) sec 5.1 "It is a common
   convention that protocols specified to run over TLS are given IANA
   service type names ending in "-tls"."
Not sure this is true. Isn't it usually just an "s" at the end? Or with registry are you talking about?

5) sec 5.1: "In some environments it may be known in advance by external means
   that both client and server support DSO, ..."
I guess the client and server also need to know if TLS is supported or not. Maybe spell this out as well.

6) sec 5.1: "... therefore either
   client or server may be the initiator of a message."
Maybe s/initiator of a message/initiator of a message exchange/

7) sec 5.1.2: "Having initiated a connection to a server, possibly using zero round-
   trip TCP Fast Open and/or zero round-trip TLS 1.3, a client MAY send
   multiple response-requiring DSO request messages to the server in
   succession without having to wait for a response to the first request
   message to confirm successful establishment of a DSO session."
Why is the ability to send more than one request related to TCP Fast Open/TLS1.3 0-RTT? These are two independent mechanisms to speed up processing. Mentioning TCP Fast Open/TLS1.3 0-RTT here is rather confusing. Respectively I also don't think that the sentence:
"Similarly, DSO supports zero round-trip operation." is describing quite the same.
 
8) Further please provide references to TCP Fast Open and TLS1.3 and maybe rephrase this paragraph to use normative language:
"Caution must be taken to ensure that DSO messages sent before the
   first round-trip is completed are idempotent, or are otherwise immune
   to any problems that could be result from the inadvertent replay that
   can occur with zero round-trip operation."
Maybe just:
"DSO messages sent with TLS1.3 0-RTT before the TLS handshake is completed or in TCP SYN data with use of TCP Fast Open MUST be idempotent."
However, this is actually already required by TLS1.3 and TFO, so there is after all no need to just rephrase this requirement here (at least not normatively). I think it would be more useful for every DSO message type to specify if it can be sent in 0-RTT or not and require this for specification of future TLVs.

9) sec 5.1.3: "In cases where a DSO session is terminated on one side of a
   middlebox, and then some session is opened on the other side of the
   middlebox in order to satisfy requests sent over the first DSO
   session, any such session MUST be treated as a separate session."
This sentence seems a bit non-sensical, which probably isn't great for a normative sentence. If a session is terminated and open of the other end, doesn't that mean that you have two sessions?

10) sec 5.1.3: "A middlebox that is not doing a strict pass-through will have no way to
   know on which connection to forward a DSO message, and therefore will
   not be able to behave incorrectly."
I'm not sure I understand this sentence. Can you clarify?

11) As already briefly mentioned by Ben, there is quite some redundant text in sec 5 (with 5.2) for handling of message IDs and TLVs. Given this text is normative, I would really recommend to only specify it clearly once. Please also check the rest of the doc further things that are specified normatively multiple times. It usually makes it must clearer to specify it only once, at least normatively, at the appropriate position in the doc.

12) sec 5.3.1: "When a DSO unacknowledged message is unsuccessful for some reason, .."
What does unsuccessful mean here? Can you clarify?

13) sec 6.5.2: "A corporate DNS server that knows it is serving only clients on the
   internal network, with no intervening NAT gateways or firewalls, can
   impose a higher keepalive interval, because frequent keepalive
   traffic is not required."
I guess in this scenario it is probably most appropriate to not send any keep-alives…

14) sec 6.6: "   o  The server application software terminates unexpectedly (perhaps
      due to a bug that makes it crash)."
This bullet point does not really make sense to me because at that time when the app is crashed there is no way for the server anymore to perform any actions.

15) sec 7.1: "When a client is sending its second and subsequent Keepalive DSO
   requests to the server, the client SHOULD continue to request its
   preferred values each time. "
I don't understand the SHOULD here.. what else should be client put in these field instead...?

16) sec 7.1.2: "Once a DSO Session has been established, if either
   client or server receives a DNS message over the DSO Session that
   contains an EDNS(0) TCP Keepalive option, this is a fatal error and
   the receiver of the EDNS(0) TCP Keepalive option MUST forcibly abort
   the connection immediately."
This is normatively specified multiple time (3?) in the doc. Please consider to only specify it once where most appropriate (probably section 7.1.2)

16) sec 7.1: "The Keepalive TLV is not used as an Additional TLV."
This is redundant with the normative sentence in the next paragraph. Maybe just remove this sentence...?

17) +1 to Ben's discuss regarding the reconnection of clients. A TCP RST can be sent for many reasons and waiting for an hour seems not appropriate. I would rather recommend to log an error and directly try to reconnect.
Suresh Krishnan Former IESG member
No Objection
No Objection (for -12) Unknown

                            
Terry Manderson Former IESG member
No Objection
No Objection (for -12) Unknown