INTERNET-DRAFT                                           Carsten Bormann
Expires: September 1998                          Universitaet Bremen TZI
                                                              March 1998


 Network News Distribution Protocol: Architecture and Design Guidelines
                     draft-bormann-mnnp-nndp-00.txt


Status of this memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as ``work in progress.''

   To learn the current status of any Internet-Draft, please check the
   ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
   ftp.isi.edu (US West Coast).

   Distribution of this document is unlimited.

Abstract

   This document describes an architecture and a set of protocols for
   distributing Netnews [RFC0977, RFC1036] via IP multicast enabled
   networks.  The architecture is designed to be useful in the global
   Internet.  In particular, it allows multiple news servers to
   cooperate on multicasting each new article only once.  To facilitate
   scalability to tens of thousands of news servers, it also provides
   for receive-only multicast participants (that continue to send
   articles via conventional NNTP).

   This document is a submission to the IETF MNNP working group.
   Comments are solicited and should be addressed to the working groups'
   mailing list at ietf-mnnp@va.pubnix.com and/or the author.


1.  Introduction

   Netnews (or Usenet news) is one of the more important systems for
   electronic communication that make up what is now loosely called
   ``the Internet'' in the media.  Usenet operates by flood-distributing
   messages called articles between participating systems, called news
   servers.  The Usenet is experiencing growth problems as with any
   other element of the thriving Internet environment.

Bormann                                                         [Page 1]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998

   It is widely recognized that NNTP, the article distribution system in
   use in the Usenet, is running into scaling problems.  Some ISPs are
   reporting numbers of between 7 and 12 % for the NNTP contribution to
   their backbone traffic -- this for a data stream that is less than 64
   kbit/s in total (see below).

   As Usenet is fundamentally a multicasting system, an obvious approach
   is to apply the emerging Internet network layer multicasting
   technology to Usenet distribution.  One experiment described in the
   literature, MUSE [firehose paper], transmitted Usenet articles as UDP
   multicast packets between participating sites.  While this experiment
   was moderately successful, it suffered from packet loss problems
   (that increase exponentially with the number of fragments generated
   from one article).  Also, a scalable security architecture was not
   defined for this experiment.

   This document defines an architecture and sketches two protocols to
   make network layer multicasting more useful for news distribution.
   The architecture will, in reference to an earlier experiment
   [newscaster] be called Newscaster-2 or simply Newscaster; the two
   protocols will be called NNDP (Network News Distribution Protocol)
   and NNDCP (Network News Distribution Coordination Protocol),
   respectively.

1.1.  Benefits of multicasting Netnews

   Distributing Netnews via network layer multicast provides a number of
   benefits.  For ISPs, Newscaster can help to significantly reduce the
   backbone NNTP load: Each article traverses each link (in the best
   case) only once instead of traversing the backbone links multiple
   times, once to each target news server.

   One other benefit of Newscaster will be reduced article propagation
   times -- while current NNTP servers can be very fast, Newscaster
   replaces multiple unicast hops between news servers by a single
   multicast hop.  As propagation times currently measure on the order
   of hours, a reduction to the order of minutes would be a nice
   achievement; a reduction below that (to seconds) is, however, not
   intended.  (As a side benefit, Newscaster will reduce the link
   bandwidth consumed by a leaf news receiver by using batching and
   compression and by reducing the NNTP/TCP/IP overhead incurred per
   article.)

1.2.  Basic Assumptions

   This document makes a number of assumptions about the basic technical
   parameters of the Netnews system.  We assume a total number of new
   news articles to be distributed per day in the few hundred thousands,
   i.e., one to a few articles per second.  We also assume that the
   total volume of those articles is on the order of hundreds of
   megabytes per day, i.e., tens to a few hundreds of kbit/s.
   Newscaster-2 is scalable beyond those numbers, but not infinitely so.
   [In particular, ``similar'' problems with different technical

Bormann                                                         [Page 2]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998

   parameters (such as live stock price feeds) are not necessarily
   supported as efficiently as the actual worldwide Netnews system;
   solving such similar problems is explicitly a non-goal of the
   architecture.]

   In addition, we assume that the concept of News servers that receive
   a full feed of news articles continues to be useful.  On-demand
   retrieval of news articles from neighboring servers is an interesting
   concept but outside the scope.  We believe that most News servers
   will want to receive most of the articles in the Netnews system;
   Newscaster does not support elaborate mechanisms to receive a
   specific subset of articles that cover exactly the newsgroups that
   are ``subscribed'' by a News server.  (Newscaster does support
   partitioning the global news-feed into a few general subsets, such as
   alt.* and comp.*/sci.*.)

   One very important point in the design of a multicast Netnews
   distribution system is that, even if it takes off quickly, News
   server administrators will not simply turn off their existing, well-
   understood and robust system of NNTP feeds.  To make a feature out of
   what could be considered a bug, the Newscaster system is intended to
   work with and be supplementary to the NNTP system.  Newscaster-based
   news servers continue to speak NNTP to neighboring systems, using
   NNTP as a background scheme to fill in articles that it might have
   missed in the multicast distribution.  Therefore, Newscaster can be a
   much more light-weight protocol as it needs not be 100 % reliable.

1.3.  The multiple-entry problem

   Given that Newscaster is not replacing, but supplementing NNTP, and
   that the Newscaster system will for a long time be only a subset of
   the global Netnews system, the two distribution mechanisms need to
   cooperate.  The most significant problem here is that a single news
   article may be flood-distributed from its source via NNTP and reach
   multiple Newscaster systems at about the same time (observations in
   the live network show that this now often happens for multiple well-
   connected news servers within a second).  As, in a multicast
   scenario, there is no way to ask all the receivers whether they
   already have received an article, this, without further mechanisms,
   would mean that Newscasters regularly send multiple redundant copies
   of a single article.

   This document proposes a coordination protocol between Newscaster
   systems to decide which Newscaster system distributes a particular
   article.  The coordination protocol is separate from the distribution
   protocol; receive-only sites need not be involved in the coordination
   protocol.  Note that correctness of the coordination protocol is not
   a prerequisite to correctness of the overall system, only to its
   efficiency, i.e., an occasional slip (multiple transmission of one
   article) is tolerable.




Bormann                                                         [Page 3]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998

2.  The Newscaster Architecture


2.1.  Protocols

   Newscaster assumes an underlying IP multicast network such as the
   experimental Mbone and/or the operational IP multicast networks being
   deployed by many ISPs.  The multicast network is assumed to be able
   to sustain a rate-controlled low-bandwidth stream of packets for
   extended periods; the only form of congestion control envisaged is
   that receivers can drop out if they experience consistent congestion.

   To achieve a degree of performance in the presence of losses in the
   experimental Mbone, some form of error control is required.  To
   achieve good scalability without router support, the distribution
   protocol only uses forward error correction; as news servers gain
   multicast connectivity, they simply can start listening to the feed
   without having to send any (unicast or multicast) data.

   The coordination protocol does not need to be as scalable as the
   distribution protocol: It will be hard to impossible to coordinate
   between a few tens of thousand news servers, and various features of
   the distribution protocol (batching, compression, digital signatures)
   argue for limiting the number of active Newscaster servers.  We
   assume that new articles travel via NNTP to the nearest active
   Newscaster system and are multicast from there to the rest of the
   world.

   Appendix A defines a preliminary coordination protocol based on a
   multicast transport protocol called MTP-2.  (This protocol is a
   version of MTP (RFC1301) that was developed further to be more useful
   in WANs.  It allows multicasting a sequence of arbitrary size
   messages, each of which can consist of one or more multicast packets.
   The MTP-2 protocol provides a global sequencing of the messages, as
   well as global rate control.)

   Other coordination protocols may be defined.  Passive, receive-only
   Newscaster systems need not be aware of the coordination protocol
   being used -- they only need to understand the distribution protocol.
   In particular, the distribution protocol can be used from a single
   source to a local (e.g., per-ISP) set of receivers; the coordination
   protocol then becomes trivial.

2.2.  Operation of active Newscasters

   A news server actively participating in the Newscaster system is
   simply called a Newscaster.  The set of cooperating Newscasters is
   called the Newscaster Web.  The entire Web is a single news system
   from the point of view of RFC1036 Path headers.  For the global
   Newscaster Web, the name of the news system as it occurs in the Path
   header is "newscaster-2.mcast.net".  Additional local Newscaster Webs
   can be created, if needed, under different names.


Bormann                                                         [Page 4]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998

   Each Newscaster examines each article it receives via NNTP or other
   means whether it already contains a Newscaster Path header entry and
   immediately removes it from further consideration in the Newscaster
   Web if this is the case (in the INN implementation of the Netnews
   protocols, this is done automatically if the outgoing link is
   identified by the Web name, e.g. "newscaster-2.mcast.net").

   Those articles that do not contain a Newscaster Path header entry are
   then prepared for being multicast into the Web.  Several such
   articles will generally be sent together as a batch.  The
   coordination protocol is used to decide, for each article, whether it
   is actually this Newscaster which will distribute the article.  At
   the service interface, an implementation of a coordination protocol
   receives a set of message-ids (a tentative batch) as input and
   returns a (possibly empty) subset of the message-ids to be sent in an
   actual batch.  In general, each Newscaster should have only one set
   of articles in progress with the coordination protocol at any point
   in time.  Further articles arriving during processing by the
   coordination protocol should be collected for a future tentative
   batch.  Also, Newscasters should wait a few seconds for further
   articles to arrive before submitting a new batch to the coordination
   protocol.

   Actual batches are then formed out of the articles selected according
   to RFC 1036, section 4.3.  They are then compressed using the gzip
   format (RFC1952) and digitally signed (see below).  Finally, they are
   distributed using the distribution protocol.

2.3.  Security

   Any system that transports Netnews must provide some basic security
   against spoofing attacks.  Since the multicasting system itself
   provides only very limited assurances that a source address is
   correct, we resort to cryptographic measures.

   Simple shared-secret authentication is not scalable -- in a
   production version, thousands of News server administrators would
   have to be in possession of the key.  Instead, a public key system is
   used, based on a web-of-trust security policy.

   In the current NNTP system, each news server administrator trusts its
   neighbor news server administrators to institute a good local usage
   policy and to respond to incidents in a manner that helps to preserve
   the integrity of the news system.  The transitive closure of this web
   of trust equals the actual connectivity of the news system.  If a
   news administrator misbehaves, he runs the risk of being
   disconnected.

   The Newscaster security policy attempts to mimic this existing policy
   by cryptographic means.  Instead of creating NNTP links to
   ``neighboring'' systems, a news administrator creates certificates
   for all the Newscasters that she trusts.  These certificates are
   regularly distributed in a newsgroup that is reserved for this

Bormann                                                         [Page 5]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998

   purpose (such as, news.config.newscaster), ensuring they can be
   received even by sites that are not yet in possession of all the
   certificates.  Every receive-only system has to trust one or more
   sites (e.g., the Newscaster equivalent of a ``well-connected site'')
   to root its certificate chain.  If a receiver of a Newscaster batch
   does not find a certificate chain that verifies the signature of the
   batch, it discards the batch.

   * Issue *: What type of key system and digital signature is used?
   Newscaster should provide relatively fast signature checking with
   modest, but (due to batching) not necessarily stellar signing
   performance.  The author would tend to use RFC1991 type (PGP)
   formats, using RSA and MD5.

3.  NNDP: The distribution protocol

   The NNDP distribution protocol is used to distribute payloads to all
   receivers.  Payloads will generally be small to a few dozen
   kilobytes, but may be much larger in case a large article needs to be
   transferred.  The job of the distribution protocol is to:

   -    partition the payload into packets that can be multicast without
        being fragmented on the way.  We assume an Internet-wide MTU of
        1280 (based on the IPv6 MTU) and save 80 bytes for header
        overhead (IP, UDP, other), leaving 1200 bytes for the
        distribution protocol data.

   -    add forward error correction.  We use Vandermonde matrices as
        implemented by Luigi Rizzo
        [http://www.iet.unipi.it/~luigi/vdm.tgz].  The amount of error
        correction to be added is a system parameter: For small batches,
        we always add at least one FEC packet.  For larger batches, the
        FEC overhead is defined by a constant expansion factor.  (This
        factor could be chosen to match the TCP equation at the rate
        intended.)  For very large batches, the batch is split into
        units which are independently subjected to FEC (packets from all
        units of a batch are interleaved to spread out the
        transmission).

   -    multicast the data at a defined rate (leaky bucket model).  It
        is the job of the coordination protocol to assign a rate to each
        batch to be sent.  (The rate should be relatively low to space
        out the packets, allowing FEC to work around burst losses.)

   -    enable reassembly/erasure processing at the receiver.  The
        batches are tagged by a unique, 80-bit global ID, which is
        assigned by the coordination protocol (e.g., global source
        ID/sequence number).  (Note that reassembly errors are not
        catastrophic, as an incorrectly reassembled batch will be
        rejected at signature check.)  Each packet carries a total batch
        size, a unit number within the batch, a packet number within the
        unit, and the number of packets to be sent per unit (N).


Bormann                                                         [Page 6]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998


                  distribution protocol packet layout
  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                           global ID                           |
 +                                                               +
 |                                                               |
 +                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                               |               N               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |         pkt idx               |        unit idx               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                   total batch size                            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                   rate                                        |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                              data                             |
 |                              ....                             |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


   (For a discussion of the rate parameter, see NNDCP below.)

   * Issue *: What is a good unit size?  E.g., 128 KB?  Should we
   actually use the TCP equivalence equation to compute an expansion
   factor from the rate?

4.  Acknowledgments

   This document has been prompted by the discussions in the MNNP BOF at
   the Washington IETF.  In particular, the author would like to thank
   Joe Malcolm for the thought-provoking discussions at this IETF.

5.  References

   TBD

6.  Addresses


6.1.  Working Group

   [The MNNP working group is in creation.]

6.2.  Author's address








Bormann                                                         [Page 7]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998


   Carsten Bormann
   Universitaet Bremen FB3 TZI
   Postfach 330440
   D-28334 Bremen, GERMANY
   cabo@tzi.org
   phone +49.421.218-7024
   fax +49.421.218-7000




7.  Annex A: MTP-2 based coordination protocol

   When a batch is being prepared, a short MTP-2 message (an
   announcement) is sent that just contains the message IDs of the
   articles in the batch.  When this message has been transmitted in the
   MTP-2 Web and all lower-numbered messages have arrived, the
   Newscaster removes those articles from the batch that have been
   announced in lower-numbered announcements.  This, in the steady state
   case, makes it unlikely that two Newscasters will be transmitting the
   same article concurrently.  However, Newscasters that return after a
   multicast outage would start to transmit old articles (that they have
   received via NNTP while other systems got them via Newscaster).  To
   minimize the impact of such late-comers on the Newscast efficiency,
   Newscasters only newscast articles they have newly received while
   being active in the Web (i.e., no spooling).

   For IPv4, the global ID of a batch is composed of the concatenation
   of the IP address of the MTP-2 master at the time of receiving the
   announcement and the 24-bit MTP-2 sequence number, filled with zeroes
   at the end.

   Rate control is performed in the following way: Each Newscaster is
   aware of the total system rate defined for the Web (e.g., 128
   kbit/s).  Newscasters that are transmitting batches share this
   bandwidth by setting up short-term reservations.  Each Newscaster
   also maintains a running idea of all the reservations currently in
   effect.  Upon reception of an announcement, the receiving newscaster
   considers half the unreserved system rate to be reserved for the
   announcer.  This reservation is corrected by the actual rate used by
   the sender, once an NNDP packet is received for this batch (rate
   field).  The sender of a batch is allowed to use up to half of what
   it considers to be the unreserved rate at the time it receives its
   own announcement for this batch.  Each Newscaster deletes a
   reservation for a batch once the sender should have stopped sending
   data, according to its actual chosen rate and the size of the batch
   as indicated in the NNDP packets, or (if no NNDP packets were
   received at all), after a timeout of T_SEND (T_SEND is initially set
   to 15 seconds).  Newscasters avoid using silly rates (i.e., less than
   a very small fraction of the system rate for a large batch).



Bormann                                                         [Page 8]


INTERNET-DRAFT  NNDP: Architecture and Design Guidelines      March 1998

8.  Annex B: Newscasters: Active vs. Passive

   Given that there are tens of thousands of news servers in operation,
   and that NNDCP is intended to work between maybe a thousand active
   Newscasters, the question immediately comes to mind which news
   servers should be active Newscasters and which should only listen to
   the global Netnews distribution.  In essence, this is of course a
   judgment call, which may be guided by:

   -    Multicast connectivity.  An active Newscaster obviously needs to
        be able to source multicast traffic, not just receive it.  Given
        the current tendency of ISPs to charge extra for multicast
        sourcing, many news servers may not want to become active
        Newscasters.

   -    Path lengths.  While the Newscaster architecture takes out many
        hops from the Netnews distribution paths, an article needs to
        traverse NNTP hops up to the first active Newscaster before it
        can be efficiently multicast to the rest of the world.  Often, a
        (topological) region will want to maintain at least one active
        Newscaster to minimize those path lengths.

   -    Maintaining the web of trust.  Maintainers of active Newscasters
        need to actively work on maintaining their position in the web
        of trust that is used as the security foundation of Newscaster.





























Bormann                                                         [Page 9]