draft-maenpaa-p2psip-topologyplugin-00

P2PSIP Working Group                                          J. Maenpaa
Internet-Draft                                                  Ericsson
Intended status: Standards Track                          A. Swaminathan
Expires: January 7, 2010                                          S. Das
                                                          Qualcomm, Inc.
                                                            G. Camarillo
                                                           J. Hautakorpi
                                                                Ericsson
                                                            July 6, 2009


         A Topology Plug-in for REsource LOcation And Discovery
                 draft-maenpaa-p2psip-topologyplugin-00

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on January 7, 2010.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents in effect on the date of
   publication of this document (http://trustee.ietf.org/license-info).
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.




Maenpaa, et al.          Expires January 7, 2010                [Page 1]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


Abstract

   REsource LOcation And Discovery (RELOAD) is a peer-to-peer signaling
   protocol that can be used to maintain an overlay network, and to
   store data in and retrieve data from the overlay.  This document
   defines a new topology plug-in for RELOAD that is more appropriate
   for real world large scale overlays.  This topology plug-in
   implements three important functionalities that allow RELOAD to
   operate under real world constraints.  First, it includes a load
   balancing algorithm that specifies efficient allocation of load to
   different nodes in the network.  Second, the document describes
   robust techniques for stabilization of fingers and successors and
   specifies self tuning mechanisms that allow dynamic and automatic
   adjustment of parameters needed for these advanced techniques in the
   topology plug-in.  Finally, it specifies a locality aware finger
   selection algorithm that reduces average lookup latency.



































Maenpaa, et al.          Expires January 7, 2010                [Page 2]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  6
   3.  Need for Advanced Topology Plug-in . . . . . . . . . . . . . .  8
     3.1.  Need for Load Balancing  . . . . . . . . . . . . . . . . .  8
     3.2.  Need for Robust Stabilization  . . . . . . . . . . . . . .  9
     3.3.  Need for Locality Awareness  . . . . . . . . . . . . . . .  9
     3.4.  Need for Self-Tuning of System Parameters  . . . . . . . . 10
   4.  Chord  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
   5.  Load Balancing in the Proposed Topology Plug-in for RELOAD . . 11
     5.1.  Basic Load Balancing scheme  . . . . . . . . . . . . . . . 11
     5.2.  Recommendations on Virtual Servers . . . . . . . . . . . . 12
     5.3.  Routing  . . . . . . . . . . . . . . . . . . . . . . . . . 13
     5.4.  Obtaining Virtual Server Identities  . . . . . . . . . . . 14
       5.4.1.  Without Enrollment Server  . . . . . . . . . . . . . . 14
       5.4.2.  With Enrollment Server . . . . . . . . . . . . . . . . 15
     5.5.  Extensions to Overlays with Heterogeneous Nodes  . . . . . 16
   6.  Stabilizing Fingers, Successors, and Predecessors in the
       Topology Plug-in . . . . . . . . . . . . . . . . . . . . . . . 16
     6.1.  Choice of Approach to Stabilization  . . . . . . . . . . . 16
     6.2.  Update Messages for Stabilization  . . . . . . . . . . . . 18
     6.3.  Finger Stabilization . . . . . . . . . . . . . . . . . . . 21
       6.3.1.  Locality-aware Finger Selection  . . . . . . . . . . . 22
     6.4.  Successor Stabilization  . . . . . . . . . . . . . . . . . 23
     6.5.  Predecessor Stabilization  . . . . . . . . . . . . . . . . 24
     6.6.  Joining the Overlay  . . . . . . . . . . . . . . . . . . . 24
       6.6.1.  Contents of the Join Message . . . . . . . . . . . . . 25
     6.7.  Leaving the Overlay  . . . . . . . . . . . . . . . . . . . 26
       6.7.1.  Contents of the Leave Message  . . . . . . . . . . . . 26
     6.8.  Self Tuning System Parameters  . . . . . . . . . . . . . . 27
       6.8.1.  Self Tuning Load Balancing Algorithm Parameters  . . . 28
       6.8.2.  Self Tuning the Stabilization Interval . . . . . . . . 32
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 37
   9.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 37
   10. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
     10.1. Comparison of the Load Balancing Algorithm with Chord  . . 38
     10.2. Performance of the Load Balancing Algorithm as Network
           Grows  . . . . . . . . . . . . . . . . . . . . . . . . . . 38
   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
     11.1. Normative References . . . . . . . . . . . . . . . . . . . 40
     11.2. Informative References . . . . . . . . . . . . . . . . . . 40
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42







Maenpaa, et al.          Expires January 7, 2010                [Page 3]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


1.  Introduction

   REsource LOcation And Discovery (RELOAD) [1] is a peer-to-peer
   signaling protocol that can be used to maintain an overlay network,
   and to store data in and retrieve data from the overlay.  For
   interoperability reasons, RELOAD specifies one overlay algorithm that
   is mandatory to implement.  Additionally, RELOAD supports a variety
   of other overlay algorithms through the use of topology plug-ins.  A
   topology plug-in implements the topology defined by a specific
   overlay algorithm.

   This document defines a new topology plug-in for RELOAD that is more
   appropriate for real world large scale overlays that have to deal
   with object storage in a fair manner, provide good lookup performance
   to a variety of applications, and self-organize to deal with churn.
   This topology plug-in implements three important functionalities that
   allow RELOAD to operate under these real world constraints.  First,
   it includes a load balancing algorithm that specifies efficient
   allocation of load to different nodes in the network.  Second, the
   document describes robust techniques for stabilization of fingers and
   successors and specifies self tuning mechanisms that allow dynamic
   and automatic adjustment of parameters needed for these advanced
   techniques in the topology plug-in; this avoids the need for
   constants that may only work in specific scenarios.  Finally, it
   specifies a locality aware finger selection algorithm that reduces
   average lookup latency.

   Load balancing is essential to effectively manage data and provide
   services on overlays.  Load balancing, as an integral part of the
   overlay, encourages participation by imposing collective fate sharing
   on the nodes.  Without such a scheme, overlay adoption may be
   significantly affected.  However, the mandatory-to-implement RELOAD
   DHT protocol based on Chord does not support operating the overlay in
   a load balanced manner.  For instance, for a system with N nodes, it
   can been shown that the imbalance factor, defined as the maximum load
   on any node on the network divided by the average load, is of the
   order of O(log2(N)) in the number of items even if all objects are
   assumed to be homogeneous.  In the case of heterogeneous networks,
   where the capabilities of nodes (storage and bandwidth) can differ by
   multiple orders of magnitude, the problem of load balancing becomes
   more important because the imbalance in load distribution could
   potentially create a bottleneck in the system.  Thus to enable
   scalable, real world deployments of RELOAD, the topology plug-in in
   this document specifies a scheme for load balancing.

   Peer-to-peer overlay networks face a fundamental challenge with
   churn: joining and leaving of nodes.  This can affect the integrity
   of the routing structure, cause a node to become disconnected from



Maenpaa, et al.          Expires January 7, 2010                [Page 4]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   the overlay, and cause overhead in maintaining consistency.  Several
   research studies have been performed on the type of stabilization
   that can be used in DHT based peer-to-peer overlays such as RELOAD.
   This document specifies a method for stabilization to deal with churn
   to allow RELOAD to be scalable and reliable in real world conditions.
   We specifically prescribe the use of a periodic stabilization routine
   to counter the undesirable effects of churn on routing.

   To enable load balancing and stabilization to deal with churn some
   parameters need to be set (described later).  The use of specific
   constants for such parameters leads to deployment that may only work
   for specific scenarios (where the parameters are evaluated).  Thus,
   this document specifies self tuning mechanisms for system parameters
   to allow RELOAD to scale naturally as network dynamics dictate.  For
   example, as churn increases, the topology plug-in specified in this
   document adapts by increasing the frequency of stabilization.
   Similarly, as network size increases, load balancing parameters and
   DHT parameters are modified to ensure quick finger and successor
   updates and to keep the load balancing property over time.  To enable
   self tuning of system parameters, some characteristics such as churn
   rate and network size are estimated.  These characteristics are then
   used to dynamically adjust the topology plug-in parameters such as
   the size of the successor set, size of the routing table, and rate of
   maintenance messages.  The benefit of this approach is that it avoids
   the problem with static parameters.  Using static parameters, it is
   not possible to achieve a low failure rate and a low communication
   overhead.  The topology plug-in specified in this document allows the
   system to take into account the evolution of network conditions and
   adapt to them.

   Even with up-to-date fingers and successors, making progress in the
   identifier space can be expensive in terms of network latency because
   small progress in identifier space could result in a significant leap
   in physical distance.  The successor and predecessor lists can be
   used to optimize network latency by relaxing the requirement for
   finger selection.  Specifically, at each overlay hop, as progress is
   made in the identifier space, small physical distance hops are used
   so as to avoid high latencies in overall lookup.  More details on the
   proposed locality aware finger selection are described later in this
   document.

   In summary, the topology plug-in proposed in this document has the
   following advantages:

   o  First, building on the RELOAD framework, this document introduces
      a load balancing algorithm that provides an imbalance factor of
      the order of O(1).  The solution proposed in the document can also
      be extended to the case of heterogeneous nodes in such a way that



Maenpaa, et al.          Expires January 7, 2010                [Page 5]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


      the number of objects assigned to a node is proportional to the
      amount of load it can handle;

   o  Second, this topology plug-in specifies stabilization techniques
      that deal effectively with churn;

   o  Third, this document proposes self-tuning of parameters required
      in these algorithms such that users no longer need to tune every
      DHT parameter correctly for a given operating environment.  By
      periodically computing the network parameters, the system
      automatically adapts to changing operating conditions; and

   o  Finally, the locality aware finger selection algorithm
      incorporated with the DHT algorithm further optimizes the
      selection of successors and predecessors to reduce network
      latency.

   This document is organized as follows.  We first describe the
   algorithms for load balancing, stabilization, and locality aware
   finger selection.  Then we describe the major components required to
   operate the topology plug-in: joining, leaving, routing, dealing with
   failures etc.  Finally, we describe self tuning of the system
   parameters to make the topology plug-in adjust itself to changing
   network conditions.


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [2].

   This document uses the terminology and definitions from the Concepts
   and Terminology for Peer-to-Peer SIP [1] draft.

   Chord ring:  The Chord DHT orders identifiers on an identifier circle
      of size 2^m (m is the number of bits in peer identifiers).  This
      identifier circle is called the Chord ring.

   DHT:  Distributed Hash Tables (DHTs) are a class of decentralized
      distributed systems that provide a lookup service similar to a
      hash table.  Given a key, any participating peer can retrieve the
      value associated with that key.  The responsibility for
      maintaining the mapping from keys to values is distributed among
      the peers.






Maenpaa, et al.          Expires January 7, 2010                [Page 6]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   Finger table:  A data structure with up to m entries maintained by
      each peer in a Chord-based overlay.  The ith entry in the finger
      table of peer n contains the identity of the first primary virtual
      server that succeeds n by at least 2^(m-i) on the Chord ring.
      This peer is called the ith finger of peer n.  As an example, the
      first entry in the finger table of peer n contains a peer half-way
      around the Chord ring from peer n.  The purpose of the finger
      table is to accelerate lookups.

   log2(N):  Logarithm of N with base 2.

   N: The number of nodes in the overlay unless otherwise specified.

   Neighborhood set:  Consists of successor and predecessor lists.

   O(g(n)):  Informally, saying that some equation f(n) = O(g(n)) means
      that f(n) is less than some constant multiple of g(n).

   Omega(g(n)):  Informally, saying that some equation f(n) =
      Omega(g(n)) means that f(n) is more than some constant multiple of
      g(n).

   Predecessor list:  A data structure containing the predecessors of a
      peer.

   Successor list:  A data structure containing the successors of a
      peer.

   Virtual server:  Each physical peer instantiates one or more virtual
      servers with random IDs that act as peers in the DHT.

   Primary virtual server:  The primary ID assigned to the peer when it
      joins the network.  This location is chosen uniformly at random
      among the available IDs in [0, 2^m-1].

   Secondary virtual server:  List of all identities owned by the
      physical peer excluding the primary virtual server.  The location
      of the secondary virtual servers are selected so that the ith
      virtual server is distributed between [vp-i*delta*2^m, vp-
      (i+1)*delta*2^m] where vp is the primary virtual server.  These
      locations may change to adapt to network dynamics.

   alpha:  number of virtual servers per physical ID.  This includes one
      primary virtual server and (alpha-1) secondary virtual servers.







Maenpaa, et al.          Expires January 7, 2010                [Page 7]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   delta:  The value of delta defines the spacing between the virtual
      servers.  The location of the secondary virtual servers are
      selected so that the ith virtual server is distributed between
      [vp-i*delta*2^m, vp-(i+1)*delta*2^m] where vp is the location of
      the primary virtual server.

   Routing table:  The set of peers that a node can use to route overlay
      messages.  The routing table consists of the finger table,
      successor list and predecessor list, all populated with primary
      virtual server IDs.


3.  Need for Advanced Topology Plug-in

   This section details the need for specifying load balancing
   mechanisms, robust stabilization and locality awareness for a RELOAD
   topology plug-in to make it efficient and scalable.

3.1.  Need for Load Balancing

   Most DHTs, such as Chord, Pastry, CAN, Tapestry, and the RELOAD base
   topology plug-in employ uniform hashing to generate object IDs so
   that the object IDs are uniformly distributed over the ID space
   (example 0 to D = 2^128 - 1).  Objects are then assigned to the nodes
   based on their IDs and the exact way that this is done differs in
   different DHTs.  Let us assume that there are N nodes in the network.
   Suppose the node IDs are generated uniformly at random, then the
   probability that the ith largest nodeID is equal to k is given by:

   P(nodeID(i) == k) = (N-1)C(i-1) * (k/D)^(i-1) * (N-i)C(1) * (1/D) *
   (1 - (k+1)/D)^(N-i+1)

   For large N and D, this distribution can be approximated as follows:

   P(nodeID(i) == k) = e^(-k*N/D) * (k*N/D)^1 * (1/i!)

   Therefore, ith largest node IDs follows a Poisson distribution.  The
   inter-node distance then follows an exponential distribution with
   parameter N/D. As consequence of this property, the number of data
   items assigned to a node is proportional to the inter-node distance
   and, thus, follows an exponential distribution (c denotes an
   arbitrary constant)

   Pr(#items in node == x) = c * (N/D) * e^(c*(N/D)*x)

   The goodness of Chord for load balancing can be measured in terms of
   the imbalance metric defined as:




Maenpaa, et al.          Expires January 7, 2010                [Page 8]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


                                 Maximum # of data items
          imbalance factor = -------------------------------
                                 Average # of data items


   Here, the maximum value is calculated as the one which is largest
   with probability: O(1 - (1/N^l)) where l is any constant.  The
   imbalance factor measures the storage load of a server; the longer
   the interval is, the more data has to be stored in the server.

   For the case of Chord, this imbalance factor is of the order of
   O(log2(N)) and there are nodes in the overlay that manage log2(N)
   times the average node's load.  Further, this result suggests that
   the imbalance factor for Chord increases as the number of nodes in
   the network and as the number of nodes increase the maximum number of
   data items handled per node increases of the order of
   O(T*(log2(N)/N)) where T denotes the total number of objects on the
   overlay.  Ideally, we would want the maximum value to be close to the
   average value of (T/N) giving an imbalance factor close to 1 for a
   good load balancing algorithm.  Therefore, RELOAD base DHT is not
   very efficient for load balancing and there is a need for load
   balancing mechanisms in a topology plug-in that is widely usable.

3.2.  Need for Robust Stabilization

   To ensure correct lookups in the presence of churn and to ensure
   optimal routing of queries, a node needs to ensure that its fingers,
   successor list, and predecessor list are up to date.  However,
   stabilization of these features comes at a cost: peers in the overlay
   network need to consume network bandwidth to maintain routing state.
   DHTs use stabilization routines to counter the undesirable effects of
   churn on routing.  The purpose of stabilization is to keep the
   routing information of each peer in the overlay consistent with the
   constantly changing overlay topology.

   The current RELOAD base topology plug-in proposes reactive
   stabilization.  Research studies have shown that this approach may
   not be the most robust to a wide variety of network conditions.  In
   this document, we revisit the choice of stabilization and recommend
   techniques likely to work efficiently across deployment types.

3.3.  Need for Locality Awareness

   A major performance issue with structured peer-to-peer based topology
   plug-ins is the need for multi-overlay-hop routing to lookup a piece
   of data such as the SIP Address of Record (AoR )to IP mapping.  Since
   the identifier space is completely random, a lookup for a data item
   stored 10ms RTT away can potentially take multiple intercontinental



Maenpaa, et al.          Expires January 7, 2010                [Page 9]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   hops before getting answered significantly affecting lookup latency.
   Additionally, the fact that routers in such peer-to-peer networks are
   expected to be constrained non-infrastructure nodes adds on
   additional delays.  Locality aware routing aims to mitigate the
   routing stretch of the topology plug-in so that a lookup is answered
   by making progress in the identifier space while minimizing the
   amount of distance traveled in physical space at each hop.

3.4.  Need for Self-Tuning of System Parameters

   Two main advantages of a self-tuning DHT are that users no longer
   need to tune every DHT parameter correctly for a given operating
   environment and that the system adapts to changing operating
   conditions.


4.  Chord

   This document proposes a new topology plugin for RELOAD that is based
   on the Chord DHT algorithm.  Topology plugins allow RELOAD to support
   a variety of overlay algorithms.  The proposed topology plugin uses a
   load balanced self-tuning version of Chord.  It can be used as an
   alternative to the default DHT specified by RELOAD.

   Chord [4] is a structured peer-to-peer algorithm that uses consistent
   hashing to build a DHT out of several independent peers.  Consistent
   hashing assigns each peer and resource an m-bit identifier.  Peers
   MUST use SHA-1 as the base hash fuction to generate the identifiers.
   The length of the identifiers MUST be m=128 bits.  The identifiers
   are ordered on an identifier circle of size 2^m.  On the identifier
   circle, key k MUST be assigned to the first peer whose identifier
   equals or follows the identifier of k in the identifier space.  The
   identifier circle is called the Chord ring.

   Different DHTs differ significantly in performance when bandwidth is
   limited.  It has been shown that when compared to other DHTs, the
   advantages of Chord include that it uses bandwidth efficiently and
   can achieve low lookup latencies at little cost [5].

   A simple lookup mechanism could be implemented on a Chord ring by
   requiring each peer to only know how to contact its current successor
   on the identifier circle.  Queries for a given identifier could then
   be passed around the circle via the successor pointers until they
   encounter the first peer whose identifier is equal to or larger than
   the desired identifier.  Such a lookup scheme uses a number of
   messages that grows linearly with the number of peers.  To reduce the
   cost of lookups, Chord also maintains additional routing information;
   each peer n MUST maintain a data structure with up to m entries,



Maenpaa, et al.          Expires January 7, 2010               [Page 10]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   called the finger table.  The first entry in the finger table of peer
   n contains the peer half-way around the ring from peer n.  The second
   entry contains the peer that is 1/4th of the way around, the third
   entry the peer that is 1/8th of the way around, etc.  In other words,
   the ith entry in the finger table at peer n SHOULD contain the
   identity of the first peer s that succeeds n by at least 2^(m-i) on
   the Chord ring.  This peer is called the ith finger of peer n.  The
   interval between two consecutive fingers is called a finger interval.
   The ith finger interval of peer n covers the range [n.id + 2^(m-i),
   n.id + 2^(m-i+1)) on the Chord ring.  In an N-peer network, each peer
   maintains information about O(log2(N)) other peers in its finger
   table.  As an example, if N=1000, it is sufficient to maintain 10
   fingers.

   To increase robustness in the event of peer failures, each Chord peer
   MUST maintain a successor list containing the peer's immediate
   successors on the Chord ring.  The successor list will be further
   described in Section 5.3.  Peers MUST also maintain a predecessor
   list containing the peer's immediate predecessors on the Chord ring.
   The recommeded value for the size of the predecessor list is 3, as is
   also specified in [1].


5.  Load Balancing in the Proposed Topology Plug-in for RELOAD

5.1.  Basic Load Balancing scheme

   The basic Chord DHT suffers from drawbacks.  The uniform hashing used
   in Chord to generate object IDs result in an imbalance factor close
   to O(log2(N)); this implies that there are peers in the Chord ring
   that would have to manage O(log2(N)) times the load that is to be
   handled by an average peer.  In this section, we build upon the basic
   Chord DHT and present the virtual servers algorithm that results in
   an imbalance factor close to 1.

   Several research papers have proposed schemes to provide load
   balanced DHTs.  Some of the schemes have general techniques while
   others are targeted towards optimizing a specific DHT.  This solution
   proposed in this document takes into account results and ideas from
   previous ideas for load balancing in [6], [7], [8], [4], [9], [10],
   [11], [12], [13], [14], [15], [16].

   The basic scheme for load balancing proposed in this document is an
   extension of the virtual servers approach [4] over Chord with the
   benefit of reducing the routing state.  At a high level, virtual
   servers approach works in improving load balancing as follows.  The
   work in [4] suggested that load balancing can be performed
   efficiently if each peer simulates a logarithmic number of "virtual



Maenpaa, et al.          Expires January 7, 2010               [Page 11]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   servers".  The method works by associating keys with virtual servers,
   and mapping multiple virtual servers (with unrelated identifiers) to
   each real node.  The authors show that this will provide a more
   uniform coverage of the identifier space.  For example, if we
   allocate O(log2(N)) randomly chosen virtual servers to each real
   node, with high probability each of the N bins will contain
   O(log2(N)) nodes.  The downside of this approach is the additional
   routing state and its maintenance cost.

   This document proposes load balancing based on a proposal in [14],
   and is an improvement over the basic virtual servers approach
   introduced in [4].  In our approach, each physical node instantiates
   one or more virtual servers that act as peers in the DHT.  The
   locations of these virtual servers are chosen close to each other in
   the nodeID space allowing the node to share a single set of overlay
   links among the virtual servers.

   The main system parameters of the solution in this document are: how
   many virtual servers should be chosen exactly and what should the
   spacing between the node identifiers of those virtual servers.  The
   next sections describe this document's recommendations on virtual
   nodes and routing for a load balanced DHT.

   Since the choices of the system parameters depend on network
   dynamics, this document further discusses dynamically adjusting the
   topology plug-in with network dynamics and recommends a periodic
   stabilization model to keep the system parameters up-to-date.

5.2.  Recommendations on Virtual Servers

   Selection of virtual servers require two decisions to be made: (1)
   how many virtual servers (denoted by alpha) should we choose and (2)
   what should be the namespace spacing (represented by delta*2^m) in
   between these virtual servers.  This section provides the present
   document's recommendation on choosing alpha and delta.

   Each physical peer maintains two sets of virtual server identities,
   namely, primary virtual server identities and secondary virtual
   server identities.  The location of the primary virtual server
   (denoted as vp) is chosen using existing techniques in RELOAD.  These
   locations remain fixed throughout the entire lifetime of the node.

   In addition to the primary virtual server, each node maintains
   secondary virtual servers.  The locations of these secondary virtual
   servers are chosen such that the ith virtual server, vs_i, is
   uniformly distributed in [vp-i*delta*2^m, vp-(i+1)*delta*2^m].

   The value of delta * 2^m defines the spacing between the virtual



Maenpaa, et al.          Expires January 7, 2010               [Page 12]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   nodes and this document requires that nodes MUST choose the value of
   delta as 1/N where N is the number of nodes in the network.  This
   value was found to perform well in simulations and is consistent with
   the value reported in [14].

   Each node MUST maintain a total of alpha = 2*log2(N) virtual server
   identities for load balancing where N is the number of nodes in the
   overlay.  This value of alpha includes one primary virtual server and
   (alpha-1) secondary virtual server identities.  The choice of
   2*log2(N) for alpha has been found to work well in simulations [14].

   While both these parameters, alpha and delta, depend on N (the size
   of the network), Section 6.8.1 describes an algorithm used to
   determine alpha and delta without performing explicit network size
   estimation.  Note that explicit network size estimation only needs to
   be performed when self-tuning the DHT parameters, since in that case
   a more accurate estimate of the network size is needed.

5.3.  Routing

   Assume an identifier space [0, 2^m-1].  As in the basic Chord
   approach, the recommended topology plug-in maintains two types of
   routing information such as finger tables and successor tables.  The
   finger tables are constructed based on the location of the primary
   virtual server, i.e., the ith finger at peer v contains the identity
   of the first peer s that succeeds its primary virtual server, vp, by
   at least 2^(m-i) on the Chord ring.

   The neighbor table of peer v contains the entries of all nodes which
   have ownership of any part of the log2(N)/N space that node v owns.
   As in the case of Chord, each virtual server belonging to the node
   obtains its successor list of its immediate successor.  If this total
   length is greater than log2(N), the clockwise-farthest entry is
   dropped to restrict the number of neighbor table entries to log2(N).
   Further, it can be shown that the maximum number of node IDs that
   fall within this space is O(log2(N)) resulting in those many neighbor
   table entries.  The added benefit of having log2(N) successors is
   that if each peer fails independently with probability p, the
   probability that all log2(N) successors fail simultaneously is only
   p^log2(N); therefore, the system is much more resilient to failures.

   To obtain the entries in the peer's neighbor table, the peer starts
   with the alpha-th secondary virtual ID, denoted as vs_alpha.  The
   peer then proceeds clockwise from vs_alpha on the Chord ring and
   identifies the virtual IDs of peers that fall in this region.  When
   it encounters a virtual server vs', it stores the location of both
   the virtual server, vs', and the location of the corresponding
   primary virtual server, vp'.  On the other hand, when it sees its own



Maenpaa, et al.          Expires January 7, 2010               [Page 13]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   virtual server, it is dropped and no information is stored.  This
   process is followed until log2(N) distinct primary virtual servers
   are obtained to populate the neighbor table.  It is to be noted that
   while the size of the successor list may be greater than log2(N)
   because it includes both the primary and the secondary virtual
   servers, the number of successor connections is limited to log2(N).

   To locate any object in the overlay, the nodes MUST first employ the
   finger tables to reach the virtual server ID that is closest to the
   query object ID.  At the end of this step, the chosen physical node
   may or may not have the object.  If the physical node does not have
   the object, then it forwards the object request to its neighbors
   using the neighbor table entries.  Therefore the total number of
   message hops required for locating an object is of the order of
   O(log2(N)) + 1.

   Thus, this solution proposed in this document, based on [14],
   requires maintaining a lower number of connections than that of a
   pure virtual servers based approach for load balancing.  This is
   because, while the basic Chord scheme with O(log2(N)) virtual servers
   per physical node requires maintaining O(log2(N)) sets of fingers
   (and O(log2(N)) routing tables) for each physical node for efficient
   routing, this document requires maintaining just one set of fingers
   (and routing table) per physical node.

   In Section 6, we discuss approaches to keep the fingers and
   successors up-to-date that is required to ensure correct look-ups.

5.4.  Obtaining Virtual Server Identities

   This section describes the procedure for obtaining the primary and
   virtual server identities.

5.4.1.  Without Enrollment Server

   When a new node, v, joins the system, the first primary node identity
   vp MUST be computed by applying the digest specified in the self-
   signed-permitted element of the overlay configuration document to the
   DER representation of the node's public key.  The subsequent (alpha -
   1) virtual node IDs MUST be computed as follows: the ith virtual ID
   is chosen as random(vp - i*delta*2^m, vp - (i+1)*delta*2^m)

   This ensures that the chosen virtual servers are close to each other
   and a maximum of log2(N)/N apart.

   A node's public key relates to the vp and can be verified by other
   nodes.  However, all other virtual server IDs cannot be related to
   this public key so they need to be verified based on vp.  There are



Maenpaa, et al.          Expires January 7, 2010               [Page 14]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   two options to do this verification:

   o  a.  Use a constant fixed offset for the ith virtual server as (vp
      - i*delta*2^m) exactly.  This removes the random component of the
      virtual server ID selection.  Since all virtual server IDs are at
      fixed intervals from vp, a node can weakly verify the node id.
      However node ID collisions may make this hard.

   o  b.  Use the current technique described in Section 5.2 with random
      selection in an interval.  In this case a node can verify that the
      virtual server ID is in a small range near vp.

   OPEN ISSUE: At the moment we recommend option (b), but this issue
   needs further analysis.

5.4.2.  With Enrollment Server

   When an enrollment server is present, the added security benefits of
   the enrollment server certified node identities are for virtual
   server identities.  The enrollment server is provided with the values
   of the system parameters alpha and delta, and based on them, the
   enrollment server gives out a set of virtual server identities to a
   node.  Similar to what a node itself does in a deployment without an
   enrollment server; the enrollment server MUST use a uniform hash
   function to randomly choose the primary virtual server ID from the
   space (0, 2^128 - 1); call this virtual server ID vp.  The remaining
   (alpha - 1) node IDs MUST be then chosen by the enrollment server
   around vp in such a way that the ith virtual serverID of v is chosen
   to be uniformly distributed in (vp - i/N, vp - (i+1)/N).  These node
   IDs are then passed on to the node.

   When a node initially joins, it does not have an estimate of alpha
   and delta since that is based on the number of fingers in the routing
   table.  Thus, the enrollment server MUST use existing values of alpha
   and delta in requests it receives from nodes in the current overlay
   or based on its own diagnostic messages sent into the overlay to
   determine the number of virtual servers and the spacing in between
   the virtual servers and consequently the node IDs handed out to the
   joining node.  If the overlay has recently formed the enrollment
   server MAY bootstrap the values of alpha and delta as 20 and 1/1000
   respectively.  The enrollment server may also choose to use past
   overlay size estimates it may possess to bootstrap alpha and delta as
   2log2(Nestimated) and 1/Nestimated.  For example, an enrollment
   server at a venue based overlay which is torn down can use past size
   estimates.

   Note that the enrollment procedure in RELOAD already defines
   providing multiple node identities to the enrolling node.  The change



Maenpaa, et al.          Expires January 7, 2010               [Page 15]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   needed is to include the values of alpha and delta in the simple
   enrollment request.

5.5.  Extensions to Overlays with Heterogeneous Nodes

   This load balancing solution can be extended for overlays with nodes
   with heterogeneous node capacities.  Let the capacity of the node-v
   in the overlay be represented by Cv.

   If an overlay has nodes with heterogeneous nodes, nodes in the
   overlay MUST use alpha = 2*Cv*log2(N) virtual servers per physical
   node.  The location of these (2 Cv log2(N)) virtual nodes MUST be
   chosen near the primary node location vp such that the location of
   the ith virtual serverID is uniformly distributed in (vp - i* delta *
   2^m, vp - (i+1) * delta * 2^m).  Each physical node then maintains a
   total of O(Cv log2(N)) neighbor entries and O(Cv log2(N)) routing
   fingers.  The imbalance factor for this scenario, defined as,

                                  Maximum # of data items
          imbalance factor = ---------------------------------
                                Cv * Average # of data items


   can be shown to be close to 1.

   Note that in addition to alpha and delta, the node capacity of the
   node MUST be passed to the enrollment server in the simple enrollment
   request.  This capacity value is used by the enrollment server to
   calculate Cv as

                                      node_capacity(node n)
              Cv(node n)  = ----------------------------------------
                               sum_{k=1}^N node_capacity(node k)


   In deployments with no enrollment servers, the node MUST estimate
   Cv(node n) by obtaining its neighbors' node capacities and building
   an estimate of average node capacity.


6.  Stabilizing Fingers, Successors, and Predecessors in the Topology
    Plug-in

6.1.  Choice of Approach to Stabilization

   There are two alternative approaches to stabilization: periodic and
   reactive.  Periodic stabilization can either use a fixed
   stabilization rate or calculate the stabilization rate in an adaptive



Maenpaa, et al.          Expires January 7, 2010               [Page 16]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   fashion.

   In reactive stabilization, a peer reacts to the loss of a peer in its
   neighborhood set or to the appearance of a new peer that should be
   added to its neighborhood set by sending a copy of its neighbor table
   to all peers in the neighborhood set.  Periodic recovery, in
   contrast, takes place independently of changes in the neighborhood
   set.  In periodic recovery, a peer periodically shares its
   neighborhood set with each of the members of that set.

   The mandatory-to-implement Chord DHT algorithm in RELOAD [1] uses
   reactive stabilization for the neighborhood set, unlike the original
   Chord algorithm, which uses periodic stabilization.  It has been
   shown in [17] that reactive stabilization works well for small
   neighborhood sets (i.e., small overlays) and moderate churn.
   However, in large-scale (e.g., 1000 peers or more [17]) or high-churn
   overlays, reactive stabilization runs the risk of creating a positive
   feedback cycle, which can eventually result in congestion collapse.
   In [17], it is shown that a 1000-peer overlay under churn uses
   significantly less bandwidth and has lower latencies when periodic
   stabilization is used than when reactive stabilization is used.
   Although in the experiments carried out in [17], reactive recovery
   performed well when there was no churn, its bandwidth use was
   observed to jump dramatically under churn.  At higher churn rates and
   larger scale overlays, periodic stabilization uses less bandwidth and
   the resulting lower contention for the network leads to lower
   latencies.  For this reason, most DHTs such as CAN, Chord, Pastry,
   Bamboo, etc. use periodic stabilization.  As an example, the first
   version of Bamboo used reactive recovery, which caused Bamboo to
   suffer from degradation in performance under churn.  To fix this
   problem, Bamboo was modified to use periodic stabilization.

   In Chord, periodic stabilization is typically done both for
   successors and fingers.  An alternative strategy is analyzed in [18].
   In this strategy, called the correction-on- change maintenance
   strategy, a peer periodically stabilizes its successors but does not
   do so for its fingers.  Instead, finger pointers are stabilized in a
   reactive fashion.  The results obtained in [18] imply that although
   the correction-on-change strategy works well when churn is low,
   periodic stabilization outperforms the correction-on-change strategy
   when churn is high.

   In this document, we propose to use periodic stabilization for
   fingers, successors, and predecessors based on these insights.  Each
   peer MUST maintain a stabilization timer.  When the stabilization
   timer fires, the peer MUST restart the timer and carry out the
   stabilization operations.  The stabilization routine is described
   next and more details on computing the stabilization interval are



Maenpaa, et al.          Expires January 7, 2010               [Page 17]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   elaborated in Section 6.8.2.

6.2.  Update Messages for Stabilization

   The stabilization procedures are implemented using Update requests
   and answers.  To describe the contents of these messages, the syntax
   defined in [1] is used.  A Chord Update request is defined as:

           enum { reserved (0), notify(1), succ_stab(2), pred_stab(3),
                  full(4), virtualserver_stab_join(5),
                  virtualserver_stab_leave(6), (255) }
                ChordUpdateType;

           struct {
             ChordUpdateType       type;
             NodeId                sender_id;

             select(type) {
               case notify:
                 uint32            uptime;
                 NodeId            sender_virtual_ids <0..2^16-1>;
               case pred_stab:     /* Empty */
                 ;
               case succ_stab:     /* Empty */
                 ;
               case virtualserver_stab_join:
                 NodeId            sender_virtual_ids <0..2^16-1>;
               case virtualserver_stab_leave:
                 NodeId            sender_virtual_ids <0..2^16-1>;
               case full:
                 uint32            uptime;
                 uint32            alpha;
                 uint32            delta;
                 NodeId            sender_virtual_ids <0..2^16-1>;
                 NodeId            predecessors <0..2^16-1>;
                 NodeId            successors <0..2^16-1>;
                 NodeId            fingers <0..2^16-1>;
             };
           } UpdateReq;


   The "type" field MUST indicate the reason why the Update was sent:

   notify:  the sender of the Update wishes to notify the recipient of
      the sender's existence.  Upon receiving the Update, the recipient
      SHOULD insert the sender into its routing table, if appropriate.
      The 'notify' message MUST include the virtual server IDs of the
      sender in the 'sender_virtual_ids' field.



Maenpaa, et al.          Expires January 7, 2010               [Page 18]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   succ_stab:  the Update request is related to the successor
      stabilization routine.

   pred_stab:  the Update request is related to the predecessor
      stabilization routine.

   virtualserver_stab_join and virtualserver_stab_leave:  the Update
      request is related to the DHT parameter stabilization routine.

   full:  the Update request contains the entire routing and neighbor
      table of the sender.

   The sender_id field contains the sender's primary virtual ID.

   The sender_virtual_ids contains the list of all secondary virtual
   server IDs belonging to the sender.

   If the type of the Update request is 'pred_stab' or 'succ_stab', the
   request MUST NOT carry any additional information.

   If the type of the Update request is 'notify', the request MUST
   contain the sender's current uptime in seconds and the location of
   the sender's current virtual server IDs.

   If the type of the request is 'virtualserver_stab_join' or
   'virtualserver_stab_leave', the contents of the message MUST include
   the list of primary and the secondary virtual server IDs of the
   sender.

   If the type of the request is 'full', the contents of the message
   MUST be:

   o  uptime: The sender's current uptime in seconds;

   o  alpha: The sender's current DHT parameter value - alpha;

   o  delta: The sender's current DHT parameter value - delta;

   o  sender_virtual_ids: The sender's list of current virtual server
      IDs;

   o  predecessors: The sender's predecessor list;

   o  successors: The sender's successor list;

   o  fingers: The sender's finger table.

   In the introduced topology plug-in, each peer decides independently



Maenpaa, et al.          Expires January 7, 2010               [Page 19]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   the appropriate size for the successor list, predecessor list, finger
   table, and system parameters (alpha and delta).  Thus, the
   'predecessors', 'successors', and 'fingers' fields are of variable
   length.  The number of virtual IDs assigned to a node along with the
   spacing between the virtual IDs are chosen based on the system
   parameters and may change as the network changes.  In keeping with
   this change, the length of the 'sender_virtual_ids' field MUST also
   be of variable length and would include the list of its virtual
   server IDs assigned to the sender.  As specified in RELOAD [1],
   variable length fields are on the wire preceded by length bytes.  In
   the case of the successor list, predecessor list, sender_virtual_ids,
   and finger table, there are two length bytes (allowing lengths up to
   2^16-1).  The number of NodeId structures included in each field can
   be calculated based on the length bytes since the size of a single
   NodeId structure is 16 bytes.  If a peer receives more entries than
   fit into its successor list, predecessor list or finger table, the
   peer SHOULD ignore the extra entries.  If the peer is assigned more
   virtual IDs than fit into its ID list, it SHOULD reject the
   assignment.  If a peer receives fewer entries than it currently has
   in its own data structure, the peer SHOULD NOT drop the extra entries
   from its data structure.

   If the Update request succeeds, the responding peer sends an
   UpdateAns message, which is defined as:

           enum { reserved (0), notify(1), succ_stab(2), pred_stab(3),
                  full(4), virtualserver_stab_join(5),
                  virtualserver_stab_leave(6), (255) }
                ChordUpdateType;

           struct {
             ChordUpdateType         type;

             select(type) {
               case full:            /* Empty */
                 ;
               case virtualserver_stab_join:      /* Empty */
                 ;
               case virtualserver_stab_leave:      /* Empty */
                 ;
               case notify:
                 uint32              uptime;
               case pred_stab:
                 NodeId              predecessors <0..2^16-1>;
               case succ_stab:
                 NodeId              predecessors <0..2^16-1>;
                 NodeId              successors <0..2^16-1>;
             };



Maenpaa, et al.          Expires January 7, 2010               [Page 20]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


           } UpdateAns;


   If the type of the Update answer is 'full',
   'virtualserver_stab_join', or 'virtualserver_stab_leave', the answer
   MUST NOT carry any additional information.  If the type is 'notify',
   the answer MUST contain the sender's current uptime in seconds.  If
   the type is 'pred_stab', the answer SHOULD carry the predecessor list
   of the responding peer.  If the type is 'succ_stab', the answer
   SHOULD include the predecessor and successor lists of the responding
   peer.

6.3.  Finger Stabilization

   The purpose of the finger stabilization procedure is to incorporate
   new peers into the finger table.  In the procedure, peer v MUST
   maintain a counter 'next', which stores the index of the next finger
   that should be stabilized.  The counter MUST be initialized to value
   one and it MUST be incremented by one after each finger stabilization
   procedure.  When the stabilization timer fires, peer v MUST choose
   one finger interval i from the set of finger_table_size finger
   intervals it maintains:

   i = next % (finger_table_size + 1),

   and send a Probe request addressed to the first identifier belonging
   to the chosen finger interval i.  The peer f responding to the Probe
   request SHOULD become the ith finger of v.  Peer v SHOULD send an
   Attach request to peer f to initiate a new connection to it.

   This document defines a new ProbeInformationType value 'uptime'.

   When this value is present in the requested_info field of a Probe
   request, it indicates that the receiver MUST include in the Probe
   response its current uptime in a ProbeInformation structure.  A Probe
   request that is sent as part of the finger stabilization procedure
   MUST contain the 'uptime' ProbeInformationType in its requested_info
   field.  The extended ProbeInformation structure that is returned in
   the Probe response is defined as:












Maenpaa, et al.          Expires January 7, 2010               [Page 21]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


           enum { responsible_set(1), num_resources(2), uptime(3),
                  (255) }
                ProbeInformationType;

           struct {
             ProbeInformationType    type;

             select (type) {
               case responsible_set:
                 uint32              responsible_ppb;

               case num_resources:
                 uint32              num_resources;

               case uptime:
                 uint32              uptime;
             };
           } ProbeInformation;


   The types "responsible_ppb" and "num_resources" have been specified
   in RELOAD [1].  The "uptime" is a new type and contains the sender's
   current uptime in seconds.

6.3.1.  Locality-aware Finger Selection

   Making progress in the identifier space can be expensive in terms of
   network latency.  The successor and predecessor lists can be used to
   optimize network latency by relaxing the requirement for finger
   selection.  Specifically, for each finger table entry, a node (say v)
   first determines a node n that matches the identifier of the finger.
   It then retrieves the successors and predecessors of n from n.  Node
   v then PINGs the successors and predecessors of node n and chooses
   the topologically closest node among these as the choice for the
   finger table entry.  The sizes of the successor and predecessor lists
   have an impact on network latency; the greater the number of
   successors and predecessors, the higher the probability of finding a
   topologically close finger table entry.  Our simulations of the basic
   Chord protocol with just three successors and three predecessors
   itself shows a reduction close to 41-46% in delay of lookup in the
   DHT.  Another alternate simulation study reported in [19] confirm our
   results and show 31% to 40% lookup stretch reductions using 2^(16)
   nodes and a Euclidean and transit-stub model.  We expect that the
   lookup delay performance would further reduce in the proposed
   topology plug-in with log2(N) successors.






Maenpaa, et al.          Expires January 7, 2010               [Page 22]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


6.4.  Successor Stabilization

   Both the primary virtual IDs and secondary virtual IDs of the nodes
   are stored as part of the successor and predecessor tables.

   In the successor stabilization routine, a peer v asks the peer s that
   is the first entry in its successor table for the virtual ID of the
   successor's first predecessor p.  If the successor's first
   predecessor pointer does not point to v's virtual ID but instead to p
   (for instance, because p has joined the overlay between v and s), the
   peer with virtual ID p should become v's first successor instead of
   s.  Thus, v adds p to the front of its successor list and notifies p
   of v's existence, so that p can change its predecessor to v.

   Also successor lists are stabilized as part of the successor
   stabilization routine.  In order to do this, peer v copies the
   successor list of its successor s, removing the last entry and
   prepending s to it.  If peer v, as a result of running the successor
   stabilization routing, notices that its successor has failed, then it
   does the following:

   o  Using the virtual ID, s, of its successor, it looks up its virtual
      ID to physical ID mapping to identify which physical node, say S,
      has failed.  Here, the term 'physical ID' refers to the primary
      virtual ID of the peer.

   o  The peer then uses the same virtual ID to physical ID mapping
      table to identify the location of other virtual IDs in its
      successor list that correspond to the physical node S. These IDs
      are marked for replacement.

   o  The peer then replaces the successor with the first live entry in
      its successor list, say n, and contacts this node for its
      successor list.  The peer synchronizes n's successor list with its
      own removing the IDs that are marked for replacement.  This step
      is repeated by contacting the other live entries, one after the
      other, until all the IDs marked for replacement are updated.

   The successor stabilization routine is executed when the
   stabilization timer fires.  To begin the successor stabilization
   routine, peer v MUST send an Update request to its first successor s.
   The type of the Update request MUST be 'succ_stab'.  Upon receiving
   the Update request, peer s MUST send an Update answer to peer v.  The
   Update answer SHOULD include the successor and predecessor lists of
   peer s.  If v learns from the predecessor and successor lists
   included in the answer that new peers should be included in its
   neighborhood set, v MUST send Attach requests to the new peers.  Once
   a direct connection has been established with each new peer as a



Maenpaa, et al.          Expires January 7, 2010               [Page 23]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   result of the Attach procedure, peer v MUST send an Update request of
   type 'notify' to each new peer.  This allows the new peers to insert
   v into their neighborhood sets.

6.5.  Predecessor Stabilization

   The predecessor stabilization routine is executed when the
   stabilization timer fires.  To begin the predecessor stabilization
   routine, a peer v MUST send an Update request to its predecessor p.
   The type of the Update request MUST be 'pred_stab'.  Upon receiving
   the Update request, peer p MUST send an Update answer to peer v.  The
   Update answer SHOULD include the predecessor list of peer p.  Peer v
   SHOULD use the predecessor list carried in the answer to update its
   own predecessor list.  If new peers are inserted into the predecessor
   list, peer v MUST send Attach requests and Update requests of type
   'notify' to the new peers in the same way as during the successor
   stabilization routine.

6.6.  Joining the Overlay

   The process of joining an overlay is as follows:

   1.   The Joining Peer (JP) SHOULD connect to a bootstrap peer.

   2.   The JP SHOULD send an Attach request to the bootstrap peer,
        which SHOULD route the request towards the Admitting Peer (AP).
        Here, the AP is the node that is the successor of JP's primary
        virtual server.  Once the Attach procedure is finished, there is
        a direct connection between the JP and the AP.

   3.   The JP SHOULD send a Join request to the AP.  The AP returns a
        Join answer.

   4.   The AP MUST send an Update request of type 'full' to the JP.
        The Update request SHOULD contain the contents of AP's routing
        table.  The JP SHOULD use the contents of the Update request to
        initialize its finger table and DHT parameters, i.e., alpha and
        delta.  The JP SHOULD set the size of its successor list,
        predecessor list, finger table, to the same values that the AP
        uses.  The values of alpha and delta are also set to the ones
        that AP uses.  If an enrollment server is present in the
        overlay, the information about the choice of DHT parameters,
        alpha and delta, can be obtained from it directly.

   5.   The JP then chooses the virtual server locations, vs_i for i =
        2, 3, ..., alpha.  This step is not performed if there is an
        enrollment server in the overlay.  In this scenario, the
        locations of the virtual servers are obtained from the



Maenpaa, et al.          Expires January 7, 2010               [Page 24]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


        enrollment server apriori.

   6.   The JP SHOULD send an Attach request to the successor of
        vs_alpha, call it n_alpha.  The node returns a Attach answer.

   7.   The peer n_alpha MUST send an Update request of type 'full' to
        the JP.  The Update request SHOULD contain the contents of the
        node's successors.  The JP SHOULD use the contents of the Update
        request to initialize its successor and predecessor lists.

   8.   The JP MUST send Attach requests to initiate connections to each
        of the peers in its predecessor list, successor list, and finger
        table.  Since the JP is already connected to the AP and n_alpha,
        there is no need to send a new Attach request to these nodes.

   9.   The JP MUST send an Update request of type 'notify' to each of
        the peers in its predecessor and successor lists (except for the
        AP and n_alpha that are already aware of the JP).

   10.  The JP MUST send a Probe request carrying the 'uptime'
        ProbeInformationType value in the requested_info field to each
        of its fingers.  This way the JP will learn the uptimes of its
        fingers (the uptimes of predecessors and successors are learned
        from Update responses in the previous step).  The uptimes are
        needed when estimating the join rate of peers in the overlay.
        It should be noted that these Probe requests are not routed via
        the overlay but are sent on a direct connection.

   11.  Now, the JP takes ownership of regions in the overlay based on
        its virtual server locations, vs_i.  In this step, the JP MUST
        send one Update message to the successor of each vs_i for all i.
        The type of the Update message MUST be set to
        'virtualserver_stab_join'.  The peer, say n_i, receiving this
        message makes a note of this change and locally updates its
        ownership locations.  Peer n_i then sends an UpdateAns message
        with type 'virtualserver_stab_join' to the JP acknowledging the
        change.  Peer n_i SHOULD then issue a series of Store requests
        to JP to transfer ownership of the resources.  This step is
        repeated for all i = alpha, alpha-1, ..., 1.  At the end of this
        step all the new virtual servers corresponding to JP have joined
        the overlay, and the JP stores the data for these virtual
        locations.

6.6.1.  Contents of the Join Message

   This topology plug-in extends the Join request defined in RELOAD [1].
   In addition to the joining_peer_id, nodes MAY choose to send its
   virtual server locations as part of the join message if they are



Maenpaa, et al.          Expires January 7, 2010               [Page 25]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   obtained apriori from the enrollment server as part of the enrollment
   process.  The JoinReq message is defined as:

        struct {
          NodeId            joining_peer_id;
          NodeId            joining_peer_virtual_server_ids <0..2^16-1>;
          opaque            overlay_specific_data <0..2^16-1>;
        } JoinReq;


   The JoinReq contains the Node-ID which the sending peer wishes to
   assume and MAY include other virtual server locations that the
   joining peer wishes to occupy.

   If the request succeeds, the responding peer responds with a JoinAns
   message; this is defined as in the case of RELOAD.

6.7.  Leaving the Overlay

   The process of leaving the overlay is as follows:

   1.  If no replication is being performed in the overlay, leaving peer
       SHOULD issue a series of Store requests to the successor of each
       of its virtual servers, vs_i, to transfer the ownership of the
       resource records it is storing.  Note that if replication is
       being used, the successor of peer is already storing replicas and
       the amount of data transferred can be minimized.

   2.  The leaving peer MUST send a Leave request to the predecessor and
       successor of each virtual server, vs_i.  The Leave request that
       is sent to the vs_i's successor SHOULD contain the predecessor
       list of the leaving peer.  The Leave request that is sent to the
       vs_i's predecessor SHOULD contain the successor list of the
       leaving peer.  The first successor SHOULD use the predecessor
       list carried in the Leave request to update its own predecessor
       list.  The first predecessor SHOULD use the successor list
       carried in the Leave request to update its own successor list.

6.7.1.  Contents of the Leave Message

   This topology plug-in extends the Leave request defined in RELOAD
   [1].  In addition to the leaving_peer_id, a node MUST send its
   virtual server locations as part of the leave message as defined in:

            public struct {
              NodeId        leaving_peer_id;
              NodeId        leaving_peer_virtual_server_ids <0..2^16-1>;
              opaque        overlay_specific_data <0..2^16-1>;



Maenpaa, et al.          Expires January 7, 2010               [Page 26]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


            } LeaveReq;


   The overlay_specific_data field of the Leave request MUST contain a
   ChordLeaveData structure:

               enum { reserved (0),
                      from_succ(1), from_pred(2), (255) }
                    ChordLeaveType;

               struct {
                 ChordLeaveType         type;

                 select(type) {
                   case from_succ:
                     NodeId              successors <0..2^16-1>;
                   case from_pred:
                     NodeId              predecessors <0..2^16-1>;
                 };
               } ChordLeaveData;



   The 'type' field indicates whether the Leave request was sent by a
   predecessor or a successor of the recipient:

   from_succ

      The Leave request was sent by a successor.

   from_pred

      The Leave request was sent by a predecessor.

   If the type of the request is 'from_succ', the contents will include
   the sender's successor list.

   If the type of the request is 'from_pred', the contents will include
   the sender's predecessor list.

6.8.  Self Tuning System Parameters

   This section describes how the system parameters for load balancing
   and stabilization adapt to changing network conditions.







Maenpaa, et al.          Expires January 7, 2010               [Page 27]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


6.8.1.  Self Tuning Load Balancing Algorithm Parameters

   All DHTs require constant updating as the size of the network grows.
   For example, in the case of Chord, each node starts out with m
   entries in the finger table (if node-IDs take values from 0 to
   2^m-1).  When the number of nodes in the network is small, some of
   these m entries collapse to the same node.  As the number of nodes in
   the network increase, the finger table entries degenerate to pointing
   to different nodes and the number of fingers grows as O(log2(N)).
   Thus, the number of fingers that each node has to maintain grows as
   O(log2(N)) as N increases.

   A similar update step is required by the solution in this document
   scheme as well because the system parameters, alpha and delta, also
   depend upon the value of N. This section describes how nodes MUST
   update the values of alpha, delta, and the location of virtual
   servers as the network changes in size.  This document proposes to
   update the system values each time the network size doubles or
   halves.

   In addition to finger, successor, and predecessor stabilization, each
   peer also needs to perform DHT parameter stabilization.  Unlike the
   other stabilization routines that are done periodically when the
   stabilization timer fires, the parameter stabilization routine is
   done whenever the number of fingers is observed to change.
   Performing stabilization in this way is sufficient for updating the
   load balancing parameters (i.e., alpha and delta) because it is not
   the highest priority update and the overlay would function
   effectively even without this update.  In the proposed topology
   plug-in, we employ lazy updates for stabilizing the parameters, alpha
   and delta.  More specifically, we use the change in number of fingers
   as a trigger to perform DHT parameter update.

   The parameter stabilization routine is executed when the number of
   fingers in a node changes.  We consider two possible scenarios that
   can arise:

   1.  Number of connections decrease: Let f be the lost connection.
       The peer v MUST first check the ID of the lost connection to see
       if there is any response.  Let n be the ID of the node responding
       to the request.  If n is the same as f, the peer must re-connect
       to f.  If n is not the same as f and does not correspond to any
       node that is already in the finger table of v, then v MUST send
       an Attach request to n.  In the above two cases, the parameter
       stabilization routine is not performed since the number of
       fingers does not change.  On the other hand, if n is not the same
       as f and corresponds to some other node in the finger table of v,
       then v understands that the network size has reduced and invokes



Maenpaa, et al.          Expires January 7, 2010               [Page 28]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


       the parameter stabilization routine.

   2.  Number of connections increase: In this case, v invokes the
       parameter stabilization routine.

   When the parameter stabilization routine is invoked, the following
   functions are performed:

   o  Step 1: The (alpha-1) virtual servers corresponding to v leave the
      overlay.

   o  Step 2: The value of alpha and delta are updated and the new
      values are obtained.  The corresponding virtual server locations
      are also modified to get vs_i.

   o  Step 3: The node v joins the overlay at the new virtual server
      locations, vs_i.

   o  Step 4: The successors and predecessors of v are updated.

   During the parameter stabilization routine, the portion of the data
   space owned by the corresponding physical node changes and so data
   needs to be moved to match this change.  The locations of successors
   and predecessors also change to adjust to the updates in virtual
   server locations.  However, the fingers do not change because these
   are associated with the primary virtual ID and the location of the
   primary node-ID is fixed.

   In Step 1 of the stabilization routine (see above), the peer v MUST
   send one Update message to the successor of the virtual server, vs_i.
   The type of the Update message MUST be set to
   'virtualserver_stab_leave'.  The peer n_i receiving this message
   makes a note of this change and locally updates its ownership
   locations.  Peer n_i then sends an UpdateAns message with type
   'virtualserver_stab_leave' to vs_i acknowledging the change.  On
   receiving the UpdateAns message, peer v SHOULD issue a series of
   Store requests to n_i to transfer ownership of the resources.  This
   step is repeated for all i = 2, ... alpha.  At the end of this step,
   all the virtual servers corresponding v have left the overlay and
   peer v no longer stores the data corresponding to these virtual
   locations.

   In Step 2, the peer v obtains its new virtual server locations, vs_i.
   In the presence of an enrollment server, the peer v MUST request the
   enrollment server for a new set of identities and stop participating
   in the overlay network with the old node identities.  The enrollment
   server MAY choose to implement diagnostics using mechanisms in [3] to
   ascertain that the node requesting identities is not requesting more



Maenpaa, et al.          Expires January 7, 2010               [Page 29]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   or less than what it should based on the finger table sizes of a
   random sampling of other nodes in the overlay.

   In the absence of the enrollment server, the following protocol is
   implemented to obtain the new values of alpha, delta, and vs_i.














































Maenpaa, et al.          Expires January 7, 2010               [Page 30]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


UpdateAlphaDelta()
{
      If (number_of_fingers increases by 1) {
           // Can be detected as in the case of Chord when finger tables
           // are updated. This implies that the network size has about
           // doubled.

           delta = delta/2;     // update delta

           // Update existing virtual server locations
           // vs_1 = vp as the location of the primary virtual server
           // does not change.
           for i = 2 to alpha
               vs_i = vp  - (vp - vs_i)/2;

           // Choose new virtual server location between
           // (vp - alpha*delta*2^m, vp - (alpha+1)*delta*2^m);

           vs_(alpha+1) = random (vp - alpha*delta*2^m,
                vp - (alpha+1)*delta*2^m);
           vs_(alpha+2) = random (vp  (alpha+1)*delta*2^m,
                vp - (alpha+2)*delta*2^m);
           alpha = alpha + 2;
      }

      If (number_of_fingers decreases by 1) {
          // Can be detected as in the case of Chord when finger tables
          // are updated. This implies that the network size has about
          // doubled.

          // Leave network with those virtual ids
          Remove virt_server_(alpha-1);
          Remove virt_server_(alpha);

          delta = delta * 2;     // update delta
          alpha = alpha - 2;     // update alpha

          // Update existing virtual server locations
          // vs_1 = vp as the location of the primary virtual server
          // does not change.
          for i = 2 to alpha
              vs_i = vp  - (vp - vs_i) * 2;
      }
}


   Once the location of the virtual servers are obtained, the peer re-
   joins the overlay at these positions.  During this step, the peer v



Maenpaa, et al.          Expires January 7, 2010               [Page 31]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   MUST send one Update message to the successor of vs_i in Step 3.  The
   type of the Update message MUST be set to 'virtualserver_stab_join'.
   The peer n_i receiving this message makes a note of this change and
   locally updates its ownership locations.  Peer n_i then sends an
   UpdateAns message with type 'virtualserver_stab_join' to v
   acknowledging the change.  Peer n_i SHOULD then issue a series of
   Store requests to v to transfer ownership of the resources.  In this
   process, the virtual server's n_i are added to the successor list of
   v.  This step is repeated for all i = alpha, alpha-1, ..., 2.  At the
   end of this step all the new virtual servers corresponding to v have
   joined the overlay, and peer v stores the data for these virtual
   locations.

   If replication is performed on the overlay, some of the data may
   already be present in the node v and this would reduce the amount of
   Store requests.

   Successor and predecessor stabilization routines are invoked as
   described in Section 6.4 and Section 6.5, respectively, in Step 4 and
   this completes the parameter stablization routine.

6.8.2.  Self Tuning the Stabilization Interval

   To ensure that lookups produce correct results as the set of
   participating peers changes and to ensure that all peers' connections
   be up to date, each peer MUST run a stabilization protocol
   periodically in the background.  The stabilization protocol uses
   three operations: finger stabilization, successor stabilization, and
   predecessor stabilization as defined earlier.  Each peer MUST
   maintain a stabilization timer.  When the stabilization timer fires,
   the peer MUST restart the timer and carry out the stabilization
   operations.  In this section, we present methods to compute the
   stabilization timer.

   When periodic stabilization is used, one faces the problem of
   selecting an appropriate execution rate for the stabilization
   procedure.  If the execution rate of periodic stabilization is high,
   changes in the system can be quickly detected, but at the
   disadvantage of increased communication overhead.  On the other hand,
   if the stabilization rate is low and the churn rate is high, routing
   tables become inaccurate and DHT performance deteriorates.  Thus, the
   problem is setting the parameters so that the overlay achieves the
   desired reliability and performance even in challenging conditions,
   such as under heavy churn.  This naturally results in high cost
   during periods when the churn level is lower than expected, or
   alternatively, poor performance or even network partitioning in worse
   than expected conditions.




Maenpaa, et al.          Expires January 7, 2010               [Page 32]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   The current approach is to configure overlays statically.  This works
   in situations where perfect information about the future is
   available.  In situations where the operating conditions of the
   network are known in advance and remain static throughout the
   lifetime of the system, it is possible to choose fixed optimal values
   for parameters such as stabilization rate, neighborhood set size, and
   routing table size.  However, if the operating conditions (e.g., the
   size of the overlay and its churn rate) do not remain static but
   evolve with time, it is not possible to achieve both a low lookup
   failure rate and a low communication overhead by using fixed
   parameters [20].

   As an example, to configure the Chord DHT algorithm, one needs to
   select appropriate values for the size of successor list and the
   stabilization interval.  To select an appropriate value for the
   stabilization interval, one needs to know the expected churn rate and
   overlay size.  According to [21], a Chord network in a ring-like
   state remains in a ring-like state as long as peers send
   Omega(log2^2(N)) messages before N new peers join or N/2 peers fail.
   Thus, in a 500-peer overlay churning at a rate such that one peer
   joins and one peer leaves the network every 30 seconds, an
   appropriate stabilization interval would be on the order of 93s.
   According to [4], the size of the successor list should be on the
   order of log2(N).  Having a successor list of size O(log2(N)) makes
   it unlikely that a peer will lose all of its successors, which would
   cause the Chord ring to become disconnected.  Thus, in a 500-peer
   network each peer should maintain on the order of nine successors.
   However, if the churn rate doubles and the network size remains
   unchanged, the stabilization rate should double as well.  That is,
   the appropriate maintenance interval would now be on the order of
   46s.  On the other hand, if the churn rate becomes e.g. six-fold and
   the size of the network grows to 2000 peers, on the order of eleven
   successors should be maintained and the stabilization interval should
   be on the order of 42s.  If one continued using the old values, this
   could result in inaccurate routing tables, network partitioning, and
   deteriorating performance.

   The proposed topology plug-in takes into consideration the continuous
   evolution of network conditions and adapts to them.  Each peer
   collects statistical data about the network and adaptively adjusts
   its stabilization rate and successor list size based on the analysis
   of the data [20].  Reference [22] shows that by using a self-tuning
   mechanism, it is possible to achieve high reliability and performance
   even in adverse conditions with low maintenance cost.  Adaptive
   stabilization has been shown to outperform periodic stabilization in
   terms of both lookup failure and communication overhead [20].

   The following sub-sections specify methods to determine the



Maenpaa, et al.          Expires January 7, 2010               [Page 33]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   appropriate stabilization rate in an adaptive fashion.  The proposed
   mechanism is based on [22][21][20].  To calculate an appropriate
   stabilization rate, the values of three parameters MUST be estimated:
   overlay size N, failure rate U, and join rate L. Peers in the overlay
   MUST re-calculate the values of the parameters to self-tune the
   algorithm at the end of each stabilization period before re-starting
   the stabilization timer.

6.8.2.1.  Estimating Overlay Size

   Techniques for estimating the size of an overlay network have been
   proposed for instance in [22] [23] [24] [25] and [20].  In Chord, the
   density of peer identifiers in the successor set can be used to
   produce an estimate of the size of the overlay, N [22].  Since peer
   identifiers are picked randomly with uniform probability from the
   m-bit identifier space, the average distance between peer identifiers
   in the successor set is (2^m)/N.

   To estimate the overlay network size, a peer MUST compute the average
   inter-peer distance d between the successive peers starting from the
   most distant predecessor and ending to the most distant successor in
   the successor list.  The estimated network size MUST be calculated
   as:

                                 2^m
                          N = ---------
                                  d


   This estimate has been found to be accurate within 15% of the real
   network size [20].  Of course, the size of the neighborhood set
   affects the accuracy of the estimate.

   When a peer joins the network, the admitting peer sends the joining
   peer a copy of its neighborhood set.  Thus, a joining peer
   immediately has enough information at its disposal to calculate an
   estimate of the network size.

6.8.2.2.  Estimating Failure Rate

   A typical approach is to assume that peers join the overlay according
   to a Poisson process with rate L and leave according to a Poisson
   process with rate parameter U [22].  The value of U can be estimated
   using peer failures in the finger table and neighborhood set [22].
   If peers fail with rate U, a peer with M unique peer identifiers in
   its routing table should observe K failures in time K/(M*U).  Every
   peer in the overlay MUST maintain a history of the last K failures.
   The current time MUST be inserted into the history when the peer



Maenpaa, et al.          Expires January 7, 2010               [Page 34]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   joins the overlay.  The estimate of U MUST be calculated as:

                                  k
                          U = ----------,
                                M * Tk


   where M is the number of unique peer identifiers in the routing
   table, Tk is the time between the first and the last failure in the
   history, and k is the number of failures in the history.  If k is
   smaller than K, the estimate is computed as if there was a failure at
   the current time.  It has been shown that an estimate calculated in a
   similar manner is accurate within 17% of the real value of U [20].

   The size of the failure history K affects the accuracy of the
   estimate of U. One can increase the accuracy by increasing K.
   However, this has the side effect of decreasing responsiveness to
   changes in the failure rate.  On the other hand, a small history size
   may cause a peer to overreact each time a new failure occurs.  In
   [20], K is set 25% of the routing table size.

6.8.2.2.1.  Estimating Join Rate

   Reference [20] proposes that a peer can estimate the peer join rate
   based on the uptime of the peers in its routing table.  An increase
   in peer join rate will be reflected by a decrease in the average age
   of peers in the routing table.  Thus, each peer MUST maintain an
   array of the ages of the peers in its routing table sorted in
   increasing order.  Using this information, an estimate of the global
   peer join rate L MUST be calculated as:

                               N           1
                          L = --- * ---------------,
                               4     Ages[rsize/4]


   where Ages is an array containing the ages of the peers in the
   routing table sorted in increasing order and rsize is the size of the
   routing table.  Only the ages of the 25% of the youngest peers in the
   routing table SHOULD be used to reduce the bias that a small number
   of peers with very old ages can cause [20].  It has been shown that
   the estimate obtained by using this method is accurate within 22% of
   the real join rate [20].  Of course, the size of the routing table
   affects the accuracy.

   In order for this mechanism to work, peers need to exchange
   information about the time they have been present in the overlay.
   Peers learn the uptimes of their successors and predecessors when



Maenpaa, et al.          Expires January 7, 2010               [Page 35]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   adding the successors and predecessors to their routing tables since
   Update requests and answers that are of type 'notify' carry uptime
   values.  Peers learn the uptimes of their fingers because the Probe
   responses sent as part of the finger stabilization routine carry
   uptime values.  A joining peer learns the admitting peer's uptime
   since an Update request of type 'full' contains uptime information.

6.8.2.2.2.   Calculating the Stabilization Interval

   According to [21], a Chord network in a ring-like state remains in a
   ring-like state as long as peers send Omega(log2^2(N)) messages
   before N new peers join or N/2 peers fail.  We can use the estimate
   of peer failure rate, U, to calculate the time Tf in which N/2 peers
   fail:

                                  1
                          Tf = --------
                                 2*U


   Based on this estimate, a stabilization interval Tstab-1 is
   calculated as:

                                        Tf
                          Tstab-1 = -----------
                                     log2^2(N)


   Further, the estimated join rate L can be used to calculate the time
   in which N new peers join the overlay.  Based on the estimate of L, a
   stabilization interval Tstab-2 is calculated as:

                                           N
                          Tstab-2 = ---------------
                                     L * log2^2(N)


   Finally, the actual stabilization interval Tstab that SHOULD be used
   can be obtained by taking the minimum of Tstab-1 and Tstab-2.

   The results obtained in [26] indicate that making the stabilization
   interval too small has the effect of making the overlay less stable
   (e.g., in terms of detected loops and path failures).  Thus, a lower
   limit should be used for the stabilization period.  Based on the
   results in [26], a lower limit of 15s is proposed, since using a
   stabilization period smaller than this will with a high probability
   cause too much traffic in the overlay.




Maenpaa, et al.          Expires January 7, 2010               [Page 36]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


7.  Security Considerations

   One important concern for the virtual servers approach is that it is
   not easily enforceable.  Given a set of virtual serverIDs for a
   physical node, it must be ensured that the physical node actually
   takes ownership of all the virtual serverIDs assigned to it.  In the
   presence of a centralized enrollment server, this server can ensure
   that each physical node gets its share of the number of virtual
   servers.  In the absence of the enrollment server, the verification
   can be performed during the join process.  For instance, in the case
   of Chord, when a new node joins the overlay, it contacts its
   neighbors to obtain neighbor relations and finger table entries.  A
   similar approach can be adopted in the case of virtual servers
   approach.  In this scenario, each physical node provides its
   neighbors a list of 2 log2(N) virtual server IDs.  The neighbors
   first verify that the number of virtual server locations received is
   close to the number of virtual server locations that it owns before
   sending the finger table entries to the joining node.  In this way,
   it can be ensured that each node has O(log2(N)) virtual locations on
   the overlay.


8.  IANA Considerations

   (a) A new overlay algorithm type should be defined for the proposed
   new topology plug-in.

   (b) This document defines one new Probe Information Type value:

                    +-----------------+------+---------------+
                    | Probe Option    | Code | Specification |
                    +-----------------+------+---------------+
                    | uptime          |    3 |      RFC-AAAA |
                    +-----------------+------+---------------+


   (c) Other IANA considerations are TBD.


9.  Acknowledgments

   This document benefited from design discussions with Vidya Narayanan
   from Qualcomm Inc.


10.  Appendix

   This appendix lists a few performance results of the load balancing



Maenpaa, et al.          Expires January 7, 2010               [Page 37]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   solution proposed in this document.

10.1.  Comparison of the Load Balancing Algorithm with Chord

   Compared to the virtual markers approach in [7], the solution
   proposed in this document does not require O(log2(N)) nodes to change
   their location upon each node arrival.  For instance, in the case of
   the virtual markers approach, if O(log2(N)) nodes change location,
   then O(log2(N)/N) data objects need to be reassigned and at least O(2
   log2(N)) nodes are involved in this step.  This is in addition to the
   messages required for updating routing tables which is O(log2^3(N)).
   In contrast to the virtual markers approach, the proposed solution
   requires O(1/N) data objects to be reassigned and around O(log2(N))
   nodes send data to the joining node.  The number of routing messages
   required is still O(log2^2(N)) similar to the case of Chord with no
   virtual servers; this is because the number of connections is still
   O(log2(N)) links per node.

   We performed some simulation studies to compare the performance of
   the solution in this document with Chord.  In order to study the
   percentage of nodes with significant load imbalance, we look at the
   q-percentile load imbalance factor defined as the ratio of the
   q-percentile load and the average load.  For example, a 99-percentile
   imbalance factor of 5 implies that less than 1 percent of nodes in
   the system have a load more than the value 5 times the average load.

   We tested the solution in this document under N = 2^(15) and compared
   the results with Chord.  We found that the 99-percentile imbalance
   factor for the solution in this document is around 2.  In comparison,
   our results with Chord suggest that the 100-percentile imbalance
   factor is around 9, the 99-percentile imbalance factor is around 4.5
   and so on.  This result suggests that in Chord, around 1% of the
   nodes have an imbalance between 4.5 and 9, 2% have an imbalance
   greater than 4 and so on compared to the solution in this document
   where less than 1% of the nodes has an imbalance greater than 2.

   With regard to routing, our simulations compared the average route
   length for Chord (with one virtual server) with this document's
   solution (2 log2(N)) servers; the results showed that the both these
   DHT implementations require around the same number of hops.

10.2.  Performance of the Load Balancing Algorithm as Network Grows

   In this section, we take a closer look at the performance of the
   solution in this document as network size grows in terms of load
   imbalance factor and routing state maintenance.





Maenpaa, et al.          Expires January 7, 2010               [Page 38]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


   (1)  Load Imbalance: In our simulations, we fix the value of the
        estimated network size to be Nest = 2^9, and chose alpha = 18
        and delta = 2^(-9).  The actual network size is then increased
        from 2^2 to 2^(14).  When N < Nest, the spacing between virtual
        servers (chosen as 1/Nest) is very small and the solution in
        this document becomes similar to Chord.  In this case, the load
        imbalance factor is of the order of O(log2(N)).  However, since
        the numerical value of N is small, the imbalance factor is still
        low and around 3 for most nodes as demonstrated by our
        simulations.  As the value of network size, N, increases, the
        spacing, delta, comes into effect and the virtual servers help
        balance the load.  Our simulations indicate that the imbalance
        is around its lowest value of 2 when N = Nest, and increases
        slowly as N becomes larger than Nest.

   (2)  Routing State: Here, we examine the performance of this solution
        with regard to the amount of routing state that it needs to
        maintain as the network grows.  We analyze the performance of
        the solution in this document analytically to determine the
        number of hops required to reach a destination as a function of
        N and Nest.  We perform the analysis in two steps:

        *  [Step 1] Routing within a distance of 1/N from destination:
           If nodes in this solution employ only their primary virtual
           servers (and the corresponding O(log2(N)) connections) for
           routing as in the case of Chord, it can be shown that any
           discretization built upon the solution in this document would
           be able to get within a distance of 1/N from the destination
           node in O(log2(N)) hops with O(log2(N)) fingers per node.
           This results is unaffected by the relative values of N and
           Nest.

        *  [Step 2] Last-mile: The number of hops required to route
           messages from within a distance of 1/N to the exact
           destination (referred to as the last-mile problem) would
           depend upon the exact realization of DHT and its construction
           of neighbor tables.  In this document, each node has alpha
           virtual servers chosen uniformly at random separated by a
           distance O(delta*2^m).  Therefore, the number of nodes within
           a 1/N distance from a chosen location would be of the order
           of O(log2(N) + N*log2(1+s alpha delta)), where 's' is a
           constant independent of N. If alpha and delta are chosen
           using Nest, then the number of nodes within a 1/N distance
           can be approximated to be of the order of O(log2(N) + N*
           log2(Nest)/Nest) for large Nest.  Therefore, a random walk
           would lead to the final destination within O(log2(N) + N*
           log2(Nest)/Nest) hops.




Maenpaa, et al.          Expires January 7, 2010               [Page 39]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


        This result indicates that when N < Nest, the solution in this
        document performs similar to Chord and messages can be routed to
        any node in O(log2(N)) hops by maintaining O(log2(N)) fingers
        and O(log2(N)) successors.  On the other hand, when N is much
        larger compared to Nest, O(N) hops might be required to reach
        the destination if the number of fingers per node is O(log2(N)).
        However, this situation arises only when the network size
        estimation is very much lower than N and the value of alpha and
        delta are not updated as the network grows.


11.  References

11.1.  Normative References

   [1]   Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and H.
         Schulzrinne, "REsource LOcation And Discovery (RELOAD) Base
         Protocol", draft-ietf-p2psip-base-02 (work in progress),
         March 2009.

   [2]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", BCP 14, RFC 2119, March 1997.

   [3]   Yongchao, S., Jiang, X., Even, R., and D. Bryan, "P2PSIP
         Overlay Diagnostics", draft-ietf-p2psip-diagnostics-01 (work in
         progress), June 2009.

11.2.  Informative References

   [4]   Stoica, I., Morris, R., Karger, D., Kaashoek, M., and H.
         Balakrishnan, "Chord: A scalable peer-to-peer lookup service
         for internet applications",  In Proc. of the ACM SIGCOMM, 2001.

   [5]   Li, J., Strinbling, J., Gil, T., and M. Kaashoek, "Comparing
         the performance of distributed hash tables under churn",  In
         Proc. of the 3rd International Workshop on Peer-to-Peer
         Systems, 2004.

   [6]   Bienkowski, M., Korzeniowski, M., and F. auf der Heide,
         "Dynamic load balancing in distributed hash tables",  In Proc.
         of IPTPS, 2005.

   [7]   Karger, D. and M. Ruhl, "Simple efficient load balancing
         algorithms for peer-to-peer systems",  In 3rd International
         Workshop on Peer-to-Peer Systems (IPTPS), 2004.

   [8]   Awerbuch, B. and C. Scheideler, "Group spreading: A protocol
         for provably secure distributed name service",  In Proc. of the



Maenpaa, et al.          Expires January 7, 2010               [Page 40]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


         31st Int. Colloquium on Automata, Languages, and Programming
         (ICALP), 2004.

   [9]   Fraigniaud, P. and C. Gavoille, "The content-addressable
         network d2b",  Technical Report 1349, LRI, Univ. Paris-Sud,
         France, 2003.

   [10]  Kaashoek, F. and D. Karger, "Koorde: A simple degree-optimal
         hash table",  In Proc. 2nd International Workshop on Peer-to-
         Peer Systems (IPTPS), 2003.

   [11]  Naor, M. and U. Wieder, "Novel architectures for peer to peer
         applications: the continuous-discrete approach.",  In Proc. of
         the 15th ACM Symp. on Parallel Algorithms and Architectures
         (SPAA), 2003.

   [12]  Byers, J., Considine, J., and M. Mitzenmacher, "Simple load
         balancing for distributed hash tables",  In 2nd International
         Workshop on Peer-to-Peer Systems (IPTPS), 2003.

   [13]  Manku, G., "Routing Networks for Distributed Hash Tables",  In
         Proc. of the Principles of Distributed Computing (PODC), 2003.

   [14]  Godfrey, B. and I. Stoica, "Heterogenity and load balance in
         Distributed Hash Tables",  IEEE INFOCOM, 2005.

   [15]  Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., and I.
         Stoica, "Load balancing in dynamic structured peer to peer
         systems",  In 23rd Conference of the IEEE Communications
         Society (INFOCOM), 2004.

   [16]  Rao, A., Lakshminarayanan, K., Surana, S., Karp, R., and I.
         Stoica, "Load balancing in structured peer to peer systems",
          In 2nd International Workshop on Peer-to-Peer Systems (IPTPS),
         2004.

   [17]  Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, "Handling
         churn in a DHT",  In Proc. of the USENIX Annual Techincal
         Conference, 2004.

   [18]  Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. Haridi,
         "Comparing maintenance strategies for overlays",  In Proc. of
         16th Euromicro Conference on Parallel, Distributed and Network-
         Based Processing, 2008.

   [19]  Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek,
         M., Dabek, F., and H. Balakrishnan, "Chord: A scalable peer-to-
         peer lookup protocol for internet applications",  IEEE/ACM



Maenpaa, et al.          Expires January 7, 2010               [Page 41]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


         Transactions on Networking, 2003.

   [20]  Ghinita, G. and Y. Teo, "An adaptive stabilization framework
         for distributed hash tables",  20th International Parallel and
         Distributed Processing Symposium, 2006.

   [21]  Liben-Nowell, D., Balakrishnan, H., and D. Karger,
         "Observations on the dynamic evolution of peer-to-peer
         networks",  In Proc. of the First International Workshop on
         Peer-to-Peer Systems, 2002.

   [22]  Mahajan, R., Castro, M., and A. Rowstron, "Controlling the cost
         of reliability in peer-to-peer overlays",  In Proceedings of
         the 2nd International Workshop on Peer-to- Peer Systems, 2003.

   [23]  Horowitz, K. and D. Malkhi, "Estimating Network Size from Local
         Information",  Information Processing Letters, Volume 88, Issue
         5, pp. 237-243, 2003.

   [24]  Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and A.
         Demers, "Decentralized schemes for size estimation in large and
         dynamic groups",  Fourth IEEE International Symposium on
         Network Computing and Applications, pp. 41-48, 2005.

   [25]  Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable
         algorithm to monitor chord-based peer to peer systems at
         runtime",  20th International Parallel and Distributed
         Processing Symposium, 2006.

   [26]  Maenpaa, J. and G. Camarillo, "A study on maintenance
         operations in a Chord-based Peer-to-Peer Session Initiation
         Protocol overlay network",  Accepted to Sixth International
         Workshop on Hot Topics in P2P Systems (HotP2P 2009), 2009.

   [27]  Ktari, S., Zoubert, M., Hecker, A., and H. Labiod, "Performance
         evaluation of replication strategies in DHTs under churn",  In
         Proc. of the 6th International Conference on Mobile and
         Ubiquitous Multimedia, 2007.













Maenpaa, et al.          Expires January 7, 2010               [Page 42]


Internet-Draft         Topology Plug-in for RELOAD             July 2009


Authors' Addresses

   Jouni Maenpaa
   Ericsson
   Hirsalantie 11
   Jorvas 02420
   Finland

   Email: Jouni.Maenpaa@ericsson.com


   Ashwin Swaminathan
   Qualcomm, Inc.
   5775 Morehouse Dr
   San Diego, CA
   USA

   Phone: +1 858-845-8775
   Email: sashwin@qualcomm.com


   Saumitra M. Das
   Qualcomm, Inc.
   3195 Kifer Road
   Santa Clara, CA
   USA

   Phone: +1 408-533-9529
   Email: saumitra@qualcomm.com


   Gonzalo Camarillo
   Ericsson
   Hirsalantie 11
   Jorvas 02420
   Finland

   Email: Gonzalo.Camarillo@ericsson.com


   Jani Hautakorpi
   Ericsson
   Hirsalantie 11
   Jorvas 02420
   Finland

   Email: Jani.Hautakorpi@ericsson.com




Maenpaa, et al.          Expires January 7, 2010               [Page 43]

Document	Document type	Expired Internet-Draft (individual) Expired & archived
	Select version	00
	Authors	Jouni Maenpaa , Ashwin Swaminathan , Saumitra Das , Gonzalo Camarillo , Jani Hautakorpi Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml