Skip to main content

Agenda IETF117: dinrg
agenda-117-dinrg-00

Meeting Agenda Decentralization of the Internet Research Group (dinrg) RG
Date and time 2023-07-25 20:00
Title Agenda IETF117: dinrg
State Active
Other versions markdown
Last updated 2023-07-14

agenda-117-dinrg-00

DINRG Meeting at IETF-117 – 2023-07-25, 20:00 to 21:30 UTC

1 DINRG Chairs’ Presentation: Status, Updates Chairs 05 min
2 Let The Platforms Burn: Bringing Back the Good Fire of the Old Internet Cory Doctorow 30 min"churn", with peers continually arriving and departing. One objection to concerns about highly transient peers is that many applications use peers in well- connected parts of the network. The Tapestry authors analysed the impact of churn in a network of 1000 nodes [31]. Others opined that it is possible to maintain a robust DHT at relatively low cost [258]. Very few papers have quantitatively compared the resilience of Risson & Moors Expires September 3, 2007 [Page 8] Internet-Draft Survey of Research on P2P Search March 2007 structured systems. Loguinov, Kumar et al claimed that there were only two such works [24, 36]. The second criticism of structured systems is that they do not support keyword searches and complex queries as well as unstructured systems. Given the current file-sharing deployments, keyword searches seem more important than exact-match key searches in the short term. Paraphrased, "most queries are for hay, not needles" [61]. More recently, some have justifiably seen unstructured and structured proposals as complementary, and have devised hybrid models [259]. Their starting point was the observation that unstructured flooding or random walks are inefficient for data that is not highly replicated across the P2P network. Structured graphs can find keys efficiently, irrespective of replication. Castro et al proposed Structella, a hybrid of Gnutella built on top of Pastry [259]. Another design used structured search for rare items and unstructured search for massively replicated items [54]. However, the "structured versus unstructured routing" taxonomy is becoming less useful, for two reasons, Firstly, most "unstructured" proposals have evolved and incorporated structure. Consider the classic "unstructured" system, Gnutella [4]. For scalability, its peers are either ultrapeers or leaf nodes. This hierarchy is augmented with a query routing protocol whereby ultrapeers receive a hashed summary of the resource names available at leaf-nodes. Between ultrapeers, simple query broadcast is still used, though methods to reduce the query load here have been considered [260]. Secondly, there are emerging schema-based P2P designs [59], with super-node hierarchies and structure within documents. These are quite distinct from the structured DHT proposals. 1.3. Indexes and Queries Given that most, if not all, P2P designs today assume some structure, a more instructive taxonomy would describe the structure. In this survey, we use a database taxonomy in lieu of the networking taxonomy, as suggested by Hellerstein, Cooper and Garcia-Molina [23, 261]. The structure is determined by the type of index (Sections 2. , 3. and 4. ). Queries feature in lieu of routing (Section 5. ). The DHT algorithms implement a "semantic-free index" [216]. They are oblivious of whether keys represent document titles, meta-data, or text. Gnutella-like and schema-based proposals have a "semantic index". Index engineering is at the heart of P2P search methods. It captures a broad range of P2P issues, as demonstrated by the Search/Index Risson & Moors Expires September 3, 2007 [Page 9] Internet-Draft Survey of Research on P2P Search March 2007 Links model [261]. As Manber put it, "the most important of the tools for information retrieval is the index - a collection of terms with pointers to places where information about documents can be found"[262]. Sen and Wang noted that a "P2P network" usually consists of connections between hosts for application-layer signaling, rather than for the data transfer itself [263]. Similarly, we concentrate on the "signaled" indexes and queries. Our focus here is the dependability and adaptability of the search network. Static dependability is a measure of how well queries route around failures in a network that is normally fault-free. Dynamic dependability gives an indication of query success when nodes and data are continually joining and leaving the P2P system. An adaptable index accommodates change in the data and query distribution. It enables data independence, in that it facilitates changes to the data layout without requiring changes to the applications that use the data [23]. An adaptable P2P system can support rich queries for a wide range of applications. Some applications benefit from simple, semantic-free key lookups [264]. Others require more complex, Structured Query Language (SQL)-like queries to find documents with multiple keywords, or to aggregate or join query results from distributed relations [22]. 2. Index Types A P2P index can be local, centralized or distributed. With a local index, a peer only keeps the references to its own data, and does not receive references for data at other nodes. The very early Gnutella design epitomized the local index (Section 2.1. ). In a centralized index, a single server keeps references to data on many peers. The classic example is Napster (Section 2.2. ). With distributed indexes, pointers towards the target reside at several nodes. One very early example is Freenet (Section 2.3. ). Distributed indexes are used in most P2P designs nowadays - they dominate this survey. P2P indexes can also be classified as non-forwarding and forwarding. When queries are guided by a non-forwarding index, they jump to the node containing the target data in a single hop. There have been semantic and semantic-free one-hop schemes [138, 265, 266]. Where scalability to a massive number of peers is required, these schemes have been extended to two-hops [267, 268]. More common are the forwarding P2Ps where the number of hops varies with the total number of peers, often logarithmically. The related tradeoffs between routing state, lookup latency, update bandwidth and peer churn are critical to total system dependability. Risson & Moors Expires September 3, 2007 [Page 10] Internet-Draft Survey of Research on P2P Search March 2007 2.1. Local Index (Gnutella) P2Ps with a purely local data index are becoming rare. In such designs, peers flood queries widely and only index their own content. They enable rich queries - the search is not limited to a simple key lookup. However, they also generate a large volume of query traffic with no guarantee that a match will be found, even if it does exist on the network. For example, to find potential peers on the early instantiations of Gnutella, 'ping' messages were broadcast over the P2P network and the 'pong' responses were used to build the node index. Then small 'query' messages, each with a list of keywords, are broadcast to peers which respond with matching filenames [4]. There have been numerous attempts to improve the scalability of local-index P2P networks. Gnutella uses fixed time-to-live (TTL) rings, where the query's TTL is set less than 7-10 hops [4]. Small TTLs reduce the network traffic and the load on peers, but also reduce the chances of a successful query hit. One paper reported, perhaps a little too bluntly, that the fixed "TTL-based mechanism does not work" [67] To address this TTL selection problem, they proposed an expanding ring, known elsewhere as iterative deepening [29]. It uses successively larger TTL counters until there is a match. The flooding, ring and expanding ring methods all increase network load with duplicated query messages. A random walk, whereby an unduplicated query wanders about the network, does indeed reduce the network load but massively increases the search latency. One solution is to replicate the query k times at each peer. Called random k-walkers, this technique can be coupled with TTL limits, or periodic checks with the query originator, to cap the query load [67]. Adamic, Lukose et al. suggested that the random walk searches be directed to nodes with higher degree, that is, with larger numbers of inter-peer connections [269]. They assumed that higher-degree peers are also capable of higher query throughputs. However without some balancing design rule, such peers would be swamped with the entire P2P signaling traffic. In addition to the above approaches, there is the 'directed breadth-first' algorithm [29]. It forwards queries within a subset of peers selected according to heuristics on previous performance, like the number of successful query results. Another algorithm, called probabilistic flooding, has been modeled using percolation theory [270]. Several measurement studies have investigated locally indexed P2Ps. Jovanovic noted Gnutella's power law behaviour [70]. Sen and Wang compared the performance of Gnutella, Fasttrack [271] and Direct Connect [263, 272, 273]. At the time, only Gnutella used local data indexes. All three schemes now use distributed data indexes, with hierarchy in the form of Ultrapeers (Gnutella), Super-Nodes Risson & Moors Expires September 3, 2007 [Page 11] Internet-Draft Survey of Research on P2P Search March 2007 (FastTrack) and Hubs (Direct Connect). It was found that a very small percentage of peers have a very high degree and that the total system dependability is at the mercy of such peers. While peer up-time and bandwidth were heavy-tailed, they did not fit well with the Zipf distribution. Fortunately for Internet Service Providers, measures aggregated by IP prefix and Autonomous System (AS) were more stable than for individual IP addresses. A study of University of Washington traffic found that Gnutella and Kazaa together contributed 43% of the university's total TCP traffic [274]. They also reported a heavy- tailed distribution, with 600 external peers (out of 281,026) delivering 26% of Kazaa bytes to internal peers. Furthermore, objects retrieved from the P2P network were typically three orders of magnitude larger than web objects - 300 objects contributed to almost half of the total outbound Kazaa bandwidth. Others reported Gnutella's topology mismatch, whereby only 2-5% of P2P connections link peers in the same AS, despite over 40% of peers being in the top 10 ASes [65]. Together these studies underscore the significance of multimedia sharing applications. They motivate interesting caching and locality solutions to the topology mismatch problem. These same studies bear out one main dependability lesson: total system dependability may be sensitive to the dependability of high degree peers. The designers of Scamp translated this observation to the design heuristic, "have the degree of each node be of nearly equal size" [153]. They analyzed a system of N peers, with mean degree c.log(N), where link failures occur independently with probability e. If d>0 is fixed and c>(1+d)/(-log(e)) then the probability of graph disconnection goes to zero as N->infinity. Otherwise, if c<(1-d)/(-log(e)) then the probability of disconnection goes to one as N->infinity. They presented a localizer, which finds approximate minima to a global function of peer degree and arbitrary link costs using only local information. The Scamp overlay construction algorithms could support any of the flooding and walking routing schemes above, or other epidemic and multicasting schemes for that matter. Resilience to high churn rates was identified for future study. 2.2. Central Index (Napster) Centralized schemes like Napster [256] are significant because they were the first to demonstrate the P2P scalability that comes from separating the data index from the data itself. Ultimately 36 million Napster users lost their service not because of technical failure, but because the single administration was vulnerable to the legal challenges of record companies [275]. Risson & Moors Expires September 3, 2007 [Page 12] Internet-Draft Survey of Research on P2P Search March 2007 There has since been little research on P2P systems with central data indexes. Such systems have also been called 'hybrid' since the index is centralized but the data is distributed. Yang and Garcia-Molina devised a four-way classification of hybrid systems [276]: unchained servers, where users whose index is on one server do not see other servers' indexes; chained servers, where the server that receives a query forwards it to a list of servers if it does not own the index itself; full replication, where all centralized servers keep a complete index of all available metadata; and hashing, where keywords are hashed to the server where the associated inverted list is kept. The unchained architecture was used by Napster, but it has the disadvantage that users do not see all indexed data in the system. Strictly speaking, the other three options illustrate the distributed data index, not the central index. The chained architecture was recommended as the optimum for the music-swapping application at the time. The methods by which clients update the central index were classified as batch or incremental, with the optimum determined by the query-to-login ratio. Measurements were derived from a clone of Napster called OpenNap[277]. Another study of live Napster data reported wide variation in the availability of peers, a general unwillingness to share files (20-40% of peers share few or no files), and a common understatement of available bandwidth so as to discourage other peers from sharing one's link [202]. Influenced by Napster's early demise, the P2P research community may have prematurely turned its back on centralized architectures. Chawathe, Ratnasamy et al. opined that Google and Yahoo demonstrate the viability of a centralized index. They argued that "the real barriers to Napster-like designs are not technical but legal and financial" [61]. Even this view may be a little too harsh on the centralized architectures - it implies that they always have an upfront capital hurdle that is steeper than for distributed architectures. The closer one looks at scalable 'centralized' architectures, the less the distinction with 'distributed' architectures seems to matter. For example, it is clear that Google's designers consider Google a distributed, not centralized, file system [278]. Google demonstrates the scale and performance possible on commodity hardware, but still has a centralized master that is critical to the operation of each Google cluster. Time may prove that the value of emerging P2P networks, regardless of the centralized- versus-distributed classification, is that they smooth the capital outlays and remove the single points of failure across the spectra of scale and geographic distribution. Risson & Moors Expires September 3, 2007 [Page 13] Internet-Draft Survey of Research on P2P Search March 2007 2.3. Distributed Index (Freenet) An important early P2P proposal for a distributed index was Freenet [5, 71, 279]. While its primary emphasis was the anonymity of peers, it did introduce a novel indexing scheme. Files are identified by low-level "content-hash" keys and by "secure signed-subspace" keys which ensure that only a file owner can write to a file while anyone can read from it. To find a file, the requesting peer first checks its local table for the node with keys closest to the target. When that node receives the query, it too checks for either a match or another node with keys close to the target. Eventually, the query either finds the target or exceeds time-to-live (TTL) limits. The query response traverses the successful query path in reverse, depositing a new routing table entry (the requested key and the data holder) at each peer. The insert message similarly steps towards the target node, updating routing table entries as it goes, and finally stores the file there. Whereas early versions of Gnutella used breadth-first flooding, Freenet uses a more economic depth-first search [280]. An initial assessment has been done of Freenet's robustness. It was shown that in a network of 1000 nodes, the median query path length stayed under 20 hops for a failure of 30% of nodes. While the Freenet designers considered this as evidence that the system is "surprisingly robust against quite large failures" [71], the same datapoint may well be outside meaningful operating bounds. How many applications are useful when the first quartile of queries have path lengths of several hundred hops in a network of only 1000 nodes, per Figure 4 of [71]? To date, there has been no analysis of Freenet's dynamic robustness. For example, how does it perform when nodes are continually arriving and departing? There have been both criticisms and extensions of the early Freenet work. Gnutella proponents acknowledged the merit in Freenet's avoidance of query broadcasting [281]. However, they are critical on two counts: the exact file name is needed to construct a query; and exactly one match is returned for each query. P2P designs using DHTs, per Section 3. , share similar characteristics - a precise query yields a precise response. The similarity is not surprising since Freenet also uses a hash function to generate keys. However, the query routing used in the DHTs has firmer theoretical foundations. Another difference with DHTs is that Freenet will take time, when a new node joins the network, to build an index that facilitates efficient query routing. By the inventor's own admission, this is damaging for a user's first impressions [282]. It was proposed to download a copy of routing tables from seed nodes at startup, even though the new node might be far from the seed node. Freenet's slow Risson &
3 Ecosystem Evolution and Digital Infrastructure Policy Challenges: Insights & Reflections from an Economics Perspective Volker Stocker & William Lehr 20 min
4 Minimal Global Broadcast (MGB) Christian Tschudin 20 min
5 Wrap-up & Buffer All 15 min

Documents

Notes

Please remember that all sessions are being recorded.