• Tidak ada hasil yang ditemukan

Basics and Definitions

I. Peer-to-Peer Knowledge Management 13

5. Self-Organized Network Topologies for P2PKM 57

5.2. Basics and Definitions

5.2.1. Model of the P2P network

As this chapter is mainly concerned with network topologies and rout- ing strategies in P2PKM systems, we abstract from the details of a P2P- KM system implementation such as elaborated in the previous chapter.

We assume that the following abstractions hold for the system under consideration (similar to (Haase et al., 2004)):

Each peer stores a set of content items, e. g., entities in a knowledge base. On these content items, there exists a similarity functionsim which can be used to determine the similarity of content items to each other. We assumed := 1−sim to be a metric in the mathe- matical sense, i. e., for all content itemsx, y, z, the following hold:

d(x, x) = 0,d(x, y) =d(y, x),d(x, z)≤d(x, y) +d(y, z). The particu- lar set of content items used in this chapter will be entities from an ontology with the related metric such as described in Section 3.4.

Each peer provides a self-description of what it contains, in the fol- lowing referred to as expertise. Expertises need to be much smaller than the knowledge bases they describe, as they are transmitted over the network and used in other peers’ routing tables. In our case, the expertise consists of a content item selected as represen- tative for the peer, but in general, the expertise could also include peer metadata like query languages supported, additional capabil- ities of the peer etc. As peer expertises are content items, they can be compared to each other and to queries using thesimfunction.

There is a relation knows on the set of peers. Each peer knows about a certain set of other peers, i. e., it knows their expertises and network address (IP, JXTA ID). This corresponds to the routing index as proposed by Crespo and Garcia-Molina (2002). In order to account for the limited amount of memory and processing power, the size of the routing index at each peer is limited.

Sometimes it is more convenient to talk about the network in terms of graph theory. One can view the P2P network as a directed graphG(V, E)with a setV of nodes and a setE V ×V of edges, where each peer P constitutes a node in V, and (P1, P2) E iff knows(P1, P2). We will use both notations synonymously.

Peers query for content items on other peers by sending query messages to some or all of their neighbors; these queries are for- warded by peers according to some query routing strategy. Using thesimfunction mentioned above, queries can thus be compared to content items and to peers’ expertises.

5.2.2. (Weighted) Clustering Coefficients

One observation about small-world networks found in many areas such as sociology or biology is that there are clusters of nodes. This means, loosely speaking, that for each node, its neighbors are likely to be con- nected directly themselves.

More formally, the clustering coefficient for a nodevhas been defined by Watts (1999) as the fraction of possible edges in the neighborhood of a node which are actually present. We slightly modify that definition to use a directed graph as our knows relation may be asymmetric.

γv = 1 kv(kv1)

w∈Γ(v)

|{u∈Γ(v) : (w, u)∈E}| (5.1) whereΓ(v)are the nodes pointed to byv, not includingv:

Γ(v) ={u∈V\{v}: (u, v)∈E} (5.2) andkv =|Γ(v)|is the size of the neighborhood. Askv(kv1)in Equa- tion 5.1 is the maximum number of edges possible in the neighborhood, γv takes on values between 0 and 1.

The clustering coefficientγ(G)of a graph is the mean of the clustering coefficient over all nodes.

In the following, we extend this notion to a weighted clustering coeffi- cientγw. The motivation for this is that we do not only want to capture how densely connected the neighborhood of each peer is, but also if the neighbors have contents similar to that of the respective peer:

γvw = 1 kv(kv1)

w∈Γ(v)

sim(v, w)|{u∈Γ(v) : (w, u)∈E}| (5.3)

This means that for the weighted clustering coefficient of nodev, each edge from a neighborwcounts only as much as the similarity between wandv.

The weighted clustering coefficient is related to the observation that in actual small-world networks where there is a notion of similarity be- tween nodes, nodes are not only surrounded by dense neighborhoods.

Beyond the density of the neighborhood, the neighbor nodes of a par- ticular node tend to be similar to the node under consideration. In a social network of humans, for example, you are likely to find people of common interests in these clusters. With the above definitions, we have

0 γvw 1. Large values of γvw mean that v is surrounded by a dense neighborhood of similar nodes.

Note that other weighted clustering coefficients have been defined (Barrat et al., 2004; Barthelemy et al., 2004; Schank and Wagner, 2005) which do not express the same intention as the one defined here.

5.2.3. Characteristic Path Length

The characteristic path lengthLis a measure for the mean distance be- tween nodes in the network. It is defined by Watts (Watts, 1999; p. 29) as follows: “The characteristic path length (L) of a graph is the median of the means of the shortest path lengths connecting each vertexv F(G)to all other vertices. That is, calculated(v, j)∀j V(G)and find dv for each v. Then defineLas the median of{dv}.” Here,d(v, j)is the number of edges on the shortest path from v to j, and dv is the average ofd(v, j) over allj (V − {v}).

For reasons of efficiency, we use the sampling technique proposed by Watts (take a sample{v1, . . . , vm} ⊂ V for some m < |V|, compute the mean distancedvi for each, take the median of mean distances as L) to estimateL. Note again that in contrast to (Watts, 1999), we consider our network to be directed.

As the measurement of clustering coefficients and characteristic path lengths requires a global knowledge of the graph, these measures cannot be used directly by the peers to guide their routing rewiring strategies.

We will use them instead to evaluate the behavior of the P2P system from the outside.