A Comparative Study of Pub/Sub Methods in Structured P2P ... · A Comparative Study of Pub/Sub...

Universitat des Saarlandes

Naturwissenschaftlich-Technische Fakultat I

Fachrichtung Informatik

Diplomstudiengang Informatik

Diplomarbeit

A Comparative Study of Pub/Sub Methods in

Structured P2P Networks

vorgelegt von

Sebastian Parkitny

am 27. September 2006

angefertigt unter der Leitung von

Prof. Gerhard Weikumbetreut von

Matthias Bender, Sebastian Michel

begutachtet von

Prof. Gerhard Weikum

Prof. Thorsten Herfet

Erklarung

Hiermit erklare ich, dass ich die vorliegende Arbeit selbststandig verfasst und alle verwendeten

Quellen angegeben habe.

Saarbrucken, den 27. September 2006

Sebastian Parkitny

Contents

1 Introduction 1

2 P2P Systems 3

2.1 Unstructured P2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Structured P2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Publish/Subscribe systems 11

3.1 Topic-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Content-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Term-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Type-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 Current Publish/Subscribe systems . . . . . . . . . . . . . . . . . . . . . 18

4 Distributed Publish/Subscribe-Algorithms 20

4.1 Scribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Meghdoot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Introduction to Store-Sub and Store-Sub 23

5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Store-Sub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Store-Pub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Analysis of Store-Sub and Store-Pub 30

6.1 Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7 Design and Implementation of a Discrete Simulator 42

7.1 Creating Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.3 Lessons learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

i

Contents Contents

8 Experimental Results 60

8.1 Average behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.2 Comparison Simulator and Formulas . . . . . . . . . . . . . . . . . . . . 62

8.3 Growing number of subscribers . . . . . . . . . . . . . . . . . . . . . . . 63

8.4 Adding publishers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9 Conclusion 66

ii

List of Figures

2.1 Gnutella network topology . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Chord ring with 8 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Primitive lookup() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Chord ring with 8 nodes and fingertable . . . . . . . . . . . . . . . . . . 10

3.1 Overview of general Publish/Subscribe system . . . . . . . . . . . . . . . 12

3.2 Hierarchical addressing in topic-based Publish/Subscribe . . . . . . . . . 15

3.3 Content-based Publish/Subscribe in Stock markets . . . . . . . . . . . . 16

3.4 Type-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1 Overview Store-Sub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Overview Store-Pub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Message types in Store-Sub . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Message types in Store-Pub . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 #Messages at variable publishing rate . . . . . . . . . . . . . . . . . . . 39

6.4 #Messages at variable refresh interval . . . . . . . . . . . . . . . . . . . 41

7.1 UML diagram of term-based simulator . . . . . . . . . . . . . . . . . . . 45

7.2 Relationship of peer classes . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.3 Class Subscription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.4 PublisherDB and SubscriberDB . . . . . . . . . . . . . . . . . . . . . . 49

7.5 Class PublisherList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.6 Class DocumentDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.7 Class RandomGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.8 Class PublicProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.9 Class BerkeleyDocumentDB . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.10 Class TopicDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.11 Class DocumentCollection . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.12 Class TopicDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

iii

List of Figures List of Figures

8.1 Publishing rate 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2 Publishing rate 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.3 Comparison of Simulator and Formula with r = 1 . . . . . . . . . . . . . 62

8.4 Comparison of Simulator and Formula with r = 100 . . . . . . . . . . . . 63

8.5 Varying number of subscribers . . . . . . . . . . . . . . . . . . . . . . . . 64

8.6 Varying number of publishers . . . . . . . . . . . . . . . . . . . . . . . . 65

iv

List of Tables

5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.1 Parameters for scenario with variable publishing rate . . . . . . . . . . . 36

6.2 Parameters for scenario with variable subavg . . . . . . . . . . . . . . . . 37

6.3 Parameters for scenario with variable interval . . . . . . . . . . . . . . . 40

7.1 Members of class Publisher . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.2 Members of class Subscription . . . . . . . . . . . . . . . . . . . . . . . 48

7.3 Members of class PublisherList . . . . . . . . . . . . . . . . . . . . . . 50

7.4 Members of class DocumentDB . . . . . . . . . . . . . . . . . . . . . . . . 51

7.5 Members of class RandomGenerator . . . . . . . . . . . . . . . . . . . . . 52

7.6 List of properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.7 Members of class BerkeleyDocumentDB . . . . . . . . . . . . . . . . . . . 55

7.8 Members of class TopicDB . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.9 Members of class TopicDistribution . . . . . . . . . . . . . . . . . . . 58

8.1 Parameters for scenario with variable publishing rate . . . . . . . . . . . 60

8.2 Parameters for scenario with growing number of subscribers . . . . . . . 63

8.3 Parameters for scenario with growing number of publishers . . . . . . . . 65

v

A shorter version of this work was accepted to the Fourth International Workshop on

Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2006) in Seoul,

Korea.

vi

http://dbisp2p.ingce.unibo.it

Zusammenfassung

Das Publish/Subscribe Paradigma ist ein gangiges Modell, um Inhalte verschiedenster

Art unter Produzenten (publisher) und Interessenten (subscriber) zu verteilen. Ver-

schiedene strukturierte Ansatze wurden entworfen und analysiert aber nicht mit an-

deren Ansatzen verglichen. Wir entwickeln zwei Publish/Subscribe-Ansatze basierend

auf einem strukturierten P2P-Netzwerk und analysieren mathematisch ihre Komplexitat.

Daruber hinaus vergleichen wir die beiden Ansatze unter Definition von jeweils ver-

schiedenen Systemparametern und assoziieren die Ergebnisse zu bestimmen Szenarios.

Zum Schluß entwerfen wir einen diskreten Simulator und stellen experimentelle Messun-

gen der beiden Ansatze vor.

vii

Abstract

Publish/Subscribe systems have become an established model to provide content delivery

from publishers to subscribers. Many approaches based on top of a P2P network have

been proposed and evaluated, but typically each approach is evaluated on its own. We

identify two approaches to implement Publish/Subscribe based on a structured P2P

network and provide a mathematical analysis of their complexity. Furthermore, we

compare the two approaches for several choices of system parameters and associate the

outcomes to certain usage scenarios. Thus, we can provide evidence of which of these

approaches is suitable for certain scenarios. Finally, we design and implement a discrete

event simulator and present results of experimental measurements of both approaches.

viii

1 Introduction

In the last few years the term ”Peer-to-Peer” (P2P) has drawn much attention. P2P

applications provide file-sharing, distributed computing, internet-based telephony, etc.

Especially applications for file-sharing have emerged rapidly because the Peer-to-Peer

paradigm offers much more possibilities than the old-fashioned client-server architec-

tures. As new applications arise, also the traffic in the internet grows enormously.

Internet providers claim that currently more than 50% of the traffic is generated by

Peer-to-Peer applications. The continuous growth of traffic implies that there is a need

to focus on design and analysis of Peer-to-Peer applications. First, the design of such

applications can guarantee the wealth and stability and also it can make them more reli-

able, scalable and flexible. Second, the analysis can give hints about the traffic generated

by an Peer-to-Peer application and thus can help to increase the quality of programs.

A popular subset of P2P implementations are the distributed Publish/Subscribe ap-

plications. Publish/Subscribe is an asynchronous messaging paradigm that decouples

publishers from subscribers. It is suitable to exchange contents of any kind and can

be provided in several schemes, like content-based and topic-based scheme. There are

several approaches to implement and evaluate Publish/Subscribe systems but none of

them compares the proposed system to other approaches.

Goal of the thesis is to identify two different usage scenarios for Publish/Subscribe and

to identify two general approaches to implement a Publish/Subscribe system on top of a

structured P2P network. The approaches are called Store-Sub and Store-Pub and will be

explained in detail. The complexity of the approaches is analyzed theoretically. Finally,

we design and implement a discrete event simulator and present results of experimental

measurements of the two approaches under different choices of system parameters. The

results of the experiments are interpreted and associated to certain usage scenarios.

The remainder of the thesis consists is organized as follows: Chapter 2 explains the P2P

paradigm. It presents two general categories, namely unstructured and structured P2P

networks. Chapter 3 illustrates the general Publish/Subscribe paradigm. Afterwards,

1

Chapter 1. Introduction

several Publish/Subscribe schemes are presented and finally some deployed implemen-

tations are explained. Chapter 4 presents some distributed Publish/Subscribe systems

and explains the advantages of decentralized Publish/Subscribe. Chapter 5 introduces

two approaches to implement Publish/Subscribe on top of a P2P network. The ap-

proaches are called Store-Sub and Store-Pub and differ in the organization of data.

Chapter 6 analyzes the two approaches in a mathematical way and compares their

behaviour and sensitivity in different usage scenarios. Chapter 7 presents the discrete

event simulator in detail and clarifies the assumptions made concerning the implemen-

tation. Chapter 8 shows experiments made with the simulator, explains the outcomes

and associates them to different usage scenarios. Additionally, the simulator is used

to compare the outcomes of the mathematical analysis and the simulations. Finally,

Chapter 9 concludes this work and points out future research directions.

2

2 P2P Systems

Today, P2P networks are one of the most popular classes in the field of internet appli-

cations. There is a tremendous number of applications using the P2P paradigm ranging

from instant messaging and VoIP 1 to file sharing and grid computing. Most of those

applications are implemented to serve as large-scale P2P networks with millions of users

participating spread all over the world. All of them use special techniques and possibly

implement additional paradigms to offer their services and fulfill the needs of the par-

ticipating peers. P2P systems are also in the focus of many research groups trying to

contribute in several ways, but unlike popular applications, do not necessarily deploy

their results in real world applications. Today, more than 50% of internet traffic is gen-

erated by P2P applications, according to internet providers. In some cases the traffic is

even more than 75%.

Oram et al. [30] give the basic definition of the term Peer-to-Peer :

[a Peer-to-Peer system is] a self-organizing system of equal, autonomous en-

tities (peers) [which] aims for the shared usage of distributed resources in a

networked environment avoiding central services.

Following the definition, a P2P network consists of a set of peers participating in the

network via an application. All the peers are by definition equal, in essence they are in

the position to execute the same operations and have to contribute in the same manner

in the context of the P2P network. These applications are mostly offered by open source

projects but also are available by commercial software providers.

The popular contrary to the P2P paradigm is the client-server paradigm which separates

the client from the server. The client-server paradigm or design pattern is still used in

the majority of internet applications. Typical representatives for this paradigm are for

1Means Voice over Internet Protocol and is usually the term for making phone calls over internet viaappropriate applications

3

Chapter 2. P2P Systems

example HTTP [23], SSH [24], Telnet [22] and SMTP [31].

The most common example is HTTP. In HTTP there are web servers and clients. A

web server offers various contents (websites, images, movies) and clients wish to retrieve

the contents. The clients use HTTP to connect to the server and ask for some special

content. Via separate HTTP connections, or possibly with the same one, the server

offers to the client the desired content. This leads to high computational load and is

clearly a bottleneck of the client-server paradigm. Another drawback is the single-point-

of-failure. If the server is not able to respond - no matter for which reason - the clients

can not retrieve any kind of content. However in a P2P network, a peer acts both as

server and client. The word servant combines the fact that a peer is able to act as server

and client. Nowadays, the old-fashioned client-server paradigm has to struggle with a lot

of requirements which are - more or less - implicitly implemented by P2P applications.

These main requirements can be identified as follows:

• Scalability

• Security and reliability

• Flexibility

P2P became popular because they meet these requirements better than client-server

applications. They offer nearly unlimited scalability, much more flexibility and reliability.

There are several kinds of P2P networks. P2P network with servers, hybrid networks

and pure P2P networks. Pure P2P networks neither have servers nor the peers have an

overview of the whole network. They are highly decentralized and resources are spread

all over the peers in the network. A peer only knows those peers it interacts with and

hence only may query and match resources published by a certain number of peers.

Because there are no servers there is no single-point-of-failure and hence there are more

scalable. Peers participating in a network may join and leave it arbitrarily and thus P2P

networks are changing over time. Therefore, they are highly dynamical artifacts. Since

there are no dedicated servers, the contents - if existent - are located at the peers.

One of the most popular applications is indeed the sharing of files between peers, not

only in a legal manner but unfortunately also in an illegal manner. Famous examples

for this kind of applications are Napster [12, 11] (today a legal online music service),

eDonkey2000 [6, 5], Gnutella [8, 7] and BitTorrent [1, 2]. Nevertheless, these file sharing

networks can also be categorized into different generations of P2P networks because of

their varieties in paradigms pursuited.

4

2.1 Unstructured P2P Chapter 2. P2P Systems

2.1 Unstructured P2P

An unstructured P2P network in general is a P2P network that has no structured overlay

topology. Peers that would like to participate in the network do this by connecting to

other network participants. Unstructured P2P networks usually were utilized in most

of the early P2P applications. The file-sharing applications mentioned above (Napster,

Gnutella, eDonkey2000, BitTorrent) are unstructured P2P applications and can be di-

vided into two categories, namely centralized and decentralized P2P networks. Napster

and eDonkey, for example, are centralized P2P file sharing systems that need a central

server to connect to. All search queries are routed to the central server. Consider an

user A that would like to retrieve a freely available audio book and this audio book is

offered by another user B in the P2P network. User A needs to contact the server and

query this audio book. The server responds with the users offering the audio book. User

A now can contact user B from the reply list and directly request this audio book. At

this point the server is not needed anymore.

Fully decentralized system like Gnutella and BitTorrent don’t need any kind of central

server to connect to and transmit search queries. They use different techniques to emit

their search queries and receive responses. The most popular technique of searching in

an decentralized and unstructured P2P network is flooding which is explained in the

remainder of this section.

Gnutella 0.4 is a network that consists of a large number of nodes. The nodes may

be distributed all-over the world and the network has no central entity. According to

M.A. Jovanovic [28] the overlay topology can be visualized in a graph of network nodes.

Figure 2.1 shows an example topology of a Gnutella 0.4 network. In a Gnutella network

every peer only knows a certain number of peers, called its neighbours. Because no peer

has a complete overview of the network, it is not possible that a peer can find every-

thing it queries. This is due to the fact that the peers are arbitrarily connected to other

peers and that by flooding not all peers can be reached. Some of the unstructured P2P

networks try to improve the topology and therefore the performance and scalability e.g.

with direct connections only covering geographically proximate peers. By this approach

they are minimizing the geographic distance between them.

Flooding works as follows: If a peer searches for a file, it queries its neighbours, that

has been determined during bootstrapping. The query is conveyed by a message with

additional fields. One of them is called TTL field which defines the number of hops a

5

2.1 Unstructured P2P Chapter 2. P2P Systems

Figure 2.1: Gnutella network topology

message travels within the network. The messages is sent from node to node and the

TTL field is decreased by every node until it reaches zero. By using TTL, a radius of

nodes to be queried is defined. The message is not propagated any further thus avoiding

infinite cycles. It had been figured out that a TTL value of 7 is sufficient for most of the

queries. The following formula states the method used in Gnutella 0.4.

TTL(0) = TTL(i) + Hops(i) ≤ 7 (2.1)

The value Hops(i) defines the number of how often the message already has been for-

warded and the number TTL(i) is the TTL field within the message. TTL(0) is the

value of the TTL field in the message, when the doesn’t have been forwarded yet. With

this formula the radius is defined.

Another pure P2P network is Bittorrent. Bittorrent pursuits a slightly different paradigm

than Gnutella. In essence a provider of some resources can shift traffic exclusively gener-

ated by itself to peers and thus avoiding the unavailability of its resources. The network

uses trackers to define different kinds of resources. The client can use the trackers to

download the resource and at the same time offer this resource to other peers interested

in the same one. As well as Gnutella, Bittorrent doesn’t use any structured overlays to

6

2.2 Structured P2P Chapter 2. P2P Systems

partition the resources offered by peers. That means that even in a steady-state of the

overlay it is not guaranteed that some peer finds what it desires if the resource is present

in the network.

In unstructured P2P networks following drawbacks can be identified . There might be a

bottleneck when using a server to respond to queries. The scalability is also limited and

the server is a single-point-of-failure if it is not able to respond. If pure P2P networks

are used, flooding causes a lot of traffic and the peers might not retrieve all the resources

they query because flooding is limited to a certain radius.

2.2 Structured P2P

Unlike unstructured P2P networks, structured P2P networks form a predefined struc-

ture. These networks require peers to be responsible for some part of the overall content

of all peers within the network. The peers form a so called Distributed Hash Table

(DHT) to address this paradigm. The DHT is constructed by using a hash function in

order to partition the peers and the contents. The choice of the hash function is impor-

tant for the probability of the robustness and the performance, because keys and peers

(IP addresses) are uniformly distributed over the same identifier space. The search space

is divided into sections by the hash function. A peer entering the network is assigned a

section of the search space [0, 2m − 1] and automatically is responsible for this section

of the search space. The basic functionality of a DHT is the so-called lookup()-function.

Given a key, it maps the key onto a network node. There are several implementations of

DHT’s that all are based on the same principles, but differ in search and concrete par-

tition strategies. Prominent examples of DHT implementations include Chord [4, 25],

Pastry [13], Tapestry [3], P-Grid [16] and CAN [32] (Content addressable network).

Most of the systems have a guaranteed complexity of retrieving a queried resource in

O(log(N)). Other issues like adding new content or peers and failure handling have com-

plexity of O(log(N)) and O(log2(N)). Therefore, unlike to unstructured P2P networks,

structured ones are highly scalable in the number of nodes. On top of such a DHT it is

possible to implement any kind of P2P network.

According to Ion Stoica et al. [25] Chord reduces the complexity in designing P2P

systems and applications by addressing the following problems:

Load balance By using a hash function the even distribution of keys is naturally im-

7


posed

Decentralization Chord provides a complete decentralization without any kind of server

or super nodes. Each node has the same responsibility, thus leading to a robust

system.

Scalability The system scales logarithmically as the system grows in the number of

nodes.

Availability Even during continuous changes by peers joining and leaving the network,

they can guarantee that a node responsible for a key can be found. Actually node

failures are detected and thus the system provides self-organization.

Flexible naming There are no constraints in the naming of keys, therefore they allow

to map any kind of key with the lookup() function.

The basic idea of Chord is the following. Chord uses consistent hashing to map keys onto

the Chord ring. The hash function (SHA-1 [14]) assigns each key an m-bit identifier. To

determine the identifier of a node, its IP address is hashed. Using this procedure and

maintaining several properties on m and the hash function, the identifiers computed by

the function are ordered in an identifier circle modulo m. The parameter m is called

the exponent of the ring and the maximal number of peers participating is N = 2m.

The assignment of keys to peers is as follows. A key k is assigned to the first node

whose identifier is equal to or follows k in the identifier space. The node is then called

successor(k), the successor node of k. Represented in a modulo circle, successor(k) is the

first node clockwise from k. Figure 2.2 shows a Chord ring with 8 nodes and 6 keys. The

keys are respectively mapped onto their successors in the Chord ring. For example key

S8 is mapped to peer P12 and keys S43 and S33 are mapped to peer P44. In essence

a primitive lookup() already can be implemented. Consider a key k, peer x asks its

successor whether it is responsible for key k. If this is the case, the search is completed

and the responsible peer is found. If not, y - the successor of x - routes the query to its

successor until the search is completed. This primitive lookup has a search complexity

of O(N) were N is the number of nodes in the Chord ring. Figure 2.3 shows primitive

lookup() in a Chord ring. P12 here is looking for key S57 and needs the query to be

routed over each node on the path of the Chord ring. The red arrows show the successive

queries. To speed up the query routing, each node in the Chord layer additionally has a

routing table, called finger table. The finger table contains a total number of O(log(N))

entries. These entries are not only the successor and predecessor of a peer. Each peer

8


Figure 2.2: Chord ring with 8 nodes

knows a total number of other m peers. The entries of the finger table are called finger.

With this knowledge it is possible to search for a key in O(log(N)). Let p be peer, then

the i-th finger of p is the peer that follows p by at least 2i−1 in the ID space. Formula

2.3 shows the construction of the finger table for a peer.

x.finger[i] = successor(keyi) : keyi = x.ID + 2i−1, i ∈ [0, m] (2.2)

Figure 2.4 shows a Chord ring with 8 nodes. The finger table of Node P12 is displayed

and arrows illustrate the fingers of this node. If for example node P12 is looking for key

S56, it knows that by its finger table node P44 is the nearest one for the sought-after

key. Therefore P12 redirects the query to P44. P44, by the same procedure, is able to

return the result. This search strategy guarantees, that any key can be retrieved with

O(log(N)) messages. For a deeper insight of mechanisms and techniques concerning

Chord, the reader is referred to [4, 25].

9


Figure 2.3: Primitive lookup()

Figure 2.4: Chord ring with 8 nodes and fingertable

10

3 Publish/Subscribe systems

A Publish/Subscribe system is based on the Publish/Subscribe paradigm. The Pub-

lish/Subscribe paradigm contains several ingredients. It consists of publishers, sub-

scribers and an event notification service. Publishers are information providers that

would like to publish some kind of content (HTML files, PDF files, etc.).

Subscribers in general are information consumers that are interested in contents. The

subscribers can create subscriptions that filter out relevant content. A subscription ex-

presses the interests of the subscribers. They can consists of terms or predicates. A

term usually is a word like football or Manchester and the subscription can consist of a

conjunction of terms like Manchester ∧United∧ football. A predicate is an expression

like DAX ≤ 5000 and content that is relevant for this predicate must contain the ap-

propriate DAX value. Unlike a simple query on some search engine, a subscription is a

standing request. It remains stored until the user decides to delete it.

The event notification service is the part of the Publish/Subscribe system that is respon-

sible for storing the subscriptions and filtering the relevant content. It offers some kind

of interface called subscribe() to add and store a subscription and an interface called

unsubscribe() to delete a subscription. The filtering or matching is based on the sub-

scriptions and the content. If e.g. a subscription is expressed with terms, only content

that contain all the terms is relevant. After filtering the relevant content for a sub-

scription, the event notification service notifies the corresponding subscriber about the

event. The event is that for the subscription relevant content could be identified. The

notification can be made by e-mail, sending the content, etc.

A popular example is Google Alert [21], where news agencies like CNN or Spiegel play

the role of publishers. They frequently put some articles about different issues on their

website and Google Alert is able to index these articles. Users can express their in-

terest in form of keywords, the type of information (News, websites and user groups),

the frequency of notifications and its e-mail address. This is the so called subscribe()

interface. Google Alert also provides the appropriate unsubscribe() to delete the sub-

11

Chapter 3. Publish/Subscribe systems

scriptions again.

The event notification service here is Google Alert. Google Alert is responsible for stor-

ing the subscriptions and notifying the users about relevant content. Let the subscription

of any user be the words George Bush election. If CNN publishes a news story about the

election and George Bush, Google Alert would send an e-mail to this user and notify it

about the publishing of this article. Other examples of Publish/Subscribe systems are

mailing lists, sensor networks, use groups like USENET, etc.

These different shapes of Publish/Subscribe systems also imply different scenarios con-

cerning the distribution of peers. Consider for example a scenario where users want to

share their knowledge with each other, like in a Wiki or a Blog. Here, the users are

interested in some events published by other users or they even become publishers and

publish at a low rate contributions like comments or new articles. This scenario is a user-

to-user scenario. Consider for example a Blog where a user wants all the new articles

about the term VoIP. When a new article is published, the user is notified via e-mail.

The counterpart of that scenario is the publisher-to-user scenario. In this scenario, there

are many subscribers and only a few publishers. The publishers publish content at a

much higher rate to much more subscribers and typically publishers are distinct from

the set of subscribers.

Figure 3.1 sketches an overview of a Publish/Subscribe system. By applying the Pub-

Figure 3.1: Overview of general Publish/Subscribe system

12

Chapter 3. Publish/Subscribe systems

lish/Subscribe paradigm, there is a decoupling of the participating parties. They are

decoupled in:

Space Publishers publish their events via the event notification service. The event noti-

fication service notifies the subscribers, in the case of a matching, about the events.

Neither publishers know who had announced interests and how many entities are

interested nor subscribers know who is publishing. The event notification service

serves as mediator between those two parties.

Time If some publisher publishes some content, the subscribers don’t need to be present

at the same time. From the subscribers view, the publishers don’t need to be

present if the event notification service notifies them. The event notification service

deals with dispatching all upcoming events at the time consumers and producers

are available.

The decoupling of the of publishers and subscribers has the effect of reducing the com-

plexity of applications based on the Publish/Subscribe paradigm. It separates the pro-

cess of production on the one hand from the process of consumption and on the other

hand makes the communication between the entities asynchronous. The separation of

these processes can be outlined as loose coupling and can be considered as an advantage

of this paradigm.

Additionally to the already known events and notifications, publishers might be able to

advertise upcoming events. The event notification service can offer an interface called

advertise() for this purpose. The advertisements can be used by subscribers to learn

about further upcoming events. In addition, these advertisements are useful for the

event notification service to adapt to future subscriptions and optimize the functional-

ity.

Publish/Subscribe systems can be categorized to different subscription schemes. All of

those schemes come with their own functionalities and algorithms. Also, many research

groups contribute with optimizing algorithms, making Publish/Subscribe a hot research

topic. The remainder of this chapter presents different subscription schemes.

13

3.1 Topic-based Publish/Subscribe Chapter 3. Publish/Subscribe systems

3.1 Topic-based Publish/Subscribe

Topic-based Publish/Subscribe systems were one of the earliest subscription schemes

applying the Publish/Subscribe paradigm. The scheme is used by subscribers to an-

nounce subscriptions in form of topics (e.g. Sports, Politics). Topics can be compared

to interest groups consisting of all entities that would like to receive similar content. The

event notification service can be considered as a broker, that disseminates the notifica-

tion through a communication channel to all matching subscribers. Each subscriber can

choose some topics to subscribe for and they only receive notifications about documents

of their topics of interests. This kind of group communication is widely used in several

applications. Mailing lists and USENET are prominent and popular representatives of

interest groups. Because of the possibility to subscribe to individual topics at the event

notifier, topics need to be predefined before subscribing to them. This can simply be

done by defining several topics with the use of the system in question. Another inter-

esting way is automatic extraction of topics out of the resources to be published. Topic

extraction is a research area on its own and out of the scope of the thesis.

Besides topics in a simple, flat hierarchy, there are extending concepts to structure the

topic space and make it more accessible to subscribers and publishers. The introduction

of hierarchies gives publishers and subscribers the facility to formulate their events and

subscription in topics and subtopics. That means that a subscription made to some

topic or subtopic includes all topics that belongs to that topic node in the hierarchy.

USENET news uses that kind of hierarchical addressing. Figure 3.2 shows the principle

of hierarchical addressing. Subscriber A subscribes to the topic Football and receives all

events that are from the topic Football or subtopics of it (like a special team). However,

subscriber B is interested in all events from topic Sports and so it receives additionally

to all Football events also those from Sports in general.

3.2 Content-based Publish/Subscribe

The content-based Publish/Subscribe is far more general than topic-based scheme. In

essence this scheme offers more flexibility to match new content against stored subscrip-

tions in the event notification service. The contents, that are published by publishers,

are inspected one by one and afterwards are matched against the subscriptions in the

notification service. Therefore this scheme can be considered a message-by-message ap-

14

3.2 Content-based Publish/Subscribe Chapter 3. Publish/Subscribe systems

Figure 3.2: Hierarchical addressing in topic-based Publish/Subscribe

proach, where the content of the messages is the basis of the filtering. Thus, this concept

is much more flexible compared to the topic-based scheme. Users can subscribe by using

special subscription languages. Subscriptions can be combined in several ways to form

more complex ones, so called subscription patterns. Operators usually used in such sub-

scription languages are <,≤, =, >,≥. Thus, the content-based scheme is not suitable for

a plain-text filtering but rather for a filtering on the basis of predicates. The subscription

patterns are then matched and the subscribers are notified. Compared to topic-based

scheme, content-based scheme is more expressive. But this freedom of choice is paid

with more storage in the event notification service. Because of the freedom of choice,

there is a tremendous number of different unique subscription patterns. This enforces

the usage of efficient matching algorithms and clearly is a disadvantage of content-based

Publish/Subscribe.

Consider the following example: The event service notification is implemented on stock

markets and the stocks themselves are regarded as publishers. Whenever some stock

changes, it emits the new value to the event notification service which matches the

changed value against standing subscription patterns and notifies all the subscribers

subscribed about the changes. Figure 3.3 illustrates content-based Publish/Subscribe in

combination with stock markets. Subscriber A had subscribed for the event of the of

DAX being less or equal to 5000 and Subscriber B for the event of NASDAQ being less

or equal to 2500. When DAX emits a new event that its current value is 4800, the event

notification service only notifies Subscriber A, because Subscriber B is not interested

in that event. More sophisticated examples can be constructed by combining patterns.

15

3.3 Term-based Publish/Subscribe Chapter 3. Publish/Subscribe systems

The subscription language can be processed using algorithms to optimize the matching

Figure 3.3: Content-based Publish/Subscribe in Stock markets

of subscriptions and contents. Subscriptions could be expressed using SQL or XML and

then can be processed by e.g. methods like XPath. XFilter using XPath [18, 26] are

methods capable of efficiently filtering XML documents and disseminating selective in-

formation out of them. Other methods are Template Objects using JavaSpaces [10] and

Executable code [19].

3.3 Term-based Publish/Subscribe

The Term-based Publish/Subscribe scheme is a generalization of the content-based Pub-

lish/Subscribe. In the content-based Publish/Subscribe the basis is the content, which

is observed and the subscribers are notified by the event notification service. But term-

based Publish/Subscribe focuses on the terms or combinations of them. There are no

operators like <,≤, =, >,≥, but only conjunctions of terms.

In the case of term-based scheme, the subscription patterns aren’t usually that sophisti-

cated. The contents are a kind of documents with terms and phrases. The subscription

patterns are predicates consisting of terms. A subscriber is able - like in a search engine

- to subscribe for terms that are contained in the content. Consider for example an

article shipped in form of HTML or PDF and consider a subscriber that has stored a

subscription that matches the contents of the article. The article is going to be delivered

to the subscriber.

The atomic part of the subscription pattern is a term. Terms can be combined to form

16

3.4 Type-based Publish/Subscribe Chapter 3. Publish/Subscribe systems

predicates of arbitrary size, like George ∧ Bush ∧ president ∧ USA. Then only doc-

uments, that contain these terms, are matched against the subscription pattern and

disseminated to the subscribers.

A part of the analysis of different Publish/Subscribe schemes performed in this thesis is

also based on term-based Publish/Subscribe.

3.4 Type-based Publish/Subscribe

Type-based Publish/Subscribe scheme [20], besides analyzing the content of upcoming

notifications, makes use of additional mechanisms to better classify resources and op-

timize event notification services. Type-based Publish/Subscribe introduces the concept

of using types in the system. By introducing types, content previously classified by only

the name of its topic now can additionally be classified by using types. Using them in

Publish/Subscribe combines the usage of objects in the underlying programming lan-

guage and types of events in the event notification service. Another new effect is to

ensure type-safety at compile-time by parameterizing the resulting abstraction interface

with the type of the corresponding events. This type-safety and type checking and other

content-based refinements ensures that the Publish/Subscribe system can match sub-

scriptions against content more efficiently. Figure 3.4 shows an example. It shows, that

the stock events can be split into two distinct types, stock quotes and stock requests.

Publishers can publish on stock quotes and brokers can subscribe for these events on

stock requests expressing their interest in buying stocks with a possible price range.

Figure 3.4: Type-based Publish/Subscribe

Eugster et al. [20] gives an in-depth description and discussion of type-based Pub-

17

3.5 Current Publish/Subscribe systems Chapter 3. Publish/Subscribe systems

lish/Subscribe.

3.5 Current Publish/Subscribe systems

This section briefly presents some systems implementing the Publish/Subscribe pattern.

Publish/Subscribe schemes exist in a great variety reaching from research prototypes to

commercial implementations.

3.5.1 Java Messaging Service

Java Messaging Service (JMS) is part of Java2 Enterprise Edition API and can be used to

implement applications with Publish/Subscribe models or point-to-point queuing. JMS

is a very flexible framework, that enables distributed communication which is loosely

coupled, reliable and asynchronous. Java Messaging Service uses topic-based subscrip-

tions that are used to group messages together to abstract topics. One information

provider can send a message to many consumers through the predefined virtual chan-

nels, i.e. the topics. Additionally, some kind of content-based filtering can be applied

on message headers and fields. In JMS topics are predefined by information providers ,

the so-called JMS providers. They define which content belongs to a specific topic and

which does not. Also topics can be structured using the hierarchy patterns.

A lot of commercial providers, as well as open source and academic projects are imple-

menting the JMS API, making it a promising and popular messaging API.

3.5.2 Cambridge Event Architecture

Cambridge Event Architecture (CEA) [27] is a research prototype developed at Cam-

bridge University in the early 90’s. CEA had been implemented to address the problem

of asynchronous communication for applications with a sensor environment and extends

the object-oriented middleware CORBA 1 with a ”publish, register, notify” paradigm.

In the context of CEA there are event producers (=information provider) and event

consumers (=information consumers). In contrast to CORBA, event producers in CEA

1CORBA defines APIs, a communication protocol and object/service information models to enableheterogeneous applications written in various languages running on various platforms to interoperate.

18

3.5 Current Publish/Subscribe systems Chapter 3. Publish/Subscribe systems

can register the events it produces with a name service. This can be considered as the

advertising mechanism. In CORBA event consumers directly register at event producers.

This creates a tight coupling between event consumers and event producers. Instead,

CEA introduces a mediator between those two entities to decouple them in time and

space. The so-called event mediator is a CORBA entity that implements producers and

consumers interfaces. By providing this kind of events, CEA offers a type-based Pub-

lish/Subscribe system and lacks the principle of content-based Publish/Subscribe. An

interesting property of CEA is that it provides a mechanism to subscribe for composite

event patterns, event correlation. Event consumers can subscribe for the occurrence of

several events and they are only notified if all of the subscribed events occur.

19

4 Distributed

Publish/Subscribe-Algorithms

Publish/Subscribe systems arise in many applications. Existing centralized implemen-

tations promise to offer the possibility that a potential unlimited number of subscribers

can participate. The advantages of the centralized way are that the system has a com-

plete overview of all subscriptions and of all notifications of the publishers. On that

physically global directory efficient algorithms can be implemented. The users of such

a system usually are autonomous entities that join or leave the system arbitrarily and

provide and consum content independently. This circumstance inevitably leads to a

bottleneck when the number of users grows and therefore the number of notifications

and subscriptions grows. A lot of notifications have to be forwarded to the appropriate

subscribers. This scalability and reliability issue can be tackled by implementing Pub-

lish/Subscribe systems in a decentralized fashion, i.e. a Publish/Subscribe system on

top of a P2P network. Especially when thinking about ubiquitous and pervasive [33]

environments, the deployment of Publish/Subscribe systems in a decentralized manner

is the best selection.

P2P systems have emerged as a very popular technique for exchanging information

in wide-area networks that span the whole world. Advanced P2P systems use struc-

tured overlays to provide more efficient lookups. The most popular structured P2P

networks are DHT implementations which are completely decentralized, scalable, self-

organizing and reliable. DHT implementations have been explained in Section 2.2. A

Publish/Subscribe system can be implemented straightforwardly on top of that over-

lay. There are several approaches to implement the Publish/Subscribe paradigm in a

decentralized fashion. Popular examples include Meghdoot [17] and Scribe [29].

20

4.1 Scribe Chapter 4. Distributed Publish/Subscribe-Algorithms

4.1 Scribe

Scribe is a application-level multicast infrastructure which supports a large number of

groups with a large number of members per group. It is build on top of Pastry [13]

and leverages Pastry’s scalability, locality and self-organization properties. Any peer in

Scribe can create a topic and other peers can subscribe to that topic. A peer with the

appropriate credentials can publish an event and Scribe is able to route the event to the

corresponding subscribers. Scribe provides a simple API with following functions:

• create(credentials, topicID)

• subscribe(credentials, topicId, eventHandler)

• unsubscribe(credentials, topicId)

• publish(credentials, topicId, event)

The event handler within the method subscribe is capable of handling each incoming

event routed to the subscriber. The scribe node/peer numerically closest to the unique

topicId is the so-called rendez-vous point for the topic. The rendez-vous point is the root

of a multicast tree created for the topic. Further details on Scribe and its functionality

can be seen in [29].

4.2 Meghdoot

Meghdoot is a P2P based Publish/Subscribe system. It uses a structured distributed

hash table to determine where subscriptions are stored and how to route the events to

the subscriptions. Meghdoot implements the content-based Publish/Subscribe scheme

known from Section 3.2. It additionally deals with the problem of load balancing. By

applying alternative methods the results show that even with highly skewed datasets no

peer is excessively loaded.

The system is represented by a set of attributes. The schema for a system can be

described as S = {A1, A2, . . . , An}. A subscription in Meghdoot is a conjunction of

predicates over a set of attributes. A predicate defines a contant value (with =) or a

range (using <,≤,≥, >) for an attribute. An event is a set of equalities which is rep-

resented as e = {A1 = c1, A2 = c2, . . . , An = cn}. An event e matches a subscription

21

4.2 Meghdoot Chapter 4. Distributed Publish/Subscribe-Algorithms

s if each predicate of s is satisfied by the corresponding attribute value in the events.

Meghdoot constructs given the schema S = {A1, A2, . . . , An} a 2n-dimensional carte-

sian space to maintain the DHT. This space is partitioned among the participating peers

and storing a subscription has a complexity of O(d ∗ N1d ), where N is the number of

peers and d the dimension of the cartesian space. A upcoming event in Meghdoot is

matched against subscritions by applying a routing algorithm within the constructed

2n-dimensional space. For further details the reader is refered to [17].

The Meghdoot system can be compared to one of our proposed approaches in chapter 5.

Meghdoot provides the content-based Publish/Subscribe scheme and stores the subscrip-

tions splitted at some peers of the network. The system schema S can for our purposed

be the set of all terms/words used in a content-based Publish/Subcribe scheme. Obvi-

ously this set is very huge and the 2n-dimensional cartesian space clearly will lead to

performance problems. The possibility to state range queries offers a great flexibility

concerning the subscriptions. The problem of the design is that system is not flexible if

it comes to adding new terms/words to the system. The system needs to reconstructed

again, as the 2n-dimensional is built up on the system schema. Therefore, Meghdoot is

a good candiate when the schema does not change. Store-Sub stores subscriptions at

peers, too. Unlike to Meghdoot, it does not offer a range-based search. But the number

of terms can change arbitrarily without reconstructing the whole system and therefore

is much more flexible. The next chapter introduces the approaches and explains them

in detail.

22

5 Introduction to Store-Sub and

Store-Sub

This chapter identifies two approaches to realize Publish/Subscribe on top of a structured

P2P network with a DHT. These approaches differ from each other by storing different

information in the conceptually global directory. Either the peers store subscriptions in

directory or the publishers store their statistics in the directory. Both approaches have

their strengths and weaknesses which depends on the usage scenario. In the remainder

of this chapter we will present these two approaches in more detail.

5.1 Terminology

To explain the two approaches, we first need to introduce some terminology. Each P2P

network contains of several peers, which we also call directory peers. The directory peers

use the lookup() function provided by the DHT. Given a key k, the directory maps the

key to the ID of the directory peer responsible to maintain the key k. The directory

peer stores for a key k some information and remains responsible for it until the network

structure changes. Now, each peer wishing to add or retrieve the information for a key

calls the lookup(key) function and retrieves the peer currently responsible to maintain

the value of the key. After that, the peer can directly contact the respective peer via

point-to-point connections and retrieve or add the desired information.

The set of all directory peers is N = {n1, n2, . . . , nk}. There are subscribers S =

{s1, s2, . . . , sn} and publishers P = {p1, p2, . . . , pm}.

Subscriber si expresses its interest by issuing a subscription subi,j what means the j-th

subscription of subscriber si. A subscription consists of a subset of terms, i.e. subi,j ⊂ T ,

where T is the set of all terms. Each subscriber can have more than one subscription.

For example si can have the following subscriptions:

23

5.1 Terminology Chapter 5. Introduction to Store-Sub and Store-Sub

• subi,1 = {Britney, Spears, concert,New, Y ork} and

• subi,2 = {George, Bush, visit, Germany}

The average number of subscriptions per subscriber is denoted by subavg and each sub-

scription has an average number of terms denoted by sub.

Publishers frequently publish new documents at some specific publishing rate r. Each

document consists of a number of terms t ∈ T and the documents are of the form d ∈ D

where D ⊂ 2T . A subscription subi,j matches a document d if all terms of the sub-

scriptions are contained in the document, i.e. subi,j ⊂ d. In this case the corresponding

subscriber is notified. Additionally, each document d belongs to a topic c = topic(d).

Table 5.1 gives an overview of the notation used.

In the analysis and simulation part all subscriptions are formulated in terms. I.e.

si ∈ S Subscriber from the set of all subscribers

pi ∈ P Publisher from the set of all publishers

ni ∈ N Node (Peer) from the P2P network

t ∈ T term t from the set of all terms

c ∈ C topic c from the set of all topics

subi,j j-th subscription of subscriber si

subavg average number of subscriptions per subscriber

sub average number of terms per subscription

d ∈ D a document of the set of all documents

|d| number of terms per document

r the number of documents published per round

|T | the number of all terms within the simulation

|C| the number of all topics within the simulation

topic(d) mapping from document to topic c ∈ C

Table 5.1: Notation

content-based Publish/Subscribe is applied in a simple manner. It is possible to use

more complicated subscription patterns, but the focus is on measuring and analyzing an

usual Publish/Subscribe system that imitates a news network or user groups.

24

5.2 Store-Sub Chapter 5. Introduction to Store-Sub and Store-Sub

5.2 Store-Sub

Store-Sub is the method, which is mostly used when implementing Publish/Subscribe.

It pursuits the intuitive way of storing directly the subscriptions in the distributed di-

rectory. The datastructure is conceptually global, but physically distributed over all

participating peers.

Store-Sub implements the following approach. It gives subscribers the possibility to

subscribe for their interests and store them in the globally available DHT. Each sub-

scriber can store as much subscriptions as it likes and also delete them. The procedure

is as follows: Each term in a subscription can be mapped by the lookup() function to

a nodeID. In more detail, a subscription si,j containing the terms {a, b, c} is stored at

the directory peers responsible for terms a, b, c, respectively. The directory peers store

the term and the whole subscription together with the nodeID of the subscribing peer.

If for example the term a is contained in more than one subscription, each subscription

containing term a is attached at the directory peer responsible for a. The directory

therefore contains terms, the nodeID of the subscribing peer and the subscriptions. The

topic based scheme is explained in Section 5.2.1.

The publishers frequently want to publish content in form of documents. In the term-

based scheme the publishers only take a look at the pure content of their documents.

Documents are a set of terms (see Table 5.1) and each publisher calls for all terms in all

documents the lookup() function to retrieve the subscribing peers. Consider that there

is a subscription si,j = {a, b, c} and document d contains these terms, i.e. {a, b, c} ⊂ d.

The publisher retrieves all subscriptions for document d and can check whether the

subscription is satisfied completely. After matching, the publisher can disseminate the

content and be sure that only subscribing peers retrieve the desired content.

Figure 5.1 illustrates the datastructure of the Store-Sub infrastructure. There are three

directory peers N17, N45, N76. Directory peer N17 is responsible for topic sport, N45

for comedy and politics and N76 for music. Each of the peers already has stored some

subscriptions in the datastructure. The subscriptions are stored by calling the lookup()

function on the topic and getting as result the nodeID. At this node, the subscriptions

are stored. Each publisher can now retrieve those lists, by the same procedure. Imagine

a publisher wants all the subscribers for topic sport. Calling lookup(), it receives the

list with {S42, S17, . . . } and can match against its own content. The figure also shows

incoming subscriptions from S14, S97 and S37. This figure only illustrates the topic-

25

5.2 Store-Sub Chapter 5. Introduction to Store-Sub and Store-Sub

Figure 5.1: Overview Store-Sub

based Publish/Subscribe scheme. In term-based Publish/Subscribe terms are stored at

the directory peers. The lookup() function retrieves the nodeID of the responsible peer

and subscribers as well as publishers have access to the directory.

The advantage of Store-Sub in general is, that subscriptions always are matched if con-

tent is available. As soon as a publisher has some content that matches a subscription,

the content will be delivered regardless of when the subscription had been stored. The

bottlenecks of Store-Sub are obvious: Publishers need to retrieve possibly huge lists of

subscribers, because they have to retrieve the subscriptions for all terms of their docu-

ments in order to make sure that all subscriptions are matched. Especially in scenarios

with a high publishing rate r the directory will be stressed. A typical way to cope with

this is to reduce the number of lookups by introducing a smaller set of topics, where the

terms belongs to topics.

5.2.1 Topic-based Store-Sub

In topic-based scheme we modify the pure term-based scheme. Now, each document and

each subscription belongs to one topic.

Let si,j = {x, y, z} be the subscription of a subscriber and c the topic of that subscription.

The subscriber calls lookup(c) to retrieve the nodeID of the directory peer responsible

26

5.3 Store-Pub Chapter 5. Introduction to Store-Sub and Store-Sub

for topic c. Then it adds the subscription and its nodeID to the directory making it

available to the publishers.

If now a publisher would like to disseminate document d with topic c = topic(d), it

calls lookup(c) and consecutively retrieves the nodeID of the corresponding directory

peer. By contacting the directory peer, the publisher retrieves all subscriptions for

topic c. Now, the publisher does the same as in term-based scheme and matches the

subscriptions against document d and finally notifies the appropriate subscribers.

The introduction of topics extends the pure term-based Publish/Subscribe to a mixture

of term-based and topic-based one. The advantages are that each subscription is stored

exactly once in the directory and that the publishers don’t need to call lookup() for each

term of the document, but only for the topic c.

5.3 Store-Pub

Unlike Store-Sub, Store-Pub implements a different approach. Instead of subscribers

storing subscriptions in the distributed overlay on top of DHT, publishers store statis-

tics of the contents of their documents into the directory. Subscribers can use this

statistics and afterwards directly subscribe at the publishers.

Each publisher pi announces its profiles in form of some statistics into the DHT. E.g. if

a publisher has for a topic - let’s say sport - some documents, it announces the quantity

of the documents for that topic. In Store-Sub as well as in Store-Pub, there are two

schemes to store statistics. Using term-based scheme, each publisher stores for each term

it has in its documents the statistics in the directory. To do so, it calls lookup() for a

term and retrieves the nodeID of the directory peer and stores the statistics. This is

done with all terms contained in all documents and leads to a detailed description of

their content within the DHT. The topic-based scheme is described in Section 5.3.1.

Subscribers use the information of the directory to decide at which publishers they are

going to subscribe. Instead of subscribing at the directory, they call the lookup() function

for each term of their subscriptions. They retrieve a list of statistics of all publishers.

With the publisher lists they can decide at which publishers they subscribe. I.e. they

use the statistics to decide which are the best publishers that fit their needs and can

choose a subset of the publishers for their subscriptions. Then they subscribe directly

at the publishers.

27


Figure 5.2 illustrates three directory peers N17, N45 and N76. The arrows show in-

coming posts for topics sport, comedy and music from publishers P12, P97 and P23,

respectively. Each directory peer stores all statistics for some topics. In the example,

N17 is responsible for topics sport, N45 for topics comedy and politics and N76 for topic

music. In the term-based scenario, statistics are only stored about terms and there are

no topics.

Figure 5.2: Overview Store-Pub

Store-Pub has the advantage that the subscribers have full control over which publishers

they contact. They can decide at how many publishers they subscribe and restrict the

quantity of notifications they receive. It is not possible for interested parties to do

social reengineering, because the subscriptions aren’t publically available. The main

disadvantage of this method is that information stored in the directory is aging. Hence

after some time it is necessary to repost statistics and subscribers therefore have to

choose their favorites again. Another one is that this approach does not guarantee the

delivery of information desired. It may happen with high probability that not all content

is sent to subscribers.

All together each of the two methods has their advantages and disadvantages. Chapter

6 analyzes Store-Sub and Store-Pub in-depth and gives explanations why one method

in some situations behaves better and the other not. This kind of analysis can be used

to evaluate deployed systems and may help to decide what system to choose for some

kind of application.

28


5.3.1 Topic-based Store-Pub

In the topic-based scheme, the publishers store statistics for all the topics of their docu-

ments at the directory peers. There is no information about individual terms and their

statistics. For example, the statistics now contain information about how many docu-

ments a publisher has for a topic. Again, more sophisticated statistics are possible.

Subscribers call the lookup() function for the topics they are interested in and retrieve

the nodeID of the directory peer responsible for the topic. They can now retrieve the

list of statistics and then decide at which subset of publishers they directly subscribe.

The problem with topic-based scheme is that the list of subscriptions for a topic can

be huge. A possible improvement to avoid the retrieval of the long lists the following:

When a publisher would like to disseminate its documents, it sends them to the direc-

tory peer that is responsible for the topic of the documents. The directory peer then

either matches the documents against the stored subscriptions and sends back only the

subscribers that match back to the publisher or disseminates the documents by itself to

the matching subscribers. This method notably reduces the generated traffic but not

the number of generated messages.

29

6 Analysis of Store-Sub and Store-Pub

This chapter analyzes the approaches Store-Sub and Store-Pub. The first section con-

tains the formulas that could be derived from the approaches identified in chapter 5. The

second part shows applications of the formulas. The formulas are used to either analyze

an approach when changing the system parameters or to compare the approaches with

each other. The comparison is based on the measurement of messages generated by each

approach as messages stress the distributed directory. The analysis of the number of

messages is crucial because it has an impact on the latency and the reliability of the

directory. The traffic does not have the same impact because the transfer of any data

usually is accomplished by direct connection between the participating peers and the

messages are necessary to lookup a directory peer. The comparison is useful to give rec-

ommendations which approach fits better in which usage scenario, i.e. user-to-user or

publisher-to-user. The mathematical analysis uses the formulas presented in this chapter

and chapter 8 presents simulation experiments with a dynamically changing system to

approximate the message generation of a real system.

6.1 Formulas

This section outlines the formulas derived from the approaches and explains them step

by step. The presented formulas generally can forecast two values:

• the messages generated and

• the traffic generated

by an approach. Additionally, the values can be split in to parts. The first part can

be expressed as storing the information into the DHT and the second one as retrieving

the information of the DHT. In each case - Store-Sub and Store-Pub - the information

30

6.1 Formulas Chapter 6. Analysis of Store-Sub and Store-Pub

available in the DHT is different.

The two approaches are implemented on top of a DHT. To forecast the messages and

the traffic of the approaches, it is necessary to keep in mind the message complexity of

the DHT. Therefore all formulas contain the complexity factor of log2(|N |) where N is

a network of nodes and |N | is the number of nodes. In essence, whenever some node

would like to send a message to another node within the network, there are log2(|N |)message hops due to the implementation.

6.1.1 Formulas Store-Sub

In Store-Sub the complexity of messages and traffic consists of two parts:

• Subscribers that join the network need to announce their subscriptions

• Publishers need to retrieve the subscriptions for the terms or topics in order to

disseminate their contents

According to the different schemes explained in chapter 3, the formulas can be used to

measure the traffic and messages of the following schemes:

• term-based Publish/Subscribe

• topic-based Publish/Subscribe

• topic-based Publish/Subscribe with performance optimizations

The two main schemes, term-based and topic-based, also imply different message and

traffic generation. Consider term-based scheme: If a subscriber would like to announce

its subscription, it needs to add the subscription at several directory peers. So for each

term of the subscription, the subscriber needs to identify the responsible directory peer

and adds the subscription. But in topic-based scheme the subscription only needs to

be added to one directory peer. The directory peer that is responsible for the topic of

the subscription. To reflect these schemes, there are two values in the formulas, fs and

fp . In the term-based scheme the value fs = sub, because the subscription has to be

sent to one peer for every term in the subscription. Whereas in the topic-based scheme

31


fs = 1, because each subscription belongs to one topic. The value fp also reflects the

difference between term-based and topic-based scheme. Consider now a publisher that

needs to retrieve all subscribers for the terms of its document. In the term-based scheme

fp = |D|, because the publisher has to match all subscribers possibly interested. On the

other hand, in the topic-based fp = 1, because each document belongs to one topic.

The following formulas show the messages and the traffic generated in Store-Sub.

M1 = O(|Snew| ∗ subavg ∗ fs ∗ log(|N |)) (6.1)

M2 = O(|P | ∗ r ∗ fp ∗ log(|N |)) (6.2)

T1 = O(sizet1avg ∗ (|Snew| ∗ subavg ∗ fs ∗ log(|N |))) (6.3)

T2 = O(sizet2avg ∗ (|P | ∗ r ∗ fp ∗ log(|N |+ listlength))) (6.4)

where

listlength =|S| ∗ subavg ∗ sub

|T |(6.5)

or

listlength =|S| ∗ subavg

|C|(6.6)

Formula 6.1 represents the messages that are generated by the subscribers joining the

network. M1 belongs to the part of messages to store the information of the DHT. They

need to announce their interests in form of subscriptions. Each subscriber therefore

generates for each subscription messages, because of adding it to the directory. The

number of messages per subscription is expressed by fs which changes depending on the

scheme. Each of these messages generates O(log(|N |)) messages in the DHT.

Formula 6.2 shows the messages that are generated by the publishers disseminating

their documents and so M2 are the messages to retrieve the information of the DHT.

For each document they need to know all potentially matching subscribers. Therefore

32


each publisher generates for each document a number of messages. The number of

documents per publisher is denoted by the publishing rate r. The number of messages

per document depends on the scheme used and is expressed in the value fp. Additionally,

each message generates O(log(|N |)) in the DHT.

Formula 6.3 reflects the traffic generated by the subscribers joining the network. The

traffic generated depends on the same values as the number of messages generated in

formula 6.1 but there is an additional value, which tries to approximate the average size

of such a message, sizet1avg. This value can be approximated using the parameters for

the simulation. Taking the average number of terms per subscription (subavg) and the

average length of each term (which is language dependent), one gets an approximate

value for sizet1avg. In essence

sizet1avg = sub ∗ lengthavg(term) (6.7)

Formula 6.4 is slightly more complicated. It represents the traffic generated by the

publishers to disseminate the documents. The traffic of this formula consists of the same

values as formula 6.2, but the parameter listlength varies depending on the scheme.

The value of listlength of term-based scheme is shown in formula 6.5 and the same

value of topic-based without performance optimization is shown in formula 6.6. The

performance optimization itself changes the value listlength to either 0 if the directory

peers take care of the matching and dissemination of all the documents. Or it is some

value matchingSubscriber(D) if the directory peer sends back the subscribers which are

candidates for dissemination. This value is hard to approximate due to the fact that the

behaviour of subscribers and the documents have to be modeled well. The same is true

for the value sizet2avg, which represents the average size of a message in this context.

6.1.2 Formulas Store-Pub

In Store-Pub the complexity of messages and traffic consists of two parts:

• Publishers need to announce their statistics

• Subscribers have to retrieve the announced statistics to find publishers that fit

their needs

33


Like in Section 6.1.1 the formulas measure the messages and the traffic of the following

schemes:

• term-based Publish/Subscribe

• topic-based Publish/Subscribe

Because of the two different schemes used, the formulas need to reflect the differences

between them. There are two values to cope with these differences depending on the

scheme. The first one is the feature space F . In term-based scheme this feature space

is the set of all terms in the simulation. And in topic-based it is the set of all topics.

Hence, |F | is the number of elements in the set. The second one is fs which in case of

term-based scheme equals sub and in case of topic-based one is 1.

Compared to Store-Sub, it is not necessary to optimize the performance of the dissem-

ination, because Store-Pub is based on another approach. But it is necessary to cope

with the problem of aging statistics in the directory. Within a given time interval each

publisher has to refresh its statistics and each subscriber has to check whether there

are better publishers fitting its needs. The variable that reflects this circumstance is

interval.

M1 = O

((|Pnew| ∗

|P |interval

)∗ |F | ∗ log(|N |)

)(6.8)

M2 = O

((|Snew| ∗

|S|interval

)∗ subavg ∗ fs ∗ log(|N |)

)(6.9)

T1 = O

(sizet3

avg ∗((

|Pnew| ∗|P |

interval

)∗ |F | ∗ log(|N |)

))(6.10)

T2 = O

(sizet4

avg ∗((

|Snew| ∗|S|

interval

)∗ subavg ∗ fs ∗ log(|N |)

))(6.11)

Formula 6.8 represents the number of messages that is generated by the publishers pub-

lishing their content. The formula distinguishes between new publishers Pnew and the

existing ones, P . New publishers naturally have to publish all their statistics, because

they didn’t participate in the network before. Already participating publishers only

34

6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub

have to publish their content in some fixed interval. |F | means the number of features

to be published, i.e. terms in term-based Publish/Subscribe or topics in topic-based

one. Altogether, formula 6.8 is the quantity of messages to store the information in the

DHT.

Formula 6.9 shows the number of messages needed to retrieve promising publishers. It

represents the retrieve part of information stored in the DHT. Like in the last formula,

this one also distinguishes between new subscribers and existing ones. New subscribers

have to choose their promising publishers based on their subscriptions. Existing sub-

scribers have to rechoose publishers, because it might be the case that some of their

before chosen publishers changed their statistics and therefore another publisher is a

better choice. Because all publishers and subscribers have to refresh their information,

it is more likely for both to achieve better results.

Formula 6.10 equals 6.8 except for the factor sizet3avg. It describes the size of one message

in the context of saving statistics in the DHT. sizet3avg is the sum of two values:

• the average size of the information to which publisher the statistic belongs (like

the IP address or nodeID)

• the average size of a statistics token, which depends on its structure

Formula 6.11 is almost the same as formula 6.9 except for factor sizet4avg. This factor is

exactly the same as sizet3avg, so:

sizet3avg = sizet4

avg = sizenodeID + sizestatistics (6.12)

The presented formulas are derived of the two approaches and are approximations that

can be feed with parameters known. Examples of parameters are the average number

of subscriptions per subscriber or the average number of terms of a subscription. Using

the formulas and the simulator, it can be shown that the former ones give a good

approximation of an average simulation.

6.2 Analysis

This section shows the effects of changing some parameters. Changing some of them

can make for example Store-Sub a better approach to deploy Publish/Subscribe but if

35


one changes another parameter, it may happen that suddenly Store-Pub is the better

solution. All the effects can be explained with the aid of the formulas which can be used

to forecast different situations.

6.2.1 Analysis of Store-Sub with changing publishing rate r

This section analyzes the approach of Store-Sub, i.e. it analyzes two different message

types of Store-Sub. Store-Sub can be split in two types of messages: Store the subscrip-

tions (by subscribers) and retrieve the subscriptions (by publishers). The experiment

shows which type of message is responsible for the main part of the messages generated.

The parameters of this experiment are depicted in table 6.1 and Figure 6.1 shows those

two types of messages. The red line represents the first type of messages, storing of

|S| 100000

|Snew| 10

|P | 100

subavg 3

sub 5

|T | 100000

|D| log2(100000) ≈ 16

r variable from 1 up to 1000 documents per round

Table 6.1: Parameters for scenario with variable publishing rate

subscriptions. There is no change, as the publishing rate r does not affect this number.

The green line represents the second type, the retrieving of subscriptions. The publish-

ing rate ranges from 1 document per round up to 1000 documents per round. It can

be seen that the number of messages of the second increases when r increases. This

is because publishers need to retrieve more and more subscriber lists for the increasing

number of documents. This implies that Store-Sub is sensitive concerning changes of

the publishing rate.

36


1000

10000

100000

1e+06

1e+07

1e+08

1 10 100 1000

#Mes

sage

s

Publish rate

Store-Sub

Sent subscriptionRetrieve subscription

Figure 6.1: Message types in Store-Sub

6.2.2 Analysis of Store-Pub with changing average number of

subscriptions subavg

The next analysis examines the different message types of Store-Pub. In Store-Pub

statistics need to be announced by publishers (store) and the subscribers retrieve the

statistics. The variable that changes is the average number of subscriptions subavg and

all others are fix. Table 6.2 shows the parameters. Figure 6.2 shows the plots of Store-

|S| 100000

|Snew| 10

|P | 100

subavg varies from 1 to 100

sub 5

|T | 100000

|D| log2(100000) ≈ 16

r 10 documents per round

interval every 100 rounds

Table 6.2: Parameters for scenario with variable subavg

37


Pub. The green line shows the number of messages generated by all publishers when

storing their statistics in the DHT. They remain constant as the number of publishers

is constant and the varying number of subscriptions does not affect this part. The red

line shows the number of messages generated when subscribers retrieve the statistics to

decide which subset of publishers to subscribe at. The more subscriptions they generate

the more messages are generated.

10000

100000

1e+06

1e+07

1 10 100

#Mes

sage

s

#Subscriptions

Store-Pub

Retrieve statisticsStore statistics

Figure 6.2: Message types in Store-Pub

6.2.3 Analysis of Store-Sub and Store-Pub with changing

publishing rate r

The next experiment compares the two approaches Store-Sub and Store-Pub when the

publishing rate r is changed. The parameters used are shown again in table 6.1. Figure

6.3 shows the outcome of this analysis. The red line is the number of messages of Store-

Sub and the green and blue lines are those from Store-Pub. Store-Pub is plotted two

times, one with an interval of 50 the other with an interval 100. On the y-axis there

is the number of messages generated and on the x-axis the varying publishing rate r.

First, it can be seen that the change of the publishing rate has no impact on method

Store-Pub. This is because the publishers don’t need to retrieve the subscriptions of the

38


10000

100000

1e+06

1e+07

1e+08

1 10 100 1000

#Mes

sage

s

Publish rate

Store-SubStore-Pub, intervall = 50

Store-Pub, intervall = 100

Figure 6.3: #Messages at variable publishing rate

users from the DHT (they directly subscribed at the publishers). Second, the change

has an impact on method Store-Sub, as the publishers need to retrieve the subscriptions

from the DHT and the more documents they intent to publish, the more information

has to be retrieved.

The experiment implies that in scenarios with many subscribers and few publishers with

high publishing rate (Publisher-to-user) Store-Pub fits better, because it is insensible.

6.2.4 Analysis of Store-Sub and Store-Pub with changing refresh

interval interval

The next experiment treats the comparison of Store-Sub and Store-Pub when the refresh

interval is changed. This experiment clearly will have no impact on Store-Sub as in

Store-Sub it is not necessary to refresh the information stored in the directory. We vary

parameter interval and all other parameters are fix. Table 6.3 shows a summary of the

parameters.

Figure 6.3 shows the plot for this scenario. On the y-axis are the messages generated

and on the x-axis the varying interval. The red line shows method Store-Sub. It can be

39


|S| 100000

|Snew| 10

|P | 100

subavg 3

sub 5

|T | 100000

|D| log2(100000) ≈ 16

r 10 documents per round

interval refresh from every 1 to every 1000 round

Table 6.3: Parameters for scenario with variable interval

seen, that changing this parameter does not have an impact on it. This is because in

Store-Sub there is no need to refresh the information stored in the DHT. The green line

shows method Store-Pub. At an interval of 1 Store-Pub generates much more messages

than Store-Sub as every round the information is refreshed from both subscribers and

publishers. And at an interval of 1000 things changes in favour of Store-Pub. The

parameter interval is therefore a variable of great impact on Store-Pub.

In a publisher-to-user scenario with the requirement to keep statistics up-to-date fre-

quently, Store-Sub might be the better solution to deploy Publish/Subscribe.

There are a lot of possibilities to plot more scenarios with different parameters changed.

Making an analysis thus can give a good approximation of the consumption of resources

like messages and traffic.

40


10000

100000

1e+06

1e+07

1e+08

1 10 100 1000

#Mes

sage

s

Intervall

Store-SubStore-Pub

Figure 6.4: #Messages at variable refresh interval

41

7 Design and Implementation of a

Discrete Simulator

This chapter describes the simulator written within the scope of the thesis. The sim-

ulator tries to approximate and measure the messages and traffic generated by a Pub-

lish/Subscribe P2P network. Like in chapter 6 the messages are in the focus of at-

tention. Because the implementation uses randomness for various tasks like creating

subscriptions, join and leave of subscribers etc., it better approximates the real traf-

fic of a Publish/Subscribe system. The simulator is discrete, i.e. it simulates a set

of publishers and subscribers interacting with the Publish/Subscribe system based on

rounds. The simulator uses this rounds to measure the messages and the traffic in the

Publish/Subscribe system. In each of the rounds there are some tasks to be done. The

list below shows the general steps done in each round :

1. new subscribers join the P2P network

2. existing subscribers leave the P2P network

3. all publishers create documents they want to publish

4. and depending whether Store-Sub or Store-Pub

• all publishers check the DHT for subscribers and match their subscriptions

(Store-Sub)

• or the publishers directly match the subscriptions made by subscribers at the

publishers (Store-Pub)

According to chapter 5 in the set of all peers, there is a set of subscribers. Each subscriber

has a number of subscriptions that represent their interest as standing queries. The

subscriptions have a number of terms and in the topic-based scheme, they are categorized

42

7.1 Creating Synthetic DataChapter 7. Design and Implementation of a Discrete Simulator

into a topic. Depending on the Publish/Subscribe scheme chosen, the subscriptions are

based on terms (content-based) or they are based on the topics (topic-based) and as a

further refinement, exact matchings based on terms are made.

Among the set of peers there is a number of publishers. These publishers would like to

disseminate some documents to the subscribers subscribed in the DHT. The content to

be published consists of documents which in turn consist of a set of terms and belongs to

a topic. Section 7.1 describes the model chosen to generate the documents realistically.

Each corpus is pregenerated and publishers subsequently pick some documents of them

in each round during the simulation.

7.1 Creating Synthetic Data

The Zipf distribution [15] is the model chosen to pregenerate the document corpora.

Zipf’s law is a powerlaw publicated by George Kingsley Zipf. The powerlaw states

that in a corpus of natural language utterances, the frequency of a any word is roughly

inversely proportional to its rank in the frequency table. That is, the most frequent

word will occur approximately twice as often as the second most frequent word. And

the second most frequent word will occur twice as often as the fourth most frequent

word and so on. It has been observed by many research groups in several application

fields on different corpora. The law can mathematically be stated as follows:

f (k; s; N) =1/ks∑Nj=1 1/js

, (7.1)

where N is the number of elements in the corpus, k is their rank and s is a constant

that characterizes the distribution. The constant s in the classical version of Zipf’s law

is nearly 1. The generation of a document corpus is as follows. For each document d,

the algorithm throws an i-sided coin to decide whether term i should be included in

document d. This is done for all n terms. That is that i.e. term 1 has a probability of 1,

term 2 a probability of 12

and term n a probability of 1n

to be included in document d. By

doing this for all documents, a document corpus following a Zipf-like term distribution

is generated.

43

7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator

7.2 Implementation

This section explains the classes of the simulator written within the scope of the thesis.

It allows to simulate a set of publishers and subscribers randomly acting in the context

of a Publish/Subscribe system. The implementation is capable of simulating the two ap-

proaches Store-Sub and Store-Pub with a term-based and topic-based scheme. For each

scheme exists an individual simulator because it is more practical to separate them. In

the remainder of this chapter the classes of the term-based scheme are explained. When-

ever there is the word term, it can be substituted by topic but some classes do not exist

in term-based scheme. They are explicitly mentioned whenever necessary.

The parameters defined in Table 5.1 need to be feeded to the simulator explained.

Changing these parameters has an impact on the number of messages and traffic gener-

ated during the simulation. The goal is to simulate real world scenarios in experiments

and measure the outcome of these experiments. This can give hints about how a the

Publish/Subscribe system behaves under the given parameters and which approach bet-

ter fits. An example of an experiment is a scenario with a lot of subscribers and only

a few publishers (news network) that publish at high publishing rate r. Depending on

the approach chosen, the outcome is very different. Figure 7.1 shows an overview of the

term-based simulator.

44


Peer

Subscriber Publisher

Subscription

1

n

SubscriberDB

PublisherDB

PeerManager

n

1

PublisherList

PublicPropertiesRandomGenerator

DocumentDB

Statistics

Figure 7.1: UML diagram of term-based simulator

7.2.1 Class Peer

The Peer class represents a peer in the simulation. A peer has an unique id in the P2P

network, so the Peer class has a class member id and a getter getID() to retrieve the

id. The class itself is never instantiated in the simulator. It only serves as abstract class

for class Publisher and Subscriber.

7.2.2 Class Subscriber

The class Subscriber inherits from the Peer class and represents a subscriber in the P2P

network. A subscriber has some subscriptions that express its standing interests. The

subscriptions are generated by each subscriber individually using the RandomGenerator

(cf. Section 7.2.9) class. So the number of subscriptions and the terms they contain is

45


randomly chosen in a configurable range. The subscriptions are stored in the member

field subscriptions.

The class is used in the two different Publish/Subscribe methods, Store-Sub and Store-

Pub. If the simulator simulates Store-Sub, the class uses the method subscribe2-

subscriberDB() and removeFromSubscriberDB() to subscribe and unsubscribe itself

from the DHT and makes use of the class SubscriberDB (cf. Section 7.2.5) to accom-

plish the tasks. Method getSubscriptions() is used by SubscriberDB to store the

subscriptions of a subscriber. Another method is used if Store-Pub is simulated. Class

Subscriber uses a method called chooseBestPublishers() to retrieve all the relevant

statistics for their subscriptions and choose the best ones. The detected publishers are

stored in a class member publishers and are used in every round to check whether

there are new documents or not. The method mergePublishers() is a helper method

for chooseBestPublisher() in the case that there are the same publishers for different

subscriptions.

In a real Publish/Subscribe applications which uses Store-Pub, the publishers have

stored the subscriptions of the subscribers. In the case of new documents, they would

check whether some subscriptions match their documents and then do the rest of the job.

In order to make the implementation easier, the subscribers - who have stored their in-

terests in subscriptions - call the method checkPublications() and use publishers

to ask the publishers to match their subscriptions by calling the method

send(Subscriptions). The method send(...) matches the subscriptions and sends

the documents to the subscriber.

The method checkReallyAvailableDocuments(...) is only used in Store-Pub and

computes all the documents matching the subscriptions of a subscriber. There might be

a discrepancy between the documents disseminated from a publisher to subscriber and

the documents probably matching. This is due to the fact, that each subscriber only

chooses a subset of publishers. Figure 7.2 shows an UML diagram of class Subscriber.

46


Peer

#id: int

+Peer(id:int)

+getId(): int

Subscriber

-subscriptions

-publishers

+Subscriber(id:int)

-generateSubscriptions()

+subscribe2SubscriberDB()

+removeFromSubscriberDB()

+chooseBestPublishers()

-mergePublishers()

+checkPublications()

+checkReallyAvailableDocuments(publishers)

+getSubscriptions()

+receive()

Publisher

-MAX_CHECK_TERMS: int

-MAX_PER_ROUND: int

-db_base: DocumentDB

-db_new: DocumentDB

-db_round: DocumentDB

+Publisher(id:int)

+publish()

+unpublish()

+republish()

+checkSubscriptions()

+generateDocumentsForRound()

+getNumberOfDocuments(term:int)

+getTerms2Documents()

+compareTo(obj:Object)

+send(s:Subscription)

+matchSubscriptions(subscriber:Subscriber)

+writeBaseToDisk(path:String)

+writeNewToDisk(path:String)

+loadBaseFromDisk(path:String)

+loadNewFromDisk(path:String)

Figure 7.2: Relationship of peer classes

7.2.3 Class Publisher

This class represents a publisher in the simulation. The member fields of this class

are: As mentioned before, a publisher has a document corpus and uses it to dissem-

MAX CHECK TERMS The maximal number of terms to be checked

MAX PER ROUND The maximal number of documents to be published per round

db base The 1st document corpus

db new The 2nd document corpus

db round The document corpus for the actual round

Table 7.1: Members of class Publisher

inate documents. In Store-Sub each publisher has only one corpus and in Store-Pub

each one has two corpora. Store-Pub needs two corpora, the first one to compute the

statistics of the documents and the second one to disseminate the documents. The

statistics are announced, deleted and republished by methods publish(), unpublish()

and republish().

Again publishers have two modes in which they can be used. In Store-Sub, they use

method checkSubscriptions() to retrieve the subscriptions for the document terms.

The subscriptions are matched against the documents and sent to the appropriate sub-

47


scribers. In this case, the class makes use of SubscriberDB (cf. Section 7.2.5) to get

the subscriptions and picks some of the documents for this round from the 1st corpus.

If some terms are contained in more than document, the method nevertheless is called

just once for the term.

In Store-Pub all subscribers call the method send(Subscription) to match the docu-

ments of this round against their subscriptions. The documents are picked by generate-

DocumentsForRound(n) from the 2nd corpus. The method matchSubscriptions() is

used to determine the documents which are really available in this round. See Section

7.2.2 for an explanation.

There are some additional methods like write*ToDisk() and load*FromDisk() that

are used to generate document corpora. The method getNumberOfDocuments(term)

returns the number of documents for a given term term and getTerms2documents() is

used by PublisherDB (cf. 7.2.6 to create the statistics for the corpus of each publisher.

Figure 7.2 shows class Publisher.

7.2.4 Class Subscription

This class represents a subscription in the simulator. A subscription consists of a set of

terms and belongs to a topic (only in topic-based scheme). Table 7.2 shows the member

fields of this class:

subscriber id indicates from which subscriber the subscription had been made

terms the terms generated

terms array the same as terms but sorted an in an array

topic indicates the topic

Table 7.2: Members of class Subscription

At creation time, the subscription generates the containing terms by calling method

generate(). It makes use of class RandomGenerator (cf. 7.2.9 to choose the terms and

the number of terms. Methods getTerms() and getSubscriber ID() return the terms

generated and the nodeID of the subscriber made the subscription. Figure 7.3 gives an

overview of this class.

48


Subscription

-subscriber_id

-terms

+Subscription(s:Subscriber)

-generate()

+getSubscriber_ID()

+getTerms()

Figure 7.3: Class Subscription

7.2.5 Class SubscriberDB

This class is a singleton class and is responsible for storing and managing the subscrip-

tions made by subscribers. The singleton pattern has the advantage that the instance

is only loaded once and the properties for the simulator can be accessed by all classes

without any additional costs. It is only used if Store-Sub is simulated and not in method

Store-Pub, i.e. it represents the DHT of a real P2P network.

The class has a method getInstance() that is used by some classes to get the single in-

stance of SubscriberDB. it is therefore guaranteed, that every subscriber and publisher

has the same information about the subscriptions like in a real P2P network with DHT.

Subscribers and publishers need to access the instance in different ways. Subscribers use

addSubscription(...) and removeSubscription(...) to add and remove their sub-

scriptions from the DHT. Publishers need to get all subscribers subscribed for one special

term. They use getPeersSubscribing() for this task. Additionally, there is method

clear() to delete all entries of the datastructure. Figure 7.4 shows class SubscriberDB.

SubscriberDB

-subscriberdb: SubscriberDB

-term2subscriberList: Hashtable

-SubscriberDB()

+getInstance(): SubscriberDB

+addSubscriber(s:Subscriber)

+removeSubscriber(s:Subscriber)

+getPeersSubscribing(term:int)

+clear()

PublisherDB

-publisherdb: PublisherDB

-term2publishers: Hashtable

-PublisherDB()

+getInstance(): PublisherDB

+addPublication(p:Publisher)

+removePublication(p:Publisher)

+getPublishersByTerm(term:int)

+getPublisherByTermAsTreeSet(term:int)

+clear()

Figure 7.4: PublisherDB and SubscriberDB

49


7.2.6 Class PublisherDB

Class PublisherDB is also a singleton class for the same reason as SubscriberDB. The

class is used when simulating Store-Pub method and it stores the statistics of the pub-

lishers. The statistics in the simulation is the number of documents for an entry. The

more documents some publisher has for an entry, the better is its rank. There can be

other implementations of statistics that are much more sophisticated.

The publishers announce their statistics by using the method addPublication(). Sub-

scribers have the possibility to retrieve the statistics for some entry by methods get-

PublishersByTerm(term) and getPublisherByTermAsTreeSet(term). They only dif-

fer in the datastructure returned. Like class SubscriberDB, this one has the methods

clear() and getInstance() to delete and get the single instance of this class. Figure

7.4 shows class PublisherDB.

7.2.7 Class PublisherList

This class is a helper class used by PublisherDB. Its purpose is to store the publishers

and therefore their statistics. The following table shows the members of PublisherList:

publisherlist the datastructure containing pointers to the publishers

term the term of the list, i.e. all publishers are ordered

by the occurrence of term in their documents

dirty indicates whether the list is sorted

Table 7.3: Members of class PublisherList

The class has methods to add and remove publishers called addPublisher(p) and

removePublisher(p). The number of publishers can be retrieved by size() and it

can be deleted by using method clear().

Consider that term has the value 1. Then all publishers having documents that contain

term 1 add itself to the list. The method get X best(x) and getPublishers() sort

the publishers according the number of occurrences of term 1 in their documents. This

methods then return a sorted list as whole or only the x best ones. The dirty flag is used

to signal that the list is not sorted. It is cheaper to add without sorting and sort when

the list is needed in a sorted manner. Figure 7.5 shows the class in an UML diagram.

50


PublisherList

-publisherlist

-term

-dirty

+PublisherList(term)

+addPublisher(p:Publisher)

+get_X_best()(n:int)

+clear()

+getPublishers()

+removePublisher(p:Publisher)

+size()

+sort()

Figure 7.5: Class PublisherList

7.2.8 Class DocumentDB

Class DocumentDB is the class that contains the document corpus. It is a serializable class

that has methods to generate documents, match subscriptions restructure the corpus and

so on. The member fields are:

documents2terms A hashtable that maps document ids to

the term contained in the documents

terms2documents A hashtable that maps terms to

documents that contain the term

probability offset Need to generate documents with terms

and their probabilities

startterm id The 1st/lowest term in the corpus

next document The next document to be published

Table 7.4: Members of class DocumentDB

The class DocumentDB either can generate ad-hoc a corpus using the constructor given

the number of terms, the number of documents and the startterm. The other way is to

deserialize it from hard disk which is helpful to run comparable and reproducible exper-

iments. The 2nd constructor can get a collection of documents and then constructs a

mapping from terms to documents. The mapping is called inverted list and is used find

out in which documents a term is contained. The method generate() uses the algorithm

explained in Section 7.1 to construct the corpus. Additionally, the class has a method

to get information about the corpus. Method getNumberOfDocument(term) returns the

51


number of document that exist for a given term and getRemainingDocuments() returns

the number of documents not yet published.

Two methods are capable of matching subscriptions and subscribers (with all their sub-

scriptions) against their documents. Their names are getDocumentsForSubscriber()

and getDocumentsForSubscription(...) and they return the documents matching the

subscriptions. Both use matchSubscription(...) as a helper method. To create the

documents for a round the method getDocuments(n) is called and returns n documents

from the corpus. The method constructs the inverted list and increments next number.

I.e. n documents are used from the corpus to be published. The terms of these doc-

uments can be accessed by getTerms() and are used to query class SubscriberDB for

subscribers.

Class PublisherDB makes use of method getTerms2Documents() to add the statistics

of the publishers into the DHT. It is also possible to merge some documents into an

existing DocumentDB. This is necessary when publishers need to republish their statis-

tics. Because usually some documents have been published and also need to be used to

form the profile. The DocumentDB, which is the basis for the the statistics, has to be

augmented with already published documents. Methods that are responsible for this are

add(doc), getFirstDocuments() and restructure(). Figure 7.6 shows a diagram of

this class.

7.2.9 Class RandomGenerator

Class RandomGenerator is a singleton class which offers methods to randomly generate

numbers. The member fields of this class are:

rg the static instance of this class

random instance of Random class from Java API

Table 7.5: Members of class RandomGenerator

This class has two methods to generate random numbers. The first method random-

Number() generates a random number and randomNumber(n) between 0 and n− 1 and

returns it.

This methods are used at various places in the simulator. It is used to generate:

• number of subscriptions for each subscriber

52


DocumentDB

-documents2terms

-terms2documents

-probality_offset

-startterm_id

-next_document

+DocumentDB(documents2terms)

+DocumentDB(terms,documents,startterm)

+add(id,doc)

+clear()

-generate()

+get_N_BestTerms(n)

+getDocumentDB(n)

+getDocuments4Term(term)

+getDocumentsForSubscriber(subscriber)

+getDocumentsForSubscription(subscription)

+getFirstDocuments()

+getNumberOfDocuments(term)

+getRemainingDocuments()

+getTerms()

+getTerms2Documents()

-matchSubscription(subscription)

+restructure()

Figure 7.6: Class DocumentDB

• number of terms in subscriptions

• number of subscribers that join/leave in a round

• number of documents to publish in a round

Figure 7.7 shows the UML diagram of this class.

53


RandomGenerator

-rg: RandomGenerator

-random

-RandomGenerator()

+getInstance()

+randomNumber()

+randomNumber(n:int)

Figure 7.7: Class RandomGenerator

7.2.10 Class PublicProperties

PublicProperties is a properties class that is implemented as singleton. The class is

used to parameterize the simulator and self-explanatory. A diagram can be viewed at

Figure 7.8.

PublicProperties

-pp: PublicProperties

-properties

-PublicProperties()

+getInstance()

+getAsBoolean(key)

+getAsInt(key)

+getAsString(key)

Figure 7.8: Class PublicProperties

The properties that can be accessed are listed in table 7.6:

54


rounds number of rounds to simulate

subs per peer maximal subscriptions per subscriber

max number of terms maximal number of terms per subscription

docs number of documents per publisher

terms number of different terms per corpus

topics number of topics per corpus

termspace all occurring terms in the simulation

probability offset defines the probability of 1st term in the corpus

doc path directory path of serialized corpora

env path directory of Berkeley environment

Table 7.6: List of properties

7.2.11 Class Statistics

Class Statistics is a singleton class that count the messages and traffic generated

during a simulation. In each method - Store-Sub and Store-Pub - the messages and

traffic must be counted in different ways and at different positions in the code . The

output of this class can be used as basis to plot graphs and gives an overview of message

and traffic consumption.

7.2.12 Class BerkeleyDocumentDB

This class is only used in topic-based scheme and never in term-based one. It represents

the document corpus and uses BerkeleyDB [9] as back end. Because this class needs a lot

of members to maintain functionality that are not important to the simulator, they are

neglected in the description. Table 7.7 shows the member fields of BerkeleyDocumentDB.

counter the first not published document

id the id of the class

number of documents number of documents contained

number of terms number of terms in the corpus

startterm the lowest term of the corpus

termdistributiondb a database which contains the inverted list

topicdistributiondb mapping from topic to documents

Table 7.7: Members of class BerkeleyDocumentDB

55


BerkeleyDocumentDB

-counter

-documentdb

-id

-number_of_documents

-number_of_terms

-startterm

-termdistribution

-topicdistribution

+BerkeleyDocumentDB(n:int)

+close()

-computePath()

-computeTopics()

-generate(startterm:int)

+getDocument(id:int)

+getDocumentsForRound(n:int)

+getNumberOfDocuments(topic:int)

+getTopicDistribution(topic:int)

+hasMoreElements()

-initializeDatabase()

+nextElement()

+storeDocument()

+storeTopicDistribution(td:TopicDistribution)

+topicDistributions()

Figure 7.9: Class BerkeleyDocumentDB

The constructor is needed to initialize the instance. It uses methods computePath()

and initializeDatabase() and class PublicProperties.

Method getDocument(id) returns a requested document in form of class Document that

is stored in the corpus and storeDocument(doc) stores a document. Class Document

simply is a wrapper of the terms and implemented to use the mechanisms of BerkeleyDB.

Method generate() is used to generate a new corpus and store the whole database

with name id. Method getDocumentsForRound(n) returns a DocumentCollection with

n documents for each round. To return the number of documents for a topic and

construct the profile for each publisher method getNumberOfDocuments(topic) is called

by class PublisherList. In the case of topic-based Publish/Subscribe, topics needs to

be computed. Method computeTopics() is used to accomplish this task which uses as

helper method getTopicDistribution(topic). Figure 7.9 shows an overview of this

class.

56


TopicDB

-instance

-number_of_topics

-topicdb

-topic2terms

-TopicDB()

+getInstance()

+close()

-computePath()

-generateTopics()

+getTerms4Topic(topic:int)

+getTopic(term:int)

-initializeDatabase()

-invert()

Figure 7.10: Class TopicDB

7.2.13 Class TopicDB

Class TopicDB is responsible for assigning each term a topic. In real documents, a term

may belong to several topics but for simplicity each term has only one topic. This class

is only used in topic-based scheme. Members are shown in table 7.8.

instance the singleton instance of this class

number of topics the number of topics in the simulation

topicdb the database containing the mapping term → topic

topic2terms the database containing mapping topic → terms

Table 7.8: Members of class TopicDB

This class is used system-wide to generate subscriptions and retrieve information about

the topic of a term or all the terms belonging to a topic. To generate the TopicDB the

whole termspace is divided and each term is assigned a topic. After that the mapping

is inverted. To be accessible to all classes, this class is a singleton class. It has methods

getTerms4Topic(...) and getTopic(...) to retrieve all terms for a topic and the

topic for a term, respectively. Methods generateTopics() and invert() construct the

whole database with intensive use of RandomGenerator. Figure 7.10 gives an overview.

57


DocumentCollection

-documents

+DocumentCollection(documents)

+getDocuments4Subscriber(s:Subscriber)

+getDocuments4Subscription(s:Subscriptioin)

+getTopics()

+match(Subscription,Document)

Figure 7.11: Class DocumentCollection

7.2.14 Class DocumentCollection

This class represents a collection of documents and is only used in the topic-based

scheme. In every round publishers choose a number of documents to publish them.

DocumentCollection is the class that collects all this documents and it is also capable

of matching the documents against subscriptions. The class only has a member field

documents which stores the documents to be published.

The constructor has a parameter by which the documents for the round are passed. With

method getTopics() publishers can retrieve the set of the topics of the documents in the

collection. Method match(...) is a helper method that matches one subscription and

one document and returns true if all terms of the subscription are contained in the doc-

ument. Methods getDocuments4Subscriber() and getDocuments4Subscription()

make use of the helper and return all documents matching the subscriber and subscrip-

tion, respectively. Figure 7.11 sketches class DocumentCollection.

7.2.15 Class TopicDistribution

TopicDistribution is the class that stores how much documents there are for a topic.

This information is used by PublisherDB to create the statistics used by subscribers.

This class is only used in the topic-based scheme and the members are shown in Fig-

ure 7.9: The constructor creates a new topic with a given parameter. The method

counter the number of documents for this topic

topic the topic

Table 7.9: Members of class TopicDistribution

addDocument() counts the documents and methods getNumberOfDocuments() and get-

Topic() return the number of documents for the topic and the topic, respectively. Figure

58

7.3 Lessons learned Chapter 7. Design and Implementation of a Discrete Simulator

TopicDistribution

-counter

-topic

+TopicDistribution(topic:int)

+addDocument(doc:Document)

+getNumberOfDocuments()

+getTopic()

Figure 7.12: Class TopicDistribution

7.12 shows class TopicDistribution.

7.3 Lessons learned

This section shortly describes the lessons learned by implementing the simulator. First,

depending on the system parameters the simulator needed very distinct time periods to

finish such that the results could be reviewed. The time periods varied from 30 minutes

up to 9 hours hours for one run of Store-Pub with many subscribers and about 20 pub-

lishers. This was due to the fact, that publishers as well as subscribers needed to refresh

the statistics and the fact that in each round method checkReallyAvailableDocuments(..)

(cf. 7.2.2) for all subscribers needed to be executed.

Second, the space consumption of the document corpora is very huge. As each pub-

lisher in Store-Pub has two corpora which 50000 documents, it is crucial to have a fast

database for this purpose. The term-based implementation works with serialized objects

and therefore consumes a lot of memory. The topic-based implementation uses the Java

API of BerkeleyDB and thus is much more efficient concerning memory consumption.

The only disadvantage is that generation is not as fast as in the term-based serialization.

59

8 Experimental Results

This chapter presents measurements made with simulator. In the simulator each pub-

lisher has its own document corpus. Each publisher has a predefined range of terms and

also a predefined number of documents. All documents only contain numbers within

this term range, which in the simulations always contained 100000 terms. To achieve

that subscriptions may be matched by only one publisher, the range of terms for each

subscriber is different, but they mostly overlap. For example publisher p1 has documents

with terms in range from 1 to 100000, p2 those from 21 to 100020, p3 from 41 to 100040

and so on. In essence there can be documents exclusively owned by some publisher and

not by another. And it might happen that different publishers have the same document.

8.1 Average behaviour

The simulation shows the average behaviour under normal load. The system parameters

are shown below in table 8.1. Figure 8.1 shows the result when the parameters above

|S| 5000

|Snew| 5 on average

|P | 10

subavg 3

sub 5

|T | 100000

|D| log2(100000) ≈ 16

r 1 or 100 per round

Table 8.1: Parameters for scenario with variable publishing rate

are used with a publishing rate of 1 document per round. As it can be seen, Store-Sub

consumes less messages but has a lot of variance in comparison to Store-Pub. Addition-

60

8.1 Average behaviour Chapter 8. Experimental Results

ally, it can be derived from the graph that Store-Sub is very sensitive to changes of the

publishing rate r. Theses changes can be seen (compared to the formulas) because the

number of documents published in each round is random. It may happen that in one

round just 10 ∗ 1 = 10 documents are published and in another round 20 ∗ 10 = 200.

This clearly affects the number of messages. The average consumption can be observed

in the plot. Figure 8.2 shows the simulation when publishing rate is on average 100.

100

1000

10000

100000

0 100 200 300 400 500 600 700 800 900 1000

#Mes

sage

s

Round

Store-SubStore-Pub

Figure 8.1: Publishing rate 1

Again, the system parameters are as in table 8.1. Now, Store-Pub is way better than

Store-Sub. I.e. Store-Pub has not changed compared to Figure 8.1, as the change of the

publishing rate has no impact on Store-Pub.

This leads to the conclusion that Store-Sub seems to perform better in a user-to-user

scenario because under low publishing rates that do not change, Store-Pub generates

more messages. On the other hand, Store-Pub performs better in a publisher-to-user

scenario because Store-Sub is sensitive to the publishing rate r and Store-Pub nearly

generates the same number of messages.

61

8.2 Comparison Simulator and Formulas Chapter 8. Experimental Results

10000

100000

1e+06

0 100 200 300 400 500 600 700 800 900 1000

#Mes

sage

s

Round

Store-SubStore-Pub

Figure 8.2: Publishing rate 100

8.2 Comparison Simulator and Formulas

This section directly compares the results of experiments with the simulator to the

results generated with the formulas described in Section 6. The system parameters for

the simulator are the same as in table 8.1 for both the simulator and the formulas. Figure

100

1000

10000

100000

0 200 400 600 800 1000

#Mes

sage

s

Round

Store-Sub (Sim)Store-Pub (Sim)Store-Sub (For)

Store-Pub, intervall = 100 (For)

Figure 8.3: Comparison of Simulator and Formula with r = 1

8.3 shows the comparison of the simulator and the formulas derived. The legend of the

graph shows (Sim) if the line belongs to data of the simulator and (For) if it belongs to

data of the formulas. It can be seen that at a publishing rate of r = 1 the values of the

simulator are forecasted well. Figure 8.4 shows the same but with a publishing rate r of

62

8.3 Growing number of subscribers Chapter 8. Experimental Results

10000

100000

1e+06

0 200 400 600 800 1000

#Mes

sage

s

Round

Store-Sub (Sim)Store-Pub (Sim)Store-Sub (For)

Store-Pub, intervall = 100 (For)

Figure 8.4: Comparison of Simulator and Formula with r = 100

100.

8.3 Growing number of subscribers

The following experiment uses the topic-based scheme to compare the approaches Store-

Sub and Store-Pub. In the experiments we assume 100 pregenerated topics that are used

to categorize the documents of the publishers. We vary the number of subscribers, such

that at the beginning there is no subscriber and at the end there are 50000 subscribers.

Table 8.2 shows the parameters used in the simulation. Figure 8.5 shows the outcome of

|S| growing from 0 to 50000

|Snew| 5 on average

|P | 10

subavg 3

sub 5

|T | 100000

|C| 100

|D| log2(100000) ≈ 16

r 10 per round

Table 8.2: Parameters for scenario with growing number of subscribers

the experiment. The x-axis indicates the number of subscribers and the y-axis indicates

63

8.4 Adding publishers Chapter 8. Experimental Results

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

#Mes

sage

s

#Subscribers

Store-SubStore-Pub

Figure 8.5: Varying number of subscribers

the number of messages generated by each approach. The red line shows the approach

Store-Sub. It can be seen that if the number of subscribers grow also the number of

messages grow faster than in Store-Pub which is shown as green line. The number of

messages grows also in Store-Pub because the log(N) factor (routing in DHT) grows.

The result of this experiment shows that in a typical publisher-to-user scenario Store-

Pub fits better just because the number of messages is much lower and not as sensitive.

8.4 Adding publishers

This experiment again simulates the topic-based scheme to compare the two approaches.

Now we vary the number of publishers but the number of subscribers in the experiment

approximately 5000. At the beginning of the simulation there are 10 publishers and in

round 2500 we add 10 publishers. Table 8.3 shows the parameters for the simulation.

Figure 8.6 shows the outcome of this simulation run. The x-axis indicates the round

and the y-axis the number of messages. The green line shows the number of messages

of Store-Pub. It can be seen that the adding of publishers has no impact on the number

of messages in Store-Pub. It remains constant throughout the simulation. The red line

shows the effect when adding the publishers in round 2500. The number of messages

doubles because now the number of publishers doubles and therefore the average num-

ber of documents per round. Thus, Store-Sub again shows the same sensitivity to the

publishing rate r. This leads to the conclusion that Store-Pub seems to be more efficient

64

8.4 Adding publishers Chapter 8. Experimental Results

|S| 5000 on average

|Snew| 5 on average

|P | 10 to 20

subavg 3

sub 5

|T | 100000

|C| 100

|D| log2(100000) ≈ 16

r 10 per round

Table 8.3: Parameters for scenario with growing number of publishers

0

1000

2000

3000

4000

5000

6000

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

#Mes

sage

s

Round

Store-SubStore-Pub

Figure 8.6: Varying number of publishers

(concerning the number of messages) like Store-Sub in a publisher-to-user scenario.

65

9 Conclusion

We presented two approaches to implement the Publish/Subscribe paradigm on top of

a structured P2P network. The approaches have been analyzed concerning the gener-

ation of messages and traffic. Additionally, we designed and implemented a discrete

event simulator, presented some experiments and associated the outcomes to certain

usage scenarios. The analysis and measurements show that Store-Sub is very sensitive

to changes of the publishing rate r and the numbers of publishers. This leads to the

conclusion that Store-Pub is the better choice for a typical publisher-to-user scenarios.

On the contrary, in user-to-user scenario with many publishers and subscribers Store-

Sub seems to perform better as Store-Pub is sensitive to the number of subscriptions.

One of the biggest challenges in the future will be to tackle network congestion. This

congestion will certainly be a topic for further researches because of the availability of

network resources to nearly everyone and the future network applications. The pre-

sented analysis of Publish/Subscribe approaches can quickly give hints on properties of

the Publish/Subscribe systems. The analysis can be used to evaluate the sensitivity

when changing application models or usage scenarios. Additionally, they contribute to

optimize application designs in question and improve the quality of systems.

The further steps are the usage of more methods to model for example the user be-

haviour when subscribing. I.e. a model of ”What is a user subscribing for?” can give

additional advices for Publish/Subscribe applications and lead to more efficient usage of

possibilities.

66

Bibliography

[1] Bittorent homepage. http://www.bittorrent.com.

[2] Bittorrent at Wikipedia. http://en.wikipedia.org/wiki/BitTorrent.

[3] Chimera and Tapestry. http://p2p.cs.ucsb.edu/chimera/html/home.html.

[4] The Chord project. http://pdos.csail.mit.edu/chord.

[5] eDonkey2000 at Wikipedia. http://en.wikipedia.org/wiki/EDonkey2000.

[6] eDonkey2000 homepage. http://www.edonkey2000.com/.

[7] Gnutella at Wikipedia. http://en.wikipedia.org/wiki/Gnutella.

[8] Gnutella homepage. http://www.gnutella.com/.

[9] Homepage of BerkeleyDB. http://www.sleepycat.com.

[10] JavaSpaces. http://java.sun.com/docs/books/jini/javaspaces/.

[11] Napster at Wikipedia. http://en.wikipedia.org/wiki/Napster.

[12] Napster homepage. http://www.napster.com.

[13] The Pastry Project. http://freepastry.org/.

[14] SHA hash functions. http://en.wikipedia.org/wiki/SHA hash functions.

[15] Zipf’s law. http://en.wikipedia.org/wiki/Zipf’s law.

[16] Karl Aberer. A self-organizing access structure for P2P information systems, vol-

ume 2172. Springer Verlag Berlin et al., 2001.

[17] Divyakant Agrawal Abhishek Gupta, Ozgur D. Sahin and Amr El Abbadi. Megh-

doot: Content-based publish/subscribe over p2p networks. Technical report, Uni-

versity of California at Santa Barbara.

67

Bibliography Bibliography

[18] Minos Garofalakis Rajeev Rastogi Chee-Yong Chan, Pascal Felber. Efficient filtering

of xml documents with xpath expressions. pages 354–379, Bell Laboratories, Lucent

Technologies, 600 Mountain Ave., NJ 07974, USA, 2002. The VLDB Journal.

[19] P. Eugster and R. Guerraoui. Content-based publish/subscribe with structural

reflection. In Usenix Conference on Object-Oriented Technologies and Systems,

2001.

[20] Patrick Thomas Eugster. Type-based Publish/Subscribe. PhD thesis, Ecole Poly-

technique Federale de Lausanne, 2001.

[21] Google, www.google.com/alerts. Google Alert.

[22] Network Working Group. Telnet Protocol Specification.

http://www.ietf.org/rfc/rfc854.txt, May 1983.

[23] Network Working Group. Hypertext Transfer Protocol – HTTP/1.1.

http://www.ietf.org/rfc/rfc2616.txt, June 1999.

[24] Network Working Group. The Secure Shell (SSH) Protocol Architecture.

http://www.ietf.org/rfc/rfc4251.txt, January 2006.

[25] David Karger M. Frans Kaashoek Hari Balakrishnan Ion Stoica, Robert Morris.

Chord: A scalable peer-to-peer lookup service for internet applications. Technical

report, MIT Laboratory for Computer Science.

[26] M. Franklin M. Altinel. Efficient filtering of xml documents for selective dissemina-

tion. pages 53–64. Proceedings of the 26th International Conference on Very Large

Data Bases (VLDB 00), 2000.

[27] Chaoying Ma and Jean Bacon. Cobea: A corba-based event architecture. University

of Cambridge, 1998.

[28] F.S. Annexestein M.A. Jovanovic and K.A. Berman. Scalability issues in large

peer to peer networks - a case study of gnutella. Technical report, University of

Cincinnati, 2001.

[29] Anne-Marie Kermarrec Miguel Castro, Peter Druschel and Antony Rowstron.

SCRIBE: A large-scale and decentralized application-level multicast infrastructure,

volume 20. IEEE Journal on Selected Areas in Communications, October 2002.

68

Bibliography Bibliography

[30] A. Oram. Peer-to-Peer: Harnessing the Power of Disruptive Technologies. O’Reilly

& Associates, Inc., Sebastapol, CA, USA, 2001.

[31] Jonathan B. Postel. Simple Mail Transfer Protocol. University of Southern Califor-

nia, http://www.ietf.org/rfc/rfc0821.txt, August 1982.

[32] M. Handley R. Karp S. Ratnasamy, P. Francis and S. Schenker. A scalable content-

addressable network. In SIGCOMM 2001, 2001.

[33] Andreas Zeidler. A Distributed Publish/Subscribe Notification Service for Pervasive

Environments. PhD thesis, TU Darmstadt, September 2004.

69

A Comparative Study of Pub/Sub Methods in Structured P2P ... · A Comparative Study of Pub/Sub...

Documents

Transcript of A Comparative Study of Pub/Sub Methods in Structured P2P ... · A Comparative Study of Pub/Sub...