P2P Microfinance Program By Wirly Xiao. What is P2P? P2P means ‘Peer to Peer’.
A Comparative Study of Pub/Sub Methods in Structured P2P ... · A Comparative Study of Pub/Sub...
Transcript of A Comparative Study of Pub/Sub Methods in Structured P2P ... · A Comparative Study of Pub/Sub...
Universitat des Saarlandes
Naturwissenschaftlich-Technische Fakultat I
Fachrichtung Informatik
Diplomstudiengang Informatik
Diplomarbeit
A Comparative Study of Pub/Sub Methods in
Structured P2P Networks
vorgelegt von
Sebastian Parkitny
am 27. September 2006
angefertigt unter der Leitung von
Prof. Gerhard Weikumbetreut von
Matthias Bender, Sebastian Michel
begutachtet von
Prof. Gerhard Weikum
Prof. Thorsten Herfet
Erklarung
Hiermit erklare ich, dass ich die vorliegende Arbeit selbststandig verfasst und alle verwendeten
Quellen angegeben habe.
Saarbrucken, den 27. September 2006
Sebastian Parkitny
Contents
1 Introduction 1
2 P2P Systems 3
2.1 Unstructured P2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Structured P2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Publish/Subscribe systems 11
3.1 Topic-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Content-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Term-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Type-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Current Publish/Subscribe systems . . . . . . . . . . . . . . . . . . . . . 18
4 Distributed Publish/Subscribe-Algorithms 20
4.1 Scribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Meghdoot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Introduction to Store-Sub and Store-Sub 23
5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Store-Sub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Store-Pub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Analysis of Store-Sub and Store-Pub 30
6.1 Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Design and Implementation of a Discrete Simulator 42
7.1 Creating Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3 Lessons learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
i
Contents Contents
8 Experimental Results 60
8.1 Average behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2 Comparison Simulator and Formulas . . . . . . . . . . . . . . . . . . . . 62
8.3 Growing number of subscribers . . . . . . . . . . . . . . . . . . . . . . . 63
8.4 Adding publishers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9 Conclusion 66
ii
List of Figures
2.1 Gnutella network topology . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Chord ring with 8 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Primitive lookup() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Chord ring with 8 nodes and fingertable . . . . . . . . . . . . . . . . . . 10
3.1 Overview of general Publish/Subscribe system . . . . . . . . . . . . . . . 12
3.2 Hierarchical addressing in topic-based Publish/Subscribe . . . . . . . . . 15
3.3 Content-based Publish/Subscribe in Stock markets . . . . . . . . . . . . 16
3.4 Type-based Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1 Overview Store-Sub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Overview Store-Pub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1 Message types in Store-Sub . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Message types in Store-Pub . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 #Messages at variable publishing rate . . . . . . . . . . . . . . . . . . . 39
6.4 #Messages at variable refresh interval . . . . . . . . . . . . . . . . . . . 41
7.1 UML diagram of term-based simulator . . . . . . . . . . . . . . . . . . . 45
7.2 Relationship of peer classes . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.3 Class Subscription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.4 PublisherDB and SubscriberDB . . . . . . . . . . . . . . . . . . . . . . 49
7.5 Class PublisherList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.6 Class DocumentDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.7 Class RandomGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.8 Class PublicProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.9 Class BerkeleyDocumentDB . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.10 Class TopicDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.11 Class DocumentCollection . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.12 Class TopicDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
iii
List of Figures List of Figures
8.1 Publishing rate 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2 Publishing rate 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3 Comparison of Simulator and Formula with r = 1 . . . . . . . . . . . . . 62
8.4 Comparison of Simulator and Formula with r = 100 . . . . . . . . . . . . 63
8.5 Varying number of subscribers . . . . . . . . . . . . . . . . . . . . . . . . 64
8.6 Varying number of publishers . . . . . . . . . . . . . . . . . . . . . . . . 65
iv
List of Tables
5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Parameters for scenario with variable publishing rate . . . . . . . . . . . 36
6.2 Parameters for scenario with variable subavg . . . . . . . . . . . . . . . . 37
6.3 Parameters for scenario with variable interval . . . . . . . . . . . . . . . 40
7.1 Members of class Publisher . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Members of class Subscription . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Members of class PublisherList . . . . . . . . . . . . . . . . . . . . . . 50
7.4 Members of class DocumentDB . . . . . . . . . . . . . . . . . . . . . . . . 51
7.5 Members of class RandomGenerator . . . . . . . . . . . . . . . . . . . . . 52
7.6 List of properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.7 Members of class BerkeleyDocumentDB . . . . . . . . . . . . . . . . . . . 55
7.8 Members of class TopicDB . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.9 Members of class TopicDistribution . . . . . . . . . . . . . . . . . . . 58
8.1 Parameters for scenario with variable publishing rate . . . . . . . . . . . 60
8.2 Parameters for scenario with growing number of subscribers . . . . . . . 63
8.3 Parameters for scenario with growing number of publishers . . . . . . . . 65
v
A shorter version of this work was accepted to the Fourth International Workshop on
Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2006) in Seoul,
Korea.
vi
Zusammenfassung
Das Publish/Subscribe Paradigma ist ein gangiges Modell, um Inhalte verschiedenster
Art unter Produzenten (publisher) und Interessenten (subscriber) zu verteilen. Ver-
schiedene strukturierte Ansatze wurden entworfen und analysiert aber nicht mit an-
deren Ansatzen verglichen. Wir entwickeln zwei Publish/Subscribe-Ansatze basierend
auf einem strukturierten P2P-Netzwerk und analysieren mathematisch ihre Komplexitat.
Daruber hinaus vergleichen wir die beiden Ansatze unter Definition von jeweils ver-
schiedenen Systemparametern und assoziieren die Ergebnisse zu bestimmen Szenarios.
Zum Schluß entwerfen wir einen diskreten Simulator und stellen experimentelle Messun-
gen der beiden Ansatze vor.
vii
Abstract
Publish/Subscribe systems have become an established model to provide content delivery
from publishers to subscribers. Many approaches based on top of a P2P network have
been proposed and evaluated, but typically each approach is evaluated on its own. We
identify two approaches to implement Publish/Subscribe based on a structured P2P
network and provide a mathematical analysis of their complexity. Furthermore, we
compare the two approaches for several choices of system parameters and associate the
outcomes to certain usage scenarios. Thus, we can provide evidence of which of these
approaches is suitable for certain scenarios. Finally, we design and implement a discrete
event simulator and present results of experimental measurements of both approaches.
viii
1 Introduction
In the last few years the term ”Peer-to-Peer” (P2P) has drawn much attention. P2P
applications provide file-sharing, distributed computing, internet-based telephony, etc.
Especially applications for file-sharing have emerged rapidly because the Peer-to-Peer
paradigm offers much more possibilities than the old-fashioned client-server architec-
tures. As new applications arise, also the traffic in the internet grows enormously.
Internet providers claim that currently more than 50% of the traffic is generated by
Peer-to-Peer applications. The continuous growth of traffic implies that there is a need
to focus on design and analysis of Peer-to-Peer applications. First, the design of such
applications can guarantee the wealth and stability and also it can make them more reli-
able, scalable and flexible. Second, the analysis can give hints about the traffic generated
by an Peer-to-Peer application and thus can help to increase the quality of programs.
A popular subset of P2P implementations are the distributed Publish/Subscribe ap-
plications. Publish/Subscribe is an asynchronous messaging paradigm that decouples
publishers from subscribers. It is suitable to exchange contents of any kind and can
be provided in several schemes, like content-based and topic-based scheme. There are
several approaches to implement and evaluate Publish/Subscribe systems but none of
them compares the proposed system to other approaches.
Goal of the thesis is to identify two different usage scenarios for Publish/Subscribe and
to identify two general approaches to implement a Publish/Subscribe system on top of a
structured P2P network. The approaches are called Store-Sub and Store-Pub and will be
explained in detail. The complexity of the approaches is analyzed theoretically. Finally,
we design and implement a discrete event simulator and present results of experimental
measurements of the two approaches under different choices of system parameters. The
results of the experiments are interpreted and associated to certain usage scenarios.
The remainder of the thesis consists is organized as follows: Chapter 2 explains the P2P
paradigm. It presents two general categories, namely unstructured and structured P2P
networks. Chapter 3 illustrates the general Publish/Subscribe paradigm. Afterwards,
1
Chapter 1. Introduction
several Publish/Subscribe schemes are presented and finally some deployed implemen-
tations are explained. Chapter 4 presents some distributed Publish/Subscribe systems
and explains the advantages of decentralized Publish/Subscribe. Chapter 5 introduces
two approaches to implement Publish/Subscribe on top of a P2P network. The ap-
proaches are called Store-Sub and Store-Pub and differ in the organization of data.
Chapter 6 analyzes the two approaches in a mathematical way and compares their
behaviour and sensitivity in different usage scenarios. Chapter 7 presents the discrete
event simulator in detail and clarifies the assumptions made concerning the implemen-
tation. Chapter 8 shows experiments made with the simulator, explains the outcomes
and associates them to different usage scenarios. Additionally, the simulator is used
to compare the outcomes of the mathematical analysis and the simulations. Finally,
Chapter 9 concludes this work and points out future research directions.
2
2 P2P Systems
Today, P2P networks are one of the most popular classes in the field of internet appli-
cations. There is a tremendous number of applications using the P2P paradigm ranging
from instant messaging and VoIP 1 to file sharing and grid computing. Most of those
applications are implemented to serve as large-scale P2P networks with millions of users
participating spread all over the world. All of them use special techniques and possibly
implement additional paradigms to offer their services and fulfill the needs of the par-
ticipating peers. P2P systems are also in the focus of many research groups trying to
contribute in several ways, but unlike popular applications, do not necessarily deploy
their results in real world applications. Today, more than 50% of internet traffic is gen-
erated by P2P applications, according to internet providers. In some cases the traffic is
even more than 75%.
Oram et al. [30] give the basic definition of the term Peer-to-Peer :
[a Peer-to-Peer system is] a self-organizing system of equal, autonomous en-
tities (peers) [which] aims for the shared usage of distributed resources in a
networked environment avoiding central services.
Following the definition, a P2P network consists of a set of peers participating in the
network via an application. All the peers are by definition equal, in essence they are in
the position to execute the same operations and have to contribute in the same manner
in the context of the P2P network. These applications are mostly offered by open source
projects but also are available by commercial software providers.
The popular contrary to the P2P paradigm is the client-server paradigm which separates
the client from the server. The client-server paradigm or design pattern is still used in
the majority of internet applications. Typical representatives for this paradigm are for
1Means Voice over Internet Protocol and is usually the term for making phone calls over internet viaappropriate applications
3
Chapter 2. P2P Systems
example HTTP [23], SSH [24], Telnet [22] and SMTP [31].
The most common example is HTTP. In HTTP there are web servers and clients. A
web server offers various contents (websites, images, movies) and clients wish to retrieve
the contents. The clients use HTTP to connect to the server and ask for some special
content. Via separate HTTP connections, or possibly with the same one, the server
offers to the client the desired content. This leads to high computational load and is
clearly a bottleneck of the client-server paradigm. Another drawback is the single-point-
of-failure. If the server is not able to respond - no matter for which reason - the clients
can not retrieve any kind of content. However in a P2P network, a peer acts both as
server and client. The word servant combines the fact that a peer is able to act as server
and client. Nowadays, the old-fashioned client-server paradigm has to struggle with a lot
of requirements which are - more or less - implicitly implemented by P2P applications.
These main requirements can be identified as follows:
• Scalability
• Security and reliability
• Flexibility
P2P became popular because they meet these requirements better than client-server
applications. They offer nearly unlimited scalability, much more flexibility and reliability.
There are several kinds of P2P networks. P2P network with servers, hybrid networks
and pure P2P networks. Pure P2P networks neither have servers nor the peers have an
overview of the whole network. They are highly decentralized and resources are spread
all over the peers in the network. A peer only knows those peers it interacts with and
hence only may query and match resources published by a certain number of peers.
Because there are no servers there is no single-point-of-failure and hence there are more
scalable. Peers participating in a network may join and leave it arbitrarily and thus P2P
networks are changing over time. Therefore, they are highly dynamical artifacts. Since
there are no dedicated servers, the contents - if existent - are located at the peers.
One of the most popular applications is indeed the sharing of files between peers, not
only in a legal manner but unfortunately also in an illegal manner. Famous examples
for this kind of applications are Napster [12, 11] (today a legal online music service),
eDonkey2000 [6, 5], Gnutella [8, 7] and BitTorrent [1, 2]. Nevertheless, these file sharing
networks can also be categorized into different generations of P2P networks because of
their varieties in paradigms pursuited.
4
2.1 Unstructured P2P Chapter 2. P2P Systems
2.1 Unstructured P2P
An unstructured P2P network in general is a P2P network that has no structured overlay
topology. Peers that would like to participate in the network do this by connecting to
other network participants. Unstructured P2P networks usually were utilized in most
of the early P2P applications. The file-sharing applications mentioned above (Napster,
Gnutella, eDonkey2000, BitTorrent) are unstructured P2P applications and can be di-
vided into two categories, namely centralized and decentralized P2P networks. Napster
and eDonkey, for example, are centralized P2P file sharing systems that need a central
server to connect to. All search queries are routed to the central server. Consider an
user A that would like to retrieve a freely available audio book and this audio book is
offered by another user B in the P2P network. User A needs to contact the server and
query this audio book. The server responds with the users offering the audio book. User
A now can contact user B from the reply list and directly request this audio book. At
this point the server is not needed anymore.
Fully decentralized system like Gnutella and BitTorrent don’t need any kind of central
server to connect to and transmit search queries. They use different techniques to emit
their search queries and receive responses. The most popular technique of searching in
an decentralized and unstructured P2P network is flooding which is explained in the
remainder of this section.
Gnutella 0.4 is a network that consists of a large number of nodes. The nodes may
be distributed all-over the world and the network has no central entity. According to
M.A. Jovanovic [28] the overlay topology can be visualized in a graph of network nodes.
Figure 2.1 shows an example topology of a Gnutella 0.4 network. In a Gnutella network
every peer only knows a certain number of peers, called its neighbours. Because no peer
has a complete overview of the network, it is not possible that a peer can find every-
thing it queries. This is due to the fact that the peers are arbitrarily connected to other
peers and that by flooding not all peers can be reached. Some of the unstructured P2P
networks try to improve the topology and therefore the performance and scalability e.g.
with direct connections only covering geographically proximate peers. By this approach
they are minimizing the geographic distance between them.
Flooding works as follows: If a peer searches for a file, it queries its neighbours, that
has been determined during bootstrapping. The query is conveyed by a message with
additional fields. One of them is called TTL field which defines the number of hops a
5
2.1 Unstructured P2P Chapter 2. P2P Systems
Figure 2.1: Gnutella network topology
message travels within the network. The messages is sent from node to node and the
TTL field is decreased by every node until it reaches zero. By using TTL, a radius of
nodes to be queried is defined. The message is not propagated any further thus avoiding
infinite cycles. It had been figured out that a TTL value of 7 is sufficient for most of the
queries. The following formula states the method used in Gnutella 0.4.
TTL(0) = TTL(i) + Hops(i) ≤ 7 (2.1)
The value Hops(i) defines the number of how often the message already has been for-
warded and the number TTL(i) is the TTL field within the message. TTL(0) is the
value of the TTL field in the message, when the doesn’t have been forwarded yet. With
this formula the radius is defined.
Another pure P2P network is Bittorrent. Bittorrent pursuits a slightly different paradigm
than Gnutella. In essence a provider of some resources can shift traffic exclusively gener-
ated by itself to peers and thus avoiding the unavailability of its resources. The network
uses trackers to define different kinds of resources. The client can use the trackers to
download the resource and at the same time offer this resource to other peers interested
in the same one. As well as Gnutella, Bittorrent doesn’t use any structured overlays to
6
2.2 Structured P2P Chapter 2. P2P Systems
partition the resources offered by peers. That means that even in a steady-state of the
overlay it is not guaranteed that some peer finds what it desires if the resource is present
in the network.
In unstructured P2P networks following drawbacks can be identified . There might be a
bottleneck when using a server to respond to queries. The scalability is also limited and
the server is a single-point-of-failure if it is not able to respond. If pure P2P networks
are used, flooding causes a lot of traffic and the peers might not retrieve all the resources
they query because flooding is limited to a certain radius.
2.2 Structured P2P
Unlike unstructured P2P networks, structured P2P networks form a predefined struc-
ture. These networks require peers to be responsible for some part of the overall content
of all peers within the network. The peers form a so called Distributed Hash Table
(DHT) to address this paradigm. The DHT is constructed by using a hash function in
order to partition the peers and the contents. The choice of the hash function is impor-
tant for the probability of the robustness and the performance, because keys and peers
(IP addresses) are uniformly distributed over the same identifier space. The search space
is divided into sections by the hash function. A peer entering the network is assigned a
section of the search space [0, 2m − 1] and automatically is responsible for this section
of the search space. The basic functionality of a DHT is the so-called lookup()-function.
Given a key, it maps the key onto a network node. There are several implementations of
DHT’s that all are based on the same principles, but differ in search and concrete par-
tition strategies. Prominent examples of DHT implementations include Chord [4, 25],
Pastry [13], Tapestry [3], P-Grid [16] and CAN [32] (Content addressable network).
Most of the systems have a guaranteed complexity of retrieving a queried resource in
O(log(N)). Other issues like adding new content or peers and failure handling have com-
plexity of O(log(N)) and O(log2(N)). Therefore, unlike to unstructured P2P networks,
structured ones are highly scalable in the number of nodes. On top of such a DHT it is
possible to implement any kind of P2P network.
According to Ion Stoica et al. [25] Chord reduces the complexity in designing P2P
systems and applications by addressing the following problems:
Load balance By using a hash function the even distribution of keys is naturally im-
7
2.2 Structured P2P Chapter 2. P2P Systems
posed
Decentralization Chord provides a complete decentralization without any kind of server
or super nodes. Each node has the same responsibility, thus leading to a robust
system.
Scalability The system scales logarithmically as the system grows in the number of
nodes.
Availability Even during continuous changes by peers joining and leaving the network,
they can guarantee that a node responsible for a key can be found. Actually node
failures are detected and thus the system provides self-organization.
Flexible naming There are no constraints in the naming of keys, therefore they allow
to map any kind of key with the lookup() function.
The basic idea of Chord is the following. Chord uses consistent hashing to map keys onto
the Chord ring. The hash function (SHA-1 [14]) assigns each key an m-bit identifier. To
determine the identifier of a node, its IP address is hashed. Using this procedure and
maintaining several properties on m and the hash function, the identifiers computed by
the function are ordered in an identifier circle modulo m. The parameter m is called
the exponent of the ring and the maximal number of peers participating is N = 2m.
The assignment of keys to peers is as follows. A key k is assigned to the first node
whose identifier is equal to or follows k in the identifier space. The node is then called
successor(k), the successor node of k. Represented in a modulo circle, successor(k) is the
first node clockwise from k. Figure 2.2 shows a Chord ring with 8 nodes and 6 keys. The
keys are respectively mapped onto their successors in the Chord ring. For example key
S8 is mapped to peer P12 and keys S43 and S33 are mapped to peer P44. In essence
a primitive lookup() already can be implemented. Consider a key k, peer x asks its
successor whether it is responsible for key k. If this is the case, the search is completed
and the responsible peer is found. If not, y - the successor of x - routes the query to its
successor until the search is completed. This primitive lookup has a search complexity
of O(N) were N is the number of nodes in the Chord ring. Figure 2.3 shows primitive
lookup() in a Chord ring. P12 here is looking for key S57 and needs the query to be
routed over each node on the path of the Chord ring. The red arrows show the successive
queries. To speed up the query routing, each node in the Chord layer additionally has a
routing table, called finger table. The finger table contains a total number of O(log(N))
entries. These entries are not only the successor and predecessor of a peer. Each peer
8
2.2 Structured P2P Chapter 2. P2P Systems
Figure 2.2: Chord ring with 8 nodes
knows a total number of other m peers. The entries of the finger table are called finger.
With this knowledge it is possible to search for a key in O(log(N)). Let p be peer, then
the i-th finger of p is the peer that follows p by at least 2i−1 in the ID space. Formula
2.3 shows the construction of the finger table for a peer.
x.finger[i] = successor(keyi) : keyi = x.ID + 2i−1, i ∈ [0, m] (2.2)
Figure 2.4 shows a Chord ring with 8 nodes. The finger table of Node P12 is displayed
and arrows illustrate the fingers of this node. If for example node P12 is looking for key
S56, it knows that by its finger table node P44 is the nearest one for the sought-after
key. Therefore P12 redirects the query to P44. P44, by the same procedure, is able to
return the result. This search strategy guarantees, that any key can be retrieved with
O(log(N)) messages. For a deeper insight of mechanisms and techniques concerning
Chord, the reader is referred to [4, 25].
9
2.2 Structured P2P Chapter 2. P2P Systems
Figure 2.3: Primitive lookup()
Figure 2.4: Chord ring with 8 nodes and fingertable
10
3 Publish/Subscribe systems
A Publish/Subscribe system is based on the Publish/Subscribe paradigm. The Pub-
lish/Subscribe paradigm contains several ingredients. It consists of publishers, sub-
scribers and an event notification service. Publishers are information providers that
would like to publish some kind of content (HTML files, PDF files, etc.).
Subscribers in general are information consumers that are interested in contents. The
subscribers can create subscriptions that filter out relevant content. A subscription ex-
presses the interests of the subscribers. They can consists of terms or predicates. A
term usually is a word like football or Manchester and the subscription can consist of a
conjunction of terms like Manchester ∧United∧ football. A predicate is an expression
like DAX ≤ 5000 and content that is relevant for this predicate must contain the ap-
propriate DAX value. Unlike a simple query on some search engine, a subscription is a
standing request. It remains stored until the user decides to delete it.
The event notification service is the part of the Publish/Subscribe system that is respon-
sible for storing the subscriptions and filtering the relevant content. It offers some kind
of interface called subscribe() to add and store a subscription and an interface called
unsubscribe() to delete a subscription. The filtering or matching is based on the sub-
scriptions and the content. If e.g. a subscription is expressed with terms, only content
that contain all the terms is relevant. After filtering the relevant content for a sub-
scription, the event notification service notifies the corresponding subscriber about the
event. The event is that for the subscription relevant content could be identified. The
notification can be made by e-mail, sending the content, etc.
A popular example is Google Alert [21], where news agencies like CNN or Spiegel play
the role of publishers. They frequently put some articles about different issues on their
website and Google Alert is able to index these articles. Users can express their in-
terest in form of keywords, the type of information (News, websites and user groups),
the frequency of notifications and its e-mail address. This is the so called subscribe()
interface. Google Alert also provides the appropriate unsubscribe() to delete the sub-
11
Chapter 3. Publish/Subscribe systems
scriptions again.
The event notification service here is Google Alert. Google Alert is responsible for stor-
ing the subscriptions and notifying the users about relevant content. Let the subscription
of any user be the words George Bush election. If CNN publishes a news story about the
election and George Bush, Google Alert would send an e-mail to this user and notify it
about the publishing of this article. Other examples of Publish/Subscribe systems are
mailing lists, sensor networks, use groups like USENET, etc.
These different shapes of Publish/Subscribe systems also imply different scenarios con-
cerning the distribution of peers. Consider for example a scenario where users want to
share their knowledge with each other, like in a Wiki or a Blog. Here, the users are
interested in some events published by other users or they even become publishers and
publish at a low rate contributions like comments or new articles. This scenario is a user-
to-user scenario. Consider for example a Blog where a user wants all the new articles
about the term VoIP. When a new article is published, the user is notified via e-mail.
The counterpart of that scenario is the publisher-to-user scenario. In this scenario, there
are many subscribers and only a few publishers. The publishers publish content at a
much higher rate to much more subscribers and typically publishers are distinct from
the set of subscribers.
Figure 3.1 sketches an overview of a Publish/Subscribe system. By applying the Pub-
Figure 3.1: Overview of general Publish/Subscribe system
12
Chapter 3. Publish/Subscribe systems
lish/Subscribe paradigm, there is a decoupling of the participating parties. They are
decoupled in:
Space Publishers publish their events via the event notification service. The event noti-
fication service notifies the subscribers, in the case of a matching, about the events.
Neither publishers know who had announced interests and how many entities are
interested nor subscribers know who is publishing. The event notification service
serves as mediator between those two parties.
Time If some publisher publishes some content, the subscribers don’t need to be present
at the same time. From the subscribers view, the publishers don’t need to be
present if the event notification service notifies them. The event notification service
deals with dispatching all upcoming events at the time consumers and producers
are available.
The decoupling of the of publishers and subscribers has the effect of reducing the com-
plexity of applications based on the Publish/Subscribe paradigm. It separates the pro-
cess of production on the one hand from the process of consumption and on the other
hand makes the communication between the entities asynchronous. The separation of
these processes can be outlined as loose coupling and can be considered as an advantage
of this paradigm.
Additionally to the already known events and notifications, publishers might be able to
advertise upcoming events. The event notification service can offer an interface called
advertise() for this purpose. The advertisements can be used by subscribers to learn
about further upcoming events. In addition, these advertisements are useful for the
event notification service to adapt to future subscriptions and optimize the functional-
ity.
Publish/Subscribe systems can be categorized to different subscription schemes. All of
those schemes come with their own functionalities and algorithms. Also, many research
groups contribute with optimizing algorithms, making Publish/Subscribe a hot research
topic. The remainder of this chapter presents different subscription schemes.
13
3.1 Topic-based Publish/Subscribe Chapter 3. Publish/Subscribe systems
3.1 Topic-based Publish/Subscribe
Topic-based Publish/Subscribe systems were one of the earliest subscription schemes
applying the Publish/Subscribe paradigm. The scheme is used by subscribers to an-
nounce subscriptions in form of topics (e.g. Sports, Politics). Topics can be compared
to interest groups consisting of all entities that would like to receive similar content. The
event notification service can be considered as a broker, that disseminates the notifica-
tion through a communication channel to all matching subscribers. Each subscriber can
choose some topics to subscribe for and they only receive notifications about documents
of their topics of interests. This kind of group communication is widely used in several
applications. Mailing lists and USENET are prominent and popular representatives of
interest groups. Because of the possibility to subscribe to individual topics at the event
notifier, topics need to be predefined before subscribing to them. This can simply be
done by defining several topics with the use of the system in question. Another inter-
esting way is automatic extraction of topics out of the resources to be published. Topic
extraction is a research area on its own and out of the scope of the thesis.
Besides topics in a simple, flat hierarchy, there are extending concepts to structure the
topic space and make it more accessible to subscribers and publishers. The introduction
of hierarchies gives publishers and subscribers the facility to formulate their events and
subscription in topics and subtopics. That means that a subscription made to some
topic or subtopic includes all topics that belongs to that topic node in the hierarchy.
USENET news uses that kind of hierarchical addressing. Figure 3.2 shows the principle
of hierarchical addressing. Subscriber A subscribes to the topic Football and receives all
events that are from the topic Football or subtopics of it (like a special team). However,
subscriber B is interested in all events from topic Sports and so it receives additionally
to all Football events also those from Sports in general.
3.2 Content-based Publish/Subscribe
The content-based Publish/Subscribe is far more general than topic-based scheme. In
essence this scheme offers more flexibility to match new content against stored subscrip-
tions in the event notification service. The contents, that are published by publishers,
are inspected one by one and afterwards are matched against the subscriptions in the
notification service. Therefore this scheme can be considered a message-by-message ap-
14
3.2 Content-based Publish/Subscribe Chapter 3. Publish/Subscribe systems
Figure 3.2: Hierarchical addressing in topic-based Publish/Subscribe
proach, where the content of the messages is the basis of the filtering. Thus, this concept
is much more flexible compared to the topic-based scheme. Users can subscribe by using
special subscription languages. Subscriptions can be combined in several ways to form
more complex ones, so called subscription patterns. Operators usually used in such sub-
scription languages are <,≤, =, >,≥. Thus, the content-based scheme is not suitable for
a plain-text filtering but rather for a filtering on the basis of predicates. The subscription
patterns are then matched and the subscribers are notified. Compared to topic-based
scheme, content-based scheme is more expressive. But this freedom of choice is paid
with more storage in the event notification service. Because of the freedom of choice,
there is a tremendous number of different unique subscription patterns. This enforces
the usage of efficient matching algorithms and clearly is a disadvantage of content-based
Publish/Subscribe.
Consider the following example: The event service notification is implemented on stock
markets and the stocks themselves are regarded as publishers. Whenever some stock
changes, it emits the new value to the event notification service which matches the
changed value against standing subscription patterns and notifies all the subscribers
subscribed about the changes. Figure 3.3 illustrates content-based Publish/Subscribe in
combination with stock markets. Subscriber A had subscribed for the event of the of
DAX being less or equal to 5000 and Subscriber B for the event of NASDAQ being less
or equal to 2500. When DAX emits a new event that its current value is 4800, the event
notification service only notifies Subscriber A, because Subscriber B is not interested
in that event. More sophisticated examples can be constructed by combining patterns.
15
3.3 Term-based Publish/Subscribe Chapter 3. Publish/Subscribe systems
The subscription language can be processed using algorithms to optimize the matching
Figure 3.3: Content-based Publish/Subscribe in Stock markets
of subscriptions and contents. Subscriptions could be expressed using SQL or XML and
then can be processed by e.g. methods like XPath. XFilter using XPath [18, 26] are
methods capable of efficiently filtering XML documents and disseminating selective in-
formation out of them. Other methods are Template Objects using JavaSpaces [10] and
Executable code [19].
3.3 Term-based Publish/Subscribe
The Term-based Publish/Subscribe scheme is a generalization of the content-based Pub-
lish/Subscribe. In the content-based Publish/Subscribe the basis is the content, which
is observed and the subscribers are notified by the event notification service. But term-
based Publish/Subscribe focuses on the terms or combinations of them. There are no
operators like <,≤, =, >,≥, but only conjunctions of terms.
In the case of term-based scheme, the subscription patterns aren’t usually that sophisti-
cated. The contents are a kind of documents with terms and phrases. The subscription
patterns are predicates consisting of terms. A subscriber is able - like in a search engine
- to subscribe for terms that are contained in the content. Consider for example an
article shipped in form of HTML or PDF and consider a subscriber that has stored a
subscription that matches the contents of the article. The article is going to be delivered
to the subscriber.
The atomic part of the subscription pattern is a term. Terms can be combined to form
16
3.4 Type-based Publish/Subscribe Chapter 3. Publish/Subscribe systems
predicates of arbitrary size, like George ∧ Bush ∧ president ∧ USA. Then only doc-
uments, that contain these terms, are matched against the subscription pattern and
disseminated to the subscribers.
A part of the analysis of different Publish/Subscribe schemes performed in this thesis is
also based on term-based Publish/Subscribe.
3.4 Type-based Publish/Subscribe
Type-based Publish/Subscribe scheme [20], besides analyzing the content of upcoming
notifications, makes use of additional mechanisms to better classify resources and op-
timize event notification services. Type-based Publish/Subscribe introduces the concept
of using types in the system. By introducing types, content previously classified by only
the name of its topic now can additionally be classified by using types. Using them in
Publish/Subscribe combines the usage of objects in the underlying programming lan-
guage and types of events in the event notification service. Another new effect is to
ensure type-safety at compile-time by parameterizing the resulting abstraction interface
with the type of the corresponding events. This type-safety and type checking and other
content-based refinements ensures that the Publish/Subscribe system can match sub-
scriptions against content more efficiently. Figure 3.4 shows an example. It shows, that
the stock events can be split into two distinct types, stock quotes and stock requests.
Publishers can publish on stock quotes and brokers can subscribe for these events on
stock requests expressing their interest in buying stocks with a possible price range.
Figure 3.4: Type-based Publish/Subscribe
Eugster et al. [20] gives an in-depth description and discussion of type-based Pub-
17
3.5 Current Publish/Subscribe systems Chapter 3. Publish/Subscribe systems
lish/Subscribe.
3.5 Current Publish/Subscribe systems
This section briefly presents some systems implementing the Publish/Subscribe pattern.
Publish/Subscribe schemes exist in a great variety reaching from research prototypes to
commercial implementations.
3.5.1 Java Messaging Service
Java Messaging Service (JMS) is part of Java2 Enterprise Edition API and can be used to
implement applications with Publish/Subscribe models or point-to-point queuing. JMS
is a very flexible framework, that enables distributed communication which is loosely
coupled, reliable and asynchronous. Java Messaging Service uses topic-based subscrip-
tions that are used to group messages together to abstract topics. One information
provider can send a message to many consumers through the predefined virtual chan-
nels, i.e. the topics. Additionally, some kind of content-based filtering can be applied
on message headers and fields. In JMS topics are predefined by information providers ,
the so-called JMS providers. They define which content belongs to a specific topic and
which does not. Also topics can be structured using the hierarchy patterns.
A lot of commercial providers, as well as open source and academic projects are imple-
menting the JMS API, making it a promising and popular messaging API.
3.5.2 Cambridge Event Architecture
Cambridge Event Architecture (CEA) [27] is a research prototype developed at Cam-
bridge University in the early 90’s. CEA had been implemented to address the problem
of asynchronous communication for applications with a sensor environment and extends
the object-oriented middleware CORBA 1 with a ”publish, register, notify” paradigm.
In the context of CEA there are event producers (=information provider) and event
consumers (=information consumers). In contrast to CORBA, event producers in CEA
1CORBA defines APIs, a communication protocol and object/service information models to enableheterogeneous applications written in various languages running on various platforms to interoperate.
18
3.5 Current Publish/Subscribe systems Chapter 3. Publish/Subscribe systems
can register the events it produces with a name service. This can be considered as the
advertising mechanism. In CORBA event consumers directly register at event producers.
This creates a tight coupling between event consumers and event producers. Instead,
CEA introduces a mediator between those two entities to decouple them in time and
space. The so-called event mediator is a CORBA entity that implements producers and
consumers interfaces. By providing this kind of events, CEA offers a type-based Pub-
lish/Subscribe system and lacks the principle of content-based Publish/Subscribe. An
interesting property of CEA is that it provides a mechanism to subscribe for composite
event patterns, event correlation. Event consumers can subscribe for the occurrence of
several events and they are only notified if all of the subscribed events occur.
19
4 Distributed
Publish/Subscribe-Algorithms
Publish/Subscribe systems arise in many applications. Existing centralized implemen-
tations promise to offer the possibility that a potential unlimited number of subscribers
can participate. The advantages of the centralized way are that the system has a com-
plete overview of all subscriptions and of all notifications of the publishers. On that
physically global directory efficient algorithms can be implemented. The users of such
a system usually are autonomous entities that join or leave the system arbitrarily and
provide and consum content independently. This circumstance inevitably leads to a
bottleneck when the number of users grows and therefore the number of notifications
and subscriptions grows. A lot of notifications have to be forwarded to the appropriate
subscribers. This scalability and reliability issue can be tackled by implementing Pub-
lish/Subscribe systems in a decentralized fashion, i.e. a Publish/Subscribe system on
top of a P2P network. Especially when thinking about ubiquitous and pervasive [33]
environments, the deployment of Publish/Subscribe systems in a decentralized manner
is the best selection.
P2P systems have emerged as a very popular technique for exchanging information
in wide-area networks that span the whole world. Advanced P2P systems use struc-
tured overlays to provide more efficient lookups. The most popular structured P2P
networks are DHT implementations which are completely decentralized, scalable, self-
organizing and reliable. DHT implementations have been explained in Section 2.2. A
Publish/Subscribe system can be implemented straightforwardly on top of that over-
lay. There are several approaches to implement the Publish/Subscribe paradigm in a
decentralized fashion. Popular examples include Meghdoot [17] and Scribe [29].
20
4.1 Scribe Chapter 4. Distributed Publish/Subscribe-Algorithms
4.1 Scribe
Scribe is a application-level multicast infrastructure which supports a large number of
groups with a large number of members per group. It is build on top of Pastry [13]
and leverages Pastry’s scalability, locality and self-organization properties. Any peer in
Scribe can create a topic and other peers can subscribe to that topic. A peer with the
appropriate credentials can publish an event and Scribe is able to route the event to the
corresponding subscribers. Scribe provides a simple API with following functions:
• create(credentials, topicID)
• subscribe(credentials, topicId, eventHandler)
• unsubscribe(credentials, topicId)
• publish(credentials, topicId, event)
The event handler within the method subscribe is capable of handling each incoming
event routed to the subscriber. The scribe node/peer numerically closest to the unique
topicId is the so-called rendez-vous point for the topic. The rendez-vous point is the root
of a multicast tree created for the topic. Further details on Scribe and its functionality
can be seen in [29].
4.2 Meghdoot
Meghdoot is a P2P based Publish/Subscribe system. It uses a structured distributed
hash table to determine where subscriptions are stored and how to route the events to
the subscriptions. Meghdoot implements the content-based Publish/Subscribe scheme
known from Section 3.2. It additionally deals with the problem of load balancing. By
applying alternative methods the results show that even with highly skewed datasets no
peer is excessively loaded.
The system is represented by a set of attributes. The schema for a system can be
described as S = {A1, A2, . . . , An}. A subscription in Meghdoot is a conjunction of
predicates over a set of attributes. A predicate defines a contant value (with =) or a
range (using <,≤,≥, >) for an attribute. An event is a set of equalities which is rep-
resented as e = {A1 = c1, A2 = c2, . . . , An = cn}. An event e matches a subscription
21
4.2 Meghdoot Chapter 4. Distributed Publish/Subscribe-Algorithms
s if each predicate of s is satisfied by the corresponding attribute value in the events.
Meghdoot constructs given the schema S = {A1, A2, . . . , An} a 2n-dimensional carte-
sian space to maintain the DHT. This space is partitioned among the participating peers
and storing a subscription has a complexity of O(d ∗ N1d ), where N is the number of
peers and d the dimension of the cartesian space. A upcoming event in Meghdoot is
matched against subscritions by applying a routing algorithm within the constructed
2n-dimensional space. For further details the reader is refered to [17].
The Meghdoot system can be compared to one of our proposed approaches in chapter 5.
Meghdoot provides the content-based Publish/Subscribe scheme and stores the subscrip-
tions splitted at some peers of the network. The system schema S can for our purposed
be the set of all terms/words used in a content-based Publish/Subcribe scheme. Obvi-
ously this set is very huge and the 2n-dimensional cartesian space clearly will lead to
performance problems. The possibility to state range queries offers a great flexibility
concerning the subscriptions. The problem of the design is that system is not flexible if
it comes to adding new terms/words to the system. The system needs to reconstructed
again, as the 2n-dimensional is built up on the system schema. Therefore, Meghdoot is
a good candiate when the schema does not change. Store-Sub stores subscriptions at
peers, too. Unlike to Meghdoot, it does not offer a range-based search. But the number
of terms can change arbitrarily without reconstructing the whole system and therefore
is much more flexible. The next chapter introduces the approaches and explains them
in detail.
22
5 Introduction to Store-Sub and
Store-Sub
This chapter identifies two approaches to realize Publish/Subscribe on top of a structured
P2P network with a DHT. These approaches differ from each other by storing different
information in the conceptually global directory. Either the peers store subscriptions in
directory or the publishers store their statistics in the directory. Both approaches have
their strengths and weaknesses which depends on the usage scenario. In the remainder
of this chapter we will present these two approaches in more detail.
5.1 Terminology
To explain the two approaches, we first need to introduce some terminology. Each P2P
network contains of several peers, which we also call directory peers. The directory peers
use the lookup() function provided by the DHT. Given a key k, the directory maps the
key to the ID of the directory peer responsible to maintain the key k. The directory
peer stores for a key k some information and remains responsible for it until the network
structure changes. Now, each peer wishing to add or retrieve the information for a key
calls the lookup(key) function and retrieves the peer currently responsible to maintain
the value of the key. After that, the peer can directly contact the respective peer via
point-to-point connections and retrieve or add the desired information.
The set of all directory peers is N = {n1, n2, . . . , nk}. There are subscribers S =
{s1, s2, . . . , sn} and publishers P = {p1, p2, . . . , pm}.
Subscriber si expresses its interest by issuing a subscription subi,j what means the j-th
subscription of subscriber si. A subscription consists of a subset of terms, i.e. subi,j ⊂ T ,
where T is the set of all terms. Each subscriber can have more than one subscription.
For example si can have the following subscriptions:
23
5.1 Terminology Chapter 5. Introduction to Store-Sub and Store-Sub
• subi,1 = {Britney, Spears, concert,New, Y ork} and
• subi,2 = {George, Bush, visit, Germany}
The average number of subscriptions per subscriber is denoted by subavg and each sub-
scription has an average number of terms denoted by sub.
Publishers frequently publish new documents at some specific publishing rate r. Each
document consists of a number of terms t ∈ T and the documents are of the form d ∈ D
where D ⊂ 2T . A subscription subi,j matches a document d if all terms of the sub-
scriptions are contained in the document, i.e. subi,j ⊂ d. In this case the corresponding
subscriber is notified. Additionally, each document d belongs to a topic c = topic(d).
Table 5.1 gives an overview of the notation used.
In the analysis and simulation part all subscriptions are formulated in terms. I.e.
si ∈ S Subscriber from the set of all subscribers
pi ∈ P Publisher from the set of all publishers
ni ∈ N Node (Peer) from the P2P network
t ∈ T term t from the set of all terms
c ∈ C topic c from the set of all topics
subi,j j-th subscription of subscriber si
subavg average number of subscriptions per subscriber
sub average number of terms per subscription
d ∈ D a document of the set of all documents
|d| number of terms per document
r the number of documents published per round
|T | the number of all terms within the simulation
|C| the number of all topics within the simulation
topic(d) mapping from document to topic c ∈ C
Table 5.1: Notation
content-based Publish/Subscribe is applied in a simple manner. It is possible to use
more complicated subscription patterns, but the focus is on measuring and analyzing an
usual Publish/Subscribe system that imitates a news network or user groups.
24
5.2 Store-Sub Chapter 5. Introduction to Store-Sub and Store-Sub
5.2 Store-Sub
Store-Sub is the method, which is mostly used when implementing Publish/Subscribe.
It pursuits the intuitive way of storing directly the subscriptions in the distributed di-
rectory. The datastructure is conceptually global, but physically distributed over all
participating peers.
Store-Sub implements the following approach. It gives subscribers the possibility to
subscribe for their interests and store them in the globally available DHT. Each sub-
scriber can store as much subscriptions as it likes and also delete them. The procedure
is as follows: Each term in a subscription can be mapped by the lookup() function to
a nodeID. In more detail, a subscription si,j containing the terms {a, b, c} is stored at
the directory peers responsible for terms a, b, c, respectively. The directory peers store
the term and the whole subscription together with the nodeID of the subscribing peer.
If for example the term a is contained in more than one subscription, each subscription
containing term a is attached at the directory peer responsible for a. The directory
therefore contains terms, the nodeID of the subscribing peer and the subscriptions. The
topic based scheme is explained in Section 5.2.1.
The publishers frequently want to publish content in form of documents. In the term-
based scheme the publishers only take a look at the pure content of their documents.
Documents are a set of terms (see Table 5.1) and each publisher calls for all terms in all
documents the lookup() function to retrieve the subscribing peers. Consider that there
is a subscription si,j = {a, b, c} and document d contains these terms, i.e. {a, b, c} ⊂ d.
The publisher retrieves all subscriptions for document d and can check whether the
subscription is satisfied completely. After matching, the publisher can disseminate the
content and be sure that only subscribing peers retrieve the desired content.
Figure 5.1 illustrates the datastructure of the Store-Sub infrastructure. There are three
directory peers N17, N45, N76. Directory peer N17 is responsible for topic sport, N45
for comedy and politics and N76 for music. Each of the peers already has stored some
subscriptions in the datastructure. The subscriptions are stored by calling the lookup()
function on the topic and getting as result the nodeID. At this node, the subscriptions
are stored. Each publisher can now retrieve those lists, by the same procedure. Imagine
a publisher wants all the subscribers for topic sport. Calling lookup(), it receives the
list with {S42, S17, . . . } and can match against its own content. The figure also shows
incoming subscriptions from S14, S97 and S37. This figure only illustrates the topic-
25
5.2 Store-Sub Chapter 5. Introduction to Store-Sub and Store-Sub
Figure 5.1: Overview Store-Sub
based Publish/Subscribe scheme. In term-based Publish/Subscribe terms are stored at
the directory peers. The lookup() function retrieves the nodeID of the responsible peer
and subscribers as well as publishers have access to the directory.
The advantage of Store-Sub in general is, that subscriptions always are matched if con-
tent is available. As soon as a publisher has some content that matches a subscription,
the content will be delivered regardless of when the subscription had been stored. The
bottlenecks of Store-Sub are obvious: Publishers need to retrieve possibly huge lists of
subscribers, because they have to retrieve the subscriptions for all terms of their docu-
ments in order to make sure that all subscriptions are matched. Especially in scenarios
with a high publishing rate r the directory will be stressed. A typical way to cope with
this is to reduce the number of lookups by introducing a smaller set of topics, where the
terms belongs to topics.
5.2.1 Topic-based Store-Sub
In topic-based scheme we modify the pure term-based scheme. Now, each document and
each subscription belongs to one topic.
Let si,j = {x, y, z} be the subscription of a subscriber and c the topic of that subscription.
The subscriber calls lookup(c) to retrieve the nodeID of the directory peer responsible
26
5.3 Store-Pub Chapter 5. Introduction to Store-Sub and Store-Sub
for topic c. Then it adds the subscription and its nodeID to the directory making it
available to the publishers.
If now a publisher would like to disseminate document d with topic c = topic(d), it
calls lookup(c) and consecutively retrieves the nodeID of the corresponding directory
peer. By contacting the directory peer, the publisher retrieves all subscriptions for
topic c. Now, the publisher does the same as in term-based scheme and matches the
subscriptions against document d and finally notifies the appropriate subscribers.
The introduction of topics extends the pure term-based Publish/Subscribe to a mixture
of term-based and topic-based one. The advantages are that each subscription is stored
exactly once in the directory and that the publishers don’t need to call lookup() for each
term of the document, but only for the topic c.
5.3 Store-Pub
Unlike Store-Sub, Store-Pub implements a different approach. Instead of subscribers
storing subscriptions in the distributed overlay on top of DHT, publishers store statis-
tics of the contents of their documents into the directory. Subscribers can use this
statistics and afterwards directly subscribe at the publishers.
Each publisher pi announces its profiles in form of some statistics into the DHT. E.g. if
a publisher has for a topic - let’s say sport - some documents, it announces the quantity
of the documents for that topic. In Store-Sub as well as in Store-Pub, there are two
schemes to store statistics. Using term-based scheme, each publisher stores for each term
it has in its documents the statistics in the directory. To do so, it calls lookup() for a
term and retrieves the nodeID of the directory peer and stores the statistics. This is
done with all terms contained in all documents and leads to a detailed description of
their content within the DHT. The topic-based scheme is described in Section 5.3.1.
Subscribers use the information of the directory to decide at which publishers they are
going to subscribe. Instead of subscribing at the directory, they call the lookup() function
for each term of their subscriptions. They retrieve a list of statistics of all publishers.
With the publisher lists they can decide at which publishers they subscribe. I.e. they
use the statistics to decide which are the best publishers that fit their needs and can
choose a subset of the publishers for their subscriptions. Then they subscribe directly
at the publishers.
27
5.3 Store-Pub Chapter 5. Introduction to Store-Sub and Store-Sub
Figure 5.2 illustrates three directory peers N17, N45 and N76. The arrows show in-
coming posts for topics sport, comedy and music from publishers P12, P97 and P23,
respectively. Each directory peer stores all statistics for some topics. In the example,
N17 is responsible for topics sport, N45 for topics comedy and politics and N76 for topic
music. In the term-based scenario, statistics are only stored about terms and there are
no topics.
Figure 5.2: Overview Store-Pub
Store-Pub has the advantage that the subscribers have full control over which publishers
they contact. They can decide at how many publishers they subscribe and restrict the
quantity of notifications they receive. It is not possible for interested parties to do
social reengineering, because the subscriptions aren’t publically available. The main
disadvantage of this method is that information stored in the directory is aging. Hence
after some time it is necessary to repost statistics and subscribers therefore have to
choose their favorites again. Another one is that this approach does not guarantee the
delivery of information desired. It may happen with high probability that not all content
is sent to subscribers.
All together each of the two methods has their advantages and disadvantages. Chapter
6 analyzes Store-Sub and Store-Pub in-depth and gives explanations why one method
in some situations behaves better and the other not. This kind of analysis can be used
to evaluate deployed systems and may help to decide what system to choose for some
kind of application.
28
5.3 Store-Pub Chapter 5. Introduction to Store-Sub and Store-Sub
5.3.1 Topic-based Store-Pub
In the topic-based scheme, the publishers store statistics for all the topics of their docu-
ments at the directory peers. There is no information about individual terms and their
statistics. For example, the statistics now contain information about how many docu-
ments a publisher has for a topic. Again, more sophisticated statistics are possible.
Subscribers call the lookup() function for the topics they are interested in and retrieve
the nodeID of the directory peer responsible for the topic. They can now retrieve the
list of statistics and then decide at which subset of publishers they directly subscribe.
The problem with topic-based scheme is that the list of subscriptions for a topic can
be huge. A possible improvement to avoid the retrieval of the long lists the following:
When a publisher would like to disseminate its documents, it sends them to the direc-
tory peer that is responsible for the topic of the documents. The directory peer then
either matches the documents against the stored subscriptions and sends back only the
subscribers that match back to the publisher or disseminates the documents by itself to
the matching subscribers. This method notably reduces the generated traffic but not
the number of generated messages.
29
6 Analysis of Store-Sub and Store-Pub
This chapter analyzes the approaches Store-Sub and Store-Pub. The first section con-
tains the formulas that could be derived from the approaches identified in chapter 5. The
second part shows applications of the formulas. The formulas are used to either analyze
an approach when changing the system parameters or to compare the approaches with
each other. The comparison is based on the measurement of messages generated by each
approach as messages stress the distributed directory. The analysis of the number of
messages is crucial because it has an impact on the latency and the reliability of the
directory. The traffic does not have the same impact because the transfer of any data
usually is accomplished by direct connection between the participating peers and the
messages are necessary to lookup a directory peer. The comparison is useful to give rec-
ommendations which approach fits better in which usage scenario, i.e. user-to-user or
publisher-to-user. The mathematical analysis uses the formulas presented in this chapter
and chapter 8 presents simulation experiments with a dynamically changing system to
approximate the message generation of a real system.
6.1 Formulas
This section outlines the formulas derived from the approaches and explains them step
by step. The presented formulas generally can forecast two values:
• the messages generated and
• the traffic generated
by an approach. Additionally, the values can be split in to parts. The first part can
be expressed as storing the information into the DHT and the second one as retrieving
the information of the DHT. In each case - Store-Sub and Store-Pub - the information
30
6.1 Formulas Chapter 6. Analysis of Store-Sub and Store-Pub
available in the DHT is different.
The two approaches are implemented on top of a DHT. To forecast the messages and
the traffic of the approaches, it is necessary to keep in mind the message complexity of
the DHT. Therefore all formulas contain the complexity factor of log2(|N |) where N is
a network of nodes and |N | is the number of nodes. In essence, whenever some node
would like to send a message to another node within the network, there are log2(|N |)message hops due to the implementation.
6.1.1 Formulas Store-Sub
In Store-Sub the complexity of messages and traffic consists of two parts:
• Subscribers that join the network need to announce their subscriptions
• Publishers need to retrieve the subscriptions for the terms or topics in order to
disseminate their contents
According to the different schemes explained in chapter 3, the formulas can be used to
measure the traffic and messages of the following schemes:
• term-based Publish/Subscribe
• topic-based Publish/Subscribe
• topic-based Publish/Subscribe with performance optimizations
The two main schemes, term-based and topic-based, also imply different message and
traffic generation. Consider term-based scheme: If a subscriber would like to announce
its subscription, it needs to add the subscription at several directory peers. So for each
term of the subscription, the subscriber needs to identify the responsible directory peer
and adds the subscription. But in topic-based scheme the subscription only needs to
be added to one directory peer. The directory peer that is responsible for the topic of
the subscription. To reflect these schemes, there are two values in the formulas, fs and
fp . In the term-based scheme the value fs = sub, because the subscription has to be
sent to one peer for every term in the subscription. Whereas in the topic-based scheme
31
6.1 Formulas Chapter 6. Analysis of Store-Sub and Store-Pub
fs = 1, because each subscription belongs to one topic. The value fp also reflects the
difference between term-based and topic-based scheme. Consider now a publisher that
needs to retrieve all subscribers for the terms of its document. In the term-based scheme
fp = |D|, because the publisher has to match all subscribers possibly interested. On the
other hand, in the topic-based fp = 1, because each document belongs to one topic.
The following formulas show the messages and the traffic generated in Store-Sub.
M1 = O(|Snew| ∗ subavg ∗ fs ∗ log(|N |)) (6.1)
M2 = O(|P | ∗ r ∗ fp ∗ log(|N |)) (6.2)
T1 = O(sizet1avg ∗ (|Snew| ∗ subavg ∗ fs ∗ log(|N |))) (6.3)
T2 = O(sizet2avg ∗ (|P | ∗ r ∗ fp ∗ log(|N |+ listlength))) (6.4)
where
listlength =|S| ∗ subavg ∗ sub
|T |(6.5)
or
listlength =|S| ∗ subavg
|C|(6.6)
Formula 6.1 represents the messages that are generated by the subscribers joining the
network. M1 belongs to the part of messages to store the information of the DHT. They
need to announce their interests in form of subscriptions. Each subscriber therefore
generates for each subscription messages, because of adding it to the directory. The
number of messages per subscription is expressed by fs which changes depending on the
scheme. Each of these messages generates O(log(|N |)) messages in the DHT.
Formula 6.2 shows the messages that are generated by the publishers disseminating
their documents and so M2 are the messages to retrieve the information of the DHT.
For each document they need to know all potentially matching subscribers. Therefore
32
6.1 Formulas Chapter 6. Analysis of Store-Sub and Store-Pub
each publisher generates for each document a number of messages. The number of
documents per publisher is denoted by the publishing rate r. The number of messages
per document depends on the scheme used and is expressed in the value fp. Additionally,
each message generates O(log(|N |)) in the DHT.
Formula 6.3 reflects the traffic generated by the subscribers joining the network. The
traffic generated depends on the same values as the number of messages generated in
formula 6.1 but there is an additional value, which tries to approximate the average size
of such a message, sizet1avg. This value can be approximated using the parameters for
the simulation. Taking the average number of terms per subscription (subavg) and the
average length of each term (which is language dependent), one gets an approximate
value for sizet1avg. In essence
sizet1avg = sub ∗ lengthavg(term) (6.7)
Formula 6.4 is slightly more complicated. It represents the traffic generated by the
publishers to disseminate the documents. The traffic of this formula consists of the same
values as formula 6.2, but the parameter listlength varies depending on the scheme.
The value of listlength of term-based scheme is shown in formula 6.5 and the same
value of topic-based without performance optimization is shown in formula 6.6. The
performance optimization itself changes the value listlength to either 0 if the directory
peers take care of the matching and dissemination of all the documents. Or it is some
value matchingSubscriber(D) if the directory peer sends back the subscribers which are
candidates for dissemination. This value is hard to approximate due to the fact that the
behaviour of subscribers and the documents have to be modeled well. The same is true
for the value sizet2avg, which represents the average size of a message in this context.
6.1.2 Formulas Store-Pub
In Store-Pub the complexity of messages and traffic consists of two parts:
• Publishers need to announce their statistics
• Subscribers have to retrieve the announced statistics to find publishers that fit
their needs
33
6.1 Formulas Chapter 6. Analysis of Store-Sub and Store-Pub
Like in Section 6.1.1 the formulas measure the messages and the traffic of the following
schemes:
• term-based Publish/Subscribe
• topic-based Publish/Subscribe
Because of the two different schemes used, the formulas need to reflect the differences
between them. There are two values to cope with these differences depending on the
scheme. The first one is the feature space F . In term-based scheme this feature space
is the set of all terms in the simulation. And in topic-based it is the set of all topics.
Hence, |F | is the number of elements in the set. The second one is fs which in case of
term-based scheme equals sub and in case of topic-based one is 1.
Compared to Store-Sub, it is not necessary to optimize the performance of the dissem-
ination, because Store-Pub is based on another approach. But it is necessary to cope
with the problem of aging statistics in the directory. Within a given time interval each
publisher has to refresh its statistics and each subscriber has to check whether there
are better publishers fitting its needs. The variable that reflects this circumstance is
interval.
M1 = O
((|Pnew| ∗
|P |interval
)∗ |F | ∗ log(|N |)
)(6.8)
M2 = O
((|Snew| ∗
|S|interval
)∗ subavg ∗ fs ∗ log(|N |)
)(6.9)
T1 = O
(sizet3
avg ∗((
|Pnew| ∗|P |
interval
)∗ |F | ∗ log(|N |)
))(6.10)
T2 = O
(sizet4
avg ∗((
|Snew| ∗|S|
interval
)∗ subavg ∗ fs ∗ log(|N |)
))(6.11)
Formula 6.8 represents the number of messages that is generated by the publishers pub-
lishing their content. The formula distinguishes between new publishers Pnew and the
existing ones, P . New publishers naturally have to publish all their statistics, because
they didn’t participate in the network before. Already participating publishers only
34
6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub
have to publish their content in some fixed interval. |F | means the number of features
to be published, i.e. terms in term-based Publish/Subscribe or topics in topic-based
one. Altogether, formula 6.8 is the quantity of messages to store the information in the
DHT.
Formula 6.9 shows the number of messages needed to retrieve promising publishers. It
represents the retrieve part of information stored in the DHT. Like in the last formula,
this one also distinguishes between new subscribers and existing ones. New subscribers
have to choose their promising publishers based on their subscriptions. Existing sub-
scribers have to rechoose publishers, because it might be the case that some of their
before chosen publishers changed their statistics and therefore another publisher is a
better choice. Because all publishers and subscribers have to refresh their information,
it is more likely for both to achieve better results.
Formula 6.10 equals 6.8 except for the factor sizet3avg. It describes the size of one message
in the context of saving statistics in the DHT. sizet3avg is the sum of two values:
• the average size of the information to which publisher the statistic belongs (like
the IP address or nodeID)
• the average size of a statistics token, which depends on its structure
Formula 6.11 is almost the same as formula 6.9 except for factor sizet4avg. This factor is
exactly the same as sizet3avg, so:
sizet3avg = sizet4
avg = sizenodeID + sizestatistics (6.12)
The presented formulas are derived of the two approaches and are approximations that
can be feed with parameters known. Examples of parameters are the average number
of subscriptions per subscriber or the average number of terms of a subscription. Using
the formulas and the simulator, it can be shown that the former ones give a good
approximation of an average simulation.
6.2 Analysis
This section shows the effects of changing some parameters. Changing some of them
can make for example Store-Sub a better approach to deploy Publish/Subscribe but if
35
6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub
one changes another parameter, it may happen that suddenly Store-Pub is the better
solution. All the effects can be explained with the aid of the formulas which can be used
to forecast different situations.
6.2.1 Analysis of Store-Sub with changing publishing rate r
This section analyzes the approach of Store-Sub, i.e. it analyzes two different message
types of Store-Sub. Store-Sub can be split in two types of messages: Store the subscrip-
tions (by subscribers) and retrieve the subscriptions (by publishers). The experiment
shows which type of message is responsible for the main part of the messages generated.
The parameters of this experiment are depicted in table 6.1 and Figure 6.1 shows those
two types of messages. The red line represents the first type of messages, storing of
|S| 100000
|Snew| 10
|P | 100
subavg 3
sub 5
|T | 100000
|D| log2(100000) ≈ 16
r variable from 1 up to 1000 documents per round
Table 6.1: Parameters for scenario with variable publishing rate
subscriptions. There is no change, as the publishing rate r does not affect this number.
The green line represents the second type, the retrieving of subscriptions. The publish-
ing rate ranges from 1 document per round up to 1000 documents per round. It can
be seen that the number of messages of the second increases when r increases. This
is because publishers need to retrieve more and more subscriber lists for the increasing
number of documents. This implies that Store-Sub is sensitive concerning changes of
the publishing rate.
36
6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub
1000
10000
100000
1e+06
1e+07
1e+08
1 10 100 1000
#Mes
sage
s
Publish rate
Store-Sub
Sent subscriptionRetrieve subscription
Figure 6.1: Message types in Store-Sub
6.2.2 Analysis of Store-Pub with changing average number of
subscriptions subavg
The next analysis examines the different message types of Store-Pub. In Store-Pub
statistics need to be announced by publishers (store) and the subscribers retrieve the
statistics. The variable that changes is the average number of subscriptions subavg and
all others are fix. Table 6.2 shows the parameters. Figure 6.2 shows the plots of Store-
|S| 100000
|Snew| 10
|P | 100
subavg varies from 1 to 100
sub 5
|T | 100000
|D| log2(100000) ≈ 16
r 10 documents per round
interval every 100 rounds
Table 6.2: Parameters for scenario with variable subavg
37
6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub
Pub. The green line shows the number of messages generated by all publishers when
storing their statistics in the DHT. They remain constant as the number of publishers
is constant and the varying number of subscriptions does not affect this part. The red
line shows the number of messages generated when subscribers retrieve the statistics to
decide which subset of publishers to subscribe at. The more subscriptions they generate
the more messages are generated.
10000
100000
1e+06
1e+07
1 10 100
#Mes
sage
s
#Subscriptions
Store-Pub
Retrieve statisticsStore statistics
Figure 6.2: Message types in Store-Pub
6.2.3 Analysis of Store-Sub and Store-Pub with changing
publishing rate r
The next experiment compares the two approaches Store-Sub and Store-Pub when the
publishing rate r is changed. The parameters used are shown again in table 6.1. Figure
6.3 shows the outcome of this analysis. The red line is the number of messages of Store-
Sub and the green and blue lines are those from Store-Pub. Store-Pub is plotted two
times, one with an interval of 50 the other with an interval 100. On the y-axis there
is the number of messages generated and on the x-axis the varying publishing rate r.
First, it can be seen that the change of the publishing rate has no impact on method
Store-Pub. This is because the publishers don’t need to retrieve the subscriptions of the
38
6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub
10000
100000
1e+06
1e+07
1e+08
1 10 100 1000
#Mes
sage
s
Publish rate
Store-SubStore-Pub, intervall = 50
Store-Pub, intervall = 100
Figure 6.3: #Messages at variable publishing rate
users from the DHT (they directly subscribed at the publishers). Second, the change
has an impact on method Store-Sub, as the publishers need to retrieve the subscriptions
from the DHT and the more documents they intent to publish, the more information
has to be retrieved.
The experiment implies that in scenarios with many subscribers and few publishers with
high publishing rate (Publisher-to-user) Store-Pub fits better, because it is insensible.
6.2.4 Analysis of Store-Sub and Store-Pub with changing refresh
interval interval
The next experiment treats the comparison of Store-Sub and Store-Pub when the refresh
interval is changed. This experiment clearly will have no impact on Store-Sub as in
Store-Sub it is not necessary to refresh the information stored in the directory. We vary
parameter interval and all other parameters are fix. Table 6.3 shows a summary of the
parameters.
Figure 6.3 shows the plot for this scenario. On the y-axis are the messages generated
and on the x-axis the varying interval. The red line shows method Store-Sub. It can be
39
6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub
|S| 100000
|Snew| 10
|P | 100
subavg 3
sub 5
|T | 100000
|D| log2(100000) ≈ 16
r 10 documents per round
interval refresh from every 1 to every 1000 round
Table 6.3: Parameters for scenario with variable interval
seen, that changing this parameter does not have an impact on it. This is because in
Store-Sub there is no need to refresh the information stored in the DHT. The green line
shows method Store-Pub. At an interval of 1 Store-Pub generates much more messages
than Store-Sub as every round the information is refreshed from both subscribers and
publishers. And at an interval of 1000 things changes in favour of Store-Pub. The
parameter interval is therefore a variable of great impact on Store-Pub.
In a publisher-to-user scenario with the requirement to keep statistics up-to-date fre-
quently, Store-Sub might be the better solution to deploy Publish/Subscribe.
There are a lot of possibilities to plot more scenarios with different parameters changed.
Making an analysis thus can give a good approximation of the consumption of resources
like messages and traffic.
40
6.2 Analysis Chapter 6. Analysis of Store-Sub and Store-Pub
10000
100000
1e+06
1e+07
1e+08
1 10 100 1000
#Mes
sage
s
Intervall
Store-SubStore-Pub
Figure 6.4: #Messages at variable refresh interval
41
7 Design and Implementation of a
Discrete Simulator
This chapter describes the simulator written within the scope of the thesis. The sim-
ulator tries to approximate and measure the messages and traffic generated by a Pub-
lish/Subscribe P2P network. Like in chapter 6 the messages are in the focus of at-
tention. Because the implementation uses randomness for various tasks like creating
subscriptions, join and leave of subscribers etc., it better approximates the real traf-
fic of a Publish/Subscribe system. The simulator is discrete, i.e. it simulates a set
of publishers and subscribers interacting with the Publish/Subscribe system based on
rounds. The simulator uses this rounds to measure the messages and the traffic in the
Publish/Subscribe system. In each of the rounds there are some tasks to be done. The
list below shows the general steps done in each round :
1. new subscribers join the P2P network
2. existing subscribers leave the P2P network
3. all publishers create documents they want to publish
4. and depending whether Store-Sub or Store-Pub
• all publishers check the DHT for subscribers and match their subscriptions
(Store-Sub)
• or the publishers directly match the subscriptions made by subscribers at the
publishers (Store-Pub)
According to chapter 5 in the set of all peers, there is a set of subscribers. Each subscriber
has a number of subscriptions that represent their interest as standing queries. The
subscriptions have a number of terms and in the topic-based scheme, they are categorized
42
7.1 Creating Synthetic DataChapter 7. Design and Implementation of a Discrete Simulator
into a topic. Depending on the Publish/Subscribe scheme chosen, the subscriptions are
based on terms (content-based) or they are based on the topics (topic-based) and as a
further refinement, exact matchings based on terms are made.
Among the set of peers there is a number of publishers. These publishers would like to
disseminate some documents to the subscribers subscribed in the DHT. The content to
be published consists of documents which in turn consist of a set of terms and belongs to
a topic. Section 7.1 describes the model chosen to generate the documents realistically.
Each corpus is pregenerated and publishers subsequently pick some documents of them
in each round during the simulation.
7.1 Creating Synthetic Data
The Zipf distribution [15] is the model chosen to pregenerate the document corpora.
Zipf’s law is a powerlaw publicated by George Kingsley Zipf. The powerlaw states
that in a corpus of natural language utterances, the frequency of a any word is roughly
inversely proportional to its rank in the frequency table. That is, the most frequent
word will occur approximately twice as often as the second most frequent word. And
the second most frequent word will occur twice as often as the fourth most frequent
word and so on. It has been observed by many research groups in several application
fields on different corpora. The law can mathematically be stated as follows:
f (k; s; N) =1/ks∑Nj=1 1/js
, (7.1)
where N is the number of elements in the corpus, k is their rank and s is a constant
that characterizes the distribution. The constant s in the classical version of Zipf’s law
is nearly 1. The generation of a document corpus is as follows. For each document d,
the algorithm throws an i-sided coin to decide whether term i should be included in
document d. This is done for all n terms. That is that i.e. term 1 has a probability of 1,
term 2 a probability of 12
and term n a probability of 1n
to be included in document d. By
doing this for all documents, a document corpus following a Zipf-like term distribution
is generated.
43
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
7.2 Implementation
This section explains the classes of the simulator written within the scope of the thesis.
It allows to simulate a set of publishers and subscribers randomly acting in the context
of a Publish/Subscribe system. The implementation is capable of simulating the two ap-
proaches Store-Sub and Store-Pub with a term-based and topic-based scheme. For each
scheme exists an individual simulator because it is more practical to separate them. In
the remainder of this chapter the classes of the term-based scheme are explained. When-
ever there is the word term, it can be substituted by topic but some classes do not exist
in term-based scheme. They are explicitly mentioned whenever necessary.
The parameters defined in Table 5.1 need to be feeded to the simulator explained.
Changing these parameters has an impact on the number of messages and traffic gener-
ated during the simulation. The goal is to simulate real world scenarios in experiments
and measure the outcome of these experiments. This can give hints about how a the
Publish/Subscribe system behaves under the given parameters and which approach bet-
ter fits. An example of an experiment is a scenario with a lot of subscribers and only
a few publishers (news network) that publish at high publishing rate r. Depending on
the approach chosen, the outcome is very different. Figure 7.1 shows an overview of the
term-based simulator.
44
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
Peer
Subscriber Publisher
Subscription
1
n
SubscriberDB
PublisherDB
PeerManager
n
1
PublisherList
PublicPropertiesRandomGenerator
DocumentDB
Statistics
Figure 7.1: UML diagram of term-based simulator
7.2.1 Class Peer
The Peer class represents a peer in the simulation. A peer has an unique id in the P2P
network, so the Peer class has a class member id and a getter getID() to retrieve the
id. The class itself is never instantiated in the simulator. It only serves as abstract class
for class Publisher and Subscriber.
7.2.2 Class Subscriber
The class Subscriber inherits from the Peer class and represents a subscriber in the P2P
network. A subscriber has some subscriptions that express its standing interests. The
subscriptions are generated by each subscriber individually using the RandomGenerator
(cf. Section 7.2.9) class. So the number of subscriptions and the terms they contain is
45
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
randomly chosen in a configurable range. The subscriptions are stored in the member
field subscriptions.
The class is used in the two different Publish/Subscribe methods, Store-Sub and Store-
Pub. If the simulator simulates Store-Sub, the class uses the method subscribe2-
subscriberDB() and removeFromSubscriberDB() to subscribe and unsubscribe itself
from the DHT and makes use of the class SubscriberDB (cf. Section 7.2.5) to accom-
plish the tasks. Method getSubscriptions() is used by SubscriberDB to store the
subscriptions of a subscriber. Another method is used if Store-Pub is simulated. Class
Subscriber uses a method called chooseBestPublishers() to retrieve all the relevant
statistics for their subscriptions and choose the best ones. The detected publishers are
stored in a class member publishers and are used in every round to check whether
there are new documents or not. The method mergePublishers() is a helper method
for chooseBestPublisher() in the case that there are the same publishers for different
subscriptions.
In a real Publish/Subscribe applications which uses Store-Pub, the publishers have
stored the subscriptions of the subscribers. In the case of new documents, they would
check whether some subscriptions match their documents and then do the rest of the job.
In order to make the implementation easier, the subscribers - who have stored their in-
terests in subscriptions - call the method checkPublications() and use publishers
to ask the publishers to match their subscriptions by calling the method
send(Subscriptions). The method send(...) matches the subscriptions and sends
the documents to the subscriber.
The method checkReallyAvailableDocuments(...) is only used in Store-Pub and
computes all the documents matching the subscriptions of a subscriber. There might be
a discrepancy between the documents disseminated from a publisher to subscriber and
the documents probably matching. This is due to the fact, that each subscriber only
chooses a subset of publishers. Figure 7.2 shows an UML diagram of class Subscriber.
46
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
Peer
#id: int
+Peer(id:int)
+getId(): int
Subscriber
-subscriptions
-publishers
+Subscriber(id:int)
-generateSubscriptions()
+subscribe2SubscriberDB()
+removeFromSubscriberDB()
+chooseBestPublishers()
-mergePublishers()
+checkPublications()
+checkReallyAvailableDocuments(publishers)
+getSubscriptions()
+receive()
Publisher
-MAX_CHECK_TERMS: int
-MAX_PER_ROUND: int
-db_base: DocumentDB
-db_new: DocumentDB
-db_round: DocumentDB
+Publisher(id:int)
+publish()
+unpublish()
+republish()
+checkSubscriptions()
+generateDocumentsForRound()
+getNumberOfDocuments(term:int)
+getTerms2Documents()
+compareTo(obj:Object)
+send(s:Subscription)
+matchSubscriptions(subscriber:Subscriber)
+writeBaseToDisk(path:String)
+writeNewToDisk(path:String)
+loadBaseFromDisk(path:String)
+loadNewFromDisk(path:String)
Figure 7.2: Relationship of peer classes
7.2.3 Class Publisher
This class represents a publisher in the simulation. The member fields of this class
are: As mentioned before, a publisher has a document corpus and uses it to dissem-
MAX CHECK TERMS The maximal number of terms to be checked
MAX PER ROUND The maximal number of documents to be published per round
db base The 1st document corpus
db new The 2nd document corpus
db round The document corpus for the actual round
Table 7.1: Members of class Publisher
inate documents. In Store-Sub each publisher has only one corpus and in Store-Pub
each one has two corpora. Store-Pub needs two corpora, the first one to compute the
statistics of the documents and the second one to disseminate the documents. The
statistics are announced, deleted and republished by methods publish(), unpublish()
and republish().
Again publishers have two modes in which they can be used. In Store-Sub, they use
method checkSubscriptions() to retrieve the subscriptions for the document terms.
The subscriptions are matched against the documents and sent to the appropriate sub-
47
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
scribers. In this case, the class makes use of SubscriberDB (cf. Section 7.2.5) to get
the subscriptions and picks some of the documents for this round from the 1st corpus.
If some terms are contained in more than document, the method nevertheless is called
just once for the term.
In Store-Pub all subscribers call the method send(Subscription) to match the docu-
ments of this round against their subscriptions. The documents are picked by generate-
DocumentsForRound(n) from the 2nd corpus. The method matchSubscriptions() is
used to determine the documents which are really available in this round. See Section
7.2.2 for an explanation.
There are some additional methods like write*ToDisk() and load*FromDisk() that
are used to generate document corpora. The method getNumberOfDocuments(term)
returns the number of documents for a given term term and getTerms2documents() is
used by PublisherDB (cf. 7.2.6 to create the statistics for the corpus of each publisher.
Figure 7.2 shows class Publisher.
7.2.4 Class Subscription
This class represents a subscription in the simulator. A subscription consists of a set of
terms and belongs to a topic (only in topic-based scheme). Table 7.2 shows the member
fields of this class:
subscriber id indicates from which subscriber the subscription had been made
terms the terms generated
terms array the same as terms but sorted an in an array
topic indicates the topic
Table 7.2: Members of class Subscription
At creation time, the subscription generates the containing terms by calling method
generate(). It makes use of class RandomGenerator (cf. 7.2.9 to choose the terms and
the number of terms. Methods getTerms() and getSubscriber ID() return the terms
generated and the nodeID of the subscriber made the subscription. Figure 7.3 gives an
overview of this class.
48
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
Subscription
-subscriber_id
-terms
+Subscription(s:Subscriber)
-generate()
+getSubscriber_ID()
+getTerms()
Figure 7.3: Class Subscription
7.2.5 Class SubscriberDB
This class is a singleton class and is responsible for storing and managing the subscrip-
tions made by subscribers. The singleton pattern has the advantage that the instance
is only loaded once and the properties for the simulator can be accessed by all classes
without any additional costs. It is only used if Store-Sub is simulated and not in method
Store-Pub, i.e. it represents the DHT of a real P2P network.
The class has a method getInstance() that is used by some classes to get the single in-
stance of SubscriberDB. it is therefore guaranteed, that every subscriber and publisher
has the same information about the subscriptions like in a real P2P network with DHT.
Subscribers and publishers need to access the instance in different ways. Subscribers use
addSubscription(...) and removeSubscription(...) to add and remove their sub-
scriptions from the DHT. Publishers need to get all subscribers subscribed for one special
term. They use getPeersSubscribing() for this task. Additionally, there is method
clear() to delete all entries of the datastructure. Figure 7.4 shows class SubscriberDB.
SubscriberDB
-subscriberdb: SubscriberDB
-term2subscriberList: Hashtable
-SubscriberDB()
+getInstance(): SubscriberDB
+addSubscriber(s:Subscriber)
+removeSubscriber(s:Subscriber)
+getPeersSubscribing(term:int)
+clear()
PublisherDB
-publisherdb: PublisherDB
-term2publishers: Hashtable
-PublisherDB()
+getInstance(): PublisherDB
+addPublication(p:Publisher)
+removePublication(p:Publisher)
+getPublishersByTerm(term:int)
+getPublisherByTermAsTreeSet(term:int)
+clear()
Figure 7.4: PublisherDB and SubscriberDB
49
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
7.2.6 Class PublisherDB
Class PublisherDB is also a singleton class for the same reason as SubscriberDB. The
class is used when simulating Store-Pub method and it stores the statistics of the pub-
lishers. The statistics in the simulation is the number of documents for an entry. The
more documents some publisher has for an entry, the better is its rank. There can be
other implementations of statistics that are much more sophisticated.
The publishers announce their statistics by using the method addPublication(). Sub-
scribers have the possibility to retrieve the statistics for some entry by methods get-
PublishersByTerm(term) and getPublisherByTermAsTreeSet(term). They only dif-
fer in the datastructure returned. Like class SubscriberDB, this one has the methods
clear() and getInstance() to delete and get the single instance of this class. Figure
7.4 shows class PublisherDB.
7.2.7 Class PublisherList
This class is a helper class used by PublisherDB. Its purpose is to store the publishers
and therefore their statistics. The following table shows the members of PublisherList:
publisherlist the datastructure containing pointers to the publishers
term the term of the list, i.e. all publishers are ordered
by the occurrence of term in their documents
dirty indicates whether the list is sorted
Table 7.3: Members of class PublisherList
The class has methods to add and remove publishers called addPublisher(p) and
removePublisher(p). The number of publishers can be retrieved by size() and it
can be deleted by using method clear().
Consider that term has the value 1. Then all publishers having documents that contain
term 1 add itself to the list. The method get X best(x) and getPublishers() sort
the publishers according the number of occurrences of term 1 in their documents. This
methods then return a sorted list as whole or only the x best ones. The dirty flag is used
to signal that the list is not sorted. It is cheaper to add without sorting and sort when
the list is needed in a sorted manner. Figure 7.5 shows the class in an UML diagram.
50
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
PublisherList
-publisherlist
-term
-dirty
+PublisherList(term)
+addPublisher(p:Publisher)
+get_X_best()(n:int)
+clear()
+getPublishers()
+removePublisher(p:Publisher)
+size()
+sort()
Figure 7.5: Class PublisherList
7.2.8 Class DocumentDB
Class DocumentDB is the class that contains the document corpus. It is a serializable class
that has methods to generate documents, match subscriptions restructure the corpus and
so on. The member fields are:
documents2terms A hashtable that maps document ids to
the term contained in the documents
terms2documents A hashtable that maps terms to
documents that contain the term
probability offset Need to generate documents with terms
and their probabilities
startterm id The 1st/lowest term in the corpus
next document The next document to be published
Table 7.4: Members of class DocumentDB
The class DocumentDB either can generate ad-hoc a corpus using the constructor given
the number of terms, the number of documents and the startterm. The other way is to
deserialize it from hard disk which is helpful to run comparable and reproducible exper-
iments. The 2nd constructor can get a collection of documents and then constructs a
mapping from terms to documents. The mapping is called inverted list and is used find
out in which documents a term is contained. The method generate() uses the algorithm
explained in Section 7.1 to construct the corpus. Additionally, the class has a method
to get information about the corpus. Method getNumberOfDocument(term) returns the
51
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
number of document that exist for a given term and getRemainingDocuments() returns
the number of documents not yet published.
Two methods are capable of matching subscriptions and subscribers (with all their sub-
scriptions) against their documents. Their names are getDocumentsForSubscriber()
and getDocumentsForSubscription(...) and they return the documents matching the
subscriptions. Both use matchSubscription(...) as a helper method. To create the
documents for a round the method getDocuments(n) is called and returns n documents
from the corpus. The method constructs the inverted list and increments next number.
I.e. n documents are used from the corpus to be published. The terms of these doc-
uments can be accessed by getTerms() and are used to query class SubscriberDB for
subscribers.
Class PublisherDB makes use of method getTerms2Documents() to add the statistics
of the publishers into the DHT. It is also possible to merge some documents into an
existing DocumentDB. This is necessary when publishers need to republish their statis-
tics. Because usually some documents have been published and also need to be used to
form the profile. The DocumentDB, which is the basis for the the statistics, has to be
augmented with already published documents. Methods that are responsible for this are
add(doc), getFirstDocuments() and restructure(). Figure 7.6 shows a diagram of
this class.
7.2.9 Class RandomGenerator
Class RandomGenerator is a singleton class which offers methods to randomly generate
numbers. The member fields of this class are:
rg the static instance of this class
random instance of Random class from Java API
Table 7.5: Members of class RandomGenerator
This class has two methods to generate random numbers. The first method random-
Number() generates a random number and randomNumber(n) between 0 and n− 1 and
returns it.
This methods are used at various places in the simulator. It is used to generate:
• number of subscriptions for each subscriber
52
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
DocumentDB
-documents2terms
-terms2documents
-probality_offset
-startterm_id
-next_document
+DocumentDB(documents2terms)
+DocumentDB(terms,documents,startterm)
+add(id,doc)
+clear()
-generate()
+get_N_BestTerms(n)
+getDocumentDB(n)
+getDocuments4Term(term)
+getDocumentsForSubscriber(subscriber)
+getDocumentsForSubscription(subscription)
+getFirstDocuments()
+getNumberOfDocuments(term)
+getRemainingDocuments()
+getTerms()
+getTerms2Documents()
-matchSubscription(subscription)
+restructure()
Figure 7.6: Class DocumentDB
• number of terms in subscriptions
• number of subscribers that join/leave in a round
• number of documents to publish in a round
Figure 7.7 shows the UML diagram of this class.
53
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
RandomGenerator
-rg: RandomGenerator
-random
-RandomGenerator()
+getInstance()
+randomNumber()
+randomNumber(n:int)
Figure 7.7: Class RandomGenerator
7.2.10 Class PublicProperties
PublicProperties is a properties class that is implemented as singleton. The class is
used to parameterize the simulator and self-explanatory. A diagram can be viewed at
Figure 7.8.
PublicProperties
-pp: PublicProperties
-properties
-PublicProperties()
+getInstance()
+getAsBoolean(key)
+getAsInt(key)
+getAsString(key)
Figure 7.8: Class PublicProperties
The properties that can be accessed are listed in table 7.6:
54
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
rounds number of rounds to simulate
subs per peer maximal subscriptions per subscriber
max number of terms maximal number of terms per subscription
docs number of documents per publisher
terms number of different terms per corpus
topics number of topics per corpus
termspace all occurring terms in the simulation
probability offset defines the probability of 1st term in the corpus
doc path directory path of serialized corpora
env path directory of Berkeley environment
Table 7.6: List of properties
7.2.11 Class Statistics
Class Statistics is a singleton class that count the messages and traffic generated
during a simulation. In each method - Store-Sub and Store-Pub - the messages and
traffic must be counted in different ways and at different positions in the code . The
output of this class can be used as basis to plot graphs and gives an overview of message
and traffic consumption.
7.2.12 Class BerkeleyDocumentDB
This class is only used in topic-based scheme and never in term-based one. It represents
the document corpus and uses BerkeleyDB [9] as back end. Because this class needs a lot
of members to maintain functionality that are not important to the simulator, they are
neglected in the description. Table 7.7 shows the member fields of BerkeleyDocumentDB.
counter the first not published document
id the id of the class
number of documents number of documents contained
number of terms number of terms in the corpus
startterm the lowest term of the corpus
termdistributiondb a database which contains the inverted list
topicdistributiondb mapping from topic to documents
Table 7.7: Members of class BerkeleyDocumentDB
55
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
BerkeleyDocumentDB
-counter
-documentdb
-id
-number_of_documents
-number_of_terms
-startterm
-termdistribution
-topicdistribution
+BerkeleyDocumentDB(n:int)
+close()
-computePath()
-computeTopics()
-generate(startterm:int)
+getDocument(id:int)
+getDocumentsForRound(n:int)
+getNumberOfDocuments(topic:int)
+getTopicDistribution(topic:int)
+hasMoreElements()
-initializeDatabase()
+nextElement()
+storeDocument()
+storeTopicDistribution(td:TopicDistribution)
+topicDistributions()
Figure 7.9: Class BerkeleyDocumentDB
The constructor is needed to initialize the instance. It uses methods computePath()
and initializeDatabase() and class PublicProperties.
Method getDocument(id) returns a requested document in form of class Document that
is stored in the corpus and storeDocument(doc) stores a document. Class Document
simply is a wrapper of the terms and implemented to use the mechanisms of BerkeleyDB.
Method generate() is used to generate a new corpus and store the whole database
with name id. Method getDocumentsForRound(n) returns a DocumentCollection with
n documents for each round. To return the number of documents for a topic and
construct the profile for each publisher method getNumberOfDocuments(topic) is called
by class PublisherList. In the case of topic-based Publish/Subscribe, topics needs to
be computed. Method computeTopics() is used to accomplish this task which uses as
helper method getTopicDistribution(topic). Figure 7.9 shows an overview of this
class.
56
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
TopicDB
-instance
-number_of_topics
-topicdb
-topic2terms
-TopicDB()
+getInstance()
+close()
-computePath()
-generateTopics()
+getTerms4Topic(topic:int)
+getTopic(term:int)
-initializeDatabase()
-invert()
Figure 7.10: Class TopicDB
7.2.13 Class TopicDB
Class TopicDB is responsible for assigning each term a topic. In real documents, a term
may belong to several topics but for simplicity each term has only one topic. This class
is only used in topic-based scheme. Members are shown in table 7.8.
instance the singleton instance of this class
number of topics the number of topics in the simulation
topicdb the database containing the mapping term → topic
topic2terms the database containing mapping topic → terms
Table 7.8: Members of class TopicDB
This class is used system-wide to generate subscriptions and retrieve information about
the topic of a term or all the terms belonging to a topic. To generate the TopicDB the
whole termspace is divided and each term is assigned a topic. After that the mapping
is inverted. To be accessible to all classes, this class is a singleton class. It has methods
getTerms4Topic(...) and getTopic(...) to retrieve all terms for a topic and the
topic for a term, respectively. Methods generateTopics() and invert() construct the
whole database with intensive use of RandomGenerator. Figure 7.10 gives an overview.
57
7.2 Implementation Chapter 7. Design and Implementation of a Discrete Simulator
DocumentCollection
-documents
+DocumentCollection(documents)
+getDocuments4Subscriber(s:Subscriber)
+getDocuments4Subscription(s:Subscriptioin)
+getTopics()
+match(Subscription,Document)
Figure 7.11: Class DocumentCollection
7.2.14 Class DocumentCollection
This class represents a collection of documents and is only used in the topic-based
scheme. In every round publishers choose a number of documents to publish them.
DocumentCollection is the class that collects all this documents and it is also capable
of matching the documents against subscriptions. The class only has a member field
documents which stores the documents to be published.
The constructor has a parameter by which the documents for the round are passed. With
method getTopics() publishers can retrieve the set of the topics of the documents in the
collection. Method match(...) is a helper method that matches one subscription and
one document and returns true if all terms of the subscription are contained in the doc-
ument. Methods getDocuments4Subscriber() and getDocuments4Subscription()
make use of the helper and return all documents matching the subscriber and subscrip-
tion, respectively. Figure 7.11 sketches class DocumentCollection.
7.2.15 Class TopicDistribution
TopicDistribution is the class that stores how much documents there are for a topic.
This information is used by PublisherDB to create the statistics used by subscribers.
This class is only used in the topic-based scheme and the members are shown in Fig-
ure 7.9: The constructor creates a new topic with a given parameter. The method
counter the number of documents for this topic
topic the topic
Table 7.9: Members of class TopicDistribution
addDocument() counts the documents and methods getNumberOfDocuments() and get-
Topic() return the number of documents for the topic and the topic, respectively. Figure
58
7.3 Lessons learned Chapter 7. Design and Implementation of a Discrete Simulator
TopicDistribution
-counter
-topic
+TopicDistribution(topic:int)
+addDocument(doc:Document)
+getNumberOfDocuments()
+getTopic()
Figure 7.12: Class TopicDistribution
7.12 shows class TopicDistribution.
7.3 Lessons learned
This section shortly describes the lessons learned by implementing the simulator. First,
depending on the system parameters the simulator needed very distinct time periods to
finish such that the results could be reviewed. The time periods varied from 30 minutes
up to 9 hours hours for one run of Store-Pub with many subscribers and about 20 pub-
lishers. This was due to the fact, that publishers as well as subscribers needed to refresh
the statistics and the fact that in each round method checkReallyAvailableDocuments(..)
(cf. 7.2.2) for all subscribers needed to be executed.
Second, the space consumption of the document corpora is very huge. As each pub-
lisher in Store-Pub has two corpora which 50000 documents, it is crucial to have a fast
database for this purpose. The term-based implementation works with serialized objects
and therefore consumes a lot of memory. The topic-based implementation uses the Java
API of BerkeleyDB and thus is much more efficient concerning memory consumption.
The only disadvantage is that generation is not as fast as in the term-based serialization.
59
8 Experimental Results
This chapter presents measurements made with simulator. In the simulator each pub-
lisher has its own document corpus. Each publisher has a predefined range of terms and
also a predefined number of documents. All documents only contain numbers within
this term range, which in the simulations always contained 100000 terms. To achieve
that subscriptions may be matched by only one publisher, the range of terms for each
subscriber is different, but they mostly overlap. For example publisher p1 has documents
with terms in range from 1 to 100000, p2 those from 21 to 100020, p3 from 41 to 100040
and so on. In essence there can be documents exclusively owned by some publisher and
not by another. And it might happen that different publishers have the same document.
8.1 Average behaviour
The simulation shows the average behaviour under normal load. The system parameters
are shown below in table 8.1. Figure 8.1 shows the result when the parameters above
|S| 5000
|Snew| 5 on average
|P | 10
subavg 3
sub 5
|T | 100000
|D| log2(100000) ≈ 16
r 1 or 100 per round
Table 8.1: Parameters for scenario with variable publishing rate
are used with a publishing rate of 1 document per round. As it can be seen, Store-Sub
consumes less messages but has a lot of variance in comparison to Store-Pub. Addition-
60
8.1 Average behaviour Chapter 8. Experimental Results
ally, it can be derived from the graph that Store-Sub is very sensitive to changes of the
publishing rate r. Theses changes can be seen (compared to the formulas) because the
number of documents published in each round is random. It may happen that in one
round just 10 ∗ 1 = 10 documents are published and in another round 20 ∗ 10 = 200.
This clearly affects the number of messages. The average consumption can be observed
in the plot. Figure 8.2 shows the simulation when publishing rate is on average 100.
100
1000
10000
100000
0 100 200 300 400 500 600 700 800 900 1000
#Mes
sage
s
Round
Store-SubStore-Pub
Figure 8.1: Publishing rate 1
Again, the system parameters are as in table 8.1. Now, Store-Pub is way better than
Store-Sub. I.e. Store-Pub has not changed compared to Figure 8.1, as the change of the
publishing rate has no impact on Store-Pub.
This leads to the conclusion that Store-Sub seems to perform better in a user-to-user
scenario because under low publishing rates that do not change, Store-Pub generates
more messages. On the other hand, Store-Pub performs better in a publisher-to-user
scenario because Store-Sub is sensitive to the publishing rate r and Store-Pub nearly
generates the same number of messages.
61
8.2 Comparison Simulator and Formulas Chapter 8. Experimental Results
10000
100000
1e+06
0 100 200 300 400 500 600 700 800 900 1000
#Mes
sage
s
Round
Store-SubStore-Pub
Figure 8.2: Publishing rate 100
8.2 Comparison Simulator and Formulas
This section directly compares the results of experiments with the simulator to the
results generated with the formulas described in Section 6. The system parameters for
the simulator are the same as in table 8.1 for both the simulator and the formulas. Figure
100
1000
10000
100000
0 200 400 600 800 1000
#Mes
sage
s
Round
Store-Sub (Sim)Store-Pub (Sim)Store-Sub (For)
Store-Pub, intervall = 100 (For)
Figure 8.3: Comparison of Simulator and Formula with r = 1
8.3 shows the comparison of the simulator and the formulas derived. The legend of the
graph shows (Sim) if the line belongs to data of the simulator and (For) if it belongs to
data of the formulas. It can be seen that at a publishing rate of r = 1 the values of the
simulator are forecasted well. Figure 8.4 shows the same but with a publishing rate r of
62
8.3 Growing number of subscribers Chapter 8. Experimental Results
10000
100000
1e+06
0 200 400 600 800 1000
#Mes
sage
s
Round
Store-Sub (Sim)Store-Pub (Sim)Store-Sub (For)
Store-Pub, intervall = 100 (For)
Figure 8.4: Comparison of Simulator and Formula with r = 100
100.
8.3 Growing number of subscribers
The following experiment uses the topic-based scheme to compare the approaches Store-
Sub and Store-Pub. In the experiments we assume 100 pregenerated topics that are used
to categorize the documents of the publishers. We vary the number of subscribers, such
that at the beginning there is no subscriber and at the end there are 50000 subscribers.
Table 8.2 shows the parameters used in the simulation. Figure 8.5 shows the outcome of
|S| growing from 0 to 50000
|Snew| 5 on average
|P | 10
subavg 3
sub 5
|T | 100000
|C| 100
|D| log2(100000) ≈ 16
r 10 per round
Table 8.2: Parameters for scenario with growing number of subscribers
the experiment. The x-axis indicates the number of subscribers and the y-axis indicates
63
8.4 Adding publishers Chapter 8. Experimental Results
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
#Mes
sage
s
#Subscribers
Store-SubStore-Pub
Figure 8.5: Varying number of subscribers
the number of messages generated by each approach. The red line shows the approach
Store-Sub. It can be seen that if the number of subscribers grow also the number of
messages grow faster than in Store-Pub which is shown as green line. The number of
messages grows also in Store-Pub because the log(N) factor (routing in DHT) grows.
The result of this experiment shows that in a typical publisher-to-user scenario Store-
Pub fits better just because the number of messages is much lower and not as sensitive.
8.4 Adding publishers
This experiment again simulates the topic-based scheme to compare the two approaches.
Now we vary the number of publishers but the number of subscribers in the experiment
approximately 5000. At the beginning of the simulation there are 10 publishers and in
round 2500 we add 10 publishers. Table 8.3 shows the parameters for the simulation.
Figure 8.6 shows the outcome of this simulation run. The x-axis indicates the round
and the y-axis the number of messages. The green line shows the number of messages
of Store-Pub. It can be seen that the adding of publishers has no impact on the number
of messages in Store-Pub. It remains constant throughout the simulation. The red line
shows the effect when adding the publishers in round 2500. The number of messages
doubles because now the number of publishers doubles and therefore the average num-
ber of documents per round. Thus, Store-Sub again shows the same sensitivity to the
publishing rate r. This leads to the conclusion that Store-Pub seems to be more efficient
64
8.4 Adding publishers Chapter 8. Experimental Results
|S| 5000 on average
|Snew| 5 on average
|P | 10 to 20
subavg 3
sub 5
|T | 100000
|C| 100
|D| log2(100000) ≈ 16
r 10 per round
Table 8.3: Parameters for scenario with growing number of publishers
0
1000
2000
3000
4000
5000
6000
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
#Mes
sage
s
Round
Store-SubStore-Pub
Figure 8.6: Varying number of publishers
(concerning the number of messages) like Store-Sub in a publisher-to-user scenario.
65
9 Conclusion
We presented two approaches to implement the Publish/Subscribe paradigm on top of
a structured P2P network. The approaches have been analyzed concerning the gener-
ation of messages and traffic. Additionally, we designed and implemented a discrete
event simulator, presented some experiments and associated the outcomes to certain
usage scenarios. The analysis and measurements show that Store-Sub is very sensitive
to changes of the publishing rate r and the numbers of publishers. This leads to the
conclusion that Store-Pub is the better choice for a typical publisher-to-user scenarios.
On the contrary, in user-to-user scenario with many publishers and subscribers Store-
Sub seems to perform better as Store-Pub is sensitive to the number of subscriptions.
One of the biggest challenges in the future will be to tackle network congestion. This
congestion will certainly be a topic for further researches because of the availability of
network resources to nearly everyone and the future network applications. The pre-
sented analysis of Publish/Subscribe approaches can quickly give hints on properties of
the Publish/Subscribe systems. The analysis can be used to evaluate the sensitivity
when changing application models or usage scenarios. Additionally, they contribute to
optimize application designs in question and improve the quality of systems.
The further steps are the usage of more methods to model for example the user be-
haviour when subscribing. I.e. a model of ”What is a user subscribing for?” can give
additional advices for Publish/Subscribe applications and lead to more efficient usage of
possibilities.
66
Bibliography
[1] Bittorent homepage. http://www.bittorrent.com.
[2] Bittorrent at Wikipedia. http://en.wikipedia.org/wiki/BitTorrent.
[3] Chimera and Tapestry. http://p2p.cs.ucsb.edu/chimera/html/home.html.
[4] The Chord project. http://pdos.csail.mit.edu/chord.
[5] eDonkey2000 at Wikipedia. http://en.wikipedia.org/wiki/EDonkey2000.
[6] eDonkey2000 homepage. http://www.edonkey2000.com/.
[7] Gnutella at Wikipedia. http://en.wikipedia.org/wiki/Gnutella.
[8] Gnutella homepage. http://www.gnutella.com/.
[9] Homepage of BerkeleyDB. http://www.sleepycat.com.
[10] JavaSpaces. http://java.sun.com/docs/books/jini/javaspaces/.
[11] Napster at Wikipedia. http://en.wikipedia.org/wiki/Napster.
[12] Napster homepage. http://www.napster.com.
[13] The Pastry Project. http://freepastry.org/.
[14] SHA hash functions. http://en.wikipedia.org/wiki/SHA hash functions.
[15] Zipf’s law. http://en.wikipedia.org/wiki/Zipf’s law.
[16] Karl Aberer. A self-organizing access structure for P2P information systems, vol-
ume 2172. Springer Verlag Berlin et al., 2001.
[17] Divyakant Agrawal Abhishek Gupta, Ozgur D. Sahin and Amr El Abbadi. Megh-
doot: Content-based publish/subscribe over p2p networks. Technical report, Uni-
versity of California at Santa Barbara.
67
Bibliography Bibliography
[18] Minos Garofalakis Rajeev Rastogi Chee-Yong Chan, Pascal Felber. Efficient filtering
of xml documents with xpath expressions. pages 354–379, Bell Laboratories, Lucent
Technologies, 600 Mountain Ave., NJ 07974, USA, 2002. The VLDB Journal.
[19] P. Eugster and R. Guerraoui. Content-based publish/subscribe with structural
reflection. In Usenix Conference on Object-Oriented Technologies and Systems,
2001.
[20] Patrick Thomas Eugster. Type-based Publish/Subscribe. PhD thesis, Ecole Poly-
technique Federale de Lausanne, 2001.
[21] Google, www.google.com/alerts. Google Alert.
[22] Network Working Group. Telnet Protocol Specification.
http://www.ietf.org/rfc/rfc854.txt, May 1983.
[23] Network Working Group. Hypertext Transfer Protocol – HTTP/1.1.
http://www.ietf.org/rfc/rfc2616.txt, June 1999.
[24] Network Working Group. The Secure Shell (SSH) Protocol Architecture.
http://www.ietf.org/rfc/rfc4251.txt, January 2006.
[25] David Karger M. Frans Kaashoek Hari Balakrishnan Ion Stoica, Robert Morris.
Chord: A scalable peer-to-peer lookup service for internet applications. Technical
report, MIT Laboratory for Computer Science.
[26] M. Franklin M. Altinel. Efficient filtering of xml documents for selective dissemina-
tion. pages 53–64. Proceedings of the 26th International Conference on Very Large
Data Bases (VLDB 00), 2000.
[27] Chaoying Ma and Jean Bacon. Cobea: A corba-based event architecture. University
of Cambridge, 1998.
[28] F.S. Annexestein M.A. Jovanovic and K.A. Berman. Scalability issues in large
peer to peer networks - a case study of gnutella. Technical report, University of
Cincinnati, 2001.
[29] Anne-Marie Kermarrec Miguel Castro, Peter Druschel and Antony Rowstron.
SCRIBE: A large-scale and decentralized application-level multicast infrastructure,
volume 20. IEEE Journal on Selected Areas in Communications, October 2002.
68
Bibliography Bibliography
[30] A. Oram. Peer-to-Peer: Harnessing the Power of Disruptive Technologies. O’Reilly
& Associates, Inc., Sebastapol, CA, USA, 2001.
[31] Jonathan B. Postel. Simple Mail Transfer Protocol. University of Southern Califor-
nia, http://www.ietf.org/rfc/rfc0821.txt, August 1982.
[32] M. Handley R. Karp S. Ratnasamy, P. Francis and S. Schenker. A scalable content-
addressable network. In SIGCOMM 2001, 2001.
[33] Andreas Zeidler. A Distributed Publish/Subscribe Notification Service for Pervasive
Environments. PhD thesis, TU Darmstadt, September 2004.
69