Tutorial 8 (web graph models)
-
Upload
kira -
Category
Technology
-
view
372 -
download
0
description
Transcript of Tutorial 8 (web graph models)
Evolutionary Models of the Web Graph
Kira Radinsky
Web Size estimation models are based on the
Standford slides by Christopher Manning and Prabhakar Raghavan
7 December 2010 2
Stochastic Models for the Web’s Graph
So what can explain the observed Power Law in/out degree distributions of Web pages?
• Standard G(n,p) Erdös-Rényi random graphs:– A graph contains n nodes, and every two nodes are connected with
probability p
– Degrees are distributed B(n-1,p), and since on the Web np<<n, they can be viewed as distributed Poisson(np-p)
– Such distributions have light-weight, exponentially decreasing tails - nodes with very large in-degrees are practically impossible – yet, they abound on the Web
Erdös-Rényi random graphs do not model the Web graph
7 December 2010 3
Evolutionary Models – First Attempt
• The Web wasn’t built in a day; in fact, it is constantly growing and evolving
• Models should (somewhat) reflect the authoring process of Web pages
• Observation: older, well-established nodes should be better connected as they’ve been around longer and are better known
• A corresponding model:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to
one of the t pre-exiting nodes chosen uniformly at random– The expected in-degree at time T of the node added at time t:
j=t+1,…,T 1/j log T – log t– Doesn’t result in a power law – P(2x)/P(x) is not a constant
7 December 2010 236620 Search Engine Technology 4
Preferential Attachment
• Observation: while older, well-established nodes are better known, it is not strictly because of their age but rather because of them having more in-links
• The preferential attachment model:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to one of
the t pre-exiting nodes• The probability of linking to node v: (1+in-degree(v)) / (2t-1)
• A variant involves a parameter α:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to node v
with probability α/t+(1- α)*in-degree(v)/(t-1)
• Both variants indeed result in a Power-Law distribution of in-degrees (different exponents)
7 December 2010 5
Preferential Attachment (cont.)
• Another observation: if search engine rankings are influenced by PageRank, then new pages will link to high-PageRank pages more than to low PageRank pages
• The model uses two positive parameters d, p such that d+p<1
• The evolution:– Start at time 0 with a single node.– At step t, add a new node with a single new edge as follows:
• With probability d, connect the edge to one of the existing nodes in proportion to the in-degree (or 1+in-degree) of that node
• With probability p, connect the edge to a node chosen at random according to the PageRank distribution at time t
• With probability 1-p-d, connect the edge to an existing node chosen uniformly at random
• With properly chosen parameters, this model can fit both the in-degree and PageRank Power-Law distributions
Raghavan et al., “Using PageRank to characterize Web Structure”, 2002
7 December 2010 6
The Copy Model
The “Copy Model” assumes the following authoring model:• Each page is on a topic of interest to its author.
– Some of its links will be copied from a previous page on the same topic, that the author found useful
– Some links will be “original”, i.e. chosen independently by the author of the page
• The stochastic process creates nodes with an out-degree of d (parallel edges are allowed)
– Start at time 0 with a single node and d self-loops– At step t, add a new node with d out-links as follows
• Choose an intermediate node v chosen u.a.r. from the t existing nodes• For j=1,…,d:
– With probability α, connect link j to a node chosen u.a.r. from the t existing nodes
– With probability 1-α, copy the j’th link of v
• The copy model results in Power-Law in-degree distributions
7 December 2010 7
Evolutionary Models - Summary
• Overall, models exist that can simultaneously fit the observed Power-Law distributions of in-degrees, out-degrees and PageRank
– Many other properties of the graph are still unexplained by theoretical evolutionary models
• The accepted models mix-and-match the principles of preferential attachment (degrees/PageRank), copying, and random connectivity
• These models have the “rich get richer” property, and favor seniority (i.e. nodes from earlier rounds tend to have higher degrees)
– One can add some random “fitness” to nodes, with preferential attachment considering fitness as well, to give new nodes better chances of competing with existing nodes
• Note that there’s a difference between “rich get richer” and “winner takes all” – the Web’s graph doesn’t exhibit the dominance of a single winner
7 December 2010 236620 Search Engine Technology 8
Related Research Area: The Science of Networks
• Power-law and scale-free networks
• “Small World” networks and the importance of weak ties
– Kleinberg’s small-world grid
• Social/collaboration networks– Milgram’s “six degrees of
separation”
– The six degrees of Kevin Bacon
– Erdös numbers
ד"ודוד של השכן שלי קיבל את הסמג
משינה ,שלומי ברכהסיפרה אישתו של בן של אחותי
What is the size of the web ?
• Issues– The web is really infinite
• Dynamic content, e.g., calendar
• Soft 404: www.yahoo.com/<anything> is a valid page
– Static web contains syntactic duplication, mostly due to mirroring (~30%)
– Some servers are seldom connected
• Who cares?– Media, and consequently the user
– Engine design
– Engine crawl policy. Impact on recall.
What can we attempt to measure?
(IQ is whatever the IQ tests measure.)
– The statically indexable web is whatever search engines index.
• Different engines have different preferences
– max url depth, max count/host, anti-spam rules, priority rules, etc.
• Different engines index different things under the same URL:
– frames, meta-keywords, document restrictions, document extensions, ...
A B = (1/2) * Size A
A B = (1/6) * Size B
(1/2)*Size A = (1/6)*Size B
\ Size A / Size B =
(1/6)/(1/2) = 1/3
Sample URLs randomly from A
Check if contained in B and vice versa
A B
Each test involves: (i) Sampling (ii) Checking
Relative Size from OverlapGiven two engines A and B
Sampling URLs
• Ideal strategy: Generate a random URL and check for
containment in each index.
• Problem: Random URLs are hard to find! Enough to generate
a random URL contained in a given Engine.
• Approach 1: Generate a random URL contained in a given engine
– Random queries
– Random searches
• Approach 2: Give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)
– Random IP addresses
– Random walks
Random URLs from random queries
• Generate random query: how?
– Lexicon: 400,000+ words from a web crawl
– Conjunctive Queries: w1 and w2
e.g., vocalists AND rsi
• Get 100 result URLs from engine A
• Choose a random URL as the candidate to check for presence in engine B
• This distribution induces a probability weight W(p) for each page.
• Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|
Not an English
dictionary
Random searches
• Choose random searches extracted from a local log [Lawrence & Giles 97] or build “random searches” [Notess]
– Use only queries with small result sets.
– Count normalized URLs in result sets.
– Use ratio statistics
Random IP addresses
• Generate random IP addresses
• Find a web server at the given address
– If there’s one
• Collect all pages from server
– From this, choose a page at random
Random walks
• View the Web as a directed graph
• Build a random walk on this graph– Includes various “jump” rules back to visited sites
• Does not get stuck in spider traps!
• Can follow all links!
– Converges to a stationary distribution• Must assume graph is finite and independent of the walk.
• Conditions are not satisfied (cookie crumbs, flooding)
• Time to convergence not really known
– Sample from stationary distribution of walk
– Use the “strong query” method to check coverage by SE