Tutorial 8 (web graph models)

16
Evolutionary Models of the Web Graph Kira Radinsky Web Size estimation models are based on the Standford slides by Christopher Manning and Prabhakar Raghavan

description

Part of the Search Engine course given in the Technion (2011)

Transcript of Tutorial 8 (web graph models)

Page 1: Tutorial 8 (web graph models)

Evolutionary Models of the Web Graph

Kira Radinsky

Web Size estimation models are based on the

Standford slides by Christopher Manning and Prabhakar Raghavan

Page 2: Tutorial 8 (web graph models)

7 December 2010 2

Stochastic Models for the Web’s Graph

So what can explain the observed Power Law in/out degree distributions of Web pages?

• Standard G(n,p) Erdös-Rényi random graphs:– A graph contains n nodes, and every two nodes are connected with

probability p

– Degrees are distributed B(n-1,p), and since on the Web np<<n, they can be viewed as distributed Poisson(np-p)

– Such distributions have light-weight, exponentially decreasing tails - nodes with very large in-degrees are practically impossible – yet, they abound on the Web

Erdös-Rényi random graphs do not model the Web graph

Page 3: Tutorial 8 (web graph models)

7 December 2010 3

Evolutionary Models – First Attempt

• The Web wasn’t built in a day; in fact, it is constantly growing and evolving

• Models should (somewhat) reflect the authoring process of Web pages

• Observation: older, well-established nodes should be better connected as they’ve been around longer and are better known

• A corresponding model:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to

one of the t pre-exiting nodes chosen uniformly at random– The expected in-degree at time T of the node added at time t:

j=t+1,…,T 1/j log T – log t– Doesn’t result in a power law – P(2x)/P(x) is not a constant

Page 4: Tutorial 8 (web graph models)

7 December 2010 236620 Search Engine Technology 4

Preferential Attachment

• Observation: while older, well-established nodes are better known, it is not strictly because of their age but rather because of them having more in-links

• The preferential attachment model:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to one of

the t pre-exiting nodes• The probability of linking to node v: (1+in-degree(v)) / (2t-1)

• A variant involves a parameter α:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to node v

with probability α/t+(1- α)*in-degree(v)/(t-1)

• Both variants indeed result in a Power-Law distribution of in-degrees (different exponents)

Page 5: Tutorial 8 (web graph models)

7 December 2010 5

Preferential Attachment (cont.)

• Another observation: if search engine rankings are influenced by PageRank, then new pages will link to high-PageRank pages more than to low PageRank pages

• The model uses two positive parameters d, p such that d+p<1

• The evolution:– Start at time 0 with a single node.– At step t, add a new node with a single new edge as follows:

• With probability d, connect the edge to one of the existing nodes in proportion to the in-degree (or 1+in-degree) of that node

• With probability p, connect the edge to a node chosen at random according to the PageRank distribution at time t

• With probability 1-p-d, connect the edge to an existing node chosen uniformly at random

• With properly chosen parameters, this model can fit both the in-degree and PageRank Power-Law distributions

Raghavan et al., “Using PageRank to characterize Web Structure”, 2002

Page 6: Tutorial 8 (web graph models)

7 December 2010 6

The Copy Model

The “Copy Model” assumes the following authoring model:• Each page is on a topic of interest to its author.

– Some of its links will be copied from a previous page on the same topic, that the author found useful

– Some links will be “original”, i.e. chosen independently by the author of the page

• The stochastic process creates nodes with an out-degree of d (parallel edges are allowed)

– Start at time 0 with a single node and d self-loops– At step t, add a new node with d out-links as follows

• Choose an intermediate node v chosen u.a.r. from the t existing nodes• For j=1,…,d:

– With probability α, connect link j to a node chosen u.a.r. from the t existing nodes

– With probability 1-α, copy the j’th link of v

• The copy model results in Power-Law in-degree distributions

Page 7: Tutorial 8 (web graph models)

7 December 2010 7

Evolutionary Models - Summary

• Overall, models exist that can simultaneously fit the observed Power-Law distributions of in-degrees, out-degrees and PageRank

– Many other properties of the graph are still unexplained by theoretical evolutionary models

• The accepted models mix-and-match the principles of preferential attachment (degrees/PageRank), copying, and random connectivity

• These models have the “rich get richer” property, and favor seniority (i.e. nodes from earlier rounds tend to have higher degrees)

– One can add some random “fitness” to nodes, with preferential attachment considering fitness as well, to give new nodes better chances of competing with existing nodes

• Note that there’s a difference between “rich get richer” and “winner takes all” – the Web’s graph doesn’t exhibit the dominance of a single winner

Page 8: Tutorial 8 (web graph models)

7 December 2010 236620 Search Engine Technology 8

Related Research Area: The Science of Networks

• Power-law and scale-free networks

• “Small World” networks and the importance of weak ties

– Kleinberg’s small-world grid

• Social/collaboration networks– Milgram’s “six degrees of

separation”

– The six degrees of Kevin Bacon

– Erdös numbers

ד"ודוד של השכן שלי קיבל את הסמג

משינה ,שלומי ברכהסיפרה אישתו של בן של אחותי

Page 9: Tutorial 8 (web graph models)

What is the size of the web ?

• Issues– The web is really infinite

• Dynamic content, e.g., calendar

• Soft 404: www.yahoo.com/<anything> is a valid page

– Static web contains syntactic duplication, mostly due to mirroring (~30%)

– Some servers are seldom connected

• Who cares?– Media, and consequently the user

– Engine design

– Engine crawl policy. Impact on recall.

Page 10: Tutorial 8 (web graph models)

What can we attempt to measure?

(IQ is whatever the IQ tests measure.)

– The statically indexable web is whatever search engines index.

• Different engines have different preferences

– max url depth, max count/host, anti-spam rules, priority rules, etc.

• Different engines index different things under the same URL:

– frames, meta-keywords, document restrictions, document extensions, ...

Page 11: Tutorial 8 (web graph models)

A B = (1/2) * Size A

A B = (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

\ Size A / Size B =

(1/6)/(1/2) = 1/3

Sample URLs randomly from A

Check if contained in B and vice versa

A B

Each test involves: (i) Sampling (ii) Checking

Relative Size from OverlapGiven two engines A and B

Page 12: Tutorial 8 (web graph models)

Sampling URLs

• Ideal strategy: Generate a random URL and check for

containment in each index.

• Problem: Random URLs are hard to find! Enough to generate

a random URL contained in a given Engine.

• Approach 1: Generate a random URL contained in a given engine

– Random queries

– Random searches

• Approach 2: Give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)

– Random IP addresses

– Random walks

Page 13: Tutorial 8 (web graph models)

Random URLs from random queries

• Generate random query: how?

– Lexicon: 400,000+ words from a web crawl

– Conjunctive Queries: w1 and w2

e.g., vocalists AND rsi

• Get 100 result URLs from engine A

• Choose a random URL as the candidate to check for presence in engine B

• This distribution induces a probability weight W(p) for each page.

• Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|

Not an English

dictionary

Page 14: Tutorial 8 (web graph models)

Random searches

• Choose random searches extracted from a local log [Lawrence & Giles 97] or build “random searches” [Notess]

– Use only queries with small result sets.

– Count normalized URLs in result sets.

– Use ratio statistics

Page 15: Tutorial 8 (web graph models)

Random IP addresses

• Generate random IP addresses

• Find a web server at the given address

– If there’s one

• Collect all pages from server

– From this, choose a page at random

Page 16: Tutorial 8 (web graph models)

Random walks

• View the Web as a directed graph

• Build a random walk on this graph– Includes various “jump” rules back to visited sites

• Does not get stuck in spider traps!

• Can follow all links!

– Converges to a stationary distribution• Must assume graph is finite and independent of the walk.

• Conditions are not satisfied (cookie crumbs, flooding)

• Time to convergence not really known

– Sample from stationary distribution of walk

– Use the “strong query” method to check coverage by SE