Internet 信息检索中的数学

Internet 信息检索中的数学

Zhi-Ming Ma April 24, 2009, 厦门

Email: [email protected] http://www.amt.ac.cn/member/mazhiming/index.html

How can google make a ranking of 2,040,000 pages

in 0.11 seconds?

A main task of Internet (Web)

Information Retrieval = Design and Analysis of

Search Engine (SE) Algorithm

involving plenty of Mathematics

Internetwork is a large scale complex random network

The Earth is developing an electronic nervous system, a network with diverse nodes and links are

搜索引擎的流程

Link Analysis

缓存

网页剖析器

倒排表

Page & Site数据库

网络图

Web

网页爬取器r

用户界面

缓存页面Links &AnchorsPages

索引编辑器

Link Map

Page Ranks

网络图生成器

查询

Indexing and Ranking

在线部分

离线部分

Static Rank ( 静态排序）• Importance ranking

– Goal: compute page importance, page authority– Method: Link analysis

• A method is based on the topology of the graph of whole Web pages.

• Web graph: page node, hyperlink edge.

– Algorithms:• HITS(Kleinberg) [5]

• PageRank(GOOGLE) [6]

Dynamic Rank （动态排序）• Relevance ranking （相关性排序）

– Goal: compute the content match relevant score between pages and query.

– Method: Statistic machine learning– Algorithms:

• Point-wise: BM25[7]

• Pair-wise: RankBoost[3], RankSVM[4], RankNet[1],…

• List-wise: ListNet[10]

Research on Complex Networks and Information Retrieval

• In recent years we have been involved in the research direction of Random Complex Networks and Information Retrieval. I shall briefly review some of our recent results (in collaboration with Microsoft Research Asia) in this direction.

Outlines

• Markov chain methods in search engines

• Point process describing Browsing behavior

• Two layer statistical learning

• Stochstic complement method in ranking Web sites

• Final remarks

Browse Rank vs.

Page Rank ?

HITS

PageRank

1998 Jon Kleinberg Cornell University

1998 Sergey Brin and Larry Page

Stanford University

Nevanlinna Prize （ 2006)Jon Kleinberg

• One of Kleinberg‘s most important research achievements focuses on the internetwork structure of the World Wide Web.

• Prior to 　 Kleinberg‘s work, search engines focused only on the content of web pages ， not on the link structure.

• Kleinberg introduced the idea of

• “authorities” and “hubs”: • An authority is a web page that contains 　 information on a

particular topic, • and a hub is a page that contains links to 　 many authorities.

Zhuzihu thesis.pdf

Page Rank, the ranking system used by the Google search engine.

• Query independent

• content independent.

• using only the web graph structure

Markov chain describing surfing behavior

Web surfers usually have two basic ways to access web pages:

1. with probability α, they visit a web page by clicking a hyperlink.

2. with probability 1-α, they visit a web page by inputting its URL address.

More generally we may consider personalized d.:

PageRank is the unique positive eigenvector:

By the strong ergodic theorem:

Problem:

PageRank as a Function of the Damping FactorPaolo Boldi Massimo Santini Sebastiano Vigna

DSI, Università degli Studi di Milano

WWW 2005 paper

3.1 Choosing the damping factor3 General Behaviour

3.2 Getting close to 1

　can we somehow characterise the properties of ? what makes 　　different from the other (infinitely many, if P is reducible) limit distributions of P?

)(lim1

*

**

is the limit distribution of P when the starting

distribution is uniform, that is,

Conjecture 1 :

*

.lim)(lim1

n

nP

N

1

Research results by our group:• Limit of PageRank• Comparison of Different

Irreducible Markov Chains• N-step PageRank• ……

Weak points of PageRank

• Using only static web graph structure• Reflecting only the will of web managers, but ignore the will of users e.g. the staying

time of users on a web.• Can not effectively against spam and junk

pages.

BrowseRankSIGIR.ppt

Letting Web Users Vote for Page Importance

• When calculating the page importance,– Use the users’ real browsing behavior

• Make no artificial assumption on the users’ behavior

– Use the users’ complete browsing behavior• Contain the time information

23/4/11 Yuting Liu@SIGIR'08 36

Browsing Process

• Markov property

• Time-homogeneity

BrowseRank: User browsing graph


Vertex: Web page

Edge: Transition

Edge weight wij:The number of

transitions

Staying time Ti:The time spend on

page i

Reset probability :Normalized frequencies as first page of session

Mathematical Deduction

Maximum likelihood estimation:

of staying time


where

Therefore


Additional Noise:

• the speed of the Internet connection,

• the length of the page,

• the layout of the page,

• user does some other things (e.g.,answers a phone call)

• Other factors


Assume

Noise: Chi-square distribution with degree k


ideally we would have:

However, due to data sparseness, we encounter challenges……


To tackle this challenge, we turn it into optimization problems:

Mathematical Deduction– Stationary distribution:

•

– is the mean of the staying time on page i.

The more important a page is, the longer staying time on it is.

– is the mean of the first re-visit time at page i. The more important a page is, the smaller the

re-visit time is, and the larger the visit frequency is.

( )P t


• Properties of Q process: – Jumping probability is conditionally independent

from jumping time: •

– Embedded Markov chain:• is a Markov chain with the transition probability

matrix


– is the stationary distribution of – The stationary distribution of discrete model is

easy to compute• Power method for

• Log data for

Experiments• Data set:

– 5.6 million vertices– 53 million edges

• Baselines: PageRank, TrustRank

• Aim to:– Find good websites– Fight spam websites


Website-level: Find good


Website-level: Fight spam


BrowseRank: Letting Web Users Vote for Page Importance

Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang,

Zhiming Ma, Shuyuan He, and Hang Li

July 23, 2008, Singapore the 31st Annual International ACM SIGIR

Conference on Research & Development on Information Retrieval.

Best student paper !

BrowseRank: Letting Web Users Vote for Page Importance

Google search ,

110,000,000 results for

Browse Rank

Further Studies

• Browsing Processes will be a

Basic Mathematical Tool

in Internet Information Retrieval

• How about inhomogenous process?

• Marked point process– Hyperlink is not reliable.– Users’ real behavior should be considered.

Dynamic Rank （动态排序）• Relevance ranking （相关性排序）

– Goal: compute the content match relevant score between pages and query.

– Method: Statistic machine learning– Algorithms:

• Point-wise: BM25[7]

• Pair-wise: RankBoost[3], RankSVM[4], RankNet[1],…

• List-wise: ListNet[10]

Outlines





• Final remarks

Learning to Rank

ModelModel)1(

)1(2

)1(1

)1(

)1(nd

d

d

q

Learning

SystemLearning System

Ranking System

Ranking System

)1( mq

)(

)(2

)(1

)(

)(m

n

m

m

m

md

d

d

q

)1(

)1(2

)1(1

)1(

m

n

m

m

md

d

d

min Loss

63Wei-Ying Ma, Microsoft Research Asia

learning to rank in IR is a two layer statistical learning

• Distributions of the relevance judgments of documents may vary from query to query

• The numbers of documents for different queries may differ largely

• Evaluation in IR is usually conducted at

query level

Document level vs Query level

– Two queries in total.

– Same errors in terms of pairwise classification. 780/790=98.73%

– Different errors in terms of query-level evaluation. 99% vs. 50%.

• Query-Level Stability and Generalization in Learning to Rank,

to appear in Proceedings of the 25th International Conference on Machine Learning 2008,

Yanyan Lan, Tie-Yan Liu, Tao Qin, Zhiming Ma, Hang Li

• Two Layer Statistical Learning and Applications in Information Retrieval,

in preparation, Yanyan Lan, Hang Li, Tie-Yan Liu, Zhi-Ming Ma, Tao Qin

Microsoft Scholar Fellowship

• We propose a new framework of statistical learning model, in which the

training dada are composed in two layers.

the two layer structure of training data is not artificial, but arises from the real world

Especially from learning to rank in Information Retrieval

Two-Layer Statistical Learning Framework

• First layer: objects

• Second layer: associated samples

: instances

: descriptions of instances

Instances are the objectives which we are concern

In learning to rank for IR, an object is a query, an instance and corresponding description can be

interpreted as

• a single document (pointwise algorithm)

• a pair of document (pairwise algorithm)

• a set of documents (listwise algorithm)

a score (or label) of a document

an order on a pair of documents

a permutation (list) of documents

Training Process

i.i.d.

For each i, the associated samples

, distribution

the training data is denoted as

• The algorithm of a training process is to learn from the training data a function

that will be used to predict the features of instances.

loss function on

expected object level loss

empirical object level loss

expected risk

empirical risk

• The challenge is that when dealing with two layer training data, most of the existing results of statistical learning can not be directly applied. Thus we have to develop new results or modify the existing results to suit the new model. In this aspect much research should be conducted.

Generalization Analysis based on Stability Theory

• Devroye, L. and Wagner, T.(1979). Stability•stability-bounds depend on properties of the algorithm itself •rather than the property of the function class,

• Bousquet, O. and Elisseeff, A.(2002).. uniform leave-one-out stability

• Motivated by the above work, we invent:Object –level uniform leave-one-out stabilityIn short, Object –level stability

Definition: We say a algorithm possesses:

Object –level uniform leave-one-out stabilityAbbreviated as Object –level stability, if:

Function learned from training data

Function learned from training data

Generalization based on Object-level Stability

Object-level stability

The number of training objects

With probability at least

Note: if , then the bound makes sense. This condition can be satisfied in many practical cases.

As case studies, we investigate Ranking SVM and RankBoost. We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability.

For RankBoost, the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function.

These analyses agree largely with our experiments and the experiments in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon,

2006[5] and [11].

• IRSVM : modified Ranking SVM in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, 2006

• query-level normalization

Query-level Empirical Risk

Generalization Bound:

Generalization Bounds Comparison

• Ranking SVM

)1(O



Modified RSVM

)( 1rO

RankBoost with Query-level Normalization and Regularization

introducing query-level normalization to RankBoost does not lead to good performance [11].

query-level normalization cannot make RankBoost have query-level stability. adding both query-level normalization and regularization to the objective function, query-level stability can be achieved . Thus the framework offers us a way of modifying RankBoost for good ranking performance

Experimental Results (I)

• Query-Level Stability

1200 queries from a search

engine’s data repository. 200 queries for training, 500

queries for validation and 500

queries for test. Five relevance labels and we

treat the first three label as

“relevant” and the other ones

as “irrelevant” to construct pairs.

• Query-level Generalization Bound

Experimental Results (II)

Future Problems and Challenges

It is worth to see whether new learning to rank algorithms can be derived under the guide of our theoretical studies.

We have investigated the generalization analysis based on the novel two layer statistical learning. We will continue to conduct other theoretical analysis. We have proposed “object-level stability”, we will try to investigate other tools.

Two layer statistical learning in other fields

Outlines





• Final remarks

• We have briefly reviewed part of our recent joint work (in collaboration with Microsoft Research Asia) concerning Internet Information Retrieval.

• Mathematics is becoming more and more important in the area of Internet information retrieval.

• Internet information retrieval has been a rich source providing plenty of interesting and challenging problems in Mathematics.

Thank you !

Internet 信息检索中的数学

Technology

Transcript of Internet 信息检索中的数学