Internet 信息检索中的数学
Click here to load reader
-
Upload
xu-jiakon -
Category
Technology
-
view
1.915 -
download
4
description
Transcript of Internet 信息检索中的数学
Internet 信息检索中的数学
Zhi-Ming Ma April 24, 2009, 厦门
Email: [email protected] http://www.amt.ac.cn/member/mazhiming/index.html
How can google make a ranking of 2,040,000 pages
in 0.11 seconds?
A main task of Internet (Web)
Information Retrieval = Design and Analysis of
Search Engine (SE) Algorithm
involving plenty of Mathematics
Internetwork is a large scale complex random network
The Earth is developing an electronic nervous system, a network with diverse nodes and links are
搜索引擎的流程
Link Analysis
缓存
网页剖析器
倒排表
Page & Site数据库
网络图
Web
网页爬取器r
用户界面
缓存页面Links &AnchorsPages
索引编辑器
Link Map
Page Ranks
网络图生成器
查询
Indexing and Ranking
在线部分
离线部分
Static Rank ( 静态排序)• Importance ranking
– Goal: compute page importance, page authority– Method: Link analysis
• A method is based on the topology of the graph of whole Web pages.
• Web graph: page node, hyperlink edge.
– Algorithms:• HITS(Kleinberg) [5]
• PageRank(GOOGLE) [6]
Dynamic Rank (动态排序)• Relevance ranking (相关性排序)
– Goal: compute the content match relevant score between pages and query.
– Method: Statistic machine learning– Algorithms:
• Point-wise: BM25[7]
• Pair-wise: RankBoost[3], RankSVM[4], RankNet[1],…
• List-wise: ListNet[10]
Research on Complex Networks and Information Retrieval
• In recent years we have been involved in the research direction of Random Complex Networks and Information Retrieval. I shall briefly review some of our recent results (in collaboration with Microsoft Research Asia) in this direction.
Outlines
• Markov chain methods in search engines
• Point process describing Browsing behavior
• Two layer statistical learning
• Stochstic complement method in ranking Web sites
• Final remarks
Browse Rank vs.
Page Rank ?
HITS
PageRank
1998 Jon Kleinberg Cornell University
1998 Sergey Brin and Larry Page
Stanford University
Nevanlinna Prize ( 2006)Jon Kleinberg
• One of Kleinberg‘s most important research achievements focuses on the internetwork structure of the World Wide Web.
• Prior to Kleinberg‘s work, search engines focused only on the content of web pages , not on the link structure.
• Kleinberg introduced the idea of
• “authorities” and “hubs”: • An authority is a web page that contains information on a
particular topic, • and a hub is a page that contains links to many authorities.
Zhuzihu thesis.pdf
Page Rank, the ranking system used by the Google search engine.
• Query independent
• content independent.
• using only the web graph structure
Markov chain describing surfing behavior
Markov chain describing surfing behavior
Web surfers usually have two basic ways to access web pages:
1. with probability α, they visit a web page by clicking a hyperlink.
2. with probability 1-α, they visit a web page by inputting its URL address.
where
More generally we may consider personalized d.:
PageRank is the unique positive eigenvector:
By the strong ergodic theorem:
Problem:
PageRank as a Function of the Damping FactorPaolo Boldi Massimo Santini Sebastiano Vigna
DSI, Università degli Studi di Milano
WWW 2005 paper
3.1 Choosing the damping factor3 General Behaviour
3.2 Getting close to 1
can we somehow characterise the properties of ? what makes different from the other (infinitely many, if P is reducible) limit distributions of P?
)(lim1
*
**
is the limit distribution of P when the starting
distribution is uniform, that is,
Conjecture 1 :
*
.lim)(lim1
n
nP
N
1
Research results by our group:• Limit of PageRank• Comparison of Different
Irreducible Markov Chains• N-step PageRank• ……
Weak points of PageRank
• Using only static web graph structure• Reflecting only the will of web managers, but ignore the will of users e.g. the staying
time of users on a web.• Can not effectively against spam and junk
pages.
BrowseRankSIGIR.ppt
Letting Web Users Vote for Page Importance
• When calculating the page importance,– Use the users’ real browsing behavior
• Make no artificial assumption on the users’ behavior
– Use the users’ complete browsing behavior• Contain the time information
23/4/11 Yuting Liu@SIGIR'08 36
Browsing Process
• Markov property
• Time-homogeneity
BrowseRank: User browsing graph
23/4/11 Yuting Liu@SIGIR'08 42
Vertex: Web page
Edge: Transition
Edge weight wij:The number of
transitions
Staying time Ti:The time spend on
page i
Reset probability :Normalized frequencies as first page of session
Mathematical Deduction
Maximum likelihood estimation:
of staying time
Mathematical Deduction
where
Therefore
Mathematical Deduction
Additional Noise:
• the speed of the Internet connection,
• the length of the page,
• the layout of the page,
• user does some other things (e.g.,answers a phone call)
• Other factors
Mathematical Deduction
Assume
Noise: Chi-square distribution with degree k
Mathematical Deduction
ideally we would have:
However, due to data sparseness, we encounter challenges……
Mathematical Deduction
To tackle this challenge, we turn it into optimization problems:
Mathematical Deduction– Stationary distribution:
•
– is the mean of the staying time on page i.
The more important a page is, the longer staying time on it is.
– is the mean of the first re-visit time at page i. The more important a page is, the smaller the
re-visit time is, and the larger the visit frequency is.
( )P t
Mathematical Deduction
• Properties of Q process: – Jumping probability is conditionally independent
from jumping time: •
– Embedded Markov chain:• is a Markov chain with the transition probability
matrix
Mathematical Deduction
– is the stationary distribution of – The stationary distribution of discrete model is
easy to compute• Power method for
• Log data for
Experiments• Data set:
– 5.6 million vertices– 53 million edges
• Baselines: PageRank, TrustRank
• Aim to:– Find good websites– Fight spam websites
23/4/11 Yuting Liu@SIGIR'08 54
Website-level: Find good
23/4/11 Yuting Liu@SIGIR'08 55
Website-level: Fight spam
23/4/11 Yuting Liu@SIGIR'08 56
BrowseRank: Letting Web Users Vote for Page Importance
Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang,
Zhiming Ma, Shuyuan He, and Hang Li
July 23, 2008, Singapore the 31st Annual International ACM SIGIR
Conference on Research & Development on Information Retrieval.
Best student paper !
BrowseRank: Letting Web Users Vote for Page Importance
Google search ,
110,000,000 results for
Browse Rank
Further Studies
• Browsing Processes will be a
Basic Mathematical Tool
in Internet Information Retrieval
• How about inhomogenous process?
• Marked point process– Hyperlink is not reliable.– Users’ real behavior should be considered.
Dynamic Rank (动态排序)• Relevance ranking (相关性排序)
– Goal: compute the content match relevant score between pages and query.
– Method: Statistic machine learning– Algorithms:
• Point-wise: BM25[7]
• Pair-wise: RankBoost[3], RankSVM[4], RankNet[1],…
• List-wise: ListNet[10]
Outlines
• Markov chain methods in search engines
• Point process describing Browsing behavior
• Two layer statistical learning
• Stochstic complement method in ranking Web sites
• Final remarks
Learning to Rank
ModelModel)1(
)1(2
)1(1
)1(
)1(nd
d
d
q
Learning
SystemLearning System
Ranking System
Ranking System
)1( mq
)(
)(2
)(1
)(
)(m
n
m
m
m
md
d
d
q
)1(
)1(2
)1(1
)1(
m
n
m
m
md
d
d
min Loss
63Wei-Ying Ma, Microsoft Research Asia
learning to rank in IR is a two layer statistical learning
• Distributions of the relevance judgments of documents may vary from query to query
• The numbers of documents for different queries may differ largely
• Evaluation in IR is usually conducted at
query level
Document level vs Query level
– Two queries in total.
– Same errors in terms of pairwise classification. 780/790=98.73%
– Different errors in terms of query-level evaluation. 99% vs. 50%.
• Query-Level Stability and Generalization in Learning to Rank,
to appear in Proceedings of the 25th International Conference on Machine Learning 2008,
Yanyan Lan, Tie-Yan Liu, Tao Qin, Zhiming Ma, Hang Li
• Two Layer Statistical Learning and Applications in Information Retrieval,
in preparation, Yanyan Lan, Hang Li, Tie-Yan Liu, Zhi-Ming Ma, Tao Qin
Microsoft Scholar Fellowship
• We propose a new framework of statistical learning model, in which the
training dada are composed in two layers.
the two layer structure of training data is not artificial, but arises from the real world
Especially from learning to rank in Information Retrieval
Two-Layer Statistical Learning Framework
• First layer: objects
• Second layer: associated samples
: instances
: descriptions of instances
Instances are the objectives which we are concern
In learning to rank for IR, an object is a query, an instance and corresponding description can be
interpreted as
• a single document (pointwise algorithm)
• a pair of document (pairwise algorithm)
• a set of documents (listwise algorithm)
a score (or label) of a document
an order on a pair of documents
a permutation (list) of documents
Training Process
i.i.d.
For each i, the associated samples
, distribution
the training data is denoted as
• The algorithm of a training process is to learn from the training data a function
that will be used to predict the features of instances.
loss function on
expected object level loss
empirical object level loss
expected risk
empirical risk
• The challenge is that when dealing with two layer training data, most of the existing results of statistical learning can not be directly applied. Thus we have to develop new results or modify the existing results to suit the new model. In this aspect much research should be conducted.
Generalization Analysis based on Stability Theory
• Devroye, L. and Wagner, T.(1979). Stability•stability-bounds depend on properties of the algorithm itself •rather than the property of the function class,
• Bousquet, O. and Elisseeff, A.(2002).. uniform leave-one-out stability
• Motivated by the above work, we invent:Object –level uniform leave-one-out stabilityIn short, Object –level stability
Definition: We say a algorithm possesses:
Object –level uniform leave-one-out stabilityAbbreviated as Object –level stability, if:
Function learned from training data
Function learned from training data
Generalization based on Object-level Stability
Object-level stability
The number of training objects
With probability at least
Note: if , then the bound makes sense. This condition can be satisfied in many practical cases.
As case studies, we investigate Ranking SVM and RankBoost. We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability.
For RankBoost, the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function.
These analyses agree largely with our experiments and the experiments in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon,
2006[5] and [11].
• IRSVM : modified Ranking SVM in Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, 2006
• query-level normalization
Query-level Empirical Risk
Generalization Bound:
Generalization Bounds Comparison
• Ranking SVM
)1(O
Generalization Bound:
Generalization Bound:
Modified RSVM
)( 1rO
RankBoost with Query-level Normalization and Regularization
introducing query-level normalization to RankBoost does not lead to good performance [11].
query-level normalization cannot make RankBoost have query-level stability. adding both query-level normalization and regularization to the objective function, query-level stability can be achieved . Thus the framework offers us a way of modifying RankBoost for good ranking performance
Experimental Results (I)
• Query-Level Stability
1200 queries from a search
engine’s data repository. 200 queries for training, 500
queries for validation and 500
queries for test. Five relevance labels and we
treat the first three label as
“relevant” and the other ones
as “irrelevant” to construct pairs.
• Query-level Generalization Bound
Experimental Results (II)
Future Problems and Challenges
It is worth to see whether new learning to rank algorithms can be derived under the guide of our theoretical studies.
We have investigated the generalization analysis based on the novel two layer statistical learning. We will continue to conduct other theoretical analysis. We have proposed “object-level stability”, we will try to investigate other tools.
Two layer statistical learning in other fields
Outlines
• Markov chain methods in search engines
• Point process describing Browsing behavior
• Two layer statistical learning
• Stochstic complement method in ranking Web sites
• Final remarks
Outlines
• Markov chain methods in search engines
• Point process describing Browsing behavior
• Two layer statistical learning
• Stochstic complement method in ranking Web sites
• Final remarks
• We have briefly reviewed part of our recent joint work (in collaboration with Microsoft Research Asia) concerning Internet Information Retrieval.
• Mathematics is becoming more and more important in the area of Internet information retrieval.
• Internet information retrieval has been a rich source providing plenty of interesting and challenging problems in Mathematics.
Thank you !