智能信息检索
description
Transcript of 智能信息检索
智能信息检索
杜小勇教授 ,中国人民大学文继荣教授 ,微软亚洲研究院
课程介绍
课程历史• 2007 年 9 月 27 日,我们和微软亚洲研究
院联合举办了一次“互联网数据管理主题学术报告”既聘任文继荣博士为兼职研究员的活动 ;
• 2008 年春季学期 , 第一次与微软合作 , 开设《智能信息检索》课程 ; 来自 MSRA 的9 位研究员共进行了 11 次讲座 ;
• 参考文献 : 从智能信息检索看微软前瞻性课程,《计算机教育》
授课风格• IR基础知识 +专题讲座• 专题讲座由微软研究员担任 ,信息量非常大• 考核方式 :
– (1) 选择某一个专题写一个综述性质的报告 , 包括 :研究的问题是什么 ? 该领域的理论基础是什么 ?技术难点在那里 ? 目前大致有什么解决问题的手段和方法 ? 研究这些问题的实验方法和评价体系是什么 ? 提出自己的观点 .
– (2) 将打印好的文章在最后一节课交给助教 , 将分发给相关的老师进行评阅。
– (3) 平时考核 , 主要是参与讨论的情况 .
授课内容• 基础知识
– 基本模型– 基本实现技术– 评价方法
• 核心技术与系统– Ranking– Information Extraction– Log Mining– System Implementation
• 应用– Image search– Multi-language search– Knowledge Management– ……
Reading Materials
• R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press,1999
• E.M.Voorhees,D.K.Harman, TREC: Experiment and Evaluation in Information Retrieval, The MIT Press 2005
• K. S. Jones, P. Willett, Readings in Information Retrieval, Morgan Kaufmann,1997
• Proceedings SIGIR, SIGMOD , WWW
课程安排Date Topic Lecturer
3 月 2 日 IR: Basic concepts and Models Xiaoyong Du3 月 9 日 IR: Overview of Key Techniques Xiaoyong Du3 月 16 日 Information retrieval in the new computing era. Wei-Ying Ma3 月 23 日 Fundamental Web IR Hang Li
3 月 30 日 Web IR Evaluation Ruihua Song4 月 6 日 Link Analysis and Anti Web Spam Bin Gao4 月 13 日 Web Search Log Mining Daxin Jiang4月 20日 Learning to Rank for Information Retrieval Tie-Yan Liu4 月 27 日 Search Engine Overview: System, Algorithms and
Challenges Ji-Rong Wen5 月 4 日 Multi-language search Ming Zhou
5 月 11 日 Web-Scale Entity Search and Knowledge Mining Zaiqing Nie5月 18日 Web Image Search Lei Zhang
5 月 25 日 Information Extraction Hang Li
6 月 1 日 Spatial data mining and its applications Xing Xie6 月 8 日 Wisdom of the crowd on the Web Haixun Wang6 月 15 日 Summary Xiaoyong Du
课程安排• http://iir.ruc.edu.cn/
• 联系方式:– 杜小勇 [email protected](信息楼 0459 )– 文继荣 [email protected]
Introduction to IR
Prof. Xiaoyong Du
What is Information Retrieval
• Definition from comparison
Aspects IR DB
Data Unstructured Structured
Operator Read only Read/Write
User’s need keywords SQL
Results Similar function Exactly match
What is IR?
• Definition by examples:– Library system, like CALIS– Search engine, like Google, Baidu
What is IR?
• Definition by content– IR = <D, Q, R(qi,dj)>, where– D: document collection– Q: User’s query– R: the relevance/similarity degree between qu
ery qi and document dj–
IR System Architecture
UnstructuredData
Index
Indexer
Ranker
Classical IR
Crawler
UserInterface
WEB
WEB IR
Query logFeedback
Extractor
Data Miner
Content
• IR Model• System architecture• Evaluation and benchmark• Key techniques
– Media-related operators– Indexing– IE– Classification and clustering– Link analysis– Relevance evaluation– ……
Related Area
• Natural Language Processing
• Large-scale distributed computing
• Database
• Data mining
• Information Science
• Artificial Intelligence
• ……
Model
IR Model
• RepresentationHow to represent document/query– Bag-of-word– Sequence-of-word– Link of documents– Semantic Network
• Similarity/relevance Evaluationsim(dj,q)=?
两大类的模型• 基于文本内容的检索模型
– 布尔模型– 向量空间模型– 概率模型– 统计语言模型
• 与内容无关的其他检索模型– 基于协同的模型– 基于链接分析的模型– 基于关联的模型
Classical IR Models ---- Basic Concepts
• Bag-of-Word Model• Each document represented by a set of representative
keywords or index terms• The importance of the index terms is represented by
weights associated to them• Let
– ki : an index term
– dj : a document
– t : the total number of docs– K = {k1, k2, …, kt} : the set of all index terms
Classic IR Models - Basic Concepts
– wij >= 0 : a weight associated with (ki,dj)
The weight wij quantifies the importance of the index term for describing the document contents
• wij = 0 indicates that term does not belong to doc
– vec(dj) = (w1j, w2j, …, wtj) : a weighted vector associated with the document dj
– gi(vec(dj)) = wij : a function which returns the weight of term ki in document dj
Classical IR Models - Basic Concepts
• A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
• A ranking is based on fundamental premises regarding the notion of relevance, such as:– common sets of index terms– sharing of weighted terms– likelihood of relevance
• Each set of premises leads to a distinct IR model
The Boolean Model
• Simple model based on set theory• Queries specified as boolean expressions
– precise semantics– neat formalism– q = ka (kb kc)
• Terms are either present or absent. Thus, wij {0,1}∈• Consider
– q = ka (kb kc)
– vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)
– vec(qcc) = (1,1,0) is a conjunctive component
Outline
• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model(LM)
The Boolean Model
• q = ka (kb kc)
• sim(q,dj) = 1 if vec(qcc) | (vec(qcc) /in vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise
(1,1,1)(1,0,0)
(1,1,0)
Ka Kb
Kc
Drawbacks of the Boolean Model
• Exact matching• No ranking: • Awkward: Information need has to be translated
into a Boolean expression • Too simple: The Boolean queries formulated by
the users are most often too simplistic• Unsatisfiable Results: The Boolean model
frequently returns either too few or too many documents in response to a user query
Outline
• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model (LM)
The Vector Model
• Non-binary weights provide consideration for partial matches
• These term weights are used to compute a degree of similarity between a query and each document
• Ranked set of documents provides for better matching
The Vector Model
• Define:– wij > 0 whenever ki dj– wiq >= 0 associated with the pair (ki,q)– vec(dj) = (w1j, w2j, ..., wtj)
vec(q) = (w1q, w2q, ..., wtq)– index terms are assumed to occur independently within
the documents ,That means the vector space is orthonormal.
• The t terms form an orthonormal basis for a t-dimensional space
• In this space, queries and documents are represented as weighted vectors
The Vector Model
• Sim(q,dj) = cos() = [vec(dj) vec(q)] / (|dj| * |q|)
= [ wij * wiq] / (|dj| * |q|)• Since wij > 0 and wiq > 0,
0 <= sim(q,dj) <=1• A document is retrieved even if it matches the
query terms only partially
i
j
dj
q
The Vector Model
• Sim(q,dj) = [ wij * wiq] / ( |dj| * |q|)• The KEY is to compute the weights wij and wiq ?• A good weight must take into account two effects:
– quantification of intra-document contents (similarity)• tf factor, the term frequency within a document
– quantification of inter-documents separation (dissimilarity)• idf factor, the inverse document frequency
– TF*IDF formular: wij = tf(i,j) * idf(i)
The Vector Model
• Let,– N be the total number of docs in the collection– ni be the number of docs which contain ki– freq(i,j) raw frequency of ki within dj
• A normalized tf factor is given by– tf(i,j) = freq(i,j) / max(freq(l,j))– where kl dj∈
• The idf factor is computed as– idf(i) = log (N/ni)– the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of information associated with the term ki.
The Vector Model
• tf-idf weighting scheme– wij = tf(i,j) * log(N/ni)– The best term-weighting schemes
• For the query term weights, – wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)– Or specified by the user
• The vector model with tf-idf weights is a good ranking strategy with general collections
• The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.
The Vector Model
• Advantages:– term-weighting improves quality of the answer
set– partial matching allows retrieval of docs that
approximate the query conditions– cosine ranking formula sorts documents
according to degree of similarity to the query
• Disadvantages:– assumes independence of index terms
Outline
• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model (LM)
Probabilistic Model
• Objective: to capture the IR problem using a probabilistic framework
• Given a user query, there is an ideal answer set• Querying as specification of the properties of this
ideal answer set (clustering)• But, what are these properties? • Guess at the beginning what they could be (i.e.,
guess initial description of ideal answer set)• Improve by iteration
Probabilistic Model• Baisc ideas:
– An initial set of documents is retrieved somehow – User inspects these docs looking for the relevant ones (in
truth, only top 10-20 need to be inspected)– IR system uses this information to refine description of
ideal answer set– By repeting this process, it is expected that the description
of the ideal answer set will improve
• Description of ideal answer set is modeled in probabilistic terms
Probabilistic Ranking Principle
• The probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).
• The model assumes that this probability of relevance depends on the query and the document representations only.
• Let R be the Ideal answer set.• But,
– how to compute probabilities?– what is the sample space?
The Ranking
• Probabilistic ranking computed as:– sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-
to q)
• Definition:– wij {0,1}– P(R | vec(dj)) : probability that given doc is relevant– P(R | vec(dj)) : probability doc is not relevant
The Ranking
• sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj))
= [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)]
~ P(vec(dj) | R) P(vec(dj) | R)
• P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents
The Ranking
• sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R)
~ [ P(ki | R)] * [ P(ki | R)]
[ P(ki | R)] * [ P(ki | R)]
• P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents
The Ranking
• sim(dj,q) ~ log [ P(ki | R)] * [ P(kj | R)] [ P(ki | R)] * [ P(kj | R)]
~ K * [ log P(ki | R) + P(ki | R)
log P(kj | R) ] P(kj | R)
~ wiq * wij * (log P(ki | R) + log P(kj | R) ) P(ki | R) P(kj | R)
where P(ki | R) = 1 - P(ki | R)P(ki | R) = 1 - P(ki | R)
The Initial Ranking
• sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) 1-P(ki | R) 1-P(ki | R)
• How to compute Probabilities P(ki | R) and P(ki | R) ?• Estimates based on assumptions:
– P(ki | R) = 0.5– P(ki | R) = ni
N
– where ni is the number of docs that contain ki– Use this initial guess to retrieve an initial ranking– Improve upon this initial ranking
Improving the Initial Ranking
• sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) )
1-P(ki | R) 1-P(ki | R)• Let
– V : set of docs initially retrieved
– Vi : subset of docs retrieved that contain ki
• Re-evaluate estimates: – P(ki | R) = Vi V
– P(ki | R) = ni - Vi N - V
• Repeat recursively
Improving the Initial Ranking
• sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) 1-P(ki | R) 1-P(ki | R)
• To avoid problems with V=1 and Vi=0:– P(ki | R) = Vi + 0.5 V + 1– P(ki | R) = ni - Vi + 0.5 N - V + 1
• Also, – P(ki | R) = Vi + ni/N V + 1– P(ki | R) = ni - Vi + ni/N
N - V + 1
Discussion
• Advantages:– Docs ranked in decreasing order of probability of
relevance
• Disadvantages:– need to guess initial estimates for P(ki | R)– method does not take into account tf and idf
factors
Outline
• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model (LM)
Document Representation
• Bag-of-words
• Bag-of-facts
• Bag-of-sentences
• Bag-of-nets
• Document = Bag of Words
• Document = Bag of Sentences,
Sentence = word sequence
p( 南京市长 )p( 江大桥 | 南京市长 ) <<
p( 南京市 )p( 长江大桥 | 南京市 )
p( 中国人民大学 ) >> p( 中国大学人民 )
•
What is a LM?
• “语言”就是其字母表上的某种概率分布 , 该分布反映了任何一个字母序列成为该语言的一个句子 (或其他任何的语言单元 ) 的可能性 ,称这个概率分布为语言模型。– 给定的一个语言,对于一个语言“句子”(符号串),可以估计其出现的概率。
– 例如:假定对于英语, p1 (a quick brown dog) > p2 ( dog brown a quick) > p3 (brown dog 棕熊 ) > p4 (棕熊 ) – 若 p1=p2,称为一阶语言模型 ,否则称为高阶语言模型
Basic Notation
• M: language we are try to model, it can be thought as a source
• s: observation (string of tokens)
• P(s|M): probability of observation “s” in M, that is the probability of getting “s” during random sampling from M
Basic Notation
• Let S=s1s2….sn be any sentence
• P(S) = P(s1)P(s2|s1)….P(sn|s1,s2…sn)
• Under n-gram model
P(si|s1,s2…si-1)=P(si|si-n+1,…si-1)
• n =1, ungram
P(si|s1,s2,…,si-1)=P(si)
How can we use LMs in IR
• Use LM to model the process of query generation:
• Every document in a collection defines a “language”
• P(s|MD) defines the probability that author would write down string ”s”
• Now suppose “q” is the user’s query• P(q|MD) is the probability of “q” during random sa
mpling from the D, and can be treated as rank of document D in the collection
Major issues in applying LMs
• What kind of language model should we use?– Unigram or high-order models
• How can we estimate model parameters?– Basic model or advanced model– Data smoothing approaches
What kind of models is better?
• Unigram model
• Bigram model
• High-order model
Unigram LMs
• Words are “sampled” independently of each other
• Joint probability decomposes into a production of marginals– P(xyz)=p(x)p(y)p(z)– P(“brown dog”)=P(“dog”)P(“brown”)
• Estimation of probability :simple counting
Higher-order Models
• n-gram: condition on preceding words
• Cache: condition on a window
• Grammar: condition on a parse tree
• Are they useful? ?
• Parameter estimation is very expensive!
Comparison
• Song 和 Croft指出,把一元语言模型和二元语言模型混合后的效果比只使用一元语言模型好 8%左右。不过, Victor Lavrenko指出, Song 和 Croft 使用的多元模型得到的效果并不是一直比只用一元语言模型好。
• David R.H.Miller 指出一元语言模型和二元语言模型混合后得到的效果也要好于一元语言模型。
• 也有研究认为词序对于检索结果影响不大 .
Major issues in applying LMs
• What kind of language model should we use?– Unigram or high-order models
• How can we estimate model parameters?– Basic model or advanced model– Data smoothing approaches
Estimation of parameter
• Given a string of text S ( =Q or D ) , estimate its LM: Ms
• Basic LMs– Maximum-likelihood estimation– Zero-frequency problem– Discounting technology– Interpolation method
Maximum-likelihood estimation
• Let V be vocabulary of M,Q=q1q2…qm be a query, qi \in V, S=d1d2…dn be a doc.
• Let Ms be the language model of S
• P(Q|Ms) =? ,called query likelihood
• P (Ms|Q) = P(Q| Ms)P(Ms)/P(Q) can be treated as the ranking of doc S.
~ P(Q| Ms)P(Ms)
• Estimating P(Q|Ms),and P(Ms)
Maximum-likelihood estimation
• 估计 P(Q|Ms) 的方法:– Multivarint Bernouli model– Multinomial model
• Bernouli model–只考虑词是否在查询中出现,而不考虑出现几
次。查询被看成是 |v| 个相互独立的 Bernouli试验的结果序列
– P(Q|Ms)=∏w Q∈ P(w|Ms) ∏w Q∈ (1-P(w|Ms))
Maximum-likelihood estimation
• Multinomial model(多项式模型 )– 将查询被看成是多项试验的结果序列,因此考虑了词在查询中出现的次数。
– P(Q|Ms)=∏qi Q∈ P(qi|Ms)= ∏w Q∈ P(w|Ms)#(w,Q)
• 上述两种办法都将转换成对 P(w|Ms) 的估计,也就是将 IR 问题转换成对文档语言模型的估计问题。从而可以利用 LM 的研究成果。
Maximum-likelihood estimation
• 最简单的办法就是采用极大似然估计: Count relative frequencies of words in S
P(w|Ms)=#(w,S)/|S|
• 0-frenquency problem (由数据的稀疏性造成)– Assume some event not occur in S, then the p
robability is 0!– It is not correct, and we need to avoid it
Discounting Method
• Laplace correction ( add-one approach ) :– Add 1 to every count, ( normalize )– P(w|Ms)= ( #(w,S)+1 ) / ( |S|+|V| )– Problematic for large vocabularies ( |V|太大
的时候)• Ref. Chen SF and Goodman JT: an empiri
cal study of smoothing technology for language modeling, proc. 34th annual meeting of the ACL,1996
Smoothing methods
• Additive smoothing methods
• Jelinek-Mercer 方法• Dirichlet 方法
Additive smoothing methods
• PML(s|Ms)=[ #(w,S)+c]/[|S|+c|V|]
• When c=1, it is laplace smoothing method
Jelinek-Mercer 方法• Discounting 方法对待所有未出现的词是一样的,但实际上,仍然有不同,可以使用一些背景知识(或者说是一阶ML) ,例如利用英语语言知识。
• P(S|Ms)=cPML(S|Ms)+(1-c)P(S)
= PML(S|Ms)+\& P(S)
• PML(S|Ms) 为条件概率,• P(S) =P(S|REF) 为先验概率• Set c to be a constant, independent of document
and query.
平滑对检索性能的影响• Zhai CX, Lafferty J, A study of smoothing metho
ds for language models applied to ad hoc information retrieval. ACM SIGIR 2001
• Zhai CX Lafferty J, A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2)179-214
• 平滑有两个作用:一是估计,解决 0 概率问题,二是查询建模,消除或者降低噪音的影响
Translation Models
• Basic LMs do not address word synonymy.
• P(q|M) = ∑w P(w|M) P(q|w)
• P(q|w) 就是 q 和 w之间的关系。如果 q 和w 是近似词,这个值就比较大。
• P (q|w) 可以依据词的共现关系 / 相同词根 /词典等进行计算,这是该方法的关键
• P (w|M) 就是语言模型下 w 的概率。
LM Tools
• LEMUR– www.cs.cmu.edu/~lemur– CMU/UMass joint project– C++, good documentation, forum-based support– Ad-hoc IR, Clustering, Q/A systems– ML+smoothing, …
• YARI– [email protected]– Ad-hoc IR, cross-language,classification– ML+smoothing,…
Other applications of LM
• Topic detection and tracking– Treat “q” as a topic description
• Classification/ filtering
• Cross-language retrieval
• Multi-media retrieval
References
• Ponte JM, Croft WB, A Language Modeling approach to Information Retrieval, ACM SIGIR 1998, pp275-281
• Ponte JM, A Language Modeling approach to Information Retrieval, PhD Dissertation, UMass, 1998
Bag-of-nets
• 如果文本的概念用本体来表达 ,也就是将从文本中抽取出的概念放在领域本体的背景下 ,形成一个概念的网络 , 情况将如何呢 ?
• 可否利用 Bayesian Network? 关键是怎么理解词与词之间的关系 , 是否具有因果关系 ?
• 比如上下位关系 ? 关联关系 ?
• 与内容无关的其他检索模型– 基于协同的模型– 基于链接分析的模型– 基于关联的模型
• 通常与基于内容的模型一起使用
Collaborative Recommendation
• raj denotes the score of item j rated by an active user a. If user a had not rated item j, raj=0.
• m - total number of users, n - total number of items.
nmmnmjm
anaja
nj
nm
rrr
rrr
rrr
R
1
1
1111
协同推荐模型• For a given user-a and document-j, Predicate paj =?
• is the number of users who are similar to user a and have rated item j.
• w(a,i): The weight of the similarity between user a and user i.
• k is a normalizing factor such that the absolute values of the weights sum to unity.
am
iiijaaj rriawkrp
1
))(,(
am
算法主要的问题• 冷启动 (cold star)
• 稀疏性 (sparse)
• 高维性 (high dimension)
基于分类的协同过滤推荐
解决冷启动问题
基本思想: ( 1 )对矩阵进行划分 ,依据资源的语义分类
( 2 )根据划分后的子矩阵进行协同过滤
( 3 )生成预测结果
基于分类的协同过滤推荐基本思想: ( 1 )把每一项资源归到一个或几个类别中;
( 2 )用户对资源评价矩阵进行分解,
iiini
iini
nmmvmv
vv
i
nmmnm
n
dd
dd
D
dd
dd
D
1
1 21
1
111
iinii vvviGenre ,,,][ 21
( 3 )对 进行裁减,去掉对该类资源没有打分的用户iD
基于分类的协同过滤算法(续 )
( 4 )根据 计算用户在某一类别中的相似度,即得到一个用户的最邻近邻居们。
( 5 )计算用户对特定类别中的资源感兴趣度
( 6 )综合用户在多个类别中的感兴趣程度,得到最终推荐结果。
iiiiniimiiim
iiniii
nmvuvu
vuvu
ii
dd
dd
DD
1
111
'
'iD
基于聚类的协同过滤算法
基本思想:
( 1 )对矩阵进行划分
划分根据稀疏矩阵聚类、 K-Means等聚类算法
( 2 )根据划分后的子矩阵进行协同过滤
( 3 )生成预测结果
基于矩阵聚类的协同过滤
资源用户 1 2 3 4 5 6 7 8
1 1 1 1 0 0 0 0 02 1 1 0 1 0 0 0 03 0 1 1 1 0 0 0 04 1 0 1 1 0 0 0 05 0 0 0 0 0 1 1 16 0 0 0 0 0 1 1 07 0 0 1 1 0 0 1 18 0 0 1 1 1 1 1 19 0 0 1 0 1 0 0 0
(1,0)经过 转换后的评分矩阵(划分前)
资源用户 1 2 3 4
1 1 1 1 02 1 1 0 13 0 1 1 14 1 0 1 1
资源用户 6 7 8
5 1 1 16 1 1 07 0 1 18 1 1 1
资源用户 3 4 5
7 1 1 08 1 1 19 1 0 1
划分后的矩阵
基于矩阵聚类的协同过滤
基本思想:
( 1 )把每一项资源归到一个或多个子矩阵中,每个用户被划分到一个或多个子矩阵中;
iiiiniimiiim
iiniii
nmvuvu
vuvu
ii
dd
dd
DD
1
111
'
基于聚类的协同过滤算法(续 )( 2 )根据 计算用户在某一类别中的相似度,即得到一个用户的最邻近邻居们。
( 3 )计算用户对特定类别中的资源感兴趣度
( 4 )综合用户在多个类别中的感兴趣程度,得到最终推荐结果。
'iD
• 与内容无关的其他检索模型– 基于协同的模型–基于链接分析的模型– 基于关联的模型
• 通常与基于内容的模型一起使用
链接分析模型• 对于超文本(例如WWW上的网页 ),超链结构
是个非常丰富和重要的资源,如果能够充分利用的话,可以极大地提高检索结果的质量。
• Sergey Brin 和 Larry Page 在 1998 年提出了 PageRank 算法
• J.Kleinberg 于 1998 年提出了 HITS 算法• 其它一些学者也相继提出了另外的链接分析算法,如 SALSA , PHITS , Bayesian等算法。
Page Ranking 算法• Brin S, Page L The anatomy of a large-sca
le hypertextual web search engine. WWW’98
• 基本思想:以下三条启发式规则:–如果一个页面被多次引用,那么这个页面很可
能是重要的。–如果一个页面被重要的页面引用,那么这个页面很可能是重要的。
– 一个页面的重要性被均分并传递到它所引用的页面。
PageRanking
• Citation graph (link graph) of the web
• A web page’s “PageRank”:
PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))
• Page A has pages T1,…,Tn which point to it (i.e. are citations)
• 0<d<1 is a damping factor (d=0.85)
• C(A) is the number of links going out of A
HITS 算法• J. Kleinberg.
Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages 668-677, ACM Press, New York, 1998
• Hub页面:指向权威页面的页面,例如目录页面等。
• Authority页面:被很多页面指向的页面
HITS 算法• Step1: 构造子图 S
–查询结果页面 R (前 n 个)– R 中每一个页面所指向的页面–指向 R 中页面的页面(可能要限制数量)
• Step2:迭代计算页面的 h值和 a值–每一个页面的 h(p)=1,a(p)=1–定义两个操作: I: a(p) = ∑(q,p) E∈ h(q)
O: h(p) = ∑(p,q) E∈ a(q)
HITS 算法(续)• Step3 :重复 Step2 k 次(可以证明上述迭代可以收敛到一个不动点,但是,如何确定一个 k值是一个问题)输出 top-m 个 hub页面和权威页面
• 与内容无关的其他检索模型– 基于协同的模型– 基于链接分析的模型–基于关联的模型
• 通常与基于内容的模型一起使用
SimRank 算法• 基本思想:
同一个类型下的两个对象 ,如果经常连接到相同的其他对象,那么这两个对象的相似性应该很高。
Simrank 算法 ---- 文本相似度计算
• 1.利用文章的相互之间的引用关系计算文本的相似度。---- 两个文档的引文相同,那么这两个文档的相似性很高。
• 2.利用文章的一些外部信息 ( 关联 ) 计算文本的相似度。
--- 文档外部信息(作者,发表会议) --- 两个文档有共同的作者,发表到共同的会议上,那么这
两个文档的相似度很高。
Simrank 算法• Similarity btw. a & b denoted by:
– if a = b, s(a,b) = 1, s(a,a) = s(b,b) = 1– otherwise:
• C is called as “confidence level” or “decay factor”. a constant btw. 0 & 1
• if |I(a)| or |I(b)| is 0, s(a,b) = 0• symmetric : s(a,b) = s(b,a)
– Similarity btw. a & b is the average similarity btw. in-neighbors of a and in-neighbors of b
)(
1
)(
1
))(),(()()(
),(bI
jji
aI
i
bIaIsbIaI
Cbas
Simrank算法计算改进工作Linkclus 算法: 1 ) 2/8原则: 图中两个点的相似性的计算只由图中的部分点来决定,并不是由图中的所有的点来决定。
由这个核心的想法,将 SimRank 的全局计算转化到一个局部的树形的计算中来,大大提高了效率。
参考文献LinkMing: [1] Lise Getoor, Christopher P. Diehl, Link Mining: A Survey, SIGKDD,
2005 [2] Ted E. Senator* Link Mining Applications: Progress and Challenge
s, SIGKDD, 2005 [3] Lise Getoor, Link mining: a new data mining challenge , SIGKDD,
2003Similarity Compute: [1] Glen Jeh, Jennifer Widom, SimRank: A Measure of Structural-Cont
ext Similarity, SIGKDD, 2002 [2] Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, Christos Falouts
os Relevance Search and Anomaly Detection in Bipartite Graphs, SIGKDD, 2005
[3] Xiaoxin Yin, Jiawei Han, Philip S. Yu,LinkClus: Efficient Clustering via Heterogeneous Semantic Links, VLDB, 2006
[4] Xiaoxin Yin, Jiawei Han Distinguishing Objects with Identical Names in Relational Databases, ICDE, 2007
[5] Zhenjiang Lin, Irwin King, and Michael R. Lyu, PageSim: A Novel Link-based Similarity Measure for theWorldWide Web, WWW, 2006