智能信息检索

97
智智智智智智 智智智智智 , 智智智智智智 智智智智智 , 智智智智智智智

description

智能信息检索. 杜小勇教授 , 中国人民大学 文继荣教授 , 微软亚洲研究院. 课程介绍. 课程历史. 2007 年 9 月 27 日,我们和微软亚洲研究院联合举办了一次 “互联网数据管理主题学术报告” 既聘任文继荣博士为兼职研究员的活动 ; 2008 年春季学期 , 第一次与微软合作 , 开设 《 智能信息检索 》 课程 ; 来自 MSRA 的 9 位研究员共进行了 11 次讲座 ; 参考文献 : 从智能信息检索看微软前瞻性课程, 《 计算机教育 》. 授课风格. IR 基础知识 + 专题讲座 专题讲座由微软研究员担任 , 信息量非常大 考核方式 : - PowerPoint PPT Presentation

Transcript of 智能信息检索

Page 1: 智能信息检索

智能信息检索

杜小勇教授 ,中国人民大学文继荣教授 ,微软亚洲研究院

Page 2: 智能信息检索

课程介绍

Page 3: 智能信息检索

课程历史• 2007 年 9 月 27 日,我们和微软亚洲研究

院联合举办了一次“互联网数据管理主题学术报告”既聘任文继荣博士为兼职研究员的活动 ;

• 2008 年春季学期 , 第一次与微软合作 , 开设《智能信息检索》课程 ; 来自 MSRA 的9 位研究员共进行了 11 次讲座 ;

• 参考文献 : 从智能信息检索看微软前瞻性课程,《计算机教育》

Page 4: 智能信息检索

授课风格• IR基础知识 +专题讲座• 专题讲座由微软研究员担任 ,信息量非常大• 考核方式 :

– (1) 选择某一个专题写一个综述性质的报告 , 包括 :研究的问题是什么 ? 该领域的理论基础是什么 ?技术难点在那里 ? 目前大致有什么解决问题的手段和方法 ? 研究这些问题的实验方法和评价体系是什么 ? 提出自己的观点 .

– (2) 将打印好的文章在最后一节课交给助教 , 将分发给相关的老师进行评阅。

– (3) 平时考核 , 主要是参与讨论的情况 .

Page 5: 智能信息检索

授课内容• 基础知识

– 基本模型– 基本实现技术– 评价方法

• 核心技术与系统– Ranking– Information Extraction– Log Mining– System Implementation

• 应用– Image search– Multi-language search– Knowledge Management– ……

Page 6: 智能信息检索

Reading Materials

• R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press,1999

• E.M.Voorhees,D.K.Harman, TREC: Experiment and Evaluation in Information Retrieval, The MIT Press 2005

• K. S. Jones, P. Willett, Readings in Information Retrieval, Morgan Kaufmann,1997

• Proceedings SIGIR, SIGMOD , WWW

Page 7: 智能信息检索

课程安排Date Topic Lecturer

3 月 2 日 IR: Basic concepts and Models Xiaoyong Du3 月 9 日 IR: Overview of Key Techniques Xiaoyong Du3 月 16 日 Information retrieval in the new computing era. Wei-Ying Ma3 月 23 日 Fundamental Web IR Hang Li

3 月 30 日 Web IR Evaluation Ruihua Song4 月 6 日 Link Analysis and Anti Web Spam Bin Gao4 月 13 日 Web Search Log Mining Daxin Jiang4月 20日 Learning to Rank for Information Retrieval Tie-Yan Liu4 月 27 日 Search Engine Overview: System, Algorithms and

Challenges Ji-Rong Wen5 月 4 日 Multi-language search Ming Zhou

5 月 11 日 Web-Scale Entity Search and Knowledge Mining Zaiqing Nie5月 18日 Web Image Search Lei Zhang

5 月 25 日 Information Extraction Hang Li

6 月 1 日 Spatial data mining and its applications Xing Xie6 月 8 日 Wisdom of the crowd on the Web Haixun Wang6 月 15 日 Summary Xiaoyong Du

Page 8: 智能信息检索

课程安排• http://iir.ruc.edu.cn/

• 联系方式:– 杜小勇 [email protected](信息楼 0459 )– 文继荣 [email protected]

Page 9: 智能信息检索

Introduction to IR

Prof. Xiaoyong Du

Page 10: 智能信息检索

What is Information Retrieval

• Definition from comparison

Aspects IR DB

Data Unstructured Structured

Operator Read only Read/Write

User’s need keywords SQL

Results Similar function Exactly match

Page 11: 智能信息检索

What is IR?

• Definition by examples:– Library system, like CALIS– Search engine, like Google, Baidu

Page 12: 智能信息检索

What is IR?

• Definition by content– IR = <D, Q, R(qi,dj)>, where– D: document collection– Q: User’s query– R: the relevance/similarity degree between qu

ery qi and document dj–

Page 13: 智能信息检索

IR System Architecture

UnstructuredData

Index

Indexer

Ranker

Classical IR

Crawler

UserInterface

WEB

WEB IR

Query logFeedback

Extractor

Data Miner

Page 14: 智能信息检索

Content

• IR Model• System architecture• Evaluation and benchmark• Key techniques

– Media-related operators– Indexing– IE– Classification and clustering– Link analysis– Relevance evaluation– ……

Page 15: 智能信息检索

Related Area

• Natural Language Processing

• Large-scale distributed computing

• Database

• Data mining

• Information Science

• Artificial Intelligence

• ……

Page 16: 智能信息检索

Model

Page 17: 智能信息检索

IR Model

• RepresentationHow to represent document/query– Bag-of-word– Sequence-of-word– Link of documents– Semantic Network

• Similarity/relevance Evaluationsim(dj,q)=?

Page 18: 智能信息检索

两大类的模型• 基于文本内容的检索模型

– 布尔模型– 向量空间模型– 概率模型– 统计语言模型

• 与内容无关的其他检索模型– 基于协同的模型– 基于链接分析的模型– 基于关联的模型

Page 19: 智能信息检索

Classical IR Models ---- Basic Concepts

• Bag-of-Word Model• Each document represented by a set of representative

keywords or index terms• The importance of the index terms is represented by

weights associated to them• Let

– ki : an index term

– dj : a document

– t : the total number of docs– K = {k1, k2, …, kt} : the set of all index terms

Page 20: 智能信息检索

Classic IR Models - Basic Concepts

– wij >= 0 : a weight associated with (ki,dj)

The weight wij quantifies the importance of the index term for describing the document contents

• wij = 0 indicates that term does not belong to doc

– vec(dj) = (w1j, w2j, …, wtj) : a weighted vector associated with the document dj

– gi(vec(dj)) = wij : a function which returns the weight of term ki in document dj

Page 21: 智能信息检索

Classical IR Models - Basic Concepts

• A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

• A ranking is based on fundamental premises regarding the notion of relevance, such as:– common sets of index terms– sharing of weighted terms– likelihood of relevance

• Each set of premises leads to a distinct IR model

Page 22: 智能信息检索

The Boolean Model

• Simple model based on set theory• Queries specified as boolean expressions

– precise semantics– neat formalism– q = ka (kb kc)

• Terms are either present or absent. Thus, wij {0,1}∈• Consider

– q = ka (kb kc)

– vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)

– vec(qcc) = (1,1,0) is a conjunctive component

Page 23: 智能信息检索

Outline

• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model(LM)

Page 24: 智能信息检索

The Boolean Model

• q = ka (kb kc)

• sim(q,dj) = 1 if vec(qcc) | (vec(qcc) /in vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise

(1,1,1)(1,0,0)

(1,1,0)

Ka Kb

Kc

Page 25: 智能信息检索

Drawbacks of the Boolean Model

• Exact matching• No ranking: • Awkward: Information need has to be translated

into a Boolean expression • Too simple: The Boolean queries formulated by

the users are most often too simplistic• Unsatisfiable Results: The Boolean model

frequently returns either too few or too many documents in response to a user query

Page 26: 智能信息检索

Outline

• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model (LM)

Page 27: 智能信息检索

The Vector Model

• Non-binary weights provide consideration for partial matches

• These term weights are used to compute a degree of similarity between a query and each document

• Ranked set of documents provides for better matching

Page 28: 智能信息检索

The Vector Model

• Define:– wij > 0 whenever ki dj– wiq >= 0 associated with the pair (ki,q)– vec(dj) = (w1j, w2j, ..., wtj)

vec(q) = (w1q, w2q, ..., wtq)– index terms are assumed to occur independently within

the documents ,That means the vector space is orthonormal.

• The t terms form an orthonormal basis for a t-dimensional space

• In this space, queries and documents are represented as weighted vectors

Page 29: 智能信息检索

The Vector Model

• Sim(q,dj) = cos() = [vec(dj) vec(q)] / (|dj| * |q|)

= [ wij * wiq] / (|dj| * |q|)• Since wij > 0 and wiq > 0,

0 <= sim(q,dj) <=1• A document is retrieved even if it matches the

query terms only partially

i

j

dj

q

Page 30: 智能信息检索

The Vector Model

• Sim(q,dj) = [ wij * wiq] / ( |dj| * |q|)• The KEY is to compute the weights wij and wiq ?• A good weight must take into account two effects:

– quantification of intra-document contents (similarity)• tf factor, the term frequency within a document

– quantification of inter-documents separation (dissimilarity)• idf factor, the inverse document frequency

– TF*IDF formular: wij = tf(i,j) * idf(i)

Page 31: 智能信息检索

The Vector Model

• Let,– N be the total number of docs in the collection– ni be the number of docs which contain ki– freq(i,j) raw frequency of ki within dj

• A normalized tf factor is given by– tf(i,j) = freq(i,j) / max(freq(l,j))– where kl dj∈

• The idf factor is computed as– idf(i) = log (N/ni)– the log is used to make the values of tf and idf

comparable. It can also be interpreted as the amount of information associated with the term ki.

Page 32: 智能信息检索

The Vector Model

• tf-idf weighting scheme– wij = tf(i,j) * log(N/ni)– The best term-weighting schemes

• For the query term weights, – wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)– Or specified by the user

• The vector model with tf-idf weights is a good ranking strategy with general collections

• The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.

Page 33: 智能信息检索

The Vector Model

• Advantages:– term-weighting improves quality of the answer

set– partial matching allows retrieval of docs that

approximate the query conditions– cosine ranking formula sorts documents

according to degree of similarity to the query

• Disadvantages:– assumes independence of index terms

Page 34: 智能信息检索

Outline

• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model (LM)

Page 35: 智能信息检索

Probabilistic Model

• Objective: to capture the IR problem using a probabilistic framework

• Given a user query, there is an ideal answer set• Querying as specification of the properties of this

ideal answer set (clustering)• But, what are these properties? • Guess at the beginning what they could be (i.e.,

guess initial description of ideal answer set)• Improve by iteration

Page 36: 智能信息检索

Probabilistic Model• Baisc ideas:

– An initial set of documents is retrieved somehow – User inspects these docs looking for the relevant ones (in

truth, only top 10-20 need to be inspected)– IR system uses this information to refine description of

ideal answer set– By repeting this process, it is expected that the description

of the ideal answer set will improve

• Description of ideal answer set is modeled in probabilistic terms

Page 37: 智能信息检索

Probabilistic Ranking Principle

• The probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).

• The model assumes that this probability of relevance depends on the query and the document representations only.

• Let R be the Ideal answer set.• But,

– how to compute probabilities?– what is the sample space?

Page 38: 智能信息检索

The Ranking

• Probabilistic ranking computed as:– sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-

to q)

• Definition:– wij {0,1}– P(R | vec(dj)) : probability that given doc is relevant– P(R | vec(dj)) : probability doc is not relevant

Page 39: 智能信息检索

The Ranking

• sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj))

= [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)]

~ P(vec(dj) | R) P(vec(dj) | R)

• P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents

Page 40: 智能信息检索

The Ranking

• sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R)

~ [ P(ki | R)] * [ P(ki | R)]

[ P(ki | R)] * [ P(ki | R)]

• P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents

Page 41: 智能信息检索

The Ranking

• sim(dj,q) ~ log [ P(ki | R)] * [ P(kj | R)] [ P(ki | R)] * [ P(kj | R)]

~ K * [ log P(ki | R) + P(ki | R)

log P(kj | R) ] P(kj | R)

~ wiq * wij * (log P(ki | R) + log P(kj | R) ) P(ki | R) P(kj | R)

where P(ki | R) = 1 - P(ki | R)P(ki | R) = 1 - P(ki | R)

Page 42: 智能信息检索

The Initial Ranking

• sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) 1-P(ki | R) 1-P(ki | R)

• How to compute Probabilities P(ki | R) and P(ki | R) ?• Estimates based on assumptions:

– P(ki | R) = 0.5– P(ki | R) = ni

N

– where ni is the number of docs that contain ki– Use this initial guess to retrieve an initial ranking– Improve upon this initial ranking

Page 43: 智能信息检索

Improving the Initial Ranking

• sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) )

1-P(ki | R) 1-P(ki | R)• Let

– V : set of docs initially retrieved

– Vi : subset of docs retrieved that contain ki

• Re-evaluate estimates: – P(ki | R) = Vi V

– P(ki | R) = ni - Vi N - V

• Repeat recursively

Page 44: 智能信息检索

Improving the Initial Ranking

• sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) 1-P(ki | R) 1-P(ki | R)

• To avoid problems with V=1 and Vi=0:– P(ki | R) = Vi + 0.5 V + 1– P(ki | R) = ni - Vi + 0.5 N - V + 1

• Also, – P(ki | R) = Vi + ni/N V + 1– P(ki | R) = ni - Vi + ni/N

N - V + 1

Page 45: 智能信息检索

Discussion

• Advantages:– Docs ranked in decreasing order of probability of

relevance

• Disadvantages:– need to guess initial estimates for P(ki | R)– method does not take into account tf and idf

factors

Page 46: 智能信息检索

Outline

• Boolean Model ( BM )• Vector Space Model ( VSM )• Probabilistic Model ( PM )• Language Model (LM)

Page 47: 智能信息检索

Document Representation

• Bag-of-words

• Bag-of-facts

• Bag-of-sentences

• Bag-of-nets

Page 48: 智能信息检索

• Document = Bag of Words

• Document = Bag of Sentences,

Sentence = word sequence

p( 南京市长 )p( 江大桥 | 南京市长 ) <<

p( 南京市 )p( 长江大桥 | 南京市 )

p( 中国人民大学 ) >> p( 中国大学人民 )

Page 49: 智能信息检索

What is a LM?

• “语言”就是其字母表上的某种概率分布 , 该分布反映了任何一个字母序列成为该语言的一个句子 (或其他任何的语言单元 ) 的可能性 ,称这个概率分布为语言模型。– 给定的一个语言,对于一个语言“句子”(符号串),可以估计其出现的概率。

– 例如:假定对于英语, p1 (a quick brown dog) > p2 ( dog brown a quick) > p3 (brown dog 棕熊 ) > p4 (棕熊 ) – 若 p1=p2,称为一阶语言模型 ,否则称为高阶语言模型

Page 50: 智能信息检索

Basic Notation

• M: language we are try to model, it can be thought as a source

• s: observation (string of tokens)

• P(s|M): probability of observation “s” in M, that is the probability of getting “s” during random sampling from M

Page 51: 智能信息检索

Basic Notation

• Let S=s1s2….sn be any sentence

• P(S) = P(s1)P(s2|s1)….P(sn|s1,s2…sn)

• Under n-gram model

P(si|s1,s2…si-1)=P(si|si-n+1,…si-1)

• n =1, ungram

P(si|s1,s2,…,si-1)=P(si)

Page 52: 智能信息检索

How can we use LMs in IR

• Use LM to model the process of query generation:

• Every document in a collection defines a “language”

• P(s|MD) defines the probability that author would write down string ”s”

• Now suppose “q” is the user’s query• P(q|MD) is the probability of “q” during random sa

mpling from the D, and can be treated as rank of document D in the collection

Page 53: 智能信息检索

Major issues in applying LMs

• What kind of language model should we use?– Unigram or high-order models

• How can we estimate model parameters?– Basic model or advanced model– Data smoothing approaches

Page 54: 智能信息检索

What kind of models is better?

• Unigram model

• Bigram model

• High-order model

Page 55: 智能信息检索

Unigram LMs

• Words are “sampled” independently of each other

• Joint probability decomposes into a production of marginals– P(xyz)=p(x)p(y)p(z)– P(“brown dog”)=P(“dog”)P(“brown”)

• Estimation of probability :simple counting

Page 56: 智能信息检索

Higher-order Models

• n-gram: condition on preceding words

• Cache: condition on a window

• Grammar: condition on a parse tree

• Are they useful? ?

• Parameter estimation is very expensive!

Page 57: 智能信息检索

Comparison

• Song 和 Croft指出,把一元语言模型和二元语言模型混合后的效果比只使用一元语言模型好 8%左右。不过, Victor Lavrenko指出, Song 和 Croft 使用的多元模型得到的效果并不是一直比只用一元语言模型好。

• David R.H.Miller 指出一元语言模型和二元语言模型混合后得到的效果也要好于一元语言模型。

• 也有研究认为词序对于检索结果影响不大 .

Page 58: 智能信息检索

Major issues in applying LMs

• What kind of language model should we use?– Unigram or high-order models

• How can we estimate model parameters?– Basic model or advanced model– Data smoothing approaches

Page 59: 智能信息检索

Estimation of parameter

• Given a string of text S ( =Q or D ) , estimate its LM: Ms

• Basic LMs– Maximum-likelihood estimation– Zero-frequency problem– Discounting technology– Interpolation method

Page 60: 智能信息检索

Maximum-likelihood estimation

• Let V be vocabulary of M,Q=q1q2…qm be a query, qi \in V, S=d1d2…dn be a doc.

• Let Ms be the language model of S

• P(Q|Ms) =? ,called query likelihood

• P (Ms|Q) = P(Q| Ms)P(Ms)/P(Q) can be treated as the ranking of doc S.

~ P(Q| Ms)P(Ms)

• Estimating P(Q|Ms),and P(Ms)

Page 61: 智能信息检索

Maximum-likelihood estimation

• 估计 P(Q|Ms) 的方法:– Multivarint Bernouli model– Multinomial model

• Bernouli model–只考虑词是否在查询中出现,而不考虑出现几

次。查询被看成是 |v| 个相互独立的 Bernouli试验的结果序列

– P(Q|Ms)=∏w Q∈ P(w|Ms) ∏w Q∈ (1-P(w|Ms))

Page 62: 智能信息检索

Maximum-likelihood estimation

• Multinomial model(多项式模型 )– 将查询被看成是多项试验的结果序列,因此考虑了词在查询中出现的次数。

– P(Q|Ms)=∏qi Q∈ P(qi|Ms)= ∏w Q∈ P(w|Ms)#(w,Q)

• 上述两种办法都将转换成对 P(w|Ms) 的估计,也就是将 IR 问题转换成对文档语言模型的估计问题。从而可以利用 LM 的研究成果。

Page 63: 智能信息检索

Maximum-likelihood estimation

• 最简单的办法就是采用极大似然估计: Count relative frequencies of words in S

P(w|Ms)=#(w,S)/|S|

• 0-frenquency problem (由数据的稀疏性造成)– Assume some event not occur in S, then the p

robability is 0!– It is not correct, and we need to avoid it

Page 64: 智能信息检索

Discounting Method

• Laplace correction ( add-one approach ) :– Add 1 to every count, ( normalize )– P(w|Ms)= ( #(w,S)+1 ) / ( |S|+|V| )– Problematic for large vocabularies ( |V|太大

的时候)• Ref. Chen SF and Goodman JT: an empiri

cal study of smoothing technology for language modeling, proc. 34th annual meeting of the ACL,1996

Page 65: 智能信息检索

Smoothing methods

• Additive smoothing methods

• Jelinek-Mercer 方法• Dirichlet 方法

Page 66: 智能信息检索

Additive smoothing methods

• PML(s|Ms)=[ #(w,S)+c]/[|S|+c|V|]

• When c=1, it is laplace smoothing method

Page 67: 智能信息检索

Jelinek-Mercer 方法• Discounting 方法对待所有未出现的词是一样的,但实际上,仍然有不同,可以使用一些背景知识(或者说是一阶ML) ,例如利用英语语言知识。

• P(S|Ms)=cPML(S|Ms)+(1-c)P(S)

= PML(S|Ms)+\& P(S)

• PML(S|Ms) 为条件概率,• P(S) =P(S|REF) 为先验概率• Set c to be a constant, independent of document

and query.

Page 68: 智能信息检索

平滑对检索性能的影响• Zhai CX, Lafferty J, A study of smoothing metho

ds for language models applied to ad hoc information retrieval. ACM SIGIR 2001

• Zhai CX Lafferty J, A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2)179-214

• 平滑有两个作用:一是估计,解决 0 概率问题,二是查询建模,消除或者降低噪音的影响

Page 69: 智能信息检索

Translation Models

• Basic LMs do not address word synonymy.

• P(q|M) = ∑w P(w|M) P(q|w)

• P(q|w) 就是 q 和 w之间的关系。如果 q 和w 是近似词,这个值就比较大。

• P (q|w) 可以依据词的共现关系 / 相同词根 /词典等进行计算,这是该方法的关键

• P (w|M) 就是语言模型下 w 的概率。

Page 70: 智能信息检索

LM Tools

• LEMUR– www.cs.cmu.edu/~lemur– CMU/UMass joint project– C++, good documentation, forum-based support– Ad-hoc IR, Clustering, Q/A systems– ML+smoothing, …

• YARI– [email protected]– Ad-hoc IR, cross-language,classification– ML+smoothing,…

Page 71: 智能信息检索

Other applications of LM

• Topic detection and tracking– Treat “q” as a topic description

• Classification/ filtering

• Cross-language retrieval

• Multi-media retrieval

Page 72: 智能信息检索

References

• Ponte JM, Croft WB, A Language Modeling approach to Information Retrieval, ACM SIGIR 1998, pp275-281

• Ponte JM, A Language Modeling approach to Information Retrieval, PhD Dissertation, UMass, 1998

Page 73: 智能信息检索

Bag-of-nets

• 如果文本的概念用本体来表达 ,也就是将从文本中抽取出的概念放在领域本体的背景下 ,形成一个概念的网络 , 情况将如何呢 ?

• 可否利用 Bayesian Network? 关键是怎么理解词与词之间的关系 , 是否具有因果关系 ?

• 比如上下位关系 ? 关联关系 ?

Page 74: 智能信息检索

• 与内容无关的其他检索模型– 基于协同的模型– 基于链接分析的模型– 基于关联的模型

• 通常与基于内容的模型一起使用

Page 75: 智能信息检索

Collaborative Recommendation

• raj denotes the score of item j rated by an active user a. If user a had not rated item j, raj=0.

• m - total number of users, n - total number of items.

nmmnmjm

anaja

nj

nm

rrr

rrr

rrr

R

1

1

1111

Page 76: 智能信息检索

协同推荐模型• For a given user-a and document-j, Predicate paj =?

• is the number of users who are similar to user a and have rated item j.

• w(a,i): The weight of the similarity between user a and user i.

• k is a normalizing factor such that the absolute values of the weights sum to unity.

am

iiijaaj rriawkrp

1

))(,(

am

Page 77: 智能信息检索

算法主要的问题• 冷启动 (cold star)

• 稀疏性 (sparse)

• 高维性 (high dimension)

Page 78: 智能信息检索

基于分类的协同过滤推荐

解决冷启动问题

基本思想: ( 1 )对矩阵进行划分 ,依据资源的语义分类

( 2 )根据划分后的子矩阵进行协同过滤

( 3 )生成预测结果

Page 79: 智能信息检索

基于分类的协同过滤推荐基本思想: ( 1 )把每一项资源归到一个或几个类别中;

( 2 )用户对资源评价矩阵进行分解,

iiini

iini

nmmvmv

vv

i

nmmnm

n

dd

dd

D

dd

dd

D

1

1 21

1

111

iinii vvviGenre ,,,][ 21

( 3 )对 进行裁减,去掉对该类资源没有打分的用户iD

Page 80: 智能信息检索

基于分类的协同过滤算法(续 )

( 4 )根据 计算用户在某一类别中的相似度,即得到一个用户的最邻近邻居们。

( 5 )计算用户对特定类别中的资源感兴趣度

( 6 )综合用户在多个类别中的感兴趣程度,得到最终推荐结果。

iiiiniimiiim

iiniii

nmvuvu

vuvu

ii

dd

dd

DD

1

111

'

'iD

Page 81: 智能信息检索

基于聚类的协同过滤算法

基本思想:

( 1 )对矩阵进行划分

划分根据稀疏矩阵聚类、 K-Means等聚类算法

( 2 )根据划分后的子矩阵进行协同过滤

( 3 )生成预测结果

Page 82: 智能信息检索

基于矩阵聚类的协同过滤

资源用户 1 2 3 4 5 6 7 8

1 1 1 1 0 0 0 0 02 1 1 0 1 0 0 0 03 0 1 1 1 0 0 0 04 1 0 1 1 0 0 0 05 0 0 0 0 0 1 1 16 0 0 0 0 0 1 1 07 0 0 1 1 0 0 1 18 0 0 1 1 1 1 1 19 0 0 1 0 1 0 0 0

(1,0)经过 转换后的评分矩阵(划分前)

资源用户 1 2 3 4

1 1 1 1 02 1 1 0 13 0 1 1 14 1 0 1 1

资源用户 6 7 8

5 1 1 16 1 1 07 0 1 18 1 1 1

资源用户 3 4 5

7 1 1 08 1 1 19 1 0 1

划分后的矩阵

Page 83: 智能信息检索

基于矩阵聚类的协同过滤

基本思想:

( 1 )把每一项资源归到一个或多个子矩阵中,每个用户被划分到一个或多个子矩阵中;

iiiiniimiiim

iiniii

nmvuvu

vuvu

ii

dd

dd

DD

1

111

'

Page 84: 智能信息检索

基于聚类的协同过滤算法(续 )( 2 )根据 计算用户在某一类别中的相似度,即得到一个用户的最邻近邻居们。

( 3 )计算用户对特定类别中的资源感兴趣度

( 4 )综合用户在多个类别中的感兴趣程度,得到最终推荐结果。

'iD

Page 85: 智能信息检索

• 与内容无关的其他检索模型– 基于协同的模型–基于链接分析的模型– 基于关联的模型

• 通常与基于内容的模型一起使用

Page 86: 智能信息检索

链接分析模型• 对于超文本(例如WWW上的网页 ),超链结构

是个非常丰富和重要的资源,如果能够充分利用的话,可以极大地提高检索结果的质量。

• Sergey Brin 和 Larry Page 在 1998 年提出了 PageRank 算法

• J.Kleinberg 于 1998 年提出了 HITS 算法• 其它一些学者也相继提出了另外的链接分析算法,如 SALSA , PHITS , Bayesian等算法。

Page 87: 智能信息检索

Page Ranking 算法• Brin S, Page L The anatomy of a large-sca

le hypertextual web search engine. WWW’98

• 基本思想:以下三条启发式规则:–如果一个页面被多次引用,那么这个页面很可

能是重要的。–如果一个页面被重要的页面引用,那么这个页面很可能是重要的。

– 一个页面的重要性被均分并传递到它所引用的页面。

Page 88: 智能信息检索

PageRanking

• Citation graph (link graph) of the web

• A web page’s “PageRank”:

PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))

• Page A has pages T1,…,Tn which point to it (i.e. are citations)

• 0<d<1 is a damping factor (d=0.85)

• C(A) is the number of links going out of A

Page 89: 智能信息检索

HITS 算法• J. Kleinberg.

Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages 668-677, ACM Press, New York, 1998

• Hub页面:指向权威页面的页面,例如目录页面等。

• Authority页面:被很多页面指向的页面

Page 90: 智能信息检索

HITS 算法• Step1: 构造子图 S

–查询结果页面 R (前 n 个)– R 中每一个页面所指向的页面–指向 R 中页面的页面(可能要限制数量)

• Step2:迭代计算页面的 h值和 a值–每一个页面的 h(p)=1,a(p)=1–定义两个操作: I: a(p) = ∑(q,p) E∈ h(q)

O: h(p) = ∑(p,q) E∈ a(q)

Page 91: 智能信息检索

HITS 算法(续)• Step3 :重复 Step2 k 次(可以证明上述迭代可以收敛到一个不动点,但是,如何确定一个 k值是一个问题)输出 top-m 个 hub页面和权威页面

Page 92: 智能信息检索

• 与内容无关的其他检索模型– 基于协同的模型– 基于链接分析的模型–基于关联的模型

• 通常与基于内容的模型一起使用

Page 93: 智能信息检索

SimRank 算法• 基本思想:

同一个类型下的两个对象 ,如果经常连接到相同的其他对象,那么这两个对象的相似性应该很高。

Page 94: 智能信息检索

Simrank 算法 ---- 文本相似度计算

• 1.利用文章的相互之间的引用关系计算文本的相似度。---- 两个文档的引文相同,那么这两个文档的相似性很高。

• 2.利用文章的一些外部信息 ( 关联 ) 计算文本的相似度。

--- 文档外部信息(作者,发表会议) --- 两个文档有共同的作者,发表到共同的会议上,那么这

两个文档的相似度很高。

Page 95: 智能信息检索

Simrank 算法• Similarity btw. a & b denoted by:

– if a = b, s(a,b) = 1, s(a,a) = s(b,b) = 1– otherwise:

• C is called as “confidence level” or “decay factor”. a constant btw. 0 & 1

• if |I(a)| or |I(b)| is 0, s(a,b) = 0• symmetric : s(a,b) = s(b,a)

– Similarity btw. a & b is the average similarity btw. in-neighbors of a and in-neighbors of b

)(

1

)(

1

))(),(()()(

),(bI

jji

aI

i

bIaIsbIaI

Cbas

Page 96: 智能信息检索

Simrank算法计算改进工作Linkclus 算法: 1 ) 2/8原则: 图中两个点的相似性的计算只由图中的部分点来决定,并不是由图中的所有的点来决定。

由这个核心的想法,将 SimRank 的全局计算转化到一个局部的树形的计算中来,大大提高了效率。

Page 97: 智能信息检索

参考文献LinkMing: [1] Lise Getoor, Christopher P. Diehl, Link Mining: A Survey, SIGKDD,

2005 [2] Ted E. Senator* Link Mining Applications: Progress and Challenge

s, SIGKDD, 2005 [3] Lise Getoor, Link mining: a new data mining challenge , SIGKDD,

2003Similarity Compute: [1] Glen Jeh, Jennifer Widom, SimRank: A Measure of Structural-Cont

ext Similarity, SIGKDD, 2002 [2] Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, Christos Falouts

os Relevance Search and Anomaly Detection in Bipartite Graphs, SIGKDD, 2005

[3] Xiaoxin Yin, Jiawei Han, Philip S. Yu,LinkClus: Efficient Clustering via Heterogeneous Semantic Links, VLDB, 2006

[4] Xiaoxin Yin, Jiawei Han Distinguishing Objects with Identical Names in Relational Databases, ICDE, 2007

[5] Zhenjiang Lin, Irwin King, and Michael R. Lyu, PageSim: A Novel Link-based Similarity Measure for theWorldWide Web, WWW, 2006