Document Representation

32
Document Representation • Bag-of-words • Bag-of-facts • Bag-of-sentences • Bag-of-nets

description

Document Representation. Bag-of-words Bag-of-facts Bag-of-sentences Bag-of-nets. Language Modeling in IR. 2008-03-06. Document = Bag of Words Document = Bag of Sentences, Sentence = word sequence p( 南京市长 )p( 江大桥 | 南京市长 )

Transcript of Document Representation

Page 1: Document Representation

Document Representation

• Bag-of-words

• Bag-of-facts

• Bag-of-sentences

• Bag-of-nets

Page 2: Document Representation

Language Modeling in IR

2008-03-06

Page 3: Document Representation

• Document = Bag of Words

• Document = Bag of Sentences,

Sentence = word sequence

p( 南京市长 )p( 江大桥 | 南京市长 ) <<

p( 南京市 )p( 长江大桥 | 南京市 )

p( 中国人民大学 ) >> p( 中国大学人民 )

Page 4: Document Representation

Agenda

• Introduction to Language Model– What is LM– How can we use LM– What are the major issues in LM?

Page 5: Document Representation

What is a LM?

• “ 语言”就是其字母表上的某种概率分布 , 该分布反映了任何一个字母序列成为该语言的一个句子 ( 或其他任何的语言单元 ) 的可能性 , 称这个概率分布为语言模型。– 给定的一个语言,对于一个语言“句子”(符号串),

可以估计其出现的概率。– 例如:假定对于英语, p1 (a quick brown dog) > p2 ( dog brown a quick) > p3 (brown dog 棕熊 ) > p4 ( 棕熊 ) – 若 p1=p2, 称为一阶语言模型 , 否则称为高阶语言模型

Page 6: Document Representation

Basic Notation

• M: language we are try to model, it can be thought as a source

• s: observation (string of tokens)

• P(s|M): probability of observation “s” in M, that is the probability of getting “s” during random sampling from M

Page 7: Document Representation

Basic Notation

• Let S=s1s2….sn be any sentence

• P(S) = P(s1)P(s2|s1)….P(sn|s1,s2…sn)

• Under n-gram model

P(si|s1,s2…si-1)=P(si|si-n+1,…si-1)

• n =1, ungram

P(si|s1,s2,…,si-1)=P(si)

Page 8: Document Representation

How can we use LMs in IR

• Use LM to model the process of query generation:

• Every document in a collection defines a “language”

• P(s|MD) defines the probability that author would write down string ”s”

• Now suppose “q” is the user’s query• P(q|MD) is the probability of “q” during random sa

mpling from the D, and can be treated as rank of document D in the collection

Page 9: Document Representation

Other ways to Rank Other ways to Rank • 查询相似 (query-likelihood) : 通过计算 P (Q |MD) 进行排序,即通过计算文档

模型能在多大程度上产生查询的概率来排序。• 文档相似 (document-likelihood) : 通过计算 P (D |MQ) 进行排序,即通过计算查询模

型能在多大程度上产生文档的概率来排序。• 模型比较 (model comparison) : 通过计算 P (MQ | | MD) 进行排序,即通过计算查

询模型与文档模型的相似性进行排序。

Page 10: Document Representation

Major issues in applying LMs

• What kind of language model should we use?– Unigram or high-order models

• How can we estimate model parameters?– Basic model or advanced model– Data smoothing approaches

Page 11: Document Representation

What kind of models is better?

• Unigram model

• Bigram model

• High-order model

Page 12: Document Representation

Unigram LMs

• Words are “sampled” independently of each other

• Joint probability decomposes into a production of marginals– P(xyz)=p(x)p(y)p(z)– P(“brown dog”)=P(“dog”)P(“brown”)

• Estimation of probability :simple counting

Page 13: Document Representation

Higher-order Models

• n-gram: condition on preceding words

• Cache: condition on a window

• Grammar: condition on a parse tree

• Are they useful? ?

• Parameter estimation is very expensive!

Page 14: Document Representation

Comparison

• Song 和 Croft 指出,把一元语言模型和二元语言模型混合后的效果比只使用一元语言模型则好 8%左右。不过, Victor Lavrenko 指出, Song 和 Croft 使用的多元模型得到的效果并不是一直比只用一元语言模型好。

• David R.H.Miller 指出一元语言模型和二元语言模型混合后得到的效果也要好于一元语言模型。

• 也有研究认为词序对于检索结果影响不大 .

Page 15: Document Representation

Major issues in applying LMs

• What kind of language model should we use?– Unigram or high-order models

• How can we estimate model parameters?– Basic model or advanced model– Data smoothing approaches

Page 16: Document Representation

Estimation of parameter

• Given a string of text S ( =Q or D ) , estimate its LM: Ms

• Basic LMs– Maximum-likelihood estimation– Zero-frequency problem– Discounting technology– Interpolation method

Page 17: Document Representation

Maximum-likelihood estimation

• Let V be vocabulary of M,Q=q1q2…qm be a query, qi \in V, S=d1d2…dn be a doc.

• Let Ms be the language model of S

• P(Q|Ms) =? ,called query likelihood

• P (Ms|Q) = P(Q| Ms)P(Ms)/P(Q) can be treated as the ranking of doc S.

~ P(Q| Ms)P(Ms)

• Estimating P(Q|Ms),and P(Ms)

Page 18: Document Representation

Maximum-likelihood estimation

• 估计 P(Q|Ms) 的方法:– Multivarint Bernouli model– Multinomial model

• Bernouli model– 只考虑词是否在查询中出现,而不考虑出现几

次。查询被看成是 |v| 个相互独立的 Bernouli试验的结果序列

– P(Q|Ms)=∏w Q∈ P(w|Ms) ∏w Q∈ (1-P(w|Ms))

Page 19: Document Representation

Maximum-likelihood estimation

• Multinomial model( 多项式模型 )– 将查询被看成是多项试验的结果序列,因此考

虑了词在查询中出现的次数。– P(Q|Ms)=∏qi Q∈ P(qi|Ms)= ∏w Q∈ P(w|Ms)#(w,Q)

• 上述两种办法都将转换成对 P(w|Ms) 的估计,也就是将 IR 问题转换成对文档语言模型的估计问题。从而可以利用 LM 的研究成果。

Page 20: Document Representation

Maximum-likelihood estimation

• 最简单的办法就是采用极大似然估计: Count relative frequencies of words in S

P(w|Ms)=#(w,S)/|S|

• 0-frenquency problem (由数据的稀疏性造成)– Assume some event not occur in S, then the p

robability is 0!– It is not correct, and we need to avoid it

Page 21: Document Representation

Discounting Method

• Laplace correction ( add-one approach ) :– Add 1 to every count, ( normalize )– P(w|Ms)= ( #(w,S)+1 ) / ( |S|+|V| )– Problematic for large vocabularies ( |V| 太大的时候)

• Lindstone correction (广义 add-one 方法)– Add a small constant to every count

• Leave-one-out discounting– Remove some word, compute p(S|Ms), repeat for every word in t

he document, and maximize overall likelihood

• Ref. Chen SF and Goodman JT: an empirical study of smoothing technology for language modeling, proc. 34th annual meeting of the ACL,1996

Page 22: Document Representation

Smoothing methods

• Discounting 方法对待所有未出现的词是一样的,但实际上,仍然有不同,可以使用一些背景知识,例如利用英语语言知识。

• P(S|Ms)=cPML(S|Ms)+(1-c)P(S)

= PML(S|Ms)+\&P(S)

• PML(S|Ms) 为条件概率,• P(S) =P(S|REF) 为先验概率

Page 23: Document Representation

Additive smoothing methods

• PML(s|Ms)=[ #(w,S)+c]/[|S|+c|V|]

• P(w) = 1/|V|

• \&= c/[|S|+c|V|]

Page 24: Document Representation

Jelinek-Mercer 方法• Set c to be a constant, independent of

document and query

• Tune to optimize retrieval performance on different database, query set, etc.

Page 25: Document Representation

Dirichlet 方法• c=N/(N+u) ,1-c =u/(N+u),

• N: sample size from the collection, or the length of S, u is a parameter

Page 26: Document Representation

平滑对检索性能的影响• Zhai CX, Lafferty J, A study of smoothing metho

ds for language models applied to ad hoc information retrieval. ACM SIGIR 2001

• Zhai CX Lafferty J, A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2)179-214

• 平滑有两个作用:一是估计,解决 0 概率问题,二是查询建模,消除或者降低噪音的影响

Page 27: Document Representation

Translation Models

• Basic LMs do not address word synonymy.

• P(q|M) = ∑w P(w|M) P(q|w)

• P(q|w) 就是 q 和 w 之间的关系。如果 q 和w 是近似词,这个值就比较大。

• P (q|w) 可以依据词的共现关系 / 相同词根 /词典等进行计算,这是该方法的关键

• P (w|M) 就是语言模型下 w 的概率。

Page 28: Document Representation

LM Tools

• LEMUR– www.cs.cmu.edu/~lemur– CMU/UMass joint project– C++, good documentation, forum-based support– Ad-hoc IR, Clustering, Q/A systems– ML+smoothing, …

• YARI– [email protected]– Ad-hoc IR, cross-language,classification– ML+smoothing,…

Page 29: Document Representation

Other applications of LM

• Topic detection and tracking– Treat “q” as a topic description

• Classification/ filtering

• Cross-language retrieval

• Multi-media retrieval

Page 30: Document Representation

References

• Ponte JM, Croft WB, A Language Modeling approach to Information Retrieval, ACM SIGIR 1998, pp275-281

• Ponte JM, A Language Modeling approach to Information Retrieval, PhD Dissertation, UMass, 1998

Page 31: Document Representation

Bag-of-nets

• 如果文本的概念用本体来表达 , 也就是将从文本中抽取出的概念放在领域本体的背景下 , 形成一个概念的网络 , 情况将如何呢 ?

• 可否利用 Bayesian Network? 关键是怎么理解词与词之间的关系 , 是否具有因果关系 ?

• 比如上下位关系 ? 关联关系 ?

Page 32: Document Representation