文本信息检索 —— 文本操作

36
南南南南南南 南南南 Multimedia Computing Institute of NJU 南南南南南南——南南南南 南南南 Tel : 83594243 Office: 南南南南 608B Email : [email protected]

description

文本信息检索 —— 文本操作. 武港山 Tel : 83594243 Office: 蒙民伟楼608 B Email : [email protected]. 信息检索系统的体系结构. 具体 应用 系统 (clir,QA,Web). 用户界面. 查询语言和 查询处理. 文档. 文本处理. 用户 需求. 文档处理. 逻辑视图. 用户 反馈. 提问处理. 建索引. 数据库 管理. 索引和检索. 倒排文档. 搜索. 提问. 索引. 文本 数据库. 排序后 的文档. 检出的文档. 排序. 内容提要. 文档预处理 - PowerPoint PPT Presentation

Transcript of 文本信息检索 —— 文本操作

Page 1: 文本信息检索 —— 文本操作

南京大学多媒体研究所Multimedia Computing Institute of NJU

文本信息检索——文本操作

武港山Tel : 83594243

Office: 蒙民伟楼 608B

Email : [email protected]

Page 2: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 2

信息检索系统的体系结构

文本数据库

数据库管理建索引

索引

提问处理

搜索

排序排序后的文档

用户反馈

文档处理

用户界面

检出的文档

用户需求

文档

提问

逻辑视图

倒排文档

查询语言和查询处理

索引和检索

文本处理

具体应用系统

(clir,QA,Web)

Page 3: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 3

内容提要 文档预处理 文档分类 文档聚类 文档摘要 文档压缩

Page 4: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 4

文档预处理TokenizationStopword removalLemmatization[ 词性还原 ]Stemming [ 词干 ]Metadata and markup languages

Page 5: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 5

Simple Tokenization

Analyze text into a sequence of discrete tokens (words).

Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. However, frequently they are not.

Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.

More careful approach: Separate ? ! ; : “ ‘ [ ] ( ) < > Care with . - why? when? Care with …

Page 6: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 6

Punctuation(标点符号)Ne’er: use language-specific mappings to

normalizeState-of-the-art: break up hyphenated

sequence.U.S.A. vs. USA a.out

Page 7: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 7

Numbers

3/12/91Mar. 12, 199155 B.C.B-52100.2.86.144

Generally, don’t index as text

Page 8: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 8

Case folding

Reduce all letters to lower case exception: upper case in mid-sentence

e.g., General Motors Fed vs. fed SAIL vs. sail

Page 9: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 9

Tokenizing HTML

Should text in HTML commands not typically seen by the user be included as tokens? Words appearing in URLs. Words appearing in “meta text” of images.

Simplest approach is to exclude all HTML tag information (between “<“ and “>”) from tokenization.

Page 10: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 10

Stopwords It is typical to exclude high-frequency words (e.g. function

words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”). Stopwords are language dependent For efficiency, store strings for stopwords in a hashtable to

recognize them in constant time. Simple Perl hashtable for Perl-based implementations

How to determine a list of stopwords? For English? – may use existing lists of stopwords

E.g. SMART’s commonword list (~ 400) WordNet stopword list

For Spanish? Bulgarian?

Page 11: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 11

Lemmatization(同义异形) Reduce inflectional/variant forms to base form Direct impact on VOCABULARY size E.g.,

am, are, is be car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color

How to do this? Need a list of grammatical rules + a list of irregular words Children child, spoken speak … Practical implementation: use WordNet’s morphstr function

Page 12: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 12

Stemming Reduce tokens to “root” form of words to recognize

morphological variation. “computer”, “computational”, “computation” all reduced to

same token “compute” Correct morphological analysis is language specific

and can be complex. Stemming “blindly” strips off known affixes

(prefixes and suffixes) in an iterative fashion.for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.

Page 13: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 13

Porter Stemmer Simple procedure for removing known

affixes in English without using a dictionary. Can produce unusual stems that are not

English words: “computer”, “computational”, “computation” all

reduced to same token “comput” May conflate (reduce to the same token)

words that are actually distinct. Not recognize all morphological derivations.

Page 14: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 14

Typical rules in Porter

sses ss ies iational ate tional tion

See class website for link to “official” Porter stemmer site Provides Perl, C ready to use implementations

Page 15: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 15

Porter Stemmer Errors

Errors of “comission”: organization, organ organ police, policy polic arm, army arm

Errors of “omission”: cylinder, cylindrical create, creation Europe, European

Page 16: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 16

Other stemmers

Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

Single-pass, longest suffix removal (about 250 rules)

Motivated by Linguistics as well as IRFull morphological analysis - modest benefits

for retrieval

Page 17: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 17

Metadata

On Metadata Often included in Web pages Hidden from the browser, but useful for indexing

Information about a document that may not be a part of the document itself (data about data).

Descriptive metadata is external to the meaning of the document:

Author Title Source (book, magazine, newspaper, journal) Date ISBN Publisher Length

Page 18: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 18

文本的特性Text properties

Distribution of words in language Why are they important?

Zipf’s LawHeap’s Law

Page 19: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 19

Statistical Properties of Text

How is the frequency of different words distributed?

How fast does vocabulary size grow with the size of a corpus?

Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system.

Page 20: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 20

Word Frequency

A few words are very common. 2 most frequent words (e.g. “the”, “of”) can

account for about 10% of word occurrences.Most words are very rare.

Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”)

Called a “heavy tailed” distribution, since most of the probability mass is in the “tail”

Page 21: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 21

Sample Word Frequency Data(from B. Croft, UMass)

Page 22: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 22

Zipf’s Law

Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).

Zipf (1949) “discovered” that:

If probability of word of rank r is pr and N is the total number of word occurrences:

)constant (for kkrf

1.0 const. indp. corpusfor Ar

A

N

fpr

Page 23: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 23

Zipf and Term Weighting

Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.

Page 24: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 24

Predicting Occurrence Frequencies

By Zipf, a word appearing n times has rank rn=AN/n

Several words may occur n times, assume rank rn applies to the last of these.

Therefore, rn words occur n or more times and rn+1 words occur n+1 or more times.

So, the number of words appearing exactly n times is:

)1(11

nn

AN

n

AN

n

ANrrI nnn

Page 25: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 25

Predicting Word Frequencies (cont’d)

Assume highest ranking term occurs once and therefore has rank D = AN/1

Fraction of words with frequency n is:

Fraction of words appearing only once is therefore ½.

)1(

1

nnD

In

Page 26: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 26

Occurrence Frequency Data(from B. Croft, UMass)

Page 27: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 27

Does Real Data Fit Zipf’s Law?

A law of the form y = kxc is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line

with slope c.

Zipf is quite accurate except for very high and low rank.

)log(log)log()log( xckkxy c

Page 28: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 28

Fit to Zipf for Brown Corpus

k = 100,000

Page 29: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 29

Mandelbrot (1954) Correction

The following more general form gives a bit better fit:

,, constantsFor )( BPrPf B

Page 30: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 30

Mandelbrot Fit

P = 105.4, B = 1.15, = 100

Page 31: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 31

Explanations for Zipf’s Law

Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one.

Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution. http://linkage.rockefeller.edu/wli/zipf/

Page 32: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 32

Zipf’s Law Impact on IR

Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs.

Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

Page 33: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 33

Vocabulary Growth

How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?

This determines how the size of the inverted index will scale with the size of the corpus.

Vocabulary not really upper-bounded due to proper names, typos, etc. (没有上界的原因是名称、打字错等)

Page 34: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 34

Heaps’ Law

If V is the size of the vocabulary and the n is the length of the corpus in words:

Typical constants: K 10100 0.40.6 (approx. square-root)

10 , constants with KKnV

Page 35: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 35

Heaps’ Law Data

Page 36: 文本信息检索 —— 文本操作

南京

大学

多媒

体研

究所

Mul

tim

edia

Com

puti

ng I

nsti

tute

of

NJU

23/4/19 Wu Gangshan: Modern Information Retrieval 36

Explanation for Heaps’ Law

Can be derived from Zipf’s law by assuming documents are generated by randomly sampling words from a Zipfian distribution.

Heap’s Law holds on distribution of other data Own experiments on types of questions asked by

users show a similar behavior