Lecture 2 : Tolerant Retrieval

44
Introduction to Information Retrieval Lecture 2 : Tolerant Retrieval 楊楊楊楊楊 楊楊楊楊楊楊楊 [email protected] 楊楊楊楊楊楊楊 Introduction to Information Retrieval 楊楊楊楊 楊 Ch. 3 1

description

Lecture 2 : Tolerant Retrieval. 楊立偉教授 台灣科大資管系 wyang @ntu.edu.tw 本投影片修改自 Introduction to Information Retrieval 一書之 投影片 Ch. 3. Recap : Inverted Index. For each term t , we store a list of all documents that contain t . dictionary postings . 2. Intersecting two posting lists. 3. - PowerPoint PPT Presentation

Transcript of Lecture 2 : Tolerant Retrieval

Page 1: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Lecture 2 : Tolerant Retrieval楊立偉教授

台灣科大資管系[email protected]

本投影片修改自 Introduction to Information Retrieval 一書之投影片 Ch. 3

1

Page 2: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

2

Recap : Inverted Index

For each term t, we store a list of all documents that contain t.

2

dictionary postings

Page 3: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

3

Intersecting two posting lists

3

Page 4: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

• 傳回查詢結果為 Doc1

Doc1:台灣大哥台Doc2:台灣大學

1 2 3 4 5 pos

1: 1

1: 2

1: 3

1: 4

1: 5 2: 1

2: 2

2: 3

2: 4

3

2

2

1

1

Dictionaryin Hashtableor Tree/Trie

被索引文件逐一讀入

Inverted List (posting)

查詢 台大 and 大哥 (容錯距離=1)

1: 1

1: 2

1: 3

1: 4

1: 5 2: 1

2: 2

2: 3

2: 4

3

2

2

1

1

1. 判斷應由(大)(哥)先合併,再做(台)(大)合併2. 各自合併之結果,再合併

Inverted index with position

Page 5: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

5

Google use the Boolean model, too

On Google, the default interpretation of a query [w1 w2 . . .wn]

is w1 AND w2 AND . . .AND wn

Simple Boolean retrieval returns matching documents in no

particular order.

Google (and most well designed Boolean engines) rank the

result set – they rank good hits (according to some estimator of

relevance) higher than bad hits.

5

Page 6: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Spelling correction

6

Page 7: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

7

Spelling correction Two principal uses 索引時校正,或查詢時校正?

Correcting documents being indexed Correcting user queries

Two different methods for spelling correction Isolated word spelling correction 單一字詞校正

Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an

asteroid that fell form the sky Context-sensitive spelling correction 依上下文校正

Look at surrounding words Can correct form/from error above e.g., I flew form Taipei to Tokyo. 7

Page 8: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

8

Correcting queries

For isolated word spelling correction Premise 1: There is a list of “correct words” from which the

correct spellings come. Premise 2: We have a way of computing the distance

between a misspelled word and a correct word. Simple spelling correction algorithm: return the “correct”

word that has the smallest distance to the misspelled word. Example: informaton → information For the list of correct words, we can use the vocabulary of all

words that occur in our collection. → dictionary

8

Page 9: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

99

dictionary postings

Page 10: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

10

Alternatives to using the term vocabulary

A standard dictionary (Webster’s, OED etc.) An industry-specific dictionary (for specialized IR systems)

Ex. dictionary for law, electronics, medicine, etc. The term vocabulary of the collection (corpus 語料集 ) The term vocabulary of the query log

10

Page 11: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

11

Distance between misspelled word and “correct” word

1. Edit distance (Levenshtein distance)2. Weighted edit distance3. k-gram overlap

11

Page 12: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

12

(1) Edit distance The edit distance between string s1 and string s2 is the

minimum number of basic operations that convert s1 to s2. Levenshtein distance: The admissible basic operations are

insert, delete, and replace Exercise

dog-do : 1 cat-cart : 1 cat-cut : 1 cat-act : 2 hordes-lords : ? water-wine : ?

12

Page 13: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

13

Levenshtein distance: Computation

13

Ref: See http://www.merriampark.com/ld.htm for more details

Page 14: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

14

Levenshtein distance: Algorithm

14

Page 15: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

15

Levenshtein distance: Algorithm

15

Page 16: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

16

Levenshtein distance: Algorithm

16

Page 17: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

17

Levenshtein distance: Algorithm

17

Page 18: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

18

Levenshtein distance: Algorithm

18

Page 19: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

19

Levenshtein distance: Example

19

Page 20: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

20

Each cell of Levenshtein matrix

20

cost of getting here frommy upper left neighbor(copy or replace)

cost of getting herefrom my upper neighbor(delete)

cost of getting here frommy left neighbor (insert)

the minimum of the three possible “movements”;the cheapest way of getting here

Page 21: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

21

Dynamic programming

Optimal substructure: The optimal solution to the problem

contains within it subsolutions, i.e., optimal solutions to

subproblems. (Recursive)

Overlapping subsolutions: The subsolutions overlap. These

subsolutions are computed over and over again when

computing the global optimal solution in a Recursive

algorithm.

Dynamic programming avoid the overlapping computation.21

Page 22: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

22

(2) Weighted edit distance

As above, but weight of an operation depends on the characters involved.

Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q.

Therefore, replacing m by n is a smaller edit distance than by q.

Same ideas usable for OCR. Ex. O is closer to Q than Z We now require a weight matrix as input. Modify dynamic programming to handle weights

22

Page 23: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

23

Using edit distance for spelling correction

Given query, first enumerate all character sequences within a preset (possibly weighted) edit distance

Intersect this set with our list of “correct” words Then suggest terms in the intersection to the user.

23

Page 24: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

2424

Exercise : Compute Levenshtein distance matrix for OSLO – SNOW

Page 25: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

25

(3) k-gram indexes for spelling correction

Enumerate all k-grams in the query term Example: bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Use the k-gram index to retrieve “correct” words that match

query term k-grams Threshold by number of matching k-grams E.g., only vocabulary terms that differ by at most 3 k-grams

挑選最多有 3 個 k-gram 不同的詞25

Page 26: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

26

k-gram indexes for spelling correction: bordroom

26

Page 27: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Example with trigrams• Suppose the text is november– Trigrams are nov, ove, vem, emb, mbe, ber.

• The query is december– Trigrams are dec, ece, cem, emb, mbe, ber.

• So 3 trigrams overlap (of 6 in each term)• Design a normalized measure of overlap

將得分標準進一步做正規化

Page 28: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Jaccard coefficient• A commonly-used measure of overlap• Let X and Y be two sets; then the J.C. is

• X and Y don’t have to be of the same size• Equals 1 when X and Y have the same elements and

zero when they are disjoint 界於 0 與 1 之間– Ex. 可設定 J.C. > 0.8 即列為候選詞,由高至低做排序

YXYX /

Page 29: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Example of matching trigrams• Consider the query lord – we wish to identify words

matching 2 of its 3 bigrams (lo, or, rd)

• Use inverted index to intersect for the answers.

loorrd

alone lord

slothlord

morbid

border card

border

ardent

Page 30: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Context-sensitive spelling correction

30

Page 31: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

31

Context-sensitive spelling correction Our example was: flew form munich How can we correct form here? One idea: hit-based spelling correction

Retrieve “correct” terms close to each query term for "flew form munich":

flea for flew, from for form, munch for munich Now try all possible resulting phrases as queries with one word

“fixed” at a time Try query “flea form munich” Try query “flew from munich” Try query “flew form munch” The correct query “flew from munich” has the most hits.

Suppose we have 7 alternatives for flew, 20 for form and 3 for munich, how many “corrected” phrases will we enumerate?

31

Page 32: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

32

Context-sensitive spelling correction

The hit-based algorithm we just outlined is not very efficient Run only on queries that matched few docs

因為速度會變慢,找不太到文件時才開啟此功能 More efficient alternative: look at “collection” of queries, not

Ex. Use query log analysis只比對熱門查詢詞集內的詞 ( 縮減比對的對象 )

Use probabilistic theory (discuss later) 改用統計機率法:執行效率高,且準確度亦佳

32

Page 33: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Thesaurus

33

Page 34: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Thesaurus• Thesaurus: language-specific list of synonyms for

terms likely to be queried– car automobile, etc.– hand-made or using machine learning to maintain

可用演算法,可自動從大量語料中學習仿同義詞 (i.e. 共現詞 或 替換詞 )

Page 35: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Thesaurus• Example : use link analysis to build thesaurus from

the web

www.ibm.com

Armonk, NY-based computer

giant IBM announced todayBig Blue today announced

record profits for the quarter

Page 36: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Use thesaurus for recommendation• Query automobile, retrieve documents containing

automobile or car.• May retrieve more junk– Ex. puma → jaguar

retrieves documents on cars instead of on sneakers. 會有誤判

Page 37: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Use thesaurus for recommendation• How to do

1. Index expansion• To add a document containing automobile or car to both lists

• When query automobile, the list is the answer also includes car• Drawbacks : index blowup, and can't perform normal query (no

thesaurus).

car

automobile

Page 38: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Use thesaurus for recommendation• How to do

2. Query expansion• To expand the query automobile to (automobile or car)• Drawback : query processing may slowed down

Page 39: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Proximity Search

39

Page 40: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

40

Proximity search

40

Use positional index Example 1 : employment /4 place

Find all documents that contain EMPLOYMENT and PLACE within 4 words of each other. ( 不分前後順序 )

Employment agencies that place healthcare workers are seeing growth is a hit.

Example 2 : 台 /3 大 ( 要分前後順序 ) 台大醫院、台灣大學、台科大、台灣大車隊 台灣科大 is not a hit

Page 41: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

• 傳回查詢結果為 Doc1

Doc1:台灣大哥台Doc2:台灣大學

1 2 3 4 5 pos

1: 1

1: 2

1: 3

1: 4

1: 5 2: 1

2: 2

2: 3

2: 4

3

2

2

1

1

Dictionaryin Hashtableor Tree/Trie

被索引文件逐一讀入

Inverted List (posting)

查詢 台大 and 大哥 (容錯距離=1)

1: 1

1: 2

1: 3

1: 4

1: 5 2: 1

2: 2

2: 3

2: 4

3

2

2

1

1

1. 判斷應由(大)(哥)先合併,再做(台)(大)合併2. 各自合併之結果,再合併

Recap : Positional Inverted index

Page 42: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

42

Proximity search

42

Simplest algorithm: look at cross-product of positions of (i)

EMPLOYMENT in document and (ii) PLACE in document

Note that we want to return the actual matching positions,

not just a list of documents. (This is important for dynamic

summaries 查詢結果頁中的範例段落 )

Wildcard (? *) can be transformed to proximity search

Page 43: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

43

“Proximity” intersection

43

不分前後順序

Page 44: Lecture 2  : Tolerant Retrieval

Introduction to Information Retrieval

Conclusion• In Boolean retrieval, make it easier to pick keywords– Spell correction– Thesaurus

• 3 basic algorithms to measure the similarity of keywords or documents– Edit distance, Weighted edit distance, k-gram overlap– slow but work– any better alternatives ?

• Proximity search is useful (for both English and Chinese)44