1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

25
1 Chap 14 Ranking Algorithm 指指指指 : 指指指 指指 指指 : 指指指 指指指

Transcript of 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

Page 1: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

1

Chap 14 Ranking Algorithm

指導教授 : 黃三益 博士學生 : 吳金山 鄭菲菲

Page 2: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

2

Outline Introduction Ranking models Selecting ranking techniques Data structures and algorithms The creation of an inverted file Searching the inverted file Stemmed and unstemmed query terms A Boolean systems with ranking Pruning

Page 3: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

3

Introduction

Boolean systems Providing powerful on-line search capabilities for

librarians and other trained intermediariesProviding very poor service for end-users who use the

system infrequently The ranking approach

Inputting a natural language query without Boolean syntax

Producing a list of ranked records that “answer” the query

More oriented toward end-users

Page 4: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

4

Introduction (cont.)

Natural language/ranking approach is more effective for end-usersThe results being ranked based on co-

occurrence of query termsmodified by statistical term-weighting eliminating the often-wrong Boolean syntax

used by end-usersproviding some results even if a query term is

incorrect

Page 5: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

5

Figure 14.1 Statistical rankingTerm Factors Information Help Human Operation Retrieval systems

Qry. Human factors in information retrieval systems

Vtr. 1 1 0 1 0 1 1

Rec1.

Human, factors, information, retrieval

Vtr. 1 1 0 1 0 1 0Rec2.

Human, factors, help, systems

Vtr. 1 0 1 1 0 0 1Rec3.

Factors, operation, systems

Vtr. 1 0 0 0 1 0 1

Page 6: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

6

Figure 14.1 Statistical ranking

Simple Match Query (1 1 0 1 0 1 1) Rec1 (1 1 0 1 0 1 0) (1 1 0 1 0 1 0) = 4

Query (1 1 0 1 0 1 1) Rec2 (1 0 1 1 0 0 1) (1 0 0 1 0 0 1) = 3

Query (1 1 0 1 0 1 1) Rec3 (1 0 0 0 1 0 1) (1 0 0 0 0 0 1) = 2

Weighted Match Query (1 1 0 1 0 1 1) Rec1 (2 3 0 5 0 3 0) (2 3 0 5 0 3 0) = 13

Query (1 1 0 1 0 1 1) Rec2 (2 0 4 5 0 0 1) (2 0 0 5 0 0 1) = 8

Query (1 1 0 1 0 1 1) Rec3 (2 0 0 0 2 0 1) (2 0 0 0 0 0 1) = 3

Page 7: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

7

Ranking models

Two types of ranking modelsranking the query against Individual

documents Vector space modelProbabilistic model

ranking the query against entire sets of related documents

Page 8: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

8

Ranking models (cont.) Vector space model

Using cosine correlation to compute similarity Early experiments

SMART system (overlap similarity function)Results

Within document frequency weighting > no term weighting

Cosine correlation with frequency term weighting > overlap similarity function

Salton & Yang (1973) (Relying on term importance within an entire collection)

ResultsSignificant performance improvement using the within-do

cument frequency weighting + the inverted document frequency (IDF)

Page 9: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

9

Ranking models (cont.)

Probabilistic model Terms appearing in previously retrieved relevan

t documents was given a higher weightCroft and Harper (1979)

Probabilistic indexing without any relevance information

Assuming all query terms have equal probabilityDeriving a term-weighting formula

Q

i i

i

n

nNCjksimilarity

1

))

log(

Page 10: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

10

Ranking models (cont.)

Probabilistic model Croft (1983)

Incorporating within-document frequency weightsUsing a tuning factor K

ResultSignificant improvement over both the IDF weighting alone

and the combination weighting

Q

iiji fIDFCjksimilarity

1

)*(

j

ij

ij freq

freqKKf

max)1(

Page 11: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

11

Other experiments involving ranking Direct comparison of similarity measures and ter

m-weighting schemes4 types of term frequency weightings (Sparch Jones,1973)

Term frequency within a documentTerm frequency within a collectionTerm postings within a document (a binary measure)Term postings within a collection

Indexing was taken from manually extracted keywords

ResultsUsing the term frequency (or postings) within a collection alway

s improved performanceUsing term frequency ( or postings) within a document improve

d performance only for some collections

Page 12: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

12

Harman(1986)Four term-weighting factors

(a) The number of matches between a document & a query (b) The distribution of a term within a document collection

IDF & noise measure

(c) The frequency of a term within a document (d) The length of the document

ResultsUsing the single measures alone, the distribution of the term

within the collection = 2 (c) Combining the within-document frequency with either the IDF or

noise measure = 2 (using the IDF or noise alone)

Other experiments involving ranking (cont.)

Page 13: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

13

Other experiments involving ranking (cont.)

Ranking based on document structureNot only using weights based on term

importance both within an entire collection and within a given document (Bernstein and Williamson, 1984)

But also using the structural position of the termSummary versus text paragraphs

In SIBRIS, increasing term-weights for terms in titles of documents and decreasing term-weights for terms added to a query from a thesaurus

Page 14: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

14

Selecting ranking techniques Using term-weighting based on the distribution of

a term within a collectionalways improves performance

Within-document frequency + IDF weightoften provides even more improvement

Within-document frequency + (Several methods) IDF measure

Adding additional weight for document structureEg. higher weightings for terms appearing in the title or

abstract vs. those appearing only in the text Relevance weighting (Chap 11)

Page 15: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

15

The creation of an inverted file Implications for supporting inverted file structures

Only the record id has to be stored (smaller index)Using strategies that increase recall at the expense of

precision

Inverted file is usually split into two pieces for searchingThe dictionary containing the term, along with statistics

about that term such as no. of postings and IDF, and a pointer to the location of the postings file for term

The postings file containing the record ids and the weights for all occurrences of the term

Page 16: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

16

The creation of an inverted file (cont.)

4 major options for storing weights in the postings file

Store the raw frequency Slowest search Most flexible

Store a normalized frequency Not suitable for use with the cosine similarity

function Updating would not change the postings

Page 17: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

17

The creation of an inverted file (cont.)

Store the completely weighted termAny of the combination weighting schemes are

suitableDisadvantage: updating requires changing all

postings

If no within-record weighting is used, then the postings records do not have to store weights

Page 18: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

18

Searching the inverted file Figure 14.4 flowchart of search engine

query

parser

Dictionary Lookup

Get Weights

Accumulator

Sort by weightRanked record numbers

Record numbers. Total weights

Record numbers on a per term basis

Dictionary entry

Page 19: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

19

Searching the inverted file (cont.) Inefficiencies of this technique

The I/O needs to be minimized A single read for all the postings of a given term, and

then separating the buffer into record ids and weightsTime savings can be gained at the expense of

some memory spaceDirect access to memory rather than through hashing

A final major bottleneck can be the sort step of the “accumulators” for large data setsFast sort of thousands of records is very time

consuming

Page 20: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

20

Stemmed and unstemmed query terms

If query terms were automatically stemmed in a ranking system, users generally got better results (Frakes, 1984; Canadela, 1990)In some cases, a stem is produced that leads to

improper resultsthe original record terms are not stored in the inverte

d file; only their stems are used

Page 21: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

21

Stemmed and unstemmed query terms (cont.)

Harman & Candela (1990) 2 separate inverted files could be created and

stored Stem terms: normal query Unstemmed terms: don’t stem

Hybrid inverted file Saving no space in the dictionary part Saving considerable storage (2 versions of posting) At the expense of some additional search time

Page 22: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

22

A Boolean systems with ranking

SIRE systemFull Boolean capability + a variation of the basic search

process Accepts queries that are either Boolean logic

strings or natural language queries (implicit OR) Major modification to the basic search process

Merge postings from the query terms before ranking is done

PerformanceFaster response time for Boolean queriesNo increase in response time for natural language

queries

Page 23: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

23

Pruning A major time bottleneck in the basic search

process The sort of the accumulators for large data sets

Changed search algorithm with pruning:

1. Sort all query terms (stems) by decreasing IDF value

2. Do a binary search for the first term (i.e., the highest IDF) and get the address of the postings list for that term

3. Read the entire postings file for that term into a buffer and add the term weights for each record id into the contents of the unique accumulator for the record id

Page 24: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

24

Pruning (cont.)4. Check the IDF of the next query term.

If the IDF >= 1/3 (max IDF of any term in the data set) then repeat steps 2, 3, and 4 otherwise repeat steps 2, 3, and 4, but do not add weights to zero weight accumulators

5. Sort the accumulators with nonzero weights to produce the final ranked record list

6. If a query has only high-frequency terms, then pruning cannot be done.

Page 25: 1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

25

Thanks