5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion...

5.Index Construction

인공지능연구실

2

목차• Memory-based inversion• Sort-based inversion• Exploiting index compression• Compressed in-memory inversion• Comparison of inversion methods• Constructing signature files and bitmaps• Dynamic collections

3

The problem is the size of the frequency matrixBible TREC

Terms 9,020 538,244Doc 31,102 742,368

4byte integer(each entry)Matrix 4x9,020x31,102bytes=

over one gigabytes4 x 538,244 x742,368bytes= 1.4 terabytes

Figure5.1static

One month 127 years

4

Memory-based inversion

• Fig5.2( 문서 ), Fig5.3( 역파일 )• Assumed that the linked lists are sorted• Dynamic dictionary data structure• Linked list(reference point)

5

문제점 ?

• 메모리 Resource 많이 필요 • the best method for small collections (Bible…)• Random Data 처리 못함

6

Sort-based inversion• Fig5.4, Fig5.5 시간과 공간을 비교

QSort

<1,1,2>....

외부 Mergesort [logR]

inital Sorted runsTemporary File

Merged runs(fully sorted)

K block

7

문제점 ?

• Two copies of temp files• 10~100Mbyte 범위에 적절

8

Exploiting index compression

• To reduce the resource(space,time) -temporary file 의 압축 (sort-based) -inverted file 을 main memory 에서 만들고 , index 를 disk 에 쓰기전에 decompressing• Compressing the temporary files• Multiway-merging• In-place multiway merging

9

Compressing the temporary files

• Chapters 3 and 4 장에서 설명됨• Compression temporary file of<t,d,fdt>• t 요소때문에 약간의 압축 손실 발생 ( 예 ,unary+delta code,TREC collection)

( 가정 ) unary code t-gap( 다음에오는 triple 과의 t 차이값 ) t-gap=0 → code 0, t-gap=1 → 10, t-gap=2 → 110(0.6Mbyte 필요 )

10

Multiway merging

• Now, processor-intensive than dick-intensive

• Reduce time by multiway-merge• Use if priority queue such as heap

11

In-place multiway merging(1)

Heap

OUTPUT BLOCK1

RUN 1, BLOCK 2

OUTPUT BLOCK2

RUN 2, BLOCK 3

RUN 3, BLOCK 2

1

2RUN 1, BLOCK 2

RUN 2, BLOCK 2

RUN 3, BLOCK 1

OUTPUT BLOCK3

Blocks in memoryOne per run Temporary file,

On diskBlock tableIn memory

12

In-place multiway merging(2)

• 알고리즘 - 메모리의 각 run 에서 b byte 의 블록이 heap 으로 이동 - heap 에서 메모리내 output 블록으로 b byte 만큼이동 - output 블록은 temporary file 로 다시 쓰여짐 (block table)

• Slack 의 사용 - 입력프로세스보다 출력프로세스가 먼저 수행 되는 경향으로 빈블럭이 추가됨Slack 추가 → permutation →compaction → truncation 처리

Second Edition) permutation → truncation 처리

13

Compressed in-memory inversionLarge memory inversion(1)

Large main memory

array - list of document numbers d, frequencies fdt

Compared in-memory technique (Section 5.1)

next pointer field 필요 없음 .

term t : ft log N bits

ft log mt bits

(mt : maximum within-document frequency )

preliminary pass 필요 : N, ft, mt

14

Compressed in-memory inversionLarge memory inversion(2)

Two-pass Golomb-coded in memory

First Pass

- count ft, Ft

- write ft, Ft to a lexicon file

Second Pass

- read lexicon file

- calculate bt, btw = 2 log((N-ft)/ft), Bt

- build a compressed in-memory inverted file

- rebuild in-memory inverted file

15

Compressed in-memory inversionLexicon-based partitioning

Subdivide into small tasks

Lexicon-based, no extra disk

make multiple second pass

each processing one load

- ex) three second pass

Lexicon-based, extra disk

Time save, Disk Space 낭비

16

Compressed in-memory inversionText-based partitioning

Inversion and Merge

In-memory inverted file 생성 Merge inverted file on disk

Chunk

Information file

- frequency of each term in chunk

Second temp disk file

- disk current pointer

17

Constructing signature files and bitmaps

Enough Main Memory

signature of k documents

k = 8M / W

- W : signature width (bits)

- M : main memory (bytes)

Bitmap

build a compressed inverted file

decompress it and store it with unary code

18

Dynamic collections(1)

‘Insert’ operation

append a new document to an existing collection

‘Edit’ operation

alter, remove

Expanding the text

Expanding the index

19

Dynamic collectionsExpanding the text

Inserting a new document

the text of the collection must be expanded

compression

- cope with hitherto unseen symbol

uncompression

- escape flag, stored uncompressed

periodically be completely rebuilt

a new compression model

20

Dynamic collectionsExpanding the Index(1)

‘stop-press’ file

accumulate update in a stop-press file

rebuild when file too large

drawback

reindex (the data) time

The Inverted file

new-inserted document contains many terms

variable-length recoreds

21


Issue

suitable file structure

record extension

record insertion

22


Block Structure

Fixed length blocks : b bytes

- block address table, records, free space

- figure 5.15

Main memory

- record address table : record number, block number

- free list

- current last block of the file

23


Access record

1) Record number

2) Block address from the record address table

3) Block read into memory

4) The address of the record within the block

5) Read the record

24


Expanding a particular record

sufficient free space

1) Block read

2) record 이동 , make space

3) extension 추가 4) block table 수정 , write

insufficient free space

- smallest record remove, insert extension

- extended record remove, insert into new block

25


Insert a record

free list check

- insert 할 block 결정 - new block 생성 Block read/write (disk operation)

general case : 2

worst case : 4

Reduce the number of disk operation

using ‘update cache’

5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion...

Documents

Transcript of 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion...