5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion...

25
5.Index Construction 인인인인인인인

description

3 The problem is the size of the frequency matrix BibleTREC Terms9,020538,244 Doc31,102742,368 4byte integer(each entry) Matrix4x9,020x31,102bytes= over one gigabytes 4 x 538,244 x 742,368bytes= 1.4 terabytes Figure5.1 static One month127 years

Transcript of 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion...

Page 1: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

5.Index Construction

인공지능연구실

Page 2: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

2

목차• Memory-based inversion• Sort-based inversion• Exploiting index compression• Compressed in-memory inversion• Comparison of inversion methods• Constructing signature files and bitmaps• Dynamic collections

Page 3: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

3

The problem is the size of the frequency matrixBible TREC

Terms 9,020 538,244Doc 31,102 742,368

4byte integer(each entry)Matrix 4x9,020x31,102bytes=

over one gigabytes4 x 538,244 x742,368bytes= 1.4 terabytes

Figure5.1static

One month 127 years

Page 4: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

4

Memory-based inversion

• Fig5.2( 문서 ), Fig5.3( 역파일 )• Assumed that the linked lists are sorted• Dynamic dictionary data structure• Linked list(reference point)

Page 5: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

5

문제점 ?

• 메모리 Resource 많이 필요 • the best method for small collections (Bible…)• Random Data 처리 못함

Page 6: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

6

Sort-based inversion• Fig5.4, Fig5.5 시간과 공간을 비교

QSort

<1,1,2>....

외부 Mergesort [logR]

inital Sorted runsTemporary File

Merged runs(fully sorted)

K block

Page 7: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

7

문제점 ?

• Two copies of temp files• 10~100Mbyte 범위에 적절

Page 8: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

8

Exploiting index compression

• To reduce the resource(space,time) -temporary file 의 압축 (sort-based) -inverted file 을 main memory 에서 만들고 , index 를 disk 에 쓰기전에 decompressing• Compressing the temporary files• Multiway-merging• In-place multiway merging

Page 9: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

9

Compressing the temporary files

• Chapters 3 and 4 장에서 설명됨• Compression temporary file of<t,d,fdt>• t 요소때문에 약간의 압축 손실 발생 ( 예 ,unary+delta code,TREC collection)

( 가정 ) unary code t-gap( 다음에오는 triple 과의 t 차이값 ) t-gap=0 → code 0, t-gap=1 → 10, t-gap=2 → 110(0.6Mbyte 필요 )

Page 10: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

10

Multiway merging

• Now, processor-intensive than dick-intensive

• Reduce time by multiway-merge• Use if priority queue such as heap

Page 11: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

11

In-place multiway merging(1)

Heap

OUTPUT BLOCK1

RUN 1, BLOCK 2

OUTPUT BLOCK2

RUN 2, BLOCK 3

RUN 3, BLOCK 2

1

2RUN 1, BLOCK 2

RUN 2, BLOCK 2

RUN 3, BLOCK 1

OUTPUT BLOCK3

Blocks in memoryOne per run Temporary file,

On diskBlock tableIn memory

Page 12: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

12

In-place multiway merging(2)

• 알고리즘 - 메모리의 각 run 에서 b byte 의 블록이 heap 으로 이동 - heap 에서 메모리내 output 블록으로 b byte 만큼이동 - output 블록은 temporary file 로 다시 쓰여짐 (block table)

• Slack 의 사용 - 입력프로세스보다 출력프로세스가 먼저 수행 되는 경향으로 빈블럭이 추가됨Slack 추가 → permutation →compaction → truncation 처리

Second Edition) permutation → truncation 처리

Page 13: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

13

Compressed in-memory inversionLarge memory inversion(1)

Large main memory

array - list of document numbers d, frequencies fdt

Compared in-memory technique (Section 5.1)

next pointer field 필요 없음 .

term t : ft log N bits

ft log mt bits

(mt : maximum within-document frequency )

preliminary pass 필요 : N, ft, mt

Page 14: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

14

Compressed in-memory inversionLarge memory inversion(2)

Two-pass Golomb-coded in memory

First Pass

- count ft, Ft

- write ft, Ft to a lexicon file

Second Pass

- read lexicon file

- calculate bt, btw = 2 log((N-ft)/ft), Bt

- build a compressed in-memory inverted file

- rebuild in-memory inverted file

Page 15: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

15

Compressed in-memory inversionLexicon-based partitioning

Subdivide into small tasks

Lexicon-based, no extra disk

make multiple second pass

each processing one load

- ex) three second pass

Lexicon-based, extra disk

Time save, Disk Space 낭비

Page 16: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

16

Compressed in-memory inversionText-based partitioning

Inversion and Merge

In-memory inverted file 생성 Merge inverted file on disk

Chunk

Information file

- frequency of each term in chunk

Second temp disk file

- disk current pointer

Page 17: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

17

Constructing signature files and bitmaps

Enough Main Memory

signature of k documents

k = 8M / W

- W : signature width (bits)

- M : main memory (bytes)

Bitmap

build a compressed inverted file

decompress it and store it with unary code

Page 18: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

18

Dynamic collections(1)

‘Insert’ operation

append a new document to an existing collection

‘Edit’ operation

alter, remove

Expanding the text

Expanding the index

Page 19: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

19

Dynamic collectionsExpanding the text

Inserting a new document

the text of the collection must be expanded

compression

- cope with hitherto unseen symbol

uncompression

- escape flag, stored uncompressed

periodically be completely rebuilt

a new compression model

Page 20: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

20

Dynamic collectionsExpanding the Index(1)

‘stop-press’ file

accumulate update in a stop-press file

rebuild when file too large

drawback

reindex (the data) time

The Inverted file

new-inserted document contains many terms

variable-length recoreds

Page 21: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

21

Dynamic collectionsExpanding the Index(2)

Issue

suitable file structure

record extension

record insertion

Page 22: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

22

Dynamic collectionsExpanding the Index(2)

Block Structure

Fixed length blocks : b bytes

- block address table, records, free space

- figure 5.15

Main memory

- record address table : record number, block number

- free list

- current last block of the file

Page 23: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

23

Dynamic collectionsExpanding the Index(3)

Access record

1) Record number

2) Block address from the record address table

3) Block read into memory

4) The address of the record within the block

5) Read the record

Page 24: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

24

Dynamic collectionsExpanding the Index(4)

Expanding a particular record

sufficient free space

1) Block read

2) record 이동 , make space

3) extension 추가 4) block table 수정 , write

insufficient free space

- smallest record remove, insert extension

- extended record remove, insert into new block

Page 25: 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion Exploiting index compression Compressed in-memory inversion Comparison.

25

Dynamic collectionsExpanding the Index(5)

Insert a record

free list check

- insert 할 block 결정 - new block 생성 Block read/write (disk operation)

general case : 2

worst case : 4

Reduce the number of disk operation

using ‘update cache’