5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion...
-
Upload
jonas-bailey -
Category
Documents
-
view
216 -
download
2
description
Transcript of 5.Index Construction 인공지능연구실. 2 목차 Memory-based inversion Sort-based inversion...
5.Index Construction
인공지능연구실
2
목차• Memory-based inversion• Sort-based inversion• Exploiting index compression• Compressed in-memory inversion• Comparison of inversion methods• Constructing signature files and bitmaps• Dynamic collections
3
The problem is the size of the frequency matrixBible TREC
Terms 9,020 538,244Doc 31,102 742,368
4byte integer(each entry)Matrix 4x9,020x31,102bytes=
over one gigabytes4 x 538,244 x742,368bytes= 1.4 terabytes
Figure5.1static
One month 127 years
4
Memory-based inversion
• Fig5.2( 문서 ), Fig5.3( 역파일 )• Assumed that the linked lists are sorted• Dynamic dictionary data structure• Linked list(reference point)
5
문제점 ?
• 메모리 Resource 많이 필요 • the best method for small collections (Bible…)• Random Data 처리 못함
6
Sort-based inversion• Fig5.4, Fig5.5 시간과 공간을 비교
QSort
<1,1,2>....
외부 Mergesort [logR]
inital Sorted runsTemporary File
Merged runs(fully sorted)
K block
7
문제점 ?
• Two copies of temp files• 10~100Mbyte 범위에 적절
8
Exploiting index compression
• To reduce the resource(space,time) -temporary file 의 압축 (sort-based) -inverted file 을 main memory 에서 만들고 , index 를 disk 에 쓰기전에 decompressing• Compressing the temporary files• Multiway-merging• In-place multiway merging
9
Compressing the temporary files
• Chapters 3 and 4 장에서 설명됨• Compression temporary file of<t,d,fdt>• t 요소때문에 약간의 압축 손실 발생 ( 예 ,unary+delta code,TREC collection)
( 가정 ) unary code t-gap( 다음에오는 triple 과의 t 차이값 ) t-gap=0 → code 0, t-gap=1 → 10, t-gap=2 → 110(0.6Mbyte 필요 )
10
Multiway merging
• Now, processor-intensive than dick-intensive
• Reduce time by multiway-merge• Use if priority queue such as heap
11
In-place multiway merging(1)
Heap
OUTPUT BLOCK1
RUN 1, BLOCK 2
OUTPUT BLOCK2
RUN 2, BLOCK 3
RUN 3, BLOCK 2
1
2RUN 1, BLOCK 2
RUN 2, BLOCK 2
RUN 3, BLOCK 1
OUTPUT BLOCK3
Blocks in memoryOne per run Temporary file,
On diskBlock tableIn memory
12
In-place multiway merging(2)
• 알고리즘 - 메모리의 각 run 에서 b byte 의 블록이 heap 으로 이동 - heap 에서 메모리내 output 블록으로 b byte 만큼이동 - output 블록은 temporary file 로 다시 쓰여짐 (block table)
• Slack 의 사용 - 입력프로세스보다 출력프로세스가 먼저 수행 되는 경향으로 빈블럭이 추가됨Slack 추가 → permutation →compaction → truncation 처리
Second Edition) permutation → truncation 처리
13
Compressed in-memory inversionLarge memory inversion(1)
Large main memory
array - list of document numbers d, frequencies fdt
Compared in-memory technique (Section 5.1)
next pointer field 필요 없음 .
term t : ft log N bits
ft log mt bits
(mt : maximum within-document frequency )
preliminary pass 필요 : N, ft, mt
14
Compressed in-memory inversionLarge memory inversion(2)
Two-pass Golomb-coded in memory
First Pass
- count ft, Ft
- write ft, Ft to a lexicon file
Second Pass
- read lexicon file
- calculate bt, btw = 2 log((N-ft)/ft), Bt
- build a compressed in-memory inverted file
- rebuild in-memory inverted file
15
Compressed in-memory inversionLexicon-based partitioning
Subdivide into small tasks
Lexicon-based, no extra disk
make multiple second pass
each processing one load
- ex) three second pass
Lexicon-based, extra disk
Time save, Disk Space 낭비
16
Compressed in-memory inversionText-based partitioning
Inversion and Merge
In-memory inverted file 생성 Merge inverted file on disk
Chunk
Information file
- frequency of each term in chunk
Second temp disk file
- disk current pointer
17
Constructing signature files and bitmaps
Enough Main Memory
signature of k documents
k = 8M / W
- W : signature width (bits)
- M : main memory (bytes)
Bitmap
build a compressed inverted file
decompress it and store it with unary code
18
Dynamic collections(1)
‘Insert’ operation
append a new document to an existing collection
‘Edit’ operation
alter, remove
Expanding the text
Expanding the index
19
Dynamic collectionsExpanding the text
Inserting a new document
the text of the collection must be expanded
compression
- cope with hitherto unseen symbol
uncompression
- escape flag, stored uncompressed
periodically be completely rebuilt
a new compression model
20
Dynamic collectionsExpanding the Index(1)
‘stop-press’ file
accumulate update in a stop-press file
rebuild when file too large
drawback
reindex (the data) time
The Inverted file
new-inserted document contains many terms
variable-length recoreds
21
Dynamic collectionsExpanding the Index(2)
Issue
suitable file structure
record extension
record insertion
22
Dynamic collectionsExpanding the Index(2)
Block Structure
Fixed length blocks : b bytes
- block address table, records, free space
- figure 5.15
Main memory
- record address table : record number, block number
- free list
- current last block of the file
23
Dynamic collectionsExpanding the Index(3)
Access record
1) Record number
2) Block address from the record address table
3) Block read into memory
4) The address of the record within the block
5) Read the record
24
Dynamic collectionsExpanding the Index(4)
Expanding a particular record
sufficient free space
1) Block read
2) record 이동 , make space
3) extension 추가 4) block table 수정 , write
insufficient free space
- smallest record remove, insert extension
- extended record remove, insert into new block
25
Dynamic collectionsExpanding the Index(5)
Insert a record
free list check
- insert 할 block 결정 - new block 생성 Block read/write (disk operation)
general case : 2
worst case : 4
Reduce the number of disk operation
using ‘update cache’