Processo Q&P (Quenching and Partitioning) Estudo de caso completo
Fail Signature Konsep Asas - superimposed coding Pemetakan secara Menegak (Vertical partitioning) ...
-
date post
15-Jan-2016 -
Category
Documents
-
view
248 -
download
0
Transcript of Fail Signature Konsep Asas - superimposed coding Pemetakan secara Menegak (Vertical partitioning) ...
Fail Signature
Konsep Asas - superimposed coding
Pemetakan secara Menegak(Vertical partitioning)
Pemetakan Secara Melintang(Horizontal partitioning)
Fail Signatur merupakan struktur indeks berorientasikan perkataan berdasarkan cincangan
Idea utamanya adalah untuk memulangkan paten bit
(deskriptor atau signatur) dan menyimpannya dalam fail
berasingan yang bertindak sebagai penapis untuk
menghapuskan item data yang tidak bertepatan dengan
permintaan maklumat Kaedah berasaskan signatur adalah lebih pantas
daripada imbasan teks penuh
Fail SignatureFail Signature
Fail signatur selalunya menggunakan pengkodan
superimposed untuk menghasilkan fail signatur.
Dokumen disimpan secara berturutan dalam “fail teks”.
“Signatur” disimpan dalam “fail signatur”.
Blok signatur – setiap satu perkataan dalam blok
dicincang kepada paten bit dengan panjang tetap, F
dengan m bit diset kepada 1 dan yang lainnya diset
kepada 0. Kedudukan m ditentukan oleh fungsi
cincangan.
Berdasarkan algorithma Karp-Rabin
Fail SignatureFail Signature
Algorithma Karp-RabinAlgorithma Karp-Rabin
• “A string matching algorithm that compares string's hash values, rather than the strings themselves. For efficiency, the hash value of the next position in the text is easily computed from the hash value of the current position”
• Fungsi cincang akan mengurangkan perbandingan m huruf kepada perbandingan satu integer,
• Karp dan Rabin mencadangkan penggunaan fungsi cincang h(x) = x mod q di mana q adalah nombor prime yang besar
• Bagi perkataan w dengan panjang m maka hash(w)
hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q
rehash(a,b,h)= ((h-a*2m-1)*2+b) mod q
ascii(e)
string = abcdeascii(a) = 97m = 4
hash(“abcd”) = (97*23 + 98*22 + 99*21 + 100*20 ) % q = 1466 % q
hash(“bcde”) = [ ( 1466 - 97*23) * 2 + 101 ] % q
hash(“bcde”) = (98*23 + 99*22 + 100*21 + 101*20 ) % q = 1481 % q
hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q
rehash(a,b,h)= ((h-a*2m-1)*2+b) mod q
Penggunaan pengekodan superimposed bagi menjana signature
Setiap teks dibahagi kepada blok logikal
Setiap blok mengandungi perbezaan n perkataan
Setiap perkataan ditukar dalam bentuk “signature perkataan”
Signature perkataan adalah dalam bentuk B-bit dengan 1-bit m
Setiap perkataan dipecah kepada jujukan dan tindanan huruf-huruf e.g.
free --> fr, fre, ree, ee Setiap perkataan signature adalah dalam bentuk OR bagi membentuk
block signature
Blok signature dicantumkan bagi membentuk signature dokumen
Fail Signature : StrukturFail Signature : Struktur
Signature perkataan merupakan vektor bit bagi F bit yang mengandungi m bit, 1.
Jika 12 bit digunakan bagi 4 bit 1, maka bagi menentukan kedudukan bit 1 satu maka langkah berikut perlu dilakukan
1.Jadi perkataan ke dalam bentuk triplet bertindan ( 3 – 3 huruf)
Contoh free *fr, fre, ree, ee* (* mewakili ruang kosong)
Words hash(word signatures)free 001 000 110 010text 000 010 101 001
free textText string:
F = 12m = 4
Signature PerkataanSignature Perkataan
Tukar huruf diatas (dalam hexadecimal) dalam bentuk ASCII
*fr 206672 ree 726565
fre 667265 ee* 656520
Mod kan nilai di atas dengan 12 dan dapatkan bakinya
206672 % 12 = 8 726565 % 12 = 7
667265 % 12 = B 656520 % 12 = 0 rehash 3
(rehash(a,b,h) = ((h-a*2m-1)*2+b) mod q
Maka kedudukan bit 1 ditentukan berdasarkan langkah diatas.
Signature PerkataanSignature Perkataan
1 2 3 4 5 6 7 8 9 10 11 12
free 0 0 1 0 0 0 1 1 0 0 1 0
block signaturesBlock 2Block 1
Dokumen dibahagikan kepada blok yang saiznya ditetapkan (satu blok mungkin menagndungi 2 perkataan bersebelahan)
superimposed coding
• bit-wise OR bagi semua signature perkataan di dalam blok.
Words Word signaturesfree 001 000 110 010text 000 010 101 001
information 000 100 011 110retrieval 101 000 100 001
free text information retrievalText string:
001 010 111 011
101 100 111 111
Block SignatureBlock Signature
Diberi perkataan, q, penjanaan kueri signature ( F bit dengan m, 1) Sq
Keadaan dikatakan berpadanan (match) bilamana Sq Sb = Sq
Example:
query signature 000 010 101 001
block 1 001 010 111 011
block 2 101 100 111 111
Retrieval Using Signature FileRetrieval Using Signature File
000 010 101 001… …… … other signatures
in block 2
Block 1: 000 010 101 001 001 010 111 011 000 010 101 001
Block 2: 000 010 101 001 101 100 111 111
= 000 000 101 001 000 010 101 001Berpadanan dokumen dan kueri
X Berpadanan dokumen dan kueri
Copyright Bruce Croft and/or James Allan
Copyright Bruce Croft and/or James Allan
Signature File
Signature File
This is a text. A text has many words. Words are made from letters.Block 1 Block 2 Block 3 Block 4
000101 110101 100100 101101
Text
Text Signatureh(text) =000101h(many) =110000h(words) =100100h(made) =001100h(letters) =100001
Kebarangkalian False DropKebarangkalian False Drop
• TETAPI… jika berpadanan antara kueri dan blok, ianya belum menjamin yang ianya betul-betul berpadanan
• Contoh kueri adalah free query signature 001 000 110 010
block 1 001 010 111 011
block 2 101 100 111 111
Kedua blok didapati berpadanan, namun perkataan free tidak terdapat pada blok 2
• Blok 2 dikatakan false drop.
• Apakah yang menyebabkan false drops?–Alasan utama : superimposition merupakan cantuman bit dari beberapa perkataan signature
–Alasan Minor : perlanggaran cincang (2 different words having same word signature), but if F is large enough hash collision probability is very low
Saiz fail signature adalah kecil dan mudah dikawal Kos penyelenggaraan kecil (kemaskini dan hapus) kerana
pengorganisasian yang mudah. Signature mudah dijanakan dan kos kemasukkan data baru
adalah rendah. Pengkodan Superimposed adalah baik untuk capaian multi-
attribute.
KebaikanKebaikan
Pengelintaran adalah lambat berbanding file songsang. False drops adalah mahal untuk dihapuskan kerana semua
signature yang berpadanan mesti dipastikan patternnya. Maklumat kekerapan/pemberat susah nak ditentukan di dalam
signature. Fungsi queri seperti disjunctive conditions, synonyms, wildcards,
proximity operations susah nak dikendalikan
KekuranganKekurangan
The most straightforward way is to store the block signatures sequentially
Signature filePointer
file Document file
N b
lock
s
F bits
001 010 111 011
101 100 111 101
Sequential Signature File (SSF)Sequential Signature File (SSF)
Walaupun kaedah di atas boleh diimplementasikan terus, prestasi penggelintaran menjadi lambat untuk pangkalan data yang bersar. Metod yang boleh digunakan untuk mengatasi masalah ini ialah:
• Pemampatan (Compression) example run length coding
• Pembahagian menegak (Vertical partitioning)
• Pembahagian melintang (Horizontal partitioning)
Sequential Signature File (SSF)Sequential Signature File (SSF)
data 0000 0000 0000 0010 0000base 0000 0001 0000 0000 0000management 0000 1000 0000 0000 0000system 0000 0000 0000 0000 1000block signature 0000 1001 0000 0010 1000
L1 L2 L3L4 L5
[L1] [L2] [L3] [L4] [L5]where [x] is the encoded value of x.
search: Decode the encoded lengths of all the preceding intervalsexample: search “data” (1) data ==> 0000 0000 0000 0010 0000 (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000disadvantage: search becomes low
Compression using run-length encodingCompression using run-length encoding
• Repeated occurrence of the same character is
called a run• Number of repetition is called the length of the run• Run of any length is represented by three
characters• eeeeeee7tnnnnnnnn• @e7t@n8
run-length encoding
Vertical Partitioning
To avoid bringing useless portions of the document signature in main memory
• Bit-Sliced Signature Files (BSSF)
• Frame-sliced signature files (FSSF)
Bit-Sliced Signature Files (BSSF)Store the signatures slice by slice. Each slice i is of length N and corresponding
to i-th bit in the signatures
Pointer file
Doc
umen
t fi
le
F s
igna
ture
file
s
N
0
0
0
11
….
N
Signature file
N b
lock
s
F bits
0 0 ...1 1 0
1 signature
query signature 001 000 110 010
. . .
1
Retrieval
• if ith bit in the query signature is set to 1, retrieve the ith signature block/record.
• if there is n number of bits are set to 1, only n number of records needs to be retrieved.
search: (1) retrieve m bit-files. e.g., the word signature of free is 001 000 110 010 the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined.
Frame-Sliced Signature File (FSSF) Divide a signature into a number of frames, and each word only
sets bits in one single frame. Signatures are stored frame-wise; I.e., 1 frame = 1 file
000000
000000010110
110010free
textNo. of frames: 2Signature length: 12No. of 1’s: 3
Tentukan bit pada frame
Tentukan frame
......
110110010001101010
000111
frame1
N blocks ......
010110011001111010
100101
frame2
. . . . . .......
110110010001101010
110100
frame k......
pointer fileD1
D2D3
Dn
Ex) Query = {free, text} H1(free) = 2, H2(free) = 100001 H1(text) = 1, H2(text) = 000011
Frame-Sliced Signature File (FSSF)Frame-Sliced Signature File (FSSF)
FSSF -- Performance Two hash functions are required: one to select the frame and one to set
the bits within the frame One keyword searches only one frame
• If a frame is stored as an individual sequential file, a query with one keyword results in searching one sequential file faster search speed
• e.g., ‘free’ will search the second frame; ‘text’ will search the first frame Updates cost is proportional to the number of frames in a signature (not the
number of bits)
• e.g., inserting a signature means generating the signature (only main memory accesses) and append the signature to the frames
• If a frame is a separate file, insertion will invoke a number of disk accesses equal to the number of frames
Horizontal Partitioning
The methods are to group signatures into sets so that only a few sets are searched.
• Two-level superimposed coding
• multi-level superimposed coding
Two-level signature files Two levels of signature are used. Here records refer to
documents, and blocks refer to a group of documents:
• First level: "Record signature": stored sequentially like SSF.
• Second level: "Block signature": superimposing signatures of all words in the block and stored in a bit-sliced form.
Each level has it own hashing function.
Searching: scan block signature first. And the work on those that qualify
Two Level Superimposed Coding Two levels of signature
• document signature is generated from the document
• block signature is generated using another hash function from the documents in the block (without document boundaries)
block signatures
document signatures documents
Two Level Superimposed Coding (2)
Searching:
• scan the block signature first
• scan the document signature files for the promising blocks only
Comments
• Searching is faster
• the second level signatures have to be longer
Multi-Level Superimposed Coding A generalization of the two level superimposed coding method.
(Lee, et. al., 1995) Signatures at each level are generated directly from the text using
a different hash function.
document signatures documents