Slides He Csdl Dpt 03 4in1

download Slides He Csdl Dpt 03 4in1

of 10

description

co so du lieu

Transcript of Slides He Csdl Dpt 03 4in1

  • Chng 3

    CSDL vn bn(Text/Document database)

    Nguyn Thanh Bnh

    Khoa CNTT&TT - i hc Cn Th

    8/2010

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 1 / 38

    Ni dung

    1 CSDL vn bn

    2 Tin x l v biu din vn bn

    3 Latent Semantic Indexing

    4 o lng tin cy ca mt h thng truy tm vn bn

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 2 / 38

    CSDL vn bn l g?

    CSDL vn bn cha cc ti liu dng vn bn.Ngun d liu vn bn: sch, bo, tp ch, cc meta-data,ph phim, nhn dng ch t ng, . . .Mt ti liu c th ch l mt cu, mt on vn, mtchng sch hay c quyn sch.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 3 / 38

    V d v CSDL vn bn

    DocID Stringd1 Jose Orojuelos Operations in Bosnia.d2 The Medelin Cartels Financial Organization.d3 The Cali Cartels Distribution Network.d4 Banking Operations and Money Laundering.d5 Profile of Hector Gomez.d6 Connections between Terrorism and Asian Dope

    Operations.d7 Hector Gomezs: How he Gave Agents the Slip in

    Cali.d8 Sex, Drugs, and Videotape.d9 The Iranian Connection.d10 Boating and Drugs: Slips owned by the Cali Cartel.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 4 / 38

  • H tm kim ti liu vn bn

    Ngi s dng cn tm mt vn bn v ch T .H tm kim vn bn s tm trong CSDL cc vn bn clin quan n ch T , sau xp hng theo mc linquan n ch T v tr kt qu cho ngi dng.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 5 / 38

    S hot ng tm kim vn bn

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 6 / 38

    Ch mc ti liu vn bn

    vic lu tr, tm kim, sp xp cc ti liu vn bn c hiuqu, ta cn phi to ch mc cho chng.

    Vic to ch mc cho vn bn bao gm vic xc nh cc (ch ) chnh cha trong vn bn.Vic to ch mc c thc hin bng cch lin kt vnbn vi mt tp cc t (word) gi l t ch mc (indexterm) hay kha ch mc (index key).

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 7 / 38

    Cc bc to ch mc CSDL vn bn

    Tin x l cc ti liu vn bn nhm mc ch lc ra cc tch mc (index terms).To cu trc cho cc t ch mc s dng cc cu trc dliu a chiu, bng bm, hay cc tp tin nghch o.Chn trng s gn cho cc t ch mc.Xy dng m hnh so snh cc t ch mc.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 8 / 38

  • Tin x l vn bn

    Qu trnh tin x l vn bn gm cc bc:Ct vn bn thnh nhng t, loi b khong trng v cck hiu l.Lc b cc t thng dng (stop words) c trong vn bn.Quy nhng t c trong vn bn v nhng t gc.m s t c trong vn bn lp bng tn sut.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 9 / 38

    Ct t v loi b t thng dng

    Ct tCt vn bn thnh mt dy cc t, b khong trng v quii ch hoa thnh ch thng.B ht tt c k t l v du chm; ch s dng cc t ctc l nhng t v k t thuc bng ch ci.

    Loi b t thng dng

    T thng dng cc t, nhm t m nu b i s khng nhhng n ni dung hoc kt qu tm kim trn vn bn. hiu qu hn trong cc thao tc x l, cc chui trongt thng dng c t chc dng bng bm nhm ti uthi gian truy xut v x l.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 10 / 38

    a v t gc

    Mt t s c nhiu bin th khc nhau.V d: drug, drugged, drugs; stop, stopped, stopping; child,children; . . .

    quy cc t v t gc (stem), ta c th s dng t inhoc mt chng trnh x l t ng.Cc chng trnh x l t ng (v d: Porter Stemmer a)s da trn cc lut quy mt t v t gc.

    sses ss ; ies i ; ational ate ; tional tionahttp://tartarus.org/~martin/PorterStemmer/

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 11 / 38

    V d

    Vn bn gc:

    Vn bn sau khi ct t:

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 12 / 38

  • V d

    B khong trng v k t l, i ch hoa thnh ch thng:

    B cc t thng dng:

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 13 / 38

    V d

    a v t gc:

    Tnh tn sut xut hin cc t:

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 14 / 38

    Biu din vn bn s dng bng tn sut

    Sau qu trnh tin x l vn bn, mi vn bn cn li mttp cc t ch mc v tn sut xut hin ca t trongvn bn.Tp hp ton b cc vn bn trong CSDL s to nn bngtn sut.Trong bng tn sut, mi vn bn c biu din didng mt vector tn sut.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 15 / 38

    Bng tn sut

    Bng tn sut (FreqT ) l bng dng biu din s xut hinca t trong ti liu (t / ti liu).

    Mi dng trong FreqT biu din mt t.Mi ct trong FreqT biu din mt ti liu.Mi gi tr Freq(i , j) trong bng cho bit s ln xut hinca t ti trong ti liu dj .

    nh ngha

    Gi D l mt tp gm n vn bn v T l mt tp gm m t ctrong cc vn bn ca D. Bng tn sut FreqT, gn vi D v T ,l mt ma trn kch thc (m n) tha:

    FreqT(i , j) = s ln xut hin t ti trong vn bn dj

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 16 / 38

  • V d v bng tn sut

    DocID Stringd8 Sex, Drugs, and Videotape.d9 The Iranian Connection.d10 Boating and Drugs: Slips owned by the Cali Cartel.

    Term/Doc d8 d9 d10sex 1 0 0drug 1 0 1

    videotape 1 0 0iran 0 1 0

    connect 0 1 0boat 0 0 1slip 0 0 1own 0 0 1cali 0 0 1cartel 0 0 1

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 17 / 38

    Mt v d khc v bng tn sut

    Xem xt bng tn sut sau:

    Term/Doc d1 d2 d3 d4 d5 d6t1 615 390 10 10 18 65t2 15 4 76 217 91 816t3 2 8 815 142 765 1t4 312 511 677 11 711 2t5 45 33 516 64 491 59

    d1,d2 l tng t nhau, bi v s phn phi ca cc ttrong d1 tng t s phn phi cc t trong d2.

    C hai u c nhiu t t1, t4, t t t2, t3 v mt tn sut vaphi t t5.

    d3 v d5 cng tng t nhau.d4 v d6 khc nhau.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 18 / 38

    Bng tn sut

    Tuy nhin, vic m mt cch n gin cc t khng ch rac mc quan trng ca cc t trong vn bn.

    V d: mt t c s ln xut hin nh nhau trong 2 vn bn.Tuy nhin, trong vn bn c di ngn hn th t s c ngha quan trng hn khi nim tf (term frequency).V d: mt t xut hin trong rt t vn bn s c nghaquan trng hn mt t xut hin trong rt nhiu vn bnkhi nim idf (inverse document frequency).

    nh gi mc quan trng ca cc t trong ti liu,ngi ta khng ch da vo s ln xut hin ca cc ttrong vn bn (thng tin cc b) m cn da vo c nhngthng tin ton cc ca h thng (CSDL vn bn).

    phng php tnh trng s tf-idf

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 19 / 38

    TF-IDF

    Vic tnh trng s cc phn t ca ti liu trong FreqT ctnh theo cng thc:

    Wij =( Fij

    k Fkj

    ) log

    (NDFi

    )Trong :

    N l tng s ti liu.DFi l s ti liu c xut hin t i .Fij l tn sut xut hin ca t i trong vn bn j

    Ta c:TF =

    (Fijk Fkj

    )(term frequency) th hin trng s ca t i

    trong vn bn j .

    IDF = log(

    NDFi

    )(inverse document frequency) th hin s

    quan trng ca t i trong CSDL.Nguyn Thanh Bnh Ch.3 : CSDL vn bn 20 / 38

  • V d v TF-IDF

    V d:Trong ti liu d1 c 100 t, c xut hin 3 t car tfcar ,d1 = 0.03Gi s CSDL c 10.000.000 ti liu v t car c xut hintrong 1.000 ti liu idfcar = 4Vy tf idfcar ,d1 = 0.03 4 = 0.12

    V d: bn hy tnh cc gi tr tf-idf cho bng tn sut sau:

    term d1 d2 d3 dftcar 27 4 24 18.165auto 3 33 0 6.723insurance 0 33 29 19.241best 14 0 17 25.235

    cho bit tng s ti liu l 806.791 ti liu.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 21 / 38

    X l cu truy vn

    Gi s ngi s dng t cu truy vn Q: Tm 25 vn bnc lin quan nhiu nht n giao dch ngn hng v maty.Cu truy vn Q trn thc hin truy tm vn bn c linquan n 2 t kha m t gc l: drug v bank.Nu ta xem cu truy vn Q cng l mt vn bn, khi tac th p dng cc bc ct t, loi b khong trng v khiu l, loi b t thng dng, quy v t gc, . . . Q cth biu din di dng mt vector ct.Chng ta cn tm cc ct trong FreqT gn vi vector ctgn vi Q.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 22 / 38

    o tng ng

    L o dng nh gi mc ging nhau ca 2 ti liu.

    Ta c cc o tng ng sau:1 o khong cch gia cc t (term distance) gia truy

    vn Q v mt vn bn dr c cho bi: Mj=1

    (vecQ(j) FreqT(j , r))2

    2 o khong cch cosine (cosine distance): y l o khong cch thng c s dng trong CSDL vnbn. M

    j=1(vecQ(j) FreqT(j , r))Mj=1 vecQ(j)2

    Mj=1 FreqT(j , r)2)

    Trong vecQ(i) ch s ln xut hin ca t ti trong truy vn Q.Nguyn Thanh Bnh Ch.3 : CSDL vn bn 23 / 38

    M hnh tm kim vi ton t Boolean

    S dng cc php ton Boolean gia cc dng c cha tkha.

    Google h tr cc php ton AND (mc nh), XOR (tkha OR), NOT (k hiu -)V d: tin AND hc ; tin OR hc ; tin -hc

    S dng cc trng s xp hng ti liu tm c.Tuy nhin, vic tm kim chnh xc trn cc t kha thngkhng mang li kt qu tt.Ngi ta thng phi m rng cu truy vn p ngyu cu ca ngi dng.V d:

    Qi =(0.5+ 0.5 FiFi

    ) log

    (NDFi

    )

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 24 / 38

  • Tr ngi ca vic tm kim chnh xc

    1 S ng ngha: tm kim trn t kha car s khng thuc nhng ti liu ni v auto.

    2 S a ngha: tm kim trn t kha Apple s tr ra c ccti liu ni v cy to v ni v my tnh.

    c th s dng k thut ch mc ng ngha tim n (LSI) khc phc.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 25 / 38

    Latent Semantic Indexing

    LSI l mt k thut ch mc v truy xut d liu s dng kthut phn tch gi tr ring (SVD) xc nh cc miquan h gia cc t (terms) v khi nim (concepts) ctrong mt tp vn bn phi cu trc. 1

    LSI da trn nguyn l l cc t c s dng trong cngng cnh thng c xu hng c ngha ging nhau.c im ni bt ca LSI l kh nng rt ra cc ni dungkhi nim t vn bn thng qua vic thit lp mi kt hpgia cc t xut hin trong cc ng cnh ging nhauLSI lm l ra cc cu trc ng ngha tim n nm bn dica vic s dng t trong vn bn.

    1http://en.wikipedia.org/wiki/Latent_semantic_indexingNguyn Thanh Bnh Ch.3 : CSDL vn bn 26 / 38

    K thut SVD

    K thut SVD l nn tng ca LSI.

    SVD bin i ma trn tn sut FreqT (kch thc m n, hngma trn l r ) thnh tch ca 3 ma trn:

    ma trn t - khi nim (term - concept) T, l ma trn trcgiao kch thc m r .ma trn gi tr ring S, l ma trn cho khng tng kchthc r r .ma trn khi nim - ti liu (concept - document) D, l matrn trc giao kch thc n r .

    FreqT = T S DT

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 27 / 38

    Nhc li v ma trn

    Mt ma trn M c gi l trc giao nuMT M = I

    V d: C =[1 10 0

    ]Ma trn M c kch thc (m m) c gi l ma trn chonu vi mi 1 i , j m:

    i 6= j M(i , j) = 0

    V d: A =

    5 0 00 4 00 0 1

    B = 0 0 00 1 0

    0 0 0

    Ma trn M c kch thc (m m) c gi l ma trnkhng tng (non-increasing) nu vi mi 1 i , j m:

    i j M(i , i) M(j , j)V d: A l ma trn khng tng.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 28 / 38

  • V d v SVD

    Cho ma trn:

    FreqT =[1.44 .52.92 1.44

    ]Hy phn tch SVD ma trn trn.

    FreqT =[ .66 .75.75 .66

    ][2.17 00 .73

    ][ .75 .66.66 .75

    ]

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 29 / 38

    Rt gn ma trn

    Sau khi p dng SVD ln ma trn FreqT , ta c th ch cngi li k gi tr ring ln nht trong ma trn S. Thngthng k = 100..300 v k r .Vic gi li k gi tr ring ln nht trong S s bo ton ccthng tin ng ngha quan trng nht trong vn bn, ngthi loi b nhiu v cc hiu ng khng mong mun khctrong ma trn gc FreqT .Vic ch gi li k gi tr ring ln nht trong S cng ngthi gip lm gim rt ng k kch thc ca 2 ma trn Tv D (Tmk v Dnk ).Ta tnh li ma trn tn sut mi:

    FreqTk = Tmk Skk DTnk

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 30 / 38

    V d v rt gn ma trn

    Gi s ma trn FreqT c SVD:

    Chng ta gi li 3 gi tr ring ln nht:

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 31 / 38

    Phn tch

    Thng thng, trong mt CSDL vn bn, s lng t m vs lng vn bn n l rt ln.

    trong ting Anh, s lng t m c th ln n trn 100.000t (bao gm c tn ring).trong mt CSDL vn bn, s lng vn bn n c th c trn1.000.000 vn bn.

    Vi k thut SVD trn, LSI cho php tm mt tp con tngi nh k khi nim sao cho vn gip phn bit c n vnbn tt nht.Trn thc t, ngi ta nhn thy LSI hot ng hiu quvi k 200.u im: mi vn bn lc ny c biu din bng mtvector ct c kch thc 200, thay v kch thc m.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 32 / 38

  • Phn tch

    Thng thng, k c chn l 200.Kch thc ca bng tn sut gc thng rt ln. Cho dch x l trn mt CSDL vn bn nh, chng ta vn ddng t c m = 10.000 t v n = 1.000.000 vn bn.Kch thc ca 3 ma trn sau khi rt gn s chiu l:

    Kch thc ma trn u tin l (m k) (10.000 200)Kch thc ma trn suy bin l (k k) (200 200)Kch thc ma trn cui l (n k) (1.000.000 200)

    Nh vy, chng ta ch x l v tnh ton trn khong 200triu ma trn, thay v x l trn mt ma trn c kch thc(m n) (10.000 1.000.000) hay 10 t .

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 33 / 38

    4 bc ca LSI

    1 To bng: to ma trn tn sut FreqT.2 Phn tch SVD: tnh cc phn tch gi tr ring (T ,S,D) t

    ma trn FreqT.3 Xc nh tp vector: vi mi vn bn d , ta c vec(d) l

    vector tng ng ca vn bn d trong ma trn D.4 To ch mc: lu tr tp cc vec(d) v lp ch mc bi

    mt trong nhiu k thut ch mc.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 34 / 38

    Hiu qu tm kim thng tin

    Khi tm kim mt cu truy vn do ngi dng g vo, ktqu tr v khng hn lc no cng nh ngi dng mongmun.Cc kt qu c th ri vo nhng trng hp c bit nh:vn bn c lin quan nhng khng tm c, vn bn tmc nhng khng lin quan.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 35 / 38

    Precision v Recall

    Gi s:D l tp hu hn cc vn bn.A l mt thut ton c tham s u vo l chui ch Tv tr v mt tp vn bn T.

    Precision = s ti liu tm ngs ti liu tm c

    =s ti liu tm c c lin quan n ch T trong T

    s ti liu tm c trong T

    Recall = s ti liu tm ngs ti liu ng trong CSDL

    =s ti liu tm c c lin quan n ch T trong Ts ti liu c lin quan n ch T trong CSDL

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 36 / 38

  • V d v Precision v Recall

    Trong mt CSDL vn bn c:1000 vn bn600 vn bn lin quan n ch T

    p dng mt gii thut truy vn vn bn, ta thu c:750 vn bn450 vn bn lin quan n ch T

    Ta c:

    Pt =450750

    = 60%

    Rt =450600

    = 75%

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 37 / 38

    Ti liu tham kho

    V.S.Subrahmanian.Chapter 6 - Principles of Multimedia Database SystemsMorgan Kaufman Prbbess, 1998.

    Nabil R. Adam.Bi ging Multimedia Information SystemsRutgers University, 2003.

    Tun.H t vn ti nguyn hc tpLun vn thc s, 2006.

    Nguyn Thanh Bnh Ch.3 : CSDL vn bn 38 / 38