Khai pha du lieu web

download Khai pha du lieu web

of 29

Transcript of Khai pha du lieu web

  • 8/3/2019 Khai pha du lieu web

    1/29

    Chng 1. TNG QUAN V KHAI PH D LIU WEB1.1. GII THIU V KHAI PH D LIU (DATAMING) V KDD1.1.1. Ti sao li cn khai ph d liu (datamining)Khong hn mt thp k tr li y, lng thng tin c lu tr trn ccthit b in t (a cng, CD-ROM, bng t, .v.v.) khng ngng tng ln. S tch lyd liu ny xy ra vi mt tc bng n. Ngi ta c on rng lng thng tintrn ton cu tng gp i sau khong hai nm v theo s lng cng nh kch c

    ca cc c s d liu (CSDL) cng tng ln mt cch nhanh chng. Ni mt cch hnhnh l chng ta ang ngp trong d liu nhng li i tri thc. Cu hi t ra lliu chng ta c th khai thc c g t nhng ni d liu tng chng nh b iy khng ?Necessity is the mother of invention - Data Mining ra i nh mt hnggii quyt hu hiu cho cu hi va t ra trn []. Kh nhiu nh ngha v DataMining v s c cp phn sau, tuy nhin c th tm hiu rng Data Mining nhl mt cng ngh tri thc gip khai thc nhng thng tin hu ch t nhng kho d liuc tch tr trong sut qu trnh hot ng ca mt cng ty, t chc no .1.1.2. Khai ph d liu l g?Khai ph d liu (datamining) c nh ngha nh l mt qu trnh cht lchay khai ph tri thc t mt lng ln d liu. Mt v d hay c s dng l l vic

    khai thc vng t v ct, Dataming c v nh cng vic "i ct tm vng" trongmt tp hp ln cc d liu cho trc. Thut ng Dataming m ch vic tm kim mttp hp nh c gi tr t mt s lng ln cc d liu th. C nhiu thut ng hinc dng cng c ngha tng t vi t Datamining nh Knowledge Mining (khaiph tri thc), knowledge extraction(cht lc tri thc), data/patern analysis(phn tch dliu/mu), data archaeoloogy (kho c d liu), datadredging(no vt d liu),...nh ngha: Khai ph d liu l mt tp hp cc k thut c s dng tng khai thc v tm ra cc mi quan h ln nhau ca d liu trong mt tp hp dliu khng l v phc tp, ng thi cng tm ra cc mu tim n trong tp d liu .Khai ph d liu l mt bc trong by bc ca qu trnh KDD (KnowleadgeDiscovery in Database) v KDD c xem nh 7 qu trnh khc nhau theo th tsau:s 1. Lm sch d liu (data cleaning & preprocessing)s: Loi b nhiu v cc dliu khng cn thit.

    2. Tch hp d liu: (data integration): qu trnh hp nht d liu thnh nhngkho d liu (data warehouses & data marts) sau khi lm sch v tin x l (datacleaning & preprocessing).3. Trch chn d liu (data selection): trch chn d liu t nhng kho d liuv sau chuyn i v dng thch hp cho qu trnh khai thc tri thc. Qu trnh nybao gm c vic x l vi d liu nhiu (noisy data), d liu khng y (incomplete data), .v.v.4. Chuyn i d liu: Cc d liu c chuyn i sang cc dng ph hpcho qu trnh x l5. Khai ph d liu(data mining): L mt trong cc bc quan trng nht,trong s dng nhng phng php thng minh cht lc ra nhng mu d liu.6. c lng mu (knowledge evaluation): Qu trnh nh gi cc kt qu tm

    c thng qua cc o no .7. Biu din tri thc (knowledge presentation): Qu trnh ny s dng cc kthut biu din v th hin trc quan cho ngi dng.

    Hnh 1 - Cc bc trong Data Mining & KDD 1.1.3. Cc chc nng chnh ca khai ph d liuData Mining c chia nh thnh mt s hng chnh nh sau: M t khi nim (concept description): thin v m t, tng hp v tmtt khi nim. V d: tm tt vn bn. Lut kt hp (association rules): l dng lut biu din tri th dng khn gin. V d: 60 % nam gii vo siu th nu mua bia th c ti 80% trong s h smua thm tht b kh. Lut kt hp c ng dng nhiu trong lnh vc knh doanh,

    y hc, tin-sinh, ti chnh & th trng chng khon, .v.v. Phn lp v d on (classification & prediction): xp mt i tngvo mt trong nhng lp bit trc. V d: phn lp vng a l theo d liu thi

  • 8/3/2019 Khai pha du lieu web

    2/29

    tit. Hng tip cn ny thng s dng mt s k thut ca machine learning nhcy quyt nh (decision tree), mng n ron nhn to (neural network), .v.v. Ngi tacn gi phn lp l hc c gim st (hc c thy). Phn cm (clustering): xp cc i tng theo tng cm (s lng cngnh tn ca cm cha c bit trc. Ngi ta cn gi phn cm l hc khng gimst (hc khng thy). Khai ph chui (sequential/temporal patterns): tng t nh khai ph

    lut kt hp nhng c thm tnh th t v tnh thi gian. Hng tip cn ny c ngdng nhiu trong lnh vc ti chnh v th trng chng khon v n c tnh d bocao.1.1.4. ng dng ca khai ph d liuData Mining tuy l mt hng tip cn mi nhng thu ht c rt nhiu squan tm ca cc nh nghin cu v pht trin nh vo nhng ng dng thc tin can. Chng ta c th lit k ra y mt s ng dng in hnh: Phn tch d liu v h tr ra quyt nh (data analysis & decisionsupport) iu tr y hc (medical treatment) Text mining & Web mining Tin-sinh (bio-informatics)

    Ti chnh v th trng chng khon (finance & stock market) Bo him (insurance) Nhn dng (pattern recognition) .v.v.1.2. C S S LIU HYPERTEXT V FULLTEXT1.2.1. C s d liu FullTextD liu dng FullText l mt dng d liu phi cu trc vi thng tin ch gmcc ti liu dng Text. Mi ti liu cha thng tin v mt vn no th hin quani dung ca tt c cc t cu thnh ti liu . ngha ca mi t trong ti liukhkng c nh m tu thuc vo tng ng cnh khc nhau s mang ngha khcnhau. Cc t trong ti liu c lin kt vi nhau theo mt ngn ng no .Trong cc d liu hin nay th vn bn l mt trong nhng d liu ph binnht, n c mt khp mi ni v chng ta thng xuyn bt gp do cc bi tonv x l vn bn c t ra kh lu v hin nay vn l mt trong nhng vn

    trong khai ph d liu Text, trong c nhng bi ton ng ch nh tm kim vnbn, phn loi vn bn, phn cm vn bn hoc dn ng vn bnCSDL full_text l mt dng CSDL phi cu trc m d liu bao gm cc tiliu v thuc tnh ca ti liu. C s d liu Full_Text thng c t chc nh mtt hp ca hai thnh phn: Mt CSDL c cu trc thng thng (cha c im cacc ti liu) v cc ti liu

    Ni dung cu ti liu c lu tr gin tip trong CSDL theo ngha h thngch qun l a ch lu tr ni dung.C s d liu dng Text c th chia lm hai loi sau:Dng khng c cu trc (unstructured): Nhng vn bn thng thng mchng ta thng c hng ngy c th hin di dng t nhin ca con ngi v nCSDL Full-TextCSDL c cu trc cha c imca cc ti liuCc ti liu khng c mt cu trc nh dng no. VD: Tp hp sch, Tp ch, Bi vit cl trong mt mng th vin in t.Dng na cu trc (semi-structured): Nhng vn bn c t chc di dngcu trc khng cht ch nh bn ghi cc k hiu nh du vn bn v vn th hinc ni dung chnh ca vn bn, v d nh cc dnh HTML, email,...

    Tuy nhin vic phn lm hai loi cng khng tht r rng, trong cc h phnmm, ngi ta thng phi s dng cc phn kt hp li thnh mt h nh trong ch tm tin (Search Engine), hoc trong bi ton tm kim vn bn (Text Retrieval), mt

  • 8/3/2019 Khai pha du lieu web

    3/29

    trong nhng lnh vc qua tm nht hin nay. Chng hn trong h tm kim nh Yahoo,Altavista, Google... u t chc d liu theo cc nhm v th mc, mi nhm li cth c nhiu nhm con nm trong . H Altavista cn tch hp thm chng trnhdch t ng c th dch chuyn i sang nhiu th ting khc nhau v cho kt qu khtt.1.2.2. C s d liu HyperTextTheo t in ca i hc Oxford (Oxford English Dictionary Additions

    Series) th Hypertext c nh ngha nh sau: l loi Text khng phi c theodng lin tc n, n c th c c theo cc th t khc nhau, c bit l Text vnh ha (Graphic) l cc dng c mi lin kt vi nhau theo cch m ngi c cth khng cn c mt cch lin tc. V d khi c mt cun sch ngi c khngphi c ln lt tng trang t u n cui m c th nhy cc n cc on sau tham kho v cc vn h quan tm.Nh vy vn bn HyperText bao gm dng ch vit khng lin tc, chngc phn nhnh v cho php ngi c c th chn cch c theo mun ca mnh.Hiu theo ngha thng thng th HyperText l mt tp cc trang ch vit c kt nivi nhau bi cc lin kt v cho php ngi c c th c theo cc cch khc nhau.Nh ta lm quen nhiu vi cc trang nh dng HTML, trong cc trang c nhnglin kt tr ti tng phn khc nhau ca trang hoc tr ti trang khc, v ngi c

    s c vn bn da vo nhng lin kt .Bn cnh , HyperText cng l mt dng vn bn Text c bit nn cng cth bao gm cc ch vit lin tc (l dng ph bin nht ca ch vit). Do khng bhn ch bi tnh lin tc trong HyperText, chng ta c th to ra cc dng trnh bymi, do ti liu s phn nh tt hn ni dung mun din t. Hn na ngi c cth chn cho mnh mt cch c ph hp chng hn nh i su vo mt vn m h quan tm.r tr ti cc vnbn khc lin kt mt tp cc vn bn c mi quan h voi nhau vi nhau l mtcch thc s hay v rt hu ch t chc thng tin. Vi ngi vit, cch ny chophp h c th thoi mi loi b nhng bn khon v th t trnh by, m c th tchc vn thnh nhng phn nh, ri s dng kt ni ch ra mi lin h gia ccphn nh vi nhau.Vi ngi c cch ny cho php h c th i tt trn mng thng tin v quyt

    nh phn thng tin no c lin quan n vn m h quan tm tip tc tm hiu.So snh vi cch c tuyn tnh, tc l c ln lt th HyperText cung cp chochng ta mt giao din c th tip xc vi ni dung thng tin hiu qu hn rtnhiu. Theo kha cnh ca cc thut ton hc my th HyperText cung cp chochng ta c hi nhn ra ngoi phm vi mt ti liu phn lp n, ngha l c tnh cn cc ti liu c lin kt vi n. Tt nhin khng phi tt c cc ti liu c lin ktn n u c ch cho vic phn lp, c bit l khi cc siu lin kt c th ch n rtnhiu loi cc ti liu khc nhau. Nhng chc chn vn cn tni ti tim nng m conngi cn tip tc nghin cu v vic s dng cc ti liu lin kt n mt trang nng cao chnh xc phn lp trang .C hai khi nim v HyperText m chng ta cn quan tm:Hypertext Document (Ti liu siu vn bn): L mt ti liu vn bn n trong

    h thng siu vn bn. Nu tng tng h thng siu vn bn l mt th, th cc tiliu tng ng vi cc nt. Hypertext Link (Lin kt siu vn bn): L mt tham chiu ni mt ti liu HyperText ny vi mt ti liu HyperText khc. Cc siu lin ktng vai tr nh nhng ng ni trong th ni trn.HyperText l loi d liu ph bin hin nay, v cng l loi d liu c nhu cutm kim v phn lp r ln. N l d liu ph bin trn mng thng tin Internet CSDLHyperText vi vn bn dng na cu trc do xut hin thm cc th : Th cu trc(tiu , m u, ni dung), th nhn trnh by ch (m, nghing,). Nh cc thny m chng ta c thm mt tiu chun (so vi ti liu fulltext) c th tm kim vphn lp chng. Da vo cc th quy nh trc chng ta c th phn thnh cc u tin khc nhaucho cc t kha nu chng xut hin nhng v tr khc nhau. V dkhi tm kim cc ti liu c ni dung lin quan n people th chng ta a t khatm kim l people, v cc ti liu c t kha poeple ng tiu th s gn vi

    yu cu tm kim hn.So snh c im ca d liu Fulltext v d liu trang webMc d trang Web l mt dang c bit ca d liu FullText, nhng c nhiu

  • 8/3/2019 Khai pha du lieu web

    4/29

    im khc nhau gia hai loi d liu ny. Mt s nhn xt sau y cho thy s khcnhau gia d liu Web v FullText. S khc nhau v c im l nguyn nhn chnhdn n s khc nhau trong khai ph hai loi d liu ny (phn lp, tm kim,).Mt s minh ho Hypertext Document nh l cc nt v cc Hypertext Link nh lcc lin kt gia chng Mt s i snh di y v c im gia d liu Fulltext vi d c trnh by trong [2].STT Trang web Vn bn thng thng (Fulltext)

    1 L dng vn bn na cu trc.Trong ni dung c phn tiu vc cc th nhn mnh ngha cat hoc cm tVn bn thng l dng vn bn phicu trc. Trong ni dung ca nkhng c mt tiu chun no cho tada vo nh gi2 Ni dung ca cc trang Webthng n m t ngn gn, cng, c cc siu lin kt ch racho ngi c n nhng ni

    khc c ni dung lin quanNi dung ca cc vn bn thngthng thng rt chi tit v y

    3 Trong ni dung cc trang Web ccha cc siu lin kt cho phplin kt cc trang c ni dung linvi nhauCc trng vn bn thng thng khnglin kt c n ni dung ca cctrang khc

    1.3. KHAI PH D LIU VN BN (TEXTMINING) V KHAI PH D

    LIU WEB (WEBMINING)Nh cp trn, TextMining (Khai ph d liu vn bn) v WebMining(Khai ph d liu Web) l mt trong nhng ng dng quan trng ca Datamining.Trong phn ny ta s i su hn vo bi ton ny.1.3.1. Cc bi ton trong khai ph d liu vn bn1. Tm kim vn bna. Ni dungTm kim vn bn l qu trnh tm kim vn bn theo yu cu ca ngi dng.Cc yu cu c th hin di dng cc cu hi (query), dng cu hi n gin nhtl cc t kha. C th hnh dung h tm kim vn bn sp xp vn bn thnh hai lp:Mt lp cho ra nhng cc vn bn tha mn vi cu hi a ra v mt lp khng hinth nhng vn bn khng c tha mn. Cc h thng thc t hin nay khng hin th nh v

    ca vn bn tu theo cccu hi a vo, v d in hnh l cc my tm tin nh Google, Altavista,b. Qu trnhQu trnh tm tin c chia thnh bn qu trnh chnh :nh ch s (indexing): Cc vn bn dng th cn c chuyn sang mtdng biu din no x l. Qu trnh ny cn c gi l qu trnh biu din vnbn, dng biu din phi c cu trc v d dng khi x l.nh dng cu hi: Ngi dng phi m t nhng yu cu v ly thng tin cnthit di dng cu hi. Cc cu hi ny phi c biu din di dng ph bin chocc h tm kim nh nhp vo cc t kha cn tm. Ngoi ra cn c cc phng phpnh dng cu hi di dng ngn ng t nhin hoc di dng cc v d, i vi ccdngny th cn c cc k thut x l phc tp hn. Trong cc h tm tin hin nay thi a s l dng cu hi di dng cc t kha.

    So snh: H thng phi c s so snh r rng v hon ton cu hi cc cu hica ngi dng vi cc vn bn cl u tr trong CSDL. Cui cng h a ra mtquyt nh phn loi cc vn bn c lin quan gnvi cu hi a vo v th t ca

  • 8/3/2019 Khai pha du lieu web

    5/29

    n. H s hin th ton b vn bn hoc ch mt phn vn bn.Phn hi: Nhiu khi kt qu c tr v ban u khng tha mn yu cu cangi dng, do cn phi c qua trnh phn hi ngi dng c tht hay i lihoc nhp mi cc yu cu ca mnh. Mt khc, ngi dng c th tng tc vi cch v cc vn bn tha mn yu cu ca mnh v h c chc nng cp nhu cc vnbn . Qu trnh ny c gi l qu trnh phn hi lin quan (Relevance feeback).Cc cng c tm kim hin nay ch yu tp trung nhiu vo ba qu trnh u,

    cn phn ln cha c qu trnh phn hi hay x l tng tc ngi dng v my. Qutrnh phn hi hin nay ang c nghin cu rng ri v ring trong qu trnh tngtc giao din ngi my xut hin hng nghin cu l interface agent.2. Phn lp vn bn(Text Categoization)a. Ni dungPhn lp vn bn c xem nh l qu trnh gn cc vn bn vo mt haynhiu vn bn xc nh t trc. Ngi ta c th phn lp cc vn bn mtc ch thcng, tc l c tng vn bn mt v gn n vo mt lp no . Cch ny s tn rtnhiu thi gian v cng sc i vi nhiu vn bn v do khng kh thi. Do vy m phi cdng ccphng php hc my trong tr tu nhn to (Cy quyt nh, Bayes, k ngi lngging gn nht)

    Mt trong nhng ng dng quan trng nht ca phn lp vn bn l trong tmkim vn bn. T mt tp d liu phn lp cc vn bn s c nh ch s vitng lp tng ng. Ngi dng c th xc nh ch hoc phn lp vn bn mmnh mong mun tm kim thng qua cc cu hi.Mt ng dng khc ca phn lp vn bn l trong lnh vc tm hiu vn bn.Phn lp vn bn c th c s dng lc cc vn bn hoc mt phn cc vn bncha d liu cn tm m khng lm mt i tnh phc tp ca ngn ng t nhin.Trong phn lp vn bn, mt lp c th c gn gi tr ng sai (True hayFalse hoc vn bn thuc hay khng thuc lp) hoc c tnh theo mc ph thuc(vn bn c mt mc ph thuc vo lp). Trong trng hp c nhiu lp th phnloi ng sai s l vic xem mt vn bn c thuc vo mt lp duy nht no haykhng..b. Qu trnh

    Qu trnh phn lp vn bn. tun theo cc bc sau:nh ch s (Indexing): Qu trnh nh ch s vn bn cng ging nh trongqu trnh nh ch s ca tm kim vn bn. Trong phn ny th tc nh ch sng vai tr quan trng v mt s cc vn bn mi c th cn c x l trong thigan thcXc nh phn lp: Cng ging nh trong tm kim vn bn, phn lp vnbn yu cu qu trnh din t vic xc nh vn bn thuc lp no nh th no,da trn cu trc biu din ca n. i vi h phn lp vn bn, chng ta gi qu trnhny l b phn lp (Categorization hoc classifier). N ng vai tr nh nhng cu hitrong h tm kim. Nhng trong khi nhng cu hi mang tnh nht thi, th b phnloi c s dng mt cch n nh v lu di cho qu trnh phn loi.So snh: Trong hu ht cc b phn loi, mi vn bn u c yu cu gn

    ng sai vo mt lp no . S khc nhau ln nht i vi qu trnh so snh trong htm kim vn bn l mi vn bn ch c so snh vi mt s lng cc lp mt ln vvicc hn quyt nh ph hp cn ph thuc vo mi quan h gia cc lp vn bn. Phn hii tr trong h phn lpvn bn. Th nht l khi phn loi th phi c mt s lng ln cc vn bn cxp loi bng tay trc , cc vn bn ny c s dng lm mu hun luyn htr xy dng b phn loi. Th hai l i vi vic phn loi vn bn ny khng ddng thay i cc yu cu nh trong qu trnh phn hi ca tm kim vn bn , ngidng c th thng tin cho ngi bo tr h thng v vic xa b, thm vo hoc thayi cc phn lp vn bn no m mnh yu cu.3. Mt s bi ton khcNgoi hai bi ton k trn, cn c cc bi ton sau:Tm tt vn bn

    Phn cm vn bnPhn cm cc t mcPhn lp cc t mc

  • 8/3/2019 Khai pha du lieu web

    6/29

    nh ch mc cc t tim nngDn ng vn bnTrong cc bi ton x l vnbn nu trn, chng tra thy vai tr ca biudin vn bn rt ln, c bit trong cc bit on tm kim, phn lp, phn cm, dnng1.3.2. Khai ph d liu Weba. Nhu cu

    S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khilng khng l cc d liu dng siu vn bn(d liu Web). Cng vi s thay i vpht trin hng nga hng gi v ni dung cng nh s lng ca cc trang Web trnInternet th vn tm kim thn g tin i vi ngi s dng li ngy cng kh khn.C th ni nhu cu tm kim thng tin trn mt CSDL phi cu trc c pht trinch yu cng vi s pht trin ca Internet. Thc vy vi Internet con ngi lmquen vi cc trang Web cng vi v vn cc thng tin. Trong nhng nm gn yIntrnet tr thnh mt trong nhng knh v khoa hc, thng tin kinh t, thng miv qung co. Mt trong nhng l do cho s pht trin ny l s thp v gi c tiu tnkhi cng khai mt trang Web trn Internet. So snh vi nhng dch v khc nh muabn hay qung co trn mt t bo hay tp ch, th mt trang Web "i" r hn rt nhiu vngi dng khp mi ni trn th

    gii. C th ni trang Web nh l cun t in Bch khoa ton th. Thng tin trn cctrang Web a dng v mt ni dung cng nh hnh thc. C th ni Internet nh mtx hi o, n bao gm cc thng tin v mi mt ca i sng kinh t, x hi c trnhby di dng vn bn, hnh nh, m thanh,...Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinhvn qu ti thng tin. Ngi ta khng th tm t kim a ch trang Web cha thngtin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung cacc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yucu ca ngi tm kim. Cc tin ch ny qun l d liu nh cc i tng phi cutrc. Hin nay chng ta lm quen vi mt s cc tin ch nh vy l: Yahoo,goolel, Alvista,...Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao,Kinh t-X hi v xy dng...Cn c vo ni dung ca cc ti liu m khch hng xem

    hoc download v, sau khi phn lp chng ta s bit khch hng hay tp trung vo nidung g trn trang Web ca chng ta, t chng ta s b sung thm nhiu cc ti liuv cc ni dung m khch hng quan tm v ngc li. Cn v pha khch hng saukhi phn tch chng ta cng bit c khch hng hay tp trung v vn g, t c th a ra nhng h tr thm cho khch hng . T nhng nhu cu thc t trn,phn lp v tm kim trang Web vn l bi ton hay v cn pht trin nghin cu hinnay.b. Kh khnH thng phc v World Wide Web nh l mt h thng trung tm rt lnphn b rng cung cp thng tin trn mi lnh vc khoa hc, x hi, thng mi, vnha,... Web l mt ngun ti nguyn giu c cho Khai ph d liu. Nhng quan st sauy cho thy Web a ra s thch thc ln cho cng ngh Khai ph d liu

    1. Web dng nh qu ln t chc thnh mt kho d liu phc vDatamingCc CSDL truyn thng th c kch thc khng ln lm v thng c lutr mt ni, , Trong khi kch thc Web rt ln, ti hng terabytes v thay ilin tc, khng nhng th cn phn tn trn rt nhiu my tnh khp ni trn th gii.Mt vi nghin cu v kch thc ca Web a ra cc s liu nh sau: Hin naytrn Internet c khong hn mt t cc trang Web c cung cp cho ngi s dng., gi sl 5-10Kb th tng kch thc ca n t nhtl khong 10 terabyte. Cn t lt ng ca cc trang Web th tht s gy n tng. Hainm gn y s cc trang Web tng gp i v cng tip tc tng trong hai nm ti.Nhiu t chc v x hi t hu ht nhng thng tin cng cng ca h ln Web. Nhvy vic xy dng mt kho d liu (datawarehouse) lu tr, sao chp hay tch hpcc d liu trn Web l gn nh khng th

    2. phc tp ca trang Web ln hn rt nhiu so vi nhng ti liu vn bntruyn thng khcCc d liu trong cc CSDL truyn thng th thng l loi d liu ng nht

  • 8/3/2019 Khai pha du lieu web

    7/29

    (v ngn ng, nh dng,), cn d liu Web th hon ton khng ng nht. V d vngn ng d liu Web bao gm rt nhiu loi ngn ng khc nhau (C ngn ng dint ni dung ln ngn ng lp trnh), nhiu loi nh dng khc nhau (Text, HTML,PDF, hnh nh m thanh,), nhiu loi t vng khc nhau (a ch Email, cc lin kt(links), cc m nn (zipcode), s in thoi)Ni cch khc, trang Web thiu mt cu trc thng nht. Chng c coi nhmt th vin k thut s rng ln, tuy nhin con s khng l cc ti liu trong th vin

    th khng c sp xp tun theo mt tiu chun c bit no, khng theo phm tr,tiu , tc gi, s trang hay ni dung,... iu ny l mt th thch rt ln cho vic tmkim thng tin cn thit trong mt th vin nh th.3. Web l mt ngun ti nguyn thng tin c thay i caoWeb khng ch c thay i v ln m thng tin trong chnh cc trang Webcng c cp nht lin tc. Theo kt qu nghin cu , hn 500.000 trang Web tronghn 4 thng th 23% cc trang thay i hng ngy, v khong hn 10 ngy th 50% cctrang trong tn min bin mt, ngha l a ch URL ca n khng cn tn ti na.Tin tc, th trng chng khon, cc cng ty qun co v trung tm phc v Webthng xuyn cp nht trang Web ca h.s Thm vo s kt ni thng tin v struy cp bn ghi cng c cp nht4. Web phc v mt cng ng ngi dng rng ln v a dng

    Internet hin nay ni vi khong 50 trm lm vic, v cng ng ngi dngvn ang nhanh chng lan rng. Mi ngi dng c mt kin thc, mi quan tm, sthch khc nhau. Nhng hu ht ngi dng khng c kin thc tt v cu trc mngthng tin, hoc khng c thc cho nhng tm kim, rt d b "lc" khi ang "m mm"tronhi tm kim m ch nhn nhng mngthng tin khng my hu ch5. Ch mt phn rt nh ca thng tin trn Web l thc s hu chTheo thng k, 99% ca thng tin Web l v ch vi 99% ngi dng Web.Trong khi nhng phn Web khng c quan tm li b bi vo kt qu nhn ctrong khi tm kim. Vy th ta cn phi khai ph Web nh th no nhn c trangweb cht lng cao nht theo tiu chun ca ngi dng?Nh vy chng ta c th thy cc im khc nhau gia vic tm kim trongmt CSDL truyn thng vi vvic tm kim trn Internet. Nhng thch thc trn

    y mnh vic nghin cu khai ph v s dng ti nguyn trn Internetc. Thun liBn cnh nhng th thch trn, cn mt s li th ca trang Web cung cpcho cng vic khai ph Web.1. Web bao gm khng ch c cc trang m cn c c cc hyperlink tr ttrang ny ti trang khc. Khi mt tc gi to mt hyperlink t trang ca ng ta ti mttrang A c ngha l A l trang c hu ch vi vn ang bn lun. Nu trang A cngnhiu Hyperlink t trang khc tr n chng t trang A quan trng. V vy s lngln cc thng tin lin kt trang s cung cp mt lng thng tin giu c v mi linquan, cht lng, v cu trc ca ni dung trang Web, v v th l mt ngun tinguyn ln cho khai ph Web2. Mt my ch Web thng ng k mt bn ghi u vo (Weblog entry) cho

    mi ln truy cp trang Web. N bao gm a ch URL, a ch IP, timestamp. D liuWeblog cung cp lng thng tin giu c v nhng trang Web ng. Vi nhng thngtin v a ch URL, a ch IP, mt cch hin th a chiu c th c cu trc nnda trn CSDL Weblog. Thc hin phn tch OLAP a chiu c th a ra N ngidng cao nht, N trang Web truy cp nhiu nht, v khong thi gian nhiu ngi truycp nht, xu hng truy cp Webd. Cc ni dung trong WebminingNh phn tch v c im v ni dung cc vn bn HyperText trn, t khai ph d liu Web cng s tp trung vo cc thnh phn c trong trang Web. chnh l:1. Khai ph ni dung trang Web (Web Content mining) Khai ph ni dung trang Web gm hai phn:a. Web Page Content

    Ngha l s s dng ch cc t trong vn bn m khng tnh n cc lin ktgia cc vn bn. y chnh l khai ph d liu Text (Textmining)b.Search Result

  • 8/3/2019 Khai pha du lieu web

    8/29

    Tm kim theo kt qu. Trong cc my tm kim, sau khi tm ra nhngtrang Web tho mn yu cu ngi dng, cn mt cng vic khng km phn quantrng, l phi sp xp kt qu theo th t d gn nhau vi ni dung cn tm kim.y cng chnh l khai ph ni dung trang Web.2. Web Structure MiningKhai ph da trn cc siu lin kt gia cc vn bn c lin quan.3. Web Usage Mining

    a. General Access Partern Tracking:Phn tch cc Web log khm ph ra cc mu truy cp ca ngi dngtrong trang Web.b. Customize Usage Tracking:Phn tch cc mu truy cp ca ngi dng ti mi thi im bit xuhng truy cp trang Web ca tng i tng ngi dng ti mi thi im khc nhau

    Cc ni dung trong khai ph Web

    WebStructureWeb

    ContentWeb PageContentSearchResultWebUsageGeneral AccessPatternCustomizedUsageWeb Mining Chng 2. MY TM KIM2.1. NHU CU

    Nh cp phn trn. Internet nh mt x hi o, n bao gm ccthng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn,hnh nh, m thanh,... Thng tin trn cc trang Web a dng v mt ni dung cngnh hnh thc Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti thng tin. i vi mi ngi dng ch mt phn rt nhthng tin l c ch, chng hn c ngi ch quan tm n trang Th thao, Vn ha mkhng my khi quan tm n Kinh t. Ngi ta khng th tm t kim a ch trangWeb cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qunl ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c nidung ging vi yu cu ca ngi tm kim. Hin nay chng ta lm quen vi mts cc tin ch nh vy l: Yahoo, Google, Alvista,...My tm kim l cc h thng c xy dng c kh nng tip nhn cc yu

    cu tm kim ca ngi dng (thng l mt tp cc t kho), sau phn tch v tmkim trong c s d liu c sn v a ra cc kt qu l cc trang web cho ngis dng. C th, ngi dng gi mt truy vn, dng n gin nht l mt danh schcc t kha, v my tm kim s lm vic tr li mt danh sch cc trang Web clin quan hoc c cha cc t kha . Phc tp hn, th truy vn l c mt vn bnhoc mt on vn bn hoc ni dung tm tt ca vn bn.2.2. CU TRC V C CH HOT NG2.2.1. Tng quan v cc h tm kim hin nayBng mt v d c th, ta xem xt h tm kim GoogleTrong phn ny ta a ra ci nhn tng quan v cch lm vic ca mt htm kim Google. Phn sau s tho lun v ng dng chnh (Crawling, indexing,searching) v cu trc d liu m phn ny cha kp cp.Phn ln Google c thit k bng C, C++ v chy tt trn Solaris hay

    Linux. Trong Google, Web crawling(download cc trang Web) c thc hin bimt vi Webcrawler phn tn. C mt my ch URL gi danh sch cc URL m c nh km ti crawler. Nhng trang Web c nh km c gi ti

  • 8/3/2019 Khai pha du lieu web

    9/29

    my ch lu tr. My ch lu tr s nn v lu tr cc trang vo Repository (Nilu tr). Mi trang Web u c mt ch s ID km theo gi l DocID. Chc nng Index cIndexer v Sorter. Indexer thc hincc chc nng sau: c tRepository , gii nn ti liu vphn tch chng. Mi ti liu cc chuyn thnh mt tp hp cc

    t xut hin gi l Hits. Hits ghi cct, v tr cc t, xp x ca phngch, s vit hoa thng. Indexerphn b nhng Hits thnh cc bgi l "Barrels". Indexer thc hinmt chc nng quan trng khc, l n phn tch tt c nhnghyperlink trn tt c cc trang vlu tr nhng thng tin quan trngv chng vo mt file ngun. Fileny cha mt lng ln ccthng tin xc nh mi lin kt tr t v tr ti trang no, cng ni dung ca lin

    kt.Nh vy, Crawler c nhim v down cc trang web v lu tr vorespositoryIndexer c t respository gii nn cc ti liu v phn tch, m ha thnhHits, sp xp thnh "Barrels". Phn tch tt c cc hyperlink lu tr vo mt file2.2.2. Cu trc ca cc h tm kimCc my tm kim hin nay thng c t chc thnh ba Modul sau:Modul nh ch mc (indexing): D tm cc trang Web trn Internet, phntch chng ri lu vo CSDL.Modul tm kim (searching): Truy xut cc CSDL tr v danh sch cc tiliu tha mn mt yu cu ngi dng (di dng truy vn l mt tp cc t kha).Modul giao din ngi my: Ly kt qu t modul tm kim.Sau y ta i su vo chi tit ca tng modul v nhim v ca chng

    Hnh 2.3_M hnh kin trc ca my tm kim Googlea. Modul nh ch mc (Indexing)Modul nh ch mc thc hin cc nhim v sau1. Phn tch c php vn bn v nh ch mc ton b cc t kho trong vnbn (s ln xut hin, v tr xut hin)2. Lp th lin kt gia cc siu vn bn (lin kt xui v lin kt ngc).3. Tnh ton quan trng PageRank ca tt c cc vn bn da vo cu trclin kt siu vn bn (GoogleTM).Sau y, ta xem xt chi tit tng nhim va.1. B d trn Web theo cc hyperlink (Web Crawler)Crawler (s): Hu ht cc my tm kim hot ng da trn cc chng trnhc tn l Crawler, chng trnh ny cung cp d liu (l cc trang Web) cho my tm

    kim hot ng. Crawler l cc chng trnh nh ca cc my tm kim lm cng vicduyt Web. Cng vic ca n cng tng t nh cng vic ca con ngi truy cpWeb da vomi lin kt i n cc trang Web khc nhau. Cc Crawler c cungcp cc a ch URL ban u v s phn tch cc lin kt c trong cc trang v acc thng tin v cho b phn iu khin crawler (Crawler control). B phn iukhin ny s quyt nh xem lin kt no s c i thm tip theo v gi li kt qu cho Crawler (trong mt vi my tm kim chc nng ny ca b phn iu khincrawler c th c crawler thc hin lun). Cc Crawler cng chuyn lun cc trang tm thy vo kho cha cc trang (Page Repository), tip tc i thm cc trangWeb khc trn Internet cho n khi cc ngun cha cn kit.Vy modul Crawler truy lc cc trang ly t Mng, download xung sau cc trang c nh ch mc bi Mdul nh ch mc, sau y vo CSDL. Qutrnh ny c lp i lp li cho n khi Crawler c quyt nh dng.

    b iu khin quyt nh c trang Web no c i thm tip theoMt my tm kim tiu chun cn xem xt hai vn chnh trong modulcrawler:

  • 8/3/2019 Khai pha du lieu web

    10/29

    - S cc trang Web l rt ln, nn Crawler khng th down ton b cc trangm ch chn nhng trang "quan trng". Vy nhng trang nh th no c coi l quantrng v quan trng c tnh ton nh th no? - Bi v ni dung cc trang Web lin tcrawler phi thng xuyn thm li cc trang c down cp nht s thay i. Hn na mc thay i ca cc trang l khc nhau nn crawler phi cn thnxem xt trang no cn xem li, trang no b qua.Vn 1: quan trng

    Cho mt trang Web P, chng ta c cc cch tnh quan trng sau:1. C mt truy vn Q. quan trng ca P c nh ngha l "s ging nhauv t ng" gia P v Q2. Biu din Q v P bi hai vector n chiu v=(w1, w2,..., wn) vi wi l biu thcho t th i trong b t vng , c th wi=s ln xut hin ca t th i. chch lchgia P v Q l gi tr cos ca hai vector biu dinGi quan trng nhn c t phng php tnh ny l IS(P)2. Trang no c nhiu trang khc link n s quang trng hn, nn mt cch tnh quan trng ca trang P l tnh s link n PGi quan trng nhn c t phng php tnh ny l IB(P)3. Tnh quan trng bi chnh a ch URL ca n. Nu a ch trang Webno tn cng bng".com" hay c cha t "home" s quan trng hn

    Gi quan trng nhn c t phng php tnh ny l IL(P)4. Mt phng php na tnh quan trng l m s ln ngi dng truycp vo trang trong mt khong thi gian no Vy cui cng quan trng ca trang P s l s kt hp ca cc quantrng tnh theo cc cch trn, theo mt t l no :IC(P)=k1. IS (P)+k2.IB(P)+ k3.IL(P)+k4.IU(P) (vi k1,k2,k3,k4 v truyvn Q l cho trc)Vn 2: S cp nht cc trang downloadC hai chin lc cho s cp nht cc trang download:1. Cp nht theo nh k tt c cc trang: crawler s thm li tt c cctrang vi cng mt tn s f, khng tnh n mc thng xuyn thay i cachng.Ngha l cc trang c i x cng bng bt k chng thay i ra sao. Cp nht thang chng hn th s

    tnh li PageRank, index ca word trong URL2. Cp nht theo mt t l: Trang no cng nhiu thay i th tn sut cpnht cng ln. VD: cc trang e1, e2,...,en, thay i theo th t k1,k2,...,kn lna.2. Indexing (Qu trnh nh ch mc)Indexer Module s tm hiu tt c cc t trong tng trang Web c lu trtrong kho cha cc trang, v ghi li cc a ch URL ca cc trang c cha mi t.Kt qu sinh ra mt bng ch mc rt ln, v nh c bng ch mc ny n c thcung cp tt c cca ch URL ca cctrang khi c yu cu.Hai modul nh chs (indexer) v

    collection analysistrn hnh 1 lm nhimv xy dng cc chs khc nhau cho cctrang web downv. Modul Indexerxy dng hai loi chs c bn:Text(content)Index vstructor(link) index.S dng 2 loi ch strn v cc trang web trong ni lu tr cc trang (repository), modul collectionanalysis xy dng thm nhiu ch s hu ch khc. Di y chng ta m t

    ngn gn mt vi loi ch s, tp trung vo cu trc v cch s dng ca chng.Link index xy dng ch s lin kt (link indext), mt phn ca b d (Crawler)

  • 8/3/2019 Khai pha du lieu web

    11/29

    c m ha di dng mt s vi cc nt v cc cnh ni, trong cc nt lcc trang Web, cc cnh ni gia cc nt l cc lin kt gia cc trang. Ch sindex s c xy dng ln theo cc nt v cc cnh ca s . (v hnh)

    Hnh1.2_ th minh ho cc nt ( ti liu Hypertext)v cc cnh ni (link) trong mt tp ti liu Hypertext Thng thng, thng tin c cu trthut ton tm kim trong cc h tm tin l cc thng tin ly t cc trang c lin kt,

    chnh s lin kt trn cung cp mt cch hu hiu s truy cp ti cc thngtin lng ging . Nhng s nh vi hng trm thm ch hng nghn nt c thc biu din bi bt k mt cu trc d liu no, song cng s thc hin nhng vi mt s ln hn c hng triu nt li l mt thch thc ln.Text IndexMc d k thut da vo lin kt c s dng tng cng chtlng v lin quan gia cc kt qu tm c, th s truy xut da vo t mc(tm kim cc trang c cha cc t kha) vn l mt phng php chnh xcnh cc trang web c lin quan n truy vn. Cch nh ch s h tr truy vn davo t mc c th c thc hin bng cch s dng bt k phng php truy cptruyn thng no tm trn ton b ni dung ti liu.My tm kim s dng chmc lin kt ngc (Inverted Index) cho vic biu din ti liu. Ch mc lin kt

    ngc (Inverted Index) l la chn truyn thng cho cu trc ch s ca cc trangWebV d chng ta c 4 vn bn sau:vn bn 1: computer sciencevn bn 2: computer is about livevn bn 3: to live or not to liveQu trnh to file Index nh sau:- Ly tt c cc t c mt trong c 4 ti liu- Lu tr chng theo th t a, b, c, ....

    Lu tr cc thng tin v ti liu (bao gm m ti liu, a ch URL,tiu , miu t ngn gn...)Kt qu thu c mt File Inverted index l mt danh sch cc thngtin sau:

    T M V a Tiu MiuAbout 2 3 ... ... ...Computer 1 1computer 2 1 ... ... ...Is 2 2 ... ... ...live 3 2Live 3 6Live 2 4 ... ... ...Not 3 4 ... ... ...Or 3 3 ... ... ...science 1 2 ... ... ...to 3 1To 3 5

    Tuy nhin mt thut ton tm kim thng s dng thm nhng thng tin vs xut hin ca t mc trong trang web, v d t mc c vit hoa (nm trong th), hay t mc nm phn tiu (nm trong th v ). kt hpnhng thng tin ny, mt trng mi c thm vo gi l trng payload(ti trng),trng ny m ha cc thng tin thm v s xut hin ca cc t mc trong vn bn.Nhng thng tin ny phc v cho thut ton Ranking sau ny.Inverted indexInverted index c lu tr qua file CSDL cc bn ghi.Vic xy dng mtCSDL lu tr Inverted Index cho b d liu ln nh tp cc trang web trn interneti hi mt kin trc phn tn vi mm do cao. Trong mi trng Web c haichin lc c bn cho vic chia cc Inverted Index thnh mt tp cc nt khc nhau c th lu tr phn tn ti nhiu ni khc nhau.Kiu th nht l local inverted file (IFL).

    Trong t chc kiu IFL th mi nt lu tr cc danh sch inverted index camt tp nh cc trang Web khc nhau trong tp cc trang Web lu tr trong b phnlu tr (page repository). Khi c yu cu tm kim th b phn search query s truyn

  • 8/3/2019 Khai pha du lieu web

    12/29

    yu cu i tt c cc nt, mi nt s tr li mt danh sch ring cc trang c cha cct ang tm kimKiu th hai l Global inverted file (GFL).Trong t chc kiu GFL, inverted index c chia theo cc t, v vy mi mtquery server lu tr danh sch inverted index ca mt tp nh cc t trong b d liu.V d h thng vi hai query server A v B, th A s lu tr danh sch inverted indexcho tt c cc t vi k t bt u t a n o, cn B lu tr cho cc t cn li t p n

    ang c cha t people th n sch hi server A.Cu trc d liu chnhModul Indexer ly cc trang c Crawler down v cha trong Repository,nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trcchnh ca c s d liu trong hu ht cc my tm kim:a. Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : Ms t kha, t kha (hnh a). Cc t kha ny dc thit lp trong qu trnh Indexing:c File vn bn, tch t kha, xem c trong file t kha cha. Nu cha c to rabn gi mi trong file t kha, trong c m s t kha v tt nhin c lun c ms. Nu c ri th ly m s. M s ly c dng cho vic to ra bn ghi tp theo.b. File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghi

    cho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL),a ch trong my h thng cha file vn bn (cache ca cc trang web ) (hnh b)c. File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mibn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin t kha ny trongvn bn (hnh c)( y chnh l file ch s lin kt ngc(Inverted index))Cch t chc CSDL: S dng cu trc hm bm _theo cc t vngThch thc- Vic xy dng mt file ch mc lin kt ngc (inverted index) lin quan nvic tin x l cc trang thnh cc phn nh, sp xp chng vo cc ch s t mc vnh v tr cho chng, cui cng vit ra nhng phn c sp xp di dng mt tphp cc danh sch lin kt ngc. Thi gian xy dng file index khng qua kht khe,tuy nhin khi lm vic vi mt tp hp cc trang Web, mt s file ch s tr nn khqun l v yu cu ngun ti nguyn ln (chng hn nh b nh), v thng cn nhiu

    thi gian hon thnh. S so snh vi nhng h tm tin truyn thng cho thy, vi hthng ang nghin cu, ni lu tr (repository)cha 40 triu trang Web mc d chbiu din c 4% ca tng cc trang Web c kh nng nh ch s, nhng ln hnh thng tm tin tiu chun (TREC-7 colection)l 100GB- Bi v ni dung ca cc trang web thay i nhanh chng, nn vic xy dngli file ch s l rt cn thit cho vic lm mi cc trang Web. Mt phn cng vic ca Crb down v, song song vi cng vic ny vic xydng li cc file ch s- Cui cng, dng b nh dnh cho file inverted index cn phi c thit kcn thn. Mt file ch s c nn s ci tin thao tc truy vn hn l c file ch sc lu tr trong b nh. Tuy nhin vn gp phi l tn thi gian dnh cho vicgii nn

    a.3. Tnh ton i lng PageRankCc h tm kim c hai c tnh quan trng gip a ra kt qu c chnhxc cao. u tin, n s dng cu trc lin kt ca Web tnh ton quan trng chotng trang Web, (PageRank).Th hai, h s dng lin kt xp hng kt qu(Ranking). Chnh s cc lin kt gia cc trang Web cho php tnh ton nhanhchng i lng PageRank.i lng PageRank c nh ngha nh sau:Gi s trang A c cc trang T1,T2,,Tn tr ti. Tham s d l h s hm c gitr trong khong 0 v 1. Chng ta thng t d=0.85. C(A) l s lin kt ra t trang A.Khi PageRank ca A c tnh nh sau:PR(A)=(1-d)+d (PR(T1)/C(T1)++PR(Tn)/C(Tn)).

  • 8/3/2019 Khai pha du lieu web

    13/29

    V PageRank ca mt trang l i lng i din cho s phn b xc sut trncc trang Web trong mt tp cc trang Web nht nh, do tng cc gi tr pagerank

    ca tt c cc trang Web trong tp cc d liu c gi tr bng 1Trang V1Trang V2Trang VmTrang URV1/ NV1RV1/NVmHnh 2.2Qu trnh tnh ton c lp i lp li cho n khi hi t.Vi d=0.85, s vng lp =20 vi khong vi triu trang. V tnh PageRankcho 26 triu trang web vi mt trm lm vic va phi th thi gian tiu tn ti vi gi.2.3. NHC IM CA CC MY TM KIM

    1. L cc h tm kim t ng, ngi s dng cha c vai tr g trong qutrnh tm kim, khng c c ch phn hi t ngi s dng cp nht cc tham stm kim nhm tng hiu qu cho ln tm kim sau2. Coi quan trng ca cc t kha l nh nhau, do cha cho php tnh quan trng khc nhau ca cc t kha. Nh trong cc h tm kim ln nh Google,Yahoo, nu a vo t System Information th h s tm kim tt c cc trang Webc lin quan n 2 t System v Information. Nu ngi dng mun tm kim tComputer Story m trong t Computer c ngha nhiu hn t Story (chng hn,t Computer c trng s 0.8, story c trng s 0.2), th vn t ra l cn phi xydng mt h tm kim nh vy3. Cha quan tm n bn cht ca x l vn bn, vn t ng ngha, anghaC rt nhiu ti liu lin quan n ni dung cn tm nhng khng cha cc t

    kha a vo, m ch cha cc t ng ngha vi chng v nhng ti liu s b bqua trong qu trnh tm kim.V cc my hu ht tm kim theo t kha, da vo vic nh ch mc cho cctrang Web(index-base search engine), c th c hng trm ti liu cng cha t khaa vo, dn n mt s lng ln ti liu nhn c t my tm kim, m rt nhiutrong chng t hoc khng lin quan n ni dung cn tm2.4. BI TON TM KIM MIHng ngy c hng t ngi truy cp vo Internet v cng c tng y ngithc hin cc thao tc tm kim vi cc my tm kim khc nhau. Nu thng k ccthng tin ca mi ln tm kim ny th chc chn chng ta s c mt ngun thngtin khng l, v nu bit cch s dng chng th s lm c rt nhiu cng vic huch. Cc bi ton tm kim trong cc my tm kim thng thng ch n gin p ng

    nhu cu tm kim thng tin ca khch hng m cha bit tn dng nhng thng tin tpha khch hng qua mi ln tm kim. Di y l bi ton xut thm vo tnhnng ca cc my tm kim v hng gii quyt trong tng lai. Bi ton:Cn c vo cc ti liu m khch hng xem hoc down v, sau khi phn tch tabit c khch hng hay tp trung vo cc trang c ni dung g trn tp cc trangWeb ca chng ta, t b xung thm nhiu ti liu m khch hng quan tm vngc li. Cn v pha khc hng sau khi phn tch chng ta cng bit c khchhng hay tp trung v vn g , t c thm nhng h tr cho khch hng.Hng gii quyt:Xy dng mt CSDL v cc ti liu, trong c mt trng ClassificationIDcho bit ti liu ny thuc lnh vc no da trn kt qu phn tch trc .(Bngphn lp)Xy dng mt CSDL v pha khch hng: Trc khi khch hng truy cp vo

    CSDL, yu cu ng k mt account thng tin: tn, tui, a ch,chng ta cng athm hai trng quan trng l ngh nghip, trnh (cho chnh xc ca thng tin lc%). Yu cu ng k account l tu chn vi khch hng. Sau trong qu trnh mi

  • 8/3/2019 Khai pha du lieu web

    14/29

    ln khch hng truy cp vo CSDL chng ta s ghi li cc ti liu m khch hng truynhp vo bng thng tin khch hng. Sau da vo cc thng tin v ti liu mkhch hng truy nhp v thng tin v khch hng, phn tch theo thut ton cy quytnh sinh lut cho bit khch hng khch hng c ngh nghip v trnh nh thno th quan tm n lnh vc no vi tin cy l ngng c2.5. KT LUN Chng 3. BI TON PHN LP3.1. PHT BIU BI TON

    Trong t nhin, con ngi thng c tng chia s vt thnh cc phn,cc lp khc nhau. Tng t nh vy, gii thut phn lp n gin ch l mt phpnh x c s d liu c sang mt min gi tr c th no , da vo mt thuctnh hoc mt tp hp cc thuc tnh ca d liu.

    Phn lp vn bn c cc nh nghin cu nh ngha thng nht nh lvic gn cc ch c xc nh cho trc vo cc vn bn Text a trn ni

    dung ca n. Phn lp vn bn l cng vic c s dng h tr trong qu trnhtm kim thng tin (Inrmation Retrieval), chit lc thng tin (InformationExtraction), lc vn bn hoc t ng dn ng cho cc vn bn ti nhng ch xc nh trc. phn loi vn bn, ngi ta s dng phng php hc my cgim st (supervised learning). Tp d liu c chia ra lm hai tp l tp hunluyn v tp kim tra trc ht phi xy ng m hnh thng qua cc mu hcbng cc tp hun luyn, sau kim tra s chnh xc bng tp liu kim tra.Hnh sau l mt khung cho vic phn lp vn bn, trong bao gm bacng on chnh: cng on u l biu din vn bn, tc l chuyn cc d liuvn bn thnh mt dng c cu trc no , tp hp cc mu cho trc thnh mttp hun luyn. Cng on th hai l vic s dng cc k thut hc my hctrn cc mu hun luyn va biu din. Nh vy l vic biu din cng on mts l u vo cho cng on th hai. Cng on th ba l vic b sung cc kin

    thc thm vo do ngi dng cung cp lm tng chnh xc trong biu dinvn bn hay trong qu trnh hc my.Trong cng on hai, c nhiu phng php hc my c p dng, mhnh mng Bayes, cy quyt nh, phng php k ngii lng ging gn nht,mng Neuron, SVM,DliuvoGiithutphnlp

    hotngLp 1Lp 2Lp n

    3.2. CC PHNG PHP BIU DIN VN BN3.2.1. Cc phng php biu din vn bn trong C s d liuFullTextTn ti ba m hnh CSDL FullText in hnh: M hnh logic, m hnh c phpv m hnh Vectora. M hnh phn tch c php

    a.1. Quy tc lu tr:- Mi vn bn u phi c phn tch c php v tr li thng tin chi tit vch ca vn bn .

  • 8/3/2019 Khai pha du lieu web

    15/29

    - Sau tin hnh Index cc ch ca tng vn bn. Cch Index trn ch ging nh khi Index trn vn bn nhng ch Index trn cc t xut hin trong ch .- Cc vn bn c qun l thng qua cc ch ny c th tm kim ckhi c yu cu, cu hi tm kim s da trn cc ch trn.a.2. Quy tc tm kim:Cu hi tm kim s da vo cc ch c Index. Vy u tinphi tin hnh Index cc ch . Cch Index trn ch ging nh Index trn ton b

    cc t c trong ch ,Cu hi a vo c th c phn tch c php tr li mt ch vtm kim trn ch Nh vy b phn x l chnh i vi mt h CSDL xy dng theo m hny chnh l h thng phn tch c php v on nhn ni dung vn bn.a.2. u im, nhc imu imKhi c sn ch th vic tm kim theo phng php ny li kh hiu quv n gin do tm kim nhanh v chnh xc.i vi nhng ngn ng n gin v mt ng php th vic phn tch trn cth t c mc chnh xc cao v chp nhn c.Nhc imCht lng ca h thng theo phng php ny hon ton ph thuc vo cht

    lng ca h thng phn tch c phpv on nhn ni dung ti liu. Trn thc t, vicxy dng h thng ny l rt phc tp, ph thuc vo c im ca tng ngn ng va s vn cha t n chnh xc cao.b. M hnh LogicTheo m hnh ny cc t c ngha trong vn bn c Index v ni dung vnbn c qun l theo cc ch s Index .b.1. Cc quy tc lu tr- Mi vn bn c Index theo quy tc:Thng k cc t c ngha trong cc vn bn, l nhng t mang thng tinchnh v cc vn bn lu tr.Index cc vn bn a vo theo danh sch cc t kho ni trn. ng vi mit kho trong danh sch s lu v tr xut hin n trong tng vn bn v tn vn bntn ti t kho .

    V d, c hai vn bn vi m tng ng l VB1,VB2.Cng ha x hi ch ngha Vit Nam (VB1)

    Vit Nam dn ch cng ha (VB2)

    Khi ta c cch biu din nh sau:

    b.2. Cc quy tc tm kim:Cu hi tm kim c a ra di dng Logic, tc l gm mt tp hp ccphp ton (AND, OR,) c thc hin trn cc t hoc cm t. Vic tm kim s

    da vo bng Index to ra v kt qu tr li l cc vn bn tho mn ton b cciu kin trnb.3. u im Nhc im

  • 8/3/2019 Khai pha du lieu web

    16/29

    u im Tm kim nhanh v n gin. Thcvy, gi s cn tm kim t computer.H thng s duyt trn bng Index tr n ch s Index tng ng. Nu tcomputer tn ti trong h thng. Vic tm kim ny l kh nhanh v n gin khitrc ta sp xp bng Index theo vn ch ci. Php tm kim trn c phc tpcp (nlog2n), vi n l s t trong bng Index. Tng ng vi ch s index trn s chota bit cc ti liu cha n.Nh vy vic tm kim lin quan n k t th cc php ton

    cn thc ehin l k*n*log2n, vi n l s t trong bng Index- Cu hi tm kim nhanh v linh hotC th dng cc k t c bit trong cu hi tm kim m khng lm nhhng n phc tp ca php tm kim. V d ta tm ta th kt qu s tr li ccvn bn c cha cc t ta, tao, tay,l cc t bt u bng t taK t % c gi l k t i din (wildcard character).Ngoi ra, bng cc php ton Logic cc t cn tm c th t chc thnh cccu hi mt cch linh hot. V d: Cn tm t [ti, ta, tao], du [] s th hin victm kim trn mt trong s nhiu t trong nhm. y thc ra l mt cch th hin linhhot php ton OR trong i s Logic thay v phi vit l: Tm cc ti liu c cha tti hoc t ta hoc tao.T mc MVB_V tr XH

    Cng VB1(1), VB2(5)Ha VB1(2), VB2(6)X VB1(3)hi VB1(4)ch VB1(5), VB2(4)ngha VB1(6)Vit VB1(7), VB2(1)Nam VB1(8), VB2(2)Dn VB2(3) Nhc im:- Ngi tm kim phi c chuyn mn trong lnh vc tm kimThc vy, do cu hi a vo di dng Logic nn kt qu tr li cng c gitr Logic (Boolean). Mt s ti liu s c tr li khi tho mn mi iu kin a

    vo. Nh vy mun tm c ti liu theo ni dung th phi bit ch xc v ti liu.- Vic Index cc ti liu l tn nhiu thi gian v phc tp.- Tn khng gian lu tr cc bng Index.- Cc ti liu tm c khng c xp xp theo chnh xc ca chng.- Cc bng Index khng linh hot. Khi cc t vng thay i (thm, xa,)th ch s Index cng phi thay i theoc. M hnh khng gian Vectorc.1. Quy tc lu trMt trong nhng phng php in hnh biu din vn bn ni chung l sdng khng gian Vector. Trong cch biu din ny, mi vn bn c biu din bngmt vector. Mi thnh phn ca Vector l mt t mc ring bit trong tp vn bngc(corpus)v c gn mt gi tr l hm f ch mt ca t mc trong vn bn.

    Chng ta c th biu din cc vn bn di dng vi t mc l cc t n vhm f biu din s ln xut hin ca chng, cch biu din ny cn gi l biu dintheo ti cc t (bag of words)Chng hn vn bn vb1, n c biu din bi mt vector V (v1,v2,,vn)Vi vi l s ln xut hin ca t kha th i (ti) trong vn bn vb1.Ta xt hai vn bn sau:

    T Vector cho vn V

    Computer 2 1Is 1 1Life 0 1

  • 8/3/2019 Khai pha du lieu web

    17/29

    Not 1 0Only 1 0C nhiu tiu chun chn hm f, do m chng ta c th sinh ra nhiu gitr trng s khc nhau. Sau y l mt vi tiu chun chn hm fComputer is not only computerComputer is life M hnh Boolean

    Gi s c mt CSDL gm m vn bn D={d1,d2,,dm}. Mi vn bn cbiu din di dng mt vector gm n t mc T={t1,t2,,tn}. Gi W=(wij) l ma trntrng s, trong wij l gi tr ca t mc ti trong vn bn dj.M hnh Boolean l m hnh n gin nh, c xc nh nh sau:Wij = 0 nu ti khng c mt trong dj1 nu ngc li

    V d chng ta c hai vn bn sau:

    T Vector cho vn VComputer 1 1Is 1 1Life 0 1Not 1 0Only 1 0M hnh tn s (Frequency Model)M hnh tn s xc nh gi tr cc s trong ma trn W=(wij) cc gi tr l ccs dng da vo tn s ca c t sut hin trong vn bn hoc tn s xut hin cavn bn trong CSDL. C ba phng php ph bin sau:

    Phng php da trn tn s t mc (TF_Term Frequency)Cc gi tr ca cc t mc c tnh da trn s ln xut hin ca ca c tmc trong vn bn . Gi tfij l s ln xut hin ca t mc ti trong vn bn dj, khi wij c tnh bi cng thc:Wij = tfij hoc wij = 1+log(tfij) hoc w=tfij.Phng php da trn nghch o t s vn bn(IDF_ Inverse DocumentFrequency)Gi tr t mc c tnh bi cng thc sau:Wij= logdfijm =log(m)- log(dfi)Computer is not only computer

    Computer is life Phng php TF.IDEPhng php ny l tng hp ca hai phng php TF v IDF, ma trn trngs c tnh nh sau:Wij = [1+log(tfij)] log (dfim ) nu tfij >=10 nu tfij =0c.2. Cc quy tc tm kimCc cu hi a vo c nh x vector Q(q1,q2,,qm) theo h s ca cc tvng l khc nhau. Tc l: T vng cng c ngha vi ni dung cn tm c h scng ln.Qi =0 khi t vng khng thuc danh sch nhng t cn tm.Qi0 khi t vng thuc danh sch cc t cn tm v Qi cng ln th mc

    lin quan n ni dung ti liu cng cao. Tc l h thng s u tin hn i vi ccti liu c cha cc t tm kim c h s cao.V d: Nu ni dung cn tm c t Machine quan trng hn t Computer,

  • 8/3/2019 Khai pha du lieu web

    18/29

    th trong vector Q ta c th t qk=2,qh=1 tng ng vi tk=Machine, th=a s.Khi , cho mt h thng cc t vng ta s xc nh c cc vector tngng vi tng ti liu v ng vi mi cu hi a vo ta s c mt vector tng vi nvi nhng h s c xc nh t trc. Vic tm kim v qun l s c thchin trn ti liu ny.T cch xc nh ni dung cc ti liu v cu hi theo cc vector tr cho taphng php tm kim v lu tr cc ti liu dng Full-Text theo cch mi nh sau:

    1. Mi ti liu c m ha bi mt vector2. Phn loi cc ti liu theo cc vector ni trn.3. Mi cu hi a vo cng c m ha bi mt vectorVic tm kim cc ti liu c thc hin bng cch nhn ln lt tng Vectorcu hi vi vector ca tng ti liuKt qu tr li s l mi ti c lin quan n cu hi tm kimc.3. u, nhc imu im- Cc ti liu tr li c th c sp xp theo mc lin quan n ni dungyu cu do trong php th mi ti liu u tr li ch s nh gi lin quan ca nn ni dung yu cu.- Vic a ra cc cu hi tm kim l d dng v khng yu cu ngi tm

    kim c trnh chuyn mn cao v vn - Tin hnh lu tr v tm kim n gin hnkim c th t a ra s cc ti liu tr li c mc chnh xc cao nhtNhc im- Vic tm kim tin hnh kh chm khi h thng cc t vng l ln do phitnh ton trn ton b cc Vector ca ti liu.- Khi biu din cc Vector vi cc h s l s t nhin lm tng mc chnhxc ca vic tm kim nhng lm tc tnh ton gim i rt nhiu do cc php nhnvector phi tin hnh trn cc s t nhin hoc s thc, hn na vic lu tr cc vectors tn km v phc tp- H thng khng linh hot khi lu tr cc t kha. Ch cn mt thay i rtnh trong bng t vng s ko theo hoc l vector ho li ton b cc ti liu lu tr,hoc l s b qua cc t c ngha b sung trong cc ti liu c m ha trc . Tuynhin, vi nhng u im nht nh s sai s nh ny c th b qua do hin ti s cc

    t c ngha c m ha kh y trc khi tin hnh m ha ti liu. V y phngphp Vector vn c quan tm v s dng- Mt nhc im na, chiu ca mi Vector theo cch biu din ny l rtln, bi v chiu ca n c xc nh bng s lng cc t khc nhau trong tp hpvn bn. V d s lng cc t c th c t 103n 105trong tp hp cc vn bn nh,cn trong tp hpc c vn bn ln th s lng s nhiu hn, c bit trong mi trngWebCch khc phc: C mt s phng php gim bt s chiu ca Vector cp dng. Mt phng php n gin v hiu qu l loi b cc t dng (stop words).T dng l cc t dng biu din cu trc cu ch khng biu t ni dung

    vn bn, v d nh cc t ni, cc gii tNhng t nh vy xut hin rt nhiu trongvn bn nhng li khng lin quan n ch v ni dung vn bn. Do chng ta cth loi b cc t ny i lm gim c s chiu ca cc vector biu din m likhng lm nh hng g n hiu qu tm kim.Mt s v d v cc t dng

    Ting Vit Ting AnhV aHoc theCng doabout 3.2.2. Cc phng php biu din vn bn trong C s d liuHyperTextTrong chng I chng ta nu ra nhng kh khn trong vic tm kim d

    liu Web v s khc nhau gia cu trc mt vn bn truyn thng vi mt vn bnHyperText Chnh v nhng kh khn gp phi nh vy m vic biu din d liu trongcc my tm kim l rt uan trng. Biu din cc trang web nh th no c th lu

  • 8/3/2019 Khai pha du lieu web

    19/29

    tr c mt s lng khng l cc trang web my tm kim c th thc hinvic tm kim nhanh chng v a ra cc kt u chnh xc cho ngi s dng?a. Biu din vn bn HyperText trong cc my tm kim (inverted index)Modul Indexer ly cc trang c Crawler down v cha trong Repository,nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trcchnh ca c s d liu trong hu ht cc my tm kim:- Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : M

    s t kha, t kha. Cc t kha ny dc thit lp trong qu trnh Indexing- File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghicho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL),a ch trong my h thng cha file vn bn (cache ca cc trang web )- File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mibn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin tkha ny trong vn bnu im: Biu din c v tr xut hin ca cc t (Bit c t kha xuthin trong cc loi th khc nhau, xut hin tiu hay thn vn bn). Lu tr cthng tin quan trng ca cc t kha.Nhc im: Cha biu din c tn s xut hin ca cc t kha. Dn nthiu chc nng tm kim trangWeb theo ni dung

    b. Biu din vn bn HyperText theo m hnh VectorTrong lun n tin s, tc gi San Slattery [May 2002_CMU-CS-02-142] a ra 4 cch biu din theo m hnh Vector cho ti liu HyperTextCch 1B qua tt c cc thng tin lin kt gia cc ti liu lng ging m ch biudin ring ni dung ti liu ang cn biu din. y l cch biu din theo ti cc t.ton c lp vilp th cch biu din ny l s la chn tt. Thc t l cc ti liu lng ging cung cpkh nhiu thng tin hu ch cho vic phn lp, do vy cch biu din ny l khng hiuqu.Cch 2Cch thc n gin nht nhm s dng ni dung cc ti liu lng ging l kthp ni dung ti liu cn biu din vi ni dung mi ti liu lng ging ca n to ra

    mt super_document. Khi , thnh phn vector biu din chnh l tn sut xut hinca t kha trong super_document.Hn ch ca cch biu din ny chnh l vic xa nha phn bit ti liu angxt vi lng ging ca n, v v th to nn nhiu ln xn khi phn lp. Cch biu dinny ch tt trong trng hp cc ti liu c tr ti c cng ch vi ti liu cnphn lp.Cch 3Trong cch biu din ny, vector biu din c chia thnh hai phn: Phn ubiu din cc t kha trong chnh ti liu cn phn lp, phn sau biu din cc t khaxut hin trong tt c cc ti liu lng ging vi n.Cch biu din ny khc phc c nhc im ca cch biu din trc ltrnh lm m nht ti liu ch vi cc ti liu lng ging. Nu cc ti liu lng ging

    hu ch cho vic phn lp th c th d dng truy cp n ni dung ca chng. Tuynhin cch biu din ny c nhc im l s chiu ca Vector ln.Cch 4Cch biu din ny c th hin qua cc ni dung sau:- Tm s lng trang lng ging trong ton b vn bn hypertext ang xem xt,gi s c d l s lng lng ging.- Cu trc vector biu din thnh d+1 phn:Phn u tin biu din trc tip ti liu cn phn lp.T phn th 2 n phn d+1 biu din cc ti liu lng ging, miphn tng ng vi mt lng ging.D nhn thy vector nhn c l rt ln v mt khc, li khng tun theo mtquy tc duy nht. Tn ti nhiu cch chn th t t phn th 2 tr i. Chnh v s adng trong cch biu din ca phng php ny gy kh khn trong vic la chn

    mu d liu xy dngQua cc cch biu din trn, chng ta a ra mt s nhn xt v cch biudin vn bn HyperText theo m hnh Vector nh trnh by di y. u im:

  • 8/3/2019 Khai pha du lieu web

    20/29

    - Khai thc c thng tin tim nng ca cc siu lin kt.- Biu din c tn s xut hin ca cc t, nn c kh nng thc hin chcnng tm kim vn bn theo gn nhau v ni dungNhc im :- Khng biu din c v tr xut hin ca cc t. Dn n b qua cc thngtin ly c quan trng ca t kha, nh nu t kha xut hin tiu haytrong cc th in m s quan trng hn cc v tr khc

    - S chiu ca Vector l rt lnIII 2.2.3 Biu din vn bn HyperText theo m hnh quan hBiu din vn bn theo m hnh quan h l cch biu din t nhin cho vnbn HyperText. Chng ta d dng cu trc mt quan h nh phn (mi lin kt giacc vn bn) m i s th nht l tn ca ti liu c cha cc Hyperlink v i s th2 l tn ca ti liu c tr ti.a) Quan h l g hiu c nhng u th ca hc quan h (relational learning), trc tinta so snh chng vi nhng thut ton nh (propositional algorithms) m lm vicvi nhng v d hay thc th c lp. Mi iu m hc nh cn bit v cc v dhun luyn ch l cc miu t hay thng tin v chnh v d . Hn na khi thc hinphn lp cho mt v d, hc nh cng ch quan tm n thng tin ca chnh v d

    m khng quan tm n mi lin h gia v d vi cc v d khc.Biu din quan h bao gm c biu din nh (nh biu din theo m hnhvector, ti cc t (bag of word), tp hp cc t (set of word)) cng vi cc thng tin vmi quan h gia cc v d vi nhau. Chng hn, nu v d hun luyn ca chng ta lpeople , biu din nh ch ch m t cc thng tin nh tn, tui, cng vic,lng, ca tng ngi, trong khi biu din quan h s biu din tt c nhngthng tin trn cng thm mt s thng tin khc na, v d nh mi quan h gia ngch-ngi lm thu hay mi quan h hn nhn.Nh vy r rng rng mt biu din quan h cho ta mt c hi tm kimton b khng gian giu c ca cc mi quan h. Nu chng ta tin tng rng cc vd lin quan c th l ngun thng tin hu ch cho s phn lp mt vi v d, th cchbiu din quan h l ph hp, cn ngc li, cc v d lin quan khng cung cp thm thngan h (relation representation) khng th

    no tt hn cch biu din nh (proposition representation)Biu din quan h trong cho HyperTextCc quan h :Link_to (page, page): Mi quan h ny th hin cc siu lin kt (hyperlink)tham chiu n cu trc gia cc trang trong ton b vn bn Web. Chng ta c thbiu din rng trang 15 cha siu lin kt tham chiu n trang 37 nh sau: link_to(page15, page37).Has_word (page): Cung cp thng tin v ni dung ca mi trang Web. Chngta s ch biu din nhng t m ta quan tm (hay sau ny s chn lm t kha). Chnghn has_computer(A) c ngha l trang A c cha t computer.Ta c th biu din ph nh: not(link_to (page15, page37)) c ngha lpage15 khng lin kt vi page17, cn not(has_computer(A) c ngha l trang A

    khng c cha t computerV d: C hai trang Web A v B sau:

    Gi s A l trang ch ca sinh vin ca tp hp cc trang Web ca mt trngi hcKhi trang A c biu din nh sau:A:- has_engine(A), has_list(A), has_vector(A), link_to(B,A), has_jame(B),has_link(B), has_paul(B), not(has_home(A))V nu bng ngn ng th ta c th dch ra thnh lut nh sau: Mt trang mcha cc t kha list, vector, common nhng khng cha t kha home, v c lin

    kt bi trang c cha cc t jame, paul, link th l trang ch ca sinh vinA

  • 8/3/2019 Khai pha du lieu web

    21/29

    ListVectorCommon

    B

    Jame

    PaulLink 3.3. CC PHNG PHP HC MY3.3.1. Thut ton phn lp BayesThut ton phn lp Bayes l mt trong nhng thut ton phn lp in hnhnht trong khai thac d liu v tri thc. tng chnh ca thut ton l tnh xc sutc sau ca s kin c thuc lp x theo s phn loi da trn xc sut c trc ca skin c thuc lp x trong iu kin T

    Gi V l tp tt c cc t vng.Gi s c N lp ti liu: C1, C2,,CnMi lp Ci c xc sut p(Ci) v ngng CtgTshi.

    Gi p(C| Doc) l xc sut ti liu Doc thuc lp C.Cho mt lp C v mt ti liu Doc, nu xc sut p(C|Doc) tnh c ln hnhoc bng gi tr ngng ca C th ti liu Doc s thuc vo lp C.Ti liu Doc c biu din nh mt vector c kch thc l s t kho trongti liu. Mi thnh phn cha mt t trong ti liu v tn xut xut hin ca t trong ti liu. Thut ton c thc hin trn tp t vng V, vector biu din ti liuDoc v cc ti liu c sn trong lp, tnh ton p(C|Doc) v quyt nh ti liu Doc sthuc lp no.Xc sut p(C | DOC) c tnh theo cng thc sau:Xc sut p(C | Doc) c tnh theo cng thc sau:Vi:p(c | x, ) = p(c | x,T) p(T |x)T in

    Trong :|V| : s lng cc

    rong

    p VFj :

    kho

    h j

    rong

    vngTF(Fj| Doc) : Tn xut ca t Fj trong ti liu Doc (bao gm c t ngngha)TF(Fj| C) : Tn xut ca t Fj trong lp C (s ln Fj xut hin trong tt ccc ti liu thuc lp C)P(Fj| C) : Xc sut c iu kin t Fj xut hin trong ti liu ca lp CCng thc F(Fi | C) c tnh s dng c lng xc sut Laplace. S d

    c s 1 trn t s ca cng thc ny trnh trng hp tn sut ca t Fi tronglp C bng 0, khi Fi khng xut hin trong lp C. gim s phc tp trong tnh ton v gim thi gian tnh ton, ta thy rng, khng phi ti liu Doc cho u cha tt c cc t trong tp t vngV. Do , TF(Fi | DOC) =0 khi t Fi thuc V nhng khng thuc ti liu Doc, nnta c, (P(Fj| C))TF(Fj, Doc)= 1. Nh vy cng thc (1) s c vit li nh sau:

    Vi:Nh vy trong qu trnh phn lp khng da vo ton b tp t vng m chda vo cc t kha xut hin trong ti liu Doc.

    3.3.2. Thut ton k-ngi lng ging gn nht.Thut ton hot ng khng da vo tp t vng. Tuy nhin, n v

    n sdng ngng C gTsh, v hc hin theo cc bc nh cp trn. l tin

  • 8/3/2019 Khai pha du lieu web

    22/29

    hnh ngu nhin k ti liu v tnh xc sut p(C|Doc) da trn s ging nhau gia tiliu Doc v k ti liu c chn. Xc sut p(C| Doc) c tnh theo cng thc sau:

    Trong :n : S lpk : S ti liu c chn so snhP(Ci

    | Dj) : C gi tr 0 hoc 1, cho bit ti liu Dj c thuc lp Cikhng. S d c gi tr ny v mt ti liu c th thuc hn mt lpSm(Doc,Dj) xc nh mc ging nhau ca ti liu Doc vi ti liu cchn Dj , c tnh bng cos ca gc gia hai Vector biu din ta liu Doc v ti liuc chn Dj.

    Cch biu din cc ti liu trong thut ton ny hon ton tng t nh trongthut ton phn lp Bayes th nht, ngha l cng gm Fi t kha v tn xut Xitng ng.Trong cng thc (4):

    Xil tn xut ca t kho th i (da trn s t ng ngha xut hin trong tiliu Doc)

    Yil tn xut ca t th i (da trn s t ng ngha xut hin trong ti liuDi) 3.3.3. Phn lp da vo cy quyt nhHc cy quyt nh l phgn php c s dng rng ri cho vic hc quynp t mt mu ln. y l phng php xp x hm mc tiu c gi tr ri rc. Mtkhc, cy quyt nh cn c th chuyn sang dng biu din tng ng di dngtri thc l cc lut If-then. Trong cc thut ton hc cy quyt nh th ID3 v C4.5 lhai thuta ton ni ting nht. Sau y l ni dung thut ton ID3.

    ID3 (Example, Target attributes, Attributes)1.To mt nt gc Root cho cy quyt nh2. Nu ton b Examples u l cc v d dng, t li cy Root mt ntn, vi nhn +.3. Nu ton b Examples u l cc v d m, tr li cy Root mt nt n,vi nhn -.4. Nu Attributes l rng th tr li cy Root mt nt n vi gn nhn bnggi tr ph bin nht ca Target_attribute trong Example.5. Ngc li Begin5.1. A

  • 8/3/2019 Khai pha du lieu web

    23/29

    Mnh Horn l cc mnh c nhiu nht mt literal dng, c dng nhsau:H \/ (-L1)\/ (-L2)\/\/ (-Ln))Trong H, L1,L2,,Ln gi l cc literal dng, cn L1, L2, Ln gi lcc literal m.Hay vit di dng lut:( L1^L2^^Ln)=>H. Dng ny c gi l lut First_Order

    L1,L2,Ln gi l tp cc tin iu kin. H gi l kt lun.VD v cc lut First_Order:If Parents(x,y) then Ancestor (x,y)If (Parents(x,z) ^ Ancestor(z,y) ) then Ancestor(x,y).

    Trong Parents, Ancestor, gi l cc predicateb.Thut ton FoilFOIL c xu v ph rin bi Quinlan (Quinlan, 1990). FOIL hc cc

    p d liu ch bao gm hai lp, lp cc v d dng v v d m. FOIL hc mt lp i vi lp dng. u vo ca Foil gm cc tin iu kin v cc kt lun. .u ra l mt tp cc lut sinh t cc tin iu kin v cc kt lun . Mi bc Foils thm mt literal vo cc tin iu kin ca lut ang hun luyn. Thut ton sdng hm Foil_Gain tnh ton la chn mt literal trong tp cc literal ng c

    FOIL l m hnh hc my khng tng trong thut ton leo i s dngmetric da theo l thuyt thng tin xy dng mt lut bao trm ln d liu. TrongFoil c hai trng thi chnh :1. separate stage (trng thi phn tch) : Bt u mt trng thi mi 2. Conquer State (trhp cc literal xy dng thnca mnh .Pha tch ri ca thut ton b t u t lut mi trong khi pha ch ng xydng mt lin kt cc literal lm thn ca lut. Mi lut m t mt tp con no ccv d dng v khng c v d m. Lu rng, FOIL c hai on : b u m lu mi vi

    hn lu

    rng v

    hm m

    li

    eral k

    hc lu

    hin

    i. FOIL k

    hc vicb sung li

    eral khi khng cn v d m c bao ph bi lu

    , v b

    u lu

    mi nkhi

    c mi v d dng c bao ph bi m

    lu

    no .Cc v d dng c ph bi mnh s c

    ch ra khi

    p dy v qu

    rnh

    ip

    c hc cc mnh

    ip

    heo vi cc v d cn li, v k

    hc khi khngc cc v d dng

    hm na. au y l thit k bc 1 ca FOIL:1.Gi POS l tp cc v d dng.2. Gi NEG l tp cc v d m3. t NewClauseBody bng rng4. Trong khi POS cha rng thc hin:Separate: (Bt u mt lut mi)5. Loi khi POS tt c nhng v d tho mn NewClauseBody.6. t li NEG l tp cc v d m ban u7. t li NewClauseBody bng rngTrong khi NEG cha rng thc hin.

    . Conquer (Xy dng thn mnh )8. Chn Literal L9. Kt hp vo NewClauseBody.10. Loi khi NEG nhng v d m khng tho mn L.

    FOIL s dng thut ton leo i b sung cc literal vi thng tin thu cln nh

    vo m

    lu

    . Vi mi bin i ca m

    kh

    ng nh P, FOIL o lng

    hng

    in

    c. la chn li

    eral vi

    hng

    in

    c cao nh

    , n cn bi

    bao nhiub dng v m hin

    i c bo m bi cc bin i ca mi kh

    ng nh c xcnh

    heo cch dn

    ri. Cng

    hc tnh infortmaion gain ca Foil l:Gain(Literal)=T++*(log2(P1/P1+N1) - log2(P0/P0+N0))P0 v N0 l s v d dng v m trc khi thm mt literal L vo mnh P1 v N1 l s v d dng v m sau khi thm literal L vo mnh .

    T++l s v d dng c nh c trc v sau khi thm literal .(ngha l s vd ng vi c hai lut R v R_l R sau khi thm vo literal L)

  • 8/3/2019 Khai pha du lieu web

    24/29

    Sau y l mt v d minh ha cho thut ton FOIL.Ta mun hc mi quan h Grandaughter(x,y) t cc quan h (Predicate)Grandaughter, Father, Mail, Femail v cc hng s: Victor, Sharon, Bob, Tom.Tp v d: L nhng gi nh lin quan n cc Predicate Grandaughter,Father, Mail, Femail v cc hng s Victor, Sharon, Bob, Tom, trong c cc v ddng l Grandaughter(Victor, Sharon), Father (Sharon, Bob), Father(Tom, Bob),Femail(Sharon), Father(Bob, Victor). Cc v d cn li l m (Chng hn nh

    -Grandaughter(Tom,Bob),-Father(Victor, Victor),). chn cc literal cho lut, FOIL xt cc cch kt hp khc nhau ca ccbin x,y,z,t vi cc hng s trn. Chng hn bc khi u khi lut ch l :- Bc 1:Lut khi u: Grandaughter (x,y) S kt hp {x/Bob, y/Sharon}s cho ta mt v d dng v trong d liuhun luyn Grandaughter(Bob, Sharon) l ng.Cn 15 cch kt hp cn li s tng ng vi cc v d m v khng tm thys xc nhn tng ng trong tp hun luyn- Mi trng thi tip theo, lut c hnh thnh da trn tp cc kt ni mcho ra cc v d dng, m. Khi mi literal c thm vo lut, tp cc v d mdng s thay i. Chng hn xt literal tip theo c vo lut l Father (y,z), th

    thay v kt ni {x/Bob,y/Sharon} trn, kt ni {x/ Bob, y/Sharon,z/ Bob} mitong ng vi mt v d dng. Ti mi bc, s v d m, dng s c tnh ton c c ly thng tin Foil_Gain (L,R).CHNG 4. H THNG TH NGHIM4.1. MT S C

    NG TRNH NGHIN CU LIN QUANH thng th nghim c xy dng da trn s kt hp nhng u im cacc gii php trong cc cng trnh nghin cu v vn tm kim v phn lp vn bntrc y. Sau y l ni dung v kt qu ca cc cng trnh nghin cu1.. [San Slattery (May 20002_CMU-CS-02-142)] Lun n tin s HyperTextClassificationTrong lun n tin s ca mnh, tc gi so snh cc thut ton hc my p dngcho phn lp trang Web cng vi cc cch biu din tng ng, l:1. Dng Nave Bayes vi cch biu din ti liu thnh mt ti cc t (bag of

    words)2. Dng k ngi lng ging gn nht vi m hnh tn s cho biu din trangWeb (TF-IDF)3. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) cho miti liu (khng tnh n cc lin kt trong mi ti liu)4. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) v c tnhn cc thng tin lin kt trong cc ti liuTc gi ci t v th nghim v a ra kt qu, vi tiu chun nh gi l hi tng(recall)v chnh xc( Precision)

    Cch tio cn 4 u im hn c, cho hi tng v chnh xc cao hn hn.Tip n, tc gi xy dng mt b phn lp HyperText mi s dng thut ton

    FOIL_PILES vi cch biu din vn bn theo m hnh quan h.2. [on Sn] Lun vn thc s Phng php s dng Logic m v ng dng trongkhai ph d liu FullTextTrong lun vn ny, tc gi thc hin phn lp vn bn s dng cch biu din vnbn bng phng php s dng Logic m v ng dng thut ton hc cy quyt nh.Vi cch gii quyt bi ton nh vy cho ta thy mt s u im: S dng cckhi nim m lm gim s chiu ca cc thuc tnh, dn n lm gim thi giantnh ton khi hc cy quyt nh. Tuy nhin cch biu din ny cn c mt s mt hn ch,th s tn nhiu cng sc cho vic xy dng ch , cc khi nim v mi lin quangia chng.3. [Bi Quang Minh] My tm kim Vietseek. Bo co kt qu nghin cu thuc ti khoa hc c bit cp HQGHN m s QG 02-02.Trong my tm kim Vietseek, cc vn bn c t chc thnh c s d liu.

    Vietseek xy dng c c ba loi ch mc (TextIndex, StructureIndex vUtilityIndex). C s d liu Vietseek c chia thnh hai phn:Phn 1: D liu v vn bn Web, Domain, Word c lu tr trong cc bng ca

  • 8/3/2019 Khai pha du lieu web

    25/29

    CSDL mySQLPhn 2: D liu v ch mc (index) c lu tr ring v c c cu ring. Do phnny i hi tc cao nn khng lu tr trong CSDL MySql m lu tr trong 300 filenh phn khc nhau.Vietseek thc hin tm kim theo cm t a vo v tr v cc vn bn c cha cccm t kha ch cha thc hin phn lp4. [Phm Th Thanh Nam] Lun vn Thc s Mt s gii php cho bi ton tm kim

    trong CSDL HyperText.T CSDL ch mc c xy dng ca Vietsek, tc gi xy dng nn vectorbiu din cc trang Web, vi thnh phn ca vector chnh l tn sut xut hin ca cct kha trong vn bn ang xt.Lun vn ny xut mt s thut ton:- Lit k danh sch cc trang Web Gn ngha nht vi trang Web hoc cm ttm kim a vo theo tiu ch Gn nhau v ni dung. gn nhau v ni dung sthu c khi so snh cc vector biu din vi nhau- quan trng ca trang Web da vo mi lin kt vi trang Web khc v tn sxut hin ca cc t kha tm kim trong trang.- Kt hp gn nhau v ni dung v quan trng ca trang web thnh mt tiuch gi l gi tr kt hp. Kt qu s c hin th theo gi tr kt hp.

    Nhn xtTuy cng trnh u tin [San Slattery] gii thiu kh tng quan v ccphng php phn lp v phn tch mt s kt qu th nghim, nhng ni chung cbn cng trnh nghin cu ni trn cha thc s cp ti vn thit k v ci tnhng gii php thc s tinh t gii quyt vn t ng ngha v a ngn ng i vih thng phn lp trong CSDL Web. Thc hin vic kho st nhng gii php cho vn ny v ci t th nghim l mt cng vic nghin cu c ngha. Tn ti mt s thut

    vn bn. Vic ci t th nghim v nh gi hiu qu hot ng ca mt s thut tonphn lp in hnh nh vy trong mt CSDL web thc s (khong vn trang ) c thc coi nh nhng bc i cn thit u tin trong vic xy dng v pht trin ccmy tm kim ting Vit.4.2. XUT MT CCH T CHC CSDL V THUT TON P

    DNGTheo nhng phng php biu din vn bn HyperText v ang c s dng,nghin cu, ta c nhn xt tng qut sau: cch biu din vn bn HyperText trong ccmy tm kim c u im l khai thc c nhng thng tin quan trng v v tr xuthin ca t kha, t xp hng c cc trang Web tm c theo th t gn vini dung t kha cn tm, nhng cha thy cp n tn s xut hin ca cc tkha trong vn bn. Nn vic tm theo ni dung l kh thc hin c.Cn vi cch biu din theo m hnh Vector ca Sen Slattery [2002] th bqua thng tin v v tr xut hin ca cc t kha, mt thng tin rt quan trng chophn lp vn bn. Hn na nu theo cch biu din 2, vn bn gc cn phn lp s bm nht i trong tp hp cc vn bn lin qua n n, v phn lp s mt chnh xcnht l khi cc vn bn lin quan khng c cng ch . Cn vi cch biu din 3 v

    4, s chiu ca vector s rt ln v c rt nhiu thnh phn lp (chnh l cc t xuthin lp i lp li trong tp cc vn bn lin quan).T nhng u nhc im ca cc phng php trn, ti a ra mt cch biudin ring. t ng chnh vn l da trn m hnh vector, ng thi trong cch xydng file t kha c tnh n cc t ng ngha4.2.1. t bi tonTn ti mt tp cc vn bn HyperText cho trc, mi lp cha cc ti liu (didng *.html) thuc cng mt th loi. Xy dng h thng vi chc nng:c mt ti liu mi, yu cu h thng phn ti liu vo mt lp thch hp.4.2.2. Cch biu din vn bn:S dng m hnh Vector tnh tn sut c tnh n quan trng ca v tr xuthin cc t kha, cng vi cc lin kt gia cc trangXy dng vector cho trang Web A bng cch:

    - Vi mi trang Web A no , thng k cc trang Web c lin kt ti A v cA tr ti. - m s ln ca mi t kha xut hin trong A v trong cc trang c lin quann A, gi s count[i] l s ln xut hin ca t kha th i trong vector biu din ca

  • 8/3/2019 Khai pha du lieu web

    26/29

    trang A,Nu i xut hin trong th body () th ch tng count[i] ln 1,Nu t i xut hin trong th tiu () th tng count[i] ln 3,Sau khi m xong trang A, nhn count [i] vi 3 (chnh l trng s ca vn bn cnbiu din), sau m tip trong cc trang c lin kt, vi nguyn tc tnh trng s vtr xut hin nh trong vn bn A, trng s ca cc vn bn lin quan bng 1.Nh vy: Cch biu din trn s dng kt hp c cc thng tin: Cc lin

    kt vo ra ca ti liu HyperText, tnh n cc ti liu lng ging nhng cng t ratrng s cho ti liu gc, biu din c s ln xut hin ca t kha trong ti liung thi tnh n v tr xut hin ca cc t kha trong ti liu4.2.3. Thit k CSDL.Cc vn bn HyperText c m ha thnh 3 bng trong CSDL Access.1. Bng 1: bng cc t kha (KeyWords),

    Field Name Data Type DescriptionKeyWordIDKeyWordSynonymousAuto Number

    TextMemoM t khaT khaCc t ng ngha vi t kha

    T kha (KeyWord) : Ni dung l mt t trong ting Anh nn n phi tha mncc iu kin sau: T trong ting Anh c mt m tit, mi m tit l mt chui k t a-z,A-Z. Cc t trong cu c tch bit bi du cch hoc cc k t bt k (du chm,du phy, du hai chm,) khng thuc a-z, A-Z.Cc t ng ngha (Synonymous): L trng memo c dng (word1,word2,,wordn ). Vy cc t ng ngha c cng m (keywordID) vi t kha.

    2. Bng 2: Bng cc vn bn (Documents)Field Name Data Type DescriptionDocIDDocNameCacheAddVectorAuto NumberTextTextMemoM vn bnTn vn bn

    a ch CacheVector biu din cho vn bn Vector: l trng kiu Memo, mi vector c dng:(M t kha 1, s ln xut hin tiu , tng s ln xut hin trong vnbn);( M t kha 2, s ln xut hin tiu , tng s ln xut hin trong vnbn);

    S thnh phn ca Vector chnh l s t kha xut hin trong trang Web angbiu din, ch khng phi l ton b cc t kha trong bng KeyWord, do s chiuca vector s gim i rt nhiu. Mi thnh phn ca vector biu din s ln xut hinv v tr xut hin ca cc t kha trong vn bn.VD: Mt Vector c dng: (1,1,4);(2,1,4);(4,2,7) c ngha: T kha th nhtxut hin 4 ln, trong 1 ln xut hin tiu . T kho th 2 xut hin 4 ln trong 1 ln xut hin tiu T kho th 4 xut hin 7 ln trong 2 ln xut hin tiu

    DocID Cache Address Vector1

  • 8/3/2019 Khai pha du lieu web

    27/29

    234C:\data\sport\s1.htmC:\data\sport\s2.htmC:\data\culture \ct3.htmC:\data\ culture \c4.htm

    (1,1,4); (3,1,4); (4,2,7);.(1,2,7); (2,1,4); (3,2,8);.(1,2,6); (5,1,4); (7,2,7);.(2,1,4); (3,1,4); (4,2,7);.

    3.Bng 3 Th hin s kin kt gia cc vn bn. (LINKS)Field Name Field Type DescrriptionDocID1DocID2NumberNumber

    M ca vn bn lin kt iM vn bn c lin kt ti

    DocID1 l m cc vn bn c lin kt ti cc vn bn c m trong DocID2.

    4. Bng 4. Xc sut ca cc lp

    4.2.4.Thit k Modul chng trnh

    Field name Fielsd type DescriptionClassNameProbabilityTextNumber(t 0..100)Tn lpXc sut c lp 1.Modul phn tch trang Web to ra bng KEYWORDSThut ton:Input: Cc vn bn dng to t khaWhile (cha c ht cc vn bn) do1. c tng vn bn2. While (cha c xong vn bn) do

    2.1.c tng t2.2. Insert vo C s d liuEnd

    End.Output: File cc t khaTrung Synonymous s c b sung bng tay i vi tng t khaThm chc nng nhp thm t kha bng tay, xa t kha khng cn thit.2.Modul ly a ch Cache (CacheAddress) ca tng ti liu hun luyn v tora m ti liu (DocID) thm vo hai trng u tin ca cc bng DOCUMENTS.Cn trng Vector s to sau nh Modul th 4.Thut ton:Input: Cc vn bn dng hun luynWhile (cha c ht cc vn bn) do

    1.1. c a ch Cache ca tng vn bnInsert vo CSDL1.2. c tn vn bn

  • 8/3/2019 Khai pha du lieu web

    28/29

    Insert vo CSDLEndM vn bn t tng.3.Modul to bng LINKS. to bng LINKS trc ht phi c bngDOCUMENTS ly a m ca tng ti liu (DocID) tng ng.Thut ton:1. c t th mc cha cc ti liu t trn a cng

    2. t bin TnTM=[ng dn ca th mc]3. While (cha phn tch ht cc ti liu) do3.1. Ly tng ti liu trong th mc km thm a ch Cache(CacheAdd).3.2. Tm trong bng DOCUMENTS DocID ca ti liu ny nh voCacheAdd, c DocID1 3.2.1. Phn tch ly c cc th siu lin kt, l cc cm tdng: href=[Tn ti liu c tr ti], gi s c N th.3.2.2. For i=1 to N do3.2.2.1. Cng TnTM v [tn ti liu c tr ti] c a chCache, duyt trong DOCUMENTS ly DocID, cDocID23.2.2.2.Thc hin lnh Insert hai DocID ly c trn vo haitrng DocID1 v DocID2 ca bn LINKS

    End.End4. Tr li bng LINKS trong CSDL

    4. Modul to ra vector cho mi ti liu, thm vo trng Vector ca bngDOCUMENTS.Thut ton:1. c t bng DOCUMENTS trong CSDL ly DocID v CacheAdd2. While (cha c ht cc bn ghi)2.1. Dng CacheAdd c ti liu t a cng2.2. Gn DocID_curence=DocID2.3. Gn total_occurence=0; header_occurence=0; vector=;2.4. Ly tng t kha keyword trong bng KEYWORDS so snh

    2.4.1 While (cha ht cc t kha)2.4.1.2. Phn tch ti liu ly tng t mc : word2.4.1.2. Kim tra xem nu word cha c trong bng KEYWORD th bsung thm2.4.1.3. While (cha c ht ti liu)- Nu (word= keyword) hoc (word=t ng ngha) v (word nm trongth ) th total_occurence+3 v header_occurence+1;- Nu (word=keyword) hoc (word=t ng ngha) v (word khng nmtrong th ) th total_occurence ++; header_occurense++;End.2.4.1.4. total_occurence*3;

    header_occurence*3;

    2.4.1.5. c tt c cc ti liu m ti liu hin thi lin kt ti(outgoing) Lp li, tng 2 bintotal_occurence v header_occurence2.4.1.6. c tt cc ti liu lin kt ti ti liu hin thi (incoming)Lp li cc bc phn tch nh i vi ti liu hinh thi tng 2 bintotal_occurence v header_occurenceEnd.2.5. Nu (total_occurence !=0 ) th vector += KeyWordID + , +total_occurence + , + header_occerence +;2.6. Insert into DOCUMENTS (Vector) values vector whereDocID=DocID_curence.3. End.

    5. Modul thc hin phn lp.Input:Tp hp cc ti liu cn phn lp.While (cha c ht ti liu) do

  • 8/3/2019 Khai pha du lieu web

    29/29

    c vo ti liu cn phn lp1. Phn tch ti liu thnh cc vetor nh trong modul to trng vectorca bng DOCUMENTS2. Kt hp vi cc vector ca cc ti liu trong CSDL, p dng mt trongcc thut ton hc my phn lp.End4.2.5. Phn tch cc chc nng ca h thng

    a. Chc nng chnh ca h thngb. Chc nng chi tit- Chc nng to CSDL- Chc nng phn lp v tm kim4.2.6. nh gi h thng th nghima. Mt s v d kt qu trn h thng th nghimH thng chy v cho mt s kt qu ban u- Xy dng c h thng CSDL nh trnh by trn+ Phn tch cc vn bn ly t kha+ Th hin c cc lin kt (link) gia cc ti liu siu vn bn trong mt siuvn bn+ M ha cc vn bn thnh cc vector v lu tr vo CSDL

    - Thc hin vic phn lp mt ti liu siu vn bn cho trc - Cho php tm kim mt tia vob. Hn ch ca h thngDo hn ch v mt thi gian nn h thng cn c mt s mt hn ch- Cc t kha vn cha y v cha c chn lc- Ch phn lp c tng ti liu mt (nu cn thi gian s tip tc sa)- chnh xc cha cao do cha c d liu hc chnh xc.