Recommender System wbia 黄连恩 [email protected] 北京大学信息工程学院 11/25/2014.
WBIA Review
description
Transcript of WBIA Review
![Page 2: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/2.jpg)
Bow-tie
Strongly Connected Component (SCC) Core
Upstream (IN) Core can’t reach
IN Downstream
(OUT) OUT can’t reach
core Disconnected Tendrils & Tubes
![Page 3: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/3.jpg)
Power-law
Nature seems to create bell curves(range around an average)
Human activity seems to create power laws(popularity skewing)
![Page 4: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/4.jpg)
Power Law Distribution -Examples
From Graph structure in the web, (by altavista crawl,1999)
![Page 5: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/5.jpg)
习题:怎么存储Web图?
Web Graph
![Page 6: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/6.jpg)
PageRank
Why and how it works?Why and how it works?
![Page 7: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/7.jpg)
Random walker model
V
u1
u2
u3
u4
u5
![Page 8: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/8.jpg)
Damping Factor
pN
LpN
pLp NT
NT
)1()1(1)1(
β 选在 0.1 和 0.2 之间,被称作 damping factor(Page & Brin 1997 )
G=(1-β)LT+ β/N(1N) 被称为 Google Matrix
![Page 9: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/9.jpg)
1
1
2/12/1
2/12/1
2/12/1
2/12/1
3/13/13/1
2/12/1
1
1
11/111/111/111/111/111/111/111/111/111/111/1
L
![Page 10: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/10.jpg)
小规模数据求解
β 取 0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T
P1=GP0 ... 。。。。。。。
Power Iteration 求解得 ( 迭代 50 次 ) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T
You can try this in MatLab
You can try this in MatLab
![Page 11: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/11.jpg)
习题:写出 PageRank 算法的伪码
![Page 12: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/12.jpg)
HITS(Hyperlink Induced Topic Search)
声望高的(入度大) 权威性高 认识许多声望高的(出度大)目录性强 如何计算?
Power Iteration on:
hEEaEh
aEEhEaT
TT
![Page 13: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/13.jpg)
Authority and Hub scores
针对 u∈V(q) ,在每个网页 u 上定义有两个参数:a[u] 和 h[u] ,分别表示其权威性和目录性。
交叉定义 一个网页 u 的 a 值依赖于指向它的网页 v 的 h 值 一个网页 u 的 h 值依赖于它所指的网页 v 的 a 值
hEEaEh
hEaT
T
![Page 14: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/14.jpg)
Web Spam
Term spamming Manipulating the text of web pages in order to
appear relevant to queries Link spamming
Creating link structures that boost page rank or hubs and authorities scores
![Page 15: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/15.jpg)
TrustRankTrustRank
Expecting that good pages point to other good pages, all pages reachable from a good seed page in M or fewer steps are denoted as good
t= · LT · t + (1- · d / |d|
1
2 3
4
5 6
7
good page
bad page
![Page 16: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/16.jpg)
TrustRank in ActionTrustRank in Action
Select seed set using inversed PageRank
=[2, 4, 5, 1, 3, 6, 7] Invoke L(=3) oracle functions Populate static score distribution
vectord=[0, 1, 0, 1, 0, 0, 0]
Normalize distribution vectord=[0, 1/2, 0, 1/2, 0, 0, 0]
Calculate TrustRank scores using biased PageRank with trust dampening and trust splitting
RESULTS [0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05]
t= · LT · t + (1- · d / |d|
1
2 3
4
5 6
7
0.18
0.12
0.05
0.05
0.13
0.15
0
![Page 17: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/17.jpg)
Tokenization
Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your |
ears
Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing
Type the class of all tokens containing the same character sequence
Term type that is included in the system dictionary (normalized)
![Page 18: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/18.jpg)
Stemming and lemmatization
Stemming Crude heuristic process that chops off the ends of the
words Democratic democa
Lemmatization Use of vocabulary and morphological analysis, returns the
base form of a word (lemma) Democratic democracy Sang sing
![Page 19: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/19.jpg)
Porter stemmer
Most common algorithm for stemming English 5 phases of word reduction SSES SS
caresses caress IES I
ponies poni SS SS S
cats cat EMENT
replacement replac cement cement
![Page 20: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/20.jpg)
Bag of words model
A document can now be viewed as the collection of terms in it and their associated weight
Mary is smarter than John John is smarter than Mary
Equivalent in the bag of words model
![Page 21: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/21.jpg)
Term frequency and weighting
A word that appears often in a document is probably very descriptive of what the document is about
Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document
Term frequency (tf) Assign the weight to be equal to the number of
occurrences of term t in document d
![Page 22: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/22.jpg)
Inverse document frequency
N number of documents in the collection
• N = 1000; df[the] = 1000; idf[the] = 0
• N = 1000; df[some] = 100; idf[some] = 2.3
• N = 1000; df[car] = 10; idf[car] = 4.6
• N = 1000; df[merger] = 1; idf[merger] = 6.9
![Page 23: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/23.jpg)
it.idf weighting
Highest when t occurs many times within a small number of documents Thus lending high discriminating power to those
documents Lower when the term occurs fewer times in
a document, or occurs in many documents Thus offering a less pronounced relevance
signal Lowest when the term occurs in virtually all
documents
![Page 24: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/24.jpg)
tf x idf term weights
tf x idf 权值计算公式 : term frequency (tf )
or wf, some measure of term density in a doc inverse document frequency (idf )
表达 term 的重要度 ( 稀有度 ) 原始值 idft = 1/dft 同样,通常会作平滑
为文档中每个词计算其 tf.idf 权重:
dfNidf
t
t log
)/log(,, tdtdt dfNtfw 24
![Page 25: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/25.jpg)
Document vector space representation
Each document is viewed as a vector with one component corresponding to each term in the dictionary
The value of each component is the tf-idf score for that word
For dictionary terms that do not occur in the document, the weights are 0
![Page 26: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/26.jpg)
Documents as vectors
每一个文档 j 能够被看作一个向量,每个 term 是一个维度,取值为 tf.idf
So we have a vector space terms are axes docs live in this space 高维空间:即使作 stemming, may have 20,000+ dimension
s
D1 D2 D3 D4 D5 D6…
中国 4.1 0.0 3.7 5.9 3.1 0.0
文化 4.5 4.5 0 0 11.6 0
日本 0 3.5 2.9 0 2.1 3.9
留学生 0 3.1 5.1 12.8 0 0
教育 2.9 0 0 2.2 0 0
北京 7.1 0 0 0 4.4 3.8
…
26
![Page 27: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/27.jpg)
Cosine similarity
![Page 28: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/28.jpg)
Cosine similarity
M
ijij wd
1,
2
向量 d1 和 d2 的 “ closeness” 可以用它们之间的夹角大小来度量
具体的,可用 cosine of the angle x 来计算向量相似度 .
向量按长度归一化 Normalization
t 1
d 2
d 1
t 3
t 2
θ
M
i ki
M
i ji
M
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
M
i ki
M
i ji
M
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
28
![Page 29: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/29.jpg)
Jaccard coefficient
Resemblance
Symmetric, reflexive, not transitive, not a metric Note r (A,A) = 1 But r (A,B)=1 does not mean A and B are identical!
Forgives any number of occurrences and any permutations of the terms.
Resemblance distance
)()(
)()(),(
BSAS
BSASBAr
),(1),( BArBAd
![Page 30: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/30.jpg)
Shingling
A contiguous subsequence contained in D is called a shingle.
Given a document D we define its w-shingling S(D, w) as the set of all unique shingles of size w contained in D. D = (a,rose,is,a,rose,is,a,rose) S(D,4) = {(a,rose,is,a),(rose,is,a,rose),(is,a,rose,is)} “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is
Why shingling? S(D,4) .vs. S(D,1)What is a good
value for w?
![Page 31: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/31.jpg)
Shingling & Jaccard Coefficient Doc1= "to be or not to be, that i
s a question!"
Doc2= "to be a question or not"
Let windows size w = 2, Resemblance r (A,B) = ?
![Page 32: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/32.jpg)
Random permutation
Random permutation Let be a set (1..N e.g.) Pick a permutation : uniformly at random
={3,7,1,4,6,2,5} A={2,3,6} MIN((A))=?
![Page 33: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/33.jpg)
Inverted index
对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表
中国文化留学生
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings
Sorted by docID (more later on why).
33
![Page 34: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/34.jpg)
Inverted Indexwith counts
• supports better ranking
algorithms
![Page 35: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/35.jpg)
VS-based Retrieval
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Sec. 6.4
![Page 36: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/36.jpg)
tf-idf example: lnc.ltc
Term Query Document Prod
tf-raw
tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53
Document: car insurance auto insuranceQuery: best car insurance
Exercise: what is N, the number of docs?
Score = 0+0+0.27+0.53 = 0.8
Doc length =
12 02 12 1.32 1.92
Sec. 6.4
![Page 37: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/37.jpg)
Singular Value Decomposition
对 term-document 矩阵作奇异值分解 Singular Value Decomposition r, 矩阵的 rank , singular values 的对角阵(按降序排列) D, T, 具有正交的单位长度列向量 (TT’=I, DD’=I)
t d t r
Wtd = T
r r
DT
r d
WWT 的特征值WWT 的特征值 WTW 和 WWT 的特征向量WTW 和 WWT 的特征向量
![Page 38: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/38.jpg)
Latent Semantic Model
LSI 检索过程: 查询映射 / 投影到 LSI 的 DT 空间,称为“ folded in“ : W=TDT ,若 q 投影到 DT 中后为 q’ ,则有
q = Tq’T
既有 q’= (-1T-1q)T = qT-1
Folded in 既为 document/query vector 乘上 T-1
文档集的文档向量为 DT
两者通过 dot-product 计算相似度
![Page 39: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/39.jpg)
Stochastic Language Models
用来生成文本的统计模型 Probability distribution over strings in a given langu
age
M
P ( | M ) = P ( | M ) P ( | M,
)P ( | M, )
P ( | M, )
![Page 40: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/40.jpg)
Unigram model likely topics
Bigram model grammaticality
tokens
wcountwP
#
)()(
)(
)()( 1
1i
iiii wcount
wwcountwwP
![Page 41: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/41.jpg)
Bigram Model
Approximate by P(unicorn|the mythical) by P(unicorn|mythical)
Markov assumption: the probability of a word depends only on the probability of a limited history
Generalization: the probability of a word depends only on the probability of the n previous words trigrams, 4-grams, … the higher n is, the more data needed to train backoff models…
)11|( nn wwP )|( 1nn wwP
![Page 42: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/42.jpg)
A Simple Example: bigram model
P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
![Page 43: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/43.jpg)
LM-based Retrieval
排序公式
用最大似然估计 :
Qt d
dt
Qtdmld
dl
tf
MtpMQp
),(
)|(ˆ)|(ˆ
Unigram assumption:Given a particular language
model, the query terms occur independently
Unigram assumption:Given a particular language
model, the query terms occur independently
),( dttf
ddl
: language model of document d
: raw tf of term t in document d
: total number of tokens in document d
dM
)|()(
)|()(),(
dMQpdp
dQpdpdQp
![Page 44: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/44.jpg)
Laplace smoothing
Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate:
Laplace estimate:
![Page 45: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/45.jpg)
Mixture model smoothing
P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) 参数很重要
值高,使得查询成为 “ conjunctive-like” – 适合短查询
值低更适合长查询 调整 来优化性能
比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)
![Page 46: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/46.jpg)
Example
Document collection (2 documents) d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases
further Model: MLE unigram from documents; = ½ Query: revenue down
P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] = 1/8 x 3/32 = 3/256
P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256
Ranking: d1 > d2
![Page 47: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/47.jpg)
What is relative entropy?
KL divergence/relative entropy
![Page 48: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/48.jpg)
Relative entropy between the two distributions
Cost in bits of coding using Q when true distribution is P
)))(log()((
))(log()()(
iPiP
iQiPQPDi
KL
i
iPiPxPH ))(log()())((
48
![Page 49: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/49.jpg)
i
KL iQ
iPiPQPD )
)(
)(log()()(
49
![Page 50: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/50.jpg)
Precision and Recall
Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved)
Recall: 相关文档被检索出来的比率 = P(retrieved|relevant)
精度 Precision P = tp/(tp + fp) 召回率 Recall R = tp/(tp + fn)
Relevant Not Relevant
Retrieved tp fp
Not Retrieved
fn tn
50
![Page 51: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/51.jpg)
Accuracy
给定一个 Query ,搜索引擎对每个文档分类 classifies as “Relevant” or “Irrelevant”.
Accuracy of an engine: 分类的正确比率 . Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure in IR?
Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
51
![Page 52: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/52.jpg)
A combined measure: F
P/R 的综合指标 F measure (weighted harmonic mean):
通常使用 balanced F1 measure( = 1 or = ½)
Harmonic mean is a conservative average , Heavily penalizes low values of P or R
RP
PR
RP
F
2
2 )1(1
)1(1
1
52
![Page 53: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/53.jpg)
MAP
多个 queries 间的平均 微平均 Micro-average – 每个 relevant document 是一个点,用来计算平均
宏平均 Macro-average – 每个 query 是一个点,用来计算平均
Average of many queries’ average precision values
Called mean average precision (MAP) “Average average precision” sounds weird
Mostcommon
53
![Page 54: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/54.jpg)
Averaging across queries
多个 queries 间的平均 微平均 Micro-average – 每个 relevant document 是一个点,用来计算平均
宏平均 Macro-average – 每个 query 是一个点,用来计算平均
Average of many queries’ average precision values
Called mean average precision (MAP) “Average average precision” sounds weird
Mostcommon
54
![Page 55: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/55.jpg)
习题 8-9 [**] 在 10,000 篇文档构成的文档集中,某个查询的相关文档总数为 8 ,下面给出了某系统针对该查询的前 20 个有序结果的相关 ( 用 R 表示 ) 和不相关 ( 用 N 表示 ) 情况,其中有 6 篇相关文档:
RRNNN NNNRN RNNNR NNNNR a. 前 20 篇文档的正确率是多少? b. 前 20 篇文档的 F1 值是多少 ? c. 在 25% 召回率水平上的插值正确率是多少?
![Page 56: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/56.jpg)
56
KNN
Government
Science
Arts
P(science| )?
Sec.14.3
![Page 57: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/57.jpg)
Naïve Bayes
),,,|(argmax 21 njCc
MAP xxxcPcj
)()|,,,(argmax 21 jjnCc
cPcxxxPj
i jij
CccxPcP
j
)|(̂)(̂argmax
N
cCNcP j
j
)()(ˆ
kcCN
cCxXNcxP
j
jiiji
)(
1),()|(ˆ
Conditional Independence AssumptionAdd one smooth maximum likelihood estimates
Maximum a posteriori HypothesisBayes Rule
![Page 58: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/58.jpg)
Parameter estimation
fraction of documents of topic cjin which word w appears
Binomial model:
Multinomial model:
)|(ˆjw ctXP
fraction of times in which word w appears
across all positions in the documents of topic cj
)|(ˆji cwXP
58
![Page 59: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/59.jpg)
NB Example
c(5)=?
59
![Page 60: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/60.jpg)
NB Example
c(5)=?
60
![Page 61: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/61.jpg)
Multinomial NB Classifier
Feature likelihood estimate
Posterior
Result: c(5) = China
61
![Page 62: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/62.jpg)
NB Example
c(5)=?
62
![Page 63: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/63.jpg)
Bernoulli NB Classifier
Feature likelihood estimate
Posterior
Result: c(5) <> China63
![Page 64: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/64.jpg)
例题:你的任务是将单词分成英语 (English) 类或非英语类。这些单词的产生来自如下分布:
(i) 计算多项式 NB 分类器的参数,分类器使用字母b 、 n 、 o 、 u 和 z 作为特征。在计算参数时使用平滑方法,零概率平滑成 0.01 ,而非零概率不做改变。(ii) 上述分类器对单词 zoo 的分类结果是什么?
![Page 65: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/65.jpg)
65
Support Vector Machine (SVM)
Support vectors
Maximizesmargin
SVMs maximize the margin around the separating hyperplane.
A.k.a. large margin classifiers
The decision function is fully specified by a subset of training samples, the support vectors.
Solving SVMs is a quadratic programming problem
Seen by many as the most successful current text classification method*
*but other discriminative methods often perform very similarly
Narrowermargin
![Page 66: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/66.jpg)
2 statistic (CHI)
The null hypothesis : Term(jaguar) is independent with Class(auto)
Then, what value are expected in this confusion matrix?
9500
500
3Class auto
2Class = auto
Term jaguar
Term = jaguar
observed: fo
66
![Page 67: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/67.jpg)
2 statistic (CHI)
2 is interested in (fo – fe)2/fe summed over all table entries
The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence).
)001.(9.129498/)94989500(502/)502500(
75.4/)75.43(25./)25.2(/)(),(22
2222
p
EEOaj
9500
500
(4.75)
(0.25)
(9498)3Class auto
(502)2Class = auto
Term jaguar
Term = jaguar expected: fe
observed: fo
67
![Page 68: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/68.jpg)
K-Means
假设 documents 是实值 vectors. 基于 cluster ω 的中心 centroids (aka the center
of gravity or mean)
划分 instances 到 clusters 是根据它到 cluster centroid 中心点的距离,选择最近的 centroid
![Page 69: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/69.jpg)
K Means Example(K=2)
Pick seeds
Reassign clusters
Compute centroids
xx
Reassign clusters
xx xx Compute centroids
Reassign clusters
Converged!
![Page 70: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/70.jpg)
Hierarchical Agglomerative Clustering (HAC)
假定有了一个 similarity function 来确定两个 instances 的相似度 .
贪心算法: 每个 instances 为一独立
的 cluster 开始 选择最 similar 的两个 clu
ster ,合并为一个新 cluster
直到最后剩下一个 cluster为止
上面的合并历史形成一个binary tree或 hierarchy.
Dendrogram
![Page 71: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/71.jpg)
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 *(max(5, 1, 0)) = 5/6Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5
Purity
Total: Purity = 1/17 * (5+4+3) = 12/17
![Page 72: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/72.jpg)
Rand Index
View it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection.
true positive (TP) decision assigns two similar documents to the same cluster
true negative (TN) decision assigns two dissimilar documents to different clusters.
false positive (FP) decision assigns two dissimilar documents to the same cluster.
false negative (FN) decision assigns two similar documents to different clusters.
![Page 73: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/73.jpg)
Rand Index
Number of points
Same Cluster in clustering
Different Clusters in clustering
Same class in ground truth
Different classes in ground truth
TP FN
TNFP
![Page 74: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/74.jpg)
Rand index Example
Cluster I Cluster II Cluster III
![Page 75: WBIA Review](https://reader036.fdocument.pub/reader036/viewer/2022062408/56814532550346895db1fa23/html5/thumbnails/75.jpg)
Thank You!
Q&A