Bayesian Learning
description
Transcript of Bayesian Learning
1
Bayesian LearningBayesian Learning
2
Bayesian ReasoningBayesian Reasoning
• Basic assumption– The quantities of interest are governed by probability distri
bution– These probability + observed data ==> reasoning ==> opti
mal decision• 의의 , 중요성
– 직접적으로 확률을 다루는 알고리듬의 근간• 예 ) naïve Bayes classifier
– 확률을 다루지 않는 알고리듬을 분석하기 위한 틀• 예 ) cross entropy , Inductive bias decision tree, MDL principle
3
Feature & LimitationFeature & Limitation
• Feature of Bayesian Learning– 관측된 데이터들은 추정된 확률을 점진적으로
증감– Prior Knowledge : P(h) , P(D|h)– Probabilistic Prediction 에 응용 – multiple hypothesis 의 결합에 의한 prediction
• 문제점– initial knowledge 요구– significant computational cost
4
Bayes TheoremBayes Theorem
• Terms – P(h) : prior probability of h – P(D) : prior probability that D will be observed– P(D|h) : prior knowledge– P(h|D) : posterior probability of h , given D
• Theorem
• machine learning : 주어진 데이터 들로부터 the most probable hypothesis 를 찾는 과정
)()()|()|(
DPhPhDPDhP
5
ExampleExample
• Medical diagnosis– P(cancer)=0.008 , P(~cancer)=0.992– P(+|cancer) = 0.98 , P(-|cancer) = 0.02– P(+|~cancer) = 0.03 , P(-|~cancer) = 0.97– P(cancer|+) = P(+|cancer)P(cancer) = 0.0078– P(~cancer|+) = P(+|~cancer)P(~cancer) = 0.0298– hMAP = ~cancer
6
MAP hypothesisMAP hypothesis
MAP(Maximum a posteriori) hypothesis
)()|(maxarg)(
)()|(maxarg
)|(maxarg
hPhDPDP
hPhDP
DhPh
Hh
h
HhMAP
)()|(maxarg hPhDPhHh
MAP
7
ML hypothesisML hypothesis
• maximum likelihood (ML) hypothesis – basic assumption : equally probable a priori
• basic formular– P(a^b) = P(A|B)P(B) = P(B|A)P(A)
)|(maxarg hDPhHh
ML
i
ii APABPBP )()|()(
8
Bayes Theorem and Concept LeaBayes Theorem and Concept Learningrning• Brute-force MAP learning
– for each calculate P(h|D)– find hMAP
• consistent assumption– noise free data D– target concept c in hypothesis space H– every hypothesis is equally probable
• Result
• every consistent hypothesis is MAP hypothesis
DHVSDhP
,
1)|( (if h is consistent with D)
P(h|D) = 0 (otherwise)
HVS
H
hP
xhhDP
HhP
DH
VSh
Hhi
i
DHi
i
,
i
i
,
11
)()h|P(DP(D)
0 else , )(d if 1)|(
1)(
DH
DH
VS
VSH
H
DPhP
DPhPhDPDhP
,
,
1
1
)()(1
)()()|()|(
10
Consistent learnerConsistent learner
• 정의 : training example 들에 대해 에러가 없는 hypothesis 를 출력해 주는 알고리듬
• result : – every consistent hypothesis output == MAP hypothesis– every consistent learner output == MAP hypothesis
• if uniform prior probability distribution over H• if deterministic, noise-free training data
11
ML and LSE hypothesisML and LSE hypothesis
• Least squared error hypothesis– NN , curve fitting, linear regression– continuous-valued target function
• task : find f : di=f(xi)+ei
• preliminary : – probability densities, Normal distribution– target value independence
• result :
• limitation : noise only in the target value
m
iii
HhML xhdh
1
2))((minarg
22
222
))((2
1
2
))((2
1minarg
))((2
1
2
1lnmaxarg
2
1maxarg
)|(maxarg
)|(maxarg
22
iih
iih
xhd
h
m
ii
h
HhML
xhd
xhd
e
hdP
hDPh
ii
13
ML hypothesis for predicting ML hypothesis for predicting ProbabilityProbability• Task : find g : g(x) = P(f(x)=1)• question : what criterion should we optimize in
order to find a ML hypothesis for g• result : cross entropy
– entropy function :
m
iiiii
HhML xhdxhdh
1
))(1ln()1()(lnmaxarg
i
ii PP ln
)(),|(
)|,()|(
iii
m
iii
xPxhdP
hdxPhDP
ii di
diii
iii
iii
xhxhxhdP
xhxhdPxhxhdP
1
i
i
))(1()(),|(
0d if , )(1),|(1d if , )(),|(
))(1ln()1()(lnmaxarg
))(1()(maxarg
)())(1()(maxarg
)|(maxarg
1
1
iiiih
di
di
h
id
id
ih
hML
xhdxhd
xhxh
xpxhxh
hDPh
ii
ii
15
Gradient search to ML in NNGradient search to ML in NN
Let G(h,D) = cross entropy
jkjk
DhGw
),(
m
iijkiijk xxhdw
1
))((
m
iijkiiiijk xxhdxhxhw
1
))())((1)(( (BP)
By gradient ascent
ijkii
ijkii
ii
ii
jk
i
ii
ii
jk
i
i
iiii
jk
i
ijk
iiii
xxhd
xxhxhxhxh
xhd
wxh
xhxhxhd
wxh
xhxhdxhd
wxh
xhDhG
wDhG
xhdxhdlet
))((
1))(1)((
))(1)(()(
)())(1)((
)(
)()(
)))(1ln()1()(ln(
)()(
),(),())(1ln()1()(ln D)G(h,
jkjk
DhGw
),(
17
MDL principleMDL principle
• 목적 : Bayesian method 에 의한 inductive bias 와 MLD principle 해석
• Shannon and weaver’s optimal code length))(log)|(log(minarg
))(log)|((logmaxarg
22
22
hPhDP
hPhDPh
Hh
HhMAP
)|()(minarg|
hDLhLhHDH CC
HhMAP
)|()(minarg21
hDLhLh CCHh
MDL
(bits) log2 iP
18
Bayes optimal classifierBayes optimal classifier
• Motivation : 새로운 instance 의 classification 은 모든 hypothesis 에 의한 prediction 의 결합으로 인하여 최적화 되어진다 .
• task : Find the most probable classification of the new instance given the training data
• answer :combining the prediction of all hypotheses• Bayes optimal classification
• limitation : significant computational cost ==> Gibbs algorithm
Vv
iijVv
ij
DhPhvP )|()|(maxarg
19
Bayes optimal classifier exampleBayes optimal classifier example
0)h|P( 1)h|P(- 3.)|(0)h|P( 1)h|P(- 3.)|(1)h|P( 0)h|P(- 4.)|(
333
222
111
DhPDhPDhP
Hhiij
v
Hhii
Hhii
ij
i
i
DhPhvP
DhPhP
DhPhP
)|()|(maxarg
6.)|()|(
4.)|()|(
},{
20
Gibbs algorithmGibbs algorithm
• Algorithm– 1. Choose h from H, according to the posterior probabil
ity distribution over H– 2. Use h to predict the classification of x
• Gibbs algorithm 의 유용성– Haussler , 1994– Error(Gibbs algorithm)< 2*Error(Bayes optimal classifi
er)
21
Naïve Bayes classifierNaïve Bayes classifier
• Naïve Bayes classifier
• difference– no explicit search through H– by counting the frequency of existing examples
• m-estimate of probability =
– m : equivalent sample size , p : prior estimate of probability
)()|,...,,(maxarg 21 jjnMAP vPvaaaPv
i
jijVv
NB vapvPvj
)|()(maxarg
mnmpnc
22
exampleexample
• (outlook=sunny,temperature=cool,humidity=high,wind=strong)
• P(wind=strong|playTennis=yes)=3/9=.33• P(wind=string|PlayTennis=no)=3/5=.60• P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)=.0
053• P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)=.0206• vNB = no
23
Bayes Belief NetworksBayes Belief Networks
• 정의– describe the joint probability distribution for a set of variables– 모든 변수들이 conditional independence 일것을 요구하지
않음– 변수들간의 부분적 의존 관계를 확률로 표현
• representation
24
Bayesian Belief NetworksBayesian Belief Networks
25
InferenceInference
• Task : infer the probability distribution for the target variables
• methods– exact inference : NP hard– approximate inference
• theoretically NP hard• practically useful• Monte Carlo methods
26
LearningLearning
• Env– structure known + fully observable data
• easy , by naïve Bayes classifier
– structure known + partially observable data• gradient ascent procedure ( by Russel , 1995 )• ML hypothesis 와 유사 P(D|h)
– structure unknown
Dd ijk
ikijhijkijk w
duyPww
)|,(
27
Learning(2)Learning(2)
• Structure unknown– Bayesian scoring metric ( cooper, Herskovits, 1992 )– K2 algorithm
• cooper, Herskovits, 1992• heuristic greedy search• fully observed data
– constraint-based approach• Spirtes, 1993• infer dependency and independency relationship• construct structure using this relationship
Dd ijk
ikijhijkijk w
duyPww
)|,(
',''''''
','''''
)()|(),|()(
1
),(),|()(
1
)(ln)(
1
)(ln
)(ln)(ln
kjikikijhikijh
ijkh
kjikijhikijh
ijkh
d ijk
h
h
dh
ijk
dh
ijkijk
h
uPuyPuydPwdP
uyPuydPwdP
wdP
dP
dPw
DPww
DP
ijk
ikijh
ikijh
ikijh
ikijh
ikikijh
ikijh
ikhikijh
h
ikikijhh
ikikijhikijhijkh
wduyP
uyPduyP
uyPuPduyP
uyPuPdPduyP
dP
uPuydPdP
uPuyPuydPwdP
)|,(
)|()|,(
),()()|,(
),()()()|,(
)(1
)(),|()(
1
)()|(),|()(
1
0w
else
0w
then )k k' ,'(
)|(
ijk
ijk
iiif
uyPw ikijhijk
29
EM algorithmEM algorithm
• EM : estimation, maximization• env
– learning in the presence of unobserved variables– the form of probability distribution is known
• application– training Bayesian belief networks– training radial basis function networks– basis for many unsupervised clustering algorithm– basis for Baum-Welch’s forward-backward algorithm
30
K-means algorithmK-means algorithm
• Env : k normal distribution 들로부터 임의로 data 생성
• task : find mean values of each distribution• instance : < xi,z11,z12>
– if z is known : using – else use EM algorithm
iiML x 2)(minarg
31
K-means algorithmK-means algorithm
• Initialize • calculate E[z]
• calculate a new ML hypothesis
2
2
22
)(2
1
)(2
1
)|()|(
][ki
ji
x
x
kki
jiij
e
exP
xPzE
m
iiijj xzE
m 1
][1
==> converge to a local ML hypothesis
32
General statement of EM algoGeneral statement of EM algo
• Terms : underlying probability distribution– x : observed data from each distribution– z : unobserved data– Y = X union Z– h : current hypothesis of – h’ : revised hypothesis
• task : estimate from X
33
guidelineguideline
• Search h’
• if h = : calculate function Q
)]|([lnmaxarg hYPEhh
],|)|([ln)|( XhhYPEhhQ
34
EM algorithmEM algorithm
• Estimation step
• maximization step
• converge to a local maxima
],|)|([ln)|( XhhYPEhhQ
)|(maxarg hhQhh
k
jjiij xZ
ikiiii
e
hzzzxPhyP2'
2 )(2
1
2
21
2
1
)'|,...,,,()'|(
))(2
1
2
1(ln
)'|(ln
)'|(ln)'|(ln
2'22 jiij
i
i
xZ
hyP
hyPhYP
)|(
)]([2
1
2
1ln
]))(2
1
2
1(ln[)]'|([ln
'
2'22
2'22
hhQ
xZE
xZEhYPE
m
ijiij
jiij
2
2
22
)(2
1
)(2
1
)|()|(
][ki
ji
x
x
kki
jiij
e
exP
xPzE
m
iiijj xzE
m 1
][1