Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine...

Post on 18-Sep-2020

2 views 0 download

Transcript of Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine...

Machine Learning Srihari

1

Variational Mixture of Gaussians

Sargur Srihari srihari@cedar.buffalo.edu

Machine Learning Srihari

Objective

•  Apply variational inference machinery to Gaussian Mixture Models

•  Demonstrates how Bayesian treatment elegantly resolves difficulties with maximum likelihood issues

•  Many more complex distributions can be solved using straightforward extensions of this analysis

2

Machine Learning Srihari

Graphical Model for GMM

•  Graphical model corresponding to likelihood function of standard GMM:

•  For each observation xn we have a corresponding latent latent variable zn –  A 1-of-K binary vector with elements znk for

k=1,..K

•  Denote observed data by X={x1,..,xN} •  Latent variables by Z={z1,..,zN}

3

Plate Notation: Equivalent networks

3

Directed acyclic graph Representing mixture

Machine Learning Srihari

Likelihood Function for GMM

4

x

Therefore Likelihood function is

p(X |π,µ,Σ) =k=1

K

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

N

Therefore log-likelihood function is

ln p(X |π ,µ,Σ) = lnk=1

K

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

N

Find parameters , π, µ and Σ that maximize log-likelihood A more difficult problem than for a single Gaussian

Mixture density function is Since z has values {zk} with probabilities πk

Product is over the Ni.i.d. samples

Machine Learning Srihari

GMM m.l.e. expressions

•  Obtained using derivatives of log-likelihood

•  Not closed form solutions for the parameters

– Since the responsibilities depend on those parameters in a complex way 5

1

1 ( )xN

k nk nnk

zN

µ γ=

= ∑

Σk = 1Nk

γ(znk )(xn − µk )(xn − µk )T

n=1

N

kkNN

π =

Nk = γ(znk )n=1

N

Parameters (means)

Parameters(covariance matrices)

Parameters (Mixing Coefficients)

All three are in terms of responsibliities

γ (znk )

Machine Learning Srihari

6

EM For GMM

•  E step – use current value of parameters

to evaluate posterior probabilities p(Z/X), i.e., responsibilities •  M step

– use these posterior probabilities to to re-estimate p(X,Z): means, covariances and mixing coefficients wrt p(Z/X)

µk ,Σk ,π kγ (znk )

Machine Learning Srihari

Graphical model for Bayesian GMM

•  To specify model we need these conditional probabilities: 1.  p(Z|π): conditional distribution of Z given mixing coeffts 2.  p(X|Z, µ, Λ): 3.  p(π): distribution of mixing coefficients 4.  p(µ,Λ): prior governing mean and precision of each

component 7

Mixing coefficients

precisions

means

GMM Bayesian GMM

Machine Learning Srihari

Conditional Distribution Expressions

1.  Conditional distribution of Z={z1,.,zN} given mix coefficients π Since components are mutually exclusive

2.  Conditional distribution of observed data X={x1,..,xN} given latent variables and component parameters p(X|Z, µ, Λ) –  Since components are Gaussian

–  where µ ={µk} and Λ={Λk}

•  use of precision matrix simplifies further analysis

8

p(Z |π ) = π kznk

k=1

K

∏n=1

N

p(X |Z,µ,Λ) = N(x

n| µ

k∏∏ ,Λk−1)znk

p(z) = π kzk

k=1

K

p(x | z) = N x | µ

k,Σ

k( )zk

k=1

K

Machine Learning Srihari

Parameter Priors: Mixing Coefficients

3. Distribution of mixing coefficients p(π) •  Conjugate priors simplify analysis •  Dirichlet distribution over π

–  We have chosen the same parameter α0 for each of the components

–  C(α0) is the normalization constant for the Dirichlet distribution

9

p(π ) = Dir(π |α0) = C(α0) π kα0−1

k=1

K

Machine Learning Srihari

Parameter Priors: Mean, Precision

4. Distribution of Mean and Precision of Gaussian components

–  Gaussian-Wishart prior is

–  Which represents the conjugate prior when both mean and precision are unknown

•  Resulting model has: – Link between Λ and µ – Due to distribution (4) above

p(µ,Λ) = p(µ |Λ)p(Λ)

= N µkm0(β0Λk )−1( )

k =1

K

∏ W (Λk |W0,ν 0)

p(µ,Λ)

Machine Learning Srihari

Bayesian Network for Bayesian GMM •  Joint of all random variables:

–  All the factors were given earlier –  Only X={x1,..,xN} are observed

•  This BN provides a nice distinction between latent variables and parameters – Variables such as zn that appear inside the plate

are latent variables •  No of such variables grows with data set

– Variables outside the plate are parameters •  Fixed in no. and outside of data set

– From viewpoint of PGMs no fundamental difference 11

Means

Precisions

Mixing Coeffts

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)

Machine Learning Srihari

The variational approach •  Recall GMM •  The EM approach:

1.  Evaluation of posterior distribution p(Z|X) 2.  Evaluation of expectation of p(X,Z) wrt to p(Z|X)

•  Our goal is to specify the variational distribution q(Z,π,µ,Λ) which will specify p(Z,π,µ,Λ|X) – Recall

12

ln p(X) = L(q) + KL(q || p)where

L(q) = q(Z)ln p(X,Z)q(Z)

" # $

% & '

∫ dZ

and

KL{q || p} = − q(Z)ln p(Z | X)q(Z)

" # $

% & '

∫ dZ

p(x) = p(z)p(x | z) = π kN x |µk,Σk( )k=1

K

∑z

∑Here p(z) has parameter π with distribution p(π)

Machine Learning Srihari

Variational Distribution •  In variational inference we can specify q by

using a factorized distribution – For Bayesian GMM the latent variables and

parameters are Z, π, µ and Λ. •  So we consider the variational distribution

q(Z,π,µ,Λ)=q(Z)q(π,µ,Λ) – Remarkably, this is the only assumption needed for

a tractable solution to a Bayesian Mixture Model •  Functional forms of both q(Z) and q(π,µ,Λ) are

determined automatically by optimizing the variational distribution 13

q(Z) = qii=1

M

∏ (Zi)

Subscripts for q’s omitted

Machine Learning Srihari

Sequential update equations •  Using general result of factorized distributions

– When L(q) is defined as –  the q that makes the functional L(q) largest is

•  For Bayesian GMM log of optimized factor is

•  Since we have

– Note: Expectations are are just weighted sums

14

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ = qi ln p(X,Z )− lnqi

i∑⎧

⎨⎩

⎫⎬⎭i

∏∫ dZ

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const

lnq *(Z)= Eπ,µ,Λln p X,Z,π,µ,Λ( )⎡⎣⎢

⎤⎦⎥+const

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const

Machine Learning Srihari

Simplification of q*(Z) •  Expression for factor q*(Z)

•  Absorbing terms not depending on Z into constant

•  where D is dimensionality of data variable x

•  Taking exponentials on both sides •  Normalized distribution is

where 15

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const

lnq *(Z)= znk

k=1

K

∑n=1

N

∑ lnρnk+const

where lnρnk= E lnπ

k⎡⎣⎢

⎤⎦⎥ +

12E ln |λ

k|⎡

⎣⎢⎤⎦⎥ −D2

ln(2π)−12EµkΔk

(xn−µ

k)TΛ

k(xk−µ

k)⎡

⎣⎢⎤⎦⎥

q *(Z)= rnk

znk

k=1

K

∏n=1

N

q*(Z )∝ ρnkznk

k=1

K

∏n=1

N

rnk=ρnk

ρnj

j=1

K

∑rnk are positive since ρnk are exponentials of real nos. and will sum to one as required

Machine Learning Srihari

Factor q*(Z) has same form as prior •  Normalized distribution is

•  We have found form of q* to maximize the functional L(q) –  It has same form as prior

•  Distribution q*(Z) is discrete and has the standard result E[znk]=rnk, – which play the role of responsibilities

•  Since equations for q*(Z) depend on moments of other variables – They are coupled and solved iteratively

16

q *(Z)= rnk

znk

k=1

K

∏n=1

N

p(Z |π ) = π kznk

k=1

K

∏n=1

N

Machine Learning Srihari

Variational EM •  Variational E-step: determine responsibilities rnk •  Variational M-step:

1.  determine statistics of data set

and

2. find optimal solution for the factor q(π,µ,Λ)

17

Nk

= rnk

n=1

N

xk

=1

Nk

rnk

n=1

N

∑ xn

Sk

=1

Nk

rnk

n=1

N

∑ xn− x

n( ) xn− x

n( )T

Responsibility of kth component Mean of kth component Covariance matrix of kth component

Machine Learning Srihari

Factorization of q(π,µ,Λ) •  Using general result of factorized distributions

– We can write

•  which decomposes into terms involving π and only µ,Λ

– The terms involving µ and Λ comprise sum of terms involving µk and Λk leading to factorization

18

lnq *(π,µ,Λ)= ln p(π)+ ln p µk,Λk( )

k=1

K

∑ +EZln p(Z | π)⎡⎣⎢

⎤⎦⎥ + E z

nk⎡⎣⎢⎤⎦⎥ lnN xk | µk,Λk

−1( )k=1

K

∑n=1

N

∑ +const

q(π,µ,Λ)= q(π) q(µk,Λk)

k=1

K

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const

Machine Learning Srihari

Factor q(π) is a Dirichlet •  Given the factorization •  Consider each factor in turn: q(π) and q(µk,Λk) •  (2a) Identifying terms depending on π, q(π) has

the solution

•  Taking exponential on both sides we get q*(π) as a Dirichlet

lnq *(π)= (α0−1) lnz

kk=1

K

∑ + rnk

n=1

N

∑k=1

K

∑ lnπk+const

q *(π)= Dir(π |α)

where α has the components αk=α0+Nk

q(π,µ,Λ)= q(π) q(µk,Λk)

k=1

K

∑∏==

− =ΓΓ

Γ=K

kk

K

kk

k

kDir

10

1

1

1

0 where)()...(

)()|( ααµαα

ααµ α K=3 αk=0.1

Dirichlet:

Machine Learning Srihari

Factor q*(µk,Λk) is a Gaussian-Wishart (2b) Variational posterior for q*(µk,Λk)

– Does not further factorize into marginals –  It is a Gaussian-Wishart distribution

– W is the Wishart distribution •  It has the form

W(Λ|W,ν)=B|Λ|(ν-D-1)/2exp[-½Tr(W-1Λ)] where ν is the no. of degrees of freedom, W is a D x D scale matrix and Tr is the trace. B(W,ν) is a normalization constant •  It is the conjugate prior for a Gaussian with known mean and

unknown precision matrix Λ

20

q *(µk,Λ

k) = N µ

km

k(β

k)−1( )W (Λ

k|W

0,ν

0)

Machine Learning Srihari

Parameters of q*(µk,Λk) •  Gaussian-Wishart

–  where we have defined

•  These update equations are analogous to M-

step of EM for m.l. solution of GMM –  Involve evaluation of same sums as EM over the

data set 21

q *(µk,Λ

k) = N µ

km

k(β

k)−1( )W (Λ

k|W

0,ν

0)

βk= β

0+N

k

mk=1βk

β0m0+N

kxk( )

Wk−1 =W

0−1 +N

kSk+β0Nk

β0+N

k

xk−m

0( ) xk −m0( )T

υk= υ

0+N

k+1

Machine Learning Srihari

Expression for Responsibilities •  For the M step we need expectations

– Which are obtained by normalizing ρnk •  Since where

–  The three expectations wrt variational distribution of parameters are easily evaluated to give

– ψ is the digamma function with •  Digamma appears in the definition of Dirichlet

22

lnρnk = E lnπ k[ ]+ 12 E ln |λk |[ ]− D2 ln(2π )−12EµkΔk (xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦

E[znk]=rnk

rnk=ρnk

ρnj

j=1

K

lnπ k ≡ E lnπ k[ ] =ψ (α k )−ν

lnΛk ≡12E ln |λk |[ ] = ψ νk +1− i

2⎛⎝⎜

⎞⎠⎟i=1

D

∑ + D ln2 + lnWk

EµkΛk(xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦ = Dβk−1 +νk (xn − µk )

TΛk (xk − µk )

α̂ = α kk∑ψ (a) = d

dalnΓ(a)

ν is the no. of degrees of freedom of Wishart

Machine Learning Srihari

Evaluation of Responsibilities •  Substituting the three expectations into ln ρnk

– This is similar to responsibilities for mle for EM

•  which can be written in the form

•  where we have used precision Λk instead of covariance Σk to highlight similarity

23

r

rnk

∝ !πk!Λ1/2 exp −

D2βk

−υk

2xn−m

k( )Wk(xn −mk)

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

γ(zk ) ≡ p(zk =1 | x) = p(zk =1)p(x | zk =1)

p(z j =1)p(x | z j =1)j=1

K

= π kN(x |µk,Σk )

π jN(x |µk,Σ j )j=1

K

rnk ∝ π k Λk1/2 exp − 1

2xn − µk( )Λk (xn − µk )

⎧⎨⎩

⎫⎬⎭

Machine Learning Srihari

Summary of Optimization •  Optimization of variational posterior distribution

involves cycling between two stages – Analogous to E and M steps on m.l. EM

•  Variational E-step: – Use current distribution over model parameters to

evaluate moments and hence evaluate E[znk]=rnk

•  Variational M step –  keep responsibilities fixed; use them to recompute

variational distribution over the parameters using and 24

q *(π)= Dir(π |α) q *(µ

k,Λ

k) = N µ

km

k(β

k)−1( )W (Λ

k|W

0,ν

0)

Machine Learning Srihari

Variational Bayesian GMM

25

K=6 components After convergence there are only two components Density of red ink inside each ellipse shows Mean value of Mixing coefficients

Old Faithful data set

Machine Learning Srihari

Similarity of Variational Bayes and EM

•  Close similarity between variational solution for the Bayesin mixture of Gaussians and the EM algorithm for maximum likelihood

•  In the limit as N à∞, the Bayesian treatment converges to the maximum likelihood EM

•  Variational algorithm is more expensive but problem of singularity is eliminated

26

Machine Learning Srihari

Variational Lower Bound

•  We can straight-forwardly evaluate the lower bound L(q) for this model

•  Recall

•  The lower bound is used to monitor re-estimation to test for convergence

27

ln p(X) = L(q)+ KL(q || p)where

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ

and

KL{q || p} = − q(Z )ln p(Z | X)q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ

Machine Learning Srihari

Predictive Density •  In using a Bayesian GMM we will be

interested in the predictive density for a new value of the observed variable

•  Assuming corresponding latent variable we can show that

– The mixture of Student’s T becomes a GMM as Nà∞ 28

p(x̂ | X) = 1α̂

α kSt x̂|mk,Lk,νk +1-D( )k=1

N

∑where the kth component has mean mk and precision

Lk =νk +1−D( )βk

1+ βk( ) Wk

Machine Learning Srihari

Determining no. of components

•  Plot of variational lower bound L versus no. of components K

•  Distinct peak at K=2 •  For each K model is trained from 100 starts

– Results shown as +

29