Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine...

Machine Learning Srihari

Variational Mixture of Gaussians

Sargur Srihari srihari@cedar.buffalo.edu

Objective

•  Apply variational inference machinery to Gaussian Mixture Models

•  Demonstrates how Bayesian treatment elegantly resolves difficulties with maximum likelihood issues

•  Many more complex distributions can be solved using straightforward extensions of this analysis

Graphical Model for GMM

•  Graphical model corresponding to likelihood function of standard GMM:

•  For each observation xn we have a corresponding latent latent variable zn –  A 1-of-K binary vector with elements znk for

k=1,..K

•  Denote observed data by X={x1,..,xN} •  Latent variables by Z={z1,..,zN}

Plate Notation: Equivalent networks

Directed acyclic graph Representing mixture

Likelihood Function for GMM

Therefore Likelihood function is

p(X |π,µ,Σ) =k=1

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

Therefore log-likelihood function is

ln p(X |π ,µ,Σ) = lnk=1

∑ π kN(xn |µk,Σk )⎧ ⎨ ⎩

⎫ ⎬ ⎭ n=1

Find parameters , π, µ and Σ that maximize log-likelihood A more difficult problem than for a single Gaussian

Mixture density function is Since z has values {zk} with probabilities πk

Product is over the Ni.i.d. samples

GMM m.l.e. expressions

•  Obtained using derivatives of log-likelihood

•  Not closed form solutions for the parameters

– Since the responsibilities depend on those parameters in a complex way 5

1 ( )xN

k nk nnk

µ γ=

Σk = 1Nk

γ(znk )(xn − µk )(xn − µk )T

Nk = γ(znk )n=1

Parameters (means)

Parameters(covariance matrices)

Parameters (Mixing Coefficients)

All three are in terms of responsibliities

γ (znk )

EM For GMM

•  E step – use current value of parameters

to evaluate posterior probabilities p(Z/X), i.e., responsibilities •  M step

– use these posterior probabilities to to re-estimate p(X,Z): means, covariances and mixing coefficients wrt p(Z/X)

µk ,Σk ,π kγ (znk )

Graphical model for Bayesian GMM

•  To specify model we need these conditional probabilities: 1.  p(Z|π): conditional distribution of Z given mixing coeffts 2.  p(X|Z, µ, Λ): 3.  p(π): distribution of mixing coefficients 4.  p(µ,Λ): prior governing mean and precision of each

component 7

Mixing coefficients

precisions

GMM Bayesian GMM

Conditional Distribution Expressions

1.  Conditional distribution of Z={z1,.,zN} given mix coefficients π Since components are mutually exclusive

2.  Conditional distribution of observed data X={x1,..,xN} given latent variables and component parameters p(X|Z, µ, Λ) –  Since components are Gaussian

–  where µ ={µk} and Λ={Λk}

•  use of precision matrix simplifies further analysis

p(Z |π ) = π kznk

∏n=1

p(X |Z,µ,Λ) = N(x

k∏∏ ,Λk−1)znk

p(z) = π kzk

p(x | z) = N x | µ

k( )zk

Parameter Priors: Mixing Coefficients

3. Distribution of mixing coefficients p(π) •  Conjugate priors simplify analysis •  Dirichlet distribution over π

–  We have chosen the same parameter α0 for each of the components

–  C(α0) is the normalization constant for the Dirichlet distribution

p(π ) = Dir(π |α0) = C(α0) π kα0−1

Parameter Priors: Mean, Precision

4. Distribution of Mean and Precision of Gaussian components

–  Gaussian-Wishart prior is

–  Which represents the conjugate prior when both mean and precision are unknown

•  Resulting model has: – Link between Λ and µ – Due to distribution (4) above

p(µ,Λ) = p(µ |Λ)p(Λ)

= N µkm0(β0Λk )−1( )

∏ W (Λk |W0,ν 0)

p(µ,Λ)

Bayesian Network for Bayesian GMM •  Joint of all random variables:

–  All the factors were given earlier –  Only X={x1,..,xN} are observed

•  This BN provides a nice distinction between latent variables and parameters – Variables such as zn that appear inside the plate

are latent variables •  No of such variables grows with data set

– Variables outside the plate are parameters •  Fixed in no. and outside of data set

– From viewpoint of PGMs no fundamental difference 11

Precisions

Mixing Coeffts

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)

The variational approach •  Recall GMM •  The EM approach:

1.  Evaluation of posterior distribution p(Z|X) 2.  Evaluation of expectation of p(X,Z) wrt to p(Z|X)

•  Our goal is to specify the variational distribution q(Z,π,µ,Λ) which will specify p(Z,π,µ,Λ|X) – Recall

ln p(X) = L(q) + KL(q || p)where

L(q) = q(Z)ln p(X,Z)q(Z)

∫ dZ

KL{q || p} = − q(Z)ln p(Z | X)q(Z)

∫ dZ

p(x) = p(z)p(x | z) = π kN x |µk,Σk( )k=1

∑Here p(z) has parameter π with distribution p(π)

Variational Distribution •  In variational inference we can specify q by

using a factorized distribution – For Bayesian GMM the latent variables and

parameters are Z, π, µ and Λ. •  So we consider the variational distribution

q(Z,π,µ,Λ)=q(Z)q(π,µ,Λ) – Remarkably, this is the only assumption needed for

a tractable solution to a Bayesian Mixture Model •  Functional forms of both q(Z) and q(π,µ,Λ) are

determined automatically by optimizing the variational distribution 13

q(Z) = qii=1

∏ (Zi)

Subscripts for q’s omitted

Sequential update equations •  Using general result of factorized distributions

– When L(q) is defined as –  the q that makes the functional L(q) largest is

•  For Bayesian GMM log of optimized factor is

•  Since we have

– Note: Expectations are are just weighted sums

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ = qi ln p(X,Z )− lnqi

i∑⎧

⎨⎩

⎫⎬⎭i

∏∫ dZ

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const

lnq *(Z)= Eπ,µ,Λln p X,Z,π,µ,Λ( )⎡⎣⎢

⎤⎦⎥+const

p(X,Z,π,µ,Λ) = p(X | Z,µ,Λ)p(Z |π )p(µ |Λ)p(Λ)

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const

Simplification of q*(Z) •  Expression for factor q*(Z)

•  Absorbing terms not depending on Z into constant

•  where D is dimensionality of data variable x

•  Taking exponentials on both sides •  Normalized distribution is

where 15

lnq *(Z)= Eπln p(Z | π)⎡⎣⎢

⎤⎦⎥ +Eµ,Λ ln p(X |Z,µ,Λ)

⎡⎣⎢

⎤⎦⎥ +const

lnq *(Z)= znk

∑n=1

∑ lnρnk+const

where lnρnk= E lnπ

k⎡⎣⎢

⎤⎦⎥ +

12E ln |λ

⎣⎢⎤⎦⎥ −D2

ln(2π)−12EµkΔk

(xn−µ

k(xk−µ

⎣⎢⎤⎦⎥

q *(Z)= rnk

∏n=1

q*(Z )∝ ρnkznk

∏n=1

rnk=ρnk

∑rnk are positive since ρnk are exponentials of real nos. and will sum to one as required

Factor q*(Z) has same form as prior •  Normalized distribution is

•  We have found form of q* to maximize the functional L(q) –  It has same form as prior

•  Distribution q*(Z) is discrete and has the standard result E[znk]=rnk, – which play the role of responsibilities

•  Since equations for q*(Z) depend on moments of other variables – They are coupled and solved iteratively

q *(Z)= rnk

∏n=1

p(Z |π ) = π kznk

∏n=1

Variational EM •  Variational E-step: determine responsibilities rnk •  Variational M-step:

1.  determine statistics of data set

2. find optimal solution for the factor q(π,µ,Λ)

∑ xn

∑ xn− x

n( ) xn− x

Responsibility of kth component Mean of kth component Covariance matrix of kth component

Factorization of q(π,µ,Λ) •  Using general result of factorized distributions

– We can write

•  which decomposes into terms involving π and only µ,Λ

– The terms involving µ and Λ comprise sum of terms involving µk and Λk leading to factorization

lnq *(π,µ,Λ)= ln p(π)+ ln p µk,Λk( )

∑ +EZln p(Z | π)⎡⎣⎢

⎤⎦⎥ + E z

nk⎡⎣⎢⎤⎦⎥ lnN xk | µk,Λk

−1( )k=1

∑n=1

∑ +const

q(π,µ,Λ)= q(π) q(µk,Λk)

lnqj*(Z j ) = Ei≠ j ln p(X,Z )[ ]+ const

Factor q(π) is a Dirichlet •  Given the factorization •  Consider each factor in turn: q(π) and q(µk,Λk) •  (2a) Identifying terms depending on π, q(π) has

the solution

•  Taking exponential on both sides we get q*(π) as a Dirichlet

lnq *(π)= (α0−1) lnz

∑ + rnk

∑k=1

∑ lnπk+const

q *(π)= Dir(π |α)

where α has the components αk=α0+Nk

q(π,µ,Λ)= q(π) q(µk,Λk)

∑∏==

− =ΓΓ

0 where)()...(

)()|( ααµαα

ααµ α K=3 αk=0.1

Dirichlet:

Factor q*(µk,Λk) is a Gaussian-Wishart (2b) Variational posterior for q*(µk,Λk)

– Does not further factorize into marginals –  It is a Gaussian-Wishart distribution

– W is the Wishart distribution •  It has the form

W(Λ|W,ν)=B|Λ|(ν-D-1)/2exp[-½Tr(W-1Λ)] where ν is the no. of degrees of freedom, W is a D x D scale matrix and Tr is the trace. B(W,ν) is a normalization constant •  It is the conjugate prior for a Gaussian with known mean and

unknown precision matrix Λ

q *(µk,Λ

k) = N µ

k)−1( )W (Λ

Parameters of q*(µk,Λk) •  Gaussian-Wishart

–  where we have defined

•  These update equations are analogous to M-

step of EM for m.l. solution of GMM –  Involve evaluation of same sums as EM over the

data set 21

q *(µk,Λ

k) = N µ

k)−1( )W (Λ

βk= β

mk=1βk

β0m0+N

kxk( )

Wk−1 =W

0−1 +N

kSk+β0Nk

xk−m

0( ) xk −m0( )T

υk= υ

Expression for Responsibilities •  For the M step we need expectations

– Which are obtained by normalizing ρnk •  Since where

–  The three expectations wrt variational distribution of parameters are easily evaluated to give

– ψ is the digamma function with •  Digamma appears in the definition of Dirichlet

lnρnk = E lnπ k[ ]+ 12 E ln |λk |[ ]− D2 ln(2π )−12EµkΔk (xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦

E[znk]=rnk

rnk=ρnk

lnπ k ≡ E lnπ k[ ] =ψ (α k )−ν

lnΛk ≡12E ln |λk |[ ] = ψ νk +1− i

2⎛⎝⎜

⎞⎠⎟i=1

∑ + D ln2 + lnWk

EµkΛk(xn − µk )

TΛk (xk − µk )⎡⎣ ⎤⎦ = Dβk−1 +νk (xn − µk )

TΛk (xk − µk )

α̂ = α kk∑ψ (a) = d

dalnΓ(a)

ν is the no. of degrees of freedom of Wishart

Evaluation of Responsibilities •  Substituting the three expectations into ln ρnk

– This is similar to responsibilities for mle for EM

•  which can be written in the form

•  where we have used precision Λk instead of covariance Σk to highlight similarity

∝ !πk!Λ1/2 exp −

−υk

2xn−m

k( )Wk(xn −mk)

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

γ(zk ) ≡ p(zk =1 | x) = p(zk =1)p(x | zk =1)

p(z j =1)p(x | z j =1)j=1

= π kN(x |µk,Σk )

π jN(x |µk,Σ j )j=1

rnk ∝ π k Λk1/2 exp − 1

2xn − µk( )Λk (xn − µk )

⎧⎨⎩

⎫⎬⎭

Summary of Optimization •  Optimization of variational posterior distribution

involves cycling between two stages – Analogous to E and M steps on m.l. EM

•  Variational E-step: – Use current distribution over model parameters to

evaluate moments and hence evaluate E[znk]=rnk

•  Variational M step –  keep responsibilities fixed; use them to recompute

variational distribution over the parameters using and 24

q *(π)= Dir(π |α) q *(µ

k) = N µ

k)−1( )W (Λ

Variational Bayesian GMM

K=6 components After convergence there are only two components Density of red ink inside each ellipse shows Mean value of Mixing coefficients

Old Faithful data set

Similarity of Variational Bayes and EM

•  Close similarity between variational solution for the Bayesin mixture of Gaussians and the EM algorithm for maximum likelihood

•  In the limit as N à∞, the Bayesian treatment converges to the maximum likelihood EM

•  Variational algorithm is more expensive but problem of singularity is eliminated

Variational Lower Bound

•  We can straight-forwardly evaluate the lower bound L(q) for this model

•  Recall

•  The lower bound is used to monitor re-estimation to test for convergence

ln p(X) = L(q)+ KL(q || p)where

L(q) = q(Z )ln p(X,Z )q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ

KL{q || p} = − q(Z )ln p(Z | X)q(Z )

⎧⎨⎩

⎫⎬⎭∫ dZ

Predictive Density •  In using a Bayesian GMM we will be

interested in the predictive density for a new value of the observed variable

•  Assuming corresponding latent variable we can show that

– The mixture of Student’s T becomes a GMM as Nà∞ 28

p(x̂ | X) = 1α̂

α kSt x̂|mk,Lk,νk +1-D( )k=1

∑where the kth component has mean mk and precision

Lk =νk +1−D( )βk

1+ βk( ) Wk

Determining no. of components

•  Plot of variational lower bound L versus no. of components K

•  Distinct peak at K=2 •  For each K model is trained from 100 starts

– Results shown as +

Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine...

Documents

Transcript of Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap10/10.3... · 2015. 12. 6. · Machine...

Sargur Srihari srihari@cedar.buffalo · – This is known as the method of steepest descent or gradient descent • Steepest descent proposes a new point ... • Gradient Descent

An Introduction to Variational Methods for Graphical Models

Variational Kalman Filter

High-dimensional Motion Segmentation by Variational ...chasen.org/~daiti-m/paper/nagano19hvgh.pdf · High-dimensional Motion Segmentation by Variational Autoencoder and Gaussian Processes

Variational Quantum Algorithms - arXiv

Isolating Sources of Disentanglement in Variational ... · Isolating Sources of Disentanglement in Variational Autoencoders inator network. Some interesting special cases that arise

On Minimizers of Causal Variational Principles · 2014-03-14 · On Minimizers of Causal Variational Principles DISSERTIONAT ZUR ERLANGUNG DES DOKTORGRADES DER NATURWISSENSCHAFTEN

Srihari Stotram in Telugu

Dual space multigrid strategies for variational ... - NASA › events › adjoint_workshop-10 › presentatio… · Variational data assimilation Dualformulation Concatenationovertime:

Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap9/Ch... · – From Bayes theorem – View as prior probability of component k and as the posterior probability

Calcul variational

Improving Variational Inference with Inverse Autoregressive Flow

Wasserstein Variational Inference - papers.nips.ccpapers.nips.cc/paper/7514-wasserstein-variational-inference.pdf · l.ambrogioni@donders.ru.nl Umut Güçlü* Radboud University u.guclu@donders.ru.nl

Linear Algebra for Machine Learning - cedar.buffalo.edusrihari/CSE574/Chap1/LinearAlgebra.pdf · Machine Learning Srihari Overview • Linear Algebra is based on continuous math rather

Abstract Variational Problem · 2012-04-13 · SIAM FR26: FEM with B-Splines Basic Finite Element Concepts { Abstract Variational Problems 2-4, page 1. Abstract Variational Problem

Variational Estimators in Statistical Multiscale Analysis

Modern Computational Statistics [1em] Lecture 13: Variational … · 2020-05-27 · Modern Computational Statistics Lecture 13: Variational Inference Cheng Zhang School of Mathematical

Riccardo Giacconi-Andrea Morbio - The Variational Status - 2012

Levenberg-Marquardt dynamics associated to variational …rabot/publications/jour17... · 2017. 4. 10. · Levenberg-Marquardt dynamics associated to variational inequalities Radu

알기쉬운 Variational autoencoder