Topic Model๏ผโ
๐
๐Text Mining๏ผ
Yueshen [email protected]
Middleware, CCNT, ZJU
Middleware, CCNT, ZJU6/11/2014
Text Mining&NLP&ML
1, Yueshen Xu
Outline
Basic Concepts
Application and Background
Famous Researchers
Language Model
Vector Space Model (VSM)
Term Frequency-Inverse Document Frequency (TF-IDF)
Latent Semantic Indexing (LSA)
Probabilistic Latent Semantic Indexing (pLSA)
Expectation-Maximization Algorithm (EM) & Maximum-
Likelihood Estimation (MLE)
6/11/2014 2 Middleware, CCNT, ZJU, Yueshen Xu
Outline
Latent Dirichlet Allocation (LDA)
Conjugate Prior
Possion Distribution
Variational Distribution and Variational Inference (VD
&VI)
Markov Chain Monte Carlo (MCMC)
Metropolis-Hastings Sampling (MH)
Gibbs Sampling and GS for LDA
Bayesian Theory v.s. Probability Theory
6/11/2014 3 Middleware, CCNT, ZJU, Yueshen Xu
Concepts
Latent Semantic Analysis
Topic Model
Text Mining
Natural Language Processing
Computational Linguistics
Information Retrieval
Dimension Reduction
Expectation-Maximization(EM)
6/11/2014 Middleware, CCNT, ZJU
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
LSA/Topic Model
Data Mining
Reductio
n
Dimension
Machine
Learning
EM
4
Machine
Translation
Aim:find the topic that a word or a document belongs to
Latent Factor Model
, Yueshen Xu
Application
LFM has been a fundamental technique in modern
search engine, recommender system, tag extraction,
blog clustering, twitter topic mining, news (text)
summarization, etc.
Search Engine PageRank How importantโฆ.this web page?
LFM How relevanceโฆ.this web page?
LFM How relevanceโฆthe userโs query
vs. one document?
Recommender System Opinion Extraction
Spam Detection
Tag Extraction
6/11/2014 5 Middleware, CCNT, ZJU
Text Summarization
Abstract Generation
Twitter Topic Mining
Text: Steven Jobs had left us for about two yearsโฆ..the appleโs price will fall
downโฆ.
, Yueshen Xu
Famous Researcher
6/11/2014 6 Middleware, CCNT, ZJU
David Blei,
Princeton,
LDA
Chengxiang Zhai,
UIUC, Presidential
Early Career Award
W. Bruce Croft, UMA
Language Model
Bing Liu, UIC
Opinion Mining
John D. Lafferty,
CMU, CRF&IBM
Thomas Hofmann
Brown, pLSA
Andrew McCallum,
UMA, CRF&IBM
Susan Dumais,
Microsoft, LSI
, Yueshen Xu
Language Model
Unigram Language Model == Zero-order Markov Chain
Bigram Language Model == First-order Markov Chain
N-gram Language Model == (N-1)-order Markov Chain
Mixture-unigram Language Model
6/11/2014 Middleware, CCNT, ZJU
sw
i
i
MwpMwp )|()|(
Bag of Words(BoW)
No order, no grammar, only multiplicity
sw
ii
i
MwwpMwp )|()|( ,1
8
w
NM
w
NM
z๐ ๐ =
๐ง
๐(๐ง)
๐=1
๐
๐(๐ค๐|๐ง)
, Yueshen Xu
9
Vector Space Model
A document is represented as a vector of identifier
Identifier
Boolean: 0, 1
Term Count: How many timesโฆ
Term Frequency: How frequentโฆin this document
TF-IDF: How importantโฆin the corpus most used
Relevance Ranking
First used in SMART(Gerard Salton, Cornell)
6/11/2014 Middleware, CCNT, ZJU
),,,(
),,,(
21
21
tqqq
tjjjj
wwwq
wwwd
Gerard Salton
Award(SIGIR)
qd
qd
j
j
cos
, Yueshen Xu
TF-IDF
Mixture language model
Linear combination of a certain distribution(Gaussian)
Better Performance
TF: Term Frequency
IDF: Inversed Document Frequency
TF-IDF
6/11/2014 Middleware, CCNT, ZJU
kkj
ij
ijn
ntf Term i, document j, count of i in j
)|}:{|1
log(dtDd
Nidf
i
i
N documents in the corpus
iijjij idftfDdtidftf ),,(How important โฆin this document
How important โฆin this corpus
10, Yueshen Xu
Latent Semantic Indexing
Challenge
Compare document in the same concept space
Compare documents across languages
Synonymy, ex: buy - purchase, user - consumer
Polysemy, ex; book - book, draw - draw
Key Idea
Dimensionality reduction of word-document co-occurrence matrix
Construction of latent semantic space
6/11/2014 Middleware, CCNT, ZJU
Defects of VSM
Word Document
Word DocumentConcept
VSM
LSI
11, Yueshen Xu
Aspect
Topic
Latent
Factor
Singular Value Decomposition
LSI ~= SVD
U, V: orthogonal matrices
โ :the diagonal matrix with the singular values of N
6/11/2014 Middleware, CCNT, ZJU12
TVUN
U
t * m
Document
Term
s
t * d
m* m m* d
N โU V
k < m || k <<mCount, Frequency, TF-IDF
t * m
Document
Term
s
t * k
k* k m* d
U V N
word: Exchangeability
k < m || k <<m
k
, Yueshen Xu
Singular Value Decomposition
The K-largest singular values
Distinguish the variance between words and documents to a
greatest extent
Discarding the lowest dimensions
Reduce noise
Fill the matrix
Predict & Lower computational complexity
Enlarge the distinctiveness
Decomposition
Concept, semantic, topic (aspect)
6/11/2014 13 Middleware, CCNT, ZJU
(Probabilistic) Matrix Factorization/
Factorization Model: Analytic
solution of SVD
Unsupervised
Learning
, Yueshen Xu
Probabilistic Latent Semantic Indexing
pLSI Model
6/11/2014 14 Middleware, CCNT, ZJU
w1
w2
wN
z1
zK
z2
d1
d2
dM
โฆ..
โฆ..
โฆ..
)(dp)|( dzp)|( zwp
Assumption
Pairs(d,w) are assumed to be
generated independently
Conditioned on z, w is generated
independently of d
Words in a document are
exchangeable
Documents are exchangeable
Latent topics z are independent
Generative Process/Model
ZzZz
zwpdzpdpdzwpdpdpdwpwdp )|()|()()|,()()()|(),(
Multinomial Distribution
Multinomial Distribution
One layer of โDeep
Neutral Networkโ
Global
Local
, Yueshen Xu
Probabilistic Latent Semantic Indexing
6/11/2014 15 Middleware, CCNT, ZJU
d z w
N
M
Zz
zwpdzpdwp )|()|()|(
Zz
ZzZz
zpzdpzwp
zdpzdwpzwdpdwp
)()|()|(
),(),|(),,(),(
d
z w
N
MThese are two ways to
formulate pLSA, which are
equivalent but lead to two
different inference processesEquivalent in Bayes Rule
Probabilistic
Graph Model
d:Exchangeability
Directed Acyclic
Graph (DAG)
, Yueshen Xu
Expectation-Maximization
EM is a general algorithm for maximum-likelihood estimation
(MLE) where the data are โincompleteโ or contains latent
variables: pLSA, GMM, HMMโฆ---Cross Domain
Deduction Process
ฮธ:parameter to be estimated; ฮธ0: initialize randomly; ฮธn: the current
value; ฮธn+1: the next value
6/11/2014 16 Middleware, CCNT, ZJU
)()(max1 nn LL
),|(log)( XpL )|,(log)( HXpLc Latent Variable
),|(log)(),|(log)|(log)|,(log)( XHpLXHpXpHXpLc
),|(
),|(log)()()()(
XHp
XHpLLLL
nn
cc
n
, Yueshen Xu
Objective:
Expectation-Maximization
6/11/2014 17 Middleware, CCNT, ZJU
),|(
),|(log),|(
),|()(),|()()()(
XHp
XHpXHp
XHpLXHpLLL
n
H
n
H
nn
c
H
n
c
n
K-L divergence: non-negativeKullback-Leibler Divergence, or Relative Entropy
H
nn
c
H
nn
c XHpLLXHpLL ),|()()(),|()()(
Lower Bound
H
n
ccXHp
n XHpLLEQ n ),|()()]([);(),|(
Q-function
E-step (expectation): Compute Q;
M-step(maximization): Re-estimate ฮธ by maximizing QConvergence
How is EM used in pLSA?
, Yueshen Xu
EM in pLSA
6/11/2014 18 Middleware, CCNT, ZJU
K
k
ikkjijk
N
i
M
j
ji
K
k
ikkj
N
i
M
j
jiijk
H
n
ccXHp
n
dzpzwpdwzpwdn
dzpzwpwdndwzp
XHpLLEQ n
11 1
1 1 1
),|(
))|()|(log(),|(),(
))|()|(log(),(),|(
),|()()]([);(
Posterior Random value in initialization
Likelyhood function
Constraints:
1.
2.
1)|(1
M
j
kjzwp
1)|(1
K
k
jkdzp
Lagrange
Multiplier
M
i
K
kiki
K
k
M
jkjkc dzpzwpLEH
1 11 1
))|(1())|(1(][
Partial derivative=0
independent
variable
independent
variable
M
m
N
i
imkim
N
i
ijkij
kj
dwzpdwn
dwzpdwn
zwp
1 1
1
),|(),(
),|(),(
)|()(
),|(),(
)|(1
i
M
j
ijkij
ikdn
dwzpdwn
dzp
M-Step
E-Step
K
l
illj
ikkj
K
l
illji
iikkj
ijk
dzpzwp
dzpzwp
dzpzwpdp
dpdzpzwpdwzp
1
1
)|()|(
)|()|(
)|()|()(
)()|()|(),|(
Associative
Law &
Distributive
Law
, Yueshen Xu
๐๐๐ ๐(๐ค|๐)๐(๐,๐ค)
Bayesian Theory v.s.
Probability Theory
Bayesian Theory v.s. Probability Theory
Estimate ๐ through posterior v.s. Estimate ๐ through the
maximization of likelihood
Bayesian theory prior v.s. Probability theory statistic
When the number of samples โ โ, Bayesian theory == Probability
theory
Parameter Estimation
๐ ๐ ๐ท โ ๐ ๐ท ๐ ๐ ๐ ๐ ๐ ? Conjugate Prior likelihood is
helpful, but its function is limited Otherwise?
6/11/2014 19 Middleware, CCNT, ZJU
Non-parametric Bayesian Methods (Complicated)
Kernel methods: I just know a little...
VSM CF MF pLSA LDA Non-parametric Bayesian
Deep Learning
, Yueshen Xu
Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) David M. Blei, Andrew Y. Ng, Michael I. Jordan
Journal of Machine Learning Research๏ผ2003, cited > 3000
Hierarchical Bayesian model; Bayesian pLSI
6/11/2014 20 Middleware, CCNT, ZJU
ฮธ z w
N
Mฮฑ
ฮฒ
Iterative times
Generative Process of a document d in a
corpus according to LDA
Choose N ~ Poisson(๐); Why?
For each document d={๐ค1, ๐ค2 โฆ ๐ค๐}
Choose ๐ ~๐ท๐๐(๐ผ); Why?
For each of the N words ๐ค๐ in d:
a) Choose a topic ๐ง๐~๐๐ข๐๐ก๐๐๐๐๐๐๐๐ ๐
Why?
b) Choose a word ๐ค๐ from ๐ ๐ค๐ ๐ง๐, ๐ฝ ,a multinomial probability conditioned on ๐ง๐
Why
ACM-Infosys
Awards
, Yueshen Xu
Latent Dirichlet Allocation
LDA(Cont.)
6/11/2014 21 Middleware, CCNT, ZJU
ฮธ z w
N
Mฮฑ
๐
ฮฒ
Kฮฒ
Generative Process of a document d in LDA
Choose N ~ Poisson(๐); Not important
For each document d={๐ค1, ๐ค2 โฆ ๐ค๐}
Choose ๐ ~๐ท๐๐(๐ผ);๐ = ๐1, ๐2 โฆ ๐๐พ , ๐ = ๐พ ,
K is fixed, 1๐พ ๐ = 1, ๐ท๐๐~๐๐ข๐๐ก๐ โ๐ถ๐๐๐๐ข๐๐๐ก๐
๐๐๐๐๐
For each of the N words ๐ค๐ in d:
a) Choose a topic ๐ง๐~๐๐ข๐๐ก๐๐๐๐๐๐๐๐ ๐
b) Choose a word ๐ค๐ from ๐ ๐ค๐ ๐ง๐, ๐ฝ ,
a multinomial probability conditioned on
๐ง๐ one word one topic
one document multi-topics
๐ = ๐1, ๐2 โฆ ๐๐พ
z= ๐ง1, ๐ง2 โฆ ๐ง๐พ
For each word ๐ค๐there is a ๐ง๐
pLSA: the number of p(z|d) is linear
to the number of documents
overfitting
Regularization
M+K Dirichlet-Multinomial
, Yueshen Xu
Conjugate Prior &
Distributions
Conjugate Prior:
If the posterior p(ฮธ|x) are in the same family as the p(ฮธ), the prior
and posterior are called conjugate distributions, and the prior is
called a conjugate prior of the likelihood p(x|ฮธ) : p(ฮธ|x) โ p(x|ฮธ)p(ฮธ)
Distributions
Binomial Distribution โโ Beta Distribution
Multinomial Distribution โโ Dirichlet Distribution
Binomial & Beta Distribution
Binomial Bin(m|N,ฮธ)=C(m,N)ฮธm(1-ฮธ)N-m :likelihood
C(m,N)=N!/(N-m)!m!
Beta(ฮธ|a,b)
6/11/2014 23 Middleware, CCNT, ZJU
11- )1()()(
)(
ba
ba
ba
0
1)( dteta ta
Why do prior and
posterior need to be
conjugate distributions?
, Yueshen Xu
Conjugate Prior &
Distributions
6/11/2014 24 Middleware, CCNT, ZJU
11- )1()()(
)(
)1(),(),,,|(
ba
lm
ba
ba
lmmCbalmp
11- )1()()(
)(),,,|(
blam
blam
blambalmp
Beta Distribution!
Parameter Estimation
Multinomial & Dirichlet Distribution
x/ ๐ฅ is a multivariate, ex, ๐ฅ = (0,0,1,0,0,0): event of ๐ฅ3 happens
The probabilistic distribution of ๐ฅ in only one event : ๐ ๐ฅ ๐
= ๐=1๐พ ๐๐
๐ฅ๐, ๐ = (๐1, ๐2 โฆ , ๐๐)
, Yueshen Xu
Conjugate Prior &
Distributions
Multinomial & Dirichlet Distribution (Cont.)
Mult(๐1, ๐2, โฆ , ๐๐พ|๐ฝ, ๐)=๐!
๐1!๐2!โฆ๐๐พ!๐ถ๐
๐1๐ถ๐โ๐1
๐2 ๐ถ๐โ๐1โ๐2
๐3 โฆ
๐ถ๐โ ๐=1
๐พโ1 ๐๐
๐๐พ ๐=1๐พ ๐๐
๐ฅ๐: the likelihood function of ๐
6/11/2014 25 Middleware, CCNT, ZJU
Mult: The exact probabilistic distribution of ๐ ๐ง๐ ๐๐ and ๐ ๐ค๐ ๐ง๐
In Bayesian theory, we need to find a conjugate prior of ๐ for
Mult, where 0 < ๐ < 1, ๐=1๐พ ๐๐ = 1
Dirichlet Distribution
๐ท๐๐ ๐ ๐ถ =ฮ(๐ผ0)
ฮ ๐ผ1 โฆ ฮ ๐ผ๐พ
๐=1
๐พ
๐๐๐ผ๐โ1
a vector
Hyper-parameter: parameter in
probabilistic distribution function (pdf), Yueshen Xu
Conjugate Prior &
Distributions
Multinomial & Dirichlet Distribution (Cont.)
๐ ๐ ๐, ๐ถ โ ๐ ๐ ๐ ๐(๐|๐ถ) โ ๐=1๐พ ๐๐
๐ผ๐+๐๐โ1
6/11/2014 26 Middleware, CCNT, ZJU
Dirichlet?
๐ ๐ ๐, ๐ถ =๐ท๐๐ ๐ ๐ + ๐ถ =ฮ(๐ผ0+๐)
ฮ ๐ผ1+๐1 โฆฮ ๐ผ๐พ+๐๐พ ๐=1
๐พ ๐๐๐ผ๐+๐๐โ1
Why? Gamma ฮ is a mysterious function
Dirichlet!
๐~๐ต๐๐ก๐ ๐ก ๐ผ, ๐ฝ ๐ธ ๐ = 0
1๐ก ร
ฮ ๐ผ+๐ฝ
ฮ ๐ผ ฮ ๐ฝ๐ก๐ผโ1(1 โ ๐ก)๐ฝโ1๐๐ก =
๐ผ
๐ผ+๐ฝ
๐~๐ท๐๐ ๐ ๐ผ ๐ธ ๐ =๐ผ1
๐=1๐พ ๐ผ๐
,๐ผ2
๐=1๐พ ๐ผ๐
, โฆ ,๐ผ๐พ
๐=1๐พ ๐ผ๐
, Yueshen Xu
Poisson Distribution
Why Poisson distribution?
The number of births per hour during a given day; the number of
particles emitted by a radioactive source in a given time; the number
of cases of a disease in different towns
For Bin(n,p), when n is large, and p is small p(X=k)โ๐๐๐โ๐
๐!, ๐ โ ๐๐
๐บ๐๐๐๐ ๐ฅ ๐ผ =๐ฅ๐ผโ1๐โ๐ฅ
ฮ(๐ผ)๐บ๐๐๐๐ ๐ฅ ๐ผ = ๐ + 1 =
๐ฅ๐๐โ๐ฅ
๐!(ฮ ๐ + 1 = ๐!)
(Poisson discrete; Gamma continuous)
6/11/2014 27 Middleware, CCNT, ZJU
Poisson Distribution
๐ ๐|๐ =๐๐๐โ๐
๐!
Many experimental situations occur in which we observe the
counts of events within a set unit of time, area, volume, length .etc
, Yueshen Xu
Solution for LDA
LDA(Cont.) ๐ผ, ๐ฝ: corpus-level parameters
๐: document-level variable
z, w:word-level variables
Conditionally independent hierarchical models
Parametric Bayes model
6/11/2014 28 Middleware, CCNT, ZJU
knkk ppp
ppp
ppp
21
n22221
n11211๐ง1
๐ง2
๐ง๐พ
๐ค1
๐ง1 ๐ง2 ๐ง๐
๐ค2 ๐ค๐
p ๐, ๐, ๐ ๐ผ, ๐ฝ = ๐(๐|๐ผ)
๐=1
๐
๐ ๐ง๐ ๐ ๐(๐ค๐|๐ง๐, ๐ฝ)
Solving Process
(๐ ๐ง๐ ๐ฝ = ๐๐)
p ๐ ๐ผ, ๐ฝ = ๐(๐|๐ผ)
๐=1
๐
๐ง๐
๐ ๐ง๐ ๐ ๐(๐ค๐|๐ง๐, ๐ฝ) ๐๐
multiple integral
p ๐ซ ๐ผ, ๐ฝ =
๐=1
๐
๐(๐๐|๐ผ)
๐=1
๐๐
๐ง๐๐
๐ ๐ง๐๐ ๐๐ ๐(๐ค๐๐|๐ง๐๐, ๐ฝ) ๐๐d
๐ฝ
, Yueshen Xu
Solution for LDA
6/11/2014 29 Middleware, CCNT, ZJU
The most significant generative model in Machine Learning Community in the
recent ten years
๐ ๐ ๐ผ, ๐ฝ =ฮ( ๐ ๐ผ๐)
๐ ฮ(๐ผ๐)
๐=1
๐
๐๐๐ผ๐โ1
๐=1
๐
๐=1
๐
๐=1
๐
(๐๐๐ฝ๐๐)๐ค๐
๐
๐๐
p ๐ ๐ผ, ๐ฝ = ๐(๐|๐ผ)
๐=1
๐
๐ง๐
๐ ๐ง๐ ๐ ๐(๐ค๐|๐ง๐, ๐ฝ) ๐๐Rewrite in terms of
model parameters
๐ผ = ๐ผ1, ๐ผ2, โฆ ๐ผ๐พ ; ๐ฝ โ ๐ ๐พร๐:What we need to solve out
Variational Inference Gibbs Sampling
Deterministic Inference Stochastic Inference
Why variational inference?Simplify the dependency structure
Why sampling? Approximate the
statistical properties of the population
with those of samplesโ
, Yueshen Xu
Variational Inference
Variational Inference (Inference through a variational
distribution), VI
VI aims to use an approximating distribution that has a simpler
dependency structure than that of the exact posterior distribution
6/11/2014 30 Middleware, CCNT, ZJU
๐(๐ป|๐ท) โ ๐(๐ป)
true posterior distribution
variational distributionDissimilarity between
P and Q?Kullback-Leibler
Divergence
๐พ๐ฟ(๐| ๐ = ๐ ๐ป ๐๐๐๐ ๐ป ๐ ๐ท
๐ ๐ป, ๐ท๐๐ป
= ๐ ๐ป ๐๐๐๐ ๐ป
๐ ๐ป, ๐ท๐๐ป + ๐๐๐๐(๐ท)
๐ฟ๐๐๐
๐ ๐ป ๐๐๐๐ ๐ป, ๐ท ๐๐ป โ ๐ ๐ป ๐๐๐๐ ๐ป ๐๐ป =< ๐๐๐๐(๐ป, ๐ท) >Q(H) +โ ๐
Entropy of Q
, Yueshen Xu
Variational Inference
6/11/2014 31 Middleware, CCNT, ZJU
๐ ๐ป ๐ท = ๐ ๐, ๐ง ๐, ๐ผ, ๐ฝ , ๐ ๐ป = ๐ ๐, ๐ง ๐พ, ๐ = ๐ ๐ ๐พ ๐ ๐ง ๐
= ๐(๐|๐พ) ๐=1๐ ๐(๐ง๐|๐๐)
๐พโ, ๐โ = arg min(๐ท(๐ ๐, ๐ง ๐พ, ๐ ||๐ ๐, ๐ง ๐, ๐ผ, ๐ฝ ))๏ผbut we donโt
know the exact analytical form of the above KL
log ๐ ๐ค ๐ผ, ๐ฝ = ๐๐๐
๐ง
๐ ๐, ๐ง, ๐ค ๐ผ, ๐ฝ ๐๐
= ๐๐๐
๐ง
๐ ๐, ๐ง, ๐ค ๐ผ, ๐ฝ ๐(๐, ๐ง)
๐(๐, ๐ง)๐๐
โฅ
๐ง
๐ ๐, ๐ง ๐๐๐๐ ๐, ๐ง, ๐ค ๐ผ, ๐ฝ
๐(๐, ๐ง)๐๐
= ๐ธ๐ ๐๐๐๐ ๐, ๐ง, ๐ค ๐ผ, ๐ฝ โ ๐ธ๐ ๐๐๐๐ ๐, ๐ง = ๐ฟ(๐พ, ๐; ๐ผ, ๐ฝ)
log ๐ ๐ค ๐ผ, ๐ฝ = ๐ฟ ๐พ, ๐; ๐ผ, ๐ฝ + KL minimize KL == maximize L
๐ ,z: independent (approximately)
for facilitating computation
, Yueshen Xu
variational distribution
Variational Inference
6/11/2014 32 Middleware, CCNT, ZJU
๐ฟ ๐พ, ๐; ๐ผ, ๐ฝ = ๐ธ๐ ๐๐๐๐ ๐ ๐ผ + ๐ธ๐๐๐๐๐ ๐ง ๐ + ๐ธ๐ ๐๐๐๐ ๐ค ๐ง, ๐ฝ โ
๐ธ๐ ๐๐๐๐ ๐ โ ๐ธ๐[๐๐๐๐(๐ง)]
๐ธ๐ ๐๐๐๐ ๐ ๐ผ
=
๐=1
๐พ
๐ผ๐ โ 1 ๐ธ๐ ๐๐๐๐๐ + ๐๐๐ฮ
๐=1
๐พ
๐ผ๐ โ
๐=1
๐พ
๐๐๐ฮ(๐ผ๐)
๐ธ๐ ๐๐๐๐๐ = ๐ ๐พ๐ โ ๐(
๐=1
๐พ
๐พ๐)
๐ธ๐ ๐๐๐๐ ๐ง ๐ =
๐=1
๐
๐=1
๐พ
๐ธ๐[๐ง๐๐] ๐ธ๐ ๐๐๐๐๐ =
๐=1
๐
๐=1
๐พ
๐๐๐(๐ ๐พ๐ โ ๐(
๐=1
๐พ
๐พ๐) )
๐ธ๐ ๐๐๐๐ ๐ค ๐ง, ๐ฝ =
๐=1
๐
๐=1
๐พ
๐=1
๐
๐ธ๐[๐ง๐๐] ๐ค๐๐๐๐๐๐ฝ๐๐ =
๐=1
๐
๐=1
๐พ
๐=1
๐
๐๐๐ ๐ค๐๐๐๐๐๐ฝ๐๐
, Yueshen Xu
Variational Inference
6/11/2014 33 Middleware, CCNT, ZJU
๐ธ๐ ๐๐๐๐ ๐ ๐พ is much like ๐ธ๐ ๐๐๐๐ ๐ ๐ผ
๐ธ๐ ๐๐๐๐ ๐ง ๐ = ๐ธ๐
๐=1
๐
๐=1
๐
๐ง๐๐๐๐๐ ๐๐๐
Maximize L with respect to ๐๐๐:
๐ฟ๐๐๐= ๐๐๐(๐ ๐พ๐ โ ๐( ๐=1
๐พ ๐พ๐))+๐๐๐๐๐๐๐ฝ๐๐-๐๐๐log๐๐๐ + ๐( ๐=1๐พ ๐๐๐ โ 1)
Lagrangian Multiplier
Taking derivatives with respect to ๐๐๐:๐๐ฟ
๐๐๐๐= (๐ ๐พ๐ โ ๐( ๐=1
๐พ ๐พ๐))+๐๐๐๐ฝ๐๐-log๐๐๐ โ 1 + ๐=0
๐๐๐ โ ๐ฝ๐๐exp(๐ ๐พ๐ โ ๐
๐=1
๐พ
๐พ๐ )
, Yueshen Xu
Variational Inference
You can refer to more in the original paper.
Variational EM Algorithm
Aim: (๐ผโ, ๐ฝ
โ)=arg max ๐=1
๐ ๐ ๐|๐ผ, ๐ฝ
Initialize ๐ผ, ๐ฝ
E-Step: compute ๐ผ, ๐ฝ through variational inference for likelihood
approximation
M-Step: Maximize the likelihood according to ๐ผ, ๐ฝ
End until convergence
6/11/2014 34 Middleware, CCNT, ZJU, Yueshen Xu
Markov Chain Monte Carlo
MCMC Basic: Markov Chain (First-order) Stationary
Distribution Fundament of Gibbs Sampling
General: ๐ ๐๐ก+๐ = ๐ฅ ๐1, ๐2, โฆ ๐๐ก = ๐(๐๐ก+๐ = ๐ฅ|๐๐ก)
First-Order: ๐ ๐๐ก+1 = ๐ฅ ๐1, ๐2, โฆ ๐๐ก = ๐(๐๐ก+1 = ๐ฅ|๐๐ก)
One-step transition probabilistic matrix
6/11/2014 35 Middleware, CCNT, ZJU
|)||(|...)2|(|)1|(|
)12(p...)22(p)12(p
|)|1(...)21()11(p
SSpSpSp
Spp
P
Xm
Xm+1
, Yueshen Xu
Markov Chain Monte Carlo
Markov Chain
Initialization probability: ๐0 = {๐0 1 , ๐0 2 , โฆ , ๐0(|๐|)}
๐๐ = ๐๐โ1๐ = ๐๐โ2๐2 = โฏ = ๐0๐๐: Chapman-Kolomogrov equation
Central-limit Theorem: Under the premise of connectivity of P, lim๐โโ
๐๐๐๐
= ๐ ๐ ; ๐ ๐ = ๐=1|๐|
๐ ๐ ๐๐๐
lim๐โโ
๐0๐๐ =๐(1) โฆ ๐(|๐|)
โฎ โฎ โฎ๐(1) ๐(|๐|)
๐ = {๐ 1 , ๐ 2 , โฆ , ๐ ๐ , โฆ , ๐(|๐|)}
6/11/2014 36 Middleware, CCNT, ZJU
Stationary Distribution
๐0~๐0 ๐ฅ โโ ๐1~๐1 ๐ฅ โโ โฏ โโ ๐๐~๐ ๐ฅ โโ ๐๐+1~๐ ๐ฅ โโ ๐๐+2~๐ ๐ฅ โโ
sample Convergence
Stationary Distribution
, Yueshen Xu
Markov Chain Monte Carlo
MCMC Sampling
We should construct the relationship between ๐(๐ฅ) and MC
transition process Detailed Balance Condition
In a common MC, if for ๐ ๐ , ๐ ๐ก๐๐๐๐ ๐๐ก๐๐๐ ๐๐๐ก๐๐๐ฅ , ๐ ๐ ๐๐๐ = ๐(j)
๐๐๐ , ๐๐๐ ๐๐๐ ๐, ๐ ๐(๐ฅ) is the stationary distribution of this MC
Prove: ๐=1โ ๐ ๐ ๐๐๐ = ๐=1
โ ๐ ๐ ๐๐๐ = ๐ ๐ โโ ๐๐ = ๐๐ is the
solution of the equation ๐๐ = ๐ Done
For a common MC(q(i,j), q(j|i), q(ij)), and for any probabilistic
distribution p(x) (the dimension of x is arbitrary) Transformation
6/11/2014 37 Middleware, CCNT, ZJU
๐ ๐ ๐ ๐, ๐ ๐ผ ๐, ๐ = ๐ ๐ ๐(๐, ๐)๐ผ(๐, ๐)
Qโ(i,j) Qโ(j,i)
๐ผ ๐, ๐ = ๐ ๐ ๐(๐, ๐),๐ผ ๐, ๐ = ๐ ๐ ๐(๐, ๐),
necessary condition
, Yueshen Xu
Markov Chain Monte Carlo
MCMC Sampling(cont.)
Step1: Initialize: ๐0 = ๐ฅ0
Step2: for t = 0, 1, 2, โฆ
๐๐ก = ๐ฅ๐ก , ๐ ๐๐๐๐๐ ๐ฆ ๐๐๐๐ ๐(๐ฅ|๐ฅ๐ก) (๐ฆ โ ๐ท๐๐๐๐๐ ๐๐ ๐ท๐๐๐๐๐๐ก๐๐๐)
sample u from Uniform[0,1]
If ๐ข < ๐ผ ๐ฅ๐ก, ๐ฆ = ๐ ๐ฆ ๐ ๐ฅ๐ก ๐ฆ โ ๐ฅ๐ก โ ๐ฆ, Xt+1 = y
else Xt+1 = xt
6/11/2014 38 Middleware, CCNT, ZJU
Metropolis-Hastings Sampling
Step1: Initialize: ๐0 = ๐ฅ0
Step2: for t = 0, 1, 2, โฆn, n+1, n+2โฆ
๐๐ก = ๐ฅ๐ก , ๐ ๐๐๐๐๐ ๐ฆ ๐๐๐๐ ๐ ๐ฅ ๐ฅ๐ก ๐ฆ โ ๐ท๐๐๐๐๐ ๐๐ ๐ท๐๐๐๐๐๐ก๐on
Burn-in PeriodConvergence
, Yueshen Xu
Gibbs Sampling
sample u from Uniform[0,1]
If ๐ข < ๐ผ ๐ฅ๐ก, ๐ฆ = ๐๐๐{๐ ๐ฆ ๐ ๐ฅ๐ก ๐ฆ๐ ๐ฅ
๐ก๐ ๐ฆ ๐ฅ๐ก
, 1} โ ๐ฅ๐ก โ ๐ฆ , Xt+1 = y
else Xt+1 = xt
6/11/2014 39 Middleware, CCNT, ZJU
Not suitable with regard to high dimensional variables
Gibbs Sampling(Two Dimensions,(x1,y1))
A(x1,y1), B(x1,y2) ๐ ๐ฅ1, ๐ฆ1 ๐ ๐ฆ2 ๐ฅ1 = ๐ ๐ฅ1 ๐ ๐ฆ1 ๐ฅ1 ๐(๐ฆ2|๐ฅ1)
๐ ๐ฅ1, ๐ฆ2 ๐ ๐ฆ1 ๐ฅ1 = ๐ ๐ฅ1 ๐ ๐ฆ2 ๐ฅ1 ๐(๐ฆ1|๐ฅ1)
๐ ๐ฅ1, ๐ฆ1 ๐ ๐ฆ2 ๐ฅ1 = ๐ ๐ฅ1, ๐ฆ2 ๐ ๐ฆ1 ๐ฅ1
๐ ๐ด ๐ ๐ฆ2 ๐ฅ1 = ๐ ๐ต ๐ ๐ฆ1 ๐ฅ1
A(x1,y1)
B(x1,y2)
C(x2,y1)
D
๐ ๐ด ๐ ๐ฅ2 ๐ฆ1 = ๐ ๐ถ ๐ ๐ฅ1 ๐ฆ1
, Yueshen Xu
Gibbs Sampling
Gibbs Sampling(Cont.)
We can construct the transition probabilistic matrix Q accordingly
๐ ๐ด โ ๐ต = ๐(๐ฆ๐ต|๐ฅ1), if ๐ฅ๐ด = ๐ฅ๐ต = ๐ฅ1
๐ ๐ด โ ๐ถ = ๐(๐ฅ๐ถ|๐ฆ1), if ๐ฆ๐ด = ๐ฆ๐ถ = ๐ฆ1
๐ ๐ด โ ๐ท = 0, else
6/11/2014 40 Middleware, CCNT, ZJU
A(x1,y1)
B(x1,y2)
C(x2,y1)
D
Detailed Balance Condition:
๐ ๐ ๐ ๐ โ ๐ = ๐ ๐ ๐(๐ โ ๐) โ
Gibbs Sampling(in two dimension)
Step1: Initialize: ๐0 = ๐ฅ0, ๐0 = ๐ฆ0
Step2: for t = 0, 1, 2, โฆ
1. ๐ฆ๐ก+1~๐ ๐ฆ ๐ฅ๐ก ;
. 2. ๐ฅ๐ก+1~๐ ๐ฅ ๐ฆ๐ก+1
, Yueshen Xu
Gibbs Sampling
6/11/2014 41 Middleware, CCNT, ZJU
Gibbs Sampling(in two dimension)
Step1: Initialize: ๐0 = ๐ฅ0 = {๐ฅ1: ๐ = 1,2, โฆ ๐}
Step2: for t = 0, 1, 2, โฆ
1. ๐ฅ1(๐ก+1)
~๐ ๐ฅ1 ๐ฅ2(๐ก)
, ๐ฅ3(๐ก)
, โฆ , ๐ฅ๐(๐ก)
;
2. ๐ฅ2๐ก+1~๐ ๐ฅ2 ๐ฅ1
(๐ก+1), ๐ฅ3
(๐ก), โฆ , ๐ฅ๐
(๐ก)
3. โฆ
4. ๐ฅ๐๐ก+1~๐ ๐ฅ๐ ๐ฅ1
(๐ก+1), ๐ฅ๐โ1
(๐ก+1), ๐ฅ๐+1
(๐ก)โฆ , ๐ฅ๐
(๐ก)
5. โฆ
6. ๐ฅ๐๐ก+1~๐ ๐ฅ๐ ๐ฅ1
(๐ก+1), ๐ฅ2
(๐ก+1), โฆ , ๐ฅ๐โ1
(๐ก+1)
t+1 t
, Yueshen Xu
Gibbs Sampling for LDA
Gibbs Sampling in LDA
Dir ๐ ๐ผ =1
ฮ(๐ผ) ๐=1
๐ ๐๐๐ผ๐โ1
, ฮ( ๐ผ) is the normalization factor:
ฮ ๐ผ = ๐=1๐ ๐๐
๐ผ๐โ1๐ ๐
๐ ๐ง๐ ๐ผ = ๐ ๐ง๐ ๐ ๐ ๐ ๐ผ ๐ ๐ = ๐=1
๐ ๐๐๐๐Dir( ๐| ๐ผ) ๐ ๐
= ๐=1๐ ๐๐
๐๐ 1
ฮ(๐ผ) ๐=1
๐ ๐๐๐ผ๐โ1
๐ ๐
= 1
ฮ(๐ผ) ๐=1
๐ ๐๐๐๐+๐ผ๐โ1
๐ ๐ =ฮ(๐๐+๐ผ)
ฮ(๐ผ)
6/11/2014 42 Middleware, CCNT, ZJU
๐ ๐ ๐ผ = ๐=1๐ ๐ ๐ง๐ ๐ผ = ๐=1
๐ ฮ(๐๐+๐ผ)
ฮ(๐ผ)โโ
๐ ๐, ๐ ๐ผ, ๐ฝ = ๐=1๐พ ฮ(๐๐+๐ฝ)
ฮ(๐ฝ) ๐=1
๐ ฮ(๐๐+๐ผ)
ฮ(๐ผ)
, Yueshen Xu
Gibbs Sampling for LDA
Gibbs Sampling in LDA
๐ ๐๐ ๐งยฌ๐,๐คยฌ๐ = ๐ท๐๐(๐๐|๐๐,ยฌ๐ + ๐ผ), ๐ ๐๐ ๐งยฌ๐,๐คยฌ๐ =
๐ท๐๐(๐๐|๐๐,ยฌ๐ + ๐ฝ)
๐(๐ง๐ = ๐| ๐งยฌ๐,๐คยฌ๐) โ ๐ ๐ง๐ = ๐, ๐ค๐ = ๐ก, ๐๐, ๐๐ ๐งยฌ๐,๐คยฌ๐ = ๐ธ ๐๐๐ โ
๐ธ ๐๐๐ก = ๐๐๐ โ ๐๐๐ก
๐๐๐=๐๐,ยฌ๐
(๐ก)+๐ผ๐
๐=1๐พ (๐
๐,ยฌ๐(๐)
+๐ผ๐), ๐๐๐ก=
๐๐,ยฌ๐(๐ก)
+๐ฝ๐
๐ก=1๐ (๐
๐,ยฌ๐(๐ก)
+๐ฝ๐)
๐(๐ง๐ = ๐| ๐งยฌ๐,๐ค) โ๐๐,ยฌ๐
(๐ก)+๐ผ๐
๐=1๐พ (๐๐,ยฌ๐
(๐)+๐ผ๐)
ร๐๐,ยฌ๐
(๐ก)+๐ฝ๐
๐ก=1๐ (๐๐,ยฌ๐
(๐ก)+๐ฝ๐)
๐ง๐(๐ก+1)
~ ๐(๐ง๐ = ๐| ๐งยฌ๐,๐ค), i=1โฆK
6/11/2014 43 Middleware, CCNT, ZJU, Yueshen Xu
Top Related