Topic model, LDA and all that

Post on 13-Apr-2017

2.651 views 3 download

Transcript of Topic model, LDA and all that

Topic Models, LDA and all that

肖智博2011-03-23

Just make sure we are on the same page

LDA

• Linear Discriminant Analysis– Fisher Linear Discriminant Analysis– 有监督学习– 寻找使类间与类内比最大的投影方向– 从矩阵的角度

• Latent Dirichlet Allocation– 无监督学习– 从图模型的角度

• 所需数学知识• Latent Dirichlet Allocation• 后验概率逼近方法• 主题模型的演化 —— 从主题的关系探讨• 有监督 LDA --- MedLDA• 主题模型的应用

Roadmap

主要内容

• 后验概率 posteriori• 逼近 approximation• 采样 sampling• 变分法 variational• 优化 optimization

Keywords

关键词

• Approximation methods is useful!– Gibbs Sampling– Variational Methods

• (Convex) Optimization is Useful!• Math is almighty!

My afterthoughts

我的感受

• 概率知识回顾• Beta 和 Gamma 方程• Dirichlet 分布• 多项分布• 共轭分布• 贝叶斯网络简介

本节内容

Overview

概率知识复习• Chain rule (conditional independence)

• Bayes rule

• Marginal distribution

Probability Recap

• Gamma 方程

• Beta 方程

Gamma 和 Beta 方程

Gamma and Beta function

Dirichlet 分布• 概率密度函数

Dirichlet Distribution

多项分布• 概率密度函数

• 期望• 方差

Multinomial Distribution

共轭分布如果似然函数 和先验分布 属于同一分布族,则称两者是共轭分布共轭先验分布可以为计算后验分布提供方便

Conjugrate distribution

Very Important!

共轭分布

Conjugrate distribution

相似

David Barber: Bayesian Reasoning and Machine Learning

贝叶斯网络

Bayesian Network

贝叶斯网络 ( 续 )

Bayesian Network

• 如何表示满足特定独立性的分布?– 表示问题 representation

• 如何利用特定独立性来有效的计算?– 推断问题 inference

• 如何辨识数据中的特定独立性?– 学习问题 learning

贝叶斯网络:要解决的问题

Bayesian Network : problems to solve

• David Barber– Bayesian Reasoning and Machine Learning

• Daphne Koller, Nir Friedman– Probabilistic Graphical Model

• Bishop– Pattern Recognition and Machine Learning Ch8

• Eric Xing– Probabilistic Graphical Models

获得更多

Bayesian Network : where to learn more

本节内容 : 主题模型• 主题模型• LDA• 推断方法 Inference• 主题间关系• MedLDA• LDA 的应用

Topic Model Overview

主题模型

Topic Model Overview

• LDA based Topic Models

• Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996)

• Bound Encoding of the Aggregate Language Environment(BEAGLE) (Jones and Mewhort, 2007)

主要研究者

Researchers in Topic Model

D Blei

Andrew McCallum

Michael I. Jordan Andrew Ng John Lafferty

Eric Xing Fei-Fei Li

Mark Steyvers

主要研究者

Researchers in Topic Model

Hanna Wallach Yee Whye Teh Jun ZhuDavid Mimno

Why Latent?

重新思考贝叶斯模型• 一个适合的模型 应该包含事件可能发生的各种情况。• 一个恰当的先验分布 应该避免给可能发生的情况赋予小概率,但是也不应该将几乎不可能事件与其他事件一概而论。为了避免这种情况发生,需要考虑模型参数间的联系。一种策略是在模型中引入隐含变量 (latent variables ) ,另一种是引入超参数 (hyperparameters) 。这两种方法都是可计算的 (tractable) 。

From Radford Neal’s CSC2541 “Bayesian Methods for Machine Learning”

The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments.

Goal

Goal and Motivation of Topic Model

主题模型:前人工作

tf-idf1983

tf-idf 统计词频无法捕捉到文档内部和文档间的统计特征

Previous Work : tf-idf

LSI1990

tf-idf1983

LSI: Latent Semantic Indexing在词与文档 (term-by-document) 矩阵上使用 SVDtf-idf 的线性组合,能捕捉到一些语法特征

Previous work : LSI

主题模型:前人工作

LSI1990

tf-idf1983

pLSI1999

pLSI (aka Aspect Model 内容模型 )参数随着语料库的容量增长,容易过拟合在文档层面没有一个统计模型,无法对文档指定概率

Previous work : pLSI

主题模型:前人工作

LSI1990

tf-idf1983

LDA2003

pLSI1999

LDAbag-of-word 假设同时考虑词和文档交换性的混合模型

LDA

主题模型:前人工作

LDA

LDA in graphical model

LDA 举例:在线音乐社区

An analog : Modeling Shared Tastes in Online Communities - Laura Dietz NIPS 09 workshop

LDA

对于语料库 中的每个文档 , LDA 是如下的变参数层次贝叶斯网络:1. 选择单词的个数2. 选择文档中话题比率3. 对于每个单词 a) 选择话题b) 从分布 中选择单词

LDA procedure

LDA

在已知超参数 和 的情况下,主题和词的联合概率为

对 和 求积分,可以得到文档的边际概率

进而,对所有的边际概率求积,可得语料库的概率

LDA : to see a document

LDA

在已知超参数 和 的情况下,主题和词的联合概率为

对 和 求积分,可以得到文档的边际概率

进而,对所有的边际概率求积,可得语料库的概率

LDA : to see a document

为何求积分?

From Jerry Zhu’s CS 731 Advanced Artificial Intelligence

• 概率通过频数得到;• 数据是随机的,所以,期望也是随机的;• 参数是确定的,未知常量与概率式无关;

• 概率是置信度的体现;• 求参数 的期望是通过求其概率分布得到; • 对未知参数的估计是通过求其边际概率得到。

Frequentist

Bayesian

LDA by Human and Computer

寒假里发生的一件趣事

• 中心思想• 段落的中心思想• 展开

训练测试

阅读…

LDA : Topic

LDA : Five topics from a 50-topic LDA model fit to Science from 1980– 2002

Five topics from a 50-topic LDA model fit to Science from 1980– 2002

LDA : Personas

LDA : Personas

Demohttp://personas.media.mit.edu/personasWeb.html

LDA :获得更多

LDA : where to learn more --- Surveys

1. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003

2. David M Blei and John D Lafferty. Topic models. Taylor and Francis, 2009.

3. Ali Daud, Juanzi Li, Lizhu Zhou, and Faqir Muhammad. Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of Computer Science in China, 4(2):280–301,January 2010.

4. Mark Steyvers and Tom Griffith. Probabilistic topic models. Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, July 2006.

如何得到 LDA 中的参数 --- 推断文档中主题的概率 每个词的主题指定概率

LDA 模型中最重要的计算任务是计算隐含变量的后验概率

变分法 抽样法Variational Inference Gibbs Sampling

Inference --- get important parameters in LDA

推断方法

Inference Methods Overview

推断方法• 随机方法 (抽样 )– MCMC, Metropolis-Hasting, Gibbs, etc– 计算量大,但相对精确

• 判定方法 ( 优化 )– Mean Field, Belief Propagation– Variational Bayes, Expectation Propagation– 计算量小,不精确,可以给出边界

Inference Methods : Comparison of two major methods

变分推断 Variational Inference

Variational ≈ Optimization

Variational ≈ Convex OptimizationThe basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood. Essentially, one considers a family of lower bounds, indexed by a set of variational parameters. The variational parameters are chosen by an optimization procedure that attempts to find the tightest possible lower bound.

Variational Inference

Mean field

• 基本思想– 用一个简单可分解的分布 逼近–求 KL散度最小的逼近

• 为何得名?– 概率可完全分解

Mean field variational inference

LDA 中的变分推断

Variational inference in LDA Overview

目标:求出

LDA 中的变分推断

Variational Inference : Beautiful math

Jensen 不等式

LDA 中的变分推断

Variational Inference : Beautiful math

记则

因为 都是可分解的,所以有

LDA 中的变分推断

Variational Inference : Beautiful math

LDA 中的变分推断

Variational Inference : Beautiful math

应用拉格朗日法,得到

总结: LDA 中的变分推断

Variational Inference : Review

目标:求出

变分推断 : 获得更多

Variational Inference : Where to learn more

1. Martin Wainwright. Graphical models and variational methods: Message-passing, convex relaxations, and all that. ICML2008 Tutorial

2. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, Vol. 1, Numbers 1--2, pp. 1--305, December 2008

MCMC in LDA

MCMC Overview

• Sampling in general– Why sampling is necessary and why it is hard– Importance sampling, rejection sampling– Markov Chain Monte Carlo– Metropolis-Hasting, Gibbs sampling

• Collapsed Gibbs in LDA

Pioneers to push sampling

MCMC Overview

Nicholas C. Metropolis Josiah W. GibbsAndrey Markov

中华人民共和国国家统计局 2006年 3月 16日经国务院批准,我国于 2005年底开展了全国1% 人口抽样调查工作。这次调查的样本量为 1705 万人,占全国总人口的 1.31% 。全国人口中,具有大学程度(指大专及以上)的人口为 6764万人,高中程度(含中专)的人口为 15083万人,初中程度的人口为 46735万人,小学程度的人口为 40706万人。

抽样例子

Sampling Example : Population statistics

1 从给定概率分布 中产生样本

抽样要解决的问题

Sampling

2 在给定概率分布 下,估计函数的期望

例子:测量湖水内某种物质的含量

Sampling : Why it is so damn hard?

Rejection sampling

Sampling : Rejection sampling

Accept

Reject

Importance sampling

Sampling : Importance sampling

In Rejection sampling, throwing away an x seems a waste, and the rejection is the only thing we know about the original distribution.

Metropolis-Hasting Method

Sampling : Metropolis-Hasting Method

考虑 Markov 特性:某个状态仅与其前一个状态有关在 t状态, 可以是任意可以抽样的分布,比如高斯分布对于一个新的状态,考虑

Gibbs Sampling

Sampling : Gibbs Sampling

1 对所有的变量初始化2 选定维度 i

从分布 中对 采样

Gibbs Sampling in LDA : Joint distribution

Gibbs Sampling in LDA

Gibbs Sampling in LDA : Joint distribution

Gibbs Sampling in LDA

Collapsed:

将上式带入

Gibbs Sampling in LDA : Joint distribution

Gibbs Sampling in LDA

此处省略若干公式 ……

Gibbs Sampling in LDA : Marginal dist.

Gibbs Sampling in LDA

此处省略若干公式 ……

Gibbs Sampling in LDA in Python Code

Sampling : Gibbs Sampling code in Python

for m in xrange(n_docs): for i, w in enumerate(word_indices(matrix[m, :])): z = np.random.randint(self.n_topics) self.nmz[m,z] += 1 self.nm[m] += 1 self.nzw[z,w] += 1 self.nz[z] += 1 self.topics[(m,i)] = z

Really simple!

1. D.J.C. MacKay. Information theory, inference, and learning algorithms. Cambridge Univ Pr,2003.

2. Gregor Heinrich. Parameter estimation for text analysis. Technical Report, 2009.

3. Michael I. Jordan and Yair Weiss. Graphical models: Probabilistic inference.

4. Christophe Andrieu, N De Freitas, A Doucet, and Michael I. Jordan. An introduction to MCMC for machine learning. Machine learning, pages 5–43, 2003.

5. Yi Wang. Distributed Gibbs Sampling of Latent Dirichlet Allocation : The Gritty Details. Technical Report, 2007.

Gibbs Sampling 获得更多

Gibbs Sampling where to learn more

主题模型的演化

Evolution of Topic Models

• Correlated topic models• Dynamic topic models• Temporal topic models

1. David M. Blei and John D Lafferty. Correlated Topic Models. In Advances in Neural Information Processing Systems 18, 2006.

2. David M. Blei and John D Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1(1):17–35, 2007.

3. David M. Blei and John D Lafferty. Dynamic topic models. Proceedings of the 23rd international conference on Machine learning - ICML ’06, pages 113–120, 2006.

Correlated + Dynamic TM

Correlated + Dynamic Topic Models

Correlated + Dynamic TM

Correlated + Dynamic Topic Models

Correlated TM

Correlated Topic Models

无法捕捉主题间的联系 和多元分布不共轭采用变分法进行推断

Correlated TM

Correlated Topic Models

Correlated TM

Correlated Topic Models

控制稀疏度

Dynamic TM

Dynamic Topic Models

DTM 中,假设所有文档是按时间分块

Dynamic TM

Dynamic Topic Models : Top10 words of Science and example articles of Science

Topics over time

Topics over time

Published in: KDD '06 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Topics over time

Topics over time

Topics over time : Topic discovery

Topics over time : State-of-the-Union Addresses

TOT

LDA

Topics over time : Topic evolution

Topics over time : Topic evolution on NIPS data

Topics over time : Co-occuring Topics

Topics over time : Topic evolution on NIPS data

Topics over time : Review

Topics over time : Review 基于 LDA 话题演化研究方法综述@ 中文信息学报 2010 年 11 月

Work by Hanna Wallach

Work by Hanna Wallach

NIPS ’09

ICML ’09

不对称先验

Rethinking LDA : Why priors matter

不对称先验的优点

Rethinking LDA : Why priors matter

• 不同的是对不同的文档,不对称起到平滑主题的作用• 对于来说,通过控制可以达到控制主题是更稀疏还是更平均分布,由于是对整个语料库起作用,不对称降低了模型对文档内部结构的刻画• 是每个文档的参数,适合使用不对称先验对文档进行刻画

1. David Blei and Jon D. McAuliffe. Supervised topic models. In Advances in Neural Information Processing Systems, pages 1–22, 2008.

2. Daniel Ramage, David Hall, Ramesh Nallapati, and C.D. Manning. Labeled LDA : A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, pages 248–256. Association for Computational Linguistics, 2009.

3. Jun Zhu, Amr Ahmed, and Eric Xing. MedLDA: Maximum Margin Supervised Topic Models. Journal of Machine Learning Research, 1:1–48, 2010.

Supervised LDA

Supervised LDA

Supervised LDA :目标

Supervised LDA

在无监督模型的基础上,增加对类标的描述进行分类、回归朴素的方法:先进行 LDA ,利用主题进行分类 Naïve

用分布来对类标进行建模

Supervised topic models

Supervised LDA

用泛化线性模型 (Generalized Linear Model) 来对类标进行建模

Supervised topic models

Supervised LDA : Why use GLM

GLM 可以灵活的描述任何可以写成指数分布的类标• 高斯分布• 二项分布• 多项分布• 柏松分布• Gamma 分布• Weibull 分布• ……

Semi-Supervised LDA

Semi-Supervised LDA

UnlabeledLabeled

MedLDA

Maximum Entropy Discrimination Latent Dirichlet Allocation

maximum entropy discrimination latent Dirichlet allocation

通过优化单一目标函数和一组边界约束将大边界理论同主题模型结合在一起

MedTM

Maximum Entropy Discrimination Topic Models

优点:1.利用大边界理论正确分类2.更好的描述数据

MedLDA LDA

MedLDA : topic discovery

2D embedding on 20Newsgroups data

MedLDA : classification

Classfication on 20Newsgroups data

二分类 多分类

主题模型的应用

NIPS ’09 Workshop on Applications for Topic Models: Text and Beyond

• 社交网络,微博,电子邮件 不规范用语 (缩写,误拼,引用,不规范引用, @ , RT……)

• 蛋白质表达式分析

层次结构,更细的粒度• 话题跟踪、演化、消亡• 文本摘要• 多媒体(图像、音频、视频)

重建庞贝古城

Reconstructing Pompeian Households

重建庞贝古城

Reconstructing Pompeian Households

重建庞贝古城

Reconstructing Pompeian Households

炖菜锅 糕点模具

重建庞贝古城

Reconstructing Pompeian Households

炖菜锅 糕点模具

Is it so?

重建庞贝古城

Reconstructing Pompeian Households

房间功能物件

主题分布单词

重建庞贝古城

Reconstructing Pompeian Households

重建庞贝古城

Reconstructing Pompeian Households

物品用途与名称不匹配

构件挖掘

Software Analysis with Unsupervised Topic Models

12,151 Java projects from Sourceforge and Apache

4,632 projects

366,287 source files

38.7 million lines of code

written by 9,250 developers

构件挖掘

Software Analysis with Unsupervised Topic Models

构件挖掘

Software Analysis with Unsupervised Topic Models

研究方向

Future Direction

• LDA 作为降维方法,应用于 k-means• 以 MedLDA 为基础继续扩展• LDA+ 主动学习 • 半监督主题模型• Asymmetric Dirichlet Prior• 更好的逼近方法 (Data Augmentation)

• 文本摘要• 非结构化数据的主题模型 (微博、论坛、基因序列 )• LDA 应用到图像

未讲述的内容

Not covered topics

• Hierarchical Topic Models• nested Chinese Restaurant Process• Indian Buffet Process

Topic Models Background : General

Topic Models Background

• D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.

• D. Blei and M. Jordan. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1:121–144, 2006.

• M. Steyvers and T. Griffiths. Probabilistic Topic Models. In Latent Semantic Analysis: A Road to Meaning, T. Landauer, Mcnamara, S. Dennis, and W. Kintsch eds. Laurence Erlbaum, 2006.

• Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566-1581, 2006.

• J. Zhu, A. Ahmed and E. P. Xing. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification. The 26th International Conference on Machine Learningy, 2009.

Inference and Evaluation

Topic Models Background

• Asuncion, M. Welling, P. Smyth, and Y. Teh. On Smoothing and Inference for Topic Models. In Uncertainty in Artificial Intelligence, 2009.

• W. Li, and A. McCallum. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations. In International Conference on Machine Learning, 2006.

• Porteous, A. Ascuncion, D. Newman, A. Ihler, P. Smyth, and M. Welling. Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation. In Knowledge Discovery and Data Mining, 2008.

• H. Wallach, I. Murray, R. Salakhutdinov and D. Mimno. Evaluation Methods for Topic Models. In International Conference on Machine Learning, 2009.

• M. Welling, Y. Teh and B. Kappen. Hybrid Variational/Gibbs Inference in Topic Models. In Uncertainty in Artificial Intelligence, 2008.

Biology

Topic Models Background

• E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed-membership stochastic blockmodels. Journal of Machine Learning Research, 9: 1981-2014, 2008.

• P. Agius, Y. Ying, and C. Campbell. Bayesian Unsupervised Learning with Multiple Data Types. Statistical Applications in Genetic and Molecular Biology, 3(1):27, 2009.

• P. Flaherty, G. Giaever, J. Kumm, Michael I. Jordan, Adam P. Arkin. A Latent Variable Model for Chemogenomic Profiling. Bioinformatics 2005 Aug 1;21(15):3286-93.

• S. Shringarpure and E. P. Xing. mStruct: Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations. Genetics, Vol 182, issue 2, 2009.

Natural Language Processing

Topic Models Background

• J. Boyd-Graber and D. Blei. Syntactic topic models. In Neural Information Processing Systems, 2009.

• T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In Neural Information Processing Systems, 2005.

• K. Toutanova and M. Johnson. A Bayesian LDA-based Model for Semi-Supervised Part-of-speech Tagging. In Neural Information Processing Systems, 2008.

Social Science

Topic Models Background

• J. Chang, J. Boyd-Graber, and D. Blei. Connections between the lines: Augmenting social networks with text. Knowledge Discovery and Data Mining, 2009.

• L. Dietz, S. Bickel, and T. Scheffer. Unsupervised Prediction of Citation Influences. In International Conference on Machine Learning, 2007.

• D. Hall, D. Jurafsky, and C. Manning. Studying the History of Ideas Using Topic Models. In Emperical Methods in Natural Language Processing, 2008.

Temporal and Network Models

Topic Models Background

• D. Blei and J. Lafferty. Dynamic topic models. In International Conference on Machine Learning, 2006.

• J. Chang and D. Blei. Relational topic models for document networks. Artificial Intelligence and Statistics (in print), 2009.

• E.P. Xing, W. Fu, and L. Song. A State-Space Mixed Membership Blockmodel for Dynamic Network Tomography. Annals of Applied Statistics, 2009.

• H. Wallach. Topic Modeling: Beyond Bag-of-Words. In International Conference on Machine Learning, 2006.

Vision

Topic Models Background

• J. Philbin, J. Sivic, and A. Zisserman. Geometric LDA: A Generative Model for Particular Object Discovery In British Machine Vision Conference, 2008.

• L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models for 101 object categories. In Computer Vision and Image Understanding, 2007.

• C. Wang, D. Blei and L. Fei-Fei. Simultaneous Image Classification and Annotation. In Computer Vision and Pattern Recognition, 2009.

ONE MORE THING…

结构 关系广泛应用

房间功能物件

主题分布单词

Q & A