Machine Learning with EM

46
Machine Learning with EM 闫闫闫 闫闫闫闫闫闫闫闫闫闫闫闫 7/24/2012 http://net.pku.edu.cn/~course/cs 402/2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United S See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup

description

Machine Learning with EM. 闫宏飞 北京大学信息科学技术学院 7/24/2012 http://net.pku.edu.cn/~course/cs402/2012. Jimmy Lin University of Maryland. SEWMGroup. - PowerPoint PPT Presentation

Transcript of Machine Learning with EM

Page 1: Machine Learning with EM

Machine Learning with EM

闫宏飞北京大学信息科学技术学院7/24/2012

http://net.pku.edu.cn/~course/cs402/2012

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Jimmy LinUniversity of Maryland SEWMGroup

Page 2: Machine Learning with EM

Today’s Agenda

• Introduction to statistical models• Expectation maximization• Apache Mahout

Page 3: Machine Learning with EM

Introduction to statistical models• Until the 1990s, text processing relied on rule-

based systems• Advantages

– More predictable– Easy to understand– Easy to identify errors and fix them

• Disadvantages– Extremely labor-intensive to create– Not robust to out of domain input– No partial output or analysis when failure occurs

Page 4: Machine Learning with EM

Introduction to statistical models• A better strategy is to use data-driven methods• Basic idea: learn from a large corpus of examples of what

we wish to model (Training Data)• Advantages

– More robust to the complexities of real-world input– Creating training data is usually cheaper than creating rules

• Even easier today thanks to Amazon Mechanical Turk• Data may already exist for independent reasons

• Disadvantages– Systems often behave differently compared to expectations– Hard to understand the reasons for errors or debug errors

Page 5: Machine Learning with EM

Introduction to statistical models• Learning from training data usually means estimating

the parameters of the statistical model• Estimation usually carried out via machine learning• Two kinds of machine learning algorithms• Supervised learning

– Training data consists of the inputs and respective outputs (labels)

– Labels are usually created via expert annotation (expensive)– Difficult to annotate when predicting more complex outputs

• Unsupervised learning– Training data just consists of inputs. No labels.– One example of such an algorithm: Expectation

Maximization

Page 6: Machine Learning with EM

EM-Algorithm

Page 7: Machine Learning with EM

What is MLE?

• Given– A sample X={X1, …, Xn}– A vector of parameters θ

• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)

• Given X, find)(maxarg

LML

Page 8: Machine Learning with EM

MLE (cont)

• Often we assume that Xis are independently identically distributed (i.i.d.)

• Depending on the form of p(x|θ), solving optimization problem can be easy or hard.

)|(logmaxarg

)|(logmaxarg

)|,...,(logmaxarg

)|(logmaxarg

)(maxarg

1

ii

ii

n

ML

XP

XP

XXP

XP

L

Page 9: Machine Learning with EM

An easy case

• Assuming– A coin has a probability p of being heads, 1-p of

being tails.– Observation: We toss a coin N times, and the

result is a set of Hs and Ts, and there are m Hs.

• What is the value of p based on MLE, given the observation?

Page 10: Machine Learning with EM

An easy case (cont)

)1log()(log)1(log)|(log)(pmNpmppXPL mNm

01

))1log()(log()(

pmN

pm

dppmNpmd

dpdL

p= m/N

Page 11: Machine Learning with EM

EM: basic concepts

Page 12: Machine Learning with EM

Basic setting in EM• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler, where Y is

“hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

Page 13: Machine Learning with EM

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

Page 14: Machine Learning with EM

The log-likelihood function

• L is a function of θ, while holding X constant:

)|()()|( XPLXL

)|,(log

)|(log

)|(log

)|(log)(log)(

1

1

1

yxP

xP

xP

XPLl

iy

n

i

i

n

i

n

ii

Page 15: Machine Learning with EM

The iterative approach for MLE

)|,(logmaxarg

)(maxarg

)(maxarg

1

yxp

l

L

n

i yi

ML

,....,...,, 10 tIn many cases, we cannot find the solution directly.

An alternative is to find a sequence:

....)(...)()( 10 tlll s.t.

Page 16: Machine Learning with EM

])|,()|,(

[log

])|,()|,(

[log

)|,()|,(

),|(log

)|,()|,(

)|',()|,(

log

)|,()|,(

)|',()|,(

log

)|',()|,(

log

)|,(

)|,(log

)|,(log)|,(log

)|(log)|(log)()(

1),|(

1),|(

1

'1

'1

'1

1

11

ti

in

ixyP

ti

in

ixyP

ti

itn

i yi

ti

i

yt

yi

ti

n

i

ti

ti

yt

yi

in

i

yt

yi

in

i

t

yi

yin

i

t

yi

n

iyi

n

i

tt

yxPyxP

E

yxPyxP

E

yxPyxP

xyP

yxPyxP

yxPyxP

yxPyxP

yxPyxP

yxPyxP

yxP

yxP

yxPyxP

XPXPll

ti

ti

Jensen’s inequality

Page 17: Machine Learning with EM

Jensen’s inequality

])([()](([, xgEfxgfEthenconvexisfif

)])([log()]([log( xpExpE

])([()](([, xgEfxgfEthenconcaveisfif

log is a concave function

Page 18: Machine Learning with EM

Maximizing the lower bound

)]|,([logmaxarg

)|,(log),|(maxarg

)|,()|,(log),|(maxarg

])|,()|,([logmaxarg

1),|(

1

1

1),|(

)1(

yxPE

yxPxyP

yxPyxPxyP

yxpyxpE

i

n

ixyP

it

i

n

i y

ti

iti

n

i y

ti

in

ixyP

t

ti

ti

The Q function

Page 19: Machine Learning with EM

The Q-function• Define the Q-function (a function of θ):

– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log)|,(log),|(

)]|,([log],|)|,([log),(

1

1),|(

),|(

yxPxyP

yxPEYXPXYP

YXPEXYXPEQ

it

n

i yi

n

iixyP

Y

t

XYPtt

ti

t

Page 20: Machine Learning with EM

The inner loop of the EM algorithm

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 21: Machine Learning with EM

L(θ) is non-decreasing at each iteration

• The EM algorithm will produce a sequence

• It can be proved that

,....,...,, 10 t

....)(...)()( 10 tlll

Page 22: Machine Learning with EM

The inner loop of the Generalized EM algorithm (GEM)

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

),(),( 1 tttt QQ

Page 23: Machine Learning with EM

Recap of the EM algorithm

Page 24: Machine Learning with EM

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

Page 25: Machine Learning with EM

Idea #2: find the θt sequence

No analytical solution iterative approach, find s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

Page 26: Machine Learning with EM

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,()|,(

[log)()(1

),|( ti

in

ixyP

t

yxPyxP

Ell ti

Page 27: Machine Learning with EM

Idea #4: find θt+1 that maximizes the Q function

)]|,([logmaxarg

])|,()|,(

[logmaxarg

1),|(

1),|(

)1(

yxPE

yxpyxp

E

i

n

ixyP

ti

in

ixyP

t

ti

ti

Lower bound of )()( tll

The Q function

Page 28: Machine Learning with EM

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 29: Machine Learning with EM

An EM Example

Page 30: Machine Learning with EM
Page 31: Machine Learning with EM

E-step

Page 32: Machine Learning with EM

M-step

Page 33: Machine Learning with EM

Apache Mahout

Industrial Strength Machine LearningMay 2008

Page 34: Machine Learning with EM

Current Situation• Large volumes of data are now available• Platforms now exist to run computations over

large datasets (Hadoop, HBase)• Sophisticated analytics are needed to turn data

into information people can use• Active research community and proprietary

implementations of “machine learning” algorithms

• The world needs scalable implementations of ML under open license - ASF

Page 35: Machine Learning with EM

History of Mahout

• Summer 2007– Developers needed scalable ML– Mailing list formed

• Community formed– Apache contributors– Academia & industry– Lots of initial interest

• Project formed under Apache Lucene– January 25, 2008

Page 36: Machine Learning with EM

Current Code Base• Matrix & Vector library

– Memory resident sparse & dense implementations• Clustering

– Canopy– K-Means– Mean Shift

• Collaborative Filtering– Taste

• Utilities– Distance Measures– Parameters

Page 37: Machine Learning with EM

Under Development

• Naïve Bayes• Perceptron• PLSI/EM• Genetic Programming• Dirichlet Process Clustering• Clustering Examples• Hama (Incubator) for very large arrays

Page 38: Machine Learning with EM

Appendix

• Sean Owen, Robin Anil, Ted Dunning and Ellen Friedman,Mahout in action,Manning Publications; Pap/Psc edition (October 14, 2011)

• From Mahout Hands on, by Ted Dunning and Robin Anil, OSCON 2011, Portland

Page 39: Machine Learning with EM

Step 1 – Convert dataset into a Hadoop Sequence File

• http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

• Download (8.2 MB) and extract the SGML files.– $ mkdir -p mahout-work/reuters-sgm– $ cd mahout-work/reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd ..

• Extract content from SGML to text file– $ bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters mahout-work/reuters-sgm mahout-work/reuters-out

Page 40: Machine Learning with EM

Step 1 – Convert dataset into a Hadoop Sequence File

• Use seqdirectory tool to convert text file into a Hadoop Sequence File– $ bin/mahout seqdirectory \ -i mahout-work/reuters-out \

-o mahout-work/reuters-out-seqdir \

-c UTF-8 -chunk 5

Page 41: Machine Learning with EM

Hadoop Sequence File• Sequence of Records, where each record is a <Key, Value> pair

– <Key1, Value1>– <Key2, Value2>– …– …– …– <Keyn, Valuen>

• Key and Value needs to be of class org.apache.hadoop.io.Text– Key = Record name or File name or unique identifier– Value = Content as UTF-8 encoded string

• TIP: Dump data from your database directly into Hadoop Sequence Files (see next slide)

Page 42: Machine Learning with EM

Writing to Sequence Files Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new

SequenceFile.Writer( fs, conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; i++) writer.append(new

Text(documents(i).Id()), new Text(documents(i).Content())); } writer.close();

Page 43: Machine Learning with EM

Generate Vectors from Sequence Files

• Steps1. Compute Dictionary2. Assign integers for words3. Compute feature weights4. Create vector for each document using word-integer

mapping and feature-weight

Or

• Simply run $ bin/mahout seq2sparse

Page 44: Machine Learning with EM

Generate Vectors from Sequence Files

• $ bin/mahout seq2sparse \ -i mahout-work/reuters-out-seqdir/ \ -o mahout-work/reuters-out-seqdir-sparse-kmeans

• Important options– Ngrams– Lucene Analyzer for tokenizing– Feature Pruning

• Min support• Max Document Frequency• Min LLR (for ngrams)

– Weighting Method• TF v/s TFIDF• lp-Norm• Log normalize length

Page 45: Machine Learning with EM

Start K-Means clustering• $ bin/mahout kmeans \

-i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \ -c mahout-work/reuters-kmeans-clusters \ -o mahout-work/reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 \ -x 10 -k 20 –ow

• Things to watch out for– Number of iterations– Convergence delta– Distance Measure– Creating assignments

Page 46: Machine Learning with EM

Inspect clusters• $ bin/mahout clusterdump \ -s mahout-work/reuters-kmeans/clusters-9 \ -d mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 20

Typical output:VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, …

Top Terms: iran => 3.1861672217321213strike => 2.567886952727918iranian => 2.133417966282966union => 2.116033937940266said => 2.101773806290277workers => 2.066259451354332gulf => 1.9501374918521601had => 1.6077752463145605he => 1.5355078004962228