Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, UA

Naive Bayes

Gladys Castillo

University of Aveiro

Page 2: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 2

The Supervised Classification Problem

A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by

a set of attributes X ={X1, X2, …, Xn} X

Supervised Learning Algorithm

hC

hC C(N+1)

Inputs: Attribute values of x(N+1) Output: class of x(N+1)

Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C

Given: a dataset D with N labeled examples of <X, C>

Build: a classifier, a hypothesis hC: X C that can

correctly predict the class labels of new objects

Learning Phase:

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD)1(

1x )1(

nx)2(

1x)2(

nx

)(

1

Nx)( N

nx

)1(

1

Nx )1(

2

Nx)1( N

nx…

attribute X2

att

rib

ute

X1

x

x x

x

x x

x

x x

o o o

o o

o

o o

o

o o

o

give credit

don´t give

credit

Page 3: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 3

Statistical Classifiers

Treat the attributes X= {X1, X2, …, Xn} and the

class C as random variables

A random variable is defined by its

probability density function

Give the probability P(cj | x) that x belongs to

a particular class rather than a simple

classification

Probability density function of a random

variable and few observations

)(xf )(xf

instead of having the map X C , we have X P(C| X)

The class c* attached to an example is the class with bigger P(cj| x)

Page 4: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 4

Bayesian Classifiers

Bayesian because the class c* attached to an

example x is determined by the Bayes’ Theorem

P(X,C)= P(X|C).P(C)

P(X,C)= P(C|X).P(X)

Bayes theorem is the main tool in Bayesian inference

)(

)|()()|(

XP

CXPCPXCP

We can combine the prior distribution and the likelihood of

the observed data in order to derive the posterior

distribution.

Page 5: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 5

Bayes Theorem Example

Given:

A doctor knows that meningitis causes stiff neck 50% of the time

Prior probability of any patient having meningitis is 1/50,000

Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck,

what’s the probability he/she has meningitis ?

0002.020/1

50000/15.0

)(

)()|()|(

SP

MPMSPSMP

likelihood

prior

posterior prior x likelihood

)(

)|()()|(

XP

CXPCPXCP

Before observing the data, our

prior beliefs can be expressed in a

prior probability distribution that

represents the knowledge we have

about the unknown features.

After observing the data our

revised beliefs are captured by a

posterior distribution over the

unknown features.

Page 6: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 6

)(

)|()()|(

x

xx

P

cPcPcP

jj

j

Bayesian Classifier

Bayes Theorem

Maximum a

posteriori

classification

How to determine P(cj | x) for each class cj ?

P(x) can be ignored because it is

the same for all the classes (a

normalization constant)

)|()(max)(1

*

jj...mj

Bayes cPcParghc xx

Classify x to the class which has bigger posterior probability

Bayesian Classification

Rule

Page 7: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 7

“Naïve” because of its very naïve independence assumption:

Naïve Bayes (NB) Classifier

all the attributes are conditionally independent given the class

Duda and Hart (1973); Langley (1992)

P(x | cj) can be decomposed into a product of n terms, one

term for each attribute

“Bayes” because the class c* attached to an example x is

determined by the Bayes’ Theorem

)|()(max)(1

*

jj...mj

Bayes cPcParghc xx

when the attribute space is high dimensional direct

estimation is hard unless we introduce some assumptions

n

i

jiij...mj

NB cxXPcParghc1

1

* )|()(max)(xNB

Classification Rule

Page 8: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 8

Xi continuous

The attribute is discretized and then treats as a discrete attribute

A Normal distribution is usually assumed

2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj

Xi discrete

1. Estimate P(cj) for each class cj

Naïve Bayes (NB) Learning Phase (Statistical Parameter Estimation)

),;()|(ijijkjki xgcxXP

Given a training dataset D of N labeled examples (assuming complete data)

N

NcP

j

j )(ˆ

j

ijk

jkiN

NcxXP )|(ˆ

2

)(2

2

1),;( e

x

xg

onde

Nj - number of examples of the class cj

Nijk - number of examples of the class cj

having the value xk for the attribute Xi

Two

options

The mean ij and the standard deviation ij are estimated from D

Page 9: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 9

Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj

A Normal distribution is usually assumed

Xi | cj ~ N(μij, σ2ij) - the mean ij and the standard deviation ij are estimated from D

Continuous Attributes Normal or Gaussian Distribution

For a variable X ~ N(74, 36), the probability

of observing the value 66 is given by:

f(x) = g(66; 74, 6) = 0.0273

The d

ensity

pro

bability

functio

n

- 2 323

N(0,2)

- 2 323

N(0,2)

f(x) is symmetrical around its mean

),;()|(ijijkjki xgcxXP

2

)(2

2

1),;( e

x

xg

Page 10: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 10

1. Estimate P(cj) for each class cj

2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj

X1 discrete X2 continuos

Naïve Bayes Probability Estimates

)101 1.73 ()|( 2 .,x;gCxXP

Tra

inin

g d

ata

set

D

5

3)(ˆ CP

3

2)|(ˆ CaXP i

Class X1 X2

+ a 1.0

+ b 1.2

+ a 3.0

- b 4.4

- b 4.5

Two classes: + (positive) , - (negative)

Two attributes:

X1 – discrete which takes values a and b

X2 – continuous

Binary Classification Problem

3

1)|(ˆ CbXP i

5

2)(ˆ CP

)070 4.45 ()|( 2 .,x;gCxXP

2

0)|(ˆ CaXP i

2

2)|(ˆ CbXP i

2+= 1.73, 2+ = 1.10

2-= 4.45, 22+ = 0.07

Example from John & Langley (1995)

Page 11: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 11

Probability Estimates Discrete Attributes

for each class: P(No) = 7/10

P(Yes) = 3/10

for each attribute value and class:

Examples:

P(Status=Married|No) = 4/7

P(Refund=Yes|Yes)=0

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

N

NcP

j

j )(ˆ

j

ijk

jkiN

NcxXP )|(ˆ

To compute Nijk we need to count the

number of examples of the class cj

having the value xk for the attribute Xi

Page 12: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 12

for each pair attribute-class (Xi, cj)

Example

for (Income, Evade = No)

If Evade = No

sample mean = 110

sample standard deviation = 54.54

0072.0)54.54(2

1)|120( )2975(2

)110120( 2

eNoIncomeP

Probability Estimates Continuos Attributes

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

2

)(

2

1) ()|( e ij

ijkx

ij

ijijkjki ,;xgcxXP

Page 13: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 13

The Balance-scale Problem

Each example has 4 numerical attributes:

the left weight (Left_W)

the left distance (Left_D)

the right weight (Right_W)

right distance (Right_D)

The dataset was generated to model psychological experimental results

Left_W

Each example is classified into 3 classes: the balance-scale:

tip to the right (Right)

tip to the left (Left)

is balanced (Balanced)

3 target rules:

If LD x LW > RD x RW tip to the left

If LD x LW < RD x RW tip to the right

If LD x LW = RD x RW it is balanced

Balance-scale dataset from the UCI repository

Right_W

Left_D Right_D

https://archive.ics.uci.edu/ml/datasets/Balance+Scale

Page 14: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 14

The Balance-scale Problem

Left_W Left_D

Right_W Right_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Adapted from © João Gama’s slides “Aprendizagem Bayesiana”

Discretization is applied: each attribute is mapped to 5 intervals

Page 15: Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 15

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

The Balance-scale Problem Learning Phase

Build

Contingency Tables

AttributeAttribute: : LeftLeft_W_W

2534496686RightRight

9108810BalancedBalanced

7271614214LeftLeft