Machine Learning Gladys Castillo, UA
Naive Bayes
Gladys Castillo
University of Aveiro
Machine Learning Gladys Castillo, U.A. 2
The Supervised Classification Problem
A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by
a set of attributes X ={X1, X2, …, Xn} X
Supervised Learning Algorithm
hC
hC C(N+1)
Inputs: Attribute values of x(N+1) Output: class of x(N+1)
Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C
Given: a dataset D with N labeled examples of <X, C>
Build: a classifier, a hypothesis hC: X C that can
correctly predict the class labels of new objects
Learning Phase:
c(N)<x(N), c(N)>
………
c(2)<x(2), c(2)>
c(1)<x(1), c(1)>
CCXXnn……XX11DD
c(N)<x(N), c(N)>
………
c(2)<x(2), c(2)>
c(1)<x(1), c(1)>
CCXXnn……XX11DD)1(
1x )1(
nx)2(
1x)2(
nx
)(
1
Nx)( N
nx
)1(
1
Nx )1(
2
Nx)1( N
nx…
attribute X2
att
rib
ute
X1
x
x x
x
x
x x
x
x x
o o o
o o
o
o o
o
o o
o
o
give credit
don´t give
credit
Machine Learning Gladys Castillo, U.A. 3
Statistical Classifiers
Treat the attributes X= {X1, X2, …, Xn} and the
class C as random variables
A random variable is defined by its
probability density function
Give the probability P(cj | x) that x belongs to
a particular class rather than a simple
classification
Probability density function of a random
variable and few observations
)(xf )(xf
instead of having the map X C , we have X P(C| X)
The class c* attached to an example is the class with bigger P(cj| x)
Machine Learning Gladys Castillo, U.A. 4
Bayesian Classifiers
Bayesian because the class c* attached to an
example x is determined by the Bayes’ Theorem
P(X,C)= P(X|C).P(C)
P(X,C)= P(C|X).P(X)
Bayes theorem is the main tool in Bayesian inference
)(
)|()()|(
XP
CXPCPXCP
We can combine the prior distribution and the likelihood of
the observed data in order to derive the posterior
distribution.
Machine Learning Gladys Castillo, U.A. 5
Bayes Theorem Example
Given:
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck,
what’s the probability he/she has meningitis ?
0002.020/1
50000/15.0
)(
)()|()|(
SP
MPMSPSMP
likelihood
prior
prior
posterior prior x likelihood
)(
)|()()|(
XP
CXPCPXCP
Before observing the data, our
prior beliefs can be expressed in a
prior probability distribution that
represents the knowledge we have
about the unknown features.
After observing the data our
revised beliefs are captured by a
posterior distribution over the
unknown features.
Machine Learning Gladys Castillo, U.A. 6
)(
)|()()|(
x
xx
P
cPcPcP
jj
j
Bayesian Classifier
Bayes Theorem
Maximum a
posteriori
classification
How to determine P(cj | x) for each class cj ?
P(x) can be ignored because it is
the same for all the classes (a
normalization constant)
)|()(max)(1
*
jj...mj
Bayes cPcParghc xx
Classify x to the class which has bigger posterior probability
Bayesian Classification
Rule
Machine Learning Gladys Castillo, U.A. 7
“Naïve” because of its very naïve independence assumption:
Naïve Bayes (NB) Classifier
all the attributes are conditionally independent given the class
Duda and Hart (1973); Langley (1992)
P(x | cj) can be decomposed into a product of n terms, one
term for each attribute
“Bayes” because the class c* attached to an example x is
determined by the Bayes’ Theorem
)|()(max)(1
*
jj...mj
Bayes cPcParghc xx
when the attribute space is high dimensional direct
estimation is hard unless we introduce some assumptions
n
i
jiij...mj
NB cxXPcParghc1
1
* )|()(max)(xNB
Classification Rule
Machine Learning Gladys Castillo, U.A. 8
Xi continuous
The attribute is discretized and then treats as a discrete attribute
A Normal distribution is usually assumed
2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj
Xi discrete
1. Estimate P(cj) for each class cj
Naïve Bayes (NB) Learning Phase (Statistical Parameter Estimation)
),;()|(ijijkjki xgcxXP
Given a training dataset D of N labeled examples (assuming complete data)
N
NcP
j
j )(ˆ
j
ijk
jkiN
NcxXP )|(ˆ
2
2
)(2
2
1),;( e
x
xg
onde
Nj - number of examples of the class cj
Nijk - number of examples of the class cj
having the value xk for the attribute Xi
Two
options
The mean ij and the standard deviation ij are estimated from D
Machine Learning Gladys Castillo, U.A. 9
Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj
A Normal distribution is usually assumed
Xi | cj ~ N(μij, σ2ij) - the mean ij and the standard deviation ij are estimated from D
Continuous Attributes Normal or Gaussian Distribution
For a variable X ~ N(74, 36), the probability
of observing the value 66 is given by:
f(x) = g(66; 74, 6) = 0.0273
The d
ensity
pro
bability
functio
n
- 2 323
N(0,2)
- 2 323
N(0,2)
f(x) is symmetrical around its mean
),;()|(ijijkjki xgcxXP
2
2
)(2
2
1),;( e
x
xg
Machine Learning Gladys Castillo, U.A. 10
1. Estimate P(cj) for each class cj
2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj
X1 discrete X2 continuos
Naïve Bayes Probability Estimates
)101 1.73 ()|( 2 .,x;gCxXP
Tra
inin
g d
ata
set
D
5
3)(ˆ CP
3
2)|(ˆ CaXP i
Class X1 X2
+ a 1.0
+ b 1.2
+ a 3.0
- b 4.4
- b 4.5
Two classes: + (positive) , - (negative)
Two attributes:
X1 – discrete which takes values a and b
X2 – continuous
Binary Classification Problem
3
1)|(ˆ CbXP i
5
2)(ˆ CP
)070 4.45 ()|( 2 .,x;gCxXP
2
0)|(ˆ CaXP i
2
2)|(ˆ CbXP i
2+= 1.73, 2+ = 1.10
2-= 4.45, 22+ = 0.07
Example from John & Langley (1995)
Machine Learning Gladys Castillo, U.A. 11
Probability Estimates Discrete Attributes
for each class: P(No) = 7/10
P(Yes) = 3/10
for each attribute value and class:
Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
categoric
al
categoric
al
continuous
class
Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book
N
NcP
j
j )(ˆ
j
ijk
jkiN
NcxXP )|(ˆ
To compute Nijk we need to count the
number of examples of the class cj
having the value xk for the attribute Xi
Machine Learning Gladys Castillo, U.A. 12
for each pair attribute-class (Xi, cj)
Example
for (Income, Evade = No)
If Evade = No
sample mean = 110
sample standard deviation = 54.54
0072.0)54.54(2
1)|120( )2975(2
)110120( 2
eNoIncomeP
Probability Estimates Continuos Attributes
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
categoric
al
categoric
al
continuous
class
Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book
2
2
)(
2
2
1) ()|( e ij
ijkx
ij
ijijkjki ,;xgcxXP
Machine Learning Gladys Castillo, U.A. 13
The Balance-scale Problem
Each example has 4 numerical attributes:
the left weight (Left_W)
the left distance (Left_D)
the right weight (Right_W)
right distance (Right_D)
The dataset was generated to model psychological experimental results
Left_W
Each example is classified into 3 classes: the balance-scale:
tip to the right (Right)
tip to the left (Left)
is balanced (Balanced)
3 target rules:
If LD x LW > RD x RW tip to the left
If LD x LW < RD x RW tip to the right
If LD x LW = RD x RW it is balanced
Balance-scale dataset from the UCI repository
Right_W
Left_D Right_D
Machine Learning Gladys Castillo, U.A. 14
The Balance-scale Problem
Left_W Left_D
Right_W Right_D
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
Adapted from © João Gama’s slides “Aprendizagem Bayesiana”
Discretization is applied: each attribute is mapped to 5 intervals
Machine Learning Gladys Castillo, U.A. 15
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
Balanced2643
...
2
1
LeftLeft_W_W
BalanceBalance--Scale DataSetScale DataSet
............
Left235
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
The Balance-scale Problem Learning Phase
Build
Contingency Tables
Contingency Tables
AttributeAttribute: : LeftLeft_W_W
2534496686RightRight
9108810BalancedBalanced
7271614214LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : LeftLeft_W_W
2534496686RightRight
9108810BalancedBalanced
7271614214LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : LeftLeft_D_D
2737495790RightRight
8109108BalancedBalanced
7770593816LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : LeftLeft_D_D
2737495790RightRight
8109108BalancedBalanced
7770593816LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_W_W
7970583716RightRight
8910108BalancedBalanced
2833496387LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_W_W
7970583716RightRight
8910108BalancedBalanced
2833496387LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_D_D
8267573717RightRight
9108108BalancedBalanced
2535446591LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
AttributeAttribute: : RightRight_D_D
8267573717RightRight
9108108BalancedBalanced
2535446591LeftLeft
I5I5I4I4I3I3I2I2I1I1ClassClass
260
LeftLeft
Classes Classes CountersCounters
56526045
TotalTotalRightRightBalancedBalanced
260
LeftLeft
Classes Classes CountersCounters
56526045
TotalTotalRightRightBalancedBalanced
565 examples
Assuming complete data, the computation
of all the required estimates requires a
simple scan through the data, an operation
of time complexity O(N x n), where N is
the number of training examples and n is
the number of attributes.
Machine Learning Gladys Castillo, U.A. 16
We need to estimate the posterior probabilities P(cj | x) for each class
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
P(Left|x) P(Balanced|x) P(Right|x)
0.277796 0.135227 0.586978 Class = Right
P(cj |x) = P(cj ) x P(Left_W=1 | cj) x P(Left_D=5 | cj ) x P(Right_W=4 | cj ) x P(Right_D=2 | cj ),
cj {Left, Balanced, Right}
The class counters and contingency tables are used to
compute the posterior probabilities for each class
1
LeftLeft_W_W
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
1
LeftLeft_W_W
Right245
ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D
?
max
The Balance-scale Problem Classification Phase
n
i
jiij...mj
NB cxXPcParghc1
1
* )|()(max)(x
Machine Learning Gladys Castillo, U.A. 17
Iris Dataset
From the statistician Douglas Fisher Three flower types (classes):
Setosa
Virginica
Versicolour
Four continuous attributes Sepal width and length
Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Iris dataset from the UCI repository
Machine Learning Gladys Castillo, U.A. 18
Iris Dataset
Machine Learning Gladys Castillo, U.A. 19
Iris Dataset
Machine Learning Gladys Castillo, U.A. 20
Scatter Plot Array of Iris Attributes
The attributes petal width and
petal length provide a moderate separation of
the Irish species
Machine Learning Gladys Castillo, U.A. 21
Naïve Bayes Iris Dataset – Continuous Attributes
Normal Probability Density Functions
Attribute: PetalWidth
P(PetalWidth|Setosa)
P(PetalWidth|Versicolor)
P(PetalWidth|Virginica)
Class Iris-setosa mean: 0.244, standard deviation: 0.107
Class Iris-versicolor mean: 1.326, standard deviation: 0.198
Class Iris-virginica
mean: 2.026, standard deviation: 0.275
Machine Learning Gladys Castillo, U.A. 22
Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198
Naïve Bayes Iris Dataset – Continuous Attributes
NB Classifier Model
Class Iris-setosa (0.327): ========================== Attribute sepallength mean: 5.006, standard deviation: 0.352 Attribute sepalwidth mean: 3.418, standard deviation: 0.381 Attribute petallength mean: 1.464, standard deviation: 0.174 Attribute petalwidth mean: 0.244, standard deviation: 0.107
Class Iris-virginica (0.327): ============================= Attribute sepallength mean: 6.588, standard deviation: 0.636 Attribute sepalwidth mean: 2.974, standard deviation: 0.322 Attribute petallength mean: 5.552, standard deviation: 0.552 Attribute petalwidth mean: 2.026, standard deviation: 0.275
Machine Learning Gladys Castillo, U.A. 23
Naïve Bayes Iris Dataset – Continuous Attributes
Classification Phase
maximum values
Machine Learning Gladys Castillo, U.A. 24
Naïve Bayes Iris Dataset – Continuous Attributes
Estimate P(cj | x) for each class:
P(setosa|x) P(versicolor|x) P(virginica|x)
0 0.995 0.005 Classe = versicolor
P(cj |x) = P(cj ) x P(sepalLength =5 | cj) x P(sepalWidth =3 | cj ) x P(petalLength =2 | cj ) x P(petalWidth =2 | cj ), cj {setosa, versicolor, virgínica}
?
max
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
Machine Learning Gladys Castillo, U.A. 25
Naïve Bayes Iris Dataset – Continuous Attributes
Estimate P(cj | x) for the versicolor class Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198
P(versicolor |x) = P(versicolor ) x P(sepalLength = 5 | versicolor ) x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor) x P(petalWidth =2 | versicolor )
P(versicolor |x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314) x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)
n
i
jiij...mj
NB cxXPcParghc1
1
* )|()(max)(x
Machine Learning Gladys Castillo, U.A. 26
Naïve Bayes Continuous Attributes - Discretization
For continuos attributes
After BinDiscretization, bins=3
Machine Learning Gladys Castillo, U.A. 27
Naïve Bayes Continuous Attributes - Discretization
NB Classifier Model
Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020
Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000
Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900
Machine Learning Gladys Castillo, U.A. 28
Naïve Bayes Continuous Attributes - Discretization
Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000
Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Attribute: Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Attribute: Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
We can build a conditional probability table (CPT) for each attribute
Machine Learning Gladys Castillo, U.A. 29
Naïve Bayes Continuous Attributes - Discretization
Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000
Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900
Attribute: Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Attribute: Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Machine Learning Gladys Castillo, U.A. 30
Naïve Bayes Iris Dataset – Discretized Attributes
Classification (Implementation) Phase
maximum values
Discretized Examples
Machine Learning Gladys Castillo, U.A. 31
Naïve Bayes Classification Phase
How to classify a new example ?
P(setosa|x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 | setosa)
x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa )
P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor )
x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor)
x P(petalWidth =r3 | versicolor )
P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica)
x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica)
x P(petalWidth =r3 | virginica )
Compute the conditional posterior probabilities for each class
sepallengt sepalwidth petallength petallength
r1 r2 r1 r3
Example with discretized attributes
sepallengt sepalwidth petallength petallength
5 3 2 2
Machine Learning Gladys Castillo, U.A. 32
Naïve Bayes Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(setosa | x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth = r2 | setosa ) x P(petalLength = r1 | setosa ) x P(petalWidth = r3 | setosa )
P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
Machine Learning Gladys Castillo, U.A. 33
Naïve Bayes Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(versicolor | x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth = r2 | versicolor ) x P(petalLength = r1 | versicolor ) x P(petalWidth = r3 | versicolor )
P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
Machine Learning Gladys Castillo, U.A. 34
Naïve Bayes Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgínica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(virginica | x) = P(virginica) x P(sepalLength = r1 | virginica ) x P(sepalWidth = r2 | virginica ) x P(petalLength = r1 | virginica ) x P(petalWidth = r3 | virginica )
P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
r1
petallengthpetallength
r1
sepallengtsepallengt
r3r2
petallengthpetallengthsepalwidthsepalwidth
If all the class probabilities are zero
we can not determine the class for
this example
Machine Learning Gladys Castillo, U.A. 35
Naïve Bayes Laplace Correction
Nijk number of examples in D such that
Xi = xk and C = cj
Nj number of examples in D of class cj
k number of possible values of Xi
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
To avoid zero probabilities
due to zero counters we
can use the Laplace
correction
j
ijk
jkiN
NcxXP )|(ˆ
kN
NcxXP
j
ijk
jki
1)|(ˆ
To calculate the conditional probabilities, instead of using the estimate
we can use the Laplace correction:
Machine Learning Gladys Castillo, U.A. 36
Bayesian SPAM Filtering
Binary Classification Problem Is an email a SPAM?
1 – is SPAM
0 – is not SPAM
57 real attributes Word frequencies
Character frequencies
Capital letters avg frequencies
Spambase dataset from the UCI repository
Spam Not Spam Total
1813 2788 4601
Machine Learning Gladys Castillo, U.A. 37
Bayesian SPAM Filtering An Implementation using RapidMiner Studio 6.0.007
Greedy Feature Selection
using “forward selection”
and CFS(1) as the
performance measure
(1) - Hall, M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. Thesis submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy at the University of Waikato.
Machine Learning Gladys Castillo, U.A. 38
Bayesian SPAM Filtering k-fold Cross-Validation Results (k= 10)
Results without feature subset selection
Results with feature subset selection
- only 11 attributes selected, performs better
Machine Learning Gladys Castillo, U.A. 39
Bayesian SPAM Filters Confussion Matrix
Spam Precision = 𝑇𝑃
𝑇𝑃+𝐹𝑃
percentage of e-mails classified as SPAM which truly are
Spam Recall = TPR = 𝑇𝑃
𝑇𝑃+𝐹𝑁
percentage of e-mails classified as SPAM over the total of SPAMs
Not Spam Precision = 𝑇𝑁
𝐹𝑁+𝑇𝑁
percentage of e-mails classified as “Not SPAM” which truly are
Not Spam Recall = 𝑇𝑁
𝐹𝑃+𝑇𝑁
percentage of e-mails classified as “Not SPAM” over the total of
valid emails
True Positive (TP) - number of e-mails classified
as SPAM which truly are
False Positive = number of e-mails classified as
SPAM which are valid
False Negative = number of e-mails classified as
“Not SPAM” which are SPAM
True Negative – number of e-mails classified as
“Not Spam” which truly are
Spam precision
Not Spam precision
Spam recall Not Spam recall
Actual
Predicted
Yes (+) No (-)
Yes (+) TP FP
No (-) FN TN
Concept Learning Problem: Is an email a SPAM?
Accuracy: 91.74%
Predicted Spam (1) Not Spam (0) Precision
Spam (1) 1558 125 92.57%
Not Spam (0) 255 2633 91.25%
Recall 85.93% 95.52%
A good Spam filter must
achieve high precision and
recall, thus reducing the
number of false positives
Machine Learning Gladys Castillo, U.A. 40
NB is one of more simple and effective classifiers
NB has a very strong unrealistic independence assumption:
Naïve Bayes Performance
all the attributes are conditionally independent given the value of class
Bias-variance decomposition of Test Error
Naive Bayes for Nursery Dataset
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
500 2000 3500 5000 6500 8000 9500 11000 12500
# Examples
Variance
Biasin practice: independence assumption
is violated HIGH BIAS
it can lead to poor classification
However, NB is efficient due to its
high variance management
less parameters LOW VARIANCE
Machine Learning Gladys Castillo, U.A. 41
References
1. Castillo G. (2006) Adaptive Learning Algorithms for Bayesian Network Classifiers. PhD Thesis.
Chapter 3.3. Naïve Bayes Classifier
2. Domingos, P. and Pazzani M. (1997) On the optimality of the simple Bayesian classifier under
zero-one loss, Machine Learning, 29(2-3):103–130
3. Duda, R.O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley &
Sons, Inc.
4. John, G.H. and Langley P. (1995) Estimating Continuous Distributions in Bayesian Classifiers.
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338-345.
Morgan Kaufmann, San Mateo.
5. Mitchell T. (1997) Machine Learning, Chapter 6. Bayesian Learning, Tom Mitchell, McGrau Hill
6. Tan, P., Steinbach, M., Kumar, V. (2006) Introduction to Data Mining . Chapter 5:
Classification- Alternative Techniques. Section 5.1 (pag 166-180) , Addison Wesley Higher
Education
Top Related