8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
1/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 1/52
Data Mining
Bab 7 Part 1
Classification & Prediction
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
2/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 2/52
Classification: predicts categorical class labels
classifies data (constructs a model) based on the training set andthe values (class labels)in a classifying attribute and uses it inclassifying new data
Prediction: models continuous-valued functions, i.e., predicts unknown or
missing values
Typical Applications credit approval
target marketing medical diagnosis
treatment effectiveness analysis
Classification vs. Prediction
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
3/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 3/52
Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the
class.
Find a model for class attribute as a function of thevalues of other attributes.
Goal:previously unseenrecords should be assigned aclass as accurately as possible. A test setis used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set
used to build the model and test set used to validate it.
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
4/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 4/52
Illustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1
Attrib2 Attrib3 Class
1
Yes Large 125K
No
2
No Medium 100K
No
3
No Small 70K
No
4
Yes Medium 120K
No
5
No Large 95K
Yes
6
No Medium 60K
No
7
Yes Large 220K
No
8
No Small 85K
Yes
9
No Medium 75K
No
10
No Small 90K
Yes10
Tid Attrib1
Attrib2 Attrib3 Class
11
No Small 55K
?
12
Yes Medium 80K
?
13
Yes Large 110K
?
14
No Small 95K
?
15
No Large 67K
?10
Test Set
Learning
algorithm
Training Set
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
5/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 5/52
Examples of Classification Task
Predicting tumor cells as benign or malignant
Classifying credit card transactionsas legitimate or fraudulent
Classifying secondary structures of proteinas alpha-helix, beta-sheet, or randomcoil
Categorizing news stories as finance,weather, entertainment, sports, etc
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
6/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 6/52
ClassificationA Two-Step Process
Model construction (induction): describing a set of predetermined
classes Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, ormathematical formulae
Model usage (deduction): for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified resultfrom the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model Test set is independent of training set, otherwise over-fittingwill occur
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
7/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 7/52
Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: the training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning(clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
8/52
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
9/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 9/52
Issues regarding classification and prediction:(1) Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing
values
Relevance analysis (feature selection) Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
10/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 10/52
Issues regarding classification and prediction:
(2) Evaluating Classification Methods
Predictive accuracy Speed
time to construct the model time to use the model
Robustness handling noise and missing values
Scalability efficiency in disk-resident databases
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
11/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 11/52
Classification by Decision Tree Induction
Decision tree A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
12/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 12/52
Example of a Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Spl i t t ing Attr ibutes
Training Data Model: Decision Tree
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
13/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 13/52
Another Example of Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle,
Divorced
< 80K > 80K
There could be more than one tree thatfits the same data !
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
14/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 14/52
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?10
Test Set
TreeInduction
algorithm
Training Set
Decision
Tree
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
15/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 15/52
Apply Model to Test Data (1)
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund MaritalStatus
TaxableIncome Cheat
No Married 80K ?10
Test Data
Start from the root of tree.
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
16/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 16/52
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund MaritalStatus
TaxableIncome Cheat
No Married 80K ?10
Test Data
Apply Model to Test Data (2)
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
17/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 17/52
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund MaritalStatus
TaxableIncome Cheat
No Married 80K ?10
Test Data
Apply Model to Test Data (3)
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
18/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 18/52
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund MaritalStatus
TaxableIncome Cheat
No Married 80K ?10
Test Data
Apply Model to Test Data (4)
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
19/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 19/52
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund MaritalStatus
TaxableIncome Cheat
No Married 80K ?10
Test Data
Apply Model to Test Data (5)
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
20/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 20/52
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund MaritalStatus
TaxableIncome Cheat
No Married 80K ?10
Test Data
Assign Cheat to No
Apply Model to Test Data (6)
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
21/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 21/52
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?10
Test Set
TreeInduction
algorithm
Training Set
Decision Tree
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
22/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 22/52
Algorithm for Decision Tree Induction
Basic algorithm Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized inadvance)
Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majorityvotingis employed for classifying the leaf
There are no samples left
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
23/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 23/52
Decision Tree Induction
Many Algorithms exist, for examples:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
24/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 24/52
General Structure of HuntsAlgorithm
Let Dtbe the set of trainingrecords that reach a node t
General Procedure: If D
t
contains records that belongthe same class yt, then t is a leafnode labeled as yt
If Dtis an empty set, then t is aleaf node labeled by the defaultclass, yd
If Dtcontains records that belongto more than one class, use anattribute test to split the data intosmaller subsets. Recursivelyapply the procedure to eachsubset.
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Dt
?
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
25/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 25/52
Hunts Algorithm
Dont
Cheat
Refund
Dont
Cheat
Dont
Cheat
Yes No
Refund
Dont
Cheat
Yes No
Marital
Status
Dont
Cheat
Cheat
Single,
DivorcedMarried
Taxable
Income
Dont
Cheat
< 80K >= 80K
Refund
Dont
Cheat
Yes No
Marital
Status
Dont
CheatCheat
Single,
DivorcedMarried
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
26/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 26/52
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
27/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 27/52
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
28/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 28/52
Splitting Based on Nominal Attributes
Multi-way split:Use as many partitions asdistinct values.
Binary split: Divides values into two subsets.Need to find optimal partitioning.
CarTypeFamily
Sports
Luxury
CarType{Family,
Luxury} {Sports}
CarType{Sports,
Luxury} {Family}OR
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
29/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 29/52
Multi-way split:Use as many partitions as
distinct values.
Binary split: Divides values into two subsets.Need to find optimal partitioning.
What about this split?
Splitting Based on Ordinal Attributes
SizeSmall
Medium
Large
Size{Medium,
Large} {Small}
Size{Small,
Medium} {Large}
OR
Size{Small,
Large} {Medium}
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
30/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 30/52
Splitting Based on Continuous Attributes (1)
Different ways of handling Discretizationto form an ordinal categorical attribute
Staticdiscretize once at the beginning
Dynamicranges can be found by equal intervalbucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut
can be more compute intensive
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
31/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 31/52
Taxable
Income
> 80K?
Yes No
Taxable
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
Splitting Based on Continuous Attributes (2)
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
32/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 32/52
Tree Induction
Greedy strategy.
Split the records based on an attribute test thatoptimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
33/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 33/52
How to determine the Best Split (1)
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes NoFamily
Sports
Luxury c1 c
10
c20
C0: 0
C1: 1...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
34/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 34/52
How to determine the Best Split (2)
Greedy approach:
Nodes with homogeneousclass distribution arepreferred
Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
35/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 35/52
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
36/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 36/52
How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01M0
M1 M2 M3 M4
M12 M34Gain= M0M12 vs. M0M34
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
37/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 37/52
Measure of Impurity: GINI
Gini Index for a given node t :
(NOTE:p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally distributed amongall classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implyingmost interesting information
j
tjptGINI 2)]|([1)(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
38/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 38/52
Examples for computing GINI
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1P(C1)2P(C2)2= 101 = 0
j
tjptGINI 2)]|([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1(1/6)2(5/6)2= 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1(2/6)2(4/6)2= 0.444
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
39/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 39/52
Splitting Based on GINI Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), thequality of split is computed as,
where, ni= number of records at child i,
n = number of records at node p.
k
i
i
split iGINIn
n
GINI 1 )(
Binary Attributes:
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
40/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 40/52
Binary Attributes:Computing GINI Index
Splits into two partitions Effect of Weighing partitions:
Larger and Purer Partitions are sought for.
B?
Yes No
Node N1 Node N2
Parent
C1 6
C2 6
Gini = 0.500
N1 N2C1 5 1
C2 2 4
Gini=0.371
Gini(N1)
= 1(5/7)2(2/7)2
= 0.408
Gini(N2)
= 1(1/5)2(4/5)2
= 0.32
Gini(Children)= 7/12 * 0.408 +
5/12 * 0.32
= 0.371
Categorical Attributes:
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
41/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 41/52
Categorical Attributes:Computing Gini Index
For each distinct value, gather counts for each class inthe dataset
Use the count matrix to make decisions
CarType
{Sports,Luxury}
{Family}
C1 3 1
C2 2 4Gini 0.400
CarType
{Sports}{Family,Luxury}
C1 2 2
C2 1 5Gini 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1Gini 0.393
Multi-way split Two-way split
(find best partition of values)
C ti Att ib t
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
42/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 42/52
Continuous Attributes:Computing Gini Index
Use Binary Decisions based on onevalue
Several Choices for the splitting value Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrixassociated with it Class counts in each of the partitions, A
< v and A v
Simple method to choose best v For each v, scan the database to gather
count matrix and compute its Gini index Computationally Inefficient! Repetition
of work.
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
TaxableIncome
> 80K?
Yes No
Continuous Attributes:
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
43/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 43/52
Continuous Attributes:Computing Gini Index... For efficient computation: for each attribute,
Sort the attribute on values Linearly scan these values, each time updating the count matrix and
computing gini index
Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values
Alt tiv S litti C it i
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
44/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 44/52
Alternative Splitting Criteriabased on INFO
Entropy at a given node t:
(NOTE:p( j | t) is the relative frequency of class j at node t).
Measures homogeneity of a node. Maximum (log nc) when records are equally distributed among all
classes implying least information
Minimum (0.0) when all records belong to one class, implyingmost information
Entropy based computations are similar to the GINIindex computations
j
tjptjptEntropy )|(log)|()(
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
45/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 45/52
Examples for computing Entropy
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy =0 log 01 log 1 =00 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy =(1/6) log2(1/6)(5/6) log2(1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy =(2/6) log2(2/6)(4/6) log2(4/6) = 0.92
j
tjptjptEntropy )|(log)|()(2
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
46/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 46/52
Splitting Based on INFO...
Information Gain:
Parent Node, p is split into k partitions;
niis number of records in partition i
Measures Reduction in Entropy achieved because of the split.Choose the split that achieves most reduction (maximizes GAIN)
Used in ID3 and C4.5
Disadvantage:Tends to prefer splits that result in large number of
partitions, each being small but pure.
k
i
i
spl itiEntropy
n
npEntropyGAIN
1
)()(
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
47/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 47/52
age income student credit_rating40 low yes fair>40 low yes excellent3140 low yes excellent
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
48/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 48/52
Attribute Selection byInformation Gain Computation
Class P: buys_computer = yes
Class N: buys_computer = no
I(p, n) = I(9, 5) =0.940
Compute the entropy for age: Hence
Similarly
age pi ni I(pi, ni)
40 3 2 0.971
69.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIageE
048.0)_(
151.0)(029.0)(
ratingcreditGain
studentGainincomeGain
25.069.094.0)(),()( ageEnpIageGain
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
49/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 49/52
Splitting Based on INFO: Gain Ratio
Gain Ratio:
Parent Node, p is split into k partitions
niis the number of records in partition i
Adjusts Information Gain by the entropy of the partitioning(SplitINFO). Higher entropy partitioning (large number of smallpartitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of Information Gain
SplitINFO
GAINGainRATIO Split
spli t
k
i
ii
n
n
n
nSplitINFO
1log
Splitting Criteria based on
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
50/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 50/52
Splitting Criteria based onClassification Error
Classification error at a node t :
Measures misclassification error made by a node. Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying
most interesting information
)|(max1)( tiPtError i
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
51/52
Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 51/52
Examples for Computing Error
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1max (0, 1) = 11 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1max (1/6, 5/6) = 15/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1max (2/6, 4/6) = 14/6 = 1/3
)|(max1)( tiPtErrori
8/12/2019 Bab 07 - Class & Predict - Part 1.ppt
52/52
Akhir
Bab 7 Part 1
Top Related