Post on 03-Apr-2018
7/28/2019 Bagging, Boosting
1/32
Dealing with Data ,Bagging,
Boosting
7/28/2019 Bagging, Boosting
2/32
Types of Data :Binary Data
ID Salary Male / Female Mortgage car
A 50000 1 0 0
B 85000 0 1 1
C 55000 1 0 1
D 95000 1 1 0
E 75000 0 0 0
F 45000 0 1 1
G 65000 1 1 0
A Binary variable has two states 0 or 1 , where 0 means the variable is absent
and 1 means it is present. Thus the variable smoker has value 1 if the
person smokes and 0 if he does not. A binary variable is symmetric if both ofits states are equally valuable and carry the same weight. A variable denoting
the gender of the person is a symmetric binary variable as both male, female
values are equally important.
Consider the data above. In this case Male/ Female , Mortgage and car are
binary variables as they take values 0 and 1 only. In this case how do we find
the distance between A and B.
7/28/2019 Bagging, Boosting
3/32
Types of Data :Binary Data
object j
1 0 sum
object i 1 q r q+r
0 s t s+t
sum q+s r+t p
We construct a matrix as shown above. The matrix shows the matching
between two objects i and j. In the above matrix q denotes the number ofmatches between i and j , where both are 1 , r denotes the number of
matches where i=1 and j=0 and so on.
r + s r + s
d( i,j)= ------------------ = --------------q + r + s +t p
The distance between i & j is also called the
dissimilarity between i and j.
7/28/2019 Bagging, Boosting
4/32
D(A,B) = (r+ s) / p = 3/3 =1
Calculation of d(A,B) i.e dissimilarity
between A and B
D(A,C) = (r+s) / p = 1 /3 =.33
Calculation of D(A,C) i.e dissimilaritybetween A and C
B
1 0 sum
A 1 0 1 1
0 2 0 2
sum 2 1 3
C
1 0 sum
A 1 1 0 1
0 1 1 2
sum 2 1 3
Symmetric Data
7/28/2019 Bagging, Boosting
5/32
Asymmetric Binary variable
object j
1 0 sum
object i 1 q r q+r
0 s t s+t
sum q+s r+t p
A variable is asymmetric variable if the outcomes of the states are not equally
important such as positive and negative outcomes of a disease test. Let the variable
be the status of HIV disease of a person. It will be 1 if disease is present and 0 ifdisease is absent. Given two asymmetric binary variables , the agreement of two 1s (
a positive match) is considered more significant than that of two 0s. In this case the
formula for dissimilarity becomes:
r + s
d( i,j)= ------------------
q + r + s
where t is not considered
7/28/2019 Bagging, Boosting
6/32
name gender fever cough test-1 test-2 test-3 test-4
Jack M Y N P N N N
Mary F Y N P N P NJim M Y Y N N N N
name gender fever cough test-1 test-2 test-3 test-4
Jack M 1 0 1 0 0 0
Mary F 1 0 1 0 1 0
Jim M 1 1 0 0 0 0
In the above case gender is symmetric and other factors are asymmetric binary. We convert asymmetric
values as 1 for Yes and Positive and 0 for No and negative.
D(Jack, Mary)= ( 0 + 1) / ( 2+ 0 +1 ) = .33
D(Jack,Jim) = ( 1 + 1) / ( 1 + 1 +1) = .67
D(Mary,Jim) = ( 1 + 2 ) / ( 1 + 1 +2) = .75
Asymmetric Binary variable
7/28/2019 Bagging, Boosting
7/32
Categorical Variables
A categorical variable is a generalization of the binary variable in that it can take
on more than two states . For example map_color is a categorical variable thatmay take five states : red, yellow, green, pink, and blue.
The dissimilarity between two categorical objects i and j can be computed
based on the ratio of mismatches:
p - m
d( i, j ) = ----------------------------
p
Where m is the number of matches ( i.e the number of variables for which i and j
are in the same state) , and p is the total number of variables.
7/28/2019 Bagging, Boosting
8/32
Categorical Variables
We take into account object identifier and test-1 only and make the dissimilarity
matrix. We have p=1 since only one variable is considered.
7/28/2019 Bagging, Boosting
9/32
Categorical Variables
7/28/2019 Bagging, Boosting
10/32
Ordinal Variables
A discrete ordinal variable resembles a categorical variable, except that the M
states of the ordinal value are ordered in a meaningful sequence.
7/28/2019 Bagging, Boosting
11/32
Ordinal Variables
We consider the object identifier and test2 (ordinal variable). We replace each of
the test-2 value by the rank. Since there are three states namely ( excellent, fair
and good) Mf = 3.
7/28/2019 Bagging, Boosting
12/32
Ordinal Variables
object-identifier test-2 Normalized value
1 3 (3-1)/ (3-1) =12 1 (1-1)/ (3-1)=0
3 2 (2-1) / (3-1)=.5
4 3 (3-1)/ (3-1) =1
We next calculate the Euclidean distance between the objects using the normalisedvalues .
The distance between 2 and 1 is (( 1)2 ) 1/2 = 1
The distance between 3 and 1 is ((.5-1)2) 1/2 = .5 and so on. This results in the
following matrix.
Rank : 1-fair , 2-good,3-excellent
7/28/2019 Bagging, Boosting
13/32
RatioScaled Variables
For the Ratio Scaled Variables we take the log values. Consider the object-
identifier and the test-3 variable.
7/28/2019 Bagging, Boosting
14/32
object-identifier test-3 Log Values
1 445 log(445)= 2.652 22 log(22)= 1.34
3 164 log(164)=2.21
4 1210 log(1210)=3.08
RatioScaled Variables
From the values in the last column we calculate the Euclidean distance and we get thefollowing distance matrix.
7/28/2019 Bagging, Boosting
15/32
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging
Bagging , which is also known as bootstrap aggregating , is a
technique that repeatedly samples (with replacement) from a data
set . Each bootstrap sample has the same size as the original data.
Because the sampling is done with replacement , some instances
may appear several times in the same training set., while others may
be omitted from the training set.
Let x denote a one-dimension attribute and y denote the class label.
We apply a classifier that induces only one-level binary decision tree
with a test condition x < = k, where k is a split point chosen to
minimize the entropy of the leaf nodes.
7/28/2019 Bagging, Boosting
16/32
Bagging
7/28/2019 Bagging, Boosting
17/32
BaggingThese values of y are determined based on
bagging round 1. The round 1 states that for
x=3.5 , y= -1
The values of y in the column are added
7/28/2019 Bagging, Boosting
18/32
An iterative procedure to adaptively changedistribution of training data by focusing more
on previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of
boosting round
Boosting
7/28/2019 Bagging, Boosting
19/32
Records that are wrongly classified will have
their weights increased
Records that are classified correctly willhave their weights decreased
Boosting
7/28/2019 Bagging, Boosting
20/32
Boosting - AdaBoost
AdaBoost Algorithm
1: w= { wj= 1 /N | j= 1,2,3.N} {Initialize the weights for all N samples}
2: Let k be the number of boosting rounds.
3: for i= 1 to k do4: Create training set Di by sampling (with replacement) from D according to w
5: Train a base classifier Ci on D
6: Apply Ci to all examples in the original training set D.
Calculate the weighted error
7: If i > .5 then
w= { wj= 1 /N | j= 1,2,3.N} (Reset the weights for all N examples}
Go back to step 48: end if
9: Calculate
10: Update the weight of each example
N
j
jjiji yxCwN 1)(
1
i
ii
1ln
2
1
7/28/2019 Bagging, Boosting
21/32
Base classifiers: C1, C2, , CT
Error rate:
Importance of a classifier:
N
jjjiji yxCwN 1 )(
1
i
ii
1ln
2
1
Boosting - AdaBoost
7/28/2019 Bagging, Boosting
22/32
Weight update:
If any intermediate rounds produce error rate
higher than 50%, the weights are reverted
back to 1/ N and the re sampling procedure is
repeated
Classification:
factorionnormalizattheiswhere
)(ifexp
)(ifexp)()1(
j
iij
iij
j
jij
i
Z
yxC
yxC
Z
ww
j
j
T
j
jj
y
yxCxC
1
)(maxarg)(*
Boosting - AdaBoost
7/28/2019 Bagging, Boosting
23/32
Boosting - AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
N=10 i.e the number of elements as shown above.
w= 1 / N = 1/10 =.1 is the initial weight assigned to each
element in the data
Let k= number of boosting rounds =3
7/28/2019 Bagging, Boosting
24/32
Boosting - AdaBoost
The figure above shows the three boostingrounds . The elements are sampled with
replacement. Hence a element appears more
than once.
7/28/2019 Bagging, Boosting
25/32
Boosting - AdaBoost
In round 1 all elements are given the same weight = 1 /10 =.1
as shown in the first row above.
The weights of training records are as follows (calculation is shown in subsequent
slides)
7/28/2019 Bagging, Boosting
26/32
Boosting - AdaBoost
The split point is: if x
7/28/2019 Bagging, Boosting
27/32
Boosting - AdaBoost
We need to calculate the value of
and so that new weights can becalculated according to the equation:
N
j
jjiji yxCw
N 1)(
1
factorionnormalizattheiswhere
)(ifexp
)(ifexp)()1(
j
iij
iij
j
j
ij
i
Z
yxC
yxC
Z
ww
j
j
i
ii
1ln
2
1
7/28/2019 Bagging, Boosting
28/32
Boosting - AdaBoost
i = 1/10 (.1 x1 + .1x 1 +.1x1 +0+0..)i = .1 (.3) = .03
N
j
jjiji yxCw
N 1
)(1
Calculation is as under :
= 1 if data element in D does not
match the original data element else itis 0.
Thus = 1 for first three data elements in D , w is the weight assigned which
is equal to .1 for the first round.
7/28/2019 Bagging, Boosting
29/32
Boosting - AdaBoost
We have the value ofi
:
i = .1 (.3) = .03
= 1 /2 In ( (1- .03)/ .03)
= 1.738
7/28/2019 Bagging, Boosting
30/32
Boosting - AdaBoost
We now need to calculate the new weights given by the equation
factorionnormalizattheiswhere
)(ifexp
)(ifexp)()1(
j
iij
iij
j
j
ij
i
Z
yxC
yxC
Z
ww
j
j
The normalization factor ensures that wi
(j+1) =1
This condition shows the matching or non matching of values
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Matching values
7/28/2019 Bagging, Boosting
31/32
We need to calculate the value of Z j the normalization factor
1 = (.1/Zj ) ( e1.738 ) + (.1/Zj ) ( e
1.738 ) + (.1/Zj ) ( e1.738 ) + (.1/Zj ) ( e
-1.738)+
1 = (.1/Zj ) ( e1.738 ) + (.1/Zj ) ( e
1.738 ) + (.1/Zj ) ( e1.738 ) + (.1/Zj ) (.175 x 7)
The value of Zj must make the right hand side expression equal to 1
If we solve the above equation we get value of Zj = 1.82
For non-matching instances the weights are:
=(.1 / 1.82) x e 1.738 = .31
For non-matching instances the weights are :
= (.1/1.82) x e -1.738 =.0096 ~ .01
The whole process is repeated with the new weights
Boosting - AdaBoost
7/28/2019 Bagging, Boosting
32/32
Boosting - AdaBoost
= -1 x (1.738) + 1 x (2.7784 ) + 1 x (4.1195) =-1 x(1.738) + 1 x (2.7784) +
-1 x (4.1195)