Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
-
Upload
clarissa-shelton -
Category
Documents
-
view
214 -
download
0
Transcript of Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
Support Vector Machines in Marketing
Georgi Nalbantov
MICC, Maastricht University
2/20
Contents
Purpose
Linear Support Vector Machines
Nonlinear Support Vector Machines
(Theoretical justifications of SVM)
Marketing Examples
Conclusion and Q & A
(some extensions)
3/20
Purpose
Task to be solved (The Classification Task):
Classify cases (customers) into “type 1” or “type 2” on the basis of
some known attributes (characteristics)
Chosen tool to solve this task:
Support Vector Machines
4/20
The Classification Task
Given data on explanatory and explained variables, where the explained variable can take two values { 1 }, find a function that gives the “best” separation between the “-1” cases and the “+1” cases:
Given: ( x1, y1 ), … , ( xm , ym ) n { 1 }
Find: : n { 1 }
“best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k ) is
minimal
Existing techniques to solve the classification task:
Linear and Quadratic Discriminant Analysis
Logit choice models (Logistic Regression)
Decision trees, Neural Networks, Least Squares SVM
5/20
Support Vector Machines: Definition
Support Vector Machines are a non-parametric tool for classification/regression
Support Vector Machines are used for prediction rather than description purposes
Support Vector Machines have been developed by Vapnik and co-workers
6/20N
umbe
r of
art
boo
ks p
urch
ased
∆ buyers ● non-buyers
Months since last purchase
Linear Support Vector Machines
A direct marketing company wants to sell a new book:
“The Art History of Florence”
Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003).
Problem: How to identify buyers and non-buyers using the two variables: Months since last purchase Number of art books purchased
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
●
●
●
∆∆
∆
7/20
∆ buyers ● non-buyers
Num
ber
of a
rt b
ooks
pur
chas
ed
Months since last purchase
Main idea of SVM:
separate groups by a line.
However: There are infinitely many lines that have zero training error…
… which line shall we choose?
Linear SVM: Separable Case
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
8/20
SVM use the idea of a margin around the separating line.
The thinner the margin,
the more complex the model,
The best line is the one with thelargest margin.
∆ buyers ● non-buyers
Num
ber
of a
rt b
ooks
pur
chas
ed
margin
Months since last purchase
Linear SVM: Separable Case
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
9/20
The line having the largest margin is:
w1x1 + w2x2 + b = 0
Where
x1 = months since last purchase x2 = number of art books purchased
Note:
w1xi 1 + w2xi 2 + b +1 for i ∆ w1xj 1 + w2xj 2 + b –1 for j ●
x2
x1
Months since last purchase
Num
ber
of a
rt b
ooks
pur
chas
ed
margin
Linear SVM: Separable Case
w 1x 1
+ w 2x 2
+ b = 1
w 1x 1
+ w 2x 2
+ b = 0
w 1x 1
+ w 2x 2
+ b = -1
w
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
10/20
The width of the margin is given by:
Note:
maximizethe margin
2w
minimize minimize
w2 22w
||||2)1(1
margin22
21
w
ww
Linear SVM: Separable Case
x2
x1
Months since last purchase
Num
ber
of a
rt b
ooks
pur
chas
ed
w 1x 1
+ w 2x 2
+ b = 1
w 1x 1
+ w 2x 2
+ b = 0
w 1x 1
+ w 2x 2
+ b = -1
w
margin
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
11/20
The optimization problem for SVM is:
subject to:
w1xi 1 + w2xi 2 + b +1 for i ∆ w1xj 1 + w2xj 2 + b –1 for j ●
x2
x1
maximizethe margin
2w
minimize minimize
w2 22w
Linear SVM: Separable Case
2)( minimize2
ww Lmargin
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
12/20
“Support vectors” are those points that lie on the boundaries of the margin
The decision surface (line) is determined only by the support vectors. All other points are irrelevant
x2
x1
“Support vectors”
Linear SVM: Separable Case
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
13/20
Non-separable case: there is no line separating errorlessly the two groups
Here, SVM minimize L(w,C) :
subject to:
w1xi 1 + w2xi 2 + b +1 – i for i
∆ w1xj 1 + w2xj 2 + b –1 + i for j
● I,j 0
x2
x1
∆ buyers ● non-buyers
Training set: 1000 targeted customers
maximizethe margin
minimize thetraining errors
i
iCCL 2),(2
ww
L(w,C) = Complexity + Errors
Linear SVM: Nonseparable Case
w 1x 1
+ w 2x 2
+ b = 1
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
●
●
●
∆∆
∆
14/20
C = 5x2
x1
Bigger C
( thinner margin )
smaller number errors( better fit on the data )
increased complexity Smaller C( wider margin )
bigger number errors( worse fit on the data )
decreased complexity
Linear SVM: The Role of C
∆
∆
●
∆
∆∆
●
● ● ●
x2
x1
C = 1∆
●
∆
∆∆
●
● ● ●
∆
Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors
15/20
Mapping into a higher-dimensional space
Optimization task: minimize L(w,C)
subject to:
∆
●
iiiii bxwxxwxw 12 223212
211
22
222
212
2121
2221221
1211211
21
2221
1211
2
2
2
llllll x
x
x
xxx
xxx
xxx
xx
xx
xx
x2
x1
i
iCC,L 22
w)(w
jjjjj bxwxxwxw 12 223212
211
Nonlinear SVM: Nonseparable Case
∆
●
∆
∆ ●
●● ●
∆∆
∆
●
●
●
●
●
●
∆∆
∆
16/20
Nonlinear SVM: Nonseparable Case
Map the data into higher-dimensional space: 2 3
22
21
21
2
x
xx
x
2
1
x
x
(1,-1)
x2
(1,1)(-1,1)
(-1,-1)
∆
∆ ●
●
x1
12111 ,,,
12111 ,,,
12111 ,,,
12111 ,,,
∆
∆
●
●
212 xx
21x
22x
●
∆
17/20
Nonlinear SVM: Nonseparable Case
Find the optimal hyperplane in the transformed space
22
21
21
2
x
xx
x
2
1
x
x
(1,-1)
x2
(1,1)(-1,1)
(-1,-1)
∆
∆ ●
●
x1
12111 ,,,
12111 ,,,
12111 ,,,
12111 ,,,
∆
∆
●
●
212 xx
22x
∆
●
21x
18/20
Nonlinear SVM: Nonseparable Case
Observe the decision surface in the original space (optional)
22
21
21
2
x
xx
x
2
1
x
x
x2
∆
∆ ●
●
x1
12111 ,,,
12111 ,,,
12111 ,,,
12111 ,,,
∆
∆
●
●
212 xx
22x
∆
●
21x
19/20
Nonlinear SVM: Nonseparable Case
Dual formulation of the (primal) SVM minimization problem
jiji
i j
ji
i
i yymax xx2
1
Ci 0
i
ii y 0
i
iCmin 2
2w
Primal Dual
iii by 1xw
Subject to
0i 1iy
Subject to
1iy
20/20
Nonlinear SVM: Nonseparable Case
Dual formulation of the (primal) SVM minimization problem
jiji
i j
ji
i
i yymax xx2
1
Dual
2
2
2121
2221
21
2221
21 22
ji
jjii
jjjjiiii
ji
)x,x()x,x(
x,xx,xx,xx,x
)()(
xx
xx
22
21
21
2
x
xx
x
2
1
x
x
)()(),(K jiji xxxx
(kernel function)Ci 0
iii y 0
Subject to
1iy
21/20
Nonlinear SVM: Nonseparable Case
Dual formulation of the (primal) SVM minimization problem
jiji
i j
ji
i
i yymax xx2
1
Ci 0 i
ii y 0
Dual
Subject to
1iy
2
2
2121
2221
21
2221
21 22
ji
jjii
jjjjiiii
ji
)x,x()x,x(
x,xx,xx,xx,x
)()(
xx
xx
22
21
21
2
x
xx
x
2
1
x
x
)()(),(K jiji xxxx
(kernel function)
2
2
1
jiji
i j
ji
i
i yymax xx
)()(yymax jiji
i j
ji
i
i xx2
1
22/20
Strengths of SVM:
Training is relatively easy No local minima It scales relatively well to high dimensional data Trade-off between classifier complexity and error can be controlled
explicitly via C Robustness of the results The “curse of dimensionality” is avoided
Weaknesses of SVM:
What is the best trade-off parameter C ? Need a good transformation of the original space
Strengths and Weaknesses of SVM
23/20
The Ketchup Marketing Problem
Two types of ketchup: Heinz and Hunts
Seven Attributes Feature Heinz Feature Hunts Display Heinz Display Hunts Feature&Display Heinz Feature&Display Hunts Log price difference between Heinz and Hunts
Training Data: 2498 cases (89.11% Heinz is chosen)
Test Data: 300 cases (88.33% Heinz is chosen)
24/20
C
σ
Cross-validation mean squared errors, SVM with RBF kernel
min max
Do (5-fold ) cross-validation procedure to find the best combination of the manually adjustable parameters (here: C and σ)
The Ketchup Marketing Problem
Choose a kernel mapping:
)(),( jijiK xxxx d
jijiK )1(),( xxxx22
2/),( jieK jixxxx
Linear kernel
Polynomial kernel
RBF kernel
25/20
Model
Linear Discriminant Analysis
The Ketchup Marketing Problem – Training Set
HeinzPredicted Group Membership Total
Hunts Heinz Hit Rate
Original Count Hunts 68 204 272 89.51%
Heinz 58 2168 2226
% Hunts 25.00% 75.00% 100.00%
Heinz 2.61% 97.39% 100.00%
26/20
Model
Logit Choice Model
The Ketchup Marketing Problem – Training Set
HeinzPredicted Group Membership Total
Hunts Heinz Hit Rate
Original Count Hunts 214 58 272 77.79%
Heinz 497 1729 2226
% Hunts 78.68% 21.32% 100.00%
Heinz 22.33% 77.67% 100.00%
27/20
Model
Support Vector Machines
The Ketchup Marketing Problem – Training Set
HeinzPredicted Group Membership Total
Hunts Heinz Hit Rate
Original Count Hunts 255 17 272 99.08%
Heinz 6 2220 2226
% Hunts 93.75% 6.25% 100.00%
Heinz 0.27% 99.73% 100.00%
28/20
Model
Majority Voting
The Ketchup Marketing Problem – Training Set
HeinzPredicted Group Membership Total
Hunts Heinz Hit Rate
Original Count Hunts 0 272 272 89.11%
Heinz 0 2226 2226
% Hunts 0% 100% 100.00%
Heinz 0% 100% 100.00%
29/20
Model
Linear Discriminant Analysis
The Ketchup Marketing Problem – Test Set
HeinzPredicted Group Membership Total
Hunts Heinz Hit Rate
Original Count Hunts 3 32 35 88.33%
Heinz 3 262 265
% Hunts 8.57% 91.43% 100.00%
Heinz 1.13% 98.87% 100.00%
30/20
Model
Logit Choice Model
The Ketchup Marketing Problem – Test Set
HeinzPredicted Group Membership Total
Hunts Heinz Hit Rate
Original Count Hunts 29 6 35 77%
Heinz 63 202 265
% Hunts 82.86% 17.14% 100.00%
Heinz 23.77% 76.23% 100.00%
31/20
Model
Support Vector Machines
The Ketchup Marketing Problem – Test Set
HeinzPredicted Group Membership Total
Hunts Heinz Hit Rate
Original Count Hunts 25 10 35 95.67%
Heinz 3 262 265
% Hunts 71.43% 28.57% 100.00%
Heinz 1.13% 98.87% 100.00%
32/20
Conclusion
Support Vector Machines (SVM) can be applied in the binary
and multi-class classification problems
SVM behave robustly in multivariate problems
Further research in various Marketing areas is needed to justify
or refute the applicability of SVM
Support Vector Regressions (SVR) can also be applied
http://www.kernel-machines.org
Email: [email protected]