Post on 13-Jul-2018
Support Vector Machines
Konstantin Tretyakov (kt@ut.ee)
MTAT.03.227 Machine Learning
So far…
Supervised machine learning
Linear models
Least squares regression
Fisher’s discriminant, Perceptron, Logistic model
Non-linear models
Neural networks, Decision trees, Association rules
Unsupervised machine learning
Clustering/EM, PCA
Generic scaffolding
Probabilistic modeling, ML/MAP estimation
Performance evaluation, Statistical learning theory
Linear algebra, Optimization methods
May 8, 2012
Coming up next
Supervised machine learning
Linear models
Least squares regression, SVM
Fisher’s discriminant, Perceptron, Logistic regression, SVM
Non-linear models
Neural networks, Decision trees, Association rules
SVM, Kernel-XXX
Unsupervised machine learning
Clustering/EM, PCA, Kernel-XXX
Generic scaffolding
Probabilistic modeling, ML/MAP estimation
Performance evaluation, Statistical learning theory
Linear algebra, Optimization methods
Kernels May 8, 2012
First things first
SVM: (𝑦 ∈ {−1,1})
library('e1071')
m = svm(X, y, kernel='linear')
predict(m, newX)
May 8, 2012
Quiz
May 8, 2012
This line is called …
This vector is …
Those lines are …
𝑓 𝒙 = ?
𝒙𝟏 = ? 𝑦1 = ?
Functional margin of 𝒙𝟏?
Geometric margin of 𝒙𝟏?
Distance to origin?
Quiz
May 8, 2012
Separating hyperplane
Normal 𝒘
Isolines (level lines)
𝑓 𝒙 = 𝒘𝑻𝒙 + 𝑏
𝒙𝟏 = (2, 6); 𝑦1 = −1
𝑦1 ⋅ 𝑓 𝒙𝟏 ≈ 2
𝑓(𝒙𝟏)/|𝒘| ≈ 3√2
𝑑 = 𝑏/|𝒘|
Quiz
Suppose we scale 𝒘 and 𝑏 by some constant.
Will it:
Affect the separating hyperplane? How?
Affect the functional margins? How?
Affect the geometric margins? How?
May 8, 2012
Quiz
Example: 𝒘 → 2𝒘, 𝑏 = 0
May 8, 2012
Quiz
Suppose we scale 𝒘 and b by some constant.
Will it:
Affect the separating hyperplane? How?
No: 𝒘𝑇𝒙 + 𝑏 = 0 ⇔ 2𝒘𝑇𝒙 + 2𝑏 = 0
Affect the functional margins? How?
Yes: 2𝒘𝑇𝒙 + 2𝑏 𝑦 = 2 ⋅ 𝒘𝑇𝒙 + 𝑏 𝑦
Affect the geometric margins? How?
No: 2𝒘𝑇𝒙+2𝑏
|2𝒘|=
𝒘𝑇𝒙+𝑏
|𝒘|
May 8, 2012
Which classifier is best?
May 8, 2012
Maximal margin classifier
May 8, 2012
Why maximal margin?
Well-defined, single stable solution
Noise-tolerant
Small parameterization
(Fairly) efficient algorithms exist for finding it
May 8, 2012
Maximal margin: Separable case
May 8, 2012
𝑓 𝒙 = 1
𝑓 𝒙 = −1
Maximal margin: Separable case
May 8, 2012
𝑓 𝒙 = 1
𝑓 𝒙 = −1
∀𝑖 𝑓 𝒙𝑖 𝑦𝑖 ≥ 1
Maximal margin: Separable case
May 8, 2012
𝑓 𝒙 = 1
𝑓 𝒙 = −1
The (geometric)
distance to the
isoline 𝑓 𝒙 = 1 is:
Maximal margin: Separable case
May 8, 2012
𝑓 𝒙 = 1
𝑓 𝒙 = −1
The (geometric)
distance to the
isoline 𝑓 𝒙 = 1 is:
𝑑 =𝑓 𝒙
𝒘=
1
𝒘
Maximal margin: Separable case
Among all linear classifiers (𝒘, 𝑏)
… which keep all points at functional margin of
𝟏 or more,
… we shall look for the one which has the largest
distance 𝒅 to the corresponding isolines, i.e. the
largest geometric margin.
As 𝑑 =1
𝒘, this is equivalent to finding the classifier
with minimal |𝒘|.
…which is equivalent to finding the classifier with
minimal 𝒘 2
May 8, 2012
May 8, 2012
May 8, 2012
May 8, 2012
May 8, 2012
Compare
“Generic” linear classification (separable case):
Find (𝒘, b), such that all points are classified correctly
i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 0
Maximal margin classification (separable case):
Find (𝒘, b), such that all points are classified correctly
with a fixed functional margin
i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 𝟏
and 𝒘 𝟐 is minimal.
May 8, 2012
Remember
May 8, 2012
SVM optimization problem
(separable case):
min𝒘,𝑏
1
2𝒘 2
so that
𝒘𝑇𝒙𝑖 + 𝑏 𝑦𝑖 ≥ 1
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 8, 2012
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
where
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 8, 2012
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
𝑖
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 8, 2012
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶 1 − 𝑚𝑖 +
𝑖
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 8, 2012
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶 hinge(𝑚𝑖)
𝑖
where
hinge 𝑚𝑖 = 1 − 𝑚𝑖 +
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
Hinge loss hinge 𝑚𝑖 = 1 − 𝑚𝑖 +
May 8, 2012
Classification loss functions
“Generic”
classification:
min𝒘,𝑏
[𝑚𝑖 < 0]
𝑖
May 8, 2012
Classification loss functions
Perceptron:
May 8, 2012
Classification loss functions
Perceptron:
min𝒘,𝑏
(−𝑚𝑖)+
𝑖
May 8, 2012
Classification loss functions
Least squares
classification*:
min𝒘,𝑏
𝑚𝑖 − 1 2
𝑖
May 8, 2012
Classification loss functions
Boosting:
min𝒘,𝑏
exp(−𝑚𝑖)
𝑖
May 8, 2012
Classification loss functions
Logistic regression:
min𝒘,𝑏
log (1 + 𝑒−𝑚𝑖)
𝑖
May 8, 2012
Classification loss functions
Regularized logistic
regression:
min𝒘,𝑏
log (1 + 𝑒−𝑚𝑖)
𝑖
+𝜆1
2𝒘 2
May 8, 2012
Classification loss functions
SVM:
min𝒘,𝑏
1 − 𝑚𝑖 +
𝑖
+1
2𝐶𝒘 2
May 8, 2012
Classification loss functions
L2-SVM:
min𝒘,𝑏
1 − 𝑚𝑖 +2
𝑖
+1
2𝐶𝒘 2
May 8, 2012
Classification loss functions
L1-regularized L2-SVM:
min𝒘,𝑏
1 − 𝑚𝑖 +2
𝑖
+ 1
2𝐶𝒘
… etc
May 8, 2012
In general
min𝒘,𝑏
𝜙(𝑚𝑖)
𝑖
+ 𝜆 ⋅ Ω(𝒘)
May 8, 2012
Model fit Model complexity
Compare to MAP estimation
maxModel
log 𝑃(𝑥𝑖|Model)
𝑖
+ log 𝑃(Model)
May 8, 2012
Likelihood Model prior
Compare to MAP estimation
maxModel
log 𝑃(Data|Model) + log 𝑃(Model)
May 8, 2012
Likelihood Model prior
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
𝑖
May 8, 2012
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 ≥ 1 − 𝜉𝑖
𝜉𝑖 ≥ 0
May 8, 2012
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0
𝜉𝑖 ≥ 0
May 8, 2012
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0
𝜉𝑖 ≥ 0
Quadratic function with linear constraints!
May 8, 2012
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0
𝜉𝑖 ≥ 0
Quadratic function with linear constraints!
May 8, 2012
Quadratic programming
Minimize
𝑓 𝒙 =1
2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙
subject to:
𝑨𝒙 ≥ 𝒃
𝑪𝒙 = 𝒅
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0
𝜉𝑖 ≥ 0
Quadratic function with linear constraints!
May 8, 2012
Quadratic programming
Minimize
𝑓 𝒙 =1
2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙
subject to:
𝑨𝒙 ≥ 𝒃
𝑪𝒙 = 𝒅
> library(quadprog)
> solve.QP(Q, -c, A, b, neq)
Solving the SVM: Dual
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b
max𝜶≥0,𝜷≥0
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
− 𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)
𝑖
− 𝛽𝑖𝜉𝑖
𝑖
May 8, 2012
Solving the SVM: Dual
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b
max𝜶≥0,𝜷≥0
1
2𝒘 2 + 𝐶 𝜉𝑖
𝑖
− 𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)
𝑖
− 𝛽𝑖𝜉𝑖
𝑖
May 8, 2012
Solving the SVM: Dual
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b
max𝜶≥0,𝜷≥0
1
2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖
𝑖
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
May 8, 2012
Solving the SVM: Dual
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b
max𝜶≥0,𝜷≥0
1
2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖
𝑖
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0
May 8, 2012
Solving the SVM: Dual
min𝒘,𝑏
1
2𝒘 2 + 𝐶 𝜉𝑖𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b
max𝜶≥0,𝜷≥0
1
2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖
𝑖
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
May 8, 2012
Solving the SVM: Dual
min𝒘,b
max𝜶
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
May 8, 2012
Solving the SVM: Dual
min𝒘,b
max𝜶
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Sparsity: 𝛼𝑖 is nonzero only for those points which
have
𝑓 𝒙𝑖 𝑦𝑖 − 1 < 0
May 8, 2012
Solving the SVM: Dual
min𝒘,b
max𝜶
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Now swap the min and the max (can be done in
particular because everything is nice and convex).
May 8, 2012
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Next solve the inner (unconstrained) min as usual.
May 8, 2012
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Next solve the inner (unconstrained) min as usual:
∇𝒘= 𝒘 − 𝛼𝑖𝑦𝑖𝒙𝑖 = 0
∇𝑏= − 𝛼𝑖𝑦𝑖 = 0
May 8, 2012
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Express 𝒘 and substitute:
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝛼𝑖𝑦𝑖 = 0
May 8, 2012
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Express 𝒘 and substitute:
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝛼𝑖𝑦𝑖 = 0
May 8, 2012
Dual
representation
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Express 𝒘 and substitute:
max𝜶
𝛼𝑖
𝑖
−1
2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖
𝑇𝒙𝑗
𝑖,𝑗
0 ≤ 𝛼𝑖 ≤ 𝐶
𝛼𝑖𝑦𝑖 = 0
𝑖
May 8, 2012
Solving the SVM: Dual
max𝜶
𝛼𝑖
𝑖
−1
2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖
𝑇𝒙𝑗
𝑖,𝑗
0 ≤ 𝛼𝑖 ≤ 𝐶
𝛼𝑖𝑦𝑖 = 0
𝑖
May 8, 2012
Solving the SVM: Dual
max𝜶
𝟏𝑇𝜶 −1
2𝜶𝑇 𝑲 ∘ 𝒀 𝜶
0 ≤ 𝜶 ≤ 𝐶 𝒚𝑇𝜶 = 0
𝐾𝑖𝑗 = 𝒙𝑖𝑇𝒙𝑗, 𝑌𝑖𝑗 = 𝑦𝑖𝑦𝑗
May 8, 2012
Solving the SVM: Dual
min𝜶
1
2𝜶𝑇 𝑲 ∘ 𝒀 𝜶 − 𝟏𝑇𝜶
𝜶 ≥ 0
−𝜶 ≥ −𝐶
𝒚𝑇𝜶 = 0
Then find 𝑏 from the condition:
𝑓 𝒙𝑖 𝑦𝑖 = 1 if 0 < 𝛼𝑖 < 𝐶
May 8, 2012
May 8, 2012
Support vectors
May 8, 2012
C
C
0
0
0
0
0
0.5
0.5
1
Support vectors
𝛼𝑖𝑦𝑖 = 0
𝑖
0 ≤ 𝛼𝑖 ≤ 𝐶
Sparsity
The dual solution is often very sparse, this
allows to perform optimization efficiently
“Working set” approach.
May 8, 2012
Kernels
𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏
May 8, 2012
Kernels
𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏
May 8, 2012
Kernel function
𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏
Kernels
May 8, 2012
𝑓 𝑥 = 𝑤1𝑥 + 𝑤2𝑥2 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖exp (−|𝒙𝑖 − 𝒙 𝟐) + 𝑏
Quiz
SVM is a __________ linear classifier.
Margin maximization can be achieved via
minimization of ______________.
SVM uses _____ loss and _______
regularization.
Besides hinge loss I also know ____ loss and
___ loss.
SVM in both primal and dual form is solved
using ________ programming.
May 8, 2012
Quiz
In primal formulation we solve for parameter
vector ___. In dual formulation we solve for
___ instead.
_____ form of SVM is typically sparse.
Support vectors are those training points for
which _______.
The relation between primal and dual variables
is: ___= ______𝑖 .
A Kernel is a generalization of _____ product.
May 8, 2012
May 8, 2012