Support Vector Machines - ut · Coming up next Supervised machine learning Linear models Least...

Support Vector Machines

Konstantin Tretyakov (kt@ut.ee)

MTAT.03.227 Machine Learning

So far…

Supervised machine learning

Linear models

Least squares regression

Fisher’s discriminant, Perceptron, Logistic model

Non-linear models

Neural networks, Decision trees, Association rules

Unsupervised machine learning

Clustering/EM, PCA

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

May 8, 2012

Coming up next

Supervised machine learning

Linear models

Least squares regression, SVM

Fisher’s discriminant, Perceptron, Logistic regression, SVM

Non-linear models

Neural networks, Decision trees, Association rules

SVM, Kernel-XXX

Unsupervised machine learning

Clustering/EM, PCA, Kernel-XXX

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

Kernels May 8, 2012

First things first

SVM: (𝑦 ∈ {−1,1})

library('e1071')

m = svm(X, y, kernel='linear')

predict(m, newX)

May 8, 2012

This line is called …

This vector is …

Those lines are …

𝑓 𝒙 = ?

𝒙𝟏 = ? 𝑦1 = ?

Functional margin of 𝒙𝟏?

Geometric margin of 𝒙𝟏?

Distance to origin?

May 8, 2012

Separating hyperplane

Normal 𝒘

Isolines (level lines)

𝑓 𝒙 = 𝒘𝑻𝒙 + 𝑏

𝒙𝟏 = (2, 6); 𝑦1 = −1

𝑦1 ⋅ 𝑓 𝒙𝟏 ≈ 2

𝑓(𝒙𝟏)/|𝒘| ≈ 3√2

𝑑 = 𝑏/|𝒘|

Suppose we scale 𝒘 and 𝑏 by some constant.

Will it:

Affect the separating hyperplane? How?

Affect the functional margins? How?

Affect the geometric margins? How?

May 8, 2012

Example: 𝒘 → 2𝒘, 𝑏 = 0

May 8, 2012

Suppose we scale 𝒘 and b by some constant.

Will it:

Affect the separating hyperplane? How?

No: 𝒘𝑇𝒙 + 𝑏 = 0 ⇔ 2𝒘𝑇𝒙 + 2𝑏 = 0

Affect the functional margins? How?

Yes: 2𝒘𝑇𝒙 + 2𝑏 𝑦 = 2 ⋅ 𝒘𝑇𝒙 + 𝑏 𝑦

Affect the geometric margins? How?

No: 2𝒘𝑇𝒙+2𝑏

|2𝒘|=

𝒘𝑇𝒙+𝑏

|𝒘|

May 8, 2012

Which classifier is best?

May 8, 2012

Maximal margin classifier

May 8, 2012

Why maximal margin?

Well-defined, single stable solution

Noise-tolerant

Small parameterization

(Fairly) efficient algorithms exist for finding it

May 8, 2012

Maximal margin: Separable case

May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1

May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1

∀𝑖 𝑓 𝒙𝑖 𝑦𝑖 ≥ 1

May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:

May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:

𝑑 =𝑓 𝒙

Among all linear classifiers (𝒘, 𝑏)

… which keep all points at functional margin of

𝟏 or more,

… we shall look for the one which has the largest

distance 𝒅 to the corresponding isolines, i.e. the

largest geometric margin.

As 𝑑 =1

𝒘, this is equivalent to finding the classifier

with minimal |𝒘|.

…which is equivalent to finding the classifier with

minimal 𝒘 2

May 8, 2012

Compare

“Generic” linear classification (separable case):

Find (𝒘, b), such that all points are classified correctly

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 0

Maximal margin classification (separable case):

Find (𝒘, b), such that all points are classified correctly

with a fixed functional margin

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 𝟏

and 𝒘 𝟐 is minimal.

May 8, 2012

Remember

May 8, 2012

SVM optimization problem

(separable case):

min𝒘,𝑏

2𝒘 2

so that

𝒘𝑇𝒙𝑖 + 𝑏 𝑦𝑖 ≥ 1

General case (“soft margin”)

The same, but we also penalize all margin

violations.

May 8, 2012

SVM optimization problem:

min𝒘,𝑏

2𝒘 2 + 𝐶 𝜉𝑖

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

violations.

May 8, 2012

min𝒘,𝑏

2𝒘 2 + 𝐶 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

violations.

May 8, 2012

min𝒘,𝑏

2𝒘 2 + 𝐶 1 − 𝑚𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

violations.

May 8, 2012

min𝒘,𝑏

2𝒘 2 + 𝐶 hinge(𝑚𝑖)

hinge 𝑚𝑖 = 1 − 𝑚𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

Hinge loss hinge 𝑚𝑖 = 1 − 𝑚𝑖 +

May 8, 2012

Classification loss functions

“Generic”

classification:

min𝒘,𝑏

[𝑚𝑖 < 0]

May 8, 2012

Perceptron:

May 8, 2012

Perceptron:

min𝒘,𝑏

(−𝑚𝑖)+

May 8, 2012

Least squares

classification*:

min𝒘,𝑏

𝑚𝑖 − 1 2

May 8, 2012

Boosting:

min𝒘,𝑏

exp(−𝑚𝑖)

May 8, 2012

Logistic regression:

min𝒘,𝑏

log (1 + 𝑒−𝑚𝑖)

May 8, 2012

Regularized logistic

regression:

min𝒘,𝑏

log (1 + 𝑒−𝑚𝑖)

+𝜆1

2𝒘 2

May 8, 2012

min𝒘,𝑏

1 − 𝑚𝑖 +

2𝐶𝒘 2

May 8, 2012

L2-SVM:

min𝒘,𝑏

1 − 𝑚𝑖 +2

2𝐶𝒘 2

May 8, 2012

L1-regularized L2-SVM:

min𝒘,𝑏

1 − 𝑚𝑖 +2

2𝐶𝒘

… etc

May 8, 2012

In general

min𝒘,𝑏

𝜙(𝑚𝑖)

+ 𝜆 ⋅ Ω(𝒘)

May 8, 2012

Model fit Model complexity

Compare to MAP estimation

maxModel

log 𝑃(𝑥𝑖|Model)

+ log 𝑃(Model)

May 8, 2012

Likelihood Model prior

Compare to MAP estimation

maxModel

log 𝑃(Data|Model) + log 𝑃(Model)

May 8, 2012

Likelihood Model prior

Solving the SVM

min𝒘,𝑏

2𝒘 2 + 𝐶 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

May 8, 2012

Solving the SVM

min𝒘,𝑏

2𝒘 2 + 𝐶 𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 ≥ 1 − 𝜉𝑖

𝜉𝑖 ≥ 0

May 8, 2012

Solving the SVM

min𝒘,𝑏

2𝒘 2 + 𝐶 𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0

May 8, 2012

Solving the SVM

min𝒘,𝑏

2𝒘 2 + 𝐶 𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0

Quadratic function with linear constraints!

May 8, 2012

Solving the SVM

min𝒘,𝑏

2𝒘 2 + 𝐶 𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0

May 8, 2012

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃

𝑪𝒙 = 𝒅

Solving the SVM

min𝒘,𝑏

2𝒘 2 + 𝐶 𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0

May 8, 2012

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃

𝑪𝒙 = 𝒅

> library(quadprog)

> solve.QP(Q, -c, A, b, neq)

Solving the SVM: Dual

min𝒘,𝑏

2𝒘 2 + 𝐶 𝜉𝑖𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

Is equivalent to:

min𝒘,b

max𝜶≥0,𝜷≥0

2𝒘 2 + 𝐶 𝜉𝑖

− 𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)

− 𝛽𝑖𝜉𝑖

May 8, 2012

min𝒘,𝑏

Is equivalent to:

min𝒘,b

2𝒘 2 + 𝐶 𝜉𝑖

− 𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)

− 𝛽𝑖𝜉𝑖

May 8, 2012

min𝒘,𝑏

Is equivalent to:

min𝒘,b

2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

May 8, 2012

min𝒘,𝑏

Is equivalent to:

min𝒘,b

2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0

May 8, 2012

min𝒘,𝑏

Is equivalent to:

min𝒘,b

2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

May 8, 2012

min𝒘,b

max𝜶

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

May 8, 2012

min𝒘,b

max𝜶

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

Sparsity: 𝛼𝑖 is nonzero only for those points which

𝑓 𝒙𝑖 𝑦𝑖 − 1 < 0

May 8, 2012

min𝒘,b

max𝜶

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

Now swap the min and the max (can be done in

particular because everything is nice and convex).

May 8, 2012

max𝜶

min𝒘,𝑏

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual.

May 8, 2012

max𝜶

min𝒘,𝑏

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual:

∇𝒘= 𝒘 − 𝛼𝑖𝑦𝑖𝒙𝑖 = 0

∇𝑏= − 𝛼𝑖𝑦𝑖 = 0

May 8, 2012

max𝜶

min𝒘,𝑏

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

Express 𝒘 and substitute:

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝛼𝑖𝑦𝑖 = 0

May 8, 2012

max𝜶

min𝒘,𝑏

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

May 8, 2012

representation

max𝜶

min𝒘,𝑏

2𝒘 2

0 ≤ 𝛼𝑖 ≤ 𝐶

max𝜶

𝛼𝑖

2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖

𝑇𝒙𝑗

𝑖,𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶

May 8, 2012

max𝜶

𝛼𝑖

2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖

𝑇𝒙𝑗

𝑖,𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶

May 8, 2012

max𝜶

𝟏𝑇𝜶 −1

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶

0 ≤ 𝜶 ≤ 𝐶 𝒚𝑇𝜶 = 0

𝐾𝑖𝑗 = 𝒙𝑖𝑇𝒙𝑗, 𝑌𝑖𝑗 = 𝑦𝑖𝑦𝑗

May 8, 2012

min𝜶

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶 − 𝟏𝑇𝜶

𝜶 ≥ 0

−𝜶 ≥ −𝐶

𝒚𝑇𝜶 = 0

Then find 𝑏 from the condition:

𝑓 𝒙𝑖 𝑦𝑖 = 1 if 0 < 𝛼𝑖 < 𝐶

May 8, 2012

Support vectors

May 8, 2012

Support vectors

0 ≤ 𝛼𝑖 ≤ 𝐶

Sparsity

The dual solution is often very sparse, this

allows to perform optimization efficiently

“Working set” approach.

May 8, 2012

Kernels

𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏

May 8, 2012

Kernels

May 8, 2012

Kernel function

Kernels

May 8, 2012

𝑓 𝑥 = 𝑤1𝑥 + 𝑤2𝑥2 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖exp (−|𝒙𝑖 − 𝒙 𝟐) + 𝑏

SVM is a __________ linear classifier.

Margin maximization can be achieved via

minimization of ______________.

SVM uses _____ loss and _______

regularization.

Besides hinge loss I also know ____ loss and

___ loss.

SVM in both primal and dual form is solved

using ________ programming.

May 8, 2012

In primal formulation we solve for parameter

vector ___. In dual formulation we solve for

___ instead.

_____ form of SVM is typically sparse.

Support vectors are those training points for

which _______.

The relation between primal and dual variables

is: ___= ______𝑖 .

A Kernel is a generalization of _____ product.

May 8, 2012

Support Vector Machines - ut · Coming up next Supervised machine learning Linear models Least...

Documents

Transcript of Support Vector Machines - ut · Coming up next Supervised machine learning Linear models Least...

Metode Kernel SVM

New · 2004. 11. 16. · SVM -O 155- 10 ETRIOII'/ 4.1 SVM 2007119} non-eye non-eye —L 719-1 SVM 1 Ù)A5L, Radial Basis Function(RBF) SVM non-eye SVM* 17 non-eye SVM 71 gl EL 711

ARF Regression lin´ eaire´ Regression logistique´ Descente ...

Support Vector Machine - WordPress.com · 10/8/2016 · Support Vector Machine (Destek Vektör Makineleri) SVM Tarihçe Linear Sınıflandırıcılar SVM Uygulama Alanları SVM Related

32hl67u Svm

Bivariate Regression Analysis - KOCWcontents.kocw.net/document/11_Simple Regression Analysis.pdf · 2011-12-27 · Simple Regression Results . Table. Results of Bivariate Linear Regression

SVM Support Vector Machines

Svm- SS2- 052416

Datamining 6th svm

Regression Analysis - Muthén & Muthén, Mplus Home Page 1/lec1_Regressi… · Regression Analysis Regression model: ... Further Readings On Regression Analysis Agresti, A. & Finlay

SVM, kernel módszerek

Semi-Supervised SVM

Ranking SVM

Lecture 5: More on Kernels. SVM regression and classi cationdprecup/courses/ML/Lectures/ml-lecture0… · Solving SVM regression We introduce slack variables, ˘+ i, ˘ i to account

Estatuto DARI-SVM

Ausgleichungsrechnung II Gerhard Navratil Regression und Kollokation Regression –Lineare Regression Kovarianzfunktion Kollokation –Ansatz –Schätzung der.

SVM Admin2

Dinh Thi Huong SVM

Nhom02 Chuong05 SVM BaoCao

Pichierri Natalia SVM Classification