Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...
Transcript of Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...
![Page 1: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/1.jpg)
Support Vector Machines
Aar$ Singh
Machine Learning 10-‐601 Nov 22, 2011
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA
![Page 2: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/2.jpg)
SVMs reminder
26
min w.w + C Σ ξj w,b,ξ
s.t. (w.xj+b) yj ≥ 1-‐ξj ∀j
ξj ≥ 0 ∀j
Hinge loss Regularization
Soft margin approach
Essen$ally a constrained op$miza$on problem!
![Page 3: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/3.jpg)
Constrained Op7miza7on
Primal problem:
⌘ min
x
max
↵�0x
2 � ↵(x� b)
min
x
max
↵�0x
2 � ↵(x� b)
Lagrange mul$plier
Lagrangian
L(x,↵)
Dual problem:
max
↵�0min
x
x
2 � ↵(x� b)⌘ min
x
max
↵�0x
2 � ↵(x� b)
L(x,↵)
f⇤ = arg
d⇤ = arg
d⇤ f⇤In general, d⇤ = f⇤When is ?
![Page 4: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/4.jpg)
Constrained Op7miza7on
28
Primal problem: Dual problem:
f⇤ = arg d⇤ = arg
↵⇤= dual solution
x
⇤= primal solution
d⇤ = f⇤ if f is convex and x*, α* sa$sfy the KKT condi$ons:
rL(x⇤,↵
⇤) = 0
↵⇤ � 0x
⇤ � b
↵
⇤(x⇤ � b) = 0
Zero Gradient
Primal feasibility
Dual feasibility
Complementary slackness
è If α* > 0, then x* = b i.e. Constraint is effec$ve
![Page 5: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/5.jpg)
Constrained Op7miza7on
29
α* = 0 Constraint is ineffec$ve
α* = 1/2 > 0 Constraint is effec$ve
Complementary slackness α*(x*-‐1) = 0
![Page 6: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/6.jpg)
Dual SVM – linearly separable case
• Primal problem:
• Lagrangian:
30
w – weights on features
α – weights on training pts
α j > 0 constraint is effec7ve (w.xj+b)yj = 1 point j is a support vector!!
![Page 7: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/7.jpg)
Dual SVM – linearly separable case
• Dual problem deriva$on:
31
If we can solve for αs (dual problem), then we have a solu$on for w (primal problem)
![Page 8: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/8.jpg)
Dual SVM – linearly separable case
• Dual problem deriva$on:
• Dual problem:
32
![Page 9: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/9.jpg)
Dual SVM – linearly separable case
Dual problem is also QP Solu$on gives αjs
33 Use support vectors to compute b w.xk+b = yk (w.xk+b)yk = 1
• Dual problem:
![Page 10: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/10.jpg)
Dual SVM Interpreta7on: Sparsity
34
w.x + b = 0
Only few αjs can be non-‐zero : where constraint is $ght
(w.xj + b)yj = 1 Support vectors – training points j whose αjs are non-‐zero
αj > 0
αj > 0
αj > 0
αj = 0
αj = 0
αj = 0 = Support vectors
![Page 11: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/11.jpg)
So why solve the dual SVM?
• There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n
Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables
35
![Page 12: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/12.jpg)
Dual SVM – non-‐separable case
36
• Primal problem:
• Dual problem:
Lagrange Mul7pliers
![Page 13: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/13.jpg)
Dual SVM – non-‐separable case
37
Dual problem is also QP Solu$on gives αjs
comes from Intuition: Earlier -‐ If constraint violated, αi →∞ Now -‐ If constraint violated, αi ≤ C
![Page 14: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/14.jpg)
So why solve the dual SVM?
• There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n
Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables
• But, more importantly, the “kernel trick”!!!
38
![Page 15: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/15.jpg)
39
What if data is not linearly separable?
x1
Using non-‐linear features to get linear separa$on
x 2
Radius, r = √x12+x22
Angle, θ
![Page 16: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/16.jpg)
40
What if data is not linearly separable?
x
Using non-‐linear features to get linear separa$on
y
x
x2
![Page 17: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/17.jpg)
What if data is not linearly separable?
41
Use features of features of features of features….
Feature space becomes really large very quickly!
Can we get by without having to write out the features explicitly?
Φ(x) = (x12, x22, x1x2, …., exp(x1))
![Page 18: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/18.jpg)
Higher Order Polynomials
42
d – input features m – degree of polynomial
grows fast! m = 6, d = 100 about 1.6 billion terms
(m+ d� 1)!
m!(d� 1)!dm
✓m+d-1
m
◆
![Page 19: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/19.jpg)
Dual formula7on only depends on dot-‐products, not on w!
43
Φ(x) – High-‐dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K
![Page 20: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/20.jpg)
Dot Product of Polynomials
44
m=1
m=2
m
m
√2 2 √2
Don’t store high-‐dim features -‐ Only evaluate dot-‐products with kernels
m = K(x,z)
![Page 21: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/21.jpg)
Finally: The Kernel Trick!
45
• Never represent features explicitly – Compute dot products in closed form
• Constant-‐$me high-‐dimensional dot-‐products for many classes of features
![Page 22: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/22.jpg)
Common Kernels
46
• Polynomials of degree d
• Polynomials of degree up to d
• Gaussian/Radial kernels (polynomials of all orders – recall series expansion)
• Sigmoid
![Page 23: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/23.jpg)
Which Func7ons Can Be Kernels?
• not all func$ons • for some defini$ons of K(x1,x2) there is no corresponding projec$on ϕ(x)
• Nice theory on this, including how to construct new kernels from exis$ng ones
• Ini$ally kernels were defined over data points in Euclidean space, but more recently over strings, over trees, over graphs, …
![Page 24: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/24.jpg)
OverfiYng
48
• Huge feature space with kernels, what about overfiyng???
– Maximizing margin leads to sparse set of support vectors (decision boundary not too complicated)
– Some interes$ng theory says that SVMs search for simple hypothesis with large margin
– O{en robust to overfiyng
![Page 25: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/25.jpg)
What about classifica7on 7me?
49
• For a new input x, if we need to represent Φ(x), we are in trouble! • Recall classifier: sign(w.Φ(x)+b) • Using kernels we are cool!
![Page 26: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/26.jpg)
SVMs with Kernels
50
• Choose a set of features and kernel func$on • Solve dual problem to obtain support vectors αi • At classifica$on $me, compute:
Classify as
![Page 27: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/27.jpg)
SVM Decision Surface using Gaussian Kernel
51
Bishop Fig 7.2
Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant
![Page 28: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/28.jpg)
SVM So[ Margin Decision Surface using Gaussian Kernel
52
Bishop Fig 7.4
Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant
−2 0 2
−2
0
2
![Page 29: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/29.jpg)
53
SVMs Kernel Regression
or
Differences: • SVMs:
– Learn weights αi (and bandwidth) – O{en sparse solu$on
• KR: – Fixed “weights”, learn bandwidth – Solu$on may not be sparse – Much simpler to implement
SVMs vs. Kernel Regression
![Page 30: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/30.jpg)
SVMs vs. Logis7c Regression
54
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! Yes!
Solution sparse Often yes! Almost always no!
Semantics of output
“Margin” Real probabilities
![Page 31: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/31.jpg)
Kernels in Logis7c Regression
55
• Define weights in terms of features:
• Derive simple gradient descent rule on αi
![Page 32: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/32.jpg)
SVMs vs. Logis7c Regression
56
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! Yes!
Solution sparse Often yes! Almost always no!
Semantics of output
“Margin” Real probabilities
![Page 33: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/33.jpg)
Kernels : Key Points
• Many learning tasks are framed as op$miza$on problems
• Primal and Dual formula$ons of op$miza$on problems
• Dual version framed in terms of dot products between x’s
• Kernel func$ons K(x,y) allow calcula$ng dot products <Φ(x),Φ(y)> without bothering to project x into Φ(x)
• Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces
![Page 34: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%](https://reader036.fdocument.pub/reader036/viewer/2022081405/5f069ce87e708231d418dc8b/html5/thumbnails/34.jpg)
What you need to know…
58
• Dual SVM formula$on – How it’s derived
• The kernel trick • Common kernels • Differences between SVMs and kernel regression • Differences between SVMs and logis$c regression • Kernelized logis$c regression
• Also, Kernel PCA, Kernel ICA, ...