Atc Lecture Tyliu
Transcript of Atc Lecture Tyliu
-
7/30/2019 Atc Lecture Tyliu
1/48
SVM and Its Applications to
Text Classification
Dr. Tie-Yan Liu
WSM Group, MSR Asia2006.3.23
-
7/30/2019 Atc Lecture Tyliu
2/48
Outline
A Brief History of SVM
SVM: A Large-Margin Classifier Linear SVM
Kernel Trick Fast implementation: SMO
SVM for Text Classification Multi-class Classification
Multi-label Classification Our Hierarchical Classification Tool
-
7/30/2019 Atc Lecture Tyliu
3/48
History of SVM
SVM is inspired from statistical learning theory [3]
SVM was first introduced in 1992 [1]
SVM becomes popular because of its success in handwritten digit
recognition
1.1% test error rate for SVM. This is the same as the error rates of acarefully constructed neural network, LeNet 4.
See Section 5.11 in [2] or the discussion in [3] for details
SVM is now regarded as an important example ofkernel methods,
arguably the hottest area in machine learning
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop onComputational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12thIAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82, 1994.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.
-
7/30/2019 Atc Lecture Tyliu
4/48
What is a Good Decision
Boundary?
Consider a two-class, linearly
separable classification
problem
Many decision boundaries! The Perceptron algorithm can be
used to find such a boundary
Are all decision boundaries
equally good? Class 1
Class 2
-
7/30/2019 Atc Lecture Tyliu
5/48
Examples of Bad Decision
Boundaries
Class 1
Class 2
Class 1
Class 2
-
7/30/2019 Atc Lecture Tyliu
6/48
Large-margin Decision
Boundary The decision boundary should be as far away from the
data of both classes as possible
We should maximize the margin, m
Class 1
Class 2
m
-
7/30/2019 Atc Lecture Tyliu
7/48
Finding the Decision Boundary
Let {x1, ..., xn} be our data set and let yi {1,-1} be theclass label ofxi
The decision boundary should classify all points correctly
The decision boundary can be found by solving thefollowing constrained optimization problem
The Lagrangian of this optimization problem is
-
7/30/2019 Atc Lecture Tyliu
8/48
The Dual Problem
By setting the derivative of the Lagrangian to be zero,
the optimization problem can be written in terms ofai
(the dual problem)
This is a quadratic programming (QP) problem A global maximum ofai can always be found
w can be recovered by
If the number of training
examples is large, SVM trainingwill be very slow because the
number of parameters Alpha is
very large in the dual problem.
-
7/30/2019 Atc Lecture Tyliu
9/48
KTT Condition
The QP problem is solved when for all i,
-
7/30/2019 Atc Lecture Tyliu
10/48
Characteristics of the Solution
KTT condition indicates many of the ai are zero w is a linear combination of a small number of data points
xi with non-zero ai are called support vectors (SV)
The decision boundary is determined only by the SV
Let tj (j=1, ..., s) be the indices of the ssupport vectors. We canwrite
For testing with a new data z
Compute
and classify z as class 1 if the sum is positive, and class 2otherwise.
-
7/30/2019 Atc Lecture Tyliu
11/48
a6=1.4
A Geometrical Interpretation
Class 1
Class 2
a1=0.8
a2=0
a3=0
a4=0
a5=0a7=0
a8=0.6
a9=0
a10=0
-
7/30/2019 Atc Lecture Tyliu
12/48
Non-linearly Separable
Problems We allow errorxi in classification
Class 1
Class 2
-
7/30/2019 Atc Lecture Tyliu
13/48
Soft Margin Hyperplane
By minimizing ixi, xi can be obtained by
xi are slack variables in optimization; xi=0 if there is no error forxi,and xi is an upper bound of the number of errors
We want to minimize
C: tradeoff parameter between error and margin
The optimization problem becomes
-
7/30/2019 Atc Lecture Tyliu
14/48
The Optimization Problem
The dual of the problem is
w is recovered as
This is very similar to the optimization problem in the
linear separable case, except that there is an upperbound Con ai now
Once again, a QP solver can be used to find ai
-
7/30/2019 Atc Lecture Tyliu
15/48
Extension to Non-linear
Decision Boundary
So far, we only consider large-margin classifier with a linear decision
boundary, how to generalize it to become nonlinear?
Key idea: transform xi to a higher dimensional space to make lifeeasier
Input space: the space the point xi are located
Feature space: the space off(xi) after transformation
Why transform?
Linear operation in the feature space is equivalent to non-linear
operation in input space
Classification can become easier with a proper transformation. In the
XOR problem, for example, adding a new feature of x1x2 make the
problem linearly separable
-
7/30/2019 Atc Lecture Tyliu
16/48
Transforming the Data
Computation in the feature space can be costly
because it is high dimensional The feature space is typically infinite-dimensional!
The kernel trick comes to rescue
f( )
f( )
f( )f( )f( )
f( )
f( )f( )
f(.)f( )
f( )
f( )
f( )f( )
f( )
f( )
f( )f( )
f( )
Feature spaceInput space
-
7/30/2019 Atc Lecture Tyliu
17/48
The Kernel Trick
Recall the SVM optimization problem
The data points only appear as inner product
As long as we can calculate the inner product in the
feature space, we do not need the mapping explicitly
Many common geometric operations (angles, distances)
can be expressed by inner products
Define the kernel function K by
-
7/30/2019 Atc Lecture Tyliu
18/48
An Example forf(.) and K(.,.)
Suppose f(.) is given as follows
An inner product in the feature space is
So, if we define the kernel function as follows, there is no need to
carry out f(.) explicitly
This use of kernel function to avoid carrying out f(.) explicitly isknown as the kernel trick
-
7/30/2019 Atc Lecture Tyliu
19/48
Kernel Functions
In practical use of SVM, only the kernel function (and notf(.)) is specified
Kernel function can be thought of as a similarity measurebetween the input objects
Not all similarity measure can be used as kernel function,however Mercer's condition states that any positivesemi-definite kernel K(x, y), i.e.
can be expressed as a dot product in a highdimensional space.
-
7/30/2019 Atc Lecture Tyliu
20/48
Examples of Kernel Functions
Polynomial kernel with degree d
Radial basis function kernel with width s
Closely related to radial basis function neural networks
Sigmoid with parameterk and q
It does not satisfy the Mercer condition on all k and q
-
7/30/2019 Atc Lecture Tyliu
21/48
Modification Due to Kernel
Function Change all inner products to kernel functions
For training,
Original
With kernelfunction
-
7/30/2019 Atc Lecture Tyliu
22/48
Modification Due to Kernel
Function For testing, the new data z is classified as class 1 iff0,
and as class 2 iff
-
7/30/2019 Atc Lecture Tyliu
23/48
Why SVM Works?
The feature space is often very high dimensional. Why dont we
have the curse of dimensionality?
A classifier in a high-dimensional space has many parameters and is
hard to estimate
Vapnik argues that the fundamental problem is not the number of
parameters to be estimated. Rather, the problem is about the
flexibility of a classifier
Typically, a classifier with many parameters is very flexible, but there
are also exceptions
Let xi
=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible
combination of class labels on xi
This 1-parameter classifier is very flexible
-
7/30/2019 Atc Lecture Tyliu
24/48
Why SVM Works?
Vapnik argues that the flexibility of a classifier should not be
characterized by the number of parameters, but by the capacity of a
classifier
This is formalized by the VC-dimensionof a classifier
The addition of||w||2 has the effect of restricting the VC-dimensionof the classifier in the feature space
The SVM objective can also be justified by structural risk
minimization: the empirical risk (training error), plus a term related to
the generalization ability of the classifier, is minimized
Another view: the SVM loss function is analogous to ridgeregression. The term ||w||2shrinks the parameters towards zeroto avoid overfitting
-
7/30/2019 Atc Lecture Tyliu
25/48
Choosing the Kernel Function
Probably the most tricky part of using SVM.
The kernel function is important because it creates the kernel matrix,
which summarize all the data
Many principles have been proposed (diffusion kernel, Fisher kernel,
string kernel,
) There are even research to estimate the kernel matrix from available
information
In practice, a low degree polynomial kernel or RBF kernel with a
reasonable width is a good initial try for most applications. It was said that for text classification, linear kernel is the best choice,
because of the already-high-enough feature dimension
-
7/30/2019 Atc Lecture Tyliu
26/48
Strengths and Weaknesses of
SVM Strengths
Training is relatively easy No local optimal, unlike in neural networks
It scales relatively well to high dimensional data
Tradeoff between classifier complexity and error canbe controlled explicitly
Non-traditional data like strings and trees can be usedas input to SVM, instead of feature vectors
By performing logistic regression (Sigmoid) on the
SVM output of a set of data can map SVM output toprobabilities.
Weaknesses Need to choose a goodkernel function.
-
7/30/2019 Atc Lecture Tyliu
27/48
Summary: Steps for
Classification Prepare the pattern matrix
Select the kernel function to use
Select the parameter of the kernel function and the value
ofC
You can use the values suggested by the SVM software, or you
can set apart a validation set to determine the values of the
parameter
Execute the training algorithm and obtain the ai
Unseen data can be classified using the ai and thesupport vectors
-
7/30/2019 Atc Lecture Tyliu
28/48
Fast SVM Implementations
SMO: Sequential Minimal Optimization
SVM-Light
LibSVM BSVM
-
7/30/2019 Atc Lecture Tyliu
29/48
SMO: Sequential Minimal
Optimization
Key idea Divide the large QP problem of SVM into a series of
smallest possible QP problems, which can be solvedanalytically and thus avoids using a time-consuming
numerical QP in the loop (a kind of SQP method). Space complexity: O(n).
Since QP is greatly simplified, most time-consumingpart of SMO is the evaluation of decision function,
therefore it is very fast for linear SVM and sparse data.
-
7/30/2019 Atc Lecture Tyliu
30/48
SMO
At each step, SMO chooses 2 Lagrange
multipliers to jointly optimize, find the optimal
values for these multipliers and updates the
SVM to reflect the new optimal values. Three components
An analytic method to solve for the two Lagrange
multipliers
A heuristic for choosing which multipliers to optimize
A method for computing b at each step, so that the
KTT conditions are fulfilled for both the two examples
-
7/30/2019 Atc Lecture Tyliu
31/48
Choosing Which Multipliers to
Optimize
First multiplier
Iterate over the entire training set, and find an
example that violates the KTT condition.
Second multiplier Maximize the size of step taken during joint
optimization.
|E1-E2|, where Ei is the error on the i-th example.
-
7/30/2019 Atc Lecture Tyliu
32/48
SVM for Text Classification
-
7/30/2019 Atc Lecture Tyliu
33/48
Text Categorization
Typical features Term frequency
Inverse document frequency
TC is a typical multi-class multi-label classification
problem. SVM, with some additional heuristic, has been regarded as one
of the best classification scheme for text data, based on many
benchmark evaluations.
TC is a high-dimensional sparse problem
SMO is a very good choice in this case.
-
7/30/2019 Atc Lecture Tyliu
34/48
Multi-Class SVM Classification
1-vs-rest
1-vs-1
MaxWin
DB2 Error Correcting Output Coding
K-class SVM
-
7/30/2019 Atc Lecture Tyliu
35/48
1-vs-rest
For any class C, train a binary classifier to
distinguish C from C.
For an unseen sample, find the binary classifier
with highest confidence score for the finaldecision.
-
7/30/2019 Atc Lecture Tyliu
36/48
1-vs-1
Train CN2 classifiers, which
distinguish one class from
another one.
Pairwise:
MaxWin (CN2 tests)
Error-correcting output code
DAG: Pachinko-machine (N tests)
-
7/30/2019 Atc Lecture Tyliu
37/48
Error Correcting Output Coding
Code Matrix (MNxK) N classes, K classifiers
Hamming Distance
Class Ci with MinimumError wins
M 1:2 1:3 1:4 2:3 2:4 3:4
1 1 1 1 0 0 0
2 -1 0 0 1 1 0
3 0 -1 0 -1 0 1
4 0 0 -1 0 -1 -1
M 1,2 1,3 1,4 2,3 2,4 3,4
1 1 1 1 -1 -1 -1
2 1 -1 -1 1 1 -1
3 -1 1 -1 1 -1 1
4 -1 -1 1 -1 1 1
-
7/30/2019 Atc Lecture Tyliu
38/48
Intransitivity of DAG
For C1C2C3, if , then
, we say is transitive.
1223, CCCC
dd
13CC
d d
1~3
2~31~2
1 2 3
1~2
3~21~3
1 3 2
C1
C3
C2
2~3
1~32~1
2 1 3
-
7/30/2019 Atc Lecture Tyliu
39/48
Divided-by-2 (DB2)
Hierarchically divide the
data into two subsets until
every subset consists of
only one class.
-
7/30/2019 Atc Lecture Tyliu
40/48
Divided-by-2 (DB2)
Data partitioning criterion:
group the classes such that the resulting subsets
have the largest margin.
Trade-off: use clustering methods k-mean: use the mean of each class
Balanced subsets: minimal difference in sample
number.
-
7/30/2019 Atc Lecture Tyliu
41/48
K-class SVM
Change the loss function and constraints
-
7/30/2019 Atc Lecture Tyliu
42/48
Multi-label SVM Classification
How does multi-label come?
Whole-vs-part Share concepts
-
7/30/2019 Atc Lecture Tyliu
43/48
Whole-vs-part
Common for parent-child relationship
Add an Other category, and do binary classification
to distinguish the child from the other category.
Since the classification boundary is non-linear, kernelmethods may be more effective.
-
7/30/2019 Atc Lecture Tyliu
44/48
Share concepts: Training
Mode-S Label multi-label data with the class to which the data most likely
belonged, by some perhaps subjective criterion.
Mode-N consider the multi-label data as a new class
Mode-X Use the multi-label data more than once, using each example as
a positive example of each of the classes to which it belongs.
-
7/30/2019 Atc Lecture Tyliu
45/48
Share concepts: Test
P-cut Label input testing data by all of the classes corresponding to
positive SVM scores. If no scores are positive, label that data tothe class with top score.
S-cut Train a threshold for each class by cross validation, and Label
input testing data by all of the classes corresponding to higherscores than the threshold.
R-cut For any given test instance, always assign it rlabels according to
the decedent confidence scores.
rcan be learned from training data.
-
7/30/2019 Atc Lecture Tyliu
46/48
Evaluation Criteria
Micro-F1:
Measure the overall classification accuracy (more consistent withthe practical application scenario)
Macro-F1:
Measure the classification accuracy on the category level. Can
reflect the classifiers capability of dealing with rare categories.
-
7/30/2019 Atc Lecture Tyliu
47/48
References
Martin Law, A Simple Introduction to Support Vector Machines. Bredensteiner, E. J., and Bennett, K. P. Multicategory Classification by
Support Vector Machines, Computer Optimization and Applications. 53-79, 1999.
Dumais, S., Chen, H. Hierarchical classification of Web content, In Proc.SIGIR, 256-263, 2000.
Platt, J. Fast training of support vector machines using sequentialminimal optimization. Advances in Kernel Methods - Support VectorLearning, 185-208, MIT Press, Cambridge, MA, 1999.
Yang, Y., Zhang, J., and Kisiel, B. A scalability analysis of classifiers intext categorization. SIGIR, 96-103, 2003.
Yang, Y. A study of thresholding strategies for text categorization, SIGIR,137-145, 2001.
Tie-Yan Liu, Yiming Yang, Hao Wan, et al, Support Vector MachinesClassification with Very Large Scale Taxonomy, SIGKDD Explorations,Special Issue on Text Mining and Natural Language Processing, vol.7,issue.1, pp36~43, 2005.
-
7/30/2019 Atc Lecture Tyliu
48/48
Thanks
http://research.microsoft.com/users/tyliu
mailto:[email protected]://research.microsoft.com/users/tyliuhttp://research.microsoft.com/users/tyliumailto:[email protected]