Introduction to Machine Learning and Soft...
Transcript of Introduction to Machine Learning and Soft...
2/24/2010 1
Introduction to Machine Learning and Soft Computing
謝哲光
Jer-Guang Hsieh
義守大學電機工程系
Department of Electrical Engineering I-Shou University
Kaohsiung, Taiwan 840
2/24/2010 2
Contents
Introduction
Single-layer Neural Networks
Linear Classification
Linear Regression
Kernel
Multi-layer Neural Networks
Nonlinear Classification
Nonlinear Regression
Model Selection
GA-based Frameworks
PSO-based Frameworks
Conclusion
Epilogue
2/24/2010 3
Introduction ^^ Almost all of the science is fitting models to data. ^^ The first step in mathematical modeling of a
system under consideration is to use the first principles, e.g., Newton’s laws in mechanics, Kirchhoff’s laws in lumped electric circuits, or various laws in thermodynamics.
^^ As a system becomes increasingly complex, the
possibility of obtaining a precise description of it in quantitative terms decreases.
⇒ What we desire in practice is a reasonable yet
tractable model. ^^ It may also happen that there is no analytic model
for the system under consideration. This is particularly true in social science problems.
^^ However, in many real situations, we do have
some experimental data, either from measurement or data collection by some means.
⇒ This raises the necessity of a theory concerning
the learning from examples, i.e., obtaining a good mathematical model from experimental data. This is what machine learning all about.
2/24/2010 4
Brief Sketch of Machine Learning and Soft Computing
ANN FNN
CNN
GRBFN
SVM
GA PSO
Numerical Optimization
Approximation Theory
StatisticalLearning
Linear Algebra
Probability
Chaos
Machine Learning & Soft Computing
Intelligent Control
Regression
Management
Bioinformatics
Time Series Analysis
Secure Communication
Diagnostics
Filter Design
Data Compression
Classification
WLM
2/24/2010 5
Machine Learning Main Features: ^^ The field of machine learning and soft computing is
vast, versatile, and fascinating. ^^ It can be viewed as judicious mixture of
computational intelligence and applied statistics. Basic Belief:
There is a process that explains the data we observe. Though we do not know the details of the process underlying the generation of data, we know that it is not completely random.
Learning Problem:
Find a general rule that explains experimental data given only a sample of limited size.
2/24/2010 6
Main Categories:
Techniques: Statistical Learning Artificial Neural Network (ANN) Radial Basis Neural Network (RBFN) Fuzzy Neural Network (FNN) Support Vector Machine (SVM) Wilcoxon Learning Machine (WLM) Evolutionary Computation (EC)
Machine Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Classification Learning (Pattern Recognition)
Function Learning (Regression Estimation)
Preference Learning
2/24/2010 7
Supervised Learning Data: A sample of input-output pairs (training sample) Task: Find a deterministic function that maps any input
to an output such that disagreement with future input-output observations is minimized.
Classification Learning (pattern recognition) Features:
Output: categorical variables (class label) No ordering between the classes
Examples:
Credit scoring of loan applicants in a bank Classification of handwritten letters and digits Optical character recognition Face recognition Speech recognition Classification of news in a news agency
Function Learning (regression estimation) Features: Output: continuous variables Examples:
Prediction of the stock market share values Weather forecasting Navigation of an autonomous car
2/24/2010 8
Preference Learning Features: ^^ Output space: order space ^^ Elements of the output space are called ranks.
We can compare whether two elements are equal or, if not, which one is to be preferred.
Examples:
Arrangement of WEB pages such that the most relevant pages (according to a query) are ranked highest.
Unsupervised Learning Data: A sample of objects without associated target
values Task: Extracting some structure or regularity from the
experimental data Features:
Finding a concise description of the data could be a set of clusters or a probability density stating how likely it is to observe a certain object in the future.
2/24/2010 9
Examples:
Image and text segmentation Novelty detection in process control Grouping of customers in a company Alignment in molecular biology
Reinforcement Learning
Data: State-action-reward triples Task: Find a concise description of the data in the form
of a strategy or policy (what to do?) that maximizes the expected reward over time.
Features: ^^ The output of the system is a sequence of actions. ^^ A single action is not important; what is important is
the strategy or policy that is the sequence of correct actions to reach the goal.
^^ No optimal action exists in a given intermediate state; an action is good if it is part of a good policy.
^^ The learning algorithm should be able to assess the goodness of policies and must identify a sequence of actions, learned from past, so as to maximize the expected reward over time.
Examples:
Game playing, e.g., playing chess Robot navigation in search of a goal location
2/24/2010 10
Supervised Learning
_____________________________________________________________________
_________
Successful
Learning System
Trained System
Pattern (Input Data) from
Training Dataset
Target (Output Data) from
Training Dataset
Pattern (Input Data) from
Testing Dataset
Predicted Output
Target (Output Data) from
Testing Data
Training Error
Training Phase
Testing Phase
Predicted Output
Testing Error (Generalization Error)
Training Examples
Testing Examples
2/24/2010 11
Successful Learning
Quality of training data: accuracy, integrity, consistency, no redundancy, timely-ness, comprehensibility, completeness
Feature extraction:
feature selection + feature composition Model selection:
Choice of learning machine and determination of machine parameters
Algorithm utilized to train the learning machine
Training and Generalization
Small training error doesn’t imply good generalization for previously unseen data.
A learning machine with a too high capacity typically leads to the very undesirable effect of overfitting.
A learning machine with a too low capacity typically leads to the very undesirable effect of underfitting.
Any learning machine almost learns nothing from too few training examples.
2/24/2010 12
Evolutionary Computation (EC) Techniques: Genetic Programming (GP) Evolution Strategy (ES) Evolutionary Programming (EP) Genetic Algorithm (GA) Particle Swarm Optimization (PSO) Spirit of Soft Computing:
Law of Sufficiency: If a solution is
good enough, fast enough, and cheap enough,
then it is sufficient. ^^ In almost all real-world applications, we are looking
for, and satisfied with, sufficient solutions. ^^ Hybrids of various soft-computing approaches with
other computational intelligence tools such as neural networks are becoming more prevalent.
2/24/2010 13
Brief History of Machine Learning and Soft Computing
Year(s) Name(s) Event Description 1936 Fisher Discriminant analysis 1943 McCulloch
Pitts First mathematical model for the artificial neuron
1958 Rosenblatt First model of learning machine (perceptron) for classification; True beginning of the mathematical analysis of learning processes
1958 Friedberg Genetic Programming (GP) 1960 Widrow
Hoff Adaptive linear neuron (Adaline) for regression by using the delta learning rule
1962 Novikoff First (convergence) theorem about the perceptron 1962 Holland Genetic Algorithm (GA) 1963 Tikhonov Regularization method for solutions of ill-posed
problems 1965 Zadeh Fuzzy mathematics 1965 Rechenberg
Schwefel Evolution Strategy (ES)
1966 Fogel Owens Walsh
Evolutionary Programming (EP)
1969 Minsky Papert
Simple biologically motivated learning systems (perceptrons) were incapable of learning an arbitrarily complex problem. (negative result)
1971 Vapnik Chervonenkis
Statistical learning theory
1982 Hopfield Hopfield network 1982 Vapnik Introduction of regularization theory into machine
learning 1986 Rumelhart
Hinton Williams Le Cun
Error back-propagation algorithm (generalized delta learning rule) for multi-layer neural networks (direct generalization of perceptrons)
1988 Chua Yang
Cellular Neural Network (CNN)
1989 Poggio Girosi
Radial Basis Function Network (RBFN)
1989~1991 Goldberg Davis
Popularization of genetic algorithm
1991 Koza Improvement of genetic programming 1992 Vapnik Support Vector Machine (SVM) 1995 Kennedy
Eberhart Particle Swarm Optimization (PSO)
2/24/2010 14
A Simple Binary Classification Algorithm Nearest Mean Classifier
c+
c-
+
+
+
+
+decision surface
+
cx
w
x-
--
-
Let nX ℜ⊆ and { }1,1: −=Y .
Data (training set): ( ){ } YXdxS lqqq ×⊆=
=1,:
Basic idea:
Assign an unseen pattern to the class with closer mean.
2/24/2010 15
Step 1: First compute the means of the two classes. Define
{ }1:: =∈=+qS dlqI , { }1:: −=∈=−
jS dljI ,
∑+∈
−++ =
SIqqxmc 1: , ∑
−∈
−−− =
SIjjxmc 1: ,
+m : number of examples with positive labels −m : number of examples with negative labels
Step 2: Assign a new point Xx ∈ to the class whose
mean is closest. Derivation:
22+− −−− cxcx
++−− −−−−−= cxcxcxcx ,,
( )[ ]2212,2 +−−
−+ −+−= ccccx .
Define
−+ −= ccw : , ( )2212: +−− −= ccb .
2/24/2010 16
Decision function:
( ) [ ]bxwxg += ,sgn , Xx ∈ . Discriminant function:
( ) bxwxf += ,: Decision surface:
a hyperplane in nℜ with normal vector w and bias b:
( ) bxwxf +== ,0 Discriminant function in terms of the input patterns:
( ) bxxmxxmxfSS Iq
qIq
q +−= ∑∑−+ ∈
−−
∈
−+ ,, 11
,
where
⎥⎦
⎤⎢⎣
⎡−= ∑∑
+− ∈
−+
∈
−−
−
SS Ijqjq
Ijqjq xxmxxmb
,
2
,
21 ,,2 .
^^ inner product of input data
2/24/2010 17
Hyperplane A hyperplane bwH , with normal vector w and bias b:
0,...2211 =+=++++ bxwbxwxwxw nn ,
[ ]Tnxxx L1:= , [ ] nT
nwww ℜ∈= L1: . Define
( ) bxwbxwxwxwxf nnbw +=++++= ,...: 2211, .
⇒ ( ){ }0,:: ,, =+=ℜ∈= bxwxfxH bwn
bw
Define
( ) ( )xfwxg bwbw ,1
, : −= , nx ℜ∈ .
⇒ ( ){ }0,:: 11,, =+=ℜ∈= −− bwxwwxgxH bw
nbw
^^ Note that w is a normal vector perpendicular to bwH , ,
while varying the value of b moves the hyperplane parallel to itself.
^^ The hyperplane thus defined is an affine subspace
(linear manifold) of dimension 1−n . It divides nℜ into two half spaces.
2/24/2010 18
Linear Classification
Let nX ℜ⊆ and { }1,1: −=Y .
Training examples: ( ){ } YXdxS lqqq ×⊆=
=1,:
Definition: The training set S is said to be linearly separable if there is a hyperplane that correctly classifies the training data. Convention: For a given hyperplane ( )bw, , recall
( ) bxwxf bw += ,:, , nx ℜ∈ .
If the hyperplane ( )bw, correctly classifies the training set, then, by convention, we assign
1+=iy if ( ) 0, ≥ibw xf ; 1−=iy if ( ) 0, <ibw xf .
Note: desire: “similar patterns ⇒ similar classes”! Cauchy-Swartz inequality ⇒ ( ) ( ) jijijbwibw xxwxxwxfxf −⋅≤−=− ,,, ; i.e., whenever two data points are close (small ji xx − ), their difference in the real-valued output of a hypothesis is also small.
2/24/2010 19
Definition: ^^ functional margin of ( )qq dx , w.r.t. ( )bw, :
( ) [ ] ( )qbwqqqq xfdbxwdbw ,,:, ⋅=+⋅=μ
^^ geometric margin of ( )qq dx , w.r.t. ( )bw, :
( ) [ ] ( )qbwqqqq xgdbwxwwdbw ,11 ,:, ⋅=+⋅= −−η
^^ 0>iμ (or 0>iη ): correct classification of ( )qq dx , ^^ For a general case iμ (or iη ) may be negative. ^^ functional margin of ( )bw, w.r.t. S:
( ) ( )bwbw q
l
qS ,min:,1
μμ=
=
^^ geometric margin of ( )bw, w.r.t. S:
( ) ( )bwbw q
l
qS ,min:,1
ηη=
=
^^ margin of a training set S:
maximum geometric margin over all hyperplanes:
( )[ ]qbwq
l
qbwS xgd ,1,minmax: ⋅=
=γ
( )[ ]bwxwwd qq
l
qbw
11
1,,minmax −−
=+⋅=
2/24/2010 20
^^ A hyperplane realizing this maximum is called a maximal margin hyperplane or optimal hyperplane.
^^ The margin of a linearly separable training set is
positive.
Robustness property of maximal margin hyperplane Fact: The training set S is linearly separable if and only if there exist a vector nw ℜ∈* , 1* =w , a number
ℜ∈*b , and a positive number 0>γ such that [ ] 0**, >≥+⋅ γbxwd qq for all lq ∈ . In this case, we have ( ) 0**, >≥≥ γηγ bwSS , i.e., the margin of the training set S is at least γ .
++
+
+
+
maximal margin hyperplane
margin
2/24/2010 21
Single-layer Neural Networks Perceptron: for classification
x: input of the network; d: desired output of the network; u: input to the neuron; y: output to the neuron; e: error;
( )⋅of : activation function of the neuron (sign function or hard limiting function)
Define
1
1: +ℜ∈⎥⎦
⎤⎢⎣
⎡= nx
z , 11 =+nx , 1: +ℜ∈⎥
⎦
⎤⎢⎣
⎡= n
bw
β , bwn =+1 .
⇒
zbxwbxwxwu nn ,,...11 β=+=+++= ,
( ) ( )usignufy o == ,
yde −= .
2/24/2010 22
Primal Form of Rosenblatt’s Algorithm
Data: training set: ( ){ } YXdxS lqqq ×⊆=
=1,:
learning rate: 0>η Goal: ( )bw, defining a linear discriminant function that
correctly classifies the training set Step 1: 00 ←w ; 00 ←b ; 0←k ;
Step 2: Choose q
l
qxUR
1max:
==≥ ;
Step 3: repeat for 1=q to l if [ ] 0, ≤+⋅ bxwd qq , then qq xdww η+← ;
2Rdbb qη+← ; 1+← kk ;
end if end for until no misclassification within the for loop return k, ( )kk bw , ; k: number of mistakes ^^ In case where the training set is not linearly separable,
the algorithm will not converge.
2/24/2010 23
Geometric Interpretation
In this figure, ix (of the positive class) is misclassified by the current linear classifier having normal vector kw . Then the update step amounts to changing kw into
qqkk xdww η+=+1 (with 1=η in the figure) and thus
qq xd "attracts" the hyperplane. After this step, the misclassified point ix is correctly classified. Thus, geometrically, the perceptron algorithm performs a walk through the primal parameter space with each step made in the direction of decreasing training error.
2/24/2010 24
Novikoff Theorem: Suppose S is a nontrivial training set and there exist a vector nw ℜ∈* , 1* =w , a number ℜ∈*b , and a positive number 0>γ such that [ ] 0**, >≥+⋅ γbxwd qq for all lq ∈ . Then the number of mistakes made by the on-line perceptron algorithm on the training set S is at most ( )22 γR . ^^ The Novikoff Theorem was one of the first theoretical
justifications of the idea that large margins yield better classifiers; here in terms of mistakes during learning.
2/24/2010 25
Dual Form of Rosenblatt’s Algorithm In the primal form of Rosenblatt’s algorithm starting from 00 =w , the final weight is of the form
∑=
=l
qqqq xdw
1ηα , lq ∈ ,
and qα , lq ∈ , is the number of mistakes when using ( )qq dx , as training example. Then we have
( ) bxxdbxwxfl
jjjjbw +=+= ∑
=
,,:1
, ηα
bxxdl
jjjj += ∑
=1,ηα ,
( ) [ ]bxwdxfd qqqbwqq +⋅=⋅= ,: ,μ
⎥⎦
⎤⎢⎣
⎡+⋅= ∑
=
bxxddl
jqjjjq
1,ηα .
This means that the decision rule can be evaluated using just inner products between the test point x and the training points jx ’s, i.e., jxx, .
2/24/2010 26
Rosenblatt’s Algorithm (dual form) Data: training set: ( ){ } YXdxS l
qqq ×⊆==1
,: learning rate: 0>η
Goal: ( )b,α defining a linear discriminant function that
correctly classifies the training set Step 1: 0←α ; 0←b ;
Step 2: Choose q
l
qxUR
1max:
==≥ ;
Step 3: repeat for 1=q to l
if 0,1
≤⎥⎦
⎤⎢⎣
⎡+⋅ ∑
=
bxxddl
jqjjjq ηα then
1+← qq αα ; 2Rdbb qη+← ; end if end for until no misclassification within the for loop return ( )b,α ^^ The training data only enter the algorithm through the
entries of the Grammian matrix [ ] llji xxG ×ℜ∈= ,: .
^^ In the preceding algorithm, the integer
lαααα +++= ...: 211
is equal to the number of mistakes. By Novikoff Theorem, we have ( )2
1 2 γα R≤ .
2/24/2010 27
Linearly Inseparable Data Three strategies: ^^ Nonlinearly transform the data to another space. If
the transformed data is linearly separable in the new space, then we may apply the techniques for linearly separable data.
⇒ linear classifiers in the new space
nonlinear classifiers in the original space ^^ Allow some misclassifications of the original data,
but still remain in the original space. ⇒ linear classifiers in the original space ^^ Nonlinearly transform the data to another space and
also allow some misclassifications of the transformed data.
⇒ linear classifiers in the new space
nonlinear classifiers in the original space
2/24/2010 28
Rosenblatt’s Algorithms for Nonlinear Classifier
Example: (NXOR problem)
pattern label ( )21 xx d ( )00 1 ( )10 -1 ( )01 -1 ( )11 1
⇒ not linearly separable in the input space 2ℜ
φ
not linearly separable in input space
linearly separablein feature space
2/24/2010 29
Define a nonlinear map 32: ℜ→ℜφ as
( )( )( )( ) ⎥
⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
21
2
1
213
212
211
21 :,,,
:,xx
xx
xxxxxx
xxφφφ
φ .
⇒
transformed pattern
label
( )321 φφφ d ( )000 1 ( )110 -1 ( )101 -1 ( )011 1
⇒ linearly separable in 3ℜ infinitely many suitable discriminant functions: ( ) ( ) ( ) ( ) bxwxwxwxf +++= 332211 φφφ
bxxwxwxw +−++= 2132211 .
^^ nonlinear discriminant functions in 2ℜ
2/24/2010 30
1
1
fo(.)
x1
x2
r1u1
r2u2
r3u31
-1
r4=1
w1
w4=b
w3
w2 y
For instance: ( ) 5.23 2121 +−−−−= xxxxxf . decision boundary: ( ) 0=xf The bold arrows indicate the region where ( ) 0>xf .
x1
x2
-3
-3
-2
-2
-1
0
1
2
3
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2/24/2010 31
Two interesting observations: ^^ By transforming nonlinearly the linearly inseparable
training examples to a feature space, the transformed training examples becomes linearly separable in feature space and the final discriminant function is a nonlinear function in the original input space.
^^ Instead of the original single-layer network, we now
have a two-layer network with a hidden layer. One of the activation functions of the hidden neuron is nonlinear.
The whole idea of the preceding example is best seen from the dual representation of the discriminant function. Before doing this, let us have the following important definition.
2/24/2010 32
Kernel Definition: Let ( ).,.,F , called the feature space, be a real inner product space and nX ℜ⊆ . A kernel is a real-valued function on XX × such that
( ) ( ) ( )zxzxK φφ ,:, = , Xzx ∈, , where φ , called the feature map, is a mapping from X to F. ^^ The idea of a kernel generalizes the standard inner
product in nℜ by making I=φ , the identity map, i.e.,
( ) ( ) ( ) zxzxzxzxK T=== :,,:, φφ . ^^ Very often, it is more practical to define the kernel
function directly and then specify the corresponding feature map.
^^ Some popular kernels are, for nzx ℜ∈, ,
polynomial kernel:
( ) ( ) ( )dTd czxczxzxK +=+= ,:, , 0≥c , 2≥d ;
Gaussian kernel:
( ) ( )22exp:, zxzxK −−= −σ ;
Mahalanobis kernel:
( ) ( ) ( )[ ]22211
21 ...exp:, nnn zxzxzxK −−−−−= −− σσ .
2/24/2010 33
Important observation: In the dual form of the Rosenblatt algorithms: training data enters the algorithm through ij xx , final discriminant function depends only upon xx j , ⇒ map x in input space into ( )xφ in feature space ⇒ ( ) ( ) ( )ijij xxKxx ,, =φφ , ( ) ( ) ( )xxKxx jj ,, =φφ .
Since ( ) ( ) ( ) ( )iiiii xxKxxx ,,2 == φφφ ,
⇒ choose ( )ii
l
ixxKUR ,max:
1==≥ .
nonlinear discriminant function:
( ) ( ) ( ) bxxdxfl
jjjjbw += ∑
=1, ,φφηα
( ) bxxKdl
jjjj += ∑
=1,ηα ,
and the functional margin of qth example becomes
( ) ( ) ( ) ⎥⎦
⎤⎢⎣
⎡+⋅=⋅= ∑
=
bxxddxfdl
jqjjjqqbwqq
1, ,: φφηαμ
( ) ⎥⎦
⎤⎢⎣
⎡+⋅= ∑
=
bxxKddl
jqjjjq
1,ηα .
2/24/2010 34
Nonlinear Rosenblatt’s Algorithm
Data: training set: ( ){ } YXdxS lqqq ×⊆=
=1,:
learning rate: 0>η Goal: ( )b,α defining a nonlinear discriminant function
that correctly classifies the training set Step 1: 0←α ; 0←b ;
Step 2: Choose ( )qq
l
qxxKUR ,max:
1==≥ ;
Step 3: repeat for 1=q to l
if ( ) 0,1
≤⎥⎦
⎤⎢⎣
⎡+⋅ ∑
=
bxxKddl
jqjjjq ηα then
1+← qq αα ; 2Rdbb qη+← ;
end if end for until no misclassification within the for loop return ( )b,α
2/24/2010 35
Example: (NXOR problem)
polynomial kernel: ( ) ( )22211, czxzxzxK ++=
Contour plots and decision boundaries
x1
x2
-10
-10
-5
-5
0 0
5
5
10
10
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
c = 1.5
x1
x2 -10
-10
-5
-5 0
0
5
5
10
10
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
c = 0.8
x1
x2
-10
-10
-5
-5
0
0
5
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
c = 0.2
x1
x2
-10
-10
-5
-5
0
0
5
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
c = 0.0
2/24/2010 36
Linear Regression Let nX ℜ⊆ and ℜ⊆Y . Training examples: ( ){ } YXdxS l
qqq ×⊆==1
,: Linear regression: Find a linear function f that models the data: ( ) bxwxf += , .
-2 -1 0 1 2
01
23
4
x Least Squares Linear Regressor: Choose a ( )bw, that minimizes
( ) [ ]∑=
−−=l
qqq bxwdbwE
1
2,:, .
2/24/2010 37
Adaline: Adaptive linear neuron (for regression and classification)
x: input of the network; d: desired output of the network; u: input to the neuron; y: output to the neuron; e: error;
( )⋅of : activation function of the neuron (identity function)
Define
1
1: +ℜ∈⎥
⎦
⎤⎢⎣
⎡= nx
z , 11 =+nx , 1: +ℜ∈⎥
⎦
⎤⎢⎣
⎡= n
bw
β , bwn =+1 .
⇒
zbxwbxwxwu nn ,,...11 β=+=+++= ,
( ) uufy o == ,
yde −= .
2/24/2010 38
Widrow-Hoff Algorithm (primal form) (delta learning rule)
qth training example ( )qq dx , with error
( ) ( ) [ ]2121 ,22:, bwxdydbwE qqqqq −−=−= −−
Learning rules:
o eu
E=
∂
∂−←δ
( ) qoq xwbw
wE
ww ηδη +=∂
∂−← ,
( ) oq bbw
bE
bb ηδη +=∂
∂−← ,
Data: training set: ( ){ } YXdxS lqqq ×⊆=
=1,:
learning rate: 0>η Goal: ( )bw, defining a linear predictive function
minimizing the sum of square errors Step 1: 0←w ; 0←b ; Step 2: repeat for 1=q to l bwxde qq −−← , ;
qexww η+← ; ebb η+← ;
end for until convergence criterion satisfied return ( )bw,
2/24/2010 39
Dual Form of Widrow-Hoff Algorithm In primal Widrow-Hoff algorithm with 00 =w : final weight
∑=
=l
qqq xw
1ηα , lq ∈ ,
and qα , lq ∈ , is the prediction error when using ( )qq dx , as training example. Then we have
( ) bxxbxwxfl
jjjbw +=+= ∑
=1, ,,: ηα ,
( ) bxxdxfdel
jqjjqqbwqq −−=−= ∑
=1, ,: ηα .
Data: training set: ( ){ } YXdxS lqqq ×⊆=
=1,:
learning rate: 0>η Goal: ( )b,α defining a linear predictive function
minimizing the sum of square errors Step 1: 0←α ; 0←b ; Step 2: repeat for 1=q to l
bxxdel
jqjjq −−← ∑
=1,ηα ;
eqq +← αα ; ebb η+← ;
end for until convergence criterion satisfied return ( )b,α
2/24/2010 40
Widrow-Hoff Algorithms for Nonlinear Regressor
Important observation: In the dual form of the Widrow-Hoff algorithms: training data enters the algorithm through ij xx , final discriminant function depends only upon xx j , ⇒ map x in input space into ( )xφ in feature space ⇒ ( ) ( ) ( )ijij xxKxx ,, =φφ , ( ) ( ) ( )xxKxx jj ,, =φφ . nonlinear predictive function:
( ) ( ) bxxKxfl
jjjbw += ∑
=1, ,ηα
qth error:
( ) ( ) bxxKdxfdel
jqjjqqbwqq −−=−= ∑
=1, ,: ηα
φ
input space feature space
2/24/2010 41
Nonlinear Widrow-Hoff Algorithm
Data: training set: ( ){ } YXdxS lqqq ×⊆=
=1,:
learning rate: 0>η Goal: ( )b,α defining a nonlinear predictive function
minimizing the sum of square errors Step 1: 0←α ; 0←b ; Step 2: repeat for 1=q to l
( ) bxxKdel
jqjjq −−← ∑
=1,ηα ;
eqq +← αα ; ebb η+← ;
end for until convergence criterion satisfied return ( )b,α
2/24/2010 42
Multi-layer Neural Networks
^^ Artificial Neural Networks ^^ Generalized Radial Basis Function Networks ^^ Fuzzy Neural Networks ^^ Support Vector Machines Crucial property for the success
of a class of learning machines ( ){ }Ω∈= θθ :;: xfM : a class of learning machines Universal Approximation Property
Given any continuous function ( )xg defined on a compact set nU ℜ⊆ and any positive constant 0>ε , no matter how small, there is a learning machine
Mf ∈ε such that
( ) ( ) εε ≤−∈
xgxfUx
sup .
^^ All four classes of learning machines are universal
approximators. ^^ This fact is usually proved via the famous Stone-
Weierstrass Theorem from mathematical analysis. ⇒ Usually a non-constructive existence result only.
2/24/2010 43
Artificial Neural Networks
z1
zi
zn+1
rj
vij
rm+1
uj
.
.
.
.
.
.
.
.
.
.
.
.wjk
yk
y1
yp
sk
rmum
r1u1
.
.
.
s1
sp
v w
fh1(u1)
fhj(uj)
fhm(um)
fo1(s1)
fok(sk)
fop(sp)
.
.
.
1 input layer with 1+n nodes 1 hidden layer with 1+m nodes 1 output layer with p nodes
2/24/2010 44
input vector:
[ ] nTnxxx ℜ∈= L1: ,
or [ ] [ ] 1
111 1: ++ ℜ∈== nT
nT
nn xxzzzz LL . output vector:
[ ] pTpyyy ℜ∈= L1: ,
ijv : connection weight from ith input node to the input of the jth hidden node
jkw : connection weight from the output of the jth hidden node to the input of the kth output node
hjf : activation function of the jth hidden node (sigmoidal functions)
okf : activation function of the kth output node (sigmoidal functions for classification; linear functions with unit slope for regression) input and output of the jth hidden node:
∑+
=
=1
1
n
iiijj zvu , ( )jhjj ufr = , 1:1 =+nz , mj ∈ .
input and output of the kth output node:
∑+
=
=1
1
m
jjjkk rws , ( )kokk sfy = , 1:1 =+mr , pk ∈ .
2/24/2010 45
Back Propagation Algorithm (generalized delta learning rule)
^^ The error (or residual) qke at the kth output node due
to the qth example:
qkqkqk yde −=: , lq ∈ , pk ∈ .
Goal: Choose weights that minimize the total sum of
squared errors:
∑∑= =
=l
q
p
kqktotal eE
1 1
2
21: .
^^ Sum of squared errors due to the qth example:
∑=
=p
kqkq eE
1
2
21: , lq ∈ ⇒ ∑
=
=l
qqtotal EE
1
.
^^ BP algorithm: qE s are minimized in sequence
( ) ( )qkokqkqk
qoqk sfe
sE '=
∂∂
−←δ ,
( )qj
oqkwjk
jk
qwjkjk rw
wE
ww δηη ⋅+=∂∂
−← ,
( ) ( ) ( )qjhj
p
kjk
oqk
qj
qhqj ufw
uE '
1⎥⎦
⎤⎢⎣
⎡=
∂∂
−← ∑=
δδ ,
( )
qih
qjvijij
qvijij zv
vE
vv δηη ⋅+=∂∂
−← .
2/24/2010 46
Key Observations 1. An artificial neural network is a cascade of two-layer
interconnected generalized linear models. ^^ Each neuron implements a generalized linear model,
where the link function is the inverse of the activation function.
2. The operational function of the artificial neural
network can be described in two steps: ^^ First, in a peculiar way, nonlinearly transform the
input vector 1+ℜ∈ nz to the feature vector mr ℜ∈ , which are the outputs of the hidden nodes, with mℜ treated as the feature space.
^^ Then perform generalized linear regression in feature
space to produce the output vector py ℜ∈ of the network.
2/24/2010 47
Generalized Radial Basis Function Networks
Feedforward network:
1 input layer with n nodes 1 hidden layer with m nodes 1 output layer with p nodes
input vector:
[ ] nTnxxx ℜ∈= L1:
output vector:
[ ] pTpyyy ℜ∈= L1:
2/24/2010 48
predictive function f : a nonlinear map
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛⎥⎦
⎤⎢⎣
⎡ −−== ∑ ∑= =
m
j
n
iijijijkokkk vcxwfxfy
1 1
2exp
jkw : connection weight from the jth hidden node to the kth output node;
[ ]Tnjjjj cccc ...: 21= : center of the jth basis function;
ijv : the ith “variance” of the jth basis function;
02: 2 >= ijijv σ .
okf : activation function of the kth output node Define, for ni ∈ , mj ∈ , and pk ∈ ,
( )∑=
−=n
iijijij vcxu
1
2 (Mahalanobis distance),
( )jj ur −= exp ,
∑=
=m
jjjkk rws
1
⇒ ( )kokk sfy = .
2/24/2010 49
Key Observations 1. A generalized radial basis function network is a
cascade of a layer of generalized radial basis functions and a layer of generalized linear models.
^^ Each hidden neuron implements a generalized radial
basis function and each output neuron implements a generalized linear model, where the link function is the inverse of the output activation function.
2. The operational function of the generalized radial
basis function network can be described in two steps: ^^ First, in a peculiar way, nonlinearly transform the
input vector nx ℜ∈ to the feature vector mr ℜ∈ , which are the outputs of the hidden nodes, with mℜ treated as the feature space.
^^ Then perform generalized linear regression in feature
space to produce the output vector py ℜ∈ of the network.
2/24/2010 50
Back Propagation Algorithm (generalized delta learning rule)
^^ The error (or residual) qke at the kth output node due
to the qth example:
qkqkqk yde −=: , lq ∈ , pk ∈ .
Goal: Choose weights that minimize the total sum of
squared errors:
∑∑= =
=l
q
p
kqktotal eE
1 1
2
21: .
^^ Sum of squared errors due to the qth example:
∑=
=p
kqkq eE
1
2
21: , lq ∈ ⇒ ∑
=
=l
qqtotal EE
1
.
^^ BP algorithm: qE s are minimized in sequence:
( ) ( )qkokqkqk
qoqk sfe
sE '=
∂∂
−←δ ,
( )qj
oqkwjk
jk
qwjkjk rw
wE
ww δηη ⋅+=∂∂
−← ,
( ) ( )qj
p
kjk
oqk
qj
qhqj rw
uE
⎥⎦
⎤⎢⎣
⎡=
∂∂
← ∑=1
δδ ,
( ) ( )
ij
ijqihqjcij
ij
qcijij v
cxc
cE
cc−
⋅+=∂∂
−←2
δηη ,
( ) ( )
2
2
ij
ijqihqjvij
ij
qvijij v
cxv
vE
vv−
⋅+=∂∂
−← δηη .
2/24/2010 51
Fuzzy Neural Networks
Standard fuzzy system Canonical fuzzy IF-THEN rules: IF
1x is jA1 and 2x is jA2 and … and nx is njA ,
THEN
1y is 1jB and 2y is 2jB and … and py is jpB .
fuzzy rule base
fuzzy inference engine
fuzzifier defuzzifier crisp x in U
fuzzy set in U fuzzy set in V
crisp y in V
Crisp nonlinear map from U to V
2/24/2010 52
Fuzzy System as a Nonlinear Map fuzzy system:
n inputs: [ ] nTnxxx ℜ∈= L1:
p outputs: [ ] pTpyyy ℜ∈= L1:
m canonical fuzzy rules:
( )iij xμ : membership function of ijA , mj ∈ , ni ∈ ;
jkw : center of normal fuzzy set jkB , pk ∈ , mj ∈ . singleton fuzzifier + product inference engine + center average defuzzifier:
( )( )
( )∑ ∏
∑ ∏
= =
= =
⎥⎦
⎤⎢⎣
⎡
⎥⎦
⎤⎢⎣
⎡
== m
j
n
iiij
m
j
n
iiijjk
kk
x
xwxfy
1 1
1 1
μ
μ, pk ∈ .
singleton fuzzifier + minimum inference engine + center average defuzzifier:
( )( )
( )∑
∑
==
==
⎥⎦⎤
⎢⎣⎡
⎥⎦⎤
⎢⎣⎡
== m
jiij
n
i
m
jiij
n
ijk
kk
x
xwxfy
1 1
1 1
min
min
μ
μ, pk ∈ .
Note: output ky is a convex combination of jkw .
2/24/2010 53
Gaussian membership functions:
( )( )[ ]
( )[ ]∑∏
∑ ∏
= =
= =
−−
−−== m
j
n
iijiji
m
j
n
iijijijk
kk
vcx
vcxwxfy
1 1
2
1 1
2
exp
exp, pk ∈ ,
jkw : center of kth fuzzy set jkB in the jth rule;
ijc : center of ith Gaussian fuzzy set ijA in the jth rule;
ijv : “variance” of ith Gaussian fuzzy set ijA in the jth rule; 02: 2 >= ijijv σ .
⇒
( )( )
( )∑ ∑
∑ ∑
= =
= =
⎥⎦
⎤⎢⎣
⎡ −−
⎥⎦
⎤⎢⎣
⎡ −−== m
j
n
iijiji
m
j
n
iijijijk
kk
vcx
vcxwxfy
1 1
2
1 1
2
exp
exp, pk ∈ .
Define, for ni ∈ , mj ∈ , and pk ∈ ,
( )∑=
−=n
iijijij vcxu
1
2 (Mahalanobis distance),
( )jj ur −= exp ,
∑=
=m
jjjkk rws
1
, ∑=
=m
jjrg
1
,
⇒ ( )gsfy kokk = .
2/24/2010 54
Fuzzy Neural Networks
x1
xn
xi
r1
rj
rm
cij,vij
c,v w
yk
y1
ypg
fo1(s1/g)
s1
sk
sp
fop(sp/g)
wjk
∑
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.fok(sk/g)
.
.
.
.
.
.
Feedforward network:
1 input layer with n nodes 1 hidden layer with m nodes 1 output layer with p nodes
2/24/2010 55
Back Propagation Algorithm (generalized delta learning rule)
^^ The error (or residual) qke at the kth output node due
to the qth example:
qkqkqk yde −=: , lq ∈ , pk ∈ .
Goal: Choose weights that minimize the total sum of
squared errors:
∑∑= =
=l
q
p
kqktotal eE
1 1
2
21: .
^^ Sum of squared errors due to the qth example:
∑=
=p
kqkq eE
1
2
21: , lq ∈ ⇒ ∑
=
=l
qqtotal EE
1
.
^^ BP algorithm: qE s are minimized in sequence:
( )
qkokqk
qk
qoqk gg
sfe
sE 1'
⎟⎟⎠
⎞⎜⎜⎝
⎛=
∂∂
−←δ ,
( )qj
oqkwjk
jk
qwjkjk rw
wE
ww δηη ⋅+=∂∂
−← ,
( ) ( )qj
p
k q
qkjk
oqk
qj
qhqj r
gs
wuE
∑=
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
∂∂
←1δδ ,
( ) ( )
ij
ijqihqjcij
ij
qcijij v
cxc
cE
cc−
⋅+=∂∂
−←2
δηη ,
( ) ( )
2
2
ij
ijqihqjvij
ij
qvijij v
cxv
vE
vv−
⋅+=∂∂
−← δηη .
2/24/2010 56
Support Vector Machines
(Boser, Guyon, and Vapnik, 1992~)
Convex Optimization from Nonlinear Optimization Theory
+ Kernel Representation from Functional Analysis
+ Distribution-free Generalization Error Bounds from Statistical Learning Theory
2/24/2010 57
Maximal Margin Classifier Let nX ℜ⊆ and { }1,1: −=Y . Training examples: ( ){ } YXdxS l
qqq ×⊆==1
,: Define { }1:: =∈=+
qS dlqI , { }1:: −=∈=−jS dljI .
Fact: Assume the margin of a linearly separable training set S is 0>Sγ . Then there exist nw ℜ∈* , 1* =w , and ℜ∈*b , realizing the maximal margin hyperplane, such that [ ] 0**, >≥+⋅ Sqq bxwd γ for all lq ∈ ,
Sj bxw γ=+ **, * for some +∈ SIj* ,
Sk bxw γ−=+ **, * for some −∈ SIk* .
2/24/2010 58
Suppose there exist nw ℜ∈* , 1* =w , ℜ∈*b , and 0>γ such that
[ ] 0**, >≥+⋅ γbxwd qq for all lq ∈ , [ ] γ=+⋅ **, bxwd jj for some lj ∈ .
⇒ geometric margin of γ=bwH , . Define
*: 10 ww −= γ , *: 1
0 bb −= γ ⇒ [ ] 1, 00 ≥+⋅ bxwd qq for all lq ∈ ,
[ ] 1, 00 =+⋅ bxwd jj for some lj ∈ .
⇒ (a) **,, 00 bwbw HH =
(b) functional margin of 00 ,bwH =1
(00 ,bwH : canonical hyperplane)
(c) 10: −= wγ .
Important observation:
maximization of the margin of a linearly separable training set
⇔ minimizing the Euclidean norms of the weight vectors of the canonical hyperplanes
2/24/2010 59
Primal problem: (P0) minimize wwT12− subject to [ ] 1, ≥+⋅ bxwd qq for all lq ∈ . ⇒ Standard quadratic convex program Suppose ( )**,bw solves (P0). ⇒ optimal discriminant function:
( ) **,* bxwxf +=
margin: 1* −= wSγ
2/24/2010 60
Dual problem Lagrangian:
( ) ( )∑=
− −+−=l
qqq
Tqq
T bdxwdwwbwL1
1 12:,, αα
Vector of Lagrange multipliers:
[ ] lTl ℜ∈= ααα L1:
Derivation:
∑=
−=∂∂
=l
qqqq xdw
wL
10 α , ∑
=
−=∂∂
=l
qqqd
bL
10 α .
⇒ ∑=
=l
qqqq xdw
1α , 0
1=∑
=
l
qqqdα .
⇒ ( ) ( )αα ,,min:
,bwLJ
bwd =
∑∑∑= =
−
=
−=l
q
l
jj
Tqjqjq
l
qq xxdd
1 1
1
12 ααα
2/24/2010 61
Dual problem: (D0)
maximize ∑∑∑= ==
−l
q
l
jjqjqjq
l
qq xxdd
1 11,
21 ααα
subject to 01
=∑=
l
qqqdα and 0≥qα for all lq ∈ .
⇒ Standard quadratic concave program ^^ The cost functional to be maximized depends only on
the input patterns in the form of a set of inner products, jq xx , , ljq ∈, .
^^ The relation ∑=
=l
qqqq xdw
1
** α shows that the
hypothesis can be described as a linear combination of the training points.
optimal weight: ∑=
=l
qqqq xdw
1
** α
margin: 1* −= wSγ
2/24/2010 62
KKT complementarity conditions: for all lq ∈ ,
[ ] 01**,* =−+ bdxwd qqqqα , [ ] 01**, ≥−+ bxwd qq ,
0* ≥qα . Define { }0:: * >∈= qsv lqI α .
optimal weight: ∑∑∈=
==svIq
qqq
l
qqqq xdxdw *
1
** αα
optimal discriminant function: ( ) *,**,* * bxxdbxwxf
svIqqqq +=+= ∑
∈
α ,
where ∑
∈
−=svIq
kqqqk xxddb ,* *α , 0* >kα .
^^ Obviously, the Lagrange multiplier associated with
each point quantifies how important a given training data is in forming the final solution.
^^ Points that have zero *
qα have no influence.
2/24/2010 63
^^ For any svIq ∈ , we have 0* >qα .
KKT conditions ⇒ [ ] 1**, =+ bxwd qq This implies that the functional margin of ( )qq dx , with respect to the maximal margin hyperplane is one and therefore lies closest to the maximal margin hyperplane. positive support vector:
any pattern qx with svIq ∈ and 1=qd
negative support vector:
any pattern qx with svIq ∈ and 1−=qd ^^ In conceptual terms, the support vectors are those data
points that lie closest to the decision surface and are therefore the most difficult to classify.
^^ The fact that only a subset of the Lagrange multipliers
is nonzero is referred to as sparseness, and means that the support vectors contain all the information necessary to reconstruct the optimal hyperplane.
2/24/2010 64
Slack Variable for Classification Definition: Let 0>γ be given. The margin slack variable qξ of an example ( )qq dx , with respect to the hyperplane H:( )bw, and target margin γ is defined by [ ]( )bxwd qqq +⋅−= ,,0max: γξ . From the definition, we have
0≥qξ and [ ] γξ ≥+⋅+ bxwd qqq , .
^^ ∑=
=l
11 : ξξ or ∑
=
=l
1
222 : ξξ
measures the amount by which the training set fails to have margin γ , and takes into account any misclassifications of the training data.
++
+
+
+
H
slack variable
H- H+
+
+
2/24/2010 65
1-norm Soft Margin Classifier Primal problem: (P1)
minimize ∑=
+l
T Cww12
1 ξ
subject to [ ] qqq bxwd ξ−≥+⋅ 1, , 0≥qξ , lq ∈ . The parameter 0>C controls the tradeoff between complexity of the machine and the number of non-separable points. Dual problem: (D1)
maximize ∑∑∑= ==
−l
q
l
jjqjqjq
l
qq xxdd
1 11,
21 ααα
subject to 01
=∑=
l
qqqdα , Cq ≤≤ α0 , lq ∈ .
^^ box constraints ⇒ the influence of the individual patterns gets limited
optimal weight: ∑∑∈=
==svIq
qqq
l
qqqq xdxdw *
1
** αα
optimal discriminant function: ( ) *,**,* * bxxdbxwxf
svIqqqq +=+= ∑
∈
α ,
kqIq
qqk xxddbsv
,* *∑∈
−= α , Ck << *0 α .
2/24/2010 66
Ridge Regressor (LS-SVM) idea: choose a ( )bw, that minimizes
( ) [ ]∑=
−−+=l
qqq bxwdwwbwE
1
2,
21,
2:, λ
0>λ : regularization parameter
22, xfwwwww T ∂∂=== λλλλ :
smoothing functional or stabilizer
assuring that the approximating function is smooth Primal problem: (P0)
minimize ∑=
+l
T ww1
2
21
2ξλ
subject to qqT
q bxwd ξ=−− for all lq ∈ . Dual problem: (D0) maximize
∑∑∑∑== ==
−−l
l
q
l
jjqjq
l
qqq xxd
1
2
1 11 21,
21 αααλ
α
subject to 01
=∑=
l
qqα .
2/24/2010 67
Suppose *α solves the problem (D0). Define { }0:: * ≠∈= qsv lqI α . Optimal weight:
∑∑∈=
==svIq
l
qqq xxw *
1
* 11* αλ
αλ
,
**qq αξ = , lq ∈ .
optimal predictive function:
( ) *,1**,* * bxxbxwxfsvIq
qq +=+= ∑∈
αλ
,
∑∈
−−=svIq
kqqkk xxdb ,1* ** αλ
α , any lk ∈ .
^^ Note how the preceding equations depend on the
inner product.
2/24/2010 68
Slack Variable for Regression Definition: Let 0>ε be given. The margin slack variable qs of an example ( )qq dx , with respect to the hyperplane H:( )bw, and target margin ε is defined by [ ]( )ε−+−= bxwds qqq ,,0max: . From the definition, we have
0≥qs and [ ] ε+≤+− qqq sbxwd , .
iξ
iηε
ε
^^ The quantity defined by
∑=
=l
qqss
11 : or ∑
=
=l
qqss
1
222 :
measures the amount by which the training set fails to fall in the ε -band of the hyperplane.
2/24/2010 69
It is convenient to introduce two slack variables, one for exceeding the target value by more than ε , and the other for being more than ε below the target. Let us define the slack variables as
[ ]( )εξ −+−= bxwd qqq ,,0max: ,
[ ]( )εη −−+= qqq dbxw,,0max: .
⇒ 0=qqηξ ,
[ ]( )εηξ −+−==+ bxwds qqqqq ,,0max: .
^^ Likewise, the quantity defined by
( )∑=
+l
qqq
1
ηξ or ( )∑=
+l
qqq
1
22 ηξ
measures the amount by which the training set fails to fall in the ε -band of the hyperplane.
2/24/2010 70
1-norm Soft Regressor Primal problem: (P1)
minimize ( )∑=
++l
qqq
T Cww12
1 ηξ
subject to [ ] qqq bxwd ξε +≤+− , , 0≥qξ , [ ] qqq dbxw ηε +≤−+, , 0≥qη , lq ∈ .
^^ We have introduced two slack variables, one for
exceeding the target value by more than ε , and the other for being more than ε below the target.
Dual problem: (D1)
maximize
( ) ( )∑∑==
+−−l
qqq
l
qqqq d
11βαεβα
( )( )∑∑= =
−−−l
q
l
jjqjjqq xx
1 1,
21 βαβα
subject to
( ) 01
=−∑=
l
qqq βα ,
Cq ≤≤ α0 , Cq ≤≤ β0 , lq ∈ .
2/24/2010 71
Simpler dual problem: (D1’)
maximize ∑∑∑∑= ===
−−l
q
l
jjqjq
l
l
qqq xxd
1 111,
21 θθθεθ
subject to 01
=∑=
l
qqθ , CC q ≤≤− θ , lq ∈ .
^^ box constraints ⇒ the influence of the individual patterns gets limited Define { }0:: * ≠∈= qsv lqI θ .
optimal weight: ∑∑∈=
==svIq
l
qqq xxw *
1
** θθ
optimal predictive function: ( ) *,**,* * bxxbxwxf
svIqqq +=+= ∑
∈
θ ,
where
∑∈
−−=svIq
kqqk xxdb ,* *θε , Ck << *0 θ .
2/24/2010 72
Nonlinear Classification
Cover’s Theorem:
A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than a low-dimensional space. Note: Cover’s Theorem states that the training set S may be transformed into a new feature space where the patterns are linearly separable with high probability, provided two conditions are satisfied: ^^ The transformation is nonlinear. ^^ The dimensionality of the feature space is high
enough.
φ
not linearly separable in input space
linearly separable in feature space
2/24/2010 73
Nonlinear Classifier
Let nX ℜ⊆ and { }1,1: −=Y . training examples: ( ){ } YXdxS l
qqq ×⊆==1
,: kernel: ( ) ( ) ( )zxzxK φφ ,:, = , Xzx ∈, .
1-norm Soft Margin Classifier in Feature Space (D1) maximize
( ) ( )∑∑∑= ==
−l
q
l
jjqjqjq
l
qq xxdd
1 11,
21 φφααα
( )∑∑∑= ==
−=l
q
l
jjqjqjq
l
qq xxKdd
1 11,
21 ααα
subject to
01
=∑=
l
qqqdα , Cq ≤≤ α0 , lq ∈ .
2/24/2010 74
Define { }0:: * >∈= qsv lqI α .
optimal weight: ( ) ( )∑∑∈=
==svIq
qqq
l
qqqq xdxdw φαφα *
1
**
optimal discriminant function: ( ) ( ) ( ) *,* * bxxdxf
svIqqqq += ∑
∈
φφα
( ) *,* bxxKdsvIq
qqq += ∑∈
α ,
where
( ) ( )∑∈
−=svIq
kqqqk xxddb φφα ,* *
( )∑∈
−=svIq
kqqqk xxKdd ,*α , Ck << *0 α .
^^ No need to calculate any features to form the final
discriminant function. ^^ Kernel is just good enough.
2/24/2010 75
^^ The kernel-based support vector classifier (SVC) can
be represented as a multi-layer feed-forward network with a single hidden layer and with nonlinear activation functions.
^^ The number of hidden neurons is equal to the number
of support vectors, which is automatically determined by the support vector machine under consideration.
K(x1,x)
K(xj,x)...
.
.
.
+
b
K(xl,x)
1α
jα
lα
d1
dj
dl
x1
yxix
xn
2/24/2010 76
Nonlinear Regression Let nX ℜ⊆ and ℜ⊆Y . training examples: ( ){ } YXdxS l
qqq ×⊆==1
,: kernel: ( ) ( ) ( )zxzxK φφ ,:, = , Xzx ∈, .
φ
input space feature space
2/24/2010 77
1-norm Soft Regressor in Feature Space (D1’) maximize
( ) ( )∑∑∑∑= ===
−−l
q
l
jjqjq
l
l
qqq xxd
1 111,
21 φφθθθεθ
( )∑∑∑∑= ===
−−=l
q
l
jjqjq
l
l
qqq xxKd
1 111,
21 θθθεθ
subject to 01
=∑=
l
qqθ , CC q ≤≤− θ , lq ∈ .
Define { }0:: * ≠∈= qsv lqI θ .
optimal weight: ( ) ( )∑∑∈=
==svIq
l
qqq xxw φθφθ *
1
**
optimal predictive function: ( ) ( ) ( ) *,* * bxxxf
svIqqq += ∑
∈
φφθ
( ) *,* bxxKsvIq
qq += ∑∈
θ ,
where
( ) ( )∑∈
−−=svIq
kqqk xxdb φφθε ,* *
( )∑∈
−−=svIq
kqqk xxKd ,*θε , Ck << *0 θ .
2/24/2010 78
^^ No need to calculate any features to form the final predictive function. Kernel is just good enough.
^^ The kernel-based support vector regressor (SVR) can
be represented as a multi-layer feed-forward network with a single hidden layer and with nonlinear activation functions.
^^ The number of hidden neurons is equal to the number
of support vectors, which is automatically determined by the support vector machine under consideration.
K(x1,x)
K(xj,x)
K(xl,x)
xi
xn
.
.
.
.
.
.
.
.
.
.
.
.
+
b
y
x1
x
1θ
jθ
lθ
2/24/2010 79
Model Selection: SVM Parameter Settings
Commonly Used Techniques
Model Selection by Cross Validation Population-based Model Selection:
Genetic Algorithm (GA) Particle Swarm Optimization (PSO) Algorithm Cross Validation
(dividing data into training and testing datasets) ^^ First split the experimental dataset into several, say m,
parts of approximately equal size. ^^ Perform m training runs. Each time, one of the m
parts is left out and used as an independent validating set for optimizing the parameters.
^^ Taking into consideration both the average training
and testing error rates, we would choose the parameters with acceptable results on average over the m runs.
2/24/2010 80
Population-based SVM Parameter Settings Mahalanobis kernel:
( ) ( ) ( )[ ]22211
21 ...exp:, nnn zxzxzxK −−−−−= −− σσ .
Parameters for Support Vector Classification (SVC):
C, 21σ , ..., 2
nσ Parameters for Support Vector Regression (SVR):
C, ε , 21σ , ..., 2
nσ Genetic Algorithm Chromosome for SVC: [ ]22
1 ...: nCz σσ= Chromosome for SVR: [ ]22
1 ...: nCz σσε= Gene for SVC: iz , 1+∈ni Gene for SVR: iz , 2+∈ni
Particle Swarm Optimization Algorithm Particle for SVC: [ ]22
1 ...: nCz σσ= Particle for SVR: [ ]22
1 ...: nCz σσε=
2/24/2010 81
Genetic Algorithm
^^ biologically motivated general search technique mimicking natural selection and natural genetics
^^ a population-based search method between exhaustive
search and traditional search methods
Initial Population
Convergence Criterion
Selection for Mating
Genetic Operations for Reproduction
(crossover, mutation)
Replacement for New Population
End
Calculating Objective Values
2/24/2010 82
Basic Notions
Selection: process of choosing parents for reproduction
fitness-proportionate selection
(roulette wheel selection) stochastic universal sampling
random selection rank selection (from top to bottom)
tournament selection ^^ low selection pressure in initial generations ⇒ avoiding premature and maintaining diversity ^^ high selection pressure in final generations ⇒ speeding up convergence Tournament selection ^^ Randomly select two chromosomes from the mating
pool, say 1z and 2z . ^^ If [ ] [ ]21 zfitnesszfitness ≥ ,
then dad:= 1z ; else dad:= 2z ;
^^ Randomly select two chromosomes from the mating pool, say 3z and 4z .
^^ If [ ] [ ]43 zfitnesszfitness ≥ ,
then mom:= 3z ; else mom:= 4z .
2/24/2010 83
Crossover: a reproduction operator that forms a new chromosome
from two parent chromosomes by combing part of the information from each
single-point crossover two-point crossover uniform crossover parameterized uniform crossover Parameterized uniform crossover
parents: dad: ( )[ ]dlldddd zzzzz 121 ...: −=
mom: ( )[ ]mllmmmm zzzzz 121 ...: −=
offspring:
brother: ( )[ ]bllbbbb zzzzz 121 ...: −= sister: ( )[ ]sllssss zzzzz 121 ...: −=
for each li ∈ , if ( ) xthresholdRand _>⋅ ( )( )dimimibi zzRandzz −⋅−= ;
( )( )dimidisi zzRandzz −⋅+= ; else dibi zz = ; misi zz = ;
( )⋅Rand : a random number in [ ]1,0 ^^ ( ) 0=⋅Rand ⇒ mibi zz = , disi zz = . ^^ ( ) 1=⋅Rand ⇒ dibi zz = , misi zz = .
2/24/2010 84
Mutation: a reproduction operator that randomly alters the
values of genes in a chromosome point mutation uniform mutation parameterized uniform mutation ^^ escape from the local maximum leapfrog over the sticking points Parameterized mutation
for each li ∈ , if ( ) mthresholdrand _>⋅ iz is replaced by a random number else iz is maintained
( )⋅rand : a random number in [ ]1,0 Building block hypothesis: GAs attempt to find highly fit solutions to the
problem at hand by the juxtaposition of the “good building blocks.”
2/24/2010 85
GA-based Framework 1
Possible fitness value of a chromosome in SVC: number of testing data that are correctly classified
Possible fitness value of a chromosome in SVR: negative of mean squared error for testing data
yes no
Generating Random Initial Population
[C, iσ ] or [C,ε , iσ ]
Computing Fitness of All Chromosomes
Selection for Mating
Genetic Operations for Reproduction
(crossover, mutation)
Replacement for New Population [C, iσ ] or [C,ε , iσ ]
Selection of Random Training and Testing Data from Experimental Data Set
Selection of Random Training and Testing Data from Experimental Data Set
Convergence?
SVC or SVR
2/24/2010 86
Criteria for Final Chromosome Selection Life time of a chromosome: the number of generations that a given chromosome
lives The chromosome with the highest life time is chosen as our final parameters. Life score of a chromosome: the average score over generations of a given
chromosome For instance:
Fitness ranking in descending order
score
1 5 2 4 3 3 4 2 5 1
6 and above 0
The chromosome with the highest life score is chosen as our final parameters.
2/24/2010 87
Particle Swarm Optimization Motive: ^^ Human intelligence results from social interaction. ^^ Swarm intelligence provides a useful paradigm for
implementing adaptive systems. Main Features: ^^ biologically motivated general search technique
mimicking fish schooling, birds flocking, and bugs swarming
^^ a population-based search method with each particle
associated with a velocity ^^ The particles are flying through the search space.
2/24/2010 88
PSO Flowchart
Two main components “Cognitive” part: ( ) { }ijij xbestxrandc −⋅ **1
“Social” part: ( ) { }ijjchampion xbestxRandc −⋅ ,2 **
Initial Population
Convergence Criterion
Updating Velocities
Updating Positions
Calculating Objective Values
End
2/24/2010 89
updating formulae +← ijij vwv *
( ) { }+−⋅+ ijij xbestxrandc **1
( ) { }ijjchampion xbestxRandc −⋅+ ,2 ** ;
ijijij vxx +← ;
ijx : jth position component of ith particle
ijv : jth velocity component of ith particle
ijbestx : jth component of the best previous position of ith
particle champion:
particle giving the best objective value of all particles up to the present
w: inertia weight balancing local and global searches
1c , 2c : two pre-specified constants
( )⋅rand , ( )⋅Rand : two generators giving random numbers in [ ]1,0
2/24/2010 90
PSO-based Framework 1
Possible fitness value of a particle in SVC: number of testing data that are correctly classified
Possible fitness value of a particle in SVR: negative of mean squared error for testing data
yes no
Generating Random Initial Population
[C, iσ ] or [C,ε , iσ ]
Computing Fitness of All Particles
Updating Velocities
Updating Positions
Replacement for New Population [C, iσ ] or [C,ε , iσ ]
Selection of Random Training and Testing Data from Experimental Data Set
Selection of Random Training and Testing Data from Experimental Data Set
Convergence?
SVC or SVR
2/24/2010 91
Conclusion ^^ New learning machines are invented day
by day:
versatility of learning machines ^^ Some learning machine is still in its
infancy:
much has to be done
^^ Wide applicability
to various branches of sciences, engineering, and management
Simple research enthusiasm ?⇒ Simple happiness
2/24/2010 92
Epilogue
We must be grateful to God that He created the world in such a way that
everything simple is true and everything complicated is untrue.
Gregory Skovoroda (18th century Ukrainian philosopher)