Introduction to Machine Learning and Soft...

2/24/2010 1

Introduction to Machine Learning and Soft Computing

謝哲光

Jer-Guang Hsieh

義守大學電機工程系

Department of Electrical Engineering I-Shou University

Kaohsiung, Taiwan 840

2/24/2010 2

Contents

Introduction

Single-layer Neural Networks

Linear Classification

Linear Regression

Kernel

Multi-layer Neural Networks

Nonlinear Classification

Nonlinear Regression

Model Selection

GA-based Frameworks

PSO-based Frameworks

Conclusion

Epilogue

2/24/2010 3

Introduction ^^ Almost all of the science is fitting models to data. ^^ The first step in mathematical modeling of a

system under consideration is to use the first principles, e.g., Newton’s laws in mechanics, Kirchhoff’s laws in lumped electric circuits, or various laws in thermodynamics.

^^ As a system becomes increasingly complex, the

possibility of obtaining a precise description of it in quantitative terms decreases.

⇒ What we desire in practice is a reasonable yet

tractable model. ^^ It may also happen that there is no analytic model

for the system under consideration. This is particularly true in social science problems.

^^ However, in many real situations, we do have

some experimental data, either from measurement or data collection by some means.

⇒ This raises the necessity of a theory concerning

the learning from examples, i.e., obtaining a good mathematical model from experimental data. This is what machine learning all about.

2/24/2010 4

Brief Sketch of Machine Learning and Soft Computing

ANN FNN

CNN

GRBFN

SVM

GA PSO

Numerical Optimization

Approximation Theory

StatisticalLearning

Linear Algebra

Probability

Chaos

Machine Learning & Soft Computing

Intelligent Control

Regression

Management

Bioinformatics

Time Series Analysis

Secure Communication

Diagnostics

Filter Design

Data Compression

Classification

WLM

2/24/2010 5

Machine Learning Main Features: ^^ The field of machine learning and soft computing is

vast, versatile, and fascinating. ^^ It can be viewed as judicious mixture of

computational intelligence and applied statistics. Basic Belief:

There is a process that explains the data we observe. Though we do not know the details of the process underlying the generation of data, we know that it is not completely random.

Learning Problem:

Find a general rule that explains experimental data given only a sample of limited size.

2/24/2010 6

Main Categories:

Techniques: Statistical Learning Artificial Neural Network (ANN) Radial Basis Neural Network (RBFN) Fuzzy Neural Network (FNN) Support Vector Machine (SVM) Wilcoxon Learning Machine (WLM) Evolutionary Computation (EC)

Machine Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Classification Learning (Pattern Recognition)

Function Learning (Regression Estimation)

Preference Learning

2/24/2010 7

Supervised Learning Data: A sample of input-output pairs (training sample) Task: Find a deterministic function that maps any input

to an output such that disagreement with future input-output observations is minimized.

Classification Learning (pattern recognition) Features:

Output: categorical variables (class label) No ordering between the classes

Examples:

Credit scoring of loan applicants in a bank Classification of handwritten letters and digits Optical character recognition Face recognition Speech recognition Classification of news in a news agency

Function Learning (regression estimation) Features: Output: continuous variables Examples:

Prediction of the stock market share values Weather forecasting Navigation of an autonomous car

2/24/2010 8

Preference Learning Features: ^^ Output space: order space ^^ Elements of the output space are called ranks.

We can compare whether two elements are equal or, if not, which one is to be preferred.

Examples:

Arrangement of WEB pages such that the most relevant pages (according to a query) are ranked highest.

Unsupervised Learning Data: A sample of objects without associated target

values Task: Extracting some structure or regularity from the

experimental data Features:

Finding a concise description of the data could be a set of clusters or a probability density stating how likely it is to observe a certain object in the future.

2/24/2010 9

Examples:

Image and text segmentation Novelty detection in process control Grouping of customers in a company Alignment in molecular biology

Reinforcement Learning

Data: State-action-reward triples Task: Find a concise description of the data in the form

of a strategy or policy (what to do?) that maximizes the expected reward over time.

Features: ^^ The output of the system is a sequence of actions. ^^ A single action is not important; what is important is

the strategy or policy that is the sequence of correct actions to reach the goal.

^^ No optimal action exists in a given intermediate state; an action is good if it is part of a good policy.

^^ The learning algorithm should be able to assess the goodness of policies and must identify a sequence of actions, learned from past, so as to maximize the expected reward over time.

Examples:

Game playing, e.g., playing chess Robot navigation in search of a goal location

2/24/2010 10

Supervised Learning

_____________________________________________________________________

_________

Successful

Learning System

Trained System

Pattern (Input Data) from

Training Dataset

Target (Output Data) from

Training Dataset

Pattern (Input Data) from

Testing Dataset

Predicted Output

Target (Output Data) from

Testing Data

Training Error

Training Phase

Testing Phase

Predicted Output

Testing Error (Generalization Error)

Training Examples

Testing Examples

2/24/2010 11

Successful Learning

Quality of training data: accuracy, integrity, consistency, no redundancy, timely-ness, comprehensibility, completeness

Feature extraction:

feature selection + feature composition Model selection:

Choice of learning machine and determination of machine parameters

Algorithm utilized to train the learning machine

Training and Generalization

Small training error doesn’t imply good generalization for previously unseen data.

A learning machine with a too high capacity typically leads to the very undesirable effect of overfitting.

A learning machine with a too low capacity typically leads to the very undesirable effect of underfitting.

Any learning machine almost learns nothing from too few training examples.

2/24/2010 12

Evolutionary Computation (EC) Techniques: Genetic Programming (GP) Evolution Strategy (ES) Evolutionary Programming (EP) Genetic Algorithm (GA) Particle Swarm Optimization (PSO) Spirit of Soft Computing:

Law of Sufficiency: If a solution is

good enough, fast enough, and cheap enough,

then it is sufficient. ^^ In almost all real-world applications, we are looking

for, and satisfied with, sufficient solutions. ^^ Hybrids of various soft-computing approaches with

other computational intelligence tools such as neural networks are becoming more prevalent.

2/24/2010 13

Brief History of Machine Learning and Soft Computing

Year(s) Name(s) Event Description 1936 Fisher Discriminant analysis 1943 McCulloch

Pitts First mathematical model for the artificial neuron

1958 Rosenblatt First model of learning machine (perceptron) for classification; True beginning of the mathematical analysis of learning processes

1958 Friedberg Genetic Programming (GP) 1960 Widrow

Hoff Adaptive linear neuron (Adaline) for regression by using the delta learning rule

1962 Novikoff First (convergence) theorem about the perceptron 1962 Holland Genetic Algorithm (GA) 1963 Tikhonov Regularization method for solutions of ill-posed

problems 1965 Zadeh Fuzzy mathematics 1965 Rechenberg

Schwefel Evolution Strategy (ES)

1966 Fogel Owens Walsh

Evolutionary Programming (EP)

1969 Minsky Papert

Simple biologically motivated learning systems (perceptrons) were incapable of learning an arbitrarily complex problem. (negative result)

1971 Vapnik Chervonenkis

Statistical learning theory

1982 Hopfield Hopfield network 1982 Vapnik Introduction of regularization theory into machine

learning 1986 Rumelhart

Hinton Williams Le Cun

Error back-propagation algorithm (generalized delta learning rule) for multi-layer neural networks (direct generalization of perceptrons)

1988 Chua Yang

Cellular Neural Network (CNN)

1989 Poggio Girosi

Radial Basis Function Network (RBFN)

1989~1991 Goldberg Davis

Popularization of genetic algorithm

1991 Koza Improvement of genetic programming 1992 Vapnik Support Vector Machine (SVM) 1995 Kennedy

Eberhart Particle Swarm Optimization (PSO)

2/24/2010 14

A Simple Binary Classification Algorithm Nearest Mean Classifier

c+

c-

+

+

+

+

+decision surface

+

cx

w

x-

--

-

Let nX ℜ⊆ and { }1,1: −=Y .

Data (training set): ( ){ } YXdxS lqqq ×⊆=

=1,:

Basic idea:

Assign an unseen pattern to the class with closer mean.

2/24/2010 15

Step 1: First compute the means of the two classes. Define

{ }1:: =∈=+qS dlqI , { }1:: −=∈=−

jS dljI ,

∑+∈

−++ =

SIqqxmc 1: , ∑

−∈

−−− =

SIjjxmc 1: ,

+m : number of examples with positive labels −m : number of examples with negative labels

Step 2: Assign a new point Xx ∈ to the class whose

mean is closest. Derivation:

22+− −−− cxcx

++−− −−−−−= cxcxcxcx ,,

( )[ ]2212,2 +−−

−+ −+−= ccccx .

Define

−+ −= ccw : , ( )2212: +−− −= ccb .

2/24/2010 16

Decision function:

( ) [ ]bxwxg += ,sgn , Xx ∈ . Discriminant function:

( ) bxwxf += ,: Decision surface:

a hyperplane in nℜ with normal vector w and bias b:

( ) bxwxf +== ,0 Discriminant function in terms of the input patterns:

( ) bxxmxxmxfSS Iq

qIq

q +−= ∑∑−+ ∈

−−

∈

−+ ,, 11

,

where

⎥⎦

⎤⎢⎣

⎡−= ∑∑

+− ∈

−+

∈

−−

−

SS Ijqjq

Ijqjq xxmxxmb

,

2

,

21 ,,2 .

^^ inner product of input data

2/24/2010 17

Hyperplane A hyperplane bwH , with normal vector w and bias b:

0,...2211 =+=++++ bxwbxwxwxw nn ,

[ ]Tnxxx L1:= , [ ] nT

nwww ℜ∈= L1: . Define

( ) bxwbxwxwxwxf nnbw +=++++= ,...: 2211, .

⇒ ( ){ }0,:: ,, =+=ℜ∈= bxwxfxH bwn

bw

Define

( ) ( )xfwxg bwbw ,1

, : −= , nx ℜ∈ .

⇒ ( ){ }0,:: 11,, =+=ℜ∈= −− bwxwwxgxH bw

nbw

^^ Note that w is a normal vector perpendicular to bwH , ,

while varying the value of b moves the hyperplane parallel to itself.

^^ The hyperplane thus defined is an affine subspace

(linear manifold) of dimension 1−n . It divides nℜ into two half spaces.

2/24/2010 18

Linear Classification

Let nX ℜ⊆ and { }1,1: −=Y .

Training examples: ( ){ } YXdxS lqqq ×⊆=

=1,:

Definition: The training set S is said to be linearly separable if there is a hyperplane that correctly classifies the training data. Convention: For a given hyperplane ( )bw, , recall

( ) bxwxf bw += ,:, , nx ℜ∈ .

If the hyperplane ( )bw, correctly classifies the training set, then, by convention, we assign

1+=iy if ( ) 0, ≥ibw xf ; 1−=iy if ( ) 0, <ibw xf .

Note: desire: “similar patterns ⇒ similar classes”! Cauchy-Swartz inequality ⇒ ( ) ( ) jijijbwibw xxwxxwxfxf −⋅≤−=− ,,, ; i.e., whenever two data points are close (small ji xx − ), their difference in the real-valued output of a hypothesis is also small.

2/24/2010 19

Definition: ^^ functional margin of ( )qq dx , w.r.t. ( )bw, :

( ) [ ] ( )qbwqqqq xfdbxwdbw ,,:, ⋅=+⋅=μ

^^ geometric margin of ( )qq dx , w.r.t. ( )bw, :

( ) [ ] ( )qbwqqqq xgdbwxwwdbw ,11 ,:, ⋅=+⋅= −−η

^^ 0>iμ (or 0>iη ): correct classification of ( )qq dx , ^^ For a general case iμ (or iη ) may be negative. ^^ functional margin of ( )bw, w.r.t. S:

( ) ( )bwbw q

l

qS ,min:,1

μμ=

=

^^ geometric margin of ( )bw, w.r.t. S:

( ) ( )bwbw q

l

qS ,min:,1

ηη=

=

^^ margin of a training set S:

maximum geometric margin over all hyperplanes:

( )[ ]qbwq

l

qbwS xgd ,1,minmax: ⋅=

=γ

( )[ ]bwxwwd qq

l

qbw

11

1,,minmax −−

=+⋅=

2/24/2010 20

^^ A hyperplane realizing this maximum is called a maximal margin hyperplane or optimal hyperplane.

^^ The margin of a linearly separable training set is

positive.

Robustness property of maximal margin hyperplane Fact: The training set S is linearly separable if and only if there exist a vector nw ℜ∈* , 1* =w , a number

ℜ∈*b , and a positive number 0>γ such that [ ] 0**, >≥+⋅ γbxwd qq for all lq ∈ . In this case, we have ( ) 0**, >≥≥ γηγ bwSS , i.e., the margin of the training set S is at least γ .

++

+

+

+

maximal margin hyperplane

margin

2/24/2010 21

Single-layer Neural Networks Perceptron: for classification

x: input of the network; d: desired output of the network; u: input to the neuron; y: output to the neuron; e: error;

( )⋅of : activation function of the neuron (sign function or hard limiting function)

Define

1

1: +ℜ∈⎥⎦

⎤⎢⎣

⎡= nx

z , 11 =+nx , 1: +ℜ∈⎥

⎦

⎤⎢⎣

⎡= n

bw

β , bwn =+1 .

⇒

zbxwbxwxwu nn ,,...11 β=+=+++= ,

( ) ( )usignufy o == ,

yde −= .

2/24/2010 22

Primal Form of Rosenblatt’s Algorithm

Data: training set: ( ){ } YXdxS lqqq ×⊆=

=1,:

learning rate: 0>η Goal: ( )bw, defining a linear discriminant function that

correctly classifies the training set Step 1: 00 ←w ; 00 ←b ; 0←k ;

Step 2: Choose q

l

qxUR

1max:

==≥ ;

Step 3: repeat for 1=q to l if [ ] 0, ≤+⋅ bxwd qq , then qq xdww η+← ;

2Rdbb qη+← ; 1+← kk ;

end if end for until no misclassification within the for loop return k, ( )kk bw , ; k: number of mistakes ^^ In case where the training set is not linearly separable,

the algorithm will not converge.

2/24/2010 23

Geometric Interpretation

In this figure, ix (of the positive class) is misclassified by the current linear classifier having normal vector kw . Then the update step amounts to changing kw into

qqkk xdww η+=+1 (with 1=η in the figure) and thus

qq xd "attracts" the hyperplane. After this step, the misclassified point ix is correctly classified. Thus, geometrically, the perceptron algorithm performs a walk through the primal parameter space with each step made in the direction of decreasing training error.

2/24/2010 24

Novikoff Theorem: Suppose S is a nontrivial training set and there exist a vector nw ℜ∈* , 1* =w , a number ℜ∈*b , and a positive number 0>γ such that [ ] 0**, >≥+⋅ γbxwd qq for all lq ∈ . Then the number of mistakes made by the on-line perceptron algorithm on the training set S is at most ( )22 γR . ^^ The Novikoff Theorem was one of the first theoretical

justifications of the idea that large margins yield better classifiers; here in terms of mistakes during learning.

2/24/2010 25

Dual Form of Rosenblatt’s Algorithm In the primal form of Rosenblatt’s algorithm starting from 00 =w , the final weight is of the form

∑=

=l

qqqq xdw

1ηα , lq ∈ ,

and qα , lq ∈ , is the number of mistakes when using ( )qq dx , as training example. Then we have

( ) bxxdbxwxfl

jjjjbw +=+= ∑

=

,,:1

, ηα

bxxdl

jjjj += ∑

=1,ηα ,

( ) [ ]bxwdxfd qqqbwqq +⋅=⋅= ,: ,μ

⎥⎦

⎤⎢⎣

⎡+⋅= ∑

=

bxxddl

jqjjjq

1,ηα .

This means that the decision rule can be evaluated using just inner products between the test point x and the training points jx ’s, i.e., jxx, .

2/24/2010 26

Rosenblatt’s Algorithm (dual form) Data: training set: ( ){ } YXdxS l

qqq ×⊆==1

,: learning rate: 0>η

Goal: ( )b,α defining a linear discriminant function that

correctly classifies the training set Step 1: 0←α ; 0←b ;

Step 2: Choose q

l

qxUR

1max:

==≥ ;

Step 3: repeat for 1=q to l

if 0,1

≤⎥⎦

⎤⎢⎣

⎡+⋅ ∑

=

bxxddl

jqjjjq ηα then

1+← qq αα ; 2Rdbb qη+← ; end if end for until no misclassification within the for loop return ( )b,α ^^ The training data only enter the algorithm through the

entries of the Grammian matrix [ ] llji xxG ×ℜ∈= ,: .

^^ In the preceding algorithm, the integer

lαααα +++= ...: 211

is equal to the number of mistakes. By Novikoff Theorem, we have ( )2

1 2 γα R≤ .

2/24/2010 27

Linearly Inseparable Data Three strategies: ^^ Nonlinearly transform the data to another space. If

the transformed data is linearly separable in the new space, then we may apply the techniques for linearly separable data.

⇒ linear classifiers in the new space

nonlinear classifiers in the original space ^^ Allow some misclassifications of the original data,

but still remain in the original space. ⇒ linear classifiers in the original space ^^ Nonlinearly transform the data to another space and

also allow some misclassifications of the transformed data.

⇒ linear classifiers in the new space

nonlinear classifiers in the original space

2/24/2010 28

Rosenblatt’s Algorithms for Nonlinear Classifier

Example: (NXOR problem)

pattern label ( )21 xx d ( )00 1 ( )10 -1 ( )01 -1 ( )11 1

⇒ not linearly separable in the input space 2ℜ

φ

not linearly separable in input space

linearly separablein feature space

2/24/2010 29

Define a nonlinear map 32: ℜ→ℜφ as

( )( )( )( ) ⎥

⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

21

2

1

213

212

211

21 :,,,

:,xx

xx

xxxxxx

xxφφφ

φ .

⇒

transformed pattern

label

( )321 φφφ d ( )000 1 ( )110 -1 ( )101 -1 ( )011 1

⇒ linearly separable in 3ℜ infinitely many suitable discriminant functions: ( ) ( ) ( ) ( ) bxwxwxwxf +++= 332211 φφφ

bxxwxwxw +−++= 2132211 .

^^ nonlinear discriminant functions in 2ℜ

2/24/2010 30

1

1

fo(.)

x1

x2

r1u1

r2u2

r3u31

-1

r4=1

w1

w4=b

w3

w2 y

For instance: ( ) 5.23 2121 +−−−−= xxxxxf . decision boundary: ( ) 0=xf The bold arrows indicate the region where ( ) 0>xf .

x1

x2

-3

-3

-2

-2

-1

0

1

2

3

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2/24/2010 31

Two interesting observations: ^^ By transforming nonlinearly the linearly inseparable

training examples to a feature space, the transformed training examples becomes linearly separable in feature space and the final discriminant function is a nonlinear function in the original input space.

^^ Instead of the original single-layer network, we now

have a two-layer network with a hidden layer. One of the activation functions of the hidden neuron is nonlinear.

The whole idea of the preceding example is best seen from the dual representation of the discriminant function. Before doing this, let us have the following important definition.

2/24/2010 32

Kernel Definition: Let ( ).,.,F , called the feature space, be a real inner product space and nX ℜ⊆ . A kernel is a real-valued function on XX × such that

( ) ( ) ( )zxzxK φφ ,:, = , Xzx ∈, , where φ , called the feature map, is a mapping from X to F. ^^ The idea of a kernel generalizes the standard inner

product in nℜ by making I=φ , the identity map, i.e.,

( ) ( ) ( ) zxzxzxzxK T=== :,,:, φφ . ^^ Very often, it is more practical to define the kernel

function directly and then specify the corresponding feature map.

^^ Some popular kernels are, for nzx ℜ∈, ,

polynomial kernel:

( ) ( ) ( )dTd czxczxzxK +=+= ,:, , 0≥c , 2≥d ;

Gaussian kernel:

( ) ( )22exp:, zxzxK −−= −σ ;

Mahalanobis kernel:

( ) ( ) ( )[ ]22211

21 ...exp:, nnn zxzxzxK −−−−−= −− σσ .

2/24/2010 33

Important observation: In the dual form of the Rosenblatt algorithms: training data enters the algorithm through ij xx , final discriminant function depends only upon xx j , ⇒ map x in input space into ( )xφ in feature space ⇒ ( ) ( ) ( )ijij xxKxx ,, =φφ , ( ) ( ) ( )xxKxx jj ,, =φφ .

Since ( ) ( ) ( ) ( )iiiii xxKxxx ,,2 == φφφ ,

⇒ choose ( )ii

l

ixxKUR ,max:

1==≥ .

nonlinear discriminant function:

( ) ( ) ( ) bxxdxfl

jjjjbw += ∑

=1, ,φφηα

( ) bxxKdl

jjjj += ∑

=1,ηα ,

and the functional margin of qth example becomes

( ) ( ) ( ) ⎥⎦

⎤⎢⎣

⎡+⋅=⋅= ∑

=

bxxddxfdl

jqjjjqqbwqq

1, ,: φφηαμ

( ) ⎥⎦

⎤⎢⎣

⎡+⋅= ∑

=

bxxKddl

jqjjjq

1,ηα .

2/24/2010 34

Nonlinear Rosenblatt’s Algorithm


=1,:

learning rate: 0>η Goal: ( )b,α defining a nonlinear discriminant function

that correctly classifies the training set Step 1: 0←α ; 0←b ;

Step 2: Choose ( )qq

l

qxxKUR ,max:

1==≥ ;

Step 3: repeat for 1=q to l

if ( ) 0,1

≤⎥⎦

⎤⎢⎣

⎡+⋅ ∑

=

bxxKddl

jqjjjq ηα then

1+← qq αα ; 2Rdbb qη+← ;

end if end for until no misclassification within the for loop return ( )b,α

2/24/2010 35

Example: (NXOR problem)

polynomial kernel: ( ) ( )22211, czxzxzxK ++=

Contour plots and decision boundaries

x1

x2

-10

-10

-5

-5

0 0

5

5

10

10

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

c = 1.5

x1

x2 -10

-10

-5

-5 0

0

5

5

10

10

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

c = 0.8

x1

x2

-10

-10

-5

-5

0

0

5

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

c = 0.2

x1

x2

-10

-10

-5

-5

0

0

5

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

c = 0.0

2/24/2010 36

Linear Regression Let nX ℜ⊆ and ℜ⊆Y . Training examples: ( ){ } YXdxS l

qqq ×⊆==1

,: Linear regression: Find a linear function f that models the data: ( ) bxwxf += , .

-2 -1 0 1 2

01

23

4

x Least Squares Linear Regressor: Choose a ( )bw, that minimizes

( ) [ ]∑=

−−=l

qqq bxwdbwE

1

2,:, .

2/24/2010 37

Adaline: Adaptive linear neuron (for regression and classification)

x: input of the network; d: desired output of the network; u: input to the neuron; y: output to the neuron; e: error;

( )⋅of : activation function of the neuron (identity function)

Define

1

1: +ℜ∈⎥

⎦

⎤⎢⎣

⎡= nx

z , 11 =+nx , 1: +ℜ∈⎥

⎦

⎤⎢⎣

⎡= n

bw

β , bwn =+1 .

⇒

zbxwbxwxwu nn ,,...11 β=+=+++= ,

( ) uufy o == ,

yde −= .

2/24/2010 38

Widrow-Hoff Algorithm (primal form) (delta learning rule)

qth training example ( )qq dx , with error

( ) ( ) [ ]2121 ,22:, bwxdydbwE qqqqq −−=−= −−

Learning rules:

qq

o eu

E=

∂

∂−←δ

( ) qoq xwbw

wE

ww ηδη +=∂

∂−← ,

( ) oq bbw

bE

bb ηδη +=∂

∂−← ,


=1,:

learning rate: 0>η Goal: ( )bw, defining a linear predictive function

minimizing the sum of square errors Step 1: 0←w ; 0←b ; Step 2: repeat for 1=q to l bwxde qq −−← , ;

qexww η+← ; ebb η+← ;

end for until convergence criterion satisfied return ( )bw,

2/24/2010 39

Dual Form of Widrow-Hoff Algorithm In primal Widrow-Hoff algorithm with 00 =w : final weight

∑=

=l

qqq xw

1ηα , lq ∈ ,

and qα , lq ∈ , is the prediction error when using ( )qq dx , as training example. Then we have

( ) bxxbxwxfl

jjjbw +=+= ∑

=1, ,,: ηα ,

( ) bxxdxfdel

jqjjqqbwqq −−=−= ∑

=1, ,: ηα .


=1,:

learning rate: 0>η Goal: ( )b,α defining a linear predictive function

minimizing the sum of square errors Step 1: 0←α ; 0←b ; Step 2: repeat for 1=q to l

bxxdel

jqjjq −−← ∑

=1,ηα ;

eqq +← αα ; ebb η+← ;

end for until convergence criterion satisfied return ( )b,α

2/24/2010 40

Widrow-Hoff Algorithms for Nonlinear Regressor

Important observation: In the dual form of the Widrow-Hoff algorithms: training data enters the algorithm through ij xx , final discriminant function depends only upon xx j , ⇒ map x in input space into ( )xφ in feature space ⇒ ( ) ( ) ( )ijij xxKxx ,, =φφ , ( ) ( ) ( )xxKxx jj ,, =φφ . nonlinear predictive function:

( ) ( ) bxxKxfl

jjjbw += ∑

=1, ,ηα

qth error:

( ) ( ) bxxKdxfdel

jqjjqqbwqq −−=−= ∑

=1, ,: ηα

φ

input space feature space

2/24/2010 41

Nonlinear Widrow-Hoff Algorithm


=1,:

learning rate: 0>η Goal: ( )b,α defining a nonlinear predictive function

minimizing the sum of square errors Step 1: 0←α ; 0←b ; Step 2: repeat for 1=q to l

( ) bxxKdel

jqjjq −−← ∑

=1,ηα ;

eqq +← αα ; ebb η+← ;

end for until convergence criterion satisfied return ( )b,α

2/24/2010 42

Multi-layer Neural Networks

^^ Artificial Neural Networks ^^ Generalized Radial Basis Function Networks ^^ Fuzzy Neural Networks ^^ Support Vector Machines Crucial property for the success

of a class of learning machines ( ){ }Ω∈= θθ :;: xfM : a class of learning machines Universal Approximation Property

Given any continuous function ( )xg defined on a compact set nU ℜ⊆ and any positive constant 0>ε , no matter how small, there is a learning machine

Mf ∈ε such that

( ) ( ) εε ≤−∈

xgxfUx

sup .

^^ All four classes of learning machines are universal

approximators. ^^ This fact is usually proved via the famous Stone-

Weierstrass Theorem from mathematical analysis. ⇒ Usually a non-constructive existence result only.

2/24/2010 43

Artificial Neural Networks

z1

zi

zn+1

rj

vij

rm+1

uj

.

.

.

.

.

.

.

.

.

.

.

.wjk

yk

y1

yp

sk

rmum

r1u1

.

.

.

s1

sp

v w

fh1(u1)

fhj(uj)

fhm(um)

fo1(s1)

fok(sk)

fop(sp)

.

.

.

1 input layer with 1+n nodes 1 hidden layer with 1+m nodes 1 output layer with p nodes

2/24/2010 44

input vector:

[ ] nTnxxx ℜ∈= L1: ,

or [ ] [ ] 1

111 1: ++ ℜ∈== nT

nT

nn xxzzzz LL . output vector:

[ ] pTpyyy ℜ∈= L1: ,

ijv : connection weight from ith input node to the input of the jth hidden node

jkw : connection weight from the output of the jth hidden node to the input of the kth output node

hjf : activation function of the jth hidden node (sigmoidal functions)

okf : activation function of the kth output node (sigmoidal functions for classification; linear functions with unit slope for regression) input and output of the jth hidden node:

∑+

=

=1

1

n

iiijj zvu , ( )jhjj ufr = , 1:1 =+nz , mj ∈ .

input and output of the kth output node:

∑+

=

=1

1

m

jjjkk rws , ( )kokk sfy = , 1:1 =+mr , pk ∈ .

2/24/2010 45

Back Propagation Algorithm (generalized delta learning rule)

^^ The error (or residual) qke at the kth output node due

to the qth example:

qkqkqk yde −=: , lq ∈ , pk ∈ .

Goal: Choose weights that minimize the total sum of

squared errors:

∑∑= =

=l

q

p

kqktotal eE

1 1

2

21: .

^^ Sum of squared errors due to the qth example:

∑=

=p

kqkq eE

1

2

21: , lq ∈ ⇒ ∑

=

=l

qqtotal EE

1

.

^^ BP algorithm: qE s are minimized in sequence

( ) ( )qkokqkqk

qoqk sfe

sE '=

∂∂

−←δ ,

( )qj

oqkwjk

jk

qwjkjk rw

wE

ww δηη ⋅+=∂∂

−← ,

( ) ( ) ( )qjhj

p

kjk

oqk

qj

qhqj ufw

uE '

1⎥⎦

⎤⎢⎣

⎡=

∂∂

−← ∑=

δδ ,

( )

qih

qjvijij

qvijij zv

vE

vv δηη ⋅+=∂∂

−← .

2/24/2010 46

Key Observations 1. An artificial neural network is a cascade of two-layer

interconnected generalized linear models. ^^ Each neuron implements a generalized linear model,

where the link function is the inverse of the activation function.

2. The operational function of the artificial neural

network can be described in two steps: ^^ First, in a peculiar way, nonlinearly transform the

input vector 1+ℜ∈ nz to the feature vector mr ℜ∈ , which are the outputs of the hidden nodes, with mℜ treated as the feature space.

^^ Then perform generalized linear regression in feature

space to produce the output vector py ℜ∈ of the network.

2/24/2010 47

Generalized Radial Basis Function Networks

Feedforward network:

1 input layer with n nodes 1 hidden layer with m nodes 1 output layer with p nodes

input vector:

[ ] nTnxxx ℜ∈= L1:

output vector:

[ ] pTpyyy ℜ∈= L1:

2/24/2010 48

predictive function f : a nonlinear map

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡ −−== ∑ ∑= =

m

j

n

iijijijkokkk vcxwfxfy

1 1

2exp

jkw : connection weight from the jth hidden node to the kth output node;

[ ]Tnjjjj cccc ...: 21= : center of the jth basis function;

ijv : the ith “variance” of the jth basis function;

02: 2 >= ijijv σ .

okf : activation function of the kth output node Define, for ni ∈ , mj ∈ , and pk ∈ ,

( )∑=

−=n

iijijij vcxu

1

2 (Mahalanobis distance),

( )jj ur −= exp ,

∑=

=m

jjjkk rws

1

⇒ ( )kokk sfy = .

2/24/2010 49

Key Observations 1. A generalized radial basis function network is a

cascade of a layer of generalized radial basis functions and a layer of generalized linear models.

^^ Each hidden neuron implements a generalized radial

basis function and each output neuron implements a generalized linear model, where the link function is the inverse of the output activation function.

2. The operational function of the generalized radial

basis function network can be described in two steps: ^^ First, in a peculiar way, nonlinearly transform the

input vector nx ℜ∈ to the feature vector mr ℜ∈ , which are the outputs of the hidden nodes, with mℜ treated as the feature space.

^^ Then perform generalized linear regression in feature

space to produce the output vector py ℜ∈ of the network.

2/24/2010 50



to the qth example:



squared errors:

∑∑= =

=l

q

p

kqktotal eE

1 1

2

21: .


∑=

=p

kqkq eE

1

2

21: , lq ∈ ⇒ ∑

=

=l

qqtotal EE

1

.

^^ BP algorithm: qE s are minimized in sequence:

( ) ( )qkokqkqk

qoqk sfe

sE '=

∂∂

−←δ ,

( )qj

oqkwjk

jk

qwjkjk rw

wE


−← ,

( ) ( )qj

p

kjk

oqk

qj

qhqj rw

uE

⎥⎦

⎤⎢⎣

⎡=

∂∂

← ∑=1

δδ ,

( ) ( )

ij

ijqihqjcij

ij

qcijij v

cxc

cE

cc−

⋅+=∂∂

−←2

δηη ,

( ) ( )

2

2

ij

ijqihqjvij

ij

qvijij v

cxv

vE

vv−

⋅+=∂∂

−← δηη .

2/24/2010 51

Fuzzy Neural Networks

Standard fuzzy system Canonical fuzzy IF-THEN rules: IF

1x is jA1 and 2x is jA2 and … and nx is njA ,

THEN

1y is 1jB and 2y is 2jB and … and py is jpB .

fuzzy rule base

fuzzy inference engine

fuzzifier defuzzifier crisp x in U

fuzzy set in U fuzzy set in V

crisp y in V

Crisp nonlinear map from U to V

2/24/2010 52

Fuzzy System as a Nonlinear Map fuzzy system:

n inputs: [ ] nTnxxx ℜ∈= L1:

p outputs: [ ] pTpyyy ℜ∈= L1:

m canonical fuzzy rules:

( )iij xμ : membership function of ijA , mj ∈ , ni ∈ ;

jkw : center of normal fuzzy set jkB , pk ∈ , mj ∈ . singleton fuzzifier + product inference engine + center average defuzzifier:

( )( )

( )∑ ∏

∑ ∏

= =

= =

⎥⎦

⎤⎢⎣

⎡

⎥⎦

⎤⎢⎣

⎡

== m

j

n

iiij

m

j

n

iiijjk

kk

x

xwxfy

1 1

1 1

μ

μ, pk ∈ .

singleton fuzzifier + minimum inference engine + center average defuzzifier:

( )( )

( )∑

∑

==

==

⎥⎦⎤

⎢⎣⎡

⎥⎦⎤

⎢⎣⎡

== m

jiij

n

i

m

jiij

n

ijk

kk

x

xwxfy

1 1

1 1

min

min

μ

μ, pk ∈ .

Note: output ky is a convex combination of jkw .

2/24/2010 53

Gaussian membership functions:

( )( )[ ]

( )[ ]∑∏

∑ ∏

= =

= =

−−

−−== m

j

n

iijiji

m

j

n

iijijijk

kk

vcx

vcxwxfy

1 1

2

1 1

2

exp

exp, pk ∈ ,

jkw : center of kth fuzzy set jkB in the jth rule;

ijc : center of ith Gaussian fuzzy set ijA in the jth rule;

ijv : “variance” of ith Gaussian fuzzy set ijA in the jth rule; 02: 2 >= ijijv σ .

⇒

( )( )

( )∑ ∑

∑ ∑

= =

= =

⎥⎦

⎤⎢⎣

⎡ −−

⎥⎦

⎤⎢⎣

⎡ −−== m

j

n

iijiji

m

j

n

iijijijk

kk

vcx

vcxwxfy

1 1

2

1 1

2

exp

exp, pk ∈ .

Define, for ni ∈ , mj ∈ , and pk ∈ ,

( )∑=

−=n

iijijij vcxu

1

2 (Mahalanobis distance),

( )jj ur −= exp ,

∑=

=m

jjjkk rws

1

, ∑=

=m

jjrg

1

,

⇒ ( )gsfy kokk = .

2/24/2010 54

Fuzzy Neural Networks

x1

xn

xi

r1

rj

rm

cij,vij

c,v w

yk

y1

ypg

fo1(s1/g)

s1

sk

sp

fop(sp/g)

wjk

∑

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.fok(sk/g)

.

.

.

.

.

.

Feedforward network:

1 input layer with n nodes 1 hidden layer with m nodes 1 output layer with p nodes

2/24/2010 55



to the qth example:



squared errors:

∑∑= =

=l

q

p

kqktotal eE

1 1

2

21: .


∑=

=p

kqkq eE

1

2

21: , lq ∈ ⇒ ∑

=

=l

qqtotal EE

1

.

^^ BP algorithm: qE s are minimized in sequence:

( )

qq

qkokqk

qk

qoqk gg

sfe

sE 1'

⎟⎟⎠

⎞⎜⎜⎝

⎛=

∂∂

−←δ ,

( )qj

oqkwjk

jk

qwjkjk rw

wE


−← ,

( ) ( )qj

p

k q

qkjk

oqk

qj

qhqj r

gs

wuE

∑=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

∂∂

←1δδ ,

( ) ( )

ij

ijqihqjcij

ij

qcijij v

cxc

cE

cc−

⋅+=∂∂

−←2

δηη ,

( ) ( )

2

2

ij

ijqihqjvij

ij

qvijij v

cxv

vE

vv−

⋅+=∂∂

−← δηη .

2/24/2010 56

Support Vector Machines

(Boser, Guyon, and Vapnik, 1992~)

Convex Optimization from Nonlinear Optimization Theory

+ Kernel Representation from Functional Analysis

+ Distribution-free Generalization Error Bounds from Statistical Learning Theory

2/24/2010 57

Maximal Margin Classifier Let nX ℜ⊆ and { }1,1: −=Y . Training examples: ( ){ } YXdxS l

qqq ×⊆==1

,: Define { }1:: =∈=+

qS dlqI , { }1:: −=∈=−jS dljI .

Fact: Assume the margin of a linearly separable training set S is 0>Sγ . Then there exist nw ℜ∈* , 1* =w , and ℜ∈*b , realizing the maximal margin hyperplane, such that [ ] 0**, >≥+⋅ Sqq bxwd γ for all lq ∈ ,

Sj bxw γ=+ **, * for some +∈ SIj* ,

Sk bxw γ−=+ **, * for some −∈ SIk* .

2/24/2010 58

Suppose there exist nw ℜ∈* , 1* =w , ℜ∈*b , and 0>γ such that

[ ] 0**, >≥+⋅ γbxwd qq for all lq ∈ , [ ] γ=+⋅ **, bxwd jj for some lj ∈ .

⇒ geometric margin of γ=bwH , . Define

*: 10 ww −= γ , *: 1

0 bb −= γ ⇒ [ ] 1, 00 ≥+⋅ bxwd qq for all lq ∈ ,

[ ] 1, 00 =+⋅ bxwd jj for some lj ∈ .

⇒ (a) **,, 00 bwbw HH =

(b) functional margin of 00 ,bwH =1

(00 ,bwH : canonical hyperplane)

(c) 10: −= wγ .

Important observation:

maximization of the margin of a linearly separable training set

⇔ minimizing the Euclidean norms of the weight vectors of the canonical hyperplanes

2/24/2010 59

Primal problem: (P0) minimize wwT12− subject to [ ] 1, ≥+⋅ bxwd qq for all lq ∈ . ⇒ Standard quadratic convex program Suppose ( )**,bw solves (P0). ⇒ optimal discriminant function:

( ) **,* bxwxf +=

margin: 1* −= wSγ

2/24/2010 60

Dual problem Lagrangian:

( ) ( )∑=

− −+−=l

qqq

Tqq

T bdxwdwwbwL1

1 12:,, αα

Vector of Lagrange multipliers:

[ ] lTl ℜ∈= ααα L1:

Derivation:

∑=

−=∂∂

=l

qqqq xdw

wL

10 α , ∑

=

−=∂∂

=l

qqqd

bL

10 α .

⇒ ∑=

=l

qqqq xdw

1α , 0

1=∑

=

l

qqqdα .

⇒ ( ) ( )αα ,,min:

,bwLJ

bwd =

∑∑∑= =

−

=

−=l

q

l

jj

Tqjqjq

l

qq xxdd

1 1

1

12 ααα

2/24/2010 61

Dual problem: (D0)

maximize ∑∑∑= ==

−l

q

l

jjqjqjq

l

qq xxdd

1 11,

21 ααα

subject to 01

=∑=

l

qqqdα and 0≥qα for all lq ∈ .

⇒ Standard quadratic concave program ^^ The cost functional to be maximized depends only on

the input patterns in the form of a set of inner products, jq xx , , ljq ∈, .

^^ The relation ∑=

=l

qqqq xdw

1

** α shows that the

hypothesis can be described as a linear combination of the training points.

optimal weight: ∑=

=l

qqqq xdw

1

** α

margin: 1* −= wSγ

2/24/2010 62

KKT complementarity conditions: for all lq ∈ ,

[ ] 01**,* =−+ bdxwd qqqqα , [ ] 01**, ≥−+ bxwd qq ,

0* ≥qα . Define { }0:: * >∈= qsv lqI α .

optimal weight: ∑∑∈=

==svIq

qqq

l

qqqq xdxdw *

1

** αα

optimal discriminant function: ( ) *,**,* * bxxdbxwxf

svIqqqq +=+= ∑

∈

α ,

where ∑

∈

−=svIq

kqqqk xxddb ,* *α , 0* >kα .

^^ Obviously, the Lagrange multiplier associated with

each point quantifies how important a given training data is in forming the final solution.

^^ Points that have zero *

qα have no influence.

2/24/2010 63

^^ For any svIq ∈ , we have 0* >qα .

KKT conditions ⇒ [ ] 1**, =+ bxwd qq This implies that the functional margin of ( )qq dx , with respect to the maximal margin hyperplane is one and therefore lies closest to the maximal margin hyperplane. positive support vector:

any pattern qx with svIq ∈ and 1=qd

negative support vector:

any pattern qx with svIq ∈ and 1−=qd ^^ In conceptual terms, the support vectors are those data

points that lie closest to the decision surface and are therefore the most difficult to classify.

^^ The fact that only a subset of the Lagrange multipliers

is nonzero is referred to as sparseness, and means that the support vectors contain all the information necessary to reconstruct the optimal hyperplane.

2/24/2010 64

Slack Variable for Classification Definition: Let 0>γ be given. The margin slack variable qξ of an example ( )qq dx , with respect to the hyperplane H:( )bw, and target margin γ is defined by [ ]( )bxwd qqq +⋅−= ,,0max: γξ . From the definition, we have

0≥qξ and [ ] γξ ≥+⋅+ bxwd qqq , .

^^ ∑=

=l

qq

11 : ξξ or ∑

=

=l

qq

1

222 : ξξ

measures the amount by which the training set fails to have margin γ , and takes into account any misclassifications of the training data.

++

+

+

+

H

slack variable

H- H+

+

+

2/24/2010 65

1-norm Soft Margin Classifier Primal problem: (P1)

minimize ∑=

+l

qq

T Cww12

1 ξ

subject to [ ] qqq bxwd ξ−≥+⋅ 1, , 0≥qξ , lq ∈ . The parameter 0>C controls the tradeoff between complexity of the machine and the number of non-separable points. Dual problem: (D1)

maximize ∑∑∑= ==

−l

q

l

jjqjqjq

l

qq xxdd

1 11,

21 ααα

subject to 01

=∑=

l

qqqdα , Cq ≤≤ α0 , lq ∈ .

^^ box constraints ⇒ the influence of the individual patterns gets limited


==svIq

qqq

l

qqqq xdxdw *

1

** αα

optimal discriminant function: ( ) *,**,* * bxxdbxwxf

svIqqqq +=+= ∑

∈

α ,

kqIq

qqk xxddbsv

,* *∑∈

−= α , Ck << *0 α .

2/24/2010 66

Ridge Regressor (LS-SVM) idea: choose a ( )bw, that minimizes

( ) [ ]∑=

−−+=l

qqq bxwdwwbwE

1

2,

21,

2:, λ

0>λ : regularization parameter

22, xfwwwww T ∂∂=== λλλλ :

smoothing functional or stabilizer

assuring that the approximating function is smooth Primal problem: (P0)

minimize ∑=

+l

qq

T ww1

2

21

2ξλ

subject to qqT

q bxwd ξ=−− for all lq ∈ . Dual problem: (D0) maximize

∑∑∑∑== ==

−−l

qq

l

q

l

jjqjq

l

qqq xxd

1

2

1 11 21,

21 αααλ

α

subject to 01

=∑=

l

qqα .

2/24/2010 67

Suppose *α solves the problem (D0). Define { }0:: * ≠∈= qsv lqI α . Optimal weight:

∑∑∈=

==svIq

qq

l

qqq xxw *

1

* 11* αλ

αλ

,

**qq αξ = , lq ∈ .

optimal predictive function:

( ) *,1**,* * bxxbxwxfsvIq

qq +=+= ∑∈

αλ

,

∑∈

−−=svIq

kqqkk xxdb ,1* ** αλ

α , any lk ∈ .

^^ Note how the preceding equations depend on the

inner product.

2/24/2010 68

Slack Variable for Regression Definition: Let 0>ε be given. The margin slack variable qs of an example ( )qq dx , with respect to the hyperplane H:( )bw, and target margin ε is defined by [ ]( )ε−+−= bxwds qqq ,,0max: . From the definition, we have

0≥qs and [ ] ε+≤+− qqq sbxwd , .

iξ

iηε

ε

^^ The quantity defined by

∑=

=l

qqss

11 : or ∑

=

=l

qqss

1

222 :

measures the amount by which the training set fails to fall in the ε -band of the hyperplane.

2/24/2010 69

It is convenient to introduce two slack variables, one for exceeding the target value by more than ε , and the other for being more than ε below the target. Let us define the slack variables as

[ ]( )εξ −+−= bxwd qqq ,,0max: ,

[ ]( )εη −−+= qqq dbxw,,0max: .

⇒ 0=qqηξ ,

[ ]( )εηξ −+−==+ bxwds qqqqq ,,0max: .

^^ Likewise, the quantity defined by

( )∑=

+l

qqq

1

ηξ or ( )∑=

+l

qqq

1

22 ηξ

measures the amount by which the training set fails to fall in the ε -band of the hyperplane.

2/24/2010 70

1-norm Soft Regressor Primal problem: (P1)

minimize ( )∑=

++l

qqq

T Cww12

1 ηξ

subject to [ ] qqq bxwd ξε +≤+− , , 0≥qξ , [ ] qqq dbxw ηε +≤−+, , 0≥qη , lq ∈ .

^^ We have introduced two slack variables, one for

exceeding the target value by more than ε , and the other for being more than ε below the target.

Dual problem: (D1)

maximize

( ) ( )∑∑==

+−−l

qqq

l

qqqq d

11βαεβα

( )( )∑∑= =

−−−l

q

l

jjqjjqq xx

1 1,

21 βαβα

subject to

( ) 01

=−∑=

l

qqq βα ,

Cq ≤≤ α0 , Cq ≤≤ β0 , lq ∈ .

2/24/2010 71

Simpler dual problem: (D1’)

maximize ∑∑∑∑= ===

−−l

q

l

jjqjq

l

qq

l

qqq xxd

1 111,

21 θθθεθ

subject to 01

=∑=

l

qqθ , CC q ≤≤− θ , lq ∈ .

^^ box constraints ⇒ the influence of the individual patterns gets limited Define { }0:: * ≠∈= qsv lqI θ .


==svIq

qq

l

qqq xxw *

1

** θθ

optimal predictive function: ( ) *,**,* * bxxbxwxf

svIqqq +=+= ∑

∈

θ ,

where

∑∈

−−=svIq

kqqk xxdb ,* *θε , Ck << *0 θ .

2/24/2010 72

Nonlinear Classification

Cover’s Theorem:

A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than a low-dimensional space. Note: Cover’s Theorem states that the training set S may be transformed into a new feature space where the patterns are linearly separable with high probability, provided two conditions are satisfied: ^^ The transformation is nonlinear. ^^ The dimensionality of the feature space is high

enough.

φ

not linearly separable in input space

linearly separable in feature space

2/24/2010 73

Nonlinear Classifier

Let nX ℜ⊆ and { }1,1: −=Y . training examples: ( ){ } YXdxS l

qqq ×⊆==1

,: kernel: ( ) ( ) ( )zxzxK φφ ,:, = , Xzx ∈, .

1-norm Soft Margin Classifier in Feature Space (D1) maximize

( ) ( )∑∑∑= ==

−l

q

l

jjqjqjq

l

qq xxdd

1 11,

21 φφααα

( )∑∑∑= ==

−=l

q

l

jjqjqjq

l

qq xxKdd

1 11,

21 ααα

subject to

01

=∑=

l

qqqdα , Cq ≤≤ α0 , lq ∈ .

2/24/2010 74

Define { }0:: * >∈= qsv lqI α .

optimal weight: ( ) ( )∑∑∈=

==svIq

qqq

l

qqqq xdxdw φαφα *

1

**

optimal discriminant function: ( ) ( ) ( ) *,* * bxxdxf

svIqqqq += ∑

∈

φφα

( ) *,* bxxKdsvIq

qqq += ∑∈

α ,

where

( ) ( )∑∈

−=svIq

kqqqk xxddb φφα ,* *

( )∑∈

−=svIq

kqqqk xxKdd ,*α , Ck << *0 α .

^^ No need to calculate any features to form the final

discriminant function. ^^ Kernel is just good enough.

2/24/2010 75

^^ The kernel-based support vector classifier (SVC) can

be represented as a multi-layer feed-forward network with a single hidden layer and with nonlinear activation functions.

^^ The number of hidden neurons is equal to the number

of support vectors, which is automatically determined by the support vector machine under consideration.

K(x1,x)

K(xj,x)...

.

.

.

+

b

K(xl,x)

1α

jα

lα

d1

dj

dl

x1

yxix

xn

2/24/2010 76

Nonlinear Regression Let nX ℜ⊆ and ℜ⊆Y . training examples: ( ){ } YXdxS l

qqq ×⊆==1

,: kernel: ( ) ( ) ( )zxzxK φφ ,:, = , Xzx ∈, .

φ

input space feature space

2/24/2010 77

1-norm Soft Regressor in Feature Space (D1’) maximize

( ) ( )∑∑∑∑= ===

−−l

q

l

jjqjq

l

qq

l

qqq xxd

1 111,

21 φφθθθεθ

( )∑∑∑∑= ===

−−=l

q

l

jjqjq

l

qq

l

qqq xxKd

1 111,

21 θθθεθ

subject to 01

=∑=

l

qqθ , CC q ≤≤− θ , lq ∈ .

Define { }0:: * ≠∈= qsv lqI θ .

optimal weight: ( ) ( )∑∑∈=

==svIq

qq

l

qqq xxw φθφθ *

1

**

optimal predictive function: ( ) ( ) ( ) *,* * bxxxf

svIqqq += ∑

∈

φφθ

( ) *,* bxxKsvIq

qq += ∑∈

θ ,

where

( ) ( )∑∈

−−=svIq

kqqk xxdb φφθε ,* *

( )∑∈

−−=svIq

kqqk xxKd ,*θε , Ck << *0 θ .

2/24/2010 78

^^ No need to calculate any features to form the final predictive function. Kernel is just good enough.

^^ The kernel-based support vector regressor (SVR) can

be represented as a multi-layer feed-forward network with a single hidden layer and with nonlinear activation functions.

^^ The number of hidden neurons is equal to the number

of support vectors, which is automatically determined by the support vector machine under consideration.

K(x1,x)

K(xj,x)

K(xl,x)

xi

xn

.

.

.

.

.

.

.

.

.

.

.

.

+

b

y

x1

x

1θ

jθ

lθ

2/24/2010 79

Model Selection: SVM Parameter Settings

Commonly Used Techniques

Model Selection by Cross Validation Population-based Model Selection:

Genetic Algorithm (GA) Particle Swarm Optimization (PSO) Algorithm Cross Validation

(dividing data into training and testing datasets) ^^ First split the experimental dataset into several, say m,

parts of approximately equal size. ^^ Perform m training runs. Each time, one of the m

parts is left out and used as an independent validating set for optimizing the parameters.

^^ Taking into consideration both the average training

and testing error rates, we would choose the parameters with acceptable results on average over the m runs.

2/24/2010 80

Population-based SVM Parameter Settings Mahalanobis kernel:

( ) ( ) ( )[ ]22211

21 ...exp:, nnn zxzxzxK −−−−−= −− σσ .

Parameters for Support Vector Classification (SVC):

C, 21σ , ..., 2

nσ Parameters for Support Vector Regression (SVR):

C, ε , 21σ , ..., 2

nσ Genetic Algorithm Chromosome for SVC: [ ]22

1 ...: nCz σσ= Chromosome for SVR: [ ]22

1 ...: nCz σσε= Gene for SVC: iz , 1+∈ni Gene for SVR: iz , 2+∈ni

Particle Swarm Optimization Algorithm Particle for SVC: [ ]22

1 ...: nCz σσ= Particle for SVR: [ ]22

1 ...: nCz σσε=

2/24/2010 81

Genetic Algorithm

^^ biologically motivated general search technique mimicking natural selection and natural genetics

^^ a population-based search method between exhaustive

search and traditional search methods

Initial Population

Convergence Criterion

Selection for Mating

Genetic Operations for Reproduction

(crossover, mutation)

Replacement for New Population

End

Calculating Objective Values

2/24/2010 82

Basic Notions

Selection: process of choosing parents for reproduction

fitness-proportionate selection

(roulette wheel selection) stochastic universal sampling

random selection rank selection (from top to bottom)

tournament selection ^^ low selection pressure in initial generations ⇒ avoiding premature and maintaining diversity ^^ high selection pressure in final generations ⇒ speeding up convergence Tournament selection ^^ Randomly select two chromosomes from the mating

pool, say 1z and 2z . ^^ If [ ] [ ]21 zfitnesszfitness ≥ ,

then dad:= 1z ; else dad:= 2z ;

^^ Randomly select two chromosomes from the mating pool, say 3z and 4z .

^^ If [ ] [ ]43 zfitnesszfitness ≥ ,

then mom:= 3z ; else mom:= 4z .

2/24/2010 83

Crossover: a reproduction operator that forms a new chromosome

from two parent chromosomes by combing part of the information from each

single-point crossover two-point crossover uniform crossover parameterized uniform crossover Parameterized uniform crossover

parents: dad: ( )[ ]dlldddd zzzzz 121 ...: −=

mom: ( )[ ]mllmmmm zzzzz 121 ...: −=

offspring:

brother: ( )[ ]bllbbbb zzzzz 121 ...: −= sister: ( )[ ]sllssss zzzzz 121 ...: −=

for each li ∈ , if ( ) xthresholdRand _>⋅ ( )( )dimimibi zzRandzz −⋅−= ;

( )( )dimidisi zzRandzz −⋅+= ; else dibi zz = ; misi zz = ;

( )⋅Rand : a random number in [ ]1,0 ^^ ( ) 0=⋅Rand ⇒ mibi zz = , disi zz = . ^^ ( ) 1=⋅Rand ⇒ dibi zz = , misi zz = .

2/24/2010 84

Mutation: a reproduction operator that randomly alters the

values of genes in a chromosome point mutation uniform mutation parameterized uniform mutation ^^ escape from the local maximum leapfrog over the sticking points Parameterized mutation

for each li ∈ , if ( ) mthresholdrand _>⋅ iz is replaced by a random number else iz is maintained

( )⋅rand : a random number in [ ]1,0 Building block hypothesis: GAs attempt to find highly fit solutions to the

problem at hand by the juxtaposition of the “good building blocks.”

2/24/2010 85

GA-based Framework 1

Possible fitness value of a chromosome in SVC: number of testing data that are correctly classified

Possible fitness value of a chromosome in SVR: negative of mean squared error for testing data

yes no

Generating Random Initial Population

[C, iσ ] or [C,ε , iσ ]

Computing Fitness of All Chromosomes

Selection for Mating

Genetic Operations for Reproduction

(crossover, mutation)

Replacement for New Population [C, iσ ] or [C,ε , iσ ]

Selection of Random Training and Testing Data from Experimental Data Set


Convergence?

SVC or SVR

2/24/2010 86

Criteria for Final Chromosome Selection Life time of a chromosome: the number of generations that a given chromosome

lives The chromosome with the highest life time is chosen as our final parameters. Life score of a chromosome: the average score over generations of a given

chromosome For instance:

Fitness ranking in descending order

score

1 5 2 4 3 3 4 2 5 1

6 and above 0

The chromosome with the highest life score is chosen as our final parameters.

2/24/2010 87

Particle Swarm Optimization Motive: ^^ Human intelligence results from social interaction. ^^ Swarm intelligence provides a useful paradigm for

implementing adaptive systems. Main Features: ^^ biologically motivated general search technique

mimicking fish schooling, birds flocking, and bugs swarming

^^ a population-based search method with each particle

associated with a velocity ^^ The particles are flying through the search space.

2/24/2010 88

PSO Flowchart

Two main components “Cognitive” part: ( ) { }ijij xbestxrandc −⋅ **1

“Social” part: ( ) { }ijjchampion xbestxRandc −⋅ ,2 **

Initial Population

Convergence Criterion

Updating Velocities

Updating Positions

Calculating Objective Values

End

2/24/2010 89

updating formulae +← ijij vwv *

( ) { }+−⋅+ ijij xbestxrandc **1

( ) { }ijjchampion xbestxRandc −⋅+ ,2 ** ;

ijijij vxx +← ;

ijx : jth position component of ith particle

ijv : jth velocity component of ith particle

ijbestx : jth component of the best previous position of ith

particle champion:

particle giving the best objective value of all particles up to the present

w: inertia weight balancing local and global searches

1c , 2c : two pre-specified constants

( )⋅rand , ( )⋅Rand : two generators giving random numbers in [ ]1,0

2/24/2010 90

PSO-based Framework 1

Possible fitness value of a particle in SVC: number of testing data that are correctly classified

Possible fitness value of a particle in SVR: negative of mean squared error for testing data

yes no

Generating Random Initial Population

[C, iσ ] or [C,ε , iσ ]

Computing Fitness of All Particles

Updating Velocities

Updating Positions

Replacement for New Population [C, iσ ] or [C,ε , iσ ]



Convergence?

SVC or SVR

2/24/2010 91

Conclusion ^^ New learning machines are invented day

by day:

versatility of learning machines ^^ Some learning machine is still in its

infancy:

much has to be done

^^ Wide applicability

to various branches of sciences, engineering, and management

Simple research enthusiasm ?⇒ Simple happiness

2/24/2010 92

Epilogue

We must be grateful to God that He created the world in such a way that

everything simple is true and everything complicated is untrue.

Gregory Skovoroda (18th century Ukrainian philosopher)

Introduction to Machine Learning and Soft...

Documents

Transcript of Introduction to Machine Learning and Soft...