Single-Layer Perceptron Classifiers -...

Post on 29-Jul-2018

226 views 0 download

Transcript of Single-Layer Perceptron Classifiers -...

Single-Layer PerceptronClassifiers

Berlin Chen, 2002

2

Outline

• Foundations of trainable decision-making networks to be formulated– Input space to output space (classification space)

• Focus on the classification of linearly separable classes of patterns– Linear discriminating functions and simple correction

function – Continuous error function minimization

• Explanation and justification of perceptron and delta training rules

3

Classification Model, Features,and Decision Regions

• A pattern is the quantitative description of an object, event, or phenomenon– Spatial patterns: weather maps, fingerprints …– Temporal patterns: speech signals …

• Pattern classification/recognition– Assign the input data (a physical object, event, or

phenomenon) to one of the pre-specified classes (categories)

– Discriminate the input data within object population via the search for invariant attributes among members of the population

4

Classification Model, Features,and Decision Regions (cont.)

• The block diagram of the recognition and classification system

Dimension reduction

A neural network for classification and for feature

extraction

5

Classification Model, Features,and Decision Regions (cont.)

• More about Feature Extraction– The compressed data from the input patterns while

poses salient information– E.g.

• Speech vowel sounds analyzed in 16-channel filterbanks can provide 16 spectral vectors, which can be further transformed into two dimensions

– Tone height (high-low) and retraction (front-back)

• Input patterns to be projected and reduced to lower dimensions

6

Classification Model, Features,and Decision Regions (cont.)

• More about Feature Extractiony

x

x’

y’

7

Classification Model, Features,and Decision Regions (cont.)

• Two simple ways to generate the pattern vectors for cases of spatial and temporal objects to be classified

• A pattern classifier maps input patterns (vectors) in En

space into numbers (E1) which specify the membership( ) Rjij ,...,2 ,1 ,0 == x

8

Classification Model, Features,and Decision Regions (cont.)

• Classification described in geometric terms

– Decision regions– Decision surfaces: generally, the decision surfaces for n-

dimensional patterns may be (n-1)-dimensional hyper-surfaces

( ) ,..., R, jji jo 21 ,Χ allfor , =∈= xx

The decision surfaces hereare curved lines

9

Discriminant Functions• Determine the membership in a category by the

classifier based on the comparison of R discriminant functions g1(x), g2(x),…, gR(x)– When x is within the region Xk if gk(x) has the largest

value ( ) ( ) ( ) j,..., R, k, jk,ggki jk ≠=>= 21 for if 0 xxx

x1, x2,…., xp, ….,xP

P>>nAssume the classifier Has been designed

x1

x2

xn

g1

g2

gR

g1(x)

gR(x)

g2(x)

10

Discriminant Functions (cont.)

• Example 3.1 ( ) ( ) ( )

( )( ) 2 class : 0

1 class : 022

21

21

<>

++=−=

xx

xxx

gg

xx-ggg

The decision surface does not uniquely specify the discriminant functions

The classifier that classifies patternsinto two classes or categories is called“dichotomizer”

Decision surface Equation:

“two” “cut”

11

Discriminant Functions (cont.)

12

Discriminant Functions (cont.)

[2,-1,0]

[-2,1,0]

(0,-2,0)

(1,0,0)[0,0,1]

[2,-1,1]

(0,-2,1)

(x-0,y+2, g1 -1)(2,-1,1)=02x-y-2+ g1 -1=0g1=-2x+y+3(x-0,y+2, g2 -1)(-2,1,1)=0-2x+y+2+ g2 -1=0g2=2x-y-1g=g1 -g2=0-4x+2y+4=0-2x+y+2=0

An infinite number of discriminant functions will yield

correct classification

(x-0,y+2, g1 -1)(2,-1,2)=02x-y-2+2g1 -2=0g1=-x+1/2y+2

(x-0,y+2, g2 -1)(-2,1,2)=0-2x+y+2+ 2g2 -2=0g2=x-1/2yg=g1 -g2=0-2x+y+2=0

Solution 1

Solution 2x

y

g

( ) [ ]

( ) [ ]

=

+

−=

2

12

2

11

1- 2

31 2

xx

g

xx

g

x

x

13

Discriminant Functions (cont.)

( ) ( ) ( )xxx 21 ggg −=

Multi-class

Two-class

( )( ) 2 class : 0

1 class : 0<>

xx

gg

subtraction Sign examination

14

Discriminant Functions (cont.)

The design of discriminator for this case is not straightforward.The discriminant functions may result as nonlinearfunctions of x1 and x2

15

Bayes’ Decision Theory

• A decision-making based on both the posterior knowledge obtained from specific observation data and prior knowledge of the categories – Prior class probabilities – Class-conditioned probabilities

( ) iP i class , ∀ω

( ) ixP i class , ∀ω

( ) ( ) ( )( )

( ) ( )( ) ( )∑

=

===

1

maxargmaxargmaxarg j

jj

ii

i

ii

ii

i PxPPxP

xPPxP

xPkωωωωωω

ω

( ) ( ) ( )iii

ii

PxPxPk ωωω maxargmaxarg ==

16

Bayes’ Decision Theory (cont.)

• Bayes’ decision rule designed to minimize the overall risk involved in making decision– The expected loss (conditional risk) when making

decision

• The overall risk (Bayes’ risk)

– Minimize the overall risk (classification error) by computing the conditional risks and select the decision

for which the conditional risk is minimum, i.e., is maximum (minimum-error-rate decision rule)

( ) ( ) ( ) ( )( )( )xP

xP

j, ij, i

xlxPxlxR

i

ijj

jij

jjii

ω

ω

ωδωωδδ

-1

10

, where,,

=

=

≠=

==

( )( ) ( ) ( ) xxdxxpxxRR sample afor decision selected the: , δδ∫∞

∞−=

iδ ( )xR iδ( )xP iω

17

Bayes’ Decision Theory (cont.)

• Two-class pattern classification

Likelihood ratio or log-likelihood ratio:

( ) ( )( )

( )( )1

2

2

1

2

1

ωω

ωω

ω

ω

PP

xPxP

xl<>

=

Bayes’ Classifier

( ) ( ) ( ) ( ) ( )1221 logloglogloglog2

1

ωωωωω

ω

PPxPxPxl −<>

−=

( ) ( ) ( ) ( )2211

2

1

ωωωωω

ω

PxPPxP<>

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )22221111 , ωωωωωω PxPxPxgPxPxPxg ≅=≅=

( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( ) ( )dxPxPdxPxP

PRxPPRxPRxPRxPerrorp

RR 1122

112221

1221

21

,,

ωωωω

ωωωωωω

∫∫ +=

∈+∈=

∈+∈=

Classification error:

18

Bayes’ Decision Theory (cont.)

• When the environment is multivariate Gaussian, the Bayes’ classifier reduces to a linear classifier– The same form taken by the perceptron– But the linear nature of the perceptron is not

contingent on the assumption of Gaussianity

[ ]( )( )[ ][ ]( )( )[ ] ΣµXµX

µXΣµXµX

µX

=−−

==−−

=

t

t

E

EE

E

22

22

11

11

: Class

: Class

ω

ω

( )( )

( ) ( )

−−−= − µxΣµx

Σx 1

21

2 21exp

2

1 tn

ω

( ) ( )21

21 == ωω PP

Assumptions

19

Bayes’ Decision Theory (cont.)

• When the environment is Gaussian, the Bayes’classifier reduces to a linear classifier (cont.)

( ) ( ) ( )( ) ( ) ( ) ( )

( ) ( )

21

21

21

logloglog

11

121

21

21

21

211

1

21

b

PPl

ttt

tt

+=

−+−=

−−+−−−=

−=

−−−

−−

wx

µΣµµΣµxΣµµ

µxΣµxµxΣµx

xxx ωω

( ) 0log2

1

ω

ω

<>

+=∴ bl wxx

20

Bayes’ Decision Theory (cont.)

• Multi-class pattern classification

21

Linear Machine and Minimum Distance Classification

• Find the linear-form discriminant function for two-class classification when the class prototypes are known

• Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes

22

Linear Machine and Minimum Distance Classification (cont.)

0)(21)(

0)2

()(

21

2221

2121

=−+−

=+

−−

xxxxx

xxxxx

t

t

The dichotomizer’s discriminant function g(x):

( )21

221

21

1

21

where,01

asTaken

xx

xxw

xw

−=

−=

=

+

+

n

n

w

w2

21 xx +

Augmentedinput pattern

It is a simple minimum-distance classifier.

23

Linear Machine and Minimum Distance Classification (cont.)

• The linear-form discriminant functions for multi-class classification – There are up to R(R-1)/2 decision hyperplanes for R

pairwise separable classes

xx

x xx

x

x

o o oo oo o

Δ Δ

Δ Δ

Δ

Δ

Δ

o

o oo oo o

oxx

x xx

x

x

ΔΔ

Δ Δ

Δ

Δ

Δ

Some classes may not be contiguous

24

Linear Machine and Minimum Distance Classification (cont.)

• Linear machine or minimum-distance classifier– Assume the class prototypes are known for all classes

• Euclidean distance between input pattern x and the center of class i, xi :

• Minimizing is equal to

maximizing

– Set the discriminant function for each class i to be:

iti

ti

ti xxxxxxxx +−=− 22

( ) ( )it

ii xxxxxx −−=−

iti

ti xxxx

21

( ) iti

tiig xxxxx

21

−=

( )itini

ii

w xx

xw

21

1, −=

=

+( ) where,

1

1,

=

+

xwx

ni

ii w

g

( ) ywx tiig =

The same for all classes

25

Linear Machine and Minimum Distance Classification (cont.)

This approach is also called correlation classification

An 1 as the n+1’th component of the input pattern

( ) iti

tiig xxxxx

21

−= ( ) ywx tiig =

26

Linear Machine and Minimum Distance Classification (cont.)

• Example 3.2

−=

−−=

−=

255 5-

,5.145 2

,522

10 321 www

05.10107:027315:

05.3778:

2123

2113

2112

=−+−=++−=−+

xxSxxS

xxS

( )( )( ) 2555

5.145252210

213

212

211

−+−=−−=−+=

xxgxxgxxxg

xx

( ) iti

tiig xxxxx

21

−=

12S13S

23S

27

Linear Machine and Minimum Distance Classification (cont.)

• If R linear discriminant functions exist for a set of patterns such that

– The classes are linearly separable

( ) ( )j i,..., R, ,..., R,j, i

i,gg ji

≠==

∈>

,2121

Classfor xxx

28

Linear Machine and Minimum Distance Classification (cont.)

29

Linear Machine and Minimum Distance Classification (cont.)

(a) 2x1-x2+2=0, decision surface is a line(b) 2x1-x2+2=0, decision surface is a plane(c) x1=[2,5], x2=[-1,-3]=>The decision surface for minimum distance classifier

(x1-x2)t x+1/2 (||x2||2-||x1||2)t=03x1+ 8x2-19/2=0

(d)

x1

x2

(0,0)(-1,0) (0,2)

x1

x2

(0,0)(-1,0) (0,2)

x3

x1

x2

(0,0)(-1,0) (0,2)

(19/6,0)

(19/16,0)

30

Linear Machine and Minimum Distance Classification (cont.)

• Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear discriminant functions can be determined if the a priori information about the sets of patterns and their class membership is known

31

Linear Machine and Minimum Distance Classification (cont.)

• The example of linearly non-separable patterns

32

Linear Machine and Minimum Distance Classification (cont.)

x1+x2+1=0

-x1-x2+1=0TLU#2

TLU#1x1

x2

-1

-1

1

1

-1

-1-1

TLU#211

1

(1,1)

(-1,-1)

(-1, 1)

(1, -1)

o1

o2

(1,1)(-1, 1)

(1, -1) o1+o2-1=0

o1

o2

33

Discrete Perceptron Training Algorithm- Geometrical Representations

• Examine the neural network classifiers that derive/training their weights based on the error-correction scheme

( ) ywy tg =

Vector Representationsin the Weight Space

Class 1:

Class 2:

0>yw t

0<yw t

Augmentedinput pattern

34

Discrete Perceptron Training Algorithm- Geometrical Representations (count.)

• Devise an analytic approach based on the geometrical representations – E.g. the decision surface for the training pattern y1

If y1 in Class 1:

y1 in Class 2

( ) 11 yyww =∇ t

11 yww c+=′

y1 in Class 1

If y1 in Class 2:

11 yww c−=′

c (>0) is the correction increment (is two times of the learning constant introduced before)

Weight Space

Weight Space

c controls the size of adjustment

Gradient(the direction ofsteep increase)

35

Discrete Perceptron Training Algorithm- Geometrical Representations (count.)

Weight adjustments of three augmented training pattern y1, y2, y3 , shown in the weight space

- Weights in the shaded region are the solutions

- The three lines labeled are fixed during training

Weight Space

23

12

11

CCC

∈∈∈

yyy

36

Discrete Perceptron Training Algorithm- Geometrical Representations (count.)

• More about the correction increment c– If it is not merely a constant, but related to the current

training pattern

yyw t

p1

= ( )

0 because ,

0

2

11

1

>==

cc

ct

t

t

t

y

ywyyyw

yyw

m

yy

ywy 2

1t

c =⇒

How to select the correction increment based on the dislocates of w1 and the corrected weight vector w

37

Discrete Perceptron Training Algorithm- Geometrical Representations (count.)

• For fixed correction rule with c=constant, the correction of weights is always the same fixed portion of the current training vector– The weight can be initialized at any value

• For dynamic correction rule with c dependent on the distance from the weight (i.e. the weight vector) to the decision surface in the weight space– The initial weight should be different from 0

yww c±=′ or ( )[ ]yywwwww

tdc sgn−=∆

∆+=′

yy

ywy 2

1t

c =⇒

38

Discrete Perceptron Training Algorithm- Geometrical Representations (count.)

• Dynamic correction rule with c dependent on the distance from the weight

yy

y

ywy

y

yw

t

t

c

c

1

2

1

λ

λ

=

=

39

Discrete Perceptron Training Algorithm- Geometrical Representations (count.)

• Example 3.3

=

13

3y

=

11

1y

−=

1 5.0

2y

=12

4y24

13

22

11

CCCC

∈∈∈∈

yyyy

( )[ ] jjkt

kk dc yyww sgn

2−=∆

What if ?-> interpreted as a mistakeand followed by a correlation

0=jkt yw

40

Continuous PerceptronTraining Algorithm

• Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons:– Gain finer control over the training procedure– Facilitate the differential characteristics to enable

computation of the error gradient

( )www E∇−= ηˆ

learning constant error gradient

41

Continuous PerceptronTraining Algorithm (cont.)

• The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface

42

Continuous PerceptronTraining Algorithm (cont.)

• Define the error as the squared difference between the desired output and the actual output

( )221 odE −=

( )[ ] ( )[ ]22

21

21or netfdfdE t −=−= yw

( ) ( )[ ]( )

( ) ( ) ( )

( )

( )

( )

( ) ( )yw

w

netfod

wnet

wnetwnet

netfod

wE

wEwE

E

netfdE

nn

′−−=

∂∂

∂∂∂

′−−=

∂∂

∂∂∂∂

=∇

−∇=∇

++

1

2

1

1

2

1

2

.

...

21

43

Continuous PerceptronTraining Algorithm (cont.)

• Bipolar Continuous Activation Function

• Unipolar Continuous Activation Function

( ) ( )netnetf

⋅−+=

λexp11

( ) ( )( )[ ]

( )[ ]{ } ( )222 11

exp1exp2 onetf

netnetnetf −=−⋅=⋅−+⋅−

⋅=′ λλλλλ

( ) ( )yww oood −−⋅⋅+= 1ˆ λη

( ) ( ) 1exp1

2−

⋅−+=

netnetf

λ

( ) ( )( )[ ]

( ) ( )[ ] ( )oonetfnetfnetnetnetf −⋅=−⋅=⋅−+⋅−⋅

=′ 11exp1exp

2 λλλλλ

( )( )yww 2121ˆ ood −−⋅+= λη

44

Continuous PerceptronTraining Algorithm (cont.)

• Example 3.3 ( ) ( ) 1exp1

2−

−+=

netnetf

=

13

3y

−=

1 5.0

2y

=

11

1y

=12

4y

45

Continuous PerceptronTraining Algorithm (cont.)

• Example 3.3 Trajectories started from fourarbitrary initial weights

Total error surface

46

Continuous PerceptronTraining Algorithm (cont.)

• Treat the last fixed component of input pattern vector as the neuron activation threshold

47

Continuous PerceptronTraining Algorithm (cont.)

• R-category linear classifier using R discrete bipolar perceptrons– Goal: The i-th TLU response of +1 is indicative of

class i and all other TLU respond with -1

( )yww iiii odc −⋅+=21ˆ

ij,..,R,jdd ji ≠=−== ,21for ,1 ,1

For “local representation”

48

Continuous PerceptronTraining Algorithm (cont.)

• Example 3.5

49

Continuous PerceptronTraining Algorithm (cont.)

• R-category linear classifier using R continuous bipolar perceptrons

( )( ),...,R,i

ood iiiii

21for

121ˆ 2

=

−−⋅+= yww λη

ij,..,R,jdd ji ≠=−== ,21for ,1 ,1

50

Continuous PerceptronTraining Algorithm (cont.)

• Error function dependent on the difference vector d-o

51

Bayes’ Classifier vs. Percepron

• Perceptron operates on the promise that the patterns to be classified are linear separable (otherwise the training algorithm will oscillate), while Bayes’ classifier assumes the (Gaussian) distribution of two classes certainly do overlap each other

• The perceptron is nonparametric while the Bayes’classifier is parametric (its derivation is contingent on the assumption of the underlying distributions)

• The perceptron is simple and adaptive, and needs small storage, while the Bayes’ classifier could be made adaptive but at the expanse of increased storage and more complex computations

52

Homework

• P3.5, P3.7, P3.9, P3.22