Single-Layer Perceptron Classifiers -...

Single-Layer PerceptronClassifiers

Berlin Chen, 2002

Outline

• Foundations of trainable decision-making networks to be formulated– Input space to output space (classification space)

• Focus on the classification of linearly separable classes of patterns– Linear discriminating functions and simple correction

function – Continuous error function minimization

• Explanation and justification of perceptron and delta training rules

Classification Model, Features,and Decision Regions

• A pattern is the quantitative description of an object, event, or phenomenon– Spatial patterns: weather maps, fingerprints …– Temporal patterns: speech signals …

• Pattern classification/recognition– Assign the input data (a physical object, event, or

phenomenon) to one of the pre-specified classes (categories)

– Discriminate the input data within object population via the search for invariant attributes among members of the population

Classification Model, Features,and Decision Regions (cont.)

• The block diagram of the recognition and classification system

Dimension reduction

A neural network for classification and for feature

extraction

• More about Feature Extraction– The compressed data from the input patterns while

poses salient information– E.g.

• Speech vowel sounds analyzed in 16-channel filterbanks can provide 16 spectral vectors, which can be further transformed into two dimensions

– Tone height (high-low) and retraction (front-back)

• Input patterns to be projected and reduced to lower dimensions

• More about Feature Extractiony

• Two simple ways to generate the pattern vectors for cases of spatial and temporal objects to be classified

• A pattern classifier maps input patterns (vectors) in En

space into numbers (E1) which specify the membership( ) Rjij ,...,2 ,1 ,0 == x

• Classification described in geometric terms

– Decision regions– Decision surfaces: generally, the decision surfaces for n-

dimensional patterns may be (n-1)-dimensional hyper-surfaces

( ) ,..., R, jji jo 21 ,Χ allfor , =∈= xx

The decision surfaces hereare curved lines

Discriminant Functions• Determine the membership in a category by the

classifier based on the comparison of R discriminant functions g1(x), g2(x),…, gR(x)– When x is within the region Xk if gk(x) has the largest

value ( ) ( ) ( ) j,..., R, k, jk,ggki jk ≠=>= 21 for if 0 xxx

x1, x2,…., xp, ….,xP

P>>nAssume the classifier Has been designed

Discriminant Functions (cont.)

• Example 3.1 ( ) ( ) ( )

( )( ) 2 class : 0

1 class : 022

++=−=

xx-ggg

The decision surface does not uniquely specify the discriminant functions

The classifier that classifies patternsinto two classes or categories is called“dichotomizer”

Decision surface Equation:

“two” “cut”

[2,-1,0]

[-2,1,0]

(0,-2,0)

(1,0,0)[0,0,1]

[2,-1,1]

(0,-2,1)

(x-0,y+2, g1 -1)(2,-1,1)=02x-y-2+ g1 -1=0g1=-2x+y+3(x-0,y+2, g2 -1)(-2,1,1)=0-2x+y+2+ g2 -1=0g2=2x-y-1g=g1 -g2=0-4x+2y+4=0-2x+y+2=0

An infinite number of discriminant functions will yield

correct classification

(x-0,y+2, g1 -1)(2,-1,2)=02x-y-2+2g1 -2=0g1=-x+1/2y+2

(x-0,y+2, g2 -1)(-2,1,2)=0-2x+y+2+ 2g2 -2=0g2=x-1/2yg=g1 -g2=0-2x+y+2=0

Solution 1

Solution 2x

( ) [ ]

( ) ( ) ( )xxx 21 ggg −=

Multi-class

Two-class

( )( ) 2 class : 0

1 class : 0<>

subtraction Sign examination

The design of discriminator for this case is not straightforward.The discriminant functions may result as nonlinearfunctions of x1 and x2

Bayes’ Decision Theory

• A decision-making based on both the posterior knowledge obtained from specific observation data and prior knowledge of the categories – Prior class probabilities – Class-conditioned probabilities

( ) iP i class , ∀ω

( ) ixP i class , ∀ω

( ) ( ) ( )( )

( ) ( )( ) ( )∑

maxargmaxargmaxarg j

i PxPPxP

xPkωωωωωω

( ) ( ) ( )iii

PxPxPk ωωω maxargmaxarg ==

Bayes’ Decision Theory (cont.)

• Bayes’ decision rule designed to minimize the overall risk involved in making decision– The expected loss (conditional risk) when making

decision

• The overall risk (Bayes’ risk)

– Minimize the overall risk (classification error) by computing the conditional risks and select the decision

for which the conditional risk is minimum, i.e., is maximum (minimum-error-rate decision rule)

( ) ( ) ( ) ( )( )( )xP

j, ij, i

xlxPxlxR

ωδωωδδ

, where,,

( )( ) ( ) ( ) xxdxxpxxRR sample afor decision selected the: , δδ∫∞

∞−=

iδ ( )xR iδ( )xP iω

• Two-class pattern classification

Likelihood ratio or log-likelihood ratio:

( ) ( )( )

( )( )1

Bayes’ Classifier

( ) ( ) ( ) ( ) ( )1221 logloglogloglog2

ωωωωω

PPxPxPxl −<>

( ) ( ) ( ) ( )2211

ωωωωω

PxPPxP<>

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )22221111 , ωωωωωω PxPxPxgPxPxPxg ≅=≅=

( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( ) ( )dxPxPdxPxP

PRxPPRxPRxPRxPerrorp

RR 1122

112221

ωωωω

ωωωωωω

∫∫ +=

∈+∈=

Classification error:

• When the environment is multivariate Gaussian, the Bayes’ classifier reduces to a linear classifier– The same form taken by the perceptron– But the linear nature of the perceptron is not

contingent on the assumption of Gaussianity

[ ]( )( )[ ][ ]( )( )[ ] ΣµXµX

µXΣµXµX

=−−

==−−

: Class

( )( )

( ) ( )

−−−= − µxΣµx

2 21exp

( ) ( )21

21 == ωω PP

Assumptions

• When the environment is Gaussian, the Bayes’classifier reduces to a linear classifier (cont.)

( ) ( ) ( )( ) ( ) ( ) ( )

( ) ( )

logloglog

−+−=

−−+−−−=

−−−

−−

µΣµµΣµxΣµµ

µxΣµxµxΣµx

xxx ωω

( ) 0log2

+=∴ bl wxx

• Multi-class pattern classification

Linear Machine and Minimum Distance Classification

• Find the linear-form discriminant function for two-class classification when the class prototypes are known

• Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes

Linear Machine and Minimum Distance Classification (cont.)

0)(21)(

=−+−

−−

The dichotomizer’s discriminant function g(x):

where,01

asTaken

21 xx +

Augmentedinput pattern

It is a simple minimum-distance classifier.

• The linear-form discriminant functions for multi-class classification – There are up to R(R-1)/2 decision hyperplanes for R

pairwise separable classes

o o oo oo o

o oo oo o

Some classes may not be contiguous

• Linear machine or minimum-distance classifier– Assume the class prototypes are known for all classes

• Euclidean distance between input pattern x and the center of class i, xi :

• Minimizing is equal to

maximizing

– Set the discriminant function for each class i to be:

ti xxxxxxxx +−=− 22

( ) ( )it

ii xxxxxx −−=−

ti xxxx

( ) iti

tiig xxxxx

( )itini

1, −=

+( ) where,

( ) ywx tiig =

The same for all classes

This approach is also called correlation classification

An 1 as the n+1’th component of the input pattern

( ) iti

tiig xxxxx

−= ( ) ywx tiig =

• Example 3.2

−−=

255 5-

,5.145 2

10 321 www

05.10107:027315:

05.3778:

=−+−=++−=−+

xxSxxS

( )( )( ) 2555

5.145252210

−+−=−−=−+=

xxgxxgxxxg

( ) iti

tiig xxxxx

12S13S

• If R linear discriminant functions exist for a set of patterns such that

– The classes are linearly separable

( ) ( )j i,..., R, ,..., R,j, i

i,gg ji

Classfor xxx

(a) 2x1-x2+2=0, decision surface is a line(b) 2x1-x2+2=0, decision surface is a plane(c) x1=[2,5], x2=[-1,-3]=>The decision surface for minimum distance classifier

(x1-x2)t x+1/2 (||x2||2-||x1||2)t=03x1+ 8x2-19/2=0

(0,0)(-1,0) (0,2)

(19/6,0)

(19/16,0)

• Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear discriminant functions can be determined if the a priori information about the sets of patterns and their class membership is known

• The example of linearly non-separable patterns

x1+x2+1=0

-x1-x2+1=0TLU#2

TLU#1x1

TLU#211

(-1,-1)

(-1, 1)

(1, -1)

(1,1)(-1, 1)

(1, -1) o1+o2-1=0

Discrete Perceptron Training Algorithm- Geometrical Representations

• Examine the neural network classifiers that derive/training their weights based on the error-correction scheme

( ) ywy tg =

Vector Representationsin the Weight Space

Class 1:

Class 2:

0>yw t

0<yw t

Augmentedinput pattern

Discrete Perceptron Training Algorithm- Geometrical Representations (count.)

• Devise an analytic approach based on the geometrical representations – E.g. the decision surface for the training pattern y1

If y1 in Class 1:

y1 in Class 2

( ) 11 yyww =∇ t

11 yww c+=′

y1 in Class 1

If y1 in Class 2:

11 yww c−=′

c (>0) is the correction increment (is two times of the learning constant introduced before)

Weight Space

c controls the size of adjustment

Gradient(the direction ofsteep increase)

Weight adjustments of three augmented training pattern y1, y2, y3 , shown in the weight space

- Weights in the shaded region are the solutions

- The three lines labeled are fixed during training

Weight Space

∈∈∈

• More about the correction increment c– If it is not merely a constant, but related to the current

training pattern

0 because ,

ywyyyw

c =⇒

How to select the correction increment based on the dislocates of w1 and the corrected weight vector w

• For fixed correction rule with c=constant, the correction of weights is always the same fixed portion of the current training vector– The weight can be initialized at any value

• For dynamic correction rule with c dependent on the distance from the weight (i.e. the weight vector) to the decision surface in the weight space– The initial weight should be different from 0

yww c±=′ or ( )[ ]yywwwww

tdc sgn−=∆

∆+=′

c =⇒

• Dynamic correction rule with c dependent on the distance from the weight

• Example 3.3

∈∈∈∈

( )[ ] jjkt

kk dc yyww sgn

2−=∆

What if ?-> interpreted as a mistakeand followed by a correlation

0=jkt yw

Continuous PerceptronTraining Algorithm

• Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons:– Gain finer control over the training procedure– Facilitate the differential characteristics to enable

computation of the error gradient

( )www E∇−= ηˆ

learning constant error gradient

Continuous PerceptronTraining Algorithm (cont.)

• The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface

• Define the error as the squared difference between the desired output and the actual output

( )221 odE −=

( )[ ] ( )[ ]22

21or netfdfdE t −=−= yw

( ) ( )[ ]( )

( ) ( ) ( )

( ) ( )yw

netfod

wnetwnet

netfod

netfdE

′−−=

∂∂

∂∂∂

′−−=

∂∂

∂∂∂∂

−∇=∇

• Bipolar Continuous Activation Function

• Unipolar Continuous Activation Function

( ) ( )netnetf

⋅−+=

λexp11

( ) ( )( )[ ]

( )[ ]{ } ( )222 11

exp1exp2 onetf

netnetnetf −=−⋅=⋅−+⋅−

⋅=′ λλλλλ

( ) ( )yww oood −−⋅⋅+= 1ˆ λη

( ) ( ) 1exp1

⋅−+=

netnetf

( ) ( )( )[ ]

( ) ( )[ ] ( )oonetfnetfnetnetnetf −⋅=−⋅=⋅−+⋅−⋅

=′ 11exp1exp

2 λλλλλ

( )( )yww 2121ˆ ood −−⋅+= λη

• Example 3.3 ( ) ( ) 1exp1

netnetf

• Example 3.3 Trajectories started from fourarbitrary initial weights

Total error surface

• Treat the last fixed component of input pattern vector as the neuron activation threshold

• R-category linear classifier using R discrete bipolar perceptrons– Goal: The i-th TLU response of +1 is indicative of

class i and all other TLU respond with -1

( )yww iiii odc −⋅+=21ˆ

ij,..,R,jdd ji ≠=−== ,21for ,1 ,1

For “local representation”

• Example 3.5

• R-category linear classifier using R continuous bipolar perceptrons

( )( ),...,R,i

ood iiiii

121ˆ 2

−−⋅+= yww λη

ij,..,R,jdd ji ≠=−== ,21for ,1 ,1

• Error function dependent on the difference vector d-o

Bayes’ Classifier vs. Percepron

• Perceptron operates on the promise that the patterns to be classified are linear separable (otherwise the training algorithm will oscillate), while Bayes’ classifier assumes the (Gaussian) distribution of two classes certainly do overlap each other

• The perceptron is nonparametric while the Bayes’classifier is parametric (its derivation is contingent on the assumption of the underlying distributions)

• The perceptron is simple and adaptive, and needs small storage, while the Bayes’ classifier could be made adaptive but at the expanse of increased storage and more complex computations

Homework

• P3.5, P3.7, P3.9, P3.22

Single-Layer Perceptron Classifiers -...

Documents

Transcript of Single-Layer Perceptron Classifiers -...

L5: Quadratic classifierscourses.cs.tamu.edu/rgutier/csce666_f13/l5.pdf · L5: Quadratic classifiers • Bayes classifiers for Normally distributed classes –Case 1: Σ =𝜎2𝐼

HMMs and Speech Recognition - 國立臺灣師範大學berlin.csie.ntnu.edu.tw/PastCourses/NaturalLanguageProcessing2003S... · Outline Overview of Speech Recognition Architecture

Plasma cells Non-plasma cells Optimal hyperplane Naïve Bayes Support Vector Machine Fisher’s discriminant Logistic Mahalanobis Making classifiers by supervised.

Mathematics of Fuzzy Sets and Fuzzy Logic - Unicampvalle/PastCourses/SFuzzy_13/Seminar...Mathematics of Fuzzy Sets and Fuzzy Logic Marcos Eduardo Valle Departamento de Matemática

Numerical Integration of Functions - 國立臺灣師範大學berlin.csie.ntnu.edu.tw/Courses/Numerical Methods/Lectures2012S...Numerical Integration of Functions ... Applied Numerical

Bayes Classifiers

Linguistic Essentialsberlin.csie.ntnu.edu.tw/PastCourses/NaturalLanguage... · 2003. 3. 4. · • Determiners (定詞) – Determiners describe the particular reference of a noun

Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1.

LECTURE 20: ESTIMATING, COMPARING AND COMBINING CLASSIFIERS

Menuju Pengenalan Ekspresi Mikro: Pendeteksian Komponen ... · Metode Haar cascade classifiers merupakan fitur persegi (rectangular features) yang memberikan indikasi secara spesifik

Splines and Piecewise Interpolation - 國立臺灣師範大學berlin.csie.ntnu.edu.tw/Courses/Numerical Methods/Lectures2012S... · Splines and Piecewise Interpolation ... • Example

ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:

Solving Problems by Searching - 國立臺灣師範大學berlin.csie.ntnu.edu.tw/...Lecture03-SolvingProblemsbySearching.pdf · Introduction • Problem-Solving ... which define the

Machine Learning and Market Competitionicle.sogang.ac.kr/static/files/admin_at_icle.sogang.ac... · 2017-09-19 · Machine Learning Machine learning, in general •X=features, classifiers,

classifiers kelompok 1

SP2003F Lecture06 Linear Prediction Analysis of Speech ...berlin.csie.ntnu.edu.tw/PastCourses/2003F-SpeechSignalProcessing... · Linear Prediction Analysis of Speech Sounds References:

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

MLDM2004S Lecture-11-An introduction to Support Vector Machineberlin.csie.ntnu.edu.tw/PastCourses/2004S... · An introduction to Support Vector Machine 報告者：黃立德 References:

Some Representation Learning Approaches to Automatic ...berlin.csie.ntnu.edu.tw/Berlin_Research/Talks/2016... · (E) Dark Knowledge Automatic Meeting Transcription (2/2) 20 1. G.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.