Backpropagation - NTU Speech Processing...

Post on 11-Aug-2020

10 views 0 download

Transcript of Backpropagation - NTU Speech Processing...

Backpropagation

Hung-yi Lee

Background

โ€ข Cost Function ๐ถ ๐œƒ

โ€ข Given training examples: ๐‘ฅ1, ๐‘ฆ1 , โ‹ฏ , ๐‘ฅ๐‘Ÿ , ๐‘ฆ๐‘Ÿ , โ‹ฏ , ๐‘ฅ๐‘… , ๐‘ฆ๐‘…

โ€ข Find a set of parameters ฮธ* minimizing C(ฮธ)

โ€ข ๐ถ ๐œƒ =1

๐‘… ๐‘Ÿ ๐ถ

๐‘Ÿ ๐œƒ , ๐ถ๐‘Ÿ ๐œƒ = ๐‘“ ๐‘ฅ๐‘Ÿ; ๐œƒ โˆ’ ๐‘ฆ๐‘Ÿ

โ€ข Gradient Descent

โ€ข ๐›ป๐ถ ๐œƒ =1

๐‘… ๐‘Ÿ ๐›ป๐ถ๐‘Ÿ ๐œƒ

โ€ข Given ๐‘ค๐‘–๐‘—๐‘™ and ๐‘๐‘–

๐‘™, we have to compute ๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ and

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘๐‘–๐‘™

โ€ข There is an efficient way to compute the gradients of the network parameters โ€“ backpropagation.

Chain Rule

Case 1

Case 2

yhz xgy

dx

dy

dy

dz

dx

dzzyx

yxkz , shy sgx

s

y

y

z

s

x

x

z

s

z

s z

x

y

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™

โ€ข is the multiplication of two termsl

ij

r

w

C

โ€ฆโ€ฆ

1

2

j

โ€ฆโ€ฆ

1

2

il

ijw l

iz l

ia

rl

i

l

ij ฮ”Cฮ”zฮ”w

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

Layer lLayer 1l

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - First Term

โ€ข is the multiplication of two termsl

ij

r

w

C

โ€ฆโ€ฆ

1

2

j

โ€ฆโ€ฆ

1

2

il

ijw l

iz l

ia

rl

i

l

ij ฮ”Cฮ”zฮ”w

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

Layer lLayer 1l

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - First Term

โ€ฆโ€ฆ

Layer l-1

1

2

jโ€ฆ

โ€ฆ

1

2

i

Layer l

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

l

j

j

l

ij

l

i bawz 1 1

l

jl

ij

l

i aw

zIf l > 1

l

ijw l

iz

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - First Term

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

If l = 1

111

i

r

j

j

iji bxwz r

j

ij

i xw

z

1

1

โ€ฆโ€ฆ

Input

โ€ฆโ€ฆ

1

2

i

Layer 1

rx1

rx2

r

jx1

ijw1

iz

l

i

l

j

j

l

ij

l

i bawz 1 1

l

jl

ij

l

i aw

zIf l > 1

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

โ€ข is the multiplication of two termsl

ij

r

w

C

โ€ฆโ€ฆ

1

2

j

โ€ฆโ€ฆ

1

2

il

ijw l

iz l

ia

rl

i

l

ij ฮ”Cฮ”zฮ”w

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

Layer lLayer 1l

l

i

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

โ€ฆโ€ฆ

Layer l-1

1

2

jโ€ฆ

โ€ฆ

1

2

i

Layer l โ€ฆ

โ€ฆ1

2

k

Layer l+1

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

1

2

n

Layer L(output layer)

1. How to compute Lฮด

2. The relation of and lฮด 1lฮดl

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

l

iฮด

lฮด2

lฮด1

1l

kฮด

1

2

lฮด

1

1

lฮด

L

nฮด

L

2ฮด

Lฮด1

lฮด1lฮด Lฮด1-Lฮด

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

1. How to compute Lฮด

2. The relation of and lฮด 1lฮดl

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

L

L

n

r

nz

C

rrLL Cyaz nnn

rL

r

n

r

n

n

y

C

z

y

L

nz โ€ฆโ€ฆ

1

2

n

Layer L(output layer)

L

nฮด

L

2ฮด

Lฮด1

z

z

Depending on the definition of cost function

L

nz

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

1. How to compute Lฮด

2. The relation of and lฮด 1lฮดl

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

L

L

n

r

nz

C

rL

r

n

r

n

n

y

C

z

y

L

n

L

L

L

z

z

z

z

2

1

r

n

r

rr

rr

rr

yC

yC

yC

yC

2

1

rrl yCzฮด L

r

n

rL

ny

Cz

element-wise multiplication

?Lฮด

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

l

i

l

i ฮ”aฮ”z rฮ”C

kl

k

r

l

i

l

k

l

i

l

i

l

i

rl

iz

C

a

z

z

a

z

C1

1

1

1

lฮ”z

1

2

lฮ”z

1l

kฮ”z

โ€ฆโ€ฆ

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

1. How to compute Lฮด

2. The relation of and lฮด 1lฮด

โ€ฆโ€ฆ

1

2

i

Layer l

โ€ฆโ€ฆ

1

2

k

Layer l+1

l

iฮด

lฮด2

lฮด1

1l

kฮด

1

2

lฮด

1

1

lฮด

1lฮดlฮด

l

i

rl

iz

Cฮด

1l

k

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

k

l

k

l

ki

l

i

l

i wz 11โ€ฆโ€ฆ

1

2

i

Layer l

โ€ฆโ€ฆ

1

2

k

Layer l+1

l

iฮด

lฮด2

lฮด1

1l

kฮด

1

2

lฮด

1

1

lฮด

1lฮดlฮด

l

i

l

i ฮ”aฮ”z rฮ”C

1

1

lฮ”z

1

2

lฮ”z

1l

kฮ”z

โ€ฆโ€ฆ

k

l

kl

i

l

k

l

i

l

il

ia

z

z

a 11

l

izฯƒ111 l

k

l

i

i

l

ki

l

k bawz

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

l

iฮด i

l

iz

multiply a constant

1l

kฮด

1

2

lฮด

1

1

lฮด

โ€ฆโ€ฆ

1l

kiw

1

2

l

iw

1

1

l

iwoutput

input

new type of neuron

k

l

k

l

ki

l

i

l

i wz 11

โ€ฆโ€ฆ

1

2

i

Layer l

โ€ฆโ€ฆ

1

2

k

Layer l+1

l

iฮด

lฮด2

lฮด1

1l

kฮด

1

2

lฮด

1

1

lฮด

1lฮดlฮด

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

2

โ€ฆ

1

1

lz

1

2

lz

1 l

kz

1

k

2

1

i

โ€ฆ

Layer l+1Layer l

lz1

lz2

l

iz

l

iฮด

lฮด2

lฮด1

1l

kฮด

1

2

lฮด

1

1

lฮด

11 lTlll Wz

1lฮดlฮด

l

i

l

l

l

z

z

z

z

2

1

k

l

k

l

ki

l

i

l

i wz 11

๐œ•๐ถ๐‘Ÿ ๐œ•๐‘ค๐‘–๐‘—๐‘™ - Second Term

2

โ€ฆ

1

1

lz

1

2

lz

1 l

kz

1

k

2

1

i

โ€ฆ

Layer l+1Layer l

lz1

lz2

l

iz

l

iฮด

lฮด2

lฮด1

1l

kฮด

1

2

lฮด

1

1

lฮด

11 lTlll Wz

1lฮดlฮด

Compare

1l

ka

1

2

la

1

1

laโ€ฆ

โ€ฆ

Layer l

1

2

i

โ€ฆโ€ฆ

1

2

kl

ia

Layer l+1

la2

la1

la1la

111 llll baWa

1

2

n

โ€ฆโ€ฆ

r

1y

C r

Lz1

Lz2

L

nz

r

2y

C r

r

n

r

y

C

Layer L

2

โ€ฆ

1

1

lz

1

2

lz

1 l

kz

1

k

2

1

iโ€ฆโ€ฆ

โ€ฆ

Layer l+1Layer l

lz1

lz2

l

iz

lฮด1

lฮด2

l

iฮด

2

โ€ฆ 1L

1

z

1

m

Layer L-1

โ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

1. How to compute Lฮด

2. The relation of and lฮด 1lฮด

11 lTlll Wz

TW L TlW 1

rr yCzฮด LL l

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

l

i

rr yCL1-L

1L

2

z

1L mz

1lฮด

lฮด

Concluding Remarksl

i

r

l

ij

l

i

l

ij

r

z

C

w

z

w

C

Forward Pass Backward Pass

11 ll za

11 lTlll Wz 1211 llll baWz

rrL yCzฮด L

1

11

lx

lar

j

l

j

โ€ฆโ€ฆ

1

2

j

โ€ฆโ€ฆ

1

2

il

ijw

Layer lLayer 1l

111 bxWz r

11 za

LTLLL Wz 11

l

i

Acknowledgement

โ€ขๆ„Ÿ่ฌ Ryan Sun ไพ†ไฟกๆŒ‡ๅ‡บๆŠ•ๅฝฑ็‰‡ไธŠ็š„้Œฏ่ชค