SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in...

43
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa 1/43

Transcript of SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in...

Page 1: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

SD StudyRNN & LSTM 2016/11/10

Seitaro Shinagawa

1/43

Page 2: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

This is a description for people

who have already understood

simple neural network

architecture like feed forward

networks.

2/43

Page 3: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

I will introduce LSTM,

how to use, tips in chainer.

3/43

Page 4: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

ใ‚ใ‹ใ‚‹LSTM ๏ฝž ๆœ€่ฟ‘ใฎๅ‹•ๅ‘ใจๅ…ฑใซ ใ‹ใ‚‰ๅผ•็”จ( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )

๏ผ‘๏ผŽRNN to LSTM

Output

Layer

Middle(Hidden)

Layer

Input

Layer

Simple RNN

4/43

Page 5: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

FAQ with LSTM beginner students.

LSTM LSTMLSTM

I hear LSTM is kind of RNN,

but LSTM looks different architectureโ€ฆ

These have same architecture!

Please follow me! Neural bear

๐’™๐’•

๐’‰๐’•

๐’š๐’•

A-san

A-san often sees this RNN A-san often sees this LSTM

๐’™๐Ÿ ๐’™๐Ÿ ๐’™๐Ÿ‘

๐’š๐Ÿ ๐’š๐Ÿ ๐’š๐Ÿ‘These are same?

different?

5/43

Page 6: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

๐’™๐’•

๐’‰๐’•

๐’š๐’•

Introduce LSTM figure from RNN

6/43

Page 7: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

๐’™๐’•

๐’‰๐’•

๐’š๐’•

Unroll on time scale

Introduce LSTM figure from RNN

7/43

Page 8: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

๐’‰๐ŸŽ

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

๐’™๐Ÿ‘

๐’‰๐Ÿ‘

๐’š๐Ÿ‘

Oh, I often see this in RNN!

๐’‰๐‘ก = tanh ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1๐’š๐‘ก = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘พโ„Ž๐‘ฆ๐’‰๐‘ก

So, this figure focuses on variables

and shows that their relationships.

Unroll on time scale

Introduce LSTM figure from RNN

8/43

Page 9: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Letโ€™s focus on the more actual process

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

๐’‰๐ŸŽ

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

I try to write the architecture detail.

๐’—๐ญ = ๐–โ„Ž๐‘ฆ๐’‰๐’•

๐ฒ๐ญ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐’—๐ญ

is function

๐’‰๐ญ = ๐‘ก๐‘Ž๐‘›โ„Ž ๐’–๐ญ

๐’–๐‘ก = ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1

๐’—๐ญ = ๐–โ„Ž๐‘ฆ๐’‰๐’•

๐ฒ๐ญ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐’—๐ญ

๐’‰๐ญ = ๐‘ก๐‘Ž๐‘›โ„Ž ๐’–๐ญ

๐’–๐‘ก = ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1

See RNN as a large function with

input (๐’™๐‘ก , ๐’‰๐‘กโˆ’1) and return (๐’š๐‘ก , ๐’‰๐‘ก)

9/43

Page 10: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

๐’‰๐ŸŽ

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

๐’—๐ญ = ๐–โ„Ž๐‘ฆ๐’‰๐’•

๐ฒ๐ญ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐’—๐ญ

๐’‰๐ญ = ๐‘ก๐‘Ž๐‘›โ„Ž ๐’–๐ญ

๐’–๐‘ก = ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1

๐’—๐ญ = ๐–โ„Ž๐‘ฆ๐’‰๐’•

๐ฒ๐ญ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐’—๐ญ

๐’‰๐ญ = ๐‘ก๐‘Ž๐‘›โ„Ž ๐’–๐ญ

๐’–๐‘ก = ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1

Letโ€™s focus on the more actual process

See RNN as a large function with

input (๐’™๐‘ก , ๐’‰๐‘กโˆ’1) and return (๐’š๐‘ก , ๐’‰๐‘ก)

is function

10/43

Page 11: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

๐’‰๐ŸŽ

๐’—๐ญ = ๐–โ„Ž๐‘ฆ๐’‰๐’•

๐ฒ๐ญ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐’—๐ญ

๐’‰๐ญ = ๐‘ก๐‘Ž๐‘›โ„Ž ๐’–๐ญ

๐’–๐‘ก = ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1

LSTM

Oh, this looks

same as LSTM!

Letโ€™s focus on the more actual process

See RNN as a large function with

input (๐’™๐‘ก , ๐’‰๐‘กโˆ’1) and return (๐’š๐‘ก , ๐’‰๐‘ก)

is function

11/43

Page 12: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Summary of this section

RNN RNNRNN

๐’™๐Ÿ ๐’™๐Ÿ ๐’™๐Ÿ‘

๐’š๐Ÿ ๐’š๐Ÿ ๐’š๐Ÿ‘

Yeah. Moreover, initial hidden state โ„Ž0 is

often omitted like below.

LSTM figure is not special!

If you see RNN as LSTM, in fact, you need

to give cell value to next time LSTM module,

but it is mostly omitted, too. 12/43

Page 13: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

By the way, if you want to see the contents of LSTMโ€ฆ

๐’™๐Ÿ

๐’‰๐Ÿ๐’š๐Ÿ๐’‰๐ŸŽ

๐’›๐‘ก = tanh ๐‘พ๐‘ฅ๐‘ง๐’™๐‘ก +๐‘พโ„Ž๐‘ง๐’‰๐‘กโˆ’1

เทœ๐’›t = ๐ณt โŠ™๐’ˆ๐‘–,๐‘ก

๐’ˆ๐‘–,๐‘ก = ๐œŽ(๐‘พ๐‘ฅ๐‘–๐’™๐‘ก +๐‘พโ„Ž๐‘–๐’‰๐‘กโˆ’1)

๐’ˆ๐‘“,๐‘ก = ๐œŽ ๐‘พ๐’™๐‘“๐’™๐‘ก +๐‘พโ„Ž๐‘“๐’‰๐‘กโˆ’1

๐’ˆ๐‘œ,๐‘ก = ๐œŽ ๐‘พ๐‘ฅ๐‘œ๐’™๐‘ก +๐‘พโ„Ž๐‘œ๐’‰๐‘กโˆ’1

๐ก๐ญ = tanh ๐’„๐‘ก โŠ™๐’ˆ๐‘œ,๐‘ก

๐œt = เทœ๐’„tโˆ’1 + ๐‘ก๐‘Ž๐‘›โ„Ž เทœ๐’›๐‘ก

เทœ๐’„tโˆ’1 = ๐’„tโˆ’1 โŠ™๐’ˆ๐‘“,๐‘ก

๏ผˆ๐œŽ โ‹… = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ โ‹… ใจใ™ใ‚‹๏ผ‰

๐ฒt = ๐œŽ ๐‘พโ„Ž๐‘ฆ๐’‰t

๐’„๐’•

๐’„๐’•โˆ’๐Ÿ

13/43

Page 14: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

LSTM FAQ

Q. What is the difference beween RNN and LSTM?

Constant Error Carousel(CEC, often called as cell)

input gate, forget gate, output

โ€ข Input gate: Select to accept input to cell or not

โ€ข Forget gate: Select to throw away cell information or not

โ€ข Output gate: Select to ๆฌกใฎๆ™‚ๅˆปใซใฉใฎ็จ‹ๅบฆๆƒ…ๅ ฑใ‚’ไผใˆใ‚‹ใ‹้ธใถ

Q. Why does LSTM avoid gradient vanishing problem?

๏ผ‘. BP is suffered because of repeatedly sigmoid diff calculation.

๏ผ’. RNN output was effected from changeable hidden states.

๏ผ“๏ผŽLSTM has a cell and store previous input as sum of weighted

inputs, so they are robust to current hidden states( Of course,

there is a certain limit to remember the sequence๏ผ‰

14/43

Page 15: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

ใ‚ใ‹ใ‚‹LSTM ๏ฝž ๆœ€่ฟ‘ใฎๅ‹•ๅ‘ใจๅ…ฑใซ ใ‹ใ‚‰ๅผ•็”จ( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )

LSTM

15/43

Page 16: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

ใ‚ใ‹ใ‚‹LSTM ๏ฝž ๆœ€่ฟ‘ใฎๅ‹•ๅ‘ใจๅ…ฑใซ ใ‹ใ‚‰ๅผ•็”จ( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )

LSTM with Peephole

Known as Standard

LSTM, but peephole

omitted LSTM is

often used, too.

16/43

Page 17: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Chainer usage

Not Peephole(Standard ver. in chainer)

chainer.links.LSTM

with Peephole

chainer.links.StatefulPeepholeLSTM

h = init_state()

h = stateless_lstm(h, x1)

h = stateless_lstm(h, x2)

stateful_lstm(x1)

stateful_lstm(x2)

โ€œStatefulโ€ means wrapping hidden state in

the internal state of the function(โ€ป)

Statefulโ—‹โ—‹ Statelessโ—‹โ—‹

(โ€ป) https://groups.google.com/forum/#!topic/chainer-jp/bJ9IQWtsef417/43

Page 18: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

2. LSTM Learning Methods

Full BPTT

Truncated BPTT

Graham Neubig NLP tutorial 8- recurrent neural networks

http://www.phontron.com/slides/nlp-programming-ja-08-rnn.pdf

๏ผˆBPTT: Back Propagation Trough Time๏ผ‰

18/43

Page 19: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Truncated BPTT by chainer

Chainerใฎไฝฟใ„ๆ–นใจ่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใธใฎๅฟœ็”จ ใ‹ใ‚‰ๅผ•็”จhttp://www.slideshare.net/beam2d/chainer-52369222

19/43

Page 20: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

ChainerใงTruncated BPTT

LSTMLSTM

๐’™๐Ÿ ๐’™๐Ÿ

๐’š๐Ÿ ๐’š๐Ÿ

LSTM LSTM

๐’™๐Ÿ ๐’™๐Ÿ‘

๐’š๐Ÿ‘๐Ÿ ๐’š๐Ÿ‘๐Ÿ

LSTM

๐’™๐Ÿ‘

๐’š๐Ÿ‘๐ŸŽ

โ‹ฏโ‹ฏ

๐’‰๐Ÿ๐’‰๐Ÿ ๐’‰๐Ÿ‘๐ŸŽ

BP until ๐’Š = ๐Ÿ‘๐ŸŽBP Update weights

20/43

Page 21: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Mini-batch calculation with GPU

How should I do if I want to use GPU with

unaligned data length?

Filling end of sequence is standard.

ex): End of sequence is 0

1 2 0

1 3 3 2 0

1 4 2 0

1 2 0 0 0

1 3 3 2 0

1 4 2 0 0I call them

Zero padding

21/43

Page 22: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Learned model become redundant!

They should learn โ€œcontinuous 0 output ruleโ€

Adding handcraft rule can solve it.

chainer.functions.where

NStepLSTM(v1.16.0 or later)

There are 2 methods in chainer

Mini-batch calculation with GPU

22/43

Page 23: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

chainer.functions.where

๐’™๐’•

๐’š๐’•

1 2 0 0 0

1 3 3 2 0

1 4 2 0 0

๐’„๐‘กโˆ’1, ๐’‰๐‘กโˆ’1 ๐’„๐‘ก , ๐’‰๐‘ก

False, False,โ€ฆ,False

True , True ,โ€ฆ,True

False, False,โ€ฆ,False

LSTM

๐’„๐‘ก๐‘š๐‘, ๐’‰๐‘ก๐‘š๐‘๐’„๐ญ = F.where ๐‘บ, ๐’„๐‘ก๐‘š๐‘, ๐’„๐‘กโˆ’1

๐’‰๐ญ = F.where ๐‘บ, ๐’‰๐‘ก๐‘š๐‘, ๐’‰๐‘กโˆ’1

๐’‰๐’•โˆ’๐Ÿ1

๐’‰๐’•โˆ’๐Ÿ2

๐’‰๐’•โˆ’๐Ÿ3

๐‘บ =

๐’‰๐’•โˆ’๐Ÿ1

๐’‰๐’•2

๐’‰๐’•โˆ’๐Ÿ3

๐’‰๐‘กโˆ’1 ๐’‰๐‘ก

Condition matrix

True False

23/43

Page 24: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

NStepLSTM(v1.16.0 or later)

NStepLSTM can auto filling

There is a bug with cudnn,dropout๏ผˆโ€ป๏ผ‰10/25 fixed version marged to master repository

Use latest version(wait for v1.18.0 or git clone from github)

https://github.com/pfnet/chainer/pull/1804

There is no document now, read raw script below

https://github.com/pfnet/chainer/blob/master/chainer/function

s/connection/n_step_lstm.py

http://www.monthly-hack.com/entry/2016/10/24/200000

๏ผˆโ€ป๏ผ‰ChainerใฎNStepLSTMใงใƒ‹ใ‚ณใƒ‹ใ‚ณๅ‹•็”ปใฎใ‚ณใƒกใƒณใƒˆไบˆๆธฌใ€‚

I donโ€™t need to listen F.where ?

Hahahaโ€ฆ

24/43

Page 25: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Gradient Clipping can suppress gradient explosion

LSTM can solve gradient vanishing problem, but

RNN also suffer from gradient explosion(โ€ป)

โ€ป On the difficulty of training recurrent neural networks

http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf

Proposed by โ€ป

If norm of all gradient is over

the threshold, make norm

threshold

In chainer, you can use

optimizer.add_hook(Gradient_Clipping(threshold))

25/43

Page 26: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

DropOut application to LSTM

DropOut is a strong smoothing method,

But DropOut anywhere doesnโ€™t always success.

https://arxiv.org/abs/1603.05118

โ€ป Recurrent Dropout without Memory Loss

According to โ€ป,

๏ผ‘๏ผŽDropOut hidden recurrent state in LSTM

๏ผ’๏ผŽDropOut cell in LSTM

๏ผ“๏ผŽDropOut input gate in LSTM

Conclusion: ๏ผ“๏ผŽachieved the best performance.

Basically,

Recurrent part โ†’DropOut should not be applied to

Forward part โ†’DropOut should be applied to

26/43

Page 27: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Batch Normalization on LSTM

Batch Normalization?

Scaling activation(sum of weighted input) distribution to N(0,1)

http://jmlr.org/proceedings/papers/v37/ioffe15.pdf

In practice,

BN is applied to mini-

batch

In theory,

BN should be applied

to all data

Batch Normalization on Activation x27/43

Page 28: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

BN to RNN doesnโ€™t improve the performance๏ผˆโ€ป๏ผ‰ hidden-to-hidden suffer from gradient explosion by repeatedly

scaling

input-to-hidden makes learning faster, but not improve

performance

โ€ปBatch Normalized Recurrent Neural Networks

https://arxiv.org/abs/1510.01378

3 new proposed way๏ผˆproposed date order๏ผ‰ (Weight Normalization) https://arxiv.org/abs/1602.07868

(Recurrent Batch Normalization) https://arxiv.org/abs/1603.09025

Layer Normalization https://arxiv.org/abs/1607.06450

Batch Normalization on LSTM

28/43

Page 29: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

๐‘Ž1(1)

๐‘Ž2(1)

๐‘Ž3(1)

๐‘Ž4(1)

โ‹ฏ ๐‘Ž๐ป(1)

๐‘Ž1(1)

๐‘Ž2(1)

๐‘Ž3(1)

๐‘Ž4(1)

โ‹ฏ ๐‘Ž๐ป1

โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ

๐‘Ž1(1)

๐‘Ž2(1)

๐‘Ž3(1)

๐‘Ž4(1)

โ‹ฏ ๐‘Ž๐ป(1)

Difference between

Batch Normalization and Layer NormalizationAssuming activation ๐’‚ ๏ผˆ๐‘Ž๐‘–

(๐‘›)= ฮฃ๐‘—๐‘ค๐‘–๐‘—๐‘ฅ๐‘—

๐‘›, h๐‘–

๐‘›= ๐‘Ž๐‘–

(๐‘›)๏ผ‰

Batch Normalization

normalizes vertically

Layer Normalization

normalizes horizontal

Variance ๐œŽ becomes larger if gradient explosion happens.

Normalization makes output more robust๏ผˆdetail is in paper๏ผ‰29/43

Page 30: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Initialization Tips

Exact solutions to the nonlinear dynamics of

learning in deep linear neural networks

https://arxiv.org/abs/1312.6120v3

A Simple Way to Initialize Recurrent Networks

of Rectified Linear Units https://arxiv.org/abs/1504.00941v2

RNN with ReLU and recurrent weight connections

initialized by identity matrix is as good as LSTM

30/43

Page 31: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

From โ€œA Simple Way to Initialize Recurrent Networks of Rectified Linear Unitsโ€ 31/43

Page 32: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

MNIST 784 sequence prediction

32/43

Page 33: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

๐’‰๐‘ก = tanh ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1๐’š๐‘ก = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘พโ„Ž๐‘ฆ๐’‰๐‘ก

๐’™๐Ÿ

๐’‰๐Ÿ

๐’š๐Ÿ

๐’‰๐ŸŽ

๐’‰๐‘ก = ReL๐‘ˆ ๐‘พ๐‘ฅโ„Ž๐’™๐‘ก +๐‘พโ„Žโ„Ž๐’‰๐‘กโˆ’1๐’š๐‘ก = ๐‘…๐‘’๐ฟ๐‘ˆ ๐‘พโ„Ž๐‘ฆ๐’‰๐‘ก

IRNN

Initialize by

identity matrix

x=0ใฎๆ™‚ใ€

h=ReLU(h)

33/43

Page 34: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Extra materials

34/43

Page 35: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Various RNN model

Encoder-Decoder

Bidirectional LSTM

Attention model

35/43

Page 36: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

RNN RNNRNN

๐’™๐Ÿ ๐’™๐Ÿ ๐’™๐Ÿ‘

๐’š๐Ÿ ๐’š๐Ÿ ๐’š๐Ÿ‘

๐’‰๐ŸŽ

RNN output is changed by

initial hidden states โ„Ž0

โ„Ž0 is also learnable by BP

It can be connected to an

encoder output

โ†’encoder-decoder model

RNNใ‚นใƒฉใ‚คใ‚นpixel็”Ÿๆˆ

original

gen from learned โ„Ž0gen from random โ„Ž0

RNNใฎ้š ใ‚ŒๅฑคใฎๅˆๆœŸๅ€คใซๆณจ็›ฎใ™ใ‚‹

๏ผ’ โ‹ฎ

First slice is 0(black), but

various sequence appear

36/43

Page 37: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Encoder-Decoder model

RNN RNNRNN

๐’™๐Ÿ ๐’™๐Ÿ ๐’™๐Ÿ‘

๐’š๐Ÿ ๐’š๐Ÿ ๐’š๐Ÿ‘

๐’‰๐ŸŽ๐’…๐’†๐’„RNN RNNRNN

๐’™๐Ÿ๐’†๐’๐’„ ๐’™๐Ÿ

๐’†๐’๐’„ ๐’™๐Ÿ‘๐’†๐’๐’„

Point: Use when your I/O data have different sequence length

๐’‰๐ŸŽ๐’…๐’†๐’„ is learned by encoder and decoder learning on the same

time

To improve performance, you can use beamsearch on Decoder

37/43

Page 38: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Bidirectional LSTM

RNN RNNRNN

๐’™๐Ÿ ๐’™๐Ÿ ๐’™๐Ÿ‘

๐’š๐Ÿ ๐’š๐Ÿ ๐’š๐Ÿ‘

๐’‰๐ŸŽ๐’…๐’†๐’„RNN RNNRNN

๐’™๐Ÿ‘๐’†๐’๐’„ ๐’™๐Ÿ

๐’†๐’๐’„ ๐’™๐Ÿ๐’†๐’๐’„

Long long time dependency is difficult to learn unless you use

LSTM (LSTM doesnโ€™t solve gradient vanishing fundamentally)

You can improve performance by using inverted encoder

๐’‰๐ŸŽ๐’…๐’†๐’„RNN RNNRNN

๐’™๐Ÿ๐’†๐’๐’„ ๐’™๐Ÿ

๐’†๐’๐’„ ๐’™๐Ÿ‘๐’†๐’๐’„

I remember latter information!

I remember former information!

38/43

Page 39: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Attention model

RNN RNNRNN

๐’™๐Ÿ ๐’™๐Ÿ ๐’™๐Ÿ‘

๐’š๐Ÿ ๐’š๐Ÿ ๐’š๐Ÿ‘

๐’‰๐ŸŽ๐’…๐’†๐’„RNN RNNRNN

๐’™๐Ÿ‘๐’†๐’๐’„ ๐’™๐Ÿ

๐’†๐’๐’„ ๐’™๐Ÿ๐’†๐’๐’„

๐’‰๐ŸŽ๐’…๐’†๐’„RNN RNNRNN

๐’™๐Ÿ๐’†๐’๐’„ ๐’™๐Ÿ

๐’†๐’๐’„ ๐’™๐Ÿ‘๐’†๐’๐’„

๐’‰๐Ÿ๐’†๐’๐’„

๐’‰๐Ÿ‘๐’†๐’๐’„ ๐’‰2

๐’†๐’๐’„

๐’‰๐Ÿ๐’†๐’๐’„

๐’‰๐ŸŽ๐’…๐’†๐’„

๐’‰๐ŸŽ๐’…๐’†๐’„

๐›ผ1,๐‘ก ๐›ผ2,๐‘ก ๐›ผ3,๐‘ก

๐œถ1 ๐œถ2 ๐œถ3

Moreover, using middle

hidden states of encoder

leads better performance!

๐’‰๐Ÿ๐’†๐’๐’„ ๐’‰๐Ÿ

๐’†๐’๐’„

๐’‰2๐’†๐’๐’„

๐’‰๐Ÿ‘๐’†๐’๐’„

39/43

Page 40: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Gated Recurrent Unit (GRU)

Variant of LSTM

โ€ข Delete cell

โ€ข Gate are reduced to ๏ผ’

Unless less complexity,

performance is not bad

Often appear on

MT task or SD task

LSTM

40/43

Page 41: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

Try to split LSTM, and make them upside down

1. GRU is to hidden states what LSTM is to cell

2. Share Input gate and Output gate as Update gate

3. Delete tanh function of cell output of LSTM

GRU can be interpreted as special case of LSTM

GRU LSTM

41/43

Page 42: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

1. Try to split LSTM, and make them upside down

GRU can be interpreted as special case of LSTM

LSTM

42/43

Page 43: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! ๐‘ก=tanh๐‘พ โ„Ž ๐‘ก+๐‘พโ„Žโ„Ž ๐‘กโˆ’1 ๐‘ก=๐‘–๐‘”๐‘– ๐‘พโ„Ž ๐‘ก

GRU can be interpreted as special case of LSTM

GRU LSTM

1. Try to split LSTM, and make them upside down

2. See LSTM cell as GRU hidden states

3. Share Input gate and Output gate as Update gate

4. Delete tanh function of cell output of LSTM

43/43