SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in...
Transcript of SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in...
SD StudyRNN & LSTM 2016/11/10
Seitaro Shinagawa
1/43
This is a description for people
who have already understood
simple neural network
architecture like feed forward
networks.
2/43
I will introduce LSTM,
how to use, tips in chainer.
3/43
ใใใLSTM ๏ฝ ๆ่ฟใฎๅๅใจๅ ฑใซ ใใๅผ็จ( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )
๏ผ๏ผRNN to LSTM
Output
Layer
Middle(Hidden)
Layer
Input
Layer
Simple RNN
4/43
FAQ with LSTM beginner students.
LSTM LSTMLSTM
I hear LSTM is kind of RNN,
but LSTM looks different architectureโฆ
These have same architecture!
Please follow me! Neural bear
๐๐
๐๐
๐๐
A-san
A-san often sees this RNN A-san often sees this LSTM
๐๐ ๐๐ ๐๐
๐๐ ๐๐ ๐๐These are same?
different?
5/43
๐๐
๐๐
๐๐
Introduce LSTM figure from RNN
6/43
๐๐
๐๐
๐๐
Unroll on time scale
Introduce LSTM figure from RNN
7/43
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
Oh, I often see this in RNN!
๐๐ก = tanh ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1๐๐ก = ๐ ๐๐๐๐๐๐ ๐พโ๐ฆ๐๐ก
So, this figure focuses on variables
and shows that their relationships.
Unroll on time scale
Introduce LSTM figure from RNN
8/43
Letโs focus on the more actual process
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
I try to write the architecture detail.
๐๐ญ = ๐โ๐ฆ๐๐
๐ฒ๐ญ = ๐ ๐๐๐๐๐๐ ๐๐ญ
is function
๐๐ญ = ๐ก๐๐โ ๐๐ญ
๐๐ก = ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1
๐๐ญ = ๐โ๐ฆ๐๐
๐ฒ๐ญ = ๐ ๐๐๐๐๐๐ ๐๐ญ
๐๐ญ = ๐ก๐๐โ ๐๐ญ
๐๐ก = ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1
See RNN as a large function with
input (๐๐ก , ๐๐กโ1) and return (๐๐ก , ๐๐ก)
9/43
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐
๐๐ญ = ๐โ๐ฆ๐๐
๐ฒ๐ญ = ๐ ๐๐๐๐๐๐ ๐๐ญ
๐๐ญ = ๐ก๐๐โ ๐๐ญ
๐๐ก = ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1
๐๐ญ = ๐โ๐ฆ๐๐
๐ฒ๐ญ = ๐ ๐๐๐๐๐๐ ๐๐ญ
๐๐ญ = ๐ก๐๐โ ๐๐ญ
๐๐ก = ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1
Letโs focus on the more actual process
See RNN as a large function with
input (๐๐ก , ๐๐กโ1) and return (๐๐ก , ๐๐ก)
is function
10/43
๐๐
๐๐
๐๐
๐๐
๐๐ญ = ๐โ๐ฆ๐๐
๐ฒ๐ญ = ๐ ๐๐๐๐๐๐ ๐๐ญ
๐๐ญ = ๐ก๐๐โ ๐๐ญ
๐๐ก = ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1
LSTM
Oh, this looks
same as LSTM!
Letโs focus on the more actual process
See RNN as a large function with
input (๐๐ก , ๐๐กโ1) and return (๐๐ก , ๐๐ก)
is function
11/43
Summary of this section
RNN RNNRNN
๐๐ ๐๐ ๐๐
๐๐ ๐๐ ๐๐
Yeah. Moreover, initial hidden state โ0 is
often omitted like below.
LSTM figure is not special!
If you see RNN as LSTM, in fact, you need
to give cell value to next time LSTM module,
but it is mostly omitted, too. 12/43
By the way, if you want to see the contents of LSTMโฆ
๐๐
๐๐๐๐๐๐
๐๐ก = tanh ๐พ๐ฅ๐ง๐๐ก +๐พโ๐ง๐๐กโ1
เท๐t = ๐ณt โ๐๐,๐ก
๐๐,๐ก = ๐(๐พ๐ฅ๐๐๐ก +๐พโ๐๐๐กโ1)
๐๐,๐ก = ๐ ๐พ๐๐๐๐ก +๐พโ๐๐๐กโ1
๐๐,๐ก = ๐ ๐พ๐ฅ๐๐๐ก +๐พโ๐๐๐กโ1
๐ก๐ญ = tanh ๐๐ก โ๐๐,๐ก
๐t = เท๐tโ1 + ๐ก๐๐โ เท๐๐ก
เท๐tโ1 = ๐tโ1 โ๐๐,๐ก
๏ผ๐ โ = ๐ ๐๐๐๐๐๐ โ ใจใใ๏ผ
๐ฒt = ๐ ๐พโ๐ฆ๐t
๐๐
๐๐โ๐
13/43
LSTM FAQ
Q. What is the difference beween RNN and LSTM?
Constant Error Carousel(CEC, often called as cell)
input gate, forget gate, output
โข Input gate: Select to accept input to cell or not
โข Forget gate: Select to throw away cell information or not
โข Output gate: Select to ๆฌกใฎๆๅปใซใฉใฎ็จๅบฆๆ ๅ ฑใไผใใใ้ธใถ
Q. Why does LSTM avoid gradient vanishing problem?
๏ผ. BP is suffered because of repeatedly sigmoid diff calculation.
๏ผ. RNN output was effected from changeable hidden states.
๏ผ๏ผLSTM has a cell and store previous input as sum of weighted
inputs, so they are robust to current hidden states( Of course,
there is a certain limit to remember the sequence๏ผ
14/43
ใใใLSTM ๏ฝ ๆ่ฟใฎๅๅใจๅ ฑใซ ใใๅผ็จ( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )
LSTM
15/43
ใใใLSTM ๏ฝ ๆ่ฟใฎๅๅใจๅ ฑใซ ใใๅผ็จ( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )
LSTM with Peephole
Known as Standard
LSTM, but peephole
omitted LSTM is
often used, too.
16/43
Chainer usage
Not Peephole(Standard ver. in chainer)
chainer.links.LSTM
with Peephole
chainer.links.StatefulPeepholeLSTM
h = init_state()
h = stateless_lstm(h, x1)
h = stateless_lstm(h, x2)
stateful_lstm(x1)
stateful_lstm(x2)
โStatefulโ means wrapping hidden state in
the internal state of the function(โป)
Statefulโโ Statelessโโ
(โป) https://groups.google.com/forum/#!topic/chainer-jp/bJ9IQWtsef417/43
2. LSTM Learning Methods
Full BPTT
Truncated BPTT
Graham Neubig NLP tutorial 8- recurrent neural networks
http://www.phontron.com/slides/nlp-programming-ja-08-rnn.pdf
๏ผBPTT: Back Propagation Trough Time๏ผ
18/43
Truncated BPTT by chainer
Chainerใฎไฝฟใๆนใจ่ช็ถ่จ่ชๅฆ็ใธใฎๅฟ็จ ใใๅผ็จhttp://www.slideshare.net/beam2d/chainer-52369222
19/43
ChainerใงTruncated BPTT
LSTMLSTM
๐๐ ๐๐
๐๐ ๐๐
LSTM LSTM
๐๐ ๐๐
๐๐๐ ๐๐๐
LSTM
๐๐
๐๐๐
โฏโฏ
๐๐๐๐ ๐๐๐
BP until ๐ = ๐๐BP Update weights
20/43
Mini-batch calculation with GPU
How should I do if I want to use GPU with
unaligned data length?
Filling end of sequence is standard.
ex): End of sequence is 0
1 2 0
1 3 3 2 0
1 4 2 0
1 2 0 0 0
1 3 3 2 0
1 4 2 0 0I call them
Zero padding
21/43
Learned model become redundant!
They should learn โcontinuous 0 output ruleโ
Adding handcraft rule can solve it.
chainer.functions.where
NStepLSTM(v1.16.0 or later)
There are 2 methods in chainer
Mini-batch calculation with GPU
22/43
chainer.functions.where
๐๐
๐๐
1 2 0 0 0
1 3 3 2 0
1 4 2 0 0
๐๐กโ1, ๐๐กโ1 ๐๐ก , ๐๐ก
False, False,โฆ,False
True , True ,โฆ,True
False, False,โฆ,False
LSTM
๐๐ก๐๐, ๐๐ก๐๐๐๐ญ = F.where ๐บ, ๐๐ก๐๐, ๐๐กโ1
๐๐ญ = F.where ๐บ, ๐๐ก๐๐, ๐๐กโ1
๐๐โ๐1
๐๐โ๐2
๐๐โ๐3
๐บ =
๐๐โ๐1
๐๐2
๐๐โ๐3
๐๐กโ1 ๐๐ก
Condition matrix
True False
23/43
NStepLSTM(v1.16.0 or later)
NStepLSTM can auto filling
There is a bug with cudnn,dropout๏ผโป๏ผ10/25 fixed version marged to master repository
Use latest version(wait for v1.18.0 or git clone from github)
https://github.com/pfnet/chainer/pull/1804
There is no document now, read raw script below
https://github.com/pfnet/chainer/blob/master/chainer/function
s/connection/n_step_lstm.py
http://www.monthly-hack.com/entry/2016/10/24/200000
๏ผโป๏ผChainerใฎNStepLSTMใงใใณใใณๅ็ปใฎใณใกใณใไบๆธฌใ
I donโt need to listen F.where ?
Hahahaโฆ
24/43
Gradient Clipping can suppress gradient explosion
LSTM can solve gradient vanishing problem, but
RNN also suffer from gradient explosion(โป)
โป On the difficulty of training recurrent neural networks
http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf
Proposed by โป
If norm of all gradient is over
the threshold, make norm
threshold
In chainer, you can use
optimizer.add_hook(Gradient_Clipping(threshold))
25/43
DropOut application to LSTM
DropOut is a strong smoothing method,
But DropOut anywhere doesnโt always success.
https://arxiv.org/abs/1603.05118
โป Recurrent Dropout without Memory Loss
According to โป,
๏ผ๏ผDropOut hidden recurrent state in LSTM
๏ผ๏ผDropOut cell in LSTM
๏ผ๏ผDropOut input gate in LSTM
Conclusion: ๏ผ๏ผachieved the best performance.
Basically,
Recurrent part โDropOut should not be applied to
Forward part โDropOut should be applied to
26/43
Batch Normalization on LSTM
Batch Normalization?
Scaling activation(sum of weighted input) distribution to N(0,1)
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
In practice,
BN is applied to mini-
batch
In theory,
BN should be applied
to all data
Batch Normalization on Activation x27/43
BN to RNN doesnโt improve the performance๏ผโป๏ผ hidden-to-hidden suffer from gradient explosion by repeatedly
scaling
input-to-hidden makes learning faster, but not improve
performance
โปBatch Normalized Recurrent Neural Networks
https://arxiv.org/abs/1510.01378
3 new proposed way๏ผproposed date order๏ผ (Weight Normalization) https://arxiv.org/abs/1602.07868
(Recurrent Batch Normalization) https://arxiv.org/abs/1603.09025
Layer Normalization https://arxiv.org/abs/1607.06450
Batch Normalization on LSTM
28/43
๐1(1)
๐2(1)
๐3(1)
๐4(1)
โฏ ๐๐ป(1)
๐1(1)
๐2(1)
๐3(1)
๐4(1)
โฏ ๐๐ป1
โฎ โฎ โฎ โฎ โฎ
๐1(1)
๐2(1)
๐3(1)
๐4(1)
โฏ ๐๐ป(1)
Difference between
Batch Normalization and Layer NormalizationAssuming activation ๐ ๏ผ๐๐
(๐)= ฮฃ๐๐ค๐๐๐ฅ๐
๐, h๐
๐= ๐๐
(๐)๏ผ
Batch Normalization
normalizes vertically
Layer Normalization
normalizes horizontal
Variance ๐ becomes larger if gradient explosion happens.
Normalization makes output more robust๏ผdetail is in paper๏ผ29/43
Initialization Tips
Exact solutions to the nonlinear dynamics of
learning in deep linear neural networks
https://arxiv.org/abs/1312.6120v3
A Simple Way to Initialize Recurrent Networks
of Rectified Linear Units https://arxiv.org/abs/1504.00941v2
RNN with ReLU and recurrent weight connections
initialized by identity matrix is as good as LSTM
30/43
From โA Simple Way to Initialize Recurrent Networks of Rectified Linear Unitsโ 31/43
MNIST 784 sequence prediction
32/43
๐๐ก = tanh ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1๐๐ก = ๐ ๐๐๐๐๐๐ ๐พโ๐ฆ๐๐ก
๐๐
๐๐
๐๐
๐๐
๐๐ก = ReL๐ ๐พ๐ฅโ๐๐ก +๐พโโ๐๐กโ1๐๐ก = ๐ ๐๐ฟ๐ ๐พโ๐ฆ๐๐ก
IRNN
Initialize by
identity matrix
x=0ใฎๆใ
h=ReLU(h)
33/43
Extra materials
34/43
Various RNN model
Encoder-Decoder
Bidirectional LSTM
Attention model
35/43
RNN RNNRNN
๐๐ ๐๐ ๐๐
๐๐ ๐๐ ๐๐
๐๐
RNN output is changed by
initial hidden states โ0
โ0 is also learnable by BP
It can be connected to an
encoder output
โencoder-decoder model
RNNในใฉใคในpixel็ๆ
original
gen from learned โ0gen from random โ0
RNNใฎ้ ใๅฑคใฎๅๆๅคใซๆณจ็ฎใใ
๏ผ โฎ
First slice is 0(black), but
various sequence appear
36/43
Encoder-Decoder model
RNN RNNRNN
๐๐ ๐๐ ๐๐
๐๐ ๐๐ ๐๐
๐๐๐ ๐๐RNN RNNRNN
๐๐๐๐๐ ๐๐
๐๐๐ ๐๐๐๐๐
Point: Use when your I/O data have different sequence length
๐๐๐ ๐๐ is learned by encoder and decoder learning on the same
time
To improve performance, you can use beamsearch on Decoder
37/43
Bidirectional LSTM
RNN RNNRNN
๐๐ ๐๐ ๐๐
๐๐ ๐๐ ๐๐
๐๐๐ ๐๐RNN RNNRNN
๐๐๐๐๐ ๐๐
๐๐๐ ๐๐๐๐๐
Long long time dependency is difficult to learn unless you use
LSTM (LSTM doesnโt solve gradient vanishing fundamentally)
You can improve performance by using inverted encoder
๐๐๐ ๐๐RNN RNNRNN
๐๐๐๐๐ ๐๐
๐๐๐ ๐๐๐๐๐
I remember latter information!
I remember former information!
38/43
Attention model
RNN RNNRNN
๐๐ ๐๐ ๐๐
๐๐ ๐๐ ๐๐
๐๐๐ ๐๐RNN RNNRNN
๐๐๐๐๐ ๐๐
๐๐๐ ๐๐๐๐๐
๐๐๐ ๐๐RNN RNNRNN
๐๐๐๐๐ ๐๐
๐๐๐ ๐๐๐๐๐
๐๐๐๐๐
๐๐๐๐๐ ๐2
๐๐๐
๐๐๐๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐ผ1,๐ก ๐ผ2,๐ก ๐ผ3,๐ก
๐ถ1 ๐ถ2 ๐ถ3
Moreover, using middle
hidden states of encoder
leads better performance!
๐๐๐๐๐ ๐๐
๐๐๐
๐2๐๐๐
๐๐๐๐๐
39/43
Gated Recurrent Unit (GRU)
Variant of LSTM
โข Delete cell
โข Gate are reduced to ๏ผ
Unless less complexity,
performance is not bad
Often appear on
MT task or SD task
LSTM
40/43
Try to split LSTM, and make them upside down
1. GRU is to hidden states what LSTM is to cell
2. Share Input gate and Output gate as Update gate
3. Delete tanh function of cell output of LSTM
GRU can be interpreted as special case of LSTM
GRU LSTM
41/43
1. Try to split LSTM, and make them upside down
GRU can be interpreted as special case of LSTM
LSTM
42/43
GRU can be interpreted as special case of LSTM
GRU LSTM
1. Try to split LSTM, and make them upside down
2. See LSTM cell as GRU hidden states
3. Share Input gate and Output gate as Update gate
4. Delete tanh function of cell output of LSTM
43/43