RNN & NLP Application
강원대학교 IT대학
이창기
차례
• RNN
• Word Embedding
• NLP application
Recurrent Neural Network
• “Recurrent” property dynamical system over time
Bidirectional RNN
• Exploit future context as well as past
Long Short-Term Memory RNN • Vanishing Gradient Problem for RNN
• LSTM can preserve gradient information
LSTM Block Architecture
Gated Recurrent Unit (GRU)
• 𝑟𝑡 = 𝜎 𝑊𝑥𝑟𝑥𝑡 + 𝑊ℎ𝑟ℎ𝑡−1 + 𝑏𝑟
• 𝑧𝑡 = 𝜎 𝑊𝑥𝑥𝑥𝑡 + 𝑊ℎ𝑧ℎ𝑡−1 + 𝑏𝑧
• ℎ 𝑡 = 𝜙 𝑊𝑥ℎ𝑥𝑡 + 𝑊ℎℎ 𝑟𝑡 ⊙ ℎ𝑡−1 + 𝑏ℎ
• ℎ𝑡 = 𝑧𝑡ℎ𝑡 + 1 − 𝑧𝑡 ℎ 𝑡
• 𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦)
LSTM RNN 응용 예
• Hand Writing by Machine
• Music Composition
• Neural Machine Translation
• Image Caption Generation CNN + LSTM RNN
차례
• RNN
• Word Embedding
• NLP application
Deep Belief Network [Hinton06]
• Key idea
– Pre-train layers with an unsupervised learning algorithm in phases
– Then, fine-tune the whole network by supervised learning
• DBN are stacks of Restricted Boltzmann Machines (RBM)
10
Restricted Boltzmann Machine
• A Restricted Boltzmann machine (RBM) is a generative stochastic neural network that can learn a probability distribution over its set of inputs
• Major applications – Dimensionality reduction
– Topic modeling, …
11
Training DBN: Pre-Training
• 1. Layer-wise greedy unsupervised pre-training – Train layers in phase from the bottom layer
12
Training DBN: Fine-Tuning
• 2. Supervised fine-tuning for the classification task
13
The Back-Propagation Algorithm
Autoencoder
• Autoencoder is an NN whose desired output is the same as the input
– To learn a compressed representation (encoding) for a set of data.
– Find weight vectors A and B that minimize: Σi(yi-xi)
2
15 <겨울학교14 Deep Learning 자료 참고>
Stacked Autoencoders
• After training, the hidden node extracts features from the input nodes
• Stacking autoencoders constructs a deep network
16 <겨울학교14 Deep Learning 자료 참고>
텍스트의 표현 방식
• One-hot representation (or symbolic)
– Ex. [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
– Dimensionality • 50K (PTB) – 500K (big vocab) – 3M (Google 1T)
– Problem • Motel [0 0 0 0 0 0 0 0 1 0 0] AND
• Hotel [0 0 0 0 0 0 1 0 0 0 0] = 0
• Continuous representation
– Latent Semantic Analysis, Random projection
– Latent Dirichlet Allocation, HMM clustering
– Neural Word Embedding • Dense vector
• By adding supervision from other tasks improve the representation
Neural Network Language Model (Bengio00,03)
Shared weights = Word embedding
• Idea – A word and its context is a
positive training sample
– A random word in that same context negative training sample
– Score(positive) > Score(neg.)
• Training complexity is high – Hidden layer output
– Softmax in the output layer
• Hierarchical softmax
• Negative sampling
• Ranking(hinge loss) Input Dim: 1 Dim: 2 Dim: 3 Dim: 4 Dim: 5
1 (boy) 0.01 0.2 -0.04 0.05 -0.3
2 (girl) 0.02 0.22 -0.05 0.04 -0.4
LT: |V|*d, Input(one hot): |V|*1 LTT I
Ranking-based Model (Collobert)
본 연구 추가
Shared weights = Word embedding
w(t-2)
w(t-1)
w(t) or wc(t)
s or sc
Negative sampling
s > sc
Word2Vec: CBOW, Skip-Gram
• Remove the hidden layer Speedup 1000x – Negative sampling
– Frequent word sampling
– Multi-thread (no lock)
• Continuous Bag-of-words (CBOW) – Predicts the current word given the con
text
• Skip-gram – Predicts the surrounding words given
the current word
– CBOW + DropOut/DropConnect
Shared weights
한국어 Word Embedding: NNLM
• Data – 세종코퍼스 원문 + Korean Wiki abstract + News data
• 2억 8000만 형태소
– Vocab. size: 60,000
• 모든 형태소 대상 (기호, 숫자, 한자, 조사 포함)
• 숫자 정규화 + 형태소/POS: 정부/NNG, 00/SN
• NNLM model – Dimension: 50
– Matlab 구현
– 학습시간: 16일
한국어 Word Embedding: NNLM
한국어 Word Embedding: Word2Vec(CBOW)
• Data – 2012-2013 News + Korean Wiki abstract:
• 9GB raw text
• 29억 형태소
– Vocab. size: 100,000 • 모든 형태소 대상 (기호, 숫자, 한자, 조사 포함)
• 숫자 정규화 + 영어 소문자 + 형태소/POS
• 기존 Word2Vec(CBOW, Skip-Gram) – 모델: CBOW > SKIP-Gram
– 학습 시간: 24분 Shared weights
한국어 Word Embedding: Word2Vec(CBOW)
Word Embedding 응용: Word Analogy
King – Man + Woman ≈ Queen
http://deeplearner.fz-qqq.net/
Bilingual Word Embedding
• Bilingual Word Embedding for PBMT (EMNLP13) – 중국어-영어 word aliment 이용
한영 Bilingual Word Embedding
한영 Bilingual Word Embedding – cont’d
차례
• RNN
• Word Embedding
• NLP application
Sequence Labeling – RNN, LSTM
Word embedding
Feature embedding
FFNN(or CNN), CNN+CRF (SENNA)
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
x(t-1) x(t ) x(t+1)
y(t+1) y(t-1) y(t )
RNN, CRF Recurrent CRF
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
x(t-1) x(t ) x(t+1)
y(t+1) y(t-1) y(t )
C(t) x(t ) h(t )
i (t )
f (t )
o(t )
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
LSTM RNN + CRF LSTM-CRF (KCC 15)
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
LSTM-CRF
• 𝑖𝑡 = 𝜎 𝑊𝑥𝑖𝑥𝑡 + 𝑊ℎ𝑖ℎ𝑡−1 + 𝑊𝑐𝑖𝑐𝑡−1 + 𝑏𝑖
• 𝑓𝑡 = 𝜎 𝑊𝑥𝑓𝑥𝑡 + 𝑊ℎ𝑓ℎ𝑡−1 + 𝑊𝑐𝑓𝑐𝑡−1 + 𝑏𝑓
• 𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ tanh 𝑊𝑥𝑐𝑥𝑡 + 𝑊ℎ𝑐ℎ𝑡−1 + 𝑏𝑐
• 𝑜𝑡 = 𝜎 𝑊𝑥𝑜𝑥𝑡 + 𝑊ℎ𝑜ℎ𝑡−1 + 𝑊𝑐𝑜𝑐𝑡 + 𝑏𝑜
• ℎ𝑡 = 𝑜𝑡 ⊙ tanh(𝑐𝑡)
• 𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦)
• 𝑦𝑡 = 𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦
• 𝑠 𝐱, 𝐲 = 𝐴 𝑦𝑡−1, 𝑦𝑡 + 𝑦𝑡𝑇𝑡=1
• log 𝑃 𝐲 𝐱 = 𝑠 𝐱, 𝐲 − log exp(𝑠(𝐱, 𝐲′))𝐲′
GRU+CRF
• 𝑟𝑡 = 𝜎 𝑊𝑥𝑟𝑥𝑡 + 𝑊ℎ𝑟ℎ𝑡−1 + 𝑏𝑟
• 𝑧𝑡 = 𝜎 𝑊𝑥𝑧𝑥𝑡 + 𝑊ℎ𝑧ℎ𝑡−1 + 𝑏𝑧
• ℎ 𝑡 = 𝜙 𝑊𝑥ℎ𝑥𝑡 + 𝑊ℎℎ 𝑟𝑡 ⊙ ℎ𝑡−1 + 𝑏ℎ
• ℎ𝑡 = 𝑧𝑡 ⊙ ℎ𝑡−1 + 𝟏 − 𝑧𝑡 ⊙ ℎ 𝑡
• 𝑦𝑡 = 𝑔(𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦)
• 𝑦𝑡 = 𝑊ℎ𝑦ℎ𝑡 + 𝑏𝑦
• 𝑠 𝐱, 𝐲 = 𝐴 𝑦𝑡−1, 𝑦𝑡 + 𝑦𝑡𝑇𝑡=1
• log 𝑃 𝐲 𝐱 = 𝑠 𝐱, 𝐲 − log exp(𝑠(𝐱, 𝐲′))𝐲′
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
Bi-LSTM CRF
• Bidirectional LSTM+CRF
• Bidirectional GRU+CRF
• Stacked Bi-LSTM+CRF …
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
bh(t-1) bh(t ) bh(t+1)
Stacked LSTM CRF
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
bh(t-1) bh(t ) bh(t+1)
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
h2(t-1) h2(t ) h2(t+1)
LSTM CRF with Context words = CNN + LSTM CRF
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t-1) y(t )
x(t-2) x(t+2)
• Bi-LSTM CRF =~ LSTM CRF with Context > LSTM CRF
영어 개체명 인식 (KCC 15, Journal submitted)
영어 개체명 인식 (CoNLL03 data set) F1(dev) F1(test)
SENNA (Collobert) - 89.59
Structural SVM (baseline + Word embedding feature) - 85.58
FFNN (Sigm + Dropout + Word embedding) 91.58 87.35
RNN (Sigm + Dropout + Word embedding) 91.83 88.09
LSTM RNN (Sigm + Dropout + Word embedding) 91.77 87.73
GRU RNN (Sigm + Dropout + Word embedding) 92.01 87.96
CNN+CRF (Sigm + Dropout + Word embedding) 93.09 88.69
RNN+CRF (Sigm + Dropout + Word embedding) 93.23 88.76
LSTM+CRF (Sigm + Dropout + Word embedding) 93.82 90.12
GRU+CRF (Sigm + Dropout + Word embedding) 93.67 89.98
한국어 개체명 인식 (Journal submitted)
한국어 개체명 인식 (TV domain) F1(test)
Structural SVM (baseline) (basic + NE dic. + word cluster feature + morpheme feature)
89.03
FFNN (ReLU + Dropout + Word embedding) 87.70
RNN (Tanh + Dropout + Word embedding) 88.93
LSTM RNN (Tanh + Dropout + Word embedding) 89.38 (+0.35)
Bi-LSTM RNN (Tanh + Dropout + Word embedding) 89.21 (+0.18)
CNN+CRF (ReLU + Dropout + Word embedding) 90.06 (+1.03)
RNN+CRF (Sigm + Dropout + Word embedding) 90.52 (+1.49)
LSTM+CRF (Sigm + Dropout + Word embedding) 91.04 (+2.01)
GRU+CRF (Sigm + Dropout + Word embedding) 91.02 (+1.99)
NLU 실험 (ATIS data)
NLU (Airplane Traveling Information System data) F1(dev) F1(test)
FFNN (Sigm + Dropout + Word embedding) 93.75 92.13
RNN (Sigm + Dropout + Word embedding) 98.05 94.78
LSTM RNN (Sigm + Dropout + Word embedding) 98.27 94.85
GRU RNN (Sigm + Dropout + Word embedding) 98.11 94.82
CNN4+CRF (Sigm + Dropout + Word embedding) 96.66 95.56
RNN+CRF (Sigm + Dropout + Word embedding) 98.42 96.19
LSTM+CRF (Sigm + Dropout + Word embedding) 98.79 96.48
학습 속도
0
50
100
150
200
250
FFNN
RNN
LSTM RNN
GRU
SCRN
R-CRF
LSTM R-CRF
GRU-CRF
SCRN-CRF
Neural Architectures for NER (Arxiv16)
• LSTM-CRF model + Char-based Word Representation – Char: Bi-LSTM RNN
Neural Architectures for NER (Arxiv16)
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF (ACL16)
• LSTM-CRF model + Char-level Representation – Char: CNN
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF (ACL16)
NER with Bidirectional LSTM-CNNs (Arxiv16)
의미역 결정 (Semantic Role Labeling)
Korean PropBank
• Virginia Corpus – 군대 도메인, 54,500 단어 도메인이 맞지 않음
• Newswire Corpus – 신문 도메인
(1994/6/2~2000/3/20), 131,800 단어
– 구구조 기반 태깅을 의존 구조로 변환 4,882 문장
• Frame file – 2,749 xml files
• 용언의 root (not stem)
• 용언화 가능 명사 (deverbal noun)
• Frame file 예제
– 덧붙.01
– English Define = add
– Role set • ARG0: adder
• ARG1: thing added
• ARG2: added to
– Mapping1 • Rel = 덧붙이다
• Src = sbj, Trg = ARG0
• Src = obj, Trg = ARG1
• Src = comp, Trg=ARG2
한국어 의미역 결정 (SRL)
• 서술어 인식(PIC)
– 그는 르노가 3월말까지 인수제의 시한을 [갖고]갖.1 있다고 [덧붙였다]덧붙.1
• 논항 인식(AIC)
– 그는 [르노가]ARG0 [3월말까지]ARGM-TMP 인수제의 [시한을]ARG1 [갖고]갖.1 [있다고]AUX 덧붙였다
– [그는]ARG0 르노가 3월말까지 인수제의 시한을 갖고 [있다고]ARG1 [덧붙였다]
덧붙.1
의존 구문 분석
의미역 결정
딥러닝 기반 한국어 의미역 결정 – 한글 및 한국어 정보처리/동계학술대회 15
• Bidirectional LSTM+CRF
• Korean Word embedding
– Predicate word, argument word
– NNLM
• Feature embedding
– POS, distance, direction
– Dependency path, LCA
• Bi-LSTM+CRF 성능 (AIC)
– F1: 78.2% (+1.2)
– Backward LSTM+CRF: F1 77.6%
– S-SVM 성능 (KCC14) • 기본자질: F1 74.3%
• 기본자질+word cluster: 77.0% – 정보과학회 논문지 2015.02
x(t-1) x(t ) x(t+1)
h(t-1) h(t ) h(t+1)
y(t+1) y(t+1) y(t )
bh(t-1) bh(t ) bh(t+1)
Stacked Bi-LSTM CRF (한국어의미역결정) (정보과학회지 제출)
Syntactic information w/ w/o
Structural SVM
FFNN
Backward LSTM CRFs
Bidirectional LSTM CRFs
Stacked Bidirectional LSTM CRFs (2 layers)
Stacked Bidirectional LSTM CRFs (3 layers)
76.96
76.01
76.79
78.16
78.12
78.14
74.15
73.22
76.37
78.17
78.57
78.36
CNN for Sentence Classification (Sentiment Analysis)
• Convolutional NN – Convolution Layer
• Sparse Connectivity
• Shared Weights
• Multiple feature maps
– Sub-sampling Layer • Average/max pooling
• NxN1
• NLP (Sentence Classification)에 적용 – ACL14
– EMNLP14
Multiple feature maps
한국어 감성 분석 – CNN
• Mobile data – Train: 4543, Test: 500
• EMNLP14 모델(CNN) 적용 – Matlab으로 구현
– Word embedding • 한국어 10만 단어 + 도메인 특화 1420 단어
한국어 감성 분석 – CNN 실험 Data set Model Accuracy
Mobile Train: 4543 Test: 500
SVM (word feature) 85.58
CNN(relu,kernel3,hid50)+Word embedding (word feature)
91.20
CNN(relu,kernel3,hid50)+Random init. 89.00
LSTM RNN 기반 한국어 감성분석
• LSTM RNN-based encoding – Sentence embedding 입력
– Fully connected NN 출력
– GRU encoding 도 유사함
x(1) x(2 ) x(t)
h(1) h(2 ) h(t)
y
Data set Model Accuracy
Mobile Train: 4543 Test: 500
SVM (word feature) 85.58
CNN(relu,kernel3,hid50)+Word embedding (word feature)
91.20
GRU encoding + Fully connected NN 91.12
LSTM RNN encoding + Fully connected NN 90.93
Attention-based LSTM RNN encoding 기반 한국어 감성분석
• Bi-LSTM RNN-based encoding + Attention Mechanism
Data set Model Accuracy
Mobile Train: 4543 Test: 500
SVM (word feature) 85.58
CNN(relu,kernel3,hid50)+Word embedding (word feature)
91.20
GRU encoding + Fully connected NN 91.12
Attention-based GRU encoding + Fully connected NN
90.73
y
Attention Mechanism 결과
• 긍정 – 반응/nng/2 속도/nng/2 터치감/nng/2 만족/nng/2 하/xsa/19 고요/ec/25 </s>/44
– 카메라/nng/1 가/jks/2 깨끗이/mag/1 잘/mag/0 나오/vv/0 는/etm/1 듯/nnb/1 하/xsa/2 아/ec/1 참/mag/0 마음/nng/0 에/jkb/10 들/vv/12 었/ep/12 습니다/ef/12 ./sf/15 </s>/21
– 검정/nng/0 보다/jkb/0 화이트/nng/0 를/jko/1 선호/nng/1 하/xsv/3 는데/ec/3 다행/nng/2 이/jks/4 화이트/nng/2 가/jks/6 있/vv/4 어서/ec/5 디자인/nng/4 도/jx/8 맘/nng/8 에/jkb/10 들/vv/9 구요/ec/9 </s>/11
• 부정 – 화면/nng/10 이나/jc/12 해상도/nng/6 는/jx/12 않/nng/25 좋/va/1 아서/ec/8
</s>/23
– 화면/nng/5 자체/nng/9 가/jks/13 작/va/4 은데/ec/3 편하/va/1 겠/ep/7 냐/ec/16 </s>/38
– 메뉴/nng/5 무지/nng/0 느리/va/4 고요/ec/3 인터넷/nng/1 심심/xr/0 하/xsa/3 면/ec/4 렉먹/vv/7 고/ec/12 문자/nng/8 전송/nng/4 속도/nng/2 gg/sl/12 </s>/28
– 요금/nng/12 넘/vv/9 비싸/va/15 아/ec/19 </s>/43
GRU encoding + Deep Output
• GRU encoding + Deep Output – Sentence embedding
입력
– Fully connected NN 출력
• 실험 결과 성능 개선 없음
x(1) x(2 ) x(t)
h(1) h(2 ) h(t)
y
h'(t)
Attention-based GRU encoding + Deep Output
• Attention-based GRU encoding + Deep Output – Sentence embedding
입력
– Fully connected NN 출력
• 실험 결과 성능 개선 없음
h'
y
Stacked GRU encoding + MLP
• Stacked GRU encoding – Sentence embedding
입력
– Fully connected NN 출력
• 실험 중 – 일부 태스크에서 좋은 성능
을 보임
x(1) x(2 ) x(t)
h(1) h(2 ) h(t)
y
h’t)
Neural Machine Translation
T|S 777 항공편 은 3 시간 동안 지상 에 있 겠 습니다 . </s>
flight 0.5 0.4 0 0 0 0 0 0 0 0 0 0 0
777 0.3 0.6 0 0 0 0 0 0 0 0 0 0 0
is 0 0.1 0 0 0.1 0.2 0 0.4 0 0.1 0 0 0
on 0 0 0 0 0 0 0 0.7 0.2 0.1 0 0 0
the 0 0 0 0.2 0.3 0.3 0.1 0 0 0 0 0
ground 0 0 0 0.1 0.2 0.5 0.3 0 0 0 0 0 0
for 0 0 0 0.1 0.2 0.5 0.1 0.1 0 0 0 0 0
three 0 0 0 0.2 0.2 0.6 0 0 0 0 0 0 0
hours 0 0 0 0.1 0.3 0.5 0 0 0 0 0 0 0
. 0 0 0 0.4 0 0.1 0.2 0.1 0.1 0.1 0 0 0
</s> 0 0 0 0 0 0 0 0.1 0 0.1 0.1 0.3 0.3
Recurrent NN Encoder–Decoder for Statistical Machine Translation (EMNLP14)
GRU RNN Encoding GRU RNN Decoding Vocab: 15,000 (src, tgt)
Sequence to Sequence Learning with Neural Networks (NIPS14 – Google)
Source Voc.: 160,000 Target Voc.: 80,000 Deep LSTMs with 4 layers Train: 7.5 epochs (12M sentences, 10 days with 8-GPU machine)
Neural MT by Jointly Learning to Align and Translate (ICLR15)
GRU RNN + Alignment Encoding GRU RNN Decoding Vocab: 30,000 (src, tgt) Train: 5 days
NMT 모델의 장단점
• NMT의 장점 – Feature Engineering이 필요 없음 – End-to-end 방식의 단일 신경망 구조
• 단어 정렬, 번역 모델, 언어 모델 등이 필요 없음
– 디코더가 간단함 – 구문 분석 등이 필요 없음
• Pre-ordering, Syntax-based SMT
• NMT의 단점 – 출력 언어 단어 사전의 크기가 제한 됨
• 사전의 크기에 비례하여 학습/디코딩 시간이 필요 • 미등록어 처리가 필요함 • 최근 NMT 논문들은 이러한 미등록어 처리에 관한 연구가 대
부분임
문자 단위의 NMT 제안 (WAT15, 한글 및 한국어15)
• 기존의 NMT: 단어 단위의 인코딩-디코딩
– 미등록어 후처리 or NMT 모델의 수정 등이 필요
• 문자 단위의 NMT
– 입력 언어는 단어 단위로 인코딩
• 입력 언어를 문자 단위로 인코딩 성능 하락
– 출력 언어는 문자 단위로 디코딩
• 단어 단위: その/UN 結果/NCA を/PS 詳細/NCD …
• 문자 단위: そ/B の/I 結/B 果/I を/B 詳/B 細/I …
• 문자 단위 NMT의 장점
– 모든 문자를 사전에 등록 미등록어 문제해결
– 학습 및 디코딩 속도 향상
– 기존 NMT 모델의 수정이 필요 없음
– 미등록어 후처리 작업이 필요 없음
본 연구에서 확장한 NMT 모델
• Encoding (source language)
– Bidirectional GRU RNN 이용
– ℎ𝑡 = [ℎ𝑡 , ℎ𝑡].
• Decoding (target language)
– 𝑐 = ℎ0 (global context)
– 𝑐𝑡 = 𝑎𝑡𝑖ℎ𝑖 (local context)
– 𝑧𝑡 = 𝑠𝑖𝑔𝑚 𝑊𝑧𝑥𝐸𝑡(𝑦𝑡−1) + 𝑊𝑧𝑠𝑠𝑡−1 + 𝑏𝑧
– 𝑟𝑡 = 𝑠𝑖𝑔𝑚 𝑊𝑟𝑥𝐸𝑡(𝑦𝑡−1) + 𝑊𝑟𝑠𝑠𝑡−1 + 𝑏𝑟
– 𝑠 𝑡 = 𝑡𝑎𝑛ℎ 𝑊𝑠𝑥𝐸𝑡(𝑦𝑡−1) + 𝑊𝑠𝑠(𝑟𝑡𝑠𝑡−1) + 𝑊𝑠𝑐𝑐𝑡 + 𝑏𝑠
– 𝑠𝑡 = (1 − 𝑧𝑡)𝑠𝑡−1 + 𝑧𝑡𝑠 𝑡
– 𝒔′𝒕 = 𝒓𝒆𝒍𝒖 𝑾𝒔′𝒔𝒔𝒕 + 𝒃𝒚
– 𝑦𝑡 =
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑾𝒚𝒔′𝒔′𝒕 + 𝑊𝑦𝑠𝑠𝑡 + 𝑊𝑦𝑦𝐸𝑡(𝑦𝑡−1) + 𝑾𝒚𝒄𝒄 + 𝑏𝑦
S’ S’
yt-1 yt
c
ASPEC J-to-E 실험
• ASPEC J-to-E data
• 성능 (Juman 이용 BLEU)
– PB SMT: 18.45
– HPB SMT: 18.72
– Tree-to-string SMT: 22.16
– NMT (Word-level decoding): 21.63
– NMT (Character-level decoding): 21.72
– NMT (Word+Character-level decoding): 23.76
ASPEC E-to-J 실험 (WAT15)
• ASPEC E-to-J data
• 성능 (Juman 이용 BLEU)
– PB SMT: 27.48
– HPB SMT: 30.19
– Tree-to-string SMT: 32.63
– NMT (Word-level decoding): 29.78
– NMT (Character-level decoding): 33.14 (4위) • RIBES 0.8073 (2위)
– Tree-to-String + NMT(Character-level) re-ranking • BLEU 34.60 (2위)
• Human 53.25 (2위)
JPO K-to-J 실험 (WAT15)
• JPO Patent K-to-J data
• 성능 (Juman 이용 BLEU)
– PB SMT: 69.22
– HPB SMT: 67.41
– NMT(Word-level): 61.52
– NMT(Character-level): 65.72
– PB SMT + NMT(Character-level) re-ranking • BLEU 71.38 (2위)
• Human 14.75 (1위)
ASPEC E-to-J 실험 – 예제
• This/DT:0 paper/NN:1 explaines/NNS:2 experimental/JJ:3 result/NN:4 according/VBG:5 to/TO:6 the/DT:7 model/NN:8 ./.:9 </s>:10
• こ/B:0 の/I:1 モ/B:2 デ/I:3 ル/I:4 に/B:5 よ/B:6 る/I:7 実/B:8 験/I:9 結/B:10 果/I:11 を/B:12 説/B:13 明/I:14 し/B:15 た/B:16 。/B:17 </s>:18
ASPEC J-to-E 실험 – 예제
• 又/CJ:0 メルト/NQC:1 フラクチャー/NCC:2 防止/NCS:3 策/NCC:4 に/PS:5 つい/VC:6 て/PJ:7 も/PC:8 述べ/VC:9 た/VX:10 </s>:11
• M/B:0 e/I:1 l/I:2 t/I:3 i/I:4 n/I:5 g/I:6 prevention/NN:7 measures/NNS:8 are/VBP:9 also/RB:10 described/VBN:11 ./.:12 </s>:13
JPO K-to-J 실험 – 예제
• ○/ETC:0 :/ETC:1 필름/NOUN:2 면/NOUN:3 과의/JOSA:4 접착/NOUN:5 이/JOSA:6 없/NORMALVERB:7 다/EOMI:8 ./ETC:9 </s>:10
• ○/B:0 :/B:1 フ/B:2 ィ/I:3 ル/I:4 ム/I:5 面/B:6 と/B:7 の/B:8 接/B:9 着/I:10 が/B:11 な/B:12 い/I:13 。/B:14 </s>:15
Input-feeding Approach (EMNLP15)
The attentional decisions are made independently, which is suboptimal.
In standard MT, a coverage set is often maintained during the translation process to keep track of which source words have been translated.
Effect: - We hope to make the model fully aware of
previous alignment choices - We create a very deep network spanning
both horizontally and vertically
Copying Mechanism or CopyNet (ACL16)
Abstractive Text Summarization (한글 및 한국어 16)
로드킬로 숨진 친구의 곁을 지키는 길고양이의 모습이 포착되었다.
RNN_search+input_feeding+CopyNet
End-to-End 한국어 형태소 분석 (동계학술대회16)
Attention + Input-feeding + Copying mechanism
Sequence-to-sequence 기반 한국어 구구조 구문 분석 (한글 및 한국어 16)
입력 예시 1 43/SN 국/NNG <sp> 참가/NNG
입력 예시 2 4 3 <SN> 국 <NNG> <sp> 참 가 <NNG>
43/SN 국/NNG + 참가/NNG
NP NP
NP
(NP (NP 43/SN + 국/NNG) (NP 참가/NNG))
GRU
GRU
GRU
GRU
GRU
GRU
x1
h1t-1
h2t-1
yt-1
h1t
h2t
yt
x2 xT
ct
Attention + Input-feeding
입력
선 생 <NNG> 님 <XSN> 의 <JKG> <sp> 이 야 기 <NNG> <sp> 끝 나 <VV> 자 <EC> <sp> 마 치 <VV> 는 <ETM> <sp>
종 <NNG> 이 <JKS> <sp> 울 리 <VV> 었 <EP> 다 <EF> . <SF>
정답 (S (S (NP_SBJ (NP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) (S
(NP_SBJ (VP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) )
RNN-search[7] (Beam size 10)
(S (VP (NP_OBJ (NP_MOD XX ) (NP_OBJ XX ) ) (VP XX ) ) (S (NP_SBJ (VP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) )
RNN-search + Input-feeding +
Dropout (Beam size 10)
(S (S (NP_SBJ (NP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) (S (NP_SBJ (VP_MOD XX ) (NP_SBJ XX ) ) (VP XX ) ) )
Sequence-to-sequence 기반 한국어 구구조 구문 분석 (한글 및 한국어 16)
모델 F1
스탠포드 구문분석기[13] 74.65
버클리 구문분석기[13] 78.74
형태소 + <sp> RNN-search[7] (Beam size 10)
87.34(baseline)
87.65*(+0.31)
형태소의 음절 + 품사태그 + <sp>
RNN-search[7] (Beam size 10) 87.69(+0.35)
88.00*(+0.66)
RNN-search + Input-feeding (Beam size 10) 88.23(+0.89)
88.68*(+1.34)
RNN-search + Input-feeding + Dropout (Beam size 10)
88.78(+1.44)
89.03*(+1.69)
Pointer Network (NIPS15) • Travelling Salesman Problem: NP-hard • Pointer Network can learn approximate solutions: O(n^2)
상호참조해결을 위한 포인터 네트워크 모델 (Journal submitted)
• 입력: 단어(형태소) 열, 출발점(대명사, 한정사구(이 별자리 등))
– X = {A:0, B:1, C:2, D:3, <EOS>:4}, Start_Point=A:0
• 출력: 입력 단어 열의 위치(Pointer) 열 Entity
– Y = {A:0, C:2, D:3, <EOS>:4}
• 특징: End-to-end 방식의 대명사 상호참조 해결 (mention detection 과정 X)
A B C D <EOS> A C D <EOS>
Encoding Decoding
Hidden Layer
Projection Layer
Attention Layer
포인터 네트워크를 이용한 한국어 의존 구문 분석 (동계학술대회16)
Neural Responding Machine for Short-Text Conversation (ACL 15)
Neural Responding Machine – cont’d
실험 결과 (ACL 15)
관련연구: Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network
Models (arXiv 15) - HRED
관련연구: Attention with Intention for a Neural Network Conversation Model (arXiv 15) - AWI
관련연구: A Diversity-Promoting Objective Function for Neural Conversation Models
(arXiv 16) - MMI
관련연구: A Persona-Based Neural Conversation Model (arXiv 16)
Speaker model + MMI
강원대 딥러닝 기반 챗봇 모델
• Short-Text Conversation 가정
– 사용자의 현재 발화 정보만을 이용하여 대답을 생성
• GRU_encdec 모델
– Gate Recurrent Unit(GRU) 기반의 Encoder-Decoder 모델
• Stacked_GRU_encdec 모델
– 2 layer GRU_encdec 모델
• GRU_search 모델
– Attention 기반의 모델
• GRU_DO_search 모델
– Attention 기반의 모델 + Deep Output 모델
Subtitle 데이터 실험
• 학습데이타: 영화 대사 데이터 100만 문장
• 모델: GRU + enc-dec • what is two plus two ?
– A) one , two , three .
• how old are you ?
– A) 18 years .
• are you a follower or a leader ?
– A) yes , i am .
• my name is david . what is my name ?
– A) david palmer .
• what is the purpose of being intelligent ?
– A) it is difficult to explain .
• what happens if machines can think ?
– A) if they find anything , they 'll kill us .
고객센터 데이터 실험
• 학습데이타: 고객센터 데이터 18만 문장
• 모델: Stacked GRU + enc-dec • 세탁기가 전원이 안 들어와요?
– A) 네 아 그러세요
– A) 많이 불편하셨겠습니다
– A) 드럼이세요 아니면 일반 세탁기세요
• 언제 방문해 줄수 있나요?
– A) 언제 방문 드리면 편하실까요
– A) 언제 방문 드리면 편하시겠어요
• 출장비가 얼마죠?
– A) 출장 점검료는 15000원이구요
– A) 출장비가 만오천원이구요
Short-Text Conversation (동계학술대회16)
- Data: 클리앙 ‘아무거나 질문 게시판’ - 77,346 질문-응답 쌍 - 학습:개발:평가 = 8:1:1
이미지 캡션 생성 소개
• 이미지 내용 이해 이미지 내용을 설명하는 캡션 자동 생성 – 이미지 인식(이해) 기술 + 자연어처리(생성) 기술
• 활용 분야 – 이미지 검색
– 맹인들을 위한 사진 설명, 네비게이션
– 유아 교육, …
기존 연구 • Multimodal RNN (M-RNN) [2]
Baidu CNN + vanilla RNN
CNN: VGGNet • Neural Image Caption
generator (NIC) [4] Google CNN + LSTM RNN
CNN: GoogLeNet
• Deep Visual-Semantic alignments (DeepVS) [5] Stanford University RCNN + Bi-RNN
alignment (training) CNN + vanilla RNN
CNN: AlexNet
AlexNet, VGGNet
RNN을 이용한 이미지 캡션 생성 (동계학술대회 15)
• CNN + RNN
– CNN: VGGNet 15 번째 layer (4096 차원)
– RNN: GRU (LSTM RNN의 변형) • Hidden layer unit: 500, 1000 (Best)
• Multimodal layer unit: 500, 1000 (Best)
– Word embedding • SENNA: 50차원 (Best)
• Word2Vec: 300 차원
– Data set • Flickr 8K : 8000 이미지 * 이미지 캡션 5문장
– 6000 학습, 1000 검증, 1000 평가
• Flickr 30K : 31783 이미지 * 이미지 캡션 5문장
– 29000 학습, 1014 검증, 1000 평가
– 4가지 모델 실험 • GRU-DO1, GRU-DO2, GRU-DO3, GRU-DO4
GRU
Embedding
CNNMultimodal
Softmax
Wt
Wt+1
Image
GRU
Embedding
CNNMultimodal
Softmax
Wt
Wt+1
Image GRU
Embedding
CNNMultimodal
Softmax
Wt
Wt+1
Image
• GRU-DO1 • GRU-DO2
• GRU-DO3 • GRU-DO4
RNN을 이용한 이미지 캡션 생성 (동계학술대회 15)
GRU
Embedding
CNNMultimodal
Softmax
Wt
Wt+1
Image GRU
Embedding
CNNMultimodal
Softmax
Wt
Wt+1
Image GRU
Embedding
CNNMultimodal
Softmax
Wt
Wt+1
Image
Flickr 30K B-1 B-2 B-3 B-4
m-RNN (Baidu)[2] 60.0 41.2 27.8 18.7
DeepVS (Stanford)[5] 57.3 36.9 24.0 15.7
NIC (Google)[4] 66.3 42.3 27.7 18.3
Ours-GRU-DO1 63.01 43.60 29.74 20.14
Ours-GRU-DO2 63.24 44.25 30.45 20.58
Ours-GRU-DO3 62.19 43.23 29.50 19.91
Ours-GRU-DO4 63.03 43.94 30.13 20.21
Flickr 8K B-1 B-2 B-3 B-4
m-RNN (Baidu)[2] 56.5 38.6 25.6 17.0
DeepVS (Stanford)[5] 57.9 38.3 24.5 16.0
NIC (Google)[4] 63.0 41.0 27.0 -
Ours-GRU-DO1 63.12 44.27 29.82 19.34
Ours-GRU-DO2 61.89 43.86 29.99 19.85
Ours-GRU-DO3 62.63 44.16 30.03 19.83
Ours-GRU-DO4 63.14 45.14 31.09 20.94
Flickr30k 실험 결과
A black and white dog is jumping in the grass
A group of people in the snow
Two men are working on a roof
신규 데이터
A large clock tower in front of a building
A man in a field throwing a frisbee
A little boy holding a white frisbee
A man and a woman are playing with a sheep
한국어 이미지 캡션 생성
한 어린 소녀가 풀로 덮인 들판에 서 있다
건물 앞에 서 있는 한 남자
분홍색 개를 데리고 있는 한 여자와 한 여자
구명조끼를 입은 한 작은 소녀가 웃고 있다
GRU
Embedding
CNNMultimodal
Softmax
Wt
Wt+1
Image
Residual Network + 한국어 이미지 캡션 생성 (동계학술대회16)
Visual Question Answering
Facebook: Visual Q&A
Toronto COCO-QA dataset (ArXiv 15)
• QA dataset 자동 생성
– MS COCO (image caption) data set 이용
– Image caption으로부터 parser 등을 이용하여 질문과 정답을 자동 생성 • 정답은 single word라고 가정함 (classification problem)
• Object, Number, Color, Location types의 질문 생성 – Object: 70%, Color: 17%, Number: 7%, Location: 6%
– 123,287 images, 78,736 train questions, 38,948 questions
– Ex. What is there in front of the sofa? • Answer: table
VIS+LSTM model (Arxiv 15)
DPPnet (포항공대)
Stacked Attention Networks for Image QA (ArXiv 16)
Top Related