Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介

37
2017.06.26 NAIST 然語処学研究室 D1 Masayoshi Kondo 紹介 About Neural Summarization@2017 Get To The Point : Summarization with PointerGenerator Networks ACL17 Abigail See Stanford University Peter J. Liu Google Brain Christopher D. Manning Stanford University

Transcript of Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介

  1. 1. 2017.06.26 NAIST D1 Masayoshi Kondo - About Neural Summarization@2017 Get To The Point : Summarization with Pointer-Generator Networks ACL17 Abigail See Stanford University Peter J. Liu Google Brain Christopher D. Manning Stanford University
  2. 2. 00: ( in: NN out: ) NNEnc:bi-directional RNN / Dec: RNN Seq2Seq pointer mechanism(attention mechanism) / coverage mechanism CNN/Daily Mail multi-sentence summarization ROUGE-score. abstract seq2seq (repetition)seq2seq-attention pointing (src)generation ()pointer-generator network Pointingrepetition coverage CNN/DailyMailROUGE2
  3. 3. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  4. 4. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  5. 5. 00: Introduction Text Summarization Extractive Summarization : - () Abstractive Summarization : - NN (copy) Src() Trg() Src() Trg() ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- ------------- ------------ ---------------- ---------------- ---------------- ----- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- ------------- ------------ xxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxx
  6. 6. 00: Introduction Abstractive Summarization Undesireble behavior such as inaccurately reproducing factual details. An inability to deal with out-of-vocabulary (OOV) Repeating themselves Short Text (1 or 2 sentences) Long Text (more than 3 sentences) Single Document Headline Generation Multi Documents (Opinion Mining) Document Summary length () Long-text summarization
  7. 7. 00: Introduction Pointer-Generator Network - (copy) Coverage Mechanism - reputation ROUGE-score CNN/Daily Mail Dataset - News( / English)
  8. 8. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Predicted Vocab Distribution Context Vector
  9. 9. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector
  10. 10. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector pgen Context Vector
  11. 11. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector pgen Final Predicted Vocab Distribution 1 - pgen pgen Context Vector
  12. 12. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector pgen Final Predicted Vocab Distribution 1 - pgen pgen(src) Context Vector
  13. 13. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  14. 14. 00: Our Models 2.1 Sequence-to-Sequence attention model [Encoder] [Decoder] i+1 i ei t = vT tanh Whh +Wss( ) at = soft max(et ) ht = ai t hi i $ % & & ' & & Encoder hidden state : Decoder hidden state : s h Context vector : h Neural machine translation by jointly learning to align and translate [Bahdanau, ICLR15] Abstractive text summarization using sequence-to-sequence RNN and beyond [R.Nallapati et al, CoNLL16]
  15. 15. 00: Our Models 2.2 Pointer-generator network Attention Attention Distribution Predicted Vocab Distribution Context Vector pgen 1 - pgen pgen Context Vector pgen = wh* T ht * + ws T st + wx T xt + bptr( ) P(w) = pgenPvocab (w)+ 1 pgen( ) ai t i:wi=w Final probability distribution: P(w) context vector: wh* T ,ws T ,wx T Generation probability : pgen ht * / decoder state: st / decoder input: xt Vector parameters:
  16. 16. 00: Our Models 2.3 Coverage mechanism Coverage Vector : ct Attention Distribution sum Decoder Timestep 1 2 3 t-1 t ct Coverage Vector Dec attention vector ct = at' t'=0 t1 ct is a (unnormalized) distribution over the source document words.
  17. 17. 00: Our Models 2.3 Coverage mechanism ei t = vT tanh Whh +Wss +Wcct + battn( ) covlosst Coverage Vector Coverage Loss : covlosst = min(ai t ,ci t ) i losst = log(wt * )+ min(ai t ,ci t ) i Attention : Dect Enciattention coverage (vectori) DectEncici tt tai tci tc1 , acovlossmin(a)backpropDect Enci min(c)DectEnci EncDec min(a) Dect
  18. 18. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  19. 19. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  20. 20. 00: Dataset CNN/Daily Mail Dataset : Online news articles Source (article) Target (summary) avg Sentence : - Word : 781 (tokens) vocab 150k size avg Sentence : 3.75 Word : 56 (tokens) vocab 60k size Settings Used scripts by Nallapati et al (2016) for pre-processing. Used the original text (non-anonymized version of the data). Train set Validation set Test set 287,226 13,368 11,496 Dataset size
  21. 21. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  22. 22. 00: Experiments Model Details Hidden layer : 256 dims Word emb : 128 dims Vocab : 2 types src trg (large) 150k 60k (small) 50k 50k Setting Details Optimize Adagrad Init-lr 0.15 Init-accumlator value 0.1 Regularize terms Max grad-clipping size 2 Early-stopping Batch size 16 Beam size (for test) 4 Environment & procedure Single GPU - Tesla K40m GPU - > Training : > Test : Word-Embpre-train Src400 tokens Trg100 tokens Src400 tokens Trg120 tokens - - ROUGE scores (F1) - METEOR scores
  23. 23. 00: Experiments Training time (Calculation cost) Proposed Model Baseline Model 230,000 iters (12.8 epoch) About 3 days + 4 hours 50 k 4 days +14 hours 150k 8 days +21 hours 600000 iters (33 epoch) - Other Settings - Coverage Loss Weight : =1 3000iter() - Inspection - =2Coverage LossPrimary Loss Coverage Model()Coverage Loss Attentionrepetation
  24. 24. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  25. 25. 00: Results lead-3src Nallaptianonymized lead-3
  26. 26. 00: Results (seq2se2-attention) Fig.1 OOV (UNK )
  27. 27. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  28. 28. 00: Discussion 7.1 Comparison with extractive systems ROUGE :1 :2 lead-3 400 tokens(20 sentences)800 tokensROUGE ROUGElead-3 ROUGE
  29. 29. lead-3() ROUGE 00: Discussion 7.1 Comparison with extractive systems ROUGE ROUGE lead-3)
  30. 30. METEOR 00: Discussion 7.1 Comparison with extractive systems METEOR () 1 lead-3lead-3
  31. 31. 00: Discussion 7.1 Comparison with extractive systems We believe that investigating this issue further is an important direction for future work. 7.2 How abstractive is our model ? We have show that our pointer mechanism makes our abstractive system more reliable, copying factual details correctly more often. But, does the ease of copying make our system any less abstractive ? pointer mechanism
  32. 32. 00: Discussion 7.2 How abstractive is our model ? srcn-gram
  33. 33. Fig.7 ) Article X beat Y on 00: Discussion 7.2 How abstractive is our model ? Fig.5 )
  34. 34. 00: Discussion 7.2 How abstractive is our model ? Train : 0.30 0.53 (train) Test : avg-0.17 pgen src
  35. 35. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion
  36. 36. 00:Conclusion Pointer-generator network long-text dataset abstractive summarization - Repetition
  37. 37. END