4thNLPDL

36
ニューラルネットをいた の意味計算 瀬翔 東学 乾・岡崎研究室 [email protected] 2016/06/22 4NLPDL 1

Transcript of 4thNLPDL

  • [email protected]/06/224NLPDL

    1

  • D3

    2

  • -e.g., SGNS, Glove

    3

    peach penguin medicinedrug

  • [Harris, 64]

    SGNS-PMI [Levy+ 2014] Glove-

    4have

    drink

    eat

    bottle

    roast

    essay

    read

    work

    beer 48 72 28 57 30 1 8 11 wine 108 92 24 86 29 2 2 23

    mutton 309 31 105 13 48 0 0 17 novelist 12 4 8 0 0 103 186 134 writer 31 3 10 0 0 132 238 84

    -

  • 5

    spoonsmall spoon smallf ( , )

  • Recurrent NN LSTM +

    6

  • [Mikolov+ 13, Pham+ 15]

    Recurrent NN Gated Recurrent Unit (GRU) [Cho+ 14] Long-Short Term Memory (LSTM) [Hochreiter+ 97]

    Recursive NN [Socher+ 11] Matrix-Vector Recursive NN [Socher+ 12] Recursive Neural Tensor Network [Socher+ 13] Tree-LSTM [Tai+ 15]

    7

  • Recurrent NN

    GPU

    8

    Relatedness Sentiment

    LSTM Variant d || d ||

    Standard 150 203,400 168 315,840Bidirectional 150 203,400 168 315,840

    2-layer 108 203,472 120 318,720Bidirectional 2-layer 108 203,472 120 318,720

    Constituency Tree 142 205,190 150 316,800Dependency Tree 150 203,400 168 315,840

    Table 1: Memory dimensions d and compositionfunction parameter counts || for each LSTM vari-ant that we evaluate.

    neutral sentences are excluded). Standard bina-rized constituency parse trees are provided foreach sentence in the dataset, and each node inthese trees is annotated with a sentiment label.

    For the sequential LSTM baselines, we predictthe sentiment of a phrase using the representationgiven by the final LSTM hidden state. The sequen-tial LSTM models are trained on the spans corre-sponding to labeled nodes in the training set.

    We use the classification model described inSec. 4.1 with both Dependency Tree-LSTMs(Sec. 3.1) and Constituency Tree-LSTMs(Sec. 3.2). The Constituency Tree-LSTMs arestructured according to the provided parse trees.For the Dependency Tree-LSTMs, we producedependency parses3 of each sentence; each nodein a tree is given a sentiment label if its spanmatches a labeled span in the training set.

    5.2 Semantic RelatednessFor a given pair of sentences, the semantic relat-edness task is to predict a human-generated ratingof the similarity of the two sentences in meaning.

    We use the Sentences Involving Composi-tional Knowledge (SICK) dataset (Marelli et al.,2014), consisting of 9927 sentence pairs in a4500/500/4927 train/dev/test split. The sentencesare derived from existing image and video descrip-tion datasets. Each sentence pair is annotated witha relatedness score y 2 [1, 5], with 1 indicatingthat the two sentences are completely unrelated,and 5 indicating that the two sentences are veryrelated. Each label is the average of 10 ratings as-signed by different human annotators.

    Here, we use the similarity model described inSec. 4.2. For the similarity prediction network(Eqs. 15) we use a hidden layer of size 50. We

    3Dependency parses produced by the Stanford NeuralNetwork Dependency Parser (Chen and Manning, 2014).

    Method Fine-grained Binary

    RAE (Socher et al., 2013) 43.2 82.4MV-RNN (Socher et al., 2013) 44.4 82.9RNTN (Socher et al., 2013) 45.7 85.4DCNN (Blunsom et al., 2014) 48.5 86.8Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8CNN-non-static (Kim, 2014) 48.0 87.2CNN-multichannel (Kim, 2014) 47.4 88.1DRNN (Irsoy and Cardie, 2014) 49.8 86.6

    LSTM 46.4 (1.1) 84.9 (0.6)Bidirectional LSTM 49.1 (1.0) 87.5 (0.5)2-layer LSTM 46.0 (1.3) 86.3 (0.6)2-layer Bidirectional LSTM 48.5 (1.0) 87.2 (1.0)

    Dependency Tree-LSTM 48.4 (0.4) 85.7 (0.4)Constituency Tree-LSTM

    randomly initialized vectors 43.9 (0.6) 82.0 (0.5) Glove vectors, fixed 49.7 (0.4) 87.5 (0.8) Glove vectors, tuned 51.0 (0.5) 88.0 (0.3)

    Table 2: Test set accuracies on the Stanford Sen-timent Treebank. For our experiments, we reportmean accuracies over 5 runs (standard deviationsin parentheses). Fine-grained: 5-class sentimentclassification. Binary: positive/negative senti-ment classification.

    produce binarized constituency parses4 and depen-dency parses of the sentences in the dataset for ourConstituency Tree-LSTM and Dependency Tree-LSTM models.

    5.3 Hyperparameters and Training Details

    The hyperparameters for our models were tunedon the development set for each task.

    We initialized our word representations usingpublicly available 300-dimensional Glove vec-tors5 (Pennington et al., 2014). For the sentimentclassification task, word representations were up-dated during training with a learning rate of 0.1.For the semantic relatedness task, word represen-tations were held fixed as we did not observe anysignificant improvement when the representationswere tuned.

    Our models were trained using AdaGrad (Duchiet al., 2011) with a learning rate of 0.05 and aminibatch size of 25. The model parameters wereregularized with a per-minibatch L2 regularizationstrength of 104. The sentiment classifier wasadditionally regularized using dropout (Srivastavaet al., 2014) with a dropout rate of 0.5. We did notobserve performance gains using dropout on thesemantic relatedness task.

    4Constituency parses produced by the Stanford PCFGParser (Klein and Manning, 2003).

    5Trained on 840 billion tokens of Common Crawl data,http://nlp.stanford.edu/projects/glove/.

    1561

    Fine-grained Binary

    Stanford Sentiment Treebank [Tai+ 15]

  • vs. 4 [Li+ 2015]Bi-directional LSTM > Tree-LSTM Sentiment analysis, QA, Discourse parsing

    Tree-LSTM > Bi-directional LSTM Relation classification SOTAdependency + CNN

    9

  • [Mikolov+ 13, Pham+ 15]

    Recurrent NN Gated Recurrent Unit (GRU) [Cho+ 14] Long-Short Term Memory (LSTM) [Hochreiter+ 97]

    Recursive NN [Socher+ 11] Matrix-Vector Recursive NN [Socher+ 12] Recursive Neural Tensor Network [Socher+ 13] Tree-LSTM [Tai+ 15]

    10

  • vking + vwoman - vman vqueen

    [Muraoka+ 14]

    11

    smoking increase the risk of lung

  • - XYX, Y

    12

    a b c d X Y e f g h X Y

    XY

  • [Tian+ 2016]

    - vXY(vX+vY)/2

    -B

    13

    B = kvXY 1

    2(vX + vY )k

  • wordcontext

    B

    logsqrt

    14

    r(F (P (context|word)) a(word) b(context))

    F

    0(x) = x1+ < 0.5

  • F(x) = log x, a(y) = 0, b(z) = log P(z)-PMI

    SGNS-PMI GloveF(x) = log x, a(y), b(z)

    SGNS, GloveF(x)

    15

  • J [Tian+ 16] J Lhelp to stop = stop to help

    16

  • Recurrent Neural Network1/2 tv(t)h(t)h(t)

    17

    h(t) =tanh(Uv(t)+Wh(t 1))

    smoking increase the risk of lung

    W

    U

  • Recurrent Neural Network2/2

    U=W=Recurrent NN

    RNN

    Ltherisk

    18

  • Recurrent Neural Network LSTM

    L

    19

    smoking increase the risk of lung

    3

    192

    193

    194

    195

    196

    197

    198

    199

    200

    201

    202

    203

    204

    205

    206

    207

    208

    209

    210

    211

    212

    213

    214

    215

    216

    217

    218

    219

    220

    221

    222

    223

    224

    225

    226

    227

    228

    229

    230

    231

    232

    233

    234

    235

    236

    237

    238

    239

    240

    241

    242

    243

    244

    245

    246

    247

    248

    249

    250

    251

    252

    253

    254

    255

    256

    257

    258

    259

    260

    261

    262

    263

    264

    265

    266

    267

    268

    269

    270

    271

    272

    273

    274

    275

    276

    277

    278

    279

    280

    281

    282

    283

    284

    285

    286

    287

    NAACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

    Figure 2: Number of judgments for each similarityrating. The total number of judgments is 27, 775(5, 555 pairs 5 workers).

    quality assessment of the workers. In other words,we discarded the similarity ratings of the Gold ex-amples, and used those judged by the workers.

    To build a high quality dataset, we use judgmentsfrom workers whose confidence values (reliabilityscores) computed by CrowdFlower are greater than75%. Additionally, we force every pair to haveat least five judgments from the workers. Conse-quently, 60 workers participated in this job. In thefinal version of this dataset, each pair has five simi-larity ratings judged by the five most reliable work-ers who were involved in the pair.

    Figure 2 presents the number of judgments foreach similarity rating. Workers seldom rated 7 fora pair of relational patterns, probably because mostpairs have at least one difference in content words.The mean of the standard deviations of similarityratings of all pairs is 1.16. The mean of Spearmans among workers involved in the dataset is 0.728.These statistics show a high inter-annotator agree-ment of the dataset.

    3 Encoder for Relational Patterns

    As can be inferred from the unavailability of thedataset for relational patterns, we have no well-established method for learning distributed repre-sentations of relational patterns. A nave approachwould be to regard a relational pattern as a sin-gle unit (word) and to train word/pattern embed-dings as usual. In fact, Mikolov et al. (2013) im-plemented this approach as a preprocessing step,mining phrasal expressions with strong collocationsfrom a training corpus. However, this approachmight be affected by data sparseness, which lowersthe quality of distributed representations.

    Another simple but effective approach is additive

    composition (Mitchell and Lapata, 2010), where thedistributed representation of a relational pattern iscomputed by the mean of embeddings of constituentwords. Presuming that a relational pattern consistsof a sequence of T words w1, ..., wT , then we letxt Rd the embedding of the word wt. This ap-proach computes 1T

    Tt=1 xt as the embedding of

    the relational pattern. Muraoka et al. (2014) reportedthat the additive composition is a strong baselineamong various methods built upon neural networks.

    Having said that, additive composition is inade-quate for some relational patterns because it treatsevery word in a relational pattern equally. It is nat-ural to expect that content words influence the dis-tributed representation of a relational pattern morethan functional words. In addition, adding the em-bedding of the word have might not be useful forX have access to Y, which has mostly the samemeaning as X access Y. Therefore, we explore analternative approach that is inspired by a Sequence-to-Sequence model (Sutskever et al., 2014) and anEncoder-Decoder model (Cho et al., 2014). The ideais to compute the embedding of a relational patternusing a function, F (x1, ..., xT ), where F (.) is mod-eled by a variant of recurrent neural network (RNN).

    3.1 Baseline: Long Short-Term MemoryLong Short-Term Memory (LSTM) (Hochreiter andSchmidhuber, 1997) is a variant of RNN that isapplied successfully to various NLP tasks includ-ing word segmentation (Chen et al., 2015), depen-dency parsing (Dyer et al., 2015), machine transla-tion (Sutskever et al., 2014), and sentiment analy-sis (Tai et al., 2015). LSTM computes the input gateit Rd, forget gate ft Rd, output gate ot Rd,memory cell ct Rd, and hidden state ht Rd fora given embedding xt at position t5.

    it = (Wixxt +Wihht1) (1)ft = (Wfxxt +Wfhht1) (2)ot = (Woxxt +Wohht1) (3)ct = ft ct1 + it g(Wcxxt +Wchht1) (4)ht = ot g(ct) (5)

    5We omitted peephole connections and bias terms in thisstudy. We set the number of dimensions of hidden states iden-tical to that of word embeddings (d) so that we can adapt theobjective function of Skip-gram model (Section 3.3).

  • Gated Additive Composition (GAC) [Takase+ 16]

    + +

    the, ofincrease, risk

    20

    smoking increase the risk of lung

    4

    288

    289

    290

    291

    292

    293

    294

    295

    296

    297

    298

    299

    300

    301

    302

    303

    304

    305

    306

    307

    308

    309

    310

    311

    312

    313

    314

    315

    316

    317

    318

    319

    320

    321

    322

    323

    324

    325

    326

    327

    328

    329

    330

    331

    332

    333

    334

    335

    336

    337

    338

    339

    340

    341

    342

    343

    344

    345

    346

    347

    348

    349

    350

    351

    352

    353

    354

    355

    356

    357

    358

    359

    360

    361

    362

    363

    364

    365

    366

    367

    368

    369

    370

    371

    372

    373

    374

    375

    376

    377

    378

    379

    380

    381

    382

    383

    NAACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

    passive smoking increases the risk of lung cancer

    xs xs+1 xs+2 xs+L-1xs+L xs+L+1xs-2 xs-1

    (3)

    (4) (5)

    hs hs+1 hs+2

    hs+L-1(3)

    ws ws+1 ws+2 ws+L-1(3)

    ws+L ws+L+1(4) (5)

    ws-2 ws-1

    f s f s+1 f s+2

    i s i s+1 i s+2 i s+3~ ~ ~ ~

    Parameter updateby Skip-gram model

    Parameter update bySkip-gram model

    Pattern vector

    T = L = 4 = 2 = 2Context window Context windowRelation pattern

    (word vectors)(context vectors) (context vectors)

    (hidden vectors)

    Gated Additive Composition (GAC)

    Figure 3: Overview of the proposed method. Theproposed method computes the distributed represen-tation of a relational pattern using the input gate andforget gate, and learns parameters by predicting sur-rounding words (Skip-gram model).

    In these equations, Wix, Wih, Wfx, Wfh, Wox,Woh, Wcx, Wch are d d matrices (parameters),(.) is the elementwise sigmoid function, g(.) is theelementwise activation function (tanh), and the op-erator presents elementwise multiplications. Weset c0 = 0 and h0 = 0 at t = 1. In essence, LSTMcomputes the hidden state ht and the memory cellct based on those at the previous position (ht1 andct1) and the word embedding xt. Applying theseequations from t = 1 to T , we use hT as the dis-tributed representation of the relational pattern.

    3.2 Proposal: Gated Additive CompositionAlthough LSTM is successful for various NLP tasks,its expressive power might be overly strong for han-dling short phrases of relational patterns. Further-more, LSTM is often criticized as having numerouscomponents for which the purpose is not immedi-ately apparent (Jozefowicz et al., 2015). We are un-sure whether LSTM is the optimal architecture formodeling relational patterns.

    For this reason, we simplified the LSTM archi-tecture as follows. We removed a memory cell byreplacing ct with a hidden state ht because the prob-lem of exponential error decay (Hochreiter et al.,2001) might not be prominent for relational patterns.We also removed matrices corresponding to Whhand Whx because most relational patterns hold addi-tive composition. This simplification yields the ar-chitecture defined by Equations 68.

    it = (Wixxt +Wihht1) (6)ft = (Wfxxt +Wfhht1) (7)ht = g(ft ht1 + it xt) (8)

    Here, Wix, Wih, Wfx, Wfh are d d matrices.The input and forget gates (Equations 6 and 7)

    are identical to those in LSTM (Equations 1 and2). However, Equation 8 is better explained usinga weighted additive composition between the vectorof the current word xt and the vector of the previ-ous hidden state ht1. The elementwise weights arecontrolled by the input gate it and forget gate ft; weexpect that input gates are closed (close to zero) andforget gates are opened (close to one) when the cur-rent word is a control verb or preposition. We namethis architecture gated additive composition (GAC).

    3.3 Parameter estimation: Skip-gram model

    As explained in Section 1, we explore the OpenIEapproach, which relies neither on an existing KBnor on supervision data for relation classification.Therefore, we adapt the Skip-gram model (Mikolovet al., 2013) to train the parameters in LSTM andGAC on an unlabeled text corpus.

    Formally, we designate an occurrence of a re-lational pattern p as a subsequence of L wordsws, ..., ws+L1 in a corpus. We define words ap-pearing before and after pattern p as the contextwords Cp = (s, ..., s1, s+L, ..., s+L+) forthe pattern. We define the log-likelihood of the re-lational pattern lp, following the objective functionof Skip-gram with negative sampling (SGNS) (Levyand Goldberg, 2014).

    lp =

    Cp

    (log (hp x ) +

    K

    k=1

    log (hp x ))

    (9)

    In this formula: K denotes the number of negativesamples; hp Rd is the vector for the relationalpattern p computed by LSTM or GAC; x Rd isthe context vector for the word w 6; x Rd is thecontext vector for the word that were sampled from

    6The Skip-gram model has two kinds of vectors xt andxt assigned for a word wt. Equation 2 of the original pa-per (Mikolov et al., 2013) denotes xt (word vector) as v (in-put vector) and xt (context vector) as v (output vector). Theword2vec implementation does not write context (output) vec-tors but only word (input) vectors to a model file. Therefore, wemodified the source code to save context vectors, and use themin Equation 9. This modification ensures the consistency of theentire model.

  • Recurrent NN LSTM +

    21

  • 1:

    Reverb [Fader+ 12] ukWaC20

    22

    Cephalexin reduce the risk of the bacteria

    Cephalexin prevent the bacteriaInhibit(Cephalexin, bacteria)

  • SGNS [Mikolov+ 13]

    23

    smoking increase the risk of lung

    logP(wt+ j |wt ) log (vwt uwt+ j )+ z~Pn (w)i=1

    k

    [log (vwt uz )]

    z k

    vwt

    uwt+1uw1

  • 5,555ACL2016

    24

    1 2 5

    inhibit prevent the growth of 4.2 0.7be the part of be an essential part of 5.6 0.8be open from close at 1.6 0.5

  • 25

    5

    400

    401

    402

    403

    404

    405

    406

    407

    408

    409

    410

    411

    412

    413

    414

    415

    416

    417

    418

    419

    420

    421

    422

    423

    424

    425

    426

    427

    428

    429

    430

    431

    432

    433

    434

    435

    436

    437

    438

    439

    440

    441

    442

    443

    444

    445

    446

    447

    448

    449

    450

    451

    452

    453

    454

    455

    456

    457

    458

    459

    460

    461

    462

    463

    464

    465

    466

    467

    468

    469

    470

    471

    472

    473

    474

    475

    476

    477

    478

    479

    480

    481

    482

    483

    484

    485

    486

    487

    488

    489

    490

    491

    492

    493

    494

    495

    496

    497

    498

    499

    ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

    At every occurrence of a relational pattern inthe corpus, we use Stochastic Gradient Descent(SGD) and backpropagation through time (BPTT)for training the parameters (matrices) in encoders.More specifically, we initialize the word vectors xtand context vectors xt and fix them during train-ing. At every occurrence of a relational pattern,we compute gradients for Equation 5 to update theparameters in encoders. In this way, each encoderis trained to compose a vector of a relational pat-tern so that it can predict the surrounding contextwords. An advantage of this parameter estimationis that the distributed representations of words andrelational patterns stay in the same vector space.Figure 3 visualizes the training process for GAC.

    4 Experiments

    In Section 4.1, we investigate the performance ofthe distributed representations computed by differ-ent encoders on the pattern similarity task. Section4.2 examines the contribution of the distributedrepresentations on SemEval 2010 Task 8, and dis-cusses the usefulness of the new dataset to predictsuccesses of the relation classification task.

    4.1 Relational pattern similarityFor every pair in the dataset built in Section 2, wecompose the vectors of the two relational patternsusing an encoder described in Section 3, and com-pute the cosine similarity of the two vectors. Re-peating this process for all pairs in the dataset, wemeasure Spearmans between the similarity val-ues computed by the encoder and similarity ratingsassigned by humans.

    4.1.1 Training procedureWe used ukWaC9 as the training corpus for theencoders. This corpus includes the text of 2 bil-lion words from Web pages crawled in the .ukdomain. Part-of-speech tags and lemmas are an-notated by TreeTagger10. We used lowercasedlemmas throughout the experiments. We applyword2vec to this corpus to train word vectors xtand context vectors xt. All encoders use word vec-tors xt to compose vectors of relational patterns;and the Skip-gram model uses context vectors xtto compute the objective function and gradients.

    We used Reverb (Fader et al., 2011) to theukWaC corpus to extract relational pattern can-

    9http://wacky.sslmit.unibo.it10http://www.cis.uni-muenchen.de/

    schmid/tools/TreeTagger/

    Figure 4: Performance of each method on the rela-tional pattern similarity task with variation in thenumber of dimensions.

    didates. To remove unuseful relational patterns,we applied filtering rules that are compatible withthose used in the publicly available extraction re-sult11. Additionally, we discarded relational pat-terns appearing in the evaluation dataset through-out the experiments to assess the performance un-der which an encoder composes vectors of unseenrelational patterns. This preprocessing yielded127, 677 relational patterns.

    All encoders were implemented on Chainer12, aflexible framework of neural networks. The hyper-parameters of the Skip-gram model are identicalto those in Mikolov et al. (2013): the width ofcontext window = 5, the number of negativesamples K = 5, the subsampling of 105. Foreach encoder that requires training, we tried 0.025,0.0025, and 0.00025 as an initial learning rate, andselected the best value for the encoder. In contrastto the presentation of Section 3, we compose a pat-tern vector in backward order (from the last to thefirst) because preliminary experiments showed aslight improvement with this treatment.

    4.1.2 Results and discussionsFigure 4 shows Spearmans rank correlations ofdifferent encoders when the number of dimensionsof vectors is 100500. The figure shows that GACachieves the best performance on all dimensions.

    Figure 4 includes the performance of the naveapproach, NoComp, which regards a relationalpattern as a single unit (word). In this approach,we allocated a vector hp for each relational pat-tern p in Equation 5 instead of the vector compo-sition, and trained the vectors of relational patternsusing the Skip-gram model. The performance waspoor for two reasons: we were unable to compute

    11http://reverb.cs.washington.edu/12http://chainer.org/

    GAC, RNN, GRU, Add, LSTM (Add)

    1

  • AddRNN GACAdd

    Add+ 4

    26

    6

    500

    501

    502

    503

    504

    505

    506

    507

    508

    509

    510

    511

    512

    513

    514

    515

    516

    517

    518

    519

    520

    521

    522

    523

    524

    525

    526

    527

    528

    529

    530

    531

    532

    533

    534

    535

    536

    537

    538

    539

    540

    541

    542

    543

    544

    545

    546

    547

    548

    549

    550

    551

    552

    553

    554

    555

    556

    557

    558

    559

    560

    561

    562

    563

    564

    565

    566

    567

    568

    569

    570

    571

    572

    573

    574

    575

    576

    577

    578

    579

    580

    581

    582

    583

    584

    585

    586

    587

    588

    589

    590

    591

    592

    593

    594

    595

    596

    597

    598

    599

    ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

    Length # NoComp Add LSTM GRU RNN GAC1 636 0.324 0.324 0.324 0.324 0.324 0.3242 1,018 0.215 0.319 0.257 0.274 0.285 0.3213 2,272 0.234 0.386 0.344 0.370 0.387 0.4044 1,206 0.208 0.306 0.314 0.329 0.319 0.323

    > 5 423 0.278 0.315 0.369 0.384 0.394 0.357All 5,555 0.215 0.340 0.336 0.356 0.362 0.370

    Table 1: Spearmans rank correlations on different pattern lengths (number of dimensions d = 500).

    wt wt+1 wt+2 ...

    large it reimburse for(input payable inopen) liable to

    small it a charter member of(input a valuable member ofclose) be an avid reader of

    large ft be eligible to participate in(forget be require to submitopen) be request to submit

    small ft coauthor of(forget capital ofclose) center of

    Table 2: Prominent moments for input/forgetgates.

    similarity values for 1,744 pairs because relationalpatterns in these pairs do not appear in ukWaC;and relational patterns could not obtain sufficientstatistics because of data sparseness.

    Table 1 reports Spearmans rank correlationscomputed for each pattern length. Here, the lengthof a relational-pattern pair is defined by the maxi-mum of the lengths of two patterns in the pair. Inlength of 1, all methods achieve the same corre-lation score because they use the same word vec-tor xt. The table shows that additive composition(Add) performs well for shorter relational patterns(lengths of 2 and 3) but poorly for longer ones(lengths of 4 and 5+). GAC also exhibits the sim-ilar tendency to Add, but it outperforms Add forshorter patterns (lengths of 2 and 3) probably be-cause of the adaptive control of input and forgetgates. In contrast, RNN and its variants (RNN,GRU, and LSTM) enjoy the advantage on longerpatterns (lengths of 4 and 5+).

    To examine the roles of input and forget gates ofGAC, we visualize the moments when input/forgetgates are wide open or closed. More precisely, weextract the input word and scanned words when

    |it|2 or |ft|2 is small (close to zero) or large (closeto one) on the relational-pattern dataset. We re-state that we compose a pattern vector in backwardorder (from the last to the first): GAC scans of,author, and be in this order for composing thevector of the relational pattern be author of.

    Table 2 displays the top three examples iden-tified using the procedure. The table shows twogroups of tendencies. Input gates open and forgetgates close when scanned words are only a prepo-sition and the current word is a content word. Inthese situations, GAC tries to read the semanticvector of the content word and to ignore the se-mantic vector of the preposition. In contrast, inputgates close and forget gates open when the currentword is be or a and scanned words form a nounphrase (e.g., charter member of), a complement(e.g., eligible to participate in), or a passivevoice (e.g., require(d) to submit). This behavioris also reasonable because GAC emphasizes infor-mative words more than functional words.

    4.2 Relation classification

    4.2.1 Experimental settingsTo examine the usefulness of the dataset and dis-tributed representations for a different application,we address the task of relation classification onthe SemEval 2010 Task 8 dataset (Hendrickx etal., 2010). In other words, we explore whetherhigh-quality distributed representations of rela-tional patterns are effective to identify a relationtype of an entity pair.

    The dataset consists of 10, 717 relation in-stances (8, 000 training and 2, 717 test instances)with their relation types annotated. The datasetdefines 9 directed relations (e.g.,CAUSE-EFFECT)and 1 undirected relation OTHER. Given a pairof entity mentions, the task is to identify a rela-tion type in 19 candidate labels (2 9 directed +1 undirected relations). For example, given thepair of entity mentions e1 = burst and e2 =

  • GAC

    27

    6

    500

    501

    502

    503

    504

    505

    506

    507

    508

    509

    510

    511

    512

    513

    514

    515

    516

    517

    518

    519

    520

    521

    522

    523

    524

    525

    526

    527

    528

    529

    530

    531

    532

    533

    534

    535

    536

    537

    538

    539

    540

    541

    542

    543

    544

    545

    546

    547

    548

    549

    550

    551

    552

    553

    554

    555

    556

    557

    558

    559

    560

    561

    562

    563

    564

    565

    566

    567

    568

    569

    570

    571

    572

    573

    574

    575

    576

    577

    578

    579

    580

    581

    582

    583

    584

    585

    586

    587

    588

    589

    590

    591

    592

    593

    594

    595

    596

    597

    598

    599

    ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

    Length # NoComp Add LSTM GRU RNN GAC1 636 0.324 0.324 0.324 0.324 0.324 0.3242 1,018 0.215 0.319 0.257 0.274 0.285 0.3213 2,272 0.234 0.386 0.344 0.370 0.387 0.4044 1,206 0.208 0.306 0.314 0.329 0.319 0.323

    > 5 423 0.278 0.315 0.369 0.384 0.394 0.357All 5,555 0.215 0.340 0.336 0.356 0.362 0.370

    Table 1: Spearmans rank correlations on different pattern lengths (number of dimensions d = 500).

    wt wt+1 wt+2 ...

    large it reimburse for(input payable inopen) liable to

    small it a charter member of(input a valuable member ofclose) be an avid reader of

    large ft be eligible to participate in(forget be require to submitopen) be request to submit

    small ft coauthor of(forget capital ofclose) center of

    Table 2: Prominent moments for input/forgetgates.

    similarity values for 1,744 pairs because relationalpatterns in these pairs do not appear in ukWaC;and relational patterns could not obtain sufficientstatistics because of data sparseness.

    Table 1 reports Spearmans rank correlationscomputed for each pattern length. Here, the lengthof a relational-pattern pair is defined by the maxi-mum of the lengths of two patterns in the pair. Inlength of 1, all methods achieve the same corre-lation score because they use the same word vec-tor xt. The table shows that additive composition(Add) performs well for shorter relational patterns(lengths of 2 and 3) but poorly for longer ones(lengths of 4 and 5+). GAC also exhibits the sim-ilar tendency to Add, but it outperforms Add forshorter patterns (lengths of 2 and 3) probably be-cause of the adaptive control of input and forgetgates. In contrast, RNN and its variants (RNN,GRU, and LSTM) enjoy the advantage on longerpatterns (lengths of 4 and 5+).

    To examine the roles of input and forget gates ofGAC, we visualize the moments when input/forgetgates are wide open or closed. More precisely, weextract the input word and scanned words when

    |it|2 or |ft|2 is small (close to zero) or large (closeto one) on the relational-pattern dataset. We re-state that we compose a pattern vector in backwardorder (from the last to the first): GAC scans of,author, and be in this order for composing thevector of the relational pattern be author of.

    Table 2 displays the top three examples iden-tified using the procedure. The table shows twogroups of tendencies. Input gates open and forgetgates close when scanned words are only a prepo-sition and the current word is a content word. Inthese situations, GAC tries to read the semanticvector of the content word and to ignore the se-mantic vector of the preposition. In contrast, inputgates close and forget gates open when the currentword is be or a and scanned words form a nounphrase (e.g., charter member of), a complement(e.g., eligible to participate in), or a passivevoice (e.g., require(d) to submit). This behavioris also reasonable because GAC emphasizes infor-mative words more than functional words.

    4.2 Relation classification

    4.2.1 Experimental settingsTo examine the usefulness of the dataset and dis-tributed representations for a different application,we address the task of relation classification onthe SemEval 2010 Task 8 dataset (Hendrickx etal., 2010). In other words, we explore whetherhigh-quality distributed representations of rela-tional patterns are effective to identify a relationtype of an entity pair.

    The dataset consists of 10, 717 relation in-stances (8, 000 training and 2, 717 test instances)with their relation types annotated. The datasetdefines 9 directed relations (e.g.,CAUSE-EFFECT)and 1 undirected relation OTHER. Given a pairof entity mentions, the task is to identify a rela-tion type in 19 candidate labels (2 9 directed +1 undirected relations). For example, given thepair of entity mentions e1 = burst and e2 =

  • RNNU, W

    28

    U W

  • 2: bigram JNNNVN

    29

  • bigramPPDB

    JN133,998NN35,602VN62,651

    30

    novel method new approach

    v1 v2

    max(0,1 v1 v2 + v1 n1)+max(0,1 v1 v2 + v2 n2 )

    n1, n2v1, v2

  • [Wieting+ 15]

    JNNNVN108 [Mitchell+ 10]

    31

    Adj-Noun1 Adj-Noun2 5

    bigram

    vast amount large quantity 5.0 0.9small house little room 2.0 0.6better job good place 3.0 0.6

  • GRURecursive NNGAC

    32

    JN NN VN AvrageAdd 0.50 0.29 0.58 0.46Recursive NN [Wieting+ 15] 0.57 0.44 0.55 0.52Recurrent NN 0.58 0.43 0.46 0.49GRU 0.62 0.40 0.53 0.53LSTM 0.57 0.44 0.49 0.49CNN 0.58 0.48 0.50 0.50GAC 0.56 0.43 0.52 0.52Human 0.87 0.64 0.73 0.75

  • 3: PPDB PPDB 5

    bigramPPDB60,0001,000 PPDB

    33

  • GAC, CNN LSTM, GRU

    34

    Spearmans rank correlationAdd 0.32Recursive NN [Wieting+ 15] 0.40Recurrent NN 0.25GRU 0.33LSTM 0.32CNN 0.45GAC 0.47

  • Recurrent NN

    SGNS

    35

    U W

  • Recurrent NN GAC +

    36