4thNLPDL
-
Upload
sho-takase -
Category
Technology
-
view
311 -
download
0
Transcript of 4thNLPDL
-
[email protected]/06/224NLPDL
1
-
D3
2
-
-e.g., SGNS, Glove
3
peach penguin medicinedrug
-
[Harris, 64]
SGNS-PMI [Levy+ 2014] Glove-
4have
drink
eat
bottle
roast
essay
read
work
beer 48 72 28 57 30 1 8 11 wine 108 92 24 86 29 2 2 23
mutton 309 31 105 13 48 0 0 17 novelist 12 4 8 0 0 103 186 134 writer 31 3 10 0 0 132 238 84
-
-
5
spoonsmall spoon smallf ( , )
-
Recurrent NN LSTM +
6
-
[Mikolov+ 13, Pham+ 15]
Recurrent NN Gated Recurrent Unit (GRU) [Cho+ 14] Long-Short Term Memory (LSTM) [Hochreiter+ 97]
Recursive NN [Socher+ 11] Matrix-Vector Recursive NN [Socher+ 12] Recursive Neural Tensor Network [Socher+ 13] Tree-LSTM [Tai+ 15]
7
-
Recurrent NN
GPU
8
Relatedness Sentiment
LSTM Variant d || d ||
Standard 150 203,400 168 315,840Bidirectional 150 203,400 168 315,840
2-layer 108 203,472 120 318,720Bidirectional 2-layer 108 203,472 120 318,720
Constituency Tree 142 205,190 150 316,800Dependency Tree 150 203,400 168 315,840
Table 1: Memory dimensions d and compositionfunction parameter counts || for each LSTM vari-ant that we evaluate.
neutral sentences are excluded). Standard bina-rized constituency parse trees are provided foreach sentence in the dataset, and each node inthese trees is annotated with a sentiment label.
For the sequential LSTM baselines, we predictthe sentiment of a phrase using the representationgiven by the final LSTM hidden state. The sequen-tial LSTM models are trained on the spans corre-sponding to labeled nodes in the training set.
We use the classification model described inSec. 4.1 with both Dependency Tree-LSTMs(Sec. 3.1) and Constituency Tree-LSTMs(Sec. 3.2). The Constituency Tree-LSTMs arestructured according to the provided parse trees.For the Dependency Tree-LSTMs, we producedependency parses3 of each sentence; each nodein a tree is given a sentiment label if its spanmatches a labeled span in the training set.
5.2 Semantic RelatednessFor a given pair of sentences, the semantic relat-edness task is to predict a human-generated ratingof the similarity of the two sentences in meaning.
We use the Sentences Involving Composi-tional Knowledge (SICK) dataset (Marelli et al.,2014), consisting of 9927 sentence pairs in a4500/500/4927 train/dev/test split. The sentencesare derived from existing image and video descrip-tion datasets. Each sentence pair is annotated witha relatedness score y 2 [1, 5], with 1 indicatingthat the two sentences are completely unrelated,and 5 indicating that the two sentences are veryrelated. Each label is the average of 10 ratings as-signed by different human annotators.
Here, we use the similarity model described inSec. 4.2. For the similarity prediction network(Eqs. 15) we use a hidden layer of size 50. We
3Dependency parses produced by the Stanford NeuralNetwork Dependency Parser (Chen and Manning, 2014).
Method Fine-grained Binary
RAE (Socher et al., 2013) 43.2 82.4MV-RNN (Socher et al., 2013) 44.4 82.9RNTN (Socher et al., 2013) 45.7 85.4DCNN (Blunsom et al., 2014) 48.5 86.8Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8CNN-non-static (Kim, 2014) 48.0 87.2CNN-multichannel (Kim, 2014) 47.4 88.1DRNN (Irsoy and Cardie, 2014) 49.8 86.6
LSTM 46.4 (1.1) 84.9 (0.6)Bidirectional LSTM 49.1 (1.0) 87.5 (0.5)2-layer LSTM 46.0 (1.3) 86.3 (0.6)2-layer Bidirectional LSTM 48.5 (1.0) 87.2 (1.0)
Dependency Tree-LSTM 48.4 (0.4) 85.7 (0.4)Constituency Tree-LSTM
randomly initialized vectors 43.9 (0.6) 82.0 (0.5) Glove vectors, fixed 49.7 (0.4) 87.5 (0.8) Glove vectors, tuned 51.0 (0.5) 88.0 (0.3)
Table 2: Test set accuracies on the Stanford Sen-timent Treebank. For our experiments, we reportmean accuracies over 5 runs (standard deviationsin parentheses). Fine-grained: 5-class sentimentclassification. Binary: positive/negative senti-ment classification.
produce binarized constituency parses4 and depen-dency parses of the sentences in the dataset for ourConstituency Tree-LSTM and Dependency Tree-LSTM models.
5.3 Hyperparameters and Training Details
The hyperparameters for our models were tunedon the development set for each task.
We initialized our word representations usingpublicly available 300-dimensional Glove vec-tors5 (Pennington et al., 2014). For the sentimentclassification task, word representations were up-dated during training with a learning rate of 0.1.For the semantic relatedness task, word represen-tations were held fixed as we did not observe anysignificant improvement when the representationswere tuned.
Our models were trained using AdaGrad (Duchiet al., 2011) with a learning rate of 0.05 and aminibatch size of 25. The model parameters wereregularized with a per-minibatch L2 regularizationstrength of 104. The sentiment classifier wasadditionally regularized using dropout (Srivastavaet al., 2014) with a dropout rate of 0.5. We did notobserve performance gains using dropout on thesemantic relatedness task.
4Constituency parses produced by the Stanford PCFGParser (Klein and Manning, 2003).
5Trained on 840 billion tokens of Common Crawl data,http://nlp.stanford.edu/projects/glove/.
1561
Fine-grained Binary
Stanford Sentiment Treebank [Tai+ 15]
-
vs. 4 [Li+ 2015]Bi-directional LSTM > Tree-LSTM Sentiment analysis, QA, Discourse parsing
Tree-LSTM > Bi-directional LSTM Relation classification SOTAdependency + CNN
9
-
[Mikolov+ 13, Pham+ 15]
Recurrent NN Gated Recurrent Unit (GRU) [Cho+ 14] Long-Short Term Memory (LSTM) [Hochreiter+ 97]
Recursive NN [Socher+ 11] Matrix-Vector Recursive NN [Socher+ 12] Recursive Neural Tensor Network [Socher+ 13] Tree-LSTM [Tai+ 15]
10
-
vking + vwoman - vman vqueen
[Muraoka+ 14]
11
smoking increase the risk of lung
-
- XYX, Y
12
a b c d X Y e f g h X Y
XY
-
[Tian+ 2016]
- vXY(vX+vY)/2
-B
13
B = kvXY 1
2(vX + vY )k
-
wordcontext
B
logsqrt
14
r(F (P (context|word)) a(word) b(context))
F
0(x) = x1+ < 0.5
-
F(x) = log x, a(y) = 0, b(z) = log P(z)-PMI
SGNS-PMI GloveF(x) = log x, a(y), b(z)
SGNS, GloveF(x)
15
-
J [Tian+ 16] J Lhelp to stop = stop to help
16
-
Recurrent Neural Network1/2 tv(t)h(t)h(t)
17
h(t) =tanh(Uv(t)+Wh(t 1))
smoking increase the risk of lung
W
U
-
Recurrent Neural Network2/2
U=W=Recurrent NN
RNN
Ltherisk
18
-
Recurrent Neural Network LSTM
L
19
smoking increase the risk of lung
3
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
NAACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
Figure 2: Number of judgments for each similarityrating. The total number of judgments is 27, 775(5, 555 pairs 5 workers).
quality assessment of the workers. In other words,we discarded the similarity ratings of the Gold ex-amples, and used those judged by the workers.
To build a high quality dataset, we use judgmentsfrom workers whose confidence values (reliabilityscores) computed by CrowdFlower are greater than75%. Additionally, we force every pair to haveat least five judgments from the workers. Conse-quently, 60 workers participated in this job. In thefinal version of this dataset, each pair has five simi-larity ratings judged by the five most reliable work-ers who were involved in the pair.
Figure 2 presents the number of judgments foreach similarity rating. Workers seldom rated 7 fora pair of relational patterns, probably because mostpairs have at least one difference in content words.The mean of the standard deviations of similarityratings of all pairs is 1.16. The mean of Spearmans among workers involved in the dataset is 0.728.These statistics show a high inter-annotator agree-ment of the dataset.
3 Encoder for Relational Patterns
As can be inferred from the unavailability of thedataset for relational patterns, we have no well-established method for learning distributed repre-sentations of relational patterns. A nave approachwould be to regard a relational pattern as a sin-gle unit (word) and to train word/pattern embed-dings as usual. In fact, Mikolov et al. (2013) im-plemented this approach as a preprocessing step,mining phrasal expressions with strong collocationsfrom a training corpus. However, this approachmight be affected by data sparseness, which lowersthe quality of distributed representations.
Another simple but effective approach is additive
composition (Mitchell and Lapata, 2010), where thedistributed representation of a relational pattern iscomputed by the mean of embeddings of constituentwords. Presuming that a relational pattern consistsof a sequence of T words w1, ..., wT , then we letxt Rd the embedding of the word wt. This ap-proach computes 1T
Tt=1 xt as the embedding of
the relational pattern. Muraoka et al. (2014) reportedthat the additive composition is a strong baselineamong various methods built upon neural networks.
Having said that, additive composition is inade-quate for some relational patterns because it treatsevery word in a relational pattern equally. It is nat-ural to expect that content words influence the dis-tributed representation of a relational pattern morethan functional words. In addition, adding the em-bedding of the word have might not be useful forX have access to Y, which has mostly the samemeaning as X access Y. Therefore, we explore analternative approach that is inspired by a Sequence-to-Sequence model (Sutskever et al., 2014) and anEncoder-Decoder model (Cho et al., 2014). The ideais to compute the embedding of a relational patternusing a function, F (x1, ..., xT ), where F (.) is mod-eled by a variant of recurrent neural network (RNN).
3.1 Baseline: Long Short-Term MemoryLong Short-Term Memory (LSTM) (Hochreiter andSchmidhuber, 1997) is a variant of RNN that isapplied successfully to various NLP tasks includ-ing word segmentation (Chen et al., 2015), depen-dency parsing (Dyer et al., 2015), machine transla-tion (Sutskever et al., 2014), and sentiment analy-sis (Tai et al., 2015). LSTM computes the input gateit Rd, forget gate ft Rd, output gate ot Rd,memory cell ct Rd, and hidden state ht Rd fora given embedding xt at position t5.
it = (Wixxt +Wihht1) (1)ft = (Wfxxt +Wfhht1) (2)ot = (Woxxt +Wohht1) (3)ct = ft ct1 + it g(Wcxxt +Wchht1) (4)ht = ot g(ct) (5)
5We omitted peephole connections and bias terms in thisstudy. We set the number of dimensions of hidden states iden-tical to that of word embeddings (d) so that we can adapt theobjective function of Skip-gram model (Section 3.3).
-
Gated Additive Composition (GAC) [Takase+ 16]
+ +
the, ofincrease, risk
20
smoking increase the risk of lung
4
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
NAACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
passive smoking increases the risk of lung cancer
xs xs+1 xs+2 xs+L-1xs+L xs+L+1xs-2 xs-1
(3)
(4) (5)
hs hs+1 hs+2
hs+L-1(3)
ws ws+1 ws+2 ws+L-1(3)
ws+L ws+L+1(4) (5)
ws-2 ws-1
f s f s+1 f s+2
i s i s+1 i s+2 i s+3~ ~ ~ ~
Parameter updateby Skip-gram model
Parameter update bySkip-gram model
Pattern vector
T = L = 4 = 2 = 2Context window Context windowRelation pattern
(word vectors)(context vectors) (context vectors)
(hidden vectors)
Gated Additive Composition (GAC)
Figure 3: Overview of the proposed method. Theproposed method computes the distributed represen-tation of a relational pattern using the input gate andforget gate, and learns parameters by predicting sur-rounding words (Skip-gram model).
In these equations, Wix, Wih, Wfx, Wfh, Wox,Woh, Wcx, Wch are d d matrices (parameters),(.) is the elementwise sigmoid function, g(.) is theelementwise activation function (tanh), and the op-erator presents elementwise multiplications. Weset c0 = 0 and h0 = 0 at t = 1. In essence, LSTMcomputes the hidden state ht and the memory cellct based on those at the previous position (ht1 andct1) and the word embedding xt. Applying theseequations from t = 1 to T , we use hT as the dis-tributed representation of the relational pattern.
3.2 Proposal: Gated Additive CompositionAlthough LSTM is successful for various NLP tasks,its expressive power might be overly strong for han-dling short phrases of relational patterns. Further-more, LSTM is often criticized as having numerouscomponents for which the purpose is not immedi-ately apparent (Jozefowicz et al., 2015). We are un-sure whether LSTM is the optimal architecture formodeling relational patterns.
For this reason, we simplified the LSTM archi-tecture as follows. We removed a memory cell byreplacing ct with a hidden state ht because the prob-lem of exponential error decay (Hochreiter et al.,2001) might not be prominent for relational patterns.We also removed matrices corresponding to Whhand Whx because most relational patterns hold addi-tive composition. This simplification yields the ar-chitecture defined by Equations 68.
it = (Wixxt +Wihht1) (6)ft = (Wfxxt +Wfhht1) (7)ht = g(ft ht1 + it xt) (8)
Here, Wix, Wih, Wfx, Wfh are d d matrices.The input and forget gates (Equations 6 and 7)
are identical to those in LSTM (Equations 1 and2). However, Equation 8 is better explained usinga weighted additive composition between the vectorof the current word xt and the vector of the previ-ous hidden state ht1. The elementwise weights arecontrolled by the input gate it and forget gate ft; weexpect that input gates are closed (close to zero) andforget gates are opened (close to one) when the cur-rent word is a control verb or preposition. We namethis architecture gated additive composition (GAC).
3.3 Parameter estimation: Skip-gram model
As explained in Section 1, we explore the OpenIEapproach, which relies neither on an existing KBnor on supervision data for relation classification.Therefore, we adapt the Skip-gram model (Mikolovet al., 2013) to train the parameters in LSTM andGAC on an unlabeled text corpus.
Formally, we designate an occurrence of a re-lational pattern p as a subsequence of L wordsws, ..., ws+L1 in a corpus. We define words ap-pearing before and after pattern p as the contextwords Cp = (s, ..., s1, s+L, ..., s+L+) forthe pattern. We define the log-likelihood of the re-lational pattern lp, following the objective functionof Skip-gram with negative sampling (SGNS) (Levyand Goldberg, 2014).
lp =
Cp
(log (hp x ) +
K
k=1
log (hp x ))
(9)
In this formula: K denotes the number of negativesamples; hp Rd is the vector for the relationalpattern p computed by LSTM or GAC; x Rd isthe context vector for the word w 6; x Rd is thecontext vector for the word that were sampled from
6The Skip-gram model has two kinds of vectors xt andxt assigned for a word wt. Equation 2 of the original pa-per (Mikolov et al., 2013) denotes xt (word vector) as v (in-put vector) and xt (context vector) as v (output vector). Theword2vec implementation does not write context (output) vec-tors but only word (input) vectors to a model file. Therefore, wemodified the source code to save context vectors, and use themin Equation 9. This modification ensures the consistency of theentire model.
-
Recurrent NN LSTM +
21
-
1:
Reverb [Fader+ 12] ukWaC20
22
Cephalexin reduce the risk of the bacteria
Cephalexin prevent the bacteriaInhibit(Cephalexin, bacteria)
-
SGNS [Mikolov+ 13]
23
smoking increase the risk of lung
logP(wt+ j |wt ) log (vwt uwt+ j )+ z~Pn (w)i=1
k
[log (vwt uz )]
z k
vwt
uwt+1uw1
-
5,555ACL2016
24
1 2 5
inhibit prevent the growth of 4.2 0.7be the part of be an essential part of 5.6 0.8be open from close at 1.6 0.5
-
25
5
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
At every occurrence of a relational pattern inthe corpus, we use Stochastic Gradient Descent(SGD) and backpropagation through time (BPTT)for training the parameters (matrices) in encoders.More specifically, we initialize the word vectors xtand context vectors xt and fix them during train-ing. At every occurrence of a relational pattern,we compute gradients for Equation 5 to update theparameters in encoders. In this way, each encoderis trained to compose a vector of a relational pat-tern so that it can predict the surrounding contextwords. An advantage of this parameter estimationis that the distributed representations of words andrelational patterns stay in the same vector space.Figure 3 visualizes the training process for GAC.
4 Experiments
In Section 4.1, we investigate the performance ofthe distributed representations computed by differ-ent encoders on the pattern similarity task. Section4.2 examines the contribution of the distributedrepresentations on SemEval 2010 Task 8, and dis-cusses the usefulness of the new dataset to predictsuccesses of the relation classification task.
4.1 Relational pattern similarityFor every pair in the dataset built in Section 2, wecompose the vectors of the two relational patternsusing an encoder described in Section 3, and com-pute the cosine similarity of the two vectors. Re-peating this process for all pairs in the dataset, wemeasure Spearmans between the similarity val-ues computed by the encoder and similarity ratingsassigned by humans.
4.1.1 Training procedureWe used ukWaC9 as the training corpus for theencoders. This corpus includes the text of 2 bil-lion words from Web pages crawled in the .ukdomain. Part-of-speech tags and lemmas are an-notated by TreeTagger10. We used lowercasedlemmas throughout the experiments. We applyword2vec to this corpus to train word vectors xtand context vectors xt. All encoders use word vec-tors xt to compose vectors of relational patterns;and the Skip-gram model uses context vectors xtto compute the objective function and gradients.
We used Reverb (Fader et al., 2011) to theukWaC corpus to extract relational pattern can-
9http://wacky.sslmit.unibo.it10http://www.cis.uni-muenchen.de/
schmid/tools/TreeTagger/
Figure 4: Performance of each method on the rela-tional pattern similarity task with variation in thenumber of dimensions.
didates. To remove unuseful relational patterns,we applied filtering rules that are compatible withthose used in the publicly available extraction re-sult11. Additionally, we discarded relational pat-terns appearing in the evaluation dataset through-out the experiments to assess the performance un-der which an encoder composes vectors of unseenrelational patterns. This preprocessing yielded127, 677 relational patterns.
All encoders were implemented on Chainer12, aflexible framework of neural networks. The hyper-parameters of the Skip-gram model are identicalto those in Mikolov et al. (2013): the width ofcontext window = 5, the number of negativesamples K = 5, the subsampling of 105. Foreach encoder that requires training, we tried 0.025,0.0025, and 0.00025 as an initial learning rate, andselected the best value for the encoder. In contrastto the presentation of Section 3, we compose a pat-tern vector in backward order (from the last to thefirst) because preliminary experiments showed aslight improvement with this treatment.
4.1.2 Results and discussionsFigure 4 shows Spearmans rank correlations ofdifferent encoders when the number of dimensionsof vectors is 100500. The figure shows that GACachieves the best performance on all dimensions.
Figure 4 includes the performance of the naveapproach, NoComp, which regards a relationalpattern as a single unit (word). In this approach,we allocated a vector hp for each relational pat-tern p in Equation 5 instead of the vector compo-sition, and trained the vectors of relational patternsusing the Skip-gram model. The performance waspoor for two reasons: we were unable to compute
11http://reverb.cs.washington.edu/12http://chainer.org/
GAC, RNN, GRU, Add, LSTM (Add)
1
-
AddRNN GACAdd
Add+ 4
26
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
Length # NoComp Add LSTM GRU RNN GAC1 636 0.324 0.324 0.324 0.324 0.324 0.3242 1,018 0.215 0.319 0.257 0.274 0.285 0.3213 2,272 0.234 0.386 0.344 0.370 0.387 0.4044 1,206 0.208 0.306 0.314 0.329 0.319 0.323
> 5 423 0.278 0.315 0.369 0.384 0.394 0.357All 5,555 0.215 0.340 0.336 0.356 0.362 0.370
Table 1: Spearmans rank correlations on different pattern lengths (number of dimensions d = 500).
wt wt+1 wt+2 ...
large it reimburse for(input payable inopen) liable to
small it a charter member of(input a valuable member ofclose) be an avid reader of
large ft be eligible to participate in(forget be require to submitopen) be request to submit
small ft coauthor of(forget capital ofclose) center of
Table 2: Prominent moments for input/forgetgates.
similarity values for 1,744 pairs because relationalpatterns in these pairs do not appear in ukWaC;and relational patterns could not obtain sufficientstatistics because of data sparseness.
Table 1 reports Spearmans rank correlationscomputed for each pattern length. Here, the lengthof a relational-pattern pair is defined by the maxi-mum of the lengths of two patterns in the pair. Inlength of 1, all methods achieve the same corre-lation score because they use the same word vec-tor xt. The table shows that additive composition(Add) performs well for shorter relational patterns(lengths of 2 and 3) but poorly for longer ones(lengths of 4 and 5+). GAC also exhibits the sim-ilar tendency to Add, but it outperforms Add forshorter patterns (lengths of 2 and 3) probably be-cause of the adaptive control of input and forgetgates. In contrast, RNN and its variants (RNN,GRU, and LSTM) enjoy the advantage on longerpatterns (lengths of 4 and 5+).
To examine the roles of input and forget gates ofGAC, we visualize the moments when input/forgetgates are wide open or closed. More precisely, weextract the input word and scanned words when
|it|2 or |ft|2 is small (close to zero) or large (closeto one) on the relational-pattern dataset. We re-state that we compose a pattern vector in backwardorder (from the last to the first): GAC scans of,author, and be in this order for composing thevector of the relational pattern be author of.
Table 2 displays the top three examples iden-tified using the procedure. The table shows twogroups of tendencies. Input gates open and forgetgates close when scanned words are only a prepo-sition and the current word is a content word. Inthese situations, GAC tries to read the semanticvector of the content word and to ignore the se-mantic vector of the preposition. In contrast, inputgates close and forget gates open when the currentword is be or a and scanned words form a nounphrase (e.g., charter member of), a complement(e.g., eligible to participate in), or a passivevoice (e.g., require(d) to submit). This behavioris also reasonable because GAC emphasizes infor-mative words more than functional words.
4.2 Relation classification
4.2.1 Experimental settingsTo examine the usefulness of the dataset and dis-tributed representations for a different application,we address the task of relation classification onthe SemEval 2010 Task 8 dataset (Hendrickx etal., 2010). In other words, we explore whetherhigh-quality distributed representations of rela-tional patterns are effective to identify a relationtype of an entity pair.
The dataset consists of 10, 717 relation in-stances (8, 000 training and 2, 717 test instances)with their relation types annotated. The datasetdefines 9 directed relations (e.g.,CAUSE-EFFECT)and 1 undirected relation OTHER. Given a pairof entity mentions, the task is to identify a rela-tion type in 19 candidate labels (2 9 directed +1 undirected relations). For example, given thepair of entity mentions e1 = burst and e2 =
-
GAC
27
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.
Length # NoComp Add LSTM GRU RNN GAC1 636 0.324 0.324 0.324 0.324 0.324 0.3242 1,018 0.215 0.319 0.257 0.274 0.285 0.3213 2,272 0.234 0.386 0.344 0.370 0.387 0.4044 1,206 0.208 0.306 0.314 0.329 0.319 0.323
> 5 423 0.278 0.315 0.369 0.384 0.394 0.357All 5,555 0.215 0.340 0.336 0.356 0.362 0.370
Table 1: Spearmans rank correlations on different pattern lengths (number of dimensions d = 500).
wt wt+1 wt+2 ...
large it reimburse for(input payable inopen) liable to
small it a charter member of(input a valuable member ofclose) be an avid reader of
large ft be eligible to participate in(forget be require to submitopen) be request to submit
small ft coauthor of(forget capital ofclose) center of
Table 2: Prominent moments for input/forgetgates.
similarity values for 1,744 pairs because relationalpatterns in these pairs do not appear in ukWaC;and relational patterns could not obtain sufficientstatistics because of data sparseness.
Table 1 reports Spearmans rank correlationscomputed for each pattern length. Here, the lengthof a relational-pattern pair is defined by the maxi-mum of the lengths of two patterns in the pair. Inlength of 1, all methods achieve the same corre-lation score because they use the same word vec-tor xt. The table shows that additive composition(Add) performs well for shorter relational patterns(lengths of 2 and 3) but poorly for longer ones(lengths of 4 and 5+). GAC also exhibits the sim-ilar tendency to Add, but it outperforms Add forshorter patterns (lengths of 2 and 3) probably be-cause of the adaptive control of input and forgetgates. In contrast, RNN and its variants (RNN,GRU, and LSTM) enjoy the advantage on longerpatterns (lengths of 4 and 5+).
To examine the roles of input and forget gates ofGAC, we visualize the moments when input/forgetgates are wide open or closed. More precisely, weextract the input word and scanned words when
|it|2 or |ft|2 is small (close to zero) or large (closeto one) on the relational-pattern dataset. We re-state that we compose a pattern vector in backwardorder (from the last to the first): GAC scans of,author, and be in this order for composing thevector of the relational pattern be author of.
Table 2 displays the top three examples iden-tified using the procedure. The table shows twogroups of tendencies. Input gates open and forgetgates close when scanned words are only a prepo-sition and the current word is a content word. Inthese situations, GAC tries to read the semanticvector of the content word and to ignore the se-mantic vector of the preposition. In contrast, inputgates close and forget gates open when the currentword is be or a and scanned words form a nounphrase (e.g., charter member of), a complement(e.g., eligible to participate in), or a passivevoice (e.g., require(d) to submit). This behavioris also reasonable because GAC emphasizes infor-mative words more than functional words.
4.2 Relation classification
4.2.1 Experimental settingsTo examine the usefulness of the dataset and dis-tributed representations for a different application,we address the task of relation classification onthe SemEval 2010 Task 8 dataset (Hendrickx etal., 2010). In other words, we explore whetherhigh-quality distributed representations of rela-tional patterns are effective to identify a relationtype of an entity pair.
The dataset consists of 10, 717 relation in-stances (8, 000 training and 2, 717 test instances)with their relation types annotated. The datasetdefines 9 directed relations (e.g.,CAUSE-EFFECT)and 1 undirected relation OTHER. Given a pairof entity mentions, the task is to identify a rela-tion type in 19 candidate labels (2 9 directed +1 undirected relations). For example, given thepair of entity mentions e1 = burst and e2 =
-
RNNU, W
28
U W
-
2: bigram JNNNVN
29
-
bigramPPDB
JN133,998NN35,602VN62,651
30
novel method new approach
v1 v2
max(0,1 v1 v2 + v1 n1)+max(0,1 v1 v2 + v2 n2 )
n1, n2v1, v2
-
[Wieting+ 15]
JNNNVN108 [Mitchell+ 10]
31
Adj-Noun1 Adj-Noun2 5
bigram
vast amount large quantity 5.0 0.9small house little room 2.0 0.6better job good place 3.0 0.6
-
GRURecursive NNGAC
32
JN NN VN AvrageAdd 0.50 0.29 0.58 0.46Recursive NN [Wieting+ 15] 0.57 0.44 0.55 0.52Recurrent NN 0.58 0.43 0.46 0.49GRU 0.62 0.40 0.53 0.53LSTM 0.57 0.44 0.49 0.49CNN 0.58 0.48 0.50 0.50GAC 0.56 0.43 0.52 0.52Human 0.87 0.64 0.73 0.75
-
3: PPDB PPDB 5
bigramPPDB60,0001,000 PPDB
33
-
GAC, CNN LSTM, GRU
34
Spearmans rank correlationAdd 0.32Recursive NN [Wieting+ 15] 0.40Recurrent NN 0.25GRU 0.33LSTM 0.32CNN 0.45GAC 0.47
-
Recurrent NN
SGNS
35
U W
-
Recurrent NN GAC +
36