4thNLPDL

[email protected]/06/224NLPDL

1

-e.g., SGNS, Glove

3

peach penguin medicinedrug

[Harris, 64]

SGNS-PMI [Levy+ 2014] Glove-

4have

drink

eat

bottle

roast

essay

read

work

beer 48 72 28 57 30 1 8 11 wine 108 92 24 86 29 2 2 23

mutton 309 31 105 13 48 0 0 17 novelist 12 4 8 0 0 103 186 134 writer 31 3 10 0 0 132 238 84

-

5

spoonsmall spoon smallf ( , )

Recurrent NN LSTM +

6

[Mikolov+ 13, Pham+ 15]

Recurrent NN Gated Recurrent Unit (GRU) [Cho+ 14] Long-Short Term Memory (LSTM) [Hochreiter+ 97]

Recursive NN [Socher+ 11] Matrix-Vector Recursive NN [Socher+ 12] Recursive Neural Tensor Network [Socher+ 13] Tree-LSTM [Tai+ 15]

7

Recurrent NN

GPU

8

Relatedness Sentiment

LSTM Variant d || d ||

Standard 150 203,400 168 315,840Bidirectional 150 203,400 168 315,840

2-layer 108 203,472 120 318,720Bidirectional 2-layer 108 203,472 120 318,720

Constituency Tree 142 205,190 150 316,800Dependency Tree 150 203,400 168 315,840

Table 1: Memory dimensions d and compositionfunction parameter counts || for each LSTM vari-ant that we evaluate.

neutral sentences are excluded). Standard bina-rized constituency parse trees are provided foreach sentence in the dataset, and each node inthese trees is annotated with a sentiment label.

For the sequential LSTM baselines, we predictthe sentiment of a phrase using the representationgiven by the final LSTM hidden state. The sequen-tial LSTM models are trained on the spans corre-sponding to labeled nodes in the training set.

We use the classification model described inSec. 4.1 with both Dependency Tree-LSTMs(Sec. 3.1) and Constituency Tree-LSTMs(Sec. 3.2). The Constituency Tree-LSTMs arestructured according to the provided parse trees.For the Dependency Tree-LSTMs, we producedependency parses3 of each sentence; each nodein a tree is given a sentiment label if its spanmatches a labeled span in the training set.

5.2 Semantic RelatednessFor a given pair of sentences, the semantic relat-edness task is to predict a human-generated ratingof the similarity of the two sentences in meaning.

We use the Sentences Involving Composi-tional Knowledge (SICK) dataset (Marelli et al.,2014), consisting of 9927 sentence pairs in a4500/500/4927 train/dev/test split. The sentencesare derived from existing image and video descrip-tion datasets. Each sentence pair is annotated witha relatedness score y 2 [1, 5], with 1 indicatingthat the two sentences are completely unrelated,and 5 indicating that the two sentences are veryrelated. Each label is the average of 10 ratings as-signed by different human annotators.

Here, we use the similarity model described inSec. 4.2. For the similarity prediction network(Eqs. 15) we use a hidden layer of size 50. We

3Dependency parses produced by the Stanford NeuralNetwork Dependency Parser (Chen and Manning, 2014).

Method Fine-grained Binary

RAE (Socher et al., 2013) 43.2 82.4MV-RNN (Socher et al., 2013) 44.4 82.9RNTN (Socher et al., 2013) 45.7 85.4DCNN (Blunsom et al., 2014) 48.5 86.8Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8CNN-non-static (Kim, 2014) 48.0 87.2CNN-multichannel (Kim, 2014) 47.4 88.1DRNN (Irsoy and Cardie, 2014) 49.8 86.6

LSTM 46.4 (1.1) 84.9 (0.6)Bidirectional LSTM 49.1 (1.0) 87.5 (0.5)2-layer LSTM 46.0 (1.3) 86.3 (0.6)2-layer Bidirectional LSTM 48.5 (1.0) 87.2 (1.0)

Dependency Tree-LSTM 48.4 (0.4) 85.7 (0.4)Constituency Tree-LSTM

randomly initialized vectors 43.9 (0.6) 82.0 (0.5) Glove vectors, fixed 49.7 (0.4) 87.5 (0.8) Glove vectors, tuned 51.0 (0.5) 88.0 (0.3)

Table 2: Test set accuracies on the Stanford Sen-timent Treebank. For our experiments, we reportmean accuracies over 5 runs (standard deviationsin parentheses). Fine-grained: 5-class sentimentclassification. Binary: positive/negative senti-ment classification.

produce binarized constituency parses4 and depen-dency parses of the sentences in the dataset for ourConstituency Tree-LSTM and Dependency Tree-LSTM models.

5.3 Hyperparameters and Training Details

The hyperparameters for our models were tunedon the development set for each task.

We initialized our word representations usingpublicly available 300-dimensional Glove vec-tors5 (Pennington et al., 2014). For the sentimentclassification task, word representations were up-dated during training with a learning rate of 0.1.For the semantic relatedness task, word represen-tations were held fixed as we did not observe anysignificant improvement when the representationswere tuned.

Our models were trained using AdaGrad (Duchiet al., 2011) with a learning rate of 0.05 and aminibatch size of 25. The model parameters wereregularized with a per-minibatch L2 regularizationstrength of 104. The sentiment classifier wasadditionally regularized using dropout (Srivastavaet al., 2014) with a dropout rate of 0.5. We did notobserve performance gains using dropout on thesemantic relatedness task.

4Constituency parses produced by the Stanford PCFGParser (Klein and Manning, 2003).

5Trained on 840 billion tokens of Common Crawl data,http://nlp.stanford.edu/projects/glove/.

1561

Fine-grained Binary

Stanford Sentiment Treebank [Tai+ 15]

vs. 4 [Li+ 2015]Bi-directional LSTM > Tree-LSTM Sentiment analysis, QA, Discourse parsing

Tree-LSTM > Bi-directional LSTM Relation classification SOTAdependency + CNN

9

[Mikolov+ 13, Pham+ 15]

Recurrent NN Gated Recurrent Unit (GRU) [Cho+ 14] Long-Short Term Memory (LSTM) [Hochreiter+ 97]

Recursive NN [Socher+ 11] Matrix-Vector Recursive NN [Socher+ 12] Recursive Neural Tensor Network [Socher+ 13] Tree-LSTM [Tai+ 15]

10

vking + vwoman - vman vqueen

[Muraoka+ 14]

11

smoking increase the risk of lung

- XYX, Y

12

a b c d X Y e f g h X Y

XY

[Tian+ 2016]

- vXY(vX+vY)/2

-B

13

B = kvXY 1

2(vX + vY )k

wordcontext

B

logsqrt

14

r(F (P (context|word)) a(word) b(context))

F

0(x) = x1+ < 0.5

F(x) = log x, a(y) = 0, b(z) = log P(z)-PMI

SGNS-PMI GloveF(x) = log x, a(y), b(z)

SGNS, GloveF(x)

15

J [Tian+ 16] J Lhelp to stop = stop to help

16

Recurrent Neural Network1/2 tv(t)h(t)h(t)

17

h(t) =tanh(Uv(t)+Wh(t 1))


W

U

Recurrent Neural Network2/2

U=W=Recurrent NN

RNN

Ltherisk

18

Recurrent Neural Network LSTM

L

19


3

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

NAACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

Figure 2: Number of judgments for each similarityrating. The total number of judgments is 27, 775(5, 555 pairs 5 workers).

quality assessment of the workers. In other words,we discarded the similarity ratings of the Gold ex-amples, and used those judged by the workers.

To build a high quality dataset, we use judgmentsfrom workers whose confidence values (reliabilityscores) computed by CrowdFlower are greater than75%. Additionally, we force every pair to haveat least five judgments from the workers. Conse-quently, 60 workers participated in this job. In thefinal version of this dataset, each pair has five simi-larity ratings judged by the five most reliable work-ers who were involved in the pair.

Figure 2 presents the number of judgments foreach similarity rating. Workers seldom rated 7 fora pair of relational patterns, probably because mostpairs have at least one difference in content words.The mean of the standard deviations of similarityratings of all pairs is 1.16. The mean of Spearmans among workers involved in the dataset is 0.728.These statistics show a high inter-annotator agree-ment of the dataset.

3 Encoder for Relational Patterns

As can be inferred from the unavailability of thedataset for relational patterns, we have no well-established method for learning distributed repre-sentations of relational patterns. A nave approachwould be to regard a relational pattern as a sin-gle unit (word) and to train word/pattern embed-dings as usual. In fact, Mikolov et al. (2013) im-plemented this approach as a preprocessing step,mining phrasal expressions with strong collocationsfrom a training corpus. However, this approachmight be affected by data sparseness, which lowersthe quality of distributed representations.

Another simple but effective approach is additive

composition (Mitchell and Lapata, 2010), where thedistributed representation of a relational pattern iscomputed by the mean of embeddings of constituentwords. Presuming that a relational pattern consistsof a sequence of T words w1, ..., wT , then we letxt Rd the embedding of the word wt. This ap-proach computes 1T

Tt=1 xt as the embedding of

the relational pattern. Muraoka et al. (2014) reportedthat the additive composition is a strong baselineamong various methods built upon neural networks.

Having said that, additive composition is inade-quate for some relational patterns because it treatsevery word in a relational pattern equally. It is nat-ural to expect that content words influence the dis-tributed representation of a relational pattern morethan functional words. In addition, adding the em-bedding of the word have might not be useful forX have access to Y, which has mostly the samemeaning as X access Y. Therefore, we explore analternative approach that is inspired by a Sequence-to-Sequence model (Sutskever et al., 2014) and anEncoder-Decoder model (Cho et al., 2014). The ideais to compute the embedding of a relational patternusing a function, F (x1, ..., xT ), where F (.) is mod-eled by a variant of recurrent neural network (RNN).

3.1 Baseline: Long Short-Term MemoryLong Short-Term Memory (LSTM) (Hochreiter andSchmidhuber, 1997) is a variant of RNN that isapplied successfully to various NLP tasks includ-ing word segmentation (Chen et al., 2015), depen-dency parsing (Dyer et al., 2015), machine transla-tion (Sutskever et al., 2014), and sentiment analy-sis (Tai et al., 2015). LSTM computes the input gateit Rd, forget gate ft Rd, output gate ot Rd,memory cell ct Rd, and hidden state ht Rd fora given embedding xt at position t5.

it = (Wixxt +Wihht1) (1)ft = (Wfxxt +Wfhht1) (2)ot = (Woxxt +Wohht1) (3)ct = ft ct1 + it g(Wcxxt +Wchht1) (4)ht = ot g(ct) (5)

5We omitted peephole connections and bias terms in thisstudy. We set the number of dimensions of hidden states iden-tical to that of word embeddings (d) so that we can adapt theobjective function of Skip-gram model (Section 3.3).

Gated Additive Composition (GAC) [Takase+ 16]

+ +

the, ofincrease, risk

20


4

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

NAACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

passive smoking increases the risk of lung cancer

xs xs+1 xs+2 xs+L-1xs+L xs+L+1xs-2 xs-1

(3)

(4) (5)

hs hs+1 hs+2

hs+L-1(3)

ws ws+1 ws+2 ws+L-1(3)

ws+L ws+L+1(4) (5)

ws-2 ws-1

f s f s+1 f s+2

i s i s+1 i s+2 i s+3~ ~ ~ ~

Parameter updateby Skip-gram model

Parameter update bySkip-gram model

Pattern vector

T = L = 4 = 2 = 2Context window Context windowRelation pattern

(word vectors)(context vectors) (context vectors)

(hidden vectors)

Gated Additive Composition (GAC)

Figure 3: Overview of the proposed method. Theproposed method computes the distributed represen-tation of a relational pattern using the input gate andforget gate, and learns parameters by predicting sur-rounding words (Skip-gram model).

In these equations, Wix, Wih, Wfx, Wfh, Wox,Woh, Wcx, Wch are d d matrices (parameters),(.) is the elementwise sigmoid function, g(.) is theelementwise activation function (tanh), and the op-erator presents elementwise multiplications. Weset c0 = 0 and h0 = 0 at t = 1. In essence, LSTMcomputes the hidden state ht and the memory cellct based on those at the previous position (ht1 andct1) and the word embedding xt. Applying theseequations from t = 1 to T , we use hT as the dis-tributed representation of the relational pattern.

3.2 Proposal: Gated Additive CompositionAlthough LSTM is successful for various NLP tasks,its expressive power might be overly strong for han-dling short phrases of relational patterns. Further-more, LSTM is often criticized as having numerouscomponents for which the purpose is not immedi-ately apparent (Jozefowicz et al., 2015). We are un-sure whether LSTM is the optimal architecture formodeling relational patterns.

For this reason, we simplified the LSTM archi-tecture as follows. We removed a memory cell byreplacing ct with a hidden state ht because the prob-lem of exponential error decay (Hochreiter et al.,2001) might not be prominent for relational patterns.We also removed matrices corresponding to Whhand Whx because most relational patterns hold addi-tive composition. This simplification yields the ar-chitecture defined by Equations 68.

it = (Wixxt +Wihht1) (6)ft = (Wfxxt +Wfhht1) (7)ht = g(ft ht1 + it xt) (8)

Here, Wix, Wih, Wfx, Wfh are d d matrices.The input and forget gates (Equations 6 and 7)

are identical to those in LSTM (Equations 1 and2). However, Equation 8 is better explained usinga weighted additive composition between the vectorof the current word xt and the vector of the previ-ous hidden state ht1. The elementwise weights arecontrolled by the input gate it and forget gate ft; weexpect that input gates are closed (close to zero) andforget gates are opened (close to one) when the cur-rent word is a control verb or preposition. We namethis architecture gated additive composition (GAC).

3.3 Parameter estimation: Skip-gram model

As explained in Section 1, we explore the OpenIEapproach, which relies neither on an existing KBnor on supervision data for relation classification.Therefore, we adapt the Skip-gram model (Mikolovet al., 2013) to train the parameters in LSTM andGAC on an unlabeled text corpus.

Formally, we designate an occurrence of a re-lational pattern p as a subsequence of L wordsws, ..., ws+L1 in a corpus. We define words ap-pearing before and after pattern p as the contextwords Cp = (s, ..., s1, s+L, ..., s+L+) forthe pattern. We define the log-likelihood of the re-lational pattern lp, following the objective functionof Skip-gram with negative sampling (SGNS) (Levyand Goldberg, 2014).

lp =

Cp

(log (hp x ) +

K

k=1

log (hp x ))

(9)

In this formula: K denotes the number of negativesamples; hp Rd is the vector for the relationalpattern p computed by LSTM or GAC; x Rd isthe context vector for the word w 6; x Rd is thecontext vector for the word that were sampled from

6The Skip-gram model has two kinds of vectors xt andxt assigned for a word wt. Equation 2 of the original pa-per (Mikolov et al., 2013) denotes xt (word vector) as v (in-put vector) and xt (context vector) as v (output vector). Theword2vec implementation does not write context (output) vec-tors but only word (input) vectors to a model file. Therefore, wemodified the source code to save context vectors, and use themin Equation 9. This modification ensures the consistency of theentire model.

Recurrent NN LSTM +

21

1:

Reverb [Fader+ 12] ukWaC20

22

Cephalexin reduce the risk of the bacteria

Cephalexin prevent the bacteriaInhibit(Cephalexin, bacteria)

SGNS [Mikolov+ 13]

23


logP(wt+ j |wt ) log (vwt uwt+ j )+ z~Pn (w)i=1

k

[log (vwt uz )]

z k

vwt

uwt+1uw1

5,555ACL2016

24

1 2 5

inhibit prevent the growth of 4.2 0.7be the part of be an essential part of 5.6 0.8be open from close at 1.6 0.5

25

5

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

At every occurrence of a relational pattern inthe corpus, we use Stochastic Gradient Descent(SGD) and backpropagation through time (BPTT)for training the parameters (matrices) in encoders.More specifically, we initialize the word vectors xtand context vectors xt and fix them during train-ing. At every occurrence of a relational pattern,we compute gradients for Equation 5 to update theparameters in encoders. In this way, each encoderis trained to compose a vector of a relational pat-tern so that it can predict the surrounding contextwords. An advantage of this parameter estimationis that the distributed representations of words andrelational patterns stay in the same vector space.Figure 3 visualizes the training process for GAC.

4 Experiments

In Section 4.1, we investigate the performance ofthe distributed representations computed by differ-ent encoders on the pattern similarity task. Section4.2 examines the contribution of the distributedrepresentations on SemEval 2010 Task 8, and dis-cusses the usefulness of the new dataset to predictsuccesses of the relation classification task.

4.1 Relational pattern similarityFor every pair in the dataset built in Section 2, wecompose the vectors of the two relational patternsusing an encoder described in Section 3, and com-pute the cosine similarity of the two vectors. Re-peating this process for all pairs in the dataset, wemeasure Spearmans between the similarity val-ues computed by the encoder and similarity ratingsassigned by humans.

4.1.1 Training procedureWe used ukWaC9 as the training corpus for theencoders. This corpus includes the text of 2 bil-lion words from Web pages crawled in the .ukdomain. Part-of-speech tags and lemmas are an-notated by TreeTagger10. We used lowercasedlemmas throughout the experiments. We applyword2vec to this corpus to train word vectors xtand context vectors xt. All encoders use word vec-tors xt to compose vectors of relational patterns;and the Skip-gram model uses context vectors xtto compute the objective function and gradients.

We used Reverb (Fader et al., 2011) to theukWaC corpus to extract relational pattern can-

9http://wacky.sslmit.unibo.it10http://www.cis.uni-muenchen.de/

schmid/tools/TreeTagger/

Figure 4: Performance of each method on the rela-tional pattern similarity task with variation in thenumber of dimensions.

didates. To remove unuseful relational patterns,we applied filtering rules that are compatible withthose used in the publicly available extraction re-sult11. Additionally, we discarded relational pat-terns appearing in the evaluation dataset through-out the experiments to assess the performance un-der which an encoder composes vectors of unseenrelational patterns. This preprocessing yielded127, 677 relational patterns.

All encoders were implemented on Chainer12, aflexible framework of neural networks. The hyper-parameters of the Skip-gram model are identicalto those in Mikolov et al. (2013): the width ofcontext window = 5, the number of negativesamples K = 5, the subsampling of 105. Foreach encoder that requires training, we tried 0.025,0.0025, and 0.00025 as an initial learning rate, andselected the best value for the encoder. In contrastto the presentation of Section 3, we compose a pat-tern vector in backward order (from the last to thefirst) because preliminary experiments showed aslight improvement with this treatment.

4.1.2 Results and discussionsFigure 4 shows Spearmans rank correlations ofdifferent encoders when the number of dimensionsof vectors is 100500. The figure shows that GACachieves the best performance on all dimensions.

Figure 4 includes the performance of the naveapproach, NoComp, which regards a relationalpattern as a single unit (word). In this approach,we allocated a vector hp for each relational pat-tern p in Equation 5 instead of the vector compo-sition, and trained the vectors of relational patternsusing the Skip-gram model. The performance waspoor for two reasons: we were unable to compute

11http://reverb.cs.washington.edu/12http://chainer.org/

GAC, RNN, GRU, Add, LSTM (Add)

1

AddRNN GACAdd

Add+ 4

26

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599


Length # NoComp Add LSTM GRU RNN GAC1 636 0.324 0.324 0.324 0.324 0.324 0.3242 1,018 0.215 0.319 0.257 0.274 0.285 0.3213 2,272 0.234 0.386 0.344 0.370 0.387 0.4044 1,206 0.208 0.306 0.314 0.329 0.319 0.323

> 5 423 0.278 0.315 0.369 0.384 0.394 0.357All 5,555 0.215 0.340 0.336 0.356 0.362 0.370

Table 1: Spearmans rank correlations on different pattern lengths (number of dimensions d = 500).

wt wt+1 wt+2 ...

large it reimburse for(input payable inopen) liable to

small it a charter member of(input a valuable member ofclose) be an avid reader of

large ft be eligible to participate in(forget be require to submitopen) be request to submit

small ft coauthor of(forget capital ofclose) center of

Table 2: Prominent moments for input/forgetgates.

similarity values for 1,744 pairs because relationalpatterns in these pairs do not appear in ukWaC;and relational patterns could not obtain sufficientstatistics because of data sparseness.

Table 1 reports Spearmans rank correlationscomputed for each pattern length. Here, the lengthof a relational-pattern pair is defined by the maxi-mum of the lengths of two patterns in the pair. Inlength of 1, all methods achieve the same corre-lation score because they use the same word vec-tor xt. The table shows that additive composition(Add) performs well for shorter relational patterns(lengths of 2 and 3) but poorly for longer ones(lengths of 4 and 5+). GAC also exhibits the sim-ilar tendency to Add, but it outperforms Add forshorter patterns (lengths of 2 and 3) probably be-cause of the adaptive control of input and forgetgates. In contrast, RNN and its variants (RNN,GRU, and LSTM) enjoy the advantage on longerpatterns (lengths of 4 and 5+).

To examine the roles of input and forget gates ofGAC, we visualize the moments when input/forgetgates are wide open or closed. More precisely, weextract the input word and scanned words when

|it|2 or |ft|2 is small (close to zero) or large (closeto one) on the relational-pattern dataset. We re-state that we compose a pattern vector in backwardorder (from the last to the first): GAC scans of,author, and be in this order for composing thevector of the relational pattern be author of.

Table 2 displays the top three examples iden-tified using the procedure. The table shows twogroups of tendencies. Input gates open and forgetgates close when scanned words are only a prepo-sition and the current word is a content word. Inthese situations, GAC tries to read the semanticvector of the content word and to ignore the se-mantic vector of the preposition. In contrast, inputgates close and forget gates open when the currentword is be or a and scanned words form a nounphrase (e.g., charter member of), a complement(e.g., eligible to participate in), or a passivevoice (e.g., require(d) to submit). This behavioris also reasonable because GAC emphasizes infor-mative words more than functional words.

4.2 Relation classification

4.2.1 Experimental settingsTo examine the usefulness of the dataset and dis-tributed representations for a different application,we address the task of relation classification onthe SemEval 2010 Task 8 dataset (Hendrickx etal., 2010). In other words, we explore whetherhigh-quality distributed representations of rela-tional patterns are effective to identify a relationtype of an entity pair.

The dataset consists of 10, 717 relation in-stances (8, 000 training and 2, 717 test instances)with their relation types annotated. The datasetdefines 9 directed relations (e.g.,CAUSE-EFFECT)and 1 undirected relation OTHER. Given a pairof entity mentions, the task is to identify a rela-tion type in 19 candidate labels (2 9 directed +1 undirected relations). For example, given thepair of entity mentions e1 = burst and e2 =

GAC

27

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599


Length # NoComp Add LSTM GRU RNN GAC1 636 0.324 0.324 0.324 0.324 0.324 0.3242 1,018 0.215 0.319 0.257 0.274 0.285 0.3213 2,272 0.234 0.386 0.344 0.370 0.387 0.4044 1,206 0.208 0.306 0.314 0.329 0.319 0.323

> 5 423 0.278 0.315 0.369 0.384 0.394 0.357All 5,555 0.215 0.340 0.336 0.356 0.362 0.370

Table 1: Spearmans rank correlations on different pattern lengths (number of dimensions d = 500).

wt wt+1 wt+2 ...

large it reimburse for(input payable inopen) liable to

small it a charter member of(input a valuable member ofclose) be an avid reader of

large ft be eligible to participate in(forget be require to submitopen) be request to submit

small ft coauthor of(forget capital ofclose) center of

Table 2: Prominent moments for input/forgetgates.

similarity values for 1,744 pairs because relationalpatterns in these pairs do not appear in ukWaC;and relational patterns could not obtain sufficientstatistics because of data sparseness.

Table 1 reports Spearmans rank correlationscomputed for each pattern length. Here, the lengthof a relational-pattern pair is defined by the maxi-mum of the lengths of two patterns in the pair. Inlength of 1, all methods achieve the same corre-lation score because they use the same word vec-tor xt. The table shows that additive composition(Add) performs well for shorter relational patterns(lengths of 2 and 3) but poorly for longer ones(lengths of 4 and 5+). GAC also exhibits the sim-ilar tendency to Add, but it outperforms Add forshorter patterns (lengths of 2 and 3) probably be-cause of the adaptive control of input and forgetgates. In contrast, RNN and its variants (RNN,GRU, and LSTM) enjoy the advantage on longerpatterns (lengths of 4 and 5+).

To examine the roles of input and forget gates ofGAC, we visualize the moments when input/forgetgates are wide open or closed. More precisely, weextract the input word and scanned words when

|it|2 or |ft|2 is small (close to zero) or large (closeto one) on the relational-pattern dataset. We re-state that we compose a pattern vector in backwardorder (from the last to the first): GAC scans of,author, and be in this order for composing thevector of the relational pattern be author of.

Table 2 displays the top three examples iden-tified using the procedure. The table shows twogroups of tendencies. Input gates open and forgetgates close when scanned words are only a prepo-sition and the current word is a content word. Inthese situations, GAC tries to read the semanticvector of the content word and to ignore the se-mantic vector of the preposition. In contrast, inputgates close and forget gates open when the currentword is be or a and scanned words form a nounphrase (e.g., charter member of), a complement(e.g., eligible to participate in), or a passivevoice (e.g., require(d) to submit). This behavioris also reasonable because GAC emphasizes infor-mative words more than functional words.

4.2 Relation classification

4.2.1 Experimental settingsTo examine the usefulness of the dataset and dis-tributed representations for a different application,we address the task of relation classification onthe SemEval 2010 Task 8 dataset (Hendrickx etal., 2010). In other words, we explore whetherhigh-quality distributed representations of rela-tional patterns are effective to identify a relationtype of an entity pair.

The dataset consists of 10, 717 relation in-stances (8, 000 training and 2, 717 test instances)with their relation types annotated. The datasetdefines 9 directed relations (e.g.,CAUSE-EFFECT)and 1 undirected relation OTHER. Given a pairof entity mentions, the task is to identify a rela-tion type in 19 candidate labels (2 9 directed +1 undirected relations). For example, given thepair of entity mentions e1 = burst and e2 =

RNNU, W

28

U W

2: bigram JNNNVN

29

bigramPPDB

JN133,998NN35,602VN62,651

30

novel method new approach

v1 v2

max(0,1 v1 v2 + v1 n1)+max(0,1 v1 v2 + v2 n2 )

n1, n2v1, v2

[Wieting+ 15]

JNNNVN108 [Mitchell+ 10]

31

Adj-Noun1 Adj-Noun2 5

bigram

vast amount large quantity 5.0 0.9small house little room 2.0 0.6better job good place 3.0 0.6

GRURecursive NNGAC

32

JN NN VN AvrageAdd 0.50 0.29 0.58 0.46Recursive NN [Wieting+ 15] 0.57 0.44 0.55 0.52Recurrent NN 0.58 0.43 0.46 0.49GRU 0.62 0.40 0.53 0.53LSTM 0.57 0.44 0.49 0.49CNN 0.58 0.48 0.50 0.50GAC 0.56 0.43 0.52 0.52Human 0.87 0.64 0.73 0.75

3: PPDB PPDB 5

bigramPPDB60,0001,000 PPDB

33

GAC, CNN LSTM, GRU

34

Spearmans rank correlationAdd 0.32Recursive NN [Wieting+ 15] 0.40Recurrent NN 0.25GRU 0.33LSTM 0.32CNN 0.45GAC 0.47

Recurrent NN

SGNS

35

U W

Recurrent NN GAC +

36

4thNLPDL

Technology

Transcript of 4thNLPDL