PageRanking WordNet Synsets : An Application to Opinion Mining Andrea Esuli and Fabrizio Sebastiani...

PageRanking WordNet Synsets :An Application to Opinion Mining

Andrea Esuli and Fabrizio SebastianiIstituto di Scienza e Tecnologie dell’Informazione

Consiglio Nazionale delle RicercheVia Giuseppe Moruzzi, 1 – 56124 Pisa, Italy

fandrea.esuli,[email protected]

Advisor: Hsin-Hsi ChenSpeaker: Yong-Sheng Lo

Date: 2007/10/31

ACL - 2007

Introduction

Recent years have witnessed an explosion of work on opinion mining

An important part of this research has been the work on the automatic determination of the opinion-related properties (ORPs) of terms OPPs = positive, negative, or neutral

polarity

Related work 1/2

Traditional work Determine the polarity of adjectives

Hatzivassiloglou and McKeown (1997) Kamps et al. (2004)

Determine the polarity of generic terms Turney and Littman (2003) Kim and Hovy (2004) Takamura et al. (2005)

Related work 2/2

Recent work Using glosses from online dictionary

Extend a set of terms of known positivity/negativity

Andreevskaia and Berger (2006a) Determine the ORPs of generic terms

Esuli and Sebastiani (2005; 2006a) Determining the ORPs of WordNet synsets (syno

nym sets) Esuli and Sebastiani (2006b)

In this work We have investigated the applicability of a

random walk model to the problem of ranking synsets (synonym sets) according to positivity and negativity.

Using PageRank Need nodes and links

Using eXtended WordNet version 2.0-1.1 Based on WordNet version 2.0

eXtended WordNet (XWN) The goal of this project is to develop a tool that tak

es as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet.

XWN has 4 files adj.xml adv.xml noun.xml verb.xml

How the information is represented in XWN ?

Graph generation for PageRank The directed graph G = ( N, L )

N (node) : The set of all WordNet synsets 115,424 synsets

L (link) : From synset Si to synset Sk ( Si Sk ) iff the gloss of Si contains at least a term belonging to

Sk For example

the gloss of Si contains “ by a small margin ; … “ Sk contains “ small , … ” Si Sk

The PageRank algorithm 1/4 Input :

The row-normalized adjacency matrix (W) W be the |N| X |N| adjacency matrix of G

|N| = # of synsets Wo[ i,j ] = 1 iff there is a link from node i to node j If Wo[i,j] = 1 W[ i,j ] = 1 / | F(i) | Else W[ i,j ] = 0

B(i) = { nj | Wo[ j,i ] = 1 } : 哪些 node 連到 node i The set of the backward neighbors of ni

F(i) = { nj | Wo[ i,j ] = 1 } : node i 連到哪些 node The set of the forward neighbors of ni

Output : A vector [ a1,……,a|N| ]

ai represents the score of node ni, i = 1~|N|

The PageRank algorithm 2/4 PageRank iteratively computes vector a :

The value of ei amounts to an internal source of score for node i It is constant (=1/|N|) across the iterations and independent fro

m its backward neighbours

In vectorial form, Equation 1 can be written as

The PageRank algorithm 3/4 In this work

Using the ei values as internal sources of a given ORP (positivity or negativity) for node i

by attributing a null ei value to all but a few “seed” synsets known to possess that ORP

Simple procedure : PageRank will thus make the ORP flow from the seed s

ynsets, at a rate constant throughout the iterations, into other synsets along the relation, until a stable state is reached; the final ai values can be used to rank the synsets in terms of that ORP.

The PageRank algorithm 4/4 Run 1:

Run 2:

Why PageRank ? 1/2 If terms contained in synset Sk occur in the glosses of m

any positive synsets, and if the positivity scores of these synsets are high, then it is likely that Sk is itself positive (the same happens for negativity).

This justifies the summation of Equation 1.

Why PageRank ? 2/2 If the gloss of a positive synset that contains a term in sy

nset sk also contains many other terms, then this is a weaker indication that Sk is itself positive

This justifies dividing by |F(j)| in Equation 1

The ranking resulting from the algorithm needs to be biased in favour of a specific ORP

已知是 ORP 的 synset 的分數會比較高 This justifies the presence of the ei factor in Equation 1

Full procedure 1/2 (1) The graph G is generated

Numbers, articles and prepositions occurring in the glosses are discarded

Since they can be assumed to carry no positivity and negativity This leaves only nouns, adjectives, verbs, and adverbs

(2) The row-normalized adjacency matrix W of G is derived

The graph G is “pruned” by removing “self-loops”

Full procedure 2/2 (3) PageRank setting

The ei values are loaded into the e vector All synsets other than the seed synsets of renowned positivity (negativit

y) are given a value of 0 We experiment with several different versions of the e vector and se

veral different values of α

(4) PageRank is executed using W and e, iterating until a predefined termination condition is reached

(5) We rank all the synsets of WordNet in descending order of their ai score

The process is run twice, once for positivity and once for negativity

Setup (e) 1/2 e1 (baseline)

all values uniformly set to 1/|N|

e2 uniform non-null ei scores assigned to the synsets that contain th

e adjective good (bad) null scores for all other synsets

e3 uniform non-null ei scores assigned to the synsets that contain at

least one of the seven “paradigmatic” positive (negative) adjectives

Positive : good, nice, excellent, positive, fortunate, correct, superior Negative : bad, nasty, poor, negative, unfortunate, wrong, inferior

null scores for all other synsets

Setup (e) 2/2 e4

The score assigned to a synset (for ei) is proportional to the positivity (negativity) score assigned to it by SentiWordNet, and in which all entries sum up to 1.

Using SentiWordNet release 1.0 SentiWordNet is a lexical resource in which each WordNet synse

t is given a positivity score, a negativity score, and a neutrality score.

e5 like as e4

Using SentiWordNet release 1.1

SentiWordNet Esuli and Sebastiani

LREC-06

SentiWordNet is a lexical resource for opinion mining

SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity

The benchmark 1/4 Micro-WNOp corpus (Cerini et al., 2007)

It consists in a set of 1,105 WordNet synsets, each of which was manually assigned score

The corpus is divided into three parts : Common: 110 synsets which all the evaluators evaluated b

y working together, so as to align their evaluation criteria. Group1: 496 synsets which were each independently evalu

ated by three evaluators. Group2: 499 synsets which were each independently evalu

ated by the other two evaluators.

The benchmark 2/4 To ensure the creation of a corpus composed by

synsets which are relevant to the opinion topic

It was generated by randomly selecting 100 positive + 100 negative + 100 objective terms from the General Inquirer (GI) lexicon (Turney and Littman, 2003)

and including all the synsets that contained at least one such term, without paying attention to Part-Of-Speech.

The benchmark 3/4 How the information is represented in Micro-WN

Op corpus ? Score = 0 ~ 1

The benchmark 4/4 In this work

We obtain the positivity (negativity) ranking from Micro-WNOp by averaging the positivity (negativity) scores assigned by the evaluators of each group into a single score, and by sorting the synsets according to the resulting score.

Using Group 1 as a validation set In order to tune α

Using Group 2 as a test set

The effectiveness measure The p-normalized Kendallτdistance

0 ≦τp ≦ 1 Smaller is better

For example 若排序完全一致： nd = nu = 0

nd : the number of discordant pairs nu : the number of pairs ordered (i.e., not tied) in th

e gold standard and tied in the prediction Z : pair 的總數 P = 1/2

For example [資料來源：Wikipedia]

4.010

021

4

Results

Conclusion We argue that the binary relation (Si

Sk) is structurally akin to the relation between hyperlinked Web pages, and thus lends itself to PageRank analysis.

This paper thus presents a proof-of-concept of the model, and the results of experiments support our intuitions.

Reference

eXtended WordNet http://xwn.hlt.utdallas.edu/

SentiWordNet (需先註冊 ) http://sentiwordnet.isti.cnr.it/

The MICRO-WNOP Corpus http://www.unipv.it/wnop/

PageRanking WordNet Synsets : An Application to Opinion Mining Andrea Esuli and Fabrizio Sebastiani...

Documents

Transcript of PageRanking WordNet Synsets : An Application to Opinion Mining Andrea Esuli and Fabrizio Sebastiani...