Knowledge Discovery mit Wordnet und Alembic Workbench (Julia Faion) (Markus Reiter)
PageRanking WordNet Synsets : An Application to Opinion Mining Andrea Esuli and Fabrizio Sebastiani...
-
Upload
adela-atkins -
Category
Documents
-
view
217 -
download
0
Transcript of PageRanking WordNet Synsets : An Application to Opinion Mining Andrea Esuli and Fabrizio Sebastiani...
PageRanking WordNet Synsets :An Application to Opinion Mining
Andrea Esuli and Fabrizio SebastianiIstituto di Scienza e Tecnologie dell’Informazione
Consiglio Nazionale delle RicercheVia Giuseppe Moruzzi, 1 – 56124 Pisa, Italy
fandrea.esuli,[email protected]
Advisor: Hsin-Hsi ChenSpeaker: Yong-Sheng Lo
Date: 2007/10/31
ACL - 2007
Introduction
Recent years have witnessed an explosion of work on opinion mining
An important part of this research has been the work on the automatic determination of the opinion-related properties (ORPs) of terms OPPs = positive, negative, or neutral
polarity
Related work 1/2
Traditional work Determine the polarity of adjectives
Hatzivassiloglou and McKeown (1997) Kamps et al. (2004)
Determine the polarity of generic terms Turney and Littman (2003) Kim and Hovy (2004) Takamura et al. (2005)
Related work 2/2
Recent work Using glosses from online dictionary
Extend a set of terms of known positivity/negativity
Andreevskaia and Berger (2006a) Determine the ORPs of generic terms
Esuli and Sebastiani (2005; 2006a) Determining the ORPs of WordNet synsets (syno
nym sets) Esuli and Sebastiani (2006b)
In this work We have investigated the applicability of a
random walk model to the problem of ranking synsets (synonym sets) according to positivity and negativity.
Using PageRank Need nodes and links
Using eXtended WordNet version 2.0-1.1 Based on WordNet version 2.0
eXtended WordNet (XWN) The goal of this project is to develop a tool that tak
es as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet.
XWN has 4 files adj.xml adv.xml noun.xml verb.xml
How the information is represented in XWN ?
Graph generation for PageRank The directed graph G = ( N, L )
N (node) : The set of all WordNet synsets 115,424 synsets
L (link) : From synset Si to synset Sk ( Si Sk ) iff the gloss of Si contains at least a term belonging to
Sk For example
the gloss of Si contains “ by a small margin ; … “ Sk contains “ small , … ” Si Sk
The PageRank algorithm 1/4 Input :
The row-normalized adjacency matrix (W) W be the |N| X |N| adjacency matrix of G
|N| = # of synsets Wo[ i,j ] = 1 iff there is a link from node i to node j If Wo[i,j] = 1 W[ i,j ] = 1 / | F(i) | Else W[ i,j ] = 0
B(i) = { nj | Wo[ j,i ] = 1 } : 哪些 node 連到 node i The set of the backward neighbors of ni
F(i) = { nj | Wo[ i,j ] = 1 } : node i 連到哪些 node The set of the forward neighbors of ni
Output : A vector [ a1,……,a|N| ]
ai represents the score of node ni, i = 1~|N|
The PageRank algorithm 2/4 PageRank iteratively computes vector a :
The value of ei amounts to an internal source of score for node i It is constant (=1/|N|) across the iterations and independent fro
m its backward neighbours
In vectorial form, Equation 1 can be written as
The PageRank algorithm 3/4 In this work
Using the ei values as internal sources of a given ORP (positivity or negativity) for node i
by attributing a null ei value to all but a few “seed” synsets known to possess that ORP
Simple procedure : PageRank will thus make the ORP flow from the seed s
ynsets, at a rate constant throughout the iterations, into other synsets along the relation, until a stable state is reached; the final ai values can be used to rank the synsets in terms of that ORP.
The PageRank algorithm 4/4 Run 1:
Run 2:
Why PageRank ? 1/2 If terms contained in synset Sk occur in the glosses of m
any positive synsets, and if the positivity scores of these synsets are high, then it is likely that Sk is itself positive (the same happens for negativity).
This justifies the summation of Equation 1.
Why PageRank ? 2/2 If the gloss of a positive synset that contains a term in sy
nset sk also contains many other terms, then this is a weaker indication that Sk is itself positive
This justifies dividing by |F(j)| in Equation 1
The ranking resulting from the algorithm needs to be biased in favour of a specific ORP
已知是 ORP 的 synset 的分數會比較高 This justifies the presence of the ei factor in Equation 1
Full procedure 1/2 (1) The graph G is generated
Numbers, articles and prepositions occurring in the glosses are discarded
Since they can be assumed to carry no positivity and negativity This leaves only nouns, adjectives, verbs, and adverbs
(2) The row-normalized adjacency matrix W of G is derived
The graph G is “pruned” by removing “self-loops”
Full procedure 2/2 (3) PageRank setting
The ei values are loaded into the e vector All synsets other than the seed synsets of renowned positivity (negativit
y) are given a value of 0 We experiment with several different versions of the e vector and se
veral different values of α
(4) PageRank is executed using W and e, iterating until a predefined termination condition is reached
(5) We rank all the synsets of WordNet in descending order of their ai score
The process is run twice, once for positivity and once for negativity
Setup (e) 1/2 e1 (baseline)
all values uniformly set to 1/|N|
e2 uniform non-null ei scores assigned to the synsets that contain th
e adjective good (bad) null scores for all other synsets
e3 uniform non-null ei scores assigned to the synsets that contain at
least one of the seven “paradigmatic” positive (negative) adjectives
Positive : good, nice, excellent, positive, fortunate, correct, superior Negative : bad, nasty, poor, negative, unfortunate, wrong, inferior
null scores for all other synsets
Setup (e) 2/2 e4
The score assigned to a synset (for ei) is proportional to the positivity (negativity) score assigned to it by SentiWordNet, and in which all entries sum up to 1.
Using SentiWordNet release 1.0 SentiWordNet is a lexical resource in which each WordNet synse
t is given a positivity score, a negativity score, and a neutrality score.
e5 like as e4
Using SentiWordNet release 1.1
SentiWordNet Esuli and Sebastiani
LREC-06
SentiWordNet is a lexical resource for opinion mining
SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity
The benchmark 1/4 Micro-WNOp corpus (Cerini et al., 2007)
It consists in a set of 1,105 WordNet synsets, each of which was manually assigned score
The corpus is divided into three parts : Common: 110 synsets which all the evaluators evaluated b
y working together, so as to align their evaluation criteria. Group1: 496 synsets which were each independently evalu
ated by three evaluators. Group2: 499 synsets which were each independently evalu
ated by the other two evaluators.
The benchmark 2/4 To ensure the creation of a corpus composed by
synsets which are relevant to the opinion topic
It was generated by randomly selecting 100 positive + 100 negative + 100 objective terms from the General Inquirer (GI) lexicon (Turney and Littman, 2003)
and including all the synsets that contained at least one such term, without paying attention to Part-Of-Speech.
The benchmark 3/4 How the information is represented in Micro-WN
Op corpus ? Score = 0 ~ 1
The benchmark 4/4 In this work
We obtain the positivity (negativity) ranking from Micro-WNOp by averaging the positivity (negativity) scores assigned by the evaluators of each group into a single score, and by sorting the synsets according to the resulting score.
Using Group 1 as a validation set In order to tune α
Using Group 2 as a test set
The effectiveness measure The p-normalized Kendallτdistance
0 ≦τp ≦ 1 Smaller is better
For example 若排序完全一致: nd = nu = 0
nd : the number of discordant pairs nu : the number of pairs ordered (i.e., not tied) in th
e gold standard and tied in the prediction Z : pair 的總數 P = 1/2
For example [資料來源:Wikipedia]
4.010
021
4
Results
Conclusion We argue that the binary relation (Si
Sk) is structurally akin to the relation between hyperlinked Web pages, and thus lends itself to PageRank analysis.
This paper thus presents a proof-of-concept of the model, and the results of experiments support our intuitions.
Reference
eXtended WordNet http://xwn.hlt.utdallas.edu/
SentiWordNet (需先註冊 ) http://sentiwordnet.isti.cnr.it/
The MICRO-WNOP Corpus http://www.unipv.it/wnop/