AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT ... - dialog-21.ru · PDF...

16
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES Alina Dubatovka / SPbSU Yurii Kurochkin / Yandex Elena Mikhailova / SPbSU Dialogue 2016, Moscow, June 1-4, 2016

Transcript of AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT ... - dialog-21.ru · PDF...

AUTOMATIC GENERATION OF THE

DOMAIN-SPECIFIC SENTIMENT

RUSSIAN DICTIONARIES

Alina Dubatovka / SPbSU

Yurii Kurochkin / Yandex

Elena Mikhailova / SPbSU

Dialogue 2016, Moscow, June 1-4, 2016

Goals

• Automatic extraction of sentiment words

• Automatic polarity detection

• Unsupervised

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

2

Methodology

• Hatzivassilogloum, McKeown 1997

– "Tasty and healthy Breakfast“

– "Cheap but nice hotel“

• The better the node is connected with other

"positive" nodes and the worse with the

"negative", the more positive it is

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

3

Graph builder• ADV NEG ∗ ADJ , ? AND BUT ? ADV NEG ∗ ADJ +

• AND – conjunction "and“

• BUT – one of adversative conjunctions ("but", "instead", "however”, “nevertheless ")

• NEG – negation

• ADV – an adverb of measure and degree ("very", "quite", "too", "completely")

• ADJ – adjective

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

4

Example

• "Tasty, plentiful but not very varied and expensive

breakfast“

• positive links: (tasty, plentiful), (tasty, varied),

(plentiful and varied)

• negative links: (tasty, expensive), (plentiful,

expensive), (varied, expensive).

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

5

Particle “not” and prefix “un-”

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

6

Good (хороший)

Pleasant (приятный)

Free (бесплатны

й)

Big (большой)1486; -40

6; 0

Unpleasant (неприятны

й)

Good (хороший)

Pleasant (приятный)

Free (бесплатны

й)

Big (большой)1556; -113

6; 0

Graph Analyzer• Initialization

• Weight of the graph edges

– 𝑤𝑒𝑖𝑔ℎ𝑡 𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2 = # 𝑤𝑜𝑟𝑑1𝐴𝑁𝐷 𝑤𝑜𝑟𝑑2 −𝐾 ∗ # 𝑤𝑜𝑟𝑑1𝐵𝑈𝑇 𝑤𝑜𝑟𝑑2

• Distance to the final set– The heaviest edge

– The sum of the weights of edges

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

7

Description of experiments• 259023 depersonalized unlabeled reviews

• Dataset size – 660 Mb

• Hotel domain

• Texts by real users– Misspellings

– Grammatical errors

– Informal words

– unrelated information concerning flight, excursions, places of interest etc

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

8

“Large” dictionaries

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

9

Positive Negative Neutral Total

Algorithm

without removing

the "un-" prefix

5252 2815 - 8067

Algorithm after

removing the "un-

" prefix

4936 2695 - 7631

“Large”

dictionary1948 1946 4951 8845

“Small” dictionaries

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

10

Positive

dictionary

Negative

dictionaryTotal

“Manual” dictionary 173 127 300

Algorithm without “un-”

prefix removing164 74 238

Algorithm with “un-”

prefix removing163 83 246

Results without removing

the "un-" prefix

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

11

MetricPositive

dictionary

Negative

dictionary

Total

dictionary

Recall 0.806 0.684 0.754

Precision 0.309 0.521 0.381

Precision without

neutral words0.77 0.827 0.796

F1-measure 0.447 0.591 0.506

F1-measure without

neutral words0.788 0.749 0.774

Results after removing the

"un-" prefix

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

12

MetricPositive

dictionary

Negative

dictionary

Total

dictionary

Recall 0.793 0.683 0.746

Precision 0.314 0.502 0.38

Precision without

neutral words0.779 0.82 0.799

F1-measure 0.45 0.579 0.504

F1-measure without

neutral words0.786 0.745 0.772

Precision@n for positive

dictionary

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

13

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1

201

401

601

801

1001

1201

1401

1601

1801

2001

2201

2401

2601

2801

3001

3201

3401

3601

3801

4001

4201

4401

4601

4801

5001

5201

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1

201

401

601

801

1001

1201

1401

1601

1801

2001

2201

2401

2601

2801

3001

3201

3401

3601

3801

4001

4201

4401

4601

4801

5001

5201

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1

201

401

601

801

1001

1201

1401

1601

1801

2001

2201

2401

2601

2801

3001

3201

3401

3601

3801

4001

4201

4401

4601

4801

5001

5201

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1

201

401

601

801

1001

1201

1401

1601

1801

2001

2201

2401

2601

2801

3001

3201

3401

3601

3801

4001

4201

4401

4601

4801

5001

5201

Precision@n for negative

dictionary

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

14

0,5

0,6

0,7

0,8

0,9

1

0,5

0,6

0,7

0,8

0,9

1

1

20

1

40

1

60

1

80

1

10

01

12

01

14

01

16

01

18

01

20

01

22

01

24

01

26

01

28

01

0,5

0,6

0,7

0,8

0,9

1

0,5

0,6

0,7

0,8

0,9

1

1

20

1

40

1

60

1

80

1

10

01

12

01

14

01

16

01

18

01

20

01

22

01

24

01

26

01

28

01

Dependence on K

AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES

15

0,5

0,6

0,7

0,8

0,9

1 2 3 4 5 6 7 8 9 10

Without neutral words With neutral words

0,5

0,6

0,7

0,8

0,9

1 2 3 4 5 6 7 8 9 10

With neutral words Without neutral words

0

0,2

0,4

0,6

0,8

1

0,76 0,78 0,8 0,82 0,84 0,86 0,88

With neutral words Without neutral words

0,3

0,4

0,5

0,6

0,7

0,8

0,9

0,75 0,8 0,85 0,9 0,95

With neutral words Without neutral words

St. Petersburg University

spbu.ru

Thanks!