Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named...

19
Combining Knowledge and CRF-based Approach to Named Entity Recognition in Russian Mozharova V. A., Loukachevitch N. V. Lomonosov Moscow State University

Transcript of Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named...

Page 1: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

Combining Knowledge and CRF-based Approach

to Named Entity Recognition in Russian

Mozharova V. A.,Loukachevitch N. V.

Lomonosov Moscow State University

Page 2: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

2

Named entity recognition taskA named entity is a word or a word collocation that means a specific object or an event and distinguishes it from other similar objects.

1. Президент [Владимир Путин] PER 17 декабря провел традиционную пресс-конференцию перед Новым Годом.2. Студенты и Татьяны получат эксклюзивный пропуск на Главный каток страны.

Methods:1. Machine learning2. Rule-based approach 3. Combination

Page 3: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

3

Related works• English• A lot of works• Evaluations (MUC, CoNLL, ACE …)

• Russian• Machine learning (CRF)

• (Antonova, Soloviev, 2013; Podobryaev, 2013; Gareev, 2013)• Most works on own collections

• Rule-based• (Trofimov, 2014)• Open collection “Person-1000”

Page 4: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

4

OutlineOur approach: CRF-based Named Entity recognitionFeatures:• Token-based• Lexicon-based• Context-based

Labeling representation• IO-scheme• BIO-scheme

Experiments on open collections• Persons-1000• Persons-1111-F (Eastern names)

Page 5: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

5

CRF-based machine learningCRF is a tool for labeling sequential data.• CRF++ (open source implementation)

Preprocessing• Morphological analyzer (POS-tagging, lemmatization,

gender and grammatical case tagging)

Page 6: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

6

Scheme of text processing

Text FeatureExtraction:

-token-based-lexicon-based-context-based

NameExtraction

CRF

Page 7: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

7

Token featuresMost traditional features1. Token initial form (lemma)2. Number of symbols in a token3. Letter case: BigBig, BigSmall, SmallSmall, Fence4. Token type

• part of speech• type of punctuation

5. The presence of a vowel (a binary feature)6. If a token contains a known letter n-gram from a pre-defined set:

• Кузнецов, Матвиенко, Джугашвили• Госдепартамент, Газпром

Page 8: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

8

Features based on lexiconsWe used vocabularies that store lists of useful expressions (words or phrase) Sources:• Phonebook• Wikipedia• Thesaurus (РуТез)

Single feature for each lexiconExample:

«Набережные[geo2] Челны[geo2]»

Page 9: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

9

lexicons.Vocabulary Size, objects Clarification Examples

Famous persons 31482 Famous people Владимир ПутинFirst names 2773 First names Василий, Анна, Том

Surnames 66108 Surnames Кузнецов, Грибоедов

Verbs of informing 1729 Verbs that usually occur with persons

высказать, признаться

Companies 33380 Organization names СбербанкCompany types 6774 Organization types организация,

авиафирмаGeography 8969 Geographical objects Балтийское мореEquipment 44094 Devices, equipment, tools устройство, телефон

Page 10: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

10

Context features and example

Token Lemma Register Token Type

Second Name

Geo Label

В В Small Auxiliary False False NO

России РОССИЯ BigSmall Noun False Geo1 GEOPOLIT

Алиев АЛИЕВ BigSmall Noun Sname1 False PER

третий ТРЕТИЙ Small Numeral False False NO

раз РАЗ Small Auxiliary False False NO

Page 11: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

11

Expert labeling. Brat annotatiоn tool

Page 12: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

12

Labeling representationIO-scheme (Inside-Outside)• I - belongs to named entity • O - does not belong to named

entity|C| + 1 classes

Token IO-Labels BIO-labels

Владимир I-PER B-PER

Путин I-PER I-PER

посетил OUTSIDE OUTSIDE

Англию I-GEOPOLIT B-GEOPOLIT

BIO-scheme (Begin-Inside-Outside)• B - named entity beginning• I - named entity continuation• O - not named entity2*|C| + 1 classes

Page 13: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

13

IO-labeling: aggregation of tokens into named entities

I-PER I-PER I-PER I-PER I-PER I-PER

Person

I-PER I-PER I-PER,Person

I-PER ,Person

I-PER I-PER

I-PERПетр

I-PER I-PERАнне

I-PERPerson Person

I-PERI-PERПетр

I-PERАнне

I-PER

I-PER I-PER I-PER(Person

I-PER I-PER I-PER

Page 14: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

14

IO-labeling: aggregation of tokens into named entities

I-ORG I-PERимени

OrganizationI-ORG I-PERимени

I-PERX1

I-PERДмитрий

I-PERXn…

PersonI-PER

X1I-PER

ДмитрийI-PER

Xn…

OUTSIDEX1

Person…

PersonOUTSIDE

Xn

Page 15: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

15

Target metric

intersectionCount is the number of named entities labeled by both: the classier and the expert;

classifierCount is the number of named entities labeled by only the classier;

expertCount is the number of named entities labeled by only the expert.

Page 16: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

16

Text collections• "Persons-1000" (1000 news documents)

• Russian names: Александр Игнатенко, Алексей Волков

• " Persons-1111F" (1111 news documents)• Eastern names: Абдалла Халаф, Иттё Ито

We additionally labeled:

• Organizations (ORG)

• Media organizations having a specific function of information providing (MEDIA)

• Locations (LOC)

• States and capitals in the role of a state (GEOPOLIT)

Page 17: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

17

Experiments on Collection “Persons-1000”

NEType

F-score, % IO IO +

rulesBIO

PER 94.95 95.09 96.08ORG 80.03 80.23 83.84LOC 92.60 92.60 94.57Average

89.54 89.67 91.71

NEType

F-score, %IO IO +

rulesBIO

PER 94.95 95.01 95.63ORG 75.90 76.16 80.06MEDIA 87.95 87.95 87.99LOC 84.53 84.53 86.91GEOPOLIT 94.65 94.65 94.50Average 88.21 88.37 89.93

Cross-validation 3:1

Page 18: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

18

Experiments on collection with Eastern names (Persons-1111F)Person name extraction

“Persons-1000”: cross-validation 3:1“Persons-1111F” : training on “Persons-1000”

Collection F-score, %Rule-based

(Trofimov, 2014)Our system

Pesons-1000 96.62 96.08Persons-1111F 64.43 81.68

Page 19: Valeriia Mozharova and  Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to  Named Entity Recognition in Russian

19

Conclusion• We presented the system for Russian Named Entity

Recognition task using knowledge-based approach together with CRF classifier• We tested our system on two open text collections

“Persons-1000” and “Persons-1111” and compare our results with rule-based system• We compared two labeling schemes for Russian

texts: IO-scheme and BIO-scheme