Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named...

Combining Knowledge and CRF-based Approach

to Named Entity Recognition in Russian

Mozharova V. A.,Loukachevitch N. V.

Lomonosov Moscow State University

2

Named entity recognition taskA named entity is a word or a word collocation that means a specific object or an event and distinguishes it from other similar objects.

1. Президент [Владимир Путин] PER 17 декабря провел традиционную пресс-конференцию перед Новым Годом.2. Студенты и Татьяны получат эксклюзивный пропуск на Главный каток страны.

Methods:1. Machine learning2. Rule-based approach 3. Combination

3

Related works• English• A lot of works• Evaluations (MUC, CoNLL, ACE …)

• Russian• Machine learning (CRF)

• (Antonova, Soloviev, 2013; Podobryaev, 2013; Gareev, 2013)• Most works on own collections

• Rule-based• (Trofimov, 2014)• Open collection “Person-1000”

4

OutlineOur approach: CRF-based Named Entity recognitionFeatures:• Token-based• Lexicon-based• Context-based

Labeling representation• IO-scheme• BIO-scheme

Experiments on open collections• Persons-1000• Persons-1111-F (Eastern names)

5

CRF-based machine learningCRF is a tool for labeling sequential data.• CRF++ (open source implementation)

Preprocessing• Morphological analyzer (POS-tagging, lemmatization,

gender and grammatical case tagging)

6

Scheme of text processing

Text FeatureExtraction:

-token-based-lexicon-based-context-based

NameExtraction

CRF

7

Token featuresMost traditional features1. Token initial form (lemma)2. Number of symbols in a token3. Letter case: BigBig, BigSmall, SmallSmall, Fence4. Token type

• part of speech• type of punctuation

5. The presence of a vowel (a binary feature)6. If a token contains a known letter n-gram from a pre-defined set:

• Кузнецов, Матвиенко, Джугашвили• Госдепартамент, Газпром

8

Features based on lexiconsWe used vocabularies that store lists of useful expressions (words or phrase) Sources:• Phonebook• Wikipedia• Thesaurus (РуТез)

Single feature for each lexiconExample:

«Набережные[geo2] Челны[geo2]»

9

lexicons.Vocabulary Size, objects Clarification Examples

Famous persons 31482 Famous people Владимир ПутинFirst names 2773 First names Василий, Анна, Том

Surnames 66108 Surnames Кузнецов, Грибоедов

Verbs of informing 1729 Verbs that usually occur with persons

высказать, признаться

Companies 33380 Organization names СбербанкCompany types 6774 Organization types организация,

авиафирмаGeography 8969 Geographical objects Балтийское мореEquipment 44094 Devices, equipment, tools устройство, телефон

10

Context features and example

Token Lemma Register Token Type

Second Name

Geo Label

В В Small Auxiliary False False NO

России РОССИЯ BigSmall Noun False Geo1 GEOPOLIT

Алиев АЛИЕВ BigSmall Noun Sname1 False PER

третий ТРЕТИЙ Small Numeral False False NO

раз РАЗ Small Auxiliary False False NO

11

Expert labeling. Brat annotatiоn tool

12

Labeling representationIO-scheme (Inside-Outside)• I - belongs to named entity • O - does not belong to named

entity|C| + 1 classes

Token IO-Labels BIO-labels

Владимир I-PER B-PER

Путин I-PER I-PER

посетил OUTSIDE OUTSIDE

Англию I-GEOPOLIT B-GEOPOLIT

BIO-scheme (Begin-Inside-Outside)• B - named entity beginning• I - named entity continuation• O - not named entity2*|C| + 1 classes

13

IO-labeling: aggregation of tokens into named entities

I-PER I-PER I-PER I-PER I-PER I-PER

Person

I-PER I-PER I-PER,Person

I-PER ,Person

I-PER I-PER

I-PERПетр

I-PER I-PERАнне

I-PERPerson Person

I-PERI-PERПетр

I-PERАнне

I-PER

I-PER I-PER I-PER(Person

I-PER I-PER I-PER

14

IO-labeling: aggregation of tokens into named entities

I-ORG I-PERимени

OrganizationI-ORG I-PERимени

I-PERX1

I-PERДмитрий

I-PERXn…

PersonI-PER

X1I-PER

ДмитрийI-PER

Xn…

OUTSIDEX1

Person…

PersonOUTSIDE

Xn

15

Target metric

intersectionCount is the number of named entities labeled by both: the classier and the expert;

classifierCount is the number of named entities labeled by only the classier;

expertCount is the number of named entities labeled by only the expert.

16

Text collections• "Persons-1000" (1000 news documents)

• Russian names: Александр Игнатенко, Алексей Волков

• " Persons-1111F" (1111 news documents)• Eastern names: Абдалла Халаф, Иттё Ито

We additionally labeled:

• Organizations (ORG)

• Media organizations having a specific function of information providing (MEDIA)

• Locations (LOC)

• States and capitals in the role of a state (GEOPOLIT)

17

Experiments on Collection “Persons-1000”

NEType

F-score, % IO IO +

rulesBIO

PER 94.95 95.09 96.08ORG 80.03 80.23 83.84LOC 92.60 92.60 94.57Average

89.54 89.67 91.71

NEType

F-score, %IO IO +

rulesBIO

PER 94.95 95.01 95.63ORG 75.90 76.16 80.06MEDIA 87.95 87.95 87.99LOC 84.53 84.53 86.91GEOPOLIT 94.65 94.65 94.50Average 88.21 88.37 89.93

Cross-validation 3:1

18

Experiments on collection with Eastern names (Persons-1111F)Person name extraction

“Persons-1000”: cross-validation 3:1“Persons-1111F” : training on “Persons-1000”

Collection F-score, %Rule-based

(Trofimov, 2014)Our system

Pesons-1000 96.62 96.08Persons-1111F 64.43 81.68

19

Conclusion• We presented the system for Russian Named Entity

Recognition task using knowledge-based approach together with CRF classifier• We tested our system on two open text collections

“Persons-1000” and “Persons-1111” and compare our results with rule-based system• We compared two labeling schemes for Russian

texts: IO-scheme and BIO-scheme

Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named...

Data & Analytics

Transcript of Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named...