Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named...
-
Upload
aist -
Category
Data & Analytics
-
view
159 -
download
0
Transcript of Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named...
Combining Knowledge and CRF-based Approach
to Named Entity Recognition in Russian
Mozharova V. A.,Loukachevitch N. V.
Lomonosov Moscow State University
2
Named entity recognition taskA named entity is a word or a word collocation that means a specific object or an event and distinguishes it from other similar objects.
1. Президент [Владимир Путин] PER 17 декабря провел традиционную пресс-конференцию перед Новым Годом.2. Студенты и Татьяны получат эксклюзивный пропуск на Главный каток страны.
Methods:1. Machine learning2. Rule-based approach 3. Combination
3
Related works• English• A lot of works• Evaluations (MUC, CoNLL, ACE …)
• Russian• Machine learning (CRF)
• (Antonova, Soloviev, 2013; Podobryaev, 2013; Gareev, 2013)• Most works on own collections
• Rule-based• (Trofimov, 2014)• Open collection “Person-1000”
4
OutlineOur approach: CRF-based Named Entity recognitionFeatures:• Token-based• Lexicon-based• Context-based
Labeling representation• IO-scheme• BIO-scheme
Experiments on open collections• Persons-1000• Persons-1111-F (Eastern names)
5
CRF-based machine learningCRF is a tool for labeling sequential data.• CRF++ (open source implementation)
Preprocessing• Morphological analyzer (POS-tagging, lemmatization,
gender and grammatical case tagging)
6
Scheme of text processing
Text FeatureExtraction:
-token-based-lexicon-based-context-based
NameExtraction
CRF
7
Token featuresMost traditional features1. Token initial form (lemma)2. Number of symbols in a token3. Letter case: BigBig, BigSmall, SmallSmall, Fence4. Token type
• part of speech• type of punctuation
5. The presence of a vowel (a binary feature)6. If a token contains a known letter n-gram from a pre-defined set:
• Кузнецов, Матвиенко, Джугашвили• Госдепартамент, Газпром
8
Features based on lexiconsWe used vocabularies that store lists of useful expressions (words or phrase) Sources:• Phonebook• Wikipedia• Thesaurus (РуТез)
Single feature for each lexiconExample:
«Набережные[geo2] Челны[geo2]»
9
lexicons.Vocabulary Size, objects Clarification Examples
Famous persons 31482 Famous people Владимир ПутинFirst names 2773 First names Василий, Анна, Том
Surnames 66108 Surnames Кузнецов, Грибоедов
Verbs of informing 1729 Verbs that usually occur with persons
высказать, признаться
Companies 33380 Organization names СбербанкCompany types 6774 Organization types организация,
авиафирмаGeography 8969 Geographical objects Балтийское мореEquipment 44094 Devices, equipment, tools устройство, телефон
10
Context features and example
Token Lemma Register Token Type
Second Name
Geo Label
В В Small Auxiliary False False NO
России РОССИЯ BigSmall Noun False Geo1 GEOPOLIT
Алиев АЛИЕВ BigSmall Noun Sname1 False PER
третий ТРЕТИЙ Small Numeral False False NO
раз РАЗ Small Auxiliary False False NO
11
Expert labeling. Brat annotatiоn tool
12
Labeling representationIO-scheme (Inside-Outside)• I - belongs to named entity • O - does not belong to named
entity|C| + 1 classes
Token IO-Labels BIO-labels
Владимир I-PER B-PER
Путин I-PER I-PER
посетил OUTSIDE OUTSIDE
Англию I-GEOPOLIT B-GEOPOLIT
BIO-scheme (Begin-Inside-Outside)• B - named entity beginning• I - named entity continuation• O - not named entity2*|C| + 1 classes
13
IO-labeling: aggregation of tokens into named entities
I-PER I-PER I-PER I-PER I-PER I-PER
Person
I-PER I-PER I-PER,Person
I-PER ,Person
I-PER I-PER
I-PERПетр
I-PER I-PERАнне
I-PERPerson Person
I-PERI-PERПетр
I-PERАнне
I-PER
I-PER I-PER I-PER(Person
I-PER I-PER I-PER
14
IO-labeling: aggregation of tokens into named entities
I-ORG I-PERимени
OrganizationI-ORG I-PERимени
I-PERX1
I-PERДмитрий
I-PERXn…
PersonI-PER
X1I-PER
ДмитрийI-PER
Xn…
OUTSIDEX1
Person…
PersonOUTSIDE
Xn
15
Target metric
intersectionCount is the number of named entities labeled by both: the classier and the expert;
classifierCount is the number of named entities labeled by only the classier;
expertCount is the number of named entities labeled by only the expert.
16
Text collections• "Persons-1000" (1000 news documents)
• Russian names: Александр Игнатенко, Алексей Волков
• " Persons-1111F" (1111 news documents)• Eastern names: Абдалла Халаф, Иттё Ито
We additionally labeled:
• Organizations (ORG)
• Media organizations having a specific function of information providing (MEDIA)
• Locations (LOC)
• States and capitals in the role of a state (GEOPOLIT)
17
Experiments on Collection “Persons-1000”
NEType
F-score, % IO IO +
rulesBIO
PER 94.95 95.09 96.08ORG 80.03 80.23 83.84LOC 92.60 92.60 94.57Average
89.54 89.67 91.71
NEType
F-score, %IO IO +
rulesBIO
PER 94.95 95.01 95.63ORG 75.90 76.16 80.06MEDIA 87.95 87.95 87.99LOC 84.53 84.53 86.91GEOPOLIT 94.65 94.65 94.50Average 88.21 88.37 89.93
Cross-validation 3:1
18
Experiments on collection with Eastern names (Persons-1111F)Person name extraction
“Persons-1000”: cross-validation 3:1“Persons-1111F” : training on “Persons-1000”
Collection F-score, %Rule-based
(Trofimov, 2014)Our system
Pesons-1000 96.62 96.08Persons-1111F 64.43 81.68
19
Conclusion• We presented the system for Russian Named Entity
Recognition task using knowledge-based approach together with CRF classifier• We tested our system on two open text collections
“Persons-1000” and “Persons-1111” and compare our results with rule-based system• We compared two labeling schemes for Russian
texts: IO-scheme and BIO-scheme