Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson,...

29
Named Entity Recognition and Linking Dr. SUN Aixin 孙爱欣 School of Computer Science and Engineering NTU, Singapore

Transcript of Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson,...

Page 1: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Named Entity Recognition and Linking

Dr. SUN Aixin 孙爱欣

School of Computer Science and Engineering

NTU, Singapore

Page 2: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

• Named-entity recognition (NER)

– The task to locate and classify named entities in text into pre-defined categories

• names of persons, organizations, locations,

• expressions of times, quantities, monetary values, percentages, etc.

– Example: [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

• Entity linking (EL)

– The task of determining the identity of entities mentioned in text, with reference

to a knowledge base.

– Example: Michael Jordan will give a talk at the conference

NER and EL

Page 3: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

• Formal text (news papers, research articles)

– Lexical features

– Grammatical features

– …

• Social media

– Informal language

– Misspellings

– Grammatical errors

– Self-defined abbreviations

– And many others….

NER from Text

Page 4: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

• Domain-specific knowledge in User Language

– Collection of terms used by users to name entities in a specific domain

– Domain defines term meanings

• Why not general (open-domain) knowledge bases?

– Wikipedia, Freebase, ProBase …

– What does this term mean: “TCU 2/52”

• Case study:

– Extract mobile phone names from user forum

– Location extraction from tweets

NER from Social Media

Page 5: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Samsung Galaxy SIII – real data from Singaporean users

Page 6: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Words grouping by Brown Clustering

Source: http://www.cs.cmu.edu/~ark/TweetNLP/cluster_viewer.html

Page 7: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Dictionary: candidate mobile phone names

Page 8: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Dictionary (knowledge) in user language

• Many variants

• Many users do use formal names

• Brand, series, model

• The usage context shall be similar

Brand User spellings

Apple, HTC, LG –No brand variations–

Nokia nokia, nokie, nk

BlackBerry blackberry, bbry, blackbery, bb, bberry

Samsung ssg, samseng, sam, samsumg, sammy, sumsung, samsun,sung, samsuck, samsung, samsungs, samung

Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson,sonyericsson, sony ericssion, sn, sony, sonyeric

Motorola motorola, moto, motorolla, mot

Page 9: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Recognize names based on a dictionary in user language

• Generate candidate names based on naming convention

• Recognize true product names from candidate names

• Normalize names based on naming convention

Page 10: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Named Entity Linking

Page 11: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Named Entity Linking

• Local confidence vs collective context

Page 12: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Collective Linking

• Collective linking:

– Utilize semantic relatedness to improve linking performance

– e.g. “Wood played at 2006 Masters held in Augusta, Georgia”.

• Semantic relatedness measures

– Jaccard Similarity (JS) of citing article sets

– Entity Embedding Similarity (EES)

12

Tiger Wood2006 Masters

Tournament

Page 13: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Collective Linking: Assumption

• All pairs of linked entities are related:

13

Local confidence Global coherence

Wood

2006

Masters

Augusta

Georgia

𝜙(𝑚𝑖 , 𝑒𝑖)𝜓(𝑒𝑖 , 𝑒𝑗)

Complete-pairwise coherence model

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 14: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Are mentioned entities densely connected?

14

Tiger

Wood

2006

Masters

Tournament

Augusta,

Georgia

Georgia

(U.S._state)

?

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 16: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Complete-pairwise coherence is not always necessary?

• Measure the degree of coherence in real datasets

– Average degree of entity relatedness graph which consists of high-weighted

edges (by JS or EES measures).

16

𝑁 − 1

2𝑁 − 1

𝑁

2𝑁 − 1

𝑁

1

Page 17: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

More About Coherence Analysis

DatasetsDegree coherence (theoretical) Degree coherence (calculated)

Forest Tree/Chain Dense Jaccard Sim EES

Reuters128 (news) 1.00 1.64 5.93 2.13 2.68

ACE2004 (news) 1.00 1.69 7.20 2.83 2.75

MSNBC (news) 1.00 1.83 14.89 4.48 7.08

Dbpedia (news) 1.00 1.71 6.60 2.55 2.92

KORE50 (short news) 1.00 1.54 3.44 1.58 1.36

Micro2014 (Tweets) 1.00 1.53 3.33 1.72 1.82

AQUAINT (news) 1.00 1.84 12.82 3.39 4.53

17

In general, the calculated values lie closer to tree (or chain) form’s expected values rather than that of the dense form.

Page 18: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Our Idea: Pair-Linking

• We do not need to look at all other entity when deriving linking

decisions.

• Interactively resolve a pair of mention at each step, from the more

confident pairs to less confident pairs.

18

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 19: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking: Local confidence + Coherence

• Pairwise confidence

19

Wood 2006

Master

Singapore_Masters

Masters of Horror(movie)

USS Wood (ship)

Wood, Wisconsin(town)

𝜙(𝑚𝑗 , 𝑒𝑗)

Page 20: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking Example

20

Wood

2006 Masters

Augusta Georgia

USS Wood (ship)

Tiger Wood

Augusta, Georgia

Augusta University

USS Augusta Georgia (country)

Georgia, U.S. State

University of Georgia

2006 Masters Tournament

0.9

0.7

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑚𝑖 → 𝑒𝑖𝑚𝑗 → 𝑒𝑗

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 21: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking Example

21

Wood

2006 Masters

Augusta Georgia

USS Wood (ship)

Tiger Wood

Augusta, Georgia

Augusta University

USS Augusta Georgia (country)

Georgia, U.S. State

University of Georgia

2006 Masters Tournament

0.9

0.7

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 22: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking Example

22

Wood

2006 Masters

Augusta Georgia

USS Wood (ship)

Tiger Wood

Augusta, Georgia

Augusta University

USS Augusta Georgia (country)

Georgia, U.S. State

University of Georgia

2006 Masters Tournament

0.9

0.7

?

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 23: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking Example

23

Wood

2006 Masters

Augusta Georgia

USS Wood (ship)

Tiger Wood

Augusta, Georgia

USS Augusta

Georgia, U.S. State

2006 Masters Tournament

0.9

0.7

?

Georgia (country)

Augusta University University of Georgia

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 24: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking Example

24

Wood

2006 Masters

Augusta Georgia

USS Wood (ship)

Tiger Wood

Augusta, Georgia

USS Augusta

Georgia, U.S. State

2006 Masters Tournament

0.9

0.7

?

Georgia (country)

Augusta University University of Georgia

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 25: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking Example

25

Wood

2006 Masters

Augusta Georgia

Tiger Wood

Augusta, Georgia

USS Augusta

Georgia, U.S. State

2006 Masters Tournament

0.9

0.7

?

Georgia (country)

Augusta University University of Georgia

USS Wood (ship)

“Wood played at 2006 Masters held in Augusta, Georgia”

Page 26: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking is Super Fast

• Pair-Linking cares about the pair with highest confidence score.

– Use priority queue to store and retrieve the pair.

– Utilize early stop to avoid scanning all possible pair of candidates.

26

Augusta GeorgiaAugusta, Georgia

USS Augusta

Georgia, U.S. State

Georgia (country)

Augusta University University of Georgia

0.9

Page 27: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Experiment: 8 benchmark datasets

27

Datasets Type #documents Avg #words

Reuters128 News 111 136

ACE2004 News 35 375

MSNBC News 20 544

DBpedia News 57 29

RSS500 RSS-feeds 343 30

KORE50 Short sentence 50 12

Micro2017 Tweets 696 18

AQUAINT News 50 220

Page 28: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

Pair-Linking Performance

• Linking accuracy (F1)

• Speed: time per document in millisecond

28

(*) Performances on ACE2004, RSS500 and Micro2014 are not shown here.

Page 29: Named Entity Recognition and Linking · Sony Ericsson sony erricson, sony ericsson, sony ericson, sony ericcson, sonyericsson, sony ericssion, sn, sony, sonyeric ... –Utilize early

• Contributors

– Phan Cong Minh

– Han Jialong

– Tay Yi

– Li Chenliang

– Yao Yangjie

• Mobile Phone Name Extraction from Internet Forums: A Semi-supervised Approach.Yangjie Yao, Aixin Sun. World Wide Web Journal. 19(5): 783-805. 2016

• NeuPL: Attention-based Semantic Matching and Pair-Linking for Entity Disambiguation Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, Chenliang Li. CIKM 2017

• Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than AllMinh C. Phan, Aixin Sun, Yi Tay, Jialong Han, Chenliang Li. ArXiv

• Project demo: https://youtu.be/w3EsALNrKAk