Linking data without common identifiers

42
1 Linking data without common identifiers Semantic Web Meetup, 2011-08-23 Lars Marius Garshol, <[email protected]> http://twitter.com/larsga

description

Talk about the

Transcript of Linking data without common identifiers

Page 1: Linking data without common identifiers

1

Linking data without common identifiers

Semantic Web Meetup, 2011-08-23Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

Page 2: Linking data without common identifiers

2

About me

• Lars Marius Garshol, Bouvet consultant• Worked with semantic technologies

since 1999– mostly Topic Maps

• Before that I worked with XML• Also background from– Opera Software (Unicode support)– open source development (Java/Python)

Page 3: Linking data without common identifiers

3

Agenda

• Linking data – the problem• Record linkage theory• Duke• A real-world example• Usage at Hafslund

Page 4: Linking data without common identifiers

4

The problem

• Data sets from different sources generally don’t share identifiers– names tend to be spelled every which way

• So how can they be linked?

Page 5: Linking data without common identifiers

5

A real-world example

DBPEDIA

Id http://dbpedia.org/resource/Samoa

Name Samoa

Founding date 1962-01-01

Capital Apia

Currency Tala

Area 2831

Leader name Tuilaepa Aiono Sailele Malielegaoi

MONDIAL

Id 17019

Name Western Samoa

Independence 01 01 1962

Capital Apia, Samoa

Population 214384

Area 2860

GDP 415

Page 6: Linking data without common identifiers

6

A difficult problem

• It requires n2 comparisons for n records– a million comparisons for 1000 records, 100

million for 10,000, ...

• Exact string comparison is not enough– must handle misspellings, name variants, etc

• Interpreting the data can be difficult even for a human being– is the address different because there are two

different people, or because the person moved?

• ...

Page 7: Linking data without common identifiers

7

Help from an unexpected quarter

Statisticians to the rescue!

Page 8: Linking data without common identifiers

8

Record linkage

• Statisticians very often must connect data sets from different sources

• They call it “record linkage”– term coined in 19461)

– mathematical foundations laid in 19592) – formalized in 1969 as “Fellegi-Sunter” model3)

• A whole subject area has been developed with well-known techniques, methods, and tools– these can of course be applied outside of

statistics1) http://ajph.aphapublications.org/cgi/reprint/36/12/14122) http://www.sciencemag.org/content/130/3381/954.citation3) http://www.jstor.org/pss/2286061

Page 9: Linking data without common identifiers

9

Other terms for the same thing

• Has been independently invented many times

• Under many names– entity resolution– identity resolution– merge/purge– deduplication– ...

• This makes Googling for information an absolute nightmare

Page 10: Linking data without common identifiers

10

Application areas

• Statistics (obviously)• Data cleaning• Data integration• Conversion• Fraud detection / intelligence /

surveillance

Page 11: Linking data without common identifiers

11

Mathematical model

Page 12: Linking data without common identifiers

12

Model, simplified

• We work with records, each of which has fields with values– before doing record linkage we’ve processed the

data so that they have the same set of fields– (fields which exist in only one of the sources get

discarded, as they are no help to us)

• We compare values for the same fields one by one, then estimate the probability that records represent the same entity given these values– probability depends on which field and what values– estimation can be very complex

• Finally, we can compute overall probability

Page 13: Linking data without common identifiers

13

Example

Field Record 1 Record 2 Probability

Name acme inc acme inc 0.9

Assoc no 177477707 0.5

Zip code 9161 9161 0.6

Country norway norway 0.51

Address 1 mb 113 mailbox 113 0.49

Address 2 0.5

Page 14: Linking data without common identifiers

14

String comparisons

• Must handle spelling errors and name variants– must estimate probability that values belong to same

entity despite differences

• Examples– John Smith ≈ Jonh Smith– J. Random Hacker ≈ James Random Hacker– ...

• Many well-known algorithms have been developed– no one best choice, unfortunately– some are eager to merge, others less so

• Many are computationally expensive– O(n2) where n is length of string is common– this gets very costly when number of record pairs to

compare is already high

Page 15: Linking data without common identifiers

15

Examples of algorithms

• Levenshtein– edit distance– ie: number of edits needed to change string 1 into 2– not so eager to merge

• Jaro-Winkler– specifically for short person names– very promiscuous

• Soundex– essentially a phonetic hashing algorithm– very fast, but also very crude

• Longest common subsequence / q-grams– haven’t tried this yet

• Monge-Elkan– one of the best, but computationally expensive

• TFIDF– based on term frequency– studies often find this performs best, but it’s expensive

Page 16: Linking data without common identifiers

16

Existing record linkage tools

• Commercial tools– big, sophisticated, and expensive– have found little information on what they

actually do– presumably also effective

• Open source tools– generally made by and for statisticians– nice user interfaces and rich configurability– architecture often not as flexible as it could be

Page 17: Linking data without common identifiers

17

Standard algorithm

• n2 comparisons for n records is unacceptable– must reduce number of direct comparisons

• Solution– produce a key from field values,– sort records by the key,– for each record, compare with n nearest neighbours– sometimes several different keys are produced, to

increase chances of finding matches

• Downsides– requires coming up with a key– difficult to apply incrementally– sorting is expensive

Page 18: Linking data without common identifiers

18

Good research papers

• Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas– http://jeffjonas.typepad.com/

IEEE.Identity.Resolution.pdf

• Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.30.3496&rep=rep1&type=pdf

• Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.122.5696&rep=rep1&type=pdf

Page 19: Linking data without common identifiers

19

DUplicate KillEr

Duke

Page 20: Linking data without common identifiers

20

Context

• Doing a project for Hafslund where we integrate data from many sources

• Entities are duplicated both inside systems and across systems

ERP

Suppliers

Customers

Companies

Customers

CRM

Customers

Billing

Page 21: Linking data without common identifiers

21

Requirements

• Must be flexible and configurable– no way to know in advance exactly what data we

will need to deduplicate

• Must scale to large data sets– CRM alone has 1.4 million customer records– that’s 2 trillion comparisons with naïve approach

• Must have an API– project uses SDshare protocol everywhere– must therefore be able to build connectors

• Must be able to work incrementally– process data as it changes and update conclusions

on the fly

Page 22: Linking data without common identifiers

22

Reviewed existing tools...

• ...but didn’t find anything matching the criteria

• Therefore...

Page 23: Linking data without common identifiers

23

Duke

• Java deduplication engine– released as open source– http://code.google.com/p/duke/

• Does not use key approach– instead indexes data with Lucene– does Lucene searches to find potential matches

• Still a work in progress, but– high performance (1M records in 10 minutes),– several data sources and comparators,– being used for real in real projects,– flexible architecture

Page 24: Linking data without common identifiers

24

How it works

• XML configuration file defines setup– lists data sources, properties, probabilities, etc

• Data sources provide streams of records• Steps:– normalize values so they are ready for comparison– collect a batch of records, then index it with

Lucene– compare records one by one against Lucene index– do detailed comparisons against search results– pairs with probability above configurable threshold

are considered matches

• For now, only API and command-line

Page 25: Linking data without common identifiers

25

Record comparisons in detail

• Comparator compares field values– return number in range 1.0 – 0.0– probability for field is high probability if

number is 1.0, or low probability if 0.0, otherwise in between the two

• Probability for entire record is computed using Bayes’s formula– approach known as “naïve Bayes” in research

literature

Page 26: Linking data without common identifiers

26

Components

Data sources• CSV• JDBC• Sparql• NTriples• <plug in your

own>

Comparators• ExactComparator• NumericComparato

r• SoundexComparato

r• TokenizedComparat

or• Levenshtein• JaroWinkler• Dice coefficient

Page 27: Linking data without common identifiers

27

Features

• Fairly complete command-line tool– with debugging support

• API for embedding the tool• Two modes:

– deduplication (all records matched against each other)

– linking (only matching across groups of records)

• Pluggable data sources, comparators, and cleaners

• Framework for composing cleaners• Incremental processing• High performance

Page 28: Linking data without common identifiers

28

A real-world example

Matching Mondial and DBpedia

Page 29: Linking data without common identifiers

29

Finding properties to match

• Need properties providing identity evidence

• Matching on the properties in bold below

• Extracted data to CSV for ease of useDBPEDIA

Id http://dbpedia.org/resource/Samoa

Name Samoa

Founding date 1962-01-01

Capital Apia

Currency Tala

Area 2831

Leader name Tuilaepa Aiono Sailele Malielegaoi

MONDIAL

Id 17019

Name Western Samoa

Independence 01 01 1962

Capital Apia, Samoa

Population 214384

Area 2860

GDP 415

Page 30: Linking data without common identifiers

30

Configuration – data sources

<group> <csv> <param name="input-file" value="dbpedia.csv"/> <param name="header-line" value="false"/> <column name="1" property="ID"/> <column name="2" cleaner="no.priv...CountryNameCleaner" property="NAME"/> <column name="3" property="AREA"/> <column name="4" cleaner="no.priv...CapitalCleaner" property="CAPITAL"/> </csv> </group>

<group> <csv> <param name="input-file" value="mondial.csv"/> <column name="id" property="ID"/> <column name="country" cleaner="no.priv...examples.CountryNameCleaner" property="NAME"/> <column name="capital" cleaner="no.priv...LowerCaseNormalizeCleaner" property="CAPITAL"/> <column name="area" property="AREA"/> </csv> </group>

Page 31: Linking data without common identifiers

31

Configuration – matching <schema> <threshold>0.65</threshold>

<property type="id"> <name>ID</name> </property> <property> <name>NAME</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.3</low> <high>0.88</high> </property> <property> <name>AREA</name> <comparator>AreaComparator</comparator> <low>0.2</low> <high>0.6</high> </property> <property> <name>CAPITAL</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.4</low> <high>0.88</high> </property> </schema>

<object class="no.priv.garshol.duke.NumericComparator" name="AreaComparator"> <param name="min-ratio" value="0.7"/> </object>

Duke analyzes this setup and decidesonly NAME and CAPITAL need to be searched on in Lucene.

Page 32: Linking data without common identifiers

32

Result

• Correct links found: 206 / 217 (94.9%)• Wrong links found: 0 / 12 (0.0%)• Unknown links found: 0• Percent of links correct 100.0%, wrong

0.0%, unknown 0.0%• Records with no link: 25• Precision 100.0%, recall 94.9%, f-

number 0.974

Page 33: Linking data without common identifiers

33

Examples

Field DBpedia Mondial

Name albania albania

Area 28748 28750

Capital tirana tirane

Probability 0.980

Field DBpedia Mondial

Name côte d'ivoire cote divoire

Area 322460 322460

Capital yamoussoukro yamoussoukro

Probability 0.975

Field DBpedia Mondial

Name samoa western samoa

Area 2831 2860

Capital apia apia

Probability 0.824

Field DBpedia Mondial

Name kazakhstan kazakstan

Area 2724900 2717300

Capital astana almaty

Probability 0.838

Field DBpedia Mondial

Name grande comore comoros

Area 1148 2170

Capital moroni moroní

Probability 0.440

Field DBpedia Mondial

Name serbia serbia and mont

Area 102350 88361

Capital sarajevo sarajevo

Probability 0.440

Page 34: Linking data without common identifiers

34

Choosing the right match

Field DBpedia Mondial Probability

Name samoa western samoa 0.3

Area 2831 2860 0.6

Capital apia apia 0.88

Probability 0.824

Field DBpedia Mondial Probability

Name samoa american samoa 0.3

Area 2831 199 0.4

Capital apia pago pago 0.4

Probability 0.067

Page 35: Linking data without common identifiers

35

An example of failure

• Duke doesn’t find this match– no tokens matching exactly– Lucene search finds nothing

• Detailed comparison gives correct result– so only problem is Lucene search

• Lucene does have Levenshtein search, but– in Lucene 3.x it’s very slow– therefore not enabled now– thinking of adding option to enable where

needed– Lucene 4.x will fix the performance problem

Field DBpedia Mondial

Name kazakhstan kazakstan

Area 2724900 2717300

Capital astana almaty

Probability 0.838

Page 36: Linking data without common identifiers

36

The effects of value matching

Case Precision Recall F

Optimal 100% 94.9% 97.4%

Cleaning, exact compare

99.3% 73.3% 84.4%

No cleaning, good compare

100% 93.4% 96.6%

No cleaning, exact compare

? ? ?

Cleaning, JaroWinkler

97.6% 96.7% 97.2%

Cleaning, Soundex

95.9% 97.2% 96.6%

Page 37: Linking data without common identifiers

37

Duke in real life

Usage at Hafslund

Page 38: Linking data without common identifiers

38

The SESAM project

• Building a new archival system– does automatic tagging of documents based

on knowledge about the data– knowledge extracted from backend systems

• Search engine-based frontend– using Recommind for this

• Very flexible architecture– extracted data stored in triple store (Virtuoso)– all data transfer based on SDshare protocol– data extracted from RDBMSs with Ontopia’s

DB2TM

Page 39: Linking data without common identifiers

39

The big picture

Virtuoso ERPRecommind

360

SDshare

SDshare SDshare

SDshare

CRM

Billing

DUPLICATES!

DukeSDshare

SDsharecontains owl:sameAs andhaf:possiblySameAs

Page 40: Linking data without common identifiers

40

Experiences so far

• Incremental processing works fine– links added and retracted as data changes

• Performance not an issue at all– but then only ~50,000 records so far...

• Matching works well, but not perfect– data are very noisy and messy– also, matching is hard

Page 41: Linking data without common identifiers

41

Duke roadmap

• 0.3– clean up the public API and document

properly– maybe some more comparators– support for writing owl:sameAs to Sparql

endpoint

• 0.4– add a web service interface

• 0.5 and onwards– more comparators– maybe some parallelism

Page 42: Linking data without common identifiers

42

Slides will appear on http://slideshare.net/larsga

Comments/questions?