Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della...

27
Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute )

Transcript of Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della...

Page 1: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

Floris Geerts (University of Antwerp)

Giansalvatore Mecca, Donatello Santoro (Università della

Basilicata)

Paolo Papotti (Qatar Computing Research Institute)

Page 2: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview2

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 3: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview3

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 4: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Data Cleaning as Data Repairing

•Data Cleaning is a very general term

•standardization, entity resolution, data fusion…

•We consider one of its facets: data repairing

4

Dirty Table R

Database Table R

ConstraintsC over R

Data Repairing

Cleaned Table R’

Page 5: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

5

A Motivating Example

Customers

SSN Name PhonePhCon

f City CC#

t1 111 M. White 408-3334 0.8 NY 112321

t2 222 L. Lennon 122-1876 0.9 SF 781658

t3 222 L. Lennon 000-0000 0.0 SF 784659

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

fd1. Cust: SSN, Name → Phone

Treatments

SSN SalaryInsuranc

e Treat. Date

t4 111 10k Abx Dental 10/01/11

t5 111 25k Abx Cholest. 08/12/12

t6 222 30k Med Eye Surg. 06/10/12er6. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street, City from MD

fd2. Cust: SSN, Name → CC#

fd3. Treat: SSN → Salary

cfd4. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd5. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]

?

INTERACTION!

PREFERRED VALUES

Page 6: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

Values

Repair Selection

6

Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

6

How to put everything together?

Page 7: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

Values

Repair Selection

7

Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

FD - ID

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

Problem 1:missing semantics

andmissing repair

algorithm

7

Page 8: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

Values

Repair Selection

8

Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

FD - ID

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

Problem 2:missing formalism

to handle user-specified

preference rules

8

Page 9: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Language of

Constraints

Choose Preferred

Values

Repair Selection

9

Many TechniquesFD - ID

Cond FDCond ID

Matching DEdit Rules

FD - ID

CurrencyConfidenceMaster Data Cost-Minimality

Certain FixesSampling

Problem 3:no DBMS-based

scalable implementations

9

Page 10: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

10System Comparison

Depend. LanguageRepair

StrategyValue Preference Solution Selection

FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert.Card

.Sam

pl

[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓

[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10

Page 11: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

11Our Goal

Depend. LanguageRepair

StrategyValue Preference Solution Selection

FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert.Card

.Sam

pl

[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓

GOAL

[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Page 12: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview12

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 13: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

13Cleaning EGDs

Customers

SSN Name PhonePhCon

f City CC#

t1 111 M. White 408-3334 0.8 NY 112321

t2 222 L. Lennon 122-1876 0.9 SF 781658

t3 222 L. Lennon 000-0000 0.0 SF 784659

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → ph = ph’e2. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → cc = cc’e3. Treat(ssn, sal, i, t, d) , Treat(ssn, sal’, i’, t’, d’) → sal = sal’e4. Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → tr = ‘Dental’

Treatments

SSN SalaryInsuranc

e Treat. Date

t4 111 10k Abx Dental 10/01/11

t5 111 25k Abx Cholest. 08/12/12

t6 222 30k Med Eye Surg. 06/10/12

e5. Cust(ssn, n, ph, c , cc ) , Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → c = ‘SF’e6. Cust(ssn, n, ph, c , cc ) , Master(ssn, n’, ph, s, c’) , → n = n’, c = c’

Σ: Cleaning EGDs

Page 14: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Semantics14

‣ Cell Groups

‣ Upgrades

‣ LLUNs

‣ Partial Order

Page 15: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

The Partial Order• Standard preference rules

• NULLS CONST

• source values are preferred to target values

• Ordering attribute

Π(Cust.Phone) = {dom(Cust.PhConf), ≥}

Π(Treat.Salary) = {dom(Treat.Date), ≥time}

•No orderΠ(Cust.CC#) = ∅

15

Customers

SSN Name PhonePhCon

f City CC#

t1 111 M. White 408-3334 0.8 NY 112321

t2 222 L. Lennon 122-1876 0.9 SF 781658

t3 222 L. Lennon 000-0000 0.0 SF 784659

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

Treatments

SSN SalaryInsuranc

e Treat. Date

t4 111 10k Abx Dental 10/01/11

t5 111 25k Abx Cholest. 08/12/12

t6 222 30k Med Eye Surg. 06/10/12

Page 16: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

SSN Name Phone PhConf

t1 222 L. Lennon 123 0.9

t2 222 F. Lennon 000 0.0

SSN Name Phone PhConf

t1 222 L. Lennon 123 0.9

t2 222 F. Lennon 123 0.0

Cell Groups

• We model partial repair by a Cell-Group

• set of target cells that must be changed together

• we do not disrupt this equality in the following

• Carry also lineage information

16

g1 = <123 → {t1.phone, t2.phone}>occurrencesvalue

Master Data

NamePhon

e

tm L. Lennon 999

e2. Cust(s, n, ph) , Master(n, ph’) → ph = ph’

222 L. Lennon 999 0.9

222 F. Lennon 999 0.0

g2 = <999→ {t1.phone, t2.phone} by {tm.ph}>

we change the value of all occurrences

justifications

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’

Partial orderover cell groups

Page 17: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Upgrades17

SSN Name Phone

t2 222 L. Lennon 123

t3 222 F. Lennon 123

<123→ {t2.ph, t3.ph} >

SSN Name Phone

t2 222 L. Lennon 999

t3 222 F. Lennon 999

<999→ {t2.ph, t3.ph} by {tm.ph}>

SSN Name Phone

t2 222 L. Lennon 123

t3 222 F. Lennon 000

Starting database J

upgradeRepair 1

upgradeRepair N

Page 18: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

LLUNs

•There are cases in which we don’t have any clear strategy to remove a violation

•LLUNs

•a new class of symbols

•placeholders used to mark conflicts

•not only an unknown value but rather a “hypervalue”

• the opposite of a NULL, since it upgrades constants

18

SSN Name Phone CC#

t2 222 L. Lennon 122-1876 5555

t3 222 L. Lennon 000-0000 6666e3. Cust(ssn, n, ph, cc ) , Cust(ssn, n, ph’, cc’) → cc = cc’

L0

Page 19: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Scenarios and Solutions19

•S: source schema, T: target schema, Σ: set of cleaning EGDs

•Π: the partial order specification

•Π is the way to specify when a value is preferrable to another

•Solution: Given C, an instance I of S, and an instance J of T compute an instance J’ such that:

• it is a repair, i.e., “I and J’ satisfy Σ”

•and “J’ is an upgrade of J according to Π”

Cleaning Scenario

Page 20: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

The Chase Algorithm

•To compute solutions:

•Chase algorithm for chasing egds

20

SSN Name Phone

t2 L1 L. Lennon 123

t3 222 F. Lennon 000

SSN Name Phone

t2 222 L. Lennon 123

t3 L2 F. Lennon 000

<L1→ {t2.ssn} >

Repair 2 (backward)

<L2→ {t3.ssn} >

Repair 3 (backward)

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’

SSN Name Phone

t2 222 L. Lennon 123

t3 222 F. Lennon 123

<123→ {t2.ph, t3.ph} >

Repair 1 (forward)

Page 21: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

A Few Results•Given a cleaning scenario C, and

instances <I, J>

•C always has a solution for <I, J>

•The chase always terminates (it never fails)

•The chase computes all minimal solutions

•The number of minimal solutions is exponential in the size of J

21

Page 22: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Overview22

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 23: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Chase Tree23

J

R1

e0, b1

R2 R3

e0, b2e0, f

R4

e1, b1

R5 R6

e1, b2e1, f

R10

e1, b1 R11 R12

e1, b2e1, f

R13

e0, b1

R14 R15

e0, b2e0, f

the e0-e1 sequence the e1-e0 sequence

Our goal: to make this scalable ! Different orders of application

give different results

Page 24: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

Scalability Techniques

•Chase implementation based on equivalence classes

•Delta Databases

•a representation system for chase trees

•Cost managers

•pluggable strategies to prune the chase tree

24

Page 25: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

25Scalability

LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FOLLUNATIC-FR-S50

CUSTOMERSHOSPITAL

CUSTOMERS

100K 250K 400K 550K 700K 850K 1M

LLUNATICis the first scalable

DBMS-based chase algorithm for

data repairing

sec.

Page 26: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

26Quality of Repairs

5k 10k 25k

LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FO

SAMPLING-500 VERTEX COVER MIN. COST

HOSPITAL

Page 27: Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013

That’s all

Folks!