Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della...
-
Upload
jaheem-edgecombe -
Category
Documents
-
view
223 -
download
5
Transcript of Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della...
Floris Geerts (University of Antwerp)
Giansalvatore Mecca, Donatello Santoro (Università della
Basilicata)
Paolo Papotti (Qatar Computing Research Institute)
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview2
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview3
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Data Cleaning as Data Repairing
•Data Cleaning is a very general term
•standardization, entity resolution, data fusion…
•We consider one of its facets: data repairing
4
Dirty Table R
Database Table R
ConstraintsC over R
Data Repairing
Cleaned Table R’
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
5
A Motivating Example
Customers
SSN Name PhonePhCon
f City CC#
t1 111 M. White 408-3334 0.8 NY 112321
t2 222 L. Lennon 122-1876 0.9 SF 781658
t3 222 L. Lennon 000-0000 0.0 SF 784659
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
fd1. Cust: SSN, Name → Phone
Treatments
SSN SalaryInsuranc
e Treat. Date
t4 111 10k Abx Dental 10/01/11
t5 111 25k Abx Cholest. 08/12/12
t6 222 30k Med Eye Surg. 06/10/12er6. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street, City from MD
fd2. Cust: SSN, Name → CC#
fd3. Treat: SSN → Salary
cfd4. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd5. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]
?
INTERACTION!
PREFERRED VALUES
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
Values
Repair Selection
6
Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
6
How to put everything together?
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
Values
Repair Selection
7
Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
FD - ID
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
Problem 1:missing semantics
andmissing repair
algorithm
7
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
Values
Repair Selection
8
Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
FD - ID
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
Problem 2:missing formalism
to handle user-specified
preference rules
8
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
Values
Repair Selection
9
Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
FD - ID
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
Problem 3:no DBMS-based
scalable implementations
9
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
10System Comparison
Depend. LanguageRepair
StrategyValue Preference Solution Selection
FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert.Card
.Sam
pl
[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓
[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
11Our Goal
Depend. LanguageRepair
StrategyValue Preference Solution Selection
FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert.Card
.Sam
pl
[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓
GOAL
[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview12
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
13Cleaning EGDs
Customers
SSN Name PhonePhCon
f City CC#
t1 111 M. White 408-3334 0.8 NY 112321
t2 222 L. Lennon 122-1876 0.9 SF 781658
t3 222 L. Lennon 000-0000 0.0 SF 784659
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → ph = ph’e2. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → cc = cc’e3. Treat(ssn, sal, i, t, d) , Treat(ssn, sal’, i’, t’, d’) → sal = sal’e4. Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → tr = ‘Dental’
Treatments
SSN SalaryInsuranc
e Treat. Date
t4 111 10k Abx Dental 10/01/11
t5 111 25k Abx Cholest. 08/12/12
t6 222 30k Med Eye Surg. 06/10/12
e5. Cust(ssn, n, ph, c , cc ) , Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → c = ‘SF’e6. Cust(ssn, n, ph, c , cc ) , Master(ssn, n’, ph, s, c’) , → n = n’, c = c’
Σ: Cleaning EGDs
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Semantics14
‣ Cell Groups
‣ Upgrades
‣ LLUNs
‣ Partial Order
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
The Partial Order• Standard preference rules
• NULLS CONST
• source values are preferred to target values
• Ordering attribute
Π(Cust.Phone) = {dom(Cust.PhConf), ≥}
Π(Treat.Salary) = {dom(Treat.Date), ≥time}
•No orderΠ(Cust.CC#) = ∅
15
Customers
SSN Name PhonePhCon
f City CC#
t1 111 M. White 408-3334 0.8 NY 112321
t2 222 L. Lennon 122-1876 0.9 SF 781658
t3 222 L. Lennon 000-0000 0.0 SF 784659
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
Treatments
SSN SalaryInsuranc
e Treat. Date
t4 111 10k Abx Dental 10/01/11
t5 111 25k Abx Cholest. 08/12/12
t6 222 30k Med Eye Surg. 06/10/12
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
SSN Name Phone PhConf
t1 222 L. Lennon 123 0.9
t2 222 F. Lennon 000 0.0
SSN Name Phone PhConf
t1 222 L. Lennon 123 0.9
t2 222 F. Lennon 123 0.0
Cell Groups
• We model partial repair by a Cell-Group
• set of target cells that must be changed together
• we do not disrupt this equality in the following
• Carry also lineage information
16
g1 = <123 → {t1.phone, t2.phone}>occurrencesvalue
Master Data
NamePhon
e
tm L. Lennon 999
e2. Cust(s, n, ph) , Master(n, ph’) → ph = ph’
222 L. Lennon 999 0.9
222 F. Lennon 999 0.0
g2 = <999→ {t1.phone, t2.phone} by {tm.ph}>
we change the value of all occurrences
justifications
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’
Partial orderover cell groups
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Upgrades17
SSN Name Phone
t2 222 L. Lennon 123
t3 222 F. Lennon 123
<123→ {t2.ph, t3.ph} >
SSN Name Phone
t2 222 L. Lennon 999
t3 222 F. Lennon 999
<999→ {t2.ph, t3.ph} by {tm.ph}>
SSN Name Phone
t2 222 L. Lennon 123
t3 222 F. Lennon 000
Starting database J
upgradeRepair 1
upgradeRepair N
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
LLUNs
•There are cases in which we don’t have any clear strategy to remove a violation
•LLUNs
•a new class of symbols
•placeholders used to mark conflicts
•not only an unknown value but rather a “hypervalue”
• the opposite of a NULL, since it upgrades constants
18
SSN Name Phone CC#
t2 222 L. Lennon 122-1876 5555
t3 222 L. Lennon 000-0000 6666e3. Cust(ssn, n, ph, cc ) , Cust(ssn, n, ph’, cc’) → cc = cc’
L0
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Scenarios and Solutions19
•S: source schema, T: target schema, Σ: set of cleaning EGDs
•Π: the partial order specification
•Π is the way to specify when a value is preferrable to another
•Solution: Given C, an instance I of S, and an instance J of T compute an instance J’ such that:
• it is a repair, i.e., “I and J’ satisfy Σ”
•and “J’ is an upgrade of J according to Π”
Cleaning Scenario
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
The Chase Algorithm
•To compute solutions:
•Chase algorithm for chasing egds
20
SSN Name Phone
t2 L1 L. Lennon 123
t3 222 F. Lennon 000
SSN Name Phone
t2 222 L. Lennon 123
t3 L2 F. Lennon 000
<L1→ {t2.ssn} >
Repair 2 (backward)
<L2→ {t3.ssn} >
Repair 3 (backward)
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’
SSN Name Phone
t2 222 L. Lennon 123
t3 222 F. Lennon 123
<123→ {t2.ph, t3.ph} >
Repair 1 (forward)
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
A Few Results•Given a cleaning scenario C, and
instances <I, J>
•C always has a solution for <I, J>
•The chase always terminates (it never fails)
•The chase computes all minimal solutions
•The number of minimal solutions is exponential in the size of J
21
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview22
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Chase Tree23
J
R1
e0, b1
R2 R3
e0, b2e0, f
R4
e1, b1
R5 R6
e1, b2e1, f
R10
e1, b1 R11 R12
e1, b2e1, f
R13
e0, b1
R14 R15
e0, b2e0, f
the e0-e1 sequence the e1-e0 sequence
Our goal: to make this scalable ! Different orders of application
give different results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Scalability Techniques
•Chase implementation based on equivalence classes
•Delta Databases
•a representation system for chase trees
•Cost managers
•pluggable strategies to prune the chase tree
24
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
25Scalability
LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FOLLUNATIC-FR-S50
CUSTOMERSHOSPITAL
CUSTOMERS
100K 250K 400K 550K 700K 850K 1M
LLUNATICis the first scalable
DBMS-based chase algorithm for
data repairing
sec.
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
26Quality of Repairs
5k 10k 25k
LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FO
SAMPLING-500 VERTEX COVER MIN. COST
HOSPITAL
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
That’s all
Folks!