Vassiliadis
-
Upload
priyanka-tiwari -
Category
Documents
-
view
223 -
download
0
Transcript of Vassiliadis
-
8/6/2019 Vassiliadis
1/39
Data Provenance in ETL Scenarios
Panos VassiliadisUniversity of Ioannina
(joint work with Alkis Simitsis, IBM Almaden Research Center,
Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)
-
8/6/2019 Vassiliadis
2/39
PrOPr 2007 2
Outline
Introduction
Conceptual Level
Logical Level
Physical Level Provenance &ETL
-
8/6/2019 Vassiliadis
3/39
PrOPr 2007 3
Outline
Introduction
Conceptual Level
Logical Level
Physical Level Provenance &ETL
-
8/6/2019 Vassiliadis
4/39
PrOPr 2007 4
Data Warehouse Environment
-
8/6/2019 Vassiliadis
5/39
PrOPr 2007 5
Extract-Transform-Load (ETL)
-
8/6/2019 Vassiliadis
6/39
PrOPr 2007 6
ETL: importance
ETL and Data Cleaning tools cost 30% of effort and expenses in the budget of the DW
55% of the total costs of DW runtime
80% of the development time in a DW project
ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential
ETL tools in the market
software packages
in-house development
No standard, no common model
most vendors implement a core set of operators and provide GUI to
create a data flow
-
8/6/2019 Vassiliadis
7/39
PrOPr 2007 7
Fundamental research question
Now: currently, ETL designers work directly at thephysical level (typically, via libraries of physical-level templates)
Challenge: can we design ETL flows as declaratively
as possible? Detail independence:
no care for the algorithmic choices
no care about the order of the transformations
(hopefully) no care for the details of the inter-attributemappings
-
8/6/2019 Vassiliadis
8/39
PrOPr 2007 8
Engine
Physical
templates
DW
Involveddata
stores
+
Now:
Physical
scenario
-
8/6/2019 Vassiliadis
9/39
PrOPr 2007 9
DWSchema
mappings
Conceptual tological mapper
Conceptual to
logical mapping
Optimizer
Engine
Logical
templates
Physical
templates
Logical
scenario
Physical
scenarioEngine
ETL tool
Vision:
Physical
templates
DW
Involveddata
stores
+
Physical
scenario
-
8/6/2019 Vassiliadis
10/39
PrOPr 2007 10
DWSchema
mappings
Conceptual tological mapper
Conceptual to
logical mapping
Optimizer
Engine
Logical
templates
Physical
templates
Logical
scenario
Physical
scenario
ETL tool
Detail independence
Automate
(as much as possible)
Conceptual: thedetails of the inter-
attribute mappings
Logical: the order of
the transformationsPhysical: the
algorithmic choices
-
8/6/2019 Vassiliadis
11/39
PrOPr 2007 11
Outline
Introduction
Conceptual Level
Logical Level
Physical Level Provenance &ETL
-
8/6/2019 Vassiliadis
12/39
PrOPr 2007 12
Conceptual Model: first attempts
-
8/6/2019 Vassiliadis
13/39
PrOPr 2007 13
Conceptual Model: The Data Mapping Diagram
Extension of UML to handle inter-attribute mappings
-
8/6/2019 Vassiliadis
14/39
PrOPr 2007 14
Conceptual Model: The Data Mapping Diagram
Aggregating computes the quarterly sales for each product.
-
8/6/2019 Vassiliadis
15/39
PrOPr 2007 15
Conceptual Model: Skoutas annotations
Application vocabulary
VC = {product, store}
VPproduct = {pid, pName, quantity, price,type, storage}
VPstore=
{sid, sName, city, street}VFpid = {source_pid, dw_pid}
VFsid = {source_sid, dw_sid}
VFprice = {dollars, euros}
VTtype = {software, hardware}
VTcity =
{paris, rome, athens}
Datastore mappings
Datastore annotation
-
8/6/2019 Vassiliadis
16/39
PrOPr 2007 16
Conceptual Model: Skoutas annotations
The class hierarchy Definition for classDS1_Products
-
8/6/2019 Vassiliadis
17/39
PrOPr 2007 17
Outline
Introduction
Conceptual Level
Logical Level
Physical Level Provenance &ETL
-
8/6/2019 Vassiliadis
18/39
PrOPr 2007 18
Logical Model
AddAttr2
SOURCE
SK2
DS.PS1.PKEY,
LOOKUP_PS.SKEY,SOURCE
$2
COST DATE
DS.PS1 SK1
DS.PS2.PKEY,LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDateU
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF2
DS.PSNEW2.PKEY,
DS.PSOLD2.PKEYDS.PSNEW2
DS.PSOLD2
DW.PARTS Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME ZY
DW.PARTSUPP.DATE,
DAY
FTP2S2.PARTS
S1.PARTS FTP1
DS.PSNEW1
DIFF1
DS.PSOLD1
DS.PSNEW1.PKEY,
DS.PSOLD1.PKEY
SourcesDW
DSA
Log
rejected
QTY,COST
PK
PKEY,DATE
Log
rejected
-
8/6/2019 Vassiliadis
19/39
PrOPr 2007 19
Logical Model
Main question:
What information should we put inside a metadata
repository to be able to answer questions like:
what is the architecture of my DW back stage? which attributes/tables are involved in the population of
an attribute?
what part of the scenario is affected if we delete an
attribute?
-
8/6/2019 Vassiliadis
20/39
PrOPr 2007 20
Architecture Graph
$2
COST DATE
DS.PS1 SK1
DS.PS2.PKEY,LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDateU
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF2
DS.PSNEW2.PKEY,
DS.PSOLD2.PKEYDS.PSNEW2
DS.PSOLD2
DW.PARTS Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME ZY
DW.PARTSUPP.DATE,
DAY
FTP2S2.PARTS
S1.PARTS FTP1
DS.PSNEW1
DIFF1
DS.PSOLD1
DS.PSNEW1.PKEY,
DS.PSOLD1.PKEY
SourcesDW
DSA
QTY,COST
PK
PKEY,DATE
Log
rejected
AddAttr2
SOURCE
SK2
DS.PS1.PKEY,
LOOKUP_PS.SKEY,SOURCE
DS.PS2
Log
rejected
Log
rejected
-
8/6/2019 Vassiliadis
21/39
PrOPr 2007 21
Architecture Graph
Example
2
-
8/6/2019 Vassiliadis
22/39
PrOPr 2007 22
Architecture Graph
Example
2
-
8/6/2019 Vassiliadis
23/39
PrOPr 2007 23
Optimization
Execution order
which is the proper
execution order?
-
8/6/2019 Vassiliadis
24/39
PrOPr 2007 24
Optimization
Execution order
order equivalence?
SK,f1,f2 orSK,f2,f1 or ... ?
-
8/6/2019 Vassiliadis
25/39
PrOPr 2007 25
Logical Optimization
Can we push selectionearly enough?
Can we aggregate
before $2 takesplace?
-
8/6/2019 Vassiliadis
26/39
PrOPr 2007 26
Outline
Introduction Conceptual Level
Logical Level
Physical Level Provenance &ETL
-
8/6/2019 Vassiliadis
27/39
PrOPr 2007 27
DWSchema
mappings
Conceptual tological mapper
Conceptual to
logical mapping
Optimizer
Engine
Logical
templates
Physical
templates
Logical
scenario
Physicalscenario
ETL tool
identify the best
possible physical
implementation for a
given logical ETLworkflow
Logical to Physical
-
8/6/2019 Vassiliadis
28/39
PrOPr 2007 28
Problem formulation
Given a logical-level ETL workflow GL
Compute a physical-level ETL workflow GP
Such that
the semantics of the workflow do not change
all constraints are met the cost is minimal
-
8/6/2019 Vassiliadis
29/39
PrOPr 2007 29
Solution
We model the problem of finding the physical implementation of an ETLprocess as a state-space search problem.
States. A state is a graph GP that represents aphysical-level ETLworkflow.
The initial state G0
P is produced after the random assignment of physical
implementations to logical activities w.r.t. preconditions and constraints.
Transitions. Given a state GP, a new state GP is generated by replacingthe implementation of a physical activity aPofGP with another validimplementation for the same activity.
Extension: introduction of a sorter activity (at the physical-level) as a newnode in the graph.
Sorter introduction
Intentionally introduce sorters to reduce execution & resumption costs
-
8/6/2019 Vassiliadis
30/39
PrOPr 2007 30
Sorters: impact
We intentionally introduce orderings, (via appropriate physical-level sorteractivities) towards obtaining physical plans of lower cost.
Semantics: unaffected
Price to pay: cost of sorting the stream of processed data
Gain:
it is possible to employ order-aware algorithms that significantly reduceprocessing cost
It is possible to amortize the cost over activities that utilize common useful
orderings
-
8/6/2019 Vassiliadis
31/39
PrOPr 2007 31
Sorter gains
Cost(G) = 100.000+10.000+3*[5.000*log2(5.000)+5.000]= 309.316
Ifsorter SA,B is addedto V:
Cost(G) = 100.000+10.000+2*5.000+[5.000*log2(5.000)+5.000] =247.877
Without order
cost(i) = n
costSO() = n*log2(n)+n
With appropriate order
cost(i) = seli * n
costSO() = n
-
8/6/2019 Vassiliadis
32/39
PrOPr 2007 32
Interesting orders
A asc A desc {A,B, [A,B]}
-
8/6/2019 Vassiliadis
33/39
PrOPr 2007 33
Outline
Introduction Conceptual Level
Logical Level
Physical Level Provenance &ETL
-
8/6/2019 Vassiliadis
34/39
PrOPr 2007 34
DW
Schemamappings
Conceptual tological mapper
Conceptual to
logical mapping
Optimizer
Engine
Logical
templates
Physical
templates
Logical
scenario
Physicalscenario
ETL tool
A principled architecture for ETL
WHY
WHAT
HOW
-
8/6/2019 Vassiliadis
35/39
PrOPr 2007 35
Logical Model: Questions revisited
What information should we put inside a metadatarepository to be able to answer questions like:
what is the architecture of my DW back stage?
it is described as the Architecture Graph
which attributes/tables are involved in the population ofan attribute?
what part of the scenario is affected if we delete an
attribute?
follow the appropriate path in the Architecture Graph
-
8/6/2019 Vassiliadis
36/39
PrOPr 2007 36
Fundamental questions on provenance & ETL
Why do we have a certain record in the DW? Because there is a process (described by the Architecture
Graph at the logical level + the conceptual model) that
produces this kind of tuples
Where did this record come from in my DW? Hard! If there is a way to derive an inverse workflow
that links the DW tuples to their sources you can answer
it.
Not always possible: transformations are not invertible,and a DW is supposed to progressively summarize data
Widoms work on record lineage
-
8/6/2019 Vassiliadis
37/39
PrOPr 2007 37
Fundamental questions on provenance & ETL
How are updates to the sources managed? (update takes place at the source, DW+data marts must be
updated)
Done, although in a tedious way: log sniffing, mainly.
Also, diff comparison of extracted snapshots When errors are discovered during the ETL process,
how are they handled?
(update takes place at the data staging area, sources must
be updated)
Too hard to back-fuse data into the sources, both for
political and workload issues. Currently, this is not
automated.
-
8/6/2019 Vassiliadis
38/39
PrOPr 2007 38
Fundamental questions on provenance & ETL
What happens if there are updates to the schema ofthe involved data sources?
Currently this is not automated, although the automation
of the task is part of the detail independence vision
What happens if we must update the workflowstructure and semantics?
Nothing is versioned back still, not really any user
requests for this to be supported
What is the equivalent of citations in ETL?
nothing really
-
8/6/2019 Vassiliadis
39/39
PrOPr 2007 39
Thank you!