Vassiliadis

download Vassiliadis

of 39

Transcript of Vassiliadis

  • 8/6/2019 Vassiliadis

    1/39

    Data Provenance in ETL Scenarios

    Panos VassiliadisUniversity of Ioannina

    (joint work with Alkis Simitsis, IBM Almaden Research Center,

    Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)

  • 8/6/2019 Vassiliadis

    2/39

    PrOPr 2007 2

    Outline

    Introduction

    Conceptual Level

    Logical Level

    Physical Level Provenance &ETL

  • 8/6/2019 Vassiliadis

    3/39

    PrOPr 2007 3

    Outline

    Introduction

    Conceptual Level

    Logical Level

    Physical Level Provenance &ETL

  • 8/6/2019 Vassiliadis

    4/39

    PrOPr 2007 4

    Data Warehouse Environment

  • 8/6/2019 Vassiliadis

    5/39

    PrOPr 2007 5

    Extract-Transform-Load (ETL)

  • 8/6/2019 Vassiliadis

    6/39

    PrOPr 2007 6

    ETL: importance

    ETL and Data Cleaning tools cost 30% of effort and expenses in the budget of the DW

    55% of the total costs of DW runtime

    80% of the development time in a DW project

    ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential

    ETL tools in the market

    software packages

    in-house development

    No standard, no common model

    most vendors implement a core set of operators and provide GUI to

    create a data flow

  • 8/6/2019 Vassiliadis

    7/39

    PrOPr 2007 7

    Fundamental research question

    Now: currently, ETL designers work directly at thephysical level (typically, via libraries of physical-level templates)

    Challenge: can we design ETL flows as declaratively

    as possible? Detail independence:

    no care for the algorithmic choices

    no care about the order of the transformations

    (hopefully) no care for the details of the inter-attributemappings

  • 8/6/2019 Vassiliadis

    8/39

    PrOPr 2007 8

    Engine

    Physical

    templates

    DW

    Involveddata

    stores

    +

    Now:

    Physical

    scenario

  • 8/6/2019 Vassiliadis

    9/39

    PrOPr 2007 9

    DWSchema

    mappings

    Conceptual tological mapper

    Conceptual to

    logical mapping

    Optimizer

    Engine

    Logical

    templates

    Physical

    templates

    Logical

    scenario

    Physical

    scenarioEngine

    ETL tool

    Vision:

    Physical

    templates

    DW

    Involveddata

    stores

    +

    Physical

    scenario

  • 8/6/2019 Vassiliadis

    10/39

    PrOPr 2007 10

    DWSchema

    mappings

    Conceptual tological mapper

    Conceptual to

    logical mapping

    Optimizer

    Engine

    Logical

    templates

    Physical

    templates

    Logical

    scenario

    Physical

    scenario

    ETL tool

    Detail independence

    Automate

    (as much as possible)

    Conceptual: thedetails of the inter-

    attribute mappings

    Logical: the order of

    the transformationsPhysical: the

    algorithmic choices

  • 8/6/2019 Vassiliadis

    11/39

    PrOPr 2007 11

    Outline

    Introduction

    Conceptual Level

    Logical Level

    Physical Level Provenance &ETL

  • 8/6/2019 Vassiliadis

    12/39

    PrOPr 2007 12

    Conceptual Model: first attempts

  • 8/6/2019 Vassiliadis

    13/39

    PrOPr 2007 13

    Conceptual Model: The Data Mapping Diagram

    Extension of UML to handle inter-attribute mappings

  • 8/6/2019 Vassiliadis

    14/39

    PrOPr 2007 14

    Conceptual Model: The Data Mapping Diagram

    Aggregating computes the quarterly sales for each product.

  • 8/6/2019 Vassiliadis

    15/39

    PrOPr 2007 15

    Conceptual Model: Skoutas annotations

    Application vocabulary

    VC = {product, store}

    VPproduct = {pid, pName, quantity, price,type, storage}

    VPstore=

    {sid, sName, city, street}VFpid = {source_pid, dw_pid}

    VFsid = {source_sid, dw_sid}

    VFprice = {dollars, euros}

    VTtype = {software, hardware}

    VTcity =

    {paris, rome, athens}

    Datastore mappings

    Datastore annotation

  • 8/6/2019 Vassiliadis

    16/39

    PrOPr 2007 16

    Conceptual Model: Skoutas annotations

    The class hierarchy Definition for classDS1_Products

  • 8/6/2019 Vassiliadis

    17/39

    PrOPr 2007 17

    Outline

    Introduction

    Conceptual Level

    Logical Level

    Physical Level Provenance &ETL

  • 8/6/2019 Vassiliadis

    18/39

    PrOPr 2007 18

    Logical Model

    AddAttr2

    SOURCE

    SK2

    DS.PS1.PKEY,

    LOOKUP_PS.SKEY,SOURCE

    $2

    COST DATE

    DS.PS1 SK1

    DS.PS2.PKEY,LOOKUP_PS.SKEY,

    SOURCECOST DATE=SYSDATE

    AddDateU

    DS.PS2

    Log

    rejected

    Log

    rejected

    A2EDate

    NotNULL

    Log

    rejected

    Log

    rejected

    Log

    rejected

    DIFF2

    DS.PSNEW2.PKEY,

    DS.PSOLD2.PKEYDS.PSNEW2

    DS.PSOLD2

    DW.PARTS Aggregate1

    PKEY, DAYMIN(COST)

    Aggregate2

    PKEY, MONTHAVG(COST)

    V2

    V1

    TIME ZY

    DW.PARTSUPP.DATE,

    DAY

    FTP2S2.PARTS

    S1.PARTS FTP1

    DS.PSNEW1

    DIFF1

    DS.PSOLD1

    DS.PSNEW1.PKEY,

    DS.PSOLD1.PKEY

    SourcesDW

    DSA

    Log

    rejected

    QTY,COST

    PK

    PKEY,DATE

    Log

    rejected

  • 8/6/2019 Vassiliadis

    19/39

    PrOPr 2007 19

    Logical Model

    Main question:

    What information should we put inside a metadata

    repository to be able to answer questions like:

    what is the architecture of my DW back stage? which attributes/tables are involved in the population of

    an attribute?

    what part of the scenario is affected if we delete an

    attribute?

  • 8/6/2019 Vassiliadis

    20/39

    PrOPr 2007 20

    Architecture Graph

    $2

    COST DATE

    DS.PS1 SK1

    DS.PS2.PKEY,LOOKUP_PS.SKEY,

    SOURCECOST DATE=SYSDATE

    AddDateU

    Log

    rejected

    A2EDate

    NotNULL

    Log

    rejected

    Log

    rejected

    Log

    rejected

    DIFF2

    DS.PSNEW2.PKEY,

    DS.PSOLD2.PKEYDS.PSNEW2

    DS.PSOLD2

    DW.PARTS Aggregate1

    PKEY, DAYMIN(COST)

    Aggregate2

    PKEY, MONTHAVG(COST)

    V2

    V1

    TIME ZY

    DW.PARTSUPP.DATE,

    DAY

    FTP2S2.PARTS

    S1.PARTS FTP1

    DS.PSNEW1

    DIFF1

    DS.PSOLD1

    DS.PSNEW1.PKEY,

    DS.PSOLD1.PKEY

    SourcesDW

    DSA

    QTY,COST

    PK

    PKEY,DATE

    Log

    rejected

    AddAttr2

    SOURCE

    SK2

    DS.PS1.PKEY,

    LOOKUP_PS.SKEY,SOURCE

    DS.PS2

    Log

    rejected

    Log

    rejected

  • 8/6/2019 Vassiliadis

    21/39

    PrOPr 2007 21

    Architecture Graph

    Example

    2

  • 8/6/2019 Vassiliadis

    22/39

    PrOPr 2007 22

    Architecture Graph

    Example

    2

  • 8/6/2019 Vassiliadis

    23/39

    PrOPr 2007 23

    Optimization

    Execution order

    which is the proper

    execution order?

  • 8/6/2019 Vassiliadis

    24/39

    PrOPr 2007 24

    Optimization

    Execution order

    order equivalence?

    SK,f1,f2 orSK,f2,f1 or ... ?

  • 8/6/2019 Vassiliadis

    25/39

    PrOPr 2007 25

    Logical Optimization

    Can we push selectionearly enough?

    Can we aggregate

    before $2 takesplace?

  • 8/6/2019 Vassiliadis

    26/39

    PrOPr 2007 26

    Outline

    Introduction Conceptual Level

    Logical Level

    Physical Level Provenance &ETL

  • 8/6/2019 Vassiliadis

    27/39

    PrOPr 2007 27

    DWSchema

    mappings

    Conceptual tological mapper

    Conceptual to

    logical mapping

    Optimizer

    Engine

    Logical

    templates

    Physical

    templates

    Logical

    scenario

    Physicalscenario

    ETL tool

    identify the best

    possible physical

    implementation for a

    given logical ETLworkflow

    Logical to Physical

  • 8/6/2019 Vassiliadis

    28/39

    PrOPr 2007 28

    Problem formulation

    Given a logical-level ETL workflow GL

    Compute a physical-level ETL workflow GP

    Such that

    the semantics of the workflow do not change

    all constraints are met the cost is minimal

  • 8/6/2019 Vassiliadis

    29/39

    PrOPr 2007 29

    Solution

    We model the problem of finding the physical implementation of an ETLprocess as a state-space search problem.

    States. A state is a graph GP that represents aphysical-level ETLworkflow.

    The initial state G0

    P is produced after the random assignment of physical

    implementations to logical activities w.r.t. preconditions and constraints.

    Transitions. Given a state GP, a new state GP is generated by replacingthe implementation of a physical activity aPofGP with another validimplementation for the same activity.

    Extension: introduction of a sorter activity (at the physical-level) as a newnode in the graph.

    Sorter introduction

    Intentionally introduce sorters to reduce execution & resumption costs

  • 8/6/2019 Vassiliadis

    30/39

    PrOPr 2007 30

    Sorters: impact

    We intentionally introduce orderings, (via appropriate physical-level sorteractivities) towards obtaining physical plans of lower cost.

    Semantics: unaffected

    Price to pay: cost of sorting the stream of processed data

    Gain:

    it is possible to employ order-aware algorithms that significantly reduceprocessing cost

    It is possible to amortize the cost over activities that utilize common useful

    orderings

  • 8/6/2019 Vassiliadis

    31/39

    PrOPr 2007 31

    Sorter gains

    Cost(G) = 100.000+10.000+3*[5.000*log2(5.000)+5.000]= 309.316

    Ifsorter SA,B is addedto V:

    Cost(G) = 100.000+10.000+2*5.000+[5.000*log2(5.000)+5.000] =247.877

    Without order

    cost(i) = n

    costSO() = n*log2(n)+n

    With appropriate order

    cost(i) = seli * n

    costSO() = n

  • 8/6/2019 Vassiliadis

    32/39

    PrOPr 2007 32

    Interesting orders

    A asc A desc {A,B, [A,B]}

  • 8/6/2019 Vassiliadis

    33/39

    PrOPr 2007 33

    Outline

    Introduction Conceptual Level

    Logical Level

    Physical Level Provenance &ETL

  • 8/6/2019 Vassiliadis

    34/39

    PrOPr 2007 34

    DW

    Schemamappings

    Conceptual tological mapper

    Conceptual to

    logical mapping

    Optimizer

    Engine

    Logical

    templates

    Physical

    templates

    Logical

    scenario

    Physicalscenario

    ETL tool

    A principled architecture for ETL

    WHY

    WHAT

    HOW

  • 8/6/2019 Vassiliadis

    35/39

    PrOPr 2007 35

    Logical Model: Questions revisited

    What information should we put inside a metadatarepository to be able to answer questions like:

    what is the architecture of my DW back stage?

    it is described as the Architecture Graph

    which attributes/tables are involved in the population ofan attribute?

    what part of the scenario is affected if we delete an

    attribute?

    follow the appropriate path in the Architecture Graph

  • 8/6/2019 Vassiliadis

    36/39

    PrOPr 2007 36

    Fundamental questions on provenance & ETL

    Why do we have a certain record in the DW? Because there is a process (described by the Architecture

    Graph at the logical level + the conceptual model) that

    produces this kind of tuples

    Where did this record come from in my DW? Hard! If there is a way to derive an inverse workflow

    that links the DW tuples to their sources you can answer

    it.

    Not always possible: transformations are not invertible,and a DW is supposed to progressively summarize data

    Widoms work on record lineage

  • 8/6/2019 Vassiliadis

    37/39

    PrOPr 2007 37

    Fundamental questions on provenance & ETL

    How are updates to the sources managed? (update takes place at the source, DW+data marts must be

    updated)

    Done, although in a tedious way: log sniffing, mainly.

    Also, diff comparison of extracted snapshots When errors are discovered during the ETL process,

    how are they handled?

    (update takes place at the data staging area, sources must

    be updated)

    Too hard to back-fuse data into the sources, both for

    political and workload issues. Currently, this is not

    automated.

  • 8/6/2019 Vassiliadis

    38/39

    PrOPr 2007 38

    Fundamental questions on provenance & ETL

    What happens if there are updates to the schema ofthe involved data sources?

    Currently this is not automated, although the automation

    of the task is part of the detail independence vision

    What happens if we must update the workflowstructure and semantics?

    Nothing is versioned back still, not really any user

    requests for this to be supported

    What is the equivalent of citations in ETL?

    nothing really

  • 8/6/2019 Vassiliadis

    39/39

    PrOPr 2007 39

    Thank you!