Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data...

34
Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and Information Systems, Birkbeck AutoMed is a joint project with Peter McBrien (Imperial College), funded under the 2 nd DIM call by EPSRC grants GR/N38107 and GR/N35915

Transcript of Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data...

Page 1: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

AutoMed: Automatic generation of Mediator tools for heterogeneous data integration

Alex Poulovassilis School of Computer Science and Information Systems, Birkbeck

AutoMed is a joint project with Peter McBrien (Imperial College),funded under the 2nd DIM call by EPSRC grants GR/N38107 and

GR/N35915

Page 2: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Schema Schema Schema

IntegratedSchema

Page 3: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Background

In earlier work (ER’97, IS’98, DKE’98) we developed a new framework to support transformation and integration of heterogeneous database schemas.

Our framework consisted of:

• a new notion of schema equivalence

• a set of primitive schema transformations which can be composed to define unconditional or conditional equivalences between schemas

Page 4: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Background

In our data integration approach, we represent the modelling constructs of higher-level data models (e.g. relational, object-oriented, semi-structured, XML, RDF) in terms of a low-level hypergraph data model – HDM – whose constructs are nodes, edges and constraints

The HDM common data model provides a unifying semantics for such higher-level modelling constructs

It avoids the semantic mismatches that may occur between constructs of higher-level modelling languages

Page 5: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Background

Our approach allows constructs from different modelling languages to be mixed within the same intermediate schema during the schema transformation/integration process (CAiSE’99)

Our schema transformations are automatically reversible, setting up a two-way transformation pathway between pairs of schema

Page 6: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Page 7: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Page 8: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

addClass Series [p|(p,S)category]

addClass Doc [p|(p,D)category]

addClass Film [p|(p,F)category]

addClass Prog [p|(p,c)category]

Page 9: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

addSubClass Film Prog

addSubClass Doc Prog

addSubClass Series Prog

addClass Series [p|(p,S)category]

addClass Doc [p|(p,D)category]

addClass Film [p|(p,F)category]

addClass Prog [p|(p,c)category]

Page 10: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

addSubClass Film Prog

addSubClass Doc Prog

addSubClass Series Prog

addClass Series [p|(p,S)category]

addClass Doc [p|(p,D)category]

addClass Film [p|(p,F)category]

addClass Prog [p|(p,c)category]

delRel category [(p,F)|pFilm] U

[(p,D)|pDoc] U

[(p,S)|pSeries]

Page 11: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

delSubClass Film Prog

delSubClass Doc Prog

delSubClass Series Prog

delClass Series [p|(p,S)category]

delClass Doc [p|(p,D)category]

delClass Film [p|(p,F)category]

delClass Prog [p|(p,c)category]

addRel category [(p,F)|pFilm] U

[(p,D)|pDoc] U

[(p,S)|pSeries]

Page 12: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

addConstraint subset Film ProgaddConstraint subset Doc

Prog addConstraint subset Series

Prog

addNode Series [p|(p,S)category]addNode Doc [p|(p,D)category]addNode Film [p|(p,F)category]addNode Prog [p|(p,c)category]

delEdge category [(p,F)|pFilm] U [(p,D)|pDoc] U [(p,S)|pSeries]

delNode Programme ProgdelNode Category [F,D,S]

Page 13: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

delConstraint subset Film ProgdelConstraint subset Doc

Prog delConstraint subset Series

Prog

delNode Series [p|(p,S)category]delNode Doc [p|(p,D)category]delNode Film [p|(p,F)category]delNode Prog [p|(p,c)category]

addEdge category [(p,F)|pFilm] U [(p,D)|pDoc] U [(p,S)|pSeries]

addNode Programme ProgaddNode Category [F,D,S]

Page 14: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Background

These pathways can be used to automatically translate data and queries between pairs of schemas (ER’99)

From a pathway T:S –> S’ we:

• compose the queries in the add steps to derive a definition of each construct in S’ as a view over S, and

• compose the queries in the del steps to derive a definition of each construct in S as a view over S’

Page 15: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Background

Thus

Prog = [p | (p,c)category]

Film = [p|(p,F)category]

Doc = [p|(p,D)category]

Series = [p|(p,S)category]

and

category = [(p,F)|pFilm] U [(p,D)|pDoc] U [(p,S)|pSeries]

These view definitions can then be used to automatically translate data and queries between S and S’

Page 16: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Overview of the AutoMed Project

The AutoMed project aims to investigate:

• how our theoretical framework can be practically applied real data integration problems

• how much of a mediator’s global query processing functionality can be automatically generated from our transformation pathways

• evolutionary and heuristic techniques for schema improvement and global query optimisation

Page 17: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

The AutoMed Architecture

Global Query Processor

Global Query Optimiser

Schema Evolution Tool

Schema Transformationand Integration Tool

Model Definition Tool

Schema and Transformation

Repository

Model Definitions Repository

Page 18: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Schema Transformation/Integration Networks in AutoMed

US1 US2 USi USn

LS1 LS2 LSi LSn

GS

id id id id id

… …

… …

Page 19: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Schema Transformation/Integration Networks in AutoMed

On the previous slide:

• GS is a global schema

• LS1, …, LSn are local schemas

• US1, …, USn are union-compatible schemas

• the transformation pathways between each pair LSi and USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository

• the transformation pathway between USi and GS is similar

• the transformation pathway between each pair of union-compatible schemas consists of id transformation steps

Page 20: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Both-As-View integration

Our schema transformation pathways capture at least the information available from global-as-view (GAV) or local-as-view (LAV)

We discuss this in a forthcoming paper (ICDE’03) and term our integration approach both-as-view (BAV)

In particular, we discuss how

• GAV and LAV view definitions can be derived from a BAV specification

• a BAV specification can be partially derived from a set of GAV or LAV view definitions

Page 21: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Schema Evolution

Unlike GAV and LAV, our framework readily supports the evolution of both local and global schemas

The first step is to define the evolution of the global or local schema as a schema transformation pathway from the old to the new schema

There is then a systematic way of evolving, as opposed to re-generating, the transformation pathways

In the case of a local schema evolution, the global schema may also be evolved

Page 22: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Schema Evolution

In particular (see our CAiSE’02 and ICDE’03 papers for details):

• if the evolved schema is semantically equivalent to the original schema, then the transformation network can be repaired automatically

• if the evolved schema is a contraction of the original schema, the transformation network can again be repaired automatically

• if the evolved schema is an extension of the original schema, then domain knowledge may be required (but again the network can be evolved rather than regenerated)

Page 23: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Global Query Processing

We are handling query language heterogeneity by translation into/from a functional intermediate query language – IQL; Edgar Jasper (BNCOD’02 poster, BNCOD’02 summer school paper)

A query Q expressed in a high-level query language on a global schema GS is first translated into IQL

GAV view definitions are derived from the transformation pathways between GS and the local schemas

These view definitions are substituted into Q, reformulating it into an IQL query over local schema constructs

Page 24: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Global Query Processing

Query optimisation and query evaluation then occur

Specific issues for query optimisation in AutoMed are:• optimising the view definitions derived from the transformation

pathways, and• handling heterogeneous modelling constructs appearing within

these view definitions

For query evaluation, wrappers translate IQL sub-queries into the local query language, and translate results back into the IQL type system.

Further query post-processing is possible.

Page 25: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Why a Functional Language as the AutoMed Intermediate Query Language ?

Compositionality: operators can be composed to an arbitrary level of nesting within a query provided the types of the operators are respected by the expressions passed to them

Referential transparency: any query evaluates to a single answer, irrespective of the order of evaluation of its sub-expressions

These properties make view generation, query reformulation and query rewriting simpler than it would be with imperative or logic notations

Page 26: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Why a Functional Language as the AutoMed Intermediate Query Language ?

Natural support for collection types and aggregation operators

Makes this a natural formalism for translating into/out of other query languages e.g.

• OQL is a functional query language

• SQL can be considered to be a restriction of OQL

• XQuery has a functional core language

• other languages for semi-structured and RDF data are also functional (UnQL, YATL, RQL)

Page 27: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Why a Functional Language as the AutoMed Intermediate Query Language ?

Aggregation operators over collection types such as sets, bags and lists are generalised by a single fold function (Buneman, Tannen, Naqvi, 1990s)

Optimisation techniques have been developed for fold which are applicable to all functional query languages with this formalism at their core (e.g. work by Wadler, Wong, Fegaras, Grust, Poulovassilis & Small)

We plan to leverage these techniques, and perhaps even existing software, for global query optimisation in AutoMed

Page 28: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

XML Data Sources

As well as integration of structured data sources, we have done some work on translating and integrating XML data – see our CAiSE’01 paper

We have defined a representation of XML in terms of the nodes, edges and constraints of the HDM

We capture the ordering of XML elements by an order node and a hyperedge to it from the edge representing the parent-child relationship

Page 29: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Translating XML into HDM

<customer name=“Jones”>

<account number=“A14”/>

<account number=“B37”/>

</customer>

<customer name=“Smith”>

<account number=“C514”/>

<account number=“D438”/>

</customer>

root

customer name

numberaccount

order

order

Page 30: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

XML Data Sources

We have defined a set of primitive transformations on XML, in terms of the underlying transformations on the equivalent HDM representation (which is the general AutoMed methodology)

XML documents are then translated into a simple ER representation, which allows them to be integrated with each other and with other structured data sources

One possible direction of further work is automatic or semi-automatic transformation and integration of the ER models arising from XML documents

Page 31: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Unstructured Text Sources

We have also been working on extracting structure from unstructured text sources – Dean Williams

The aim here is to integrate information extracted from unstructured text with structured or semi-structured information available from other sources

We are using existing technology (the GATE tool) for the text annotation and IE part of this work

Page 32: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Unstructured Text Sources

Natural language and domain ontologies will be used extend these annotations

These will be imported into RDF repositories, and we have extended AutoMed to encompass RDF and RDFS data sources

The information extracted from the text will be matched with existing structured information to derive new facts and perhaps new schema information as well

Page 33: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Materialised integration

Finally, as well as virtual integration of data sources, we are also investigating using the AutoMed framework for materialised integration i.e. a data warehousing approach

In particular, we are looking at incremental view maintenance and data lineage tracing using the AutoMed schema transformation pathways – Hao Fan

Page 34: Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.

Aberdeen, 28/1/2003

Ongoing AutoMed Work at Imperial

Automatic generation of equivalences between different data models A graphical schema & transformations editor Data mining techniques for extracting relational schema equivalences Using AutoMed for integrating semi-structured and structured data, in

particular genomic data Optimising schema transformation pathways